From benh at kernel.crashing.org Tue Mar 1 15:29:25 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 01 Mar 2005 15:29:25 +1100 Subject: [PATCH] ppc64: Fix zImage wrapper incorrect size to flush_cache() Message-ID: <1109651365.7669.21.camel@gaston> Hi ! This patch fixes a bug in the ppc64 zImage wrapper causing it to pass an incorrect size to flush_cache() when flushing the data and instruction caches prior to jumping to the kernel entry. This causes crashes on firmare environment that do strict MMU mapping only of actually allocated areas Signed-off-by: Benjamin Herrenschmidt --- dingo/2.6.10-bk5/arch/ppc64/boot/main.c 2004-12-25 08:35:50.000000000 +1100 +++ 2.6.10-bk5/arch/ppc64/boot/main.c 2005-02-16 17:10:49.194263268 +1100 @@ -200,7 +200,7 @@ vmlinux.addr += (unsigned long)elf64ph->p_offset; vmlinux.size -= (unsigned long)elf64ph->p_offset; - flush_cache((void *)vmlinux.addr, vmlinux.memsize); + flush_cache((void *)vmlinux.addr, vmlinux.size); if (a1) printf("initrd head: 0x%lx\n\r", *((u32 *)initrd.addr)); From kravetz at us.ibm.com Wed Mar 2 09:27:13 2005 From: kravetz at us.ibm.com (Mike Kravetz) Date: Tue, 1 Mar 2005 14:27:13 -0800 Subject: [PATCH] NUMA memory fixup Message-ID: <20050301222713.GB5780@w-mikek2.ibm.com> When I booted my new 720 on a kernel configured for NUMA, I received the following during bootup: WARNING: Unexpected node layout: region start 44000000 length 2000000 NUMA is disabled This is due to memory 'holes' within nodes. If such holes are encountered, then NUMA is disabled. The following patch adds support for such configurations. My 720 now boots with the following message: [boot]0012 Setup Arch Node 0 Memory: 0x0-0x8000000 0x44000000-0x12a000000 Node 1 Memory: 0x8000000-0x44000000 0x12a000000-0x1ea000000 I'd appreciate any comments on the approach taken. I'm also working on adding NUMA support on top of the SPARSEMEM implementation being pushed as part of memory hot add. However, it seems important to get the current implementation based on DISCONTIGMEM working first. This patch is against 2.6.11-rc3, but I can provide a later version if needed. -- Signed-off-by: Mike Kravetz diff -Naupr linux-2.6.11-rc3/arch/ppc64/mm/numa.c linux-2.6.11-rc3.work/arch/ppc64/mm/numa.c --- linux-2.6.11-rc3/arch/ppc64/mm/numa.c 2005-02-03 01:57:16.000000000 +0000 +++ linux-2.6.11-rc3.work/arch/ppc64/mm/numa.c 2005-03-01 19:39:21.000000000 +0000 @@ -40,7 +40,6 @@ int nr_cpus_in_node[MAX_NUMNODES] = { [0 struct pglist_data *node_data[MAX_NUMNODES]; bootmem_data_t __initdata plat_node_bdata[MAX_NUMNODES]; -static unsigned long node0_io_hole_size; static int min_common_depth; /* @@ -49,7 +48,8 @@ static int min_common_depth; */ static struct { unsigned long node_start_pfn; - unsigned long node_spanned_pages; + unsigned long node_end_pfn; + unsigned long node_present_pages; } init_node_data[MAX_NUMNODES] __initdata; EXPORT_SYMBOL(node_data); @@ -348,33 +348,28 @@ new_range: if (max_domain < numa_domain) max_domain = numa_domain; - /* - * For backwards compatibility, OF splits the first node - * into two regions (the first being 0-4GB). Check for - * this simple case and complain if there is a gap in - * memory + /* + * Initialize new node struct, or add to an existing one. */ - if (init_node_data[numa_domain].node_spanned_pages) { - unsigned long shouldstart = - init_node_data[numa_domain].node_start_pfn + - init_node_data[numa_domain].node_spanned_pages; - if (shouldstart != (start / PAGE_SIZE)) { - /* Revert to non-numa for now */ - printk(KERN_ERR - "WARNING: Unexpected node layout: " - "region start %lx length %lx\n", - start, size); - printk(KERN_ERR "NUMA is disabled\n"); - goto err; - } - init_node_data[numa_domain].node_spanned_pages += + if (init_node_data[numa_domain].node_end_pfn) { + if ((start / PAGE_SIZE) < + init_node_data[numa_domain].node_start_pfn) + init_node_data[numa_domain].node_start_pfn = + start / PAGE_SIZE; + else + init_node_data[numa_domain].node_end_pfn = + (start / PAGE_SIZE) + + (size / PAGE_SIZE); + + init_node_data[numa_domain].node_present_pages += size / PAGE_SIZE; } else { node_set_online(numa_domain); init_node_data[numa_domain].node_start_pfn = start / PAGE_SIZE; - init_node_data[numa_domain].node_spanned_pages = + init_node_data[numa_domain].node_end_pfn = + init_node_data[numa_domain].node_start_pfn + size / PAGE_SIZE; } @@ -391,14 +386,6 @@ new_range: node_set_online(i); return 0; -err: - /* Something has gone wrong; revert any setup we've done */ - for_each_node(i) { - node_set_offline(i); - init_node_data[i].node_start_pfn = 0; - init_node_data[i].node_spanned_pages = 0; - } - return -1; } static void __init setup_nonnuma(void) @@ -426,12 +413,11 @@ static void __init setup_nonnuma(void) node_set_online(0); init_node_data[0].node_start_pfn = 0; - init_node_data[0].node_spanned_pages = lmb_end_of_DRAM() / PAGE_SIZE; + init_node_data[0].node_end_pfn = lmb_end_of_DRAM() / PAGE_SIZE; + init_node_data[0].node_present_pages = total_ram / PAGE_SIZE; for (i = 0 ; i < top_of_ram; i += MEMORY_INCREMENT) numa_memory_lookup_table[i >> MEMORY_INCREMENT_SHIFT] = 0; - - node0_io_hole_size = top_of_ram - total_ram; } static void __init dump_numa_topology(void) @@ -512,6 +498,7 @@ static unsigned long careful_allocation( void __init do_init_bootmem(void) { int nid; + struct device_node *memory = NULL; static struct notifier_block ppc64_numa_nb = { .notifier_call = cpu_numa_callback, .priority = 1 /* Must run before sched domains notifier. */ @@ -535,7 +522,7 @@ void __init do_init_bootmem(void) unsigned long bootmap_pages; start_paddr = init_node_data[nid].node_start_pfn * PAGE_SIZE; - end_paddr = start_paddr + (init_node_data[nid].node_spanned_pages * PAGE_SIZE); + end_paddr = init_node_data[nid].node_end_pfn * PAGE_SIZE; /* Allocate the node structure node local if possible */ NODE_DATA(nid) = (struct pglist_data *)careful_allocation(nid, @@ -551,9 +538,9 @@ void __init do_init_bootmem(void) NODE_DATA(nid)->node_start_pfn = init_node_data[nid].node_start_pfn; NODE_DATA(nid)->node_spanned_pages = - init_node_data[nid].node_spanned_pages; + end_paddr - start_paddr; - if (init_node_data[nid].node_spanned_pages == 0) + if (NODE_DATA(nid)->node_spanned_pages == 0) continue; dbg("start_paddr = %lx\n", start_paddr); @@ -572,33 +559,48 @@ void __init do_init_bootmem(void) start_paddr >> PAGE_SHIFT, end_paddr >> PAGE_SHIFT); - for (i = 0; i < lmb.memory.cnt; i++) { - unsigned long physbase, size; - - physbase = lmb.memory.region[i].physbase; - size = lmb.memory.region[i].size; - - if (physbase < end_paddr && - (physbase+size) > start_paddr) { - /* overlaps */ - if (physbase < start_paddr) { - size -= start_paddr - physbase; - physbase = start_paddr; - } - - if (size > end_paddr - physbase) - size = end_paddr - physbase; - - dbg("free_bootmem %lx %lx\n", physbase, size); - free_bootmem_node(NODE_DATA(nid), physbase, - size); + /* + * We need to do another scan of all memory sections to + * associate memory with the correct node. + */ + memory = NULL; + while ((memory = of_find_node_by_type(memory, "memory")) != NULL) { + unsigned long mem_start, mem_size; + int numa_domain; + unsigned int *memcell_buf; + unsigned int len; + + memcell_buf = (unsigned int *)get_property(memory, "reg", &len); + if (!memcell_buf || len <= 0) + continue; + + mem_start = read_cell_ul(memory, &memcell_buf); + mem_size = read_cell_ul(memory, &memcell_buf); + numa_domain = of_node_numa_domain(memory); + + if (numa_domain != nid) + continue; + + if (mem_start < end_paddr && + (mem_start+mem_size) > start_paddr) { + /* should be no overlaps ! */ + dbg("free_bootmem %lx %lx\n", mem_start, mem_size); + free_bootmem_node(NODE_DATA(nid), mem_start, + mem_size); } } + /* + * Mark reserved regions on this node + */ for (i = 0; i < lmb.reserved.cnt; i++) { unsigned long physbase = lmb.reserved.region[i].physbase; unsigned long size = lmb.reserved.region[i].size; + if (pa_to_nid(physbase) != nid && + pa_to_nid(physbase+size-1) != nid) + continue; + if (physbase < end_paddr && (physbase+size) > start_paddr) { /* overlaps */ @@ -632,13 +634,12 @@ void __init paging_init(void) unsigned long start_pfn; unsigned long end_pfn; - start_pfn = plat_node_bdata[nid].node_boot_start >> PAGE_SHIFT; - end_pfn = plat_node_bdata[nid].node_low_pfn; + start_pfn = init_node_data[nid].node_start_pfn; + end_pfn = init_node_data[nid].node_end_pfn; zones_size[ZONE_DMA] = end_pfn - start_pfn; - zholes_size[ZONE_DMA] = 0; - if (nid == 0) - zholes_size[ZONE_DMA] = node0_io_hole_size >> PAGE_SHIFT; + zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] - + init_node_data[nid].node_present_pages; dbg("free_area_init node %d %lx %lx (hole: %lx)\n", nid, zones_size[ZONE_DMA], start_pfn, zholes_size[ZONE_DMA]); From ntl at pobox.com Wed Mar 2 12:47:01 2005 From: ntl at pobox.com (Nathan Lynch) Date: Tue, 1 Mar 2005 19:47:01 -0600 Subject: [PATCH] explicitly bind idle tasks In-Reply-To: <20050227144928.6c71adaf.akpm@osdl.org> References: <20050227031655.67233bb5.akpm@osdl.org> <1109542971.14993.217.camel@gaston> <20050227144928.6c71adaf.akpm@osdl.org> Message-ID: <20050302014701.GA5897@otto> On Sun, Feb 27, 2005 at 02:49:28PM -0800, Andrew Morton wrote: > Benjamin Herrenschmidt wrote: > > > > > - if (cpu_is_offline(smp_processor_id()) && > > > + if (cpu_is_offline(_smp_processor_id()) && > > > system_state == SYSTEM_RUNNING) > > > cpu_die(); > > > } > > > _ > > > > This is the idle loop. Is that ever supposed to be preempted ? > > Nope, it's a false positive. We had to do the same in x86's idle loop and > probably others will hit it. Perhaps I'm missing something, but is there any reason we can't do the following? I've tested it on ppc64, doesn't seem to break anything. With hotplug cpu and preempt, we tend to see smp_processor_id warnings from idle loop code because it's always checking whether its cpu has gone offline. Replacing every use of smp_processor_id with _smp_processor_id in all idle loop code is one solution; another way is explicitly binding idle threads to their cpus (the smp_processor_id warning does not fire if the caller is bound only to the calling cpu). This has the (admittedly slight) advantage of letting us know if an idle thread ever runs on the wrong cpu. Signed-off-by: Nathan Lynch Index: linux-2.6.11-rc5-mm1/init/main.c =================================================================== --- linux-2.6.11-rc5-mm1.orig/init/main.c 2005-03-02 00:12:07.000000000 +0000 +++ linux-2.6.11-rc5-mm1/init/main.c 2005-03-02 00:53:04.000000000 +0000 @@ -638,6 +638,10 @@ { lock_kernel(); /* + * init can run on any cpu. + */ + set_cpus_allowed(current, CPU_MASK_ALL); + /* * Tell the world that we're going to be the grim * reaper of innocent orphaned children. * Index: linux-2.6.11-rc5-mm1/kernel/sched.c =================================================================== --- linux-2.6.11-rc5-mm1.orig/kernel/sched.c 2005-03-02 00:12:07.000000000 +0000 +++ linux-2.6.11-rc5-mm1/kernel/sched.c 2005-03-02 00:47:14.000000000 +0000 @@ -4092,6 +4092,7 @@ idle->array = NULL; idle->prio = MAX_PRIO; idle->state = TASK_RUNNING; + idle->cpus_allowed = cpumask_of_cpu(cpu); set_task_cpu(idle, cpu); spin_lock_irqsave(&rq->lock, flags); From zwane at arm.linux.org.uk Wed Mar 2 14:13:26 2005 From: zwane at arm.linux.org.uk (Zwane Mwaikambo) Date: Tue, 1 Mar 2005 20:13:26 -0700 (MST) Subject: [PATCH] explicitly bind idle tasks In-Reply-To: <20050302014701.GA5897@otto> References: <20050227031655.67233bb5.akpm@osdl.org> <1109542971.14993.217.camel@gaston> <20050227144928.6c71adaf.akpm@osdl.org> <20050302014701.GA5897@otto> Message-ID: On Tue, 1 Mar 2005, Nathan Lynch wrote: > On Sun, Feb 27, 2005 at 02:49:28PM -0800, Andrew Morton wrote: > > Benjamin Herrenschmidt wrote: > > > > > > > - if (cpu_is_offline(smp_processor_id()) && > > > > + if (cpu_is_offline(_smp_processor_id()) && > > > > system_state == SYSTEM_RUNNING) > > > > cpu_die(); > > > > } > > > > _ > > > > > > This is the idle loop. Is that ever supposed to be preempted ? > > > > Nope, it's a false positive. We had to do the same in x86's idle loop and > > probably others will hit it. > > Perhaps I'm missing something, but is there any reason we can't do > the following? I've tested it on ppc64, doesn't seem to break anything. > > With hotplug cpu and preempt, we tend to see smp_processor_id warnings > from idle loop code because it's always checking whether its cpu has > gone offline. Replacing every use of smp_processor_id with > _smp_processor_id in all idle loop code is one solution; another way > is explicitly binding idle threads to their cpus (the smp_processor_id > warning does not fire if the caller is bound only to the calling cpu). > This has the (admittedly slight) advantage of letting us know if an > idle thread ever runs on the wrong cpu. Makes sense to me, for some reason i thought the smp_processor_id() function did a cpu_rq->idle check of some sort. Thanks, Zwane From Derek.Fults at gd-ais.com Tue Mar 1 05:16:12 2005 From: Derek.Fults at gd-ais.com (Derek.Fults at gd-ais.com) Date: Mon, 28 Feb 2005 12:16:12 -0600 Subject: CPU Freq Scaling Message-ID: <1109614572.20610.41.camel@kato.gd-ais.com> Hi All, I'm looking for information on CPU frequency scaling of the 970. I've got a request to clock it down to 500 SPECINTs. Is this currently being worked on or down the pipe a ways? Thanks for any info. -- Derek L. Fults From nacc at us.ibm.com Thu Mar 3 05:12:06 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Wed, 2 Mar 2005 10:12:06 -0800 Subject: eeh.h compile warnings / adbhid.c build failure Message-ID: <20050302181206.GA2741@us.ibm.com> Hi, While building 2.6.11 for a G5, I noticed the following errors being spit out (gcc 3.3.5): include/asm/eeh.h: In function `eeh_memcpy_fromio': include/asm/eeh.h:265: warning: statement with no effect include/asm/eeh.h: In function `eeh_insb': include/asm/eeh.h:353: warning: statement with no effect include/asm/eeh.h: In function `eeh_insw_ns': include/asm/eeh.h:360: warning: statement with no effect include/asm/eeh.h: In function `eeh_insl_ns': include/asm/eeh.h:367: warning: statement with no effect include/asm/eeh.h: In function `eeh_memcpy_fromio': include/asm/eeh.h:265: warning: statement with no effect include/asm/eeh.h: In function `eeh_insb': include/asm/eeh.h:353: warning: statement with no effect include/asm/eeh.h: In function `eeh_insw_ns': include/asm/eeh.h:360: warning: statement with no effect include/asm/eeh.h: In function `eeh_insl_ns': include/asm/eeh.h:367: warning: statement with no effect These warnings are emitted for pretty much every driver. It looks like it is becuase with CONFIG_EEH undefined (it's a pSeries thing? -- my interpretation from looking at the ppc64 Kconfig), eeh_check_failure() becomes #define'd to simply it's second parameter, which in the case of assignment statements ia statement with no effect. It's not a big deal, the kernels still compile (with a patch to adbhid.c which I'll mention in a second) but it's a lot of noise to be generated because I don't have a pSeries machine... Now, to the build-blocking code: In drivers/macintosh/adbhid.c::1159: static int __init adbhid_init(void) { #ifndef CONFIG_MAC if ( (_machine != _MACH_chrp) && (_machine != _MACH_Pmac) ) return 0; #endif ... I don't see CONFIG_MAC in my .config (attached below [1]), and _MACH_chrp is not defined for ppc64 (it's in asm-ppc/processor.h not in ppc64/processor.h. I just removed the _MACH_chrp conditional in my local code to get the kernel to build. I'm not sure what the actual solution is, but I thought you all should know about it. Thanks, Nish # # Automatically generated make config: don't edit # Linux kernel version: 2.6.11 # Wed Mar 2 09:27:02 2005 # CONFIG_64BIT=y CONFIG_MMU=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_GENERIC_ISA_DMA=y CONFIG_HAVE_DEC_LOCK=y CONFIG_EARLY_PRINTK=y CONFIG_COMPAT=y CONFIG_FRAME_POINTER=y CONFIG_FORCE_MAX_ZONEORDER=13 # # Code maturity level options # CONFIG_EXPERIMENTAL=y # CONFIG_CLEAN_COMPILE is not set CONFIG_BROKEN=y CONFIG_BROKEN_ON_SMP=y CONFIG_LOCK_KERNEL=y # # General setup # CONFIG_LOCALVERSION="" CONFIG_SWAP=y CONFIG_SYSVIPC=y # CONFIG_POSIX_MQUEUE is not set # CONFIG_BSD_PROCESS_ACCT is not set CONFIG_SYSCTL=y # CONFIG_AUDIT is not set CONFIG_LOG_BUF_SHIFT=17 CONFIG_HOTPLUG=y CONFIG_KOBJECT_UEVENT=y CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y # CONFIG_EMBEDDED is not set CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_ALL is not set # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_FUTEX=y CONFIG_EPOLL=y # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SHMEM=y CONFIG_CC_ALIGN_FUNCTIONS=0 CONFIG_CC_ALIGN_LABELS=0 CONFIG_CC_ALIGN_LOOPS=0 CONFIG_CC_ALIGN_JUMPS=0 # CONFIG_TINY_SHMEM is not set # # Loadable module support # CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y # CONFIG_MODULE_FORCE_UNLOAD is not set CONFIG_OBSOLETE_MODPARM=y # CONFIG_MODVERSIONS is not set # CONFIG_MODULE_SRCVERSION_ALL is not set # CONFIG_KMOD is not set CONFIG_STOP_MACHINE=y CONFIG_SYSVIPC_COMPAT=y # # Platform support # # CONFIG_PPC_ISERIES is not set CONFIG_PPC_MULTIPLATFORM=y # CONFIG_PPC_PSERIES is not set CONFIG_PPC_PMAC=y # CONFIG_PPC_MAPLE is not set CONFIG_PPC=y CONFIG_PPC64=y CONFIG_PPC_OF=y CONFIG_ALTIVEC=y CONFIG_U3_DART=y CONFIG_PPC_PMAC64=y CONFIG_BOOTX_TEXT=y # CONFIG_POWER4_ONLY is not set # CONFIG_IOMMU_VMERGE is not set CONFIG_SMP=y CONFIG_NR_CPUS=2 CONFIG_SCHED_SMT=y CONFIG_PREEMPT=y CONFIG_PREEMPT_BKL=y CONFIG_GENERIC_HARDIRQS=y # # General setup # CONFIG_PCI=y CONFIG_PCI_DOMAINS=y CONFIG_BINFMT_ELF=y CONFIG_BINFMT_MISC=y CONFIG_PCI_LEGACY_PROC=y CONFIG_PCI_NAMES=y # # PCCARD (PCMCIA/CardBus) support # # CONFIG_PCCARD is not set # # PC-card bridges # # # PCI Hotplug Support # # CONFIG_HOTPLUG_PCI is not set CONFIG_PROC_DEVICETREE=y CONFIG_CMDLINE_BOOL=y CONFIG_CMDLINE="console=ttyS0,9600 console=tty0 root=/dev/sda2" # # Device Drivers # # # Generic Driver Options # # CONFIG_STANDALONE is not set # CONFIG_PREVENT_FIRMWARE_BUILD is not set CONFIG_FW_LOADER=y # CONFIG_DEBUG_DRIVER is not set # # Memory Technology Devices (MTD) # # CONFIG_MTD is not set # # Parallel port support # # CONFIG_PARPORT is not set # # Plug and Play support # # # Block devices # # CONFIG_BLK_DEV_FD is not set # CONFIG_BLK_CPQ_DA is not set # CONFIG_BLK_CPQ_CISS_DA is not set # CONFIG_BLK_DEV_DAC960 is not set # CONFIG_BLK_DEV_UMEM is not set # CONFIG_BLK_DEV_COW_COMMON is not set CONFIG_BLK_DEV_LOOP=y # CONFIG_BLK_DEV_CRYPTOLOOP is not set # CONFIG_BLK_DEV_NBD is not set # CONFIG_BLK_DEV_SX8 is not set # CONFIG_BLK_DEV_UB is not set CONFIG_BLK_DEV_RAM=y CONFIG_BLK_DEV_RAM_COUNT=16 CONFIG_BLK_DEV_RAM_SIZE=4096 CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" # CONFIG_CDROM_PKTCDVD is not set # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y # CONFIG_ATA_OVER_ETH is not set # # ATA/ATAPI/MFM/RLL support # CONFIG_IDE=y CONFIG_BLK_DEV_IDE=y # # Please see Documentation/ide.txt for help/info on IDE drives # # CONFIG_BLK_DEV_IDE_SATA is not set CONFIG_BLK_DEV_IDEDISK=y # CONFIG_IDEDISK_MULTI_MODE is not set CONFIG_BLK_DEV_IDECD=y # CONFIG_BLK_DEV_IDETAPE is not set # CONFIG_BLK_DEV_IDEFLOPPY is not set # CONFIG_BLK_DEV_IDESCSI is not set # CONFIG_IDE_TASK_IOCTL is not set # # IDE chipset support/bugfixes # CONFIG_IDE_GENERIC=y CONFIG_BLK_DEV_IDEPCI=y CONFIG_IDEPCI_SHARE_IRQ=y # CONFIG_BLK_DEV_OFFBOARD is not set CONFIG_BLK_DEV_GENERIC=y # CONFIG_BLK_DEV_OPTI621 is not set # CONFIG_BLK_DEV_SL82C105 is not set CONFIG_BLK_DEV_IDEDMA_PCI=y # CONFIG_BLK_DEV_IDEDMA_FORCED is not set CONFIG_IDEDMA_PCI_AUTO=y # CONFIG_IDEDMA_ONLYDISK is not set # CONFIG_BLK_DEV_AEC62XX is not set # CONFIG_BLK_DEV_ALI15X3 is not set # CONFIG_BLK_DEV_AMD74XX is not set # CONFIG_BLK_DEV_CMD64X is not set # CONFIG_BLK_DEV_TRIFLEX is not set # CONFIG_BLK_DEV_CY82C693 is not set # CONFIG_BLK_DEV_CS5520 is not set # CONFIG_BLK_DEV_CS5530 is not set # CONFIG_BLK_DEV_HPT34X is not set # CONFIG_BLK_DEV_HPT366 is not set # CONFIG_BLK_DEV_SC1200 is not set # CONFIG_BLK_DEV_PIIX is not set # CONFIG_BLK_DEV_NS87415 is not set # CONFIG_BLK_DEV_PDC202XX_OLD is not set # CONFIG_BLK_DEV_PDC202XX_NEW is not set # CONFIG_BLK_DEV_SVWKS is not set # CONFIG_BLK_DEV_SIIMAGE is not set # CONFIG_BLK_DEV_SLC90E66 is not set # CONFIG_BLK_DEV_TRM290 is not set # CONFIG_BLK_DEV_VIA82CXXX is not set CONFIG_BLK_DEV_IDE_PMAC=y CONFIG_BLK_DEV_IDE_PMAC_ATA100FIRST=y CONFIG_BLK_DEV_IDEDMA_PMAC=y # CONFIG_BLK_DEV_IDE_PMAC_BLINK is not set # CONFIG_IDE_ARM is not set CONFIG_BLK_DEV_IDEDMA=y # CONFIG_IDEDMA_IVB is not set CONFIG_IDEDMA_AUTO=y # CONFIG_BLK_DEV_HD is not set # # SCSI device support # CONFIG_SCSI=y CONFIG_SCSI_PROC_FS=y # # SCSI support type (disk, tape, CD-ROM) # CONFIG_BLK_DEV_SD=y # CONFIG_CHR_DEV_ST is not set # CONFIG_CHR_DEV_OSST is not set CONFIG_BLK_DEV_SR=y CONFIG_BLK_DEV_SR_VENDOR=y CONFIG_CHR_DEV_SG=y # # Some SCSI devices (e.g. CD jukebox) support multiple LUNs # CONFIG_SCSI_MULTI_LUN=y CONFIG_SCSI_CONSTANTS=y # CONFIG_SCSI_LOGGING is not set # # SCSI Transport Attributes # CONFIG_SCSI_SPI_ATTRS=y CONFIG_SCSI_FC_ATTRS=y # CONFIG_SCSI_ISCSI_ATTRS is not set # # SCSI low-level drivers # # CONFIG_BLK_DEV_3W_XXXX_RAID is not set # CONFIG_SCSI_3W_9XXX is not set # CONFIG_SCSI_ACARD is not set # CONFIG_SCSI_AACRAID is not set # CONFIG_SCSI_AIC7XXX is not set # CONFIG_SCSI_AIC7XXX_OLD is not set # CONFIG_SCSI_AIC79XX is not set # CONFIG_SCSI_ADVANSYS is not set # CONFIG_MEGARAID_NEWGEN is not set # CONFIG_MEGARAID_LEGACY is not set CONFIG_SCSI_SATA=y # CONFIG_SCSI_SATA_AHCI is not set CONFIG_SCSI_SATA_SVW=y # CONFIG_SCSI_ATA_PIIX is not set # CONFIG_SCSI_SATA_NV is not set # CONFIG_SCSI_SATA_PROMISE is not set # CONFIG_SCSI_SATA_QSTOR is not set # CONFIG_SCSI_SATA_SX4 is not set # CONFIG_SCSI_SATA_SIL is not set # CONFIG_SCSI_SATA_SIS is not set # CONFIG_SCSI_SATA_ULI is not set # CONFIG_SCSI_SATA_VIA is not set # CONFIG_SCSI_SATA_VITESSE is not set # CONFIG_SCSI_BUSLOGIC is not set # CONFIG_SCSI_CPQFCTS is not set # CONFIG_SCSI_DMX3191D is not set # CONFIG_SCSI_EATA is not set # CONFIG_SCSI_EATA_PIO is not set # CONFIG_SCSI_FUTURE_DOMAIN is not set # CONFIG_SCSI_GDTH is not set # CONFIG_SCSI_IPS is not set # CONFIG_SCSI_INITIO is not set # CONFIG_SCSI_INIA100 is not set # CONFIG_SCSI_SYM53C8XX_2 is not set # CONFIG_SCSI_IPR is not set # CONFIG_SCSI_PCI2000 is not set # CONFIG_SCSI_PCI2220I is not set # CONFIG_SCSI_QLOGIC_ISP is not set # CONFIG_SCSI_QLOGIC_FC is not set # CONFIG_SCSI_QLOGIC_1280 is not set CONFIG_SCSI_QLA2XXX=y # CONFIG_SCSI_QLA21XX is not set # CONFIG_SCSI_QLA22XX is not set # CONFIG_SCSI_QLA2300 is not set # CONFIG_SCSI_QLA2322 is not set # CONFIG_SCSI_QLA6312 is not set # CONFIG_SCSI_DC395x is not set # CONFIG_SCSI_DC390T is not set # CONFIG_SCSI_DEBUG is not set # # Multi-device support (RAID and LVM) # # CONFIG_MD is not set # # Fusion MPT device support # # CONFIG_FUSION is not set # # IEEE 1394 (FireWire) support # CONFIG_IEEE1394=y # # Subsystem Options # # CONFIG_IEEE1394_VERBOSEDEBUG is not set # CONFIG_IEEE1394_OUI_DB is not set CONFIG_IEEE1394_EXTRA_CONFIG_ROMS=y CONFIG_IEEE1394_CONFIG_ROM_IP1394=y # # Device Drivers # # CONFIG_IEEE1394_PCILYNX is not set CONFIG_IEEE1394_OHCI1394=y # # Protocol Drivers # CONFIG_IEEE1394_VIDEO1394=y CONFIG_IEEE1394_SBP2=y # CONFIG_IEEE1394_SBP2_PHYS_DMA is not set CONFIG_IEEE1394_ETH1394=y CONFIG_IEEE1394_DV1394=y CONFIG_IEEE1394_RAWIO=y CONFIG_IEEE1394_CMP=y # CONFIG_IEEE1394_AMDTP is not set # # I2O device support # CONFIG_I2O=y CONFIG_I2O_CONFIG=y CONFIG_I2O_BLOCK=y CONFIG_I2O_SCSI=y CONFIG_I2O_PROC=y # # Macintosh device drivers # CONFIG_ADB=y CONFIG_ADB_PMU=y # CONFIG_PMAC_PBOOK is not set # CONFIG_PMAC_BACKLIGHT is not set # CONFIG_MAC_SERIAL is not set CONFIG_INPUT_ADBHID=y CONFIG_MAC_EMUMOUSEBTN=y CONFIG_THERM_PM72=y # # Networking support # CONFIG_NET=y # # Networking options # CONFIG_PACKET=y # CONFIG_PACKET_MMAP is not set # CONFIG_NETLINK_DEV is not set CONFIG_UNIX=y CONFIG_NET_KEY=y CONFIG_INET=y CONFIG_IP_MULTICAST=y # CONFIG_IP_ADVANCED_ROUTER is not set # CONFIG_IP_PNP is not set CONFIG_NET_IPIP=y # CONFIG_NET_IPGRE is not set # CONFIG_IP_MROUTE is not set # CONFIG_ARPD is not set CONFIG_SYN_COOKIES=y CONFIG_INET_AH=y CONFIG_INET_ESP=y CONFIG_INET_IPCOMP=y CONFIG_INET_TUNNEL=y CONFIG_IP_TCPDIAG=y # CONFIG_IP_TCPDIAG_IPV6 is not set # # IP: Virtual Server Configuration # # CONFIG_IP_VS is not set # CONFIG_IPV6 is not set CONFIG_NETFILTER=y # CONFIG_NETFILTER_DEBUG is not set # # IP: Netfilter Configuration # CONFIG_IP_NF_CONNTRACK=y # CONFIG_IP_NF_CT_ACCT is not set # CONFIG_IP_NF_CONNTRACK_MARK is not set # CONFIG_IP_NF_CT_PROTO_SCTP is not set CONFIG_IP_NF_FTP=y CONFIG_IP_NF_IRC=y CONFIG_IP_NF_TFTP=y CONFIG_IP_NF_AMANDA=y CONFIG_IP_NF_QUEUE=y CONFIG_IP_NF_IPTABLES=y CONFIG_IP_NF_MATCH_LIMIT=y CONFIG_IP_NF_MATCH_IPRANGE=y CONFIG_IP_NF_MATCH_MAC=y CONFIG_IP_NF_MATCH_PKTTYPE=y CONFIG_IP_NF_MATCH_MARK=y CONFIG_IP_NF_MATCH_MULTIPORT=y CONFIG_IP_NF_MATCH_TOS=y CONFIG_IP_NF_MATCH_RECENT=y CONFIG_IP_NF_MATCH_ECN=y CONFIG_IP_NF_MATCH_DSCP=y CONFIG_IP_NF_MATCH_AH_ESP=y CONFIG_IP_NF_MATCH_LENGTH=y CONFIG_IP_NF_MATCH_TTL=y CONFIG_IP_NF_MATCH_TCPMSS=y CONFIG_IP_NF_MATCH_HELPER=y CONFIG_IP_NF_MATCH_STATE=y CONFIG_IP_NF_MATCH_CONNTRACK=y CONFIG_IP_NF_MATCH_OWNER=y # CONFIG_IP_NF_MATCH_ADDRTYPE is not set # CONFIG_IP_NF_MATCH_REALM is not set # CONFIG_IP_NF_MATCH_SCTP is not set # CONFIG_IP_NF_MATCH_COMMENT is not set # CONFIG_IP_NF_MATCH_HASHLIMIT is not set CONFIG_IP_NF_FILTER=y CONFIG_IP_NF_TARGET_REJECT=y CONFIG_IP_NF_TARGET_LOG=y CONFIG_IP_NF_TARGET_ULOG=y CONFIG_IP_NF_TARGET_TCPMSS=y CONFIG_IP_NF_NAT=y CONFIG_IP_NF_NAT_NEEDED=y CONFIG_IP_NF_TARGET_MASQUERADE=y CONFIG_IP_NF_TARGET_REDIRECT=y CONFIG_IP_NF_TARGET_NETMAP=y CONFIG_IP_NF_TARGET_SAME=y CONFIG_IP_NF_NAT_SNMP_BASIC=y CONFIG_IP_NF_NAT_IRC=y CONFIG_IP_NF_NAT_FTP=y CONFIG_IP_NF_NAT_TFTP=y CONFIG_IP_NF_NAT_AMANDA=y CONFIG_IP_NF_MANGLE=y CONFIG_IP_NF_TARGET_TOS=y CONFIG_IP_NF_TARGET_ECN=y CONFIG_IP_NF_TARGET_DSCP=y CONFIG_IP_NF_TARGET_MARK=y CONFIG_IP_NF_TARGET_CLASSIFY=y # CONFIG_IP_NF_RAW is not set CONFIG_IP_NF_ARPTABLES=y CONFIG_IP_NF_ARPFILTER=y CONFIG_IP_NF_ARP_MANGLE=y CONFIG_XFRM=y CONFIG_XFRM_USER=y # # SCTP Configuration (EXPERIMENTAL) # # CONFIG_IP_SCTP is not set # CONFIG_ATM is not set # CONFIG_BRIDGE is not set # CONFIG_VLAN_8021Q is not set # CONFIG_DECNET is not set # CONFIG_LLC2 is not set # CONFIG_IPX is not set # CONFIG_ATALK is not set # CONFIG_X25 is not set # CONFIG_LAPB is not set # CONFIG_NET_DIVERT is not set # CONFIG_ECONET is not set # CONFIG_WAN_ROUTER is not set # # QoS and/or fair queueing # # CONFIG_NET_SCHED is not set # CONFIG_NET_CLS_ROUTE is not set # # Network testing # # CONFIG_NET_PKTGEN is not set CONFIG_NETPOLL=y # CONFIG_NETPOLL_RX is not set # CONFIG_NETPOLL_TRAP is not set CONFIG_NET_POLL_CONTROLLER=y # CONFIG_HAMRADIO is not set # CONFIG_IRDA is not set # CONFIG_BT is not set CONFIG_NETDEVICES=y CONFIG_DUMMY=y # CONFIG_BONDING is not set # CONFIG_EQUALIZER is not set # CONFIG_TUN is not set # # ARCnet devices # # CONFIG_ARCNET is not set # # Ethernet (10 or 100Mbit) # CONFIG_NET_ETHERNET=y CONFIG_MII=y # CONFIG_OAKNET is not set # CONFIG_HAPPYMEAL is not set CONFIG_SUNGEM=y # CONFIG_NET_VENDOR_3COM is not set # # Tulip family network device support # # CONFIG_NET_TULIP is not set # CONFIG_HP100 is not set # CONFIG_NET_PCI is not set # # Ethernet (1000 Mbit) # # CONFIG_ACENIC is not set # CONFIG_DL2K is not set # CONFIG_E1000 is not set # CONFIG_NS83820 is not set # CONFIG_HAMACHI is not set # CONFIG_YELLOWFIN is not set # CONFIG_R8169 is not set # CONFIG_SK98LIN is not set # CONFIG_TIGON3 is not set # # Ethernet (10000 Mbit) # # CONFIG_IXGB is not set # CONFIG_S2IO is not set # # Token Ring devices # # CONFIG_TR is not set # # Wireless LAN (non-hamradio) # # CONFIG_NET_RADIO is not set # # Wan interfaces # # CONFIG_WAN is not set # CONFIG_FDDI is not set # CONFIG_HIPPI is not set # CONFIG_PPP is not set # CONFIG_SLIP is not set # CONFIG_NET_FC is not set # CONFIG_SHAPER is not set CONFIG_NETCONSOLE=y # # ISDN subsystem # # CONFIG_ISDN is not set # # Telephony Support # # CONFIG_PHONE is not set # # Input device support # CONFIG_INPUT=y # # Userland interfaces # CONFIG_INPUT_MOUSEDEV=y CONFIG_INPUT_MOUSEDEV_PSAUX=y CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024 CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768 # CONFIG_INPUT_JOYDEV is not set # CONFIG_INPUT_TSDEV is not set # CONFIG_INPUT_EVDEV is not set # CONFIG_INPUT_EVBUG is not set # # Input I/O drivers # # CONFIG_GAMEPORT is not set CONFIG_SOUND_GAMEPORT=y CONFIG_SERIO=y CONFIG_SERIO_I8042=y CONFIG_SERIO_SERPORT=y # CONFIG_SERIO_CT82C710 is not set # CONFIG_SERIO_PCIPS2 is not set CONFIG_SERIO_LIBPS2=y # CONFIG_SERIO_RAW is not set # # Input Device Drivers # CONFIG_INPUT_KEYBOARD=y CONFIG_KEYBOARD_ATKBD=y # CONFIG_KEYBOARD_SUNKBD is not set # CONFIG_KEYBOARD_LKKBD is not set # CONFIG_KEYBOARD_XTKBD is not set # CONFIG_KEYBOARD_NEWTON is not set CONFIG_INPUT_MOUSE=y CONFIG_MOUSE_PS2=y # CONFIG_MOUSE_SERIAL is not set # CONFIG_MOUSE_VSXXXAA is not set # CONFIG_INPUT_JOYSTICK is not set # CONFIG_INPUT_TOUCHSCREEN is not set # CONFIG_INPUT_MISC is not set # # Character devices # CONFIG_VT=y CONFIG_VT_CONSOLE=y CONFIG_HW_CONSOLE=y # CONFIG_SERIAL_NONSTANDARD is not set # # Serial drivers # CONFIG_SERIAL_8250=y CONFIG_SERIAL_8250_CONSOLE=y CONFIG_SERIAL_8250_NR_UARTS=4 # CONFIG_SERIAL_8250_EXTENDED is not set # # Non-8250 serial port support # CONFIG_SERIAL_CORE=y CONFIG_SERIAL_CORE_CONSOLE=y CONFIG_SERIAL_PMACZILOG=y CONFIG_SERIAL_PMACZILOG_CONSOLE=y CONFIG_UNIX98_PTYS=y CONFIG_LEGACY_PTYS=y CONFIG_LEGACY_PTY_COUNT=256 # # IPMI # # CONFIG_IPMI_HANDLER is not set # # Watchdog Cards # # CONFIG_WATCHDOG is not set # CONFIG_RTC is not set # CONFIG_GEN_RTC is not set # CONFIG_DTLK is not set # CONFIG_R3964 is not set # CONFIG_APPLICOM is not set # # Ftape, the floppy tape device driver # CONFIG_DRM=y # CONFIG_DRM_TDFX is not set # CONFIG_DRM_GAMMA is not set # CONFIG_DRM_R128 is not set CONFIG_DRM_RADEON=y # CONFIG_RAW_DRIVER is not set # # I2C support # CONFIG_I2C=y CONFIG_I2C_CHARDEV=y # # I2C Algorithms # CONFIG_I2C_ALGOBIT=y CONFIG_I2C_ALGOPCF=y CONFIG_I2C_ALGOPCA=y # # I2C Hardware Bus support # # CONFIG_I2C_ALI1535 is not set # CONFIG_I2C_ALI1563 is not set # CONFIG_I2C_ALI15X3 is not set # CONFIG_I2C_AMD756 is not set # CONFIG_I2C_AMD8111 is not set # CONFIG_I2C_I801 is not set # CONFIG_I2C_I810 is not set # CONFIG_I2C_ISA is not set CONFIG_I2C_KEYWEST=y # CONFIG_I2C_MPC is not set # CONFIG_I2C_NFORCE2 is not set # CONFIG_I2C_PARPORT_LIGHT is not set # CONFIG_I2C_PROSAVAGE is not set # CONFIG_I2C_SAVAGE4 is not set # CONFIG_SCx200_ACB is not set # CONFIG_I2C_SIS5595 is not set # CONFIG_I2C_SIS630 is not set # CONFIG_I2C_SIS96X is not set # CONFIG_I2C_STUB is not set # CONFIG_I2C_VIA is not set # CONFIG_I2C_VIAPRO is not set # CONFIG_I2C_VOODOO3 is not set # CONFIG_I2C_PCA_ISA is not set # # Hardware Sensors Chip support # # CONFIG_I2C_SENSOR is not set # CONFIG_SENSORS_ADM1021 is not set # CONFIG_SENSORS_ADM1025 is not set # CONFIG_SENSORS_ADM1026 is not set # CONFIG_SENSORS_ADM1031 is not set # CONFIG_SENSORS_ASB100 is not set # CONFIG_SENSORS_DS1621 is not set # CONFIG_SENSORS_FSCHER is not set # CONFIG_SENSORS_GL518SM is not set # CONFIG_SENSORS_IT87 is not set # CONFIG_SENSORS_LM63 is not set # CONFIG_SENSORS_LM75 is not set # CONFIG_SENSORS_LM77 is not set # CONFIG_SENSORS_LM78 is not set # CONFIG_SENSORS_LM80 is not set # CONFIG_SENSORS_LM83 is not set # CONFIG_SENSORS_LM85 is not set # CONFIG_SENSORS_LM87 is not set # CONFIG_SENSORS_LM90 is not set # CONFIG_SENSORS_MAX1619 is not set # CONFIG_SENSORS_PC87360 is not set # CONFIG_SENSORS_SMSC47B397 is not set # CONFIG_SENSORS_SMSC47M1 is not set # CONFIG_SENSORS_VIA686A is not set # CONFIG_SENSORS_W83781D is not set # CONFIG_SENSORS_W83L785TS is not set # CONFIG_SENSORS_W83627HF is not set # # Other I2C Chip support # # CONFIG_SENSORS_EEPROM is not set # CONFIG_SENSORS_PCF8574 is not set # CONFIG_SENSORS_PCF8591 is not set # CONFIG_SENSORS_RTC8564 is not set # CONFIG_I2C_DEBUG_CORE is not set # CONFIG_I2C_DEBUG_ALGO is not set # CONFIG_I2C_DEBUG_BUS is not set # CONFIG_I2C_DEBUG_CHIP is not set # # Dallas's 1-wire bus # # CONFIG_W1 is not set # # Misc devices # # # Multimedia devices # # CONFIG_VIDEO_DEV is not set # # Digital Video Broadcasting Devices # # CONFIG_DVB is not set # # Graphics support # CONFIG_FB=y CONFIG_FB_MODE_HELPERS=y # CONFIG_FB_TILEBLITTING is not set # CONFIG_FB_CIRRUS is not set # CONFIG_FB_PM2 is not set # CONFIG_FB_CYBER2000 is not set CONFIG_FB_OF=y # CONFIG_FB_CONTROL is not set # CONFIG_FB_PLATINUM is not set # CONFIG_FB_VALKYRIE is not set # CONFIG_FB_CT65550 is not set # CONFIG_FB_ASILIANT is not set # CONFIG_FB_IMSTT is not set # CONFIG_FB_S3TRIO is not set # CONFIG_FB_VGA16 is not set # CONFIG_FB_RIVA is not set # CONFIG_FB_MATROX is not set # CONFIG_FB_RADEON_OLD is not set CONFIG_FB_RADEON=y CONFIG_FB_RADEON_I2C=y # CONFIG_FB_RADEON_DEBUG is not set # CONFIG_FB_ATY128 is not set # CONFIG_FB_ATY is not set # CONFIG_FB_SAVAGE is not set # CONFIG_FB_SIS is not set # CONFIG_FB_NEOMAGIC is not set # CONFIG_FB_KYRO is not set # CONFIG_FB_3DFX is not set # CONFIG_FB_VOODOO1 is not set # CONFIG_FB_TRIDENT is not set # CONFIG_FB_PM3 is not set # CONFIG_FB_VIRTUAL is not set # # Console display driver support # CONFIG_VGA_CONSOLE=y CONFIG_DUMMY_CONSOLE=y CONFIG_FRAMEBUFFER_CONSOLE=y # CONFIG_FONTS is not set CONFIG_FONT_8x8=y CONFIG_FONT_8x16=y # # Logo configuration # CONFIG_LOGO=y CONFIG_LOGO_LINUX_MONO=y CONFIG_LOGO_LINUX_VGA16=y CONFIG_LOGO_LINUX_CLUT224=y # CONFIG_BACKLIGHT_LCD_SUPPORT is not set # # Sound # # CONFIG_SOUND is not set # # USB support # CONFIG_USB=y # CONFIG_USB_DEBUG is not set # # Miscellaneous USB options # CONFIG_USB_DEVICEFS=y # CONFIG_USB_BANDWIDTH is not set # CONFIG_USB_DYNAMIC_MINORS is not set # CONFIG_USB_OTG is not set CONFIG_USB_ARCH_HAS_HCD=y CONFIG_USB_ARCH_HAS_OHCI=y # # USB Host Controller Drivers # CONFIG_USB_EHCI_HCD=y CONFIG_USB_EHCI_SPLIT_ISO=y CONFIG_USB_EHCI_ROOT_HUB_TT=y CONFIG_USB_OHCI_HCD=y # CONFIG_USB_UHCI_HCD is not set # CONFIG_USB_SL811_HCD is not set # # USB Device Class drivers # # CONFIG_USB_BLUETOOTH_TTY is not set # CONFIG_USB_ACM is not set # CONFIG_USB_PRINTER is not set # # NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support' may also be needed; see USB_STORAGE Help for more information # # CONFIG_USB_STORAGE is not set # # USB Input Devices # CONFIG_USB_HID=y CONFIG_USB_HIDINPUT=y # CONFIG_HID_FF is not set CONFIG_USB_HIDDEV=y # CONFIG_USB_AIPTEK is not set # CONFIG_USB_WACOM is not set # CONFIG_USB_KBTAB is not set # CONFIG_USB_POWERMATE is not set # CONFIG_USB_MTOUCH is not set # CONFIG_USB_EGALAX is not set # CONFIG_USB_XPAD is not set # CONFIG_USB_ATI_REMOTE is not set # # USB Imaging devices # # CONFIG_USB_MDC800 is not set # CONFIG_USB_MICROTEK is not set # CONFIG_USB_HPUSBSCSI is not set # # USB Multimedia devices # # CONFIG_USB_DABUSB is not set # # Video4Linux support is needed for USB Multimedia device support # # # USB Network Adapters # # CONFIG_USB_CATC is not set # CONFIG_USB_KAWETH is not set # CONFIG_USB_PEGASUS is not set # CONFIG_USB_RTL8150 is not set # CONFIG_USB_USBNET is not set # # USB port drivers # # # USB Serial Converter support # # CONFIG_USB_SERIAL is not set # # USB Miscellaneous drivers # # CONFIG_USB_EMI62 is not set # CONFIG_USB_EMI26 is not set # CONFIG_USB_AUERSWALD is not set # CONFIG_USB_RIO500 is not set # CONFIG_USB_LEGOTOWER is not set # CONFIG_USB_LCD is not set # CONFIG_USB_LED is not set # CONFIG_USB_CYTHERM is not set # CONFIG_USB_PHIDGETKIT is not set # CONFIG_USB_PHIDGETSERVO is not set # CONFIG_USB_IDMOUSE is not set # CONFIG_USB_TEST is not set # # USB ATM/DSL drivers # # # USB Gadget Support # # CONFIG_USB_GADGET is not set # # MMC/SD Card support # # CONFIG_MMC is not set # # InfiniBand support # # CONFIG_INFINIBAND is not set # # File systems # # CONFIG_EXT2_FS is not set CONFIG_EXT3_FS=y CONFIG_EXT3_FS_XATTR=y CONFIG_EXT3_FS_POSIX_ACL=y # CONFIG_EXT3_FS_SECURITY is not set CONFIG_JBD=y # CONFIG_JBD_DEBUG is not set CONFIG_FS_MBCACHE=y # CONFIG_REISERFS_FS is not set # CONFIG_JFS_FS is not set CONFIG_FS_POSIX_ACL=y # # XFS support # # CONFIG_XFS_FS is not set # CONFIG_MINIX_FS is not set # CONFIG_ROMFS_FS is not set # CONFIG_QUOTA is not set CONFIG_DNOTIFY=y # CONFIG_AUTOFS_FS is not set CONFIG_AUTOFS4_FS=y # # CD-ROM/DVD Filesystems # CONFIG_ISO9660_FS=y # CONFIG_JOLIET is not set # CONFIG_ZISOFS is not set CONFIG_UDF_FS=y CONFIG_UDF_NLS=y # # DOS/FAT/NT Filesystems # # CONFIG_MSDOS_FS is not set # CONFIG_VFAT_FS is not set # CONFIG_NTFS_FS is not set # # Pseudo filesystems # CONFIG_PROC_FS=y CONFIG_PROC_KCORE=y CONFIG_SYSFS=y # CONFIG_DEVFS_FS is not set CONFIG_DEVPTS_FS_XATTR=y # CONFIG_DEVPTS_FS_SECURITY is not set CONFIG_TMPFS=y # CONFIG_TMPFS_XATTR is not set CONFIG_HUGETLBFS=y CONFIG_HUGETLB_PAGE=y CONFIG_RAMFS=y # # Miscellaneous filesystems # # CONFIG_ADFS_FS is not set # CONFIG_AFFS_FS is not set CONFIG_HFS_FS=y CONFIG_HFSPLUS_FS=y # CONFIG_BEFS_FS is not set # CONFIG_BFS_FS is not set # CONFIG_EFS_FS is not set CONFIG_CRAMFS=y # CONFIG_VXFS_FS is not set # CONFIG_HPFS_FS is not set # CONFIG_QNX4FS_FS is not set # CONFIG_SYSV_FS is not set # CONFIG_UFS_FS is not set # # Network File Systems # CONFIG_NFS_FS=y CONFIG_NFS_V3=y CONFIG_NFS_V4=y CONFIG_NFS_DIRECTIO=y # CONFIG_NFSD is not set CONFIG_LOCKD=y CONFIG_LOCKD_V4=y CONFIG_SUNRPC=y CONFIG_SUNRPC_GSS=y CONFIG_RPCSEC_GSS_KRB5=y # CONFIG_RPCSEC_GSS_SPKM3 is not set # CONFIG_SMB_FS is not set # CONFIG_CIFS is not set # CONFIG_NCP_FS is not set # CONFIG_CODA_FS is not set # CONFIG_AFS_FS is not set # # Partition Types # CONFIG_PARTITION_ADVANCED=y # CONFIG_ACORN_PARTITION is not set # CONFIG_OSF_PARTITION is not set # CONFIG_AMIGA_PARTITION is not set # CONFIG_ATARI_PARTITION is not set CONFIG_MAC_PARTITION=y CONFIG_MSDOS_PARTITION=y # CONFIG_BSD_DISKLABEL is not set # CONFIG_MINIX_SUBPARTITION is not set # CONFIG_SOLARIS_X86_PARTITION is not set # CONFIG_UNIXWARE_DISKLABEL is not set # CONFIG_LDM_PARTITION is not set # CONFIG_SGI_PARTITION is not set # CONFIG_ULTRIX_PARTITION is not set # CONFIG_SUN_PARTITION is not set # CONFIG_EFI_PARTITION is not set # # Native Language Support # CONFIG_NLS=y CONFIG_NLS_DEFAULT="iso8859-1" CONFIG_NLS_CODEPAGE_437=y # CONFIG_NLS_CODEPAGE_737 is not set # CONFIG_NLS_CODEPAGE_775 is not set # CONFIG_NLS_CODEPAGE_850 is not set # CONFIG_NLS_CODEPAGE_852 is not set # CONFIG_NLS_CODEPAGE_855 is not set # CONFIG_NLS_CODEPAGE_857 is not set # CONFIG_NLS_CODEPAGE_860 is not set # CONFIG_NLS_CODEPAGE_861 is not set # CONFIG_NLS_CODEPAGE_862 is not set # CONFIG_NLS_CODEPAGE_863 is not set # CONFIG_NLS_CODEPAGE_864 is not set # CONFIG_NLS_CODEPAGE_865 is not set # CONFIG_NLS_CODEPAGE_866 is not set # CONFIG_NLS_CODEPAGE_869 is not set # CONFIG_NLS_CODEPAGE_936 is not set # CONFIG_NLS_CODEPAGE_950 is not set # CONFIG_NLS_CODEPAGE_932 is not set # CONFIG_NLS_CODEPAGE_949 is not set # CONFIG_NLS_CODEPAGE_874 is not set # CONFIG_NLS_ISO8859_8 is not set # CONFIG_NLS_CODEPAGE_1250 is not set # CONFIG_NLS_CODEPAGE_1251 is not set CONFIG_NLS_ASCII=y CONFIG_NLS_ISO8859_1=y # CONFIG_NLS_ISO8859_2 is not set # CONFIG_NLS_ISO8859_3 is not set # CONFIG_NLS_ISO8859_4 is not set # CONFIG_NLS_ISO8859_5 is not set # CONFIG_NLS_ISO8859_6 is not set # CONFIG_NLS_ISO8859_7 is not set # CONFIG_NLS_ISO8859_9 is not set # CONFIG_NLS_ISO8859_13 is not set # CONFIG_NLS_ISO8859_14 is not set CONFIG_NLS_ISO8859_15=y # CONFIG_NLS_KOI8_R is not set # CONFIG_NLS_KOI8_U is not set CONFIG_NLS_UTF8=y # # Profiling support # # CONFIG_PROFILING is not set # # Kernel hacking # CONFIG_DEBUG_KERNEL=y CONFIG_MAGIC_SYSRQ=y # CONFIG_SCHEDSTATS is not set # CONFIG_DEBUG_SLAB is not set CONFIG_DEBUG_PREEMPT=y CONFIG_DEBUG_SPINLOCK_SLEEP=y # CONFIG_DEBUG_KOBJECT is not set CONFIG_DEBUG_INFO=y # CONFIG_DEBUG_FS is not set # CONFIG_DEBUG_STACKOVERFLOW is not set # CONFIG_KPROBES is not set # CONFIG_DEBUG_STACK_USAGE is not set CONFIG_DEBUGGER=y # CONFIG_XMON is not set CONFIG_PPCDBG=y # CONFIG_IRQSTACKS is not set # # Security options # # CONFIG_KEYS is not set # CONFIG_SECURITY is not set # # Cryptographic options # CONFIG_CRYPTO=y CONFIG_CRYPTO_HMAC=y CONFIG_CRYPTO_NULL=y CONFIG_CRYPTO_MD4=y CONFIG_CRYPTO_MD5=y CONFIG_CRYPTO_SHA1=y CONFIG_CRYPTO_SHA256=y CONFIG_CRYPTO_SHA512=y # CONFIG_CRYPTO_WP512 is not set CONFIG_CRYPTO_DES=y CONFIG_CRYPTO_BLOWFISH=y CONFIG_CRYPTO_TWOFISH=y CONFIG_CRYPTO_SERPENT=y CONFIG_CRYPTO_AES=y CONFIG_CRYPTO_CAST5=y CONFIG_CRYPTO_CAST6=y # CONFIG_CRYPTO_TEA is not set CONFIG_CRYPTO_ARC4=y # CONFIG_CRYPTO_KHAZAD is not set # CONFIG_CRYPTO_ANUBIS is not set CONFIG_CRYPTO_DEFLATE=y # CONFIG_CRYPTO_MICHAEL_MIC is not set # CONFIG_CRYPTO_CRC32C is not set CONFIG_CRYPTO_TEST=y # # Hardware crypto devices # # # Library routines # CONFIG_CRC_CCITT=y CONFIG_CRC32=y CONFIG_LIBCRC32C=y CONFIG_ZLIB_INFLATE=y CONFIG_ZLIB_DEFLATE=y From nacc at us.ibm.com Thu Mar 3 06:28:11 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Wed, 2 Mar 2005 11:28:11 -0800 Subject: eeh.h compile warnings / adbhid.c build failure In-Reply-To: <20050302192005.GA21615@pants.nu> References: <20050302181206.GA2741@us.ibm.com> <20050302192005.GA21615@pants.nu> Message-ID: <20050302192811.GD2741@us.ibm.com> On Wed, Mar 02, 2005 at 11:20:06AM -0800, Brad Boyer wrote: > On Wed, Mar 02, 2005 at 10:12:06AM -0800, Nishanth Aravamudan wrote: > > Now, to the build-blocking code: > > > > In drivers/macintosh/adbhid.c::1159: > > > > static int __init adbhid_init(void) > > { > > #ifndef CONFIG_MAC > > if ( (_machine != _MACH_chrp) && (_machine != _MACH_Pmac) ) > > return 0; > > #endif > > ... > > > > I don't see CONFIG_MAC in my .config (attached below [1]), and _MACH_chrp is > > not defined for ppc64 (it's in asm-ppc/processor.h not in > > ppc64/processor.h. I just removed the _MACH_chrp conditional in my local > > code to get the kernel to build. I'm not sure what the actual solution > > is, but I thought you all should know about it. > > The CONFIG_MAC symbol is defined for mac68k support. The m68k arch has a > different way of telling the various machines apart, although I notice > that there isn't an equivalent block here for that. I guess no other 68k > people cared enough to make sure the ADB layer doesn't load on their boxes. > > In my opinion, the machine selectors should be reconciled between ppc and > ppc64 due to the amount of code that expects them to act the same. So > even though you wouldn't have a CHRP ppc64 box, the define will be there. I definitely think this is the ideal solution. I did notice, though, that hte _MACH_Pmac #define varies between ppc and ppc64, so I'm not sure how that would work for _MACH_chrp. I am more than happy to test code, generate patches (if you tell me what you'd like me to change), etc. Thanks, Nish From flar at allandria.com Thu Mar 3 06:20:06 2005 From: flar at allandria.com (Brad Boyer) Date: Wed, 2 Mar 2005 11:20:06 -0800 Subject: eeh.h compile warnings / adbhid.c build failure In-Reply-To: <20050302181206.GA2741@us.ibm.com> References: <20050302181206.GA2741@us.ibm.com> Message-ID: <20050302192005.GA21615@pants.nu> On Wed, Mar 02, 2005 at 10:12:06AM -0800, Nishanth Aravamudan wrote: > Now, to the build-blocking code: > > In drivers/macintosh/adbhid.c::1159: > > static int __init adbhid_init(void) > { > #ifndef CONFIG_MAC > if ( (_machine != _MACH_chrp) && (_machine != _MACH_Pmac) ) > return 0; > #endif > ... > > I don't see CONFIG_MAC in my .config (attached below [1]), and _MACH_chrp is > not defined for ppc64 (it's in asm-ppc/processor.h not in > ppc64/processor.h. I just removed the _MACH_chrp conditional in my local > code to get the kernel to build. I'm not sure what the actual solution > is, but I thought you all should know about it. The CONFIG_MAC symbol is defined for mac68k support. The m68k arch has a different way of telling the various machines apart, although I notice that there isn't an equivalent block here for that. I guess no other 68k people cared enough to make sure the ADB layer doesn't load on their boxes. In my opinion, the machine selectors should be reconciled between ppc and ppc64 due to the amount of code that expects them to act the same. So even though you wouldn't have a CHRP ppc64 box, the define will be there. Brad Boyer flar at allandria.com From johnrose at austin.ibm.com Thu Mar 3 08:10:37 2005 From: johnrose at austin.ibm.com (John Rose) Date: Wed, 02 Mar 2005 15:10:37 -0600 Subject: [PATCH] error code cleanups for rtas wrappers Message-ID: <1109797837.9434.2.camel@sinatra.austin.ibm.com> This patch changes the rtas wrapper functions in rtas.c to map RTAS failures to conventional error values. The goal is to make failure conditions obvious in the wrapper functions and in the caller code. Flame away :) John Signed-off-by: John Rose diff -puN arch/ppc64/kernel/pSeries_smp.c~01_rtas_rcs arch/ppc64/kernel/pSeries_smp.c --- 2_6_linus_3/arch/ppc64/kernel/pSeries_smp.c~01_rtas_rcs 2005-03-02 14:50:33.000000000 -0600 +++ 2_6_linus_3-johnrose/arch/ppc64/kernel/pSeries_smp.c 2005-03-02 14:50:33.000000000 -0600 @@ -151,7 +151,7 @@ static unsigned int find_physical_cpu_to if (index) { int state; int rc = rtas_get_sensor(9003, *index, &state); - if (rc != 0 || state != 1) + if (rc < 0 || state != 1) continue; } diff -puN arch/ppc64/kernel/rtas.c~01_rtas_rcs arch/ppc64/kernel/rtas.c --- 2_6_linus_3/arch/ppc64/kernel/rtas.c~01_rtas_rcs 2005-03-02 14:50:33.000000000 -0600 +++ 2_6_linus_3-johnrose/arch/ppc64/kernel/rtas.c 2005-03-02 14:50:33.000000000 -0600 @@ -255,29 +255,59 @@ rtas_extended_busy_delay_time(int status return ms; } -int -rtas_get_power_level(int powerdomain, int *level) +int rtas_error_rc(int rtas_rc) +{ + int rc; + + switch (rtas_rc) { + case -1: /* Hardware Error */ + rc = -EIO; + break; + case -3: /* Bad indicator/domain/etc */ + rc = -EINVAL; + break; + case -9000: /* Isolation error */ + rc = -EFAULT; + break; + case -9001: /* Outstanding TCE/PTE */ + rc = -EEXIST; + break; + case -9002: /* No usable slot */ + rc = -ENODEV; + break; + default: + printk(KERN_ERR "%s: unexpected RTAS error %d\n", + __FUNCTION__, rtas_rc); + rc = -ERANGE; + break; + } + return rc; +} + +int rtas_get_power_level(int powerdomain, int *level) { int token = rtas_token("get-power-level"); int rc; if (token == RTAS_UNKNOWN_SERVICE) - return RTAS_UNKNOWN_OP; + return -ENOENT; while ((rc = rtas_call(token, 1, 2, level, powerdomain)) == RTAS_BUSY) udelay(1); + + if (rc < 0) + return rtas_error_rc(rc); return rc; } -int -rtas_set_power_level(int powerdomain, int level, int *setlevel) +int rtas_set_power_level(int powerdomain, int level, int *setlevel) { int token = rtas_token("set-power-level"); unsigned int wait_time; int rc; if (token == RTAS_UNKNOWN_SERVICE) - return RTAS_UNKNOWN_OP; + return -ENOENT; while (1) { rc = rtas_call(token, 2, 2, setlevel, powerdomain, level); @@ -289,18 +319,20 @@ rtas_set_power_level(int powerdomain, in } else break; } + + if (rc < 0) + return rtas_error_rc(rc); return rc; } -int -rtas_get_sensor(int sensor, int index, int *state) +int rtas_get_sensor(int sensor, int index, int *state) { int token = rtas_token("get-sensor-state"); unsigned int wait_time; int rc; if (token == RTAS_UNKNOWN_SERVICE) - return RTAS_UNKNOWN_OP; + return -ENOENT; while (1) { rc = rtas_call(token, 2, 2, state, sensor, index); @@ -312,18 +344,20 @@ rtas_get_sensor(int sensor, int index, i } else break; } + + if (rc < 0) + return rtas_error_rc(rc); return rc; } -int -rtas_set_indicator(int indicator, int index, int new_value) +int rtas_set_indicator(int indicator, int index, int new_value) { int token = rtas_token("set-indicator"); unsigned int wait_time; int rc; if (token == RTAS_UNKNOWN_SERVICE) - return RTAS_UNKNOWN_OP; + return -ENOENT; while (1) { rc = rtas_call(token, 3, 1, NULL, indicator, index, new_value); @@ -337,6 +371,8 @@ rtas_set_indicator(int indicator, int in break; } + if (rc < 0) + return rtas_error_rc(rc); return rc; } diff -puN arch/ppc64/kernel/rtasd.c~01_rtas_rcs arch/ppc64/kernel/rtasd.c --- 2_6_linus_3/arch/ppc64/kernel/rtasd.c~01_rtas_rcs 2005-03-02 14:50:33.000000000 -0600 +++ 2_6_linus_3-johnrose/arch/ppc64/kernel/rtasd.c 2005-03-02 14:50:33.000000000 -0600 @@ -347,7 +347,7 @@ static int enable_surveillance(int timeo if (error == 0) return 0; - if (error == RTAS_NO_SUCH_INDICATOR) { + if (error == -EINVAL) { printk(KERN_INFO "rtasd: surveillance not supported\n"); return 0; } diff -puN arch/ppc64/kernel/xics.c~01_rtas_rcs arch/ppc64/kernel/xics.c --- 2_6_linus_3/arch/ppc64/kernel/xics.c~01_rtas_rcs 2005-03-02 14:50:33.000000000 -0600 +++ 2_6_linus_3-johnrose/arch/ppc64/kernel/xics.c 2005-03-02 14:50:33.000000000 -0600 @@ -654,7 +654,7 @@ void xics_migrate_irqs_away(void) /* remove ourselves from the global interrupt queue */ status = rtas_set_indicator(GLOBAL_INTERRUPT_QUEUE, (1UL << interrupt_server_size) - 1 - default_distrib_server, 0); - WARN_ON(status != 0); + WARN_ON(status < 0); /* Allow IPIs again... */ ops->cppr_info(cpu, DEFAULT_PRIORITY); diff -puN include/asm-ppc64/rtas.h~01_rtas_rcs include/asm-ppc64/rtas.h --- 2_6_linus_3/include/asm-ppc64/rtas.h~01_rtas_rcs 2005-03-02 14:50:33.000000000 -0600 +++ 2_6_linus_3-johnrose/include/asm-ppc64/rtas.h 2005-03-02 14:50:33.000000000 -0600 @@ -24,12 +24,9 @@ /* RTAS return status codes */ #define RTAS_BUSY -2 /* RTAS Busy */ -#define RTAS_NO_SUCH_INDICATOR -3 /* No such indicator implemented */ #define RTAS_EXTENDED_DELAY_MIN 9900 #define RTAS_EXTENDED_DELAY_MAX 9905 -#define RTAS_UNKNOWN_OP -1099 /* Unknown RTAS Token */ - /* * In general to call RTAS use rtas_token("string") to lookup * an RTAS token for the given string (e.g. "event-scan"). _ From moilanen at austin.ibm.com Thu Mar 3 09:34:12 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Wed, 2 Mar 2005 16:34:12 -0600 Subject: [PATCH][RFC] unlikely spinlocks Message-ID: <20050302163412.0fa52c4b.moilanen@austin.ibm.com> On our raw spinlocks, we currently have an attempt at the lock, and if we do not get it we enter a spin loop. This spinloop will likely continue for awhile, and we pridict likely. Shouldn't we predict that we will get out of the loop so our next instructions are already prefetched. Even when we miss because the lock is still held, it won't matter since we are waiting anyways. I did a couple quick benchmarks, but the results are inconclusive. 16-way 690 running specjbb with original code # ./specjbb 3000 16 1 1 19 30 120 ... Valid run, Score is 59282 16-way 690 running specjbb with unlikely code # ./specjbb 3000 16 1 1 19 30 120 ... Valid run, Score is 59541 I saw a smaller increase on a JS20 (~1.6%) JS20 specjbb w/ original code # ./specjbb 400 2 1 1 19 30 120 ... Valid run, Score is 20460 JS20 specjbb w/ unlikely code # ./specjbb 400 2 1 1 19 30 120 ... Valid run, Score is 20803 Jake Signed-off-by: Jake Moilanen --- diff -puN include/asm-ppc64/spinlock.h~unlikely-spinlocks include/asm-ppc64/spinlock.h --- linux-2.6-bk/include/asm-ppc64/spinlock.h~unlikely-spinlocks Wed Mar 2 13:55:39 2005 +++ linux-2.6-bk-moilanen/include/asm-ppc64/spinlock.h Wed Mar 2 13:55:40 2005 @@ -110,7 +110,7 @@ static void __inline__ _raw_spin_lock(sp HMT_low(); if (SHARED_PROCESSOR) __spin_yield(lock); - } while (likely(lock->lock != 0)); + } while (unlikely(lock->lock != 0)); HMT_medium(); } } @@ -128,7 +128,7 @@ static void __inline__ _raw_spin_lock_fl HMT_low(); if (SHARED_PROCESSOR) __spin_yield(lock); - } while (likely(lock->lock != 0)); + } while (unlikely(lock->lock != 0)); HMT_medium(); local_irq_restore(flags_dis); } @@ -194,7 +194,7 @@ static void __inline__ _raw_read_lock(rw HMT_low(); if (SHARED_PROCESSOR) __rw_yield(rw); - } while (likely(rw->lock < 0)); + } while (unlikely(rw->lock < 0)); HMT_medium(); } } @@ -251,7 +251,7 @@ static void __inline__ _raw_write_lock(r HMT_low(); if (SHARED_PROCESSOR) __rw_yield(rw); - } while (likely(rw->lock != 0)); + } while (unlikely(rw->lock != 0)); HMT_medium(); } } _ From benh at kernel.crashing.org Thu Mar 3 10:39:16 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 03 Mar 2005 10:39:16 +1100 Subject: eeh.h compile warnings / adbhid.c build failure In-Reply-To: <20050302181206.GA2741@us.ibm.com> References: <20050302181206.GA2741@us.ibm.com> Message-ID: <1109806756.5680.127.camel@gaston> On Wed, 2005-03-02 at 10:12 -0800, Nishanth Aravamudan wrote: > > These warnings are emitted for pretty much every driver. It looks like > it is becuase with CONFIG_EEH undefined (it's a pSeries thing? -- my > interpretation from looking at the ppc64 Kconfig), eeh_check_failure() > becomes #define'd to simply it's second parameter, which in the case of > assignment statements ia statement with no effect. It's not a big deal, > the kernels still compile (with a patch to adbhid.c which I'll mention > in a second) but it's a lot of noise to be generated because I don't > have a pSeries machine... Stupid gcc :) > Now, to the build-blocking code: > > In drivers/macintosh/adbhid.c::1159: > > static int __init adbhid_init(void) > { > #ifndef CONFIG_MAC > if ( (_machine != _MACH_chrp) && (_machine != _MACH_Pmac) ) > return 0; > #endif > ... > > I don't see CONFIG_MAC in my .config (attached below [1]), and _MACH_chrp is > not defined for ppc64 (it's in asm-ppc/processor.h not in > ppc64/processor.h. I just removed the _MACH_chrp conditional in my local > code to get the kernel to build. I'm not sure what the actual solution > is, but I thought you all should know about it. There is no ADB bus on a G5, so the driver isn't useful anyway. Currently, ppc64 allows you to enable pmac drivers that won't build, but they also are useless on G5s. I'll fix that over time though. Ben. From nacc at us.ibm.com Thu Mar 3 11:23:08 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Wed, 2 Mar 2005 16:23:08 -0800 Subject: eeh.h compile warnings / adbhid.c build failure In-Reply-To: <1109806756.5680.127.camel@gaston> References: <20050302181206.GA2741@us.ibm.com> <1109806756.5680.127.camel@gaston> Message-ID: <20050303002308.GO2741@us.ibm.com> On Thu, Mar 03, 2005 at 10:39:16AM +1100, Benjamin Herrenschmidt wrote: > On Wed, 2005-03-02 at 10:12 -0800, Nishanth Aravamudan wrote: > > > > These warnings are emitted for pretty much every driver. It looks like > > it is becuase with CONFIG_EEH undefined (it's a pSeries thing? -- my > > interpretation from looking at the ppc64 Kconfig), eeh_check_failure() > > becomes #define'd to simply it's second parameter, which in the case of > > assignment statements ia statement with no effect. It's not a big deal, > > the kernels still compile (with a patch to adbhid.c which I'll mention > > in a second) but it's a lot of noise to be generated because I don't > > have a pSeries machine... > > Stupid gcc :) > > > Now, to the build-blocking code: > > > > In drivers/macintosh/adbhid.c::1159: > > > > static int __init adbhid_init(void) > > { > > #ifndef CONFIG_MAC > > if ( (_machine != _MACH_chrp) && (_machine != _MACH_Pmac) ) > > return 0; > > #endif > > ... > > > > I don't see CONFIG_MAC in my .config (attached below [1]), and _MACH_chrp is > > not defined for ppc64 (it's in asm-ppc/processor.h not in > > ppc64/processor.h. I just removed the _MACH_chrp conditional in my local > > code to get the kernel to build. I'm not sure what the actual solution > > is, but I thought you all should know about it. > > There is no ADB bus on a G5, so the driver isn't useful anyway. > Currently, ppc64 allows you to enable pmac drivers that won't build, but > they also are useless on G5s. I'll fix that over time though. Ok, I'll take it out of the config. Thanks! -Nish From jschopp at austin.ibm.com Thu Mar 3 11:55:47 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Wed, 02 Mar 2005 18:55:47 -0600 Subject: [PATCH][RFC] unlikely spinlocks In-Reply-To: <20050302163412.0fa52c4b.moilanen@austin.ibm.com> References: <20050302163412.0fa52c4b.moilanen@austin.ibm.com> Message-ID: <42266093.5080101@austin.ibm.com> Jake Moilanen wrote: > On our raw spinlocks, we currently have an attempt at the lock, and if > we do not get it we enter a spin loop. This spinloop will likely > continue for awhile, and we pridict likely. > > Shouldn't we predict that we will get out of the loop so our next > instructions are already prefetched. Even when we miss because the lock > is still held, it won't matter since we are waiting anyways. I agree with you in principle. It would be nice to have some better supporting measurements as well though. > > I did a couple quick benchmarks, but the results are inconclusive. > > 16-way 690 running specjbb with original code > # ./specjbb 3000 16 1 1 19 30 120 > ... > Valid run, Score is 59282 > > 16-way 690 running specjbb with unlikely code > # ./specjbb 3000 16 1 1 19 30 120 > ... > Valid run, Score is 59541 > > I saw a smaller increase on a JS20 (~1.6%) Percentage wise the 690 increase was smaller > > JS20 specjbb w/ original code > # ./specjbb 400 2 1 1 19 30 120 > ... > Valid run, Score is 20460 > > > JS20 specjbb w/ unlikely code > # ./specjbb 400 2 1 1 19 30 120 > ... > Valid run, Score is 20803 > > Jake My guess is there is some variance in specjbb runs. The variance might be greater than the amount of improvement. It is still possible to use statistics to show the amount of the increase. If you could get me the results of say 12 runs on each kernel I could do the analysis for you. On a side note. Do you have the assembly generated by _raw_spin_lock() and brethren? They always get inlined so I doubt a simple objdump would do it. I'm curious how good the compiler is at optimizing away things. From xma at us.ibm.com Thu Mar 3 12:58:58 2005 From: xma at us.ibm.com (Shirley Ma) Date: Thu, 3 Mar 2005 01:58:58 +0000 Subject: PCI: Unable to reserve mem region Message-ID: When I loaded Mellanox device driver on P615, I hit below problem: Mar 2 14:31:39 elm3b5 kernel: ib_mthca: Initializing (0000:62:00.0) Mar 2 14:31:39 elm3b5 kernel: PCI: Enabling device: (0000:62:00.0), cmd 142 Mar 2 14:31:39 elm3b5 kernel: PCI: Unable to reserve mem region #5:8000000 at 3fcc 0000000 for device 0000:62:00.0 Mar 2 14:31:39 elm3b5 kernel: ib_mthca 0000:62:00.0: Cannot obtain PCI resource s, aborting. Mar 2 14:31:39 elm3b5 kernel: ib_mchca: probe of 0000:62:00.0 failed with error -16 Anybody has any idea to fix this problem? Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050303/96bbec81/attachment.htm From ntl at pobox.com Thu Mar 3 14:37:57 2005 From: ntl at pobox.com (Nathan Lynch) Date: Wed, 2 Mar 2005 21:37:57 -0600 Subject: [PATCH] fix eeh.h compile warnings In-Reply-To: <20050302181206.GA2741@us.ibm.com> References: <20050302181206.GA2741@us.ibm.com> Message-ID: <20050303033757.GD5897@otto> On Wed, Mar 02, 2005 at 07:41:13PM -0600, Nishanth Aravamudan wrote: > > While building 2.6.11 for a G5, I noticed the following errors being > spit out (gcc 3.3.5): > > include/asm/eeh.h: In function `eeh_memcpy_fromio': > include/asm/eeh.h:265: warning: statement with no effect > include/asm/eeh.h: In function `eeh_insb': > include/asm/eeh.h:353: warning: statement with no effect > include/asm/eeh.h: In function `eeh_insw_ns': > include/asm/eeh.h:360: warning: statement with no effect > include/asm/eeh.h: In function `eeh_insl_ns': > include/asm/eeh.h:367: warning: statement with no effect > include/asm/eeh.h: In function `eeh_memcpy_fromio': > include/asm/eeh.h:265: warning: statement with no effect > include/asm/eeh.h: In function `eeh_insb': > include/asm/eeh.h:353: warning: statement with no effect > include/asm/eeh.h: In function `eeh_insw_ns': > include/asm/eeh.h:360: warning: statement with no effect > include/asm/eeh.h: In function `eeh_insl_ns': > include/asm/eeh.h:367: warning: statement with no effect > > These warnings are emitted for pretty much every driver. It looks like > it is becuase with CONFIG_EEH undefined (it's a pSeries thing? -- my > interpretation from looking at the ppc64 Kconfig), eeh_check_failure() > becomes #define'd to simply it's second parameter, which in the case of > assignment statements ia statement with no effect I don't have a toolchain readily available which gives these warnings, but does this fix them? Use static inlines instead of #defines for stub functions when CONFIG_EEH=n. Signed-off-by: Nathan Lynch Index: linux-2.6.11/include/asm-ppc64/eeh.h =================================================================== --- linux-2.6.11.orig/include/asm-ppc64/eeh.h 2005-03-02 07:38:38.000000000 +0000 +++ linux-2.6.11/include/asm-ppc64/eeh.h 2005-03-03 01:39:25.000000000 +0000 @@ -104,17 +104,30 @@ int eeh_unregister_notifier(struct notif */ #define EEH_IO_ERROR_VALUE(size) (~0U >> ((4 - (size)) * 8)) -#else -#define eeh_init() -#define eeh_check_failure(token, val) (val) -#define eeh_dn_check_failure(dn, dev) (0) -#define pci_addr_cache_build() -#define eeh_add_device_early(dn) -#define eeh_add_device_late(dev) -#define eeh_remove_device(dev) +#else /* !CONFIG_EEH */ +static inline void eeh_init(void) { } + +static inline unsigned long eeh_check_failure(const volatile void __iomem *token, unsigned long val) +{ + return val; +} + +static inline int eeh_dn_check_failure(struct device_node *dn, struct pci_dev *dev) +{ + return 0; +} + +static inline void pci_addr_cache_build(void) { } + +static inline void eeh_add_device_early(struct device_node *dn) { } + +static inline void eeh_add_device_late(struct pci_dev *dev) { } + +static inline void eeh_remove_device(struct pci_dev *dev) { } + #define EEH_POSSIBLE_ERROR(val, type) (0) #define EEH_IO_ERROR_VALUE(size) (-1UL) -#endif +#endif /* CONFIG_EEH */ /* * MMIO read/write operations with EEH support. From apw at us.ibm.com Thu Mar 3 11:59:48 2005 From: apw at us.ibm.com (Amos Waterland) Date: Wed, 2 Mar 2005 19:59:48 -0500 Subject: [patch] init_boot_display link error Message-ID: <20050303005948.GA691@kvasir.austin.ibm.com> In pmac_setup.c, the function init_boot_display as currently written only makes sense with CONFIG_BOOTX_TEXT enabled, and causes a link error if it is not enabled. Signed-off-by: Amos Waterland ===== arch/ppc64/kernel/pmac_setup.c 1.15 vs edited ===== --- 1.15/arch/ppc64/kernel/pmac_setup.c 2005-01-08 00:43:52 -05:00 +++ edited/arch/ppc64/kernel/pmac_setup.c 2005-03-02 19:37:31 -05:00 @@ -244,7 +244,6 @@ { btext_drawchar(c); } -#endif /* CONFIG_BOOTX_TEXT */ static void __init init_boot_display(void) { @@ -280,6 +279,7 @@ return; } } +#endif /* CONFIG_BOOTX_TEXT */ /* * Early initialization. From benh at kernel.crashing.org Thu Mar 3 16:08:14 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 03 Mar 2005 16:08:14 +1100 Subject: PCI: Unable to reserve mem region In-Reply-To: References: Message-ID: <1109826494.5679.174.camel@gaston> On Thu, 2005-03-03 at 01:58 +0000, Shirley Ma wrote: > > When I loaded Mellanox device driver on P615, I hit below problem: > > Mar 2 14:31:39 elm3b5 kernel: ib_mthca: Initializing (0000:62:00.0) > Mar 2 14:31:39 elm3b5 kernel: PCI: Enabling device: (0000:62:00.0), > cmd 142 > Mar 2 14:31:39 elm3b5 kernel: PCI: Unable to reserve mem region > #5:8000000 at 3fcc > 0000000 for device 0000:62:00.0 > Mar 2 14:31:39 elm3b5 kernel: ib_mthca 0000:62:00.0: Cannot obtain > PCI resource > s, aborting. > Mar 2 14:31:39 elm3b5 kernel: ib_mchca: probe of 0000:62:00.0 failed > with error > -16 Is it possible to have remote access to the machine & HMC ? Ben. From xma at us.ibm.com Thu Mar 3 17:24:54 2005 From: xma at us.ibm.com (Shirley Ma) Date: Wed, 2 Mar 2005 22:24:54 -0800 Subject: PCI: Unable to reserve mem region In-Reply-To: <1109826494.5679.174.camel@gaston> Message-ID: > Is it possible to have remote access to the machine & HMC ? Sorry. It can't be reachable. If you have any suggestion to debug this problem, I can try it out. I installed 2.6.10 kernel with openib.org Gen2 mthca driver. This driver works OK on both ia64 and ia32. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050302/006f38f1/attachment.htm From benh at kernel.crashing.org Thu Mar 3 17:24:33 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 03 Mar 2005 17:24:33 +1100 Subject: PCI: Unable to reserve mem region In-Reply-To: References: Message-ID: <1109831073.5610.185.camel@gaston> On Wed, 2005-03-02 at 22:24 -0800, Shirley Ma wrote: > > Is it possible to have remote access to the machine & HMC ? > Sorry. It can't be reachable. > If you have any suggestion to debug this problem, I can try it out. I > installed 2.6.10 kernel with openib.org Gen2 mthca driver. This driver > works OK on both ia64 and ia32. The problem is a core kernel PCI allocation issue it seems, it will require quite a bit of debugging to figure out what's wrong I'm afraid... Ben. From anton at samba.org Thu Mar 3 17:20:05 2005 From: anton at samba.org (Anton Blanchard) Date: Thu, 3 Mar 2005 17:20:05 +1100 Subject: PCI: Unable to reserve mem region In-Reply-To: References: <1109826494.5679.174.camel@gaston> Message-ID: <20050303062005.GA16915@krispykreme.ozlabs.ibm.com> > Sorry. It can't be reachable. > If you have any suggestion to debug this problem, I can try it out. I > installed 2.6.10 kernel with openib.org Gen2 mthca driver. This driver > works OK on both ia64 and ia32. How big is the PCI MMIO window on this card? lspci -v should give us this information. Anton From sonny at burdell.org Thu Mar 3 18:02:22 2005 From: sonny at burdell.org (Sonny Rao) Date: Thu, 3 Mar 2005 02:02:22 -0500 Subject: [PATCH][RFC] unlikely spinlocks In-Reply-To: <20050302163412.0fa52c4b.moilanen@austin.ibm.com> References: <20050302163412.0fa52c4b.moilanen@austin.ibm.com> Message-ID: <20050303070222.GA26059@kevlar.burdell.org> On Wed, Mar 02, 2005 at 04:34:12PM -0600, Jake Moilanen wrote: > On our raw spinlocks, we currently have an attempt at the lock, and if > we do not get it we enter a spin loop. This spinloop will likely > continue for awhile, and we pridict likely. > > Shouldn't we predict that we will get out of the loop so our next > instructions are already prefetched. Even when we miss because the lock > is still held, it won't matter since we are waiting anyways. > > I did a couple quick benchmarks, but the results are inconclusive. > > 16-way 690 running specjbb with original code > # ./specjbb 3000 16 1 1 19 30 120 > ... > Valid run, Score is 59282 > > 16-way 690 running specjbb with unlikely code > # ./specjbb 3000 16 1 1 19 30 120 > ... > Valid run, Score is 59541 > > I saw a smaller increase on a JS20 (~1.6%) > > JS20 specjbb w/ original code > # ./specjbb 400 2 1 1 19 30 120 > ... > Valid run, Score is 20460 > > > JS20 specjbb w/ unlikely code > # ./specjbb 400 2 1 1 19 30 120 > ... > Valid run, Score is 20803 Hmm, I doubt you want to use specjbb to show spin-lock contention. Unless I'm missing something, jbb scales really well in terms of the kernel, most of the benchmark runs in userspace and the JVM's own locking strategies probably have a bigger impact on performance than the kernel's _raw_spin_lock() implementation. I should probably have Java Perf. guys get oprofile data for jbb to confirm this conclusively. If you use FFSB with enough threads doing lots of file-descriptor activity, you'll see tons of lock contention on the fget_light function. This is a pretty well known scalability problem, and I've been able to drive my 16-way LPAR to > 40% spin_lock time doing things like this with FFSB and tons of threads. Sonny From moilanen at austin.ibm.com Fri Mar 4 06:40:34 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Thu, 3 Mar 2005 13:40:34 -0600 Subject: [PATCH] PCI address getting truncated to 32-bits Message-ID: <20050303134034.779c79e0.moilanen@austin.ibm.com> While looking at another problem, I ran across this. It looks like we are truncated our pci addresses coming out of "assigned-addresses" to 32-bits. Signed-off-by: Jake Moilanen -- diff -puN arch/ppc64/kernel/prom.c~offb_dsi arch/ppc64/kernel/prom.c --- linux-2.6.11/arch/ppc64/kernel/prom.c~offb_dsi Thu Mar 3 10:23:22 2005 +++ linux-2.6.11-moilanen/arch/ppc64/kernel/prom.c Thu Mar 3 13:25:54 2005 @@ -333,7 +333,7 @@ static unsigned long __init interpret_pc while ((l -= sizeof(struct pci_reg_property)) >= 0) { if (!measure_only) { adr[i].space = pci_addrs[i].addr.a_hi; - adr[i].address = pci_addrs[i].addr.a_lo; + adr[i].address = ((unsigned long)pci_addrs[i].addr.a_mid << 32) | pci_addrs[i].addr.a_lo; adr[i].size = pci_addrs[i].size_lo; } ++i; From markus at unixforces.net Fri Mar 4 06:54:32 2005 From: markus at unixforces.net (Markus Rothe) Date: Thu, 3 Mar 2005 19:54:32 +0000 Subject: Display problems with kernel > 2.6.9 Message-ID: <20050303195432.GA9010@unixforces.net> Hi, I have a problem with my Apple Cinema Display, if I use kernel versions above 2.6.9. The display is connected through the Apple Display Connector (ADC) to my G5 and it's ATI Radeon 9600 graphics card. The problem is that there are many "blue lightnings" all over the display. With blue lightning I mean a small set of pixels which turn into light blue for about half a second. This happens both in console mode and if I run Xorg. I've taken tree screenshots available at [1], [2] and [3]. The screenshots have been taken while runnig kernel-2.6.11, but this problem occured with all kernels after 2.6.9. Markus [1] http://www.unixforces.net/downloads/blue_lightning_1.png (~1.9 MB) [2] http://www.unixforces.net/downloads/blue_lightning_2.png (~2.0 MB) [3] http://www.unixforces.net/downloads/blue_lightning_3.png (~1.9 MB) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050303/f8d2ebd3/attachment.pgp From moilanen at austin.ibm.com Fri Mar 4 09:02:08 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Thu, 3 Mar 2005 16:02:08 -0600 Subject: [PATCH] PCI address getting truncated to 32-bits In-Reply-To: <20050303134034.779c79e0.moilanen@austin.ibm.com> References: <20050303134034.779c79e0.moilanen@austin.ibm.com> Message-ID: <20050303160208.242e29a9.moilanen@austin.ibm.com> On Thu, 3 Mar 2005 13:40:34 -0600 Jake Moilanen wrote: > While looking at another problem, I ran across this. It looks like we > are truncated our pci addresses coming out of "assigned-addresses" to > 32-bits. Probably need it for of_finish_dynamic_node() too. Signed-off-by: Jake Moilanen --- diff -puN arch/ppc64/kernel/prom.c~offb_dsi arch/ppc64/kernel/prom.c --- linux-2.6.11/arch/ppc64/kernel/prom.c~offb_dsi Thu Mar 3 10:23:22 2005 +++ linux-2.6.11-moilanen/arch/ppc64/kernel/prom.c Thu Mar 3 16:09:02 2005 @@ -333,7 +333,8 @@ static unsigned long __init interpret_pc while ((l -= sizeof(struct pci_reg_property)) >= 0) { if (!measure_only) { adr[i].space = pci_addrs[i].addr.a_hi; - adr[i].address = pci_addrs[i].addr.a_lo; + adr[i].address = ((unsigned long)pci_addrs[i].addr.a_mid << 32) + | pci_addrs[i].addr.a_lo; adr[i].size = pci_addrs[i].size_lo; } ++i; @@ -1712,7 +1713,8 @@ static int of_finish_dynamic_node(struct } while ((l -= sizeof(struct pci_reg_property)) >= 0) { adr[i].space = pci_addrs[i].addr.a_hi; - adr[i].address = pci_addrs[i].addr.a_lo; + adr[i].address = ((unsigned long)pci_addrs[i].addr.a_mid << 32) + | pci_addrs[i].addr.a_lo; adr[i].size = pci_addrs[i].size_lo; ++i; } From paulus at samba.org Fri Mar 4 20:18:39 2005 From: paulus at samba.org (Paul Mackerras) Date: Fri, 4 Mar 2005 20:18:39 +1100 Subject: RFC/Patch more xmon additions In-Reply-To: <421E3BE3.90301@vnet.ibm.com> References: <421E3BE3.90301@vnet.ibm.com> Message-ID: <16936.10223.704710.234312@cargo.ozlabs.ibm.com> will schmidt writes: > Am looking for comments on this additional function i've added to xmon > on the side.. > > the bulk of my intent was to make it easier for me to poke at memory > within a particular user process. The main problem I have with it is that we seem to be accessing a lot of kernel data structures without checking any pointers or using mread() to read the memory safely. One of the goals of xmon is that it should be as reliable as possible even if kernel data structures are corrupted, and I think your patch would reduce that reliability. Also, I'm not sure that there is any point doing a spin_trylock(), since all cpus are supposed to be in xmon by the time you get to a command prompt. By all means bail out if spin_is_locked() returns true, but I don't see the need to actually take the lock. Regards, Paul. From linas at austin.ibm.com Sat Mar 5 03:42:36 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Fri, 4 Mar 2005 10:42:36 -0600 Subject: [PATCH] fix eeh.h compile warnings In-Reply-To: <20050303033757.GD5897@otto> References: <20050302181206.GA2741@us.ibm.com> <20050303033757.GD5897@otto> Message-ID: <20050304164236.GT1220@austin.ibm.com> On Wed, Mar 02, 2005 at 09:37:57PM -0600, Nathan Lynch was heard to remark: > > I don't have a toolchain readily available which gives these warnings, > but does this fix them? I think it should > Use static inlines instead of #defines for stub functions when > CONFIG_EEH=n. Its more elegant your way anyway ... > Signed-off-by: Nathan Lynch From olof at austin.ibm.com Sat Mar 5 03:57:17 2005 From: olof at austin.ibm.com (Olof Johansson) Date: Fri, 4 Mar 2005 10:57:17 -0600 Subject: [PATCH] fix eeh.h compile warnings In-Reply-To: <20050303033757.GD5897@otto> References: <20050302181206.GA2741@us.ibm.com> <20050303033757.GD5897@otto> Message-ID: <20050304165717.GA5789@austin.ibm.com> On Wed, Mar 02, 2005 at 09:37:57PM -0600, Nathan Lynch wrote: > I don't have a toolchain readily available which gives these warnings, > but does this fix them? Yep, it does here. > Use static inlines instead of #defines for stub functions when > CONFIG_EEH=n. > > Signed-off-by: Nathan Lynch Acked-by: Olof Johansson From nacc at us.ibm.com Sat Mar 5 05:13:32 2005 From: nacc at us.ibm.com (Nishanth Aravamudan) Date: Fri, 4 Mar 2005 10:13:32 -0800 Subject: [PATCH] fix eeh.h compile warnings In-Reply-To: <20050304165717.GA5789@austin.ibm.com> References: <20050302181206.GA2741@us.ibm.com> <20050303033757.GD5897@otto> <20050304165717.GA5789@austin.ibm.com> Message-ID: <20050304181332.GA2689@us.ibm.com> On Fri, Mar 04, 2005 at 10:57:17AM -0600, Olof Johansson wrote: > On Wed, Mar 02, 2005 at 09:37:57PM -0600, Nathan Lynch wrote: > > > I don't have a toolchain readily available which gives these warnings, > > but does this fix them? > > Yep, it does here. Here as well, sorry for the delayed response. > > Use static inlines instead of #defines for stub functions when > > CONFIG_EEH=n. > > > > Signed-off-by: Nathan Lynch > > Acked-by: Olof Johansson Acked-by: Nishanth Aravamudan From paulus at samba.org Sat Mar 5 23:13:02 2005 From: paulus at samba.org (Paul Mackerras) Date: Sat, 5 Mar 2005 23:13:02 +1100 Subject: [PATCH] Updated U3 AGP patch Message-ID: <16937.41550.251010.982065@cargo.ozlabs.ibm.com> This patch is based on Jerome Glisse's work with some extra bits that I found in Darwin. It adds support for the U3 AGP bridge for both ppc32 and ppc64 kernels. It also includes the suspend/resume support that I need for my 1.5GHz powerbook to be able to sleep when AGP is active. This doesn't solve the potential cache problems, but so far I haven't seen them in practice... Signed-off-by: Paul Mackerras diff -urN linux-2.5/drivers/char/agp/Kconfig g5-ppc64/drivers/char/agp/Kconfig --- linux-2.5/drivers/char/agp/Kconfig 2005-01-21 08:40:04.000000000 +1100 +++ g5-ppc64/drivers/char/agp/Kconfig 2005-02-21 18:29:10.000000000 +1100 @@ -1,6 +1,6 @@ config AGP tristate "/dev/agpgart (AGP Support)" if !GART_IOMMU - depends on ALPHA || IA64 || PPC32 || X86 + depends on ALPHA || IA64 || PPC32 || PPC64 || X86 default y if GART_IOMMU ---help--- AGP (Accelerated Graphics Port) is a bus system mainly used to @@ -156,11 +156,11 @@ default AGP config AGP_UNINORTH - tristate "Apple UniNorth AGP support" + tristate "Apple UniNorth & U3 AGP support" depends on AGP && PPC_PMAC help This option gives you AGP support for Apple machines with a - UniNorth bridge. + UniNorth or U3 (Apple G5) bridge. config AGP_EFFICEON tristate "Transmeta Efficeon support" diff -urN linux-2.5/drivers/char/agp/uninorth-agp.c g5-ppc64/drivers/char/agp/uninorth-agp.c --- linux-2.5/drivers/char/agp/uninorth-agp.c 2004-12-28 10:24:26.000000000 +1100 +++ g5-ppc64/drivers/char/agp/uninorth-agp.c 2005-03-05 23:09:40.000000000 +1100 @@ -6,10 +6,26 @@ #include #include #include +#include #include #include +#include #include "agp.h" +/* + * NOTES for uninorth3 (G5 AGP) supports : + * + * There maybe also possibility to have bigger cache line size for + * agp (see pmac_pci.c and look for cache line). Need to be investigated + * by someone. + * + * PAGE size are hardcoded but this may change, see asm/page.h. + * + * Jerome Glisse + */ +static int uninorth_rev; +static int is_u3; + static int uninorth_fetch_size(void) { int i; @@ -39,26 +55,39 @@ static void uninorth_tlbflush(struct agp_memory *mem) { + u32 ctrl = UNI_N_CFG_GART_ENABLE; + + if (is_u3) + ctrl |= U3_N_CFG_GART_PERFRD; pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, - UNI_N_CFG_GART_ENABLE | UNI_N_CFG_GART_INVAL); - pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, - UNI_N_CFG_GART_ENABLE); - pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, - UNI_N_CFG_GART_ENABLE | UNI_N_CFG_GART_2xRESET); - pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, - UNI_N_CFG_GART_ENABLE); + ctrl | UNI_N_CFG_GART_INVAL); + pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, ctrl); + + if (uninorth_rev <= 0x30) { + pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, + ctrl | UNI_N_CFG_GART_2xRESET); + pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, + ctrl); + } } static void uninorth_cleanup(void) { - pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, - UNI_N_CFG_GART_ENABLE | UNI_N_CFG_GART_INVAL); - pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, - 0); - pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, - UNI_N_CFG_GART_2xRESET); - pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, - 0); + u32 tmp; + + pci_read_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, &tmp); + if (!(tmp & UNI_N_CFG_GART_ENABLE)) + return; + tmp |= UNI_N_CFG_GART_INVAL; + pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, tmp); + pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, 0); + + if (uninorth_rev <= 0x30) { + pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, + UNI_N_CFG_GART_2xRESET); + pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, + 0); + } } static int uninorth_configure(void) @@ -81,8 +110,21 @@ * the AGP aperture isn't mapped at bus physical address 0 */ agp_bridge->gart_bus_addr = 0; +#ifdef CONFIG_PPC64 + /* Assume U3 or later on PPC64 systems */ + /* high 4 bits of GART physical address go in UNI_N_CFG_AGP_BASE */ + pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_AGP_BASE, + (agp_bridge->gatt_bus_addr >> 32) & 0xf); +#else pci_write_config_dword(agp_bridge->dev, - UNI_N_CFG_AGP_BASE, agp_bridge->gart_bus_addr); + UNI_N_CFG_AGP_BASE, agp_bridge->gart_bus_addr); +#endif + + if (is_u3) { + pci_write_config_dword(agp_bridge->dev, + UNI_N_CFG_GART_DUMMY_PAGE, + agp_bridge->scratch_page_real >> 12); + } return 0; } @@ -111,14 +153,54 @@ } for (i = 0, j = pg_start; i < mem->page_count; i++, j++) { - agp_bridge->gatt_table[j] = cpu_to_le32((mem->memory[i] & 0xfffff000) | 0x00000001UL); + agp_bridge->gatt_table[j] = + cpu_to_le32((mem->memory[i] & 0xFFFFF000UL) | 0x1UL); + flush_dcache_range((unsigned long)__va(mem->memory[i]), + (unsigned long)__va(mem->memory[i])+0x1000); + } + (void)in_le32((volatile u32*)&agp_bridge->gatt_table[pg_start]); + mb(); + flush_dcache_range((unsigned long)&agp_bridge->gatt_table[pg_start], + (unsigned long)&agp_bridge->gatt_table[pg_start + + mem->page_count]); + + uninorth_tlbflush(mem); + return 0; +} + +static int u3_insert_memory(struct agp_memory *mem, off_t pg_start, int type) +{ + int i, j, num_entries; + void *temp; + + temp = agp_bridge->current_size; + num_entries = A_SIZE_32(temp)->num_entries; + + if (type != 0 || mem->type != 0) + /* We know nothing of memory types */ + return -EINVAL; + if ((pg_start + mem->page_count) > num_entries) + return -EINVAL; + + j = pg_start; + + while (j < (pg_start + mem->page_count)) { + if (!PGE_EMPTY(agp_bridge, agp_bridge->gatt_table[j])) + return -EBUSY; + j++; + } + + for (i = 0, j = pg_start; i < mem->page_count; i++, j++) { + agp_bridge->gatt_table[j] = ((mem->memory[i] >> PAGE_SHIFT) | + 0x80000000UL); flush_dcache_range((unsigned long)__va(mem->memory[i]), (unsigned long)__va(mem->memory[i])+0x1000); } (void)in_le32((volatile u32*)&agp_bridge->gatt_table[pg_start]); mb(); flush_dcache_range((unsigned long)&agp_bridge->gatt_table[pg_start], - (unsigned long)&agp_bridge->gatt_table[pg_start + mem->page_count]); + (unsigned long)&agp_bridge->gatt_table[pg_start + + mem->page_count]); uninorth_tlbflush(mem); return 0; @@ -126,15 +208,31 @@ static void uninorth_agp_enable(u32 mode) { - u32 command, scratch; + u32 command, scratch, status; int timeout; pci_read_config_dword(agp_bridge->dev, agp_bridge->capndx + PCI_AGP_STATUS, - &command); + &status); + + command = agp_collect_device_status(mode, status); + command |= PCI_AGP_COMMAND_AGP; + + if (uninorth_rev == 0x21) { + /* + * Darwin disable AGP 4x on this revision, thus we + * may assume it's broken. This is an AGP2 controller. + */ + command &= ~AGPSTAT2_4X; + } - command = agp_collect_device_status(mode, command); - command |= 0x100; + if ((uninorth_rev >= 0x30) && (uninorth_rev <= 0x33)) { + /* + * We need to to set REQ_DEPTH to 7 for U3 versions 1.0, 2.1, + * 2.2 and 2.3, Darwin do so. + */ + command |= (7 << AGPSTAT_RQ_DEPTH_SHIFT); + } uninorth_tlbflush(NULL); @@ -146,15 +244,74 @@ pci_read_config_dword(agp_bridge->dev, agp_bridge->capndx + PCI_AGP_COMMAND, &scratch); - } while ((scratch & 0x100) == 0 && ++timeout < 1000); - if ((scratch & 0x100) == 0) + } while ((scratch & PCI_AGP_COMMAND_AGP) == 0 && ++timeout < 1000); + if ((scratch & PCI_AGP_COMMAND_AGP) == 0) printk(KERN_ERR PFX "failed to write UniNorth AGP command reg\n"); - agp_device_command(command, 0); + if (uninorth_rev >= 0x30) { + /* This is an AGP V3 */ + agp_device_command(command, (status & 0x8)); + } else { + /* AGP V2 */ + agp_device_command(command, 0); + } uninorth_tlbflush(NULL); } +#ifdef CONFIG_PM +static int agp_uninorth_suspend(struct pci_dev *pdev, pm_message_t state) +{ + u32 cmd; + u8 agp; + struct pci_dev *device = NULL; + + if (state != PMSG_SUSPEND) + return 0; + + /* turn off AGP on the video chip, if it was enabled */ + for_each_pci_dev(device) { + /* Don't touch the bridge yet, device first */ + if (device == pdev) + continue; + /* Only deal with devices on the same bus here, no Mac has a P2P + * bridge on the AGP port, and mucking around the entire PCI tree + * is source of problems on some machines because of a bug in + * some versions of pci_find_capability() when hitting a dead device + */ + if (device->bus != pdev->bus) + continue; + agp = pci_find_capability(device, PCI_CAP_ID_AGP); + if (!agp) + continue; + pci_read_config_dword(device, agp + PCI_AGP_COMMAND, &cmd); + if (!(cmd & PCI_AGP_COMMAND_AGP)) + continue; + printk("uninorth-agp: disabling AGP on device %s\n", pci_name(device)); + cmd &= ~PCI_AGP_COMMAND_AGP; + pci_write_config_dword(device, agp + PCI_AGP_COMMAND, cmd); + } + + /* turn off AGP on the bridge */ + agp = pci_find_capability(pdev, PCI_CAP_ID_AGP); + pci_read_config_dword(pdev, agp + PCI_AGP_COMMAND, &cmd); + if (cmd & PCI_AGP_COMMAND_AGP) { + printk("uninorth-agp: disabling AGP on bridge %s\n", pci_name(pdev)); + cmd &= ~PCI_AGP_COMMAND_AGP; + pci_write_config_dword(pdev, agp + PCI_AGP_COMMAND, cmd); + } + /* turn off the GART */ + uninorth_cleanup(); + + return 0; +} + +static int agp_uninorth_resume(struct pci_dev *pdev) +{ + return 0; +} +#endif + static int uninorth_create_gatt_table(void) { char *table; @@ -202,10 +359,8 @@ agp_bridge->gatt_table = (u32 *)table; agp_bridge->gatt_bus_addr = virt_to_phys(table); - for (i = 0; i < num_entries; i++) { - agp_bridge->gatt_table[i] = - (unsigned long) agp_bridge->scratch_page; - } + for (i = 0; i < num_entries; i++) + agp_bridge->gatt_table[i] = 0; flush_dcache_range((unsigned long)table, (unsigned long)table_end); @@ -258,6 +413,22 @@ {4, 1024, 0, 1} }; +/* + * Not sure that u3 supports that high aperture sizes but it + * would strange if it did not :) + */ +static struct aper_size_info_32 u3_sizes[8] = +{ + {512, 131072, 7, 128}, + {256, 65536, 6, 64}, + {128, 32768, 5, 32}, + {64, 16384, 4, 16}, + {32, 8192, 3, 8}, + {16, 4096, 2, 4}, + {8, 2048, 1, 2}, + {4, 1024, 0, 1} +}; + struct agp_bridge_driver uninorth_agp_driver = { .owner = THIS_MODULE, .aperture_sizes = (void *)uninorth_sizes, @@ -282,6 +453,31 @@ .cant_use_aperture = 1, }; +struct agp_bridge_driver u3_agp_driver = { + .owner = THIS_MODULE, + .aperture_sizes = (void *)u3_sizes, + .size_type = U32_APER_SIZE, + .num_aperture_sizes = 8, + .configure = uninorth_configure, + .fetch_size = uninorth_fetch_size, + .cleanup = uninorth_cleanup, + .tlb_flush = uninorth_tlbflush, + .mask_memory = agp_generic_mask_memory, + .masks = NULL, + .cache_flush = null_cache_flush, + .agp_enable = uninorth_agp_enable, + .create_gatt_table = uninorth_create_gatt_table, + .free_gatt_table = uninorth_free_gatt_table, + .insert_memory = u3_insert_memory, + .remove_memory = agp_generic_remove_memory, + .alloc_by_type = agp_generic_alloc_by_type, + .free_by_type = agp_generic_free_by_type, + .agp_alloc_page = agp_generic_alloc_page, + .agp_destroy_page = agp_generic_destroy_page, + .cant_use_aperture = 1, + .needs_scratch_page = 1, +}; + static struct agp_device_ids uninorth_agp_device_ids[] __devinitdata = { { .device_id = PCI_DEVICE_ID_APPLE_UNI_N_AGP, @@ -299,6 +495,18 @@ .device_id = PCI_DEVICE_ID_APPLE_UNI_N_AGP2, .chipset_name = "UniNorth 2", }, + { + .device_id = PCI_DEVICE_ID_APPLE_U3_AGP, + .chipset_name = "U3", + }, + { + .device_id = PCI_DEVICE_ID_APPLE_U3L_AGP, + .chipset_name = "U3L", + }, + { + .device_id = PCI_DEVICE_ID_APPLE_U3H_AGP, + .chipset_name = "U3H", + }, }; static int __devinit agp_uninorth_probe(struct pci_dev *pdev, @@ -306,6 +514,7 @@ { struct agp_device_ids *devs = uninorth_agp_device_ids; struct agp_bridge_data *bridge; + struct device_node *uninorth_node; u8 cap_ptr; int j; @@ -327,11 +536,33 @@ return -ENODEV; found: + /* Set revision to 0 if we could not read it. */ + uninorth_rev = 0; + is_u3 = 0; + /* Locate core99 Uni-N */ + uninorth_node = of_find_node_by_name(NULL, "uni-n"); + /* Locate G5 u3 */ + if (uninorth_node == NULL) { + is_u3 = 1; + uninorth_node = of_find_node_by_name(NULL, "u3"); + } + if (uninorth_node) { + int *revprop = (int *) + get_property(uninorth_node, "device-rev", NULL); + if (revprop != NULL) + uninorth_rev = *revprop & 0x3f; + of_node_put(uninorth_node); + } + bridge = agp_alloc_bridge(); if (!bridge) return -ENOMEM; - bridge->driver = &uninorth_agp_driver; + if (is_u3) + bridge->driver = &u3_agp_driver; + else + bridge->driver = &uninorth_agp_driver; + bridge->dev = pdev; bridge->capndx = cap_ptr; @@ -369,6 +600,10 @@ .id_table = agp_uninorth_pci_table, .probe = agp_uninorth_probe, .remove = agp_uninorth_remove, +#ifdef CONFIG_PM + .suspend = agp_uninorth_suspend, + .resume = agp_uninorth_resume, +#endif }; static int __init agp_uninorth_init(void) diff -urN linux-2.5/include/asm-ppc/uninorth.h g5-ppc64/include/asm-ppc/uninorth.h --- linux-2.5/include/asm-ppc/uninorth.h 2005-02-03 18:00:28.000000000 +1100 +++ g5-ppc64/include/asm-ppc/uninorth.h 2005-03-05 17:28:33.000000000 +1100 @@ -27,13 +27,18 @@ #define UNI_N_CFG_AGP_BASE 0x90 #define UNI_N_CFG_GART_CTRL 0x94 #define UNI_N_CFG_INTERNAL_STATUS 0x98 +#define UNI_N_CFG_GART_DUMMY_PAGE 0xa4 /* UNI_N_CFG_GART_CTRL bits definitions */ -/* Not U3 */ #define UNI_N_CFG_GART_INVAL 0x00000001 #define UNI_N_CFG_GART_ENABLE 0x00000100 #define UNI_N_CFG_GART_2xRESET 0x00010000 #define UNI_N_CFG_GART_DISSBADET 0x00020000 +/* The following seems to only be used only on U3 */ +#define U3_N_CFG_GART_SYNCMODE 0x00040000 +#define U3_N_CFG_GART_PERFRD 0x00080000 +#define U3_N_CFG_GART_B2BGNT 0x00200000 +#define U3_N_CFG_GART_FASTDDR 0x00400000 /* My understanding of UniNorth AGP as of UniNorth rev 1.0x, * revision 1.5 (x4 AGP) may need further changes. diff -urN linux-2.5/include/asm-ppc64/agp.h g5-ppc64/include/asm-ppc64/agp.h --- /dev/null 2005-02-22 20:41:05.000000000 +1100 +++ g5-ppc64/include/asm-ppc64/agp.h 2005-02-21 18:30:02.000000000 +1100 @@ -0,0 +1,13 @@ +#ifndef AGP_H +#define AGP_H 1 + +#include + +/* nothing much needed here */ + +#define map_page_into_agp(page) +#define unmap_page_from_agp(page) +#define flush_agp_mappings() +#define flush_agp_cache() mb() + +#endif From stefan at nocrew.org Sun Mar 6 08:01:27 2005 From: stefan at nocrew.org (Stefan Berndtsson) Date: Sat, 05 Mar 2005 22:01:27 +0100 Subject: BTTV in linux/ppc64. Message-ID: <87mzth24ew.fsf@hades.nocrew.org> I'm having trouble getting bttv working in linux/ppc64. The same kernel works fine with linux/ppc on the same hardware. Kernel used is 2.6.11 (from kernel.org) Machine is a 1.8GHz G5 (single cpu). BTTV-card is a PCTV Rave with a bt878 chipset. It compiles nicely, but when the module is loaded, the modprobe process hangs and never returns. The rest of the system keeps working as it should. As far as I've been able to figure out, it calls driver_register(), but never returns from this. Another issue, where I get an oops, is the loading of the sound alsa module for the Vortex card in the machine. It's a Vortex au8820. Like the bttv issue, the card and driver works fine with a 32bit kernel. The oops looks like this: PCI: Enabling device: (0001:06:03.0), cmd 7 Vortex: init.... Oops: Kernel access of bad area, sig: 11 [#1] POWERMAC Modules linked in: snd_au8820 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_timer snd_mpu401_uart snd_rawmidi snd soundco ic5 kernel: NIP: D0000000000FF8EC XER: 00000000 LR: D0000000000FF8D8 CTR: C00000000018F188 REGS: c000000001c6f450 TRAP: 0300 Not tainted (2.6.11-ppc64) MSR: 9000000000009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 CR: 24008422 DAR: e000000083f30018 DSISR: 0000000042000000 TASK: c00000000fbd77a0[1228] 'modprobe' THREAD: c000000001c6c000 GPR00: E000000083F30018 C000000001C6F6D0 D00000000010CE20 0000000000000014 GPR04: C0000000004916A0 C00000001F788990 0000000000000008 C000000000481B88 GPR08: 00000000FFFBF3FE E000000083F2B000 C000000000481578 FFFFFFFFFFFFFFFF GPR12: 0000000044008428 C000000000399C00 0000000000000000 000000000000000A GPR16: D0000000000F2112 D0000000000F2210 D000000000104600 0000000000000124 GPR20: 0000000000000000 D000000000104650 C00000000044F448 D0000000000E9000 GPR24: 0000000000000000 C000000001813C00 C00000001F79E000 C000000001E2C000 GPR28: C00000001F79E000 0000000000000000 D00000000010C880 C000000001C6F6D0 NIP [d0000000000ff8ec] .snd_vortex_probe+0x194/0x1028 [snd_au8820] LR [d0000000000ff8d8] .snd_vortex_probe+0x180/0x1028 [snd_au8820] Call Trace: [c000000001c6f6d0] [d0000000000ff8d8] .snd_vortex_probe+0x180/0x1028 [snd_au8820] (unreliable) [c000000001c6f7e0] [c00000000016c990] .pci_device_probe+0xec/0x20c [c000000001c6f880] [c0000000001c1dcc] .driver_probe_device+0x80/0x11c [c000000001c6f910] [c0000000001c2010] .driver_attach+0x84/0xfc [c000000001c6f9b0] [c0000000001c2568] .bus_add_driver+0xc4/0x1ec [c000000001c6fa60] [c0000000001c2e1c] .driver_register+0x38/0x50 [c000000001c6faf0] [c00000000016c49c] .pci_register_driver+0x80/0xf0 [c000000001c6fb80] [d0000000001007f8] .alsa_card_vortex_init+0x24/0x40 [snd_au8820] [c000000001c6fc00] [c000000000062c98] .sys_init_module+0x3d4/0x1918 [c000000001c6fe30] [c00000000000d400] syscall_exit+0x0/0x18 Instruction dump: 4800150d e8410028 2fa30000 f87b1500 419e0b78 e87e8200 48000fb5 e8410028 e93b1500 3960ffff 3d290002 38095018 <7d60052c> 7c0004ac 38600005 48001051 Any ideas? /Stefan Berndtsson From paulus at samba.org Sun Mar 6 10:42:17 2005 From: paulus at samba.org (Paul Mackerras) Date: Sun, 6 Mar 2005 10:42:17 +1100 Subject: [PATCH] Updated U3 AGP patch In-Reply-To: References: <16937.41550.251010.982065@cargo.ozlabs.ibm.com> Message-ID: <16938.17369.373338.88220@cargo.ozlabs.ibm.com> Andreas Schwab writes: > I can't find these ids being defined anywhere in 2.6.11. Oops, my mistake, you need this bit too. Paul. diff -urN linux-2.5/include/linux/pci_ids.h g5-ppc64/include/linux/pci_ids.h --- linux-2.5/include/linux/pci_ids.h 2005-03-03 08:14:30.000000000 +1100 +++ g5-ppc64/include/linux/pci_ids.h 2005-03-03 09:36:03.000000000 +1100 @@ -861,7 +861,10 @@ #define PCI_DEVICE_ID_APPLE_IPID_ATA100 0x003b #define PCI_DEVICE_ID_APPLE_KEYLARGO_I 0x003e #define PCI_DEVICE_ID_APPLE_K2_ATA100 0x0043 +#define PCI_DEVICE_ID_APPLE_U3_AGP 0x004b #define PCI_DEVICE_ID_APPLE_K2_GMAC 0x004c +#define PCI_DEVICE_ID_APPLE_U3L_AGP 0x0058 +#define PCI_DEVICE_ID_APPLE_U3H_AGP 0x0059 #define PCI_DEVICE_ID_APPLE_TIGON3 0x1645 #define PCI_VENDOR_ID_YAMAHA 0x1073 From service at paypal.com Mon Mar 7 06:36:14 2005 From: service at paypal.com (PayPal) Date: Sun, 06 Mar 05 19:36:14 GMT Subject: PayPal Account Security Measures Message-ID: <442k$k4j$nhu6eit@1uar.yv> An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050306/8379e45c/attachment.htm From domen at coderock.org Mon Mar 7 09:23:30 2005 From: domen at coderock.org (domen at coderock.org) Date: Sun, 06 Mar 2005 23:23:30 +0100 Subject: [patch 2/2] delete unused file include_asm_ppc64_iSeries_iSeries_fixup.h Message-ID: <20050306222330.D06231ED3D@trashy.coderock.org> Remove nowhere referenced file. (egrep "filename\." didn't find anything) Signed-off-by: Domen Puncer --- kj/include/asm-ppc64/iSeries/iSeries_fixup.h | 25 ------------------------- 1 files changed, 25 deletions(-) diff -L include/asm-ppc64/iSeries/iSeries_fixup.h -puN include/asm-ppc64/iSeries/iSeries_fixup.h~remove_file-include_asm_ppc64_iSeries_iSeries_fixup.h /dev/null --- kj/include/asm-ppc64/iSeries/iSeries_fixup.h +++ /dev/null 2005-03-02 11:34:59.000000000 +0100 @@ -1,25 +0,0 @@ - -#ifndef __ISERIES_FIXUP_H__ -#define __ISERIES_FIXUP_H__ -#include - -#ifdef __cplusplus -extern "C" { -#endif - -void iSeries_fixup (void); -void iSeries_fixup_bus (struct pci_bus*); -unsigned int iSeries_scan_slot (struct pci_dev*, u16, u8, u8); - - -/* Need to store information related to the PHB bucc and make it accessible to the hose */ -struct iSeries_hose_arch_data { - u32 hvBusNumber; -}; - - -#ifdef __cplusplus -} -#endif - -#endif /* __ISERIES_FIXUP_H__ */ _ From domen at coderock.org Mon Mar 7 09:23:27 2005 From: domen at coderock.org (domen at coderock.org) Date: Sun, 06 Mar 2005 23:23:27 +0100 Subject: [patch 1/2] delete unused file arch_ppc64_boot_no_initrd.c Message-ID: <20050306222327.CD55F1EC90@trashy.coderock.org> Remove nowhere referenced file. (egrep "filename\." didn't find anything) Signed-off-by: Domen Puncer --- kj/arch/ppc64/boot/no_initrd.c | 2 -- 1 files changed, 2 deletions(-) diff -L arch/ppc64/boot/no_initrd.c -puN arch/ppc64/boot/no_initrd.c~remove_file-arch_ppc64_boot_no_initrd.c /dev/null --- kj/arch/ppc64/boot/no_initrd.c +++ /dev/null 2005-03-02 11:34:59.000000000 +0100 @@ -1,2 +0,0 @@ -char initrd_data[1]; -int initrd_len = 0; _ From sfr at canb.auug.org.au Mon Mar 7 11:36:00 2005 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Mon, 7 Mar 2005 11:36:00 +1100 Subject: [patch 2/2] delete unused file include_asm_ppc64_iSeries_iSeries_fixup.h In-Reply-To: <20050306222330.D06231ED3D@trashy.coderock.org> References: <20050306222330.D06231ED3D@trashy.coderock.org> Message-ID: <20050307113600.2c1d52b5.sfr@canb.auug.org.au> On Sun, 06 Mar 2005 23:23:30 +0100 domen at coderock.org wrote: > > > Remove nowhere referenced file. (egrep "filename\." didn't find anything) And, in fact, none of the things declared here exist any more. And iSeries build happily without it. > Signed-off-by: Domen Puncer Acked-by: Stephen Rothwell -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050307/785b7cea/attachment.pgp From paulus at samba.org Mon Mar 7 19:57:50 2005 From: paulus at samba.org (Paul Mackerras) Date: Mon, 7 Mar 2005 19:57:50 +1100 Subject: [PATCH] PPC64 Addresses from OF getting truncated to 32-bits Message-ID: <16940.6030.778567.924954@cargo.ozlabs.ibm.com> This patch is from Jake Moilanen , reformatted by me. Signed-off-by: Jake Moilanen Signed-off-by: Paul Mackerras The `assigned-addresses' property in the Open Firmware device tree nodes for PCI devices has 64 bits of PCI bus address, but we were only using 32. This patch fixes it so we use all 64. diff -urN linux-2.5/arch/ppc64/kernel/prom.c test/arch/ppc64/kernel/prom.c --- linux-2.5/arch/ppc64/kernel/prom.c 2005-03-07 08:21:53.000000000 +1100 +++ test/arch/ppc64/kernel/prom.c 2005-03-07 19:49:13.000000000 +1100 @@ -335,7 +335,8 @@ while ((l -= sizeof(struct pci_reg_property)) >= 0) { if (!measure_only) { adr[i].space = pci_addrs[i].addr.a_hi; - adr[i].address = pci_addrs[i].addr.a_lo; + adr[i].address = pci_addrs[i].addr.a_lo | + ((u64)pci_addrs[i].addr.a_mid << 32); adr[i].size = pci_addrs[i].size_lo; } ++i; @@ -1721,7 +1722,8 @@ } while ((l -= sizeof(struct pci_reg_property)) >= 0) { adr[i].space = pci_addrs[i].addr.a_hi; - adr[i].address = pci_addrs[i].addr.a_lo; + adr[i].address = pci_addrs[i].addr.a_lo | + ((u64)pci_addrs[i].addr.a_mid << 32); adr[i].size = pci_addrs[i].size_lo; ++i; } From paulus at samba.org Mon Mar 7 21:33:20 2005 From: paulus at samba.org (Paul Mackerras) Date: Mon, 7 Mar 2005 21:33:20 +1100 Subject: [PATCH] error code cleanups for rtas wrappers In-Reply-To: <1109797837.9434.2.camel@sinatra.austin.ibm.com> References: <1109797837.9434.2.camel@sinatra.austin.ibm.com> Message-ID: <16940.11760.291422.528712@cargo.ozlabs.ibm.com> John Rose writes: > This patch changes the rtas wrapper functions in rtas.c to map RTAS failures > to conventional error values. The goal is to make failure conditions > obvious in the wrapper functions and in the caller code. Looks good, got a patch to change all the callers? Paul. From johnrose at austin.ibm.com Tue Mar 8 03:11:52 2005 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 07 Mar 2005 10:11:52 -0600 Subject: [PATCH] error code cleanups for rtas wrappers In-Reply-To: <16940.11760.291422.528712@cargo.ozlabs.ibm.com> References: <1109797837.9434.2.camel@sinatra.austin.ibm.com> <16940.11760.291422.528712@cargo.ozlabs.ibm.com> Message-ID: <1110211912.2538.15.camel@sinatra.austin.ibm.com> > Looks good, got a patch to change all the callers? I do, but the patch only affects RPA PCI Hotplug/DLPAR. I figured I'd wait for acceptance on this end before submitting those changes. Callers within PPC64 "base" were either changed by the patch above, or already had sufficient checks (rc == 0, etc). Thanks- John From paulus at samba.org Tue Mar 8 10:02:18 2005 From: paulus at samba.org (Paul Mackerras) Date: Tue, 8 Mar 2005 10:02:18 +1100 Subject: [PATCH] error code cleanups for rtas wrappers In-Reply-To: <1110211912.2538.15.camel@sinatra.austin.ibm.com> References: <1109797837.9434.2.camel@sinatra.austin.ibm.com> <16940.11760.291422.528712@cargo.ozlabs.ibm.com> <1110211912.2538.15.camel@sinatra.austin.ibm.com> Message-ID: <16940.56698.707687.831617@cargo.ozlabs.ibm.com> John Rose writes: > I do, but the patch only affects RPA PCI Hotplug/DLPAR. I figured I'd > wait for acceptance on this end before submitting those changes. > Callers within PPC64 "base" were either changed by the patch above, or > already had sufficient checks (rc == 0, etc). Yes, it was the callers in drivers/pci/hotplug/rpa* that I was concerned about. Some of them were testing for specific return values. If you have a patch to fix them too I'll forward both patches to Andrew. Paul. From jschopp at austin.ibm.com Tue Mar 8 10:01:28 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Mon, 07 Mar 2005 17:01:28 -0600 Subject: [PATCH] explicitly bind idle tasks In-Reply-To: <20050302014701.GA5897@otto> References: <20050227031655.67233bb5.akpm@osdl.org> <1109542971.14993.217.camel@gaston> <20050227144928.6c71adaf.akpm@osdl.org> <20050302014701.GA5897@otto> Message-ID: <422CDD48.10006@austin.ibm.com> Nathan Lynch wrote: > With hotplug cpu and preempt, we tend to see smp_processor_id warnings > from idle loop code because it's always checking whether its cpu has > gone offline. Replacing every use of smp_processor_id with > _smp_processor_id in all idle loop code is one solution; another way > is explicitly binding idle threads to their cpus (the smp_processor_id > warning does not fire if the caller is bound only to the calling cpu). > This has the (admittedly slight) advantage of letting us know if an > idle thread ever runs on the wrong cpu. I also prefer explicitly binding idle threads to their cpus instead of replacing use of smp_processor_id with _smp_processor_id. > > > Signed-off-by: Nathan Lynch Acked-by: Joel Schopp > > Index: linux-2.6.11-rc5-mm1/init/main.c > =================================================================== > --- linux-2.6.11-rc5-mm1.orig/init/main.c 2005-03-02 00:12:07.000000000 +0000 > +++ linux-2.6.11-rc5-mm1/init/main.c 2005-03-02 00:53:04.000000000 +0000 > @@ -638,6 +638,10 @@ > { > lock_kernel(); > /* > + * init can run on any cpu. > + */ > + set_cpus_allowed(current, CPU_MASK_ALL); > + /* > * Tell the world that we're going to be the grim > * reaper of innocent orphaned children. > * > Index: linux-2.6.11-rc5-mm1/kernel/sched.c > =================================================================== > --- linux-2.6.11-rc5-mm1.orig/kernel/sched.c 2005-03-02 00:12:07.000000000 +0000 > +++ linux-2.6.11-rc5-mm1/kernel/sched.c 2005-03-02 00:47:14.000000000 +0000 > @@ -4092,6 +4092,7 @@ > idle->array = NULL; > idle->prio = MAX_PRIO; > idle->state = TASK_RUNNING; > + idle->cpus_allowed = cpumask_of_cpu(cpu); > set_task_cpu(idle, cpu); > > spin_lock_irqsave(&rq->lock, flags); > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > From johnrose at austin.ibm.com Tue Mar 8 10:54:54 2005 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 07 Mar 2005 17:54:54 -0600 Subject: [PATCH] error code cleanups rpa[php,dlpar] In-Reply-To: <16940.56698.707687.831617@cargo.ozlabs.ibm.com> References: <1109797837.9434.2.camel@sinatra.austin.ibm.com> <16940.11760.291422.528712@cargo.ozlabs.ibm.com> <1110211912.2538.15.camel@sinatra.austin.ibm.com> <16940.56698.707687.831617@cargo.ozlabs.ibm.com> Message-ID: <1110239693.2538.26.camel@sinatra.austin.ibm.com> > Yes, it was the callers in drivers/pci/hotplug/rpa* that I was > concerned about. Some of them were testing for specific return > values. If you have a patch to fix them too I'll forward both patches > to Andrew. This patch changes the RPA PCI Hotplug and DLPAR modules to use more conventional error values for return codes. The goal is to make failure conditions obvious in the wrapper functions and in the caller code. Thanks Paul. Signed-off-by: John Rose diff -puN drivers/pci/hotplug/rpaphp.h~02_rpaphp_rcs drivers/pci/hotplug/rpaphp.h --- 2_6_linus_3/drivers/pci/hotplug/rpaphp.h~02_rpaphp_rcs 2005-03-07 17:52:20.000000000 -0600 +++ 2_6_linus_3-johnrose/drivers/pci/hotplug/rpaphp.h 2005-03-07 17:52:20.000000000 -0600 @@ -45,11 +45,6 @@ #define LED_ID 2 /* slow blinking */ #define LED_ACTION 3 /* fast blinking */ -/* Error status from rtas_get-sensor */ -#define NEED_POWER -9000 /* slot must be power up and unisolated to get state */ -#define PWR_ONLY -9001 /* slot must be powerd up to get state, leave isolated */ -#define ERR_SENSE_USE -9002 /* No DR operation will succeed, slot is unusable */ - /* Sensor values from rtas_get-sensor */ #define EMPTY 0 /* No card in slot */ #define PRESENT 1 /* Card in slot */ diff -puN drivers/pci/hotplug/rpaphp_core.c~02_rpaphp_rcs drivers/pci/hotplug/rpaphp_core.c --- 2_6_linus_3/drivers/pci/hotplug/rpaphp_core.c~02_rpaphp_rcs 2005-03-07 17:52:20.000000000 -0600 +++ 2_6_linus_3-johnrose/drivers/pci/hotplug/rpaphp_core.c 2005-03-07 17:52:20.000000000 -0600 @@ -256,12 +256,12 @@ int rpaphp_get_drc_props(struct device_n my_index = (int *) get_property(dn, "ibm,my-drc-index", NULL); if (!my_index) { /* Node isn't DLPAR/hotplug capable */ - return 1; + return -EINVAL; } rc = get_children_props(dn->parent, &indexes, &names, &types, &domains); if (rc < 0) { - return 1; + return -EINVAL; } name_tmp = (char *) &names[1]; @@ -284,7 +284,7 @@ int rpaphp_get_drc_props(struct device_n type_tmp += (strlen(type_tmp) + 1); } - return 1; + return -EINVAL; } static int is_php_type(char *drc_type) diff -puN drivers/pci/hotplug/rpaphp_pci.c~02_rpaphp_rcs drivers/pci/hotplug/rpaphp_pci.c --- 2_6_linus_3/drivers/pci/hotplug/rpaphp_pci.c~02_rpaphp_rcs 2005-03-07 17:52:20.000000000 -0600 +++ 2_6_linus_3-johnrose/drivers/pci/hotplug/rpaphp_pci.c 2005-03-07 17:52:20.000000000 -0600 @@ -81,8 +81,8 @@ static int rpaphp_get_sensor_state(struc rc = rtas_get_sensor(DR_ENTITY_SENSE, slot->index, state); - if (rc) { - if (rc == NEED_POWER || rc == PWR_ONLY) { + if (rc < 0) { + if (rc == -EFAULT || rc == -EEXIST) { dbg("%s: slot must be power up to get sensor-state\n", __FUNCTION__); @@ -91,14 +91,14 @@ static int rpaphp_get_sensor_state(struc */ rc = rtas_set_power_level(slot->power_domain, POWER_ON, &setlevel); - if (rc) { + if (rc < 0) { dbg("%s: power on slot[%s] failed rc=%d.\n", __FUNCTION__, slot->name, rc); } else { rc = rtas_get_sensor(DR_ENTITY_SENSE, slot->index, state); } - } else if (rc == ERR_SENSE_USE) + } else if (rc == -ENODEV) info("%s: slot is unusable\n", __FUNCTION__); else err("%s failed to get sensor state\n", __FUNCTION__); @@ -413,7 +413,7 @@ static int setup_pci_hotplug_slot_info(s if (slot->hotplug_slot->info->adapter_status == NOT_VALID) { err("%s: NOT_VALID: skip dn->full_name=%s\n", __FUNCTION__, slot->dn->full_name); - return -1; + return -EINVAL; } return 0; } @@ -426,15 +426,15 @@ static int set_phb_slot_name(struct slot dn = slot->dn; if (!dn) { - return 1; + return -EINVAL; } phb = dn->phb; if (!phb) { - return 1; + return -EINVAL; } bus = phb->bus; if (!bus) { - return 1; + return -EINVAL; } sprintf(slot->name, "%04x:%02x:%02x.%x", pci_domain_nr(bus), @@ -448,7 +448,7 @@ static int setup_pci_slot(struct slot *s if (slot->type == PHB) { rc = set_phb_slot_name(slot); - if (rc) { + if (rc < 0) { err("%s: failed to set phb slot name\n", __FUNCTION__); goto exit_rc; } @@ -509,12 +509,12 @@ static int setup_pci_slot(struct slot *s return 0; exit_rc: dealloc_slot_struct(slot); - return 1; + return -EINVAL; } int register_pci_slot(struct slot *slot) { - int rc = 1; + int rc = -EINVAL; slot->dev_type = PCI_DEV; if ((slot->type == EMBEDDED) || (slot->type == PHB)) diff -puN drivers/pci/hotplug/rpaphp_slot.c~02_rpaphp_rcs drivers/pci/hotplug/rpaphp_slot.c --- 2_6_linus_3/drivers/pci/hotplug/rpaphp_slot.c~02_rpaphp_rcs 2005-03-07 17:52:20.000000000 -0600 +++ 2_6_linus_3-johnrose/drivers/pci/hotplug/rpaphp_slot.c 2005-03-07 17:52:20.000000000 -0600 @@ -211,7 +211,7 @@ int register_slot(struct slot *slot) if (is_registered(slot)) { /* should't be here */ err("register_slot: slot[%s] is already registered\n", slot->name); rpaphp_release_slot(slot->hotplug_slot); - return 1; + return -EAGAIN; } retval = pci_hp_register(slot->hotplug_slot); if (retval) { @@ -270,7 +270,7 @@ int rpaphp_set_attention_status(struct s /* status: LED_OFF or LED_ON */ rc = rtas_set_indicator(DR_INDICATOR, slot->index, status); - if (rc) + if (rc < 0) err("slot(name=%s location=%s index=0x%x) set attention-status(%d) failed! rc=0x%x\n", slot->name, slot->location, slot->index, status, rc); diff -puN drivers/pci/hotplug/rpaphp_vio.c~02_rpaphp_rcs drivers/pci/hotplug/rpaphp_vio.c --- 2_6_linus_3/drivers/pci/hotplug/rpaphp_vio.c~02_rpaphp_rcs 2005-03-07 17:52:20.000000000 -0600 +++ 2_6_linus_3-johnrose/drivers/pci/hotplug/rpaphp_vio.c 2005-03-07 17:52:20.000000000 -0600 @@ -71,11 +71,11 @@ int register_vio_slot(struct device_node { u32 *index; char *name; - int rc = 1; + int rc = -EINVAL; struct slot *slot = NULL; rc = rpaphp_get_drc_props(dn, NULL, &name, NULL, NULL); - if (rc) + if (rc < 0) goto exit_rc; index = (u32 *) get_property(dn, "ibm,my-drc-index", NULL); if (!index) diff -puN drivers/pci/hotplug/rpadlpar_core.c~02_rpaphp_rcs drivers/pci/hotplug/rpadlpar_core.c --- 2_6_linus_3/drivers/pci/hotplug/rpadlpar_core.c~02_rpaphp_rcs 2005-03-07 17:52:51.000000000 -0600 +++ 2_6_linus_3-johnrose/drivers/pci/hotplug/rpadlpar_core.c 2005-03-07 17:53:02.000000000 -0600 @@ -142,7 +142,7 @@ static int pci_add_secondary_bus(struct child = pci_add_new_bus(bridge_dev->bus, bridge_dev, sec_busno); if (!child) { printk(KERN_ERR "%s: could not add secondary bus\n", __FUNCTION__); - return 1; + return -ENOMEM; } sprintf(child->name, "PCI Bus #%02x", child->number); @@ -204,7 +204,7 @@ static int dlpar_pci_remove_bus(struct p if (!bridge_dev) { printk(KERN_ERR "%s: unexpected null device\n", __FUNCTION__); - return 1; + return -EINVAL; } secondary_bus = bridge_dev->subordinate; @@ -212,7 +212,7 @@ static int dlpar_pci_remove_bus(struct p if (unmap_bus_range(secondary_bus)) { printk(KERN_ERR "%s: failed to unmap bus range\n", __FUNCTION__); - return 1; + return -ERANGE; } pci_remove_bus_device(bridge_dev); @@ -282,7 +282,7 @@ static int dlpar_remove_phb(struct slot } rc = dlpar_remove_root_bus(phb); - if (rc) + if (rc < 0) return rc; return 0; @@ -294,7 +294,7 @@ static int dlpar_add_phb(struct device_n phb = init_phb_dynamic(dn); if (!phb) - return 1; + return -EINVAL; return 0; } _ From ntl at pobox.com Tue Mar 8 12:56:38 2005 From: ntl at pobox.com (Nathan Lynch) Date: Mon, 7 Mar 2005 19:56:38 -0600 Subject: [PATCH] call idle_task_exit with irqs disabled Message-ID: <20050308015638.GA21853@otto> Seeing this very occasionally during cpu hotplug testing: Badness in slb_flush_and_rebolt at arch/ppc64/mm/slb.c:52 Call Trace: [c0000000ef0efbe0] [c0000000000127a0] .__switch_to+0xa4/0xf0 (unreliable) [c0000000ef0efc80] [c000000000050178] .idle_task_exit+0xbc/0x15c [c0000000ef0efd10] [c00000000000d108] .cpu_die+0x18/0x68 [c0000000ef0efd90] [c00000000001023c] .dedicated_idle+0x1fc/0x254 [c0000000ef0efe80] [c00000000000fc80] .cpu_idle+0x3c/0x54 [c0000000ef0eff00] [c00000000003aa90] .start_secondary+0x108/0x148 [c0000000ef0eff90] [c00000000000bd28] .enable_64b_mode+0x0/0x28 idle_task_exit can result in a call to slb_flush_and_rebolt, which must not be called with interrupts enabled. Make the call with interrupts disabled. Signed-off-by: Nathan Lynch pSeries_setup.c | 2 +- 1 files changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.11-bk2/arch/ppc64/kernel/pSeries_setup.c =================================================================== --- linux-2.6.11-bk2.orig/arch/ppc64/kernel/pSeries_setup.c 2005-03-07 04:09:29.000000000 +0000 +++ linux-2.6.11-bk2/arch/ppc64/kernel/pSeries_setup.c 2005-03-07 04:15:22.000000000 +0000 @@ -322,8 +322,8 @@ static void __init pSeries_discover_pic static void pSeries_mach_cpu_die(void) { - idle_task_exit(); local_irq_disable(); + idle_task_exit(); /* Some hardware requires clearing the CPPR, while other hardware does not * it is safe either way */ From ntl at pobox.com Tue Mar 8 13:00:17 2005 From: ntl at pobox.com (Nathan Lynch) Date: Mon, 7 Mar 2005 20:00:17 -0600 Subject: [PATCH] update irq affinity mask when migrating irqs Message-ID: <20050308020017.GB21853@otto> When offlining a cpu, any device interrupts which are bound to the cpu have their affinity forcibly reset to all cpus (the default). However, the value in /proc/irq/XXX/smp_affinity remains unchanged. Since we're doing this while all the other cpus are stopped, it should be safe to just call desc->handler->set_affinity and manually update the irq_affinity array. Signed-off-by: Nathan Lynch xics.c | 11 ++--------- 1 files changed, 2 insertions(+), 9 deletions(-) Index: linux-2.6.11-bk2/arch/ppc64/kernel/xics.c =================================================================== --- linux-2.6.11-bk2.orig/arch/ppc64/kernel/xics.c 2005-03-02 07:38:10.000000000 +0000 +++ linux-2.6.11-bk2/arch/ppc64/kernel/xics.c 2005-03-07 03:52:08.000000000 +0000 @@ -704,15 +704,8 @@ void xics_migrate_irqs_away(void) virq, cpu); /* Reset affinity to all cpus */ - xics_status[0] = default_distrib_server; - - status = rtas_call(ibm_set_xive, 3, 1, NULL, irq, - xics_status[0], xics_status[1]); - if (status) - printk(KERN_ERR "migrate_irqs_away: irq=%d " - "ibm,set-xive returns %d\n", - virq, status); - + desc->handler->set_affinity(virq, CPU_MASK_ALL); + irq_affinity[virq] = CPU_MASK_ALL; unlock: spin_unlock_irqrestore(&desc->lock, flags); } From sfr at canb.auug.org.au Tue Mar 8 17:54:22 2005 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Tue, 8 Mar 2005 17:54:22 +1100 Subject: [PATCH] PPC64: "invert" dma mapping routines Message-ID: <20050308175422.47ff6e85.sfr@canb.auug.org.au> Hi Andrew, Linus, This patch "inverts" the PPC64 dma mapping routines so that the pci_ and vio_ ... routines are implemented in terms of the dma_ ... routines (the vio_ routines disappear anyway as noone uses them directly any more). The most noticable change after this patch is applied will be that the flags passed to dma_alloc_coherent will now be honoured (whereas they were previously silently ignored since we used to just call pci_alloc_consistent). Signed-off-by: Stephen Rothwell diffstat looks like this: arch/ppc64/kernel/dma.c | 100 +++++++++++++-------------- arch/ppc64/kernel/iommu.c | 8 +- arch/ppc64/kernel/pci.c | 2 arch/ppc64/kernel/pci_direct_iommu.c | 34 +++++---- arch/ppc64/kernel/pci_iommu.c | 55 ++++++++------- arch/ppc64/kernel/vio.c | 55 +++++++++------ include/asm-ppc64/dma-mapping.h | 20 +++++ include/asm-ppc64/iommu.h | 6 - include/asm-ppc64/pci.h | 126 +---------------------------------- include/asm-ppc64/vio.h | 27 ------- 10 files changed, 166 insertions(+), 267 deletions(-) This has been compiled for iSeries, pSeries and g5 (default configs) and booted on iSeries. -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ diff -ruNp linus/arch/ppc64/kernel/dma.c linus-dma.4/arch/ppc64/kernel/dma.c --- linus/arch/ppc64/kernel/dma.c 2004-10-26 16:06:41.000000000 +1000 +++ linus-dma.4/arch/ppc64/kernel/dma.c 2005-02-07 17:47:41.000000000 +1100 @@ -13,14 +13,23 @@ #include #include -int dma_supported(struct device *dev, u64 mask) +static struct dma_mapping_ops *get_dma_ops(struct device *dev) { if (dev->bus == &pci_bus_type) - return pci_dma_supported(to_pci_dev(dev), mask); + return &pci_dma_ops; #ifdef CONFIG_IBMVIO if (dev->bus == &vio_bus_type) - return vio_dma_supported(to_vio_dev(dev), mask); -#endif /* CONFIG_IBMVIO */ + return &vio_dma_ops; +#endif + return NULL; +} + +int dma_supported(struct device *dev, u64 mask) +{ + struct dma_mapping_ops *dma_ops = get_dma_ops(dev); + + if (dma_ops) + return dma_ops->dma_supported(dev, mask); BUG(); return 0; } @@ -32,7 +41,7 @@ int dma_set_mask(struct device *dev, u64 return pci_set_dma_mask(to_pci_dev(dev), dma_mask); #ifdef CONFIG_IBMVIO if (dev->bus == &vio_bus_type) - return vio_set_dma_mask(to_vio_dev(dev), dma_mask); + return -EIO; #endif /* CONFIG_IBMVIO */ BUG(); return 0; @@ -42,12 +51,10 @@ EXPORT_SYMBOL(dma_set_mask); void *dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle, int flag) { - if (dev->bus == &pci_bus_type) - return pci_alloc_consistent(to_pci_dev(dev), size, dma_handle); -#ifdef CONFIG_IBMVIO - if (dev->bus == &vio_bus_type) - return vio_alloc_consistent(to_vio_dev(dev), size, dma_handle); -#endif /* CONFIG_IBMVIO */ + struct dma_mapping_ops *dma_ops = get_dma_ops(dev); + + if (dma_ops) + return dma_ops->alloc_coherent(dev, size, dma_handle, flag); BUG(); return NULL; } @@ -56,12 +63,10 @@ EXPORT_SYMBOL(dma_alloc_coherent); void dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, dma_addr_t dma_handle) { - if (dev->bus == &pci_bus_type) - pci_free_consistent(to_pci_dev(dev), size, cpu_addr, dma_handle); -#ifdef CONFIG_IBMVIO - else if (dev->bus == &vio_bus_type) - vio_free_consistent(to_vio_dev(dev), size, cpu_addr, dma_handle); -#endif /* CONFIG_IBMVIO */ + struct dma_mapping_ops *dma_ops = get_dma_ops(dev); + + if (dma_ops) + dma_ops->free_coherent(dev, size, cpu_addr, dma_handle); else BUG(); } @@ -70,12 +75,10 @@ EXPORT_SYMBOL(dma_free_coherent); dma_addr_t dma_map_single(struct device *dev, void *cpu_addr, size_t size, enum dma_data_direction direction) { - if (dev->bus == &pci_bus_type) - return pci_map_single(to_pci_dev(dev), cpu_addr, size, (int)direction); -#ifdef CONFIG_IBMVIO - if (dev->bus == &vio_bus_type) - return vio_map_single(to_vio_dev(dev), cpu_addr, size, direction); -#endif /* CONFIG_IBMVIO */ + struct dma_mapping_ops *dma_ops = get_dma_ops(dev); + + if (dma_ops) + return dma_ops->map_single(dev, cpu_addr, size, direction); BUG(); return (dma_addr_t)0; } @@ -84,12 +87,10 @@ EXPORT_SYMBOL(dma_map_single); void dma_unmap_single(struct device *dev, dma_addr_t dma_addr, size_t size, enum dma_data_direction direction) { - if (dev->bus == &pci_bus_type) - pci_unmap_single(to_pci_dev(dev), dma_addr, size, (int)direction); -#ifdef CONFIG_IBMVIO - else if (dev->bus == &vio_bus_type) - vio_unmap_single(to_vio_dev(dev), dma_addr, size, direction); -#endif /* CONFIG_IBMVIO */ + struct dma_mapping_ops *dma_ops = get_dma_ops(dev); + + if (dma_ops) + dma_ops->unmap_single(dev, dma_addr, size, direction); else BUG(); } @@ -99,12 +100,11 @@ dma_addr_t dma_map_page(struct device *d unsigned long offset, size_t size, enum dma_data_direction direction) { - if (dev->bus == &pci_bus_type) - return pci_map_page(to_pci_dev(dev), page, offset, size, (int)direction); -#ifdef CONFIG_IBMVIO - if (dev->bus == &vio_bus_type) - return vio_map_page(to_vio_dev(dev), page, offset, size, direction); -#endif /* CONFIG_IBMVIO */ + struct dma_mapping_ops *dma_ops = get_dma_ops(dev); + + if (dma_ops) + return dma_ops->map_single(dev, + (page_address(page) + offset), size, direction); BUG(); return (dma_addr_t)0; } @@ -113,12 +113,10 @@ EXPORT_SYMBOL(dma_map_page); void dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size, enum dma_data_direction direction) { - if (dev->bus == &pci_bus_type) - pci_unmap_page(to_pci_dev(dev), dma_address, size, (int)direction); -#ifdef CONFIG_IBMVIO - else if (dev->bus == &vio_bus_type) - vio_unmap_page(to_vio_dev(dev), dma_address, size, direction); -#endif /* CONFIG_IBMVIO */ + struct dma_mapping_ops *dma_ops = get_dma_ops(dev); + + if (dma_ops) + dma_ops->unmap_single(dev, dma_address, size, direction); else BUG(); } @@ -127,12 +125,10 @@ EXPORT_SYMBOL(dma_unmap_page); int dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction direction) { - if (dev->bus == &pci_bus_type) - return pci_map_sg(to_pci_dev(dev), sg, nents, (int)direction); -#ifdef CONFIG_IBMVIO - if (dev->bus == &vio_bus_type) - return vio_map_sg(to_vio_dev(dev), sg, nents, direction); -#endif /* CONFIG_IBMVIO */ + struct dma_mapping_ops *dma_ops = get_dma_ops(dev); + + if (dma_ops) + return dma_ops->map_sg(dev, sg, nents, direction); BUG(); return 0; } @@ -141,12 +137,10 @@ EXPORT_SYMBOL(dma_map_sg); void dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nhwentries, enum dma_data_direction direction) { - if (dev->bus == &pci_bus_type) - pci_unmap_sg(to_pci_dev(dev), sg, nhwentries, (int)direction); -#ifdef CONFIG_IBMVIO - else if (dev->bus == &vio_bus_type) - vio_unmap_sg(to_vio_dev(dev), sg, nhwentries, direction); -#endif /* CONFIG_IBMVIO */ + struct dma_mapping_ops *dma_ops = get_dma_ops(dev); + + if (dma_ops) + dma_ops->unmap_sg(dev, sg, nhwentries, direction); else BUG(); } diff -ruNp linus/arch/ppc64/kernel/iommu.c linus-dma.4/arch/ppc64/kernel/iommu.c --- linus/arch/ppc64/kernel/iommu.c 2005-01-09 10:05:39.000000000 +1100 +++ linus-dma.4/arch/ppc64/kernel/iommu.c 2005-02-07 15:00:06.000000000 +1100 @@ -513,8 +513,8 @@ void iommu_unmap_single(struct iommu_tab * Returns the virtual address of the buffer and sets dma_handle * to the dma address (mapping) of the first page. */ -void *iommu_alloc_consistent(struct iommu_table *tbl, size_t size, - dma_addr_t *dma_handle) +void *iommu_alloc_coherent(struct iommu_table *tbl, size_t size, + dma_addr_t *dma_handle, int flag) { void *ret = NULL; dma_addr_t mapping; @@ -538,7 +538,7 @@ void *iommu_alloc_consistent(struct iomm return NULL; /* Alloc enough pages (and possibly more) */ - ret = (void *)__get_free_pages(GFP_ATOMIC, order); + ret = (void *)__get_free_pages(flag, order); if (!ret) return NULL; memset(ret, 0, size); @@ -553,7 +553,7 @@ void *iommu_alloc_consistent(struct iomm return ret; } -void iommu_free_consistent(struct iommu_table *tbl, size_t size, +void iommu_free_coherent(struct iommu_table *tbl, size_t size, void *vaddr, dma_addr_t dma_handle) { unsigned int npages; diff -ruNp linus/arch/ppc64/kernel/pci.c linus-dma.4/arch/ppc64/kernel/pci.c --- linus/arch/ppc64/kernel/pci.c 2005-03-06 07:08:24.000000000 +1100 +++ linus-dma.4/arch/ppc64/kernel/pci.c 2005-03-07 10:23:14.000000000 +1100 @@ -71,7 +71,7 @@ void iSeries_pcibios_init(void); LIST_HEAD(hose_list); -struct pci_dma_ops pci_dma_ops; +struct dma_mapping_ops pci_dma_ops; EXPORT_SYMBOL(pci_dma_ops); int global_phb_number; /* Global phb counter */ diff -ruNp linus/arch/ppc64/kernel/pci_direct_iommu.c linus-dma.4/arch/ppc64/kernel/pci_direct_iommu.c --- linus/arch/ppc64/kernel/pci_direct_iommu.c 2005-01-09 10:05:39.000000000 +1100 +++ linus-dma.4/arch/ppc64/kernel/pci_direct_iommu.c 2005-02-07 16:00:47.000000000 +1100 @@ -30,12 +30,12 @@ #include "pci.h" -static void *pci_direct_alloc_consistent(struct pci_dev *hwdev, size_t size, - dma_addr_t *dma_handle) +static void *pci_direct_alloc_coherent(struct device *hwdev, size_t size, + dma_addr_t *dma_handle, int flag) { void *ret; - ret = (void *)__get_free_pages(GFP_ATOMIC, get_order(size)); + ret = (void *)__get_free_pages(flag, get_order(size)); if (ret != NULL) { memset(ret, 0, size); *dma_handle = virt_to_abs(ret); @@ -43,24 +43,24 @@ static void *pci_direct_alloc_consistent return ret; } -static void pci_direct_free_consistent(struct pci_dev *hwdev, size_t size, +static void pci_direct_free_coherent(struct device *hwdev, size_t size, void *vaddr, dma_addr_t dma_handle) { free_pages((unsigned long)vaddr, get_order(size)); } -static dma_addr_t pci_direct_map_single(struct pci_dev *hwdev, void *ptr, +static dma_addr_t pci_direct_map_single(struct device *hwdev, void *ptr, size_t size, enum dma_data_direction direction) { return virt_to_abs(ptr); } -static void pci_direct_unmap_single(struct pci_dev *hwdev, dma_addr_t dma_addr, +static void pci_direct_unmap_single(struct device *hwdev, dma_addr_t dma_addr, size_t size, enum dma_data_direction direction) { } -static int pci_direct_map_sg(struct pci_dev *hwdev, struct scatterlist *sg, +static int pci_direct_map_sg(struct device *hwdev, struct scatterlist *sg, int nents, enum dma_data_direction direction) { int i; @@ -73,17 +73,23 @@ static int pci_direct_map_sg(struct pci_ return nents; } -static void pci_direct_unmap_sg(struct pci_dev *hwdev, struct scatterlist *sg, +static void pci_direct_unmap_sg(struct device *hwdev, struct scatterlist *sg, int nents, enum dma_data_direction direction) { } +static int pci_direct_dma_supported(struct device *dev, u64 mask) +{ + return mask < 0x100000000ull; +} + void __init pci_direct_iommu_init(void) { - pci_dma_ops.pci_alloc_consistent = pci_direct_alloc_consistent; - pci_dma_ops.pci_free_consistent = pci_direct_free_consistent; - pci_dma_ops.pci_map_single = pci_direct_map_single; - pci_dma_ops.pci_unmap_single = pci_direct_unmap_single; - pci_dma_ops.pci_map_sg = pci_direct_map_sg; - pci_dma_ops.pci_unmap_sg = pci_direct_unmap_sg; + pci_dma_ops.alloc_coherent = pci_direct_alloc_coherent; + pci_dma_ops.free_coherent = pci_direct_free_coherent; + pci_dma_ops.map_single = pci_direct_map_single; + pci_dma_ops.unmap_single = pci_direct_unmap_single; + pci_dma_ops.map_sg = pci_direct_map_sg; + pci_dma_ops.unmap_sg = pci_direct_unmap_sg; + pci_dma_ops.dma_supported = pci_direct_dma_supported; } diff -ruNp linus/arch/ppc64/kernel/pci_iommu.c linus-dma.4/arch/ppc64/kernel/pci_iommu.c --- linus/arch/ppc64/kernel/pci_iommu.c 2004-11-16 16:05:10.000000000 +1100 +++ linus-dma.4/arch/ppc64/kernel/pci_iommu.c 2005-02-07 15:10:05.000000000 +1100 @@ -50,19 +50,23 @@ */ #define PCI_GET_DN(dev) ((struct device_node *)((dev)->sysdata)) -static inline struct iommu_table *devnode_table(struct pci_dev *dev) +static inline struct iommu_table *devnode_table(struct device *dev) { - if (!dev) - dev = ppc64_isabridge_dev; - if (!dev) - return NULL; + struct pci_dev *pdev; + + if (!dev) { + pdev = ppc64_isabridge_dev; + if (!pdev) + return NULL; + } else + pdev = to_pci_dev(dev); #ifdef CONFIG_PPC_ISERIES - return ISERIES_DEVNODE(dev)->iommu_table; + return ISERIES_DEVNODE(pdev)->iommu_table; #endif /* CONFIG_PPC_ISERIES */ #ifdef CONFIG_PPC_MULTIPLATFORM - return PCI_GET_DN(dev)->iommu_table; + return PCI_GET_DN(pdev)->iommu_table; #endif /* CONFIG_PPC_MULTIPLATFORM */ } @@ -71,16 +75,17 @@ static inline struct iommu_table *devnod * Returns the virtual address of the buffer and sets dma_handle * to the dma address (mapping) of the first page. */ -static void *pci_iommu_alloc_consistent(struct pci_dev *hwdev, size_t size, - dma_addr_t *dma_handle) +static void *pci_iommu_alloc_coherent(struct device *hwdev, size_t size, + dma_addr_t *dma_handle, int flag) { - return iommu_alloc_consistent(devnode_table(hwdev), size, dma_handle); + return iommu_alloc_coherent(devnode_table(hwdev), size, dma_handle, + flag); } -static void pci_iommu_free_consistent(struct pci_dev *hwdev, size_t size, +static void pci_iommu_free_coherent(struct device *hwdev, size_t size, void *vaddr, dma_addr_t dma_handle) { - iommu_free_consistent(devnode_table(hwdev), size, vaddr, dma_handle); + iommu_free_coherent(devnode_table(hwdev), size, vaddr, dma_handle); } /* Creates TCEs for a user provided buffer. The user buffer must be @@ -89,46 +94,46 @@ static void pci_iommu_free_consistent(st * need not be page aligned, the dma_addr_t returned will point to the same * byte within the page as vaddr. */ -static dma_addr_t pci_iommu_map_single(struct pci_dev *hwdev, void *vaddr, +static dma_addr_t pci_iommu_map_single(struct device *hwdev, void *vaddr, size_t size, enum dma_data_direction direction) { return iommu_map_single(devnode_table(hwdev), vaddr, size, direction); } -static void pci_iommu_unmap_single(struct pci_dev *hwdev, dma_addr_t dma_handle, +static void pci_iommu_unmap_single(struct device *hwdev, dma_addr_t dma_handle, size_t size, enum dma_data_direction direction) { iommu_unmap_single(devnode_table(hwdev), dma_handle, size, direction); } -static int pci_iommu_map_sg(struct pci_dev *pdev, struct scatterlist *sglist, +static int pci_iommu_map_sg(struct device *pdev, struct scatterlist *sglist, int nelems, enum dma_data_direction direction) { - return iommu_map_sg(&pdev->dev, devnode_table(pdev), sglist, + return iommu_map_sg(pdev, devnode_table(pdev), sglist, nelems, direction); } -static void pci_iommu_unmap_sg(struct pci_dev *pdev, struct scatterlist *sglist, +static void pci_iommu_unmap_sg(struct device *pdev, struct scatterlist *sglist, int nelems, enum dma_data_direction direction) { iommu_unmap_sg(devnode_table(pdev), sglist, nelems, direction); } /* We support DMA to/from any memory page via the iommu */ -static int pci_iommu_dma_supported(struct pci_dev *pdev, u64 mask) +static int pci_iommu_dma_supported(struct device *dev, u64 mask) { return 1; } void pci_iommu_init(void) { - pci_dma_ops.pci_alloc_consistent = pci_iommu_alloc_consistent; - pci_dma_ops.pci_free_consistent = pci_iommu_free_consistent; - pci_dma_ops.pci_map_single = pci_iommu_map_single; - pci_dma_ops.pci_unmap_single = pci_iommu_unmap_single; - pci_dma_ops.pci_map_sg = pci_iommu_map_sg; - pci_dma_ops.pci_unmap_sg = pci_iommu_unmap_sg; - pci_dma_ops.pci_dma_supported = pci_iommu_dma_supported; + pci_dma_ops.alloc_coherent = pci_iommu_alloc_coherent; + pci_dma_ops.free_coherent = pci_iommu_free_coherent; + pci_dma_ops.map_single = pci_iommu_map_single; + pci_dma_ops.unmap_single = pci_iommu_unmap_single; + pci_dma_ops.map_sg = pci_iommu_map_sg; + pci_dma_ops.unmap_sg = pci_iommu_unmap_sg; + pci_dma_ops.dma_supported = pci_iommu_dma_supported; } diff -ruNp linus/arch/ppc64/kernel/vio.c linus-dma.4/arch/ppc64/kernel/vio.c --- linus/arch/ppc64/kernel/vio.c 2005-01-09 10:05:39.000000000 +1100 +++ linus-dma.4/arch/ppc64/kernel/vio.c 2005-02-07 15:45:00.000000000 +1100 @@ -557,48 +557,61 @@ int vio_disable_interrupts(struct vio_de EXPORT_SYMBOL(vio_disable_interrupts); #endif -dma_addr_t vio_map_single(struct vio_dev *dev, void *vaddr, +static dma_addr_t vio_map_single(struct device *dev, void *vaddr, size_t size, enum dma_data_direction direction) { - return iommu_map_single(dev->iommu_table, vaddr, size, direction); + return iommu_map_single(to_vio_dev(dev)->iommu_table, vaddr, size, + direction); } -EXPORT_SYMBOL(vio_map_single); -void vio_unmap_single(struct vio_dev *dev, dma_addr_t dma_handle, +static void vio_unmap_single(struct device *dev, dma_addr_t dma_handle, size_t size, enum dma_data_direction direction) { - iommu_unmap_single(dev->iommu_table, dma_handle, size, direction); + iommu_unmap_single(to_vio_dev(dev)->iommu_table, dma_handle, size, + direction); } -EXPORT_SYMBOL(vio_unmap_single); -int vio_map_sg(struct vio_dev *vdev, struct scatterlist *sglist, int nelems, - enum dma_data_direction direction) +static int vio_map_sg(struct device *dev, struct scatterlist *sglist, + int nelems, enum dma_data_direction direction) { - return iommu_map_sg(&vdev->dev, vdev->iommu_table, sglist, + return iommu_map_sg(dev, to_vio_dev(dev)->iommu_table, sglist, nelems, direction); } -EXPORT_SYMBOL(vio_map_sg); -void vio_unmap_sg(struct vio_dev *vdev, struct scatterlist *sglist, int nelems, - enum dma_data_direction direction) +static void vio_unmap_sg(struct device *dev, struct scatterlist *sglist, + int nelems, enum dma_data_direction direction) { - iommu_unmap_sg(vdev->iommu_table, sglist, nelems, direction); + iommu_unmap_sg(to_vio_dev(dev)->iommu_table, sglist, nelems, direction); } -EXPORT_SYMBOL(vio_unmap_sg); -void *vio_alloc_consistent(struct vio_dev *dev, size_t size, - dma_addr_t *dma_handle) +static void *vio_alloc_coherent(struct device *dev, size_t size, + dma_addr_t *dma_handle, int flag) { - return iommu_alloc_consistent(dev->iommu_table, size, dma_handle); + return iommu_alloc_coherent(to_vio_dev(dev)->iommu_table, size, + dma_handle, flag); } -EXPORT_SYMBOL(vio_alloc_consistent); -void vio_free_consistent(struct vio_dev *dev, size_t size, +static void vio_free_coherent(struct device *dev, size_t size, void *vaddr, dma_addr_t dma_handle) { - iommu_free_consistent(dev->iommu_table, size, vaddr, dma_handle); + iommu_free_coherent(to_vio_dev(dev)->iommu_table, size, vaddr, + dma_handle); } -EXPORT_SYMBOL(vio_free_consistent); + +static int vio_dma_supported(struct device *dev, u64 mask) +{ + return 1; +} + +struct dma_mapping_ops vio_dma_ops = { + .alloc_coherent = vio_alloc_coherent, + .free_coherent = vio_free_coherent, + .map_single = vio_map_single, + .unmap_single = vio_unmap_single, + .map_sg = vio_map_sg, + .unmap_sg = vio_unmap_sg, + .dma_supported = vio_dma_supported, +}; static int vio_bus_match(struct device *dev, struct device_driver *drv) { diff -ruNp linus/include/asm-ppc64/dma-mapping.h linus-dma.4/include/asm-ppc64/dma-mapping.h --- linus/include/asm-ppc64/dma-mapping.h 2004-09-14 21:06:08.000000000 +1000 +++ linus-dma.4/include/asm-ppc64/dma-mapping.h 2005-02-07 14:38:01.000000000 +1100 @@ -113,4 +113,24 @@ dma_cache_sync(void *vaddr, size_t size, /* nothing to do */ } +/* + * DMA operations are abstracted for G5 vs. i/pSeries, PCI vs. VIO + */ +struct dma_mapping_ops { + void * (*alloc_coherent)(struct device *dev, size_t size, + dma_addr_t *dma_handle, int flag); + void (*free_coherent)(struct device *dev, size_t size, + void *vaddr, dma_addr_t dma_handle); + dma_addr_t (*map_single)(struct device *dev, void *ptr, + size_t size, enum dma_data_direction direction); + void (*unmap_single)(struct device *dev, dma_addr_t dma_addr, + size_t size, enum dma_data_direction direction); + int (*map_sg)(struct device *dev, struct scatterlist *sg, + int nents, enum dma_data_direction direction); + void (*unmap_sg)(struct device *dev, struct scatterlist *sg, + int nents, enum dma_data_direction direction); + int (*dma_supported)(struct device *dev, u64 mask); + int (*dac_dma_supported)(struct device *dev, u64 mask); +}; + #endif /* _ASM_DMA_MAPPING_H */ diff -ruNp linus/include/asm-ppc64/iommu.h linus-dma.4/include/asm-ppc64/iommu.h --- linus/include/asm-ppc64/iommu.h 2005-01-09 10:05:41.000000000 +1100 +++ linus-dma.4/include/asm-ppc64/iommu.h 2005-02-07 15:02:01.000000000 +1100 @@ -145,9 +145,9 @@ extern int iommu_map_sg(struct device *d extern void iommu_unmap_sg(struct iommu_table *tbl, struct scatterlist *sglist, int nelems, enum dma_data_direction direction); -extern void *iommu_alloc_consistent(struct iommu_table *tbl, size_t size, - dma_addr_t *dma_handle); -extern void iommu_free_consistent(struct iommu_table *tbl, size_t size, +extern void *iommu_alloc_coherent(struct iommu_table *tbl, size_t size, + dma_addr_t *dma_handle, int flag); +extern void iommu_free_coherent(struct iommu_table *tbl, size_t size, void *vaddr, dma_addr_t dma_handle); extern dma_addr_t iommu_map_single(struct iommu_table *tbl, void *vaddr, size_t size, enum dma_data_direction direction); diff -ruNp linus/include/asm-ppc64/pci.h linus-dma.4/include/asm-ppc64/pci.h --- linus/include/asm-ppc64/pci.h 2005-03-05 12:06:15.000000000 +1100 +++ linus-dma.4/include/asm-ppc64/pci.h 2005-03-07 10:24:32.000000000 +1100 @@ -13,11 +13,14 @@ #include #include #include + #include #include #include #include +#include + #define PCIBIOS_MIN_IO 0x1000 #define PCIBIOS_MIN_MEM 0x10000000 @@ -63,131 +66,18 @@ static inline int pcibios_prep_mwi(struc extern unsigned int pcibios_assign_all_busses(void); -/* - * PCI DMA operations are abstracted for G5 vs. i/pSeries - */ -struct pci_dma_ops { - void * (*pci_alloc_consistent)(struct pci_dev *hwdev, size_t size, - dma_addr_t *dma_handle); - void (*pci_free_consistent)(struct pci_dev *hwdev, size_t size, - void *vaddr, dma_addr_t dma_handle); - - dma_addr_t (*pci_map_single)(struct pci_dev *hwdev, void *ptr, - size_t size, enum dma_data_direction direction); - void (*pci_unmap_single)(struct pci_dev *hwdev, dma_addr_t dma_addr, - size_t size, enum dma_data_direction direction); - int (*pci_map_sg)(struct pci_dev *hwdev, struct scatterlist *sg, - int nents, enum dma_data_direction direction); - void (*pci_unmap_sg)(struct pci_dev *hwdev, struct scatterlist *sg, - int nents, enum dma_data_direction direction); - int (*pci_dma_supported)(struct pci_dev *hwdev, u64 mask); - int (*pci_dac_dma_supported)(struct pci_dev *hwdev, u64 mask); -}; - -extern struct pci_dma_ops pci_dma_ops; - -static inline void *pci_alloc_consistent(struct pci_dev *hwdev, size_t size, - dma_addr_t *dma_handle) -{ - return pci_dma_ops.pci_alloc_consistent(hwdev, size, dma_handle); -} - -static inline void pci_free_consistent(struct pci_dev *hwdev, size_t size, - void *vaddr, dma_addr_t dma_handle) -{ - pci_dma_ops.pci_free_consistent(hwdev, size, vaddr, dma_handle); -} - -static inline dma_addr_t pci_map_single(struct pci_dev *hwdev, void *ptr, - size_t size, int direction) -{ - return pci_dma_ops.pci_map_single(hwdev, ptr, size, - (enum dma_data_direction)direction); -} - -static inline void pci_unmap_single(struct pci_dev *hwdev, dma_addr_t dma_addr, - size_t size, int direction) -{ - pci_dma_ops.pci_unmap_single(hwdev, dma_addr, size, - (enum dma_data_direction)direction); -} - -static inline int pci_map_sg(struct pci_dev *hwdev, struct scatterlist *sg, - int nents, int direction) -{ - return pci_dma_ops.pci_map_sg(hwdev, sg, nents, - (enum dma_data_direction)direction); -} - -static inline void pci_unmap_sg(struct pci_dev *hwdev, struct scatterlist *sg, - int nents, int direction) -{ - pci_dma_ops.pci_unmap_sg(hwdev, sg, nents, - (enum dma_data_direction)direction); -} - -static inline void pci_dma_sync_single_for_cpu(struct pci_dev *hwdev, - dma_addr_t dma_handle, - size_t size, int direction) -{ - BUG_ON(direction == PCI_DMA_NONE); - /* nothing to do */ -} - -static inline void pci_dma_sync_single_for_device(struct pci_dev *hwdev, - dma_addr_t dma_handle, - size_t size, int direction) -{ - BUG_ON(direction == PCI_DMA_NONE); - /* nothing to do */ -} - -static inline void pci_dma_sync_sg_for_cpu(struct pci_dev *hwdev, - struct scatterlist *sg, - int nelems, int direction) -{ - BUG_ON(direction == PCI_DMA_NONE); - /* nothing to do */ -} - -static inline void pci_dma_sync_sg_for_device(struct pci_dev *hwdev, - struct scatterlist *sg, - int nelems, int direction) -{ - BUG_ON(direction == PCI_DMA_NONE); - /* nothing to do */ -} - -/* Return whether the given PCI device DMA address mask can - * be supported properly. For example, if your device can - * only drive the low 24-bits during PCI bus mastering, then - * you would pass 0x00ffffff as the mask to this function. - * We default to supporting only 32 bits DMA unless we have - * an explicit override of this function in pci_dma_ops for - * the platform - */ -static inline int pci_dma_supported(struct pci_dev *hwdev, u64 mask) -{ - if (pci_dma_ops.pci_dma_supported) - return pci_dma_ops.pci_dma_supported(hwdev, mask); - return (mask < 0x100000000ull); -} +extern struct dma_mapping_ops pci_dma_ops; /* For DAC DMA, we currently don't support it by default, but * we let the platform override this */ static inline int pci_dac_dma_supported(struct pci_dev *hwdev,u64 mask) { - if (pci_dma_ops.pci_dac_dma_supported) - return pci_dma_ops.pci_dac_dma_supported(hwdev, mask); + if (pci_dma_ops.dac_dma_supported) + return pci_dma_ops.dac_dma_supported(&hwdev->dev, mask); return 0; } -static inline int pci_dma_mapping_error(dma_addr_t dma_addr) -{ - return dma_mapping_error(dma_addr); -} - extern int pci_domain_nr(struct pci_bus *bus); /* Decide whether to display the domain number in /proc */ @@ -201,10 +91,6 @@ int pci_mmap_page_range(struct pci_dev * /* Tell drivers/pci/proc.c that we have pci_mmap_page_range() */ #define HAVE_PCI_MMAP 1 -#define pci_map_page(dev, page, off, size, dir) \ - pci_map_single(dev, (page_address(page) + (off)), size, dir) -#define pci_unmap_page(dev,addr,sz,dir) pci_unmap_single(dev,addr,sz,dir) - /* pci_unmap_{single,page} is not a nop, thus... */ #define DECLARE_PCI_UNMAP_ADDR(ADDR_NAME) \ dma_addr_t ADDR_NAME; diff -ruNp linus/include/asm-ppc64/vio.h linus-dma.4/include/asm-ppc64/vio.h --- linus/include/asm-ppc64/vio.h 2004-06-30 15:40:04.000000000 +1000 +++ linus-dma.4/include/asm-ppc64/vio.h 2005-02-07 15:42:37.000000000 +1100 @@ -57,32 +57,7 @@ int vio_get_irq(struct vio_dev *dev); int vio_enable_interrupts(struct vio_dev *dev); int vio_disable_interrupts(struct vio_dev *dev); -dma_addr_t vio_map_single(struct vio_dev *dev, void *vaddr, - size_t size, enum dma_data_direction direction); -void vio_unmap_single(struct vio_dev *dev, dma_addr_t dma_handle, - size_t size, enum dma_data_direction direction); -int vio_map_sg(struct vio_dev *vdev, struct scatterlist *sglist, - int nelems, enum dma_data_direction direction); -void vio_unmap_sg(struct vio_dev *vdev, struct scatterlist *sglist, - int nelems, enum dma_data_direction direction); -void *vio_alloc_consistent(struct vio_dev *dev, size_t size, - dma_addr_t *dma_handle); -void vio_free_consistent(struct vio_dev *dev, size_t size, void *vaddr, - dma_addr_t dma_handle); - -static inline int vio_dma_supported(struct vio_dev *hwdev, u64 mask) -{ - return 1; -} - -#define vio_map_page(dev, page, off, size, dir) \ - vio_map_single(dev, (page_address(page) + (off)), size, dir) -#define vio_unmap_page(dev,addr,sz,dir) vio_unmap_single(dev,addr,sz,dir) - -static inline int vio_set_dma_mask(struct vio_dev *dev, u64 mask) -{ - return -EIO; -} +extern struct dma_mapping_ops vio_dma_ops; extern struct bus_type vio_bus_type; From amodra at bigpond.net.au Tue Mar 8 20:49:33 2005 From: amodra at bigpond.net.au (Alan Modra) Date: Tue, 8 Mar 2005 20:19:33 +1030 Subject: gcc4 miscompiles glibc math test In-Reply-To: <20050306200139.GA15512@suse.de> References: <20050306164645.GA16851@suse.de> <20050306200139.GA15512@suse.de> Message-ID: <20050308094933.GB15642@bubble.modra.org> On Sun, Mar 06, 2005 at 09:01:39PM +0100, Olaf Hering wrote: > On Sun, Mar 06, Olaf Hering wrote: > > > I'm building gcc40 with -O1 now and see if that makes any difference. > > No, building gcc and glibc with -O1 doesnt fix it, still: > > abuild at tangelo:~/objglibc-40-O1> cat /home/abuild/objglibc-40-O1/math/test-float.out > testing float (without inline functions) > Failure: Test: Real part of: cpow (2 + 3 i, 4 + 0 i) == -119.0 - 120.0 i > Result: > is: -1.18999961853027343750e+02 -0x1.dbfff600000000000000p+6 > should be: -1.19000000000000000000e+02 -0x1.dc000000000000000000p+6 > difference: 3.81469726562500000000e-05 0x1.40000000000000000000p-15 > ulp : 5.0000 > max.ulp : 4.0000 > Maximal error of real part of: cpow > is : 5 ulp > accepted: 4 ulp > Maximal error of imaginary part of: cpow > is : 2 ulp > accepted: 2 ulp > > Test suite completed: > 2599 test cases plus 2384 tests for exception flags executed. > 2 errors occurred. > > > Are you seeing the same, or should I go and extract a selfcontained testcase? I get exactly the same result. I wrote this litte testcase to try to narrow down the problem #include #include int main (void) { _Complex float x = 2.0 + 3.0i; _Complex float y = 4.0 + 0.0i; _Complex float z; double dr, di; printf ("%.20e + %.20e i\n", (double) (__real__ x), (double) (__imag__ x)); printf ("%.20e + %.20e i\n", (double) (__real__ y), (double) (__imag__ y)); z = cpowf (x, y); printf ("%.20e + %.20e i\n", (double) (__real__ z), (double) (__imag__ z)); dr = __real__ z; dr -= -119.0f; di = __imag__ z; di -= -120.0f; printf ("%.20e + %.20e i\n", dr, di); printf ("%f, %f ulp\n", (double) dr / -119.0 * (1 << 24), (double) di / -120.0 * (1 << 24)); return 0; } Then I played games mixing various parts of a math library compiled with gcc-4.0 with a math library compiled with gcc-3.4, until I got it down to just one function from the gcc-4.0 compiled glibc. $ gcc/xgcc -Bgcc/ -static -Wl,-u,__kernel_sinf ../glibc64-gcc4.0/math/libm.a cpow.o ../glibc64-gcc3.4/math/libm.a So it's something to do with sysdeps/ieee754/flt-32/k_sinf.c, I thought. Of course, the function is compiled quite differently by gcc-4.0 as compared to gcc-3.4, with different registers and insn scheduling. So it takes a little analysis to find out what is really different. The first thing that stands out is that gcc-4.0 stores different constants, preferring to subtract positive values rather than add negative ones. This doesn't affect the result at all, of course. The only real difference I found is right at the end of the function, with gcc-4.0 generating 94: ec 02 03 38 fmsubs f0,f2,f12,f0 # y*0.5 - v*r 98: ec 09 10 38 fmsubs f0,f9,f0,f2 # z*f0 - y 9c: ed a8 03 7a fmadds f13,f8,f13,f0 # v*-S1 + f0 a0: ec 21 68 28 fsubs f1,f1,f13 # x - f13 x-(v*-S1+(z*(y*0.5-v*r)-y) vs. gcc-3.4 generating 98: ed a1 03 72 fmuls f13,f1,f13 # v*S1 9c: ec 02 03 38 fmsubs f0,f2,f12,f0 # y*0.5-v*r a0: ec 00 12 78 fmsubs f0,f0,f9,f2 # f0*z - y a4: ec 00 68 28 fsubs f0,f0,f13 # f0-v*S1 a8: ec 28 00 28 fsubs f1,f8,f0 # x-f0 x-((y*0.5-v*r)*z-y-v*S1) That's exactly the same, even to the order of operations, except that gcc-3.4 has one extra rounding step, with v*S1 being rounded to float before being added to the sum. The fmadds used by gcc-4.0 _doesn't_ round the product before adding. So, not a bug, but just extra precision affecting the result. If the algorithms in glibc are tuned for best results with ieee754 rounding at each fp operation, then we probably ought to compile glibc with -mno-fused-madd. This will make libm a little slower. -- Alan Modra IBM OzLabs - Linux Technology Centre -------------- next part -------------- k_sinf.o: file format elf64-powerpc Disassembly of section .text: 0000000000000000 <.__kernel_sinf>: 0: d0 21 ff f0 stfs f1,-16(r1) 4: 3d 20 31 ff lis r9,12799 8: 2f 25 00 00 cmpdi cr6,r5,0 c: 61 29 ff ff ori r9,r9,65535 10: 80 01 ff f0 lwz r0,-16(r1) 14: 54 00 00 7e clrlwi r0,r0,1 18: 7f 80 48 00 cmpw cr7,r0,r9 1c: 41 9d 00 20 bgt- cr7,3c <.__kernel_sinf+0x3c> 20: fc 00 08 90 fmr f0,f1 24: fd a0 00 1e fctiwz f13,f0 28: d9 a1 ff e0 stfd f13,-32(r1) 2c: 60 00 00 00 nop 30: 80 01 ff e4 lwz r0,-28(r1) 34: 2f 80 00 00 cmpwi cr7,r0,0 38: 4d 9e 00 20 beqlr cr7 3c: ed 21 00 72 fmuls f9,f1,f1 # z=x*x 40: c1 a2 00 08 lfs f13,8(r2) # -S5 44: c0 02 00 00 lfs f0,0(r2) # S6 48: c1 82 00 10 lfs f12,16(r2) # S4 4c: c1 62 00 18 lfs f11,24(r2) # -S3 50: c1 42 00 20 lfs f10,32(r2) # S2 54: ec 09 68 38 fmsubs f0,f9,f0,f13 # z*S6 - -S5 58: ed 01 02 72 fmuls f8,f1,f9 # v=z*x 5c: ec 09 60 3a fmadds f0,f9,f0,f12 # z*f0 + S4 60: ec 09 58 38 fmsubs f0,f9,f0,f11 # z*f0 - -S3 64: ed a9 50 3a fmadds f13,f9,f0,f10 # z*f0 + S2 68: 40 9a 00 18 bne- cr6,80 <.__kernel_sinf+0x80> 6c: c0 02 00 28 lfs f0,40(r2) # -S1 70: ec 09 03 78 fmsubs f0,f9,f13,f0 # z*r - -S1 74: ec 28 08 3a fmadds f1,f8,f0,f1 # v*f0 + x 78: 4e 80 00 20 blr 7c: 60 00 00 00 nop 80: 3c 00 3f 00 lis r0,16128 84: ec 08 03 72 fmuls f0,f8,f13 # v*r 88: c1 a2 00 28 lfs f13,40(r2) # -S1 8c: 90 01 ff f0 stw r0,-16(r1) 90: c1 81 ff f0 lfs f12,-16(r1) # 0.5 94: ec 02 03 38 fmsubs f0,f2,f12,f0 # y*0.5 - v*r 98: ec 09 10 38 fmsubs f0,f9,f0,f2 # z*f0 - y 9c: ed a8 03 7a fmadds f13,f8,f13,f0 # v*-S1 + f0 a0: ec 21 68 28 fsubs f1,f1,f13 # x - f13 a4: 4e 80 00 20 blr ... b4: 60 00 00 00 nop b8: 60 00 00 00 nop bc: 60 00 00 00 nop Contents of section .toc: 0000 2f2ec9d3 00000000 32d72f34 00000000 /.......2./4.... 0010 3638ef1b 00000000 39500d01 00000000 68......9P...... 0020 3c088889 00000000 3e2aaaab 00000000 <.......>*...... ../../glibc64/math/k_sinf.o: file format elf64-powerpc Disassembly of section .text: 0000000000000000 <.__kernel_sinf>: 0: d0 21 ff f0 stfs f1,-16(r1) 4: 3d 20 31 ff lis r9,12799 8: fd 00 08 90 fmr f8,f1 c: 2f 25 00 00 cmpdi cr6,r5,0 10: 80 01 ff f0 lwz r0,-16(r1) 14: 61 29 ff ff ori r9,r9,65535 18: 78 00 00 60 clrldi r0,r0,33 1c: 7f 80 48 00 cmpw cr7,r0,r9 20: 41 9d 00 24 bgt- cr7,44 <.__kernel_sinf+0x44> 24: fc 00 40 90 fmr f0,f8 28: fc 00 00 1e fctiwz f0,f0 2c: d8 01 ff f8 stfd f0,-8(r1) 30: e8 01 ff f8 ld r0,-8(r1) 34: f8 01 ff e0 std r0,-32(r1) 38: 81 21 ff e4 lwz r9,-28(r1) 3c: 2f 89 00 00 cmpwi cr7,r9,0 40: 4d 9e 00 20 beqlr cr7 44: ed 28 02 32 fmuls f9,f8,f8 # z=x*x 48: c1 a2 00 08 lfs f13,8(r2) # S5 4c: c0 02 00 00 lfs f0,0(r2) # S6 50: c1 82 00 10 lfs f12,16(r2) # S4 54: c1 62 00 18 lfs f11,24(r2) # S3 58: ec 29 02 32 fmuls f1,f9,f8 # v=z*x 5c: ec 09 68 3a fmadds f0,f9,f0,f13 # z*S6+S5 60: c1 42 00 20 lfs f10,32(r2) # S2 64: ec 00 62 7a fmadds f0,f0,f9,f12 # f0*z+S4 68: ec 00 5a 7a fmadds f0,f0,f9,f11 # f0*z+S3 6c: ed a0 52 7a fmadds f13,f0,f9,f10 # f0*z+S2 70: 40 9a 00 14 bne- cr6,84 <.__kernel_sinf+0x84> 74: c0 02 00 28 lfs f0,40(r2) # S1 78: ec 09 03 7a fmadds f0,f9,f13,f0 # z*r+S1 7c: ec 20 40 7a fmadds f1,f0,f1,f8 # f0*v+x 80: 4e 80 00 20 blr 84: 3c 00 3f 00 lis r0,16128 88: ec 01 03 72 fmuls f0,f1,f13 # v*r 8c: c1 a2 00 28 lfs f13,40(r2) # S1 90: 90 01 ff f0 stw r0,-16(r1) 94: c1 81 ff f0 lfs f12,-16(r1) # 0.5 98: ed a1 03 72 fmuls f13,f1,f13 # v*S1 9c: ec 02 03 38 fmsubs f0,f2,f12,f0 # y*0.5-v*r a0: ec 00 12 78 fmsubs f0,f0,f9,f2 # f0*z - y a4: ec 00 68 28 fsubs f0,f0,f13 # f0-v*S1 a8: ec 28 00 28 fsubs f1,f8,f0 # x-f0 ac: 4e 80 00 20 blr ... Contents of section .toc: 0000 2f2ec9d3 00000000 b2d72f34 00000000 /........./4.... 0010 3638ef1b 00000000 b9500d01 00000000 68.......P...... 0020 3c088889 00000000 be2aaaab 00000000 <........*...... From olh at suse.de Tue Mar 8 23:56:35 2005 From: olh at suse.de (Olaf Hering) Date: Tue, 8 Mar 2005 13:56:35 +0100 Subject: eeh.h compile warnings / adbhid.c build failure In-Reply-To: <1109806756.5680.127.camel@gaston> References: <20050302181206.GA2741@us.ibm.com> <1109806756.5680.127.camel@gaston> Message-ID: <20050308125635.GA19169@suse.de> On Thu, Mar 03, Benjamin Herrenschmidt wrote: > There is no ADB bus on a G5, so the driver isn't useful anyway. > Currently, ppc64 allows you to enable pmac drivers that won't build, but > they also are useless on G5s. I'll fix that over time though. They are of course not useless. Send this patch to Linus to allow the mouse button emulation until either someone split it off the ADB driver, or until someone fixes the stupid userinterfaces in Linux. Signed-off-by: Olaf Hering diff -p -purN R/linux-2.6.3/drivers/macintosh/adb.c linux-2.6.3/drivers/macintosh/adb.c --- R/linux-2.6.3/drivers/macintosh/adb.c 2004-02-18 04:59:56.000000000 +0100 +++ linux-2.6.3/drivers/macintosh/adb.c 2004-02-22 15:16:43.000000000 +0100 @@ -294,6 +294,10 @@ int __init adb_init(void) if ( (_machine != _MACH_chrp) && (_machine != _MACH_Pmac) ) return 0; #endif +#ifdef CONFIG_PPC64 + if (_machine != _MACH_Pmac) + return 0; +#endif #ifdef CONFIG_MAC if (!MACH_IS_MAC) return 0; diff -p -purN R/linux-2.6.3/drivers/macintosh/adbhid.c linux-2.6.3/drivers/macintosh/adbhid.c --- R/linux-2.6.3/drivers/macintosh/adbhid.c 2004-02-18 04:59:57.000000000 +0100 +++ linux-2.6.3/drivers/macintosh/adbhid.c 2004-02-22 15:41:34.000000000 +0100 @@ -1021,10 +1021,14 @@ init_ms_a3(int id) static int __init adbhid_init(void) { -#ifndef CONFIG_MAC +#ifdef CONFIG_PPC32 if ( (_machine != _MACH_chrp) && (_machine != _MACH_Pmac) ) return 0; #endif +#ifdef CONFIG_PPC64 + if (_machine != _MACH_Pmac) + return 0; +#endif led_request.complete = 1; From jschopp at austin.ibm.com Wed Mar 9 04:20:33 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Tue, 08 Mar 2005 11:20:33 -0600 Subject: [PATCH] update irq affinity mask when migrating irqs In-Reply-To: <20050308020017.GB21853@otto> References: <20050308020017.GB21853@otto> Message-ID: <422DDEE1.5040706@austin.ibm.com> Comments below. > > > Signed-off-by: Nathan Lynch > > xics.c | 11 ++--------- > 1 files changed, 2 insertions(+), 9 deletions(-) > > Index: linux-2.6.11-bk2/arch/ppc64/kernel/xics.c > =================================================================== > --- linux-2.6.11-bk2.orig/arch/ppc64/kernel/xics.c 2005-03-02 07:38:10.000000000 +0000 > +++ linux-2.6.11-bk2/arch/ppc64/kernel/xics.c 2005-03-07 03:52:08.000000000 +0000 > @@ -704,15 +704,8 @@ void xics_migrate_irqs_away(void) > virq, cpu); > > /* Reset affinity to all cpus */ > - xics_status[0] = default_distrib_server; > - > - status = rtas_call(ibm_set_xive, 3, 1, NULL, irq, > - xics_status[0], xics_status[1]); > - if (status) > - printk(KERN_ERR "migrate_irqs_away: irq=%d " > - "ibm,set-xive returns %d\n", > - virq, status); > - > + desc->handler->set_affinity(virq, CPU_MASK_ALL); The downside of calling this is it increases the path length and causes ibm_get_xive to be called again. Usually slightly slower is a fine tradeoff for more readable code, but in this case I would have left it how it was. With all the cpus stopped it is best to be as fast as possible. Maybe this is still fast enough, but you'd have to test under heavy load on a variety of systems to be sure. > + irq_affinity[virq] = CPU_MASK_ALL; This was a good catch. > unlock: > spin_unlock_irqrestore(&desc->lock, flags); > } > _______________________________________________ > Linuxppc64-dev mailing list > Linuxppc64-dev at ozlabs.org > https://ozlabs.org/cgi-bin/mailman/listinfo/linuxppc64-dev > From benh at kernel.crashing.org Wed Mar 9 09:26:55 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 09 Mar 2005 09:26:55 +1100 Subject: eeh.h compile warnings / adbhid.c build failure In-Reply-To: <20050308125635.GA19169@suse.de> References: <20050302181206.GA2741@us.ibm.com> <1109806756.5680.127.camel@gaston> <20050308125635.GA19169@suse.de> Message-ID: <1110320815.13593.279.camel@gaston> On Tue, 2005-03-08 at 13:56 +0100, Olaf Hering wrote: > On Thu, Mar 03, Benjamin Herrenschmidt wrote: > > > There is no ADB bus on a G5, so the driver isn't useful anyway. > > Currently, ppc64 allows you to enable pmac drivers that won't build, but > > they also are useless on G5s. I'll fix that over time though. > > They are of course not useless. Send this patch to Linus to allow the > mouse button emulation until either someone split it off the ADB driver, > or until someone fixes the stupid userinterfaces in Linux. Oh well, don't people buy real mice to plug on G5s ? :) Anyway, mouse button emulation should be split off adb stuff. Ben. From moilanen at austin.ibm.com Wed Mar 9 09:59:04 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Tue, 8 Mar 2005 16:59:04 -0600 Subject: [PATCH 0/2] No-exec support for ppc64 Message-ID: <20050308165904.0ce07112.moilanen@austin.ibm.com> These patches add no execute support to PPC64. They prohibit executing code on the stack, or most any non-text segment for both user space, and kernel. No execute is supported on Power4 processors and up. These processors support pages that have a no-execute permission bit. The patches include a base fixup from Anton Blanchard. This includes a fix for the wrong bit being used for no-exec and for read/write on the hardware PTEs. For distros that compile w/ pt_gnu_stacks, they depend on Ben Herrenschmidt's vDSO patches for signal trampoline. Without it, the application will hang on the first signal due to the return code being put on the signal context stack to return to the kernel on the completion of the signal handler. The changes should be in the latest BK tree. The patch is broken into two parts: 1/2: PPC64 no-exec support for user space: This will prohibit user space apps from executing in segments not marked as executable. The base support is in here as well. 2/2: PPC64 no-exec support for kernel space: This prohibits the kernel from executing non-text code. Thanks, Jake From flar at allandria.com Wed Mar 9 10:10:47 2005 From: flar at allandria.com (Brad Boyer) Date: Tue, 8 Mar 2005 15:10:47 -0800 Subject: eeh.h compile warnings / adbhid.c build failure In-Reply-To: <1110320815.13593.279.camel@gaston> References: <20050302181206.GA2741@us.ibm.com> <1109806756.5680.127.camel@gaston> <20050308125635.GA19169@suse.de> <1110320815.13593.279.camel@gaston> Message-ID: <20050308231046.GA19175@pants.nu> On Wed, Mar 09, 2005 at 09:26:55AM +1100, Benjamin Herrenschmidt wrote: > On Tue, 2005-03-08 at 13:56 +0100, Olaf Hering wrote: > > On Thu, Mar 03, Benjamin Herrenschmidt wrote: > > > > > There is no ADB bus on a G5, so the driver isn't useful anyway. > > > Currently, ppc64 allows you to enable pmac drivers that won't build, but > > > they also are useless on G5s. I'll fix that over time though. > > > > They are of course not useless. Send this patch to Linus to allow the > > mouse button emulation until either someone split it off the ADB driver, > > or until someone fixes the stupid userinterfaces in Linux. > > Oh well, don't people buy real mice to plug on G5s ? :) I bought a very fancy USB pointing device for my G5, but it would be nice to support the older stuff. Eventually I'm planning to add support for the Griffin iMate, which would give us ADB on anything that supports USB. It's just not at the top of my list. > Anyway, mouse button emulation should be split off adb stuff. I'm pretty sure it already is. The last time I looked at it, the only tie it had left was presenting itself as an ADB device to the input layer (bustype of BUS_ADB). I have to admit I haven't tried it, but it ought to work without any of the actual ADB code even compiled in. I should caveat that by saying that you can't compile the in-kernel mouse emulation in at all unless it's on MAC or PPC_PMAC due to the fact that it's in the drivers/macintosh directory. Brad Boyer flar at allandria.com From moilanen at austin.ibm.com Wed Mar 9 10:08:26 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Tue, 8 Mar 2005 17:08:26 -0600 Subject: [PATCH 1/2] No-exec support for ppc64 In-Reply-To: <20050308165904.0ce07112.moilanen@austin.ibm.com> References: <20050308165904.0ce07112.moilanen@austin.ibm.com> Message-ID: <20050308170826.13a2299e.moilanen@austin.ibm.com> No-exec base and user space support for PPC64. This will prohibit user space apps that a compile w/ PT_GNU_STACK from executing in segments that are non-executable. Non-PT_GNU_STACK compiled apps will work as well, but will not be able to take advantage of the no-exec feature. Signed-off-by: Jake Moilanen --- linux-2.6-bk-moilanen/arch/ppc64/kernel/head.S | 5 + linux-2.6-bk-moilanen/arch/ppc64/kernel/iSeries_htab.c | 4 + linux-2.6-bk-moilanen/arch/ppc64/kernel/pSeries_lpar.c | 2 linux-2.6-bk-moilanen/arch/ppc64/mm/fault.c | 14 +++-- linux-2.6-bk-moilanen/arch/ppc64/mm/hash_low.S | 12 ++-- linux-2.6-bk-moilanen/arch/ppc64/mm/hugetlbpage.c | 13 ++++ linux-2.6-bk-moilanen/fs/binfmt_elf.c | 2 linux-2.6-bk-moilanen/include/asm-ppc64/elf.h | 7 ++ linux-2.6-bk-moilanen/include/asm-ppc64/page.h | 19 ++++++- linux-2.6-bk-moilanen/include/asm-ppc64/pgtable.h | 45 +++++++++-------- 10 files changed, 87 insertions(+), 36 deletions(-) diff -puN arch/ppc64/kernel/head.S~nx-user-ppc64 arch/ppc64/kernel/head.S --- linux-2.6-bk/arch/ppc64/kernel/head.S~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/kernel/head.S 2005-03-08 16:08:54 -06:00 @@ -36,6 +36,7 @@ #include #include #include +#include #include #ifdef CONFIG_PPC_ISERIES @@ -950,11 +951,11 @@ END_FTR_SECTION_IFCLR(CPU_FTR_SLB) * accessing a userspace segment (even from the kernel). We assume * kernel addresses always have the high bit set. */ - rlwinm r4,r4,32-23,29,29 /* DSISR_STORE -> _PAGE_RW */ + rlwinm r4,r4,32-25+9,31-9,31-9 /* DSISR_STORE -> _PAGE_RW */ rotldi r0,r3,15 /* Move high bit into MSR_PR posn */ orc r0,r12,r0 /* MSR_PR | ~high_bit */ rlwimi r4,r0,32-13,30,30 /* becomes _PAGE_USER access bit */ - ori r4,r4,1 /* add _PAGE_PRESENT */ + rlwimi r4,r5,22+2,31-2,31-2 /* Set _PAGE_EXEC if trap is 0x400 */ /* * On iSeries, we soft-disable interrupts here, then diff -puN arch/ppc64/kernel/iSeries_htab.c~nx-user-ppc64 arch/ppc64/kernel/iSeries_htab.c --- linux-2.6-bk/arch/ppc64/kernel/iSeries_htab.c~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/kernel/iSeries_htab.c 2005-03-08 16:08:54 -06:00 @@ -144,6 +144,10 @@ static long iSeries_hpte_updatepp(unsign HvCallHpt_get(&hpte, slot); if ((hpte.dw0.dw0.avpn == avpn) && (hpte.dw0.dw0.v)) { + /* + * Hypervisor expects bit's as NPPP, which is + * different from how they are mapped in our PP. + */ HvCallHpt_setPp(slot, (newpp & 0x3) | ((newpp & 0x4) << 1)); iSeries_hunlock(slot); return 0; diff -puN arch/ppc64/kernel/pSeries_lpar.c~nx-user-ppc64 arch/ppc64/kernel/pSeries_lpar.c --- linux-2.6-bk/arch/ppc64/kernel/pSeries_lpar.c~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/kernel/pSeries_lpar.c 2005-03-08 16:08:54 -06:00 @@ -470,7 +470,7 @@ static void pSeries_lpar_hpte_updatebolt slot = pSeries_lpar_hpte_find(vpn); BUG_ON(slot == -1); - flags = newpp & 3; + flags = newpp & 7; lpar_rc = plpar_pte_protect(flags, slot, 0); BUG_ON(lpar_rc != H_Success); diff -puN arch/ppc64/mm/fault.c~nx-user-ppc64 arch/ppc64/mm/fault.c --- linux-2.6-bk/arch/ppc64/mm/fault.c~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/mm/fault.c 2005-03-08 16:08:54 -06:00 @@ -93,6 +93,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long code = SEGV_MAPERR; unsigned long is_write = error_code & 0x02000000; unsigned long trap = TRAP(regs); + unsigned long is_exec = trap == 0x400; BUG_ON((trap == 0x380) || (trap == 0x480)); @@ -199,16 +200,19 @@ int do_page_fault(struct pt_regs *regs, good_area: code = SEGV_ACCERR; + if (is_exec) { + /* protection fault */ + if (error_code & 0x08000000) + goto bad_area; + if (!(vma->vm_flags & VM_EXEC)) + goto bad_area; /* a write */ - if (is_write) { + } else if (is_write) { if (!(vma->vm_flags & VM_WRITE)) goto bad_area; /* a read */ } else { - /* protection fault */ - if (error_code & 0x08000000) - goto bad_area; - if (!(vma->vm_flags & (VM_READ | VM_EXEC))) + if (!(vma->vm_flags & VM_READ)) goto bad_area; } diff -puN arch/ppc64/mm/hash_low.S~nx-user-ppc64 arch/ppc64/mm/hash_low.S --- linux-2.6-bk/arch/ppc64/mm/hash_low.S~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/mm/hash_low.S 2005-03-08 16:08:54 -06:00 @@ -89,7 +89,7 @@ _GLOBAL(__hash_page) /* Prepare new PTE value (turn access RW into DIRTY, then * add BUSY,HASHPTE and ACCESSED) */ - rlwinm r30,r4,5,24,24 /* _PAGE_RW -> _PAGE_DIRTY */ + rlwinm r30,r4,32-9+7,31-7,31-7 /* _PAGE_RW -> _PAGE_DIRTY */ or r30,r30,r31 ori r30,r30,_PAGE_BUSY | _PAGE_ACCESSED | _PAGE_HASHPTE /* Write the linux PTE atomically (setting busy) */ @@ -112,11 +112,11 @@ _GLOBAL(__hash_page) rldicl r5,r5,0,25 /* vsid & 0x0000007fffffffff */ rldicl r0,r3,64-12,48 /* (ea >> 12) & 0xffff */ xor r28,r5,r0 - - /* Convert linux PTE bits into HW equivalents - */ - andi. r3,r30,0x1fa /* Get basic set of flags */ - rlwinm r0,r30,32-2+1,30,30 /* _PAGE_RW -> _PAGE_USER (r0) */ + + /* Convert linux PTE bits into HW equivalents */ + andi. r3,r30,0x1fe /* Get basic set of flags */ + xori r3,r3,HW_NO_EXEC /* _PAGE_EXEC -> NOEXEC */ + rlwinm r0,r30,32-9+1,30,30 /* _PAGE_RW -> _PAGE_USER (r0) */ rlwinm r4,r30,32-7+1,30,30 /* _PAGE_DIRTY -> _PAGE_USER (r4) */ and r0,r0,r4 /* _PAGE_RW & _PAGE_DIRTY -> r0 bit 30 */ andc r0,r30,r0 /* r0 = pte & ~r0 */ diff -puN arch/ppc64/mm/hugetlbpage.c~nx-user-ppc64 arch/ppc64/mm/hugetlbpage.c --- linux-2.6-bk/arch/ppc64/mm/hugetlbpage.c~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/mm/hugetlbpage.c 2005-03-08 16:08:54 -06:00 @@ -786,6 +786,7 @@ int hash_huge_page(struct mm_struct *mm, pte_t old_pte, new_pte; unsigned long hpteflags, prpn; long slot; + int is_exec; int err = 1; spin_lock(&mm->page_table_lock); @@ -796,6 +797,10 @@ int hash_huge_page(struct mm_struct *mm, va = (vsid << 28) | (ea & 0x0fffffff); vpn = va >> HPAGE_SHIFT; + is_exec = access & _PAGE_EXEC; + if (unlikely(is_exec && !(pte_val(*ptep) & _PAGE_EXEC))) + goto out; + /* * If no pte found or not present, send the problem up to * do_page_fault @@ -828,7 +833,12 @@ int hash_huge_page(struct mm_struct *mm, old_pte = *ptep; new_pte = old_pte; - hpteflags = 0x2 | (! (pte_val(new_pte) & _PAGE_RW)); + hpteflags = (pte_val(new_pte) & _PAGE_RW) | + (!(pte_val(new_pte) & _PAGE_RW)) | + _PAGE_USER; + + /* _PAGE_EXEC -> HW_NO_EXEC since it's inverted */ + hpteflags |= ((pte_val(new_pte) & _PAGE_EXEC) ? 0 : HW_NO_EXEC); /* Check if pte already has an hpte (case 2) */ if (unlikely(pte_val(old_pte) & _PAGE_HASHPTE)) { @@ -898,6 +908,7 @@ repeat: err = 0; out: + spin_unlock(&mm->page_table_lock); return err; diff -puN fs/binfmt_elf.c~nx-user-ppc64 fs/binfmt_elf.c --- linux-2.6-bk/fs/binfmt_elf.c~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/fs/binfmt_elf.c 2005-03-08 16:08:54 -06:00 @@ -99,6 +99,8 @@ static int set_brk(unsigned long start, up_write(¤t->mm->mmap_sem); if (BAD_ADDR(addr)) return addr; + + sys_mprotect(start, end-start, PROT_READ|PROT_WRITE|PROT_EXEC); } current->mm->start_brk = current->mm->brk = end; return 0; diff -puN include/asm-ppc64/elf.h~nx-user-ppc64 include/asm-ppc64/elf.h --- linux-2.6-bk/include/asm-ppc64/elf.h~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/include/asm-ppc64/elf.h 2005-03-08 16:08:54 -06:00 @@ -226,6 +226,13 @@ do { \ else if (current->personality != PER_LINUX32) \ set_personality(PER_LINUX); \ } while (0) + +/* + * An executable for which elf_read_implies_exec() returns TRUE will + * have the READ_IMPLIES_EXEC personality flag set automatically. + */ +#define elf_read_implies_exec(ex, have_pt_gnu_stack) (!(have_pt_gnu_stack)) + #endif /* diff -puN include/asm-ppc64/page.h~nx-user-ppc64 include/asm-ppc64/page.h --- linux-2.6-bk/include/asm-ppc64/page.h~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/include/asm-ppc64/page.h 2005-03-08 16:08:54 -06:00 @@ -235,8 +235,25 @@ extern u64 ppc64_pft_size; /* Log 2 of #define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT) -#define VM_DATA_DEFAULT_FLAGS (VM_READ | VM_WRITE | VM_EXEC | \ +#define VM_DATA_DEFAULT_FLAGS32 (VM_READ | VM_WRITE | \ VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) + +#define VM_STACK_DEFAULT_FLAGS32 (VM_READ | VM_WRITE | VM_EXEC | \ + VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) + +#define VM_DATA_DEFAULT_FLAGS64 (VM_READ | VM_WRITE | \ + VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) + +#define VM_STACK_DEFAULT_FLAGS64 (VM_READ | VM_WRITE | VM_EXEC | \ + VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) + +#define VM_DATA_DEFAULT_FLAGS \ + (test_thread_flag(TIF_32BIT) ? \ + VM_DATA_DEFAULT_FLAGS32 : VM_DATA_DEFAULT_FLAGS64) + +#define VM_STACK_DEFAULT_FLAGS \ + (test_thread_flag(TIF_32BIT) ? \ + VM_STACK_DEFAULT_FLAGS32 : VM_STACK_DEFAULT_FLAGS64) #endif /* __KERNEL__ */ #endif /* _PPC64_PAGE_H */ diff -puN include/asm-ppc64/pgtable.h~nx-user-ppc64 include/asm-ppc64/pgtable.h --- linux-2.6-bk/include/asm-ppc64/pgtable.h~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/include/asm-ppc64/pgtable.h 2005-03-08 16:08:54 -06:00 @@ -82,14 +82,14 @@ #define _PAGE_PRESENT 0x0001 /* software: pte contains a translation */ #define _PAGE_USER 0x0002 /* matches one of the PP bits */ #define _PAGE_FILE 0x0002 /* (!present only) software: pte holds file offset */ -#define _PAGE_RW 0x0004 /* software: user write access allowed */ +#define _PAGE_EXEC 0x0004 /* No execute on POWER4 and newer (we invert) */ #define _PAGE_GUARDED 0x0008 #define _PAGE_COHERENT 0x0010 /* M: enforce memory coherence (SMP systems) */ #define _PAGE_NO_CACHE 0x0020 /* I: cache inhibit */ #define _PAGE_WRITETHRU 0x0040 /* W: cache write-through */ #define _PAGE_DIRTY 0x0080 /* C: page changed */ #define _PAGE_ACCESSED 0x0100 /* R: page referenced */ -#define _PAGE_EXEC 0x0200 /* software: i-cache coherence required */ +#define _PAGE_RW 0x0200 /* software: user write access allowed */ #define _PAGE_HASHPTE 0x0400 /* software: pte has an associated HPTE */ #define _PAGE_BUSY 0x0800 /* software: PTE & hash are busy */ #define _PAGE_SECONDARY 0x8000 /* software: HPTE is in secondary group */ @@ -100,7 +100,7 @@ /* PAGE_MASK gives the right answer below, but only by accident */ /* It should be preserving the high 48 bits and then specifically */ /* preserving _PAGE_SECONDARY | _PAGE_GROUP_IX */ -#define _PAGE_CHG_MASK (PAGE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_HPTEFLAGS) +#define _PAGE_CHG_MASK (_PAGE_GUARDED | _PAGE_COHERENT | _PAGE_NO_CACHE | _PAGE_WRITETHRU | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_HPTEFLAGS | PAGE_MASK) #define _PAGE_BASE (_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_COHERENT) @@ -116,31 +116,38 @@ #define PAGE_READONLY __pgprot(_PAGE_BASE | _PAGE_USER) #define PAGE_READONLY_X __pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_EXEC) #define PAGE_KERNEL __pgprot(_PAGE_BASE | _PAGE_WRENABLE) -#define PAGE_KERNEL_CI __pgprot(_PAGE_PRESENT | _PAGE_ACCESSED | \ - _PAGE_WRENABLE | _PAGE_NO_CACHE | _PAGE_GUARDED) + +#define HW_NO_EXEC _PAGE_EXEC /* This is used when the bit is + * inverted, even though it's the + * same value, hopefully it will be + * clearer in the code what is + * going on. */ /* - * The PowerPC can only do execute protection on a segment (256MB) basis, - * not on a page basis. So we consider execute permission the same as read. + * POWER4 and newer have per page execute protection, older chips can only + * do this on a segment (256MB) basis. + * * Also, write permissions imply read permissions. * This is the closest we can get.. + * + * Note due to the way vm flags are laid out, the bits are XWR */ #define __P000 PAGE_NONE -#define __P001 PAGE_READONLY_X +#define __P001 PAGE_READONLY #define __P010 PAGE_COPY -#define __P011 PAGE_COPY_X -#define __P100 PAGE_READONLY +#define __P011 PAGE_COPY +#define __P100 PAGE_READONLY_X #define __P101 PAGE_READONLY_X -#define __P110 PAGE_COPY +#define __P110 PAGE_COPY_X #define __P111 PAGE_COPY_X #define __S000 PAGE_NONE -#define __S001 PAGE_READONLY_X +#define __S001 PAGE_READONLY #define __S010 PAGE_SHARED -#define __S011 PAGE_SHARED_X -#define __S100 PAGE_READONLY +#define __S011 PAGE_SHARED +#define __S100 PAGE_READONLY_X #define __S101 PAGE_READONLY_X -#define __S110 PAGE_SHARED +#define __S110 PAGE_SHARED_X #define __S111 PAGE_SHARED_X #ifndef __ASSEMBLY__ @@ -197,7 +204,8 @@ void hugetlb_mm_free_pgd(struct mm_struc }) #define pte_modify(_pte, newprot) \ - (__pte((pte_val(_pte) & _PAGE_CHG_MASK) | pgprot_val(newprot))) + (__pte((pte_val(_pte) & _PAGE_CHG_MASK) | \ + (pgprot_val(newprot) & ~_PAGE_CHG_MASK))) #define pte_none(pte) ((pte_val(pte) & ~_PAGE_HPTEFLAGS) == 0) #define pte_present(pte) (pte_val(pte) & _PAGE_PRESENT) @@ -266,9 +274,6 @@ static inline int pte_young(pte_t pte) { static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE;} static inline int pte_huge(pte_t pte) { return pte_val(pte) & _PAGE_HUGE;} -static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; } -static inline void pte_cache(pte_t pte) { pte_val(pte) &= ~_PAGE_NO_CACHE; } - static inline pte_t pte_rdprotect(pte_t pte) { pte_val(pte) &= ~_PAGE_USER; return pte; } static inline pte_t pte_exprotect(pte_t pte) { @@ -438,7 +443,7 @@ static inline void set_pte_at(struct mm_ static inline void __ptep_set_access_flags(pte_t *ptep, pte_t entry, int dirty) { unsigned long bits = pte_val(entry) & - (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW); + (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW | _PAGE_EXEC); unsigned long old, tmp; __asm__ __volatile__( _ From moilanen at austin.ibm.com Wed Mar 9 10:13:26 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Tue, 8 Mar 2005 17:13:26 -0600 Subject: [PATCH 2/2] No-exec support for ppc64 In-Reply-To: <20050308165904.0ce07112.moilanen@austin.ibm.com> References: <20050308165904.0ce07112.moilanen@austin.ibm.com> Message-ID: <20050308171326.3d72363a.moilanen@austin.ibm.com> No-exec support for the kernel on PPC64. This will mark all non-text kernel pages as no-execute. Signed-off-by: Jake Moilanen --- linux-2.6-bk-moilanen/arch/ppc64/kernel/iSeries_setup.c | 7 +++ linux-2.6-bk-moilanen/arch/ppc64/kernel/module.c | 3 + linux-2.6-bk-moilanen/arch/ppc64/mm/fault.c | 25 ++++++++++++ linux-2.6-bk-moilanen/arch/ppc64/mm/hash_utils.c | 31 ++++++++++++---- linux-2.6-bk-moilanen/include/asm-ppc64/pgtable.h | 1 5 files changed, 59 insertions(+), 8 deletions(-) diff -puN arch/ppc64/kernel/iSeries_setup.c~nx-kernel-ppc64 arch/ppc64/kernel/iSeries_setup.c --- linux-2.6-bk/arch/ppc64/kernel/iSeries_setup.c~nx-kernel-ppc64 2005-03-08 16:08:57 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/kernel/iSeries_setup.c 2005-03-08 16:08:57 -06:00 @@ -624,6 +624,7 @@ static void __init iSeries_bolt_kernel(u { unsigned long pa; unsigned long mode_rw = _PAGE_ACCESSED | _PAGE_COHERENT | PP_RWXX; + unsigned long tmp_mode; HPTE hpte; for (pa = saddr; pa < eaddr ;pa += PAGE_SIZE) { @@ -632,6 +633,12 @@ static void __init iSeries_bolt_kernel(u unsigned long va = (vsid << 28) | (pa & 0xfffffff); unsigned long vpn = va >> PAGE_SHIFT; unsigned long slot = HvCallHpt_findValid(&hpte, vpn); + + tmp_mode = mode_rw; + + /* Make non-kernel text non-executable */ + if (!is_kernel_text(ea)) + tmp_mode = mode_rw | HW_NO_EXEC; if (hpte.dw0.dw0.v) { /* HPTE exists, so just bolt it */ diff -puN arch/ppc64/kernel/module.c~nx-kernel-ppc64 arch/ppc64/kernel/module.c --- linux-2.6-bk/arch/ppc64/kernel/module.c~nx-kernel-ppc64 2005-03-08 16:08:57 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/kernel/module.c 2005-03-08 16:08:57 -06:00 @@ -102,7 +102,8 @@ void *module_alloc(unsigned long size) { if (size == 0) return NULL; - return vmalloc(size); + + return vmalloc_exec(size); } /* Free memory returned from module_alloc */ diff -puN arch/ppc64/mm/fault.c~nx-kernel-ppc64 arch/ppc64/mm/fault.c --- linux-2.6-bk/arch/ppc64/mm/fault.c~nx-kernel-ppc64 2005-03-08 16:08:57 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/mm/fault.c 2005-03-08 16:08:57 -06:00 @@ -76,6 +76,21 @@ static int store_updates_sp(struct pt_re return 0; } +pte_t *lookup_address(unsigned long address) +{ + pgd_t *pgd = pgd_offset_k(address); + pmd_t *pmd; + + if (pgd_none(*pgd)) + return NULL; + + pmd = pmd_offset(pgd, address); + if (pmd_none(*pmd)) + return NULL; + + return pte_offset_kernel(pmd, address); +} + /* * The error_code parameter is * - DSISR for a non-SLB data access fault, @@ -94,6 +109,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long is_write = error_code & 0x02000000; unsigned long trap = TRAP(regs); unsigned long is_exec = trap == 0x400; + pte_t *ptep; BUG_ON((trap == 0x380) || (trap == 0x480)); @@ -253,6 +269,15 @@ bad_area_nosemaphore: info.si_addr = (void __user *) address; force_sig_info(SIGSEGV, &info, current); return 0; + } + + ptep = lookup_address(address); + + if (ptep && pte_present(*ptep) && !pte_exec(*ptep)) { + if (printk_ratelimit()) + printk(KERN_CRIT "kernel tried to execute NX-protected page - exploit attempt? (uid: %d)\n", current->uid); + show_stack(current, (unsigned long *)__get_SP()); + do_exit(SIGKILL); } return SIGSEGV; diff -puN arch/ppc64/mm/hash_utils.c~nx-kernel-ppc64 arch/ppc64/mm/hash_utils.c --- linux-2.6-bk/arch/ppc64/mm/hash_utils.c~nx-kernel-ppc64 2005-03-08 16:08:57 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/mm/hash_utils.c 2005-03-08 16:08:57 -06:00 @@ -51,6 +51,7 @@ #include #include #include +#include #ifdef DEBUG #define DBG(fmt...) udbg_printf(fmt) @@ -89,12 +90,23 @@ static inline void loop_forever(void) ; } +int is_kernel_text(unsigned long addr) +{ + if (addr >= (unsigned long)_stext && addr < (unsigned long)__init_end) + return 1; + + return 0; +} + + + #ifdef CONFIG_PPC_MULTIPLATFORM static inline void create_pte_mapping(unsigned long start, unsigned long end, unsigned long mode, int large) { unsigned long addr; unsigned int step; + unsigned long tmp_mode; if (large) step = 16*MB; @@ -112,6 +124,13 @@ static inline void create_pte_mapping(un else vpn = va >> PAGE_SHIFT; + + tmp_mode = mode; + + /* Make non-kernel text non-executable */ + if (!is_kernel_text(addr)) + tmp_mode = mode | HW_NO_EXEC; + hash = hpt_hash(vpn, large); hpteg = ((hash & htab_hash_mask) * HPTES_PER_GROUP); @@ -120,12 +139,12 @@ static inline void create_pte_mapping(un if (systemcfg->platform & PLATFORM_LPAR) ret = pSeries_lpar_hpte_insert(hpteg, va, virt_to_abs(addr) >> PAGE_SHIFT, - 0, mode, 1, large); + 0, tmp_mode, 1, large); else #endif /* CONFIG_PPC_PSERIES */ ret = native_hpte_insert(hpteg, va, virt_to_abs(addr) >> PAGE_SHIFT, - 0, mode, 1, large); + 0, tmp_mode, 1, large); if (ret == -1) { ppc64_terminate_msg(0x20, "create_pte_mapping"); @@ -238,8 +257,6 @@ unsigned int hash_page_do_lazy_icache(un { struct page *page; -#define PPC64_HWNOEXEC (1 << 2) - if (!pfn_valid(pte_pfn(pte))) return pp; @@ -250,8 +267,8 @@ unsigned int hash_page_do_lazy_icache(un if (trap == 0x400) { __flush_dcache_icache(page_address(page)); set_bit(PG_arch_1, &page->flags); - } else - pp |= PPC64_HWNOEXEC; + } else + pp |= HW_NO_EXEC; } return pp; } @@ -271,7 +288,7 @@ int hash_page(unsigned long ea, unsigned int user_region = 0; int local = 0; cpumask_t tmp; - + switch (REGION_ID(ea)) { case USER_REGION_ID: user_region = 1; diff -puN include/asm-ppc64/pgtable.h~nx-kernel-ppc64 include/asm-ppc64/pgtable.h --- linux-2.6-bk/include/asm-ppc64/pgtable.h~nx-kernel-ppc64 2005-03-08 16:08:57 -06:00 +++ linux-2.6-bk-moilanen/include/asm-ppc64/pgtable.h 2005-03-08 16:08:57 -06:00 @@ -116,6 +116,7 @@ #define PAGE_READONLY __pgprot(_PAGE_BASE | _PAGE_USER) #define PAGE_READONLY_X __pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_EXEC) #define PAGE_KERNEL __pgprot(_PAGE_BASE | _PAGE_WRENABLE) +#define PAGE_KERNEL_EXEC __pgprot(_PAGE_BASE | _PAGE_WRENABLE | _PAGE_EXEC) #define HW_NO_EXEC _PAGE_EXEC /* This is used when the bit is * inverted, even though it's the _ From benh at kernel.crashing.org Wed Mar 9 10:30:08 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 09 Mar 2005 10:30:08 +1100 Subject: eeh.h compile warnings / adbhid.c build failure In-Reply-To: <20050308231046.GA19175@pants.nu> References: <20050302181206.GA2741@us.ibm.com> <1109806756.5680.127.camel@gaston> <20050308125635.GA19169@suse.de> <1110320815.13593.279.camel@gaston> <20050308231046.GA19175@pants.nu> Message-ID: <1110324608.32556.1.camel@gaston> On Tue, 2005-03-08 at 15:10 -0800, Brad Boyer wrote: > On Wed, Mar 09, 2005 at 09:26:55AM +1100, Benjamin Herrenschmidt wrote: > > On Tue, 2005-03-08 at 13:56 +0100, Olaf Hering wrote: > > > On Thu, Mar 03, Benjamin Herrenschmidt wrote: > > > > > > > There is no ADB bus on a G5, so the driver isn't useful anyway. > > > > Currently, ppc64 allows you to enable pmac drivers that won't build, but > > > > they also are useless on G5s. I'll fix that over time though. > > > > > > They are of course not useless. Send this patch to Linus to allow the > > > mouse button emulation until either someone split it off the ADB driver, > > > or until someone fixes the stupid userinterfaces in Linux. > > > > Oh well, don't people buy real mice to plug on G5s ? :) > > I bought a very fancy USB pointing device for my G5, but it would be > nice to support the older stuff. Eventually I'm planning to add > support for the Griffin iMate, which would give us ADB on anything > that supports USB. It's just not at the top of my list. If we go that way, then we should finally bite the bullet and get ADB in the device model (define an adb bus_type, with proper drivers etc...). That would also allow a clean mecanism (sysfs properties) for things like trackpad settings, etc... > > Anyway, mouse button emulation should be split off adb stuff. > > I'm pretty sure it already is. The last time I looked at it, the > only tie it had left was presenting itself as an ADB device to > the input layer (bustype of BUS_ADB). I have to admit I haven't > tried it, but it ought to work without any of the actual ADB code > even compiled in. > > I should caveat that by saying that you can't compile the in-kernel > mouse emulation in at all unless it's on MAC or PPC_PMAC due to the > fact that it's in the drivers/macintosh directory. > > Brad Boyer > flar at allandria.com -- Benjamin Herrenschmidt From olof at austin.ibm.com Wed Mar 9 10:43:54 2005 From: olof at austin.ibm.com (Olof Johansson) Date: Tue, 8 Mar 2005 17:43:54 -0600 Subject: [PATCH] update irq affinity mask when migrating irqs In-Reply-To: <422DDEE1.5040706@austin.ibm.com> References: <20050308020017.GB21853@otto> <422DDEE1.5040706@austin.ibm.com> Message-ID: <20050308234354.GB18077@austin.ibm.com> On Tue, Mar 08, 2005 at 11:20:33AM -0600, Joel Schopp wrote: > The downside of calling this is it increases the path length and causes > ibm_get_xive to be called again. Usually slightly slower is a fine > tradeoff for more readable code, but in this case I would have left it > how it was. With all the cpus stopped it is best to be as fast as Is CPU removal really that performance critical a path? How long does a ibm_get_xive call take? > possible. Maybe this is still fast enough, but you'd have to test under > heavy load on a variety of systems to be sure. Please define "fast enough". -Olof From jschopp at austin.ibm.com Wed Mar 9 11:31:46 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Tue, 08 Mar 2005 18:31:46 -0600 Subject: [PATCH] update irq affinity mask when migrating irqs In-Reply-To: <20050308234354.GB18077@austin.ibm.com> References: <20050308020017.GB21853@otto> <422DDEE1.5040706@austin.ibm.com> <20050308234354.GB18077@austin.ibm.com> Message-ID: <422E43F2.80601@austin.ibm.com> Olof Johansson wrote: > On Tue, Mar 08, 2005 at 11:20:33AM -0600, Joel Schopp wrote: > > >>The downside of calling this is it increases the path length and causes >> ibm_get_xive to be called again. Usually slightly slower is a fine >>tradeoff for more readable code, but in this case I would have left it >>how it was. With all the cpus stopped it is best to be as fast as > > > Is CPU removal really that performance critical a path? How long does > a ibm_get_xive call take? The part of it where we have all the cpus with interrupts disabled running our high priority tasks is VERY performance critical. Look at the __stop_machine_run() and stop_machine code. The rest of it is not performance critical at all. I couldn't tell you how long an ibm_get_xive call takes, and I suppose it would vary from system to system. I doubt that adding a ibm_get_xive call and a few other instructions would do much damage. Still, I'd hate it to be the straw that broke the camel's back. The way we'd notice such problems would make it painful to determine the root cause in the field. > > >>possible. Maybe this is still fast enough, but you'd have to test under >>heavy load on a variety of systems to be sure. > > > Please define "fast enough". __stop_machine_run doesn't interfere with system operation. No buffers gets overfilled, no interrupts get lost, no packets get dropped, etc. From paulus at samba.org Wed Mar 9 11:59:13 2005 From: paulus at samba.org (Paul Mackerras) Date: Wed, 9 Mar 2005 11:59:13 +1100 Subject: [PATCH] update irq affinity mask when migrating irqs In-Reply-To: <422E43F2.80601@austin.ibm.com> References: <20050308020017.GB21853@otto> <422DDEE1.5040706@austin.ibm.com> <20050308234354.GB18077@austin.ibm.com> <422E43F2.80601@austin.ibm.com> Message-ID: <16942.19041.842622.791326@cargo.ozlabs.ibm.com> Joel Schopp writes: > The part of it where we have all the cpus with interrupts disabled > running our high priority tasks is VERY performance critical. Look at > the __stop_machine_run() and stop_machine code. The rest of it is not > performance critical at all. Well... we have to be careful not to take too long, but I really can't imagine that an extra procedure call and return is going to cause any problems. Paul. From sfr at canb.auug.org.au Wed Mar 9 12:03:43 2005 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Wed, 9 Mar 2005 12:03:43 +1100 Subject: [RFC][PATCH] combining header files Message-ID: <20050309120343.0c22eb0f.sfr@canb.auug.org.au> Hi all, I would just like to start a discussion about consolidating (some of) the ppc and ppc64 header files. As a starting point (am I am not saying that this is the right way to go) the following patch replaces (semantically) equivalent ppc64 headers files by just including the asm-ppc file. We *could* use this method to make the journey incremental until there are no nontrivial files left in asm-ppc64 .... Diffstat looks like: asm-ppc/ipc.h | 2 asm-ppc64/ioctl.h | 75 --------------- asm-ppc64/ioctls.h | 115 ------------------------ asm-ppc64/ipc.h | 35 ------- asm-ppc64/mman.h | 53 ----------- asm-ppc64/param.h | 30 ------ asm-ppc64/parport.h | 19 ---- asm-ppc64/poll.h | 33 ------ asm-ppc64/string.h | 36 ------- asm-ppc64/termbits.h | 194 ----------------------------------------- asm-ppc64/termios.h | 236 -------------------------------------------------- asm-ppc64/unaligned.h | 22 ---- 12 files changed, 12 insertions(+), 838 deletions(-) -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ diff -ruNp linus/include/asm-ppc/ipc.h linus-headers/include/asm-ppc/ipc.h --- linus/include/asm-ppc/ipc.h 2003-09-24 10:56:02.000000000 +1000 +++ linus-headers/include/asm-ppc/ipc.h 2005-03-09 11:54:36.000000000 +1100 @@ -4,7 +4,7 @@ /* * These are used to wrap system calls on PowerPC. * - * See arch/ppc/kernel/syscalls.c for ugly details.. + * See arch/ppc{,64}/kernel/syscalls.c for ugly details.. */ struct ipc_kludge { struct msgbuf __user *msgp; diff -ruNp linus/include/asm-ppc64/ioctl.h linus-headers/include/asm-ppc64/ioctl.h --- linus/include/asm-ppc64/ioctl.h 2003-12-31 09:39:13.000000000 +1100 +++ linus-headers/include/asm-ppc64/ioctl.h 2005-03-09 01:10:54.000000000 +1100 @@ -1,74 +1 @@ -#ifndef _PPC64_IOCTL_H -#define _PPC64_IOCTL_H - - -/* - * This was copied from the alpha as it's a bit cleaner there. - * -- Cort - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ - -#define _IOC_NRBITS 8 -#define _IOC_TYPEBITS 8 -#define _IOC_SIZEBITS 13 -#define _IOC_DIRBITS 3 - -#define _IOC_NRMASK ((1 << _IOC_NRBITS)-1) -#define _IOC_TYPEMASK ((1 << _IOC_TYPEBITS)-1) -#define _IOC_SIZEMASK ((1 << _IOC_SIZEBITS)-1) -#define _IOC_DIRMASK ((1 << _IOC_DIRBITS)-1) - -#define _IOC_NRSHIFT 0 -#define _IOC_TYPESHIFT (_IOC_NRSHIFT+_IOC_NRBITS) -#define _IOC_SIZESHIFT (_IOC_TYPESHIFT+_IOC_TYPEBITS) -#define _IOC_DIRSHIFT (_IOC_SIZESHIFT+_IOC_SIZEBITS) - -/* - * Direction bits _IOC_NONE could be 0, but OSF/1 gives it a bit. - * And this turns out useful to catch old ioctl numbers in header - * files for us. - */ -#define _IOC_NONE 1U -#define _IOC_READ 2U -#define _IOC_WRITE 4U - -#define _IOC(dir,type,nr,size) \ - (((dir) << _IOC_DIRSHIFT) | \ - ((type) << _IOC_TYPESHIFT) | \ - ((nr) << _IOC_NRSHIFT) | \ - ((size) << _IOC_SIZESHIFT)) - -/* provoke compile error for invalid uses of size argument */ -extern unsigned int __invalid_size_argument_for_IOC; -#define _IOC_TYPECHECK(t) \ - ((sizeof(t) == sizeof(t[1]) && \ - sizeof(t) < (1 << _IOC_SIZEBITS)) ? \ - sizeof(t) : __invalid_size_argument_for_IOC) - -/* used to create numbers */ -#define _IO(type,nr) _IOC(_IOC_NONE,(type),(nr),0) -#define _IOR(type,nr,size) _IOC(_IOC_READ,(type),(nr),(_IOC_TYPECHECK(size))) -#define _IOW(type,nr,size) _IOC(_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size))) -#define _IOWR(type,nr,size) _IOC(_IOC_READ|_IOC_WRITE,(type),(nr),(_IOC_TYPECHECK(size))) -#define _IOR_BAD(type,nr,size) _IOC(_IOC_READ,(type),(nr),sizeof(size)) -#define _IOW_BAD(type,nr,size) _IOC(_IOC_WRITE,(type),(nr),sizeof(size)) -#define _IOWR_BAD(type,nr,size) _IOC(_IOC_READ|_IOC_WRITE,(type),(nr),sizeof(size)) - -/* used to decode them.. */ -#define _IOC_DIR(nr) (((nr) >> _IOC_DIRSHIFT) & _IOC_DIRMASK) -#define _IOC_TYPE(nr) (((nr) >> _IOC_TYPESHIFT) & _IOC_TYPEMASK) -#define _IOC_NR(nr) (((nr) >> _IOC_NRSHIFT) & _IOC_NRMASK) -#define _IOC_SIZE(nr) (((nr) >> _IOC_SIZESHIFT) & _IOC_SIZEMASK) - -/* various drivers, such as the pcmcia stuff, need these... */ -#define IOC_IN (_IOC_WRITE << _IOC_DIRSHIFT) -#define IOC_OUT (_IOC_READ << _IOC_DIRSHIFT) -#define IOC_INOUT ((_IOC_WRITE|_IOC_READ) << _IOC_DIRSHIFT) -#define IOCSIZE_MASK (_IOC_SIZEMASK << _IOC_SIZESHIFT) -#define IOCSIZE_SHIFT (_IOC_SIZESHIFT) - -#endif /* _PPC64_IOCTL_H */ +#include diff -ruNp linus/include/asm-ppc64/ioctls.h linus-headers/include/asm-ppc64/ioctls.h --- linus/include/asm-ppc64/ioctls.h 2003-04-03 08:55:29.000000000 +1000 +++ linus-headers/include/asm-ppc64/ioctls.h 2005-03-09 01:13:05.000000000 +1100 @@ -1,114 +1 @@ -#ifndef _ASM_PPC64_IOCTLS_H -#define _ASM_PPC64_IOCTLS_H - -/* - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ - -#include - -#define FIOCLEX _IO('f', 1) -#define FIONCLEX _IO('f', 2) -#define FIOASYNC _IOW('f', 125, int) -#define FIONBIO _IOW('f', 126, int) -#define FIONREAD _IOR('f', 127, int) -#define TIOCINQ FIONREAD -#define FIOQSIZE _IOR('f', 128, loff_t) - -#define TIOCGETP _IOR('t', 8, struct sgttyb) -#define TIOCSETP _IOW('t', 9, struct sgttyb) -#define TIOCSETN _IOW('t', 10, struct sgttyb) /* TIOCSETP wo flush */ - -#define TIOCSETC _IOW('t', 17, struct tchars) -#define TIOCGETC _IOR('t', 18, struct tchars) -#define TCGETS _IOR('t', 19, struct termios) -#define TCSETS _IOW('t', 20, struct termios) -#define TCSETSW _IOW('t', 21, struct termios) -#define TCSETSF _IOW('t', 22, struct termios) - -#define TCGETA _IOR('t', 23, struct termio) -#define TCSETA _IOW('t', 24, struct termio) -#define TCSETAW _IOW('t', 25, struct termio) -#define TCSETAF _IOW('t', 28, struct termio) - -#define TCSBRK _IO('t', 29) -#define TCXONC _IO('t', 30) -#define TCFLSH _IO('t', 31) - -#define TIOCSWINSZ _IOW('t', 103, struct winsize) -#define TIOCGWINSZ _IOR('t', 104, struct winsize) -#define TIOCSTART _IO('t', 110) /* start output, like ^Q */ -#define TIOCSTOP _IO('t', 111) /* stop output, like ^S */ -#define TIOCOUTQ _IOR('t', 115, int) /* output queue size */ - -#define TIOCGLTC _IOR('t', 116, struct ltchars) -#define TIOCSLTC _IOW('t', 117, struct ltchars) -#define TIOCSPGRP _IOW('t', 118, int) -#define TIOCGPGRP _IOR('t', 119, int) - -#define TIOCEXCL 0x540C -#define TIOCNXCL 0x540D -#define TIOCSCTTY 0x540E - -#define TIOCSTI 0x5412 -#define TIOCMGET 0x5415 -#define TIOCMBIS 0x5416 -#define TIOCMBIC 0x5417 -#define TIOCMSET 0x5418 -# define TIOCM_LE 0x001 -# define TIOCM_DTR 0x002 -# define TIOCM_RTS 0x004 -# define TIOCM_ST 0x008 -# define TIOCM_SR 0x010 -# define TIOCM_CTS 0x020 -# define TIOCM_CAR 0x040 -# define TIOCM_RNG 0x080 -# define TIOCM_DSR 0x100 -# define TIOCM_CD TIOCM_CAR -# define TIOCM_RI TIOCM_RNG - -#define TIOCGSOFTCAR 0x5419 -#define TIOCSSOFTCAR 0x541A -#define TIOCLINUX 0x541C -#define TIOCCONS 0x541D -#define TIOCGSERIAL 0x541E -#define TIOCSSERIAL 0x541F -#define TIOCPKT 0x5420 -# define TIOCPKT_DATA 0 -# define TIOCPKT_FLUSHREAD 1 -# define TIOCPKT_FLUSHWRITE 2 -# define TIOCPKT_STOP 4 -# define TIOCPKT_START 8 -# define TIOCPKT_NOSTOP 16 -# define TIOCPKT_DOSTOP 32 - - -#define TIOCNOTTY 0x5422 -#define TIOCSETD 0x5423 -#define TIOCGETD 0x5424 -#define TCSBRKP 0x5425 /* Needed for POSIX tcsendbreak() */ -#define TIOCSBRK 0x5427 /* BSD compatibility */ -#define TIOCCBRK 0x5428 /* BSD compatibility */ -#define TIOCGSID 0x5429 /* Return the session ID of FD */ -#define TIOCGPTN _IOR('T',0x30, unsigned int) /* Get Pty Number (of pty-mux device) */ -#define TIOCSPTLCK _IOW('T',0x31, int) /* Lock/unlock Pty */ - -#define TIOCSERCONFIG 0x5453 -#define TIOCSERGWILD 0x5454 -#define TIOCSERSWILD 0x5455 -#define TIOCGLCKTRMIOS 0x5456 -#define TIOCSLCKTRMIOS 0x5457 -#define TIOCSERGSTRUCT 0x5458 /* For debugging only */ -#define TIOCSERGETLSR 0x5459 /* Get line status register */ - /* ioctl (fd, TIOCSERGETLSR, &result) where result may be as below */ -# define TIOCSER_TEMT 0x01 /* Transmitter physically empty */ -#define TIOCSERGETMULTI 0x545A /* Get multiport config */ -#define TIOCSERSETMULTI 0x545B /* Set multiport config */ - -#define TIOCMIWAIT 0x545C /* wait for a change on serial input line(s) */ -#define TIOCGICOUNT 0x545D /* read serial port inline interrupt counts */ - -#endif /* _ASM_PPC64_IOCTLS_H */ +#include diff -ruNp linus/include/asm-ppc64/ipc.h linus-headers/include/asm-ppc64/ipc.h --- linus/include/asm-ppc64/ipc.h 2004-05-30 11:50:26.000000000 +1000 +++ linus-headers/include/asm-ppc64/ipc.h 2005-03-09 01:15:40.000000000 +1100 @@ -1,34 +1 @@ -#ifndef __PPC64_IPC_H__ -#define __PPC64_IPC_H__ - -/* - * These are used to wrap system calls on PowerPC. - * - * See arch/ppc64/kernel/syscalls.c for ugly details.. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ -struct ipc_kludge { - struct msgbuf __user *msgp; - long msgtyp; -}; - -#define SEMOP 1 -#define SEMGET 2 -#define SEMCTL 3 -#define SEMTIMEDOP 4 -#define MSGSND 11 -#define MSGRCV 12 -#define MSGGET 13 -#define MSGCTL 14 -#define SHMAT 21 -#define SHMDT 22 -#define SHMGET 23 -#define SHMCTL 24 - -#define IPCCALL(version,op) ((version)<<16 | (op)) - -#endif /* __PPC64_IPC_H__ */ +#include diff -ruNp linus/include/asm-ppc64/mman.h linus-headers/include/asm-ppc64/mman.h --- linus/include/asm-ppc64/mman.h 2003-09-26 07:54:24.000000000 +1000 +++ linus-headers/include/asm-ppc64/mman.h 2005-03-09 01:25:14.000000000 +1100 @@ -1,52 +1 @@ -#ifndef __PPC64_MMAN_H__ -#define __PPC64_MMAN_H__ - -/* - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ - -#define PROT_READ 0x1 /* page can be read */ -#define PROT_WRITE 0x2 /* page can be written */ -#define PROT_EXEC 0x4 /* page can be executed */ -#define PROT_SEM 0x8 /* page may be used for atomic ops */ -#define PROT_NONE 0x0 /* page can not be accessed */ -#define PROT_GROWSDOWN 0x01000000 /* mprotect flag: extend change to start of growsdown vma */ -#define PROT_GROWSUP 0x02000000 /* mprotect flag: extend change to end of growsup vma */ - -#define MAP_SHARED 0x01 /* Share changes */ -#define MAP_PRIVATE 0x02 /* Changes are private */ -#define MAP_TYPE 0x0f /* Mask for type of mapping */ -#define MAP_FIXED 0x10 /* Interpret addr exactly */ -#define MAP_ANONYMOUS 0x20 /* don't use a file */ -#define MAP_RENAME MAP_ANONYMOUS /* In SunOS terminology */ -#define MAP_NORESERVE 0x40 /* don't reserve swap pages */ -#define MAP_LOCKED 0x80 - -#define MAP_GROWSDOWN 0x0100 /* stack-like segment */ -#define MAP_DENYWRITE 0x0800 /* ETXTBSY */ -#define MAP_EXECUTABLE 0x1000 /* mark it as an executable */ - -#define MS_ASYNC 1 /* sync memory asynchronously */ -#define MS_INVALIDATE 2 /* invalidate the caches */ -#define MS_SYNC 4 /* synchronous memory sync */ - -#define MCL_CURRENT 0x2000 /* lock all currently mapped pages */ -#define MCL_FUTURE 0x4000 /* lock all additions to address space */ - -#define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ -#define MAP_NONBLOCK 0x10000 /* do not block on IO */ - -#define MADV_NORMAL 0x0 /* default page-in behavior */ -#define MADV_RANDOM 0x1 /* page-in minimum required */ -#define MADV_SEQUENTIAL 0x2 /* read-ahead aggressively */ -#define MADV_WILLNEED 0x3 /* pre-fault pages */ -#define MADV_DONTNEED 0x4 /* discard these pages */ - -/* compatibility flags */ -#define MAP_ANON MAP_ANONYMOUS -#define MAP_FILE 0 - -#endif /* __PPC64_MMAN_H__ */ +#include diff -ruNp linus/include/asm-ppc64/param.h linus-headers/include/asm-ppc64/param.h --- linus/include/asm-ppc64/param.h 2004-02-23 12:05:19.000000000 +1100 +++ linus-headers/include/asm-ppc64/param.h 2005-03-09 01:38:01.000000000 +1100 @@ -1,29 +1 @@ -#ifndef _ASM_PPC64_PARAM_H -#define _ASM_PPC64_PARAM_H - -/* - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ - -#ifdef __KERNEL__ -# define HZ 1000 /* Internal kernel timer frequency */ -# define USER_HZ 100 /* .. some user interfaces are in "ticks" */ -# define CLOCKS_PER_SEC (USER_HZ) /* like times() */ -#endif - -#ifndef HZ -#define HZ 100 -#endif - -#define EXEC_PAGESIZE 4096 - -#ifndef NOGROUP -#define NOGROUP (-1) -#endif - -#define MAXHOSTNAMELEN 64 /* max length of hostname */ - -#endif /* _ASM_PPC64_PARAM_H */ +#include diff -ruNp linus/include/asm-ppc64/parport.h linus-headers/include/asm-ppc64/parport.h --- linus/include/asm-ppc64/parport.h 2002-02-14 23:14:36.000000000 +1100 +++ linus-headers/include/asm-ppc64/parport.h 2005-03-09 01:40:11.000000000 +1100 @@ -1,18 +1 @@ -/* - * parport.h: platform-specific PC-style parport initialisation - * - * Copyright (C) 1999, 2000 Tim Waugh - * - * This file should only be included by drivers/parport/parport_pc.c. - */ - -#ifndef _ASM_PPC64_PARPORT_H -#define _ASM_PPC64_PARPORT_H - -static int __devinit parport_pc_find_isa_ports (int autoirq, int autodma); -static int __devinit parport_pc_find_nonpci_ports (int autoirq, int autodma) -{ - return parport_pc_find_isa_ports (autoirq, autodma); -} - -#endif /* !(_ASM_PPC_PARPORT_H) */ +#include diff -ruNp linus/include/asm-ppc64/poll.h linus-headers/include/asm-ppc64/poll.h --- linus/include/asm-ppc64/poll.h 2002-11-01 05:18:30.000000000 +1100 +++ linus-headers/include/asm-ppc64/poll.h 2005-03-09 01:45:19.000000000 +1100 @@ -1,32 +1 @@ -#ifndef __PPC64_POLL_H -#define __PPC64_POLL_H - -/* - * Copyright (C) 2001 PPC64 Team, IBM Corp - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ - -#define POLLIN 0x0001 -#define POLLPRI 0x0002 -#define POLLOUT 0x0004 -#define POLLERR 0x0008 -#define POLLHUP 0x0010 -#define POLLNVAL 0x0020 -#define POLLRDNORM 0x0040 -#define POLLRDBAND 0x0080 -#define POLLWRNORM 0x0100 -#define POLLWRBAND 0x0200 -#define POLLMSG 0x0400 -#define POLLREMOVE 0x1000 - -struct pollfd { - int fd; - short events; - short revents; -}; - -#endif /* __PPC64_POLL_H */ +#include diff -ruNp linus/include/asm-ppc64/string.h linus-headers/include/asm-ppc64/string.h --- linus/include/asm-ppc64/string.h 2005-01-29 06:05:47.000000000 +1100 +++ linus-headers/include/asm-ppc64/string.h 2005-03-09 02:01:45.000000000 +1100 @@ -1,35 +1 @@ -#ifndef _PPC64_STRING_H_ -#define _PPC64_STRING_H_ - -/* - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ - -#define __HAVE_ARCH_STRCPY -#define __HAVE_ARCH_STRNCPY -#define __HAVE_ARCH_STRLEN -#define __HAVE_ARCH_STRCMP -#define __HAVE_ARCH_STRCAT -#define __HAVE_ARCH_MEMSET -#define __HAVE_ARCH_MEMCPY -#define __HAVE_ARCH_MEMMOVE -#define __HAVE_ARCH_MEMCMP -#define __HAVE_ARCH_MEMCHR - -extern int strcasecmp(const char *, const char *); -extern int strncasecmp(const char *, const char *, int); -extern char * strcpy(char *,const char *); -extern char * strncpy(char *,const char *, __kernel_size_t); -extern __kernel_size_t strlen(const char *); -extern int strcmp(const char *,const char *); -extern char * strcat(char *, const char *); -extern void * memset(void *,int,__kernel_size_t); -extern void * memcpy(void *,const void *,__kernel_size_t); -extern void * memmove(void *,const void *,__kernel_size_t); -extern int memcmp(const void *,const void *,__kernel_size_t); -extern void * memchr(const void *,int,__kernel_size_t); - -#endif /* _PPC64_STRING_H_ */ +#include diff -ruNp linus/include/asm-ppc64/termbits.h linus-headers/include/asm-ppc64/termbits.h --- linus/include/asm-ppc64/termbits.h 2004-05-11 07:53:05.000000000 +1000 +++ linus-headers/include/asm-ppc64/termbits.h 2005-03-09 02:04:35.000000000 +1100 @@ -1,193 +1 @@ -#ifndef _PPC64_TERMBITS_H -#define _PPC64_TERMBITS_H - -/* - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ - -#include - -typedef unsigned char cc_t; -typedef unsigned int speed_t; -typedef unsigned int tcflag_t; - -/* - * termios type and macro definitions. Be careful about adding stuff - * to this file since it's used in GNU libc and there are strict rules - * concerning namespace pollution. - */ - -#define NCCS 19 -struct termios { - tcflag_t c_iflag; /* input mode flags */ - tcflag_t c_oflag; /* output mode flags */ - tcflag_t c_cflag; /* control mode flags */ - tcflag_t c_lflag; /* local mode flags */ - cc_t c_cc[NCCS]; /* control characters */ - cc_t c_line; /* line discipline (== c_cc[19]) */ - speed_t c_ispeed; /* input speed */ - speed_t c_ospeed; /* output speed */ -}; - -/* c_cc characters */ -#define VINTR 0 -#define VQUIT 1 -#define VERASE 2 -#define VKILL 3 -#define VEOF 4 -#define VMIN 5 -#define VEOL 6 -#define VTIME 7 -#define VEOL2 8 -#define VSWTC 9 -#define VWERASE 10 -#define VREPRINT 11 -#define VSUSP 12 -#define VSTART 13 -#define VSTOP 14 -#define VLNEXT 15 -#define VDISCARD 16 - -/* c_iflag bits */ -#define IGNBRK 0000001 -#define BRKINT 0000002 -#define IGNPAR 0000004 -#define PARMRK 0000010 -#define INPCK 0000020 -#define ISTRIP 0000040 -#define INLCR 0000100 -#define IGNCR 0000200 -#define ICRNL 0000400 -#define IXON 0001000 -#define IXOFF 0002000 -#define IXANY 0004000 -#define IUCLC 0010000 -#define IMAXBEL 0020000 -#define IUTF8 0040000 - -/* c_oflag bits */ -#define OPOST 0000001 -#define ONLCR 0000002 -#define OLCUC 0000004 - -#define OCRNL 0000010 -#define ONOCR 0000020 -#define ONLRET 0000040 - -#define OFILL 00000100 -#define OFDEL 00000200 -#define NLDLY 00001400 -#define NL0 00000000 -#define NL1 00000400 -#define NL2 00001000 -#define NL3 00001400 -#define TABDLY 00006000 -#define TAB0 00000000 -#define TAB1 00002000 -#define TAB2 00004000 -#define TAB3 00006000 -#define XTABS 00006000 /* required by POSIX to == TAB3 */ -#define CRDLY 00030000 -#define CR0 00000000 -#define CR1 00010000 -#define CR2 00020000 -#define CR3 00030000 -#define FFDLY 00040000 -#define FF0 00000000 -#define FF1 00040000 -#define BSDLY 00100000 -#define BS0 00000000 -#define BS1 00100000 -#define VTDLY 00200000 -#define VT0 00000000 -#define VT1 00200000 - -/* c_cflag bit meaning */ -#define CBAUD 0000377 -#define B0 0000000 /* hang up */ -#define B50 0000001 -#define B75 0000002 -#define B110 0000003 -#define B134 0000004 -#define B150 0000005 -#define B200 0000006 -#define B300 0000007 -#define B600 0000010 -#define B1200 0000011 -#define B1800 0000012 -#define B2400 0000013 -#define B4800 0000014 -#define B9600 0000015 -#define B19200 0000016 -#define B38400 0000017 -#define EXTA B19200 -#define EXTB B38400 -#define CBAUDEX 0000000 -#define B57600 00020 -#define B115200 00021 -#define B230400 00022 -#define B460800 00023 -#define B500000 00024 -#define B576000 00025 -#define B921600 00026 -#define B1000000 00027 -#define B1152000 00030 -#define B1500000 00031 -#define B2000000 00032 -#define B2500000 00033 -#define B3000000 00034 -#define B3500000 00035 -#define B4000000 00036 - -#define CSIZE 00001400 -#define CS5 00000000 -#define CS6 00000400 -#define CS7 00001000 -#define CS8 00001400 - -#define CSTOPB 00002000 -#define CREAD 00004000 -#define PARENB 00010000 -#define PARODD 00020000 -#define HUPCL 00040000 - -#define CLOCAL 00100000 -#define CRTSCTS 020000000000 /* flow control */ - -/* c_lflag bits */ -#define ISIG 0x00000080 -#define ICANON 0x00000100 -#define XCASE 0x00004000 -#define ECHO 0x00000008 -#define ECHOE 0x00000002 -#define ECHOK 0x00000004 -#define ECHONL 0x00000010 -#define NOFLSH 0x80000000 -#define TOSTOP 0x00400000 -#define ECHOCTL 0x00000040 -#define ECHOPRT 0x00000020 -#define ECHOKE 0x00000001 -#define FLUSHO 0x00800000 -#define PENDIN 0x20000000 -#define IEXTEN 0x00000400 - -/* Values for the ACTION argument to `tcflow'. */ -#define TCOOFF 0 -#define TCOON 1 -#define TCIOFF 2 -#define TCION 3 - -/* Values for the QUEUE_SELECTOR argument to `tcflush'. */ -#define TCIFLUSH 0 -#define TCOFLUSH 1 -#define TCIOFLUSH 2 - -/* Values for the OPTIONAL_ACTIONS argument to `tcsetattr'. */ -#define TCSANOW 0 -#define TCSADRAIN 1 -#define TCSAFLUSH 2 - -#endif /* _PPC64_TERMBITS_H */ +#include diff -ruNp linus/include/asm-ppc64/termios.h linus-headers/include/asm-ppc64/termios.h --- linus/include/asm-ppc64/termios.h 2003-04-03 08:55:29.000000000 +1000 +++ linus-headers/include/asm-ppc64/termios.h 2005-03-09 02:13:20.000000000 +1100 @@ -1,235 +1 @@ -#ifndef _PPC64_TERMIOS_H -#define _PPC64_TERMIOS_H - -/* - * Liberally adapted from alpha/termios.h. In particular, the c_cc[] - * fields have been reordered so that termio & termios share the - * common subset in the same order (for brain dead programs that don't - * know or care about the differences). - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ - -#include -#include - -struct sgttyb { - char sg_ispeed; - char sg_ospeed; - char sg_erase; - char sg_kill; - short sg_flags; -}; - -struct tchars { - char t_intrc; - char t_quitc; - char t_startc; - char t_stopc; - char t_eofc; - char t_brkc; -}; - -struct ltchars { - char t_suspc; - char t_dsuspc; - char t_rprntc; - char t_flushc; - char t_werasc; - char t_lnextc; -}; - -struct winsize { - unsigned short ws_row; - unsigned short ws_col; - unsigned short ws_xpixel; - unsigned short ws_ypixel; -}; - -#define NCC 10 -struct termio { - unsigned short c_iflag; /* input mode flags */ - unsigned short c_oflag; /* output mode flags */ - unsigned short c_cflag; /* control mode flags */ - unsigned short c_lflag; /* local mode flags */ - unsigned char c_line; /* line discipline */ - unsigned char c_cc[NCC]; /* control characters */ -}; - -/* c_cc characters */ -#define _VINTR 0 -#define _VQUIT 1 -#define _VERASE 2 -#define _VKILL 3 -#define _VEOF 4 -#define _VMIN 5 -#define _VEOL 6 -#define _VTIME 7 -#define _VEOL2 8 -#define _VSWTC 9 - -/* line disciplines */ -#define N_TTY 0 -#define N_SLIP 1 -#define N_MOUSE 2 -#define N_PPP 3 -#define N_STRIP 4 -#define N_AX25 5 -#define N_X25 6 /* X.25 async */ -#define N_6PACK 7 -#define N_MASC 8 /* Reserved for Mobitex module */ -#define N_R3964 9 /* Reserved for Simatic R3964 module */ -#define N_PROFIBUS_FDL 10 /* Reserved for Profibus */ -#define N_IRDA 11 /* Linux IrDa - http://www.cs.uit.no/~dagb/irda/irda.html */ -#define N_SMSBLOCK 12 /* SMS block mode - for talking to GSM data cards about SMS messages */ -#define N_HDLC 13 /* synchronous HDLC */ -#define N_SYNC_PPP 14 - -#ifdef __KERNEL__ -/* ^C ^\ del ^U ^D 1 0 0 0 0 ^W ^R ^Z ^Q ^S ^V ^U */ -#define INIT_C_CC "\003\034\177\025\004\001\000\000\000\000\027\022\032\021\023\026\025" -#endif - -#define FIOCLEX _IO('f', 1) -#define FIONCLEX _IO('f', 2) -#define FIOASYNC _IOW('f', 125, int) -#define FIONBIO _IOW('f', 126, int) -#define FIONREAD _IOR('f', 127, int) -#define TIOCINQ FIONREAD - -#define TIOCGETP _IOR('t', 8, struct sgttyb) -#define TIOCSETP _IOW('t', 9, struct sgttyb) -#define TIOCSETN _IOW('t', 10, struct sgttyb) /* TIOCSETP wo flush */ - -#define TIOCSETC _IOW('t', 17, struct tchars) -#define TIOCGETC _IOR('t', 18, struct tchars) -#define TCGETS _IOR('t', 19, struct termios) -#define TCSETS _IOW('t', 20, struct termios) -#define TCSETSW _IOW('t', 21, struct termios) -#define TCSETSF _IOW('t', 22, struct termios) - -#define TCGETA _IOR('t', 23, struct termio) -#define TCSETA _IOW('t', 24, struct termio) -#define TCSETAW _IOW('t', 25, struct termio) -#define TCSETAF _IOW('t', 28, struct termio) - -#define TCSBRK _IO('t', 29) -#define TCXONC _IO('t', 30) -#define TCFLSH _IO('t', 31) - -#define TIOCSWINSZ _IOW('t', 103, struct winsize) -#define TIOCGWINSZ _IOR('t', 104, struct winsize) -#define TIOCSTART _IO('t', 110) /* start output, like ^Q */ -#define TIOCSTOP _IO('t', 111) /* stop output, like ^S */ -#define TIOCOUTQ _IOR('t', 115, int) /* output queue size */ - -#define TIOCGLTC _IOR('t', 116, struct ltchars) -#define TIOCSLTC _IOW('t', 117, struct ltchars) -#define TIOCSPGRP _IOW('t', 118, int) -#define TIOCGPGRP _IOR('t', 119, int) - -#define TIOCEXCL 0x540C -#define TIOCNXCL 0x540D -#define TIOCSCTTY 0x540E - -#define TIOCSTI 0x5412 -#define TIOCMGET 0x5415 -#define TIOCMBIS 0x5416 -#define TIOCMBIC 0x5417 -#define TIOCMSET 0x5418 -#define TIOCGSOFTCAR 0x5419 -#define TIOCSSOFTCAR 0x541A -#define TIOCLINUX 0x541C -#define TIOCCONS 0x541D -#define TIOCGSERIAL 0x541E -#define TIOCSSERIAL 0x541F -#define TIOCPKT 0x5420 - -#define TIOCNOTTY 0x5422 -#define TIOCSETD 0x5423 -#define TIOCGETD 0x5424 -#define TCSBRKP 0x5425 /* Needed for POSIX tcsendbreak() */ - -#define TIOCSERCONFIG 0x5453 -#define TIOCSERGWILD 0x5454 -#define TIOCSERSWILD 0x5455 -#define TIOCGLCKTRMIOS 0x5456 -#define TIOCSLCKTRMIOS 0x5457 -#define TIOCSERGSTRUCT 0x5458 /* For debugging only */ -#define TIOCSERGETLSR 0x5459 /* Get line status register */ -#define TIOCSERGETMULTI 0x545A /* Get multiport config */ -#define TIOCSERSETMULTI 0x545B /* Set multiport config */ - -#define TIOCMIWAIT 0x545C /* wait for a change on serial input line(s) */ -#define TIOCGICOUNT 0x545D /* read serial port inline interrupt counts */ - -/* Used for packet mode */ -#define TIOCPKT_DATA 0 -#define TIOCPKT_FLUSHREAD 1 -#define TIOCPKT_FLUSHWRITE 2 -#define TIOCPKT_STOP 4 -#define TIOCPKT_START 8 -#define TIOCPKT_NOSTOP 16 -#define TIOCPKT_DOSTOP 32 - -/* modem lines */ -#define TIOCM_LE 0x001 -#define TIOCM_DTR 0x002 -#define TIOCM_RTS 0x004 -#define TIOCM_ST 0x008 -#define TIOCM_SR 0x010 -#define TIOCM_CTS 0x020 -#define TIOCM_CAR 0x040 -#define TIOCM_RNG 0x080 -#define TIOCM_DSR 0x100 -#define TIOCM_CD TIOCM_CAR -#define TIOCM_RI TIOCM_RNG -#define TIOCM_OUT1 0x2000 -#define TIOCM_OUT2 0x4000 -#define TIOCM_LOOP 0x8000 - -/* ioctl (fd, TIOCSERGETLSR, &result) where result may be as below */ -#define TIOCSER_TEMT 0x01 /* Transmitter physically empty */ - -#ifdef __KERNEL__ - -/* - * Translate a "termio" structure into a "termios". Ugh. - */ -#define SET_LOW_TERMIOS_BITS(termios, termio, x) { \ - unsigned short __tmp; \ - get_user(__tmp,&(termio)->x); \ - (termios)->x = (0xffff0000 & (termios)->x) | __tmp; \ -} - -#define user_termio_to_kernel_termios(termios, termio) \ -({ \ - SET_LOW_TERMIOS_BITS(termios, termio, c_iflag); \ - SET_LOW_TERMIOS_BITS(termios, termio, c_oflag); \ - SET_LOW_TERMIOS_BITS(termios, termio, c_cflag); \ - SET_LOW_TERMIOS_BITS(termios, termio, c_lflag); \ - copy_from_user((termios)->c_cc, (termio)->c_cc, NCC); \ -}) - -/* - * Translate a "termios" structure into a "termio". Ugh. - */ -#define kernel_termios_to_user_termio(termio, termios) \ -({ \ - put_user((termios)->c_iflag, &(termio)->c_iflag); \ - put_user((termios)->c_oflag, &(termio)->c_oflag); \ - put_user((termios)->c_cflag, &(termio)->c_cflag); \ - put_user((termios)->c_lflag, &(termio)->c_lflag); \ - put_user((termios)->c_line, &(termio)->c_line); \ - copy_to_user((termio)->c_cc, (termios)->c_cc, NCC); \ -}) - -#define user_termios_to_kernel_termios(k, u) copy_from_user(k, u, sizeof(struct termios)) -#define kernel_termios_to_user_termios(u, k) copy_to_user(u, k, sizeof(struct termios)) - -#endif /* __KERNEL__ */ - -#endif /* _PPC64_TERMIOS_H */ +#include diff -ruNp linus/include/asm-ppc64/unaligned.h linus-headers/include/asm-ppc64/unaligned.h --- linus/include/asm-ppc64/unaligned.h 2002-02-14 23:14:36.000000000 +1100 +++ linus-headers/include/asm-ppc64/unaligned.h 2005-03-09 02:16:30.000000000 +1100 @@ -1,21 +1 @@ -#ifndef __PPC64_UNALIGNED_H -#define __PPC64_UNALIGNED_H - -/* - * The PowerPC can do unaligned accesses itself in big endian mode. - * - * The strange macros are there to make sure these can't - * be misused in a way that makes them not work on other - * architectures where unaligned accesses aren't as simple. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ - -#define get_unaligned(ptr) (*(ptr)) - -#define put_unaligned(val, ptr) ((void)( *(ptr) = (val) )) - -#endif /* __PPC64_UNALIGNED_H */ +#include From amodra at bigpond.net.au Wed Mar 9 12:19:29 2005 From: amodra at bigpond.net.au (Alan Modra) Date: Wed, 9 Mar 2005 11:49:29 +1030 Subject: [RFC][PATCH] combining header files In-Reply-To: <20050309120343.0c22eb0f.sfr@canb.auug.org.au> References: <20050309120343.0c22eb0f.sfr@canb.auug.org.au> Message-ID: <20050309011929.GI15642@bubble.modra.org> On Wed, Mar 09, 2005 at 12:03:43PM +1100, Stephen Rothwell wrote: > I would just like to start a discussion about consolidating (some of) the > ppc and ppc64 header files. Marvellous! In case it isn't completely obvious, you can often share structure definitions between ppc32 and ppc64 by judicious selection of types. eg. struct stays_the_same { long long some_64bit_var; int some_32bit_var; } struct bigger_in_64bit { long var_sized_by_arch; } -- Alan Modra IBM OzLabs - Linux Technology Centre From benh at kernel.crashing.org Wed Mar 9 14:02:01 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 09 Mar 2005 14:02:01 +1100 Subject: [PATCH 2/2] No-exec support for ppc64 In-Reply-To: <20050308171326.3d72363a.moilanen@austin.ibm.com> References: <20050308165904.0ce07112.moilanen@austin.ibm.com> <20050308171326.3d72363a.moilanen@austin.ibm.com> Message-ID: <1110337321.32556.26.camel@gaston> On Tue, 2005-03-08 at 17:13 -0600, Jake Moilanen wrote: > diff -puN arch/ppc64/kernel/iSeries_setup.c~nx-kernel-ppc64 arch/ppc64/kernel/iSeries_setup.c > --- linux-2.6-bk/arch/ppc64/kernel/iSeries_setup.c~nx-kernel-ppc64 2005-03-08 16:08:57 -06:00 > +++ linux-2.6-bk-moilanen/arch/ppc64/kernel/iSeries_setup.c 2005-03-08 16:08:57 -06:00 > @@ -624,6 +624,7 @@ static void __init iSeries_bolt_kernel(u > { > unsigned long pa; > unsigned long mode_rw = _PAGE_ACCESSED | _PAGE_COHERENT | PP_RWXX; > + unsigned long tmp_mode; > HPTE hpte; > > for (pa = saddr; pa < eaddr ;pa += PAGE_SIZE) { > @@ -632,6 +633,12 @@ static void __init iSeries_bolt_kernel(u > unsigned long va = (vsid << 28) | (pa & 0xfffffff); > unsigned long vpn = va >> PAGE_SHIFT; > unsigned long slot = HvCallHpt_findValid(&hpte, vpn); > + > + tmp_mode = mode_rw; > + > + /* Make non-kernel text non-executable */ > + if (!is_kernel_text(ea)) > + tmp_mode = mode_rw | HW_NO_EXEC; > > if (hpte.dw0.dw0.v) { > /* HPTE exists, so just bolt it */ tmp_mode doesn't seem to be ever used here ... > /* Free memory returned from module_alloc */ > diff -puN arch/ppc64/mm/fault.c~nx-kernel-ppc64 arch/ppc64/mm/fault.c > --- linux-2.6-bk/arch/ppc64/mm/fault.c~nx-kernel-ppc64 2005-03-08 16:08:57 -06:00 > +++ linux-2.6-bk-moilanen/arch/ppc64/mm/fault.c 2005-03-08 16:08:57 -06:00 > @@ -76,6 +76,21 @@ static int store_updates_sp(struct pt_re > return 0; > } > > +pte_t *lookup_address(unsigned long address) > +{ > + pgd_t *pgd = pgd_offset_k(address); > + pmd_t *pmd; > + > + if (pgd_none(*pgd)) > + return NULL; > + > + pmd = pmd_offset(pgd, address); > + if (pmd_none(*pmd)) > + return NULL; > + > + return pte_offset_kernel(pmd, address); > +} Use find_linux_pte() here (asm-ppc64/pgtable.h). It will return NULL of the PTE is not present too, so no need to dbl check that. That way, I won't have to fix your copy of the function when I get the proper 4L headers patch in ;) > /* > * The error_code parameter is > * - DSISR for a non-SLB data access fault, > @@ -94,6 +109,7 @@ int do_page_fault(struct pt_regs *regs, > unsigned long is_write = error_code & 0x02000000; > unsigned long trap = TRAP(regs); > unsigned long is_exec = trap == 0x400; > + pte_t *ptep; > > BUG_ON((trap == 0x380) || (trap == 0x480)); > > @@ -253,6 +269,15 @@ bad_area_nosemaphore: > info.si_addr = (void __user *) address; > force_sig_info(SIGSEGV, &info, current); > return 0; > + } > + > + ptep = lookup_address(address); > + > + if (ptep && pte_present(*ptep) && !pte_exec(*ptep)) { > + if (printk_ratelimit()) > + printk(KERN_CRIT "kernel tried to execute NX-protected page - exploit attempt? (uid: %d)\n", current->uid); > + show_stack(current, (unsigned long *)__get_SP()); > + do_exit(SIGKILL); > } Can you try to limit to 80 columns ? (I know, I'm not the best for that neither, but I'm trying to cure myself here, I promise my next rewrite of radeonfb will be fully 80-columns safe :) From flar at allandria.com Wed Mar 9 17:34:44 2005 From: flar at allandria.com (Brad Boyer) Date: Tue, 8 Mar 2005 22:34:44 -0800 Subject: eeh.h compile warnings / adbhid.c build failure In-Reply-To: <1110324608.32556.1.camel@gaston> References: <20050302181206.GA2741@us.ibm.com> <1109806756.5680.127.camel@gaston> <20050308125635.GA19169@suse.de> <1110320815.13593.279.camel@gaston> <20050308231046.GA19175@pants.nu> <1110324608.32556.1.camel@gaston> Message-ID: <20050309063443.GB20610@pants.nu> On Wed, Mar 09, 2005 at 10:30:08AM +1100, Benjamin Herrenschmidt wrote: > If we go that way, then we should finally bite the bullet and get ADB in > the device model (define an adb bus_type, with proper drivers etc...). > That would also allow a clean mecanism (sysfs properties) for things > like trackpad settings, etc... I agree. That's one of the reasons I haven't done it yet. I did start on it, but other stuff took higher priority. Would people want to still be able to use the older code? I also intend to make a new drivers/adb directory, since it won't really be Mac specific at that point. That would include the main bus_type definition, as well as stuff like adbhid, but not via-cuda, via-pmu, and so on. I won't be working on it until I get a couple other things done, but it doesn't sound like anyone else will get to it first. Brad Boyer flar at allandria.com From wangzyu at cn.ibm.com Wed Mar 9 17:40:35 2005 From: wangzyu at cn.ibm.com (Zhao Yu Wang) Date: Wed, 9 Mar 2005 14:40:35 +0800 Subject: A question about dlpar add cpu failed Message-ID: Hi, I meet a failed while trying to add cpu to a partition dynamicly. I am not sure what's wrong. Thanks. 1. about the shared_proc_pool: hscroot at hmc6lte:~> lshwres -r proc -m fsp-pear --level pool shared_proc_pool_id=0,configurable_pool_proc_units=null,curr_avail_pool_proc_units=2.0,pend_avail_pool_proc_units=null 2. partition "pearlp3 RH"'s config: hscroot at hmc6lte:~> lshwres -r proc -m fsp-pear --level lpar --filter "lpar_names=pearlp3 RH" lpar_name=pearlp3 RH,lpar_id=3,curr_shared_proc_pool_id=0,curr_proc_mode=shared,curr_min_proc_units=0.1,curr_proc_units=0.3,curr_max_proc_units=1.0,curr_min_procs=1,curr_procs=3,curr_max_procs=10,curr_sharing_mode=uncap,curr_uncap_weight=128,pend_shared_proc_pool_id=0,pend_proc_mode=shared,pend_min_proc_units=0.1,pend_proc_units=0.3,pend_max_proc_units=1.0,pend_min_procs=1,pend_procs=3,pend_max_procs=10,pend_sharing_mode=uncap,pend_uncap_weight=128,run_proc_units=0.3,run_procs=3,run_uncap_weight=128 3. The operate and result A add 1 procs: hscroot at hmc6lte:~> chhwres -m fsp-pear -d 5 -r proc -o a -p "pearlp3 RH" --procs 1 HSCL145F Attempted to allocate processing units less than the minimum capacity allowed with the specified virtual processor setting. B add 10 procs: hscroot at hmc6lte:~> chhwres -m fsp-pear -d 5 -r proc -o a -p "pearlp3 RH" --procs 10 Your request exceeds the profile's maximum virtual processor limit. You can add or move up to 7 virtual processors. Please retry the operation. C add 2 procs: hscroot at hmc6lte:~> chhwres -m fsp-pear -d 5 -r proc -o a -p "pearlp3 RH" --procs 2 HSCL145F Attempted to allocate processing units less than the minimum capacity allowed with the specified virtual processor setting. Thanks & Best regards, -------------------------------------------- Wang Zhaoyu ??? Email: wangzyu at cn.ibm.com Notes: Zhao Yu Wang/China/Contr/IBM at IBMCN -------------- next part -------------- An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050309/a3acdccc/attachment.htm From benh at kernel.crashing.org Wed Mar 9 18:19:50 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 09 Mar 2005 18:19:50 +1100 Subject: eeh.h compile warnings / adbhid.c build failure In-Reply-To: <20050309063443.GB20610@pants.nu> References: <20050302181206.GA2741@us.ibm.com> <1109806756.5680.127.camel@gaston> <20050308125635.GA19169@suse.de> <1110320815.13593.279.camel@gaston> <20050308231046.GA19175@pants.nu> <1110324608.32556.1.camel@gaston> <20050309063443.GB20610@pants.nu> Message-ID: <1110352790.32557.63.camel@gaston> On Tue, 2005-03-08 at 22:34 -0800, Brad Boyer wrote: > I won't be working on it until I get a couple other things done, but > it doesn't sound like anyone else will get to it first. Let me know when you start, I may beat you to it if I get bored one of these week-ends :) Ben. From geert at linux-m68k.org Wed Mar 9 20:40:13 2005 From: geert at linux-m68k.org (Geert Uytterhoeven) Date: Wed, 9 Mar 2005 10:40:13 +0100 (CET) Subject: [RFC][PATCH] combining header files In-Reply-To: <20050309011929.GI15642@bubble.modra.org> References: <20050309120343.0c22eb0f.sfr@canb.auug.org.au> <20050309011929.GI15642@bubble.modra.org> Message-ID: On Wed, 9 Mar 2005, Alan Modra wrote: > On Wed, Mar 09, 2005 at 12:03:43PM +1100, Stephen Rothwell wrote: > > I would just like to start a discussion about consolidating (some of) the > > ppc and ppc64 header files. > > Marvellous! In case it isn't completely obvious, you can often share > structure definitions between ppc32 and ppc64 by judicious selection of > types. eg. > > struct stays_the_same { > long long some_64bit_var; > int some_32bit_var; > } If size matters, why not use an explicitly sized type like s64 to make it explicit? Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert at linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds From amodra at bigpond.net.au Wed Mar 9 23:06:01 2005 From: amodra at bigpond.net.au (Alan Modra) Date: Wed, 9 Mar 2005 22:36:01 +1030 Subject: [RFC][PATCH] combining header files In-Reply-To: References: <20050309120343.0c22eb0f.sfr@canb.auug.org.au> <20050309011929.GI15642@bubble.modra.org> Message-ID: <20050309120601.GN15642@bubble.modra.org> On Wed, Mar 09, 2005 at 10:40:13AM +0100, Geert Uytterhoeven wrote: > On Wed, 9 Mar 2005, Alan Modra wrote: > > On Wed, Mar 09, 2005 at 12:03:43PM +1100, Stephen Rothwell wrote: > > > I would just like to start a discussion about consolidating (some of) the > > > ppc and ppc64 header files. > > > > Marvellous! In case it isn't completely obvious, you can often share > > structure definitions between ppc32 and ppc64 by judicious selection of > > types. eg. > > > > struct stays_the_same { > > long long some_64bit_var; > > int some_32bit_var; > > } > > If size matters, why not use an explicitly sized type like s64 to make it > explicit? Sure, that's even better. -- Alan Modra IBM OzLabs - Linux Technology Centre From arnd at arndb.de Thu Mar 10 00:01:16 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Wed, 9 Mar 2005 14:01:16 +0100 Subject: [PATCH] linking zImage with biarch ld Message-ID: <200503091401.17143.arnd@arndb.de> I noticed that with the vDSO patch in 2.6.11-bk, it's almost possible to build the kernel with the fedora biarch toolchain. However, I still get warnings from ld about zImage being the wrong architecture, unless I change the script as shown in this patch. I'm not sure if this breaks setups with old binutils that might not understand powerpc:common, otherwise please apply. Signed-off-by: Arnd Bergmann --- 1.4/arch/ppc64/boot/zImage.lds 2004-09-17 00:34:55 -04:00 +++ edited/arch/ppc64/boot/zImage.lds 2005-03-08 11:03:50 -05:00 @@ -1,4 +1,4 @@ -OUTPUT_ARCH(powerpc) +OUTPUT_ARCH(powerpc:common) SEARCH_DIR(/lib); SEARCH_DIR(/usr/lib); SEARCH_DIR(/usr/local/lib); SEARCH_DIR(/usr/local/powerpc-any-elf/lib); /* Do we need any of these for elf? __DYNAMIC = 0; */ From segher at kernel.crashing.org Thu Mar 10 02:30:12 2005 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Wed, 9 Mar 2005 16:30:12 +0100 Subject: [PATCH] linking zImage with biarch ld In-Reply-To: <200503091401.17143.arnd@arndb.de> References: <200503091401.17143.arnd@arndb.de> Message-ID: <67866e6be990bc2d27d34b92df86d946@kernel.crashing.org> > I'm not sure if this breaks setups with old binutils that might not > understand > powerpc:common, otherwise please apply. powerpc:common is fine since, erm, forever (that is, at least three years). Segher From linas at austin.ibm.com Thu Mar 10 07:01:09 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Wed, 9 Mar 2005 14:01:09 -0600 Subject: [RFC][PATCH] combining header files In-Reply-To: <20050309120343.0c22eb0f.sfr@canb.auug.org.au> References: <20050309120343.0c22eb0f.sfr@canb.auug.org.au> Message-ID: <20050309200109.GG1220@austin.ibm.com> On Wed, Mar 09, 2005 at 12:03:43PM +1100, Stephen Rothwell was heard to remark: > Hi all, > > I would just like to start a discussion about consolidating (some of) the > ppc and ppc64 header files. As a starting point (am I am not saying that > this is the right way to go) the following patch replaces (semantically) > equivalent ppc64 headers files by just including the asm-ppc file. Why not #include instead? > We *could* use this method to make the journey incremental until there > are no nontrivial files left in asm-ppc64 .... sounds good to me. --linas From nathanl at austin.ibm.com Thu Mar 10 07:54:15 2005 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Wed, 09 Mar 2005 14:54:15 -0600 Subject: A question about dlpar add cpu failed In-Reply-To: References: Message-ID: <1110401655.12027.7.camel@biclops> On Wed, 2005-03-09 at 14:40 +0800, Zhao Yu Wang wrote: > Hi, > I meet a failed while trying to add cpu to a partition dynamicly. I am > not sure what's wrong. Thanks. ... > A add 1 procs: > hscroot at hmc6lte:~> chhwres -m fsp-pear -d 5 -r proc -o a -p "pearlp3 > RH" --procs 1 > HSCL145F Attempted to allocate processing units less than the minimum > capacity allowed with the specified virtual processor setting. > > B add 10 procs: > hscroot at hmc6lte:~> chhwres -m fsp-pear -d 5 -r proc -o a -p "pearlp3 > RH" --procs 10 > Your request exceeds the profile's maximum virtual processor limit. > You can add or move up to 7 virtual processors. Please retry the > operation. > > C add 2 procs: > hscroot at hmc6lte:~> chhwres -m fsp-pear -d 5 -r proc -o a -p "pearlp3 > RH" --procs 2 > HSCL145F Attempted to allocate processing units less than the minimum > capacity allowed with the specified virtual processor setting. All of these seem to indicate HMC or platform firmware issues (or operator error), the OS and programs on the partition are not even involved at this point. Nathan From benh at kernel.crashing.org Thu Mar 10 08:39:48 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 10 Mar 2005 08:39:48 +1100 Subject: [PATCH] linking zImage with biarch ld In-Reply-To: <200503091401.17143.arnd@arndb.de> References: <200503091401.17143.arnd@arndb.de> Message-ID: <1110404388.32524.101.camel@gaston> On Wed, 2005-03-09 at 14:01 +0100, Arnd Bergmann wrote: > I noticed that with the vDSO patch in 2.6.11-bk, it's almost possible to build > the kernel with the fedora biarch toolchain. However, I still get warnings > from ld about zImage being the wrong architecture, unless I change the script > as shown in this patch. "Almost possible" ? What's wrong ? Only that ? > I'm not sure if this breaks setups with old binutils that might not understand > powerpc:common, otherwise please apply. > > Signed-off-by: Arnd Bergmann > > --- 1.4/arch/ppc64/boot/zImage.lds 2004-09-17 00:34:55 -04:00 > +++ edited/arch/ppc64/boot/zImage.lds 2005-03-08 11:03:50 -05:00 > @@ -1,4 +1,4 @@ > -OUTPUT_ARCH(powerpc) > +OUTPUT_ARCH(powerpc:common) > SEARCH_DIR(/lib); SEARCH_DIR(/usr/lib); SEARCH_DIR(/usr/local/lib); SEARCH_DIR(/usr/local/powerpc-any-elf/lib); > /* Do we need any of these for elf? > __DYNAMIC = 0; */ -- Benjamin Herrenschmidt From ntl at pobox.com Thu Mar 10 11:51:32 2005 From: ntl at pobox.com (Nathan Lynch) Date: Wed, 9 Mar 2005 18:51:32 -0600 (CST) Subject: [PATCH 0/8] reworked support for pSeries dynamic reconfiguration Message-ID: <20050310005132.31309.65485.31668@otto> Hi- This patch series reworks existing ppc64 architecture support for the "dynamic reconfiguration" option of RPA platforms. This includes PCI hotplug and dynamic logical partitioning (DLPAR). This was all motivated by my desire to add code for better handling of processor addition and removal, but I didn't want to just add to the growing mess in prom.c where we have duplicated code for boot and DLPAR/hotplug. This adds very little new function, but gets rid of much duplicated code and introduces a new pSeries-specific file, pSeries_reconfig.c, which contains the core support for dynamic reconfiguration and implements a more refined version of the notifier chain API I posted a few weeks ago. Code that needs to act upon device nodes that are being added or removed can register with this notifier chain. I've ported as much code as possible to that API, and I expect memory DLPAR will want to use it too. The last couple of patches in the series modify the pSeries smp code so that we properly manage cpu_present_map with respect to DLPAR, and includes the "make cpu hotplug play well with maxcpus and smt-enabled" patch, which depends on this. The following cases have been tested on a Power5 system: * CPU add/remove * Virtual I/O adapter add/remove * Logical slot add/remove (thanks to John Rose) I also checked the build against all defconfigs in arch/ppc64/configs. diffstat for the combined series: arch/ppc64/kernel/Makefile | 2 arch/ppc64/kernel/pSeries_iommu.c | 25 arch/ppc64/kernel/pSeries_reconfig.c | 427 +++++++++++++ arch/ppc64/kernel/pSeries_smp.c | 231 +++++-- arch/ppc64/kernel/pci_dn.c | 22 arch/ppc64/kernel/proc_ppc64.c | 249 -------- arch/ppc64/kernel/prom.c | 466 ++++----------- arch/ppc64/kernel/setup.c | 12 arch/ppc64/kernel/smp.c | 13 include/asm-ppc64/machdep.h | 1 include/asm-ppc64/pSeries_reconfig.h | 25 include/asm-ppc64/prom.h | 4 12 files changed, 812 insertions(+), 665 deletions(-) Thanks, Nathan From ntl at pobox.com Thu Mar 10 11:51:37 2005 From: ntl at pobox.com (Nathan Lynch) Date: Wed, 9 Mar 2005 18:51:37 -0600 (CST) Subject: [PATCH 1/8] preliminary changes to OF fixup functions In-Reply-To: <20050310005132.31309.65485.31668@otto> References: <20050310005132.31309.65485.31668@otto> Message-ID: <20050310005137.31309.32303.42763@otto> Preliminary modifications to support using some of the interpret_func family of functions at runtime. Changes the mem_start argument to be passed by reference, and the return type to int for error handling to be implemented in following patches. Signed-off-by: Nathan Lynch prom.c | 135 ++++++++++++++++++++++++++++++++++------------------------------- 1 files changed, 71 insertions(+), 64 deletions(-) Index: linux-2.6.11-bk5/arch/ppc64/kernel/prom.c =================================================================== --- linux-2.6.11-bk5.orig/arch/ppc64/kernel/prom.c 2005-03-09 20:01:32.000000000 +0000 +++ linux-2.6.11-bk5/arch/ppc64/kernel/prom.c 2005-03-09 20:02:15.000000000 +0000 @@ -73,8 +73,8 @@ struct isa_reg_property { }; -typedef unsigned long interpret_func(struct device_node *, unsigned long, - int, int, int); +typedef int interpret_func(struct device_node *, unsigned long *, + int, int, int); extern struct rtas_t rtas; extern struct lmb lmb; @@ -255,9 +255,9 @@ static int __devinit map_interrupt(unsig return nintrc; } -static unsigned long __init finish_node_interrupts(struct device_node *np, - unsigned long mem_start, - int measure_only) +static int __init finish_node_interrupts(struct device_node *np, + unsigned long *mem_start, + int measure_only) { unsigned int *ints; int intlen, intrcells, intrcount; @@ -267,14 +267,14 @@ static unsigned long __init finish_node_ ints = (unsigned int *) get_property(np, "interrupts", &intlen); if (ints == NULL) - return mem_start; + return 0; intrcells = prom_n_intr_cells(np); intlen /= intrcells * sizeof(unsigned int); - np->intrs = (struct interrupt_info *) mem_start; - mem_start += intlen * sizeof(struct interrupt_info); + np->intrs = (struct interrupt_info *) (*mem_start); + (*mem_start) += intlen * sizeof(struct interrupt_info); if (measure_only) - return mem_start; + return 0; intrcount = 0; for (i = 0; i < intlen; ++i, ints += intrcells) { @@ -315,13 +315,13 @@ static unsigned long __init finish_node_ } np->n_intrs = intrcount; - return mem_start; + return 0; } -static unsigned long __init interpret_pci_props(struct device_node *np, - unsigned long mem_start, - int naddrc, int nsizec, - int measure_only) +static int __init interpret_pci_props(struct device_node *np, + unsigned long *mem_start, + int naddrc, int nsizec, + int measure_only) { struct address_range *adr; struct pci_reg_property *pci_addrs; @@ -331,7 +331,7 @@ static unsigned long __init interpret_pc get_property(np, "assigned-addresses", &l); if (pci_addrs != 0 && l >= sizeof(struct pci_reg_property)) { i = 0; - adr = (struct address_range *) mem_start; + adr = (struct address_range *) (*mem_start); while ((l -= sizeof(struct pci_reg_property)) >= 0) { if (!measure_only) { adr[i].space = pci_addrs[i].addr.a_hi; @@ -343,15 +343,15 @@ static unsigned long __init interpret_pc } np->addrs = adr; np->n_addrs = i; - mem_start += i * sizeof(struct address_range); + (*mem_start) += i * sizeof(struct address_range); } - return mem_start; + return 0; } -static unsigned long __init interpret_dbdma_props(struct device_node *np, - unsigned long mem_start, - int naddrc, int nsizec, - int measure_only) +static int __init interpret_dbdma_props(struct device_node *np, + unsigned long *mem_start, + int naddrc, int nsizec, + int measure_only) { struct reg_property32 *rp; struct address_range *adr; @@ -372,7 +372,7 @@ static unsigned long __init interpret_db rp = (struct reg_property32 *) get_property(np, "reg", &l); if (rp != 0 && l >= sizeof(struct reg_property32)) { i = 0; - adr = (struct address_range *) mem_start; + adr = (struct address_range *) (*mem_start); while ((l -= sizeof(struct reg_property32)) >= 0) { if (!measure_only) { adr[i].space = 2; @@ -383,16 +383,16 @@ static unsigned long __init interpret_db } np->addrs = adr; np->n_addrs = i; - mem_start += i * sizeof(struct address_range); + (*mem_start) += i * sizeof(struct address_range); } - return mem_start; + return 0; } -static unsigned long __init interpret_macio_props(struct device_node *np, - unsigned long mem_start, - int naddrc, int nsizec, - int measure_only) +static int __init interpret_macio_props(struct device_node *np, + unsigned long *mem_start, + int naddrc, int nsizec, + int measure_only) { struct reg_property32 *rp; struct address_range *adr; @@ -413,7 +413,7 @@ static unsigned long __init interpret_ma rp = (struct reg_property32 *) get_property(np, "reg", &l); if (rp != 0 && l >= sizeof(struct reg_property32)) { i = 0; - adr = (struct address_range *) mem_start; + adr = (struct address_range *) (*mem_start); while ((l -= sizeof(struct reg_property32)) >= 0) { if (!measure_only) { adr[i].space = 2; @@ -424,16 +424,16 @@ static unsigned long __init interpret_ma } np->addrs = adr; np->n_addrs = i; - mem_start += i * sizeof(struct address_range); + (*mem_start) += i * sizeof(struct address_range); } - return mem_start; + return 0; } -static unsigned long __init interpret_isa_props(struct device_node *np, - unsigned long mem_start, - int naddrc, int nsizec, - int measure_only) +static int __init interpret_isa_props(struct device_node *np, + unsigned long *mem_start, + int naddrc, int nsizec, + int measure_only) { struct isa_reg_property *rp; struct address_range *adr; @@ -442,7 +442,7 @@ static unsigned long __init interpret_is rp = (struct isa_reg_property *) get_property(np, "reg", &l); if (rp != 0 && l >= sizeof(struct isa_reg_property)) { i = 0; - adr = (struct address_range *) mem_start; + adr = (struct address_range *) (*mem_start); while ((l -= sizeof(struct isa_reg_property)) >= 0) { if (!measure_only) { adr[i].space = rp[i].space; @@ -453,16 +453,16 @@ static unsigned long __init interpret_is } np->addrs = adr; np->n_addrs = i; - mem_start += i * sizeof(struct address_range); + (*mem_start) += i * sizeof(struct address_range); } - return mem_start; + return 0; } -static unsigned long __init interpret_root_props(struct device_node *np, - unsigned long mem_start, - int naddrc, int nsizec, - int measure_only) +static int __init interpret_root_props(struct device_node *np, + unsigned long *mem_start, + int naddrc, int nsizec, + int measure_only) { struct address_range *adr; int i, l; @@ -472,7 +472,7 @@ static unsigned long __init interpret_ro rp = (unsigned int *) get_property(np, "reg", &l); if (rp != 0 && l >= rpsize) { i = 0; - adr = (struct address_range *) mem_start; + adr = (struct address_range *) (*mem_start); while ((l -= rpsize) >= 0) { if (!measure_only) { adr[i].space = 0; @@ -484,26 +484,30 @@ static unsigned long __init interpret_ro } np->addrs = adr; np->n_addrs = i; - mem_start += i * sizeof(struct address_range); + (*mem_start) += i * sizeof(struct address_range); } - return mem_start; + return 0; } -static unsigned long __init finish_node(struct device_node *np, - unsigned long mem_start, - interpret_func *ifunc, - int naddrc, int nsizec, - int measure_only) +static int __init finish_node(struct device_node *np, + unsigned long *mem_start, + interpret_func *ifunc, + int naddrc, int nsizec, + int measure_only) { struct device_node *child; - int *ip; + int *ip, rc = 0; /* get the device addresses and interrupts */ if (ifunc != NULL) - mem_start = ifunc(np, mem_start, naddrc, nsizec, measure_only); + rc = ifunc(np, mem_start, naddrc, nsizec, measure_only); + if (rc) + goto out; - mem_start = finish_node_interrupts(np, mem_start, measure_only); + rc = finish_node_interrupts(np, mem_start, measure_only); + if (rc) + goto out; /* Look for #address-cells and #size-cells properties. */ ip = (int *) get_property(np, "#address-cells", NULL); @@ -539,11 +543,14 @@ static unsigned long __init finish_node( || !strcmp(np->type, "media-bay")))) ifunc = NULL; - for (child = np->child; child != NULL; child = child->sibling) - mem_start = finish_node(child, mem_start, ifunc, - naddrc, nsizec, measure_only); - - return mem_start; + for (child = np->child; child != NULL; child = child->sibling) { + rc = finish_node(child, mem_start, ifunc, + naddrc, nsizec, measure_only); + if (rc) + goto out; + } +out: + return rc; } /** @@ -555,7 +562,7 @@ static unsigned long __init finish_node( */ void __init finish_device_tree(void) { - unsigned long mem, size; + unsigned long start, end, size = 0; DBG(" -> finish_device_tree\n"); @@ -568,11 +575,11 @@ void __init finish_device_tree(void) virt_irq_init(); /* Finish device-tree (pre-parsing some properties etc...) */ - size = finish_node(allnodes, 0, NULL, 0, 0, 1); - mem = (unsigned long)abs_to_virt(lmb_alloc(size, 128)); - if (finish_node(allnodes, mem, NULL, 0, 0, 0) != mem + size) - BUG(); - + finish_node(allnodes, &size, NULL, 0, 0, 1); + end = start = (unsigned long)abs_to_virt(lmb_alloc(size, 128)); + finish_node(allnodes, &end, NULL, 0, 0, 0); + BUG_ON(end != start + size); + DBG(" <- finish_device_tree\n"); } From ntl at pobox.com Thu Mar 10 11:51:42 2005 From: ntl at pobox.com (Nathan Lynch) Date: Wed, 9 Mar 2005 18:51:42 -0600 (CST) Subject: [PATCH 2/8] make OF node fixup code usable at runtime In-Reply-To: <20050310005132.31309.65485.31668@otto> References: <20050310005132.31309.65485.31668@otto> Message-ID: <20050310005142.31309.45788.99418@otto> At boot we recurse through the device tree "fixing up" various fields and properties in the device nodes. Long ago, to support DLPAR and hotplug, we largely duplicated some of this fixup code, the main difference being that the new code used kmalloc for allocating various data structures which are attached to the new device nodes. This patch kills most of the duplicated code and makes finish_node, finish_node_interrupts, and interpret_pci_props suitable for use at runtime. These functions, if passed a null mem_start argument, will use kmalloc for allocating extra data structures for the device node being processed. Not terribly elegant, but it seems worth it to get rid of the duplicated code (and bugs). Signed-off-by: Nathan Lynch prom.c | 169 ++++++++++++++++++++--------------------------------------------- 1 files changed, 54 insertions(+), 115 deletions(-) Index: linux-2.6.11-bk5/arch/ppc64/kernel/prom.c =================================================================== --- linux-2.6.11-bk5.orig/arch/ppc64/kernel/prom.c 2005-03-09 20:02:15.000000000 +0000 +++ linux-2.6.11-bk5/arch/ppc64/kernel/prom.c 2005-03-09 20:08:28.000000000 +0000 @@ -255,9 +255,9 @@ static int __devinit map_interrupt(unsig return nintrc; } -static int __init finish_node_interrupts(struct device_node *np, - unsigned long *mem_start, - int measure_only) +static int __devinit finish_node_interrupts(struct device_node *np, + unsigned long *mem_start, + int measure_only) { unsigned int *ints; int intlen, intrcells, intrcount; @@ -270,8 +270,15 @@ static int __init finish_node_interrupts return 0; intrcells = prom_n_intr_cells(np); intlen /= intrcells * sizeof(unsigned int); - np->intrs = (struct interrupt_info *) (*mem_start); - (*mem_start) += intlen * sizeof(struct interrupt_info); + + if (mem_start) { + np->intrs = (struct interrupt_info *) (*mem_start); + (*mem_start) += intlen * sizeof(struct interrupt_info); + } else { + np->intrs = kmalloc(intlen * sizeof(*(np->intrs)), GFP_KERNEL); + if (!np->intrs) + return -ENOMEM; + } if (measure_only) return 0; @@ -318,33 +325,44 @@ static int __init finish_node_interrupts return 0; } -static int __init interpret_pci_props(struct device_node *np, - unsigned long *mem_start, - int naddrc, int nsizec, - int measure_only) +static int __devinit interpret_pci_props(struct device_node *np, + unsigned long *mem_start, + int naddrc, int nsizec, + int measure_only) { struct address_range *adr; struct pci_reg_property *pci_addrs; - int i, l; + int i, l, n_addrs; pci_addrs = (struct pci_reg_property *) get_property(np, "assigned-addresses", &l); - if (pci_addrs != 0 && l >= sizeof(struct pci_reg_property)) { - i = 0; - adr = (struct address_range *) (*mem_start); - while ((l -= sizeof(struct pci_reg_property)) >= 0) { - if (!measure_only) { - adr[i].space = pci_addrs[i].addr.a_hi; - adr[i].address = pci_addrs[i].addr.a_lo | - ((u64)pci_addrs[i].addr.a_mid << 32); - adr[i].size = pci_addrs[i].size_lo; - } - ++i; - } - np->addrs = adr; - np->n_addrs = i; - (*mem_start) += i * sizeof(struct address_range); + if (!pci_addrs) + return 0; + + n_addrs = l / sizeof(*pci_addrs); + + if (!mem_start) { + adr = kmalloc(n_addrs * sizeof(*adr), GFP_KERNEL); + if (!adr) + return -ENOMEM; + } else { + adr = (struct address_range *)(*mem_start); + (*mem_start) += n_addrs * sizeof(struct address_range); + } + + if (measure_only) + return 0; + + np->addrs = adr; + np->n_addrs = n_addrs; + + for (i = 0; i < n_addrs; i++) { + adr[i].space = pci_addrs[i].addr.a_hi; + adr[i].address = pci_addrs[i].addr.a_lo | + ((u64)pci_addrs[i].addr.a_mid << 32); + adr[i].size = pci_addrs[i].size_lo; } + return 0; } @@ -490,11 +508,12 @@ static int __init interpret_root_props(s return 0; } -static int __init finish_node(struct device_node *np, - unsigned long *mem_start, - interpret_func *ifunc, - int naddrc, int nsizec, - int measure_only) +/* If mem_start == NULL ifuncs should use kmalloc for allocations. */ +static int __devinit finish_node(struct device_node *np, + unsigned long *mem_start, + interpret_func *ifunc, + int naddrc, int nsizec, + int measure_only) { struct device_node *child; int *ip, rc = 0; @@ -1627,54 +1646,6 @@ static void remove_node_proc_entries(str #endif /* CONFIG_PROC_DEVICETREE */ /* - * Fix up n_intrs and intrs fields in a new device node - * - */ -static int of_finish_dynamic_node_interrupts(struct device_node *node) -{ - int intrcells, intlen, i; - unsigned *irq, *ints, virq; - struct device_node *ic; - - ints = (unsigned int *)get_property(node, "interrupts", &intlen); - intrcells = prom_n_intr_cells(node); - intlen /= intrcells * sizeof(unsigned int); - node->n_intrs = intlen; - node->intrs = kmalloc(sizeof(struct interrupt_info) * intlen, - GFP_KERNEL); - if (!node->intrs) - return -ENOMEM; - - for (i = 0; i < intlen; ++i) { - int n, j; - node->intrs[i].line = 0; - node->intrs[i].sense = 1; - n = map_interrupt(&irq, &ic, node, ints, intrcells); - if (n <= 0) - continue; - virq = virt_irq_create_mapping(irq[0]); - if (virq == NO_IRQ) { - printk(KERN_CRIT "Could not allocate interrupt " - "number for %s\n", node->full_name); - return -ENOMEM; - } - node->intrs[i].line = irq_offset_up(virq); - if (n > 1) - node->intrs[i].sense = irq[1]; - if (n > 2) { - printk(KERN_DEBUG "hmmm, got %d intr cells for %s:", n, - node->full_name); - for (j = 0; j < n; ++j) - printk(" %d", irq[j]); - printk("\n"); - } - ints += intrcells; - } - return 0; -} - - -/* * Fix up the uninitialized fields in a new device node: * name, type, n_addrs, addrs, n_intrs, intrs, and pci-specific fields * @@ -1685,7 +1656,9 @@ static int of_finish_dynamic_node_interr * This should probably be split up into smaller chunks. */ -static int of_finish_dynamic_node(struct device_node *node) +static int of_finish_dynamic_node(struct device_node *node, + unsigned long *unused1, int unused2, + int unused3, int unused4) { struct device_node *parent = of_get_parent(node); u32 *regs; @@ -1710,41 +1683,6 @@ static int of_finish_dynamic_node(struct if ((ibm_phandle = (unsigned int *)get_property(node, "ibm,phandle", NULL))) node->linux_phandle = *ibm_phandle; - /* do the work of interpret_pci_props */ - if (parent->type && !strcmp(parent->type, "pci")) { - struct address_range *adr; - struct pci_reg_property *pci_addrs; - int i, l; - - pci_addrs = (struct pci_reg_property *) - get_property(node, "assigned-addresses", &l); - if (pci_addrs != 0 && l >= sizeof(struct pci_reg_property)) { - i = 0; - adr = kmalloc(sizeof(struct address_range) * - (l / sizeof(struct pci_reg_property)), - GFP_KERNEL); - if (!adr) { - err = -ENOMEM; - goto out; - } - while ((l -= sizeof(struct pci_reg_property)) >= 0) { - adr[i].space = pci_addrs[i].addr.a_hi; - adr[i].address = pci_addrs[i].addr.a_lo | - ((u64)pci_addrs[i].addr.a_mid << 32); - adr[i].size = pci_addrs[i].size_lo; - ++i; - } - node->addrs = adr; - node->n_addrs = i; - } - } - - /* now do the work of finish_node_interrupts */ - if (get_property(node, "interrupts", NULL)) { - err = of_finish_dynamic_node_interrupts(node); - if (err) goto out; - } - /* now do the rough equivalent of update_dn_pci_info, this * probably is not correct for phb's, but should work for * IOAs and slots. @@ -1796,7 +1734,8 @@ int of_add_node(const char *path, struct return -EINVAL; /* could also be ENOMEM, though */ } - if (0 != (err = of_finish_dynamic_node(np))) { + err = finish_node(np, NULL, of_finish_dynamic_node, 0, 0, 0); + if (err < 0) { kfree(np); return err; } From ntl at pobox.com Thu Mar 10 11:51:47 2005 From: ntl at pobox.com (Nathan Lynch) Date: Wed, 9 Mar 2005 18:51:47 -0600 (CST) Subject: [PATCH 3/8] introduce pSeries_reconfig.[ch] In-Reply-To: <20050310005132.31309.65485.31668@otto> References: <20050310005132.31309.65485.31668@otto> Message-ID: <20050310005147.31309.61029.66648@otto> Move as much pSeries-specific DLPAR/hotplug code as possible into its own file, which is built only when pSeries support is enabled in the config. This new file is intended to contain support code for the "Dynamic Reconfiguration" option in the RISC Platform Architecture, which encompasses both PCI hotplug and dynamic logical partitioning (DLPAR). This patch mostly just moves code around, but the device node addition and removal API is slightly modified. In this way, of_add_node and of_remove_node are now responsible only for safely updating the device tree and global list, without all the other stuff like proc entries etc. This also adds the definitions and api for a notifier chain which is meant to be used by code that must act upon device node addition or removal. Patches to migrate code to the notifier api follow in this series. Signed-off-by: Nathan Lynch arch/ppc64/kernel/Makefile | 2 arch/ppc64/kernel/pSeries_reconfig.c | 439 +++++++++++++++++++++++++++++++++++ arch/ppc64/kernel/proc_ppc64.c | 249 ------------------- arch/ppc64/kernel/prom.c | 156 +----------- include/asm-ppc64/pSeries_reconfig.h | 25 + include/asm-ppc64/prom.h | 4 6 files changed, 480 insertions(+), 395 deletions(-) Index: linux-2.6.11-bk5/arch/ppc64/kernel/Makefile =================================================================== --- linux-2.6.11-bk5.orig/arch/ppc64/kernel/Makefile 2005-03-09 20:01:32.000000000 +0000 +++ linux-2.6.11-bk5/arch/ppc64/kernel/Makefile 2005-03-09 20:16:31.000000000 +0000 @@ -31,7 +31,7 @@ obj-$(CONFIG_PPC_ISERIES) += iSeries_irq obj-$(CONFIG_PPC_MULTIPLATFORM) += nvram.o i8259.o prom_init.o prom.o mpic.o obj-$(CONFIG_PPC_PSERIES) += pSeries_pci.o pSeries_lpar.o pSeries_hvCall.o \ - pSeries_nvram.o rtasd.o ras.o \ + pSeries_nvram.o rtasd.o ras.o pSeries_reconfig.o \ xics.o rtas.o pSeries_setup.o pSeries_iommu.o obj-$(CONFIG_EEH) += eeh.o Index: linux-2.6.11-bk5/arch/ppc64/kernel/proc_ppc64.c =================================================================== --- linux-2.6.11-bk5.orig/arch/ppc64/kernel/proc_ppc64.c 2005-03-09 20:01:32.000000000 +0000 +++ linux-2.6.11-bk5/arch/ppc64/kernel/proc_ppc64.c 2005-03-09 20:16:31.000000000 +0000 @@ -41,20 +41,6 @@ static struct file_operations page_map_f .mmap = page_map_mmap }; -#ifdef CONFIG_PPC_PSERIES -/* routines for /proc/ppc64/ofdt */ -static ssize_t ofdt_write(struct file *, const char __user *, size_t, loff_t *); -static void proc_ppc64_create_ofdt(void); -static int do_remove_node(char *); -static int do_add_node(char *, size_t); -static void release_prop_list(const struct property *); -static struct property *new_property(const char *, const int, const unsigned char *, struct property *); -static char * parse_next_property(char *, char *, char **, int *, unsigned char**); -static struct file_operations ofdt_fops = { - .write = ofdt_write -}; -#endif - /* * Create the ppc64 and ppc64/rtas directories early. This allows us to * assume that they have been previously created in drivers. @@ -92,11 +78,6 @@ static int __init proc_ppc64_init(void) pde->size = PAGE_SIZE; pde->proc_fops = &page_map_fops; -#ifdef CONFIG_PPC_PSERIES - if ((systemcfg->platform & PLATFORM_PSERIES)) - proc_ppc64_create_ofdt(); -#endif - return 0; } __initcall(proc_ppc64_init); @@ -145,233 +126,3 @@ static int page_map_mmap( struct file *f return 0; } -#ifdef CONFIG_PPC_PSERIES -/* create /proc/ppc64/ofdt write-only by root */ -static void proc_ppc64_create_ofdt(void) -{ - struct proc_dir_entry *ent; - - ent = create_proc_entry("ppc64/ofdt", S_IWUSR, NULL); - if (ent) { - ent->nlink = 1; - ent->data = NULL; - ent->size = 0; - ent->proc_fops = &ofdt_fops; - } -} - -/** - * ofdt_write - perform operations on the Open Firmware device tree - * - * @file: not used - * @buf: command and arguments - * @count: size of the command buffer - * @off: not used - * - * Operations supported at this time are addition and removal of - * whole nodes along with their properties. Operations on individual - * properties are not implemented (yet). - */ -static ssize_t ofdt_write(struct file *file, const char __user *buf, size_t count, - loff_t *off) -{ - int rv = 0; - char *kbuf; - char *tmp; - - if (!(kbuf = kmalloc(count + 1, GFP_KERNEL))) { - rv = -ENOMEM; - goto out; - } - if (copy_from_user(kbuf, buf, count)) { - rv = -EFAULT; - goto out; - } - - kbuf[count] = '\0'; - - tmp = strchr(kbuf, ' '); - if (!tmp) { - rv = -EINVAL; - goto out; - } - *tmp = '\0'; - tmp++; - - if (!strcmp(kbuf, "add_node")) - rv = do_add_node(tmp, count - (tmp - kbuf)); - else if (!strcmp(kbuf, "remove_node")) - rv = do_remove_node(tmp); - else - rv = -EINVAL; -out: - kfree(kbuf); - return rv ? rv : count; -} - -static int do_remove_node(char *buf) -{ - struct device_node *node; - int rv = -ENODEV; - - if ((node = of_find_node_by_path(buf))) - rv = of_remove_node(node); - - of_node_put(node); - return rv; -} - -static int do_add_node(char *buf, size_t bufsize) -{ - char *path, *end, *name; - struct device_node *np; - struct property *prop = NULL; - unsigned char* value; - int length, rv = 0; - - end = buf + bufsize; - path = buf; - buf = strchr(buf, ' '); - if (!buf) - return -EINVAL; - *buf = '\0'; - buf++; - - if ((np = of_find_node_by_path(path))) { - of_node_put(np); - return -EINVAL; - } - - /* rv = build_prop_list(tmp, bufsize - (tmp - buf), &proplist); */ - while (buf < end && - (buf = parse_next_property(buf, end, &name, &length, &value))) { - struct property *last = prop; - - prop = new_property(name, length, value, last); - if (!prop) { - rv = -ENOMEM; - prop = last; - goto out; - } - } - if (!buf) { - rv = -EINVAL; - goto out; - } - - rv = of_add_node(path, prop); - -out: - if (rv) - release_prop_list(prop); - return rv; -} - -static struct property *new_property(const char *name, const int length, - const unsigned char *value, struct property *last) -{ - struct property *new = kmalloc(sizeof(*new), GFP_KERNEL); - - if (!new) - return NULL; - memset(new, 0, sizeof(*new)); - - if (!(new->name = kmalloc(strlen(name) + 1, GFP_KERNEL))) - goto cleanup; - if (!(new->value = kmalloc(length + 1, GFP_KERNEL))) - goto cleanup; - - strcpy(new->name, name); - memcpy(new->value, value, length); - *(((char *)new->value) + length) = 0; - new->length = length; - new->next = last; - return new; - -cleanup: - if (new->name) - kfree(new->name); - if (new->value) - kfree(new->value); - kfree(new); - return NULL; -} - -/** - * parse_next_property - process the next property from raw input buffer - * @buf: input buffer, must be nul-terminated - * @end: end of the input buffer + 1, for validation - * @name: return value; set to property name in buf - * @length: return value; set to length of value - * @value: return value; set to the property value in buf - * - * Note that the caller must make copies of the name and value returned, - * this function does no allocation or copying of the data. Return value - * is set to the next name in buf, or NULL on error. - */ -static char * parse_next_property(char *buf, char *end, char **name, int *length, - unsigned char **value) -{ - char *tmp; - - *name = buf; - - tmp = strchr(buf, ' '); - if (!tmp) { - printk(KERN_ERR "property parse failed in %s at line %d\n", - __FUNCTION__, __LINE__); - return NULL; - } - *tmp = '\0'; - - if (++tmp >= end) { - printk(KERN_ERR "property parse failed in %s at line %d\n", - __FUNCTION__, __LINE__); - return NULL; - } - - /* now we're on the length */ - *length = -1; - *length = simple_strtoul(tmp, &tmp, 10); - if (*length == -1) { - printk(KERN_ERR "property parse failed in %s at line %d\n", - __FUNCTION__, __LINE__); - return NULL; - } - if (*tmp != ' ' || ++tmp >= end) { - printk(KERN_ERR "property parse failed in %s at line %d\n", - __FUNCTION__, __LINE__); - return NULL; - } - - /* now we're on the value */ - *value = tmp; - tmp += *length; - if (tmp > end) { - printk(KERN_ERR "property parse failed in %s at line %d\n", - __FUNCTION__, __LINE__); - return NULL; - } - else if (tmp < end && *tmp != ' ' && *tmp != '\0') { - printk(KERN_ERR "property parse failed in %s at line %d\n", - __FUNCTION__, __LINE__); - return NULL; - } - tmp++; - - /* and now we should be on the next name, or the end */ - return tmp; -} - -static void release_prop_list(const struct property *prop) -{ - struct property *next; - for (; prop; prop = next) { - next = prop->next; - kfree(prop->name); - kfree(prop->value); - kfree(prop); - } - -} -#endif /* defined(CONFIG_PPC_PSERIES) */ Index: linux-2.6.11-bk5/arch/ppc64/kernel/pSeries_reconfig.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.11-bk5/arch/ppc64/kernel/pSeries_reconfig.c 2005-03-09 20:16:31.000000000 +0000 @@ -0,0 +1,439 @@ +/* + * pSeries_reconfig.c - support for dynamic reconfiguration (including PCI + * Hotplug and Dynamic Logical Partitioning on RPA platforms). + * + * Copyright (C) 2005 Nathan Lynch + * Copyright (C) 2005 IBM Corporation + * + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version + * 2 as published by the Free Software Foundation. + */ + +#include +#include +#include +#include + +#include +#include +#include + + + +/* + * Routines for "runtime" addition and removal of device tree nodes. + */ +#ifdef CONFIG_PROC_DEVICETREE +/* + * Add a node to /proc/device-tree. + */ +static void add_node_proc_entries(struct device_node *np) +{ + struct proc_dir_entry *ent; + + ent = proc_mkdir(strrchr(np->full_name, '/') + 1, np->parent->pde); + if (ent) + proc_device_tree_add_node(np, ent); +} + +static void remove_node_proc_entries(struct device_node *np) +{ + struct property *pp = np->properties; + struct device_node *parent = np->parent; + + while (pp) { + remove_proc_entry(pp->name, np->pde); + pp = pp->next; + } + + /* Assuming that symlinks have the same parent directory as + * np->pde. + */ + if (np->name_link) + remove_proc_entry(np->name_link->name, parent->pde); + if (np->addr_link) + remove_proc_entry(np->addr_link->name, parent->pde); + if (np->pde) + remove_proc_entry(np->pde->name, parent->pde); +} +#else /* !CONFIG_PROC_DEVICETREE */ +static void add_node_proc_entries(struct device_node *np) +{ + return; +} + +static void remove_node_proc_entries(struct device_node *np) +{ + return; +} +#endif /* CONFIG_PROC_DEVICETREE */ + +/** + * derive_parent - basically like dirname(1) + * @path: the full_name of a node to be added to the tree + * + * Returns the node which should be the parent of the node + * described by path. E.g., for path = "/foo/bar", returns + * the node with full_name = "/foo". + */ +static struct device_node *derive_parent(const char *path) +{ + struct device_node *parent = NULL; + char *parent_path = "/"; + size_t parent_path_len = strrchr(path, '/') - path + 1; + + /* reject if path is "/" */ + if (!strcmp(path, "/")) + return NULL; + + if (strrchr(path, '/') != path) { + parent_path = kmalloc(parent_path_len, GFP_KERNEL); + if (!parent_path) + return NULL; + strlcpy(parent_path, path, parent_path_len); + } + parent = of_find_node_by_path(parent_path); + if (strcmp(parent_path, "/")) + kfree(parent_path); + return parent; +} + +static struct notifier_block *pSeries_reconfig_chain; + +int pSeries_reconfig_notifier_register(struct notifier_block *nb) +{ + return notifier_chain_register(&pSeries_reconfig_chain, nb); +} + +void pSeries_reconfig_notifier_unregister(struct notifier_block *nb) +{ + notifier_chain_unregister(&pSeries_reconfig_chain, nb); +} + +static int pSeries_reconfig_add_node(const char *path, struct property *proplist) +{ + struct device_node *np; + int err = -ENOMEM; + + np = kcalloc(1, sizeof(*np), GFP_KERNEL); + if (!np) + goto out_err; + + np->full_name = kmalloc(strlen(path) + 1, GFP_KERNEL); + if (!np->full_name) + goto out_err; + + strcpy(np->full_name, path); + + np->properties = proplist; + OF_MARK_DYNAMIC(np); + kref_init(&np->kref); + of_node_get(np); + np->parent = derive_parent(path); + if (!np->parent) + goto out_err; + + err = notifier_call_chain(&pSeries_reconfig_chain, + PSERIES_RECONFIG_ADD, np); + if (err == NOTIFY_BAD) { + printk(KERN_ERR "Failed to add device node %s\n", path); + goto out_err; + } + + of_add_node(np); + + add_node_proc_entries(np); + + of_node_put(np->parent); + of_node_put(np); + + return 0; + +out_err: + kfree(np->full_name); + kfree(np); + return err; +} + +/* + * Prepare an OF node for removal from system + * XXX move this to pSeries_iommu.c + */ +static void of_cleanup_node(struct device_node *np) +{ + if (np->iommu_table && get_property(np, "ibm,dma-window", NULL)) + iommu_free_table(np); +} + +static int pSeries_reconfig_remove_node(struct device_node *np) +{ + struct device_node *parent, *child; + + parent = of_get_parent(np); + if (!parent) + return -EINVAL; + + if ((child = of_get_next_child(np, NULL))) { + of_node_put(child); + return -EBUSY; + } + + of_cleanup_node(np); + + remove_node_proc_entries(np); + + notifier_call_chain(&pSeries_reconfig_chain, + PSERIES_RECONFIG_REMOVE, np); + of_remove_node(np); + + of_node_put(parent); + of_node_put(np); /* Must decrement the refcount */ + return 0; +} + +/* + * /proc/ppc64/ofdt - yucky binary interface for adding and removing + * OF device nodes. Should be deprecated as soon as we get an + * in-kernel wrapper for the RTAS ibm,configure-connector call. + */ + +static void release_prop_list(const struct property *prop) +{ + struct property *next; + for (; prop; prop = next) { + next = prop->next; + kfree(prop->name); + kfree(prop->value); + kfree(prop); + } + +} + +/** + * parse_next_property - process the next property from raw input buffer + * @buf: input buffer, must be nul-terminated + * @end: end of the input buffer + 1, for validation + * @name: return value; set to property name in buf + * @length: return value; set to length of value + * @value: return value; set to the property value in buf + * + * Note that the caller must make copies of the name and value returned, + * this function does no allocation or copying of the data. Return value + * is set to the next name in buf, or NULL on error. + */ +static char * parse_next_property(char *buf, char *end, char **name, int *length, + unsigned char **value) +{ + char *tmp; + + *name = buf; + + tmp = strchr(buf, ' '); + if (!tmp) { + printk(KERN_ERR "property parse failed in %s at line %d\n", + __FUNCTION__, __LINE__); + return NULL; + } + *tmp = '\0'; + + if (++tmp >= end) { + printk(KERN_ERR "property parse failed in %s at line %d\n", + __FUNCTION__, __LINE__); + return NULL; + } + + /* now we're on the length */ + *length = -1; + *length = simple_strtoul(tmp, &tmp, 10); + if (*length == -1) { + printk(KERN_ERR "property parse failed in %s at line %d\n", + __FUNCTION__, __LINE__); + return NULL; + } + if (*tmp != ' ' || ++tmp >= end) { + printk(KERN_ERR "property parse failed in %s at line %d\n", + __FUNCTION__, __LINE__); + return NULL; + } + + /* now we're on the value */ + *value = tmp; + tmp += *length; + if (tmp > end) { + printk(KERN_ERR "property parse failed in %s at line %d\n", + __FUNCTION__, __LINE__); + return NULL; + } + else if (tmp < end && *tmp != ' ' && *tmp != '\0') { + printk(KERN_ERR "property parse failed in %s at line %d\n", + __FUNCTION__, __LINE__); + return NULL; + } + tmp++; + + /* and now we should be on the next name, or the end */ + return tmp; +} + +static struct property *new_property(const char *name, const int length, + const unsigned char *value, struct property *last) +{ + struct property *new = kmalloc(sizeof(*new), GFP_KERNEL); + + if (!new) + return NULL; + memset(new, 0, sizeof(*new)); + + if (!(new->name = kmalloc(strlen(name) + 1, GFP_KERNEL))) + goto cleanup; + if (!(new->value = kmalloc(length + 1, GFP_KERNEL))) + goto cleanup; + + strcpy(new->name, name); + memcpy(new->value, value, length); + *(((char *)new->value) + length) = 0; + new->length = length; + new->next = last; + return new; + +cleanup: + if (new->name) + kfree(new->name); + if (new->value) + kfree(new->value); + kfree(new); + return NULL; +} + +static int do_add_node(char *buf, size_t bufsize) +{ + char *path, *end, *name; + struct device_node *np; + struct property *prop = NULL; + unsigned char* value; + int length, rv = 0; + + end = buf + bufsize; + path = buf; + buf = strchr(buf, ' '); + if (!buf) + return -EINVAL; + *buf = '\0'; + buf++; + + if ((np = of_find_node_by_path(path))) { + of_node_put(np); + return -EINVAL; + } + + /* rv = build_prop_list(tmp, bufsize - (tmp - buf), &proplist); */ + while (buf < end && + (buf = parse_next_property(buf, end, &name, &length, &value))) { + struct property *last = prop; + + prop = new_property(name, length, value, last); + if (!prop) { + rv = -ENOMEM; + prop = last; + goto out; + } + } + if (!buf) { + rv = -EINVAL; + goto out; + } + + rv = pSeries_reconfig_add_node(path, prop); + +out: + if (rv) + release_prop_list(prop); + return rv; +} + +static int do_remove_node(char *buf) +{ + struct device_node *node; + int rv = -ENODEV; + + if ((node = of_find_node_by_path(buf))) + rv = pSeries_reconfig_remove_node(node); + + of_node_put(node); + return rv; +} + +/** + * ofdt_write - perform operations on the Open Firmware device tree + * + * @file: not used + * @buf: command and arguments + * @count: size of the command buffer + * @off: not used + * + * Operations supported at this time are addition and removal of + * whole nodes along with their properties. Operations on individual + * properties are not implemented (yet). + */ +static ssize_t ofdt_write(struct file *file, const char __user *buf, size_t count, + loff_t *off) +{ + int rv = 0; + char *kbuf; + char *tmp; + + if (!(kbuf = kmalloc(count + 1, GFP_KERNEL))) { + rv = -ENOMEM; + goto out; + } + if (copy_from_user(kbuf, buf, count)) { + rv = -EFAULT; + goto out; + } + + kbuf[count] = '\0'; + + tmp = strchr(kbuf, ' '); + if (!tmp) { + rv = -EINVAL; + goto out; + } + *tmp = '\0'; + tmp++; + + if (!strcmp(kbuf, "add_node")) + rv = do_add_node(tmp, count - (tmp - kbuf)); + else if (!strcmp(kbuf, "remove_node")) + rv = do_remove_node(tmp); + else + rv = -EINVAL; +out: + kfree(kbuf); + return rv ? rv : count; +} + +static struct file_operations ofdt_fops = { + .write = ofdt_write +}; + +/* create /proc/ppc64/ofdt write-only by root */ +static int proc_ppc64_create_ofdt(void) +{ + struct proc_dir_entry *ent; + + if (!(systemcfg->platform & PLATFORM_PSERIES)) + return 0; + + ent = create_proc_entry("ppc64/ofdt", S_IWUSR, NULL); + if (ent) { + ent->nlink = 1; + ent->data = NULL; + ent->size = 0; + ent->proc_fops = &ofdt_fops; + } + + return 0; +} +__initcall(proc_ppc64_create_ofdt); Index: linux-2.6.11-bk5/arch/ppc64/kernel/prom.c =================================================================== --- linux-2.6.11-bk5.orig/arch/ppc64/kernel/prom.c 2005-03-09 20:08:28.000000000 +0000 +++ linux-2.6.11-bk5/arch/ppc64/kernel/prom.c 2005-03-09 20:16:31.000000000 +0000 @@ -27,7 +27,6 @@ #include #include #include -#include #include #include #include @@ -1567,84 +1566,6 @@ void of_node_put(struct device_node *nod } EXPORT_SYMBOL(of_node_put); -/** - * derive_parent - basically like dirname(1) - * @path: the full_name of a node to be added to the tree - * - * Returns the node which should be the parent of the node - * described by path. E.g., for path = "/foo/bar", returns - * the node with full_name = "/foo". - */ -static struct device_node *derive_parent(const char *path) -{ - struct device_node *parent = NULL; - char *parent_path = "/"; - size_t parent_path_len = strrchr(path, '/') - path + 1; - - /* reject if path is "/" */ - if (!strcmp(path, "/")) - return NULL; - - if (strrchr(path, '/') != path) { - parent_path = kmalloc(parent_path_len, GFP_KERNEL); - if (!parent_path) - return NULL; - strlcpy(parent_path, path, parent_path_len); - } - parent = of_find_node_by_path(parent_path); - if (strcmp(parent_path, "/")) - kfree(parent_path); - return parent; -} - -/* - * Routines for "runtime" addition and removal of device tree nodes. - */ -#ifdef CONFIG_PROC_DEVICETREE -/* - * Add a node to /proc/device-tree. - */ -static void add_node_proc_entries(struct device_node *np) -{ - struct proc_dir_entry *ent; - - ent = proc_mkdir(strrchr(np->full_name, '/') + 1, np->parent->pde); - if (ent) - proc_device_tree_add_node(np, ent); -} - -static void remove_node_proc_entries(struct device_node *np) -{ - struct property *pp = np->properties; - struct device_node *parent = np->parent; - - while (pp) { - remove_proc_entry(pp->name, np->pde); - pp = pp->next; - } - - /* Assuming that symlinks have the same parent directory as - * np->pde. - */ - if (np->name_link) - remove_proc_entry(np->name_link->name, parent->pde); - if (np->addr_link) - remove_proc_entry(np->addr_link->name, parent->pde); - if (np->pde) - remove_proc_entry(np->pde->name, parent->pde); -} -#else /* !CONFIG_PROC_DEVICETREE */ -static void add_node_proc_entries(struct device_node *np) -{ - return; -} - -static void remove_node_proc_entries(struct device_node *np) -{ - return; -} -#endif /* CONFIG_PROC_DEVICETREE */ - /* * Fix up the uninitialized fields in a new device node: * name, type, n_addrs, addrs, n_intrs, intrs, and pci-specific fields @@ -1702,43 +1623,18 @@ out: } /* - * Given a path and a property list, construct an OF device node, add - * it to the device tree and global list, and place it in - * /proc/device-tree. This function may sleep. + * Plug a device node into the tree and global list. */ -int of_add_node(const char *path, struct property *proplist) +void of_add_node(struct device_node *np) { - struct device_node *np; - int err = 0; - - np = kmalloc(sizeof(struct device_node), GFP_KERNEL); - if (!np) - return -ENOMEM; - - memset(np, 0, sizeof(*np)); - - np->full_name = kmalloc(strlen(path) + 1, GFP_KERNEL); - if (!np->full_name) { - kfree(np); - return -ENOMEM; - } - strcpy(np->full_name, path); - - np->properties = proplist; - OF_MARK_DYNAMIC(np); - kref_init(&np->kref); - of_node_get(np); - np->parent = derive_parent(path); - if (!np->parent) { - kfree(np); - return -EINVAL; /* could also be ENOMEM, though */ - } + int err; + /* This use of finish_node will be moved to a notifier so + * the error code can be used. + */ err = finish_node(np, NULL, of_finish_dynamic_node, 0, 0, 0); - if (err < 0) { - kfree(np); - return err; - } + if (err < 0) + return; write_lock(&devtree_lock); np->sibling = np->parent->child; @@ -1746,21 +1642,6 @@ int of_add_node(const char *path, struct np->parent->child = np; allnodes = np; write_unlock(&devtree_lock); - - add_node_proc_entries(np); - - of_node_put(np->parent); - of_node_put(np); - return 0; -} - -/* - * Prepare an OF node for removal from system - */ -static void of_cleanup_node(struct device_node *np) -{ - if (np->iommu_table && get_property(np, "ibm,dma-window", NULL)) - iommu_free_table(np); } /* @@ -1768,23 +1649,14 @@ static void of_cleanup_node(struct devic * a reference to the node. The memory associated with the node * is not freed until its refcount goes to zero. */ -int of_remove_node(struct device_node *np) +void of_remove_node(const struct device_node *np) { - struct device_node *parent, *child; - - parent = of_get_parent(np); - if (!parent) - return -EINVAL; + struct device_node *parent; - if ((child = of_get_next_child(np, NULL))) { - of_node_put(child); - return -EBUSY; - } + write_lock(&devtree_lock); - of_cleanup_node(np); + parent = np->parent; - write_lock(&devtree_lock); - remove_node_proc_entries(np); if (allnodes == np) allnodes = np->allnext; else { @@ -1806,10 +1678,8 @@ int of_remove_node(struct device_node *n ; prevsib->sibling = np->sibling; } + write_unlock(&devtree_lock); - of_node_put(parent); - of_node_put(np); /* Must decrement the refcount */ - return 0; } /* Index: linux-2.6.11-bk5/include/asm-ppc64/prom.h =================================================================== --- linux-2.6.11-bk5.orig/include/asm-ppc64/prom.h 2005-03-09 20:01:34.000000000 +0000 +++ linux-2.6.11-bk5/include/asm-ppc64/prom.h 2005-03-09 20:16:31.000000000 +0000 @@ -209,8 +209,8 @@ extern struct device_node *of_node_get(s extern void of_node_put(struct device_node *node); /* For updating the device tree at runtime */ -extern int of_add_node(const char *path, struct property *proplist); -extern int of_remove_node(struct device_node *np); +extern void of_add_node(struct device_node *); +extern void of_remove_node(const struct device_node *); /* Other Prototypes */ extern unsigned long prom_init(unsigned long, unsigned long, unsigned long, Index: linux-2.6.11-bk5/include/asm-ppc64/pSeries_reconfig.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.11-bk5/include/asm-ppc64/pSeries_reconfig.h 2005-03-09 20:16:31.000000000 +0000 @@ -0,0 +1,25 @@ +#ifndef _PPC64_PSERIES_RECONFIG_H +#define _PPC64_PSERIES_RECONFIG_H + +#include + +/* + * Use this API if your code needs to know about OF device nodes being + * added or removed on pSeries systems. + */ + +#define PSERIES_RECONFIG_ADD 0x0001 +#define PSERIES_RECONFIG_REMOVE 0x0002 + +#ifdef CONFIG_PPC_PSERIES +extern int pSeries_reconfig_notifier_register(struct notifier_block *); +extern void pSeries_reconfig_notifier_unregister(struct notifier_block *); +#else /* !CONFIG_PPC_PSERIES */ +static inline int pSeries_reconfig_notifier_register(struct notifier_block *nb) +{ + return 0; +} +static inline void pSeries_reconfig_notifier_unregister(struct notifier_block *nb) { } +#endif /* CONFIG_PPC_PSERIES */ + +#endif /* _PPC64_PSERIES_RECONFIG_H */ From ntl at pobox.com Thu Mar 10 11:51:52 2005 From: ntl at pobox.com (Nathan Lynch) Date: Wed, 9 Mar 2005 18:51:52 -0600 (CST) Subject: [PATCH 4/8] prom.c: use pSeries reconfig notifier In-Reply-To: <20050310005132.31309.65485.31668@otto> References: <20050310005132.31309.65485.31668@otto> Message-ID: <20050310005152.31309.41959.21947@otto> Use the pSeries_reconfig notifier list to fix up a device node which is about to be added. Signed-off-by: Nathan Lynch prom.c | 40 +++++++++++++++++++++++++++++++--------- 1 files changed, 31 insertions(+), 9 deletions(-) Index: linux-2.6.11-bk4/arch/ppc64/kernel/prom.c =================================================================== --- linux-2.6.11-bk4.orig/arch/ppc64/kernel/prom.c 2005-03-09 04:22:07.000000000 +0000 +++ linux-2.6.11-bk4/arch/ppc64/kernel/prom.c 2005-03-09 06:12:30.000000000 +0000 @@ -52,6 +52,7 @@ #include #include #include +#include #ifdef DEBUG #define DBG(fmt...) udbg_printf(fmt) @@ -1627,15 +1628,6 @@ out: */ void of_add_node(struct device_node *np) { - int err; - - /* This use of finish_node will be moved to a notifier so - * the error code can be used. - */ - err = finish_node(np, NULL, of_finish_dynamic_node, 0, 0, 0); - if (err < 0) - return; - write_lock(&devtree_lock); np->sibling = np->parent->child; np->allnext = allnodes; @@ -1682,6 +1674,36 @@ void of_remove_node(const struct device_ write_unlock(&devtree_lock); } +static int prom_reconfig_notifier(struct notifier_block *nb, unsigned long action, void *node) +{ + int err; + + switch (action) { + case PSERIES_RECONFIG_ADD: + err = finish_node(node, NULL, of_finish_dynamic_node, 0, 0, 0); + if (err < 0) { + printk(KERN_ERR "finish_node returned %d\n", err); + err = NOTIFY_BAD; + } + break; + default: + err = NOTIFY_DONE; + break; + } + return err; +} + +static struct notifier_block prom_reconfig_nb = { + .notifier_call prom_reconfig_notifier, + .priority = 10, /* This one needs to run first */ +}; + +static int __init prom_reconfig_setup(void) +{ + return pSeries_reconfig_notifier_register(&prom_reconfig_nb); +} +__initcall(prom_reconfig_setup); + /* * Find a property with a given name for a given node * and return the value. From ntl at pobox.com Thu Mar 10 11:51:57 2005 From: ntl at pobox.com (Nathan Lynch) Date: Wed, 9 Mar 2005 18:51:57 -0600 (CST) Subject: [PATCH 5/8] pci_dn.c: use pSeries reconfig notifier In-Reply-To: <20050310005132.31309.65485.31668@otto> References: <20050310005132.31309.65485.31668@otto> Message-ID: <20050310005157.31309.82819.78506@otto> Use the pSeries_reconfig notifier list to handle newly added pci device nodes. Remove duplicated version of update_dn_pci_info from prom.c. Signed-off-by: Nathan Lynch pci_dn.c | 22 ++++++++++++++++++++++ prom.c | 14 -------------- 2 files changed, 22 insertions(+), 14 deletions(-) Index: linux-2.6.11-bk5/arch/ppc64/kernel/pci_dn.c =================================================================== --- linux-2.6.11-bk5.orig/arch/ppc64/kernel/pci_dn.c 2005-03-09 20:01:32.000000000 +0000 +++ linux-2.6.11-bk5/arch/ppc64/kernel/pci_dn.c 2005-03-09 20:16:54.000000000 +0000 @@ -27,6 +27,7 @@ #include #include #include +#include #include "pci.h" @@ -161,6 +162,25 @@ struct device_node *fetch_dev_dn(struct } EXPORT_SYMBOL(fetch_dev_dn); +static int pci_dn_reconfig_notifier(struct notifier_block *nb, unsigned long action, void *node) +{ + struct device_node *np = node; + int err = NOTIFY_OK; + + switch (action) { + case PSERIES_RECONFIG_ADD: + update_dn_pci_info(np, np->parent->phb); + break; + default: + err = NOTIFY_DONE; + break; + } + return err; +} + +static struct notifier_block pci_dn_reconfig_nb = { + .notifier_call = pci_dn_reconfig_notifier, +}; /* * Actually initialize the phbs. @@ -173,4 +193,6 @@ void __init pci_devs_phb_init(void) /* This must be done first so the device nodes have valid pci info! */ list_for_each_entry_safe(phb, tmp, &hose_list, list_node) pci_devs_phb_init_dynamic(phb); + + pSeries_reconfig_notifier_register(&pci_dn_reconfig_nb); } Index: linux-2.6.11-bk5/arch/ppc64/kernel/prom.c =================================================================== --- linux-2.6.11-bk5.orig/arch/ppc64/kernel/prom.c 2005-03-09 20:16:43.000000000 +0000 +++ linux-2.6.11-bk5/arch/ppc64/kernel/prom.c 2005-03-09 20:16:54.000000000 +0000 @@ -1583,7 +1583,6 @@ static int of_finish_dynamic_node(struct int unused3, int unused4) { struct device_node *parent = of_get_parent(node); - u32 *regs; int err = 0; phandle *ibm_phandle; @@ -1605,19 +1604,6 @@ static int of_finish_dynamic_node(struct if ((ibm_phandle = (unsigned int *)get_property(node, "ibm,phandle", NULL))) node->linux_phandle = *ibm_phandle; - /* now do the rough equivalent of update_dn_pci_info, this - * probably is not correct for phb's, but should work for - * IOAs and slots. - */ - - node->phb = parent->phb; - - regs = (u32 *)get_property(node, "reg", NULL); - if (regs) { - node->busno = (regs[0] >> 16) & 0xff; - node->devfn = (regs[0] >> 8) & 0xff; - } - out: of_node_put(parent); return err; From ntl at pobox.com Thu Mar 10 11:52:02 2005 From: ntl at pobox.com (Nathan Lynch) Date: Wed, 9 Mar 2005 18:52:02 -0600 (CST) Subject: [PATCH 6/8] pSeries_iommu.c: use pSeries reconfig notifier In-Reply-To: <20050310005132.31309.65485.31668@otto> References: <20050310005132.31309.65485.31668@otto> Message-ID: <20050310005202.31309.17710.70791@otto> Use the pSeries_reconfig notifier chain for tearing down the iommu table when a device node is removed. Signed-off-by: Nathan Lynch pSeries_iommu.c | 25 +++++++++++++++++++++++++ pSeries_reconfig.c | 12 ------------ 2 files changed, 25 insertions(+), 12 deletions(-) Index: linux-2.6.11-bk5/arch/ppc64/kernel/pSeries_iommu.c =================================================================== --- linux-2.6.11-bk5.orig/arch/ppc64/kernel/pSeries_iommu.c 2005-03-09 20:01:04.000000000 +0000 +++ linux-2.6.11-bk5/arch/ppc64/kernel/pSeries_iommu.c 2005-03-09 20:17:09.000000000 +0000 @@ -43,6 +43,7 @@ #include #include #include +#include #include #include "pci.h" @@ -455,6 +456,28 @@ static void iommu_dev_setup_pSeries(stru } } +static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long action, void *node) +{ + int err = NOTIFY_OK; + struct device_node *np = node; + + switch (action) { + case PSERIES_RECONFIG_REMOVE: + if (np->iommu_table && + get_property(np, "ibm,dma-window", NULL)) + iommu_free_table(np); + break; + default: + err = NOTIFY_DONE; + break; + } + return err; +} + +static struct notifier_block iommu_reconfig_nb = { + .notifier_call = iommu_reconfig_notifier, +}; + static void iommu_bus_setup_null(struct pci_bus *b) { } static void iommu_dev_setup_null(struct pci_dev *d) { } @@ -487,6 +510,8 @@ void iommu_init_early_pSeries(void) ppc_md.iommu_dev_setup = iommu_dev_setup_pSeries; + pSeries_reconfig_notifier_register(&iommu_reconfig_nb); + pci_iommu_init(); } Index: linux-2.6.11-bk5/arch/ppc64/kernel/pSeries_reconfig.c =================================================================== --- linux-2.6.11-bk5.orig/arch/ppc64/kernel/pSeries_reconfig.c 2005-03-09 20:16:31.000000000 +0000 +++ linux-2.6.11-bk5/arch/ppc64/kernel/pSeries_reconfig.c 2005-03-09 20:17:09.000000000 +0000 @@ -157,16 +157,6 @@ out_err: return err; } -/* - * Prepare an OF node for removal from system - * XXX move this to pSeries_iommu.c - */ -static void of_cleanup_node(struct device_node *np) -{ - if (np->iommu_table && get_property(np, "ibm,dma-window", NULL)) - iommu_free_table(np); -} - static int pSeries_reconfig_remove_node(struct device_node *np) { struct device_node *parent, *child; @@ -180,8 +170,6 @@ static int pSeries_reconfig_remove_node( return -EBUSY; } - of_cleanup_node(np); - remove_node_proc_entries(np); notifier_call_chain(&pSeries_reconfig_chain, From ntl at pobox.com Thu Mar 10 11:52:07 2005 From: ntl at pobox.com (Nathan Lynch) Date: Wed, 9 Mar 2005 18:52:07 -0600 (CST) Subject: [PATCH 7/8] pSeries_smp.c: use pSeries reconfig notifier for cpu DLPAR In-Reply-To: <20050310005132.31309.65485.31668@otto> References: <20050310005132.31309.65485.31668@otto> Message-ID: <20050310005207.31309.32546.73375@otto> Use the pSeries_reconfig notifier API to handle processor addition and removal on pSeries LPAR. This is the "right" way to do it, as opposed to setting cpu_present_map = cpu_possible_map at boot (this is fixed in the next patch). Signed-off-by: Nathan Lynch pSeries_smp.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 126 insertions(+) Index: linux-2.6.11-bk5/arch/ppc64/kernel/pSeries_smp.c =================================================================== --- linux-2.6.11-bk5.orig/arch/ppc64/kernel/pSeries_smp.c 2005-03-09 20:01:32.000000000 +0000 +++ linux-2.6.11-bk5/arch/ppc64/kernel/pSeries_smp.c 2005-03-09 20:31:06.000000000 +0000 @@ -44,6 +44,7 @@ #include #include #include +#include #include "mpic.h" @@ -213,6 +214,127 @@ static inline int __devinit smp_startup_ } return 1; } + +/* + * Update cpu_present_map and paca(s) for a new cpu node. The wrinkle + * here is that a cpu device node may represent up to two logical cpus + * in the SMT case. We must honor the assumption in other code that + * the logical ids for sibling SMT threads x and y are adjacent, such + * that x^1 == y and y^1 == x. + */ +static int pSeries_add_processor(struct device_node *np) +{ + unsigned int cpu; + cpumask_t candidate_map, tmp = CPU_MASK_NONE; + int err = -ENOSPC, len, nthreads, i; + u32 *intserv; + + intserv = (u32 *)get_property(np, "ibm,ppc-interrupt-server#s", &len); + if (!intserv) + return 0; + + nthreads = len / sizeof(u32); + for (i = 0; i < nthreads; i++) + cpu_set(i, tmp); + + lock_cpu_hotplug(); + + BUG_ON(!cpus_subset(cpu_present_map, cpu_possible_map)); + + /* Get a bitmap of unoccupied slots. */ + cpus_xor(candidate_map, cpu_possible_map, cpu_present_map); + if (cpus_empty(candidate_map)) { + /* If we get here, it most likely means that NR_CPUS is + * less than the partition's max processors setting. + */ + printk(KERN_ERR "Cannot add cpu %s; this system configuration" + " supports %d logical cpus.\n", np->full_name, + cpus_weight(cpu_possible_map)); + goto out_unlock; + } + + while (!cpus_empty(tmp)) + if (cpus_subset(tmp, candidate_map)) + /* Found a range where we can insert the new cpu(s) */ + break; + else + cpus_shift_left(tmp, tmp, nthreads); + + if (cpus_empty(tmp)) { + printk(KERN_ERR "Unable to find space in cpu_present_map for" + " processor %s with %d thread(s)\n", np->name, + nthreads); + goto out_unlock; + } + + for_each_cpu_mask(cpu, tmp) { + BUG_ON(cpu_isset(cpu, cpu_present_map)); + cpu_set(cpu, cpu_present_map); + set_hard_smp_processor_id(cpu, *intserv++); + } + err = 0; +out_unlock: + unlock_cpu_hotplug(); + return err; +} + +/* + * Update the present map for a cpu node which is going away, and set + * the hard id in the paca(s) to -1 to be consistent with boot time + * convention for non-present cpus. + */ +static void pSeries_remove_processor(struct device_node *np) +{ + unsigned int cpu; + int len, nthreads, i; + u32 *intserv; + + intserv = (u32 *)get_property(np, "ibm,ppc-interrupt-server#s", &len); + if (!intserv) + return; + + nthreads = len / sizeof(u32); + + lock_cpu_hotplug(); + for (i = 0; i < nthreads; i++) { + for_each_present_cpu(cpu) { + if (get_hard_smp_processor_id(cpu) != intserv[i]) + continue; + BUG_ON(cpu_online(cpu)); + cpu_clear(cpu, cpu_present_map); + set_hard_smp_processor_id(cpu, -1); + break; + } + if (cpu == NR_CPUS) + printk(KERN_WARNING "Could not find cpu to remove " + "with physical id 0x%x\n", intserv[i]); + } + unlock_cpu_hotplug(); +} + +static int pSeries_smp_notifier(struct notifier_block *nb, unsigned long action, void *node) +{ + int err = NOTIFY_OK; + + switch (action) { + case PSERIES_RECONFIG_ADD: + if (pSeries_add_processor(node)) + err = NOTIFY_BAD; + break; + case PSERIES_RECONFIG_REMOVE: + pSeries_remove_processor(node); + break; + default: + err = NOTIFY_DONE; + break; + } + return err; +} + +static struct notifier_block pSeries_smp_nb = { + .notifier_call = pSeries_smp_notifier, +}; + #else /* ... CONFIG_HOTPLUG_CPU */ static inline int __devinit smp_startup_cpu(unsigned int lcpu) { @@ -336,6 +458,10 @@ void __init smp_init_pSeries(void) #ifdef CONFIG_HOTPLUG_CPU smp_ops->cpu_disable = pSeries_cpu_disable; smp_ops->cpu_die = pSeries_cpu_die; + + /* Processors can be added/removed only on LPAR */ + if (systemcfg->platform == PLATFORM_PSERIES_LPAR) + pSeries_reconfig_notifier_register(&pSeries_smp_nb); #endif /* Start secondary threads on SMT systems; primary threads From ntl at pobox.com Thu Mar 10 11:52:13 2005 From: ntl at pobox.com (Nathan Lynch) Date: Wed, 9 Mar 2005 18:52:13 -0600 (CST) Subject: [PATCH 8/8] make cpu hotplug play well with maxcpus and smt-enabled In-Reply-To: <20050310005132.31309.65485.31668@otto> References: <20050310005132.31309.65485.31668@otto> Message-ID: <20050310005212.31309.16616.43059@otto> This patch allows you to boot a pSeries system with maxcpus=x or smt-enabled=off (or both) and bring up the offline cpus later from userspace, assuming the kernel was built with CONFIG_HOTPLUG_CPU=y. - Record cpus which were started from OF in a cpu map and use that instead of system_state to decide how to start a cpu in smp_startup_cpu. - Change the smp bootup logic slightly so that the path for bringing up secondary threads is exactly the same as hotplugging a cpu later from userspace. - Add a new function to smp_ops - cpu_bootable. This is implemented only by pSeries to filter out secondary threads during boot with smt-enabled=off. Another way this could be done is to change the kick_cpu member to return int and we can check for this case in smp_pSeries_kick_cpu. - Remove the games we play with cpu_present_map and the hard_smp_processor_id to handle smt-enabled=off, since they're now unnecessary. - Remove find_physical_cpu_to_start; assigning threads to logical slots should be done at bootup and at DLPAR time, not during a cpu online operation. One caveat: you need up-to-date firmware on Power5 for the maxcpus option to work on systems with more than one processor. Otherwise interrupts get misrouted, typically resulting in hangs or "unable to find root filesystem" problems. Tested on Power5 with and without CONFIG_HOTPLUG_CPU and with various combinations of the maxcpus= and smt-enabled= parameters. arch/ppc64/kernel/pSeries_smp.c | 183 +++++++++++++++------------------------- arch/ppc64/kernel/setup.c | 12 -- arch/ppc64/kernel/smp.c | 13 -- include/asm-ppc64/machdep.h | 1 4 files changed, 78 insertions(+), 131 deletions(-) Signed-off-by: Nathan Lynch Index: linux-2.6.11-bk5/arch/ppc64/kernel/pSeries_smp.c =================================================================== --- linux-2.6.11-bk5.orig/arch/ppc64/kernel/pSeries_smp.c 2005-03-09 20:31:06.000000000 +0000 +++ linux-2.6.11-bk5/arch/ppc64/kernel/pSeries_smp.c 2005-03-09 20:32:55.000000000 +0000 @@ -54,8 +54,16 @@ #define DBG(fmt...) #endif +/* + * The primary thread of each non-boot processor is recorded here before + * smp init. + */ +static cpumask_t of_spin_map; + extern void pSeries_secondary_smp_init(unsigned long); +#ifdef CONFIG_HOTPLUG_CPU + /* Get state of physical CPU. * Return codes: * 0 - The processor is in the RTAS stopped state @@ -82,9 +90,6 @@ static int query_cpu_stopped(unsigned in return cpu_status; } - -#ifdef CONFIG_HOTPLUG_CPU - int pSeries_cpu_disable(void) { systemcfg->processorCount--; @@ -123,98 +128,6 @@ void pSeries_cpu_die(unsigned int cpu) paca[cpu].cpu_start = 0; } -/* Search all cpu device nodes for an offline logical cpu. If a - * device node has a "ibm,my-drc-index" property (meaning this is an - * LPAR), paranoid-check whether we own the cpu. For each "thread" - * of a cpu, if it is offline and has the same hw index as before, - * grab that in preference. - */ -static unsigned int find_physical_cpu_to_start(unsigned int old_hwindex) -{ - struct device_node *np = NULL; - unsigned int best = -1U; - - while ((np = of_find_node_by_type(np, "cpu"))) { - int nr_threads, len; - u32 *index = (u32 *)get_property(np, "ibm,my-drc-index", NULL); - u32 *tid = (u32 *) - get_property(np, "ibm,ppc-interrupt-server#s", &len); - - if (!tid) - tid = (u32 *)get_property(np, "reg", &len); - - if (!tid) - continue; - - /* If there is a drc-index, make sure that we own - * the cpu. - */ - if (index) { - int state; - int rc = rtas_get_sensor(9003, *index, &state); - if (rc < 0 || state != 1) - continue; - } - - nr_threads = len / sizeof(u32); - - while (nr_threads--) { - if (0 == query_cpu_stopped(tid[nr_threads])) { - best = tid[nr_threads]; - if (best == old_hwindex) - goto out; - } - } - } -out: - of_node_put(np); - return best; -} - -/** - * smp_startup_cpu() - start the given cpu - * - * At boot time, there is nothing to do. At run-time, call RTAS with - * the appropriate start location, if the cpu is in the RTAS stopped - * state. - * - * Returns: - * 0 - failure - * 1 - success - */ -static inline int __devinit smp_startup_cpu(unsigned int lcpu) -{ - int status; - unsigned long start_here = __pa((u32)*((unsigned long *) - pSeries_secondary_smp_init)); - unsigned int pcpu; - - /* At boot time the cpus are already spinning in hold - * loops, so nothing to do. */ - if (system_state < SYSTEM_RUNNING) - return 1; - - pcpu = find_physical_cpu_to_start(get_hard_smp_processor_id(lcpu)); - if (pcpu == -1U) { - printk(KERN_INFO "No more cpus available, failing\n"); - return 0; - } - - /* Fixup atomic count: it exited inside IRQ handler. */ - paca[lcpu].__current->thread_info->preempt_count = 0; - - /* At boot this is done in prom.c. */ - paca[lcpu].hw_cpu_id = pcpu; - - status = rtas_call(rtas_token("start-cpu"), 3, 1, NULL, - pcpu, start_here, lcpu); - if (status != 0) { - printk(KERN_ERR "start-cpu failed: %i\n", status); - return 0; - } - return 1; -} - /* * Update cpu_present_map and paca(s) for a new cpu node. The wrinkle * here is that a cpu device node may represent up to two logical cpus @@ -335,12 +248,43 @@ static struct notifier_block pSeries_smp .notifier_call = pSeries_smp_notifier, }; -#else /* ... CONFIG_HOTPLUG_CPU */ +#endif /* CONFIG_HOTPLUG_CPU */ + +/** + * smp_startup_cpu() - start the given cpu + * + * At boot time, there is nothing to do for primary threads which were + * started from Open Firmware. For anything else, call RTAS with the + * appropriate start location. + * + * Returns: + * 0 - failure + * 1 - success + */ static inline int __devinit smp_startup_cpu(unsigned int lcpu) { + int status; + unsigned long start_here = __pa((u32)*((unsigned long *) + pSeries_secondary_smp_init)); + unsigned int pcpu; + + if (cpu_isset(lcpu, of_spin_map)) + /* Already started by OF and sitting in spin loop */ + return 1; + + pcpu = get_hard_smp_processor_id(lcpu); + + /* Fixup atomic count: it exited inside IRQ handler. */ + paca[lcpu].__current->thread_info->preempt_count = 0; + + status = rtas_call(rtas_token("start-cpu"), 3, 1, NULL, + pcpu, start_here, lcpu); + if (status != 0) { + printk(KERN_ERR "start-cpu failed: %i\n", status); + return 0; + } return 1; } -#endif /* CONFIG_HOTPLUG_CPU */ static inline void smp_xics_do_message(int cpu, int msg) { @@ -380,6 +324,8 @@ static void __devinit smp_xics_setup_cpu if (cur_cpu_spec->firmware_features & FW_FEATURE_SPLPAR) vpa_init(cpu); + cpu_clear(cpu, of_spin_map); + /* * Put the calling processor into the GIQ. This is really only * necessary from a secondary thread as the OF start-cpu interface @@ -429,6 +375,20 @@ static void __devinit smp_pSeries_kick_c paca[nr].cpu_start = 1; } +static int smp_pSeries_cpu_bootable(unsigned int nr) +{ + /* Special case - we inhibit secondary thread startup + * during boot if the user requests it. Odd-numbered + * cpus are assumed to be secondary threads. + */ + if (system_state < SYSTEM_RUNNING && + cur_cpu_spec->cpu_features & CPU_FTR_SMT && + !smt_enabled_at_boot && nr % 2 != 0) + return 0; + + return 1; +} + static struct smp_ops_t pSeries_mpic_smp_ops = { .message_pass = smp_mpic_message_pass, .probe = smp_mpic_probe, @@ -441,12 +401,13 @@ static struct smp_ops_t pSeries_xics_smp .probe = smp_xics_probe, .kick_cpu = smp_pSeries_kick_cpu, .setup_cpu = smp_xics_setup_cpu, + .cpu_bootable = smp_pSeries_cpu_bootable, }; /* This is called very early */ void __init smp_init_pSeries(void) { - int ret, i; + int i; DBG(" -> smp_init_pSeries()\n"); @@ -464,20 +425,20 @@ void __init smp_init_pSeries(void) pSeries_reconfig_notifier_register(&pSeries_smp_nb); #endif - /* Start secondary threads on SMT systems; primary threads - * are already in the running state. - */ - for_each_present_cpu(i) { - if (query_cpu_stopped(get_hard_smp_processor_id(i)) == 0) { - printk("%16.16x : starting thread\n", i); - DBG("%16.16x : starting thread\n", i); - rtas_call(rtas_token("start-cpu"), 3, 1, &ret, - get_hard_smp_processor_id(i), - __pa((u32)*((unsigned long *) - pSeries_secondary_smp_init)), - i); + /* Mark threads which are still spinning in hold loops. */ + if (cur_cpu_spec->cpu_features & CPU_FTR_SMT) + for_each_present_cpu(i) { + if (i % 2 == 0) + /* + * Even-numbered logical cpus correspond to + * primary threads. + */ + cpu_set(i, of_spin_map); } - } + else + of_spin_map = cpu_present_map; + + cpu_clear(boot_cpuid, of_spin_map); /* Non-lpar has additional take/give timebase */ if (rtas_token("freeze-time-base") != RTAS_UNKNOWN_SERVICE) { Index: linux-2.6.11-bk5/include/asm-ppc64/machdep.h =================================================================== --- linux-2.6.11-bk5.orig/include/asm-ppc64/machdep.h 2005-03-09 20:30:34.000000000 +0000 +++ linux-2.6.11-bk5/include/asm-ppc64/machdep.h 2005-03-09 20:32:55.000000000 +0000 @@ -33,6 +33,7 @@ struct smp_ops_t { int (*cpu_enable)(unsigned int nr); int (*cpu_disable)(void); void (*cpu_die)(unsigned int nr); + int (*cpu_bootable)(unsigned int nr); }; #endif Index: linux-2.6.11-bk5/arch/ppc64/kernel/smp.c =================================================================== --- linux-2.6.11-bk5.orig/arch/ppc64/kernel/smp.c 2005-03-09 20:30:34.000000000 +0000 +++ linux-2.6.11-bk5/arch/ppc64/kernel/smp.c 2005-03-09 20:32:55.000000000 +0000 @@ -490,9 +490,8 @@ int __devinit __cpu_up(unsigned int cpu) if (!cpu_enable(cpu)) return 0; - /* At boot, don't bother with non-present cpus -JSCHOPP */ - if (system_state < SYSTEM_RUNNING && !cpu_present(cpu)) - return -ENOENT; + if (smp_ops->cpu_bootable && !smp_ops->cpu_bootable(cpu)) + return -EINVAL; paca[cpu].default_decr = tb_ticks_per_jiffy / decr_overclock; @@ -606,14 +605,6 @@ void __init smp_cpus_done(unsigned int m smp_ops->setup_cpu(boot_cpuid); set_cpus_allowed(current, old_mask); - - /* - * We know at boot the maximum number of cpus we can add to - * a partition and set cpu_possible_map accordingly. cpu_present_map - * needs to match for the hotplug code to allow us to hot add - * any offline cpus. - */ - cpu_present_map = cpu_possible_map; } #ifdef CONFIG_HOTPLUG_CPU Index: linux-2.6.11-bk5/arch/ppc64/kernel/setup.c =================================================================== --- linux-2.6.11-bk5.orig/arch/ppc64/kernel/setup.c 2005-03-09 20:30:34.000000000 +0000 +++ linux-2.6.11-bk5/arch/ppc64/kernel/setup.c 2005-03-09 20:32:55.000000000 +0000 @@ -269,15 +269,9 @@ static void __init setup_cpu_maps(void) nthreads = len / sizeof(u32); for (j = 0; j < nthreads && cpu < NR_CPUS; j++) { - /* - * Only spin up secondary threads if SMT is enabled. - * We must leave space in the logical map for the - * threads. - */ - if (j == 0 || smt_enabled_at_boot) { - cpu_set(cpu, cpu_present_map); - set_hard_smp_processor_id(cpu, intserv[j]); - } + cpu_set(cpu, cpu_present_map); + set_hard_smp_processor_id(cpu, intserv[j]); + if (intserv[j] == boot_cpuid_phys) swap_cpuid = cpu; cpu_set(cpu, cpu_possible_map); From ntl at pobox.com Thu Mar 10 12:40:13 2005 From: ntl at pobox.com (Nathan Lynch) Date: Wed, 9 Mar 2005 19:40:13 -0600 Subject: [PATCH] ppc64: fix xmon build break with non-SMP config Message-ID: <20050310014013.GC21853@otto> CC arch/ppc64/xmon/xmon.o arch/ppc64/xmon/xmon.c: In function `set_controlled_dabr': arch/ppc64/xmon/xmon.c:633: warning: implicit declaration of function `plpar_hcall_norets' arch/ppc64/xmon/xmon.c:633: error: `H_SET_DABR' undeclared (first use in this function) arch/ppc64/xmon/xmon.c:633: error: (Each undeclared identifier is reported only once arch/ppc64/xmon/xmon.c:633: error: for each function it appears in.) arch/ppc64/xmon/xmon.c:634: error: `H_Success' undeclared (first use in this function) Signed-off-by: Nathan Lynch xmon.c | 1 + 1 files changed, 1 insertion(+) Index: linux-2.6.11-bk5/arch/ppc64/xmon/xmon.c =================================================================== --- linux-2.6.11-bk5.orig/arch/ppc64/xmon/xmon.c 2005-03-09 20:01:32.000000000 +0000 +++ linux-2.6.11-bk5/arch/ppc64/xmon/xmon.c 2005-03-10 01:09:13.000000000 +0000 @@ -32,6 +32,7 @@ #include #include #include +#include #include "nonstdio.h" #include "privinst.h" From david at gibson.dropbear.id.au Thu Mar 10 13:18:48 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 10 Mar 2005 13:18:48 +1100 Subject: [PPC64] Allow emulation of mfpvr on ppc64 kernel Message-ID: <20050310021848.GD30435@localhost.localdomain> Andrew, please apply. Allow userspace programs on ppc64 to use the (privileged) mfpvr instruction to determine the processor type. At the moment it emulates the instruction to provide the real PVR value, though it could be made to lie in future if for some reason we wish to restrict what CPU features userspace uses. If nothing else this means that some existing ppc32 applications will now run on a 64-bit kernel (the 32-bit kernel has long supported this emulation). It will also be necessary for ppc64 perfctr support, where userspace requires finer-grained cpu type information than the kernel in order to correctly program the performance monitor control registers. Signed-off-by: David Gibson Index: working-2.6/arch/ppc64/kernel/traps.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/traps.c 2005-03-06 07:08:24.000000000 +1100 +++ working-2.6/arch/ppc64/kernel/traps.c 2005-03-10 13:05:25.000000000 +1100 @@ -279,6 +279,9 @@ * fault. Return zero on success. */ +#define INST_MFSPR_PVR 0x7c1f42a6 +#define INST_MFSPR_PVR_MASK 0xfc1fffff + #define INST_DCBA 0x7c0005ec #define INST_DCBA_MASK 0x7c0007fe @@ -297,6 +300,15 @@ if (get_user(instword, (unsigned int __user *)(regs->nip))) return -EFAULT; + /* Emulate the mfspr rD, PVR. */ + if ((instword & INST_MFSPR_PVR_MASK) == INST_MFSPR_PVR) { + unsigned int rd; + + rd = (instword >> 21) & 0x1f; + regs->gpr[rd] = mfspr(SPRN_PVR); + return 0; + } + /* Emulating the dcba insn is just a no-op. */ if ((instword & INST_DCBA_MASK) == INST_DCBA) { static int warned; @@ -390,11 +402,6 @@ if (regs->msr & 0x100000) { /* IEEE FP exception */ parse_fpe(regs); - - } else if (regs->msr & 0x40000) { - /* Privileged instruction */ - _exception(SIGILL, regs, ILL_PRVOPC, regs->nip); - } else if (regs->msr & 0x20000) { /* trap exception */ @@ -411,7 +418,7 @@ _exception(SIGTRAP, regs, TRAP_BRKPT, regs->nip); } else { - /* Illegal instruction; try to emulate it. */ + /* Privileged or illegal instruction; try to emulate it. */ switch (emulate_instruction(regs)) { case 0: regs->nip += 4; @@ -423,7 +430,12 @@ break; default: - _exception(SIGILL, regs, ILL_ILLOPC, regs->nip); + if (regs->msr & 0x40000) + /* priveleged */ + _exception(SIGILL, regs, ILL_PRVOPC, regs->nip); + else + /* illegal */ + _exception(SIGILL, regs, ILL_ILLOPC, regs->nip); break; } } -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist. NOT _the_ _other_ _way_ | _around_! http://www.ozlabs.org/people/dgibson From sfr at canb.auug.org.au Thu Mar 10 13:42:16 2005 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Thu, 10 Mar 2005 13:42:16 +1100 Subject: [RFC][PATCH] combining header files In-Reply-To: <20050309200109.GG1220@austin.ibm.com> References: <20050309120343.0c22eb0f.sfr@canb.auug.org.au> <20050309200109.GG1220@austin.ibm.com> Message-ID: <20050310134216.5b9b27ef.sfr@canb.auug.org.au> On Wed, 9 Mar 2005 14:01:09 -0600 Linas Vepstas wrote: > > Why not #include instead? Because I am talking about similarities between ppc and ppc64 not ppc64 and the generic code (though there may be some of those to be exploited as well). -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050310/6b006c20/attachment.pgp From olof at austin.ibm.com Thu Mar 10 14:25:07 2005 From: olof at austin.ibm.com (Olof Johansson) Date: Wed, 9 Mar 2005 21:25:07 -0600 Subject: [PATCH 2/2] No-exec support for ppc64 In-Reply-To: <20050308171326.3d72363a.moilanen@austin.ibm.com> References: <20050308165904.0ce07112.moilanen@austin.ibm.com> <20050308171326.3d72363a.moilanen@austin.ibm.com> Message-ID: <20050310032507.GC20789@austin.ibm.com> Hi, On Tue, Mar 08, 2005 at 05:13:26PM -0600, Jake Moilanen wrote: > diff -puN arch/ppc64/mm/hash_utils.c~nx-kernel-ppc64 arch/ppc64/mm/hash_utils.c > --- linux-2.6-bk/arch/ppc64/mm/hash_utils.c~nx-kernel-ppc64 2005-03-08 16:08:57 -06:00 > +++ linux-2.6-bk-moilanen/arch/ppc64/mm/hash_utils.c 2005-03-08 16:08:57 -06:00 > @@ -89,12 +90,23 @@ static inline void loop_forever(void) > ; > } > > +int is_kernel_text(unsigned long addr) > +{ > + if (addr >= (unsigned long)_stext && addr < (unsigned long)__init_end) > + return 1; > + > + return 0; > +} This is used in two files, but never declared extern in the second file (iSeries_setup.c). Should it go in a header file as a static inline instead? There also seems to be a local static is_kernel_text() in kallsyms that overlaps (but it's not identical). Removing that redundancy can be taken care of as a janitorial patch outside of the noexec stuff. -Olof From olof at austin.ibm.com Thu Mar 10 14:22:13 2005 From: olof at austin.ibm.com (Olof Johansson) Date: Wed, 9 Mar 2005 21:22:13 -0600 Subject: [PATCH 1/2] No-exec support for ppc64 In-Reply-To: <20050308170826.13a2299e.moilanen@austin.ibm.com> References: <20050308165904.0ce07112.moilanen@austin.ibm.com> <20050308170826.13a2299e.moilanen@austin.ibm.com> Message-ID: <20050310032213.GB20789@austin.ibm.com> On Tue, Mar 08, 2005 at 05:08:26PM -0600, Jake Moilanen wrote: > No-exec base and user space support for PPC64. Hi, a couple of comments below. -Olof > @@ -786,6 +786,7 @@ int hash_huge_page(struct mm_struct *mm, > pte_t old_pte, new_pte; > unsigned long hpteflags, prpn; > long slot; > + int is_exec; > int err = 1; > > spin_lock(&mm->page_table_lock); > @@ -796,6 +797,10 @@ int hash_huge_page(struct mm_struct *mm, > va = (vsid << 28) | (ea & 0x0fffffff); > vpn = va >> HPAGE_SHIFT; > > + is_exec = access & _PAGE_EXEC; > + if (unlikely(is_exec && !(pte_val(*ptep) & _PAGE_EXEC))) > + goto out; You only use is_exec this one time, you can probably skip it and just add the mask in the if statement. > @@ -898,6 +908,7 @@ repeat: > err = 0; > > out: > + > spin_unlock(&mm->page_table_lock); Whitespace change > diff -puN include/asm-ppc64/pgtable.h~nx-user-ppc64 include/asm-ppc64/pgtable.h > --- linux-2.6-bk/include/asm-ppc64/pgtable.h~nx-user-ppc64 2005-03-08 16:08:54 -06:00 > +++ linux-2.6-bk-moilanen/include/asm-ppc64/pgtable.h 2005-03-08 16:08:54 -06:00 > @@ -82,14 +82,14 @@ > #define _PAGE_PRESENT 0x0001 /* software: pte contains a translation */ > #define _PAGE_USER 0x0002 /* matches one of the PP bits */ > #define _PAGE_FILE 0x0002 /* (!present only) software: pte holds file offset */ > -#define _PAGE_RW 0x0004 /* software: user write access allowed */ > +#define _PAGE_EXEC 0x0004 /* No execute on POWER4 and newer (we invert) */ Good to see the comment there, I remember we talked about that earlier. It can be somewhat confusing. :-) > #define _PAGE_GUARDED 0x0008 > #define _PAGE_COHERENT 0x0010 /* M: enforce memory coherence (SMP systems) */ > #define _PAGE_NO_CACHE 0x0020 /* I: cache inhibit */ > #define _PAGE_WRITETHRU 0x0040 /* W: cache write-through */ > #define _PAGE_DIRTY 0x0080 /* C: page changed */ > #define _PAGE_ACCESSED 0x0100 /* R: page referenced */ > -#define _PAGE_EXEC 0x0200 /* software: i-cache coherence required */ > +#define _PAGE_RW 0x0200 /* software: user write access allowed */ > #define _PAGE_HASHPTE 0x0400 /* software: pte has an associated HPTE */ > #define _PAGE_BUSY 0x0800 /* software: PTE & hash are busy */ > #define _PAGE_SECONDARY 0x8000 /* software: HPTE is in secondary group */ > @@ -100,7 +100,7 @@ > /* PAGE_MASK gives the right answer below, but only by accident */ > /* It should be preserving the high 48 bits and then specifically */ > /* preserving _PAGE_SECONDARY | _PAGE_GROUP_IX */ > -#define _PAGE_CHG_MASK (PAGE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_HPTEFLAGS) > +#define _PAGE_CHG_MASK (_PAGE_GUARDED | _PAGE_COHERENT | _PAGE_NO_CACHE | _PAGE_WRITETHRU | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_HPTEFLAGS | PAGE_MASK) Can you break it into 80 columns with \ ? From benh at kernel.crashing.org Thu Mar 10 18:15:34 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 10 Mar 2005 18:15:34 +1100 Subject: [PATCH 2/2] No-exec support for ppc64 In-Reply-To: <20050310032507.GC20789@austin.ibm.com> References: <20050308165904.0ce07112.moilanen@austin.ibm.com> <20050308171326.3d72363a.moilanen@austin.ibm.com> <20050310032507.GC20789@austin.ibm.com> Message-ID: <1110438934.32524.203.camel@gaston> On Wed, 2005-03-09 at 21:25 -0600, Olof Johansson wrote: > Hi, > > On Tue, Mar 08, 2005 at 05:13:26PM -0600, Jake Moilanen wrote: > > diff -puN arch/ppc64/mm/hash_utils.c~nx-kernel-ppc64 arch/ppc64/mm/hash_utils.c > > --- linux-2.6-bk/arch/ppc64/mm/hash_utils.c~nx-kernel-ppc64 2005-03-08 16:08:57 -06:00 > > +++ linux-2.6-bk-moilanen/arch/ppc64/mm/hash_utils.c 2005-03-08 16:08:57 -06:00 > > @@ -89,12 +90,23 @@ static inline void loop_forever(void) > > ; > > } > > > > +int is_kernel_text(unsigned long addr) > > +{ > > + if (addr >= (unsigned long)_stext && addr < (unsigned long)__init_end) > > + return 1; > > + > > + return 0; > > +} > > This is used in two files, but never declared extern in the second file > (iSeries_setup.c). Should it go in a header file as a static inline > instead? Yes, I think it should. > There also seems to be a local static is_kernel_text() in kallsyms that > overlaps (but it's not identical). Removing that redundancy can be taken > care of as a janitorial patch outside of the noexec stuff. > > > > -Olof > _______________________________________________ > Linuxppc64-dev mailing list > Linuxppc64-dev at ozlabs.org > https://ozlabs.org/cgi-bin/mailman/listinfo/linuxppc64-dev -- Benjamin Herrenschmidt From arnd at arndb.de Thu Mar 10 20:34:04 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Thu, 10 Mar 2005 10:34:04 +0100 Subject: [PATCH] linking zImage with biarch ld In-Reply-To: <1110404388.32524.101.camel@gaston> References: <200503091401.17143.arnd@arndb.de> <1110404388.32524.101.camel@gaston> Message-ID: <200503101034.05737.arnd@arndb.de> On Middeweken 09 M?rz 2005 22:39, Benjamin Herrenschmidt wrote: > > I noticed that with the vDSO patch in 2.6.11-bk, it's almost possible to build > > the kernel with the fedora biarch toolchain. However, I still get warnings > > from ld about zImage being the wrong architecture, unless I change the script > > as shown in this patch. > > "Almost possible" ? What's wrong ? Only that ? Yes, that's the only problem. Arnd <>< From johnrose at austin.ibm.com Fri Mar 11 03:50:10 2005 From: johnrose at austin.ibm.com (John Rose) Date: Thu, 10 Mar 2005 10:50:10 -0600 Subject: [PATCH 2/8] make OF node fixup code usable at runtime In-Reply-To: <20050310005142.31309.45788.99418@otto> References: <20050310005132.31309.65485.31668@otto> <20050310005142.31309.45788.99418@otto> Message-ID: <1110473410.29353.4.camel@sinatra.austin.ibm.com> Hi Nathan- The patch series cleans things up nicely. One comment: > These functions, if passed a null mem_start argument, will > use kmalloc for allocating extra data structures for the device node > being processed. Might it be possible to use the dynamic flag of the device node to decide when to use kmalloc? Thanks- John From johnrose at austin.ibm.com Fri Mar 11 04:20:00 2005 From: johnrose at austin.ibm.com (John Rose) Date: Thu, 10 Mar 2005 11:20:00 -0600 Subject: [PATCH 4/8] prom.c: use pSeries reconfig notifier In-Reply-To: <20050310005152.31309.41959.21947@otto> References: <20050310005132.31309.65485.31668@otto> <20050310005152.31309.41959.21947@otto> Message-ID: <1110475200.29353.12.camel@sinatra.austin.ibm.com> Quick comment on this- > void of_add_node(struct device_node *np) > { > - int err; > - > - /* This use of finish_node will be moved to a notifier so > - * the error code can be used. > - */ > - err = finish_node(np, NULL, of_finish_dynamic_node, 0, 0, 0); > - if (err < 0) > - return; > - > write_lock(&devtree_lock); > np->sibling = np->parent->child; > np->allnext = allnodes; > @@ -1682,6 +1674,36 @@ void of_remove_node(const struct device_ > write_unlock(&devtree_lock); > } If I understand correctly, of_add_node() now simply adds the node to relevant lists and sets relational pointers to position it in the device tree. The allocation and other setup has been moved out of the function. Might it be more clear to rename it to of_attach_node() or something similar? Thanks- John From arnd at arndb.de Fri Mar 11 06:05:28 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Thu, 10 Mar 2005 20:05:28 +0100 Subject: [PATCH 4/8] prom.c: use pSeries reconfig notifier In-Reply-To: <20050310005152.31309.41959.21947@otto> References: <20050310005132.31309.65485.31668@otto> <20050310005152.31309.41959.21947@otto> Message-ID: <200503102005.29526.arnd@arndb.de> On Dunnersdag 10 M?rz 2005 01:51, Nathan Lynch wrote: > ?void of_add_node(struct device_node *np) > ?{ While looking at the of_add_node code, I noticed that there are some memory holes in the error path of that function. They should be trivial to fix with the attached patch, but I can't test this because I don't have reconfigurable machines. Unfortunately, this also conflicts with Nathan's patches, but I can submit a new patch when they show up in bitkeeper. Signed-off-by: Arnd Bergmann -------------- next part -------------- A non-text attachment was scrubbed... Name: of_add_node_leak.diff Type: text/x-diff Size: 1307 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050310/448cea66/attachment.diff From jschopp at austin.ibm.com Fri Mar 11 06:41:38 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Thu, 10 Mar 2005 13:41:38 -0600 Subject: [PATCH 2/8] make OF node fixup code usable at runtime In-Reply-To: <20050310005142.31309.45788.99418@otto> References: <20050310005132.31309.65485.31668@otto> <20050310005142.31309.45788.99418@otto> Message-ID: <4230A2F2.7020403@austin.ibm.com> Nathan Lynch wrote: > At boot we recurse through the device tree "fixing up" various fields > and properties in the device nodes. Long ago, to support DLPAR and > hotplug, we largely duplicated some of this fixup code, the main > difference being that the new code used kmalloc for allocating various > data structures which are attached to the new device nodes. > > This patch kills most of the duplicated code and makes finish_node, > finish_node_interrupts, and interpret_pci_props suitable for use at > runtime. These functions, if passed a null mem_start argument, will > use kmalloc for allocating extra data structures for the device node > being processed. Not terribly elegant, but it seems worth it to get > rid of the duplicated code (and bugs). Good idea, I wholeheartedly agree. > -static int of_finish_dynamic_node(struct device_node *node) > +static int of_finish_dynamic_node(struct device_node *node, > + unsigned long *unused1, int unused2, > + int unused3, int unused4) > { Is there a reason for these 4 unused fields that I am just missing? From arnd at arndb.de Fri Mar 11 06:54:36 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Thu, 10 Mar 2005 20:54:36 +0100 Subject: [PATCH] ppc64: kill might_sleep() warnings in __copy_*_user_inatomic Message-ID: <200503102054.38123.arnd@arndb.de> I currently get warnings from futex resulting from Olofs futex+rwsem fix combined with the fact that ppc64 __copy_from_user has a might_sleep check in it: [ 9607.577071] Debug: sleeping function called from invalid context at include2/asm/uaccess.h:2 28 [ 9607.676181] in_atomic():1, irqs_disabled():0 [ 9607.724741] Call Trace: [ 9607.752058] [c00000000d68fab0] [c000000001f0fb80] 0xc000000001f0fb80 (unreliable) [ 9607.835030] [c00000000d68fb30] [c000000000042420] .__might_sleep+0xf8/0x108 [ 9607.912936] [c00000000d68fbd0] [c00000000006ac34] .do_futex+0x224/0x858 The fix is to do the check only in copy_*_user, not __copy_*_user. This is the same that most other architectures do. Signed-off-by: Arnd Bergmann --- arch/ppc64/lib/usercopy.c | 2 ++ include/asm-ppc64/uaccess.h | 6 ++---- 2 files changed, 4 insertions(+), 4 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: uaccess-might-sleep.diff Type: text/x-diff Size: 2392 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050310/78bb7edf/attachment.diff From jschopp at austin.ibm.com Fri Mar 11 07:27:02 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Thu, 10 Mar 2005 14:27:02 -0600 Subject: [PATCH 7/8] pSeries_smp.c: use pSeries reconfig notifier for cpu DLPAR In-Reply-To: <20050310005207.31309.32546.73375@otto> References: <20050310005132.31309.65485.31668@otto> <20050310005207.31309.32546.73375@otto> Message-ID: <4230AD96.7020508@austin.ibm.com> This patch seems fine. Just a couple trivial comments. > +static int pSeries_add_processor(struct device_node *np) This function doesn't really add a processor; it could use a better name. > +static void pSeries_remove_processor(struct device_node *np) This doesn't remove a processor; it could also use a better name. From moilanen at austin.ibm.com Fri Mar 11 09:25:13 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Thu, 10 Mar 2005 16:25:13 -0600 Subject: [PATCH 1/2] No-exec support for ppc64 In-Reply-To: <20050310032213.GB20789@austin.ibm.com> References: <20050308165904.0ce07112.moilanen@austin.ibm.com> <20050308170826.13a2299e.moilanen@austin.ibm.com> <20050310032213.GB20789@austin.ibm.com> Message-ID: <20050310162513.74191caa.moilanen@austin.ibm.com> On Wed, 9 Mar 2005 21:22:13 -0600 olof at austin.ibm.com (Olof Johansson) wrote: > On Tue, Mar 08, 2005 at 05:08:26PM -0600, Jake Moilanen wrote: > > No-exec base and user space support for PPC64. > > Hi, a couple of comments below. > Here's the revised user & base support for no-exec on ppc64 with Olof and Ben's comments. Signed-off-by: Jake Moilanen --- linux-2.6-bk-moilanen/arch/ppc64/kernel/head.S | 5 + linux-2.6-bk-moilanen/arch/ppc64/kernel/iSeries_htab.c | 4 + linux-2.6-bk-moilanen/arch/ppc64/kernel/pSeries_lpar.c | 2 linux-2.6-bk-moilanen/arch/ppc64/mm/fault.c | 14 +++-- linux-2.6-bk-moilanen/arch/ppc64/mm/hash_low.S | 12 ++-- linux-2.6-bk-moilanen/arch/ppc64/mm/hugetlbpage.c | 10 +++ linux-2.6-bk-moilanen/fs/binfmt_elf.c | 2 linux-2.6-bk-moilanen/include/asm-ppc64/elf.h | 7 ++ linux-2.6-bk-moilanen/include/asm-ppc64/page.h | 19 ++++++- linux-2.6-bk-moilanen/include/asm-ppc64/pgtable.h | 46 +++++++++-------- 10 files changed, 85 insertions(+), 36 deletions(-) diff -puN arch/ppc64/kernel/head.S~nx-user-ppc64 arch/ppc64/kernel/head.S --- linux-2.6-bk/arch/ppc64/kernel/head.S~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/kernel/head.S 2005-03-08 16:08:54 -06:00 @@ -36,6 +36,7 @@ #include #include #include +#include #include #ifdef CONFIG_PPC_ISERIES @@ -950,11 +951,11 @@ END_FTR_SECTION_IFCLR(CPU_FTR_SLB) * accessing a userspace segment (even from the kernel). We assume * kernel addresses always have the high bit set. */ - rlwinm r4,r4,32-23,29,29 /* DSISR_STORE -> _PAGE_RW */ + rlwinm r4,r4,32-25+9,31-9,31-9 /* DSISR_STORE -> _PAGE_RW */ rotldi r0,r3,15 /* Move high bit into MSR_PR posn */ orc r0,r12,r0 /* MSR_PR | ~high_bit */ rlwimi r4,r0,32-13,30,30 /* becomes _PAGE_USER access bit */ - ori r4,r4,1 /* add _PAGE_PRESENT */ + rlwimi r4,r5,22+2,31-2,31-2 /* Set _PAGE_EXEC if trap is 0x400 */ /* * On iSeries, we soft-disable interrupts here, then diff -puN arch/ppc64/kernel/iSeries_htab.c~nx-user-ppc64 arch/ppc64/kernel/iSeries_htab.c --- linux-2.6-bk/arch/ppc64/kernel/iSeries_htab.c~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/kernel/iSeries_htab.c 2005-03-08 16:08:54 -06:00 @@ -144,6 +144,10 @@ static long iSeries_hpte_updatepp(unsign HvCallHpt_get(&hpte, slot); if ((hpte.dw0.dw0.avpn == avpn) && (hpte.dw0.dw0.v)) { + /* + * Hypervisor expects bit's as NPPP, which is + * different from how they are mapped in our PP. + */ HvCallHpt_setPp(slot, (newpp & 0x3) | ((newpp & 0x4) << 1)); iSeries_hunlock(slot); return 0; diff -puN arch/ppc64/kernel/pSeries_lpar.c~nx-user-ppc64 arch/ppc64/kernel/pSeries_lpar.c --- linux-2.6-bk/arch/ppc64/kernel/pSeries_lpar.c~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/kernel/pSeries_lpar.c 2005-03-08 16:08:54 -06:00 @@ -470,7 +470,7 @@ static void pSeries_lpar_hpte_updatebolt slot = pSeries_lpar_hpte_find(vpn); BUG_ON(slot == -1); - flags = newpp & 3; + flags = newpp & 7; lpar_rc = plpar_pte_protect(flags, slot, 0); BUG_ON(lpar_rc != H_Success); diff -puN arch/ppc64/mm/fault.c~nx-user-ppc64 arch/ppc64/mm/fault.c --- linux-2.6-bk/arch/ppc64/mm/fault.c~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/mm/fault.c 2005-03-10 16:14:45 -06:00 @@ -93,6 +93,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long code = SEGV_MAPERR; unsigned long is_write = error_code & 0x02000000; unsigned long trap = TRAP(regs); + unsigned long is_exec = trap == 0x400; BUG_ON((trap == 0x380) || (trap == 0x480)); @@ -199,16 +200,19 @@ int do_page_fault(struct pt_regs *regs, good_area: code = SEGV_ACCERR; + if (is_exec) { + /* protection fault */ + if (error_code & 0x08000000) + goto bad_area; + if (!(vma->vm_flags & VM_EXEC)) + goto bad_area; /* a write */ - if (is_write) { + } else if (is_write) { if (!(vma->vm_flags & VM_WRITE)) goto bad_area; /* a read */ } else { - /* protection fault */ - if (error_code & 0x08000000) - goto bad_area; - if (!(vma->vm_flags & (VM_READ | VM_EXEC))) + if (!(vma->vm_flags & VM_READ)) goto bad_area; } diff -puN arch/ppc64/mm/hash_low.S~nx-user-ppc64 arch/ppc64/mm/hash_low.S --- linux-2.6-bk/arch/ppc64/mm/hash_low.S~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/mm/hash_low.S 2005-03-08 16:08:54 -06:00 @@ -89,7 +89,7 @@ _GLOBAL(__hash_page) /* Prepare new PTE value (turn access RW into DIRTY, then * add BUSY,HASHPTE and ACCESSED) */ - rlwinm r30,r4,5,24,24 /* _PAGE_RW -> _PAGE_DIRTY */ + rlwinm r30,r4,32-9+7,31-7,31-7 /* _PAGE_RW -> _PAGE_DIRTY */ or r30,r30,r31 ori r30,r30,_PAGE_BUSY | _PAGE_ACCESSED | _PAGE_HASHPTE /* Write the linux PTE atomically (setting busy) */ @@ -112,11 +112,11 @@ _GLOBAL(__hash_page) rldicl r5,r5,0,25 /* vsid & 0x0000007fffffffff */ rldicl r0,r3,64-12,48 /* (ea >> 12) & 0xffff */ xor r28,r5,r0 - - /* Convert linux PTE bits into HW equivalents - */ - andi. r3,r30,0x1fa /* Get basic set of flags */ - rlwinm r0,r30,32-2+1,30,30 /* _PAGE_RW -> _PAGE_USER (r0) */ + + /* Convert linux PTE bits into HW equivalents */ + andi. r3,r30,0x1fe /* Get basic set of flags */ + xori r3,r3,HW_NO_EXEC /* _PAGE_EXEC -> NOEXEC */ + rlwinm r0,r30,32-9+1,30,30 /* _PAGE_RW -> _PAGE_USER (r0) */ rlwinm r4,r30,32-7+1,30,30 /* _PAGE_DIRTY -> _PAGE_USER (r4) */ and r0,r0,r4 /* _PAGE_RW & _PAGE_DIRTY -> r0 bit 30 */ andc r0,r30,r0 /* r0 = pte & ~r0 */ diff -puN arch/ppc64/mm/hugetlbpage.c~nx-user-ppc64 arch/ppc64/mm/hugetlbpage.c --- linux-2.6-bk/arch/ppc64/mm/hugetlbpage.c~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/mm/hugetlbpage.c 2005-03-10 13:46:08 -06:00 @@ -796,6 +796,9 @@ int hash_huge_page(struct mm_struct *mm, va = (vsid << 28) | (ea & 0x0fffffff); vpn = va >> HPAGE_SHIFT; + if (unlikely((access & _PAGE_EXEC) && !(pte_val(*ptep) & _PAGE_EXEC))) + goto out; + /* * If no pte found or not present, send the problem up to * do_page_fault @@ -828,7 +831,12 @@ int hash_huge_page(struct mm_struct *mm, old_pte = *ptep; new_pte = old_pte; - hpteflags = 0x2 | (! (pte_val(new_pte) & _PAGE_RW)); + hpteflags = (pte_val(new_pte) & _PAGE_RW) | + (!(pte_val(new_pte) & _PAGE_RW)) | + _PAGE_USER; + + /* _PAGE_EXEC -> HW_NO_EXEC since it's inverted */ + hpteflags |= ((pte_val(new_pte) & _PAGE_EXEC) ? 0 : HW_NO_EXEC); /* Check if pte already has an hpte (case 2) */ if (unlikely(pte_val(old_pte) & _PAGE_HASHPTE)) { diff -puN fs/binfmt_elf.c~nx-user-ppc64 fs/binfmt_elf.c --- linux-2.6-bk/fs/binfmt_elf.c~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/fs/binfmt_elf.c 2005-03-08 16:08:54 -06:00 @@ -99,6 +99,8 @@ static int set_brk(unsigned long start, up_write(¤t->mm->mmap_sem); if (BAD_ADDR(addr)) return addr; + + sys_mprotect(start, end-start, PROT_READ|PROT_WRITE|PROT_EXEC); } current->mm->start_brk = current->mm->brk = end; return 0; diff -puN include/asm-ppc64/elf.h~nx-user-ppc64 include/asm-ppc64/elf.h --- linux-2.6-bk/include/asm-ppc64/elf.h~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/include/asm-ppc64/elf.h 2005-03-08 16:23:37 -06:00 @@ -226,6 +226,13 @@ do { \ else if (current->personality != PER_LINUX32) \ set_personality(PER_LINUX); \ } while (0) + +/* + * An executable for which elf_read_implies_exec() returns TRUE will + * have the READ_IMPLIES_EXEC personality flag set automatically. + */ +#define elf_read_implies_exec(ex, have_pt_gnu_stack) (!(have_pt_gnu_stack)) + #endif /* diff -puN include/asm-ppc64/page.h~nx-user-ppc64 include/asm-ppc64/page.h --- linux-2.6-bk/include/asm-ppc64/page.h~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/include/asm-ppc64/page.h 2005-03-08 16:08:54 -06:00 @@ -235,8 +235,25 @@ extern u64 ppc64_pft_size; /* Log 2 of #define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT) -#define VM_DATA_DEFAULT_FLAGS (VM_READ | VM_WRITE | VM_EXEC | \ +#define VM_DATA_DEFAULT_FLAGS32 (VM_READ | VM_WRITE | \ VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) + +#define VM_STACK_DEFAULT_FLAGS32 (VM_READ | VM_WRITE | VM_EXEC | \ + VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) + +#define VM_DATA_DEFAULT_FLAGS64 (VM_READ | VM_WRITE | \ + VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) + +#define VM_STACK_DEFAULT_FLAGS64 (VM_READ | VM_WRITE | VM_EXEC | \ + VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) + +#define VM_DATA_DEFAULT_FLAGS \ + (test_thread_flag(TIF_32BIT) ? \ + VM_DATA_DEFAULT_FLAGS32 : VM_DATA_DEFAULT_FLAGS64) + +#define VM_STACK_DEFAULT_FLAGS \ + (test_thread_flag(TIF_32BIT) ? \ + VM_STACK_DEFAULT_FLAGS32 : VM_STACK_DEFAULT_FLAGS64) #endif /* __KERNEL__ */ #endif /* _PPC64_PAGE_H */ diff -puN include/asm-ppc64/pgtable.h~nx-user-ppc64 include/asm-ppc64/pgtable.h --- linux-2.6-bk/include/asm-ppc64/pgtable.h~nx-user-ppc64 2005-03-08 16:08:54 -06:00 +++ linux-2.6-bk-moilanen/include/asm-ppc64/pgtable.h 2005-03-10 16:14:45 -06:00 @@ -82,14 +82,14 @@ #define _PAGE_PRESENT 0x0001 /* software: pte contains a translation */ #define _PAGE_USER 0x0002 /* matches one of the PP bits */ #define _PAGE_FILE 0x0002 /* (!present only) software: pte holds file offset */ -#define _PAGE_RW 0x0004 /* software: user write access allowed */ +#define _PAGE_EXEC 0x0004 /* No execute on POWER4 and newer (we invert) */ #define _PAGE_GUARDED 0x0008 #define _PAGE_COHERENT 0x0010 /* M: enforce memory coherence (SMP systems) */ #define _PAGE_NO_CACHE 0x0020 /* I: cache inhibit */ #define _PAGE_WRITETHRU 0x0040 /* W: cache write-through */ #define _PAGE_DIRTY 0x0080 /* C: page changed */ #define _PAGE_ACCESSED 0x0100 /* R: page referenced */ -#define _PAGE_EXEC 0x0200 /* software: i-cache coherence required */ +#define _PAGE_RW 0x0200 /* software: user write access allowed */ #define _PAGE_HASHPTE 0x0400 /* software: pte has an associated HPTE */ #define _PAGE_BUSY 0x0800 /* software: PTE & hash are busy */ #define _PAGE_SECONDARY 0x8000 /* software: HPTE is in secondary group */ @@ -100,7 +100,8 @@ /* PAGE_MASK gives the right answer below, but only by accident */ /* It should be preserving the high 48 bits and then specifically */ /* preserving _PAGE_SECONDARY | _PAGE_GROUP_IX */ -#define _PAGE_CHG_MASK (PAGE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_HPTEFLAGS) +#define _PAGE_CHG_MASK (_PAGE_GUARDED | _PAGE_COHERENT | _PAGE_NO_CACHE | _PAGE_WRITETHRU | \ + _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_HPTEFLAGS | PAGE_MASK) #define _PAGE_BASE (_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_COHERENT) @@ -116,31 +117,38 @@ #define PAGE_READONLY __pgprot(_PAGE_BASE | _PAGE_USER) #define PAGE_READONLY_X __pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_EXEC) #define PAGE_KERNEL __pgprot(_PAGE_BASE | _PAGE_WRENABLE) -#define PAGE_KERNEL_CI __pgprot(_PAGE_PRESENT | _PAGE_ACCESSED | \ - _PAGE_WRENABLE | _PAGE_NO_CACHE | _PAGE_GUARDED) + +#define HW_NO_EXEC _PAGE_EXEC /* This is used when the bit is + * inverted, even though it's the + * same value, hopefully it will be + * clearer in the code what is + * going on. */ /* - * The PowerPC can only do execute protection on a segment (256MB) basis, - * not on a page basis. So we consider execute permission the same as read. + * POWER4 and newer have per page execute protection, older chips can only + * do this on a segment (256MB) basis. + * * Also, write permissions imply read permissions. * This is the closest we can get.. + * + * Note due to the way vm flags are laid out, the bits are XWR */ #define __P000 PAGE_NONE -#define __P001 PAGE_READONLY_X +#define __P001 PAGE_READONLY #define __P010 PAGE_COPY -#define __P011 PAGE_COPY_X -#define __P100 PAGE_READONLY +#define __P011 PAGE_COPY +#define __P100 PAGE_READONLY_X #define __P101 PAGE_READONLY_X -#define __P110 PAGE_COPY +#define __P110 PAGE_COPY_X #define __P111 PAGE_COPY_X #define __S000 PAGE_NONE -#define __S001 PAGE_READONLY_X +#define __S001 PAGE_READONLY #define __S010 PAGE_SHARED -#define __S011 PAGE_SHARED_X -#define __S100 PAGE_READONLY +#define __S011 PAGE_SHARED +#define __S100 PAGE_READONLY_X #define __S101 PAGE_READONLY_X -#define __S110 PAGE_SHARED +#define __S110 PAGE_SHARED_X #define __S111 PAGE_SHARED_X #ifndef __ASSEMBLY__ @@ -197,7 +205,8 @@ void hugetlb_mm_free_pgd(struct mm_struc }) #define pte_modify(_pte, newprot) \ - (__pte((pte_val(_pte) & _PAGE_CHG_MASK) | pgprot_val(newprot))) + (__pte((pte_val(_pte) & _PAGE_CHG_MASK) | \ + (pgprot_val(newprot) & ~_PAGE_CHG_MASK))) #define pte_none(pte) ((pte_val(pte) & ~_PAGE_HPTEFLAGS) == 0) #define pte_present(pte) (pte_val(pte) & _PAGE_PRESENT) @@ -266,9 +275,6 @@ static inline int pte_young(pte_t pte) { static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE;} static inline int pte_huge(pte_t pte) { return pte_val(pte) & _PAGE_HUGE;} -static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; } -static inline void pte_cache(pte_t pte) { pte_val(pte) &= ~_PAGE_NO_CACHE; } - static inline pte_t pte_rdprotect(pte_t pte) { pte_val(pte) &= ~_PAGE_USER; return pte; } static inline pte_t pte_exprotect(pte_t pte) { @@ -438,7 +444,7 @@ static inline void set_pte_at(struct mm_ static inline void __ptep_set_access_flags(pte_t *ptep, pte_t entry, int dirty) { unsigned long bits = pte_val(entry) & - (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW); + (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW | _PAGE_EXEC); unsigned long old, tmp; __asm__ __volatile__( _ From moilanen at austin.ibm.com Fri Mar 11 09:27:21 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Thu, 10 Mar 2005 16:27:21 -0600 Subject: [PATCH 2/2] No-exec support for ppc64 In-Reply-To: <1110438934.32524.203.camel@gaston> References: <20050308165904.0ce07112.moilanen@austin.ibm.com> <20050308171326.3d72363a.moilanen@austin.ibm.com> <20050310032507.GC20789@austin.ibm.com> <1110438934.32524.203.camel@gaston> Message-ID: <20050310162721.19003dac.moilanen@austin.ibm.com> On Thu, 10 Mar 2005 18:15:34 +1100 Benjamin Herrenschmidt wrote: > On Wed, 2005-03-09 at 21:25 -0600, Olof Johansson wrote: > > Hi, > > > > On Tue, Mar 08, 2005 at 05:13:26PM -0600, Jake Moilanen wrote: > > > diff -puN arch/ppc64/mm/hash_utils.c~nx-kernel-ppc64 arch/ppc64/mm/hash_utils.c > > > --- linux-2.6-bk/arch/ppc64/mm/hash_utils.c~nx-kernel-ppc64 2005-03-08 16:08:57 -06:00 > > > +++ linux-2.6-bk-moilanen/arch/ppc64/mm/hash_utils.c 2005-03-08 16:08:57 -06:00 > > > @@ -89,12 +90,23 @@ static inline void loop_forever(void) > > > ; > > > } > > > > > > +int is_kernel_text(unsigned long addr) > > > +{ > > > + if (addr >= (unsigned long)_stext && addr < (unsigned long)__init_end) > > > + return 1; > > > + > > > + return 0; > > > +} > > > > This is used in two files, but never declared extern in the second file > > (iSeries_setup.c). Should it go in a header file as a static inline > > instead? > > Yes, I think it should. > Here is the revised no-exec for the kernel on ppc64 w/ Olof and Ben's comments. Signed-off-by: Jake Moilanen --- linux-2.6-bk-moilanen/arch/ppc64/kernel/iSeries_setup.c | 4 +++ linux-2.6-bk-moilanen/arch/ppc64/kernel/module.c | 3 +- linux-2.6-bk-moilanen/arch/ppc64/mm/fault.c | 19 ++++++++++++++ linux-2.6-bk-moilanen/arch/ppc64/mm/hash_utils.c | 21 ++++++++++------ linux-2.6-bk-moilanen/include/asm-ppc64/pgtable.h | 1 linux-2.6-bk-moilanen/include/asm-ppc64/sections.h | 9 ++++++ 6 files changed, 49 insertions(+), 8 deletions(-) diff -puN arch/ppc64/kernel/iSeries_setup.c~nx-kernel-ppc64 arch/ppc64/kernel/iSeries_setup.c --- linux-2.6-bk/arch/ppc64/kernel/iSeries_setup.c~nx-kernel-ppc64 2005-03-10 13:54:14 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/kernel/iSeries_setup.c 2005-03-10 13:59:12 -06:00 @@ -633,6 +633,10 @@ static void __init iSeries_bolt_kernel(u unsigned long vpn = va >> PAGE_SHIFT; unsigned long slot = HvCallHpt_findValid(&hpte, vpn); + /* Make non-kernel text non-executable */ + if (!in_kernel_text(ea)) + mode_rw |= HW_NO_EXEC; + if (hpte.dw0.dw0.v) { /* HPTE exists, so just bolt it */ HvCallHpt_setSwBits(slot, 0x10, 0); diff -puN arch/ppc64/kernel/module.c~nx-kernel-ppc64 arch/ppc64/kernel/module.c --- linux-2.6-bk/arch/ppc64/kernel/module.c~nx-kernel-ppc64 2005-03-10 13:54:14 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/kernel/module.c 2005-03-10 13:54:14 -06:00 @@ -102,7 +102,8 @@ void *module_alloc(unsigned long size) { if (size == 0) return NULL; - return vmalloc(size); + + return vmalloc_exec(size); } /* Free memory returned from module_alloc */ diff -puN arch/ppc64/mm/fault.c~nx-kernel-ppc64 arch/ppc64/mm/fault.c --- linux-2.6-bk/arch/ppc64/mm/fault.c~nx-kernel-ppc64 2005-03-10 13:54:14 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/mm/fault.c 2005-03-10 13:54:14 -06:00 @@ -76,6 +76,13 @@ static int store_updates_sp(struct pt_re return 0; } +pte_t *lookup_address(unsigned long address) +{ + pgd_t *pgd = pgd_offset_k(address); + + return find_linux_pte(pgd, address); +} + /* * The error_code parameter is * - DSISR for a non-SLB data access fault, @@ -94,6 +101,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long is_write = error_code & 0x02000000; unsigned long trap = TRAP(regs); unsigned long is_exec = trap == 0x400; + pte_t *ptep; BUG_ON((trap == 0x380) || (trap == 0x480)); @@ -253,6 +261,17 @@ bad_area_nosemaphore: info.si_addr = (void __user *) address; force_sig_info(SIGSEGV, &info, current); return 0; + } + + ptep = lookup_address(address); + + if (ptep && pte_present(*ptep) && !pte_exec(*ptep)) { + if (printk_ratelimit()) + printk(KERN_CRIT "kernel tried to execute NX-protected " + "page - exploit attempt? (uid: %d)\n", + current->uid); + show_stack(current, (unsigned long *)__get_SP()); + do_exit(SIGKILL); } return SIGSEGV; diff -puN arch/ppc64/mm/hash_utils.c~nx-kernel-ppc64 arch/ppc64/mm/hash_utils.c --- linux-2.6-bk/arch/ppc64/mm/hash_utils.c~nx-kernel-ppc64 2005-03-10 13:54:14 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/mm/hash_utils.c 2005-03-10 13:58:37 -06:00 @@ -51,6 +51,7 @@ #include #include #include +#include #ifdef DEBUG #define DBG(fmt...) udbg_printf(fmt) @@ -95,6 +96,7 @@ static inline void create_pte_mapping(un { unsigned long addr; unsigned int step; + unsigned long tmp_mode; if (large) step = 16*MB; @@ -112,6 +114,13 @@ static inline void create_pte_mapping(un else vpn = va >> PAGE_SHIFT; + + tmp_mode = mode; + + /* Make non-kernel text non-executable */ + if (!in_kernel_text(addr)) + tmp_mode = mode | HW_NO_EXEC; + hash = hpt_hash(vpn, large); hpteg = ((hash & htab_hash_mask) * HPTES_PER_GROUP); @@ -120,12 +129,12 @@ static inline void create_pte_mapping(un if (systemcfg->platform & PLATFORM_LPAR) ret = pSeries_lpar_hpte_insert(hpteg, va, virt_to_abs(addr) >> PAGE_SHIFT, - 0, mode, 1, large); + 0, tmp_mode, 1, large); else #endif /* CONFIG_PPC_PSERIES */ ret = native_hpte_insert(hpteg, va, virt_to_abs(addr) >> PAGE_SHIFT, - 0, mode, 1, large); + 0, tmp_mode, 1, large); if (ret == -1) { ppc64_terminate_msg(0x20, "create_pte_mapping"); @@ -238,8 +247,6 @@ unsigned int hash_page_do_lazy_icache(un { struct page *page; -#define PPC64_HWNOEXEC (1 << 2) - if (!pfn_valid(pte_pfn(pte))) return pp; @@ -250,8 +257,8 @@ unsigned int hash_page_do_lazy_icache(un if (trap == 0x400) { __flush_dcache_icache(page_address(page)); set_bit(PG_arch_1, &page->flags); - } else - pp |= PPC64_HWNOEXEC; + } else + pp |= HW_NO_EXEC; } return pp; } @@ -271,7 +278,7 @@ int hash_page(unsigned long ea, unsigned int user_region = 0; int local = 0; cpumask_t tmp; - + switch (REGION_ID(ea)) { case USER_REGION_ID: user_region = 1; diff -puN include/asm-ppc64/pgtable.h~nx-kernel-ppc64 include/asm-ppc64/pgtable.h --- linux-2.6-bk/include/asm-ppc64/pgtable.h~nx-kernel-ppc64 2005-03-10 13:54:14 -06:00 +++ linux-2.6-bk-moilanen/include/asm-ppc64/pgtable.h 2005-03-10 13:54:14 -06:00 @@ -117,6 +117,7 @@ #define PAGE_READONLY __pgprot(_PAGE_BASE | _PAGE_USER) #define PAGE_READONLY_X __pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_EXEC) #define PAGE_KERNEL __pgprot(_PAGE_BASE | _PAGE_WRENABLE) +#define PAGE_KERNEL_EXEC __pgprot(_PAGE_BASE | _PAGE_WRENABLE | _PAGE_EXEC) #define HW_NO_EXEC _PAGE_EXEC /* This is used when the bit is * inverted, even though it's the diff -puN include/asm-ppc64/sections.h~nx-kernel-ppc64 include/asm-ppc64/sections.h --- linux-2.6-bk/include/asm-ppc64/sections.h~nx-kernel-ppc64 2005-03-10 13:54:14 -06:00 +++ linux-2.6-bk-moilanen/include/asm-ppc64/sections.h 2005-03-10 13:58:12 -06:00 @@ -17,4 +17,13 @@ extern char _end[]; #define __openfirmware #define __openfirmwaredata + +static inline int in_kernel_text(unsigned long addr) +{ + if (addr >= (unsigned long)_stext && addr < (unsigned long)__init_end) + return 1; + + return 0; +} + #endif _ From benh at kernel.crashing.org Fri Mar 11 09:44:28 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 11 Mar 2005 09:44:28 +1100 Subject: [PATCH 2/2] No-exec support for ppc64 In-Reply-To: <20050310162721.19003dac.moilanen@austin.ibm.com> References: <20050308165904.0ce07112.moilanen@austin.ibm.com> <20050308171326.3d72363a.moilanen@austin.ibm.com> <20050310032507.GC20789@austin.ibm.com> <1110438934.32524.203.camel@gaston> <20050310162721.19003dac.moilanen@austin.ibm.com> Message-ID: <1110494668.32525.283.camel@gaston> > /* Free memory returned from module_alloc */ > diff -puN arch/ppc64/mm/fault.c~nx-kernel-ppc64 arch/ppc64/mm/fault.c > --- linux-2.6-bk/arch/ppc64/mm/fault.c~nx-kernel-ppc64 2005-03-10 13:54:14 -06:00 > +++ linux-2.6-bk-moilanen/arch/ppc64/mm/fault.c 2005-03-10 13:54:14 -06:00 > @@ -76,6 +76,13 @@ static int store_updates_sp(struct pt_re > return 0; > } > > +pte_t *lookup_address(unsigned long address) > +{ > + pgd_t *pgd = pgd_offset_k(address); > + > + return find_linux_pte(pgd, address); > +} static please, even inline in this case. I've removed Andrew from CC upon his request, Paul, Anton or I will forward to him when it's ready, no need to clobber his mailbox in the meantime. Ben. From olof at austin.ibm.com Fri Mar 11 10:39:32 2005 From: olof at austin.ibm.com (Olof Johansson) Date: Thu, 10 Mar 2005 17:39:32 -0600 Subject: [PATCH] ppc64: kill might_sleep() warnings in __copy_*_user_inatomic In-Reply-To: <200503102054.38123.arnd@arndb.de> References: <200503102054.38123.arnd@arndb.de> Message-ID: <20050310233932.GA26823@austin.ibm.com> On Thu, Mar 10, 2005 at 08:54:36PM +0100, Arnd Bergmann wrote: > I currently get warnings from futex resulting from Olofs futex+rwsem fix > combined with the fact that ppc64 __copy_from_user has a might_sleep > check in it: > > [ 9607.577071] Debug: sleeping function called from invalid context at include2/asm/uaccess.h:2 > 28 > [ 9607.676181] in_atomic():1, irqs_disabled():0 > [ 9607.724741] Call Trace: > [ 9607.752058] [c00000000d68fab0] [c000000001f0fb80] 0xc000000001f0fb80 (unreliable) > [ 9607.835030] [c00000000d68fb30] [c000000000042420] .__might_sleep+0xf8/0x108 > [ 9607.912936] [c00000000d68fbd0] [c00000000006ac34] .do_futex+0x224/0x858 > > The fix is to do the check only in copy_*_user, not __copy_*_user. This is the > same that most other architectures do. Actually, I think I would prefer the following. It renames current __copy_{to,from}_user to __copy_{to,from}_user_inatomic, adds the old ones as inlines doing the might_sleep() and calling the inatomics afterwards. This way the calls to __copy_{to,from}_user() will be caught if called under lock or preemption as well. This is also how i386 does it. This was coded up during travelling, so I haven't been able to boot the patch, only build it. Dave Jones made me aware of it since he hit exactly the above on ppc64 himself, but it was right before I left town. -Olof --- This implements the __copy_{to,from}_user_inatomic() functions on ppc64. The only difference between the inatomic and regular version is that inatomic does not call might_sleep() to detect possible faults while holding locks/elevated preempt counts. Signed-off-by: Olof Johansson Index: linux-2.5/include/asm-ppc64/uaccess.h =================================================================== --- linux-2.5.orig/include/asm-ppc64/uaccess.h 2005-03-09 17:17:31.000000000 -0600 +++ linux-2.5/include/asm-ppc64/uaccess.h 2005-03-09 17:21:01.000000000 -0600 @@ -223,9 +223,8 @@ extern unsigned long __copy_tofrom_user( unsigned long size); static inline unsigned long -__copy_from_user(void *to, const void __user *from, unsigned long n) +__copy_from_user_inatomic(void *to, const void __user *from, unsigned long n) { - might_sleep(); if (__builtin_constant_p(n)) { unsigned long ret; @@ -248,9 +247,15 @@ __copy_from_user(void *to, const void __ } static inline unsigned long -__copy_to_user(void __user *to, const void *from, unsigned long n) +__copy_from_user(void *to, const void __user *from, unsigned long n) { might_sleep(); + return __copy_from_user_inatomic(to, from, n); +} + +static inline unsigned long +__copy_to_user_inatomic(void __user *to, const void *from, unsigned long n) +{ if (__builtin_constant_p(n)) { unsigned long ret; @@ -272,6 +277,13 @@ __copy_to_user(void __user *to, const vo return __copy_tofrom_user(to, (__force const void __user *) from, n); } +static inline unsigned long +__copy_to_user(void __user *to, const void *from, unsigned long n) +{ + might_sleep(); + return __copy_to_user_inatomic(to, from, n); +} + #define __copy_in_user(to, from, size) \ __copy_tofrom_user((to), (from), (size)) @@ -284,9 +296,6 @@ extern unsigned long copy_in_user(void _ extern unsigned long __clear_user(void __user *addr, unsigned long size); -#define __copy_to_user_inatomic __copy_to_user -#define __copy_from_user_inatomic __copy_from_user - static inline unsigned long clear_user(void __user *addr, unsigned long size) { From kravetz at us.ibm.com Fri Mar 11 10:42:47 2005 From: kravetz at us.ibm.com (mike kravetz) Date: Thu, 10 Mar 2005 15:42:47 -0800 Subject: [PATCH] PPC64 NUMA memory fixup In-Reply-To: <20050310023613.23499386.akpm@osdl.org> References: <16942.30144.513313.26103@cargo.ozlabs.ibm.com> <20050310023613.23499386.akpm@osdl.org> Message-ID: <20050310234247.GA8276@w-mikek2.ibm.com> On Thu, Mar 10, 2005 at 02:36:13AM -0800, Andrew Morton wrote: > Paul Mackerras wrote: > > > > When I booted my new 720 on a kernel configured for NUMA, I received > > the following during bootup: > > > > WARNING: Unexpected node layout: region start 44000000 length 2000000 > > NUMA is disabled > > > > This is due to memory 'holes' within nodes. If such holes are > > encountered, then NUMA is disabled. The following patch adds support > > for such configurations. My 720 now boots with the following message: > > This patch causes the non-numa G5 to oops very early in boot in > smp_call_function(). > I can't recreate this on my system here even if I start with Andrew's config file. In addition, I don't have access to a PMAC to even try and figure out the flow at boot time. My guess is that this is related to the extra scan of memory sections (via of_find_node_by_type()) in do_init_bootmem. Is there something inherently different between making these calls on a LPAR as opposed to PMAC? I'm going to start looking for a PMAC so I can get more info. Any other suggestions on how to track this down are appreciated. Thanks, -- Mike From david at gibson.dropbear.id.au Fri Mar 11 11:11:20 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Fri, 11 Mar 2005 11:11:20 +1100 Subject: [PPC64] Allow emulation of mfpvr on ppc64 kernel In-Reply-To: <200503102317.04027.ioe-lkml@axxeo.de> References: <20050310021848.GD30435@localhost.localdomain> <200503102317.04027.ioe-lkml@axxeo.de> Message-ID: <20050311001120.GA6512@localhost.localdomain> On Thu, Mar 10, 2005 at 11:17:03PM +0100, Ingo Oeser wrote: > David Gibson wrote: > > Andrew, please apply. > > > > Allow userspace programs on ppc64 to use the (privileged) mfpvr > > instruction to determine the processor type. At the moment it > > emulates the instruction to provide the real PVR value, though it > > could be made to lie in future if for some reason we wish to restrict > > what CPU features userspace uses. > > Why not putting the required information into the AUX table > when executing your ELF programs? I loved this feature in the > ix86 arch. Because this is easy and is the way we already do it on ppc32..? -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist. NOT _the_ _other_ _way_ | _around_! http://www.ozlabs.org/people/dgibson From ntl at pobox.com Fri Mar 11 12:24:20 2005 From: ntl at pobox.com (Nathan Lynch) Date: Thu, 10 Mar 2005 19:24:20 -0600 Subject: [PATCH 2/8] make OF node fixup code usable at runtime In-Reply-To: <1110473410.29353.4.camel@sinatra.austin.ibm.com> References: <20050310005132.31309.65485.31668@otto> <20050310005142.31309.45788.99418@otto> <1110473410.29353.4.camel@sinatra.austin.ibm.com> Message-ID: <20050311012420.GD21853@otto> On Thu, Mar 10, 2005 at 10:50:10AM -0600, John Rose wrote: > > > > These functions, if passed a null mem_start argument, will > > use kmalloc for allocating extra data structures for the device node > > being processed. > > Might it be possible to use the dynamic flag of the device node to > decide when to use kmalloc? D'oh... that didn't occur to me. I think I will use this idea, it should reduce the total size of the changes. Nathan From ntl at pobox.com Fri Mar 11 12:30:47 2005 From: ntl at pobox.com (Nathan Lynch) Date: Thu, 10 Mar 2005 19:30:47 -0600 Subject: [PATCH 2/8] make OF node fixup code usable at runtime In-Reply-To: <4230A2F2.7020403@austin.ibm.com> References: <20050310005132.31309.65485.31668@otto> <20050310005142.31309.45788.99418@otto> <4230A2F2.7020403@austin.ibm.com> Message-ID: <20050311013047.GE21853@otto> On Thu, Mar 10, 2005 at 01:41:38PM -0600, Joel Schopp wrote: > Nathan Lynch wrote: > > >-static int of_finish_dynamic_node(struct device_node *node) > >+static int of_finish_dynamic_node(struct device_node *node, > >+ unsigned long *unused1, int unused2, > >+ int unused3, int unused4) > > { > > > Is there a reason for these 4 unused fields that I am just missing? > In order for it to be correctly used as an argument to finish_node, of_finish_dynamic_node needs to have a definition compatible with the interpret_func typedef. Nathan From ntl at pobox.com Fri Mar 11 12:37:53 2005 From: ntl at pobox.com (Nathan Lynch) Date: Thu, 10 Mar 2005 19:37:53 -0600 Subject: [PATCH 4/8] prom.c: use pSeries reconfig notifier In-Reply-To: <1110475200.29353.12.camel@sinatra.austin.ibm.com> References: <20050310005132.31309.65485.31668@otto> <20050310005152.31309.41959.21947@otto> <1110475200.29353.12.camel@sinatra.austin.ibm.com> Message-ID: <20050311013753.GF21853@otto> On Thu, Mar 10, 2005 at 11:20:00AM -0600, John Rose wrote: > Quick comment on this- > > > void of_add_node(struct device_node *np) > > { > > - int err; > > - > > - /* This use of finish_node will be moved to a notifier so > > - * the error code can be used. > > - */ > > - err = finish_node(np, NULL, of_finish_dynamic_node, 0, 0, 0); > > - if (err < 0) > > - return; > > - > > write_lock(&devtree_lock); > > np->sibling = np->parent->child; > > np->allnext = allnodes; > > @@ -1682,6 +1674,36 @@ void of_remove_node(const struct device_ > > write_unlock(&devtree_lock); > > } > > If I understand correctly, of_add_node() now simply adds the node to > relevant lists and sets relational pointers to position it in the device > tree. The allocation and other setup has been moved out of the > function. Might it be more clear to rename it to of_attach_node() or > something similar? > Yup, and perhaps rename of_remove_node to of_detach_node or similar, since the semantics have changed slightly there also. Nathan From ntl at pobox.com Fri Mar 11 12:43:26 2005 From: ntl at pobox.com (Nathan Lynch) Date: Thu, 10 Mar 2005 19:43:26 -0600 Subject: [PATCH 4/8] prom.c: use pSeries reconfig notifier In-Reply-To: <200503102005.29526.arnd@arndb.de> References: <20050310005132.31309.65485.31668@otto> <20050310005152.31309.41959.21947@otto> <200503102005.29526.arnd@arndb.de> Message-ID: <20050311014326.GG21853@otto> On Thu, Mar 10, 2005 at 08:05:28PM +0100, Arnd Bergmann wrote: > On Dunnersdag 10 M?rz 2005 01:51, Nathan Lynch wrote: > > ?void of_add_node(struct device_node *np) > > ?{ > > While looking at the of_add_node code, I noticed that there are some > memory holes in the error path of that function. They should be trivial > to fix with the attached patch, but I can't test this because I don't > have reconfigurable machines. > I noticed this too and tried to fix it up in pSeries_reconfig_add_node in the previous (#3) patch, but I made a mess of it and will need to rework it. Thanks for pointing this out. Nathan From ntl at pobox.com Fri Mar 11 12:46:36 2005 From: ntl at pobox.com (Nathan Lynch) Date: Thu, 10 Mar 2005 19:46:36 -0600 Subject: [PATCH 3/8] introduce pSeries_reconfig.[ch] In-Reply-To: <20050310005147.31309.61029.66648@otto> References: <20050310005132.31309.65485.31668@otto> <20050310005147.31309.61029.66648@otto> Message-ID: <20050311014636.GH21853@otto> > +static int pSeries_reconfig_add_node(const char *path, struct property *proplist) > +{ > + struct device_node *np; > + int err = -ENOMEM; > + > + np = kcalloc(1, sizeof(*np), GFP_KERNEL); > + if (!np) > + goto out_err; > + > + np->full_name = kmalloc(strlen(path) + 1, GFP_KERNEL); > + if (!np->full_name) > + goto out_err; > + ... > + > +out_err: > + kfree(np->full_name); > + kfree(np); > + return err; > +} Bah, potential null pointer dereference in the first kfree, I'll need to fix that up. Nathan From ntl at pobox.com Fri Mar 11 12:53:39 2005 From: ntl at pobox.com (Nathan Lynch) Date: Thu, 10 Mar 2005 19:53:39 -0600 Subject: [PATCH 7/8] pSeries_smp.c: use pSeries reconfig notifier for cpu DLPAR In-Reply-To: <4230AD96.7020508@austin.ibm.com> References: <20050310005132.31309.65485.31668@otto> <20050310005207.31309.32546.73375@otto> <4230AD96.7020508@austin.ibm.com> Message-ID: <20050311015339.GI21853@otto> On Thu, Mar 10, 2005 at 02:27:02PM -0600, Joel Schopp wrote: > This patch seems fine. Just a couple trivial comments. > > >+static int pSeries_add_processor(struct device_node *np) > > This function doesn't really add a processor; it could use a better name. > > >+static void pSeries_remove_processor(struct device_node *np) > > This doesn't remove a processor; it could also use a better name. > They do add and remove processors in the sense that they update cpu_present_map, which is the kernel's logical representation of the cpus actually resident in the system. I'm sort of drawing a blank trying to think of alternatives. If you've better ideas please share. Nathan From paulus at samba.org Fri Mar 11 13:34:28 2005 From: paulus at samba.org (Paul Mackerras) Date: Fri, 11 Mar 2005 13:34:28 +1100 Subject: [PPC64] Allow emulation of mfpvr on ppc64 kernel In-Reply-To: <200503102317.04027.ioe-lkml@axxeo.de> References: <20050310021848.GD30435@localhost.localdomain> <200503102317.04027.ioe-lkml@axxeo.de> Message-ID: <16945.948.350317.549743@cargo.ozlabs.ibm.com> Ingo Oeser writes: > Why not putting the required information into the AUX table > when executing your ELF programs? I loved this feature in the > ix86 arch. We do put an AT_HWCAP entry in the aux table, which is a bitmap of features supported by the cpu. But for some applications, such as programming the performance monitor hardware, you need to know the specific CPU model and version, and this is a way to provide that information. Paul. From paulus at samba.org Fri Mar 11 14:02:17 2005 From: paulus at samba.org (Paul Mackerras) Date: Fri, 11 Mar 2005 14:02:17 +1100 Subject: [PATCH] AGP support for powermac G5 Message-ID: <16945.2617.625095.404994@cargo.ozlabs.ibm.com> This patch adds AGP support for the U3 northbridge used in Apple G5 machines to drivers/char/agp/uninorth-agp.c. This patch is based on earlier work by Jerome Glisse. With this patch, the driver works in both ppc32 and ppc64 kernels. Signed-off-by: Paul Mackerras diff -urN linux-2.5/drivers/char/agp/Kconfig g5/drivers/char/agp/Kconfig --- linux-2.5/drivers/char/agp/Kconfig 2005-03-07 14:01:44.000000000 +1100 +++ g5/drivers/char/agp/Kconfig 2005-03-11 13:53:47.000000000 +1100 @@ -1,6 +1,6 @@ config AGP tristate "/dev/agpgart (AGP Support)" if !GART_IOMMU - depends on ALPHA || IA64 || PPC32 || X86 + depends on ALPHA || IA64 || PPC || X86 default y if GART_IOMMU ---help--- AGP (Accelerated Graphics Port) is a bus system mainly used to @@ -146,11 +146,11 @@ default AGP config AGP_UNINORTH - tristate "Apple UniNorth AGP support" + tristate "Apple UniNorth & U3 AGP support" depends on AGP && PPC_PMAC help This option gives you AGP support for Apple machines with a - UniNorth bridge. + UniNorth or U3 (Apple G5) bridge. config AGP_EFFICEON tristate "Transmeta Efficeon support" diff -urN linux-2.5/drivers/char/agp/uninorth-agp.c g5/drivers/char/agp/uninorth-agp.c --- linux-2.5/drivers/char/agp/uninorth-agp.c 2005-03-11 11:47:37.000000000 +1100 +++ g5/drivers/char/agp/uninorth-agp.c 2005-03-11 11:54:54.000000000 +1100 @@ -9,8 +9,23 @@ #include #include #include +#include #include "agp.h" +/* + * NOTES for uninorth3 (G5 AGP) supports : + * + * There maybe also possibility to have bigger cache line size for + * agp (see pmac_pci.c and look for cache line). Need to be investigated + * by someone. + * + * PAGE size are hardcoded but this may change, see asm/page.h. + * + * Jerome Glisse + */ +static int uninorth_rev; +static int is_u3; + static int uninorth_fetch_size(void) { int i; @@ -40,14 +55,20 @@ static void uninorth_tlbflush(struct agp_memory *mem) { + u32 ctrl = UNI_N_CFG_GART_ENABLE; + + if (is_u3) + ctrl |= U3_N_CFG_GART_PERFRD; pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, - UNI_N_CFG_GART_ENABLE | UNI_N_CFG_GART_INVAL); - pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, - UNI_N_CFG_GART_ENABLE); - pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, - UNI_N_CFG_GART_ENABLE | UNI_N_CFG_GART_2xRESET); - pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, - UNI_N_CFG_GART_ENABLE); + ctrl | UNI_N_CFG_GART_INVAL); + pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, ctrl); + + if (uninorth_rev <= 0x30) { + pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, + ctrl | UNI_N_CFG_GART_2xRESET); + pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, + ctrl); + } } static void uninorth_cleanup(void) @@ -57,14 +78,16 @@ pci_read_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, &tmp); if (!(tmp & UNI_N_CFG_GART_ENABLE)) return; - pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, - UNI_N_CFG_GART_ENABLE | UNI_N_CFG_GART_INVAL); - pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, - 0); - pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, - UNI_N_CFG_GART_2xRESET); - pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, - 0); + tmp |= UNI_N_CFG_GART_INVAL; + pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, tmp); + pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, 0); + + if (uninorth_rev <= 0x30) { + pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, + UNI_N_CFG_GART_2xRESET); + pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_GART_CTRL, + 0); + } } static int uninorth_configure(void) @@ -87,8 +110,21 @@ * the AGP aperture isn't mapped at bus physical address 0 */ agp_bridge->gart_bus_addr = 0; +#ifdef CONFIG_PPC64 + /* Assume U3 or later on PPC64 systems */ + /* high 4 bits of GART physical address go in UNI_N_CFG_AGP_BASE */ + pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_AGP_BASE, + (agp_bridge->gatt_bus_addr >> 32) & 0xf); +#else pci_write_config_dword(agp_bridge->dev, UNI_N_CFG_AGP_BASE, agp_bridge->gart_bus_addr); +#endif + + if (is_u3) { + pci_write_config_dword(agp_bridge->dev, + UNI_N_CFG_GART_DUMMY_PAGE, + agp_bridge->scratch_page_real >> 12); + } return 0; } @@ -111,13 +147,14 @@ j = pg_start; while (j < (pg_start + mem->page_count)) { - if (!PGE_EMPTY(agp_bridge, agp_bridge->gatt_table[j])) + if (agp_bridge->gatt_table[j]) return -EBUSY; j++; } for (i = 0, j = pg_start; i < mem->page_count; i++, j++) { - agp_bridge->gatt_table[j] = cpu_to_le32((mem->memory[i] & 0xfffff000) | 0x00000001UL); + agp_bridge->gatt_table[j] = + cpu_to_le32((mem->memory[i] & 0xFFFFF000UL) | 0x1UL); flush_dcache_range((unsigned long)__va(mem->memory[i]), (unsigned long)__va(mem->memory[i])+0x1000); } @@ -130,17 +167,90 @@ return 0; } +static int u3_insert_memory(struct agp_memory *mem, off_t pg_start, int type) +{ + int i, num_entries; + void *temp; + u32 *gp; + + temp = agp_bridge->current_size; + num_entries = A_SIZE_32(temp)->num_entries; + + if (type != 0 || mem->type != 0) + /* We know nothing of memory types */ + return -EINVAL; + if ((pg_start + mem->page_count) > num_entries) + return -EINVAL; + + gp = (u32 *) &agp_bridge->gatt_table[pg_start]; + for (i = 0; i < mem->page_count; ++i) { + if (gp[i]) { + printk("u3_insert_memory: entry 0x%x occupied (%x)\n", + i, gp[i]); + return -EBUSY; + } + } + + for (i = 0; i < mem->page_count; i++) { + gp[i] = (mem->memory[i] >> PAGE_SHIFT) | 0x80000000UL; + flush_dcache_range((unsigned long)__va(mem->memory[i]), + (unsigned long)__va(mem->memory[i])+0x1000); + } + mb(); + flush_dcache_range((unsigned long)gp, (unsigned long) &gp[i]); + uninorth_tlbflush(mem); + + return 0; +} + +int u3_remove_memory(struct agp_memory *mem, off_t pg_start, int type) +{ + size_t i; + u32 *gp; + + if (type != 0 || mem->type != 0) + /* We know nothing of memory types */ + return -EINVAL; + + gp = (u32 *) &agp_bridge->gatt_table[pg_start]; + for (i = 0; i < mem->page_count; ++i) + gp[i] = 0; + mb(); + flush_dcache_range((unsigned long)gp, (unsigned long) &gp[i]); + uninorth_tlbflush(mem); + + return 0; +} + static void uninorth_agp_enable(struct agp_bridge_data *bridge, u32 mode) { - u32 command, scratch; + u32 command, scratch, status; int timeout; pci_read_config_dword(bridge->dev, bridge->capndx + PCI_AGP_STATUS, - &command); + &status); - command = agp_collect_device_status(bridge, mode, command); - command |= 0x100; + command = agp_collect_device_status(bridge, mode, status); + command |= PCI_AGP_COMMAND_AGP; + + if (uninorth_rev == 0x21) { + /* + * Darwin disable AGP 4x on this revision, thus we + * may assume it's broken. This is an AGP2 controller. + */ + command &= ~AGPSTAT2_4X; + } + + if ((uninorth_rev >= 0x30) && (uninorth_rev <= 0x33)) { + /* + * We need to to set REQ_DEPTH to 7 for U3 versions 1.0, 2.1, + * 2.2 and 2.3, Darwin do so. + */ + if ((command >> AGPSTAT_RQ_DEPTH_SHIFT) > 7) + command = (command & ~AGPSTAT_RQ_DEPTH) + | (7 << AGPSTAT_RQ_DEPTH_SHIFT); + } uninorth_tlbflush(NULL); @@ -152,11 +262,17 @@ pci_read_config_dword(bridge->dev, bridge->capndx + PCI_AGP_COMMAND, &scratch); - } while ((scratch & 0x100) == 0 && ++timeout < 1000); - if ((scratch & 0x100) == 0) + } while ((scratch & PCI_AGP_COMMAND_AGP) == 0 && ++timeout < 1000); + if ((scratch & PCI_AGP_COMMAND_AGP) == 0) printk(KERN_ERR PFX "failed to write UniNorth AGP command reg\n"); - agp_device_command(command, 0); + if (uninorth_rev >= 0x30) { + /* This is an AGP V3 */ + agp_device_command(command, (status & AGPSTAT_MODE_3_0)); + } else { + /* AGP V2 */ + agp_device_command(command, 0); + } uninorth_tlbflush(NULL); } @@ -229,12 +345,12 @@ struct page *page; /* We can't handle 2 level gatt's */ - if (agp_bridge->driver->size_type == LVL2_APER_SIZE) + if (bridge->driver->size_type == LVL2_APER_SIZE) return -EINVAL; table = NULL; - i = agp_bridge->aperture_size_idx; - temp = agp_bridge->current_size; + i = bridge->aperture_size_idx; + temp = bridge->current_size; size = page_order = num_entries = 0; do { @@ -246,11 +362,11 @@ if (table == NULL) { i++; - agp_bridge->current_size = A_IDX32(agp_bridge); + bridge->current_size = A_IDX32(bridge); } else { - agp_bridge->aperture_size_idx = i; + bridge->aperture_size_idx = i; } - } while (!table && (i < agp_bridge->driver->num_aperture_sizes)); + } while (!table && (i < bridge->driver->num_aperture_sizes)); if (table == NULL) return -ENOMEM; @@ -260,14 +376,12 @@ for (page = virt_to_page(table); page <= virt_to_page(table_end); page++) SetPageReserved(page); - agp_bridge->gatt_table_real = (u32 *) table; - agp_bridge->gatt_table = (u32 *)table; - agp_bridge->gatt_bus_addr = virt_to_phys(table); - - for (i = 0; i < num_entries; i++) { - agp_bridge->gatt_table[i] = - (unsigned long) agp_bridge->scratch_page; - } + bridge->gatt_table_real = (u32 *) table; + bridge->gatt_table = (u32 *)table; + bridge->gatt_bus_addr = virt_to_phys(table); + + for (i = 0; i < num_entries; i++) + bridge->gatt_table[i] = 0; flush_dcache_range((unsigned long)table, (unsigned long)table_end); @@ -281,7 +395,7 @@ void *temp; struct page *page; - temp = agp_bridge->current_size; + temp = bridge->current_size; page_order = A_SIZE_32(temp)->page_order; /* Do not worry about freeing memory, because if this is @@ -289,13 +403,13 @@ * from the table. */ - table = (char *) agp_bridge->gatt_table_real; + table = (char *) bridge->gatt_table_real; table_end = table + ((PAGE_SIZE * (1 << page_order)) - 1); for (page = virt_to_page(table); page <= virt_to_page(table_end); page++) ClearPageReserved(page); - free_pages((unsigned long) agp_bridge->gatt_table_real, page_order); + free_pages((unsigned long) bridge->gatt_table_real, page_order); return 0; } @@ -320,6 +434,22 @@ {4, 1024, 0, 1} }; +/* + * Not sure that u3 supports that high aperture sizes but it + * would strange if it did not :) + */ +static struct aper_size_info_32 u3_sizes[8] = +{ + {512, 131072, 7, 128}, + {256, 65536, 6, 64}, + {128, 32768, 5, 32}, + {64, 16384, 4, 16}, + {32, 8192, 3, 8}, + {16, 4096, 2, 4}, + {8, 2048, 1, 2}, + {4, 1024, 0, 1} +}; + struct agp_bridge_driver uninorth_agp_driver = { .owner = THIS_MODULE, .aperture_sizes = (void *)uninorth_sizes, @@ -344,6 +474,31 @@ .cant_use_aperture = 1, }; +struct agp_bridge_driver u3_agp_driver = { + .owner = THIS_MODULE, + .aperture_sizes = (void *)u3_sizes, + .size_type = U32_APER_SIZE, + .num_aperture_sizes = 8, + .configure = uninorth_configure, + .fetch_size = uninorth_fetch_size, + .cleanup = uninorth_cleanup, + .tlb_flush = uninorth_tlbflush, + .mask_memory = agp_generic_mask_memory, + .masks = NULL, + .cache_flush = null_cache_flush, + .agp_enable = uninorth_agp_enable, + .create_gatt_table = uninorth_create_gatt_table, + .free_gatt_table = uninorth_free_gatt_table, + .insert_memory = u3_insert_memory, + .remove_memory = u3_remove_memory, + .alloc_by_type = agp_generic_alloc_by_type, + .free_by_type = agp_generic_free_by_type, + .agp_alloc_page = agp_generic_alloc_page, + .agp_destroy_page = agp_generic_destroy_page, + .cant_use_aperture = 1, + .needs_scratch_page = 1, +}; + static struct agp_device_ids uninorth_agp_device_ids[] __devinitdata = { { .device_id = PCI_DEVICE_ID_APPLE_UNI_N_AGP, @@ -361,6 +516,18 @@ .device_id = PCI_DEVICE_ID_APPLE_UNI_N_AGP2, .chipset_name = "UniNorth 2", }, + { + .device_id = PCI_DEVICE_ID_APPLE_U3_AGP, + .chipset_name = "U3", + }, + { + .device_id = PCI_DEVICE_ID_APPLE_U3L_AGP, + .chipset_name = "U3L", + }, + { + .device_id = PCI_DEVICE_ID_APPLE_U3H_AGP, + .chipset_name = "U3H", + }, }; static int __devinit agp_uninorth_probe(struct pci_dev *pdev, @@ -368,6 +535,7 @@ { struct agp_device_ids *devs = uninorth_agp_device_ids; struct agp_bridge_data *bridge; + struct device_node *uninorth_node; u8 cap_ptr; int j; @@ -389,13 +557,36 @@ return -ENODEV; found: + /* Set revision to 0 if we could not read it. */ + uninorth_rev = 0; + is_u3 = 0; + /* Locate core99 Uni-N */ + uninorth_node = of_find_node_by_name(NULL, "uni-n"); + /* Locate G5 u3 */ + if (uninorth_node == NULL) { + is_u3 = 1; + uninorth_node = of_find_node_by_name(NULL, "u3"); + } + if (uninorth_node) { + int *revprop = (int *) + get_property(uninorth_node, "device-rev", NULL); + if (revprop != NULL) + uninorth_rev = *revprop & 0x3f; + of_node_put(uninorth_node); + } + bridge = agp_alloc_bridge(); if (!bridge) return -ENOMEM; - bridge->driver = &uninorth_agp_driver; + if (is_u3) + bridge->driver = &u3_agp_driver; + else + bridge->driver = &uninorth_agp_driver; + bridge->dev = pdev; bridge->capndx = cap_ptr; + bridge->flags = AGP_ERRATA_FASTWRITES; /* Fill in the mode register */ pci_read_config_dword(pdev, cap_ptr+PCI_AGP_STATUS, &bridge->mode); diff -urN linux-2.5/include/asm-ppc/uninorth.h g5/include/asm-ppc/uninorth.h --- linux-2.5/include/asm-ppc/uninorth.h 2005-01-31 17:22:37.000000000 +1100 +++ g5/include/asm-ppc/uninorth.h 2005-03-11 11:54:54.000000000 +1100 @@ -27,13 +27,18 @@ #define UNI_N_CFG_AGP_BASE 0x90 #define UNI_N_CFG_GART_CTRL 0x94 #define UNI_N_CFG_INTERNAL_STATUS 0x98 +#define UNI_N_CFG_GART_DUMMY_PAGE 0xa4 /* UNI_N_CFG_GART_CTRL bits definitions */ -/* Not U3 */ #define UNI_N_CFG_GART_INVAL 0x00000001 #define UNI_N_CFG_GART_ENABLE 0x00000100 #define UNI_N_CFG_GART_2xRESET 0x00010000 #define UNI_N_CFG_GART_DISSBADET 0x00020000 +/* The following seems to only be used only on U3 */ +#define U3_N_CFG_GART_SYNCMODE 0x00040000 +#define U3_N_CFG_GART_PERFRD 0x00080000 +#define U3_N_CFG_GART_B2BGNT 0x00200000 +#define U3_N_CFG_GART_FASTDDR 0x00400000 /* My understanding of UniNorth AGP as of UniNorth rev 1.0x, * revision 1.5 (x4 AGP) may need further changes. diff -urN linux-2.5/include/asm-ppc64/agp.h g5/include/asm-ppc64/agp.h --- /dev/null 2005-03-10 17:27:14.905983648 +1100 +++ g5/include/asm-ppc64/agp.h 2005-03-11 11:54:54.000000000 +1100 @@ -0,0 +1,13 @@ +#ifndef AGP_H +#define AGP_H 1 + +#include + +/* nothing much needed here */ + +#define map_page_into_agp(page) +#define unmap_page_from_agp(page) +#define flush_agp_mappings() +#define flush_agp_cache() mb() + +#endif diff -urN linux-2.5/include/linux/pci_ids.h g5/include/linux/pci_ids.h --- linux-2.5/include/linux/pci_ids.h 2005-03-11 11:47:38.000000000 +1100 +++ g5/include/linux/pci_ids.h 2005-03-11 11:54:54.000000000 +1100 @@ -876,10 +876,13 @@ #define PCI_DEVICE_ID_APPLE_IPID_ATA100 0x003b #define PCI_DEVICE_ID_APPLE_KEYLARGO_I 0x003e #define PCI_DEVICE_ID_APPLE_K2_ATA100 0x0043 +#define PCI_DEVICE_ID_APPLE_U3_AGP 0x004b #define PCI_DEVICE_ID_APPLE_K2_GMAC 0x004c #define PCI_DEVICE_ID_APPLE_SH_ATA 0x0050 #define PCI_DEVICE_ID_APPLE_SH_SUNGEM 0x0051 #define PCI_DEVICE_ID_APPLE_SH_FW 0x0052 +#define PCI_DEVICE_ID_APPLE_U3L_AGP 0x0058 +#define PCI_DEVICE_ID_APPLE_U3H_AGP 0x0059 #define PCI_DEVICE_ID_APPLE_TIGON3 0x1645 #define PCI_VENDOR_ID_YAMAHA 0x1073 From benh at kernel.crashing.org Fri Mar 11 14:01:26 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 11 Mar 2005 14:01:26 +1100 Subject: [PATCH 2/8] make OF node fixup code usable at runtime In-Reply-To: <20050310005142.31309.45788.99418@otto> References: <20050310005132.31309.65485.31668@otto> <20050310005142.31309.45788.99418@otto> Message-ID: <1110510087.32525.334.camel@gaston> On Wed, 2005-03-09 at 18:51 -0600, Nathan Lynch wrote: > At boot we recurse through the device tree "fixing up" various fields > and properties in the device nodes. Long ago, to support DLPAR and > hotplug, we largely duplicated some of this fixup code, the main > difference being that the new code used kmalloc for allocating various > data structures which are attached to the new device nodes. > > This patch kills most of the duplicated code and makes finish_node, > finish_node_interrupts, and interpret_pci_props suitable for use at > runtime. These functions, if passed a null mem_start argument, will > use kmalloc for allocating extra data structures for the device node > being processed. Not terribly elegant, but it seems worth it to get > rid of the duplicated code (and bugs). Maybe hide that logic in a macro or inline ? Ben. From ncunningham at cyclades.com Fri Mar 11 14:52:47 2005 From: ncunningham at cyclades.com (Nigel Cunningham) Date: Fri, 11 Mar 2005 14:52:47 +1100 Subject: [PATCH] AGP support for powermac G5 In-Reply-To: <16945.2617.625095.404994@cargo.ozlabs.ibm.com> References: <16945.2617.625095.404994@cargo.ozlabs.ibm.com> Message-ID: <1110513167.3049.45.camel@desktop.cunningham.myip.net.au> Hi. On Fri, 2005-03-11 at 14:02, Paul Mackerras wrote: > +struct agp_bridge_driver u3_agp_driver = { > + .owner = THIS_MODULE, > + .aperture_sizes = (void *)u3_sizes, > + .size_type = U32_APER_SIZE, > + .num_aperture_sizes = 8, > + .configure = uninorth_configure, > + .fetch_size = uninorth_fetch_size, > + .cleanup = uninorth_cleanup, > + .tlb_flush = uninorth_tlbflush, > + .mask_memory = agp_generic_mask_memory, > + .masks = NULL, > + .cache_flush = null_cache_flush, > + .agp_enable = uninorth_agp_enable, > + .create_gatt_table = uninorth_create_gatt_table, > + .free_gatt_table = uninorth_free_gatt_table, > + .insert_memory = u3_insert_memory, > + .remove_memory = u3_remove_memory, > + .alloc_by_type = agp_generic_alloc_by_type, > + .free_by_type = agp_generic_free_by_type, > + .agp_alloc_page = agp_generic_alloc_page, > + .agp_destroy_page = agp_generic_destroy_page, > + .cant_use_aperture = 1, > + .needs_scratch_page = 1, > +}; > + No power management support? :> Regards, Nigel -- Nigel Cunningham Software Engineer, Canberra, Australia http://www.cyclades.com Bus: +61 (2) 6291 9554; Hme: +61 (2) 6292 8028; Mob: +61 (417) 100 574 Maintainer of Suspend2 Kernel Patches http://suspend2.net From paulus at samba.org Fri Mar 11 15:02:11 2005 From: paulus at samba.org (Paul Mackerras) Date: Fri, 11 Mar 2005 15:02:11 +1100 Subject: [PATCH] AGP support for powermac G5 In-Reply-To: <1110513167.3049.45.camel@desktop.cunningham.myip.net.au> References: <16945.2617.625095.404994@cargo.ozlabs.ibm.com> <1110513167.3049.45.camel@desktop.cunningham.myip.net.au> Message-ID: <16945.6211.331369.393573@cargo.ozlabs.ibm.com> Nigel Cunningham writes: > No power management support? :> The suspend/resume methods are in the pci_driver struct, not the agp_bridge_driver struct. Not that we have suspend/resume on the G5 yet. Paul. From ncunningham at cyclades.com Fri Mar 11 15:08:19 2005 From: ncunningham at cyclades.com (Nigel Cunningham) Date: Fri, 11 Mar 2005 15:08:19 +1100 Subject: [PATCH] AGP support for powermac G5 In-Reply-To: <16945.6211.331369.393573@cargo.ozlabs.ibm.com> References: <16945.2617.625095.404994@cargo.ozlabs.ibm.com> <1110513167.3049.45.camel@desktop.cunningham.myip.net.au> <16945.6211.331369.393573@cargo.ozlabs.ibm.com> Message-ID: <1110514099.3049.47.camel@desktop.cunningham.myip.net.au> Hi. On Fri, 2005-03-11 at 15:02, Paul Mackerras wrote: > Nigel Cunningham writes: > > > No power management support? :> > > The suspend/resume methods are in the pci_driver struct, not the > agp_bridge_driver struct. Not that we have suspend/resume on the G5 > yet. Ah. Thought I'd seen some in others. Humble apologies. Nigel -- Nigel Cunningham Software Engineer, Canberra, Australia http://www.cyclades.com Bus: +61 (2) 6291 9554; Hme: +61 (2) 6292 8028; Mob: +61 (417) 100 574 Maintainer of Suspend2 Kernel Patches http://suspend2.net From benh at kernel.crashing.org Fri Mar 11 15:29:02 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 11 Mar 2005 15:29:02 +1100 Subject: [PATCH] AGP support for powermac G5 In-Reply-To: <1110513167.3049.45.camel@desktop.cunningham.myip.net.au> References: <16945.2617.625095.404994@cargo.ozlabs.ibm.com> <1110513167.3049.45.camel@desktop.cunningham.myip.net.au> Message-ID: <1110515343.32524.343.camel@gaston> > > No power management support? :> Heh, not yet :) We can't really put a G5 to sleep yet. I haven't figured out the magic incantations for the PMU chip on those. Ben. From sfr at canb.auug.org.au Fri Mar 11 17:25:11 2005 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Fri, 11 Mar 2005 17:25:11 +1100 Subject: inappropriate use of in_atomic() In-Reply-To: <20050310204006.48286d17.akpm@osdl.org> References: <20050310204006.48286d17.akpm@osdl.org> Message-ID: <20050311172511.1fa0919e.sfr@canb.auug.org.au> Hi Andrew, On Thu, 10 Mar 2005 20:40:06 -0800 Andrew Morton wrote: > > in_atomic() is not a reliable indication of whether it is currently safe > to call schedule(). > > arch/ppc64/kernel/viopath.c in_atomic() in viopath.c was just used to determine if we had initialised enough to be able to wait in a semaphore (i.e. schedule). Thus it can be replaced now with checking system_state for SYSTEM_RUNNING. Signed-off-by: Stephen Rothwell Test booted on iSeries (which is the only place it is used). -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ diff -ruNp linus/arch/ppc64/kernel/viopath.c linus-in_atomic/arch/ppc64/kernel/viopath.c --- linus/arch/ppc64/kernel/viopath.c 2005-01-22 06:09:01.000000000 +1100 +++ linus-in_atomic/arch/ppc64/kernel/viopath.c 2005-03-11 17:19:45.000000000 +1100 @@ -79,7 +79,7 @@ static void handleMonitorEvent(struct Hv /* * We use this structure to handle asynchronous responses. The caller * blocks on the semaphore and the handler posts the semaphore. However, - * if in_atomic() is true in the caller, then wait_atomic is used ... + * if system_state is not SYSTEM_RUNNING, then wait_atomic is used ... */ struct doneAllocParms_t { struct semaphore *sem; @@ -465,7 +465,7 @@ static int allocateEvents(HvLpIndex remo DECLARE_MUTEX_LOCKED(Semaphore); atomic_t wait_atomic; - if (in_atomic()) { + if (system_state != SYSTEM_RUNNING) { parms.used_wait_atomic = 1; atomic_set(&wait_atomic, 1); parms.wait_atomic = &wait_atomic; @@ -475,7 +475,7 @@ static int allocateEvents(HvLpIndex remo } mf_allocate_lp_events(remoteLp, HvLpEvent_Type_VirtualIo, 250, /* It would be nice to put a real number here! */ numEvents, &viopath_donealloc, &parms); - if (in_atomic()) { + if (system_state != SYSTEM_RUNNING) { while (atomic_read(&wait_atomic)) mb(); } else -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050311/266c2f0e/attachment.pgp From arnd at arndb.de Fri Mar 11 22:45:37 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Fri, 11 Mar 2005 12:45:37 +0100 Subject: [PATCH] ppc64: kill might_sleep() warnings in __copy_*_user_inatomic In-Reply-To: <20050310233932.GA26823@austin.ibm.com> References: <200503102054.38123.arnd@arndb.de> <20050310233932.GA26823@austin.ibm.com> Message-ID: <200503111245.39257.arnd@arndb.de> On Freedag 11 M?rz 2005 00:39, Olof Johansson wrote: > Actually, I think I would prefer the following. It renames current > __copy_{to,from}_user to __copy_{to,from}_user_inatomic, adds the > old ones as inlines doing the ?might_sleep() and calling the inatomics > afterwards. This way the calls to __copy_{to,from}_user() will be caught > if called under lock or preemption as well. This is also how i386 does it. Yes, that solution is better than mine. However, you missed the case where __copy_{to,from}_user_inatomic calls __{get,put}_user_size, which in turn does might_sleep(). I now changed the {get,put}_user path accordingly. I have checked that this version boots and does not warn about futex. Arnd <>< --- This implements the __copy_{to,from}_user_inatomic() functions on ppc64. The only difference between the inatomic and regular version is that inatomic does not call might_sleep() to detect possible faults while holding locks/elevated preempt counts. Signed-off-by: Olof Johansson Signed-off-by: Arnd Bergmann -------------- next part -------------- A non-text attachment was scrubbed... Name: uaccess-might-sleep-3.diff Type: text/x-diff Size: 3367 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050311/57d29cec/attachment.diff From moilanen at austin.ibm.com Sat Mar 12 01:01:31 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Fri, 11 Mar 2005 08:01:31 -0600 Subject: [PATCH 2/2] No-exec support for ppc64 In-Reply-To: <1110494668.32525.283.camel@gaston> References: <20050308165904.0ce07112.moilanen@austin.ibm.com> <20050308171326.3d72363a.moilanen@austin.ibm.com> <20050310032507.GC20789@austin.ibm.com> <1110438934.32524.203.camel@gaston> <20050310162721.19003dac.moilanen@austin.ibm.com> <1110494668.32525.283.camel@gaston> Message-ID: <20050311080131.24419bd4.moilanen@austin.ibm.com> On Fri, 11 Mar 2005 09:44:28 +1100 Benjamin Herrenschmidt wrote: > > > > /* Free memory returned from module_alloc */ > > diff -puN arch/ppc64/mm/fault.c~nx-kernel-ppc64 arch/ppc64/mm/fault.c > > --- linux-2.6-bk/arch/ppc64/mm/fault.c~nx-kernel-ppc64 2005-03-10 13:54:14 -06:00 > > +++ linux-2.6-bk-moilanen/arch/ppc64/mm/fault.c 2005-03-10 13:54:14 -06:00 > > @@ -76,6 +76,13 @@ static int store_updates_sp(struct pt_re > > return 0; > > } > > > > +pte_t *lookup_address(unsigned long address) > > +{ > > + pgd_t *pgd = pgd_offset_k(address); > > + > > + return find_linux_pte(pgd, address); > > +} > > static please, even inline in this case. > > I've removed Andrew from CC upon his request, Paul, Anton or I will > forward to him when it's ready, no need to clobber his mailbox in the > meantime. 3rd time is a charm. Signed-off-by: Jake Moilanen --- linux-2.6-bk-moilanen/arch/ppc64/kernel/iSeries_setup.c | 4 +++ linux-2.6-bk-moilanen/arch/ppc64/kernel/module.c | 3 +- linux-2.6-bk-moilanen/arch/ppc64/mm/fault.c | 19 ++++++++++++++++ linux-2.6-bk-moilanen/arch/ppc64/mm/hash_utils.c | 19 ++++++++++------ linux-2.6-bk-moilanen/include/asm-ppc64/pgtable.h | 1 linux-2.6-bk-moilanen/include/asm-ppc64/sections.h | 9 +++++++ 6 files changed, 48 insertions(+), 7 deletions(-) diff -puN arch/ppc64/kernel/iSeries_setup.c~nx-kernel-ppc64 arch/ppc64/kernel/iSeries_setup.c --- linux-2.6-bk/arch/ppc64/kernel/iSeries_setup.c~nx-kernel-ppc64 2005-03-11 07:50:39 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/kernel/iSeries_setup.c 2005-03-11 07:50:39 -06:00 @@ -633,6 +633,10 @@ static void __init iSeries_bolt_kernel(u unsigned long vpn = va >> PAGE_SHIFT; unsigned long slot = HvCallHpt_findValid(&hpte, vpn); + /* Make non-kernel text non-executable */ + if (!in_kernel_text(ea)) + mode_rw |= HW_NO_EXEC; + if (hpte.dw0.dw0.v) { /* HPTE exists, so just bolt it */ HvCallHpt_setSwBits(slot, 0x10, 0); diff -puN arch/ppc64/kernel/module.c~nx-kernel-ppc64 arch/ppc64/kernel/module.c --- linux-2.6-bk/arch/ppc64/kernel/module.c~nx-kernel-ppc64 2005-03-11 07:50:39 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/kernel/module.c 2005-03-11 07:50:39 -06:00 @@ -102,7 +102,8 @@ void *module_alloc(unsigned long size) { if (size == 0) return NULL; - return vmalloc(size); + + return vmalloc_exec(size); } /* Free memory returned from module_alloc */ diff -puN arch/ppc64/mm/fault.c~nx-kernel-ppc64 arch/ppc64/mm/fault.c --- linux-2.6-bk/arch/ppc64/mm/fault.c~nx-kernel-ppc64 2005-03-11 07:50:39 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/mm/fault.c 2005-03-11 07:50:57 -06:00 @@ -76,6 +76,13 @@ static int store_updates_sp(struct pt_re return 0; } +static inline pte_t *lookup_address(unsigned long address) +{ + pgd_t *pgd = pgd_offset_k(address); + + return find_linux_pte(pgd, address); +} + /* * The error_code parameter is * - DSISR for a non-SLB data access fault, @@ -94,6 +101,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long is_write = error_code & 0x02000000; unsigned long trap = TRAP(regs); unsigned long is_exec = trap == 0x400; + pte_t *ptep; BUG_ON((trap == 0x380) || (trap == 0x480)); @@ -253,6 +261,17 @@ bad_area_nosemaphore: info.si_addr = (void __user *) address; force_sig_info(SIGSEGV, &info, current); return 0; + } + + ptep = lookup_address(address); + + if (ptep && pte_present(*ptep) && !pte_exec(*ptep)) { + if (printk_ratelimit()) + printk(KERN_CRIT "kernel tried to execute NX-protected " + "page - exploit attempt? (uid: %d)\n", + current->uid); + show_stack(current, (unsigned long *)__get_SP()); + do_exit(SIGKILL); } return SIGSEGV; diff -puN arch/ppc64/mm/hash_utils.c~nx-kernel-ppc64 arch/ppc64/mm/hash_utils.c --- linux-2.6-bk/arch/ppc64/mm/hash_utils.c~nx-kernel-ppc64 2005-03-11 07:50:39 -06:00 +++ linux-2.6-bk-moilanen/arch/ppc64/mm/hash_utils.c 2005-03-11 07:59:53 -06:00 @@ -51,6 +51,7 @@ #include #include #include +#include #ifdef DEBUG #define DBG(fmt...) udbg_printf(fmt) @@ -95,6 +96,7 @@ static inline void create_pte_mapping(un { unsigned long addr; unsigned int step; + unsigned long tmp_mode; if (large) step = 16*MB; @@ -112,6 +114,13 @@ static inline void create_pte_mapping(un else vpn = va >> PAGE_SHIFT; + + tmp_mode = mode; + + /* Make non-kernel text non-executable */ + if (!in_kernel_text(addr)) + tmp_mode = mode | HW_NO_EXEC; + hash = hpt_hash(vpn, large); hpteg = ((hash & htab_hash_mask) * HPTES_PER_GROUP); @@ -120,12 +129,12 @@ static inline void create_pte_mapping(un if (systemcfg->platform & PLATFORM_LPAR) ret = pSeries_lpar_hpte_insert(hpteg, va, virt_to_abs(addr) >> PAGE_SHIFT, - 0, mode, 1, large); + 0, tmp_mode, 1, large); else #endif /* CONFIG_PPC_PSERIES */ ret = native_hpte_insert(hpteg, va, virt_to_abs(addr) >> PAGE_SHIFT, - 0, mode, 1, large); + 0, tmp_mode, 1, large); if (ret == -1) { ppc64_terminate_msg(0x20, "create_pte_mapping"); @@ -238,8 +247,6 @@ unsigned int hash_page_do_lazy_icache(un { struct page *page; -#define PPC64_HWNOEXEC (1 << 2) - if (!pfn_valid(pte_pfn(pte))) return pp; @@ -250,8 +257,8 @@ unsigned int hash_page_do_lazy_icache(un if (trap == 0x400) { __flush_dcache_icache(page_address(page)); set_bit(PG_arch_1, &page->flags); - } else - pp |= PPC64_HWNOEXEC; + } else + pp |= HW_NO_EXEC; } return pp; } diff -puN include/asm-ppc64/pgtable.h~nx-kernel-ppc64 include/asm-ppc64/pgtable.h --- linux-2.6-bk/include/asm-ppc64/pgtable.h~nx-kernel-ppc64 2005-03-11 07:50:39 -06:00 +++ linux-2.6-bk-moilanen/include/asm-ppc64/pgtable.h 2005-03-11 07:50:39 -06:00 @@ -117,6 +117,7 @@ #define PAGE_READONLY __pgprot(_PAGE_BASE | _PAGE_USER) #define PAGE_READONLY_X __pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_EXEC) #define PAGE_KERNEL __pgprot(_PAGE_BASE | _PAGE_WRENABLE) +#define PAGE_KERNEL_EXEC __pgprot(_PAGE_BASE | _PAGE_WRENABLE | _PAGE_EXEC) #define HW_NO_EXEC _PAGE_EXEC /* This is used when the bit is * inverted, even though it's the diff -puN include/asm-ppc64/sections.h~nx-kernel-ppc64 include/asm-ppc64/sections.h --- linux-2.6-bk/include/asm-ppc64/sections.h~nx-kernel-ppc64 2005-03-11 07:50:39 -06:00 +++ linux-2.6-bk-moilanen/include/asm-ppc64/sections.h 2005-03-11 07:50:39 -06:00 @@ -17,4 +17,13 @@ extern char _end[]; #define __openfirmware #define __openfirmwaredata + +static inline int in_kernel_text(unsigned long addr) +{ + if (addr >= (unsigned long)_stext && addr < (unsigned long)__init_end) + return 1; + + return 0; +} + #endif _ From jschopp at austin.ibm.com Sat Mar 12 03:15:01 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Fri, 11 Mar 2005 10:15:01 -0600 Subject: [PATCH 2/8] make OF node fixup code usable at runtime In-Reply-To: <20050311013047.GE21853@otto> References: <20050310005132.31309.65485.31668@otto> <20050310005142.31309.45788.99418@otto> <4230A2F2.7020403@austin.ibm.com> <20050311013047.GE21853@otto> Message-ID: <4231C405.8040703@austin.ibm.com> >> >>>-static int of_finish_dynamic_node(struct device_node *node) >>>+static int of_finish_dynamic_node(struct device_node *node, >>>+ unsigned long *unused1, int unused2, >>>+ int unused3, int unused4) >>>{ >> >> >>Is there a reason for these 4 unused fields that I am just missing? >> > > > In order for it to be correctly used as an argument to finish_node, > of_finish_dynamic_node needs to have a definition compatible with the > interpret_func typedef. OK. I'm good with the patch then. From olof at austin.ibm.com Sat Mar 12 07:30:44 2005 From: olof at austin.ibm.com (Olof Johansson) Date: Fri, 11 Mar 2005 14:30:44 -0600 Subject: [PATCH] ppc64: kill might_sleep() warnings in __copy_*_user_inatomic In-Reply-To: <200503111245.39257.arnd@arndb.de> References: <200503102054.38123.arnd@arndb.de> <20050310233932.GA26823@austin.ibm.com> <200503111245.39257.arnd@arndb.de> Message-ID: <20050311203044.GC6086@austin.ibm.com> On Fri, Mar 11, 2005 at 12:45:37PM +0100, Arnd Bergmann wrote: > On Freedag 11 M?rz 2005 00:39, Olof Johansson wrote: > > Actually, I think I would prefer the following. It renames current > > __copy_{to,from}_user to __copy_{to,from}_user_inatomic, adds the > > old ones as inlines doing the ?might_sleep() and calling the inatomics > > afterwards. This way the calls to __copy_{to,from}_user() will be caught > > if called under lock or preemption as well. This is also how i386 does it. > > Yes, that solution is better than mine. However, you missed the case where > __copy_{to,from}_user_inatomic calls __{get,put}_user_size, which in turn > does might_sleep(). I now changed the {get,put}_user path accordingly. > > I have checked that this version boots and does not warn about futex. Doh! Great, thanks. > --- > This implements the __copy_{to,from}_user_inatomic() functions on ppc64. > The only difference between the inatomic and regular version is that > inatomic does not call might_sleep() to detect possible faults while > holding locks/elevated preempt counts. > > Signed-off-by: Olof Johansson > Signed-off-by: Arnd Bergmann Acked-by: Olof Johansson From linas at austin.ibm.com Sat Mar 12 08:22:16 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Fri, 11 Mar 2005 15:22:16 -0600 Subject: [PATCH] allow dynamic enablement of EEH In-Reply-To: <1109371857.27183.28.camel@sinatra.austin.ibm.com> References: <1109371857.27183.28.camel@sinatra.austin.ibm.com> Message-ID: <20050311212216.GJ1220@austin.ibm.com> Hi John, I just unearthed this patch .. sorry it took so long ... FWIW, its good to me... Signed-off-by: Linas Vepstas or should that be Approved-by: Linas Vepstas --linas On Fri, Feb 25, 2005 at 04:50:57PM -0600, John Rose was heard to remark: > EEH scans the system I/O adapters at boot for EEH-capabilities. If no > EEH-capable adapters are found, the subsystem is marked disabled for the > life of the system. EEH should allow dynamic enabling of the EEH > subsystem when hotplug-adding an adapter. > > Please apply, if appropriate. > > Thanks- > John > > Signed-off-by: John Rose > > diff -puN arch/ppc64/kernel/eeh.c~04_eeh_add_early arch/ppc64/kernel/eeh.c > --- 2_6_linus_2/arch/ppc64/kernel/eeh.c~04_eeh_add_early 2005-02-25 16:29:51.000000000 -0600 > +++ 2_6_linus_2-johnrose/arch/ppc64/kernel/eeh.c 2005-02-25 16:29:51.000000000 -0600 > @@ -808,7 +808,7 @@ void eeh_add_device_early(struct device_ > struct pci_controller *phb; > struct eeh_early_enable_info info; > > - if (!dn || !eeh_subsystem_enabled) > + if (!dn) > return; > phb = dn->phb; > if (NULL == phb || 0 == phb->buid) { > > _ > From linas at austin.ibm.com Sat Mar 12 12:32:51 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Fri, 11 Mar 2005 19:32:51 -0600 Subject: [PATCH/RFC] PCI Error Recovery In-Reply-To: <421E9D16.3000606@jp.fujitsu.com> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> Message-ID: <20050312013251.GA2609@austin.ibm.com> Hi, Appended is my current draft PCI Error Recovery patch. Per previous conversatios, it has moved some of the ppc64-specific error reporting code into generic PCI structures: see changes to include/linux/pci.h and a new file drivers/pci/pci-error.c. Note in particular the pci bus states enumerated in "enum pci_device_io_state"; BenH was suggesting having more of these ... BenH do you want to propose a "final list"? I named the generic pci error recovery routines "peh" because my brain froze. Better suggestions invited. The patch includes error recovery code for the IPR scsi device driver that uses the new generic PCI interfaces. There's also some prototype symbios scsi recovery code, but I haven't had a chance to test it due to hardware issues. Ignore the debug statements. The last chunk of this patch is ppc64 specific code; it uses the new generic interfaces where it can. Please review, comment, criticize and suggest. I am eager to get the pci-generic parts nailed down, and want to really start moving in a direction that would let this go into mainline. --linas p.s. It should apply cleanly to kernel.org 2.6.11 and will recover from pci errors sent to IPR and ethernet on power5 boxes. I haven't tested on power4. -------------- next part -------------- --- include/linux/pci.h.linas-orig 2005-03-09 02:11:40.000000000 -0600 +++ include/linux/pci.h 2005-03-11 18:00:46.000000000 -0600 @@ -659,6 +659,63 @@ struct pci_dynids { unsigned int use_driver_data:1; /* pci_driver->driver_data is used */ }; +/* ---------------------------------------------------------------- */ +/** PCI error recovery state. Whenever the PCI bus state changes, + * the io_state_change() callback will be called to notify the + * device driver os state changes. + */ + +enum pci_device_io_state { + pci_device_io_frozen = 1, /* I/O to device is blocked */ + pci_device_io_thawed, /* I/O te device is (re-)enabled */ + pci_device_io_perm_failure, /* pci card is dead */ +}; + +/** + * PCI Error notifier event flags. + */ +#define PEH_NOTIFY_ERROR 1 + +/** PEH event -- structure holding pci controller data that describes + * a change in the isolation status of a PCI slot. A pointer + * to this struct is passed as the data pointer in a notify callback. + */ +struct peh_event { + struct list_head list; + struct pci_dev *dev; /* affected device */ + enum pci_device_io_state state; /* PCI bus state for the affected device */ + int time_unavail; /* milliseconds until device might be available */ +}; + +/** + * peh_send_failure_event - generate a PCI error event + * @dev pci device + * + * This routine builds a PCI error event which will be delivered + * to all listeners on the peh_notifier_chain. + * + * This routine can be called within an interrupt context; + * the actual event will be delivered in a normal context + * (from a workqueue). + */ +int peh_send_failure_event (struct pci_dev *dev, + enum pci_device_io_state state, + int time_unavail); + +/** + * peh_register_notifier - Register to find out about EEH events. + * @nb: notifier block to callback on events + */ +int peh_register_notifier(struct notifier_block *nb); + +/** + * peh_unregister_notifier - Unregister to an EEH event notifier. + * @nb: notifier block to callback on events + */ +int peh_unregister_notifier(struct notifier_block *nb); + +/* ---------------------------------------------------------------- */ + struct module; struct pci_driver { struct list_head node; @@ -670,6 +727,7 @@ struct pci_driver { int (*suspend) (struct pci_dev *dev, pm_message_t state); /* Device suspended */ int (*resume) (struct pci_dev *dev); /* Device woken up */ int (*enable_wake) (struct pci_dev *dev, u32 state, int enable); /* Enable wake event */ + int (*io_state_change) (struct pci_dev *, enum pci_device_io_state); /* state change */ struct device_driver driver; struct pci_dynids dynids; --- drivers/pci/Makefile.linas-orig 2005-03-09 02:12:50.000000000 -0600 +++ drivers/pci/Makefile 2005-03-11 18:19:29.000000000 -0600 @@ -3,7 +3,7 @@ # obj-y += access.o bus.o probe.o remove.o pci.o quirks.o \ - names.o pci-driver.o search.o pci-sysfs.o \ + names.o pci-driver.o pci-error.o search.o pci-sysfs.o \ rom.o obj-$(CONFIG_PROC_FS) += proc.o --- drivers/pci/pci-error.c.linas-orig 2005-03-11 18:21:20.000000000 -0600 +++ drivers/pci/pci-error.c 2005-03-11 18:23:47.000000000 -0600 @@ -0,0 +1,152 @@ +/* + * pci-error.c + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include +#include +#include + +#undef DEBUG + +/** Overview: + * PEH, or "PCI Error Handling" is a PCI bridge technology for + * dealing with PCI bus errors that can't be dealt with within the + * usual PCI framework, except by check-stopping the CPU. Systems + * that are designed for high-availability/reliability cannot afford + * to crash due to a "mere" PCI error, thus the need for PEH. + * An PEH-capable bridge operates by converting a detected error + * into a "slot freeze", taking the PCI adapter off-line, making + * the slot behave, from the OS'es point of view, as if the slot + * were "empty": all reads return 0xff's and all writes are silently + * ignored. PEH slot isolation events can be triggered by parity + * errors on the address or data busses (e.g. during posted writes), + * which in turn might be caused by low voltage on the bus, dust, + * vibration, humidity, radioactivity or plain-old failed hardware. + * + * Note, however, that one of the leading causes of PEH slot + * freeze events are buggy device drivers, buggy device microcode, + * or buggy device hardware. This is because any attempt by the + * device to bus-master data to a memory address that is not + * assigned to the device will trigger a slot freeze. (The idea + * is to prevent devices-gone-wild from corrupting system memory). + * Buggy hardware/drivers will have a miserable time co-existing + * with PEH. + */ + +/* PEH event workqueue setup. */ +static spinlock_t peh_eventlist_lock = SPIN_LOCK_UNLOCKED; +LIST_HEAD(peh_eventlist); +static void peh_event_handler(void *); +DECLARE_WORK(peh_event_wq, peh_event_handler, NULL); + +static struct notifier_block *peh_notifier_chain; + +/** + * peh_event_handler - dispatch PEH events. The detection of a frozen + * slot can occur inside an interrupt, where it can be hard to do + * anything about it. The goal of this routine is to pull these + * detection events out of the context of the interrupt handler, and + * re-dispatch them for processing at a later time in a normal context. + * + * @dummy - unused + */ +static void peh_event_handler(void *dummy) +{ + unsigned long flags; + struct peh_event *event; + + while (1) { + spin_lock_irqsave(&peh_eventlist_lock, flags); + event = NULL; + if (!list_empty(&peh_eventlist)) { + event = list_entry(peh_eventlist.next, struct peh_event, list); + list_del(&event->list); + } + spin_unlock_irqrestore(&peh_eventlist_lock, flags); + if (event == NULL) + break; + + printk(KERN_INFO "PEH: Detected PCI bus error on device " + "%s %s\n", + pci_name(event->dev), pci_pretty_name(event->dev)); + + notifier_call_chain (&peh_notifier_chain, + PEH_NOTIFY_ERROR, event); + + pci_dev_put(event->dev); + kfree(event); + } +} + + +/** + * peh_send_failure_event - generate a PCI error event + * @dev pci device + * + * This routine builds a PCI error event which will be delivered + * to all listeners on the peh_notifier_chain. + * + * This routine can be called within an interrupt context; + * the actual event will be delivered in a normal context + * (from a workqueue). + */ +int peh_send_failure_event (struct pci_dev *dev, + enum pci_device_io_state state, + int time_unavail) +{ + unsigned long flags; + struct peh_event *event; + + event = kmalloc(sizeof(*event), GFP_ATOMIC); + if (event == NULL) { + printk (KERN_ERR "PEH: out of memory, event not handled\n"); + return 1; + } + + event->dev = dev; + event->state = state; + event->time_unavail = time_unavail; + + /* We may or may not be called in an interrupt context */ + spin_lock_irqsave(&peh_eventlist_lock, flags); + list_add(&event->list, &peh_eventlist); + spin_unlock_irqrestore(&peh_eventlist_lock, flags); + + schedule_work(&peh_event_wq); + + return 0; +} + +/** + * peh_register_notifier - Register to find out about EEH events. + * @nb: notifier block to callback on events + */ +int peh_register_notifier(struct notifier_block *nb) +{ + return notifier_chain_register(&peh_notifier_chain, nb); +} + +/** + * peh_unregister_notifier - Unregister to an EEH event notifier. + * @nb: notifier block to callback on events + */ +int peh_unregister_notifier(struct notifier_block *nb) +{ + return notifier_chain_unregister(&peh_notifier_chain, nb); +} + + --- drivers/scsi/ipr.c.linas-orig 2005-03-09 02:13:17.000000000 -0600 +++ drivers/scsi/ipr.c 2005-03-10 14:54:27.000000000 -0600 @@ -80,6 +80,8 @@ #include #include #include + +#define CONFIG_SCSI_IPR_EEH #include "ipr.h" /* @@ -4993,6 +4995,7 @@ static int ipr_reset_start_bist(struct i return rc; } + /** * ipr_reset_allowed - Query whether or not IOA can be reset * @ioa_cfg: ioa config struct @@ -5306,6 +5309,68 @@ static void ipr_initiate_ioa_reset(struc shutdown_type); } +#ifdef CONFIG_SCSI_IPR_EEH + +/** If the PCI slot is frozen, hold off all i/o + * activity; then, as soon as the slot is available again, + * initiate an adapter reset. + */ +static int ipr_reset_freeze(struct ipr_cmnd *ipr_cmd) +{ + list_add_tail(&ipr_cmd->queue, &ipr_cmd->ioa_cfg->pending_q); + ipr_cmd->done = ipr_reset_ioa_job; + return IPR_RC_JOB_RETURN; +} + +static void ipr_eeh_frozen (struct pci_dev *pdev) +{ + unsigned long flags = 0; + struct ipr_ioa_cfg *ioa_cfg = pci_get_drvdata(pdev); + + spin_lock_irqsave(ioa_cfg->host->host_lock, flags); + _ipr_initiate_ioa_reset(ioa_cfg, ipr_reset_freeze, IPR_SHUTDOWN_NONE); + spin_unlock_irqrestore(ioa_cfg->host->host_lock, flags); +} + +static void ipr_eeh_thawed (struct pci_dev *pdev) +{ + unsigned long flags = 0; + struct ipr_ioa_cfg *ioa_cfg = pci_get_drvdata(pdev); + + spin_lock_irqsave(ioa_cfg->host->host_lock, flags); + _ipr_initiate_ioa_reset(ioa_cfg, ipr_reset_restore_cfg_space, + IPR_SHUTDOWN_NONE); + spin_unlock_irqrestore(ioa_cfg->host->host_lock, flags); +} + +static void ipr_eeh_perm_failure (struct pci_dev *pdev) +{ +#if 0 // XXXXXXXXXXXXXXXXXXXXXXX + ipr_cmd->job_step = ipr_reset_shutdown_ioa; + rc = IPR_RC_JOB_CONTINUE; +#endif +} + +static int ipr_io_state_change (struct pci_dev *pdev, + enum pci_device_io_state state) +{ + switch (state) { + case pci_device_io_frozen: + ipr_eeh_frozen (pdev); + break; + case pci_device_io_thawed: + ipr_eeh_thawed (pdev); + break; + case pci_device_io_perm_failure: + ipr_eeh_perm_failure (pdev); + break; + default: + break; + } + return 0; +} +#endif + /** * ipr_probe_ioa_part2 - Initializes IOAs found in ipr_probe_ioa(..) * @ioa_cfg: ioa cfg struct @@ -6015,6 +6080,7 @@ static struct pci_driver ipr_driver = { .id_table = ipr_pci_table, .probe = ipr_probe, .remove = ipr_remove, + .io_state_change = ipr_io_state_change, .driver = { .shutdown = ipr_shutdown, }, --- drivers/scsi/ipr.h.linas-orig 2005-03-09 02:11:12.000000000 -0600 +++ drivers/scsi/ipr.h 2005-03-10 14:54:27.000000000 -0600 @@ -1132,9 +1132,11 @@ struct ipr_ucode_image_header { #define ipr_trace ipr_dbg("%s: %s: Line: %d\n",\ __FILE__, __FUNCTION__, __LINE__) +#undef IPR_DBG_TRACE +#define IPR_DBG_TRACE 1 #if IPR_DBG_TRACE -#define ENTER printk(KERN_INFO IPR_NAME": Entering %s\n", __FUNCTION__) -#define LEAVE printk(KERN_INFO IPR_NAME": Leaving %s\n", __FUNCTION__) +#define ENTER printk(KERN_INFO IPR_NAME": Entering %s jiffies=%lu\n", __FUNCTION__, jiffies) +#define LEAVE printk(KERN_INFO IPR_NAME": Leaving %s jiffies=%lu\n", __FUNCTION__, jiffies) #else #define ENTER #define LEAVE --- drivers/scsi/sym53c8xx_2/sym_glue.c.linas-orig 2005-03-09 02:13:09.000000000 -0600 +++ drivers/scsi/sym53c8xx_2/sym_glue.c 2005-03-11 18:54:19.000000000 -0600 @@ -770,6 +770,11 @@ static irqreturn_t sym53c8xx_intr(int ir struct sym_hcb *np = (struct sym_hcb *)dev_id; if (DEBUG_FLAGS & DEBUG_TINY) printf_debug ("["); +#define CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY +#ifdef CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY + if (np->s.io_state != pci_device_io_thawed) + return IRQ_HANDLED; +#endif /* CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY */ spin_lock_irqsave(np->s.host->host_lock, flags); sym_interrupt(np); @@ -844,6 +849,27 @@ static void sym_eh_done(struct scsi_cmnd */ static void sym_eh_timeout(u_long p) { __sym_eh_done((struct scsi_cmnd *)p, 1); } +#ifdef CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY +static void sym_eeh_timeout(u_long p) +{ + struct sym_eh_wait *ep = (struct sym_eh_wait *) p; + if (!ep) + return; + complete(&ep->done); +} + +static void sym_eeh_done(struct sym_eh_wait *ep) +{ + if (!ep) + return; + ep->timed_out = 0; + if (!del_timer(&ep->timer)) + return; + + complete(&ep->done); +} +#endif /* CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY */ + /* * Generic method for our eh processing. * The 'op' argument tells what we have to do. @@ -905,6 +931,35 @@ prepare: sts = 0; break; case SYM_EH_HOST_RESET: +#ifdef CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY +printk("duuuuuude attempting symbios recovery\n"); +dump_stack(); + int rc = eeh_slot_is_isolated (np->s.device); + +printk ("duude symbios is isolated ??=%d\n", rc); +printk ("duuude the current io state is %d\n", np->s.io_state); + if (rc) { + struct sym_eh_wait eeh, *eep = &eeh; + np->s.io_reset_wait = eep; + init_completion(&eep->done); + init_timer(&eep->timer); + eep->to_do = SYM_EH_DO_WAIT; + eep->timer.expires = jiffies + (10*HZ); + eep->timer.function = sym_eeh_timeout; + eep->timer.data = (u_long)eep; + eep->timed_out = 1; /* Be pessimistic for once :) */ + add_timer(&eep->timer); + spin_unlock_irq(np->s.host->host_lock); + wait_for_completion(&eep->done); + spin_lock_irq(np->s.host->host_lock); + if (eep->timed_out) { +printk ("duude symbios timed out\n"); + } else { +printk ("duude symbios waited for completion\n"); + } + np->s.io_reset_wait = NULL; + } +#endif /* CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY */ sym_reset_scsi_bus(np, 0); sym_start_up (np, 1); sts = 0; @@ -1577,6 +1632,23 @@ static int sym_setup_bus_dma_mask(struct return -1; } +#ifdef CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY +int sym2_io_state_change (struct pci_dev *pdev, enum pci_device_io_state state) +{ + struct sym_hcb *np = pci_get_drvdata(pdev); +printk ("duude symbios got this state change %d jiffies=%ld\n", state, jiffies); + + np->s.io_state = state; + if (state == pci_device_io_thawed) { + sym_eeh_done (np->s.io_reset_wait); + } + + // XXX if perm frozen, then ...? + + return 0; +} +#endif /* CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY */ + /* * Host attach and initialisations. * @@ -1625,6 +1697,8 @@ static struct Scsi_Host * __devinit sym_ if (!np) goto attach_failed; np->s.device = dev->pdev; + np->s.io_state = pci_device_io_thawed; + np->s.io_reset_wait = NULL; np->bus_dmat = dev->pdev; /* Result in 1 DMA pool per HBA */ host_data->ncb = np; np->s.host = instance; @@ -2359,6 +2433,7 @@ static struct pci_driver sym2_driver = { .id_table = sym2_id_table, .probe = sym2_probe, .remove = __devexit_p(sym2_remove), + .io_state_change = sym2_io_state_change, }; static int __init sym2_init(void) --- drivers/scsi/sym53c8xx_2/sym_glue.h.linas-orig 2005-03-09 02:13:03.000000000 -0600 +++ drivers/scsi/sym53c8xx_2/sym_glue.h 2005-03-10 14:54:27.000000000 -0600 @@ -358,6 +358,10 @@ struct sym_shcb { char chip_name[8]; struct pci_dev *device; + /* pci bus i/o state; waiter for clearing of i/o state */ + enum pci_device_io_state io_state; + struct sym_eh_wait *io_reset_wait; + struct Scsi_Host *host; void __iomem * mmio_va; /* MMIO kernel virtual address */ --- drivers/scsi/sym53c8xx_2/sym_hipd.c.linas-orig 2005-03-09 02:11:01.000000000 -0600 +++ drivers/scsi/sym53c8xx_2/sym_hipd.c 2005-03-11 19:06:17.000000000 -0600 @@ -2836,6 +2836,7 @@ void sym_interrupt (struct sym_hcb *np) u_char istat, istatc; u_char dstat; u_short sist; + u_int icnt; /* * interrupt on the fly ? @@ -2877,6 +2878,7 @@ void sym_interrupt (struct sym_hcb *np) sist = 0; dstat = 0; istatc = istat; + icnt = 0; do { if (istatc & SIP) sist |= INW (nc_sist); @@ -2884,6 +2886,14 @@ void sym_interrupt (struct sym_hcb *np) dstat |= INB (nc_dstat); istatc = INB (nc_istat); istat |= istatc; + icnt ++; + if (100 < icnt) { +#define CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY +#ifdef CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY + if(eeh_slot_is_isolated (np->s.device)) + return; +#endif /* CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY */ + } } while (istatc & (SIP|DIP)); if (DEBUG_FLAGS & DEBUG_TINY) --- include/asm-ppc64/eeh.h.linas-orig 2005-03-09 02:13:21.000000000 -0600 +++ include/asm-ppc64/eeh.h 2005-03-11 18:01:19.000000000 -0600 @@ -23,6 +23,7 @@ #include #include #include +#include #include struct pci_dev; @@ -36,6 +37,11 @@ struct notifier_block; #define EEH_MODE_SUPPORTED (1<<0) #define EEH_MODE_NOCHECK (1<<1) #define EEH_MODE_ISOLATED (1<<2) +#define EEH_MODE_RECOVERING (1<<3) + +/* Max number of EEH freezes allowed before we consider the device + * to be permanently disabled. */ +#define EEH_MAX_ALLOWED_FREEZES 5 void __init eeh_init(void); unsigned long eeh_check_failure(const volatile void __iomem *token, @@ -59,35 +65,82 @@ void eeh_add_device_late(struct pci_dev * eeh_remove_device - undo EEH setup for the indicated pci device * @dev: pci device to be removed * - * This routine should be when a device is removed from a running - * system (e.g. by hotplug or dlpar). + * This routine should be called when a device is removed from + * a running system (e.g. by hotplug or dlpar). It unregisters + * the PCI device from the EEH subsystem. I/O errors affecting + * this device will no longer be detected after this call; thus, + * i/o errors affecting this slot may leave this device unusable. */ void eeh_remove_device(struct pci_dev *); -#define EEH_DISABLE 0 -#define EEH_ENABLE 1 -#define EEH_RELEASE_LOADSTORE 2 -#define EEH_RELEASE_DMA 3 +/** + * eeh_slot_is_isolated -- return non-zero value if slot is frozen + */ +int eeh_slot_is_isolated (struct pci_dev *dev); /** - * Notifier event flags. + * eeh_ioaddr_is_isolated -- return non-zero value if device at + * io address is frozen. */ -#define EEH_NOTIFY_FREEZE 1 +int eeh_ioaddr_is_isolated(const volatile void __iomem *token); -/** EEH event -- structure holding pci slot data that describes - * a change in the isolation status of a PCI slot. A pointer - * to this struct is passed as the data pointer in a notify callback. - */ -struct eeh_event { - struct list_head list; - struct pci_dev *dev; - struct device_node *dn; - int reset_state; -}; - -/** Register to find out about EEH events. */ -int eeh_register_notifier(struct notifier_block *nb); -int eeh_unregister_notifier(struct notifier_block *nb); +/** + * eeh_slot_error_detail -- record and EEH error condition to the log + * @severity: 1 if temporary, 2 if permanent failure. + * + * Obtains the the EEH error details from the RTAS subsystem, + * and then logs these details with the RTAS error log system. + */ +void eeh_slot_error_detail (struct device_node *dn, int severity); + +/** + * rtas_set_slot_reset -- unfreeze a frozen slot + * + * Clear the EEH-frozen condition on a slot. This routine + * does this by asserting the PCI #RST line for 1/8th of + * a second; this routine will sleep while the adapter is + * being reset. + */ +void rtas_set_slot_reset (struct device_node *dn); + +/** rtas_pci_slot_reset raises/lowers the pci #RST line + * state: 1/0 to raise/lower the #RST + * + * Clear the EEH-frozen condition on a slot. This routine + * asserts the PCI #RST line if the 'state' argument is '1', + * and drops the #RST line if 'state is '0'. This routine is + * safe to call in an interrupt context. + * + */ +void rtas_pci_slot_reset(struct device_node *dn, int state); +void eeh_pci_slot_reset(struct pci_dev *dev, int state); + +/** eeh_pci_slot_availability -- Indicates whether a PCI + * slot is ready to be used. After a PCI reset, it may take a while + * for the PCI fabric to fully reset the comminucations path to the + * given PCI card. This routine can be used to determine how long + * to wait before a PCI slot might become usable. + * + * This routine returns how long to wait (in milliseconds) before + * the slot is expected to be usable. A value of zero means the + * slot is immediately usable. A negavitve value means that the + * slot is permanently disabled. + */ +int eeh_pci_slot_availability(struct pci_dev *dev); + +/** Restore device configuration info across device resets. + */ +void eeh_restore_bars(struct device_node *); +void eeh_pci_restore_bars(struct pci_dev *dev); + +/** + * rtas_configure_bridge -- firmware initialization of pci bridge + * + * Ask the firmware to configure any PCI bridge devices + * located behind the indicated node. Required after a + * pci device reset. + */ +void rtas_configure_bridge(struct device_node *dn); /** * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure. --- include/asm-ppc64/prom.h.linas-orig 2005-03-09 02:13:03.000000000 -0600 +++ include/asm-ppc64/prom.h 2005-03-10 14:54:27.000000000 -0600 @@ -119,6 +119,7 @@ struct property { */ struct pci_controller; struct iommu_table; +struct eeh_recovery_ops; struct device_node { char *name; @@ -137,8 +138,12 @@ struct device_node { int devfn; /* for pci devices */ int eeh_mode; /* See eeh.h for possible EEH_MODEs */ int eeh_config_addr; + int eeh_check_count; /* number of times device driver ignored error */ + int eeh_freeze_count; /* number of times this device froze up. */ + int eeh_is_bridge; /* device is pci-to-pci bridge */ struct pci_controller *phb; /* for pci devices */ struct iommu_table *iommu_table; /* for phb's or bridges */ + u32 config_space[16]; /* saved PCI config space */ struct property *properties; struct device_node *parent; --- include/asm-ppc64/rtas.h.linas-orig 2005-03-09 02:13:00.000000000 -0600 +++ include/asm-ppc64/rtas.h 2005-03-10 14:54:27.000000000 -0600 @@ -243,4 +243,6 @@ extern unsigned long rtas_rmo_buf; #define GLOBAL_INTERRUPT_QUEUE 9005 +extern int rtas_write_config(struct device_node *dn, int where, int size, u32 val); + #endif /* _PPC64_RTAS_H */ --- arch/ppc64/kernel/eeh.c.linas-orig 2005-03-09 02:12:13.000000000 -0600 +++ arch/ppc64/kernel/eeh.c 2005-03-11 18:58:50.000000000 -0600 @@ -17,16 +17,17 @@ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ -#include +#include #include +#include #include -#include #include #include #include #include #include #include +#include #include #include #include @@ -49,8 +50,8 @@ * were "empty": all reads return 0xff's and all writes are silently * ignored. EEH slot isolation events can be triggered by parity * errors on the address or data busses (e.g. during posted writes), - * which in turn might be caused by dust, vibration, humidity, - * radioactivity or plain-old failed hardware. + * which in turn might be caused by low voltage on the bus, dust, + * vibration, humidity, radioactivity or plain-old failed hardware. * * Note, however, that one of the leading causes of EEH slot * freeze events are buggy device drivers, buggy device microcode, @@ -75,22 +76,13 @@ #define BUID_HI(buid) ((buid) >> 32) #define BUID_LO(buid) ((buid) & 0xffffffff) -/* EEH event workqueue setup. */ -static DEFINE_SPINLOCK(eeh_eventlist_lock); -LIST_HEAD(eeh_eventlist); -static void eeh_event_handler(void *); -DECLARE_WORK(eeh_event_wq, eeh_event_handler, NULL); - -static struct notifier_block *eeh_notifier_chain; - /* * If a device driver keeps reading an MMIO register in an interrupt * handler after a slot isolation event has occurred, we assume it * is broken and panic. This sets the threshold for how many read * attempts we allow before panicking. */ -#define EEH_MAX_FAILS 1000 -static atomic_t eeh_fail_count; +#define EEH_MAX_FAILS 100000 /* RTAS tokens */ static int ibm_set_eeh_option; @@ -107,6 +99,10 @@ static DEFINE_SPINLOCK(slot_errbuf_lock) static int eeh_error_buf_size; /* System monitoring statistics */ +static DEFINE_PER_CPU(unsigned long, no_device); +static DEFINE_PER_CPU(unsigned long, no_dn); +static DEFINE_PER_CPU(unsigned long, no_cfg_addr); +static DEFINE_PER_CPU(unsigned long, ignored_check); static DEFINE_PER_CPU(unsigned long, total_mmio_ffs); static DEFINE_PER_CPU(unsigned long, false_positives); static DEFINE_PER_CPU(unsigned long, ignored_failures); @@ -225,9 +221,9 @@ pci_addr_cache_insert(struct pci_dev *de while (*p) { parent = *p; piar = rb_entry(parent, struct pci_io_addr_range, rb_node); - if (alo < piar->addr_lo) { + if (ahi < piar->addr_lo) { p = &parent->rb_left; - } else if (ahi > piar->addr_hi) { + } else if (alo > piar->addr_hi) { p = &parent->rb_right; } else { if (dev != piar->pcidev || @@ -245,6 +241,11 @@ pci_addr_cache_insert(struct pci_dev *de piar->addr_hi = ahi; piar->pcidev = dev; piar->flags = flags; + +#ifdef DEBUG + printk (KERN_DEBUG "PIAR: insert range=[%lx:%lx] dev=%s\n", + alo, ahi, pci_name (dev)); +#endif rb_link_node(&piar->rb_node, parent, p); rb_insert_color(&piar->rb_node, &pci_io_addr_cache_root.rb_root); @@ -369,6 +370,7 @@ void pci_addr_cache_remove_device(struct */ void __init pci_addr_cache_build(void) { + struct device_node *dn; struct pci_dev *dev = NULL; spin_lock_init(&pci_io_addr_cache_root.piar_lock); @@ -379,6 +381,17 @@ void __init pci_addr_cache_build(void) continue; } pci_addr_cache_insert_device(dev); + + /* Save the BAR's; firmware doesn't restore these after EEH reset */ + dn = pci_device_to_OF_node(dev); + if (dn) { + int i; + for (i = 0; i < 16; i++) + pci_read_config_dword(dev, i * 4, &dn->config_space[i]); + + if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) + dn->eeh_is_bridge = 1; + } } #ifdef DEBUG @@ -390,24 +403,32 @@ void __init pci_addr_cache_build(void) /* --------------------------------------------------------------- */ /* Above lies the PCI Address Cache. Below lies the EEH event infrastructure */ -/** - * eeh_register_notifier - Register to find out about EEH events. - * @nb: notifier block to callback on events - */ -int eeh_register_notifier(struct notifier_block *nb) +void eeh_slot_error_detail (struct device_node *dn, int severity) { - return notifier_chain_register(&eeh_notifier_chain, nb); -} + unsigned long flags; + int rc; -/** - * eeh_unregister_notifier - Unregister to an EEH event notifier. - * @nb: notifier block to callback on events - */ -int eeh_unregister_notifier(struct notifier_block *nb) -{ - return notifier_chain_unregister(&eeh_notifier_chain, nb); + if (!dn) return; + + /* Log the error with the rtas logger */ + spin_lock_irqsave(&slot_errbuf_lock, flags); + memset(slot_errbuf, 0, eeh_error_buf_size); + + rc = rtas_call(ibm_slot_error_detail, + 8, 1, NULL, dn->eeh_config_addr, + BUID_HI(dn->phb->buid), + BUID_LO(dn->phb->buid), NULL, 0, + virt_to_phys(slot_errbuf), + eeh_error_buf_size, + severity); + + if (rc == 0) + log_error(slot_errbuf, ERR_TYPE_RTAS_LOG, 0); + spin_unlock_irqrestore(&slot_errbuf_lock, flags); } +EXPORT_SYMBOL(eeh_slot_error_detail); + /** * read_slot_reset_state - Read the reset state of a device node's slot * @dn: device node to read @@ -422,6 +443,7 @@ static int read_slot_reset_state(struct outputs = 4; } else { token = ibm_read_slot_reset_state; + rets[2] = 0; /* fake PE Unavailable info */ outputs = 3; } @@ -430,75 +452,8 @@ static int read_slot_reset_state(struct } /** - * eeh_panic - call panic() for an eeh event that cannot be handled. - * The philosophy of this routine is that it is better to panic and - * halt the OS than it is to risk possible data corruption by - * oblivious device drivers that don't know better. - * - * @dev pci device that had an eeh event - * @reset_state current reset state of the device slot - */ -static void eeh_panic(struct pci_dev *dev, int reset_state) -{ - /* - * XXX We should create a separate sysctl for this. - * - * Since the panic_on_oops sysctl is used to halt the system - * in light of potential corruption, we can use it here. - */ - if (panic_on_oops) - panic("EEH: MMIO failure (%d) on device:%s %s\n", reset_state, - pci_name(dev), pci_pretty_name(dev)); - else { - __get_cpu_var(ignored_failures)++; - printk(KERN_INFO "EEH: Ignored MMIO failure (%d) on device:%s %s\n", - reset_state, pci_name(dev), pci_pretty_name(dev)); - } -} - -/** - * eeh_event_handler - dispatch EEH events. The detection of a frozen - * slot can occur inside an interrupt, where it can be hard to do - * anything about it. The goal of this routine is to pull these - * detection events out of the context of the interrupt handler, and - * re-dispatch them for processing at a later time in a normal context. - * - * @dummy - unused - */ -static void eeh_event_handler(void *dummy) -{ - unsigned long flags; - struct eeh_event *event; - - while (1) { - spin_lock_irqsave(&eeh_eventlist_lock, flags); - event = NULL; - if (!list_empty(&eeh_eventlist)) { - event = list_entry(eeh_eventlist.next, struct eeh_event, list); - list_del(&event->list); - } - spin_unlock_irqrestore(&eeh_eventlist_lock, flags); - if (event == NULL) - break; - - printk(KERN_INFO "EEH: MMIO failure (%d), notifiying device " - "%s %s\n", event->reset_state, - pci_name(event->dev), pci_pretty_name(event->dev)); - - atomic_set(&eeh_fail_count, 0); - notifier_call_chain (&eeh_notifier_chain, - EEH_NOTIFY_FREEZE, event); - - __get_cpu_var(slot_resets)++; - - pci_dev_put(event->dev); - kfree(event); - } -} - -/** - * eeh_token_to_phys - convert EEH address token to phys address - * @token i/o token, should be address in the form 0xE.... + * eeh_token_to_phys - convert I/O address to phys address + * @token i/o address, should be address in the form 0xA.... */ static inline unsigned long eeh_token_to_phys(unsigned long token) { @@ -513,6 +468,18 @@ static inline unsigned long eeh_token_to return pa | (token & (PAGE_SIZE-1)); } + +static inline struct pci_dev * eeh_find_pci_dev(struct device_node *dn) +{ + struct pci_dev *dev = NULL; + for_each_pci_dev(dev) { + if (pci_device_to_OF_node(dev) == dn) + return dev; + } + return NULL; +} + + /** * eeh_dn_check_failure - check if all 1's data is due to EEH slot freeze * @dn device node @@ -528,29 +495,33 @@ static inline unsigned long eeh_token_to * * It is safe to call this routine in an interrupt context. */ +extern void disable_irq_nosync(unsigned int); + int eeh_dn_check_failure(struct device_node *dn, struct pci_dev *dev) { int ret; int rets[3]; - unsigned long flags; - int rc, reset_state; - struct eeh_event *event; + enum pci_device_io_state state; __get_cpu_var(total_mmio_ffs)++; if (!eeh_subsystem_enabled) return 0; - if (!dn) + if (!dn) { + __get_cpu_var(no_dn)++; return 0; + } /* Access to IO BARs might get this far and still not want checking. */ if (!(dn->eeh_mode & EEH_MODE_SUPPORTED) || dn->eeh_mode & EEH_MODE_NOCHECK) { + __get_cpu_var(ignored_check)++; return 0; } if (!dn->eeh_config_addr) { + __get_cpu_var(no_cfg_addr)++; return 0; } @@ -559,12 +530,18 @@ int eeh_dn_check_failure(struct device_n * slot, we know it's bad already, we don't need to check... */ if (dn->eeh_mode & EEH_MODE_ISOLATED) { - atomic_inc(&eeh_fail_count); - if (atomic_read(&eeh_fail_count) >= EEH_MAX_FAILS) { + dn->eeh_check_count ++; + if (dn->eeh_check_count >= EEH_MAX_FAILS) { + printk (KERN_ERR "EEH: Device driver ignored %d bad reads, panicing\n", + dn->eeh_check_count); + dump_stack(); /* re-read the slot reset state */ if (read_slot_reset_state(dn, rets) != 0) rets[0] = -1; /* reset state unknown */ - eeh_panic(dev, rets[0]); + + /* If we are here, then we hit an infinite loop. Stop. */ + panic("EEH: MMIO halt (%d) on device:%s %s\n", rets[0], + pci_name(dev), pci_pretty_name(dev)); } return 0; } @@ -577,53 +554,42 @@ int eeh_dn_check_failure(struct device_n * In any case they must share a common PHB. */ ret = read_slot_reset_state(dn, rets); - if (!(ret == 0 && rets[1] == 1 && (rets[0] == 2 || rets[0] == 4))) { + if (!(ret == 0 && ((rets[1] == 1 && (rets[0] == 2 || rets[0] >= 4)) + || (rets[0] == 5)))) { __get_cpu_var(false_positives)++; return 0; } - /* prevent repeated reports of this failure */ - dn->eeh_mode |= EEH_MODE_ISOLATED; + /* Note that empty slots will fail; empty slots don't have children... */ + if ((rets[0] == 5) && (dn->child == NULL)) { + __get_cpu_var(false_positives)++; + return 0; + } - reset_state = rets[0]; + /* Prevent repeated reports of this failure */ + dn->eeh_mode |= EEH_MODE_ISOLATED; - spin_lock_irqsave(&slot_errbuf_lock, flags); - memset(slot_errbuf, 0, eeh_error_buf_size); + /* Some devices go crazy if irq's are not ack'ed; disable irq now */ + disable_irq_nosync (dev->irq); +// get_irq_desc (dev->irq)->handler->disable (dev->irq); + + __get_cpu_var(slot_resets)++; - rc = rtas_call(ibm_slot_error_detail, - 8, 1, NULL, dn->eeh_config_addr, - BUID_HI(dn->phb->buid), - BUID_LO(dn->phb->buid), NULL, 0, - virt_to_phys(slot_errbuf), - eeh_error_buf_size, - 1 /* Temporary Error */); + if (!dev) + dev = eeh_find_pci_dev (dn); - if (rc == 0) - log_error(slot_errbuf, ERR_TYPE_RTAS_LOG, 0); - spin_unlock_irqrestore(&slot_errbuf_lock, flags); + state = pci_device_io_thawed; + if ((rets[0] == 2) || (rets[0] == 4)) + state = pci_device_io_frozen; + if (rets[0] == 5) + state = pci_device_io_perm_failure; - printk(KERN_INFO "EEH: MMIO failure (%d) on device: %s %s\n", - rets[0], dn->name, dn->full_name); - event = kmalloc(sizeof(*event), GFP_ATOMIC); - if (event == NULL) { - eeh_panic(dev, reset_state); - return 1; - } - - event->dev = dev; - event->dn = dn; - event->reset_state = reset_state; - - /* We may or may not be called in an interrupt context */ - spin_lock_irqsave(&eeh_eventlist_lock, flags); - list_add(&event->list, &eeh_eventlist); - spin_unlock_irqrestore(&eeh_eventlist_lock, flags); + peh_send_failure_event (dev, state, rets[2]); /* Most EEH events are due to device driver bugs. Having * a stack trace will help the device-driver authors figure * out what happened. So print that out. */ - dump_stack(); - schedule_work(&eeh_event_wq); + if (rets[0] != 5) dump_stack(); return 0; } @@ -635,7 +601,6 @@ EXPORT_SYMBOL(eeh_dn_check_failure); * @token i/o token, should be address in the form 0xA.... * @val value, should be all 1's (XXX why do we need this arg??) * - * Check for an eeh failure at the given token address. * Check for an EEH failure at the given token address. Call this * routine if the result of a read was all 0xff's and you want to * find out if this is due to an EEH slot freeze event. This routine @@ -643,6 +608,7 @@ EXPORT_SYMBOL(eeh_dn_check_failure); * * Note this routine is safe to call in an interrupt context. */ + unsigned long eeh_check_failure(const volatile void __iomem *token, unsigned long val) { unsigned long addr; @@ -652,8 +618,10 @@ unsigned long eeh_check_failure(const vo /* Finding the phys addr + pci device; this is pretty quick. */ addr = eeh_token_to_phys((unsigned long __force) token); dev = pci_get_device_by_addr(addr); - if (!dev) + if (!dev) { + __get_cpu_var(no_device)++; return val; + } dn = pci_device_to_OF_node(dev); eeh_dn_check_failure (dn, dev); @@ -664,6 +632,235 @@ unsigned long eeh_check_failure(const vo EXPORT_SYMBOL(eeh_check_failure); +/* ------------------------------------------------------------- */ +/* The code below deals with error recovery */ + +int +eeh_slot_is_isolated(struct pci_dev *dev) +{ + struct device_node *dn; + dn = pci_device_to_OF_node(dev); + return (dn->eeh_mode & EEH_MODE_ISOLATED); +} + +int +eeh_ioaddr_is_isolated(const volatile void __iomem *token) +{ + unsigned long addr; + struct pci_dev *dev; + int rc; + + addr = eeh_token_to_phys((unsigned long __force) token); + dev = pci_get_device_by_addr(addr); + if (!dev) + return 0; + rc = eeh_slot_is_isolated(dev); + pci_dev_put(dev); + return rc; +} + +/** eeh_pci_slot_reset -- raises/lowers the pci #RST line + * state: 1/0 to raise/lower the #RST + */ +void +eeh_pci_slot_reset(struct pci_dev *dev, int state) +{ + struct device_node *dn = pci_device_to_OF_node(dev); + rtas_pci_slot_reset (dn, state); +} + +/** Return negative value if a permanent error, else return + * a number of milliseconds to wait until the PCI slot is + * ready to be used. + */ +static int +eeh_slot_availability(struct device_node *dn) +{ + int rc; + int rets[3]; + + rc = read_slot_reset_state(dn, rets); + if (rc) return rc; + + if (rets[1] == 0) return -1; /* EEH is not supported */ + if (rets[0] == 0) return 0; /* Oll Korrect */ + if (rets[0] == 5) { + if (rets[2] == 0) return -1; /* permanently unavailable */ + return rets[2]; /* number of millisecs to wait */ + } + return -1; +} + +int +eeh_pci_slot_availability(struct pci_dev *dev) +{ + struct device_node *dn = pci_device_to_OF_node(dev); + if (!dn) return -1; + + BUG_ON (dn->phb==NULL); + if (dn->phb==NULL) { + printk (KERN_ERR "EEH, checking on slot with no phb dn=%s dev=%s:%s\n", + dn->full_name, pci_name(dev), pci_pretty_name (dev)); + return -1; + } + return eeh_slot_availability (dn); +} + +void +rtas_pci_slot_reset(struct device_node *dn, int state) +{ + int rc; + + if (!dn) + return; + if (!dn->phb) { + printk (KERN_WARNING "EEH: in slot reset, device node %s has no phb\n", dn->full_name); + return; + } + + dn->eeh_mode |= EEH_MODE_RECOVERING; + rc = rtas_call(ibm_set_slot_reset,4,1, NULL, + dn->eeh_config_addr, + BUID_HI(dn->phb->buid), + BUID_LO(dn->phb->buid), + state); + if (rc) { + printk (KERN_WARNING "EEH: Unable to reset the failed slot, (%d) #RST=%d\n", rc, state); + return; + } + + if (state == 0) + dn->eeh_mode &= ~(EEH_MODE_RECOVERING|EEH_MODE_ISOLATED); +} + +/** rtas_set_slot_reset -- assert the pci #RST line for 1/4 second + * dn -- device node to be reset. + */ + +void +rtas_set_slot_reset(struct device_node *dn) +{ + int i, rc; + + rtas_pci_slot_reset (dn, 1); + + /* The PCI bus requires that the reset be held high for at least + * a 100 milliseconds. We wait a bit longer 'just in case'. */ + +#define PCI_BUS_RST_HOLD_TIME_MSEC 250 + msleep (PCI_BUS_RST_HOLD_TIME_MSEC); + rtas_pci_slot_reset (dn, 0); + + /* After a PCI slot has been reset, the PCI Express spec requires + * a 1.5 second idle time for the bus to stabilize, before starting + * up traffic. */ +#define PCI_BUS_SETTLE_TIME_MSEC 1800 + msleep (PCI_BUS_SETTLE_TIME_MSEC); + + /* Now double check with the firmware to make sure the device is + * ready to be used; if not, wait for recovery. */ + for (i=0; i<10; i++) { + rc = eeh_slot_availability (dn); + if (rc <= 0) return; + + msleep (rc+100); + } +} + +EXPORT_SYMBOL(rtas_set_slot_reset); + +void +rtas_configure_bridge(struct device_node *dn) +{ + int token = rtas_token ("ibm,configure-bridge"); + int rc; + + if (token == RTAS_UNKNOWN_SERVICE) + return; + rc = rtas_call(token,3,1, NULL, + dn->eeh_config_addr, + BUID_HI(dn->phb->buid), + BUID_LO(dn->phb->buid)); + if (rc) { + printk (KERN_WARNING "EEH: Unable to configure device bridge (%d) for %s\n", + rc, dn->full_name); + } +} + +EXPORT_SYMBOL(rtas_configure_bridge); + +/* ------------------------------------------------------- */ +/** Save and restore of PCI BARs + * + * Although firmware will set up BARs during boot, it doesn't + * set up device BAR's after a device reset, although it will, + * if requested, set up bridge configuration. Thus, we need to + * configure the PCI devices ourselves. Config-space setup is + * stored in the PCI structures which are normally deleted during + * device removal. Thus, the "save" routine references the + * structures so that they aren't deleted. + */ + +/** + * __restore_bars - Restore the Base Address Registers + * Loads the PCI configuration space base address registers, + * the expansion ROM base address, the latency timer, and etc. + * from the saved values in the device node. + */ +static inline void __restore_bars (struct device_node *dn) +{ + int i; + + if (NULL==dn->phb) return; + for (i=4; i<10; i++) { + rtas_write_config(dn, i*4, 4, dn->config_space[i]); + } + + /* 12 == Expansion ROM Address */ + rtas_write_config(dn, 12*4, 4, dn->config_space[12]); + +#define SAVED_BYTE(OFF) (((u8 *)(dn->config_space))[OFF]) + + rtas_write_config (dn, PCI_CACHE_LINE_SIZE, 1, + SAVED_BYTE(PCI_CACHE_LINE_SIZE)); + + rtas_write_config (dn, PCI_LATENCY_TIMER, 1, + SAVED_BYTE(PCI_LATENCY_TIMER)); + + rtas_write_config (dn, PCI_INTERRUPT_LINE, 1, + SAVED_BYTE(PCI_INTERRUPT_LINE)); +} + +/** + * eeh_restore_bars - restore the PCI config space info + */ +void eeh_restore_bars(struct device_node *dn) +{ + if (! dn->eeh_is_bridge) + __restore_bars (dn); + + if (dn->child) + eeh_restore_bars (dn->child); +#if DO_SIBLINGS + if (dn->sibling) + eeh_restore_bars (dn->sibling); +#endif +} + +void eeh_pci_restore_bars(struct pci_dev *dev) +{ + struct device_node *dn = pci_device_to_OF_node(dev); + eeh_restore_bars (dn); +} + +/* ------------------------------------------------------------- */ +/* The code below deals with enabling EEH for devices during the + * early boot sequence. EEH must be enabled before any PCI probing + * can be done. + */ + +#define EEH_ENABLE 1 + struct eeh_early_enable_info { unsigned int buid_hi; unsigned int buid_lo; @@ -682,6 +879,8 @@ static void *early_enable_eeh(struct dev int enable; dn->eeh_mode = 0; + dn->eeh_check_count = 0; + dn->eeh_freeze_count = 0; if (status && strcmp(status, "ok") != 0) return NULL; /* ignore devices with bad status */ @@ -743,7 +942,7 @@ static void *early_enable_eeh(struct dev dn->full_name); } - return NULL; + return NULL; } /* @@ -824,11 +1023,13 @@ void eeh_add_device_early(struct device_ struct pci_controller *phb; struct eeh_early_enable_info info; - if (!dn || !eeh_subsystem_enabled) + if (!dn) return; phb = dn->phb; if (NULL == phb || 0 == phb->buid) { - printk(KERN_WARNING "EEH: Expected buid but found none\n"); + printk(KERN_WARNING "EEH: Expected buid but found none for %s\n", + dn->full_name); + dump_stack(); return; } @@ -847,6 +1048,9 @@ EXPORT_SYMBOL(eeh_add_device_early); */ void eeh_add_device_late(struct pci_dev *dev) { + int i; + struct device_node *dn; + if (!dev || !eeh_subsystem_enabled) return; @@ -856,6 +1060,14 @@ void eeh_add_device_late(struct pci_dev #endif pci_addr_cache_insert_device (dev); + + /* Save the BAR's; firmware doesn't restore these after EEH reset */ + dn = pci_device_to_OF_node(dev); + for (i = 0; i < 16; i++) + pci_read_config_dword(dev, i * 4, &dn->config_space[i]); + + if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) + dn->eeh_is_bridge = 1; } EXPORT_SYMBOL(eeh_add_device_late); @@ -885,12 +1097,17 @@ static int proc_eeh_show(struct seq_file unsigned int cpu; unsigned long ffs = 0, positives = 0, failures = 0; unsigned long resets = 0; + unsigned long no_dev = 0, no_dn = 0, no_cfg = 0, no_check = 0; for_each_cpu(cpu) { ffs += per_cpu(total_mmio_ffs, cpu); positives += per_cpu(false_positives, cpu); failures += per_cpu(ignored_failures, cpu); resets += per_cpu(slot_resets, cpu); + no_dev += per_cpu(no_device, cpu); + no_dn += per_cpu(no_dn, cpu); + no_cfg += per_cpu(no_cfg_addr, cpu); + no_check += per_cpu(ignored_check, cpu); } if (0 == eeh_subsystem_enabled) { @@ -898,13 +1115,17 @@ static int proc_eeh_show(struct seq_file seq_printf(m, "eeh_total_mmio_ffs=%ld\n", ffs); } else { seq_printf(m, "EEH Subsystem is enabled\n"); - seq_printf(m, "eeh_total_mmio_ffs=%ld\n" + seq_printf(m, + "no device=%ld\n" + "no device node=%ld\n" + "no config address=%ld\n" + "check not wanted=%ld\n" + "eeh_total_mmio_ffs=%ld\n" "eeh_false_positives=%ld\n" "eeh_ignored_failures=%ld\n" - "eeh_slot_resets=%ld\n" - "eeh_fail_count=%d\n", - ffs, positives, failures, resets, - eeh_fail_count.counter); + "eeh_slot_resets=%ld\n", + no_dev, no_dn, no_cfg, no_check, + ffs, positives, failures, resets); } return 0; --- arch/ppc64/kernel/pSeries_pci.c.linas-orig 2005-03-09 02:13:08.000000000 -0600 +++ arch/ppc64/kernel/pSeries_pci.c 2005-03-10 14:54:27.000000000 -0600 @@ -101,7 +101,7 @@ static int rtas_pci_read_config(struct p return PCIBIOS_DEVICE_NOT_FOUND; } -static int rtas_write_config(struct device_node *dn, int where, int size, u32 val) +int rtas_write_config(struct device_node *dn, int where, int size, u32 val) { unsigned long buid, addr; int ret; --- drivers/pci/hotplug/rpaphp.h.linas-orig 2005-03-09 02:11:19.000000000 -0600 +++ drivers/pci/hotplug/rpaphp.h 2005-03-10 14:54:27.000000000 -0600 @@ -118,7 +118,8 @@ extern int rpaphp_enable_pci_slot(struct extern int register_pci_slot(struct slot *slot); extern int rpaphp_unconfig_pci_adapter(struct slot *slot); extern int rpaphp_get_pci_adapter_status(struct slot *slot, int is_init, u8 * value); -extern struct hotplug_slot *rpaphp_find_hotplug_slot(struct pci_dev *dev); +extern void init_eeh_handler (void); +extern void exit_eeh_handler (void); /* rpaphp_core.c */ extern int rpaphp_add_slot(struct device_node *dn); --- drivers/pci/hotplug/rpaphp_core.c.linas-orig 2005-03-09 02:12:58.000000000 -0600 +++ drivers/pci/hotplug/rpaphp_core.c 2005-03-10 14:54:27.000000000 -0600 @@ -460,12 +460,18 @@ static int __init rpaphp_init(void) { info(DRIVER_DESC " version: " DRIVER_VERSION "\n"); + /* Get set to handle EEH events. */ + init_eeh_handler(); + /* read all the PRA info from the system */ return init_rpa(); } static void __exit rpaphp_exit(void) { + /* Let EEH know we are going away. */ + exit_eeh_handler(); + cleanup_slots(); } --- drivers/pci/hotplug/rpaphp_pci.c.linas-orig 2005-03-09 02:11:01.000000000 -0600 +++ drivers/pci/hotplug/rpaphp_pci.c 2005-03-11 18:40:28.000000000 -0600 @@ -22,8 +22,13 @@ * Send feedback to * */ +#include +#include +#include #include +#include #include +#include #include #include #include "../pci.h" /* for pci_add_new_bus */ @@ -63,6 +68,7 @@ int rpaphp_claim_resource(struct pci_dev root ? "Address space collision on" : "No parent found for", resource, dtype, pci_name(dev), res->start, res->end); + dump_stack(); } return err; } @@ -188,6 +194,19 @@ rpaphp_fixup_new_pci_devices(struct pci_ static int rpaphp_pci_config_bridge(struct pci_dev *dev); +static void rpaphp_eeh_add_bus_device(struct pci_bus *bus) +{ + struct pci_dev *dev; + list_for_each_entry(dev, &bus->devices, bus_list) { + eeh_add_device_late(dev); + if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) { + struct pci_bus *subbus = dev->subordinate; + if (bus) + rpaphp_eeh_add_bus_device (subbus); + } + } +} + /***************************************************************************** rpaphp_pci_config_slot() will configure all devices under the given slot->dn and return the the first pci_dev. @@ -215,6 +234,8 @@ rpaphp_pci_config_slot(struct device_nod } if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) rpaphp_pci_config_bridge(dev); + + rpaphp_eeh_add_bus_device(bus); } return dev; } @@ -223,7 +244,6 @@ static int rpaphp_pci_config_bridge(stru { u8 sec_busno; struct pci_bus *child_bus; - struct pci_dev *child_dev; dbg("Enter %s: BRIDGE dev=%s\n", __FUNCTION__, pci_name(dev)); @@ -240,11 +260,7 @@ static int rpaphp_pci_config_bridge(stru /* do pci_scan_child_bus */ pci_scan_child_bus(child_bus); - list_for_each_entry(child_dev, &child_bus->devices, bus_list) { - eeh_add_device_late(child_dev); - } - - /* fixup new pci devices without touching bus struct */ + /* Fixup new pci devices without touching bus struct */ rpaphp_fixup_new_pci_devices(child_bus, 0); /* Make the discovered devices available */ @@ -282,7 +298,7 @@ static void print_slot_pci_funcs(struct return; } #else -static void print_slot_pci_funcs(struct slot *slot) +static inline void print_slot_pci_funcs(struct slot *slot) { return; } @@ -364,7 +380,6 @@ static void rpaphp_eeh_remove_bus_device if (pdev) rpaphp_eeh_remove_bus_device(pdev); } - } return; } @@ -566,36 +581,265 @@ exit: return retval; } -struct hotplug_slot *rpaphp_find_hotplug_slot(struct pci_dev *dev) +/** + * rpaphp_search_bus_for_dev - return 1 if device is under this bus, else 0 + * @bus: the bus to search for this device. + * @dev: the pci device we are looking for. + */ +static int rpaphp_search_bus_for_dev (struct pci_bus *bus, struct pci_dev *dev) +{ + struct list_head *ln; + + if (!bus) return 0; + + for (ln = bus->devices.next; ln != &bus->devices; ln = ln->next) { + struct pci_dev *pdev = pci_dev_b(ln); + if (pdev == dev) + return 1; + if (pdev->subordinate) { + int rc; + rc = rpaphp_search_bus_for_dev (pdev->subordinate, dev); + if (rc) + return 1; + } + } + return 0; +} + +/** + * rpaphp_find_slot - find and return the slot holding the device + * @dev: pci device for which we want the slot structure. + */ +static struct slot *rpaphp_find_slot(struct pci_dev *dev) { - struct list_head *tmp, *n; - struct slot *slot; + struct list_head *tmp, *n; + struct slot *slot; list_for_each_safe(tmp, n, &rpaphp_slot_head) { struct pci_bus *bus; - struct list_head *ln; slot = list_entry(tmp, struct slot, rpaphp_slot_list); - if (slot->bridge == NULL) { - if (slot->dev_type == PCI_DEV) { - printk(KERN_WARNING "PCI slot missing bridge %s %s \n", - slot->name, slot->location); - } + + /* PHB's don't have bridges. */ + if (slot->bridge == NULL) continue; - } + + /* The PCI device could be the slot itself. */ + if (slot->bridge == dev) + return slot; bus = slot->bridge->subordinate; if (!bus) { + printk (KERN_WARNING "PCI bridge is missing bus: %s %s\n", + pci_name (slot->bridge), pci_pretty_name (slot->bridge)); continue; /* should never happen? */ } - for (ln = bus->devices.next; ln != &bus->devices; ln = ln->next) { - struct pci_dev *pdev = pci_dev_b(ln); - if (pdev == dev) - return slot->hotplug_slot; - } + + if (rpaphp_search_bus_for_dev (bus, dev)) + return slot; } + return NULL; +} + +/** get_phb_of_device -- find the pci controller for the device + * @dev the pci device + * This routine returns a pointer to the device node that + * describes the pci controller for the indicated slot. + */ +static struct device_node * +get_phb_of_device (struct pci_dev *dev) +{ + struct device_node *dn; + struct pci_bus *bus; + + while (1) { + bus = dev->bus; + if (!bus) + break; + dn = pci_bus_to_OF_node(bus); + + if (dn->phb) + return dn; + + dev = bus->self; + BUG_ON (dev==NULL); + if (dev == NULL) + return NULL; + } return NULL; } -EXPORT_SYMBOL_GPL(rpaphp_find_hotplug_slot); +/* ------------------------------------------------------- */ +/** + * handle_eeh_events -- reset a PCI device after hard lockup. + * + * pSeries systems will isolate a PCI slot if the PCI-Host + * bridge detects address or data parity errors, DMA's + * occuring to wild addresses (which usually happen due to + * bugs in device drivers or in PCI adapter firmware). + * Slot isolations also occur if #SERR, #PERR or other misc + * PCI-related errors are detected. + * + * Recovery process consists of unplugging the device driver + * (which generated hotplug events to userspace), then issuing + * a PCI #RST to the device, then reconfiguring the PCI config + * space for all bridges & devices under this slot, and then + * finally restarting the device drivers (which cause a second + * set of hotplug events to go out to userspace). + */ + +int eeh_reset_device (struct pci_dev *dev, struct device_node *dn, int reconfig) +{ + struct slot *frozen_slot= NULL; + + if (!dev) + return 1; + + if (reconfig) + frozen_slot = rpaphp_find_slot(dev); + + if (reconfig && frozen_slot) rpaphp_unconfig_pci_adapter (frozen_slot); + + /* Reset the pci controller. (Asserts RST#; resets config space). + * Reconfigure bridges and devices */ + rtas_set_slot_reset (dn->child); + rtas_configure_bridge(dn); + eeh_restore_bars(dn->child); + enable_irq (dev->irq); + + /* Give the system 5 seconds to finish running the user-space + * hotplug scripts, e.g. ifdown for ethernet. Yes, this is a hack, + * but if we don't do this, weird things happen. + */ + if (reconfig && frozen_slot) { + ssleep (5); + rpaphp_enable_pci_slot (frozen_slot); + } + return 0; +} + +/* The longest amount of time to wait for a pci device + * to come back on line, in seconds. + */ +#define MAX_WAIT_FOR_RECOVERY 15 + +int handle_eeh_events (struct notifier_block *self, + unsigned long reason, void *ev) +{ + int freeze_count=0; + struct device_node *frozen_device; + struct peh_event *event = ev; + struct pci_dev *dev = event->dev; + int perm_failure = 0; + int rc; + + if (!dev) + { + printk ("EEH: EEH error caught, but no PCI device specified!\n"); + return 1; + } + + frozen_device = get_phb_of_device (dev); + + if (!frozen_device) + { + printk (KERN_ERR "EEH: Cannot find PCI conroller for %s %s\n", + pci_name(dev), pci_pretty_name (dev)); + + return 1; + } + + /* We get "permanent failure" messages on empty slots. + * These are false alarms. Empty slots have no child dn. */ + if ((event->state == pci_device_io_perm_failure) && (frozen_device == NULL)) + return 0; + + if (frozen_device) + freeze_count = frozen_device->eeh_freeze_count; + freeze_count ++; + if (freeze_count > EEH_MAX_ALLOWED_FREEZES) + perm_failure = 1; + + /* If the reset state is a '5' and the time to reset is 0 (infinity) + * or is more then 15 seconds, then mark this as a permanent failure. + */ + if ((event->state == pci_device_io_perm_failure) && + ((event->time_unavail <= 0) || + (event->time_unavail > MAX_WAIT_FOR_RECOVERY*1000))) + perm_failure = 1; + + /* Log the error with the rtas logger. */ + if (perm_failure) { + /* + * About 90% of all real-life EEH failures in the field + * are due to poorly seated PCI cards. Only 10% or so are + * due to actual, failed cards. + */ + printk (KERN_ERR + "EEH: device %s:%s has failed %d times \n" + "and has been permanently disabled. Please try reseating\n" + "this device or replacing it.\n", + pci_name (dev), + pci_pretty_name (dev), + freeze_count); + + eeh_slot_error_detail (frozen_device, 2 /* Permanent Error */); + + /* Notify the device that its about to go down. */ + /* XXX this should be a recursive walk to children for + * multi-function devices */ + if (dev->driver->io_state_change) { + dev->driver->io_state_change (dev, pci_device_io_perm_failure); + } + + /* If there's a hotplug slot, unconfigure it */ + struct slot * frozen_slot = rpaphp_find_slot(dev); + rpaphp_unconfig_pci_adapter (frozen_slot); + return 1; + } else { + eeh_slot_error_detail (frozen_device, 1 /* Temporary Error */); + } + + printk (KERN_WARNING + "EEH: This device has failed %d times since last reboot: %s:%s\n", + freeze_count, + pci_name (dev), + pci_pretty_name (dev)); + + /* Walk the various device drivers attached to this slot through + * a reset sequence, giving each an opportunity to do what it needs + * to accomplish the reset */ + /* XXX this should be a recursive walk to children for + * multi-function devices; each child should get to report + * status too, if needed ... if any child can't handle the reset, + * then need to hotplug it. */ + if (dev->driver->io_state_change) { + dev->driver->io_state_change (dev, pci_device_io_frozen); + rc = eeh_reset_device (dev, frozen_device, 0); + dev->driver->io_state_change (dev, pci_device_io_thawed); + } else { + rc = eeh_reset_device (dev, frozen_device, 1); + } + + /* Store the freeze count with the pci adapter, and not the slot. + * This way, if the device is replaced, the count is cleared. + */ + frozen_device->eeh_freeze_count = freeze_count; + + return rc; +} + +static struct notifier_block eeh_block; + +void __init init_eeh_handler (void) +{ + eeh_block.notifier_call = handle_eeh_events; + peh_register_notifier (&eeh_block); +} + +void __exit exit_eeh_handler (void) +{ + peh_unregister_notifier (&eeh_block); +} + From aw-confirm at ebay.com Sun Mar 13 22:13:07 2005 From: aw-confirm at ebay.com (eBay) Date: Sun, 13 Mar 2005 15:13:07 +0400 Subject: Your account on eBay has been suspended Message-ID: An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050313/f12d4810/attachment.htm From ak at muc.de Sat Mar 12 20:52:32 2005 From: ak at muc.de (Andi Kleen) Date: 12 Mar 2005 10:52:32 +0100 Subject: [PATCH/RFC] PCI Error Recovery In-Reply-To: <20050312013251.GA2609@austin.ibm.com> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> Message-ID: <20050312095232.GA31444@muc.de> On Fri, Mar 11, 2005 at 07:32:51PM -0600, Linas Vepstas wrote: > > Hi, > > Appended is my current draft PCI Error Recovery patch. > Per previous conversatios, it has moved some of the ppc64-specific > error reporting code into generic PCI structures: see changes to > include/linux/pci.h and a new file drivers/pci/pci-error.c. Note > in particular the pci bus states enumerated in > "enum pci_device_io_state"; BenH was suggesting having > more of these ... BenH do you want to propose a "final list"? I don't like it very much that the frozen state is exposed so clearly in the API. e.g. on typical PCI Express chipsets there is no such concept. With error reporting there you just get told "one of your previous accesses or DMAs failed", but the device is not completely gone. IMHO the concept of "slot freeze" is too PPC64 specific to be exposed like this. While there are other chipsets that do similar things a lot of widely used ones will not. Can you reformulate it in terms of "error states" ? Just report an error occurred and the device is unreliable. -Andi From arnd at arndb.de Sat Mar 12 20:51:37 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Sat, 12 Mar 2005 10:51:37 +0100 Subject: [PATCH/RFC] PCI Error Recovery In-Reply-To: <20050312013251.GA2609@austin.ibm.com> References: <20050223002409.GA10909@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> Message-ID: <200503121051.39015.arnd@arndb.de> On S?nnavend 12 M?rz 2005 02:32, Linas Vepstas wrote: > Appended is my current draft PCI Error Recovery patch. > Per previous conversatios, it has moved some of the ppc64-specific > error reporting code into generic PCI structures: see changes to > include/linux/pci.h and a new file drivers/pci/pci-error.c. Note How does that relate to the stuff that Long sent about PCIe advanced error handling yesterday [1]? Is there an overlap? Arnd <>< [1] http://marc.theaimsgroup.com/?l=linux-kernel&m=111058289917142 http://marc.theaimsgroup.com/?l=linux-kernel&m=111058232501296 http://marc.theaimsgroup.com/?l=linux-kernel&m=111058232501527 http://marc.theaimsgroup.com/?l=linux-kernel&m=111058345605486 http://marc.theaimsgroup.com/?l=linux-kernel&m=111058345701999 http://marc.theaimsgroup.com/?l=linux-kernel&m=111058475121258 http://marc.theaimsgroup.com/?l=linux-kernel&m=111058536826662 From benh at kernel.crashing.org Sat Mar 12 22:16:34 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 12 Mar 2005 22:16:34 +1100 Subject: [PATCH/RFC] PCI Error Recovery In-Reply-To: <20050312095232.GA31444@muc.de> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <20050312095232.GA31444@muc.de> Message-ID: <1110626195.5787.12.camel@gaston> > > e.g. on typical PCI Express chipsets there is no such concept. > With error reporting there you just get told "one of your > previous accesses or DMAs failed", but the device is not > completely gone. > > IMHO the concept of "slot freeze" is too PPC64 specific to > be exposed like this. While there are other chipsets > that do similar things a lot of widely used ones will not. > > Can you reformulate it in terms of "error states" ? > Just report an error occurred and the device is unreliable. I don't want to expose it that way neither, but the fact is we can't just have "generic" states that apply to every architecture. I haven't yet looked at Linas latest patch though, but at one point, we need to define a few states that may or may not apply to a given architecture and give enough info to drivers to deal with them as much as they can. I'm afraid we can't completely avoid some of the complexity here. Ben. From ak at muc.de Sat Mar 12 22:30:16 2005 From: ak at muc.de (Andi Kleen) Date: 12 Mar 2005 12:30:16 +0100 Subject: [PATCH/RFC] PCI Error Recovery In-Reply-To: <1110626195.5787.12.camel@gaston> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <20050312095232.GA31444@muc.de> <1110626195.5787.12.camel@gaston> Message-ID: <20050312113016.GA47310@muc.de> > I don't want to expose it that way neither, but the fact is we can't > just have "generic" states that apply to every architecture. I haven't > yet looked at Linas latest patch though, but at one point, we need to > define a few states that may or may not apply to a given architecture > and give enough info to drivers to deal with them as much as they can. > I'm afraid we can't completely avoid some of the complexity here. Perhaps, but Linas' version seems to be far too PPC64 centric to me. It's really not in your interest either to have a too ppc64 specific solution because it means much additional work to fix drivers for ppc64 that have been developed on other architectures. What's wrong with just simply telling the driver. an error occurred. all your recent transactions may be broken. To use the device again call "foo" first to fix the device. foo then returns if it fixed the device or not. I don't get why the driver even needs to know about isolation or not. It's not fundamentally different from an bus abort on other systems, just that it lasts longer. -Andi From benh at kernel.crashing.org Sat Mar 12 22:50:35 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 12 Mar 2005 22:50:35 +1100 Subject: [PATCH/RFC] PCI Error Recovery In-Reply-To: <20050312113016.GA47310@muc.de> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <20050312095232.GA31444@muc.de> <1110626195.5787.12.camel@gaston> <20050312113016.GA47310@muc.de> Message-ID: <1110628235.19810.16.camel@gaston> > Perhaps, but Linas' version seems to be far too PPC64 centric to me. > > It's really not in your interest either to have a too ppc64 specific > solution because it means much additional work to fix drivers for > ppc64 that have been developed on other architectures. > > What's wrong with just simply telling the driver. > > an error occurred. all your recent transactions may be broken. > To use the device again call "foo" first to fix the device. > foo then returns if it fixed the device or not. I want something along those lines, except that I wnat it asynchronous because of the issue of drivers sharing the same bus segment that need to be all notifed before we can re-enable things. Also, I can either just re-enable IOs, re-enable DMA, both, reset the slot, etc.... I may not offer that rich functionality in the generic API, but I need to find the right "cutting point". Just re-enabling IOs is useful for drivers who can extract diagnostic infos from the device, for example after a DMA error. Resetting the slot may be necessary to get some devices back. > I don't get why the driver even needs to know about isolation > or not. It's not fundamentally different from an bus abort > on other systems, just that it lasts longer. From grundler at parisc-linux.org Sun Mar 13 04:22:25 2005 From: grundler at parisc-linux.org (Grant Grundler) Date: Sat, 12 Mar 2005 10:22:25 -0700 Subject: [PATCH/RFC] PCI Error Recovery In-Reply-To: <1110628235.19810.16.camel@gaston> References: <20050223174356.GH13081@kroah.com> <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <20050312095232.GA31444@muc.de> <1110626195.5787.12.camel@gaston> <20050312113016.GA47310@muc.de> <1110628235.19810.16.camel@gaston> Message-ID: <20050312172225.GA1978@colo.lackof.org> On Sat, Mar 12, 2005 at 10:50:35PM +1100, Benjamin Herrenschmidt wrote: ... > > To use the device again call "foo" first to fix the device. > > foo then returns if it fixed the device or not. > > I want something along those lines, except that I wnat it asynchronous > because of the issue of drivers sharing the same bus segment that need > to be all notifed before we can re-enable things. Also, I can either > just re-enable IOs, re-enable DMA, both, reset the slot, etc.... > > I may not offer that rich functionality in the generic API, Why not? Can't we do that today with various PCI initialization routines that provide arch (pcibios) specific hooks? e.g. pci_set_master vs pci_enable_device I'm wondering if the second part of the error recovery path in the driver can use it's "normal" initialization sequence. Proably needs adjusting to look for error states and the first part will need to clean up pending IO requests. > but I need > to find the right "cutting point". Just re-enabling IOs is useful for > drivers who can extract diagnostic infos from the device, for example > after a DMA error. By "IO", I'm guessing you mean MMIO or IO Port space access. This implies only the device driver knows what/where any diag info lives. But some of the info is architected in PCI: SERR and PERR status bits. PCIe seems to be richer in error reporting but I don't know details. I think the majority of the error info is much more likely to be held in driver state and platform chipset state. E.g. only the driver will be able to associate a particular IO request with the invalid DMA or MMIO address that the chipset captured. The driver can reject that IO (with extreme prejudice so it doesn't get retried) and restart the PCI device. In case it's not obvious, this is all just hand waving and maybe it will inspire something more realistic... > Resetting the slot may be necessary to get some devices back. *nod* Or even several slots. > > I don't get why the driver even needs to know about isolation > > or not. It's not fundamentally different from an bus abort > > on other systems, just that it lasts longer. I think the driver just needs to know if it's ok to do MMIO/IO Port access to the device or not at any given point in time. A simpler strategy could be to just blow away (PCI Bus reset) the failed device(s) and reconfigure the PCI bus. Then call back into the drivers to tell them their devices suffered an "event". But then finer grain recovery isn't really possible. grant From aw-confirm at ebay.com Mon Mar 14 11:28:53 2005 From: aw-confirm at ebay.com (eBay) Date: Sun, 13 Mar 2005 17:28:53 -0700 Subject: Fraudulent Account Message-ID: An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050313/494f39b0/attachment.htm From benh at kernel.crashing.org Sun Mar 13 10:05:05 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sun, 13 Mar 2005 10:05:05 +1100 Subject: [PATCH/RFC] PCI Error Recovery In-Reply-To: <20050312172225.GA1978@colo.lackof.org> References: <20050223174356.GH13081@kroah.com> <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <20050312095232.GA31444@muc.de> <1110626195.5787.12.camel@gaston> <20050312113016.GA47310@muc.de> <1110628235.19810.16.camel@gaston> <20050312172225.GA1978@colo.lackof.org> Message-ID: <1110668705.5787.31.camel@gaston> > Why not? > Can't we do that today with various PCI initialization > routines that provide arch (pcibios) specific hooks? > e.g. pci_set_master vs pci_enable_device Well, it gets complicated. For example, the driver may try to re-enable IO to check for some status, but that itself triggers a new error right away because the HW is dead ... Maybe the driver should "assume" by default indeed that the slot is isolated by default (even if it's not on non-ppc64 archs) and thus has to call pci_enable_device() and pci_set_master() again. It adds some burden to the underlying code though to figure out if those calls are emitted in the context of an error or not, since the operations are completely different at the firmware level. Then, there is need to inform the driver as well of the capability to reset the slot, to be used if the driver decides it can't recover. Finally, I'm not fan at _ALL_ of providing synchronous APIs like pci_enable_device() or pci_set_master(). In fact, those two would be not _too_ bad, but the slot reset is more nasty. The problem is that we have potentially more than one driver affected. Even if the error was triggered by one card/function, several cards/functions may have been isolated etc... We need to "notify" all drivers, give them a chance to re-enable device & gather diagnostic data, etc... before we try to reset the slot if a driver decides it requires that to happen. Also, if a driver is ok after just enabling the device() re-initializes itself, but it's sibling decides it needs to reset the slot ? This is why I'm more inclined toward a callback that acts like a state machine. > I'm wondering if the second part of the error recovery path in > the driver can use it's "normal" initialization sequence. > Proably needs adjusting to look for error states and the first > part will need to clean up pending IO requests. Oh it could, but I wouldn't make it mandatory by calling probe() or whatevre. It's up to each driver to decide, easy enough to move their init code into a function called by both code path. > > but I need > > to find the right "cutting point". Just re-enabling IOs is useful for > > drivers who can extract diagnostic infos from the device, for example > > after a DMA error. > > By "IO", I'm guessing you mean MMIO or IO Port space access. Yes. > This implies only the device driver knows what/where any diag info lives. Yes. > But some of the info is architected in PCI: SERR and PERR status bits. > PCIe seems to be richer in error reporting but I don't know details. Oh, sure, and that's why it may not be worth bothering about this "step" and just always reset the slot when we can. But we then need to inform the driver of what happened since not all platforms will be able to do that. That would definitely simplify the above problem, and this is what I meant by "I may not offer that rich functionality in the generic API" > I think the majority of the error info is much more likely to be held > in driver state and platform chipset state. E.g. only the driver will > be able to associate a particular IO request with the invalid DMA or > MMIO address that the chipset captured. The driver can reject that IO > (with extreme prejudice so it doesn't get retried) and restart the PCI > device. > > In case it's not obvious, this is all just hand waving and maybe > it will inspire something more realistic... > > > Resetting the slot may be necessary to get some devices back. > > *nod* Or even several slots. > > > > I don't get why the driver even needs to know about isolation > > > or not. It's not fundamentally different from an bus abort > > > on other systems, just that it lasts longer. > > I think the driver just needs to know if it's ok to do MMIO/IO Port > access to the device or not at any given point in time. > > A simpler strategy could be to just blow away (PCI Bus reset) the failed > device(s) and reconfigure the PCI bus. Then call back into the drivers > to tell them their devices suffered an "event". But then finer grain > recovery isn't really possible. Yes, but I think fine grained recovery ends up beeing an API nightmare when you start dealing with several drivers on the same segment with conflicting requirements for recovery. Now, the problem is that we have to provide both approaches anway, as a lot of platforms can't do anything but clear the SERR/PERR state and hope the driver can go on. So we need to inform the driver of the platform capability in a way. Ben. From paulus at samba.org Mon Mar 14 08:47:15 2005 From: paulus at samba.org (Paul Mackerras) Date: Mon, 14 Mar 2005 08:47:15 +1100 Subject: [PATCH] enable DEBUG via config option In-Reply-To: <20050211105453.GA31718@suse.de> References: <20050211105453.GA31718@suse.de> Message-ID: <16948.46307.76370.206088@cargo.ozlabs.ibm.com> Olaf Hering writes: > Its always boring to edit each file and turn the #undef DEBUG into > #define DEBUG. This patch makes it a simple config option. > Now the question is, how verbose will the boot be when all the printk > are enabled? appears to be ok so far on a p620. Having it as a config option seems to be of use only to a few kernel developers. Why don't you just edit the Makefile and add -DDEBUG to the CFLAGS when you want to do that? Paul. From paulus at samba.org Mon Mar 14 20:27:29 2005 From: paulus at samba.org (Paul Mackerras) Date: Mon, 14 Mar 2005 20:27:29 +1100 Subject: [PATCH] ppc64: fix nvram partition scan In-Reply-To: References: Message-ID: <16949.22785.701996.273162@cargo.ozlabs.ibm.com> Utz Bacher writes: > the following patch against 2.6.11-rc4 corrects some problems with bad NVRAM > contents: > - when the checksum is incorrect, better do not trust anything (instead > of assuming the length is correct) > - when the partition length is zero, stop looking for more partitions > instead of looping I have tidied up some of the messages and changed the loop exits to conform to the usual kernel style. Any comments on this version? Paul. diff -urN linux-2.5/arch/ppc64/kernel/nvram.c test/arch/ppc64/kernel/nvram.c --- linux-2.5/arch/ppc64/kernel/nvram.c 2005-03-14 18:03:26.000000000 +1100 +++ test/arch/ppc64/kernel/nvram.c 2005-03-14 20:26:55.000000000 +1100 @@ -507,8 +507,8 @@ struct nvram_partition * tmp_part; unsigned char c_sum; char * header; - long size; int total_size; + int err; if (ppc_md.nvram_size == NULL) return -ENODEV; @@ -522,29 +522,37 @@ while (cur_index < total_size) { - size = ppc_md.nvram_read(header, NVRAM_HEADER_LEN, &cur_index); - if (size != NVRAM_HEADER_LEN) { + err = ppc_md.nvram_read(header, NVRAM_HEADER_LEN, &cur_index); + if (err != NVRAM_HEADER_LEN) { printk(KERN_ERR "nvram_scan_partitions: Error parsing " "nvram partitions\n"); - kfree(header); - return size; + goto out; } cur_index -= NVRAM_HEADER_LEN; /* nvram_read will advance us */ memcpy(&phead, header, NVRAM_HEADER_LEN); + err = 0; c_sum = nvram_checksum(&phead); - if (c_sum != phead.checksum) - printk(KERN_WARNING "WARNING: nvram partition checksum " - "was %02x, should be %02x!\n", phead.checksum, c_sum); - + if (c_sum != phead.checksum) { + printk(KERN_WARNING "WARNING: nvram partition checksum" + " was %02x, should be %02x!\n", + phead.checksum, c_sum); + printk(KERN_WARNING "Terminating nvram partition scan\n"); + goto out; + } + if (!phead.length) { + printk(KERN_WARNING "WARNING: nvram corruption " + "detected: 0-length partition\n"); + goto out; + } tmp_part = (struct nvram_partition *) kmalloc(sizeof(struct nvram_partition), GFP_KERNEL); + err = -ENOMEM; if (!tmp_part) { printk(KERN_ERR "nvram_scan_partitions: kmalloc failed\n"); - kfree(header); - return -ENOMEM; + goto out; } memcpy(&tmp_part->header, &phead, NVRAM_HEADER_LEN); @@ -553,9 +561,11 @@ cur_index += phead.length * NVRAM_BLOCK_LEN; } + err = 0; + out: kfree(header); - return 0; + return err; } static int __init nvram_init(void) From paulus at samba.org Mon Mar 14 20:28:47 2005 From: paulus at samba.org (Paul Mackerras) Date: Mon, 14 Mar 2005 20:28:47 +1100 Subject: [RFC/PATCH] Updated: ppc64: Add mem=X option In-Reply-To: <20050225191408.599c613d.michael@ellerman.id.au> References: <20050222192423.727023f7.michael@ellerman.id.au> <20050225191408.599c613d.michael@ellerman.id.au> Message-ID: <16949.22863.622912.175918@cargo.ozlabs.ibm.com> Michael Ellerman writes: > Here is an updated patch for adding support for the mem=X boot option. It gets rejects now that Mike Kravetz's NUMA patch has gone in. Care to respin it? Paul. From paulus at samba.org Mon Mar 14 20:47:10 2005 From: paulus at samba.org (Paul Mackerras) Date: Mon, 14 Mar 2005 20:47:10 +1100 Subject: [PATCH][RFC] unlikely spinlocks In-Reply-To: <20050302163412.0fa52c4b.moilanen@austin.ibm.com> References: <20050302163412.0fa52c4b.moilanen@austin.ibm.com> Message-ID: <16949.23966.756568.902508@cargo.ozlabs.ibm.com> Jake Moilanen writes: > On our raw spinlocks, we currently have an attempt at the lock, and if > we do not get it we enter a spin loop. This spinloop will likely > continue for awhile, and we pridict likely. > > Shouldn't we predict that we will get out of the loop so our next > instructions are already prefetched. Even when we miss because the lock > is still held, it won't matter since we are waiting anyways. Possibly the best thing is not to put a static prediction on it at all, and let the machine's dynamic branch prediction decide which path to predict? Paul. From paulus at samba.org Mon Mar 14 21:13:36 2005 From: paulus at samba.org (Paul Mackerras) Date: Mon, 14 Mar 2005 21:13:36 +1100 Subject: [PATCH 1/2] No-exec support for ppc64 In-Reply-To: <20050310162513.74191caa.moilanen@austin.ibm.com> References: <20050308165904.0ce07112.moilanen@austin.ibm.com> <20050308170826.13a2299e.moilanen@austin.ibm.com> <20050310032213.GB20789@austin.ibm.com> <20050310162513.74191caa.moilanen@austin.ibm.com> Message-ID: <16949.25552.640180.677985@cargo.ozlabs.ibm.com> Jake Moilanen writes: > diff -puN fs/binfmt_elf.c~nx-user-ppc64 fs/binfmt_elf.c > --- linux-2.6-bk/fs/binfmt_elf.c~nx-user-ppc64 2005-03-08 16:08:54 -06:00 > +++ linux-2.6-bk-moilanen/fs/binfmt_elf.c 2005-03-08 16:08:54 -06:00 > @@ -99,6 +99,8 @@ static int set_brk(unsigned long start, > up_write(¤t->mm->mmap_sem); > if (BAD_ADDR(addr)) > return addr; > + > + sys_mprotect(start, end-start, PROT_READ|PROT_WRITE|PROT_EXEC); I don't think I can push that upstream. What happens if you leave that out? More generally, we are making a user-visible change, even for programs that aren't marked as having non-executable stack or heap, because we are now enforcing that the program can't execute from mappings that don't have PROT_EXEC. Perhaps we should enforce the requirement for execute permission only on those programs that indicate somehow that they can handle it? Paul. From seto.hidetoshi at jp.fujitsu.com Mon Mar 14 23:33:03 2005 From: seto.hidetoshi at jp.fujitsu.com (Hidetoshi Seto) Date: Mon, 14 Mar 2005 21:33:03 +0900 Subject: [PATCH/RFC] PCI Error Recovery In-Reply-To: <20050312013251.GA2609@austin.ibm.com> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> Message-ID: <4235847F.3080705@jp.fujitsu.com> Linas Vepstas wrote: > "enum pci_device_io_state"; BenH was suggesting having > more of these ... BenH do you want to propose a "final list"? > (snip) > +/* ---------------------------------------------------------------- */ > +/** PCI error recovery state. Whenever the PCI bus state changes, > + * the io_state_change() callback will be called to notify the > + * device driver os state changes. > + */ > + > +enum pci_device_io_state { > + pci_device_io_frozen = 1, /* I/O to device is blocked */ > + pci_device_io_thawed, /* I/O te device is (re-)enabled */ > + pci_device_io_perm_failure, /* pci card is dead */ > +}; I'm not BenH... but I think it's of value to have the list of states. (Even it seems that the list what originally you want isn't "state list" but "event list".) IMHO, (according to current list) there will be 3 states at least: - NORMAL: Standard, usual, healthy state. Strictly speaking, this doesn't mean "everything works well." IOW - unreliable: "works but occasionally fails." You can access the device but checking the result is recommended. - ISOLATED: Physically connected but accesses are temporarily blocked. Devices would be unstable but maybe believed as recoverable. Error info on the platform or device would be inaccessible. The system could attempt to recover - change the state to NORMAL. - DEAD: Physically connected but accesses are permanently blocked. No recovery attempt is required any more. How many other state will be there? And, I guess you would need 3 types of event at least: - ERROR_DETECTED: An error was detected. Notified driver could test the device, collect advanced/extra error info and log it. - STATE_CHANGED: I/O state was changed. New state will be indicated in the param with this event. - TRY_RECOVER: OS requires possible device-specific-recovery to drivers. After gathering all results, OS will decide recovered or not. Depending on arch's facility and implementation, behavior of system changes terribly. For example, if we get an error when in NORMAL state: case 1) NORMAL -> NORMAL State isn't changed. The error will be reported by some kind of exception, read() will return broken(or poisoned) data, and write will be ignored. Even if subsequent I/O also fails, we can continue access to the device. # ex. ia32 case 2) NORMAL -> ISOLATED/DEAD Even if it was temporary soft error, system isolates the affected bus and devices. All subsequent I/O will be blocked(or poisoned/ignored). # ex. ppc64 case 3) System reset Even if it was temporary soft error, system goes to reboot immediately. All subsequent/pending I/O will be dismissed. # ex. ia64 (too sensitive...so now I'm engaged in :-p Therefore, you will (case 1)get a lot of ERROR_DETECTED events, or (case 2)get a STATE_CHANGED event with param indicating "ISOLATED," or (case 3)get nothing. Again, currently most of arch don't use states other than NORMAL... Now your intent: > + pci_device_io_frozen = 1, /* I/O to device is blocked */ > + pci_device_io_thawed, /* I/O te device is (re-)enabled */ > + pci_device_io_perm_failure, /* pci card is dead */ would be realized by: event(STATE_CHANGED,ISOLATED) + event(TRY_RECOVER,*data) event(STATE_CHANGED,NORMAL) event(STATE_CHANGED,DEAD) I think the latter style is more generic. Do these ideas become a clue to go on? Thanks, H.Seto From linas at austin.ibm.com Tue Mar 15 04:49:06 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Mon, 14 Mar 2005 11:49:06 -0600 Subject: [PATCH/RFC] PCI Error Recovery In-Reply-To: <1110668705.5787.31.camel@gaston> References: <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <20050312095232.GA31444@muc.de> <1110626195.5787.12.camel@gaston> <20050312113016.GA47310@muc.de> <1110628235.19810.16.camel@gaston> <20050312172225.GA1978@colo.lackof.org> <1110668705.5787.31.camel@gaston> Message-ID: <20050314174906.GA498@austin.ibm.com> Hi, > The problem is that we have > potentially more than one driver affected. Even if the error was > triggered by one card/function, several cards/functions may have been > isolated etc... To be specific, on PPC64 we have PIC busses that are physical cables that run from one rack-mounted drawer to the other rack cage that contains the cpu (the "CEC central electronics complex"). Each rack-monted cage may hold 4 or 8 or 16 PCI cards, and a failure on that bus could take out multiple PCI cards at once. Even on a plain-jane desktop system, one is confronted with "multi-function pci cards" which can cause multiple drivers to be loaded. > We need to "notify" all drivers, give them a chance to re-enable device > & gather diagnostic data, etc... before we try to reset the slot if a > driver decides it requires that to happen. Also, if a driver is ok after > just enabling the device() re-initializes itself, but it's sibling > decides it needs to reset the slot ? [...] > Yes, but I think fine grained recovery ends up beeing an API nightmare > when you start dealing with several drivers on the same segment with > conflicting requirements for recovery. I'm thinking of having a way of asking all affected drivers "what can you deal with?" and then playing to the lowest common denominator. For example, the current reset sequence tries to do a hotplug add-remove if the driver is ignorant. At this time, I distinguish "ignorant" from "not ignorant" based on whether the 'state change' callback is null or not. I'll try to think of something just a tiny bit more fine-grained than this. --linas From linas at austin.ibm.com Tue Mar 15 04:55:11 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Mon, 14 Mar 2005 11:55:11 -0600 Subject: [PATCH/RFC] PCI Error Recovery In-Reply-To: <20050312113016.GA47310@muc.de> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <20050312095232.GA31444@muc.de> <1110626195.5787.12.camel@gaston> <20050312113016.GA47310@muc.de> Message-ID: <20050314175511.GB498@austin.ibm.com> On Sat, Mar 12, 2005 at 12:30:16PM +0100, Andi Kleen was heard to remark: > > Perhaps, but Linas' version seems to be far too PPC64 centric to me. I'm trying to expand my horizons, and part of this includes asking for advice on this mailing list. I'd like to have Long Nguyen from Intel to be a bit more involved in the conversation, so that I don't get surprised by patches that do almost the same thing, but differently (as the last PCI express patch seems to sort-of/somehow might be doing). --linas From linas at austin.ibm.com Tue Mar 15 05:00:31 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Mon, 14 Mar 2005 12:00:31 -0600 Subject: [PATCH/RFC] PCI Error Recovery In-Reply-To: <200503121051.39015.arnd@arndb.de> References: <20050223002409.GA10909@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <200503121051.39015.arnd@arndb.de> Message-ID: <20050314180031.GC498@austin.ibm.com> On Sat, Mar 12, 2005 at 10:51:37AM +0100, Arnd Bergmann was heard to remark: > On S?nnavend 12 M?rz 2005 02:32, Linas Vepstas wrote: > > > Appended is my current draft PCI Error Recovery patch. > > Per previous conversatios, it has moved some of the ppc64-specific > > error reporting code into generic PCI structures: see changes to > > include/linux/pci.h and a new file drivers/pci/pci-error.c. Note > > How does that relate to the stuff that Long sent about PCIe > advanced error handling yesterday [1]? Is there an overlap? Dunno, I'm looking. I was surprised to see this patch; I invite Long to join the conversation and describe the situation in his view. --linas From linas at austin.ibm.com Tue Mar 15 05:14:20 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Mon, 14 Mar 2005 12:14:20 -0600 Subject: [PATCH/RFC] PCI Error Recovery In-Reply-To: <4235847F.3080705@jp.fujitsu.com> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <4235847F.3080705@jp.fujitsu.com> Message-ID: <20050314181420.GD498@austin.ibm.com> On Mon, Mar 14, 2005 at 09:33:03PM +0900, Hidetoshi Seto was heard to remark: > Linas Vepstas wrote: > >+enum pci_device_io_state { > > ... but I think it's of value to have the list of states. > (Even it seems that the list what originally you want isn't "state list" > but "event list".) Sorry, you are right, I confused the concept of "state transition" with the concept of "state", I will try to clarify the difference in the next patch. > would be realized by: > event(STATE_CHANGED,ISOLATED) + event(TRY_RECOVER,*data) > event(STATE_CHANGED,NORMAL) > event(STATE_CHANGED,DEAD) > I think the latter style is more generic. Hmm, are you suggesting that there **shouldn't** be a callback function in struct pci_driver, and that instead, all state changes should be delivered as events? (i.e. by means of the notifier_chain mechanism?) Hmm ... thats possible, I'd have to rearrange the code a bit. Is there a long-term philosphy for the Linux kernel on a question like this? That is, when should changes add callbacks to structures, as opposed to notifier-chain based events? The callback is a bit simpler, and maybe a tiny bit faster, but its less flexible in the long run (e.g. anyone can listen for the events, but only device drivers can get callbacks). Comments, please? --linas From linas at austin.ibm.com Tue Mar 15 06:36:40 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Mon, 14 Mar 2005 13:36:40 -0600 Subject: [RFC][PATCH] combining header files In-Reply-To: <20050310134216.5b9b27ef.sfr@canb.auug.org.au> References: <20050309120343.0c22eb0f.sfr@canb.auug.org.au> <20050309200109.GG1220@austin.ibm.com> <20050310134216.5b9b27ef.sfr@canb.auug.org.au> Message-ID: <20050314193640.GE498@austin.ibm.com> On Thu, Mar 10, 2005 at 01:42:16PM +1100, Stephen Rothwell was heard to remark: > On Wed, 9 Mar 2005 14:01:09 -0600 Linas Vepstas wrote: > > > > Why not #include instead? > > Because I am talking about similarities between ppc and ppc64 not ppc64 > and the generic code (though there may be some of those to be exploited as > well). Hmm. well, yes. I just figured that since you're looking at this anyway, may as well look to see if it can be made generic. --linas From utz.bacher at de.ibm.com Tue Mar 15 06:26:39 2005 From: utz.bacher at de.ibm.com (Utz Bacher) Date: Mon, 14 Mar 2005 20:26:39 +0100 Subject: [PATCH] ppc64: fix nvram partition scan In-Reply-To: <16949.22785.701996.273162@cargo.ozlabs.ibm.com> Message-ID: Paul Mackerras wrote: > I have tidied up some of the messages and changed the loop exits to > conform to the usual kernel style. Any comments on this version? Even better! Thanks, Utz :wq -------------- next part -------------- An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050314/490c4197/attachment.htm From moilanen at austin.ibm.com Tue Mar 15 08:51:25 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Mon, 14 Mar 2005 15:51:25 -0600 Subject: [PATCH 1/2] No-exec support for ppc64 In-Reply-To: <16949.25552.640180.677985@cargo.ozlabs.ibm.com> References: <20050308165904.0ce07112.moilanen@austin.ibm.com> <20050308170826.13a2299e.moilanen@austin.ibm.com> <20050310032213.GB20789@austin.ibm.com> <20050310162513.74191caa.moilanen@austin.ibm.com> <16949.25552.640180.677985@cargo.ozlabs.ibm.com> Message-ID: <20050314155125.68dcff70.moilanen@austin.ibm.com> On Mon, 14 Mar 2005 21:13:36 +1100 Paul Mackerras wrote: > Jake Moilanen writes: > > > diff -puN fs/binfmt_elf.c~nx-user-ppc64 fs/binfmt_elf.c > > --- linux-2.6-bk/fs/binfmt_elf.c~nx-user-ppc64 2005-03-08 16:08:54 -06:00 > > +++ linux-2.6-bk-moilanen/fs/binfmt_elf.c 2005-03-08 16:08:54 -06:00 > > @@ -99,6 +99,8 @@ static int set_brk(unsigned long start, > > up_write(¤t->mm->mmap_sem); > > if (BAD_ADDR(addr)) > > return addr; > > + > > + sys_mprotect(start, end-start, PROT_READ|PROT_WRITE|PROT_EXEC); > > I don't think I can push that upstream. What happens if you leave > that out? The bss and the plt are in the same segment, and plt obviously needs to be executable. Section Headers: [Nr] Name Type Addr Off Size ES Flg Lk Inf Al [ 0] NULL 00000000 000000 000000 00 0 0 0 [ 1] .interp PROGBITS 10000154 000154 00000d 00 A 0 0 1 ... ... [26] .plt NOBITS 10013c5c 003c34 000210 00 WAX 0 0 4 [27] .bss NOBITS 10013e6c 003c34 000128 00 WA 0 0 4 Segment Sections... 00 01 .interp 02 .interp .note.ABI-tag .note.SuSE .hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .text .fini .rodata 03 .data .eh_frame .got2 .dynamic .ctors .dtors .jcr .got .sdata .sbss .plt .bss 04 .dynamic 05 .note.ABI-tag 06 .note.SuSE 07 Anton mentioned that Alan was considering putting plt into a new segment. > More generally, we are making a user-visible change, even for programs > that aren't marked as having non-executable stack or heap, because we > are now enforcing that the program can't execute from mappings that > don't have PROT_EXEC. Perhaps we should enforce the requirement for > execute permission only on those programs that indicate somehow that > they can handle it? Unless a program is compiled w/ pt_gnu_stacks we will set the READ_IMPLIES_EXEC personality and those applications should still work as normal. Jake From johnrose at austin.ibm.com Tue Mar 15 09:04:25 2005 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 14 Mar 2005 16:04:25 -0600 Subject: [PATCH] remove unnecessary ISA ioports Message-ID: <1110837865.3586.28.camel@sinatra.austin.ibm.com> During boot, pSeries_request_regions() should only request I/O ports for legacy ISA in the case that ISA exists on the system. Add a check for this. This patch was suggested by Anton. Please apply, if appropriate. Thanks- John Signed-off-by: John Rose diff -puN arch/ppc64/kernel/pSeries_pci.c~02_ppc64_request_regions arch/ppc64/kernel/pSeries_pci.c --- 2_6_linus_4/arch/ppc64/kernel/pSeries_pci.c~02_ppc64_request_regions 2005-03-14 15:59:44.000000000 -0600 +++ 2_6_linus_4-johnrose/arch/ppc64/kernel/pSeries_pci.c 2005-03-14 15:59:44.000000000 -0600 @@ -540,6 +540,9 @@ EXPORT_SYMBOL(pcibios_remove_root_bus); static void __init pSeries_request_regions(void) { + if (!isa_io_base) + return; + request_region(0x20,0x20,"pic1"); request_region(0xa0,0x20,"pic2"); request_region(0x00,0x20,"dma1"); _ From paulus at samba.org Tue Mar 15 09:18:04 2005 From: paulus at samba.org (Paul Mackerras) Date: Tue, 15 Mar 2005 09:18:04 +1100 Subject: [PATCH 1/2] No-exec support for ppc64 In-Reply-To: <20050314155125.68dcff70.moilanen@austin.ibm.com> References: <20050308165904.0ce07112.moilanen@austin.ibm.com> <20050308170826.13a2299e.moilanen@austin.ibm.com> <20050310032213.GB20789@austin.ibm.com> <20050310162513.74191caa.moilanen@austin.ibm.com> <16949.25552.640180.677985@cargo.ozlabs.ibm.com> <20050314155125.68dcff70.moilanen@austin.ibm.com> Message-ID: <16950.3484.416343.832453@cargo.ozlabs.ibm.com> Jake Moilanen writes: > > I don't think I can push that upstream. What happens if you leave > > that out? > > The bss and the plt are in the same segment, and plt obviously needs to > be executable. Yes... what I was asking was "do things actually break if you leave that out, or does the binfmt_elf loader honour the 'x' permission on the PT_LOAD entry for the data/bss region, meaning that it all just works anyway?" I did an objdump -p on some random 32-bit binaries, and they all have "rwx" flags on the data/bss segment (the second PT_LOAD entry). And when I look in /proc//maps, it seems that the heap is in fact marked executable (this is without your patch). So why do we need the hack in binfmt_elf.c? Paul. From ntl at pobox.com Tue Mar 15 13:49:23 2005 From: ntl at pobox.com (Nathan Lynch) Date: Mon, 14 Mar 2005 20:49:23 -0600 (CST) Subject: [PATCH 0/8] reworked support for pSeries dynamic reconfiguration (v2) Message-ID: <20050315024923.11665.85622.82498@otto> Thanks to those who gave feedback on the previous submission of this patch series. I've noted the changes I've made in the changelogs of the individual patch mails to follow. This patch series reworks existing ppc64 architecture support for the "dynamic reconfiguration" option of RPA platforms. This includes PCI hotplug and dynamic logical partitioning (DLPAR). This was all motivated by my desire to add code for better handling of processor addition and removal, but I didn't want to just add to the growing mess in prom.c where we have duplicated code for boot and DLPAR/hotplug. This adds very little new function, but gets rid of much duplicated code and introduces a new pSeries-specific file, pSeries_reconfig.c, which contains the core support for dynamic reconfiguration and implements a more refined version of the notifier chain API I posted a few weeks ago. Code that needs to act upon device nodes that are being added or removed can register with this notifier chain. I've ported as much code as possible to that API, and I expect memory DLPAR will want to use it too. The last couple of patches in the series modify the pSeries smp code so that we properly manage cpu_present_map with respect to DLPAR, and includes the "make cpu hotplug play well with maxcpus and smt-enabled" patch, which depends on this. The following cases have been tested on a Power5 system: * CPU add/remove * Virtual I/O adapter add/remove * Logical slot add/remove (thanks to John Rose) I also checked the build against all defconfigs in arch/ppc64/configs. diffstat for the combined series: arch/ppc64/kernel/Makefile | 2 arch/ppc64/kernel/pSeries_iommu.c | 25 arch/ppc64/kernel/pSeries_reconfig.c | 434 +++++++++++++ arch/ppc64/kernel/pSeries_smp.c | 231 +++++-- arch/ppc64/kernel/pci_dn.c | 22 arch/ppc64/kernel/proc_ppc64.c | 249 ------- arch/ppc64/kernel/prom.c | 474 ++++----------- arch/ppc64/kernel/setup.c | 12 arch/ppc64/kernel/smp.c | 13 include/asm-ppc64/machdep.h | 1 include/asm-ppc64/pSeries_reconfig.h | 25 include/asm-ppc64/prom.h | 4 12 files changed, 827 insertions(+), 665 deletions(-) Thanks, Nathan From ntl at pobox.com Tue Mar 15 13:49:28 2005 From: ntl at pobox.com (Nathan Lynch) Date: Mon, 14 Mar 2005 20:49:28 -0600 (CST) Subject: [PATCH 1/8] preliminary changes to OF fixup functions In-Reply-To: <20050315024923.11665.85622.82498@otto> References: <20050315024923.11665.85622.82498@otto> Message-ID: <20050315024928.11665.31398.52683@otto> Preliminary modifications to support using some of the interpret_func family of functions at runtime. Changes the mem_start argument to be passed by reference, and the return type to int for error handling to be implemented in following patches. Signed-off-by: Nathan Lynch arch/ppc64/kernel/prom.c | 135 ++++++++++++++------------- 1 files changed, 71 insertions(+), 64 deletions(-) Index: linux-2.6.11-bk10/arch/ppc64/kernel/prom.c =================================================================== --- linux-2.6.11-bk10.orig/arch/ppc64/kernel/prom.c 2005-03-14 21:49:40.000000000 +0000 +++ linux-2.6.11-bk10/arch/ppc64/kernel/prom.c 2005-03-14 21:49:46.000000000 +0000 @@ -73,8 +73,8 @@ struct isa_reg_property { }; -typedef unsigned long interpret_func(struct device_node *, unsigned long, - int, int, int); +typedef int interpret_func(struct device_node *, unsigned long *, + int, int, int); extern struct rtas_t rtas; extern struct lmb lmb; @@ -255,9 +255,9 @@ static int __devinit map_interrupt(unsig return nintrc; } -static unsigned long __init finish_node_interrupts(struct device_node *np, - unsigned long mem_start, - int measure_only) +static int __init finish_node_interrupts(struct device_node *np, + unsigned long *mem_start, + int measure_only) { unsigned int *ints; int intlen, intrcells, intrcount; @@ -267,14 +267,14 @@ static unsigned long __init finish_node_ ints = (unsigned int *) get_property(np, "interrupts", &intlen); if (ints == NULL) - return mem_start; + return 0; intrcells = prom_n_intr_cells(np); intlen /= intrcells * sizeof(unsigned int); - np->intrs = (struct interrupt_info *) mem_start; - mem_start += intlen * sizeof(struct interrupt_info); + np->intrs = (struct interrupt_info *) (*mem_start); + (*mem_start) += intlen * sizeof(struct interrupt_info); if (measure_only) - return mem_start; + return 0; intrcount = 0; for (i = 0; i < intlen; ++i, ints += intrcells) { @@ -315,13 +315,13 @@ static unsigned long __init finish_node_ } np->n_intrs = intrcount; - return mem_start; + return 0; } -static unsigned long __init interpret_pci_props(struct device_node *np, - unsigned long mem_start, - int naddrc, int nsizec, - int measure_only) +static int __init interpret_pci_props(struct device_node *np, + unsigned long *mem_start, + int naddrc, int nsizec, + int measure_only) { struct address_range *adr; struct pci_reg_property *pci_addrs; @@ -331,7 +331,7 @@ static unsigned long __init interpret_pc get_property(np, "assigned-addresses", &l); if (pci_addrs != 0 && l >= sizeof(struct pci_reg_property)) { i = 0; - adr = (struct address_range *) mem_start; + adr = (struct address_range *) (*mem_start); while ((l -= sizeof(struct pci_reg_property)) >= 0) { if (!measure_only) { adr[i].space = pci_addrs[i].addr.a_hi; @@ -343,15 +343,15 @@ static unsigned long __init interpret_pc } np->addrs = adr; np->n_addrs = i; - mem_start += i * sizeof(struct address_range); + (*mem_start) += i * sizeof(struct address_range); } - return mem_start; + return 0; } -static unsigned long __init interpret_dbdma_props(struct device_node *np, - unsigned long mem_start, - int naddrc, int nsizec, - int measure_only) +static int __init interpret_dbdma_props(struct device_node *np, + unsigned long *mem_start, + int naddrc, int nsizec, + int measure_only) { struct reg_property32 *rp; struct address_range *adr; @@ -372,7 +372,7 @@ static unsigned long __init interpret_db rp = (struct reg_property32 *) get_property(np, "reg", &l); if (rp != 0 && l >= sizeof(struct reg_property32)) { i = 0; - adr = (struct address_range *) mem_start; + adr = (struct address_range *) (*mem_start); while ((l -= sizeof(struct reg_property32)) >= 0) { if (!measure_only) { adr[i].space = 2; @@ -383,16 +383,16 @@ static unsigned long __init interpret_db } np->addrs = adr; np->n_addrs = i; - mem_start += i * sizeof(struct address_range); + (*mem_start) += i * sizeof(struct address_range); } - return mem_start; + return 0; } -static unsigned long __init interpret_macio_props(struct device_node *np, - unsigned long mem_start, - int naddrc, int nsizec, - int measure_only) +static int __init interpret_macio_props(struct device_node *np, + unsigned long *mem_start, + int naddrc, int nsizec, + int measure_only) { struct reg_property32 *rp; struct address_range *adr; @@ -413,7 +413,7 @@ static unsigned long __init interpret_ma rp = (struct reg_property32 *) get_property(np, "reg", &l); if (rp != 0 && l >= sizeof(struct reg_property32)) { i = 0; - adr = (struct address_range *) mem_start; + adr = (struct address_range *) (*mem_start); while ((l -= sizeof(struct reg_property32)) >= 0) { if (!measure_only) { adr[i].space = 2; @@ -424,16 +424,16 @@ static unsigned long __init interpret_ma } np->addrs = adr; np->n_addrs = i; - mem_start += i * sizeof(struct address_range); + (*mem_start) += i * sizeof(struct address_range); } - return mem_start; + return 0; } -static unsigned long __init interpret_isa_props(struct device_node *np, - unsigned long mem_start, - int naddrc, int nsizec, - int measure_only) +static int __init interpret_isa_props(struct device_node *np, + unsigned long *mem_start, + int naddrc, int nsizec, + int measure_only) { struct isa_reg_property *rp; struct address_range *adr; @@ -442,7 +442,7 @@ static unsigned long __init interpret_is rp = (struct isa_reg_property *) get_property(np, "reg", &l); if (rp != 0 && l >= sizeof(struct isa_reg_property)) { i = 0; - adr = (struct address_range *) mem_start; + adr = (struct address_range *) (*mem_start); while ((l -= sizeof(struct isa_reg_property)) >= 0) { if (!measure_only) { adr[i].space = rp[i].space; @@ -453,16 +453,16 @@ static unsigned long __init interpret_is } np->addrs = adr; np->n_addrs = i; - mem_start += i * sizeof(struct address_range); + (*mem_start) += i * sizeof(struct address_range); } - return mem_start; + return 0; } -static unsigned long __init interpret_root_props(struct device_node *np, - unsigned long mem_start, - int naddrc, int nsizec, - int measure_only) +static int __init interpret_root_props(struct device_node *np, + unsigned long *mem_start, + int naddrc, int nsizec, + int measure_only) { struct address_range *adr; int i, l; @@ -472,7 +472,7 @@ static unsigned long __init interpret_ro rp = (unsigned int *) get_property(np, "reg", &l); if (rp != 0 && l >= rpsize) { i = 0; - adr = (struct address_range *) mem_start; + adr = (struct address_range *) (*mem_start); while ((l -= rpsize) >= 0) { if (!measure_only) { adr[i].space = 0; @@ -484,26 +484,30 @@ static unsigned long __init interpret_ro } np->addrs = adr; np->n_addrs = i; - mem_start += i * sizeof(struct address_range); + (*mem_start) += i * sizeof(struct address_range); } - return mem_start; + return 0; } -static unsigned long __init finish_node(struct device_node *np, - unsigned long mem_start, - interpret_func *ifunc, - int naddrc, int nsizec, - int measure_only) +static int __init finish_node(struct device_node *np, + unsigned long *mem_start, + interpret_func *ifunc, + int naddrc, int nsizec, + int measure_only) { struct device_node *child; - int *ip; + int *ip, rc = 0; /* get the device addresses and interrupts */ if (ifunc != NULL) - mem_start = ifunc(np, mem_start, naddrc, nsizec, measure_only); + rc = ifunc(np, mem_start, naddrc, nsizec, measure_only); + if (rc) + goto out; - mem_start = finish_node_interrupts(np, mem_start, measure_only); + rc = finish_node_interrupts(np, mem_start, measure_only); + if (rc) + goto out; /* Look for #address-cells and #size-cells properties. */ ip = (int *) get_property(np, "#address-cells", NULL); @@ -539,11 +543,14 @@ static unsigned long __init finish_node( || !strcmp(np->type, "media-bay")))) ifunc = NULL; - for (child = np->child; child != NULL; child = child->sibling) - mem_start = finish_node(child, mem_start, ifunc, - naddrc, nsizec, measure_only); - - return mem_start; + for (child = np->child; child != NULL; child = child->sibling) { + rc = finish_node(child, mem_start, ifunc, + naddrc, nsizec, measure_only); + if (rc) + goto out; + } +out: + return rc; } /** @@ -555,7 +562,7 @@ static unsigned long __init finish_node( */ void __init finish_device_tree(void) { - unsigned long mem, size; + unsigned long start, end, size = 0; DBG(" -> finish_device_tree\n"); @@ -568,11 +575,11 @@ void __init finish_device_tree(void) virt_irq_init(); /* Finish device-tree (pre-parsing some properties etc...) */ - size = finish_node(allnodes, 0, NULL, 0, 0, 1); - mem = (unsigned long)abs_to_virt(lmb_alloc(size, 128)); - if (finish_node(allnodes, mem, NULL, 0, 0, 0) != mem + size) - BUG(); - + finish_node(allnodes, &size, NULL, 0, 0, 1); + end = start = (unsigned long)abs_to_virt(lmb_alloc(size, 128)); + finish_node(allnodes, &end, NULL, 0, 0, 0); + BUG_ON(end != start + size); + DBG(" <- finish_device_tree\n"); } From ntl at pobox.com Tue Mar 15 13:49:33 2005 From: ntl at pobox.com (Nathan Lynch) Date: Mon, 14 Mar 2005 20:49:33 -0600 (CST) Subject: [PATCH 2/8] make OF node fixup code usable at runtime In-Reply-To: <20050315024923.11665.85622.82498@otto> References: <20050315024923.11665.85622.82498@otto> Message-ID: <20050315024933.11665.3281.50892@otto> Updates since last submission: o I decided to use Ben's suggestion to introduce a small wrapper function for handling allocations. At first I thought I would use John's idea of checking the dynamic flag in the node to decide whether to use kmalloc, but I think this way is better since it keeps all that logic out of the interpret_func-style routines. At boot we recurse through the device tree "fixing up" various fields and properties in the device nodes. Long ago, to support DLPAR and hotplug, we largely duplicated some of this fixup code, the main difference being that the new code used kmalloc for allocating various data structures which are attached to the new device nodes. This patch introduces a helper function (prom_alloc) for handling allocations at both boot and runtime, kills most of the duplicated code, and makes finish_node, finish_node_interrupts, and interpret_pci_props suitable for use at runtime by converting them to use prom_alloc. Signed-off-by: Nathan Lynch arch/ppc64/kernel/prom.c | 177 +++++++++------------------ 1 files changed, 62 insertions(+), 115 deletions(-) Index: linux-2.6.11-bk10/arch/ppc64/kernel/prom.c =================================================================== --- linux-2.6.11-bk10.orig/arch/ppc64/kernel/prom.c 2005-03-14 21:49:46.000000000 +0000 +++ linux-2.6.11-bk10/arch/ppc64/kernel/prom.c 2005-03-14 21:54:08.000000000 +0000 @@ -103,6 +103,25 @@ static DEFINE_RWLOCK(devtree_lock); struct device_node *of_chosen; /* + * Wrapper for allocating memory for various data that needs to be + * attached to device nodes as they are processed at boot or when + * added to the device tree later (e.g. DLPAR). At boot there is + * already a region reserved so we just increment *mem_start by size; + * otherwise we call kmalloc. + */ +static void * prom_alloc(unsigned long size, unsigned long *mem_start) +{ + unsigned long tmp; + + if (!mem_start) + return kmalloc(size, GFP_KERNEL); + + tmp = *mem_start; + *mem_start += size; + return (void *)tmp; +} + +/* * Find the device_node with a given phandle. */ static struct device_node * find_phandle(phandle ph) @@ -255,9 +274,9 @@ static int __devinit map_interrupt(unsig return nintrc; } -static int __init finish_node_interrupts(struct device_node *np, - unsigned long *mem_start, - int measure_only) +static int __devinit finish_node_interrupts(struct device_node *np, + unsigned long *mem_start, + int measure_only) { unsigned int *ints; int intlen, intrcells, intrcount; @@ -270,8 +289,10 @@ static int __init finish_node_interrupts return 0; intrcells = prom_n_intr_cells(np); intlen /= intrcells * sizeof(unsigned int); - np->intrs = (struct interrupt_info *) (*mem_start); - (*mem_start) += intlen * sizeof(struct interrupt_info); + + np->intrs = prom_alloc(intlen * sizeof(*(np->intrs)), mem_start); + if (!np->intrs) + return -ENOMEM; if (measure_only) return 0; @@ -318,33 +339,39 @@ static int __init finish_node_interrupts return 0; } -static int __init interpret_pci_props(struct device_node *np, - unsigned long *mem_start, - int naddrc, int nsizec, - int measure_only) +static int __devinit interpret_pci_props(struct device_node *np, + unsigned long *mem_start, + int naddrc, int nsizec, + int measure_only) { struct address_range *adr; struct pci_reg_property *pci_addrs; - int i, l; + int i, l, n_addrs; pci_addrs = (struct pci_reg_property *) get_property(np, "assigned-addresses", &l); - if (pci_addrs != 0 && l >= sizeof(struct pci_reg_property)) { - i = 0; - adr = (struct address_range *) (*mem_start); - while ((l -= sizeof(struct pci_reg_property)) >= 0) { - if (!measure_only) { - adr[i].space = pci_addrs[i].addr.a_hi; - adr[i].address = pci_addrs[i].addr.a_lo | - ((u64)pci_addrs[i].addr.a_mid << 32); - adr[i].size = pci_addrs[i].size_lo; - } - ++i; - } - np->addrs = adr; - np->n_addrs = i; - (*mem_start) += i * sizeof(struct address_range); + if (!pci_addrs) + return 0; + + n_addrs = l / sizeof(*pci_addrs); + + adr = prom_alloc(n_addrs * sizeof(*adr), mem_start); + if (!adr) + return -ENOMEM; + + if (measure_only) + return 0; + + np->addrs = adr; + np->n_addrs = n_addrs; + + for (i = 0; i < n_addrs; i++) { + adr[i].space = pci_addrs[i].addr.a_hi; + adr[i].address = pci_addrs[i].addr.a_lo | + ((u64)pci_addrs[i].addr.a_mid << 32); + adr[i].size = pci_addrs[i].size_lo; } + return 0; } @@ -490,11 +517,11 @@ static int __init interpret_root_props(s return 0; } -static int __init finish_node(struct device_node *np, - unsigned long *mem_start, - interpret_func *ifunc, - int naddrc, int nsizec, - int measure_only) +static int __devinit finish_node(struct device_node *np, + unsigned long *mem_start, + interpret_func *ifunc, + int naddrc, int nsizec, + int measure_only) { struct device_node *child; int *ip, rc = 0; @@ -1627,54 +1654,6 @@ static void remove_node_proc_entries(str #endif /* CONFIG_PROC_DEVICETREE */ /* - * Fix up n_intrs and intrs fields in a new device node - * - */ -static int of_finish_dynamic_node_interrupts(struct device_node *node) -{ - int intrcells, intlen, i; - unsigned *irq, *ints, virq; - struct device_node *ic; - - ints = (unsigned int *)get_property(node, "interrupts", &intlen); - intrcells = prom_n_intr_cells(node); - intlen /= intrcells * sizeof(unsigned int); - node->n_intrs = intlen; - node->intrs = kmalloc(sizeof(struct interrupt_info) * intlen, - GFP_KERNEL); - if (!node->intrs) - return -ENOMEM; - - for (i = 0; i < intlen; ++i) { - int n, j; - node->intrs[i].line = 0; - node->intrs[i].sense = 1; - n = map_interrupt(&irq, &ic, node, ints, intrcells); - if (n <= 0) - continue; - virq = virt_irq_create_mapping(irq[0]); - if (virq == NO_IRQ) { - printk(KERN_CRIT "Could not allocate interrupt " - "number for %s\n", node->full_name); - return -ENOMEM; - } - node->intrs[i].line = irq_offset_up(virq); - if (n > 1) - node->intrs[i].sense = irq[1]; - if (n > 2) { - printk(KERN_DEBUG "hmmm, got %d intr cells for %s:", n, - node->full_name); - for (j = 0; j < n; ++j) - printk(" %d", irq[j]); - printk("\n"); - } - ints += intrcells; - } - return 0; -} - - -/* * Fix up the uninitialized fields in a new device node: * name, type, n_addrs, addrs, n_intrs, intrs, and pci-specific fields * @@ -1685,7 +1664,9 @@ static int of_finish_dynamic_node_interr * This should probably be split up into smaller chunks. */ -static int of_finish_dynamic_node(struct device_node *node) +static int of_finish_dynamic_node(struct device_node *node, + unsigned long *unused1, int unused2, + int unused3, int unused4) { struct device_node *parent = of_get_parent(node); u32 *regs; @@ -1710,41 +1691,6 @@ static int of_finish_dynamic_node(struct if ((ibm_phandle = (unsigned int *)get_property(node, "ibm,phandle", NULL))) node->linux_phandle = *ibm_phandle; - /* do the work of interpret_pci_props */ - if (parent->type && !strcmp(parent->type, "pci")) { - struct address_range *adr; - struct pci_reg_property *pci_addrs; - int i, l; - - pci_addrs = (struct pci_reg_property *) - get_property(node, "assigned-addresses", &l); - if (pci_addrs != 0 && l >= sizeof(struct pci_reg_property)) { - i = 0; - adr = kmalloc(sizeof(struct address_range) * - (l / sizeof(struct pci_reg_property)), - GFP_KERNEL); - if (!adr) { - err = -ENOMEM; - goto out; - } - while ((l -= sizeof(struct pci_reg_property)) >= 0) { - adr[i].space = pci_addrs[i].addr.a_hi; - adr[i].address = pci_addrs[i].addr.a_lo | - ((u64)pci_addrs[i].addr.a_mid << 32); - adr[i].size = pci_addrs[i].size_lo; - ++i; - } - node->addrs = adr; - node->n_addrs = i; - } - } - - /* now do the work of finish_node_interrupts */ - if (get_property(node, "interrupts", NULL)) { - err = of_finish_dynamic_node_interrupts(node); - if (err) goto out; - } - /* now do the rough equivalent of update_dn_pci_info, this * probably is not correct for phb's, but should work for * IOAs and slots. @@ -1796,7 +1742,8 @@ int of_add_node(const char *path, struct return -EINVAL; /* could also be ENOMEM, though */ } - if (0 != (err = of_finish_dynamic_node(np))) { + err = finish_node(np, NULL, of_finish_dynamic_node, 0, 0, 0); + if (err < 0) { kfree(np); return err; } From ntl at pobox.com Tue Mar 15 13:49:38 2005 From: ntl at pobox.com (Nathan Lynch) Date: Mon, 14 Mar 2005 20:49:38 -0600 (CST) Subject: [PATCH 3/8] introduce pSeries_reconfig.[ch] In-Reply-To: <20050315024923.11665.85622.82498@otto> References: <20050315024923.11665.85622.82498@otto> Message-ID: <20050315024938.11665.274.97750@otto> Updates since last submission: o The memory leaks in the error path of of_add_node which Arnd pointed out should be gone (see pSeries_reconfig_add_node). o Fixed the potential null pointer dereference in the error path of pSeries_reconfig_add_node, where the code would try to kfree(np->full_name) even though the allocation for np had failed. o As suggested by John, changed the names of of_add_node and of_remove_node to of_attach_node and of_detach_node, respectively, to reflect the changes in their meanings. Move as much pSeries-specific DLPAR/hotplug code as possible into its own file, which is built only when pSeries support is enabled in the config. This new file is intended to contain support code for the "Dynamic Reconfiguration" option in the RISC Platform Architecture, which encompasses both PCI hotplug and dynamic logical partitioning (DLPAR). This patch mostly just moves code around, but the device node addition and removal API is slightly modified. In this way, of_add_node and of_remove_node are now responsible only for safely updating the device tree and global list, without all the other stuff like proc entries etc. of_add_node and of_remove_node have been renamed to of_attach_node and of_detach_node, respectively. This also adds the definitions and api for a notifier chain which is meant to be used by code that must act upon device node addition or removal. Patches to migrate code to the notifier api follow in this series. Signed-off-by: Nathan Lynch arch/ppc64/kernel/Makefile | 2 arch/ppc64/kernel/pSeries_reconfig.c | 446 +++++++++++++++ arch/ppc64/kernel/proc_ppc64.c | 249 -------- arch/ppc64/kernel/prom.c | 156 ----- include/asm-ppc64/pSeries_reconfig.h | 25 include/asm-ppc64/prom.h | 4 6 files changed, 487 insertions(+), 395 deletions(-) Index: linux-2.6.11-bk10/arch/ppc64/kernel/Makefile =================================================================== --- linux-2.6.11-bk10.orig/arch/ppc64/kernel/Makefile 2005-03-14 21:28:14.000000000 +0000 +++ linux-2.6.11-bk10/arch/ppc64/kernel/Makefile 2005-03-14 22:06:42.000000000 +0000 @@ -31,7 +31,7 @@ obj-$(CONFIG_PPC_ISERIES) += iSeries_irq obj-$(CONFIG_PPC_MULTIPLATFORM) += nvram.o i8259.o prom_init.o prom.o mpic.o obj-$(CONFIG_PPC_PSERIES) += pSeries_pci.o pSeries_lpar.o pSeries_hvCall.o \ - pSeries_nvram.o rtasd.o ras.o \ + pSeries_nvram.o rtasd.o ras.o pSeries_reconfig.o \ xics.o rtas.o pSeries_setup.o pSeries_iommu.o obj-$(CONFIG_EEH) += eeh.o Index: linux-2.6.11-bk10/arch/ppc64/kernel/proc_ppc64.c =================================================================== --- linux-2.6.11-bk10.orig/arch/ppc64/kernel/proc_ppc64.c 2005-03-14 21:28:14.000000000 +0000 +++ linux-2.6.11-bk10/arch/ppc64/kernel/proc_ppc64.c 2005-03-14 22:06:42.000000000 +0000 @@ -41,20 +41,6 @@ static struct file_operations page_map_f .mmap = page_map_mmap }; -#ifdef CONFIG_PPC_PSERIES -/* routines for /proc/ppc64/ofdt */ -static ssize_t ofdt_write(struct file *, const char __user *, size_t, loff_t *); -static void proc_ppc64_create_ofdt(void); -static int do_remove_node(char *); -static int do_add_node(char *, size_t); -static void release_prop_list(const struct property *); -static struct property *new_property(const char *, const int, const unsigned char *, struct property *); -static char * parse_next_property(char *, char *, char **, int *, unsigned char**); -static struct file_operations ofdt_fops = { - .write = ofdt_write -}; -#endif - /* * Create the ppc64 and ppc64/rtas directories early. This allows us to * assume that they have been previously created in drivers. @@ -92,11 +78,6 @@ static int __init proc_ppc64_init(void) pde->size = PAGE_SIZE; pde->proc_fops = &page_map_fops; -#ifdef CONFIG_PPC_PSERIES - if ((systemcfg->platform & PLATFORM_PSERIES)) - proc_ppc64_create_ofdt(); -#endif - return 0; } __initcall(proc_ppc64_init); @@ -145,233 +126,3 @@ static int page_map_mmap( struct file *f return 0; } -#ifdef CONFIG_PPC_PSERIES -/* create /proc/ppc64/ofdt write-only by root */ -static void proc_ppc64_create_ofdt(void) -{ - struct proc_dir_entry *ent; - - ent = create_proc_entry("ppc64/ofdt", S_IWUSR, NULL); - if (ent) { - ent->nlink = 1; - ent->data = NULL; - ent->size = 0; - ent->proc_fops = &ofdt_fops; - } -} - -/** - * ofdt_write - perform operations on the Open Firmware device tree - * - * @file: not used - * @buf: command and arguments - * @count: size of the command buffer - * @off: not used - * - * Operations supported at this time are addition and removal of - * whole nodes along with their properties. Operations on individual - * properties are not implemented (yet). - */ -static ssize_t ofdt_write(struct file *file, const char __user *buf, size_t count, - loff_t *off) -{ - int rv = 0; - char *kbuf; - char *tmp; - - if (!(kbuf = kmalloc(count + 1, GFP_KERNEL))) { - rv = -ENOMEM; - goto out; - } - if (copy_from_user(kbuf, buf, count)) { - rv = -EFAULT; - goto out; - } - - kbuf[count] = '\0'; - - tmp = strchr(kbuf, ' '); - if (!tmp) { - rv = -EINVAL; - goto out; - } - *tmp = '\0'; - tmp++; - - if (!strcmp(kbuf, "add_node")) - rv = do_add_node(tmp, count - (tmp - kbuf)); - else if (!strcmp(kbuf, "remove_node")) - rv = do_remove_node(tmp); - else - rv = -EINVAL; -out: - kfree(kbuf); - return rv ? rv : count; -} - -static int do_remove_node(char *buf) -{ - struct device_node *node; - int rv = -ENODEV; - - if ((node = of_find_node_by_path(buf))) - rv = of_remove_node(node); - - of_node_put(node); - return rv; -} - -static int do_add_node(char *buf, size_t bufsize) -{ - char *path, *end, *name; - struct device_node *np; - struct property *prop = NULL; - unsigned char* value; - int length, rv = 0; - - end = buf + bufsize; - path = buf; - buf = strchr(buf, ' '); - if (!buf) - return -EINVAL; - *buf = '\0'; - buf++; - - if ((np = of_find_node_by_path(path))) { - of_node_put(np); - return -EINVAL; - } - - /* rv = build_prop_list(tmp, bufsize - (tmp - buf), &proplist); */ - while (buf < end && - (buf = parse_next_property(buf, end, &name, &length, &value))) { - struct property *last = prop; - - prop = new_property(name, length, value, last); - if (!prop) { - rv = -ENOMEM; - prop = last; - goto out; - } - } - if (!buf) { - rv = -EINVAL; - goto out; - } - - rv = of_add_node(path, prop); - -out: - if (rv) - release_prop_list(prop); - return rv; -} - -static struct property *new_property(const char *name, const int length, - const unsigned char *value, struct property *last) -{ - struct property *new = kmalloc(sizeof(*new), GFP_KERNEL); - - if (!new) - return NULL; - memset(new, 0, sizeof(*new)); - - if (!(new->name = kmalloc(strlen(name) + 1, GFP_KERNEL))) - goto cleanup; - if (!(new->value = kmalloc(length + 1, GFP_KERNEL))) - goto cleanup; - - strcpy(new->name, name); - memcpy(new->value, value, length); - *(((char *)new->value) + length) = 0; - new->length = length; - new->next = last; - return new; - -cleanup: - if (new->name) - kfree(new->name); - if (new->value) - kfree(new->value); - kfree(new); - return NULL; -} - -/** - * parse_next_property - process the next property from raw input buffer - * @buf: input buffer, must be nul-terminated - * @end: end of the input buffer + 1, for validation - * @name: return value; set to property name in buf - * @length: return value; set to length of value - * @value: return value; set to the property value in buf - * - * Note that the caller must make copies of the name and value returned, - * this function does no allocation or copying of the data. Return value - * is set to the next name in buf, or NULL on error. - */ -static char * parse_next_property(char *buf, char *end, char **name, int *length, - unsigned char **value) -{ - char *tmp; - - *name = buf; - - tmp = strchr(buf, ' '); - if (!tmp) { - printk(KERN_ERR "property parse failed in %s at line %d\n", - __FUNCTION__, __LINE__); - return NULL; - } - *tmp = '\0'; - - if (++tmp >= end) { - printk(KERN_ERR "property parse failed in %s at line %d\n", - __FUNCTION__, __LINE__); - return NULL; - } - - /* now we're on the length */ - *length = -1; - *length = simple_strtoul(tmp, &tmp, 10); - if (*length == -1) { - printk(KERN_ERR "property parse failed in %s at line %d\n", - __FUNCTION__, __LINE__); - return NULL; - } - if (*tmp != ' ' || ++tmp >= end) { - printk(KERN_ERR "property parse failed in %s at line %d\n", - __FUNCTION__, __LINE__); - return NULL; - } - - /* now we're on the value */ - *value = tmp; - tmp += *length; - if (tmp > end) { - printk(KERN_ERR "property parse failed in %s at line %d\n", - __FUNCTION__, __LINE__); - return NULL; - } - else if (tmp < end && *tmp != ' ' && *tmp != '\0') { - printk(KERN_ERR "property parse failed in %s at line %d\n", - __FUNCTION__, __LINE__); - return NULL; - } - tmp++; - - /* and now we should be on the next name, or the end */ - return tmp; -} - -static void release_prop_list(const struct property *prop) -{ - struct property *next; - for (; prop; prop = next) { - next = prop->next; - kfree(prop->name); - kfree(prop->value); - kfree(prop); - } - -} -#endif /* defined(CONFIG_PPC_PSERIES) */ Index: linux-2.6.11-bk10/arch/ppc64/kernel/pSeries_reconfig.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.11-bk10/arch/ppc64/kernel/pSeries_reconfig.c 2005-03-14 22:16:09.000000000 +0000 @@ -0,0 +1,446 @@ +/* + * pSeries_reconfig.c - support for dynamic reconfiguration (including PCI + * Hotplug and Dynamic Logical Partitioning on RPA platforms). + * + * Copyright (C) 2005 Nathan Lynch + * Copyright (C) 2005 IBM Corporation + * + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version + * 2 as published by the Free Software Foundation. + */ + +#include +#include +#include +#include + +#include +#include +#include + + + +/* + * Routines for "runtime" addition and removal of device tree nodes. + */ +#ifdef CONFIG_PROC_DEVICETREE +/* + * Add a node to /proc/device-tree. + */ +static void add_node_proc_entries(struct device_node *np) +{ + struct proc_dir_entry *ent; + + ent = proc_mkdir(strrchr(np->full_name, '/') + 1, np->parent->pde); + if (ent) + proc_device_tree_add_node(np, ent); +} + +static void remove_node_proc_entries(struct device_node *np) +{ + struct property *pp = np->properties; + struct device_node *parent = np->parent; + + while (pp) { + remove_proc_entry(pp->name, np->pde); + pp = pp->next; + } + + /* Assuming that symlinks have the same parent directory as + * np->pde. + */ + if (np->name_link) + remove_proc_entry(np->name_link->name, parent->pde); + if (np->addr_link) + remove_proc_entry(np->addr_link->name, parent->pde); + if (np->pde) + remove_proc_entry(np->pde->name, parent->pde); +} +#else /* !CONFIG_PROC_DEVICETREE */ +static void add_node_proc_entries(struct device_node *np) +{ + return; +} + +static void remove_node_proc_entries(struct device_node *np) +{ + return; +} +#endif /* CONFIG_PROC_DEVICETREE */ + +/** + * derive_parent - basically like dirname(1) + * @path: the full_name of a node to be added to the tree + * + * Returns the node which should be the parent of the node + * described by path. E.g., for path = "/foo/bar", returns + * the node with full_name = "/foo". + */ +static struct device_node *derive_parent(const char *path) +{ + struct device_node *parent = NULL; + char *parent_path = "/"; + size_t parent_path_len = strrchr(path, '/') - path + 1; + + /* reject if path is "/" */ + if (!strcmp(path, "/")) + return ERR_PTR(-EINVAL); + + if (strrchr(path, '/') != path) { + parent_path = kmalloc(parent_path_len, GFP_KERNEL); + if (!parent_path) + return ERR_PTR(-ENOMEM); + strlcpy(parent_path, path, parent_path_len); + } + parent = of_find_node_by_path(parent_path); + if (!parent) + return ERR_PTR(-EINVAL); + if (strcmp(parent_path, "/")) + kfree(parent_path); + return parent; +} + +static struct notifier_block *pSeries_reconfig_chain; + +int pSeries_reconfig_notifier_register(struct notifier_block *nb) +{ + return notifier_chain_register(&pSeries_reconfig_chain, nb); +} + +void pSeries_reconfig_notifier_unregister(struct notifier_block *nb) +{ + notifier_chain_unregister(&pSeries_reconfig_chain, nb); +} + +static int pSeries_reconfig_add_node(const char *path, struct property *proplist) +{ + struct device_node *np; + int err = -ENOMEM; + + np = kcalloc(1, sizeof(*np), GFP_KERNEL); + if (!np) + goto out_err; + + np->full_name = kmalloc(strlen(path) + 1, GFP_KERNEL); + if (!np->full_name) + goto out_err; + + strcpy(np->full_name, path); + + np->properties = proplist; + OF_MARK_DYNAMIC(np); + kref_init(&np->kref); + + np->parent = derive_parent(path); + if (IS_ERR(np->parent)) { + err = PTR_ERR(np->parent); + goto out_err; + } + + err = notifier_call_chain(&pSeries_reconfig_chain, + PSERIES_RECONFIG_ADD, np); + if (err == NOTIFY_BAD) { + printk(KERN_ERR "Failed to add device node %s\n", path); + err = -ENOMEM; /* For now, safe to assume kmalloc failure */ + goto out_err; + } + + of_attach_node(np); + + add_node_proc_entries(np); + + of_node_put(np->parent); + + return 0; + +out_err: + if (np) { + of_node_put(np->parent); + kfree(np->full_name); + kfree(np); + } + return err; +} + +/* + * Prepare an OF node for removal from system + * XXX move this to pSeries_iommu.c + */ +static void of_cleanup_node(struct device_node *np) +{ + if (np->iommu_table && get_property(np, "ibm,dma-window", NULL)) + iommu_free_table(np); +} + +static int pSeries_reconfig_remove_node(struct device_node *np) +{ + struct device_node *parent, *child; + + parent = of_get_parent(np); + if (!parent) + return -EINVAL; + + if ((child = of_get_next_child(np, NULL))) { + of_node_put(child); + return -EBUSY; + } + + of_cleanup_node(np); + + remove_node_proc_entries(np); + + notifier_call_chain(&pSeries_reconfig_chain, + PSERIES_RECONFIG_REMOVE, np); + of_detach_node(np); + + of_node_put(parent); + of_node_put(np); /* Must decrement the refcount */ + return 0; +} + +/* + * /proc/ppc64/ofdt - yucky binary interface for adding and removing + * OF device nodes. Should be deprecated as soon as we get an + * in-kernel wrapper for the RTAS ibm,configure-connector call. + */ + +static void release_prop_list(const struct property *prop) +{ + struct property *next; + for (; prop; prop = next) { + next = prop->next; + kfree(prop->name); + kfree(prop->value); + kfree(prop); + } + +} + +/** + * parse_next_property - process the next property from raw input buffer + * @buf: input buffer, must be nul-terminated + * @end: end of the input buffer + 1, for validation + * @name: return value; set to property name in buf + * @length: return value; set to length of value + * @value: return value; set to the property value in buf + * + * Note that the caller must make copies of the name and value returned, + * this function does no allocation or copying of the data. Return value + * is set to the next name in buf, or NULL on error. + */ +static char * parse_next_property(char *buf, char *end, char **name, int *length, + unsigned char **value) +{ + char *tmp; + + *name = buf; + + tmp = strchr(buf, ' '); + if (!tmp) { + printk(KERN_ERR "property parse failed in %s at line %d\n", + __FUNCTION__, __LINE__); + return NULL; + } + *tmp = '\0'; + + if (++tmp >= end) { + printk(KERN_ERR "property parse failed in %s at line %d\n", + __FUNCTION__, __LINE__); + return NULL; + } + + /* now we're on the length */ + *length = -1; + *length = simple_strtoul(tmp, &tmp, 10); + if (*length == -1) { + printk(KERN_ERR "property parse failed in %s at line %d\n", + __FUNCTION__, __LINE__); + return NULL; + } + if (*tmp != ' ' || ++tmp >= end) { + printk(KERN_ERR "property parse failed in %s at line %d\n", + __FUNCTION__, __LINE__); + return NULL; + } + + /* now we're on the value */ + *value = tmp; + tmp += *length; + if (tmp > end) { + printk(KERN_ERR "property parse failed in %s at line %d\n", + __FUNCTION__, __LINE__); + return NULL; + } + else if (tmp < end && *tmp != ' ' && *tmp != '\0') { + printk(KERN_ERR "property parse failed in %s at line %d\n", + __FUNCTION__, __LINE__); + return NULL; + } + tmp++; + + /* and now we should be on the next name, or the end */ + return tmp; +} + +static struct property *new_property(const char *name, const int length, + const unsigned char *value, struct property *last) +{ + struct property *new = kmalloc(sizeof(*new), GFP_KERNEL); + + if (!new) + return NULL; + memset(new, 0, sizeof(*new)); + + if (!(new->name = kmalloc(strlen(name) + 1, GFP_KERNEL))) + goto cleanup; + if (!(new->value = kmalloc(length + 1, GFP_KERNEL))) + goto cleanup; + + strcpy(new->name, name); + memcpy(new->value, value, length); + *(((char *)new->value) + length) = 0; + new->length = length; + new->next = last; + return new; + +cleanup: + if (new->name) + kfree(new->name); + if (new->value) + kfree(new->value); + kfree(new); + return NULL; +} + +static int do_add_node(char *buf, size_t bufsize) +{ + char *path, *end, *name; + struct device_node *np; + struct property *prop = NULL; + unsigned char* value; + int length, rv = 0; + + end = buf + bufsize; + path = buf; + buf = strchr(buf, ' '); + if (!buf) + return -EINVAL; + *buf = '\0'; + buf++; + + if ((np = of_find_node_by_path(path))) { + of_node_put(np); + return -EINVAL; + } + + /* rv = build_prop_list(tmp, bufsize - (tmp - buf), &proplist); */ + while (buf < end && + (buf = parse_next_property(buf, end, &name, &length, &value))) { + struct property *last = prop; + + prop = new_property(name, length, value, last); + if (!prop) { + rv = -ENOMEM; + prop = last; + goto out; + } + } + if (!buf) { + rv = -EINVAL; + goto out; + } + + rv = pSeries_reconfig_add_node(path, prop); + +out: + if (rv) + release_prop_list(prop); + return rv; +} + +static int do_remove_node(char *buf) +{ + struct device_node *node; + int rv = -ENODEV; + + if ((node = of_find_node_by_path(buf))) + rv = pSeries_reconfig_remove_node(node); + + of_node_put(node); + return rv; +} + +/** + * ofdt_write - perform operations on the Open Firmware device tree + * + * @file: not used + * @buf: command and arguments + * @count: size of the command buffer + * @off: not used + * + * Operations supported at this time are addition and removal of + * whole nodes along with their properties. Operations on individual + * properties are not implemented (yet). + */ +static ssize_t ofdt_write(struct file *file, const char __user *buf, size_t count, + loff_t *off) +{ + int rv = 0; + char *kbuf; + char *tmp; + + if (!(kbuf = kmalloc(count + 1, GFP_KERNEL))) { + rv = -ENOMEM; + goto out; + } + if (copy_from_user(kbuf, buf, count)) { + rv = -EFAULT; + goto out; + } + + kbuf[count] = '\0'; + + tmp = strchr(kbuf, ' '); + if (!tmp) { + rv = -EINVAL; + goto out; + } + *tmp = '\0'; + tmp++; + + if (!strcmp(kbuf, "add_node")) + rv = do_add_node(tmp, count - (tmp - kbuf)); + else if (!strcmp(kbuf, "remove_node")) + rv = do_remove_node(tmp); + else + rv = -EINVAL; +out: + kfree(kbuf); + return rv ? rv : count; +} + +static struct file_operations ofdt_fops = { + .write = ofdt_write +}; + +/* create /proc/ppc64/ofdt write-only by root */ +static int proc_ppc64_create_ofdt(void) +{ + struct proc_dir_entry *ent; + + if (!(systemcfg->platform & PLATFORM_PSERIES)) + return 0; + + ent = create_proc_entry("ppc64/ofdt", S_IWUSR, NULL); + if (ent) { + ent->nlink = 1; + ent->data = NULL; + ent->size = 0; + ent->proc_fops = &ofdt_fops; + } + + return 0; +} +__initcall(proc_ppc64_create_ofdt); Index: linux-2.6.11-bk10/arch/ppc64/kernel/prom.c =================================================================== --- linux-2.6.11-bk10.orig/arch/ppc64/kernel/prom.c 2005-03-14 21:54:08.000000000 +0000 +++ linux-2.6.11-bk10/arch/ppc64/kernel/prom.c 2005-03-14 22:15:45.000000000 +0000 @@ -27,7 +27,6 @@ #include #include #include -#include #include #include #include @@ -1575,84 +1574,6 @@ void of_node_put(struct device_node *nod } EXPORT_SYMBOL(of_node_put); -/** - * derive_parent - basically like dirname(1) - * @path: the full_name of a node to be added to the tree - * - * Returns the node which should be the parent of the node - * described by path. E.g., for path = "/foo/bar", returns - * the node with full_name = "/foo". - */ -static struct device_node *derive_parent(const char *path) -{ - struct device_node *parent = NULL; - char *parent_path = "/"; - size_t parent_path_len = strrchr(path, '/') - path + 1; - - /* reject if path is "/" */ - if (!strcmp(path, "/")) - return NULL; - - if (strrchr(path, '/') != path) { - parent_path = kmalloc(parent_path_len, GFP_KERNEL); - if (!parent_path) - return NULL; - strlcpy(parent_path, path, parent_path_len); - } - parent = of_find_node_by_path(parent_path); - if (strcmp(parent_path, "/")) - kfree(parent_path); - return parent; -} - -/* - * Routines for "runtime" addition and removal of device tree nodes. - */ -#ifdef CONFIG_PROC_DEVICETREE -/* - * Add a node to /proc/device-tree. - */ -static void add_node_proc_entries(struct device_node *np) -{ - struct proc_dir_entry *ent; - - ent = proc_mkdir(strrchr(np->full_name, '/') + 1, np->parent->pde); - if (ent) - proc_device_tree_add_node(np, ent); -} - -static void remove_node_proc_entries(struct device_node *np) -{ - struct property *pp = np->properties; - struct device_node *parent = np->parent; - - while (pp) { - remove_proc_entry(pp->name, np->pde); - pp = pp->next; - } - - /* Assuming that symlinks have the same parent directory as - * np->pde. - */ - if (np->name_link) - remove_proc_entry(np->name_link->name, parent->pde); - if (np->addr_link) - remove_proc_entry(np->addr_link->name, parent->pde); - if (np->pde) - remove_proc_entry(np->pde->name, parent->pde); -} -#else /* !CONFIG_PROC_DEVICETREE */ -static void add_node_proc_entries(struct device_node *np) -{ - return; -} - -static void remove_node_proc_entries(struct device_node *np) -{ - return; -} -#endif /* CONFIG_PROC_DEVICETREE */ - /* * Fix up the uninitialized fields in a new device node: * name, type, n_addrs, addrs, n_intrs, intrs, and pci-specific fields @@ -1710,43 +1631,18 @@ out: } /* - * Given a path and a property list, construct an OF device node, add - * it to the device tree and global list, and place it in - * /proc/device-tree. This function may sleep. + * Plug a device node into the tree and global list. */ -int of_add_node(const char *path, struct property *proplist) +void of_attach_node(struct device_node *np) { - struct device_node *np; - int err = 0; - - np = kmalloc(sizeof(struct device_node), GFP_KERNEL); - if (!np) - return -ENOMEM; - - memset(np, 0, sizeof(*np)); - - np->full_name = kmalloc(strlen(path) + 1, GFP_KERNEL); - if (!np->full_name) { - kfree(np); - return -ENOMEM; - } - strcpy(np->full_name, path); - - np->properties = proplist; - OF_MARK_DYNAMIC(np); - kref_init(&np->kref); - of_node_get(np); - np->parent = derive_parent(path); - if (!np->parent) { - kfree(np); - return -EINVAL; /* could also be ENOMEM, though */ - } + int err; + /* This use of finish_node will be moved to a notifier so + * the error code can be used. + */ err = finish_node(np, NULL, of_finish_dynamic_node, 0, 0, 0); - if (err < 0) { - kfree(np); - return err; - } + if (err < 0) + return; write_lock(&devtree_lock); np->sibling = np->parent->child; @@ -1754,21 +1650,6 @@ int of_add_node(const char *path, struct np->parent->child = np; allnodes = np; write_unlock(&devtree_lock); - - add_node_proc_entries(np); - - of_node_put(np->parent); - of_node_put(np); - return 0; -} - -/* - * Prepare an OF node for removal from system - */ -static void of_cleanup_node(struct device_node *np) -{ - if (np->iommu_table && get_property(np, "ibm,dma-window", NULL)) - iommu_free_table(np); } /* @@ -1776,23 +1657,14 @@ static void of_cleanup_node(struct devic * a reference to the node. The memory associated with the node * is not freed until its refcount goes to zero. */ -int of_remove_node(struct device_node *np) +void of_detach_node(const struct device_node *np) { - struct device_node *parent, *child; + struct device_node *parent; - parent = of_get_parent(np); - if (!parent) - return -EINVAL; - - if ((child = of_get_next_child(np, NULL))) { - of_node_put(child); - return -EBUSY; - } + write_lock(&devtree_lock); - of_cleanup_node(np); + parent = np->parent; - write_lock(&devtree_lock); - remove_node_proc_entries(np); if (allnodes == np) allnodes = np->allnext; else { @@ -1814,10 +1686,8 @@ int of_remove_node(struct device_node *n ; prevsib->sibling = np->sibling; } + write_unlock(&devtree_lock); - of_node_put(parent); - of_node_put(np); /* Must decrement the refcount */ - return 0; } /* Index: linux-2.6.11-bk10/include/asm-ppc64/prom.h =================================================================== --- linux-2.6.11-bk10.orig/include/asm-ppc64/prom.h 2005-03-14 21:28:20.000000000 +0000 +++ linux-2.6.11-bk10/include/asm-ppc64/prom.h 2005-03-14 22:15:17.000000000 +0000 @@ -209,8 +209,8 @@ extern struct device_node *of_node_get(s extern void of_node_put(struct device_node *node); /* For updating the device tree at runtime */ -extern int of_add_node(const char *path, struct property *proplist); -extern int of_remove_node(struct device_node *np); +extern void of_attach_node(struct device_node *); +extern void of_detach_node(const struct device_node *); /* Other Prototypes */ extern unsigned long prom_init(unsigned long, unsigned long, unsigned long, Index: linux-2.6.11-bk10/include/asm-ppc64/pSeries_reconfig.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.11-bk10/include/asm-ppc64/pSeries_reconfig.h 2005-03-14 22:06:42.000000000 +0000 @@ -0,0 +1,25 @@ +#ifndef _PPC64_PSERIES_RECONFIG_H +#define _PPC64_PSERIES_RECONFIG_H + +#include + +/* + * Use this API if your code needs to know about OF device nodes being + * added or removed on pSeries systems. + */ + +#define PSERIES_RECONFIG_ADD 0x0001 +#define PSERIES_RECONFIG_REMOVE 0x0002 + +#ifdef CONFIG_PPC_PSERIES +extern int pSeries_reconfig_notifier_register(struct notifier_block *); +extern void pSeries_reconfig_notifier_unregister(struct notifier_block *); +#else /* !CONFIG_PPC_PSERIES */ +static inline int pSeries_reconfig_notifier_register(struct notifier_block *nb) +{ + return 0; +} +static inline void pSeries_reconfig_notifier_unregister(struct notifier_block *nb) { } +#endif /* CONFIG_PPC_PSERIES */ + +#endif /* _PPC64_PSERIES_RECONFIG_H */ From ntl at pobox.com Tue Mar 15 13:49:43 2005 From: ntl at pobox.com (Nathan Lynch) Date: Mon, 14 Mar 2005 20:49:43 -0600 (CST) Subject: [PATCH 4/8] prom.c: use pSeries reconfig notifier In-Reply-To: <20050315024923.11665.85622.82498@otto> References: <20050315024923.11665.85622.82498@otto> Message-ID: <20050315024943.11665.41759.40955@otto> Use the pSeries_reconfig notifier list to fix up a device node which is about to be added. Signed-off-by: Nathan Lynch arch/ppc64/kernel/prom.c | 40 ++++++++++++++++++++------- 1 files changed, 31 insertions(+), 9 deletions(-) Index: linux-2.6.11-bk10/arch/ppc64/kernel/prom.c =================================================================== --- linux-2.6.11-bk10.orig/arch/ppc64/kernel/prom.c 2005-03-14 22:15:45.000000000 +0000 +++ linux-2.6.11-bk10/arch/ppc64/kernel/prom.c 2005-03-14 22:28:19.000000000 +0000 @@ -52,6 +52,7 @@ #include #include #include +#include #ifdef DEBUG #define DBG(fmt...) udbg_printf(fmt) @@ -1635,15 +1636,6 @@ out: */ void of_attach_node(struct device_node *np) { - int err; - - /* This use of finish_node will be moved to a notifier so - * the error code can be used. - */ - err = finish_node(np, NULL, of_finish_dynamic_node, 0, 0, 0); - if (err < 0) - return; - write_lock(&devtree_lock); np->sibling = np->parent->child; np->allnext = allnodes; @@ -1690,6 +1682,36 @@ void of_detach_node(const struct device_ write_unlock(&devtree_lock); } +static int prom_reconfig_notifier(struct notifier_block *nb, unsigned long action, void *node) +{ + int err; + + switch (action) { + case PSERIES_RECONFIG_ADD: + err = finish_node(node, NULL, of_finish_dynamic_node, 0, 0, 0); + if (err < 0) { + printk(KERN_ERR "finish_node returned %d\n", err); + err = NOTIFY_BAD; + } + break; + default: + err = NOTIFY_DONE; + break; + } + return err; +} + +static struct notifier_block prom_reconfig_nb = { + .notifier_call prom_reconfig_notifier, + .priority = 10, /* This one needs to run first */ +}; + +static int __init prom_reconfig_setup(void) +{ + return pSeries_reconfig_notifier_register(&prom_reconfig_nb); +} +__initcall(prom_reconfig_setup); + /* * Find a property with a given name for a given node * and return the value. From ntl at pobox.com Tue Mar 15 13:49:49 2005 From: ntl at pobox.com (Nathan Lynch) Date: Mon, 14 Mar 2005 20:49:49 -0600 (CST) Subject: [PATCH 5/8] pci_dn.c: use pSeries reconfig notifier In-Reply-To: <20050315024923.11665.85622.82498@otto> References: <20050315024923.11665.85622.82498@otto> Message-ID: <20050315024949.11665.81330.76171@otto> Use the pSeries_reconfig notifier list to handle newly added pci device nodes. Signed-off-by: Nathan Lynch arch/ppc64/kernel/pci_dn.c | 22 ++++++++++++++++++++++ arch/ppc64/kernel/prom.c | 14 -------------- 2 files changed, 22 insertions(+), 14 deletions(-) Index: linux-2.6.11-bk10/arch/ppc64/kernel/pci_dn.c =================================================================== --- linux-2.6.11-bk10.orig/arch/ppc64/kernel/pci_dn.c 2005-03-14 21:28:14.000000000 +0000 +++ linux-2.6.11-bk10/arch/ppc64/kernel/pci_dn.c 2005-03-14 22:29:03.000000000 +0000 @@ -27,6 +27,7 @@ #include #include #include +#include #include "pci.h" @@ -161,6 +162,25 @@ struct device_node *fetch_dev_dn(struct } EXPORT_SYMBOL(fetch_dev_dn); +static int pci_dn_reconfig_notifier(struct notifier_block *nb, unsigned long action, void *node) +{ + struct device_node *np = node; + int err = NOTIFY_OK; + + switch (action) { + case PSERIES_RECONFIG_ADD: + update_dn_pci_info(np, np->parent->phb); + break; + default: + err = NOTIFY_DONE; + break; + } + return err; +} + +static struct notifier_block pci_dn_reconfig_nb = { + .notifier_call = pci_dn_reconfig_notifier, +}; /* * Actually initialize the phbs. @@ -173,4 +193,6 @@ void __init pci_devs_phb_init(void) /* This must be done first so the device nodes have valid pci info! */ list_for_each_entry_safe(phb, tmp, &hose_list, list_node) pci_devs_phb_init_dynamic(phb); + + pSeries_reconfig_notifier_register(&pci_dn_reconfig_nb); } Index: linux-2.6.11-bk10/arch/ppc64/kernel/prom.c =================================================================== --- linux-2.6.11-bk10.orig/arch/ppc64/kernel/prom.c 2005-03-14 22:28:19.000000000 +0000 +++ linux-2.6.11-bk10/arch/ppc64/kernel/prom.c 2005-03-14 22:29:03.000000000 +0000 @@ -1591,7 +1591,6 @@ static int of_finish_dynamic_node(struct int unused3, int unused4) { struct device_node *parent = of_get_parent(node); - u32 *regs; int err = 0; phandle *ibm_phandle; @@ -1613,19 +1612,6 @@ static int of_finish_dynamic_node(struct if ((ibm_phandle = (unsigned int *)get_property(node, "ibm,phandle", NULL))) node->linux_phandle = *ibm_phandle; - /* now do the rough equivalent of update_dn_pci_info, this - * probably is not correct for phb's, but should work for - * IOAs and slots. - */ - - node->phb = parent->phb; - - regs = (u32 *)get_property(node, "reg", NULL); - if (regs) { - node->busno = (regs[0] >> 16) & 0xff; - node->devfn = (regs[0] >> 8) & 0xff; - } - out: of_node_put(parent); return err; From ntl at pobox.com Tue Mar 15 13:49:54 2005 From: ntl at pobox.com (Nathan Lynch) Date: Mon, 14 Mar 2005 20:49:54 -0600 (CST) Subject: [PATCH 6/8] pSeries_iommu.c: use pSeries reconfig notifier In-Reply-To: <20050315024923.11665.85622.82498@otto> References: <20050315024923.11665.85622.82498@otto> Message-ID: <20050315024954.11665.81666.16106@otto> Use the pSeries_reconfig notifier chain for tearing down the iommu table when a device node is removed. Signed-off-by: Nathan Lynch arch/ppc64/kernel/pSeries_iommu.c | 25 +++++++++++++++ arch/ppc64/kernel/pSeries_reconfig.c | 12 ------- 2 files changed, 25 insertions(+), 12 deletions(-) Index: linux-2.6.11-bk10/arch/ppc64/kernel/pSeries_iommu.c =================================================================== --- linux-2.6.11-bk10.orig/arch/ppc64/kernel/pSeries_iommu.c 2005-03-13 02:51:53.000000000 +0000 +++ linux-2.6.11-bk10/arch/ppc64/kernel/pSeries_iommu.c 2005-03-14 22:29:30.000000000 +0000 @@ -43,6 +43,7 @@ #include #include #include +#include #include #include "pci.h" @@ -455,6 +456,28 @@ static void iommu_dev_setup_pSeries(stru } } +static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long action, void *node) +{ + int err = NOTIFY_OK; + struct device_node *np = node; + + switch (action) { + case PSERIES_RECONFIG_REMOVE: + if (np->iommu_table && + get_property(np, "ibm,dma-window", NULL)) + iommu_free_table(np); + break; + default: + err = NOTIFY_DONE; + break; + } + return err; +} + +static struct notifier_block iommu_reconfig_nb = { + .notifier_call = iommu_reconfig_notifier, +}; + static void iommu_bus_setup_null(struct pci_bus *b) { } static void iommu_dev_setup_null(struct pci_dev *d) { } @@ -487,6 +510,8 @@ void iommu_init_early_pSeries(void) ppc_md.iommu_dev_setup = iommu_dev_setup_pSeries; + pSeries_reconfig_notifier_register(&iommu_reconfig_nb); + pci_iommu_init(); } Index: linux-2.6.11-bk10/arch/ppc64/kernel/pSeries_reconfig.c =================================================================== --- linux-2.6.11-bk10.orig/arch/ppc64/kernel/pSeries_reconfig.c 2005-03-14 22:16:09.000000000 +0000 +++ linux-2.6.11-bk10/arch/ppc64/kernel/pSeries_reconfig.c 2005-03-14 22:29:30.000000000 +0000 @@ -164,16 +164,6 @@ out_err: return err; } -/* - * Prepare an OF node for removal from system - * XXX move this to pSeries_iommu.c - */ -static void of_cleanup_node(struct device_node *np) -{ - if (np->iommu_table && get_property(np, "ibm,dma-window", NULL)) - iommu_free_table(np); -} - static int pSeries_reconfig_remove_node(struct device_node *np) { struct device_node *parent, *child; @@ -187,8 +177,6 @@ static int pSeries_reconfig_remove_node( return -EBUSY; } - of_cleanup_node(np); - remove_node_proc_entries(np); notifier_call_chain(&pSeries_reconfig_chain, From ntl at pobox.com Tue Mar 15 13:49:59 2005 From: ntl at pobox.com (Nathan Lynch) Date: Mon, 14 Mar 2005 20:49:59 -0600 (CST) Subject: [PATCH 7/8] pSeries_smp.c: use pSeries reconfig notifier for cpu DLPAR In-Reply-To: <20050315024923.11665.85622.82498@otto> References: <20050315024923.11665.85622.82498@otto> Message-ID: <20050315024959.11665.79221.22369@otto> Use the pSeries_reconfig notifier API to handle processor addition and removal on pSeries LPAR. This is the "right" way to do it, as opposed to setting cpu_present_map = cpu_possible_map at boot (this is fixed in a following patch). Signed-off-by: Nathan Lynch arch/ppc64/kernel/pSeries_smp.c | 126 ++++++++++++++++++++ 1 files changed, 126 insertions(+) Index: linux-2.6.11-bk10/arch/ppc64/kernel/pSeries_smp.c =================================================================== --- linux-2.6.11-bk10.orig/arch/ppc64/kernel/pSeries_smp.c 2005-03-14 21:28:14.000000000 +0000 +++ linux-2.6.11-bk10/arch/ppc64/kernel/pSeries_smp.c 2005-03-14 22:29:53.000000000 +0000 @@ -44,6 +44,7 @@ #include #include #include +#include #include "mpic.h" @@ -213,6 +214,127 @@ static inline int __devinit smp_startup_ } return 1; } + +/* + * Update cpu_present_map and paca(s) for a new cpu node. The wrinkle + * here is that a cpu device node may represent up to two logical cpus + * in the SMT case. We must honor the assumption in other code that + * the logical ids for sibling SMT threads x and y are adjacent, such + * that x^1 == y and y^1 == x. + */ +static int pSeries_add_processor(struct device_node *np) +{ + unsigned int cpu; + cpumask_t candidate_map, tmp = CPU_MASK_NONE; + int err = -ENOSPC, len, nthreads, i; + u32 *intserv; + + intserv = (u32 *)get_property(np, "ibm,ppc-interrupt-server#s", &len); + if (!intserv) + return 0; + + nthreads = len / sizeof(u32); + for (i = 0; i < nthreads; i++) + cpu_set(i, tmp); + + lock_cpu_hotplug(); + + BUG_ON(!cpus_subset(cpu_present_map, cpu_possible_map)); + + /* Get a bitmap of unoccupied slots. */ + cpus_xor(candidate_map, cpu_possible_map, cpu_present_map); + if (cpus_empty(candidate_map)) { + /* If we get here, it most likely means that NR_CPUS is + * less than the partition's max processors setting. + */ + printk(KERN_ERR "Cannot add cpu %s; this system configuration" + " supports %d logical cpus.\n", np->full_name, + cpus_weight(cpu_possible_map)); + goto out_unlock; + } + + while (!cpus_empty(tmp)) + if (cpus_subset(tmp, candidate_map)) + /* Found a range where we can insert the new cpu(s) */ + break; + else + cpus_shift_left(tmp, tmp, nthreads); + + if (cpus_empty(tmp)) { + printk(KERN_ERR "Unable to find space in cpu_present_map for" + " processor %s with %d thread(s)\n", np->name, + nthreads); + goto out_unlock; + } + + for_each_cpu_mask(cpu, tmp) { + BUG_ON(cpu_isset(cpu, cpu_present_map)); + cpu_set(cpu, cpu_present_map); + set_hard_smp_processor_id(cpu, *intserv++); + } + err = 0; +out_unlock: + unlock_cpu_hotplug(); + return err; +} + +/* + * Update the present map for a cpu node which is going away, and set + * the hard id in the paca(s) to -1 to be consistent with boot time + * convention for non-present cpus. + */ +static void pSeries_remove_processor(struct device_node *np) +{ + unsigned int cpu; + int len, nthreads, i; + u32 *intserv; + + intserv = (u32 *)get_property(np, "ibm,ppc-interrupt-server#s", &len); + if (!intserv) + return; + + nthreads = len / sizeof(u32); + + lock_cpu_hotplug(); + for (i = 0; i < nthreads; i++) { + for_each_present_cpu(cpu) { + if (get_hard_smp_processor_id(cpu) != intserv[i]) + continue; + BUG_ON(cpu_online(cpu)); + cpu_clear(cpu, cpu_present_map); + set_hard_smp_processor_id(cpu, -1); + break; + } + if (cpu == NR_CPUS) + printk(KERN_WARNING "Could not find cpu to remove " + "with physical id 0x%x\n", intserv[i]); + } + unlock_cpu_hotplug(); +} + +static int pSeries_smp_notifier(struct notifier_block *nb, unsigned long action, void *node) +{ + int err = NOTIFY_OK; + + switch (action) { + case PSERIES_RECONFIG_ADD: + if (pSeries_add_processor(node)) + err = NOTIFY_BAD; + break; + case PSERIES_RECONFIG_REMOVE: + pSeries_remove_processor(node); + break; + default: + err = NOTIFY_DONE; + break; + } + return err; +} + +static struct notifier_block pSeries_smp_nb = { + .notifier_call = pSeries_smp_notifier, +}; + #else /* ... CONFIG_HOTPLUG_CPU */ static inline int __devinit smp_startup_cpu(unsigned int lcpu) { @@ -336,6 +458,10 @@ void __init smp_init_pSeries(void) #ifdef CONFIG_HOTPLUG_CPU smp_ops->cpu_disable = pSeries_cpu_disable; smp_ops->cpu_die = pSeries_cpu_die; + + /* Processors can be added/removed only on LPAR */ + if (systemcfg->platform == PLATFORM_PSERIES_LPAR) + pSeries_reconfig_notifier_register(&pSeries_smp_nb); #endif /* Start secondary threads on SMT systems; primary threads From ntl at pobox.com Tue Mar 15 13:50:04 2005 From: ntl at pobox.com (Nathan Lynch) Date: Mon, 14 Mar 2005 20:50:04 -0600 (CST) Subject: [PATCH 8/8] make cpu hotplug play well with maxcpus and smt-enabled In-Reply-To: <20050315024923.11665.85622.82498@otto> References: <20050315024923.11665.85622.82498@otto> Message-ID: <20050315025004.11665.99923.55129@otto> This patch allows you to boot a pSeries system with maxcpus=x or smt-enabled=off (or both) and bring up the offline cpus later from userspace, assuming the kernel was built with CONFIG_HOTPLUG_CPU=y. - Record cpus which were started from OF in a cpu map and use that instead of system_state to decide how to start a cpu in smp_startup_cpu. - Change the smp bootup logic slightly so that the path for bringing up secondary threads is exactly the same as hotplugging a cpu later from userspace. - Add a new function to smp_ops - cpu_bootable. This is implemented only by pSeries to filter out secondary threads during boot with smt-enabled=off. Another way this could be done is to change the kick_cpu member to return int and we can check for this case in smp_pSeries_kick_cpu. - Remove the games we play with cpu_present_map and the hard_smp_processor_id to handle smt-enabled=off, since they're now unnecessary. - Remove find_physical_cpu_to_start; assigning threads to logical slots should be done at bootup and at DLPAR time, not during a cpu online operation. One caveat: you need up-to-date firmware on Power5 for the maxcpus option to work on systems with more than one processor. Otherwise interrupts get misrouted, typically resulting in hangs or "unable to find root filesystem" problems. Tested on Power5 with and without CONFIG_HOTPLUG_CPU and with various combinations of the maxcpus= and smt-enabled= parameters. arch/ppc64/kernel/pSeries_smp.c | 183 +++++++++++++++------------------------- arch/ppc64/kernel/setup.c | 12 -- arch/ppc64/kernel/smp.c | 13 -- include/asm-ppc64/machdep.h | 1 4 files changed, 78 insertions(+), 131 deletions(-) Signed-off-by: Nathan Lynch Index: linux-2.6.11-bk5/arch/ppc64/kernel/pSeries_smp.c =================================================================== --- linux-2.6.11-bk5.orig/arch/ppc64/kernel/pSeries_smp.c 2005-03-09 20:31:06.000000000 +0000 +++ linux-2.6.11-bk5/arch/ppc64/kernel/pSeries_smp.c 2005-03-09 20:32:55.000000000 +0000 @@ -54,8 +54,16 @@ #define DBG(fmt...) #endif +/* + * The primary thread of each non-boot processor is recorded here before + * smp init. + */ +static cpumask_t of_spin_map; + extern void pSeries_secondary_smp_init(unsigned long); +#ifdef CONFIG_HOTPLUG_CPU + /* Get state of physical CPU. * Return codes: * 0 - The processor is in the RTAS stopped state @@ -82,9 +90,6 @@ static int query_cpu_stopped(unsigned in return cpu_status; } - -#ifdef CONFIG_HOTPLUG_CPU - int pSeries_cpu_disable(void) { systemcfg->processorCount--; @@ -123,98 +128,6 @@ void pSeries_cpu_die(unsigned int cpu) paca[cpu].cpu_start = 0; } -/* Search all cpu device nodes for an offline logical cpu. If a - * device node has a "ibm,my-drc-index" property (meaning this is an - * LPAR), paranoid-check whether we own the cpu. For each "thread" - * of a cpu, if it is offline and has the same hw index as before, - * grab that in preference. - */ -static unsigned int find_physical_cpu_to_start(unsigned int old_hwindex) -{ - struct device_node *np = NULL; - unsigned int best = -1U; - - while ((np = of_find_node_by_type(np, "cpu"))) { - int nr_threads, len; - u32 *index = (u32 *)get_property(np, "ibm,my-drc-index", NULL); - u32 *tid = (u32 *) - get_property(np, "ibm,ppc-interrupt-server#s", &len); - - if (!tid) - tid = (u32 *)get_property(np, "reg", &len); - - if (!tid) - continue; - - /* If there is a drc-index, make sure that we own - * the cpu. - */ - if (index) { - int state; - int rc = rtas_get_sensor(9003, *index, &state); - if (rc < 0 || state != 1) - continue; - } - - nr_threads = len / sizeof(u32); - - while (nr_threads--) { - if (0 == query_cpu_stopped(tid[nr_threads])) { - best = tid[nr_threads]; - if (best == old_hwindex) - goto out; - } - } - } -out: - of_node_put(np); - return best; -} - -/** - * smp_startup_cpu() - start the given cpu - * - * At boot time, there is nothing to do. At run-time, call RTAS with - * the appropriate start location, if the cpu is in the RTAS stopped - * state. - * - * Returns: - * 0 - failure - * 1 - success - */ -static inline int __devinit smp_startup_cpu(unsigned int lcpu) -{ - int status; - unsigned long start_here = __pa((u32)*((unsigned long *) - pSeries_secondary_smp_init)); - unsigned int pcpu; - - /* At boot time the cpus are already spinning in hold - * loops, so nothing to do. */ - if (system_state < SYSTEM_RUNNING) - return 1; - - pcpu = find_physical_cpu_to_start(get_hard_smp_processor_id(lcpu)); - if (pcpu == -1U) { - printk(KERN_INFO "No more cpus available, failing\n"); - return 0; - } - - /* Fixup atomic count: it exited inside IRQ handler. */ - paca[lcpu].__current->thread_info->preempt_count = 0; - - /* At boot this is done in prom.c. */ - paca[lcpu].hw_cpu_id = pcpu; - - status = rtas_call(rtas_token("start-cpu"), 3, 1, NULL, - pcpu, start_here, lcpu); - if (status != 0) { - printk(KERN_ERR "start-cpu failed: %i\n", status); - return 0; - } - return 1; -} - /* * Update cpu_present_map and paca(s) for a new cpu node. The wrinkle * here is that a cpu device node may represent up to two logical cpus @@ -335,12 +248,43 @@ static struct notifier_block pSeries_smp .notifier_call = pSeries_smp_notifier, }; -#else /* ... CONFIG_HOTPLUG_CPU */ +#endif /* CONFIG_HOTPLUG_CPU */ + +/** + * smp_startup_cpu() - start the given cpu + * + * At boot time, there is nothing to do for primary threads which were + * started from Open Firmware. For anything else, call RTAS with the + * appropriate start location. + * + * Returns: + * 0 - failure + * 1 - success + */ static inline int __devinit smp_startup_cpu(unsigned int lcpu) { + int status; + unsigned long start_here = __pa((u32)*((unsigned long *) + pSeries_secondary_smp_init)); + unsigned int pcpu; + + if (cpu_isset(lcpu, of_spin_map)) + /* Already started by OF and sitting in spin loop */ + return 1; + + pcpu = get_hard_smp_processor_id(lcpu); + + /* Fixup atomic count: it exited inside IRQ handler. */ + paca[lcpu].__current->thread_info->preempt_count = 0; + + status = rtas_call(rtas_token("start-cpu"), 3, 1, NULL, + pcpu, start_here, lcpu); + if (status != 0) { + printk(KERN_ERR "start-cpu failed: %i\n", status); + return 0; + } return 1; } -#endif /* CONFIG_HOTPLUG_CPU */ static inline void smp_xics_do_message(int cpu, int msg) { @@ -380,6 +324,8 @@ static void __devinit smp_xics_setup_cpu if (cur_cpu_spec->firmware_features & FW_FEATURE_SPLPAR) vpa_init(cpu); + cpu_clear(cpu, of_spin_map); + /* * Put the calling processor into the GIQ. This is really only * necessary from a secondary thread as the OF start-cpu interface @@ -429,6 +375,20 @@ static void __devinit smp_pSeries_kick_c paca[nr].cpu_start = 1; } +static int smp_pSeries_cpu_bootable(unsigned int nr) +{ + /* Special case - we inhibit secondary thread startup + * during boot if the user requests it. Odd-numbered + * cpus are assumed to be secondary threads. + */ + if (system_state < SYSTEM_RUNNING && + cur_cpu_spec->cpu_features & CPU_FTR_SMT && + !smt_enabled_at_boot && nr % 2 != 0) + return 0; + + return 1; +} + static struct smp_ops_t pSeries_mpic_smp_ops = { .message_pass = smp_mpic_message_pass, .probe = smp_mpic_probe, @@ -441,12 +401,13 @@ static struct smp_ops_t pSeries_xics_smp .probe = smp_xics_probe, .kick_cpu = smp_pSeries_kick_cpu, .setup_cpu = smp_xics_setup_cpu, + .cpu_bootable = smp_pSeries_cpu_bootable, }; /* This is called very early */ void __init smp_init_pSeries(void) { - int ret, i; + int i; DBG(" -> smp_init_pSeries()\n"); @@ -464,20 +425,20 @@ void __init smp_init_pSeries(void) pSeries_reconfig_notifier_register(&pSeries_smp_nb); #endif - /* Start secondary threads on SMT systems; primary threads - * are already in the running state. - */ - for_each_present_cpu(i) { - if (query_cpu_stopped(get_hard_smp_processor_id(i)) == 0) { - printk("%16.16x : starting thread\n", i); - DBG("%16.16x : starting thread\n", i); - rtas_call(rtas_token("start-cpu"), 3, 1, &ret, - get_hard_smp_processor_id(i), - __pa((u32)*((unsigned long *) - pSeries_secondary_smp_init)), - i); + /* Mark threads which are still spinning in hold loops. */ + if (cur_cpu_spec->cpu_features & CPU_FTR_SMT) + for_each_present_cpu(i) { + if (i % 2 == 0) + /* + * Even-numbered logical cpus correspond to + * primary threads. + */ + cpu_set(i, of_spin_map); } - } + else + of_spin_map = cpu_present_map; + + cpu_clear(boot_cpuid, of_spin_map); /* Non-lpar has additional take/give timebase */ if (rtas_token("freeze-time-base") != RTAS_UNKNOWN_SERVICE) { Index: linux-2.6.11-bk5/include/asm-ppc64/machdep.h =================================================================== --- linux-2.6.11-bk5.orig/include/asm-ppc64/machdep.h 2005-03-09 20:30:34.000000000 +0000 +++ linux-2.6.11-bk5/include/asm-ppc64/machdep.h 2005-03-09 20:32:55.000000000 +0000 @@ -33,6 +33,7 @@ struct smp_ops_t { int (*cpu_enable)(unsigned int nr); int (*cpu_disable)(void); void (*cpu_die)(unsigned int nr); + int (*cpu_bootable)(unsigned int nr); }; #endif Index: linux-2.6.11-bk5/arch/ppc64/kernel/smp.c =================================================================== --- linux-2.6.11-bk5.orig/arch/ppc64/kernel/smp.c 2005-03-09 20:30:34.000000000 +0000 +++ linux-2.6.11-bk5/arch/ppc64/kernel/smp.c 2005-03-09 20:32:55.000000000 +0000 @@ -490,9 +490,8 @@ int __devinit __cpu_up(unsigned int cpu) if (!cpu_enable(cpu)) return 0; - /* At boot, don't bother with non-present cpus -JSCHOPP */ - if (system_state < SYSTEM_RUNNING && !cpu_present(cpu)) - return -ENOENT; + if (smp_ops->cpu_bootable && !smp_ops->cpu_bootable(cpu)) + return -EINVAL; paca[cpu].default_decr = tb_ticks_per_jiffy / decr_overclock; @@ -606,14 +605,6 @@ void __init smp_cpus_done(unsigned int m smp_ops->setup_cpu(boot_cpuid); set_cpus_allowed(current, old_mask); - - /* - * We know at boot the maximum number of cpus we can add to - * a partition and set cpu_possible_map accordingly. cpu_present_map - * needs to match for the hotplug code to allow us to hot add - * any offline cpus. - */ - cpu_present_map = cpu_possible_map; } #ifdef CONFIG_HOTPLUG_CPU Index: linux-2.6.11-bk5/arch/ppc64/kernel/setup.c =================================================================== --- linux-2.6.11-bk5.orig/arch/ppc64/kernel/setup.c 2005-03-09 20:30:34.000000000 +0000 +++ linux-2.6.11-bk5/arch/ppc64/kernel/setup.c 2005-03-09 20:32:55.000000000 +0000 @@ -269,15 +269,9 @@ static void __init setup_cpu_maps(void) nthreads = len / sizeof(u32); for (j = 0; j < nthreads && cpu < NR_CPUS; j++) { - /* - * Only spin up secondary threads if SMT is enabled. - * We must leave space in the logical map for the - * threads. - */ - if (j == 0 || smt_enabled_at_boot) { - cpu_set(cpu, cpu_present_map); - set_hard_smp_processor_id(cpu, intserv[j]); - } + cpu_set(cpu, cpu_present_map); + set_hard_smp_processor_id(cpu, intserv[j]); + if (intserv[j] == boot_cpuid_phys) swap_cpuid = cpu; cpu_set(cpu, cpu_possible_map); From sfr at canb.auug.org.au Tue Mar 15 14:34:12 2005 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Tue, 15 Mar 2005 14:34:12 +1100 Subject: [PATCH] PPC64 iSeries: cleanup viopath Message-ID: <20050315143412.0c60690a.sfr@canb.auug.org.au> Hi Andrew, Since you brought this file to my attention, I figured I might as well do some simple cleanups. This patch does: - single bit int bitfields are a bit suspect and Anndrew pointed out recently that they are probably slower to access than ints - get rid of some more stufly caps - define the semaphore and the atomic in struct alloc_parms rather than pointers to them since we just allocate them on the stack anyway. - one small white space cleanup - use the HvLpIndexInvalid constant instead of ita value Built and booted on iSeries (which is the only place it is used). Signed-off-by: Stephen Rothwell -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ diff -ruNp linus/arch/ppc64/kernel/viopath.c linus-cleanup.1/arch/ppc64/kernel/viopath.c --- linus/arch/ppc64/kernel/viopath.c 2005-03-13 04:07:42.000000000 +1100 +++ linus-cleanup.1/arch/ppc64/kernel/viopath.c 2005-03-15 14:02:48.000000000 +1100 @@ -42,6 +42,7 @@ #include #include +#include #include #include #include @@ -56,8 +57,8 @@ * But this allows for other support in the future. */ static struct viopathStatus { - int isOpen:1; /* Did we open the path? */ - int isActive:1; /* Do we have a mon msg outstanding */ + int isOpen; /* Did we open the path? */ + int isActive; /* Do we have a mon msg outstanding */ int users[VIO_MAX_SUBTYPES]; HvLpInstanceId mSourceInst; HvLpInstanceId mTargetInst; @@ -81,10 +82,10 @@ static void handleMonitorEvent(struct Hv * blocks on the semaphore and the handler posts the semaphore. However, * if system_state is not SYSTEM_RUNNING, then wait_atomic is used ... */ -struct doneAllocParms_t { - struct semaphore *sem; +struct alloc_parms { + struct semaphore sem; int number; - atomic_t *wait_atomic; + atomic_t wait_atomic; int used_wait_atomic; }; @@ -97,9 +98,9 @@ static u8 viomonseq = 22; /* Our hosting logical partition. We get this at startup * time, and different modules access this variable directly. */ -HvLpIndex viopath_hostLp = 0xff; /* HvLpIndexInvalid */ +HvLpIndex viopath_hostLp = HvLpIndexInvalid; EXPORT_SYMBOL(viopath_hostLp); -HvLpIndex viopath_ourLp = 0xff; +HvLpIndex viopath_ourLp = HvLpIndexInvalid; EXPORT_SYMBOL(viopath_ourLp); /* For each kind of incoming event we set a pointer to a @@ -200,7 +201,7 @@ EXPORT_SYMBOL(viopath_isactive); /* * We cache the source and target instance ids for each - * partition. + * partition. */ HvLpInstanceId viopath_sourceinst(HvLpIndex lp) { @@ -450,36 +451,33 @@ static void vio_handleEvent(struct HvLpE static void viopath_donealloc(void *parm, int number) { - struct doneAllocParms_t *parmsp = (struct doneAllocParms_t *)parm; + struct alloc_parms *parmsp = parm; parmsp->number = number; if (parmsp->used_wait_atomic) - atomic_set(parmsp->wait_atomic, 0); + atomic_set(&parmsp->wait_atomic, 0); else - up(parmsp->sem); + up(&parmsp->sem); } static int allocateEvents(HvLpIndex remoteLp, int numEvents) { - struct doneAllocParms_t parms; - DECLARE_MUTEX_LOCKED(Semaphore); - atomic_t wait_atomic; + struct alloc_parms parms; if (system_state != SYSTEM_RUNNING) { parms.used_wait_atomic = 1; - atomic_set(&wait_atomic, 1); - parms.wait_atomic = &wait_atomic; + atomic_set(&parms.wait_atomic, 1); } else { parms.used_wait_atomic = 0; - parms.sem = &Semaphore; + init_MUTEX_LOCKED(&parms.sem); } mf_allocate_lp_events(remoteLp, HvLpEvent_Type_VirtualIo, 250, /* It would be nice to put a real number here! */ numEvents, &viopath_donealloc, &parms); if (system_state != SYSTEM_RUNNING) { - while (atomic_read(&wait_atomic)) + while (atomic_read(&parms.wait_atomic)) mb(); } else - down(&Semaphore); + down(&parms.sem); return parms.number; } @@ -558,8 +556,7 @@ int viopath_close(HvLpIndex remoteLp, in unsigned long flags; int i; int numOpen; - struct doneAllocParms_t doneAllocParms; - DECLARE_MUTEX_LOCKED(Semaphore); + struct alloc_parms parms; if ((remoteLp >= HvMaxArchitectedLps) || (remoteLp == HvLpIndexInvalid)) return -EINVAL; @@ -580,11 +577,11 @@ int viopath_close(HvLpIndex remoteLp, in spin_unlock_irqrestore(&statuslock, flags); - doneAllocParms.used_wait_atomic = 0; - doneAllocParms.sem = &Semaphore; + parms.used_wait_atomic = 0; + init_MUTEX_LOCKED(&parms.sem); mf_deallocate_lp_events(remoteLp, HvLpEvent_Type_VirtualIo, - numReq, &viopath_donealloc, &doneAllocParms); - down(&Semaphore); + numReq, &viopath_donealloc, &parms); + down(&parms.sem); spin_lock_irqsave(&statuslock, flags); for (i = 0, numOpen = 0; i < VIO_MAX_SUBTYPES; i++) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050315/b924afe2/attachment.pgp From sfr at canb.auug.org.au Tue Mar 15 15:34:46 2005 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Tue, 15 Mar 2005 15:34:46 +1100 Subject: [PATCH] PPC64 iSeries: cleanup iSeries_setup Message-ID: <20050315153446.4404919f.sfr@canb.auug.org.au> Hi Andrew, This patch does some trivial cleanups on iSeries_setup.[ch]: - eliminiate warning about iommu_init_early_iSeries not being declared - remove trailing whitespace - change some functions to static - remove defunct function declarations Built and booted on iSeries. Signed-off-by: Stephen Rothwell -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ diff -ruNp linus-cleanup.1/arch/ppc64/kernel/iSeries_setup.c linus-cleanup.2/arch/ppc64/kernel/iSeries_setup.c --- linus-cleanup.1/arch/ppc64/kernel/iSeries_setup.c 2005-03-06 07:08:24.000000000 +1100 +++ linus-cleanup.2/arch/ppc64/kernel/iSeries_setup.c 2005-03-15 15:23:35.000000000 +1100 @@ -15,7 +15,7 @@ * as published by the Free Software Foundation; either version * 2 of the License, or (at your option) any later version. */ - + #undef DEBUG #include @@ -39,6 +39,7 @@ #include #include #include +#include #include #include "iSeries_setup.h" @@ -57,6 +58,7 @@ #include #include #include +#include extern void hvlog(char *fmt, ...); @@ -72,7 +74,6 @@ extern void ppcdbg_initialize(void); static void build_iSeries_Memory_Map(void); static void setup_iSeries_cache_sizes(void); static void iSeries_bolt_kernel(unsigned long saddr, unsigned long eaddr); -extern void iSeries_setup_arch(void); extern void iSeries_pci_final_fixup(void); /* Global Variables */ @@ -108,8 +109,8 @@ struct MemoryBlock { * and return the number of physical blocks and fill in the array of * block data. */ -unsigned long iSeries_process_Condor_mainstore_vpd(struct MemoryBlock *mb_array, - unsigned long max_entries) +static unsigned long iSeries_process_Condor_mainstore_vpd( + struct MemoryBlock *mb_array, unsigned long max_entries) { unsigned long holeFirstChunk, holeSizeChunks; unsigned long numMemoryBlocks = 1; @@ -154,7 +155,7 @@ unsigned long iSeries_process_Condor_mai #define MaxSegmentAdrRangeBlocks 128 #define MaxAreaRangeBlocks 4 -unsigned long iSeries_process_Regatta_mainstore_vpd( +static unsigned long iSeries_process_Regatta_mainstore_vpd( struct MemoryBlock *mb_array, unsigned long max_entries) { struct IoHriMainStoreSegment5 *msVpdP = @@ -246,7 +247,7 @@ unsigned long iSeries_process_Regatta_ma printk(" Bitmap range: %016lx - %016lx\n" " Absolute range: %016lx - %016lx\n", mb_array[i].logicalStart, - mb_array[i].logicalEnd, + mb_array[i].logicalEnd, mb_array[i].absStart, mb_array[i].absEnd); mb_array[i].absStart = addr_to_chunk(mb_array[i].absStart & 0x000fffffffffffff); @@ -261,7 +262,7 @@ unsigned long iSeries_process_Regatta_ma return numSegmentBlocks; } -unsigned long iSeries_process_mainstore_vpd(struct MemoryBlock *mb_array, +static unsigned long iSeries_process_mainstore_vpd(struct MemoryBlock *mb_array, unsigned long max_entries) { unsigned long i; @@ -302,7 +303,7 @@ static void __init iSeries_parse_cmdline *p = 0; } -/*static*/ void __init iSeries_init_early(void) +static void __init iSeries_init_early(void) { DBG(" -> iSeries_init_early()\n"); @@ -355,7 +356,7 @@ static void __init iSeries_parse_cmdline #ifdef CONFIG_SMP smp_init_iSeries(); #endif - if (itLpNaca.xPirEnvironMode == 0) + if (itLpNaca.xPirEnvironMode == 0) piranha_simulator = 1; /* Associate Lp Event Queue 0 with processor 0 */ @@ -385,21 +386,21 @@ static void __init iSeries_parse_cmdline /* * The iSeries may have very large memories ( > 128 GB ) and a partition * may get memory in "chunks" that may be anywhere in the 2**52 real - * address space. The chunks are 256K in size. To map this to the - * memory model Linux expects, the AS/400 specific code builds a + * address space. The chunks are 256K in size. To map this to the + * memory model Linux expects, the AS/400 specific code builds a * translation table to translate what Linux thinks are "physical" - * addresses to the actual real addresses. This allows us to make + * addresses to the actual real addresses. This allows us to make * it appear to Linux that we have contiguous memory starting at * physical address zero while in fact this could be far from the truth. - * To avoid confusion, I'll let the words physical and/or real address - * apply to the Linux addresses while I'll use "absolute address" to + * To avoid confusion, I'll let the words physical and/or real address + * apply to the Linux addresses while I'll use "absolute address" to * refer to the actual hardware real address. * - * build_iSeries_Memory_Map gets information from the Hypervisor and + * build_iSeries_Memory_Map gets information from the Hypervisor and * looks at the Main Store VPD to determine the absolute addresses * of the memory that has been assigned to our partition and builds * a table used to translate Linux's physical addresses to these - * absolute addresses. Absolute addresses are needed when + * absolute addresses. Absolute addresses are needed when * communicating with the hypervisor (e.g. to build HPT entries) */ @@ -428,13 +429,13 @@ static void __init build_iSeries_Memory_ * otherwise, it might not be returned by PLIC as the first * chunks */ - + loadAreaFirstChunk = (u32)addr_to_chunk(itLpNaca.xLoadAreaAddr); loadAreaSize = itLpNaca.xLoadAreaChunks; /* - * Only add the pages already mapped here. - * Otherwise we might add the hpt pages + * Only add the pages already mapped here. + * Otherwise we might add the hpt pages * The rest of the pages of the load area * aren't in the HPT yet and can still * be assigned an arbitrary physical address @@ -446,7 +447,7 @@ static void __init build_iSeries_Memory_ /* * TODO Do we need to do something if the HPT is in the 64MB load area? - * This would be required if the itLpNaca.xLoadAreaChunks includes + * This would be required if the itLpNaca.xLoadAreaChunks includes * the HPT size */ @@ -454,11 +455,11 @@ static void __init build_iSeries_Memory_ " absolute addr = %016lx\n", chunk_to_addr(loadAreaFirstChunk)); printk("Load area size %dK\n", loadAreaSize * 256); - + for (nextPhysChunk = 0; nextPhysChunk < loadAreaSize; ++nextPhysChunk) msChunks.abs[nextPhysChunk] = loadAreaFirstChunk + nextPhysChunk; - + /* * Get absolute address of our HPT and remember it so * we won't map it to any physical address @@ -475,7 +476,7 @@ static void __init build_iSeries_Memory_ num_ptegs = hptSizePages * (PAGE_SIZE / (sizeof(HPTE) * HPTES_PER_GROUP)); htab_hash_mask = num_ptegs - 1; - + /* * The actual hashed page table is in the hypervisor, * we have no direct access @@ -533,9 +534,9 @@ static void __init build_iSeries_Memory_ } /* - * main store size (in chunks) is + * main store size (in chunks) is * totalChunks - hptSizeChunks - * which should be equal to + * which should be equal to * nextPhysChunk */ systemcfg->physicalMemorySize = chunk_to_addr(nextPhysChunk); @@ -650,7 +651,7 @@ extern unsigned long ppc_tb_freq; /* * Document me. */ -void __init iSeries_setup_arch(void) +static void __init iSeries_setup_arch(void) { void *eventStack; unsigned procIx = get_paca()->lppaca.dyn_hv_phys_proc_index; @@ -669,14 +670,14 @@ void __init iSeries_setup_arch(void) */ eventStack = alloc_bootmem_pages(LpEventStackSize); memset(eventStack, 0, LpEventStackSize); - + /* Invoke the hypervisor to initialize the event stack */ HvCallEvent_setLpEventStack(0, eventStack, LpEventStackSize); /* Initialize fields in our Lp Event Queue */ xItLpQueue.xSlicEventStackPtr = (char *)eventStack; xItLpQueue.xSlicCurEventPtr = (char *)eventStack; - xItLpQueue.xSlicLastValidEventPtr = (char *)eventStack + + xItLpQueue.xSlicLastValidEventPtr = (char *)eventStack + (LpEventStackSize - LpEventMaxSize); xItLpQueue.xIndex = 0; @@ -694,7 +695,7 @@ void __init iSeries_setup_arch(void) tbFreqMhzHundreths = (tbFreqHz / 10000) - (tbFreqMhz * 100); ppc_tb_freq = tbFreqHz; - printk("Max logical processors = %d\n", + printk("Max logical processors = %d\n", itVpdAreas.xSlicMaxLogicalProcs); printk("Max physical processors = %d\n", itVpdAreas.xSlicMaxPhysicalProcs); @@ -706,7 +707,7 @@ void __init iSeries_setup_arch(void) printk("Processor version = %x\n", systemcfg->processor); } -void iSeries_get_cpuinfo(struct seq_file *m) +static void iSeries_get_cpuinfo(struct seq_file *m) { seq_printf(m, "machine\t\t: 64-bit iSeries Logical Partition\n"); } @@ -715,7 +716,7 @@ void iSeries_get_cpuinfo(struct seq_file * Document me. * and Implement me. */ -int iSeries_get_irq(struct pt_regs *regs) +static int iSeries_get_irq(struct pt_regs *regs) { /* -2 means ignore this interrupt */ return -2; @@ -724,7 +725,7 @@ int iSeries_get_irq(struct pt_regs *regs /* * Document me. */ -void iSeries_restart(char *cmd) +static void iSeries_restart(char *cmd) { mf_reboot(); } @@ -732,7 +733,7 @@ void iSeries_restart(char *cmd) /* * Document me. */ -void iSeries_power_off(void) +static void iSeries_power_off(void) { mf_power_off(); } @@ -740,14 +741,11 @@ void iSeries_power_off(void) /* * Document me. */ -void iSeries_halt(void) +static void iSeries_halt(void) { mf_power_off(); } -/* JDH Hack */ -unsigned long jdh_time = 0; - extern void setup_default_decr(void); /* @@ -758,17 +756,17 @@ extern void setup_default_decr(void); * and sets up the kernel timer decrementer based on that value. * */ -void __init iSeries_calibrate_decr(void) +static void __init iSeries_calibrate_decr(void) { unsigned long cyclesPerUsec; struct div_result divres; - + /* Compute decrementer (and TB) frequency in cycles/sec */ cyclesPerUsec = ppc_tb_freq / 1000000; /* * Set the amount to refresh the decrementer by. This - * is the number of decrementer ticks it takes for + * is the number of decrementer ticks it takes for * 1/HZ seconds. */ tb_ticks_per_jiffy = ppc_tb_freq / HZ; @@ -793,7 +791,7 @@ void __init iSeries_calibrate_decr(void) setup_default_decr(); } -void __init iSeries_progress(char * st, unsigned short code) +static void __init iSeries_progress(char * st, unsigned short code) { printk("Progress: [%04x] - %s\n", (unsigned)code, st); if (!piranha_simulator && mf_initialized) { @@ -825,7 +823,7 @@ static void __init iSeries_fixup_klimit( } } -int __init iSeries_src_init(void) +static int __init iSeries_src_init(void) { /* clear the progress line */ ppc_md.progress(" ", 0xffff); diff -ruNp linus-cleanup.1/arch/ppc64/kernel/iSeries_setup.h linus-cleanup.2/arch/ppc64/kernel/iSeries_setup.h --- linus-cleanup.1/arch/ppc64/kernel/iSeries_setup.h 2004-09-24 15:23:06.000000000 +1000 +++ linus-cleanup.2/arch/ppc64/kernel/iSeries_setup.h 2005-03-15 15:22:05.000000000 +1100 @@ -19,19 +19,8 @@ #ifndef __ISERIES_SETUP_H__ #define __ISERIES_SETUP_H__ -extern void iSeries_setup_arch(void); -extern void iSeries_setup_residual(struct seq_file *m, int cpu_id); -extern void iSeries_get_cpuinfo(struct seq_file *m); -extern void iSeries_init_IRQ(void); -extern int iSeries_get_irq(struct pt_regs *regs); -extern void iSeries_restart(char *cmd); -extern void iSeries_power_off(void); -extern void iSeries_halt(void); -extern void iSeries_time_init(void); extern void iSeries_get_boot_time(struct rtc_time *tm); extern int iSeries_set_rtc_time(struct rtc_time *tm); extern void iSeries_get_rtc_time(struct rtc_time *tm); -extern void iSeries_calibrate_decr(void); -extern void iSeries_progress( char *, unsigned short ); #endif /* __ISERIES_SETUP_H__ */ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050315/9622fe83/attachment.pgp From benh at kernel.crashing.org Tue Mar 15 16:32:20 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 15 Mar 2005 16:32:20 +1100 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI Error Recovery) In-Reply-To: <20050314181420.GD498@austin.ibm.com> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <4235847F.3080705@jp.fujitsu.com> <20050314181420.GD498@austin.ibm.com> Message-ID: <1110864741.29124.68.camel@gaston> > Is there a long-term philosphy for the Linux kernel on a question like > this? That is, when should changes add callbacks to structures, > as opposed to notifier-chain based events? The callback is a bit > simpler, and maybe a tiny bit faster, but its less flexible in the > long run (e.g. anyone can listen for the events, but only device > drivers can get callbacks). Comments, please? Ok, let's propose what i think is a proper API and simple enough on the driver side, if complexity there is, it's in the platform policy. That should cover all the needs we discussed so far: I think we need a callback in pci_driver, as I explained all along, with a very simple semantic: int (*error_handler)(struct pci_dev *dev, int message); At first, message will be : 1) PCIERR_ERROR_DETECTED Error detected. This is sent once after an error has been detected. At this point, the device might not be accessible anymore depending on the platform (the slot will be isolated on ppc64). The driver may already have "noticed" the error because of a failing IO, but this is the proper "synchronisation point", that is, it gives a chance to the driver to cleanup, waiting for pending stuffs (timers, whatever, etc...) to complete, it can take semaphores, schedule, etc... everything but touch the device. Within this function and after it returns, the driver shouldn't do any new IOs. Called in task context. This is sort of a "quiesce" point. See note about interrupts at the end of this doc. Result codes: - PCIERR_RESULT_CAN_RECOVER: Return this if you think you might be able to recover the HW by just banging IOs or if you want to be given a chance to extract some diagnostic informations (see below). - PCIERR_RESULT_NEED_RESET: Return this if you think you can't recover unless the slot is reset. - PCIERR_RESULT_DISCONNECT: Return this if you think you won't recover at all, (this will detach the driver ? or just leave it dangling ? to be decided) So at this point, we have called PCIERR_ERROR_DETECTED for all drivers on the segment that had the error. On ppc64, the slot is isolated. What happens now typically depends on the result from the drivers. If all drivers on the segment/slot return PCIERR_RESULT_CAN_RECOVER, we would re-enable IOs on the slot (or do nothing special if the platform doesn't isolate slots) and call 2). If not and we can reset slots, we go to 4), if neither, we have a dead slot. If it's an hotplug slot, we might "simulate" reset by triggering HW unplug/replug tho. 2) PCIERR_ERROR_RECOVER This is the "early recovery" call. IOs are allowed again, but DMA is not (hrm... to be discussed, I prefer not), with some restrictions. This is NOT a callback for the driver to start operations again, only to peek/poke at the device, extract diagnostic informations if any, and eventually do things like trigger a device local reset or such things, but not restart operations. This is sent if all drivers on a segment agree that they can try to recover. If the platform can't just re-enable IOs without a slot reset, it doesn't call this callback and goes directly to 4). All IOs should be done _synchronously_ from withing this callback, errors triggered by them will be returned via the normal pci_check_whatever() api, no new PCIERR_ERROR_DETECTED callback will be issued due to an error happening here, though such an error might cause IOs to be re-blocked for the whole segment (and thus invalidating the recovery of other devices on the same segment). Result codes: - PCIERR_RESULT_RECOVERED Return this if you think your device is fully functionnal and think you are ready to start to do your normal driver job again. There is no guarantee that because you returned that, you'll be allowed to actually proceed as another driver on the same segment might have failed and thus triggered a slot reset on platforms that support it. - PCIERR_RESULT_NEED_RESET Return this if you think your device is not recoverable in it's current state and you need a slot reset to proceed. - PCIERR_RESULT_DISCONNECT Same as above. Total failure, no recovery even after reset driver dead. (To be defined more precisely) 3) PCIERR_ERROR_RESTART This is called if all drivers on the segment have returned PCIERR_RESULT_RECOVERED from the prevous callback. That basically tells the driver to restart activity, everything is back & running. No result code is taken into account here. If a new error happens, it will restart a new error handling process. 4) PCIERR_ERROR_RESET This is called after the slot has been reset (and PCI BARs re-configured by the platform). As for PCIERR_ERROR_RESTART, drivers here are just supposed to re-init the hardware and restart operations. However, a driver can still return a critical failure from here in case it just can't get it's device back from reset. There is just nothing we can do about it tho. Result codes: - PCIERR_RESULT_DISCONNECT Same as above. That's it. I think this covers all the possibilities. The way those callbacks are called is platform policy. A platform with no slot reset capability for example may want to just "ignore" drivers that can't recover (disconnect them) and try to let other cards on the same segment recover. Keep in mind that in most real life cases, though, there will be only one driver per segment. Now, there is a note about interrupts. If you get an interrupt and your device is dead or has been isolated, there is a problem :) After much thinking, I decided to leave that to the platform. That is, the recovery API only precies that: - There is no guarantee that interrupt delivery can proceed from any device on the segment starting from the error detection and until the restart callback is sent, at which point interrupts are expected to be fully operational. - There is no guarantee that interrupt delivery is stopped, that is, ad river that gets an interrupts after detecting an error, or that detects and error within the interrupt handler such that it prevents proper ack'ing of the interrupt (and thus removal of the source) should just return IRQ_NOTHANDLED. It's up to the platform to deal with taht condition, typically by masking the irq source during the duration of the error handling. It is expected that the platform "knows" which interrupts are routed to error-management capable slots and can deal with temporarily disabling that irq number during error processing (this isn't terribly complex). That means some IRQ latency for other devices sharing the interrupt, but there is simply no other way. High end platforms aren't supposed to share interrupts between many devices anyway :) Comments welcome. Linas, I'll give a try at coding something up in the upcoming days unless you beat me to it. Ben. From hollis at penguinppc.org Wed Mar 16 01:32:27 2005 From: hollis at penguinppc.org (Hollis Blanchard) Date: Tue, 15 Mar 2005 08:32:27 -0600 Subject: [PATCH] PPC64 iSeries: cleanup viopath In-Reply-To: <20050315143412.0c60690a.sfr@canb.auug.org.au> References: <20050315143412.0c60690a.sfr@canb.auug.org.au> Message-ID: <0961a209ce72bb9f2a01b163aa6e6fbd@penguinppc.org> On Mar 14, 2005, at 9:34 PM, Stephen Rothwell wrote: > > Since you brought this file to my attention, I figured I might as well > do > some simple cleanups. This patch does: > - single bit int bitfields are a bit suspect and Anndrew pointed > out recently that they are probably slower to access than ints > --- linus/arch/ppc64/kernel/viopath.c 2005-03-13 04:07:42.000000000 > +1100 > +++ linus-cleanup.1/arch/ppc64/kernel/viopath.c 2005-03-15 > 14:02:48.000000000 +1100 > @@ -56,8 +57,8 @@ > * But this allows for other support in the future. > */ > static struct viopathStatus { > - int isOpen:1; /* Did we open the path? */ > - int isActive:1; /* Do we have a mon msg outstanding */ > + int isOpen; /* Did we open the path? */ > + int isActive; /* Do we have a mon msg outstanding */ > int users[VIO_MAX_SUBTYPES]; > HvLpInstanceId mSourceInst; > HvLpInstanceId mTargetInst; Why not use a byte instead of a full int (reordering the members for alignment)? -- Hollis Blanchard IBM Linux Technology Center From olh at suse.de Wed Mar 16 02:12:59 2005 From: olh at suse.de (Olaf Hering) Date: Tue, 15 Mar 2005 16:12:59 +0100 Subject: [PATCH] enable DEBUG via config option In-Reply-To: <16948.46307.76370.206088@cargo.ozlabs.ibm.com> References: <20050211105453.GA31718@suse.de> <16948.46307.76370.206088@cargo.ozlabs.ibm.com> Message-ID: <20050315151259.GC22412@suse.de> On Mon, Mar 14, Paul Mackeras wrote: > Olaf Hering writes: > > > Its always boring to edit each file and turn the #undef DEBUG into > > #define DEBUG. This patch makes it a simple config option. > > Now the question is, how verbose will the boot be when all the printk > > are enabled? appears to be ok so far on a p620. > > Having it as a config option seems to be of use only to a few kernel > developers. Why don't you just edit the Makefile and add -DDEBUG to > the CFLAGS when you want to do that? This series of patches changes all DEBUG_FOO to DEBUG (except NUMA_DEBUG) and removes all the remaining #define DEBUG or #undef DEBUG compile-tested on all 5 configs in arch/ppc64, with and without -DDEBUG ppc64-undef-debug-LPARCFG_DEBUG.patch ppc64-undef-debug-module-DEBUGP.patch ppc64-undef-debug-nvram-DEBUG_NVRAM.patch ppc64-undef-debug-pmac_feature-DEBUG_FEATURE.patch ppc64-undef-debug-prom_init-DEBUG_PROM.patch ppc64-undef-debug-rtasd-DEBUG.patch ppc64-undef-debug-scanlog-DEBUG.patch ppc64-undef-debug-signal-DEBUG_SIG.patch ppc64-undef-debug-time-DEBUG_PPC_ADJTIMEX.patch ppc64-undef-debug-vdso-__DEBUG.patch ppc64-undef-debug.patch -------------- next part -------------- Index: linux-2.6.11-olh/arch/ppc64/kernel/lparcfg.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/lparcfg.c +++ linux-2.6.11-olh/arch/ppc64/kernel/lparcfg.c @@ -38,8 +38,6 @@ #define MODULE_VERS "1.6" #define MODULE_NAME "lparcfg" -/* #define LPARCFG_DEBUG */ - /* find a better place for this function... */ void log_plpar_hcall_return(unsigned long rc, char *tag) { @@ -274,7 +272,7 @@ static void parse_system_parameter_strin __FILE__, __FUNCTION__, __LINE__); return; } -#ifdef LPARCFG_DEBUG +#ifdef DEBUG printk(KERN_INFO "success calling get-system-parameter \n"); #endif splpar_strlen = local_buffer[0] * 16 + local_buffer[1]; @@ -328,7 +326,7 @@ static int lparcfg_count_active_processo int count = 0; while ((cpus_dn = of_find_node_by_type(cpus_dn, "cpu"))) { -#ifdef LPARCFG_DEBUG +#ifdef DEBUG printk(KERN_ERR "cpus_dn %p \n", cpus_dn); #endif count++; -------------- next part -------------- Index: linux-2.6.11-olh/arch/ppc64/kernel/module.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/module.c +++ linux-2.6.11-olh/arch/ppc64/kernel/module.c @@ -30,10 +30,10 @@ Using a magic allocator which places modules within 32MB solves this, and makes other things simpler. Anton? --RR. */ -#if 0 -#define DEBUGP printk +#ifdef DEBUG +#define DBG printk #else -#define DEBUGP(fmt , ...) +#define DBG(fmt , ...) #endif /* There's actually a third entry here, but it's unused */ @@ -124,8 +124,8 @@ static unsigned long get_stubs_size(cons /* Every relocated section... */ for (i = 1; i < hdr->e_shnum; i++) { if (sechdrs[i].sh_type == SHT_RELA) { - DEBUGP("Found relocations in section %u\n", i); - DEBUGP("Ptr: %p. Number: %lu\n", + DBG("Found relocations in section %u\n", i); + DBG("Ptr: %p. Number: %lu\n", (void *)sechdrs[i].sh_addr, sechdrs[i].sh_size / sizeof(Elf64_Rela)); relocs += count_relocs((void *)sechdrs[i].sh_addr, @@ -134,7 +134,7 @@ static unsigned long get_stubs_size(cons } } - DEBUGP("Looks like a total of %lu stubs, max\n", relocs); + DBG("Looks like a total of %lu stubs, max\n", relocs); return relocs * sizeof(struct ppc64_stub_entry); } @@ -246,7 +246,7 @@ static inline int create_stub(Elf64_Shdr me->name, (void *)reladdr, (void *)my_r2); return 0; } - DEBUGP("Stub %p get data from reladdr %li\n", entry, reladdr); + DBG("Stub %p get data from reladdr %li\n", entry, reladdr); *loc1 = PPC_HA(reladdr); *loc2 = PPC_LO(reladdr); @@ -307,7 +307,7 @@ int apply_relocate_add(Elf64_Shdr *sechd unsigned long *location; unsigned long value; - DEBUGP("Applying ADD relocate section %u to %u\n", relsec, + DBG("Applying ADD relocate section %u to %u\n", relsec, sechdrs[relsec].sh_info); for (i = 0; i < sechdrs[relsec].sh_size / sizeof(*rela); i++) { /* This is where to make the change */ @@ -317,7 +317,7 @@ int apply_relocate_add(Elf64_Shdr *sechd sym = (Elf64_Sym *)sechdrs[symindex].sh_addr + ELF64_R_SYM(rela[i].r_info); - DEBUGP("RELOC at %p: %li-type as %s (%lu) + %li\n", + DBG("RELOC at %p: %li-type as %s (%lu) + %li\n", location, (long)ELF64_R_TYPE(rela[i].r_info), strtab + sym->st_name, (unsigned long)sym->st_value, (long)rela[i].r_addend); -------------- next part -------------- Index: linux-2.6.11-olh/arch/ppc64/kernel/nvram.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/nvram.c +++ linux-2.6.11-olh/arch/ppc64/kernel/nvram.c @@ -33,7 +33,7 @@ #include #include -#undef DEBUG_NVRAM +#undef DEBUG static int nvram_scan_partitions(void); static int nvram_setup_partition(void); @@ -200,7 +200,7 @@ static struct miscdevice nvram_dev = { }; -#ifdef DEBUG_NVRAM +#ifdef DEBUG static void nvram_print_partitions(char * label) { struct list_head * p; @@ -591,7 +591,7 @@ static int __init nvram_init(void) printk(KERN_WARNING "nvram_init: Could not find nvram partition" " for nvram buffered error logging.\n"); -#ifdef DEBUG_NVRAM +#ifdef DEBUG nvram_print_partitions("NVRAM Partitions"); #endif -------------- next part -------------- Index: linux-2.6.11-olh/arch/ppc64/kernel/pmac_feature.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/pmac_feature.c +++ linux-2.6.11-olh/arch/ppc64/kernel/pmac_feature.c @@ -41,9 +41,9 @@ #include #include -#undef DEBUG_FEATURE +#undef DEBUG -#ifdef DEBUG_FEATURE +#ifdef DEBUG #define DBG(fmt...) printk(KERN_DEBUG fmt) #else #define DBG(fmt...) -------------- next part -------------- Index: linux-2.6.11-olh/arch/ppc64/kernel/prom_init.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/prom_init.c +++ linux-2.6.11-olh/arch/ppc64/kernel/prom_init.c @@ -15,7 +15,7 @@ * 2 of the License, or (at your option) any later version. */ -#undef DEBUG_PROM +#undef DEBUG #include #include @@ -106,7 +106,7 @@ extern const struct linux_logo logo_linu __asm__ __volatile__(".long " BUG_ILLEGAL_INSTR); \ } while (0) -#ifdef DEBUG_PROM +#ifdef DEBUG #define prom_debug(x...) prom_printf(x) #else #define prom_debug(x...) @@ -642,11 +642,11 @@ static void __init prom_init_mem(void) p = RELOC(regbuf); endp = p + (plen / sizeof(cell_t)); -#ifdef DEBUG_PROM +#ifdef DEBUG memset(path, 0, PROM_SCRATCH_SIZE); call_prom("package-to-path", 3, 1, node, path, PROM_SCRATCH_SIZE-1); prom_debug(" node %s :\n", path); -#endif /* DEBUG_PROM */ +#endif /* DEBUG */ while ((endp - p) >= (_prom->root_addr_cells + _prom->root_size_cells)) { unsigned long base, size; @@ -1514,7 +1514,7 @@ static void __init flatten_device_tree(v reserve_mem(RELOC(dt_header_start), hdr->totalsize); memcpy(rsvmap, RELOC(mem_reserve_map), sizeof(mem_reserve_map)); -#ifdef DEBUG_PROM +#ifdef DEBUG { int i; prom_printf("reserved memory map:\n"); -------------- next part -------------- Index: linux-2.6.11-olh/arch/ppc64/kernel/rtasd.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/rtasd.c +++ linux-2.6.11-olh/arch/ppc64/kernel/rtasd.c @@ -28,10 +28,10 @@ #include #include -#if 0 -#define DEBUG(A...) printk(KERN_ERR A) +#ifdef DEBUG +#define DBG(A...) printk(KERN_ERR A) #else -#define DEBUG(A...) +#define DBG(A...) #endif static DEFINE_SPINLOCK(rtasd_log_lock); @@ -194,7 +194,7 @@ void pSeries_log_error(char *buf, unsign unsigned long s; int len = 0; - DEBUG("logging event\n"); + DBG("logging event\n"); if (buf == NULL) return; @@ -369,7 +369,7 @@ static int get_eventscan_parms(void) return -1; } rtas_event_scan_rate = *ip; - DEBUG("rtas-event-scan-rate %d\n", rtas_event_scan_rate); + DBG("rtas-event-scan-rate %d\n", rtas_event_scan_rate); /* Make room for the sequence number */ rtas_error_log_max = rtas_get_error_log_max(); @@ -419,7 +419,7 @@ static int rtasd(void *unused) printk(KERN_ERR "RTAS daemon started\n"); - DEBUG("will sleep for %d jiffies\n", (HZ*60/rtas_event_scan_rate) / 2); + DBG("will sleep for %d jiffies\n", (HZ*60/rtas_event_scan_rate) / 2); /* See if we have any error stored in NVRAM */ memset(logdata, 0, rtas_error_log_max); @@ -438,9 +438,9 @@ static int rtasd(void *unused) /* First pass. */ lock_cpu_hotplug(); for_each_online_cpu(cpu) { - DEBUG("scheduling on %d\n", cpu); + DBG("scheduling on %d\n", cpu); set_cpus_allowed(current, cpumask_of_cpu(cpu)); - DEBUG("watchdog scheduled on cpu %d\n", smp_processor_id()); + DBG("watchdog scheduled on cpu %d\n", smp_processor_id()); do_event_scan(event_scan); set_current_state(TASK_INTERRUPTIBLE); @@ -449,9 +449,9 @@ static int rtasd(void *unused) unlock_cpu_hotplug(); if (surveillance_timeout != -1) { - DEBUG("enabling surveillance\n"); + DBG("enabling surveillance\n"); enable_surveillance(surveillance_timeout); - DEBUG("surveillance enabled\n"); + DBG("surveillance enabled\n"); } lock_cpu_hotplug(); -------------- next part -------------- Index: linux-2.6.11-olh/arch/ppc64/kernel/scanlog.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/scanlog.c +++ linux-2.6.11-olh/arch/ppc64/kernel/scanlog.c @@ -37,7 +37,7 @@ #define SCANLOG_HWERROR -1 #define SCANLOG_CONTINUE 1 -#define DEBUG(A...) do { if (scanlog_debug) printk(KERN_ERR "scanlog: " A); } while (0) +#define DBG(A...) do { if (scanlog_debug) printk(KERN_ERR "scanlog: " A); } while (0) static int scanlog_debug; static unsigned int ibm_scan_log_dump; /* RTAS token */ @@ -85,14 +85,14 @@ static ssize_t scanlog_read(struct file memcpy(data, rtas_data_buf, RTAS_DATA_BUF_SIZE); spin_unlock(&rtas_data_buf_lock); - DEBUG("status=%d, data[0]=%x, data[1]=%x, data[2]=%x\n", + DBG("status=%d, data[0]=%x, data[1]=%x, data[2]=%x\n", status, data[0], data[1], data[2]); switch (status) { case SCANLOG_COMPLETE: - DEBUG("hit eof\n"); + DBG("hit eof\n"); return 0; case SCANLOG_HWERROR: - DEBUG("hardware error reading scan log data\n"); + DBG("hardware error reading scan log data\n"); return -EIO; case SCANLOG_CONTINUE: /* We may or may not have data yet */ @@ -143,9 +143,9 @@ static ssize_t scanlog_write(struct file if (buf) { if (strncmp(stkbuf, "reset", 5) == 0) { - DEBUG("reset scanlog\n"); + DBG("reset scanlog\n"); status = rtas_call(ibm_scan_log_dump, 2, 1, NULL, 0, 0); - DEBUG("rtas returns %d\n", status); + DBG("rtas returns %d\n", status); } else if (strncmp(stkbuf, "debugon", 7) == 0) { printk(KERN_ERR "scanlog: debug on\n"); scanlog_debug = 1; -------------- next part -------------- Index: linux-2.6.11-olh/arch/ppc64/kernel/signal.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/signal.c +++ linux-2.6.11-olh/arch/ppc64/kernel/signal.c @@ -38,7 +38,7 @@ #include #include -#define DEBUG_SIG 0 +#define DEBUG 0 #define _BLOCKABLE (~(sigmask(SIGKILL) | sigmask(SIGSTOP))) @@ -383,7 +383,7 @@ int sys_rt_sigreturn(unsigned long r3, u return regs->result; badframe: -#if DEBUG_SIG +#if DEBUG printk("badframe in sys_rt_sigreturn, regs=%p uc=%p &uc->uc_mcontext=%p\n", regs, uc, &uc->uc_mcontext); #endif @@ -465,7 +465,7 @@ static int setup_rt_frame(int signr, str return 1; badframe: -#if DEBUG_SIG +#if DEBUG printk("badframe in setup_rt_frame, regs=%p frame=%p newsp=%lx\n", regs, frame, newsp); #endif Index: linux-2.6.11-olh/arch/ppc64/kernel/signal32.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/signal32.c +++ linux-2.6.11-olh/arch/ppc64/kernel/signal32.c @@ -33,7 +33,7 @@ #include #include -#define DEBUG_SIG 0 +#define DEBUG 0 #define _BLOCKABLE (~(sigmask(SIGKILL) | sigmask(SIGSTOP))) @@ -684,7 +684,7 @@ static int handle_rt_signal32(unsigned l return 1; badframe: -#if DEBUG_SIG +#if DEBUG printk("badframe in handle_rt_signal, regs=%p frame=%p newsp=%lx\n", regs, frame, newsp); #endif @@ -857,7 +857,7 @@ static int handle_signal32(unsigned long return 1; badframe: -#if DEBUG_SIG +#if DEBUG printk("badframe in handle_signal, regs=%p frame=%x newsp=%x\n", regs, frame, *newspp); #endif -------------- next part -------------- Index: linux-2.6.11-olh/arch/ppc64/kernel/time.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/time.c +++ linux-2.6.11-olh/arch/ppc64/kernel/time.c @@ -546,7 +546,7 @@ void __init time_init(void) * adjust the frequency. */ -/* #define DEBUG_PPC_ADJTIMEX 1 */ +#undef DEBUG void ppc_adjtimex(void) { @@ -576,7 +576,7 @@ void ppc_adjtimex(void) /* If there is a single shot time adjustment in progress */ if ( time_adjust ) { -#ifdef DEBUG_PPC_ADJTIMEX +#ifdef DEBUG printk("ppc_adjtimex: "); if ( adjusting_time == 0 ) printk("starting "); @@ -599,7 +599,7 @@ void ppc_adjtimex(void) singleshot_ppm = -singleshot_ppm; } else { -#ifdef DEBUG_PPC_ADJTIMEX +#ifdef DEBUG if ( adjusting_time ) printk("ppc_adjtimex: ending single shot time_adjust\n"); #endif @@ -620,7 +620,7 @@ void ppc_adjtimex(void) new_tb_ticks_per_sec = tb_ticks_per_sec - tb_ticks_per_sec_delta; } -#ifdef DEBUG_PPC_ADJTIMEX +#ifdef DEBUG printk("ppc_adjtimex: ltemp = %ld, time_freq = %ld, singleshot_ppm = %ld\n", ltemp, time_freq, singleshot_ppm); printk("ppc_adjtimex: tb_ticks_per_sec - base = %ld new = %ld\n", tb_ticks_per_sec, new_tb_ticks_per_sec); #endif -------------- next part -------------- Index: linux-2.6.11-olh/arch/ppc64/kernel/vdso.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/vdso.c +++ linux-2.6.11-olh/arch/ppc64/kernel/vdso.c @@ -109,7 +109,7 @@ struct lib64_elfinfo }; -#ifdef __DEBUG +#ifdef DEBUG static void dump_one_vdso_page(struct page *pg, struct page *upg) { printk("kpg: %p (c:%d,f:%08lx)", __va(page_to_pfn(pg) << PAGE_SHIFT), -------------- next part -------------- Index: linux-2.6.11-olh/arch/ppc64/kernel/prom_init.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/prom_init.c +++ linux-2.6.11-olh/arch/ppc64/kernel/prom_init.c @@ -15,8 +15,6 @@ * 2 of the License, or (at your option) any later version. */ -#undef DEBUG - #include #include #include Index: linux-2.6.11-olh/arch/ppc64/kernel/eeh.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/eeh.c +++ linux-2.6.11-olh/arch/ppc64/kernel/eeh.c @@ -35,8 +35,6 @@ #include #include "pci.h" -#undef DEBUG - /** Overview: * EEH, or "Extended Error Handling" is a PCI bridge technology for * dealing with PCI bus errors that can't be dealt with within the Index: linux-2.6.11-olh/arch/ppc64/kernel/pSeries_smp.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/pSeries_smp.c +++ linux-2.6.11-olh/arch/ppc64/kernel/pSeries_smp.c @@ -12,8 +12,6 @@ * 2 of the License, or (at your option) any later version. */ -#undef DEBUG - #include #include #include Index: linux-2.6.11-olh/arch/ppc64/kernel/lmb.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/lmb.c +++ linux-2.6.11-olh/arch/ppc64/kernel/lmb.c @@ -22,8 +22,6 @@ struct lmb lmb; -#undef DEBUG - void lmb_dump_all(void) { #ifdef DEBUG Index: linux-2.6.11-olh/arch/ppc64/kernel/pSeries_setup.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/pSeries_setup.c +++ linux-2.6.11-olh/arch/ppc64/kernel/pSeries_setup.c @@ -16,8 +16,6 @@ * bootup setup stuff.. */ -#undef DEBUG - #include #include #include Index: linux-2.6.11-olh/arch/ppc64/kernel/idle_power4.S =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/idle_power4.S +++ linux-2.6.11-olh/arch/ppc64/kernel/idle_power4.S @@ -22,8 +22,6 @@ #include #include -#undef DEBUG - .text /* Index: linux-2.6.11-olh/arch/ppc64/kernel/pmac_smp.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/pmac_smp.c +++ linux-2.6.11-olh/arch/ppc64/kernel/pmac_smp.c @@ -22,8 +22,6 @@ * 2 of the License, or (at your option) any later version. */ -#undef DEBUG - #include #include #include Index: linux-2.6.11-olh/arch/ppc64/kernel/smp.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/smp.c +++ linux-2.6.11-olh/arch/ppc64/kernel/smp.c @@ -15,8 +15,6 @@ * 2 of the License, or (at your option) any later version. */ -#undef DEBUG - #include #include #include Index: linux-2.6.11-olh/arch/ppc64/kernel/nvram.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/nvram.c +++ linux-2.6.11-olh/arch/ppc64/kernel/nvram.c @@ -33,8 +33,6 @@ #include #include -#undef DEBUG - static int nvram_scan_partitions(void); static int nvram_setup_partition(void); static int nvram_create_os_partition(void); Index: linux-2.6.11-olh/arch/ppc64/kernel/pmac_feature.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/pmac_feature.c +++ linux-2.6.11-olh/arch/ppc64/kernel/pmac_feature.c @@ -41,8 +41,6 @@ #include #include -#undef DEBUG - #ifdef DEBUG #define DBG(fmt...) printk(KERN_DEBUG fmt) #else Index: linux-2.6.11-olh/arch/ppc64/kernel/prom.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/prom.c +++ linux-2.6.11-olh/arch/ppc64/kernel/prom.c @@ -15,8 +15,6 @@ * 2 of the License, or (at your option) any later version. */ -#undef DEBUG - #include #include #include Index: linux-2.6.11-olh/arch/ppc64/kernel/setup.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/setup.c +++ linux-2.6.11-olh/arch/ppc64/kernel/setup.c @@ -10,8 +10,6 @@ * 2 of the License, or (at your option) any later version. */ -#undef DEBUG - #include #include #include Index: linux-2.6.11-olh/arch/ppc64/boot/main.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/boot/main.c +++ linux-2.6.11-olh/arch/ppc64/boot/main.c @@ -73,8 +73,6 @@ void *stdin; void *stdout; void *stderr; -#undef DEBUG - static unsigned long claim_base = PROG_START; static unsigned long try_claim(unsigned long size) Index: linux-2.6.11-olh/arch/ppc64/kernel/vdso.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/vdso.c +++ linux-2.6.11-olh/arch/ppc64/kernel/vdso.c @@ -36,8 +36,6 @@ #include #include -#undef DEBUG - #ifdef DEBUG #define DBG(fmt...) printk(fmt) #else Index: linux-2.6.11-olh/arch/ppc64/mm/hash_utils.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/mm/hash_utils.c +++ linux-2.6.11-olh/arch/ppc64/mm/hash_utils.c @@ -18,8 +18,6 @@ * 2 of the License, or (at your option) any later version. */ -#undef DEBUG - #include #include #include Index: linux-2.6.11-olh/arch/ppc64/kernel/pmac_low_i2c.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/pmac_low_i2c.c +++ linux-2.6.11-olh/arch/ppc64/kernel/pmac_low_i2c.c @@ -16,8 +16,6 @@ * properties parser */ -#undef DEBUG - #include #include #include Index: linux-2.6.11-olh/arch/ppc64/kernel/mpic.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/mpic.c +++ linux-2.6.11-olh/arch/ppc64/kernel/mpic.c @@ -12,8 +12,6 @@ * for more details. */ -#undef DEBUG - #include #include #include Index: linux-2.6.11-olh/arch/ppc64/kernel/maple_time.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/maple_time.c +++ linux-2.6.11-olh/arch/ppc64/kernel/maple_time.c @@ -11,8 +11,6 @@ * */ -#undef DEBUG - #include #include #include Index: linux-2.6.11-olh/arch/ppc64/kernel/pmac_setup.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/pmac_setup.c +++ linux-2.6.11-olh/arch/ppc64/kernel/pmac_setup.c @@ -23,8 +23,6 @@ * bootup setup stuff.. */ -#undef DEBUG - #include #include #include Index: linux-2.6.11-olh/arch/ppc64/kernel/pci.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/pci.c +++ linux-2.6.11-olh/arch/ppc64/kernel/pci.c @@ -11,8 +11,6 @@ * 2 of the License, or (at your option) any later version. */ -#undef DEBUG - #include #include #include Index: linux-2.6.11-olh/arch/ppc64/kernel/iSeries_setup.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/iSeries_setup.c +++ linux-2.6.11-olh/arch/ppc64/kernel/iSeries_setup.c @@ -16,8 +16,6 @@ * 2 of the License, or (at your option) any later version. */ -#undef DEBUG - #include #include #include Index: linux-2.6.11-olh/arch/ppc64/kernel/pmac_time.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/pmac_time.c +++ linux-2.6.11-olh/arch/ppc64/kernel/pmac_time.c @@ -32,8 +32,6 @@ #include #include -#undef DEBUG - #ifdef DEBUG #define DBG(x...) printk(x) #else Index: linux-2.6.11-olh/arch/ppc64/kernel/iSeries_smp.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/iSeries_smp.c +++ linux-2.6.11-olh/arch/ppc64/kernel/iSeries_smp.c @@ -12,8 +12,6 @@ * 2 of the License, or (at your option) any later version. */ -#undef DEBUG - #include #include #include Index: linux-2.6.11-olh/arch/ppc64/kernel/pmac_pci.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/pmac_pci.c +++ linux-2.6.11-olh/arch/ppc64/kernel/pmac_pci.c @@ -31,8 +31,6 @@ #include "pci.h" #include "pmac.h" -#define DEBUG - #ifdef DEBUG #define DBG(x...) printk(x) #else Index: linux-2.6.11-olh/arch/ppc64/kernel/ras.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/ras.c +++ linux-2.6.11-olh/arch/ppc64/kernel/ras.c @@ -74,8 +74,6 @@ static irqreturn_t ras_epow_interrupt(in static irqreturn_t ras_error_interrupt(int irq, void *dev_id, struct pt_regs * regs); -/* #define DEBUG */ - static void request_ras_irqs(struct device_node *np, char *propname, irqreturn_t (*handler)(int, void *, struct pt_regs *), const char *name) Index: linux-2.6.11-olh/arch/ppc64/kernel/maple_setup.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/maple_setup.c +++ linux-2.6.11-olh/arch/ppc64/kernel/maple_setup.c @@ -11,8 +11,6 @@ * */ -#define DEBUG - #include #include #include Index: linux-2.6.11-olh/arch/ppc64/kernel/pmac_nvram.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/pmac_nvram.c +++ linux-2.6.11-olh/arch/ppc64/kernel/pmac_nvram.c @@ -29,8 +29,6 @@ #include #include -#define DEBUG - #ifdef DEBUG #define DBG(x...) printk(x) #else Index: linux-2.6.11-olh/arch/ppc64/kernel/maple_pci.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/maple_pci.c +++ linux-2.6.11-olh/arch/ppc64/kernel/maple_pci.c @@ -8,8 +8,6 @@ * 2 of the License, or (at your option) any later version. */ -#define DEBUG - #include #include #include Index: linux-2.6.11-olh/arch/ppc64/kernel/pSeries_lpar.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/pSeries_lpar.c +++ linux-2.6.11-olh/arch/ppc64/kernel/pSeries_lpar.c @@ -19,8 +19,6 @@ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ -#define DEBUG - #include #include #include Index: linux-2.6.11-olh/arch/ppc64/kernel/signal.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/signal.c +++ linux-2.6.11-olh/arch/ppc64/kernel/signal.c @@ -38,8 +38,6 @@ #include #include -#define DEBUG 0 - #define _BLOCKABLE (~(sigmask(SIGKILL) | sigmask(SIGSTOP))) #ifndef MIN Index: linux-2.6.11-olh/arch/ppc64/kernel/time.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/time.c +++ linux-2.6.11-olh/arch/ppc64/kernel/time.c @@ -546,8 +546,6 @@ void __init time_init(void) * adjust the frequency. */ -#undef DEBUG - void ppc_adjtimex(void) { unsigned long den, new_tb_ticks_per_sec, tb_ticks, old_xsec, new_tb_to_xs, new_xsec, new_stamp_xsec; Index: linux-2.6.11-olh/arch/ppc64/kernel/signal32.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/signal32.c +++ linux-2.6.11-olh/arch/ppc64/kernel/signal32.c @@ -33,8 +33,6 @@ #include #include -#define DEBUG 0 - #define _BLOCKABLE (~(sigmask(SIGKILL) | sigmask(SIGSTOP))) #define GP_REGS_SIZE32 min(sizeof(elf_gregset_t32), sizeof(struct pt_regs32)) From sfr at canb.auug.org.au Wed Mar 16 02:53:39 2005 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Wed, 16 Mar 2005 02:53:39 +1100 Subject: [PATCH] PPC64 iSeries: cleanup viopath In-Reply-To: <0961a209ce72bb9f2a01b163aa6e6fbd@penguinppc.org> References: <20050315143412.0c60690a.sfr@canb.auug.org.au> <0961a209ce72bb9f2a01b163aa6e6fbd@penguinppc.org> Message-ID: <20050316025339.318fc246.sfr@canb.auug.org.au> On Tue, 15 Mar 2005 08:32:27 -0600 Hollis Blanchard wrote: > > On Mar 14, 2005, at 9:34 PM, Stephen Rothwell wrote: > > > > Since you brought this file to my attention, I figured I might as well > > do > > some simple cleanups. This patch does: > > - single bit int bitfields are a bit suspect and Anndrew pointed > > out recently that they are probably slower to access than ints > > > --- linus/arch/ppc64/kernel/viopath.c 2005-03-13 04:07:42.000000000 > > +1100 > > +++ linus-cleanup.1/arch/ppc64/kernel/viopath.c 2005-03-15 > > 14:02:48.000000000 +1100 > > @@ -56,8 +57,8 @@ > > * But this allows for other support in the future. > > */ > > static struct viopathStatus { > > - int isOpen:1; /* Did we open the path? */ > > - int isActive:1; /* Do we have a mon msg outstanding */ > > + int isOpen; /* Did we open the path? */ > > + int isActive; /* Do we have a mon msg outstanding */ > > int users[VIO_MAX_SUBTYPES]; > > HvLpInstanceId mSourceInst; > > HvLpInstanceId mTargetInst; > > Why not use a byte instead of a full int (reordering the members for > alignment)? Because "classical" boleans are ints. Because I don't know the relative speed of accessing single byte variables. Because it was easy. Because we only allocate 32 of these structures. Changing them really only adds four bytes per structure. I guess using bytes and rearranging the structure could actually save 4 bytes per structure. I originally changed them to unsigned int single bit bitfields, but changed my mind - would that be better? It really makes little difference, I was just trying to get rid of the silly signed single bit bitfields ... -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050316/205a4051/attachment.pgp From linas at austin.ibm.com Wed Mar 16 04:43:10 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Tue, 15 Mar 2005 11:43:10 -0600 Subject: [PATCH] PPC64 iSeries: cleanup viopath In-Reply-To: <20050316025339.318fc246.sfr@canb.auug.org.au> References: <20050315143412.0c60690a.sfr@canb.auug.org.au> <0961a209ce72bb9f2a01b163aa6e6fbd@penguinppc.org> <20050316025339.318fc246.sfr@canb.auug.org.au> Message-ID: <20050315174310.GH498@austin.ibm.com> On Wed, Mar 16, 2005 at 02:53:39AM +1100, Stephen Rothwell was heard to remark: > On Tue, 15 Mar 2005 08:32:27 -0600 Hollis Blanchard wrote: > > > > Why not use a byte instead of a full int (reordering the members for > > alignment)? > > Because "classical" boleans are ints. > > Because I don't know the relative speed of accessing single byte variables. > > Because it was easy. > > Because we only allocate 32 of these structures. Changing them really > only adds four bytes per structure. I guess using bytes and rearranging > the structure could actually save 4 bytes per structure. FWIW, keep in mind that a cache miss due to large structures not fitting is a zillion times more expensive than byte-aligning in the cpu (even if byte operands had a cpu perf overhead, which I don't think they do on ppc). > It really makes little difference, Yep. So my apologies for making you read this email. --linas From flar at allandria.com Wed Mar 16 04:49:30 2005 From: flar at allandria.com (Brad Boyer) Date: Tue, 15 Mar 2005 09:49:30 -0800 Subject: [PATCH] PPC64 iSeries: cleanup viopath In-Reply-To: <20050315174310.GH498@austin.ibm.com> References: <20050315143412.0c60690a.sfr@canb.auug.org.au> <0961a209ce72bb9f2a01b163aa6e6fbd@penguinppc.org> <20050316025339.318fc246.sfr@canb.auug.org.au> <20050315174310.GH498@austin.ibm.com> Message-ID: <20050315174929.GC10301@pants.nu> On Tue, Mar 15, 2005 at 11:43:10AM -0600, Linas Vepstas wrote: > FWIW, keep in mind that a cache miss due to large structures not fitting > is a zillion times more expensive than byte-aligning in the cpu > (even if byte operands had a cpu perf overhead, which I don't think > they do on ppc). Actually, there is a small overhead to bytes if you make them signed. That's why char is unsigned by default on ppc. Brad Boyer flar at allandria.com From jschopp at austin.ibm.com Wed Mar 16 05:15:17 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Tue, 15 Mar 2005 12:15:17 -0600 Subject: [PATCH][RFC] unlikely spinlocks In-Reply-To: <16949.23966.756568.902508@cargo.ozlabs.ibm.com> References: <20050302163412.0fa52c4b.moilanen@austin.ibm.com> <16949.23966.756568.902508@cargo.ozlabs.ibm.com> Message-ID: <42372635.2070705@austin.ibm.com> >>On our raw spinlocks, we currently have an attempt at the lock, and if >>we do not get it we enter a spin loop. This spinloop will likely >>continue for awhile, and we pridict likely. >> >>Shouldn't we predict that we will get out of the loop so our next >>instructions are already prefetched. Even when we miss because the lock >>is still held, it won't matter since we are waiting anyways. > > > Possibly the best thing is not to put a static prediction on it at > all, and let the machine's dynamic branch prediction decide which path > to predict? It is better to predict you will get out of the loop than to let the machine predict it. If we are wrong and go back into the loop we have all the time in the world and have lost nothing with the wrong prediction. We could predict wrong 5000 times in a row, get it right on try 5001, and still come out ahead. From olh at suse.de Wed Mar 16 08:02:46 2005 From: olh at suse.de (Olaf Hering) Date: Tue, 15 Mar 2005 22:02:46 +0100 Subject: [PATCH] allow xmon=on,off,early Message-ID: <20050315210246.GA24477@suse.de> allow 'xmon' or 'xmon=early' to enter xmon very early during boot. allow 'xmon=on' to just enable it, or 'xmon=off' to disable it. Signed-off-by: Olaf Hering Index: linux-2.6.11-olh/arch/ppc64/kernel/setup.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/setup.c +++ linux-2.6.11-olh/arch/ppc64/kernel/setup.c @@ -1365,6 +1365,12 @@ EXPORT_SYMBOL(check_legacy_ioport); static int __init early_xmon(char *p) { /* ensure xmon is enabled */ + if (p) { + if (strncmp(p, "on", 2) == 0) + xmon_init(); + if (strncmp(p, "early", 5)) + return 0; + } xmon_init(); debugger(NULL); From olh at suse.de Wed Mar 16 08:26:56 2005 From: olh at suse.de (Olaf Hering) Date: Tue, 15 Mar 2005 22:26:56 +0100 Subject: [PATCH] CONFIG_PM for ppc64, to allow sysrq o Message-ID: <20050315212656.GA24563@suse.de> For some weird reason, sysrq o is hidden behind CONFIG_PM. Why? One can power off just fine without that. Can pm_sysrq_init be moved to a better place? I think it used to be in sysrq.c in 2.4. Too bad, with this patch radeonfb fails to compile. Index: linux-2.6.11-olh/arch/ppc64/Kconfig =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/Kconfig +++ linux-2.6.11-olh/arch/ppc64/Kconfig @@ -350,6 +350,8 @@ config CMDLINE endmenu +source "kernel/power/Kconfig" + source "drivers/Kconfig" source "fs/Kconfig" From linas at austin.ibm.com Wed Mar 16 08:44:13 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Tue, 15 Mar 2005 15:44:13 -0600 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI Error Recovery) In-Reply-To: <1110864741.29124.68.camel@gaston> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <4235847F.3080705@jp.fujitsu.com> <20050314181420.GD498@austin.ibm.com> <1110864741.29124.68.camel@gaston> Message-ID: <20050315214413.GI498@austin.ibm.com> On Tue, Mar 15, 2005 at 04:32:20PM +1100, Benjamin Herrenschmidt was heard to remark: > > Ok, let's propose what i think is a proper API and simple enough on the > driver side, ... > That > should cover all the needs we discussed so far: > > I think we need a callback in pci_driver, as I explained all along, with > a very simple semantic: > > int (*error_handler)(struct pci_dev *dev, int message); How about enum instead of int? that allows static type checking by the compiler. > 1) PCIERR_ERROR_DETECTED Elsewhere in the kernel, enums seem to be lowercase ... > Comments welcome. Linas, I'll give a try at coding something up in the > upcoming days unless you beat me to it. Looks good to me. This is a minor tweak on what I currently have, so I'll take a shot at it. Unfortunately I just lost my regular devel machine. And still haven't debugged the symbios recovery. It might take a few days, but what you wrote looks eminently workable. Also, I have not completely read (or understood) what Long Nguyen just sent in... and haven't heard him make any remarks. Sounds to me like his AER patch is a pcie-specific version of what we are talking about. It would be nice to hear for Long about his thoughts on this. --linas From moilanen at austin.ibm.com Wed Mar 16 08:51:35 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Tue, 15 Mar 2005 15:51:35 -0600 Subject: [PATCH 1/2] No-exec support for ppc64 In-Reply-To: <16950.3484.416343.832453@cargo.ozlabs.ibm.com> References: <20050308165904.0ce07112.moilanen@austin.ibm.com> <20050308170826.13a2299e.moilanen@austin.ibm.com> <20050310032213.GB20789@austin.ibm.com> <20050310162513.74191caa.moilanen@austin.ibm.com> <16949.25552.640180.677985@cargo.ozlabs.ibm.com> <20050314155125.68dcff70.moilanen@austin.ibm.com> <16950.3484.416343.832453@cargo.ozlabs.ibm.com> Message-ID: <20050315155135.11b942ef.moilanen@austin.ibm.com> On Tue, 15 Mar 2005 09:18:04 +1100 Paul Mackerras wrote: > Jake Moilanen writes: > > > > I don't think I can push that upstream. What happens if you leave > > > that out? > > > > The bss and the plt are in the same segment, and plt obviously needs to > > be executable. > > Yes... what I was asking was "do things actually break if you leave > that out, or does the binfmt_elf loader honour the 'x' permission on > the PT_LOAD entry for the data/bss region, meaning that it all just > works anyway?" It does not work w/o the sys_mprotect. It will hang in one of the first few binaries. I believe the problem is that the last PT_LOAD entry does not have the correct size, and we only mmap up to the sbss. The .sbss, .plt, and .bss do not get mmapped with the section. Here is /bin/bash on SLES 9: Section Headers: [Nr] Name Type Addr Off Size ES Flg Lk Inf Al [ 0] NULL 00000000 000000 000000 00 0 0 0 [ 1] .interp PROGBITS 10000174 000174 00000d 00 A 0 0 1 ... ... [19] .data PROGBITS 1008ca80 07ca80 001b34 00 WA 0 0 4 [20] .eh_frame PROGBITS 1008e5b4 07e5b4 0000b4 00 A 0 0 4 [21] .got2 PROGBITS 1008e668 07e668 000010 00 WA 0 0 1 [22] .dynamic DYNAMIC 1008e678 07e678 0000e8 08 WA 6 0 4 [23] .ctors PROGBITS 1008e760 07e760 000008 00 WA 0 0 4 [24] .dtors PROGBITS 1008e768 07e768 000008 00 WA 0 0 4 [25] .jcr PROGBITS 1008e770 07e770 000004 00 WA 0 0 4 [26] .got PROGBITS 1008e774 07e774 000014 04 WAX 0 0 4 [27] .sdata PROGBITS 1008e788 07e788 0000d4 00 WA 0 0 4 [28] .sbss NOBITS 1008e860 07e860 000704 00 WA 0 0 8 [29] .plt NOBITS 1008ef64 07e860 000aa4 00 WAX 0 0 4 [30] .bss NOBITS 1008fa10 07e868 0062f0 00 WA 0 0 16 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align PHDR 0x000034 0x10000034 0x10000034 0x00120 0x00120 R E 0x4 INTERP 0x000174 0x10000174 0x10000174 0x0000d 0x0000d R 0x1 [Requesting program interpreter: /lib/ld.so.1] LOAD 0x000000 0x10000000 0x10000000 0x7ca80 0x7ca80 R E 0x10000 LOAD 0x07ca80 0x1008ca80 0x1008ca80 0x01ddc 0x09280 RWE 0x10000 DYNAMIC 0x07e678 0x1008e678 0x1008e678 0x000e8 0x000e8 RW 0x4 NOTE 0x000184 0x10000184 0x10000184 0x00020 0x00020 R 0x4 NOTE 0x0001a4 0x100001a4 0x100001a4 0x00018 0x00018 R 0x4 GNU_EH_FRAME 0x07ca54 0x1007ca54 0x1007ca54 0x0002c 0x0002c R 0x4 STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x4 Section to Segment mapping: Segment Sections... 00 01 .interp 02 .interp .note.ABI-tag .note.SuSE .hash .dynsym .dynstr .gnu.version .g nu.version_r .rela.dyn .rela.plt .init .text text.unlikely text.hot .fini .rodat a .eh_frame_hdr 03 .data .eh_frame .got2 .dynamic .ctors .dtors .jcr .got .sdata .sbss .p lt .bss 04 .dynamic 05 .note.ABI-tag 06 .note.SuSE 07 .eh_frame_hdr 08 In the program headers section, the FileSiz for the last PT_LOAD is 0x1ddc. If we go back to the Section Headers and look at .data it is at 0x1008ca80. So the segment should end at 0x1008e85c. We round up for alignment and we get 0x1008e860 or .sbss. The sbss, plt, and bss are not mmapped. So the sys_mprotect is used to pick it up. Did I miss something to explain this? Can you think of another way to fix it? Jake From olh at suse.de Wed Mar 16 09:03:32 2005 From: olh at suse.de (Olaf Hering) Date: Tue, 15 Mar 2005 23:03:32 +0100 Subject: [PATCH] CONFIG_PM for ppc64, to allow sysrq o In-Reply-To: <20050315212656.GA24563@suse.de> References: <20050315212656.GA24563@suse.de> Message-ID: <20050315220332.GA24708@suse.de> On Tue, Mar 15, Olaf Hering wrote: > > For some weird reason, sysrq o is hidden behind CONFIG_PM. > Why? One can power off just fine without that. Can pm_sysrq_init be > moved to a better place? I think it used to be in sysrq.c in 2.4. > > Too bad, with this patch radeonfb fails to compile. After disabling radeon and this additional change, sysrq o powers off. Just dont type too fast over hvc (ctrl o o) ;) Index: linux-2.6.11-olh/arch/ppc64/kernel/setup.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/setup.c +++ linux-2.6.11-olh/arch/ppc64/kernel/setup.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include #include @@ -710,6 +711,8 @@ void machine_halt(void) EXPORT_SYMBOL(machine_halt); +void (*pm_power_off)(void) = machine_power_off; + unsigned long ppc_proc_freq; unsigned long ppc_tb_freq; From tom.l.nguyen at intel.com Wed Mar 16 09:14:57 2005 From: tom.l.nguyen at intel.com (Nguyen, Tom L) Date: Tue, 15 Mar 2005 14:14:57 -0800 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI Error Recovery) Message-ID: On Tuesday, March 15, 2005 1:44 PM Linas Vepstas wrote: > Also, I have not completely read (or understood) what Long Nguyen > just sent in... and haven't heard him make any remarks. Sounds > to me like his AER patch is a pcie-specific version of what we are > talking about. It would be nice to hear for Long about his thoughts on > this. I apologize for taking it too long to respond. To give you some PCI Express AER context in short general terms, PCI Express component, which detects an error, sends an error message to the Root Port. The Root Port processes this error message internally and generates an interrupt signal at root port. The AER driver's ISR, which services this interrupt signal, determine the error based on the error information logged by the Root Port device in the Root Error Status Register and the Error Source Identification Registers. Once the error is identified, the AER driver uses the AER aware callback handle as defined below: struct pcie_aer_handle { /* * Notify the PCI Express device driver of an error sent by its device. * Also, obtain error information from the driver to identify what * error type and what severity. */ int (*notify) (unsigned short requestor_id, union aer_error *error); /* * Obtain TLP header log, which may be logged along with certain * uncorrectable error. */ int (*get_header) (unsigned short requestor_id, union aer_error *error, struct header_log_regs *log); /* * Notify the driver to abort any existing transactions, prepare for * uncorrectable fatal error recovery. This occurs only if the PCI * Express Port implements a link reset in its hardware. */ int (*link_rec_prepare) (unsigned short requestor_id); /* * Notify the driver when link reset is completed and active. */ int (*link_rec_restart) (unsigned short requestor_id); /* * Notify the driver performing link reset. This occurs only if this PCI * Express Port implements a link reset in its hardware. */ int (*link_reset) (unsigned short requestor_id); }; to coordinate with the PCI Express AER aware driver to determine more precisely what error type and what severity; so, the PCI Express AER Root driver can log and report the error to user. If error type is uncorrectable and error severity is fatal, the hardware link is no longer reliable. To return the link to reliable requires the implementation of link reset on the PCI Express port. If this condition meets, then PCI Express AER driver does a link reset. Otherwise, PCI Express AER Root driver assumes users own an error policy. However, we acknowledge that LKML inputs prefer a generic interface for error handling. I like the current proposal of three callback functions in pci_driver. void (*frozen) (struct pci_dev *); /* called when dev is first frozen */ void (*thawed) (struct pci_dev *); /* called after card is reset */ void (*perm_failure) (struct pci_dev *); /* called if card is dead */ I can use frozen to replace link_rec_prepare and thawed to replace link_rec_restart. In addition, I prefer to add two other functions as below: int (*notify) (struct pci_dev *, union error_src *); /* notify driver of correctable/uncorrectable error occurred on its device */ int (*reset) (struct pci_dev *); /* called to reset a downstream bus(s) if PCI Express Port supports link reset */ with error_src data structure is defined as below: union error_src { unsigned int type; /*AER_CORRECTABLE|AER_UNCORRECTABLE*/ struct { unsigned int type; /*AER_CORRECTABLE|AER_NONFATAL|AER_FATAL*/ unsigned int flags; /*TLP log valid and whether reset is supported or not */ unsigned int status; /*Particular Error Status*/ struct header_log_regs *log; /*PCI Express TLB Header log */ }pcie_aer; }; Please let us know what you think? Thanks, Long From amodra at bigpond.net.au Wed Mar 16 09:48:36 2005 From: amodra at bigpond.net.au (Alan Modra) Date: Wed, 16 Mar 2005 09:18:36 +1030 Subject: [PATCH 1/2] No-exec support for ppc64 In-Reply-To: <20050315155135.11b942ef.moilanen@austin.ibm.com> References: <20050308165904.0ce07112.moilanen@austin.ibm.com> <20050308170826.13a2299e.moilanen@austin.ibm.com> <20050310032213.GB20789@austin.ibm.com> <20050310162513.74191caa.moilanen@austin.ibm.com> <16949.25552.640180.677985@cargo.ozlabs.ibm.com> <20050314155125.68dcff70.moilanen@austin.ibm.com> <16950.3484.416343.832453@cargo.ozlabs.ibm.com> <20050315155135.11b942ef.moilanen@austin.ibm.com> Message-ID: <20050315224836.GD21148@bubble.modra.org> On Tue, Mar 15, 2005 at 03:51:35PM -0600, Jake Moilanen wrote: > I believe the problem is that the last PT_LOAD entry does not have the > correct size, and we only mmap up to the sbss. The .sbss, .plt, and > .bss do not get mmapped with the section. Huh? .sbss, .plt and .bss have no file contents, so of course p_filesz doesn't cover them. -- Alan Modra IBM OzLabs - Linux Technology Centre From moilanen at austin.ibm.com Wed Mar 16 10:17:04 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Tue, 15 Mar 2005 17:17:04 -0600 Subject: [PATCH 1/2] No-exec support for ppc64 In-Reply-To: <20050315224836.GD21148@bubble.modra.org> References: <20050308165904.0ce07112.moilanen@austin.ibm.com> <20050308170826.13a2299e.moilanen@austin.ibm.com> <20050310032213.GB20789@austin.ibm.com> <20050310162513.74191caa.moilanen@austin.ibm.com> <16949.25552.640180.677985@cargo.ozlabs.ibm.com> <20050314155125.68dcff70.moilanen@austin.ibm.com> <16950.3484.416343.832453@cargo.ozlabs.ibm.com> <20050315155135.11b942ef.moilanen@austin.ibm.com> <20050315224836.GD21148@bubble.modra.org> Message-ID: <20050315171704.08f3057a.moilanen@austin.ibm.com> On Wed, 16 Mar 2005 09:18:36 +1030 Alan Modra wrote: > On Tue, Mar 15, 2005 at 03:51:35PM -0600, Jake Moilanen wrote: > > I believe the problem is that the last PT_LOAD entry does not have the > > correct size, and we only mmap up to the sbss. The .sbss, .plt, and > > .bss do not get mmapped with the section. > > Huh? .sbss, .plt and .bss have no file contents, so of course p_filesz > doesn't cover them. Your right, those shouldn't be mmapped. set_brk() call is called on sbss, plt and bss. There needs to be some method to set execute permission, on those pieces as well. Currently it has no concept of what permission should be set. Jake From benh at kernel.crashing.org Wed Mar 16 10:27:48 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 16 Mar 2005 10:27:48 +1100 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI Error Recovery) In-Reply-To: References: Message-ID: <1110929268.25201.31.camel@gaston> On Tue, 2005-03-15 at 14:14 -0800, Nguyen, Tom L wrote: > However, we acknowledge that LKML inputs prefer a generic interface for > error handling. I like the current proposal of three callback functions > in pci_driver. > > void (*frozen) (struct pci_dev *); /* called when dev is first > frozen */ > void (*thawed) (struct pci_dev *); /* called after card is reset */ > void (*perm_failure) (struct pci_dev *); /* called if card is > dead */ > > I can use frozen to replace link_rec_prepare and thawed to replace > link_rec_restart. In addition, I prefer to add two other functions as > below: Please, look at my mail describing a different interface. I think your model can fit. However, one thing you need to do is what I call "synchronous error detection" as well. That is, you need a way to implement the proposed clear_errors() do IOs check_errors() (Sorry, I don't have the exact terminology proposed by Seto in mind) API. The callback to the driver would happen a bit later on as it would be delayed to a work queue, the above allows more immediate bail out from various code path on error. My proposed callback API doesn't provide error details. I think that we want to define a specific opaque format for this, to be returned by the above check_errors(), and eventually by an equivalent that can be called from the callbacks as well. Ben. From benh at kernel.crashing.org Wed Mar 16 11:04:44 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 16 Mar 2005 11:04:44 +1100 Subject: [PATCH] CONFIG_PM for ppc64, to allow sysrq o In-Reply-To: <20050315212656.GA24563@suse.de> References: <20050315212656.GA24563@suse.de> Message-ID: <1110931484.25201.60.camel@gaston> On Tue, 2005-03-15 at 22:26 +0100, Olaf Hering wrote: > For some weird reason, sysrq o is hidden behind CONFIG_PM. > Why? One can power off just fine without that. Can pm_sysrq_init be > moved to a better place? I think it used to be in sysrq.c in 2.4. > > Too bad, with this patch radeonfb fails to compile. Hehe :) ppc64 isn't yet ready for CONFIG_PM, though I have some hacks-in-progress ... Ben. From paulus at samba.org Wed Mar 16 14:14:04 2005 From: paulus at samba.org (Paul Mackerras) Date: Wed, 16 Mar 2005 14:14:04 +1100 Subject: [PATCH] allow xmon=on,off,early In-Reply-To: <20050315210246.GA24477@suse.de> References: <20050315210246.GA24477@suse.de> Message-ID: <16951.42108.932360.666387@cargo.ozlabs.ibm.com> Olaf Hering writes: > allow 'xmon' or 'xmon=early' to enter xmon very early during boot. > allow 'xmon=on' to just enable it, or 'xmon=off' to disable it. > > Signed-off-by: Olaf Hering > > Index: linux-2.6.11-olh/arch/ppc64/kernel/setup.c > =================================================================== > --- linux-2.6.11-olh.orig/arch/ppc64/kernel/setup.c > +++ linux-2.6.11-olh/arch/ppc64/kernel/setup.c > @@ -1365,6 +1365,12 @@ EXPORT_SYMBOL(check_legacy_ioport); > static int __init early_xmon(char *p) > { > /* ensure xmon is enabled */ > + if (p) { > + if (strncmp(p, "on", 2) == 0) > + xmon_init(); > + if (strncmp(p, "early", 5)) > + return 0; > + } Where does this handle xmon=off? Paul. From paulus at samba.org Wed Mar 16 17:10:57 2005 From: paulus at samba.org (Paul Mackerras) Date: Wed, 16 Mar 2005 17:10:57 +1100 Subject: [PATCH 1/2] No-exec support for ppc64 In-Reply-To: <20050315155135.11b942ef.moilanen@austin.ibm.com> References: <20050308165904.0ce07112.moilanen@austin.ibm.com> <20050308170826.13a2299e.moilanen@austin.ibm.com> <20050310032213.GB20789@austin.ibm.com> <20050310162513.74191caa.moilanen@austin.ibm.com> <16949.25552.640180.677985@cargo.ozlabs.ibm.com> <20050314155125.68dcff70.moilanen@austin.ibm.com> <16950.3484.416343.832453@cargo.ozlabs.ibm.com> <20050315155135.11b942ef.moilanen@austin.ibm.com> Message-ID: <16951.52721.139394.592636@cargo.ozlabs.ibm.com> Jake Moilanen writes: > It does not work w/o the sys_mprotect. It will hang in one of the first > few binaries. Hmmm, what distro is this with? I just tried a kernel with the patch below on a SLES9 install and a Debian install and it came up and ran just fine in both cases. Paul. diff -urN linux-2.5/arch/ppc64/kernel/head.S test/arch/ppc64/kernel/head.S --- linux-2.5/arch/ppc64/kernel/head.S 2005-03-07 10:46:38.000000000 +1100 +++ test/arch/ppc64/kernel/head.S 2005-03-15 17:14:44.000000000 +1100 @@ -950,11 +950,12 @@ * accessing a userspace segment (even from the kernel). We assume * kernel addresses always have the high bit set. */ - rlwinm r4,r4,32-23,29,29 /* DSISR_STORE -> _PAGE_RW */ + rlwinm r4,r4,32-25+9,31-9,31-9 /* DSISR_STORE -> _PAGE_RW */ rotldi r0,r3,15 /* Move high bit into MSR_PR posn */ orc r0,r12,r0 /* MSR_PR | ~high_bit */ rlwimi r4,r0,32-13,30,30 /* becomes _PAGE_USER access bit */ ori r4,r4,1 /* add _PAGE_PRESENT */ + rlwimi r4,r5,22+2,31-2,31-2 /* Set _PAGE_EXEC if trap is 0x400 */ /* * On iSeries, we soft-disable interrupts here, then diff -urN linux-2.5/arch/ppc64/kernel/iSeries_htab.c test/arch/ppc64/kernel/iSeries_htab.c --- linux-2.5/arch/ppc64/kernel/iSeries_htab.c 2004-09-21 17:22:33.000000000 +1000 +++ test/arch/ppc64/kernel/iSeries_htab.c 2005-03-15 17:15:36.000000000 +1100 @@ -144,6 +144,10 @@ HvCallHpt_get(&hpte, slot); if ((hpte.dw0.dw0.avpn == avpn) && (hpte.dw0.dw0.v)) { + /* + * Hypervisor expects bits as NPPP, which is + * different from how they are mapped in our PP. + */ HvCallHpt_setPp(slot, (newpp & 0x3) | ((newpp & 0x4) << 1)); iSeries_hunlock(slot); return 0; diff -urN linux-2.5/arch/ppc64/kernel/iSeries_setup.c test/arch/ppc64/kernel/iSeries_setup.c --- linux-2.5/arch/ppc64/kernel/iSeries_setup.c 2005-03-07 10:46:38.000000000 +1100 +++ test/arch/ppc64/kernel/iSeries_setup.c 2005-03-15 16:55:05.000000000 +1100 @@ -633,6 +633,10 @@ unsigned long vpn = va >> PAGE_SHIFT; unsigned long slot = HvCallHpt_findValid(&hpte, vpn); + /* Make non-kernel text non-executable */ + if (!in_kernel_text(ea)) + mode_rw |= HW_NO_EXEC; + if (hpte.dw0.dw0.v) { /* HPTE exists, so just bolt it */ HvCallHpt_setSwBits(slot, 0x10, 0); diff -urN linux-2.5/arch/ppc64/kernel/module.c test/arch/ppc64/kernel/module.c --- linux-2.5/arch/ppc64/kernel/module.c 2004-05-10 21:25:58.000000000 +1000 +++ test/arch/ppc64/kernel/module.c 2005-03-15 16:55:05.000000000 +1100 @@ -102,7 +102,8 @@ { if (size == 0) return NULL; - return vmalloc(size); + + return vmalloc_exec(size); } /* Free memory returned from module_alloc */ diff -urN linux-2.5/arch/ppc64/kernel/pSeries_lpar.c test/arch/ppc64/kernel/pSeries_lpar.c --- linux-2.5/arch/ppc64/kernel/pSeries_lpar.c 2005-03-07 10:46:38.000000000 +1100 +++ test/arch/ppc64/kernel/pSeries_lpar.c 2005-03-15 16:55:02.000000000 +1100 @@ -470,7 +470,7 @@ slot = pSeries_lpar_hpte_find(vpn); BUG_ON(slot == -1); - flags = newpp & 3; + flags = newpp & 7; lpar_rc = plpar_pte_protect(flags, slot, 0); BUG_ON(lpar_rc != H_Success); diff -urN linux-2.5/arch/ppc64/mm/fault.c test/arch/ppc64/mm/fault.c --- linux-2.5/arch/ppc64/mm/fault.c 2005-01-04 10:49:20.000000000 +1100 +++ test/arch/ppc64/mm/fault.c 2005-03-15 17:13:05.000000000 +1100 @@ -91,8 +91,9 @@ struct mm_struct *mm = current->mm; siginfo_t info; unsigned long code = SEGV_MAPERR; - unsigned long is_write = error_code & 0x02000000; + unsigned long is_write = error_code & DSISR_ISSTORE; unsigned long trap = TRAP(regs); + unsigned long is_exec = trap == 0x400; BUG_ON((trap == 0x380) || (trap == 0x480)); @@ -109,7 +110,7 @@ if (!user_mode(regs) && (address >= TASK_SIZE)) return SIGSEGV; - if (error_code & 0x00400000) { + if (error_code & DSISR_DABRMATCH) { if (notify_die(DIE_DABR_MATCH, "dabr_match", regs, error_code, 11, SIGSEGV) == NOTIFY_STOP) return 0; @@ -199,16 +200,19 @@ good_area: code = SEGV_ACCERR; + if (is_exec) { + /* protection fault */ + if (error_code & DSISR_PROTFAULT) + goto bad_area; + if (!(vma->vm_flags & VM_EXEC)) + goto bad_area; /* a write */ - if (is_write) { + } else if (is_write) { if (!(vma->vm_flags & VM_WRITE)) goto bad_area; /* a read */ } else { - /* protection fault */ - if (error_code & 0x08000000) - goto bad_area; - if (!(vma->vm_flags & (VM_READ | VM_EXEC))) + if (!(vma->vm_flags & VM_READ)) goto bad_area; } @@ -251,6 +255,12 @@ return 0; } + if (trap == 0x400 && (error_code & DSISR_PROTFAULT) + && printk_ratelimit()) + printk(KERN_CRIT "kernel tried to execute NX-protected" + " page (%lx) - exploit attempt? (uid: %d)\n", + address, current->uid); + return SIGSEGV; /* diff -urN linux-2.5/arch/ppc64/mm/hash_low.S test/arch/ppc64/mm/hash_low.S --- linux-2.5/arch/ppc64/mm/hash_low.S 2005-01-05 13:48:02.000000000 +1100 +++ test/arch/ppc64/mm/hash_low.S 2005-03-15 16:55:02.000000000 +1100 @@ -89,7 +89,7 @@ /* Prepare new PTE value (turn access RW into DIRTY, then * add BUSY,HASHPTE and ACCESSED) */ - rlwinm r30,r4,5,24,24 /* _PAGE_RW -> _PAGE_DIRTY */ + rlwinm r30,r4,32-9+7,31-7,31-7 /* _PAGE_RW -> _PAGE_DIRTY */ or r30,r30,r31 ori r30,r30,_PAGE_BUSY | _PAGE_ACCESSED | _PAGE_HASHPTE /* Write the linux PTE atomically (setting busy) */ @@ -112,11 +112,11 @@ rldicl r5,r5,0,25 /* vsid & 0x0000007fffffffff */ rldicl r0,r3,64-12,48 /* (ea >> 12) & 0xffff */ xor r28,r5,r0 - - /* Convert linux PTE bits into HW equivalents - */ - andi. r3,r30,0x1fa /* Get basic set of flags */ - rlwinm r0,r30,32-2+1,30,30 /* _PAGE_RW -> _PAGE_USER (r0) */ + + /* Convert linux PTE bits into HW equivalents */ + andi. r3,r30,0x1fe /* Get basic set of flags */ + xori r3,r3,HW_NO_EXEC /* _PAGE_EXEC -> NOEXEC */ + rlwinm r0,r30,32-9+1,30,30 /* _PAGE_RW -> _PAGE_USER (r0) */ rlwinm r4,r30,32-7+1,30,30 /* _PAGE_DIRTY -> _PAGE_USER (r4) */ and r0,r0,r4 /* _PAGE_RW & _PAGE_DIRTY -> r0 bit 30 */ andc r0,r30,r0 /* r0 = pte & ~r0 */ diff -urN linux-2.5/arch/ppc64/mm/hash_utils.c test/arch/ppc64/mm/hash_utils.c --- linux-2.5/arch/ppc64/mm/hash_utils.c 2005-03-07 10:46:38.000000000 +1100 +++ test/arch/ppc64/mm/hash_utils.c 2005-03-15 17:20:35.000000000 +1100 @@ -51,6 +51,7 @@ #include #include #include +#include #ifdef DEBUG #define DBG(fmt...) udbg_printf(fmt) @@ -95,6 +96,7 @@ { unsigned long addr; unsigned int step; + unsigned long tmp_mode; if (large) step = 16*MB; @@ -112,6 +114,13 @@ else vpn = va >> PAGE_SHIFT; + + tmp_mode = mode; + + /* Make non-kernel text non-executable */ + if (!in_kernel_text(addr)) + tmp_mode = mode | HW_NO_EXEC; + hash = hpt_hash(vpn, large); hpteg = ((hash & htab_hash_mask) * HPTES_PER_GROUP); @@ -120,12 +129,12 @@ if (systemcfg->platform & PLATFORM_LPAR) ret = pSeries_lpar_hpte_insert(hpteg, va, virt_to_abs(addr) >> PAGE_SHIFT, - 0, mode, 1, large); + 0, tmp_mode, 1, large); else #endif /* CONFIG_PPC_PSERIES */ ret = native_hpte_insert(hpteg, va, virt_to_abs(addr) >> PAGE_SHIFT, - 0, mode, 1, large); + 0, tmp_mode, 1, large); if (ret == -1) { ppc64_terminate_msg(0x20, "create_pte_mapping"); @@ -238,8 +247,6 @@ { struct page *page; -#define PPC64_HWNOEXEC (1 << 2) - if (!pfn_valid(pte_pfn(pte))) return pp; @@ -251,7 +258,7 @@ __flush_dcache_icache(page_address(page)); set_bit(PG_arch_1, &page->flags); } else - pp |= PPC64_HWNOEXEC; + pp |= HW_NO_EXEC; } return pp; } diff -urN linux-2.5/arch/ppc64/mm/hugetlbpage.c test/arch/ppc64/mm/hugetlbpage.c --- linux-2.5/arch/ppc64/mm/hugetlbpage.c 2005-03-07 14:01:43.000000000 +1100 +++ test/arch/ppc64/mm/hugetlbpage.c 2005-03-15 17:27:33.000000000 +1100 @@ -782,7 +782,6 @@ { pte_t *ptep; unsigned long va, vpn; - int is_write; pte_t old_pte, new_pte; unsigned long hpteflags, prpn; long slot; @@ -809,8 +808,7 @@ * Check the user's access rights to the page. If access should be * prevented then send the problem up to do_page_fault. */ - is_write = access & _PAGE_RW; - if (unlikely(is_write && !(pte_val(*ptep) & _PAGE_RW))) + if (unlikely(access & ~pte_val(*ptep))) goto out; /* * At this point, we have a pte (old_pte) which can be used to build @@ -829,6 +827,8 @@ new_pte = old_pte; hpteflags = 0x2 | (! (pte_val(new_pte) & _PAGE_RW)); + /* _PAGE_EXEC -> HW_NO_EXEC since it's inverted */ + hpteflags |= ((pte_val(new_pte) & _PAGE_EXEC) ? 0 : HW_NO_EXEC); /* Check if pte already has an hpte (case 2) */ if (unlikely(pte_val(old_pte) & _PAGE_HASHPTE)) { diff -urN linux-2.5/include/asm-ppc64/elf.h test/include/asm-ppc64/elf.h --- linux-2.5/include/asm-ppc64/elf.h 2005-03-07 10:46:39.000000000 +1100 +++ test/include/asm-ppc64/elf.h 2005-03-15 16:55:02.000000000 +1100 @@ -226,6 +226,13 @@ else if (current->personality != PER_LINUX32) \ set_personality(PER_LINUX); \ } while (0) + +/* + * An executable for which elf_read_implies_exec() returns TRUE will + * have the READ_IMPLIES_EXEC personality flag set automatically. + */ +#define elf_read_implies_exec(ex, have_pt_gnu_stack) (!(have_pt_gnu_stack)) + #endif /* diff -urN linux-2.5/include/asm-ppc64/page.h test/include/asm-ppc64/page.h --- linux-2.5/include/asm-ppc64/page.h 2005-03-07 10:46:39.000000000 +1100 +++ test/include/asm-ppc64/page.h 2005-03-15 16:55:02.000000000 +1100 @@ -235,8 +235,25 @@ #define virt_addr_valid(kaddr) pfn_valid(__pa(kaddr) >> PAGE_SHIFT) -#define VM_DATA_DEFAULT_FLAGS (VM_READ | VM_WRITE | VM_EXEC | \ +#define VM_DATA_DEFAULT_FLAGS32 (VM_READ | VM_WRITE | \ VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) +#define VM_STACK_DEFAULT_FLAGS32 (VM_READ | VM_WRITE | VM_EXEC | \ + VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) + +#define VM_DATA_DEFAULT_FLAGS64 (VM_READ | VM_WRITE | \ + VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) + +#define VM_STACK_DEFAULT_FLAGS64 (VM_READ | VM_WRITE | VM_EXEC | \ + VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) + +#define VM_DATA_DEFAULT_FLAGS \ + (test_thread_flag(TIF_32BIT) ? \ + VM_DATA_DEFAULT_FLAGS32 : VM_DATA_DEFAULT_FLAGS64) + +#define VM_STACK_DEFAULT_FLAGS \ + (test_thread_flag(TIF_32BIT) ? \ + VM_STACK_DEFAULT_FLAGS32 : VM_STACK_DEFAULT_FLAGS64) + #endif /* __KERNEL__ */ #endif /* _PPC64_PAGE_H */ diff -urN linux-2.5/include/asm-ppc64/pgtable.h test/include/asm-ppc64/pgtable.h --- linux-2.5/include/asm-ppc64/pgtable.h 2005-03-07 14:01:44.000000000 +1100 +++ test/include/asm-ppc64/pgtable.h 2005-03-15 17:41:14.000000000 +1100 @@ -82,14 +82,14 @@ #define _PAGE_PRESENT 0x0001 /* software: pte contains a translation */ #define _PAGE_USER 0x0002 /* matches one of the PP bits */ #define _PAGE_FILE 0x0002 /* (!present only) software: pte holds file offset */ -#define _PAGE_RW 0x0004 /* software: user write access allowed */ +#define _PAGE_EXEC 0x0004 /* No execute on POWER4 and newer (we invert) */ #define _PAGE_GUARDED 0x0008 #define _PAGE_COHERENT 0x0010 /* M: enforce memory coherence (SMP systems) */ #define _PAGE_NO_CACHE 0x0020 /* I: cache inhibit */ #define _PAGE_WRITETHRU 0x0040 /* W: cache write-through */ #define _PAGE_DIRTY 0x0080 /* C: page changed */ #define _PAGE_ACCESSED 0x0100 /* R: page referenced */ -#define _PAGE_EXEC 0x0200 /* software: i-cache coherence required */ +#define _PAGE_RW 0x0200 /* software: user write access allowed */ #define _PAGE_HASHPTE 0x0400 /* software: pte has an associated HPTE */ #define _PAGE_BUSY 0x0800 /* software: PTE & hash are busy */ #define _PAGE_SECONDARY 0x8000 /* software: HPTE is in secondary group */ @@ -118,29 +118,38 @@ #define PAGE_KERNEL __pgprot(_PAGE_BASE | _PAGE_WRENABLE) #define PAGE_KERNEL_CI __pgprot(_PAGE_PRESENT | _PAGE_ACCESSED | \ _PAGE_WRENABLE | _PAGE_NO_CACHE | _PAGE_GUARDED) +#define PAGE_KERNEL_EXEC __pgprot(_PAGE_BASE | _PAGE_WRENABLE | _PAGE_EXEC) /* - * The PowerPC can only do execute protection on a segment (256MB) basis, - * not on a page basis. So we consider execute permission the same as read. + * This bit in a hardware PTE indicates that the page is *not* executable. + */ +#define HW_NO_EXEC _PAGE_EXEC + +/* + * POWER4 and newer have per page execute protection, older chips can only + * do this on a segment (256MB) basis. + * * Also, write permissions imply read permissions. * This is the closest we can get.. + * + * Note due to the way vm flags are laid out, the bits are XWR */ #define __P000 PAGE_NONE -#define __P001 PAGE_READONLY_X +#define __P001 PAGE_READONLY #define __P010 PAGE_COPY #define __P011 PAGE_COPY_X #define __P100 PAGE_READONLY #define __P101 PAGE_READONLY_X -#define __P110 PAGE_COPY +#define __P110 PAGE_COPY_X #define __P111 PAGE_COPY_X #define __S000 PAGE_NONE -#define __S001 PAGE_READONLY_X +#define __S001 PAGE_READONLY #define __S010 PAGE_SHARED -#define __S011 PAGE_SHARED_X -#define __S100 PAGE_READONLY +#define __S011 PAGE_SHARED +#define __S100 PAGE_READONLY_X #define __S101 PAGE_READONLY_X -#define __S110 PAGE_SHARED +#define __S110 PAGE_SHARED_X #define __S111 PAGE_SHARED_X #ifndef __ASSEMBLY__ @@ -438,7 +447,7 @@ static inline void __ptep_set_access_flags(pte_t *ptep, pte_t entry, int dirty) { unsigned long bits = pte_val(entry) & - (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW); + (_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW | _PAGE_EXEC); unsigned long old, tmp; __asm__ __volatile__( diff -urN linux-2.5/include/asm-ppc64/processor.h test/include/asm-ppc64/processor.h --- linux-2.5/include/asm-ppc64/processor.h 2005-03-07 10:46:39.000000000 +1100 +++ test/include/asm-ppc64/processor.h 2005-03-15 17:08:21.000000000 +1100 @@ -173,6 +173,11 @@ #define SPRN_DEC 0x016 /* Decrement Register */ #define SPRN_DMISS 0x3D0 /* Data TLB Miss Register */ #define SPRN_DSISR 0x012 /* Data Storage Interrupt Status Register */ +#define DSISR_NOHPTE 0x40000000 /* no translation found */ +#define DSISR_PROTFAULT 0x08000000 /* protection fault */ +#define DSISR_ISSTORE 0x02000000 /* access was a store */ +#define DSISR_DABRMATCH 0x00400000 /* hit data breakpoint */ +#define DSISR_NOSEGMENT 0x00200000 /* STAB/SLB miss */ #define SPRN_EAR 0x11A /* External Address Register */ #define SPRN_ESR 0x3D4 /* Exception Syndrome Register */ #define ESR_IMCP 0x80000000 /* Instr. Machine Check - Protection */ diff -urN linux-2.5/include/asm-ppc64/sections.h test/include/asm-ppc64/sections.h --- linux-2.5/include/asm-ppc64/sections.h 2004-02-12 14:57:14.000000000 +1100 +++ test/include/asm-ppc64/sections.h 2005-03-15 16:55:05.000000000 +1100 @@ -17,4 +17,13 @@ #define __openfirmware #define __openfirmwaredata + +static inline int in_kernel_text(unsigned long addr) +{ + if (addr >= (unsigned long)_stext && addr < (unsigned long)__init_end) + return 1; + + return 0; +} + #endif From olh at suse.de Wed Mar 16 17:40:48 2005 From: olh at suse.de (Olaf Hering) Date: Wed, 16 Mar 2005 07:40:48 +0100 Subject: [PATCH] allow xmon=on,off,early In-Reply-To: <16951.42108.932360.666387@cargo.ozlabs.ibm.com> References: <20050315210246.GA24477@suse.de> <16951.42108.932360.666387@cargo.ozlabs.ibm.com> Message-ID: <20050316064048.GA29079@suse.de> On Wed, Mar 16, Paul Mackeras wrote: > > + if (strncmp(p, "early", 5)) > > + return 0; > > + } > > Where does this handle xmon=off? Here? Just keep it simple. From tom.l.nguyen at intel.com Thu Mar 17 03:43:25 2005 From: tom.l.nguyen at intel.com (Nguyen, Tom L) Date: Wed, 16 Mar 2005 08:43:25 -0800 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI Error Recovery) Message-ID: On Tue, Mar 15, 2005 at 04:32:20PM, Benjamin Herrenschmidt wrote: > > Ok, let's propose what i think is a proper API and simple enough on the > driver side, ... > That > should cover all the needs we discussed so far: > > I think we need a callback in pci_driver, as I explained all along, with > a very simple semantic: > > int (*error_handler)(struct pci_dev *dev, int message); This API does not support PCI Express AER precise errors. I prefer to have param int message being replaced by union error_src structure as below to include PCI Express AER precise errors. union error_src { int message; /* This for PCI Error */ struct { /* This for PCI Express Precise Error */ int type; unsigned int flags; unsigned int status; struct header_log_regs *log; }pcie_aer; }; Please let me know what you think? Thanks, Long From hollis at penguinppc.org Thu Mar 17 02:12:51 2005 From: hollis at penguinppc.org (Hollis Blanchard) Date: Wed, 16 Mar 2005 09:12:51 -0600 Subject: [PATCH] PPC64 iSeries: cleanup viopath In-Reply-To: <20050316025339.318fc246.sfr@canb.auug.org.au> References: <20050315143412.0c60690a.sfr@canb.auug.org.au> <0961a209ce72bb9f2a01b163aa6e6fbd@penguinppc.org> <20050316025339.318fc246.sfr@canb.auug.org.au> Message-ID: <303a387c46a384eb8afa7cce8c7e3225@penguinppc.org> On Mar 15, 2005, at 9:53 AM, Stephen Rothwell wrote: > On Tue, 15 Mar 2005 08:32:27 -0600 Hollis Blanchard > wrote: >> >> On Mar 14, 2005, at 9:34 PM, Stephen Rothwell wrote: >>> >>> Since you brought this file to my attention, I figured I might as >>> well >>> do >>> some simple cleanups. This patch does: >>> - single bit int bitfields are a bit suspect and Anndrew pointed >>> out recently that they are probably slower to access than ints >> >>> --- linus/arch/ppc64/kernel/viopath.c 2005-03-13 04:07:42.000000000 >>> +1100 >>> +++ linus-cleanup.1/arch/ppc64/kernel/viopath.c 2005-03-15 >>> 14:02:48.000000000 +1100 >>> @@ -56,8 +57,8 @@ >>> * But this allows for other support in the future. >>> */ >>> static struct viopathStatus { >>> - int isOpen:1; /* Did we open the path? */ >>> - int isActive:1; /* Do we have a mon msg outstanding */ >>> + int isOpen; /* Did we open the path? */ >>> + int isActive; /* Do we have a mon msg outstanding */ >>> int users[VIO_MAX_SUBTYPES]; >>> HvLpInstanceId mSourceInst; >>> HvLpInstanceId mTargetInst; >> >> Why not use a byte instead of a full int (reordering the members for >> alignment)? > > Because "classical" boleans are ints. > > Because I don't know the relative speed of accessing single byte > variables. I didn't see the original observation that bitfields are slow. If the argument was that loading a bitfield requires a load then mask, then you'll be happy to find that PPC has word, halfword, and byte load instructions. So loading a byte (unsigned, as Brad pointed out) should be just as fast as loading a word. > It really makes little difference, I was just trying to get rid of the > silly signed single bit bitfields ... I understand. I was half being nitpicky, and half wondering if there was an actual reason I was missing. -Hollis From kravetz at us.ibm.com Thu Mar 17 05:50:41 2005 From: kravetz at us.ibm.com (Mike Kravetz) Date: Wed, 16 Mar 2005 10:50:41 -0800 Subject: [PATCH] PPC64 NUMA memory fixup (another try) Message-ID: <20050316185041.GA5617@w-mikek2.ibm.com> Below is a new version of the patch that allows holes within nodes on ppc64 NUMA. I would appreciate it if someone familiar with OF device tree parsing could take a look at this part of the code. So far, I've gotten this wrong twice. Patch was tested in various configurations on a G5 and OpenPower 720. -- Signed-off-by: Mike Kravetz diff -Naupr linux-2.6.11.4/arch/ppc64/mm/numa.c linux-2.6.11.4.work/arch/ppc64/mm/numa.c --- linux-2.6.11.4/arch/ppc64/mm/numa.c 2005-03-16 00:09:31.000000000 +0000 +++ linux-2.6.11.4.work/arch/ppc64/mm/numa.c 2005-03-16 17:40:44.000000000 +0000 @@ -40,7 +40,6 @@ int nr_cpus_in_node[MAX_NUMNODES] = { [0 struct pglist_data *node_data[MAX_NUMNODES]; bootmem_data_t __initdata plat_node_bdata[MAX_NUMNODES]; -static unsigned long node0_io_hole_size; static int min_common_depth; /* @@ -49,7 +48,8 @@ static int min_common_depth; */ static struct { unsigned long node_start_pfn; - unsigned long node_spanned_pages; + unsigned long node_end_pfn; + unsigned long node_present_pages; } init_node_data[MAX_NUMNODES] __initdata; EXPORT_SYMBOL(node_data); @@ -186,14 +186,36 @@ static int __init find_min_common_depth( return depth; } -static unsigned long read_cell_ul(struct device_node *device, unsigned int **buf) +static int __init get_mem_addr_cells(void) +{ + struct device_node *memory = NULL; + int rc; + + memory = of_find_node_by_type(memory, "memory"); + if (!memory) + return 0; /* it won't matter */ + + rc = prom_n_addr_cells(memory); + return rc; +} + +static int __init get_mem_size_cells(void) +{ + struct device_node *memory = NULL; + int rc; + + memory = of_find_node_by_type(memory, "memory"); + if (!memory) + return 0; /* it won't matter */ + rc = prom_n_size_cells(memory); + return rc; +} + +static unsigned long read_n_cells(int n, unsigned int **buf) { - int i; unsigned long result = 0; - i = prom_n_size_cells(device); - /* bug on i>2 ?? */ - while (i--) { + while (n--) { result = (result << 32) | **buf; (*buf)++; } @@ -267,6 +289,7 @@ static int __init parse_numa_properties( { struct device_node *cpu = NULL; struct device_node *memory = NULL; + int addr_cells, size_cells; int max_domain = 0; long entries = lmb_end_of_DRAM() >> MEMORY_INCREMENT_SHIFT; unsigned long i; @@ -313,6 +336,8 @@ static int __init parse_numa_properties( } } + addr_cells = get_mem_addr_cells(); + size_cells = get_mem_size_cells(); memory = NULL; while ((memory = of_find_node_by_type(memory, "memory")) != NULL) { unsigned long start; @@ -329,8 +354,8 @@ static int __init parse_numa_properties( ranges = memory->n_addrs; new_range: /* these are order-sensitive, and modify the buffer pointer */ - start = read_cell_ul(memory, &memcell_buf); - size = read_cell_ul(memory, &memcell_buf); + start = read_n_cells(addr_cells, &memcell_buf); + size = read_n_cells(size_cells, &memcell_buf); start = _ALIGN_DOWN(start, MEMORY_INCREMENT); size = _ALIGN_UP(size, MEMORY_INCREMENT); @@ -348,33 +373,31 @@ new_range: if (max_domain < numa_domain) max_domain = numa_domain; - /* - * For backwards compatibility, OF splits the first node - * into two regions (the first being 0-4GB). Check for - * this simple case and complain if there is a gap in - * memory + /* + * Initialize new node struct, or add to an existing one. */ - if (init_node_data[numa_domain].node_spanned_pages) { - unsigned long shouldstart = - init_node_data[numa_domain].node_start_pfn + - init_node_data[numa_domain].node_spanned_pages; - if (shouldstart != (start / PAGE_SIZE)) { - /* Revert to non-numa for now */ - printk(KERN_ERR - "WARNING: Unexpected node layout: " - "region start %lx length %lx\n", - start, size); - printk(KERN_ERR "NUMA is disabled\n"); - goto err; - } - init_node_data[numa_domain].node_spanned_pages += + if (init_node_data[numa_domain].node_end_pfn) { + if ((start / PAGE_SIZE) < + init_node_data[numa_domain].node_start_pfn) + init_node_data[numa_domain].node_start_pfn = + start / PAGE_SIZE; + if (((start / PAGE_SIZE) + (size / PAGE_SIZE)) > + init_node_data[numa_domain].node_end_pfn) + init_node_data[numa_domain].node_end_pfn = + (start / PAGE_SIZE) + + (size / PAGE_SIZE); + + init_node_data[numa_domain].node_present_pages += size / PAGE_SIZE; } else { node_set_online(numa_domain); init_node_data[numa_domain].node_start_pfn = start / PAGE_SIZE; - init_node_data[numa_domain].node_spanned_pages = + init_node_data[numa_domain].node_end_pfn = + init_node_data[numa_domain].node_start_pfn + + size / PAGE_SIZE; + init_node_data[numa_domain].node_present_pages = size / PAGE_SIZE; } @@ -391,14 +414,6 @@ new_range: node_set_online(i); return 0; -err: - /* Something has gone wrong; revert any setup we've done */ - for_each_node(i) { - node_set_offline(i); - init_node_data[i].node_start_pfn = 0; - init_node_data[i].node_spanned_pages = 0; - } - return -1; } static void __init setup_nonnuma(void) @@ -426,12 +441,11 @@ static void __init setup_nonnuma(void) node_set_online(0); init_node_data[0].node_start_pfn = 0; - init_node_data[0].node_spanned_pages = lmb_end_of_DRAM() / PAGE_SIZE; + init_node_data[0].node_end_pfn = lmb_end_of_DRAM() / PAGE_SIZE; + init_node_data[0].node_present_pages = total_ram / PAGE_SIZE; for (i = 0 ; i < top_of_ram; i += MEMORY_INCREMENT) numa_memory_lookup_table[i >> MEMORY_INCREMENT_SHIFT] = 0; - - node0_io_hole_size = top_of_ram - total_ram; } static void __init dump_numa_topology(void) @@ -512,6 +526,8 @@ static unsigned long careful_allocation( void __init do_init_bootmem(void) { int nid; + int addr_cells, size_cells; + struct device_node *memory = NULL; static struct notifier_block ppc64_numa_nb = { .notifier_call = cpu_numa_callback, .priority = 1 /* Must run before sched domains notifier. */ @@ -535,7 +551,7 @@ void __init do_init_bootmem(void) unsigned long bootmap_pages; start_paddr = init_node_data[nid].node_start_pfn * PAGE_SIZE; - end_paddr = start_paddr + (init_node_data[nid].node_spanned_pages * PAGE_SIZE); + end_paddr = init_node_data[nid].node_end_pfn * PAGE_SIZE; /* Allocate the node structure node local if possible */ NODE_DATA(nid) = (struct pglist_data *)careful_allocation(nid, @@ -551,9 +567,9 @@ void __init do_init_bootmem(void) NODE_DATA(nid)->node_start_pfn = init_node_data[nid].node_start_pfn; NODE_DATA(nid)->node_spanned_pages = - init_node_data[nid].node_spanned_pages; + end_paddr - start_paddr; - if (init_node_data[nid].node_spanned_pages == 0) + if (NODE_DATA(nid)->node_spanned_pages == 0) continue; dbg("start_paddr = %lx\n", start_paddr); @@ -572,33 +588,55 @@ void __init do_init_bootmem(void) start_paddr >> PAGE_SHIFT, end_paddr >> PAGE_SHIFT); - for (i = 0; i < lmb.memory.cnt; i++) { - unsigned long physbase, size; - - physbase = lmb.memory.region[i].physbase; - size = lmb.memory.region[i].size; - - if (physbase < end_paddr && - (physbase+size) > start_paddr) { - /* overlaps */ - if (physbase < start_paddr) { - size -= start_paddr - physbase; - physbase = start_paddr; - } - - if (size > end_paddr - physbase) - size = end_paddr - physbase; + /* + * We need to do another scan of all memory sections to + * associate memory with the correct node. + */ + addr_cells = get_mem_addr_cells(); + size_cells = get_mem_size_cells(); + memory = NULL; + while ((memory = of_find_node_by_type(memory, "memory")) != NULL) { + unsigned long mem_start, mem_size; + int numa_domain, ranges; + unsigned int *memcell_buf; + unsigned int len; + + memcell_buf = (unsigned int *)get_property(memory, "reg", &len); + if (!memcell_buf || len <= 0) + continue; - dbg("free_bootmem %lx %lx\n", physbase, size); - free_bootmem_node(NODE_DATA(nid), physbase, - size); + ranges = memory->n_addrs; /* ranges in cell */ +new_range: + mem_start = read_n_cells(addr_cells, &memcell_buf); + mem_size = read_n_cells(size_cells, &memcell_buf); + numa_domain = of_node_numa_domain(memory); + + if (numa_domain != nid) + continue; + + if (mem_start < end_paddr && + (mem_start+mem_size) > start_paddr) { + /* should be no overlaps ! */ + dbg("free_bootmem %lx %lx\n", mem_start, mem_size); + free_bootmem_node(NODE_DATA(nid), mem_start, + mem_size); } + + if (--ranges) /* process all ranges in cell */ + goto new_range; } + /* + * Mark reserved regions on this node + */ for (i = 0; i < lmb.reserved.cnt; i++) { unsigned long physbase = lmb.reserved.region[i].physbase; unsigned long size = lmb.reserved.region[i].size; + if (pa_to_nid(physbase) != nid && + pa_to_nid(physbase+size-1) != nid) + continue; + if (physbase < end_paddr && (physbase+size) > start_paddr) { /* overlaps */ @@ -632,13 +670,12 @@ void __init paging_init(void) unsigned long start_pfn; unsigned long end_pfn; - start_pfn = plat_node_bdata[nid].node_boot_start >> PAGE_SHIFT; - end_pfn = plat_node_bdata[nid].node_low_pfn; + start_pfn = init_node_data[nid].node_start_pfn; + end_pfn = init_node_data[nid].node_end_pfn; zones_size[ZONE_DMA] = end_pfn - start_pfn; - zholes_size[ZONE_DMA] = 0; - if (nid == 0) - zholes_size[ZONE_DMA] = node0_io_hole_size >> PAGE_SHIFT; + zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] - + init_node_data[nid].node_present_pages; dbg("free_area_init node %d %lx %lx (hole: %lx)\n", nid, zones_size[ZONE_DMA], start_pfn, zholes_size[ZONE_DMA]); From ak at muc.de Thu Mar 17 07:43:00 2005 From: ak at muc.de (Andi Kleen) Date: 16 Mar 2005 21:43:00 +0100 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI Error Recovery) In-Reply-To: <1110864741.29124.68.camel@gaston> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <4235847F.3080705@jp.fujitsu.com> <20050314181420.GD498@austin.ibm.com> <1110864741.29124.68.camel@gaston> Message-ID: <20050316204300.GA6251@muc.de> Hi Ben, I have not studied everything in detail, but from a quick browse it looks reasonable. I would make an own callback for each state though, otherwise the driver has to do a switch with subfunctions anyways. -Andi From paulus at samba.org Thu Mar 17 08:40:25 2005 From: paulus at samba.org (Paul Mackerras) Date: Thu, 17 Mar 2005 08:40:25 +1100 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI Error Recovery) In-Reply-To: References: Message-ID: <16952.42953.985753.66211@cargo.ozlabs.ibm.com> Nguyen, Tom L writes: > This API does not support PCI Express AER precise errors. I prefer to > have param int message being replaced by union error_src structure as > below to include PCI Express AER precise errors. I think you are misunderstanding the purpose of the "message" parameter. It is not there to give you details of the error that occurred, it is there to tell the driver what stage of the recovery process we are up to. The details of the error would be reported through an io_check_error() or similar interface. Paul. From moilanen at austin.ibm.com Thu Mar 17 08:45:58 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Wed, 16 Mar 2005 15:45:58 -0600 Subject: [PATCH 1/2] No-exec support for ppc64 In-Reply-To: <16951.52721.139394.592636@cargo.ozlabs.ibm.com> References: <20050308165904.0ce07112.moilanen@austin.ibm.com> <20050308170826.13a2299e.moilanen@austin.ibm.com> <20050310032213.GB20789@austin.ibm.com> <20050310162513.74191caa.moilanen@austin.ibm.com> <16949.25552.640180.677985@cargo.ozlabs.ibm.com> <20050314155125.68dcff70.moilanen@austin.ibm.com> <16950.3484.416343.832453@cargo.ozlabs.ibm.com> <20050315155135.11b942ef.moilanen@austin.ibm.com> <16951.52721.139394.592636@cargo.ozlabs.ibm.com> Message-ID: <20050316154558.7c634a23.moilanen@austin.ibm.com> On Wed, 16 Mar 2005 17:10:57 +1100 Paul Mackerras wrote: > Jake Moilanen writes: > > > It does not work w/o the sys_mprotect. It will hang in one of the first > > few binaries. > > Hmmm, what distro is this with? I just tried a kernel with the patch > below on a SLES9 install and a Debian install and it came up and ran > just fine in both cases. I'm not sure that the patch you sent is actually doing protection correctly. To test I commented out this line: > +#define elf_read_implies_exec(ex, have_pt_gnu_stack) (!(have_pt_gnu_stack)) and then ran a non-pt_gnu_stack binary which should have executed on a non-exec segment, it did not segfault. > + * > + * Note due to the way vm flags are laid out, the bits are XWR > */ > #define __P000 PAGE_NONE > -#define __P001 PAGE_READONLY_X > +#define __P001 PAGE_READONLY > #define __P010 PAGE_COPY > #define __P011 PAGE_COPY_X > #define __P100 PAGE_READONLY > #define __P101 PAGE_READONLY_X > -#define __P110 PAGE_COPY > +#define __P110 PAGE_COPY_X > #define __P111 PAGE_COPY_X I think the problem was this hunk. __P011 should be PAGE_COPY and __P100 should be PAGE_READONLY_X. Here is a patch ontop of the last patch you sent to fix this problem and take another crack at doing the sys_mprotect less hackish. Signed-off-by: Jake Moilanen --- linux-2.6.11.4-paulus-moilanen/fs/binfmt_elf.c | 18 +++++++++---- linux-2.6.11.4-paulus-moilanen/include/asm-ppc64/pgtable.h | 4 +- 2 files changed, 15 insertions(+), 7 deletions(-) diff -puN fs/binfmt_elf.c~more-nx fs/binfmt_elf.c --- linux-2.6.11.4-paulus/fs/binfmt_elf.c~more-nx 2005-03-16 09:35:28 -06:00 +++ linux-2.6.11.4-paulus-moilanen/fs/binfmt_elf.c 2005-03-16 11:03:15 -06:00 @@ -88,7 +88,7 @@ static struct linux_binfmt elf_format = #define BAD_ADDR(x) ((unsigned long)(x) > TASK_SIZE) -static int set_brk(unsigned long start, unsigned long end) +static int set_brk(unsigned long start, unsigned long end, int prot) { start = ELF_PAGEALIGN(start); end = ELF_PAGEALIGN(end); @@ -99,6 +99,9 @@ static int set_brk(unsigned long start, up_write(¤t->mm->mmap_sem); if (BAD_ADDR(addr)) return addr; + + sys_mprotect(start, end-start, prot); + } current->mm->start_brk = current->mm->brk = end; return 0; @@ -529,6 +532,7 @@ static int load_elf_binary(struct linux_ struct files_struct *files; int have_pt_gnu_stack, executable_stack = EXSTACK_DEFAULT; unsigned long def_flags = 0; + int bss_prot = 0; struct { struct elfhdr elf_ex; struct elfhdr interp_elf_ex; @@ -811,7 +815,7 @@ static int load_elf_binary(struct linux_ before this one. Map anonymous pages, if needed, and clear the area. */ retval = set_brk (elf_bss + load_bias, - elf_brk + load_bias); + elf_brk + load_bias, bss_prot); if (retval) { send_sig(SIGKILL, current, 0); goto out_free_dentry; @@ -883,15 +887,19 @@ static int load_elf_binary(struct linux_ k = elf_ppnt->p_vaddr + elf_ppnt->p_filesz; - if (k > elf_bss) + if (k > elf_bss) { elf_bss = k; + bss_prot = elf_prot; + } if ((elf_ppnt->p_flags & PF_X) && end_code < k) end_code = k; if (end_data < k) end_data = k; k = elf_ppnt->p_vaddr + elf_ppnt->p_memsz; - if (k > elf_brk) + if (k > elf_brk) { elf_brk = k; + bss_prot = elf_prot; + } } loc->elf_ex.e_entry += load_bias; @@ -907,7 +915,7 @@ static int load_elf_binary(struct linux_ * mapping in the interpreter, to make sure it doesn't wind * up getting placed where the bss needs to go. */ - retval = set_brk(elf_bss, elf_brk); + retval = set_brk(elf_bss, elf_brk, bss_prot); if (retval) { send_sig(SIGKILL, current, 0); goto out_free_dentry; diff -puN include/asm-ppc64/pgtable.h~more-nx include/asm-ppc64/pgtable.h --- linux-2.6.11.4-paulus/include/asm-ppc64/pgtable.h~more-nx 2005-03-16 09:35:44 -06:00 +++ linux-2.6.11.4-paulus-moilanen/include/asm-ppc64/pgtable.h 2005-03-16 09:35:53 -06:00 @@ -137,8 +137,8 @@ #define __P000 PAGE_NONE #define __P001 PAGE_READONLY #define __P010 PAGE_COPY -#define __P011 PAGE_COPY_X -#define __P100 PAGE_READONLY +#define __P011 PAGE_COPY +#define __P100 PAGE_READONLY_X #define __P101 PAGE_READONLY_X #define __P110 PAGE_COPY_X #define __P111 PAGE_COPY_X _ From agl at us.ibm.com Thu Mar 17 08:01:36 2005 From: agl at us.ibm.com (Adam Litke) Date: Wed, 16 Mar 2005 15:01:36 -0600 Subject: Hugepage COW In-Reply-To: <20050223070322.GF24473@localhost.localdomain> References: <1109085505.5217.28.camel@localhost.localdomain> <20050223070322.GF24473@localhost.localdomain> Message-ID: <1111006896.3635.24.camel@localhost.localdomain> On Wed, 2005-02-23 at 01:03, David Gibson wrote: > On Tue, Feb 22, 2005 at 07:18:25AM -0800, Adam Litke wrote: > > Hi David. I've been trying out your consolidation patch and everything > > seems to be working great for me too on ppc64. > > That's good to know. Did you test on anything other than ppc64? > That's where the patch really needs testing. I just want to report a testing ACK on i386 for both the consolidation patch and the COW patch. Is there anything holding these back from mainline? Let me know if I can do anything else to help out. Once my hugetlb tests make it onto ABAT, I'll be able to test x86_64 and ia64 as well. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center From benh at kernel.crashing.org Thu Mar 17 09:40:28 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 17 Mar 2005 09:40:28 +1100 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI Error Recovery) In-Reply-To: References: Message-ID: <1111012828.15509.23.camel@gaston> On Wed, 2005-03-16 at 08:43 -0800, Nguyen, Tom L wrote: > On Tue, Mar 15, 2005 at 04:32:20PM, Benjamin Herrenschmidt wrote: > > > > Ok, let's propose what i think is a proper API and simple enough on > the > > driver side, ... > > That > > should cover all the needs we discussed so far: > > > > I think we need a callback in pci_driver, as I explained all along, > with > > a very simple semantic: > > > > int (*error_handler)(struct pci_dev *dev, int message); > > This API does not support PCI Express AER precise errors. I prefer to > have param int message being replaced by union error_src structure as > below to include PCI Express AER precise errors. > > union error_src { > int message; /* This for PCI Error */ > struct { /* This for PCI Express > Precise Error */ > int type; > unsigned int flags; > unsigned int status; > struct header_log_regs *log; > }pcie_aer; > }; > > Please let me know what you think? Well, I did that on purpose to avoid messing with variable argument types or complexify the arguments of the callback. The way I see things, there are 2 ways the error details can be obtained by the driver, one is returning those from the clear ... check stuff described by Seto, and eventually a pci_get_last_error(pdev). In both cases, it was suggested that the error is returned as an "opaque" token with accessors to retreive the class of error and other details. We have to deal with PCI legacy errors, PCIE-AER errors, plus other kind of errors provided by various architectures/firmwares for things like DMA errors, etc... Now, if you feel that we should really pass the error token to the callback, then maybe we should go with that, though it only have sense in the initial callback, not the 3 other ones. Ben. From benh at kernel.crashing.org Thu Mar 17 09:58:55 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 17 Mar 2005 09:58:55 +1100 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI Error Recovery) In-Reply-To: <20050316204300.GA6251@muc.de> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <4235847F.3080705@jp.fujitsu.com> <20050314181420.GD498@austin.ibm.com> <1110864741.29124.68.camel@gaston> <20050316204300.GA6251@muc.de> Message-ID: <1111013936.15509.31.camel@gaston> On Wed, 2005-03-16 at 21:43 +0100, Andi Kleen wrote: > Hi Ben, > > I have not studied everything in detail, but from a quick browse it looks > reasonable. I would make an own callback for each state though, otherwise > the driver has to do a switch with subfunctions anyways. Well... I preferred not to bloat pci_driver for everybody, but I'm ok with separate function approach in principle. We could even have pci_ops just have an optional pointer to a separate pci_err_ops in fact to avoid the blotage in the common case. Again, my point in this definition wasn't the details like that, but the actual detailed semantic of each step that, I hope, will fit everybody. Ben. From olh at suse.de Thu Mar 17 10:03:52 2005 From: olh at suse.de (Olaf Hering) Date: Thu, 17 Mar 2005 00:03:52 +0100 Subject: [PATCH] allow xmon=on,off,early In-Reply-To: <16951.42108.932360.666387@cargo.ozlabs.ibm.com> References: <20050315210246.GA24477@suse.de> <16951.42108.932360.666387@cargo.ozlabs.ibm.com> Message-ID: <20050316230352.GA19259@suse.de> > Where does this handle xmon=off? Maybe you like this one better: allow 'xmon' or 'xmon=early' to enter xmon very early during boot. allow 'xmon=on' to just enable it, or 'xmon=off' to disable it. Signed-off-by: Olaf Hering Index: linux-2.6.11-olh/arch/ppc64/kernel/setup.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/kernel/setup.c +++ linux-2.6.11-olh/arch/ppc64/kernel/setup.c @@ -1365,6 +1365,12 @@ EXPORT_SYMBOL(check_legacy_ioport); static int __init early_xmon(char *p) { /* ensure xmon is enabled */ + if (p) { + if (strncmp(p, "on", 2) == 0) + xmon_init(); + if (strncmp(p, "early", 5) != 0) + return 0; + } xmon_init(); debugger(NULL); From tom.l.nguyen at intel.com Thu Mar 17 09:55:28 2005 From: tom.l.nguyen at intel.com (Nguyen, Tom L) Date: Wed, 16 Mar 2005 14:55:28 -0800 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI ErrorRecovery) Message-ID: Tuesday, March 15, 2005 3:28 PM Benjamin Herrenschmidt wrote: > Please, look at my mail describing a different interface. I think your > model can fit. However, one thing you need to do is what I call > "synchronous error detection" as well. Seems like we need to define the roles of the Error Driver (in the case of PCI Express what I am referring to as the AER driver) and the device driver (e.g. driver for a PCI SCSI card). See the PCI Express HOW-TO for how I defined the roles. Can you provide insight into the roles for PCI? Need some scenarios to walk though that would validate any API usage model. We have gone through some to define the PCI Express AER interface. However, they were PCI Express specific. We need some PCI based error flows to understand the details of the flow so we can develop an interface compatible with both. Some specific comments regarding the API proposed as it relates to PCI Express: 1. With message = PCIERR_ERROR_DETECTED PCI Express has extensive error reporting information and requires more than just an "int message" to report all the information. The error_handler interface allows us to notify the driver but does not allow the driver to report comprehensive error information that the driver can gather from its device. PCI Express inherently builds in the severity of the errors so that a query of the device regarding recoverability/fatality is built into the error data the device sends to the AER driver. The AER driver then uses this data to guide the error recovery, because it owns error recovery communication with the hierarchy in question. 2. With message = PCIERR_ERROR_RECOVER In PCI Express the device driver programs the severity of the errors for its device. This programming allows the error information to embed the recoverability of the error in the error message information forwarded to the AER driver from the device. It is assumed for all errors when the device driver is notified by the AER driver the device driver will take recovery actions with its HW device as it deems appropriate based on the severity of the error. However, the device driver cannot take any action that affects the bus/link interface. Under PCI Express this is prevented because devices cannot reset upstream links. I need a better understanding of how PCI works in these scenarios so that we can come to a common API. Can you provide some common PCI error flows with the capabilities a device driver may have regarding error recovery. 3. With message = PCIERR_ERROR_RESTART Not necessary for PCI Express because it is a point to point protocol. However, we can overload it so PCI Express uses as a mechanism to communicate with all downstream devices affected by an upstream link error/reset. 4. What mechanism (message??) is used to perform the bus and/or link level reset? For PCI Express the reset is performed by the upstream port driver. My API takes this into account. Are you assuming the PCI device on the bus does the reset or will there be a PCI bus driver that will do the reset? How will the PCI error handling code initiate a reset? 5. Is the infrastructure being proposed for device level errors, PCI bus level errors, or PCI Express link level errors? The connection between the devices is what the errors are about. We think that the focus should be bus/link level errors. Do I misunderstand the usage model of this API somehow? Thanks, Long From tom.l.nguyen at intel.com Thu Mar 17 10:01:53 2005 From: tom.l.nguyen at intel.com (Nguyen, Tom L) Date: Wed, 16 Mar 2005 15:01:53 -0800 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI ErrorRecovery) Message-ID: On Wednesday, March 16, 2005 1:40 PM Paul Mackerras wrote: >> This API does not support PCI Express AER precise errors. I prefer to >> have param int message being replaced by union error_src structure as >> below to include PCI Express AER precise errors. > >I think you are misunderstanding the purpose of the "message" >parameter. It is not there to give you details of the error that >occurred, it is there to tell the driver what stage of the recovery >process we are up to. The details of the error would be reported >through an io_check_error() or similar interface. How does an io_check_error() support PCI Express comprehensive error information? Would you please explain it to me? Do you think there is an overlap between error_handler and io_check_error usages when dealing with notifying the driver of an error occurred? Thanks, Long From benh at kernel.crashing.org Thu Mar 17 13:49:51 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 17 Mar 2005 13:49:51 +1100 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI ErrorRecovery) In-Reply-To: References: Message-ID: <1111027792.7192.173.camel@gaston> On Wed, 2005-03-16 at 15:01 -0800, Nguyen, Tom L wrote: > On Wednesday, March 16, 2005 1:40 PM Paul Mackerras wrote: > >> This API does not support PCI Express AER precise errors. I prefer to > >> have param int message being replaced by union error_src structure as > >> below to include PCI Express AER precise errors. > > > >I think you are misunderstanding the purpose of the "message" > >parameter. It is not there to give you details of the error that > >occurred, it is there to tell the driver what stage of the recovery > >process we are up to. The details of the error would be reported > >through an io_check_error() or similar interface. > > How does an io_check_error() support PCI Express comprehensive error > information? Would you please explain it to me? Do you think there is an > overlap between error_handler and io_check_error usages when dealing > with notifying the driver of an error occurred? Those are two different things. One is when you have a bunch of IOs, to be able to check wether an error occurred in there. Especially useful on a driver that is sort-of waiting for something or looping around something and is suddenly getting ff's. Also read closely my proposed API for the case where a driver tries to recover. THat part must be "protected", that is an IO error done in that part must not lead to a new error message sent to the driver but should be reported synchronously to the caller. Wether the error information is "comprehensive" or not is unrelated to the discussion :) The idea, as I explained, is to provide an opaque error token with functions to eventually extract details. Ben. From paulus at samba.org Thu Mar 17 14:22:25 2005 From: paulus at samba.org (Paul Mackerras) Date: Thu, 17 Mar 2005 14:22:25 +1100 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI ErrorRecovery) In-Reply-To: References: Message-ID: <16952.63473.537572.160203@cargo.ozlabs.ibm.com> Nguyen, Tom L writes: > How does an io_check_error() support PCI Express comprehensive error > information? It doesn't. The io_check_error() would be a generic, cross-platform thing. You would have some PCI-Express-specific interfaces to get PCI-Express-specific error information. You are facing a similar tension to what we are facing - the tension between wanting drivers to be able to use all the facilities of the platform, but not wanting to make the driver so platform-specific that it is useless on any other platform (and you end up maintaining all the drivers for the platform). The ideal thing is to find a way to express the facilities of the different platforms we have in a sufficiently abstract or generic way that drivers can use a common error detection and recovery strategy across multiple platforms. That is what we are trying to do, and if we can succeed at that we will reduce the development and maintenance effort for everyone. That means, however, that the driver will not be dealing with the specific nitty-gritty platform-specific details of exactly what went wrong and where. If the driver really has to deal with those details then it becomes platform-specific. That may be acceptable for a few drivers for which the extra effort (including the ongoing maintenance effort) can be justified, but it is not practicable to maintain platform-specific versions of every driver. Paul. From benh at kernel.crashing.org Thu Mar 17 14:20:23 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 17 Mar 2005 14:20:23 +1100 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI ErrorRecovery) In-Reply-To: References: Message-ID: <1111029623.15509.199.camel@gaston> On Wed, 2005-03-16 at 14:55 -0800, Nguyen, Tom L wrote: > 1. With message = PCIERR_ERROR_DETECTED > PCI Express has extensive error reporting information and requires more > than just an "int message" to report all the information. The > error_handler interface allows us to notify the driver but does not > allow the driver to report comprehensive error information that the > driver can gather from its device. PCI Express inherently builds in the > severity of the errors so that a query of the device regarding > recoverability/fatality is built into the error data the device sends to > the AER driver. The AER driver then uses this data to guide the error > recovery, because it owns error recovery communication with the > hierarchy in question. As I explained already, I feel like this additional error informations should be requested explicitely, but that isn't a strong feeling, we could perfectly add an io error token (opaque error representation) to this callback. Since it seems that it's preferred to have several callbacks anyway rather than a switch/case on a message, the message argument will be gone, and that specific callback will take an io error token. > 2. With message = PCIERR_ERROR_RECOVER > In PCI Express the device driver programs the severity of the errors for > its device. This programming allows the error information to embed the > recoverability of the error in the error message information forwarded > to the AER driver from the device. It is assumed for all errors when > the device driver is notified by the AER driver the device driver will > take recovery actions with its HW device as it deems appropriate based > on the severity of the error. However, the device driver cannot take > any action that affects the bus/link interface. Under PCI Express this > is prevented because devices cannot reset upstream links. > > I need a better understanding of how PCI works in these scenarios so > that we can come to a common API. Can you provide some common PCI error > flows with the capabilities a device driver may have regarding error > recovery. This is more than just how PCI works. It's also how IBM's EEH works and others. I'm trying to setup a model in which we can "fit" everybody. Basically, if you already know that no recovery is possible with just doing IOs to the chip, but a card reset will be possible, you can directly go to the step PCIERR_ERROR_RESET. Remember that we are trying to provide a driver-side API that isolates them of the underlying mecanisms and policies. If the driver knows (thanks to PCI Express stuff) that the error was fatal (because it knows it's on pci-express and extracted the relevant information out of the error token), then it could just return PCIERR_RESULT_NEED_RESET right away from the PCIERR_ERROR_DETECTED callback. It's up to the platform then to just give up if it decides it can't bring the link back, or to reset everything and then call PCIERR_ERROR_RESET. Basically, my model allows driver to deal with that basic 3 states (and 2 recovery mecanisms): - reporting the error to the driver, quiesce it - if possible give a chance to the driver to recover just by issuing IOs (that is the error wasn't fatal). That is a best try, it can fail. - if possible, reset the slot/bus and let drivers re-init the HW I'm confident the PCI Express mecanism can fit nicely into this model, but you are welcome to prove me wrong. In all above case, "if possible" is a mix of driver and platform knowledge. That is, the platform tries at best to serve the driver wishes (like going to state "recover" when the driver says it can try to recover, based on result code form "detect" callback), but may just decide it can't and reset. If the error information provide the "recoverability" information from the very beginning, then the driver just need to return the right code from the "detect" callback based on it. > 3. With message = PCIERR_ERROR_RESTART > Not necessary for PCI Express because it is a point to point protocol. > However, we can overload it so PCI Express uses as a mechanism to > communicate with all downstream devices affected by an upstream link > error/reset. Among others... It is neccessary because again, we are defining a model that matches everybody. So we want drivers to assume they have to wait before everybody has been notified, even if a specific implementation doesn't require it. On PCI Express, we could imagine just calling restart right away after recover, that isn't a problem. There may also be incidences with interrupt handling. Restart is the only point where interrupts are guaranteed to be properly operational (see my note). > What mechanism (message??) is used to perform the bus and/or link > level reset? For PCI Express the reset is performed by the upstream > port driver. My API takes this into account. Are you assuming the PCI > device on the bus does the reset or will there be a PCI bus driver that > will do the reset? How will the PCI error handling code initiate a > reset? The "caller", that is the error management framework. I'm defining the API at the driver level, not the implementation at the core level. For example, on IBM pSeries with PCI-Express, we will probably not have an AER driver. This will be all dealt by the firmware which will mimmic that to the existing EEH error management. We'll have the same API to do the reset that we have today for resetting a slot. You may have noticed in general that I didn't either define who is callign those callbacks. It's all implicit that this is done by platform error management code. For example, on ppc64, even the recovery step requires action from the platform since the slot has been physically isolated. After we have notified all drivers with the "error detected" callback, if we decide we can try the "recover" step (all drivers returned they could try it and we decided the error wasn't too fatal) we will call the firmware to re-enable IOs on the slot and call the "recover" step. If my model is implemented on top of classic PCI, the "core" would just clear SERR/PERR or any other error reporting facility on the host bridge and other bridges down the path if necessary and call recover. If the platform has no slot reset capability, the core would give up in any non-recoverable case. We could provide generic implementations for basic PCI (as described above) and for PCI-Express maybe, but the platform is the one to choose which implementation to use and to provide eventual alternative implementations based on that platform additional capabilities. > 5. Is the infrastructure being proposed for device level errors, PCI bus > level errors, or PCI Express link level errors? The connection between > the devices is what the errors are about. We think that the focus > should be bus/link level errors. Device level and link. It might be useful to provide a generic function to extract the level of the error from the error token, but not mandatory. > Do I misunderstand the usage model of this API somehow? Maybe :) Ben. From david at gibson.dropbear.id.au Thu Mar 17 14:48:44 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 17 Mar 2005 14:48:44 +1100 Subject: Hugepage COW In-Reply-To: <1111006896.3635.24.camel@localhost.localdomain> References: <1109085505.5217.28.camel@localhost.localdomain> <20050223070322.GF24473@localhost.localdomain> <1111006896.3635.24.camel@localhost.localdomain> Message-ID: <20050317034844.GD14048@localhost.localdomain> On Wed, Mar 16, 2005 at 03:01:36PM -0600, Adam Litke wrote: > On Wed, 2005-02-23 at 01:03, David Gibson wrote: > > On Tue, Feb 22, 2005 at 07:18:25AM -0800, Adam Litke wrote: > > > Hi David. I've been trying out your consolidation patch and everything > > > seems to be working great for me too on ppc64. > > > > That's good to know. Did you test on anything other than ppc64? > > That's where the patch really needs testing. > > I just want to report a testing ACK on i386 for both the consolidation > patch and the COW patch. Is there anything holding these back from > mainline? Let me know if I can do anything else to help out. Once my > hugetlb tests make it onto ABAT, I'll be able to test x86_64 and ia64 as > well. As far as the consolidation patch goes, lack of testing was the main objective reason for holding back. So if you could test on x86_64 and ia64 too, that would be great. wli had some objections to the patch when I first posted which I didn't and don't really understand, and from conversations with akpm, I'm certainly thinking of just re-sending it. COW will be a bit more of a political shitfight, I suspect. I'd like to at least hold off until the consolidation is merged, which makes the COW much easier. We'll also need to implement the necessary arch-hooks for COW on every platform. Speaking of which, did you implement the i386 hooks? I thought I only did COW for ppc64, so far, although on top of the consolidation patch the amount of arch code is vastly reduced. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist. NOT _the_ _other_ _way_ | _around_! http://www.ozlabs.org/people/dgibson From paulus at samba.org Thu Mar 17 14:51:49 2005 From: paulus at samba.org (Paul Mackerras) Date: Thu, 17 Mar 2005 14:51:49 +1100 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI ErrorRecovery) In-Reply-To: References: Message-ID: <16952.65237.676567.980812@cargo.ozlabs.ibm.com> Nguyen, Tom L writes: > We need some PCI > based error flows to understand the details of the flow so we can > develop an interface compatible with both. Here is a basic outline of what happens with EEH (Enhanced Error Handling) on IBM PPC64 platforms. This applies to PCI, PCI-X and PCI-Express devices. We have a PCI-PCI bridge per slot. The bridge (and the PCI fabric generally) look for errors such as address parity errors, out-of-bounds DMA accesses by the device, or anything that would normally cause SERR to be set. If such an error occurs, the bridge immediately isolates the device, meaning that writes by the CPU to the device are discarded, reads by the CPU are returned with all 1s data, and DMA accesses by the device are blocked. What happens at the driver level depends on whether the driver is EEH-aware or not. (This description is more what we would like to have rather than what is necessarily implemented at present). If the driver is not EEH-aware but is hot-plug capable, then the platform code will notice that reads from the device are returning all 1s and query firmware about the state of the slot. Firmware will indicate that the slot has been isolated. Platform code can obtain more specific information about the error from firmware and log it. Then, platform code will generate a hot-unplug event for the slot. After the driver has cleaned up and notified higher levels that its device has gone away, platform code will call firmware to reset and unisolate the slot, and then generate a hotplug event to tell the driver that it can use the device - but as far as the driver is concerned, it is a new device. If the driver is EEH-aware, then we use the API that Ben has proposed. Platform code can either reset the slot (by calling firmware) or not, depending on what the driver asks for, and also depending on any other information the platform code has available to it, such as specific information about the error that has occurred. Platform code then unisolates the slot and then informs the driver that it can reinitialize the device and restart any transfers that were in progress. Ben's API is aimed at supporting the code flows that we need for EEH as well as those needed for recovery from errors on PCI Express. Part of the reason for not just requiring the driver to do everything itself is that a slot isolation event can affect multiple drivers, because the card in the slot could have a PCI-PCI bridge with multiple devices behind it. Thus the recovery process potentially requires a degree of coordination between multiple drivers, and Ben's API addresses that. The same coordination could be required on PCI Express, if I understand correctly, because a fault on an upstream link could affect many devices downstream of that link. Paul. From maneesh at in.ibm.com Thu Mar 17 15:08:43 2005 From: maneesh at in.ibm.com (Maneesh Soni) Date: Thu, 17 Mar 2005 09:38:43 +0530 Subject: Fw: Re: [RFC/PATCH] Updated: ppc64: Add mem=X option Message-ID: <20050317040843.GA14462@in.ibm.com> Just re-sending, don't know why it didn't appear in the list for almost a day ----- Forwarded message from Maneesh Soni ----- Date: Wed, 16 Mar 2005 11:18:26 +0530 From: Maneesh Soni To: Michael Ellerman Cc: linuxppc64-dev at ozlabs.org Subject: Re: [RFC/PATCH] Updated: ppc64: Add mem=X option Reply-To: maneesh at in.ibm.com On Fri, Feb 25, 2005 at 07:14:08PM +1100, Michael Ellerman wrote: [..] > +unsigned long prom_memparse(const char *ptr, const char **retptr) > +{ > + unsigned long ret = prom_strtoul(ptr, retptr); > + > + switch (**retptr) { > + case 'G': > + case 'g': > + ret <<= 10; > + case 'M': > + case 'm': > + ret <<= 10; > + case 'K': > + case 'k': > + ret <<= 10; > + (*retptr)++; > + default: > + break; > + } > + return ret; > +} > I get following exception with the above switch statement in place. ======================================= Welcome to yaboot version 1.3.11.SuSE Enter "help" to get some basic usage information boot: t mem=512M Please wait, loading kernel... Elf64 kernel loaded... Loading ramdisk... ramdisk loaded at 04200000, size: 2616 Kbytes OF stdout device is: /pci at 400000000110/isa at 3/serial at i3f8 klimit=0xc0000000006b0000 offset=0xbffffffffc600000 initrd_start=0x0000000004200000 initrd_end=0x000000000448e000 command line: root=/dev/sdd3 selinux=0 elevator=cfq splash=silent desktop profile=0 mem=512M DEFAULT CATCH!, handler-entered=fff00700 Open Firmware exception handler entered from non-OF code Client's Fix Pt Regs: 00 0000000000000018 000000000291f7c0 0000000003fe15f8 0000000000000200 04 000000000291fb58 0000000000000200 000000000000000a 000000000000001d 08 000000000000004d c00000000002c050 0000000003e4faa5 c00000000002c050 0c 2000000000000000 0000000000000000 0000000000000000 0000000000000000 10 0000000000000000 0000000000000000 0000000003a00000 0000000003e4f798 14 0000000000230000 000000000028e000 bffffffffc600000 0000000004200000 18 0000000003e4fa50 000000000291f8f0 000000000000000d 000000000000000d 1c 000000000000000c 0000000003dca588 000000000291fb58 000000000291f7c0 Special Regs: %IV: 00000700 %CR: 84004048 %XER: 20000000 %DSISR: 00000000 %SRR0: c00000000002c050 %SRR1: 9000000000083000 %LR: 0000000003a2c024 %CTR: c00000000002c050 %DAR: 0000000000000000 PID = 18 ok 0 > =========================================== It worked fine if I replace the "switch" statement with the following "if" block : : if ((**retptr == 'G') || (**retptr == 'g')) { ret <<= 10; ret <<= 10; ret <<= 10; } if ((**retptr == 'M') || (**retptr == 'm')) { ret <<= 10; ret <<= 10; } if ((**retptr == 'K') || (**retptr == 'k')) ret <<= 10; (*retptr)++; : : I suspect this could be some compiler issue but not sure exactly what is the issue. I am tryng this on SLES9 system with gcc 3.3.3 version. llm15:~ # gcc --version gcc (GCC) 3.3.3 (SuSE Linux) Copyright (C) 2003 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Thanks Maneesh -- Maneesh Soni Linux Technology Center, IBM India Software Labs, Bangalore, India email: maneesh at in.ibm.com Phone: 91-80-25044990 ----- End forwarded message ----- From tom.l.nguyen at intel.com Thu Mar 17 11:00:09 2005 From: tom.l.nguyen at intel.com (Nguyen, Tom L) Date: Wed, 16 Mar 2005 16:00:09 -0800 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI ErrorRecovery) Message-ID: On Wednesday, March 16, 2005 2:40 PM Benjamin Herrenschmidt wrote: > Now, if you feel that we should really pass the error token to the > callback, then maybe we should go with that, though it only have sense > in the initial callback, not the 3 other ones. I am glad you agree to update API, as below, to support PCI Express comprehensive error information. int (*error_handler)(struct pci_dev *dev, union error_src *); Please look at my other mail regarding the 3 other ones. Thanks, Long From michael at ellerman.id.au Thu Mar 17 16:31:27 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Thu, 17 Mar 2005 16:31:27 +1100 Subject: [RFC/PATCH] Updated: ppc64: Add mem=X option In-Reply-To: <16949.22863.622912.175918@cargo.ozlabs.ibm.com> References: <20050222192423.727023f7.michael@ellerman.id.au> <20050225191408.599c613d.michael@ellerman.id.au> <16949.22863.622912.175918@cargo.ozlabs.ibm.com> Message-ID: <200503171631.30135.michael@ellerman.id.au> On Mon, 14 Mar 2005 20:28, Paul Mackerras wrote: > Michael Ellerman writes: > > Here is an updated patch for adding support for the mem=X boot option. > > It gets rejects now that Mike Kravetz's NUMA patch has gone in. Care > to respin it? Hi Paulus, My patch works again now that Mike's has been reverted =D ! Although I notice he's posted an updated version just now. I've got to see what's going on with Maneesh's kernel (crashes in prom_memparse), and then I'll look at Mike's new patch. If Mike needs more time to get his NUMA changes sorted we could merge most of the mem=X code and leave the NUMA change until later (need to boot with numa=off), I'd like to test the NUMA change a bit more anyway. I'm not getting much work done being sick, so it might take me a few days. cheers! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050317/13280a14/attachment.pgp From benh at kernel.crashing.org Thu Mar 17 16:30:29 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 17 Mar 2005 16:30:29 +1100 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI ErrorRecovery) In-Reply-To: References: Message-ID: <1111037429.7192.202.camel@gaston> On Wed, 2005-03-16 at 16:00 -0800, Nguyen, Tom L wrote: > On Wednesday, March 16, 2005 2:40 PM Benjamin Herrenschmidt wrote: > > Now, if you feel that we should really pass the error token to the > > callback, then maybe we should go with that, though it only have sense > > in the initial callback, not the 3 other ones. > > I am glad you agree to update API, as below, to support PCI Express > comprehensive error information. > > int (*error_handler)(struct pci_dev *dev, union error_src *); > > Please look at my other mail regarding the 3 other ones. Sure, the API is not burned in stone yet, that's why we are discussing it still :) Note that I would much prefer the second parameter to be an opaque token, with separate accessor functions to get to the content. Ben. From anton at samba.org Thu Mar 17 16:52:17 2005 From: anton at samba.org (Anton Blanchard) Date: Thu, 17 Mar 2005 16:52:17 +1100 Subject: Fw: Re: [RFC/PATCH] Updated: ppc64: Add mem=X option In-Reply-To: <20050317040843.GA14462@in.ibm.com> References: <20050317040843.GA14462@in.ibm.com> Message-ID: <20050317055217.GC10162@ozlabs.org> Hi, > +unsigned long prom_memparse(const char *ptr, const char **retptr) > +{ > + unsigned long ret = prom_strtoul(ptr, retptr); > + > + switch (**retptr) { > + case 'G': > + case 'g': > + ret <<= 10; > + case 'M': > + case 'm': > + ret <<= 10; > + case 'K': > + case 'k': > + ret <<= 10; > + (*retptr)++; > + default: > + break; > + } > + return ret; > +} > > I get following exception with the above switch statement in place. That makes me think gcc is using a particular switch statement optimisation. You create a table indexed by the switch values (or part of them). This table contains target addresses for the particular switch case. You throw it in the count register then do a bctr. Come to think of it, Im not sure how this optimisation can be safe before we copy the kernel down. Anton From benh at kernel.crashing.org Thu Mar 17 17:00:30 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 17 Mar 2005 17:00:30 +1100 Subject: [PATCH] Thermal control for Xserve Message-ID: <1111039230.7144.205.camel@gaston> HI ! (This patch is for -mm only for now until I had a bit more testing on desktop G5s). This patch adds support for Xserve G5 to the thermal control driver. It also adds a few updates to the desktop G5 code based from changes Apple did to their own drivers. Signed-off-by: Benjamin Herrenschmidt Index: linux-work/drivers/macintosh/therm_pm72.c =================================================================== --- linux-work.orig/drivers/macintosh/therm_pm72.c 2005-01-31 14:18:21.000000000 +1100 +++ linux-work/drivers/macintosh/therm_pm72.c 2005-03-10 14:12:44.000000000 +1100 @@ -47,8 +47,11 @@ * decisions, like slewing down CPUs * - Deal with fan and i2c failures in a better way * - Maybe do a generic PID based on params used for - * U3 and Drives ? - * - Add RackMac3,1 support (XServe g5) + * U3 and Drives ? Definitely need to factor code a bit + * bettter... also make sensor detection more robust using + * the device-tree to probe for them + * - Figure out how to get the slots consumption and set the + * slots fan accordingly * * History: * @@ -85,6 +88,13 @@ * - Add new CPU cooling algorithm for machines with liquid cooling * - Workaround for some PowerMac7,3 with empty "fan" node in the devtree * - Fix a signed/unsigned compare issue in some PID loops + * + * Mar. 10, 2005 : 1.2 + * - Add basic support for Xserve G5 + * - Retreive pumps min/max from EEPROM image in device-tree (broken) + * - Use min/max macros here or there + * - Latest darwin updated U3H min fan speed to 20% PWM + * */ #include @@ -113,7 +123,7 @@ #include "therm_pm72.h" -#define VERSION "1.1" +#define VERSION "1.2b2" #undef DEBUG @@ -131,21 +141,26 @@ static struct of_device * of_dev; static struct i2c_adapter * u3_0; static struct i2c_adapter * u3_1; +static struct i2c_adapter * k2; static struct i2c_client * fcu; static struct cpu_pid_state cpu_state[2]; static struct basckside_pid_params backside_params; static struct backside_pid_state backside_state; static struct drives_pid_state drives_state; +static struct dimm_pid_state dimms_state; static int state; static int cpu_count; static int cpu_pid_type; static pid_t ctrl_task; static struct completion ctrl_complete; static int critical_state; +static int rackmac; +static s32 dimm_output_clamp; + static DECLARE_MUTEX(driver_lock); /* - * We have 2 types of CPU PID control. One is "split" old style control + * We have 3 types of CPU PID control. One is "split" old style control * for intake & exhaust fans, the other is "combined" control for both * CPUs that also deals with the pumps when present. To be "compatible" * with OS X at this point, we only use "COMBINED" on the machines that @@ -155,6 +170,7 @@ */ #define CPU_PID_TYPE_SPLIT 0 #define CPU_PID_TYPE_COMBINED 1 +#define CPU_PID_TYPE_RACKMAC 2 /* * This table describes all fans in the FCU. The "id" and "type" values @@ -177,7 +193,7 @@ struct fcu_fan_table fcu_fans[] = { [BACKSIDE_FAN_PWM_INDEX] = { - .loc = "BACKSIDE", + .loc = "BACKSIDE,SYS CTRLR FAN", .type = FCU_FAN_PWM, .id = BACKSIDE_FAN_PWM_DEFAULT_ID, }, @@ -187,7 +203,7 @@ .id = DRIVES_FAN_RPM_DEFAULT_ID, }, [SLOTS_FAN_PWM_INDEX] = { - .loc = "SLOT", + .loc = "SLOT,PCI FAN", .type = FCU_FAN_PWM, .id = SLOTS_FAN_PWM_DEFAULT_ID, }, @@ -224,6 +240,37 @@ .type = FCU_FAN_RPM, .id = FCU_FAN_ABSENT_ID, }, + /* Xserve fans */ + [CPU_A1_FAN_RPM_INDEX] = { + .loc = "CPU A 1", + .type = FCU_FAN_RPM, + .id = FCU_FAN_ABSENT_ID, + }, + [CPU_A2_FAN_RPM_INDEX] = { + .loc = "CPU A 2", + .type = FCU_FAN_RPM, + .id = FCU_FAN_ABSENT_ID, + }, + [CPU_A3_FAN_RPM_INDEX] = { + .loc = "CPU A 3", + .type = FCU_FAN_RPM, + .id = FCU_FAN_ABSENT_ID, + }, + [CPU_B1_FAN_RPM_INDEX] = { + .loc = "CPU B 1", + .type = FCU_FAN_RPM, + .id = FCU_FAN_ABSENT_ID, + }, + [CPU_B2_FAN_RPM_INDEX] = { + .loc = "CPU B 2", + .type = FCU_FAN_RPM, + .id = FCU_FAN_ABSENT_ID, + }, + [CPU_B3_FAN_RPM_INDEX] = { + .loc = "CPU B 3", + .type = FCU_FAN_RPM, + .id = FCU_FAN_ABSENT_ID, + }, }; /* @@ -251,7 +298,9 @@ struct i2c_client *clt; struct i2c_adapter *adap; - if (id & 0x100) + if (id & 0x200) + adap = k2; + else if (id & 0x100) adap = u3_1; else adap = u3_0; @@ -361,6 +410,31 @@ } } +static int read_lm87_reg(struct i2c_client * chip, int reg) +{ + int rc, tries = 0; + u8 buf; + + for (;;) { + /* Set address */ + buf = (u8)reg; + rc = i2c_master_send(chip, &buf, 1); + if (rc <= 0) + goto error; + rc = i2c_master_recv(chip, &buf, 1); + if (rc <= 0) + goto error; + return (int)buf; + error: + DBG("Error reading LM87, retrying...\n"); + if (++tries > 10) { + printk(KERN_ERR "therm_pm72: Error reading LM87 !\n"); + return -1; + } + msleep(10); + } +} + static int fan_read_reg(int reg, unsigned char *buf, int nb) { int tries, nr, nw; @@ -570,6 +644,38 @@ return 0; } +static void fetch_cpu_pumps_minmax(void) +{ + struct cpu_pid_state *state0 = &cpu_state[0]; + struct cpu_pid_state *state1 = &cpu_state[1]; + u16 pump_min = 0, pump_max = 0xffff; + u16 tmp[4]; + + /* Try to fetch pumps min/max infos from eeprom */ + + memcpy(&tmp, &state0->mpu.processor_part_num, 8); + if (tmp[0] != 0xffff && tmp[1] != 0xffff) { + pump_min = max(pump_min, tmp[0]); + pump_max = min(pump_max, tmp[1]); + } + if (tmp[2] != 0xffff && tmp[3] != 0xffff) { + pump_min = max(pump_min, tmp[2]); + pump_max = min(pump_max, tmp[3]); + } + + /* Double check the values, this _IS_ needed as the EEPROM on + * some dual 2.5Ghz G5s seem, at least, to have both min & max + * same to the same value ... (grrrr) + */ + if (pump_min == pump_max || pump_min == 0 || pump_max == 0xffff) { + pump_min = CPU_PUMP_OUTPUT_MIN; + pump_max = CPU_PUMP_OUTPUT_MAX; + } + + state0->pump_min = state1->pump_min = pump_min; + state0->pump_max = state1->pump_max = pump_max; +} + /* * Now, unfortunately, sysfs doesn't give us a nice void * we could * pass around to the attribute functions, so we don't really have @@ -611,6 +717,8 @@ BUILD_SHOW_FUNC_FIX(drives_temperature, drives_state.last_temp) BUILD_SHOW_FUNC_INT(drives_fan_rpm, drives_state.rpm) +BUILD_SHOW_FUNC_FIX(dimms_temperature, dimms_state.last_temp) + static DEVICE_ATTR(cpu0_temperature,S_IRUGO,show_cpu0_temperature,NULL); static DEVICE_ATTR(cpu0_voltage,S_IRUGO,show_cpu0_voltage,NULL); static DEVICE_ATTR(cpu0_current,S_IRUGO,show_cpu0_current,NULL); @@ -629,6 +737,8 @@ static DEVICE_ATTR(drives_temperature,S_IRUGO,show_drives_temperature,NULL); static DEVICE_ATTR(drives_fan_rpm,S_IRUGO,show_drives_fan_rpm,NULL); +static DEVICE_ATTR(dimms_temperature,S_IRUGO,show_dimms_temperature,NULL); + /* * CPUs fans control loop */ @@ -636,17 +746,21 @@ static int do_read_one_cpu_values(struct cpu_pid_state *state, s32 *temp, s32 *power) { s32 ltemp, volts, amps; - int rc = 0; + int index, rc = 0; /* Default (in case of error) */ *temp = state->cur_temp; *power = state->cur_power; - /* Read current fan status */ - if (state->index == 0) - rc = get_rpm_fan(CPUA_EXHAUST_FAN_RPM_INDEX, !RPM_PID_USE_ACTUAL_SPEED); + if (cpu_pid_type == CPU_PID_TYPE_RACKMAC) + index = (state->index == 0) ? + CPU_A1_FAN_RPM_INDEX : CPU_B1_FAN_RPM_INDEX; else - rc = get_rpm_fan(CPUB_EXHAUST_FAN_RPM_INDEX, !RPM_PID_USE_ACTUAL_SPEED); + index = (state->index == 0) ? + CPUA_EXHAUST_FAN_RPM_INDEX : CPUB_EXHAUST_FAN_RPM_INDEX; + + /* Read current fan status */ + rc = get_rpm_fan(index, !RPM_PID_USE_ACTUAL_SPEED); if (rc < 0) { /* XXX What do we do now ? Nothing for now, keep old value, but * return error upstream @@ -777,11 +891,6 @@ DBG(" sum: %d\n", (int)sum); state->rpm += (s32)sum; - - if (state->rpm < (int)state->mpu.rminn_exhaust_fan) - state->rpm = state->mpu.rminn_exhaust_fan; - if (state->rpm > (int)state->mpu.rmaxn_exhaust_fan) - state->rpm = state->mpu.rmaxn_exhaust_fan; } static void do_monitor_cpu_combined(void) @@ -823,28 +932,28 @@ if (state0->overtemp > 0) { state0->rpm = state0->mpu.rmaxn_exhaust_fan; state0->intake_rpm = intake = state0->mpu.rmaxn_intake_fan; - pump = CPU_PUMP_OUTPUT_MAX; + pump = state0->pump_min; goto do_set_fans; } /* Do the PID */ do_cpu_pid(state0, temp_combi, power_combi); + /* Range check */ + state0->rpm = max(state0->rpm, (int)state0->mpu.rminn_exhaust_fan); + state0->rpm = min(state0->rpm, (int)state0->mpu.rmaxn_exhaust_fan); + /* Calculate intake fan speed */ intake = (state0->rpm * CPU_INTAKE_SCALE) >> 16; - if (intake < (int)state0->mpu.rminn_intake_fan) - intake = state0->mpu.rminn_intake_fan; - if (intake > (int)state0->mpu.rmaxn_intake_fan) - intake = state0->mpu.rmaxn_intake_fan; + intake = max(intake, (int)state0->mpu.rminn_intake_fan); + intake = min(intake, (int)state0->mpu.rmaxn_intake_fan); state0->intake_rpm = intake; /* Calculate pump speed */ - pump = (state0->rpm * CPU_PUMP_OUTPUT_MAX) / + pump = (state0->rpm * state0->pump_max) / state0->mpu.rmaxn_exhaust_fan; - if (pump > CPU_PUMP_OUTPUT_MAX) - pump = CPU_PUMP_OUTPUT_MAX; - if (pump < CPU_PUMP_OUTPUT_MIN) - pump = CPU_PUMP_OUTPUT_MIN; + pump = min(pump, state0->pump_max); + pump = max(pump, state0->pump_min); do_set_fans: /* We copy values from state 0 to state 1 for /sysfs */ @@ -904,11 +1013,14 @@ /* Do the PID */ do_cpu_pid(state, temp, power); + /* Range check */ + state->rpm = max(state->rpm, (int)state->mpu.rminn_exhaust_fan); + state->rpm = min(state->rpm, (int)state->mpu.rmaxn_exhaust_fan); + + /* Calculate intake fan */ intake = (state->rpm * CPU_INTAKE_SCALE) >> 16; - if (intake < (int)state->mpu.rminn_intake_fan) - intake = state->mpu.rminn_intake_fan; - if (intake > (int)state->mpu.rmaxn_intake_fan) - intake = state->mpu.rmaxn_intake_fan; + intake = max(intake, (int)state->mpu.rminn_intake_fan); + intake = min(intake, (int)state->mpu.rmaxn_intake_fan); state->intake_rpm = intake; do_set_fans: @@ -929,6 +1041,67 @@ } } +static void do_monitor_cpu_rack(struct cpu_pid_state *state) +{ + s32 temp, power, fan_min; + int rc; + + /* Read current fan status */ + rc = do_read_one_cpu_values(state, &temp, &power); + if (rc < 0) { + /* XXX What do we do now ? */ + } + + /* Check tmax, increment overtemp if we are there. At tmax+8, we go + * full blown immediately and try to trigger a shutdown + */ + if (temp >= ((state->mpu.tmax + 8) << 16)) { + printk(KERN_WARNING "Warning ! CPU %d temperature way above maximum" + " (%d) !\n", + state->index, temp >> 16); + state->overtemp = CPU_MAX_OVERTEMP; + } else if (temp > (state->mpu.tmax << 16)) + state->overtemp++; + else + state->overtemp = 0; + if (state->overtemp >= CPU_MAX_OVERTEMP) + critical_state = 1; + if (state->overtemp > 0) { + state->rpm = state->intake_rpm = state->mpu.rmaxn_intake_fan; + goto do_set_fans; + } + + /* Do the PID */ + do_cpu_pid(state, temp, power); + + /* Check clamp from dimms */ + fan_min = dimm_output_clamp; + fan_min = max(fan_min, (int)state->mpu.rminn_intake_fan); + + state->rpm = max(state->rpm, (int)fan_min); + state->rpm = min(state->rpm, (int)state->mpu.rmaxn_intake_fan); + state->intake_rpm = state->rpm; + + do_set_fans: + DBG("** CPU %d RPM: %d overtemp: %d\n", + state->index, (int)state->rpm, state->overtemp); + + /* We should check for errors, shouldn't we ? But then, what + * do we do once the error occurs ? For FCU notified fan + * failures (-EFAULT) we probably want to notify userland + * some way... + */ + if (state->index == 0) { + set_rpm_fan(CPU_A1_FAN_RPM_INDEX, state->rpm); + set_rpm_fan(CPU_A2_FAN_RPM_INDEX, state->rpm); + set_rpm_fan(CPU_A3_FAN_RPM_INDEX, state->rpm); + } else { + set_rpm_fan(CPU_B1_FAN_RPM_INDEX, state->rpm); + set_rpm_fan(CPU_B2_FAN_RPM_INDEX, state->rpm); + set_rpm_fan(CPU_B3_FAN_RPM_INDEX, state->rpm); + } +} + /* * Initialize the state structure for one CPU control loop */ @@ -936,7 +1109,7 @@ { state->index = index; state->first = 1; - state->rpm = 1000; + state->rpm = (cpu_pid_type == CPU_PID_TYPE_RACKMAC) ? 4000 : 1000; state->overtemp = 0; state->adc_config = 0x00; @@ -1012,13 +1185,13 @@ */ static void do_monitor_backside(struct backside_pid_state *state) { - s32 temp, integral, derivative; + s32 temp, integral, derivative, fan_min; s64 integ_p, deriv_p, prop_p, sum; int i, rc; if (--state->ticks != 0) return; - state->ticks = BACKSIDE_PID_INTERVAL; + state->ticks = backside_params.interval; DBG("backside:\n"); @@ -1059,7 +1232,7 @@ integral = 0; for (i = 0; i < BACKSIDE_PID_HISTORY_SIZE; i++) integral += state->error_history[i]; - integral *= BACKSIDE_PID_INTERVAL; + integral *= backside_params.interval; DBG(" integral: %08x\n", integral); integ_p = ((s64)backside_params.G_r) * (s64)integral; DBG(" integ_p: %d\n", (int)(integ_p >> 36)); @@ -1069,7 +1242,7 @@ derivative = state->error_history[state->cur_sample] - state->error_history[(state->cur_sample + BACKSIDE_PID_HISTORY_SIZE - 1) % BACKSIDE_PID_HISTORY_SIZE]; - derivative /= BACKSIDE_PID_INTERVAL; + derivative /= backside_params.interval; deriv_p = ((s64)backside_params.G_d) * (s64)derivative; DBG(" deriv_p: %d\n", (int)(deriv_p >> 36)); sum += deriv_p; @@ -1083,11 +1256,17 @@ sum >>= 36; DBG(" sum: %d\n", (int)sum); - state->pwm += (s32)sum; - if (state->pwm < backside_params.output_min) - state->pwm = backside_params.output_min; - if (state->pwm > backside_params.output_max) - state->pwm = backside_params.output_max; + if (backside_params.additive) + state->pwm += (s32)sum; + else + state->pwm = sum; + + /* Check for clamp */ + fan_min = (dimm_output_clamp * 100) / 14000; + fan_min = max(fan_min, backside_params.output_min); + + state->pwm = max(state->pwm, fan_min); + state->pwm = min(state->pwm, backside_params.output_max); DBG("** BACKSIDE PWM: %d\n", (int)state->pwm); set_pwm_fan(BACKSIDE_FAN_PWM_INDEX, state->pwm); @@ -1114,17 +1293,33 @@ of_node_put(u3); } - backside_params.G_p = BACKSIDE_PID_G_p; - backside_params.G_r = BACKSIDE_PID_G_r; - backside_params.output_max = BACKSIDE_PID_OUTPUT_MAX; - if (u3h) { + if (rackmac) { + backside_params.G_d = BACKSIDE_PID_RACK_G_d; + backside_params.input_target = BACKSIDE_PID_RACK_INPUT_TARGET; + backside_params.output_min = BACKSIDE_PID_U3H_OUTPUT_MIN; + backside_params.interval = BACKSIDE_PID_RACK_INTERVAL; + backside_params.G_p = BACKSIDE_PID_RACK_G_p; + backside_params.G_r = BACKSIDE_PID_G_r; + backside_params.output_max = BACKSIDE_PID_OUTPUT_MAX; + backside_params.additive = 0; + } else if (u3h) { backside_params.G_d = BACKSIDE_PID_U3H_G_d; backside_params.input_target = BACKSIDE_PID_U3H_INPUT_TARGET; backside_params.output_min = BACKSIDE_PID_U3H_OUTPUT_MIN; + backside_params.interval = BACKSIDE_PID_INTERVAL; + backside_params.G_p = BACKSIDE_PID_G_p; + backside_params.G_r = BACKSIDE_PID_G_r; + backside_params.output_max = BACKSIDE_PID_OUTPUT_MAX; + backside_params.additive = 1; } else { backside_params.G_d = BACKSIDE_PID_U3_G_d; backside_params.input_target = BACKSIDE_PID_U3_INPUT_TARGET; backside_params.output_min = BACKSIDE_PID_U3_OUTPUT_MIN; + backside_params.interval = BACKSIDE_PID_INTERVAL; + backside_params.G_p = BACKSIDE_PID_G_p; + backside_params.G_r = BACKSIDE_PID_G_r; + backside_params.output_max = BACKSIDE_PID_OUTPUT_MAX; + backside_params.additive = 1; } state->ticks = 1; @@ -1233,10 +1428,9 @@ DBG(" sum: %d\n", (int)sum); state->rpm += (s32)sum; - if (state->rpm < DRIVES_PID_OUTPUT_MIN) - state->rpm = DRIVES_PID_OUTPUT_MIN; - if (state->rpm > DRIVES_PID_OUTPUT_MAX) - state->rpm = DRIVES_PID_OUTPUT_MAX; + + state->rpm = max(state->rpm, DRIVES_PID_OUTPUT_MIN); + state->rpm = min(state->rpm, DRIVES_PID_OUTPUT_MAX); DBG("** DRIVES RPM: %d\n", (int)state->rpm); set_rpm_fan(DRIVES_FAN_RPM_INDEX, state->rpm); @@ -1276,6 +1470,126 @@ state->monitor = NULL; } +/* + * DIMMs temp control loop + */ +static void do_monitor_dimms(struct dimm_pid_state *state) +{ + s32 temp, integral, derivative, fan_min; + s64 integ_p, deriv_p, prop_p, sum; + int i; + + if (--state->ticks != 0) + return; + state->ticks = DIMM_PID_INTERVAL; + + DBG("DIMM:\n"); + + DBG(" current value: %d\n", state->output); + + temp = read_lm87_reg(state->monitor, LM87_INT_TEMP); + if (temp < 0) + return; + temp <<= 16; + state->last_temp = temp; + DBG(" temp: %d.%03d, target: %d.%03d\n", FIX32TOPRINT(temp), + FIX32TOPRINT(DIMM_PID_INPUT_TARGET)); + + /* Store temperature and error in history array */ + state->cur_sample = (state->cur_sample + 1) % DIMM_PID_HISTORY_SIZE; + state->sample_history[state->cur_sample] = temp; + state->error_history[state->cur_sample] = temp - DIMM_PID_INPUT_TARGET; + + /* If first loop, fill the history table */ + if (state->first) { + for (i = 0; i < (DIMM_PID_HISTORY_SIZE - 1); i++) { + state->cur_sample = (state->cur_sample + 1) % + DIMM_PID_HISTORY_SIZE; + state->sample_history[state->cur_sample] = temp; + state->error_history[state->cur_sample] = + temp - DIMM_PID_INPUT_TARGET; + } + state->first = 0; + } + + /* Calculate the integral term */ + sum = 0; + integral = 0; + for (i = 0; i < DIMM_PID_HISTORY_SIZE; i++) + integral += state->error_history[i]; + integral *= DIMM_PID_INTERVAL; + DBG(" integral: %08x\n", integral); + integ_p = ((s64)DIMM_PID_G_r) * (s64)integral; + DBG(" integ_p: %d\n", (int)(integ_p >> 36)); + sum += integ_p; + + /* Calculate the derivative term */ + derivative = state->error_history[state->cur_sample] - + state->error_history[(state->cur_sample + DIMM_PID_HISTORY_SIZE - 1) + % DIMM_PID_HISTORY_SIZE]; + derivative /= DIMM_PID_INTERVAL; + deriv_p = ((s64)DIMM_PID_G_d) * (s64)derivative; + DBG(" deriv_p: %d\n", (int)(deriv_p >> 36)); + sum += deriv_p; + + /* Calculate the proportional term */ + prop_p = ((s64)DIMM_PID_G_p) * (s64)(state->error_history[state->cur_sample]); + DBG(" prop_p: %d\n", (int)(prop_p >> 36)); + sum += prop_p; + + /* Scale sum */ + sum >>= 36; + + DBG(" sum: %d\n", (int)sum); + state->output = (s32)sum; + state->output = max(state->output, DIMM_PID_OUTPUT_MIN); + state->output = min(state->output, DIMM_PID_OUTPUT_MAX); + dimm_output_clamp = state->output; + + DBG("** DIMM clamp value: %d\n", (int)state->output); + + /* Backside PID is only every 5 seconds, force backside fan clamping now */ + fan_min = (dimm_output_clamp * 100) / 14000; + fan_min = max(fan_min, backside_params.output_min); + if (backside_state.pwm < fan_min) { + backside_state.pwm = fan_min; + DBG(" -> applying clamp to backside fan now: %d !\n", fan_min); + set_pwm_fan(BACKSIDE_FAN_PWM_INDEX, fan_min); + } +} + +/* + * Initialize the state structure for the DIMM temp control loop + */ +static int init_dimms_state(struct dimm_pid_state *state) +{ + state->ticks = 1; + state->first = 1; + state->output = 4000; + + state->monitor = attach_i2c_chip(XSERVE_DIMMS_LM87, "dimms_temp"); + if (state->monitor == NULL) + return -ENODEV; + + device_create_file(&of_dev->dev, &dev_attr_dimms_temperature); + + return 0; +} + +/* + * Dispose of the state data for the drives control loop + */ +static void dispose_dimms_state(struct dimm_pid_state *state) +{ + if (state->monitor == NULL) + return; + + device_remove_file(&of_dev->dev, &dev_attr_dimms_temperature); + + detach_i2c_chip(state->monitor); + state->monitor = NULL; +} + static int call_critical_overtemp(void) { char *argv[] = { critical_overtemp_path, NULL }; @@ -1321,15 +1635,29 @@ start = jiffies; down(&driver_lock); + + /* First, we always calculate the new DIMMs state on an Xserve */ + if (rackmac) + do_monitor_dimms(&dimms_state); + + /* Then, the CPUs */ if (cpu_pid_type == CPU_PID_TYPE_COMBINED) do_monitor_cpu_combined(); - else { + else if (cpu_pid_type == CPU_PID_TYPE_RACKMAC) { + do_monitor_cpu_rack(&cpu_state[0]); + if (cpu_state[1].monitor != NULL) + do_monitor_cpu_rack(&cpu_state[1]); + // better deal with UP + } else { do_monitor_cpu_split(&cpu_state[0]); if (cpu_state[1].monitor != NULL) do_monitor_cpu_split(&cpu_state[1]); + // better deal with UP } + /* Then, the rest */ do_monitor_backside(&backside_state); - do_monitor_drives(&drives_state); + if (!rackmac) + do_monitor_drives(&drives_state); up(&driver_lock); if (critical_state == 1) { @@ -1369,9 +1697,9 @@ { dispose_cpu_state(&cpu_state[0]); dispose_cpu_state(&cpu_state[1]); - dispose_backside_state(&backside_state); dispose_drives_state(&drives_state); + dispose_dimms_state(&dimms_state); } /* @@ -1395,7 +1723,9 @@ * the pumps, though that may not be the best way, that is good enough * for now */ - if (machine_is_compatible("PowerMac7,3") + if (rackmac) + cpu_pid_type = CPU_PID_TYPE_RACKMAC; + else if (machine_is_compatible("PowerMac7,3") && (cpu_count > 1) && fcu_fans[CPUA_PUMP_RPM_INDEX].id != FCU_FAN_ABSENT_ID && fcu_fans[CPUB_PUMP_RPM_INDEX].id != FCU_FAN_ABSENT_ID) { @@ -1409,11 +1739,16 @@ */ if (init_cpu_state(&cpu_state[0], 0)) goto fail; + if (cpu_pid_type == CPU_PID_TYPE_COMBINED) + fetch_cpu_pumps_minmax(); + if (cpu_count > 1 && init_cpu_state(&cpu_state[1], 1)) goto fail; if (init_backside_state(&backside_state)) goto fail; - if (init_drives_state(&drives_state)) + if (rackmac && init_dimms_state(&dimms_state)) + goto fail; + if (!rackmac && init_drives_state(&drives_state)) goto fail; DBG("all control loops up !\n"); @@ -1492,17 +1827,24 @@ /* Check if we are looking for one of these */ if (u3_0 == NULL && !strcmp(adapter->name, "u3 0")) { u3_0 = adapter; - DBG("found U3-0, creating control loops\n"); - if (create_control_loops()) - u3_0 = NULL; + DBG("found U3-0\n"); + if (k2 || !rackmac) + if (create_control_loops()) + u3_0 = NULL; } else if (u3_1 == NULL && !strcmp(adapter->name, "u3 1")) { u3_1 = adapter; DBG("found U3-1, attaching FCU\n"); if (attach_fcu()) u3_1 = NULL; + } else if (k2 == NULL && !strcmp(adapter->name, "mac-io 0")) { + k2 = adapter; + DBG("Found K2\n"); + if (u3_0 && rackmac) + if (create_control_loops()) + k2 = NULL; } /* We got all we need, start control loops */ - if (u3_0 != NULL && u3_1 != NULL) { + if (u3_0 != NULL && u3_1 != NULL && (k2 || !rackmac)) { DBG("everything up, starting control loops\n"); state = state_attached; start_control_loops(); @@ -1548,6 +1890,27 @@ return 0; } +static int fan_check_loc_match(const char *loc, int fan) +{ + char tmp[64]; + char *c, *e; + + strlcpy(tmp, fcu_fans[fan].loc, 64); + + c = tmp; + for (;;) { + e = strchr(c, ','); + if (e) + *e = 0; + if (strcmp(loc, c) == 0) + return 1; + if (e == NULL) + break; + c = e + 1; + } + return 0; +} + static void fcu_lookup_fans(struct device_node *fcu_node) { struct device_node *np = NULL; @@ -1589,7 +1952,7 @@ for (i = 0; i < FCU_FAN_COUNT; i++) { int fan_id; - if (strcmp(loc, fcu_fans[i].loc)) + if (!fan_check_loc_match(loc, i)) continue; DBG(" location match, index: %d\n", i); fcu_fans[i].id = FCU_FAN_ABSENT_ID; @@ -1671,8 +2034,11 @@ { struct device_node *np; + rackmac = machine_is_compatible("RackMac3,1"); + if (!machine_is_compatible("PowerMac7,2") && - !machine_is_compatible("PowerMac7,3")) + !machine_is_compatible("PowerMac7,3") && + !rackmac) return -ENODEV; printk(KERN_INFO "PowerMac G5 Thermal control driver %s\n", VERSION); @@ -1709,6 +2075,6 @@ module_exit(therm_pm72_exit); MODULE_AUTHOR("Benjamin Herrenschmidt "); -MODULE_DESCRIPTION("Driver for Apple's PowerMac7,2 G5 thermal control"); +MODULE_DESCRIPTION("Driver for Apple's PowerMac G5 thermal control"); MODULE_LICENSE("GPL"); Index: linux-work/drivers/macintosh/therm_pm72.h =================================================================== --- linux-work.orig/drivers/macintosh/therm_pm72.h 2005-01-31 14:18:21.000000000 +1100 +++ linux-work/drivers/macintosh/therm_pm72.h 2005-03-10 13:57:02.000000000 +1100 @@ -52,7 +52,7 @@ u16 rmaxn_intake_fan; /* 0x4e - Intake fan max RPM */ u16 rminn_exhaust_fan; /* 0x50 - Exhaust fan min RPM */ u16 rmaxn_exhaust_fan; /* 0x52 - Exhaust fan max RPM */ - u8 processor_part_num[8]; /* 0x54 - Processor part number */ + u8 processor_part_num[8]; /* 0x54 - Processor part number XX pumps min/max */ u32 processor_lot_num; /* 0x5c - Processor lot number */ u8 orig_card_sernum[0x10]; /* 0x60 - Card original serial number */ u8 curr_card_sernum[0x10]; /* 0x70 - Card current serial number */ @@ -94,19 +94,25 @@ * of the driver, though I would accept any clean patch * doing a better use of the device-tree without turning the * while i2c registration mecanism into a racy mess + * + * Note: Xserve changed this. We have some bits on the K2 bus, + * which I arbitrarily set to 0x200. Ultimately, we really want + * too lookup these in the device-tree though */ #define FAN_CTRLER_ID 0x15e #define SUPPLY_MONITOR_ID 0x58 #define SUPPLY_MONITORB_ID 0x5a #define DRIVES_DALLAS_ID 0x94 #define BACKSIDE_MAX_ID 0x98 +#define XSERVE_DIMMS_LM87 0x25a /* - * Some MAX6690 & DS1775 register definitions + * Some MAX6690, DS1775, LM87 register definitions */ #define MAX6690_INT_TEMP 0 #define MAX6690_EXT_TEMP 1 #define DS1775_TEMP 0 +#define LM87_INT_TEMP 0x27 /* * Scaling factors for the AD7417 ADC converters (except @@ -126,14 +132,18 @@ #define BACKSIDE_FAN_PWM_INDEX 0 #define BACKSIDE_PID_U3_G_d 0x02800000 #define BACKSIDE_PID_U3H_G_d 0x01400000 +#define BACKSIDE_PID_RACK_G_d 0x00500000 #define BACKSIDE_PID_G_p 0x00500000 +#define BACKSIDE_PID_RACK_G_p 0x0004cccc #define BACKSIDE_PID_G_r 0x00000000 #define BACKSIDE_PID_U3_INPUT_TARGET 0x00410000 #define BACKSIDE_PID_U3H_INPUT_TARGET 0x004b0000 +#define BACKSIDE_PID_RACK_INPUT_TARGET 0x00460000 #define BACKSIDE_PID_INTERVAL 5 +#define BACKSIDE_PID_RACK_INTERVAL 1 #define BACKSIDE_PID_OUTPUT_MAX 100 #define BACKSIDE_PID_U3_OUTPUT_MIN 20 -#define BACKSIDE_PID_U3H_OUTPUT_MIN 30 +#define BACKSIDE_PID_U3H_OUTPUT_MIN 20 #define BACKSIDE_PID_HISTORY_SIZE 2 struct basckside_pid_params @@ -144,6 +154,8 @@ s32 input_target; s32 output_min; s32 output_max; + s32 interval; + int additive; }; struct backside_pid_state @@ -188,25 +200,34 @@ #define SLOTS_FAN_PWM_INDEX 2 #define SLOTS_FAN_DEFAULT_PWM 50 /* Do better here ! */ + /* - * IDs in Darwin for the sensors & fans - * - * CPU A AD7417_TEMP 10 (CPU A ambient temperature) - * CPU A AD7417_AD1 11 (CPU A diode temperature) - * CPU A AD7417_AD2 12 (CPU A 12V current) - * CPU A AD7417_AD3 13 (CPU A voltage) - * CPU A AD7417_AD4 14 (CPU A current) - * - * CPU A FAKE POWER 48 (I_V_inputs: 13, 14) - * - * CPU B AD7417_TEMP 15 (CPU B ambient temperature) - * CPU B AD7417_AD1 16 (CPU B diode temperature) - * CPU B AD7417_AD2 17 (CPU B 12V current) - * CPU B AD7417_AD3 18 (CPU B voltage) - * CPU B AD7417_AD4 19 (CPU B current) - * - * CPU B FAKE POWER 49 (I_V_inputs: 18, 19) + * PID factors for the Xserve DIMM control loop */ +#define DIMM_PID_G_d 0 +#define DIMM_PID_G_p 0 +#define DIMM_PID_G_r 0x6553600 +#define DIMM_PID_INPUT_TARGET 3276800 +#define DIMM_PID_INTERVAL 1 +#define DIMM_PID_OUTPUT_MAX 14000 +#define DIMM_PID_OUTPUT_MIN 4000 +#define DIMM_PID_HISTORY_SIZE 20 + +struct dimm_pid_state +{ + int ticks; + struct i2c_client * monitor; + s32 sample_history[DIMM_PID_HISTORY_SIZE]; + s32 error_history[DIMM_PID_HISTORY_SIZE]; + int cur_sample; + s32 last_temp; + int first; + int output; +}; + + + +/* Desktops */ #define CPUA_INTAKE_FAN_RPM_DEFAULT_ID 3 #define CPUA_EXHAUST_FAN_RPM_DEFAULT_ID 4 @@ -226,8 +247,17 @@ #define CPUA_PUMP_RPM_INDEX 7 #define CPUB_PUMP_RPM_INDEX 8 -#define CPU_PUMP_OUTPUT_MAX 3700 -#define CPU_PUMP_OUTPUT_MIN 1000 +#define CPU_PUMP_OUTPUT_MAX 3200 +#define CPU_PUMP_OUTPUT_MIN 1250 + +/* Xserve */ +#define CPU_A1_FAN_RPM_INDEX 9 +#define CPU_A2_FAN_RPM_INDEX 10 +#define CPU_A3_FAN_RPM_INDEX 11 +#define CPU_B1_FAN_RPM_INDEX 12 +#define CPU_B2_FAN_RPM_INDEX 13 +#define CPU_B3_FAN_RPM_INDEX 14 + struct cpu_pid_state { @@ -249,6 +279,8 @@ s32 last_power; int first; u8 adc_config; + s32 pump_min; + s32 pump_max; }; /* From michael at ellerman.id.au Thu Mar 17 17:36:24 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Thu, 17 Mar 2005 17:36:24 +1100 Subject: Fw: Re: [RFC/PATCH] Updated: ppc64: Add mem=X option In-Reply-To: <20050317055217.GC10162@ozlabs.org> References: <20050317040843.GA14462@in.ibm.com> <20050317055217.GC10162@ozlabs.org> Message-ID: <200503171736.28232.michael@ellerman.id.au> On Thu, 17 Mar 2005 16:52, Anton Blanchard wrote: > > +unsigned long prom_memparse(const char *ptr, const char **retptr) > > +{ > > + unsigned long ret = prom_strtoul(ptr, retptr); > > + > > + switch (**retptr) { > > + case 'G': > > + case 'g': > > + ret <<= 10; > > + case 'M': > > + case 'm': > > + ret <<= 10; > > + case 'K': > > + case 'k': > > + ret <<= 10; > > + (*retptr)++; > > + default: > > + break; > > + } > > + return ret; > > +} > > > > I get following exception with the above switch statement in place. > > That makes me think gcc is using a particular switch statement > optimisation. You create a table indexed by the switch values (or part of > them). This table contains target addresses for the particular switch > case. You throw it in the count register then do a bctr. > > Come to think of it, Im not sure how this optimisation can be safe > before we copy the kernel down. Thanks for the bug report Maneesh, you're right about the compiler. Anton's right, GCC 3.3 uses a switch table whereas 3.4 doesn't. The bad news is I can't find anyway to turn it off, there doesn't seem to be an individual option for it, and not even -O0 disables it. The only other place we use a switch statement in prom_init.c is prom_printf() but that doesn't generate a switch table (although one day it might?). I guess I'll just change my code to use a couple of if's. cheers -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050317/e8c1ffa4/attachment.pgp From greg at kroah.com Thu Mar 17 18:52:04 2005 From: greg at kroah.com (Greg KH) Date: Wed, 16 Mar 2005 23:52:04 -0800 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCI Error Recovery) In-Reply-To: <1110864741.29124.68.camel@gaston> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <4235847F.3080705@jp.fujitsu.com> <20050314181420.GD498@austin.ibm.com> <1110864741.29124.68.camel@gaston> Message-ID: <20050317075204.GA15881@kroah.com> On Tue, Mar 15, 2005 at 04:32:20PM +1100, Benjamin Herrenschmidt wrote: > Ok, let's propose what i think is a proper API and simple enough on the > driver side, if complexity there is, it's in the platform policy. That > should cover all the needs we discussed so far: Looks like a sane proposal to me. thanks, greg k-h From olh at suse.de Thu Mar 17 20:21:04 2005 From: olh at suse.de (Olaf Hering) Date: Thu, 17 Mar 2005 10:21:04 +0100 Subject: [PATCH] missing newline/carrige return in ppc64 zImage Message-ID: <20050317092104.GA25105@suse.de> Some eyecandy for zImage and zImage.initrd. Most OF implementations do not print a newline after their last line of output, so the "zImage starting..." appears right after the last number or netboot output. A zImage.initrd misses a carrige return to avoid a staircase effect. Tested on JS20. Signed-off-by: Olaf Hering Index: linux-2.6.11-olh/arch/ppc64/boot/main.c =================================================================== --- linux-2.6.11-olh.orig/arch/ppc64/boot/main.c +++ linux-2.6.11-olh/arch/ppc64/boot/main.c @@ -110,7 +110,7 @@ void start(unsigned long a1, unsigned lo if (getprop(chosen_handle, "stdin", &stdin, sizeof(stdin)) != 4) exit(); - printf("zImage starting: loaded at 0x%x\n\r", (unsigned)_start); + printf("\n\rzImage starting: loaded at 0x%x\n\r", (unsigned)_start); /* * Now we try to claim some memory for the kernel itself @@ -149,7 +149,7 @@ void start(unsigned long a1, unsigned lo printf("initial ramdisk moving 0x%lx <- 0x%lx (%lx bytes)\n\r", initrd.addr, (unsigned long)_initrd_start, initrd.size); memmove((void *)initrd.addr, (void *)_initrd_start, initrd.size); - printf("initrd head: 0x%lx\n", *((u32 *)initrd.addr)); + printf("initrd head: 0x%lx\n\r", *((u32 *)initrd.addr)); } /* Eventually gunzip the kernel */ From michael at ellerman.id.au Thu Mar 17 20:57:04 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Thu, 17 Mar 2005 20:57:04 +1100 Subject: Why no bigphysarea in mainline? Message-ID: <200503172057.06570.michael@ellerman.id.au> Hi all, Can anyone recall if there's ever been a discussion about merging the bigphysarea patch (see below) into mainline? I couldn't find much of interest on google. I realise bigphysarea is a bit of a hack, but it's no where near as big a hack as using mem=X to limit the kernel's memory and then using the rest of memory for your device driver. The reason I'm curious is because I've gotten several queries about the mem=X option on PPC64 and whether it will support this hack, which it won't. If no one has any fundamental objections I think it'd be good to get this merged into mainline so people start using it rather than mem=X hacks. To that end please let me know what you think is wrong with the patch as it stands (below). cheers Nick Martin's version for 2.6.9 (which applies to 2.6.11): http://www.ussg.iu.edu/hypermail/linux/kernel/0411.1/2076.html And the guts of it: Index: 2.6.11-bigphysarea/mm/bigphysarea.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ 2.6.11-bigphysarea/mm/bigphysarea.c 2005-03-17 19:15:08.256421832 +1100 @@ -0,0 +1,353 @@ +/* linux/mm/bigphysarea.c, M. Welsh (mdw at xxxxxxxxxxxxxx) + * Copyright (c) 1996 by Matt Welsh. + * Extended by Roger Butenuth (butenuth at xxxxxxxxxxxxxxxx), October 1997 + * Extended for linux-2.1.121 till 2.4.0 (June 2000) + * by Pauline Middelink + * Extended for linux-2.6.9 (November 2004) + * by Nick Martin + * + * This is a set of routines which allow you to reserve a large (?) + * amount of physical memory at boot-time, which can be allocated/deallocated + * by drivers. This memory is intended to be used for devices such as + * video framegrabbers which need a lot of physical RAM (above the amount + * allocated by kmalloc). This is by no means efficient or recommended; + * to be used only in extreme circumstances. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static int get_info(char* buf, char**, off_t, int); + +typedef struct range_struct { + struct range_struct *next; + caddr_t base; /* base of allocated block */ + size_t size; /* size in bytes */ +} range_t; + +/* + * 0: nothing initialized + * 1: bigphysarea_pages initialized + * 2: free list initialized + */ +static int init_level = 0; +static int bigphysarea_pages = 0; +static caddr_t bigphysarea = 0; +static range_t *free_list = NULL; +static range_t *used_list = NULL; +static struct resource mem_resource = { "Bigphysarea", 0, 0, IORESOURCE_MEM|IORESOURCE_BUSY }; + +static +int __init bigphysarea_init(void) +{ + if (bigphysarea_pages == 0 || bigphysarea == 0) + return -EINVAL; + + /* create to /proc entry for it */ + if (!create_proc_info_entry("bigphysarea",0444,NULL,get_info)) { + // ohoh, no way to free the allocated memory! + // continue without proc support, it not fatal in itself +// free_bootmem((unsigned long)bigphysarea>>PAGE_SHIFT,bigphysarea_pages<next; + /* + * The free-list is sorted by address, search insertion point + * and insert block in free list. + */ + for (range_ptr = &free_list, prev = NULL; + *range_ptr != NULL; + prev = *range_ptr, range_ptr = &(*range_ptr)->next) + if ((*range_ptr)->base >= base) + break; + range->next = *range_ptr; + *range_ptr = range; + /* + * Concatenate free range with neighbors, if possible. + * Try for upper neighbor (next in list) first, then + * for lower neighbor (predecessor in list). + */ + if (range->next != NULL && + range->base + range->size == range->next->base) { + next = range->next; + range->size += range->next->size; + range->next = next->next; + kfree(next); + } + if (prev != NULL && + prev->base + prev->size == range->base) { + prev->size += prev->next->size; + prev->next = range->next; + kfree(range); + } +} +EXPORT_SYMBOL(bigphysarea_free_pages); + +caddr_t bigphysarea_alloc(int size) +{ + int pages = (size + PAGE_SIZE - 1) / PAGE_SIZE; + + return bigphysarea_alloc_pages(pages, 1, GFP_KERNEL); +} +EXPORT_SYMBOL(bigphysarea_alloc); + + +void bigphysarea_free(caddr_t addr, int size) +{ + (void)size; + bigphysarea_free_pages(addr); +} +EXPORT_SYMBOL(bigphysarea_free); + +static +int get_info(char *buf, char **a, off_t b, int c) +{ + char *p = buf; + range_t *ptr; + int free_count, free_total, free_max; + int used_count, used_total, used_max; + + if (init_level == 1) + init2(GFP_KERNEL); + + free_count = 0; + free_total = 0; + free_max = 0; + for (ptr = free_list; ptr != NULL; ptr = ptr->next) { + free_count++; + free_total += ptr->size; + if (ptr->size > free_max) + free_max = ptr->size; + } + + used_count = 0; + used_total = 0; + used_max = 0; + for (ptr = used_list; ptr != NULL; ptr = ptr->next) { + used_count++; + used_total += ptr->size; + if (ptr->size > used_max) + used_max = ptr->size; + } + + if (bigphysarea_pages == 0) { + p += sprintf(p, "No big physical area allocated!\n"); + return p - buf; + } + + p += sprintf(p, "Big physical area, size %ld kB\n", + bigphysarea_pages * PAGE_SIZE / 1024); + p += sprintf(p, " free list: used list:\n"); + p += sprintf(p, "number of blocks: %8d %8d\n", + free_count, used_count); + p += sprintf(p, "size of largest block: %8d kB %8d kB\n", + free_max / 1024, used_max / 1024); + p += sprintf(p, "total: %8d kB %8d kB\n", + free_total / 1024, used_total /1024); + + return p - buf; +} -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050317/b5a29da0/attachment.pgp From haveblue at us.ibm.com Fri Mar 18 01:35:32 2005 From: haveblue at us.ibm.com (Dave Hansen) Date: Thu, 17 Mar 2005 06:35:32 -0800 Subject: Why no bigphysarea in mainline? In-Reply-To: <200503172057.06570.michael@ellerman.id.au> References: <200503172057.06570.michael@ellerman.id.au> Message-ID: <1111070132.19021.31.camel@localhost> On Thu, 2005-03-17 at 20:57 +1100, Michael Ellerman wrote: > I realise bigphysarea is a bit of a hack, but it's no where near as > big a hack as using mem=X to limit the kernel's memory and then using > the rest of memory for your device driver. Well, the fact that you can get away with that is a coincidence. What if you have 4GB of RAM on an x86 machine, you do mem=3G, and you start using that top GB of memory for your driver? You eventually write into the PCI config space. Ooops. You get strange errors that way. Doing mem= for drivers isn't just a hack, it's *WRONG*. It's a ticking time bomb that magically happens to work on some systems. It will not work consistently on a discontiguous memory system, or a memory hotplug system. > If no one has any fundamental objections I think it'd be good to get > this merged into mainline so people start using it rather than mem=X > hacks. To that end please let me know what you think is wrong with > the patch as it stands (below). Could you give some examples of drivers which are in the kernel that could benefit from this patch? We don't tend to put things like this in, unless they have actual users. We don't tend to change code for out-of-tree users, either. -- Dave From tom.l.nguyen at intel.com Fri Mar 18 05:33:12 2005 From: tom.l.nguyen at intel.com (Nguyen, Tom L) Date: Thu, 17 Mar 2005 10:33:12 -0800 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery) Message-ID: On Wednesday, March 16, 2005 7:52 PM Paul Mackerras wrote: >> We need some PCI >> based error flows to understand the details of the flow so we can >> develop an interface compatible with both. > >Here is a basic outline of what happens with EEH (Enhanced Error >Handling) on IBM PPC64 platforms. This applies to PCI, PCI-X and >PCI-Express devices. Is EEH a PCI-SIG specification? Is EEH specs available in public? >We have a PCI-PCI bridge per slot. The bridge (and the PCI fabric >generally) look for errors such as address parity errors, >out-of-bounds DMA accesses by the device, or anything that would >normally cause SERR to be set. If such an error occurs, the bridge >immediately isolates the device, meaning that writes by the CPU to the >device are discarded, reads by the CPU are returned with all 1s data, >and DMA accesses by the device are blocked. It seems that a PCI-PCI bridge per slot is hardware implementation specific. The fact that the PCI-PCI Bridge can isolate the slot is hardware feature specific. >What happens at the driver level depends on whether the driver is >EEH-aware or not. (This description is more what we would like to >have rather than what is necessarily implemented at present). PCI Express AER driver uses similar concept of determining whether the driver is AER-aware or not except that PCI Express AER is independent from firmware support. >If the driver is not EEH-aware but is hot-plug capable, then the >platform code will notice that reads from the device are returning all >1s and query firmware about the state of the slot. Firmware will >indicate that the slot has been isolated. Platform code can obtain >more specific information about the error from firmware and log it. >Then, platform code will generate a hot-unplug event for the slot. >After the driver has cleaned up and notified higher levels that its >device has gone away, platform code will call firmware to reset and >unisolate the slot, and then generate a hotplug event to tell the >driver that it can use the device - but as far as the driver is >concerned, it is a new device. Where does the platform code reside and where does it log the error? In PCI Express if the driver is not AER-aware the fatal error message is reported by its upstream switch, the AER driver obtains comprehensive error information from the upstream switch (like EEH platform code obtains error information from the firmware). Since the driver is not AER-aware, the fatal error is reported to user to make a policy decision since the PCI Express does not have a hot-plug event for the slot like EEH platform. So it looks like the hot-plug capability of the driver is being used in lieu of specific callbacks to freeze and thaw IO in the case of a non-aware driver. If the driver does not support hot-plug then the error is just logged. Do you leave the slot isolated or perform error recovery anyway? On a fatal error the interface is down. No matter what the driver supports (AER aware, EEH aware, unaware) all IO is likely to fail. Resetting a bus in a point-to-point environment like PCI Express or EEH (as you describe) should have little adverse effect. The risk is the bus reset will cause a card reset and the driver must understand to re-initialize the card. A link reset in PCI Express will not cause a card reset. We assume the driver will reset its card if necessary. >If the driver is EEH-aware, then we use the API that Ben has >proposed. Platform code can either reset the slot (by calling >firmware) or not, depending on what the driver asks for, and also >depending on any other information the platform code has available to >it, such as specific information about the error that has occurred. >Platform code then unisolates the slot and then informs the driver >that it can reinitialize the device and restart any transfers that >were in progress. In PCI Express the AER driver obtains fatal error information from the upstream switch driver. We can use the same API with message = PCIERR_ERROR_RECOVER to notify the endpoint driver, which is maybe unaware of the fatal error reported by its upstream device. Mostly the driver will respond with PCIERR_RESULT_NEED_RESET. >Ben's API is aimed at supporting the code flows that we need for EEH >as well as those needed for recovery from errors on PCI Express. Part >of the reason for not just requiring the driver to do everything >itself is that a slot isolation event can affect multiple drivers, >because the card in the slot could have a PCI-PCI bridge with multiple >devices behind it. Thus the recovery process potentially requires a >degree of coordination between multiple drivers, and Ben's API >addresses that. The same coordination could be required on PCI >Express, if I understand correctly, because a fault on an upstream >link could affect many devices downstream of that link. Yes the same case applies to PCI Express upstream links. So halting IO is desired when other devices are affected. Thanks, Long From msdemlei at cl.uni-heidelberg.de Fri Mar 18 06:30:47 2005 From: msdemlei at cl.uni-heidelberg.de (Markus Demleitner) Date: Thu, 17 Mar 2005 20:30:47 +0100 Subject: sungem on imac G5 Message-ID: <20050317193047.GA25015@tucana.cl.uni-heidelberg.de> Hi all, Sorry for lowering the discussion level, but I believe this is most appropriate mailing list for my woes. My problem is that with the 2.6.11.2 kernel, patched with benh's recent patches (the sungem driver has date 2005-03-10 23:04:36.000000000 +0000), networking on an iMac G5 works like a charm when OF hasn't touched the network card. If I boot from the network (yaboot, getting yaboot, the kernel, and initrd via tftp) *or* have OF put its console on the network (i.e., as long as OF touches the chip in a serious way), however, I get "GMAC PHY not responding" when the chip is initialized by the kernel, more exactly: sungem.c: v0.98 8/24/03 David S. Miller (davem at redhat.com) eth%d: GMAC PHY not responding ! [ Just an aside: the blank before the bang should go away:-) I admit to not having tried to investigate why the %d is left in gp->dev->name, but since it's filled in later I figured that wouldn't be the problem ] eth0: Sun GEM (PCI) 10/100/1000BaseT Ethernet 00:0a:XXXXXXXX eht0: Found no PHY Experimentally, I raised the delay in #ifdef CONFIG_PPC_PMAC pmac_call_feature(PMAC_FTR_GMAC_PHY_RESET, gp->of_node, 0, 0); msleep(20); #endif to up to 2000 without really knowing what I was doing, but (fortunately) that didn't help. I've checked that that code really is executed. Now, I don't even know what a PHY is, really, and thus this cry for help: Is there any way to perform a really deep, deep reset on the sungem device? Is it just that Apple chose some not-quite-known PHY in the G5? Thanks for any pointers, Markus From benh at kernel.crashing.org Fri Mar 18 09:57:58 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 18 Mar 2005 09:57:58 +1100 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery) In-Reply-To: References: Message-ID: <1111100278.25180.21.camel@gaston> > On a fatal error the interface is down. No matter what the driver > supports (AER aware, EEH aware, unaware) all IO is likely to fail. > Resetting a bus in a point-to-point environment like PCI Express or EEH > (as you describe) should have little adverse effect. The risk is the > bus reset will cause a card reset and the driver must understand to > re-initialize the card. A link reset in PCI Express will not cause a > card reset. We assume the driver will reset its card if necessary. Does the link side of PCIE provides a way to trigger a hard reset of the rest of the card ? If not, then it's dodgy as there may be no way to consistently "reset" the card if it's in a bad state. I have to double check, but I suspect that IBM's implementation of EEH-compliant PCIE will add a full hard reset not just a link reset. From flar at allandria.com Fri Mar 18 10:08:37 2005 From: flar at allandria.com (Brad Boyer) Date: Thu, 17 Mar 2005 15:08:37 -0800 Subject: sungem on imac G5 In-Reply-To: <20050317193047.GA25015@tucana.cl.uni-heidelberg.de> References: <20050317193047.GA25015@tucana.cl.uni-heidelberg.de> Message-ID: <20050317230837.GA21896@pants.nu> On Thu, Mar 17, 2005 at 08:30:47PM +0100, Markus Demleitner wrote: > Now, I don't even know what a PHY is, really, and thus this cry for > help: Is there any way to perform a really deep, deep reset on the > sungem device? Is it just that Apple chose some not-quite-known > PHY in the G5? This won't really help your bug, but I can explain the purpose of a PHY chip. This chip implements the physical media layer, such as 10baseT, 100baseTX, etc. It performs low level signal handling, auto-negotiation, bit encoding (for fiber), and other stuff related to the actual signal going over the cable. Because of this, you can buy the different parts of an ethernet card from separate vendors depending on the features you need. For example, Apple puts chips that auto-detect crossover cables in the PowerBook series these days. They can use the some controller chip (i.e.: GEM) in boards that have different feature sets by using a different PHY. The bulk of the driver stays the same, but they end up with the desired features. Brad Boyer flar at allandria.com From benh at kernel.crashing.org Fri Mar 18 11:04:30 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 18 Mar 2005 11:04:30 +1100 Subject: sungem on imac G5 In-Reply-To: <20050317193047.GA25015@tucana.cl.uni-heidelberg.de> References: <20050317193047.GA25015@tucana.cl.uni-heidelberg.de> Message-ID: <1111104270.25180.28.camel@gaston> On Thu, 2005-03-17 at 20:30 +0100, Markus Demleitner wrote: > Hi all, > Now, I don't even know what a PHY is, really, and thus this cry for > help: Is there any way to perform a really deep, deep reset on the > sungem device? Is it just that Apple chose some not-quite-known > PHY in the G5? There is some weird thing in the Shasta chipset relative to the PHY reset. I think the ethernet and firewire PHYs are reset by the same line, we may be screwing up there /me checks Darwin code ... Yes, there are all sort of weird things in there, I'll cook a patch and post it for testing. Ben. From benh at kernel.crashing.org Fri Mar 18 11:21:31 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 18 Mar 2005 11:21:31 +1100 Subject: sungem on imac G5 In-Reply-To: <20050317193047.GA25015@tucana.cl.uni-heidelberg.de> References: <20050317193047.GA25015@tucana.cl.uni-heidelberg.de> Message-ID: <1111105291.3835.30.camel@gaston> On Thu, 2005-03-17 at 20:30 +0100, Markus Demleitner wrote: > Hi all, > > Sorry for lowering the discussion level, but I believe this is most > appropriate mailing list for my woes. > > My problem is that with the 2.6.11.2 kernel, patched with benh's > recent patches (the sungem driver has date > 2005-03-10 23:04:36.000000000 +0000), networking on an iMac G5 works > like a charm when OF hasn't touched the network card. Can you try this patch ? Index: linux-work/arch/ppc64/kernel/pmac_feature.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/pmac_feature.c 2005-03-15 11:56:46.000000000 +1100 +++ linux-work/arch/ppc64/kernel/pmac_feature.c 2005-03-18 11:20:21.000000000 +1100 @@ -220,6 +220,34 @@ return 0; } +static long __pmac g5_eth_phy_reset(struct device_node* node, long param, long value) +{ + struct device_node *phy; + int need_reset; + unsigned long flags; + + /* + * We must not reset the combo PHYs, only the BCM5221 found in + * the iMac G5. + */ + phy = of_get_next_child(node, NULL); + if (!phy) + return -ENODEV; + need_reset = device_is_compatible(phy, "B5221"); + of_node_put(phy); + if (!need_reset) + return 0; + + /* PHY reset is GPIO 29, not in device-tree unfortunately */ + MACIO_OUT8(K2_GPIO_EXTINT_0 + 29, + KEYLARGO_GPIO_OUTPUT_ENABLE | KEYLARGO_GPIO_OUTOUT_DATA); + /* Thankfully, this is now always called at a time when we can + * schedule by sungem. + */ + msleep(10); + MACIO_OUT8(K2_GPIO_EXTINT_0 + 29, 0); +} + #ifdef CONFIG_SMP static long __pmac g5_reset_cpu(struct device_node* node, long param, long value) { @@ -306,6 +334,7 @@ { PMAC_FTR_ENABLE_MPIC, g5_mpic_enable }, { PMAC_FTR_READ_GPIO, g5_read_gpio }, { PMAC_FTR_WRITE_GPIO, g5_write_gpio }, + { PMAC_FTR_GMAC_PHY_RESET, g5_eth_phy_reset }, #ifdef CONFIG_SMP { PMAC_FTR_RESET_CPU, g5_reset_cpu }, #endif /* CONFIG_SMP */ From daniel at osdl.org Fri Mar 18 12:12:28 2005 From: daniel at osdl.org (Daniel McNeil) Date: Thu, 17 Mar 2005 17:12:28 -0800 Subject: AIO panic on 2.6.11 on PPC64 caused by is_hugepage_only_range() Message-ID: <1111108348.31932.43.camel@ibm-c.pdx.osdl.net> When testing AIO on PPC64 (a power5 machine) running 2.6.11 with CONFIG_HUGETLB_PAGE=y, I ran into a kernel panic when a process exits that has done AIO (io_queue_init()) but has not done the io_queue_release(). The exit_aio() code is cleaning up and panicing when trying to free the aio ring buffer. I tracked this down to is_hugepage_only_range() (include/asm-ppc64/page.h) which is doing a touches_hugepage_low_range() which is checking current->mm->context.htlb_segs. The problem is that exit_mm() cleared tsk->mm before doing the mmput() which leads to the exit_aio() and then the panic. Looks like is_hugepage_only_range() is only used in ia64 and ppc64. Possible fix is to change is_hugepage_only_range() to take an 'mm' as a parameter as well as 'addr' and 'len' and then the ppc64 code could change to use 'mm'. It looks like it has been broken for quite a while. Here's the stack trace: cpu 0x2: Vector: 300 (Data Access) at [c0000001d1be7590] pc: c000000000092960: .unmap_region+0x17c/0x4a4 lr: c000000000092bb0: .unmap_region+0x3cc/0x4a4 sp: c0000001d1be7810 msr: 8000000000009032 dar: 298 dsisr: 40000000 current = 0xc000000001dd77b0 paca = 0xc000000000595c00 pid = 11336, comm = aiodio_readoff [c0000001d1be78e0] c000000000093d08 .do_munmap+0x240/0x408 [c0000001d1be79b0] c0000000000d11b4 .aio_free_ring+0x10c/0x1d8 [c0000001d1be7a50] c0000000000d162c .__put_ioctx+0x84/0x120 [c0000001d1be7af0] c0000000000d3640 .exit_aio+0xf4/0x100 [c0000001d1be7b80] c00000000004dfd4 .mmput+0x80/0x15c [c0000001d1be7c20] c000000000053648 .exit_mm+0x1b4/0x264 [c0000001d1be7cc0] c0000000000555ac .do_exit+0x10c/0xdb0 [c0000001d1be7d90] c0000000000562a8 .do_group_exit+0x58/0xd8 [c0000001d1be7e30] c00000000000d500 syscall_exit+0x0/0x18 Here's a program that produces the panic: (compile using cc -o aiodio_read aiodio_read.c -laio). -------------------------- #define _XOPEN_SOURCE 600 #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include int pagesize; char *iobuf; io_context_t myctx; int aio_maxio = 4; /* * do a AIO DIO write */ int do_aio_direct_read(int fd, char *iobuf, int offset, int size) { struct iocb myiocb; struct iocb *iocbp = &myiocb; int ret; struct io_event e; struct stat s; io_prep_pread(&myiocb, fd, iobuf, size, offset); if ((ret = io_submit(myctx, 1, &iocbp)) != 1) { perror("io_submit"); return ret; } ret = io_getevents(myctx, 1, 1, &e, 0); if (ret) { struct iocb *iocb = e.obj; int iosize = iocb->u.c.nbytes; char *buf = iocb->u.c.buf; long long loffset = iocb->u.c.offset; printf("AIO read of %d at offset %lld returned %d\n", iosize, loffset, e.res); } return ret; } int main(int argc, char *argv[]) { char *filename; int fd; int err; filename = "test.aio.file"; fd = open(filename, O_RDWR|O_DIRECT|O_CREAT|O_TRUNC, 0666); pagesize = getpagesize(); err = posix_memalign((void**) &iobuf, pagesize, pagesize); if (err) { fprintf(stderr, "Error allocating %d aligned bytes.\n", pagesize); exit(1); } err = write(fd, iobuf, pagesize); if (err != pagesize) { fprintf(stderr, "Error ret = %d writing %d bytes.\n", err, pagesize); perror(""); exit(1); } memset(&myctx, 0, sizeof(myctx)); io_queue_init(aio_maxio, &myctx); err = do_aio_direct_read(fd, iobuf, 0, pagesize); close(fd); printf("This will panic on ppc64\n"); return err; } -------------------------- Daniel From tom.l.nguyen at intel.com Fri Mar 18 05:53:46 2005 From: tom.l.nguyen at intel.com (Nguyen, Tom L) Date: Thu, 17 Mar 2005 10:53:46 -0800 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery) Message-ID: On Wednesday, March 16, 2005 7:20 PM Benjamin Herrenschmidt wrote: >> What mechanism (message??) is used to perform the bus and/or link >> level reset? For PCI Express the reset is performed by the upstream >> port driver. My API takes this into account. Are you assuming the PCI >> device on the bus does the reset or will there be a PCI bus driver that >> will do the reset? How will the PCI error handling code initiate a >> reset? > >The "caller", that is the error management framework. I'm defining the >API at the driver level, not the implementation at the core level. > >For example, on IBM pSeries with PCI-Express, we will probably not have >an AER driver. This will be all dealt by the firmware which will mimmic >that to the existing EEH error management. We'll have the same API to do >the reset that we have today for resetting a slot. We decided to implement PCI Express error handling based on the PCI Express specification in a platform independent manner. This allows any platform that implements PCI Express AER per the PCI SIG specification can take advantage of the advanced features, much like SHPC hot-plug or PCI Express hot-plug implementations. >You may have noticed in general that I didn't either define who is >callign those callbacks. It's all implicit that this is done by platform >error management code. For example, on ppc64, even the recovery step >requires action from the platform since the slot has been physically >isolated. After we have notified all drivers with the "error detected" >callback, if we decide we can try the "recover" step (all drivers >returned they could try it and we decided the error wasn't too fatal) we >will call the firmware to re-enable IOs on the slot and call the >"recover" step. For PCI Express the endpoint device driver can take recovery action on its own, depending on the nature of the error so long as it does not affect the upstream device. This can include endpoint device resets. We expect the driver to do this upon error notification, if possible. In PCI Express since the driver will have the most knowledge regarding the error it will have the best ability to do device dependent recovery and IO retry. If its recovery fails then the AER driver will ask the upstream device driver to perform the link reset. Since this is more of a side effect an explicit call to recover is not necessary. However, we understand and agree that it is needed to support the general error recovery cases for PCI. To support the AER driver calling an upstream device to initiate a reset of the link we need a specific callback since the driver doing the reset is not the driver who got the error. In the case of general PCI this could be useful if a PCI bus driver were available to support the callback for a bridge device. This would also support specific error recovery calls to reset an endpoint adapter. We need a call to request a driver to perform a reset on a link or device. Thanks, Long From benh at kernel.crashing.org Fri Mar 18 13:43:37 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 18 Mar 2005 13:43:37 +1100 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery) In-Reply-To: References: Message-ID: <1111113817.25180.79.camel@gaston> On Thu, 2005-03-17 at 10:53 -0800, Nguyen, Tom L wrote: > To support the AER driver calling an upstream device to initiate a reset > of the link we need a specific callback since the driver doing the reset > is not the driver who got the error. In the case of general PCI this > could be useful if a PCI bus driver were available to support the > callback for a bridge device. This would also support specific error > recovery calls to reset an endpoint adapter. We need a call to request > a driver to perform a reset on a link or device. That is quite implementation specific, it doesn't need to be part of the API (the way the general error management is implemented in PCIE could be completely done within the bus drivers I suppose). Again, I'm not trying to define or force a given implementation. I'm trying to define the driver-side API, that's all. I have difficulties following all of your previous explanations, I must admit. My point here is I'd like you to find out if the API can fit on the driver side, and if not, what would need to be changed. For example, we might want to distinguish between slot reset (full hard reset) and link reset, that sort of thing (thus adding a new state for link reset and a new return code for the others for requesting a link reset if possible, platforms that don't do it, like IBM EEH PCI would just fallback to full reset). Again, the goal here is to have a way for drivers to be mostly bus agnostic (that is not have to care if they are running on PCI, PCI-X, PCIE, with or without IBM EEH mecanism, and whatever other mecanism another vendor might provide) and still implement basic error recovery. A driver _designed_ for a PCI-Express deviec that knows it's on PCI Express can perfectly use additional APIs to gather more error details, etc... but it would be nice to fit the "common needs" as much as possible in a common and _SIMPLE_ API. The simplicity here is a requirement, I'm very serious about it, because if it's not simple, drivers either won't implement it or won't get it right. Ben. From michael at ellerman.id.au Fri Mar 18 14:42:48 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Fri, 18 Mar 2005 14:42:48 +1100 Subject: Why no bigphysarea in mainline? In-Reply-To: <1111070132.19021.31.camel@localhost> References: <200503172057.06570.michael@ellerman.id.au> <1111070132.19021.31.camel@localhost> Message-ID: <200503181442.51830.michael@ellerman.id.au> On Fri, 18 Mar 2005 01:35, Dave Hansen wrote: > Doing mem= for drivers isn't just a hack, it's *WRONG*. It's a ticking > time bomb that magically happens to work on some systems. It will not > work consistently on a discontiguous memory system, or a memory hotplug > system. I couldn't agree more. Problem is I've been asked to change the way mem=X works on PPC64 so that this hack will work, which is a horrible thought. > Could you give some examples of drivers which are in the kernel that > could benefit from this patch? We don't tend to put things like this > in, unless they have actual users. We don't tend to change code for > out-of-tree users, either. No I can't. I've been approached by several "vendors" asking about using mem=X hacks on PPC64, however I doubt any of them have code in-tree. I'll check though. cheers -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050318/437c9308/attachment.pgp From paulus at samba.org Fri Mar 18 14:22:35 2005 From: paulus at samba.org (Paul Mackerras) Date: Fri, 18 Mar 2005 14:22:35 +1100 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery) In-Reply-To: References: Message-ID: <16954.18811.500766.244979@cargo.ozlabs.ibm.com> Nguyen, Tom L writes: > Is EEH a PCI-SIG specification? Is EEH specs available in public? No and no (not yet anyway). > It seems that a PCI-PCI bridge per slot is hardware implementation > specific. The fact that the PCI-PCI Bridge can isolate the slot is > hardware feature specific. Well, it's a common feature across all current IBM PPC64 machines. > PCI Express AER driver uses similar concept of determining whether the > driver is AER-aware or not except that PCI Express AER is independent > from firmware support. Don't worry about the firmware; the driver won't have to interact with firmware itself, that's the job of the ppc64-specific platform code. > Where does the platform code reside and where does it log the error? By platform code I meant the code under the arch directory that knows the details of the I/O topology of the machine, how to access the PCI host bridges, etc. How and where it logs the error is a platform policy; on IBM ppc64 machines we have an error log daemon for this purpose, which can do things like log the error to a file or send it to another machine. > In PCI Express if the driver is not AER-aware the fatal error message is > reported by its upstream switch, the AER driver obtains comprehensive > error information from the upstream switch (like EEH platform code > obtains error information from the firmware). Since the driver is not > AER-aware, the fatal error is reported to user to make a policy decision > since the PCI Express does not have a hot-plug event for the slot like > EEH platform. If there is a permanent failure of an upstream link, then maybe generating unplug events for the devices below it would be a useful thing to do. > So it looks like the hot-plug capability of the driver is being used in > lieu of specific callbacks to freeze and thaw IO in the case of a > non-aware driver. If the driver does not support hot-plug then the > error is just logged. Do you leave the slot isolated or perform error > recovery anyway? The choice is really to leave the slot isolated or to panic the system. Leaving the slot isolated risks having the driver loop in an interrupt routine or deliver bad data to userspace, so we currently panic the system. > On a fatal error the interface is down. No matter what the driver Which interface do you mean here? > supports (AER aware, EEH aware, unaware) all IO is likely to fail. > Resetting a bus in a point-to-point environment like PCI Express or EEH > (as you describe) should have little adverse effect. The risk is the > bus reset will cause a card reset and the driver must understand to > re-initialize the card. A link reset in PCI Express will not cause a > card reset. We assume the driver will reset its card if necessary. How will the driver reset its card? > In PCI Express the AER driver obtains fatal error information from the > upstream switch driver. We can use the same API with message = > PCIERR_ERROR_RECOVER to notify the endpoint driver, which is maybe > unaware of the fatal error reported by its upstream device. Mostly the > driver will respond with PCIERR_RESULT_NEED_RESET. Sounds fine. Paul. From paulus at samba.org Fri Mar 18 15:01:11 2005 From: paulus at samba.org (Paul Mackerras) Date: Fri, 18 Mar 2005 15:01:11 +1100 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery) In-Reply-To: References: Message-ID: <16954.21127.536184.940537@cargo.ozlabs.ibm.com> Nguyen, Tom L writes: > We decided to implement PCI Express error handling based on the PCI > Express specification in a platform independent manner. This allows any > platform that implements PCI Express AER per the PCI SIG specification > can take advantage of the advanced features, much like SHPC hot-plug or > PCI Express hot-plug implementations. Does the PCI Express AER specification define an API for drivers? > For PCI Express the endpoint device driver can take recovery action on > its own, depending on the nature of the error so long as it does not > affect the upstream device. This can include endpoint device resets. Likewise, with EEH the device driver could take recovery action on its own. But we don't want to end up with multiple sets of recovery code in drivers, if possible. Also we want the recovery code to be as simple as possible, otherwise driver authors will get it wrong. > To support the AER driver calling an upstream device to initiate a reset > of the link we need a specific callback since the driver doing the reset > is not the driver who got the error. In the case of general PCI this I would see the AER driver as being included in the "platform" code. The AER driver would be be closely involved in the recovery process. What is the state of a link during the time between when an error is detected and when a link reset is done? Is the link usable? What happens if you try to do a MMIO read from a device downstream of the link? Regards, Paul. From msdemlei at cl.uni-heidelberg.de Sat Mar 19 01:24:49 2005 From: msdemlei at cl.uni-heidelberg.de (Markus Demleitner) Date: Fri, 18 Mar 2005 15:24:49 +0100 Subject: sungem on imac G5 In-Reply-To: <20050318010004.783B667A7D@ozlabs.org> References: <20050318010004.783B667A7D@ozlabs.org> Message-ID: <20050318142448.GA26406@tucana.cl.uni-heidelberg.de> On Fri, Mar 18, 2005 at 12:00:04PM +1100, benh wrote: > > My problem is that with the 2.6.11.2 kernel, patched with benh's > > recent patches (the sungem driver has date > > 2005-03-10 23:04:36.000000000 +0000), networking on an iMac G5 works > > like a charm when OF hasn't touched the network card. > > Can you try this patch ? > > Index: linux-work/arch/ppc64/kernel/pmac_feature.c > =================================================================== > --- linux-work.orig/arch/ppc64/kernel/pmac_feature.c 2005-03-15 11:56:46.000000000 +1100 > +++ linux-work/arch/ppc64/kernel/pmac_feature.c 2005-03-18 11:20:21.000000000 +1100 > @@ -220,6 +220,34 @@ > return 0; > } > > +static long __pmac g5_eth_phy_reset(struct device_node* node, long param, long value) > +{ > + struct device_node *phy; > + int need_reset; > + unsigned long flags; > + > + /* > + * We must not reset the combo PHYs, only the BCM5221 found in > + * the iMac G5. > + */ > + phy = of_get_next_child(node, NULL); > + if (!phy) > + return -ENODEV; > + need_reset = device_is_compatible(phy, "B5221"); > + of_node_put(phy); > + if (!need_reset) > + return 0; > + > + /* PHY reset is GPIO 29, not in device-tree unfortunately */ > + MACIO_OUT8(K2_GPIO_EXTINT_0 + 29, > + KEYLARGO_GPIO_OUTPUT_ENABLE | KEYLARGO_GPIO_OUTOUT_DATA); > + /* Thankfully, this is now always called at a time when we can > + * schedule by sungem. > + */ > + msleep(10); > + MACIO_OUT8(K2_GPIO_EXTINT_0 + 29, 0); > +} > + > #ifdef CONFIG_SMP > static long __pmac g5_reset_cpu(struct device_node* node, long param, long value) > { > @@ -306,6 +334,7 @@ > { PMAC_FTR_ENABLE_MPIC, g5_mpic_enable }, > { PMAC_FTR_READ_GPIO, g5_read_gpio }, > { PMAC_FTR_WRITE_GPIO, g5_write_gpio }, > + { PMAC_FTR_GMAC_PHY_RESET, g5_eth_phy_reset }, > #ifdef CONFIG_SMP > { PMAC_FTR_RESET_CPU, g5_reset_cpu }, > #endif /* CONFIG_SMP */ > Yup, that does the trick, thanks a lot. Hooray, I can netboot the beast! Minor issues: (a) The patch didn't apply cleanly, I had to fiddle in the second hunk manually (that one probably doesn't matter at all, I just wanted to mention it in case of a regression in the code I don't have) (b) MACIO_OUT8 uses macio, which g5_eth_phy_reset doesn't define. Fixed it by adding struct macio_chip* macio = &macio_chips[0]; to its local declarations. (c) There are two warnings remaining that I didn't care to fix (for now): arch/ppc64/kernel/pmac_feature.c:227: warning: unused variable `flags' arch/ppc64/kernel/pmac_feature.c:250: warning: control reaches end of non-void function (the line numbers are of course for my version) Markus PS: While I'm here, current progress report on thermal control: I'm prototyping a thermal control driver in userspace python right now, basically doing PID (though I'm convinced there just has to be a better control algorithm for this particular problem). Trouble is that I have one fan that doesn't seem to have an effect on the temperatures (harddisk fan? I don't think I can see the hard disk temperature, though I have one sensor I cannot interpret at all, plus there is one sensor OF reads directly through i2c, which may well be the hard disk one). I'll rip the machine open soon to see where the fans are (have been reluctant so far since it isn't my machine). Good news: The machine switches itself off if it overheats :-) From olh at suse.de Sat Mar 19 03:22:33 2005 From: olh at suse.de (Olaf Hering) Date: Fri, 18 Mar 2005 17:22:33 +0100 Subject: pmac_zilog: Trying to im_free nonexistent area Message-ID: <20050318162233.GA18490@suse.de> Ben, can you check the error handling in init_pmz() when pmz_register() register fails? It tries to iounmap ->control_reg, but I dont see how that can happen. Also, pmz_init_port() returns always 0, so pmz_probe() doesnt need to check for non-null. From tom.l.nguyen at intel.com Sat Mar 19 04:17:49 2005 From: tom.l.nguyen at intel.com (Nguyen, Tom L) Date: Fri, 18 Mar 2005 09:17:49 -0800 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery) Message-ID: On Thursday, March 17, 2005 6:44 PM Benjamin Herrenschmidt wrote: >I have difficulties following all of your previous explanations, I must >admit. My point here is I'd like you to find out if the API can fit on >the driver side, and if not, what would need to be changed. In summary, we agreed that the API you propose should be: int (*error_handler)(struct pci_dev *dev, union error_src *); I believe this API works for most of PCI Express needs. The only addition PCI Express needs is a mechanism for the AER code to request a port bus driver to perform a downstream link reset when an error occurs on that downstream link. For example, you can add the PCIERR_ERROR_PORT_RESET message with the return is either PCIERR_RESULT_RECOVERED or PCIERR_RESULT_DISCONNECT to fit PCI Express needs. Thanks, Long From tom.l.nguyen at intel.com Sat Mar 19 04:24:02 2005 From: tom.l.nguyen at intel.com (Nguyen, Tom L) Date: Fri, 18 Mar 2005 09:24:02 -0800 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery) Message-ID: On Thursday, March 17, 2005 8:01 PM Paul Mackerras wrote: > Does the PCI Express AER specification define an API for drivers? No. That is why we agree a general API that works for all platforms. >Likewise, with EEH the device driver could take recovery action on its >own. But we don't want to end up with multiple sets of recovery code >in drivers, if possible. Also we want the recovery code to be as >simple as possible, otherwise driver authors will get it wrong. Drivers own their devices register sets. Therefore if there are any vendor unique actions that can be taken by the driver to recovery we expect the driver to do so. For example, if the drivers see "xyz" error and there is a known errata and workaround that involves resetting some registers on the card. From our perspective we see drivers taking care of their own cards but the AER driver and your platform code will take care of the bus/link interfaces. >I would see the AER driver as being included in the "platform" code. >The AER driver would be be closely involved in the recovery process. Our goal is to have the AER driver be part of the general code base because it is based on a PCI SIG specification that can be implemented across all architectures. >What is the state of a link during the time between when an error is >detected and when a link reset is done? Is the link usable? What >happens if you try to do a MMIO read from a device downstream of the >link? For a FATAL error the link is "unreliable". This means MMIO operations may or may not succeed. That is why the reset is performed by the upstream port driver. The interface to that is reliable. A reset of an upstream port will propagate to all downstream links. So we need an interface to the bus/port driver to request a reset on its downstream link. We don't want the AER driver writing port bus driver bridge control registers. We are trying to keep the ownership of the devices register read/write within the domain of the devices driver. In our case the port bus driver. Thanks, Long From grundler at parisc-linux.org Sat Mar 19 05:10:05 2005 From: grundler at parisc-linux.org (Grant Grundler) Date: Fri, 18 Mar 2005 11:10:05 -0700 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery) In-Reply-To: References: Message-ID: <20050318181005.GA30909@colo.lackof.org> On Fri, Mar 18, 2005 at 09:24:02AM -0800, Nguyen, Tom L wrote: > >Likewise, with EEH the device driver could take recovery action on its > >own. But we don't want to end up with multiple sets of recovery code > >in drivers, if possible. Also we want the recovery code to be as > >simple as possible, otherwise driver authors will get it wrong. > > Drivers own their devices register sets. Therefore if there are any > vendor unique actions that can be taken by the driver to recovery we > expect the driver to do so. ... All drivers also need to cleanup driver state if they can't simply recover (and restart pending IOs). ie they need to release DMA resources and return suitable errors for pending requests. > >I would see the AER driver as being included in the "platform" code. > >The AER driver would be be closely involved in the recovery process. > > Our goal is to have the AER driver be part of the general code base > because it is based on a PCI SIG specification that can be implemented > across all architectures. To the driver writer, it's all "platform" code. Folks who maintain PCI (and other) services differentiate between "generic" and "arch/platform" specific. Think first like a driver writer and then worry about if/how that can be divided between platform generic and platform/arch specific code. Even PCI-Express has *some* arch specific component. At a minimum each architecture has it's own chipset and firmware to deal with for PCI Express bus discovery and initialization. But driver writers don't have to worry about that and they shouldn't for error recovery either. > For a FATAL error the link is "unreliable". This means MMIO operations > may or may not succeed. That is why the reset is performed by the > upstream port driver. The interface to that is reliable. A reset of an > upstream port will propagate to all downstream links. So we need an > interface to the bus/port driver to request a reset on its downstream > link. We don't want the AER driver writing port bus driver bridge > control registers. We are trying to keep the ownership of the devices > register read/write within the domain of the devices driver. In our > case the port bus driver. A port bus driver does NOT sound like a normal device driver. If PCI Express defines a standard register set for a bridge device (like PCI COnfig space for PCI-PCI Bridges), then I don't see a problem with PCI-Express error handling code mucking with those registers. Look at how PCI-PCI bridges are supported today and which bits of code poke registers on PCI-PCI Bridges. hth, grant From tom.l.nguyen at intel.com Sat Mar 19 05:33:00 2005 From: tom.l.nguyen at intel.com (Nguyen, Tom L) Date: Fri, 18 Mar 2005 10:33:00 -0800 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery) Message-ID: On Friday, March 18, 2005 10:10 AM Grant Grundler wrote: >A port bus driver does NOT sound like a normal device driver. >If PCI Express defines a standard register set for a bridge >device (like PCI COnfig space for PCI-PCI Bridges), then I >don't see a problem with PCI-Express error handling code mucking >with those registers. Look at how PCI-PCI bridges are supported >today and which bits of code poke registers on PCI-PCI Bridges. Please refer to PCIEBUS-HOWTO.txt for how port bus driver works. Thanks, Long From tom.l.nguyen at intel.com Sat Mar 19 06:29:43 2005 From: tom.l.nguyen at intel.com (Nguyen, Tom L) Date: Fri, 18 Mar 2005 11:29:43 -0800 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery) Message-ID: On Thursday, March 17, 2005 2:58 PM Benjamin Herrenschmidt wrote: > Does the link side of PCIE provides a way to trigger a hard reset of the > rest of the card ? If not, then it's dodgy as there may be no way to > consistently "reset" the card if it's in a bad state. The PCI Express spec does not make it clear of whether an in-band mechanism, called a hot-reset, triggers a hard reset of the rest of the card. I agree that if not, then it's dodgy. Thanks, Long From benh at kernel.crashing.org Sat Mar 19 10:13:02 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 19 Mar 2005 10:13:02 +1100 Subject: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery) In-Reply-To: <20050318181005.GA30909@colo.lackof.org> References: <20050318181005.GA30909@colo.lackof.org> Message-ID: <1111187582.1236.192.camel@gaston> On Fri, 2005-03-18 at 11:10 -0700, Grant Grundler wrote: > On Fri, Mar 18, 2005 at 09:24:02AM -0800, Nguyen, Tom L wrote: > > >Likewise, with EEH the device driver could take recovery action on its > > >own. But we don't want to end up with multiple sets of recovery code > > >in drivers, if possible. Also we want the recovery code to be as > > >simple as possible, otherwise driver authors will get it wrong. > > > > Drivers own their devices register sets. Therefore if there are any > > vendor unique actions that can be taken by the driver to recovery we > > expect the driver to do so. > ... > > All drivers also need to cleanup driver state if they can't > simply recover (and restart pending IOs). ie they need to release > DMA resources and return suitable errors for pending requests. Additionally, in "real life", very few errors are cause by known errata. If the drivers know about the errata, they usually already work around them. Afaik, most of the errors are caused by transcient conditions on the bus or the device, like a bit beeing flipped, or thermal conditions... > To the driver writer, it's all "platform" code. > Folks who maintain PCI (and other) services differentiate between > "generic" and "arch/platform" specific. Think first like a driver > writer and then worry about if/how that can be divided between platform > generic and platform/arch specific code. > > Even PCI-Express has *some* arch specific component. At a minimum each > architecture has it's own chipset and firmware to deal with > for PCI Express bus discovery and initialization. But driver writers > don't have to worry about that and they shouldn't for error > recovery either. Exactly. A given platform could use Intel's code as-is, or may choose to do things differently while still showing the same interface to drivers. Eventually we may end up adding platform hooks to the generic PCIE code like we have in the PCI code if some platforms require them. From linas at austin.ibm.com Sat Mar 19 11:35:32 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Fri, 18 Mar 2005 18:35:32 -0600 Subject: Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery) In-Reply-To: <1111187582.1236.192.camel@gaston> References: <20050318181005.GA30909@colo.lackof.org> <1111187582.1236.192.camel@gaston> Message-ID: <20050319003532.GS498@austin.ibm.com> On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to remark: > > Additionally, in "real life", very few errors are cause by known errata. > If the drivers know about the errata, they usually already work around > them. Afaik, most of the errors are caused by transcient conditions on > the bus or the device, like a bit beeing flipped, or thermal > conditions... Heh. Let me describe "real life" a bit more accurately. We've been running with pci error detection enabled here for the last two years. Based on this experience, the ballpark figures are: 90% of all detected errors were device driver bugs coupled to pci card hardware errata 9% poorly seated pci cards (remove/reseat will make problem go away) 1% transient/other. We've seen *EVERY* and I mean *EVERY* device driver that we've put under stress tests (e.g. peak i/o rates for > 72 hours, e.g. massive tcp/nfs traffic, massive disk i/o traffic, etc), *EVERY* driver tripped on an EEH error detect that was traced back to a device driver bug. Not to blame the drivers, a lot of these were related to pci card hardware/foirmware bugs. For example, I think grepping for "split completion" and "NAPI" in the patches/errata for e100 and e1000 for the last year will reveal some of the stuff that was found. As far as I know, for every bug found, a patch made it into mainline. As a rule, it seems that finding these device driver bugs was very hard; we had some people work on these for months, and in the case of the e1000, we managed to get Intel engineers to fly out here and stare at PCI bus traces for a few days. (Thanks Intel!) Ditto for Emulex. For ipr, we had inhouse people. So overall, PCI error detection did have the expected effect (protecting the kernel from corruption, e.g. due to DMA's going to wild addresses), but I don't think anybody expected that the vast majority would be software/hardware bugs, instead of transient effects. What's ironic in all of this is that by adding error recovery, device driver bugs will be able to hide more effectively ... if there's a pci bus error due to a driver bug, the pci card will get rebooted, the kernel will burp for 3 seconds, and things will keep going, and most sysadmins won't notice or won't care. --linas From benh at kernel.crashing.org Sat Mar 19 12:24:07 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 19 Mar 2005 12:24:07 +1100 Subject: Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery) In-Reply-To: <20050319003532.GS498@austin.ibm.com> References: <20050318181005.GA30909@colo.lackof.org> <1111187582.1236.192.camel@gaston> <20050319003532.GS498@austin.ibm.com> Message-ID: <1111195447.25179.205.camel@gaston> On Fri, 2005-03-18 at 18:35 -0600, Linas Vepstas wrote: > On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to remark: > > > > Additionally, in "real life", very few errors are cause by known errata. > > If the drivers know about the errata, they usually already work around > > them. Afaik, most of the errors are caused by transcient conditions on > > the bus or the device, like a bit beeing flipped, or thermal > > conditions... > > > Heh. Let me describe "real life" a bit more accurately. > > We've been running with pci error detection enabled here for the last > two years. Based on this experience, the ballpark figures are: > > 90% of all detected errors were device driver bugs coupled to > pci card hardware errata Well, this have been in-lab testing to fight driver bugs/errata on early rlease kernels, I'm talking about the context of a released solution with stable drivers/hw. > 9% poorly seated pci cards (remove/reseat will make problem go away) > > 1% transient/other. Ok. > We've seen *EVERY* and I mean *EVERY* device driver that we've put > under stress tests (e.g. peak i/o rates for > 72 hours, e.g. > massive tcp/nfs traffic, massive disk i/o traffic, etc), *EVERY* > driver tripped on an EEH error detect that was traced back to > a device driver bug. Not to blame the drivers, a lot of these > were related to pci card hardware/foirmware bugs. For example, > I think grepping for "split completion" and "NAPI" in the > patches/errata for e100 and e1000 for the last year will reveal > some of the stuff that was found. As far as I know, > for every bug found, a patch made it into mainline. Yah, those are a pain. But then, it isn't the context described by Nguyen where the driver "knows" about the errata and how to recover. It's the context of a bug where the driver does not know what's going on and/or doesn't have the proper workaround. My point was more that there are very few cases where a driver will have to do recovery of PCI error in known cases where it actually expect an error to happen. > As a rule, it seems that finding these device driver bugs was > very hard; we had some people work on these for months, and in > the case of the e1000, we managed to get Intel engineers to fly > out here and stare at PCI bus traces for a few days. (Thanks Intel!) > Ditto for Emulex. For ipr, we had inhouse people. > > So overall, PCI error detection did have the expected effect > (protecting the kernel from corruption, e.g. due to DMA's going > to wild addresses), but I don't think anybody expected that the > vast majority would be software/hardware bugs, instead of transient > effects. > > What's ironic in all of this is that by adding error recovery, > device driver bugs will be able to hide more effectively ... > if there's a pci bus error due to a driver bug, the pci card > will get rebooted, the kernel will burp for 3 seconds, and > things will keep going, and most sysadmins won't notice or > won't care. Yes, but it will be logged at least, so we'll spot a lot of these during our tests. Ben. From benh at kernel.crashing.org Sat Mar 19 23:08:54 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 19 Mar 2005 23:08:54 +1100 Subject: sungem on imac G5 In-Reply-To: <20050318142448.GA26406@tucana.cl.uni-heidelberg.de> References: <20050318010004.783B667A7D@ozlabs.org> <20050318142448.GA26406@tucana.cl.uni-heidelberg.de> Message-ID: <1111234134.3835.213.camel@gaston> On Fri, 2005-03-18 at 15:24 +0100, Markus Demleitner wrote: > > (a) The patch didn't apply cleanly, I had to fiddle in the second > hunk manually (that one probably doesn't matter at all, I just wanted > to mention it in case of a regression in the code I don't have) Yup, it's against current bk > (b) MACIO_OUT8 uses macio, which g5_eth_phy_reset doesn't define. Fixed > it by adding > struct macio_chip* macio = &macio_chips[0]; > to its local declarations. Yah, I missed that bit, I didn't have time to test compile :) > (c) There are two warnings remaining that I didn't care to fix (for > now): > arch/ppc64/kernel/pmac_feature.c:227: warning: unused variable `flags' > arch/ppc64/kernel/pmac_feature.c:250: warning: control reaches end of non-void function > (the line numbers are of course for my version) I first added a lock then figured it wasn't necessary, so the flag can be removed. The function should also take a return 0, though the lack of it is harmless as sungem isn't testing the return value. I'll fix those. Thanks for testing ! Markus > > PS: While I'm here, current progress report on thermal control: I'm > prototyping a thermal control driver in userspace python right now, > basically doing PID (though I'm convinced there just has to be a > better control algorithm for this particular problem). Apple uses PID everywhere. I ported their algorithm in my existing driver, though I'm really not convinced it's the best way to go (Apple stuff tends to oscillate etc..). However, I don't have the proper calibration infos to do somehting else, all they give me in the cpuid eeprom on G5s is the actual factors for the PID algorithm, so I decided to just re-implement the same algorightm. > Trouble is > that I have one fan that doesn't seem to have an effect on the > temperatures (harddisk fan? I don't think I can see the hard disk > temperature, though I have one sensor I cannot interpret at all, plus > there is one sensor OF reads directly through i2c, which may well be > the hard disk one). I'll rip the machine open soon to see where the > fans are (have been reluctant so far since it isn't my machine). > Good news: The machine switches itself off if it overheats :-) Ok, good luck ! Once I finally have access to one of these, I'll try tracing the darwin kernel with remote gdb to figure out where the sensor data actually come from. Ben. From mikpe at csd.uu.se Tue Mar 22 02:19:01 2005 From: mikpe at csd.uu.se (Mikael Pettersson) Date: Mon, 21 Mar 2005 16:19:01 +0100 (MET) Subject: [PATCH][2.6.12-rc1-mm1] fix compile error in ppc64 prom.c Message-ID: <200503211519.j2LFJ1os021884@harpo.it.uu.se> Compiling 2.6.12-rc1-mm1 for ppc64 fails with: arch/ppc64/kernel/prom.c:1691: error: syntax error before 'prom_reconfig_notifier' arch/ppc64/kernel/prom.c:1692: error: field name not in record or union initializer arch/ppc64/kernel/prom.c:1692: error: (near initialization for 'prom_reconfig_nb') arch/ppc64/kernel/prom.c:1692: warning: initialization makes pointer from integer without a cast make[1]: *** [arch/ppc64/kernel/prom.o] Error 1 make: *** [arch/ppc64/kernel] Error 2 Fix: repair the obvious syntax error (missing "="). Signed-off-by: Mikael Pettersson --- linux-2.6.12-rc1-mm1/arch/ppc64/kernel/prom.c.~1~ 2005-03-21 14:48:51.000000000 +0100 +++ linux-2.6.12-rc1-mm1/arch/ppc64/kernel/prom.c 2005-03-21 15:14:19.000000000 +0100 @@ -1688,7 +1688,7 @@ static int prom_reconfig_notifier(struct } static struct notifier_block prom_reconfig_nb = { - .notifier_call prom_reconfig_notifier, + .notifier_call = prom_reconfig_notifier, .priority = 10, /* This one needs to run first */ }; From mikpe at csd.uu.se Tue Mar 22 02:18:22 2005 From: mikpe at csd.uu.se (Mikael Pettersson) Date: Mon, 21 Mar 2005 16:18:22 +0100 (MET) Subject: [PATCH][2.6.12-rc1] fix gcc4 compile error in ppc64 paca.h Message-ID: <200503211518.j2LFIMP6021847@harpo.it.uu.se> Compiling 2.6.12-rc1 or 2.6.12-rc1-mm1 for ppc64 with gcc4 fails with: In file included from include/asm/spinlock.h:20, from include/linux/spinlock.h:43, from include/linux/signal.h:5, from arch/ppc64/kernel/asm-offsets.c:17: include/asm/paca.h:25: error: array type has incomplete element type make[1]: *** [arch/ppc64/kernel/asm-offsets.s] Error 1 make: *** [arch/ppc64/kernel/asm-offsets.s] Error 2 This is an array-of-incomplete-type error. Fix: move array decl to after the struct decl. Signed-off-by: Mikael Pettersson --- linux-2.6.12-rc1/include/asm-ppc64/paca.h.~1~ 2005-03-02 19:24:19.000000000 +0100 +++ linux-2.6.12-rc1/include/asm-ppc64/paca.h 2005-03-21 15:29:26.000000000 +0100 @@ -22,7 +22,6 @@ #include #include -extern struct paca_struct paca[]; register struct paca_struct *local_paca asm("r13"); #define get_paca() local_paca @@ -115,4 +114,6 @@ struct paca_struct { #endif }; +extern struct paca_struct paca[]; + #endif /* _PPC64_PACA_H */ From mikpe at csd.uu.se Tue Mar 22 02:19:55 2005 From: mikpe at csd.uu.se (Mikael Pettersson) Date: Mon, 21 Mar 2005 16:19:55 +0100 (MET) Subject: [PATCH][2.6.12-rc1-mm1] fix ppc64 linkage error on G5 Message-ID: <200503211519.j2LFJtWU021931@harpo.it.uu.se> When 2.6.12-rc1-mm1 is configured for a ppc64/G5, so CONFIG_PPC_PSERIES is disabled, linking of vmlinux fails with: arch/ppc64/kernel/built-in.o(.text+0x7de0): In function `.sys_call_table32': : undefined reference to `.ppc_rtas' arch/ppc64/kernel/built-in.o(.text+0x8668): In function `.sys_call_table': : undefined reference to `.ppc_rtas' make: *** [.tmp_vmlinux1] Error 1 This is because 2.6.12-rc1-mm1 contains the apparently broken patch: >--- linux-2.6.12-rc1/arch/ppc64/kernel/misc.S 2005-03-17 21:43:54.000000000 -0800 >+++ 25/arch/ppc64/kernel/misc.S 2005-03-21 01:07:42.000000000 -0800 >@@ -680,7 +680,7 @@ _GLOBAL(kernel_thread) > ld r30,-16(r1) > blr > >-#ifndef CONFIG_PPC_PSERIES /* hack hack hack */ >+#ifdef CONFIG_PPC_RTAS /* hack hack hack */ > #define ppc_rtas sys_ni_syscall > #endif PPC_PSERIES implies PPC_RTAS. It seems someone tried to clean up the condition but accidentally negated it: on PSERIES the system call will now go to sys_ni_syscall, and on !PSERIES linking will fail. Fix: negate the condition. Signed-off-by: Mikael Pettersson --- linux-2.6.12-rc1-mm1/arch/ppc64/kernel/misc.S.~1~ 2005-03-21 14:48:51.000000000 +0100 +++ linux-2.6.12-rc1-mm1/arch/ppc64/kernel/misc.S 2005-03-21 15:22:04.000000000 +0100 @@ -680,7 +680,7 @@ _GLOBAL(kernel_thread) ld r30,-16(r1) blr -#ifdef CONFIG_PPC_RTAS /* hack hack hack */ +#ifndef CONFIG_PPC_RTAS /* hack hack hack */ #define ppc_rtas sys_ni_syscall #endif From anton at samba.org Tue Mar 22 03:32:59 2005 From: anton at samba.org (Anton Blanchard) Date: Tue, 22 Mar 2005 03:32:59 +1100 Subject: [PATCH] ppc64: fix linkage error on G5 In-Reply-To: <200503211519.j2LFJtWU021931@harpo.it.uu.se> References: <200503211519.j2LFJtWU021931@harpo.it.uu.se> Message-ID: <20050321163259.GA12509@krispykreme> > When 2.6.12-rc1-mm1 is configured for a ppc64/G5, so CONFIG_PPC_PSERIES > is disabled, linking of vmlinux fails with: > > arch/ppc64/kernel/built-in.o(.text+0x7de0): In function `.sys_call_table32': > : undefined reference to `.ppc_rtas' > arch/ppc64/kernel/built-in.o(.text+0x8668): In function `.sys_call_table': > : undefined reference to `.ppc_rtas' > make: *** [.tmp_vmlinux1] Error 1 It turns out we are trying to fix this problem twice, we may as well remove the #define hack and use cond_syscall. -- Move the ppc64 specific cond_syscall(ppc_rtas) into sys_ni.c so that it takes effect. With this fixed we can remove the #define hack. Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/misc.S~fix_ppc_rtas arch/ppc64/kernel/misc.S --- foobar2/arch/ppc64/kernel/misc.S~fix_ppc_rtas 2005-03-22 02:41:53.819634410 +1100 +++ foobar2-anton/arch/ppc64/kernel/misc.S 2005-03-22 02:41:53.851631972 +1100 @@ -680,10 +680,6 @@ _GLOBAL(kernel_thread) ld r30,-16(r1) blr -#ifdef CONFIG_PPC_RTAS /* hack hack hack */ -#define ppc_rtas sys_ni_syscall -#endif - /* Why isn't this a) automatic, b) written in 'C'? */ .balign 8 _GLOBAL(sys_call_table32) diff -puN arch/ppc64/kernel/syscalls.c~fix_ppc_rtas arch/ppc64/kernel/syscalls.c --- foobar2/arch/ppc64/kernel/syscalls.c~fix_ppc_rtas 2005-03-22 02:41:53.825633952 +1100 +++ foobar2-anton/arch/ppc64/kernel/syscalls.c 2005-03-22 02:41:53.852631895 +1100 @@ -256,6 +256,3 @@ void do_show_syscall_exit(unsigned long { printk(" -> %lx, current=%p cpu=%d\n", r3, current, smp_processor_id()); } - -/* Only exists on P-series. */ -cond_syscall(ppc_rtas); diff -puN kernel/sys_ni.c~fix_ppc_rtas kernel/sys_ni.c --- foobar2/kernel/sys_ni.c~fix_ppc_rtas 2005-03-22 02:41:53.829633648 +1100 +++ foobar2-anton/kernel/sys_ni.c 2005-03-22 02:41:53.853631819 +1100 @@ -83,3 +83,4 @@ cond_syscall(sys_pciconfig_write); cond_syscall(sys_pciconfig_iobase); cond_syscall(sys32_ipc); cond_syscall(sys32_sysctl); +cond_syscall(ppc_rtas); From ntl at pobox.com Tue Mar 22 03:45:17 2005 From: ntl at pobox.com (Nathan Lynch) Date: Mon, 21 Mar 2005 10:45:17 -0600 Subject: [PATCH][2.6.12-rc1-mm1] fix compile error in ppc64 prom.c In-Reply-To: <200503211519.j2LFJ1os021884@harpo.it.uu.se> References: <200503211519.j2LFJ1os021884@harpo.it.uu.se> Message-ID: <20050321164517.GB16469@otto> On Mon, Mar 21, 2005 at 04:19:01PM +0100, Mikael Pettersson wrote: > Compiling 2.6.12-rc1-mm1 for ppc64 fails with: > > arch/ppc64/kernel/prom.c:1691: error: syntax error before 'prom_reconfig_notifier' > arch/ppc64/kernel/prom.c:1692: error: field name not in record or union initializer > arch/ppc64/kernel/prom.c:1692: error: (near initialization for 'prom_reconfig_nb') > arch/ppc64/kernel/prom.c:1692: warning: initialization makes pointer from integer without a cast > make[1]: *** [arch/ppc64/kernel/prom.o] Error 1 > make: *** [arch/ppc64/kernel] Error 2 > > Fix: repair the obvious syntax error (missing "="). Thanks for the fix; the mistake was mine. Lest Andrew and Paulus think I'm sending untested patches, the compiler I'm using (gcc 3.3.3-hammer) does not give an error or even a warning. Sorry for the inconvenience; I'll have to upgrade to a less forgiving version of gcc. Nathan From anton at samba.org Tue Mar 22 04:13:49 2005 From: anton at samba.org (Anton Blanchard) Date: Tue, 22 Mar 2005 04:13:49 +1100 Subject: [PATCH] ppc64: fix semtimedop compat syscall Message-ID: <20050321171349.GA23908@krispykreme> As with sparc64, the ppc64 version of semtimedop was incorrect - the timeout is in the fifth argument. I got caught copying again :) Dave: the sparc64 change only caught the second hunk, was that intended? Signed-off-by: Anton Blanchard ===== arch/ppc64/kernel/sys_ppc32.c 1.104 vs edited ===== --- 1.104/arch/ppc64/kernel/sys_ppc32.c 2005-01-26 08:50:22 +11:00 +++ edited/arch/ppc64/kernel/sys_ppc32.c 2005-03-21 11:09:32 +11:00 @@ -504,11 +504,11 @@ switch (call) { case SEMTIMEDOP: - if (third) + if (fifth) /* sign extend semid */ return compat_sys_semtimedop((int)first, compat_ptr(ptr), second, - compat_ptr(third)); + compat_ptr(fifth)); /* else fall through for normal semop() */ case SEMOP: /* struct sembuf is the same on 32 and 64bit :)) */ From agl at us.ibm.com Tue Mar 22 03:26:39 2005 From: agl at us.ibm.com (Adam Litke) Date: Mon, 21 Mar 2005 10:26:39 -0600 Subject: Hugepage COW In-Reply-To: <20050317034844.GD14048@localhost.localdomain> References: <1109085505.5217.28.camel@localhost.localdomain> <20050223070322.GF24473@localhost.localdomain> <1111006896.3635.24.camel@localhost.localdomain> <20050317034844.GD14048@localhost.localdomain> Message-ID: <1111422399.3635.51.camel@localhost.localdomain> On Wed, 2005-03-16 at 21:48, David Gibson wrote: > As far as the consolidation patch goes, lack of testing was the main > objective reason for holding back. So if you could test on x86_64 and > ia64 too, that would be great. wli had some objections to the patch > when I first posted which I didn't and don't really understand, and > from conversations with akpm, I'm certainly thinking of just > re-sending it. I am in the process of obtaining an x86_64 box to test these on. IA64 hardware could prove more difficult to find though. > > COW will be a bit more of a political shitfight, I suspect. I'd like > to at least hold off until the consolidation is merged, which makes > the COW much easier. We'll also need to implement the necessary > arch-hooks for COW on every platform. Speaking of which, did you > implement the i386 hooks? I thought I only did COW for ppc64, so far, > although on top of the consolidation patch the amount of arch code is > vastly reduced. The version of your cow that I have has no hooks (not even for ppc64). I applies on top of the consolidate patch. It is working fine on x86 but I am investigating a possible problem on ppc64. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center From haveblue at us.ibm.com Tue Mar 22 02:56:10 2005 From: haveblue at us.ibm.com (Dave Hansen) Date: Mon, 21 Mar 2005 07:56:10 -0800 Subject: CONFIG_NUMA and DISCONTIG inconsistencies Message-ID: <1111420571.9648.88.camel@localhost> First of all, what are the machines that require CONFIG_DISCONTIGMEM=y # CONFIG_NUMA is not set That appears to be the ppc64 defconfig, and it doesn't make a lot of sense to me. First of all, why do you need DISCONTIG without NUMA? Should that even be allowed? Also, I see this: static inline int pa_to_nid(unsigned long pa) { int nid; nid = numa_memory_lookup_table[pa >> MEMORY_INCREMENT_SHIFT]; ... return nid; } in mmzone.h under #ifdef CONFIG_DISCONTIGMEM. Seems much more like a NUMA thing to me. Two completely untested patches to fix this up attached. Comments? -- Dave -------------- next part -------------- A non-text attachment was scrubbed... Name: A1-sparse-prep-ppc64-remove-numa-debug.patch Type: text/x-patch Size: 2569 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050321/2786abc4/attachment.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: A2-sparse-prep-discontig-alone.patch Type: text/x-patch Size: 3411 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050321/2786abc4/attachment-0001.bin From anton at samba.org Tue Mar 22 04:38:06 2005 From: anton at samba.org (Anton Blanchard) Date: Tue, 22 Mar 2005 04:38:06 +1100 Subject: [PATCH] ppc64: fix pseries hcall stubs Message-ID: <20050321173806.GB23908@krispykreme> Fix a number of bugs in our pseries hcall stubs: - store parameters in the area specified by the ABI, no need to create stack frames. - plpar_hcall_4out would corrupt r14 - merge multiple HVSC definitions Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/head.S~fix_pseries_hcalls arch/ppc64/kernel/head.S --- foobar2/arch/ppc64/kernel/head.S~fix_pseries_hcalls 2005-03-21 11:46:19.796654559 +1100 +++ foobar2-anton/arch/ppc64/kernel/head.S 2005-03-21 11:46:19.834651652 +1100 @@ -37,6 +37,7 @@ #include #include #include +#include #ifdef CONFIG_PPC_ISERIES #define DO_SOFT_DISABLE @@ -45,7 +46,6 @@ /* * hcall interface to pSeries LPAR */ -#define HVSC .long 0x44000022 #define H_SET_ASR 0x30 /* diff -puN arch/ppc64/kernel/pSeries_hvCall.S~fix_pseries_hcalls arch/ppc64/kernel/pSeries_hvCall.S --- foobar2/arch/ppc64/kernel/pSeries_hvCall.S~fix_pseries_hcalls 2005-03-21 11:46:19.801654177 +1100 +++ foobar2-anton/arch/ppc64/kernel/pSeries_hvCall.S 2005-03-21 11:47:44.485802897 +1100 @@ -1,7 +1,6 @@ /* * arch/ppc64/kernel/pSeries_hvCall.S * - * * This file contains the generic code to perform a call to the * pSeries LPAR hypervisor. * NOTE: this file will go away when we move to inline this work. @@ -11,133 +10,114 @@ * as published by the Free Software Foundation; either version * 2 of the License, or (at your option) any later version. */ -#include -#include -#include +#include #include -#include -#include #include -/* - * hcall interface to pSeries LPAR - */ -#define HVSC .long 0x44000022 - -/* long plpar_hcall(unsigned long opcode, R3 - unsigned long arg1, R4 - unsigned long arg2, R5 - unsigned long arg3, R6 - unsigned long arg4, R7 - unsigned long *out1, R8 - unsigned long *out2, R9 - unsigned long *out3); R10 - */ +#define STK_PARM(i) (48 + ((i)-3)*8) .text + +/* long plpar_hcall(unsigned long opcode, R3 + unsigned long arg1, R4 + unsigned long arg2, R5 + unsigned long arg3, R6 + unsigned long arg4, R7 + unsigned long *out1, R8 + unsigned long *out2, R9 + unsigned long *out3); R10 + */ _GLOBAL(plpar_hcall) mfcr r0 - std r0,-8(r1) - stdu r1,-32(r1) - std r8,-8(r1) /* Save out ptrs. */ - std r9,-16(r1) - std r10,-24(r1) - - HVSC /* invoke the hypervisor */ + std r8,STK_PARM(r8)(r1) /* Save out ptrs */ + std r9,STK_PARM(r9)(r1) + std r10,STK_PARM(r10)(r1) + + stw r0,8(r1) + + HVSC /* invoke the hypervisor */ + + lwz r0,8(r1) + + ld r8,STK_PARM(r8)(r1) /* Fetch r4-r6 ret args */ + ld r9,STK_PARM(r9)(r1) + ld r10,STK_PARM(r10)(r1) + std r4,0(r8) + std r5,0(r9) + std r6,0(r10) - ld r10,-8(r1) /* Fetch r4-r7 ret args. */ - std r4,0(r10) - ld r10,-16(r1) - std r5,0(r10) - ld r10,-24(r1) - std r6,0(r10) - - ld r1,0(r1) - ld r0,-8(r1) mtcrf 0xff,r0 - blr /* return r3 = status */ + blr /* return r3 = status */ /* Simple interface with no output values (other than status) */ _GLOBAL(plpar_hcall_norets) mfcr r0 - std r0,-8(r1) - HVSC /* invoke the hypervisor */ - ld r0,-8(r1) - mtcrf 0xff,r0 - blr /* return r3 = status */ + stw r0,8(r1) + HVSC /* invoke the hypervisor */ -/* long plpar_hcall_8arg_2ret(unsigned long opcode, R3 - unsigned long arg1, R4 - unsigned long arg2, R5 - unsigned long arg3, R6 - unsigned long arg4, R7 - unsigned long arg5, R8 - unsigned long arg6, R9 - unsigned long arg7, R10 - unsigned long arg8, 112(R1) - unsigned long *out1); 120(R1) + lwz r0,8(r1) + mtcrf 0xff,r0 + blr /* return r3 = status */ - */ - .text +/* long plpar_hcall_8arg_2ret(unsigned long opcode, R3 + unsigned long arg1, R4 + unsigned long arg2, R5 + unsigned long arg3, R6 + unsigned long arg4, R7 + unsigned long arg5, R8 + unsigned long arg6, R9 + unsigned long arg7, R10 + unsigned long arg8, 112(R1) + unsigned long *out1); 120(R1) + */ _GLOBAL(plpar_hcall_8arg_2ret) mfcr r0 + ld r11,STK_PARM(r11)(r1) /* put arg8 in R11 */ + stw r0,8(r1) - ld r11, 112(r1) /* put arg8 and out1 in R11 and R12 */ - ld r12, 120(r1) - - std r0,-8(r1) - stdu r1,-32(r1) + HVSC /* invoke the hypervisor */ - std r12,-8(r1) /* Save out ptr */ - - HVSC /* invoke the hypervisor */ - - ld r10,-8(r1) /* Fetch r4 ret arg */ - std r4,0(r10) - - ld r1,0(r1) - ld r0,-8(r1) + lwz r0,8(r1) + ld r10,STK_PARM(r12)(r1) /* Fetch r4 ret arg */ + std r4,0(r10) mtcrf 0xff,r0 - blr /* return r3 = status */ + blr /* return r3 = status */ -/* long plpar_hcall_4out(unsigned long opcode, R3 - unsigned long arg1, R4 - unsigned long arg2, R5 - unsigned long arg3, R6 - unsigned long arg4, R7 - unsigned long *out1, (r4) R8 - unsigned long *out2, (r5) R9 - unsigned long *out3, (r6) R10 - unsigned long *out4); (r7) 112(R1). From Parameter save area. +/* long plpar_hcall_4out(unsigned long opcode, R3 + unsigned long arg1, R4 + unsigned long arg2, R5 + unsigned long arg3, R6 + unsigned long arg4, R7 + unsigned long *out1, R8 + unsigned long *out2, R9 + unsigned long *out3, R10 + unsigned long *out4); 112(R1) */ _GLOBAL(plpar_hcall_4out) mfcr r0 - std r0,-8(r1) - ld r14,112(r1) - stdu r1,-48(r1) - - std r8,32(r1) /* Save out ptrs. */ - std r9,24(r1) - std r10,16(r1) - std r14,8(r1) - - HVSC /* invoke the hypervisor */ + stw r0,8(r1) - ld r14,32(r1) /* Fetch r4-r7 ret args. */ - std r4,0(r14) - ld r14,24(r1) - std r5,0(r14) - ld r14,16(r1) - std r6,0(r14) - ld r14,8(r1) - std r7,0(r14) + std r8,STK_PARM(r8)(r1) /* Save out ptrs */ + std r9,STK_PARM(r9)(r1) + std r10,STK_PARM(r10)(r1) + + HVSC /* invoke the hypervisor */ + + lwz r0,8(r1) + + ld r8,STK_PARM(r8)(r1) /* Fetch r4-r7 ret args */ + ld r9,STK_PARM(r9)(r1) + ld r10,STK_PARM(r10)(r1) + ld r11,STK_PARM(r11)(r1) + std r4,0(r8) + std r5,0(r9) + std r6,0(r10) + std r7,0(r11) - ld r1,0(r1) - ld r0,-8(r1) mtcrf 0xff,r0 - blr /* return r3 = status */ + blr /* return r3 = status */ diff -puN include/asm-ppc64/hvcall.h~fix_pseries_hcalls include/asm-ppc64/hvcall.h --- foobar2/include/asm-ppc64/hvcall.h~fix_pseries_hcalls 2005-03-21 11:46:19.806653794 +1100 +++ foobar2-anton/include/asm-ppc64/hvcall.h 2005-03-21 11:46:19.829652035 +1100 @@ -1,6 +1,8 @@ #ifndef _PPC64_HVCALL_H #define _PPC64_HVCALL_H +#define HVSC .long 0x44000022 + #define H_Success 0 #define H_Busy 1 /* Hardware busy -- retry later */ #define H_Constrained 4 /* Resource request constrained to max allowed */ @@ -41,7 +43,7 @@ /* Flags */ #define H_LARGE_PAGE (1UL<<(63-16)) -#define H_EXACT (1UL<<(63-24)) /* Use exact PTE or return H_PTEG_FULL */ +#define H_EXACT (1UL<<(63-24)) /* Use exact PTE or return H_PTEG_FULL */ #define H_R_XLATE (1UL<<(63-25)) /* include a valid logical page num in the pte if the valid bit is set */ #define H_READ_4 (1UL<<(63-26)) /* Return 4 PTEs */ #define H_AVPN (1UL<<(63-32)) /* An avpn is provided as a sanity test */ @@ -54,8 +56,6 @@ #define H_PP1 (1UL<<(63-62)) #define H_PP2 (1UL<<(63-63)) - - /* pSeries hypervisor opcodes */ #define H_REMOVE 0x04 #define H_ENTER 0x08 @@ -108,6 +108,8 @@ #define H_FREE_VTERM 0x158 #define H_POLL_PENDING 0x1D8 +#ifndef __ASSEMBLY__ + /* plpar_hcall() -- Generic call interface using above opcodes * * The actual call interface is a hypervisor call instruction with @@ -125,8 +127,6 @@ long plpar_hcall(unsigned long opcode, unsigned long *out2, unsigned long *out3); -#define HVSC ".long 0x44000022\n" - /* Same as plpar_hcall but for those opcodes that return no values * other than status. Slightly more efficient. */ @@ -147,9 +147,6 @@ long plpar_hcall_8arg_2ret(unsigned long unsigned long arg7, unsigned long arg8, unsigned long *out1); - - - /* plpar_hcall_4out() * @@ -166,4 +163,5 @@ long plpar_hcall_4out(unsigned long opco unsigned long *out3, unsigned long *out4); +#endif /* __ASSEMBLY__ */ #endif /* _PPC64_HVCALL_H */ _ From anton at samba.org Tue Mar 22 04:48:23 2005 From: anton at samba.org (Anton Blanchard) Date: Tue, 22 Mar 2005 04:48:23 +1100 Subject: [PATCH] ppc64: fix gcc4 compile error in paca.h In-Reply-To: <200503211518.j2LFIMP6021847@harpo.it.uu.se> References: <200503211518.j2LFIMP6021847@harpo.it.uu.se> Message-ID: <20050321174823.GC23908@krispykreme> Thanks, looks good. Anton -- From: Mikael Pettersson Compiling 2.6.12-rc1 or 2.6.12-rc1-mm1 for ppc64 with gcc4 fails with: In file included from include/asm/spinlock.h:20, from include/linux/spinlock.h:43, from include/linux/signal.h:5, from arch/ppc64/kernel/asm-offsets.c:17: include/asm/paca.h:25: error: array type has incomplete element type make[1]: *** [arch/ppc64/kernel/asm-offsets.s] Error 1 make: *** [arch/ppc64/kernel/asm-offsets.s] Error 2 This is an array-of-incomplete-type error. Fix: move array decl to after the struct decl. Signed-off-by: Mikael Pettersson Signed-off-by: Anton Blanchard --- linux-2.6.12-rc1/include/asm-ppc64/paca.h.~1~ 2005-03-02 19:24:19.000000000 +0100 +++ linux-2.6.12-rc1/include/asm-ppc64/paca.h 2005-03-21 15:29:26.000000000 +0100 @@ -22,7 +22,6 @@ #include #include -extern struct paca_struct paca[]; register struct paca_struct *local_paca asm("r13"); #define get_paca() local_paca @@ -115,4 +114,6 @@ struct paca_struct { #endif }; +extern struct paca_struct paca[]; + #endif /* _PPC64_PACA_H */ From jschopp at austin.ibm.com Tue Mar 22 04:58:18 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Mon, 21 Mar 2005 11:58:18 -0600 Subject: CONFIG_NUMA and DISCONTIG inconsistencies In-Reply-To: <1111420571.9648.88.camel@localhost> References: <1111420571.9648.88.camel@localhost> Message-ID: <423F0B3A.8080705@austin.ibm.com> A wise man once told me to send 1 patch per email. It is good to see he ignores his own advice sometimes :) > First of all, what are the machines that require > > CONFIG_DISCONTIGMEM=y > # CONFIG_NUMA is not set > > That appears to be the ppc64 defconfig, and it doesn't make a lot of > sense to me. First of all, why do you need DISCONTIG without NUMA? > Should that even be allowed? Seems silly to me too. Somebody want to set Dave and me straight? First patch: DEBUG_NUMA seems to clutter things up without adding any value. This patch actually removes all of it despite being so small. Paul or Anton, could you apply and forward on to -mm? Acked-by: Joel Schopp > --- > > memhotplug-dave/arch/ppc64/mm/numa.c | 6 ------ > memhotplug-dave/include/asm-ppc64/mmzone.h | 17 +---------------- > memhotplug-dave/include/asm-ppc64/topology.h | 10 +--------- > 3 files changed, 2 insertions(+), 31 deletions(-) > > diff -puN include/asm-ppc64/mmzone.h~ppc64-remove-numa-debug include/asm-ppc64/mmzone.h > --- memhotplug/include/asm-ppc64/mmzone.h~ppc64-remove-numa-debug 2005-03-21 07:33:36.000000000 -0800 > +++ memhotplug-dave/include/asm-ppc64/mmzone.h 2005-03-21 07:34:12.000000000 -0800 > @@ -27,24 +27,9 @@ extern int nr_cpus_in_node[]; > #define MEMORY_INCREMENT_SHIFT 24 > #define MEMORY_INCREMENT (1UL << MEMORY_INCREMENT_SHIFT) > > -/* NUMA debugging, will not work on a DLPAR machine */ > -#undef DEBUG_NUMA > - > static inline int pa_to_nid(unsigned long pa) > { > - int nid; > - > - nid = numa_memory_lookup_table[pa >> MEMORY_INCREMENT_SHIFT]; > - > -#ifdef DEBUG_NUMA > - /* the physical address passed in is not in the map for the system */ > - if (nid == -1) { > - printk("bad address: %lx\n", pa); > - BUG(); > - } > -#endif > - > - return nid; > + return numa_memory_lookup_table[pa >> MEMORY_INCREMENT_SHIFT]; > } > > #define pfn_to_nid(pfn) pa_to_nid((pfn) << PAGE_SHIFT) > diff -puN include/linux/mmzone.h~ppc64-remove-numa-debug include/linux/mmzone.h > diff -puN include/asm-ppc64/page.h~ppc64-remove-numa-debug include/asm-ppc64/page.h > diff -puN include/linux/mm.h~ppc64-remove-numa-debug include/linux/mm.h > diff -puN arch/ppc64/mm/numa.c~ppc64-remove-numa-debug arch/ppc64/mm/numa.c > --- memhotplug/arch/ppc64/mm/numa.c~ppc64-remove-numa-debug 2005-03-21 07:33:36.000000000 -0800 > +++ memhotplug-dave/arch/ppc64/mm/numa.c 2005-03-21 07:34:42.000000000 -0800 > @@ -26,12 +26,6 @@ static int numa_enabled = 1; > static int numa_debug; > #define dbg(args...) if (numa_debug) { printk(KERN_INFO args); } > > -#ifdef DEBUG_NUMA > -#define ARRAY_INITIALISER -1 > -#else > -#define ARRAY_INITIALISER 0 > -#endif > - > int numa_cpu_lookup_table[NR_CPUS] = { [ 0 ... (NR_CPUS - 1)] = > ARRAY_INITIALISER}; > char *numa_memory_lookup_table; > diff -puN include/asm-ppc64/topology.h~ppc64-remove-numa-debug include/asm-ppc64/topology.h > --- memhotplug/include/asm-ppc64/topology.h~ppc64-remove-numa-debug 2005-03-21 07:35:03.000000000 -0800 > +++ memhotplug-dave/include/asm-ppc64/topology.h 2005-03-21 07:35:16.000000000 -0800 > @@ -8,15 +8,7 @@ > > static inline int cpu_to_node(int cpu) > { > - int node; > - > - node = numa_cpu_lookup_table[cpu]; > - > -#ifdef DEBUG_NUMA > - BUG_ON(node == -1); > -#endif > - > - return node; > + return numa_cpu_lookup_table[cpu]; > } > > #define parent_node(node) (node) > _ > > > ------------------------------------------------------------------------ > The second patch isn't quite ready for primetime yet. I'll fix it and repost today. > > > --- > > arch/ppc64/mm/numa.c | 0 > memhotplug-dave/include/asm-ppc64/mmzone.h | 38 ++++++----------------------- > memhotplug-dave/include/linux/mm.h | 2 - > 3 files changed, 10 insertions(+), 30 deletions(-) > > diff -puN include/asm-ppc64/mmzone.h~B-sparse-171-discontig-alone include/asm-ppc64/mmzone.h > --- memhotplug/include/asm-ppc64/mmzone.h~B-sparse-171-discontig-alone 2005-03-21 07:36:49.000000000 -0800 > +++ memhotplug-dave/include/asm-ppc64/mmzone.h 2005-03-21 07:44:36.000000000 -0800 > @@ -10,14 +10,8 @@ > #include > #include > > -#ifdef CONFIG_DISCONTIGMEM > - > +#ifdef CONFIG_NUMA > extern struct pglist_data *node_data[]; > - > -/* > - * Following are specific to this numa platform. > - */ > - > extern int numa_cpu_lookup_table[]; > extern char *numa_memory_lookup_table; > extern cpumask_t numa_cpumask_lookup_table[]; > @@ -31,32 +25,21 @@ static inline int pa_to_nid(unsigned lon > { > return numa_memory_lookup_table[pa >> MEMORY_INCREMENT_SHIFT]; > } > - > -#define pfn_to_nid(pfn) pa_to_nid((pfn) << PAGE_SHIFT) > - > -/* > - * Return a pointer to the node data for node n. > - */ > #define NODE_DATA(nid) (node_data[nid]) > - > #define node_localnr(pfn, nid) ((pfn) - NODE_DATA(nid)->node_start_pfn) > - > -/* > - * Following are macros that each numa implmentation must define. > - */ > - > -/* > - * Given a kernel address, find the home node of the underlying memory. > - */ > #define kvaddr_to_nid(kaddr) pa_to_nid(__pa(kaddr)) > +#define pfn_to_nid(pfn) pa_to_nid((pfn) << PAGE_SHIFT) > +#endif /* CONFIG_NUMA */ > + > +#define local_mapnr(kvaddr) \ > + ( (__pa(kvaddr) >> PAGE_SHIFT) - node_start_pfn(kvaddr_to_nid(kvaddr)) > +#define node_localnr(pfn, nid) ((pfn) - NODE_DATA(nid)->node_start_pfn) If NUMA is turned on node_localnr() gets defined twice. > > #define node_mem_map(nid) (NODE_DATA(nid)->node_mem_map) > #define node_start_pfn(nid) (NODE_DATA(nid)->node_start_pfn) > #define node_end_pfn(nid) (NODE_DATA(nid)->node_end_pfn) > > -#define local_mapnr(kvaddr) \ > - ( (__pa(kvaddr) >> PAGE_SHIFT) - node_start_pfn(kvaddr_to_nid(kvaddr)) > - > +#ifdef CONFIG_DISCONTIGMEM > /* Written this way to avoid evaluating arguments twice */ > #define discontigmem_pfn_to_page(pfn) \ > ({ \ > @@ -64,16 +47,13 @@ static inline int pa_to_nid(unsigned lon > (node_mem_map(pfn_to_nid(__tmp)) + \ > node_localnr(__tmp, pfn_to_nid(__tmp))); \ > }) > - > #define discontigmem_page_to_pfn(p) \ > ({ \ > struct page *__tmp = p; \ > (((__tmp) - page_zone(__tmp)->zone_mem_map) + \ > page_zone(__tmp)->zone_start_pfn); \ > }) > - > -/* XXX fix for discontiguous physical memory */ > #define discontigmem_pfn_valid(pfn) ((pfn) < num_physpages) > - > #endif /* CONFIG_DISCONTIGMEM */ > + > #endif /* _ASM_MMZONE_H_ */ > diff -puN include/linux/mmzone.h~B-sparse-171-discontig-alone include/linux/mmzone.h > diff -puN include/asm-ppc64/page.h~B-sparse-171-discontig-alone include/asm-ppc64/page.h > diff -puN include/linux/mm.h~B-sparse-171-discontig-alone include/linux/mm.h > --- memhotplug/include/linux/mm.h~B-sparse-171-discontig-alone 2005-03-21 07:36:49.000000000 -0800 > +++ memhotplug-dave/include/linux/mm.h 2005-03-21 07:36:49.000000000 -0800 > @@ -462,7 +462,7 @@ static inline void set_page_links(struct > set_page_node(page, node); > } > > -#ifndef CONFIG_DISCONTIGMEM > +#ifdef CONFIG_FLATMEM This needs to be pulled out separate since it depends on a as yet unposted patch. From davem at redhat.com Tue Mar 22 04:59:02 2005 From: davem at redhat.com (David S. Miller) Date: Mon, 21 Mar 2005 09:59:02 -0800 Subject: [PATCH] ppc64: fix semtimedop compat syscall In-Reply-To: <20050321171349.GA23908@krispykreme> References: <20050321171349.GA23908@krispykreme> Message-ID: <20050321095902.2ad6b6be.davem@redhat.com> On Tue, 22 Mar 2005 04:13:49 +1100 Anton Blanchard wrote: > Dave: the sparc64 change only caught the second hunk, was that intended? No, it wasn't intentional. I fixed this in my tree, good spotting. This compat_sys_ipc() probably is starting to become similar enough (at least across ppc and sparc) to go to some common place. Don't ya think? :-) From olof at austin.ibm.com Tue Mar 22 06:01:00 2005 From: olof at austin.ibm.com (Olof Johansson) Date: Mon, 21 Mar 2005 13:01:00 -0600 Subject: CONFIG_NUMA and DISCONTIG inconsistencies In-Reply-To: <1111420571.9648.88.camel@localhost> References: <1111420571.9648.88.camel@localhost> Message-ID: <20050321190100.GA28665@austin.ibm.com> On Mon, Mar 21, 2005 at 07:56:10AM -0800, Dave Hansen wrote: > > in mmzone.h under #ifdef CONFIG_DISCONTIGMEM. Seems much more like a > NUMA thing to me. Two completely untested patches to fix this up > attached. Comments? Yes. I suggest boot testing your patches before posting them, that way you can catch embarrasing build errors and avoid others from wasting their time finding them. > -/* NUMA debugging, will not work on a DLPAR machine */ > -#undef DEBUG_NUMA Why are you removing the debug support under this patch? If you want to remove it, please do it explicitly under a different patch. > -#ifdef DEBUG_NUMA > -#define ARRAY_INITIALISER -1 > -#else > -#define ARRAY_INITIALISER 0 > -#endif > - > int numa_cpu_lookup_table[NR_CPUS] = { [ 0 ... (NR_CPUS - 1)] = > ARRAY_INITIALISER}; How do you expect this to build if you remove ARRAY_INITIALISER? -Olof From haveblue at us.ibm.com Tue Mar 22 06:24:48 2005 From: haveblue at us.ibm.com (Dave Hansen) Date: Mon, 21 Mar 2005 11:24:48 -0800 Subject: CONFIG_NUMA and DISCONTIG inconsistencies In-Reply-To: <20050321190100.GA28665@austin.ibm.com> References: <1111420571.9648.88.camel@localhost> <20050321190100.GA28665@austin.ibm.com> Message-ID: <1111433088.5045.2.camel@localhost> On Mon, 2005-03-21 at 13:01 -0600, Olof Johansson wrote: > Why are you removing the debug support under this patch? It takes up valuable visual space in a header file where the #ifdefs are already confusing. It's also supposedly non-functional. The developers who get bugs for things like NUMA support also tend to have their own, much more comprehensive patches to do the same kinds of things. Basically, it got in my way when I wanted to make that file look better. > If you want to > remove it, please do it explicitly under a different patch. I actually attached two plain-text patches in that message. They were separated when I attached them :) > > -#ifdef DEBUG_NUMA > > -#define ARRAY_INITIALISER -1 > > -#else > > -#define ARRAY_INITIALISER 0 > > -#endif > > - > > int numa_cpu_lookup_table[NR_CPUS] = { [ 0 ... (NR_CPUS - 1)] = > > ARRAY_INITIALISER}; > > How do you expect this to build if you remove ARRAY_INITIALISER? Hehe. Minor detail. I think Joel is going to make a more presentable (read: compiling) version. I'm sure he'll get that part. -- Dave From mikpe at csd.uu.se Tue Mar 22 06:30:51 2005 From: mikpe at csd.uu.se (Mikael Pettersson) Date: Mon, 21 Mar 2005 20:30:51 +0100 (MET) Subject: [PATCH] ppc64: fix linkage error on G5 Message-ID: <200503211930.j2LJUp7S003701@harpo.it.uu.se> On Tue, 22 Mar 2005 03:32:59 +1100, Anton Blanchard wrote: >> When 2.6.12-rc1-mm1 is configured for a ppc64/G5, so CONFIG_PPC_PSERIES >> is disabled, linking of vmlinux fails with: >> >> arch/ppc64/kernel/built-in.o(.text+0x7de0): In function `.sys_call_table32': >> : undefined reference to `.ppc_rtas' >> arch/ppc64/kernel/built-in.o(.text+0x8668): In function `.sys_call_table': >> : undefined reference to `.ppc_rtas' >> make: *** [.tmp_vmlinux1] Error 1 > >It turns out we are trying to fix this problem twice, we may as well >remove the #define hack and use cond_syscall. > >-- > >Move the ppc64 specific cond_syscall(ppc_rtas) into sys_ni.c so that it >takes effect. With this fixed we can remove the #define hack. This worked fine. Thanks. /Mikael From olof at austin.ibm.com Tue Mar 22 07:12:41 2005 From: olof at austin.ibm.com (Olof Johansson) Date: Mon, 21 Mar 2005 14:12:41 -0600 Subject: CONFIG_NUMA and DISCONTIG inconsistencies In-Reply-To: <1111433088.5045.2.camel@localhost> References: <1111420571.9648.88.camel@localhost> <20050321190100.GA28665@austin.ibm.com> <1111433088.5045.2.camel@localhost> Message-ID: <20050321201241.GB28665@austin.ibm.com> On Mon, Mar 21, 2005 at 11:24:48AM -0800, Dave Hansen wrote: > > If you want to > > remove it, please do it explicitly under a different patch. > > I actually attached two plain-text patches in that message. They were > separated when I attached them :) One patch per email, please. So, separate patch = separate email. And that means separate patch description, which would have avoided this chain of emails in the first place. Anyway, back to your original question: I think we have machines with discontigous memory (due to I/O holes) without having NUMA support on the machines, i.e. some POWER3/RS64 boxes. Unfortunately I don't have such a machine with sufficient memory configured to take a look at. > > How do you expect this to build if you remove ARRAY_INITIALISER? > > Hehe. Minor detail. I think Joel is going to make a more presentable > (read: compiling) version. I'm sure he'll get that part. Joel just commented your patch and acked it, with the same build error in it. I guess that got me confused. -Olof From haveblue at us.ibm.com Tue Mar 22 07:21:19 2005 From: haveblue at us.ibm.com (Dave Hansen) Date: Mon, 21 Mar 2005 12:21:19 -0800 Subject: CONFIG_NUMA and DISCONTIG inconsistencies In-Reply-To: <20050321201241.GB28665@austin.ibm.com> References: <1111420571.9648.88.camel@localhost> <20050321190100.GA28665@austin.ibm.com> <1111433088.5045.2.camel@localhost> <20050321201241.GB28665@austin.ibm.com> Message-ID: <1111436479.5045.12.camel@localhost> On Mon, 2005-03-21 at 14:12 -0600, Olof Johansson wrote: > I think we have machines with discontigous memory (due to I/O holes) > without having NUMA support on the machines, i.e. some POWER3/RS64 > boxes. DISCONTIGMEM only provides discontiguous mem_maps between different zones. The mem_map[] inside of any single zone is contiguous. As I understand it, most of the 64-bit arches keep all of their memory in one zone, and they only separate memory into different zones when there are multiple NUMA nodes. I guess don't understand what DISCONTIGMEM provides on !NUMA machines, even like the POWER3 machines. -- Dave From iamroot at ca.ibm.com Tue Mar 22 07:16:40 2005 From: iamroot at ca.ibm.com (Omkhar Arasaratnam) Date: Mon, 21 Mar 2005 15:16:40 -0500 Subject: [Fwd: Re: [BUG] 2.6.11- sym53c8xx Broken on pp64] Message-ID: <423F2BA8.9080503@ca.ibm.com> Sorry it took so long to move this over - can we further the investigation? Omkhar -------------- next part -------------- An embedded message was scrubbed... From: Benjamin Herrenschmidt Subject: Re: [BUG] 2.6.11- sym53c8xx Broken on pp64 Date: Wed, 16 Mar 2005 10:38:40 +1100 Size: 3616 Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050321/fde65190/attachment.eml From iamroot at ca.ibm.com Tue Mar 22 08:01:19 2005 From: iamroot at ca.ibm.com (Omkhar Arasaratnam) Date: Mon, 21 Mar 2005 16:01:19 -0500 Subject: [BUG] 2.6.11- sym53c8xx Broken on pp64 Message-ID: <423F361F.8010609@ca.ibm.com> Sorry I just realized that Thunderbird decided to forward the email as an attachement. So here is a summary: On bringup I see this on a p630: sym0: No NVRAM, ID 7, Fast-80 LVD, parity checking CACHE TEST FAILED: DMA error (dstat=0xa0) .sym0: CACHE INCORRECTLY CONFIGURED sym0: giving up ... No issue with 2.6.10 or 2.6.9 with the driver. Copying the sym2/ dir from 2.6.10 over to the 2.6.11.3 dir doesn't help (same error). Ideas Omkhar From anton at samba.org Tue Mar 22 08:22:12 2005 From: anton at samba.org (Anton Blanchard) Date: Tue, 22 Mar 2005 08:22:12 +1100 Subject: [PATCH] ppc64: fix semtimedop compat syscall In-Reply-To: <20050321095902.2ad6b6be.davem@redhat.com> References: <20050321171349.GA23908@krispykreme> <20050321095902.2ad6b6be.davem@redhat.com> Message-ID: <20050321212212.GD23908@krispykreme> > No, it wasn't intentional. I fixed this in my tree, good spotting. > > This compat_sys_ipc() probably is starting to become > similar enough (at least across ppc and sparc) to go > to some common place. Don't ya think? :-) Yep I agree :) I was also wondering if we could make the compat layer sign extension code common, the ppc64 ones are written in c and most probably incomplete at the moment. Anton From ntl at pobox.com Tue Mar 22 08:45:37 2005 From: ntl at pobox.com (Nathan Lynch) Date: Mon, 21 Mar 2005 15:45:37 -0600 Subject: ppc64 pSeries build broken in 2.6.12-rc1-bk1 Message-ID: <20050321214537.GC16469@otto> Hi- It seems that the "pSeries reconfig" patch series which Paul sent on March 17th to Andrew on my behalf was incompletely merged into bk? It looks like only the last two of the series are in -bk as of today (2.6.12-rc1-mm1 has them all). Subject lines and URLs for the series in question: [PATCH 1/8] PPC64 preliminary changes to OF fixup functions http://lkml.org/lkml/2005/3/17/41 [PATCH 2/8] PPC64 make OF node fixup code usable at runtime http://lkml.org/lkml/2005/3/17/40 [PATCH 3/8] PPC64 introduce pSeries_reconfig.[ch] http://lkml.org/lkml/2005/3/17/43 [PATCH 4/8] PPC64 prom.c: use pSeries reconfig notifier http://lkml.org/lkml/2005/3/17/56 [PATCH 5/8] PPC64 pci_dn.c: use pSeries reconfig notifier http://lkml.org/lkml/2005/3/17/57 [PATCH 6/8] PPC64 pSeries_iommu.c: use pSeries reconfig notifier http://lkml.org/lkml/2005/3/17/54 [PATCH 7/8] PPC64 use pSeries reconfig notifier for cpu DLPAR http://lkml.org/lkml/2005/3/17/44 [PATCH 8/8] PPC64 make cpu hotplug play well with maxcpus and smt-enabled http://lkml.org/lkml/2005/3/17/55 With 2.6.12-rc1-bk1, if CONFIG_PPC_PSERIES and CONFIG_SMP are enabled, the build errors out: CC arch/ppc64/kernel/pSeries_smp.o arch/ppc64/kernel/pSeries_smp.c:47:34: asm/pSeries_reconfig.h: No such file or directory If people would like something to use in the meantime, below is a throwaway patch to work around the breakage (not intended for inclusion). Disabling CONFIG_PPC_PSERIES should work around it also. The real fix is to either revert patches 7 and 8, or merge 1-6 :) Nathan Index: linux-2.6.12-rc1-bk1/arch/ppc64/kernel/pSeries_smp.c =================================================================== --- linux-2.6.12-rc1-bk1.orig/arch/ppc64/kernel/pSeries_smp.c 2005-03-21 20:28:22.000000000 +0000 +++ linux-2.6.12-rc1-bk1/arch/ppc64/kernel/pSeries_smp.c 2005-03-21 21:00:16.000000000 +0000 @@ -44,7 +44,7 @@ #include #include #include -#include +/* #include */ #include "mpic.h" @@ -135,6 +135,7 @@ void pSeries_cpu_die(unsigned int cpu) * the logical ids for sibling SMT threads x and y are adjacent, such * that x^1 == y and y^1 == x. */ +#if 0 static int pSeries_add_processor(struct device_node *np) { unsigned int cpu; @@ -247,7 +248,7 @@ static int pSeries_smp_notifier(struct n static struct notifier_block pSeries_smp_nb = { .notifier_call = pSeries_smp_notifier, }; - +#endif /* 0 */ #endif /* CONFIG_HOTPLUG_CPU */ /** @@ -421,9 +422,11 @@ void __init smp_init_pSeries(void) smp_ops->cpu_die = pSeries_cpu_die; /* Processors can be added/removed only on LPAR */ +#if 0 if (systemcfg->platform == PLATFORM_PSERIES_LPAR) pSeries_reconfig_notifier_register(&pSeries_smp_nb); #endif +#endif /* Mark threads which are still spinning in hold loops. */ if (cur_cpu_spec->cpu_features & CPU_FTR_SMT) From davem at redhat.com Tue Mar 22 08:50:31 2005 From: davem at redhat.com (David S. Miller) Date: Mon, 21 Mar 2005 13:50:31 -0800 Subject: [PATCH] ppc64: fix semtimedop compat syscall In-Reply-To: <20050321212212.GD23908@krispykreme> References: <20050321171349.GA23908@krispykreme> <20050321095902.2ad6b6be.davem@redhat.com> <20050321212212.GD23908@krispykreme> Message-ID: <20050321135031.6f35932a.davem@redhat.com> On Tue, 22 Mar 2005 08:22:12 +1100 Anton Blanchard wrote: > I was also wondering if we could make the compat layer sign extension > code common, the ppc64 ones are written in c and most probably > incomplete at the moment. I do them in assembler on sparc64 to save a stack frame. See arch/sparc64/kernel/sys32.S That file is interesting because I am %99.999 certain that it is an exhaustive list of the system calls that require sign extension. From akpm at osdl.org Tue Mar 22 09:01:51 2005 From: akpm at osdl.org (Andrew Morton) Date: Mon, 21 Mar 2005 14:01:51 -0800 Subject: ppc64 pSeries build broken in 2.6.12-rc1-bk1 In-Reply-To: <20050321214537.GC16469@otto> References: <20050321214537.GC16469@otto> Message-ID: <20050321140151.4ebc3f7d.akpm@osdl.org> Nathan Lynch wrote: > > It seems that the "pSeries reconfig" patch series which Paul sent on > March 17th to Andrew on my behalf was incompletely merged into bk? Yes, I was getting a different build error which those three patches happened to fix, so I sent them in for 2.6.12-rc1. From olof at austin.ibm.com Tue Mar 22 09:14:33 2005 From: olof at austin.ibm.com (Olof Johansson) Date: Mon, 21 Mar 2005 16:14:33 -0600 Subject: [BUG] 2.6.11- sym53c8xx Broken on pp64 In-Reply-To: <423F361F.8010609@ca.ibm.com> References: <423F361F.8010609@ca.ibm.com> Message-ID: <20050321221433.GD28665@austin.ibm.com> On Mon, Mar 21, 2005 at 04:01:19PM -0500, Omkhar Arasaratnam wrote: > Sorry I just realized that Thunderbird decided to forward the email as > an attachement. So here is a summary: > > On bringup I see this on a p630: > > sym0: No NVRAM, ID 7, Fast-80 LVD, parity checking > CACHE TEST FAILED: DMA error (dstat=0xa0) .sym0: CACHE INCORRECTLY > CONFIGURED > sym0: giving up ... > > No issue with 2.6.10 or 2.6.9 with the driver. Copying the sym2/ dir > from 2.6.10 over to the 2.6.11.3 dir doesn't help (same error). Ideas You said you're running on a p630. Are you running in partitioned or unpartitioned mode? You mentioned that 2.6.9 works, does 2.6.10 break or does that work as well? -Olof From paulus at samba.org Tue Mar 22 08:55:15 2005 From: paulus at samba.org (Paul Mackerras) Date: Tue, 22 Mar 2005 08:55:15 +1100 Subject: [PATCH][2.6.12-rc1-mm1] fix compile error in ppc64 prom.c In-Reply-To: <200503211519.j2LFJ1os021884@harpo.it.uu.se> References: <200503211519.j2LFJ1os021884@harpo.it.uu.se> Message-ID: <16959.17091.901784.666693@cargo.ozlabs.ibm.com> Mikael Pettersson writes: > Compiling 2.6.12-rc1-mm1 for ppc64 fails with: > > arch/ppc64/kernel/prom.c:1691: error: syntax error before 'prom_reconfig_notifier' Currently prom.c is in a mess because Linus applied the last 2 of 8 patches from Nathan Lynch but not the first 6. :-P Paul. From ntl at pobox.com Tue Mar 22 09:56:26 2005 From: ntl at pobox.com (Nathan Lynch) Date: Mon, 21 Mar 2005 16:56:26 -0600 Subject: [PATCH][2.6.12-rc1-mm1] fix compile error in ppc64 prom.c In-Reply-To: <16959.17091.901784.666693@cargo.ozlabs.ibm.com> References: <200503211519.j2LFJ1os021884@harpo.it.uu.se> <16959.17091.901784.666693@cargo.ozlabs.ibm.com> Message-ID: <20050321225626.GD16469@otto> On Tue, Mar 22, 2005 at 08:55:15AM +1100, Paul Mackerras wrote: > Mikael Pettersson writes: > > > Compiling 2.6.12-rc1-mm1 for ppc64 fails with: > > > > arch/ppc64/kernel/prom.c:1691: error: syntax error before 'prom_reconfig_notifier' > > Currently prom.c is in a mess because Linus applied the last 2 of 8 > patches from Nathan Lynch but not the first 6. :-P Actually, this one is my fault, although unless I'm really missing something gcc 3.4.2 silently accepts the invalid syntax. All eight of the patches from the pSeries reconfig series are present in 2.6.12-rc1-mm1. The error Mikael reported is unrelated to the state of Linus' tree, and his patch is correct. (The mistake is present in both versions of the patches which I posted for review on linuxppc64-dev; nobody caught it.) Nathan From linas at austin.ibm.com Tue Mar 22 10:10:28 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Mon, 21 Mar 2005 17:10:28 -0600 Subject: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)] In-Reply-To: <20050226063609.GC7036@colo.lackof.org> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <1109207532.5384.32.camel@gaston> <20050224013137.GF2088@austin.ibm.com> <20050226063609.GC7036@colo.lackof.org> Message-ID: <20050321231028.GV498@austin.ibm.com> Hi, There has been a running thread for a while on several mailing lists concerning PCI bus error recovery. Very breifly, some architectures have PCI error recovery mechanisms built into them (e.g. IBM PowerPC, also new PCI-Express chips from Intel (and other vendors) and possibly pa-risc and others). I've been trying to prototype error recovery. I currently have ethernet and the IPR scsi driver working, but I am having trouble with the symbios driver. I need help/advice ... On Fri, Feb 25, 2005 at 11:36:09PM -0700, Grant Grundler was heard to remark: > On Wed, Feb 23, 2005 at 07:31:37PM -0600, Linas Vepstas wrote: > > I also want to do the symbios driver... > > FYI, Mathew Wilcox maintains the sym2 driver in cvs.parisc-linux.org. My current hardware will halt all i/o to/from the symbios controller upon detection of a PCI error. The recovery proceedure that I am currently using is to call system firmware (aka 'bios') to raise and then lower the #RST pci signal line for 1/4 second, then wait 2 seconds for the PCI bus to settle, then restore the PCI config space registers (BARs, interrupt line, etc) to what they used to be. Then, I call sym_start_up() in an attempt to get the symbios card working again. And that's where I get stuck ... My assumption is that after the #RST, that the symbios card will sit there, dumb and stupid, with no scripts running. But sometimes I find that the card has done something to make the PCI error hardware trip again. Typically, this means that the card attempted to DMA to some address that its not allowed to touch, or raised #SERR or possibly #PERR (I can't tell which). Sometimes, I get the PCI error while the card is sitting there idly after the #RST, but more often, I get the error in sym_chip_reset(), immediately after the OUTB (nc_istat, SRST); Any clue what this is about? Am I missing something? I'm rather perplexed at this point, any clues/hints/suggestions are welcome. --linas From linas at austin.ibm.com Tue Mar 22 10:40:31 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Mon, 21 Mar 2005 17:40:31 -0600 Subject: [Fwd: Re: [BUG] 2.6.11- sym53c8xx Broken on pp64] In-Reply-To: <423F2BA8.9080503@ca.ibm.com> References: <423F2BA8.9080503@ca.ibm.com> Message-ID: <20050321234031.GW498@austin.ibm.com> On Mon, Mar 21, 2005 at 03:16:40PM -0500, Omkhar Arasaratnam was heard to remark: > Sorry it took so long to move this over - can we further the investigation? FYI, when I had this problem (I think its the same problem) I backed off to 2.6.11.2 from kernel.org and discovered the problem went away. --linas > On Tue, 2005-03-15 at 09:54 -0600, Omkhar Arasaratnam wrote: > > Benjamin Herrenschmidt wrote: > > > The 2.6.11.3 kernel with the 2.6.10 driver seems to fail with the same > > sym2 driver error - so I suppose it goes deeper than the driver itself. > > > > Let's move that to linuxppc64-dev and drop the CC-list. Last message on > this thread. > From jschopp at austin.ibm.com Tue Mar 22 10:52:21 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Mon, 21 Mar 2005 17:52:21 -0600 Subject: [RFC][PATCH] start seperating DISCONTIGMEM and NUMA In-Reply-To: <1111420571.9648.88.camel@localhost> References: <1111420571.9648.88.camel@localhost> Message-ID: <423F5E35.9040500@austin.ibm.com> Dave and I agree that CONFIG_NUMA and CONFIG_DISCONTIGMEM are often confused. Yet try as we might it is an almost impossible task to detangle them in one go. But we need to start detangling them, for two major reasons: 1. You can currently configure CONFIG_DISCONTIGMEM without CONFIG_NUMA; in fact it is the default on ppc64. 2. CONFIG_SPARSEMEM will soon be an optional NUMA memory manager and will probably altogether replace CONFIG_DISCONTIGMEM sometime after that. The attached patch splits up include/asm-ppc64/mmzone.h into into a few groupings. 1. CONFIG_NUMA or CONFIG_DISCONTIGMEM. Some of this should be only one or the other, but right now is too difficult to disentangle properly. 2. CONFIG_NUMA this is actually only needed when NUMA is really on and can be seperated now. 3. CONFIG_DISCONTIGMEM this is actually only needed with CONFIG_DISCONTIGMEM is on and can be seperated now. 4. CONFIG_DISCONTIGMEM && !CONFIG_NUMA this is for DISCONTIG to fake some of the numa stuff that isn't on. The patch has been compiled and booted (on a pSeries LPAR) with DISCONTIGMEM off, DISCONTIGMEM on and NUMA off, and DISCONTING and NUMA on. If there are no objections I'd like this to go upstream. Signed-off-by: Joel Schopp -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: discontig-nonuma.patch Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050321/16b731a5/attachment.txt From haveblue at us.ibm.com Tue Mar 22 11:00:16 2005 From: haveblue at us.ibm.com (Dave Hansen) Date: Mon, 21 Mar 2005 16:00:16 -0800 Subject: [RFC][PATCH] start seperating DISCONTIGMEM and NUMA In-Reply-To: <423F5E35.9040500@austin.ibm.com> References: <1111420571.9648.88.camel@localhost> <423F5E35.9040500@austin.ibm.com> Message-ID: <1111449616.5045.55.camel@localhost> Joel, I think you're missing one of the main points: you don't have node-indexed data structures when !CONFIG_NUMA: > +/* Put things that are common to DISCONTIGMEM and NUMA here */ > +#if (CONFIG_DISCONTIGMEM) || (CONFIG_NUMA) > > extern struct pglist_data *node_data[]; That should only be '#ifdef CONFIG_NUMA'. > +#if (CONFIG_DISCONTIGMEM) && (!CONFIG_NUMA) > +#define NODE_DATA(nid) (node_data[0]) > +#define pa_to_nid(pa) 0 > +#define pfn_to_nid(pfn) 0 > +#define kvaddr_to_nid(kaddr) 0BTW, the patch I sent You don't need this special case. As I said before, there's no reason to even have node_data[] when !CONFIG_NUMA, just use contig_page_data. That should make your patch a ton smaller. -- Dave From david at gibson.dropbear.id.au Tue Mar 22 13:04:35 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Tue, 22 Mar 2005 13:04:35 +1100 Subject: Hugepage COW In-Reply-To: <1111422399.3635.51.camel@localhost.localdomain> References: <1109085505.5217.28.camel@localhost.localdomain> <20050223070322.GF24473@localhost.localdomain> <1111006896.3635.24.camel@localhost.localdomain> <20050317034844.GD14048@localhost.localdomain> <1111422399.3635.51.camel@localhost.localdomain> Message-ID: <20050322020435.GE23695@localhost.localdomain> On Mon, Mar 21, 2005 at 10:26:39AM -0600, Adam Litke wrote: > On Wed, 2005-03-16 at 21:48, David Gibson wrote: > > As far as the consolidation patch goes, lack of testing was the main > > objective reason for holding back. So if you could test on x86_64 and > > ia64 too, that would be great. wli had some objections to the patch > > when I first posted which I didn't and don't really understand, and > > from conversations with akpm, I'm certainly thinking of just > > re-sending it. > > I am in the process of obtaining an x86_64 box to test these on. IA64 > hardware could prove more difficult to find though. That's fine. Testing on three archs is still better than two, even if four would be better yet :) > > COW will be a bit more of a political shitfight, I suspect. I'd like > > to at least hold off until the consolidation is merged, which makes > > the COW much easier. We'll also need to implement the necessary > > arch-hooks for COW on every platform. Speaking of which, did you > > implement the i386 hooks? I thought I only did COW for ppc64, so far, > > although on top of the consolidation patch the amount of arch code is > > vastly reduced. > > The version of your cow that I have has no hooks (not even for ppc64). > I applies on top of the consolidate patch. It is working fine on x86 > but I am investigating a possible problem on ppc64. Ah, yes, just checked and I see now. It only needs arch hooks for those archs which define ARCH_HAS_SETCLEAR_HUGE_PTE (in the consolidation patch). Which I think is just sparc64, sh and sh64. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From olof at austin.ibm.com Tue Mar 22 13:55:01 2005 From: olof at austin.ibm.com (Olof Johansson) Date: Mon, 21 Mar 2005 20:55:01 -0600 Subject: [RFC][PATCH] start seperating DISCONTIGMEM and NUMA In-Reply-To: <423F5E35.9040500@austin.ibm.com> References: <1111420571.9648.88.camel@localhost> <423F5E35.9040500@austin.ibm.com> Message-ID: <20050322025501.GA5402@austin.ibm.com> On Mon, Mar 21, 2005 at 05:52:21PM -0600, Joel Schopp wrote: > -#ifdef CONFIG_DISCONTIGMEM > +/* Put things that are common to DISCONTIGMEM and NUMA here */ > +#if (CONFIG_DISCONTIGMEM) || (CONFIG_NUMA) Please use "#if defined(X) || defined(Y)" instead of the above syntax, the same goes for other locations. -Olof From sfr at canb.auug.org.au Tue Mar 22 14:22:11 2005 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Tue, 22 Mar 2005 14:22:11 +1100 Subject: [PATCH] ppc64: fix semtimedop compat syscall In-Reply-To: <20050321135031.6f35932a.davem@redhat.com> References: <20050321171349.GA23908@krispykreme> <20050321095902.2ad6b6be.davem@redhat.com> <20050321212212.GD23908@krispykreme> <20050321135031.6f35932a.davem@redhat.com> Message-ID: <20050322142211.1b4337e4.sfr@canb.auug.org.au> On Mon, 21 Mar 2005 13:50:31 -0800 "David S. Miller" wrote: > > On Tue, 22 Mar 2005 08:22:12 +1100 > Anton Blanchard wrote: > > > I was also wondering if we could make the compat layer sign extension > > code common, the ppc64 ones are written in c and most probably > > incomplete at the moment. > > I do them in assembler on sparc64 to save a stack frame. > See arch/sparc64/kernel/sys32.S > > That file is interesting because I am %99.999 certain that > it is an exhaustive list of the system calls that require > sign extension. I will take the hint and have a look at compat_sys_ipc (unless someone beats me to it). -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050322/caa3c10c/attachment.pgp From iamroot at ca.ibm.com Tue Mar 22 16:07:36 2005 From: iamroot at ca.ibm.com (Omkhar Arasaratnam) Date: Tue, 22 Mar 2005 00:07:36 -0500 Subject: [BUG] 2.6.11- sym53c8xx Broken on pp64 In-Reply-To: <20050321221433.GD28665@austin.ibm.com> References: <423F361F.8010609@ca.ibm.com> <20050321221433.GD28665@austin.ibm.com> Message-ID: <423FA818.1040102@ca.ibm.com> Olof Johansson wrote: >On Mon, Mar 21, 2005 at 04:01:19PM -0500, Omkhar Arasaratnam wrote: > > >>Sorry I just realized that Thunderbird decided to forward the email as >>an attachement. So here is a summary: >> >>On bringup I see this on a p630: >> >>sym0: No NVRAM, ID 7, Fast-80 LVD, parity checking >>CACHE TEST FAILED: DMA error (dstat=0xa0) .sym0: CACHE INCORRECTLY >>CONFIGURED >>sym0: giving up ... >> >>No issue with 2.6.10 or 2.6.9 with the driver. Copying the sym2/ dir >>from 2.6.10 over to the 2.6.11.3 dir doesn't help (same error). Ideas >> >> > >You said you're running on a p630. Are you running in partitioned or >unpartitioned mode? You mentioned that 2.6.9 works, does 2.6.10 break or >does that work as well? > > >-Olof > > > 2.6.10 works just fine save an initrd error which *could* be my fault Omkhar From iamroot at ca.ibm.com Tue Mar 22 16:08:27 2005 From: iamroot at ca.ibm.com (Omkhar Arasaratnam) Date: Tue, 22 Mar 2005 00:08:27 -0500 Subject: [Fwd: Re: [BUG] 2.6.11- sym53c8xx Broken on pp64] In-Reply-To: <20050321234031.GW498@austin.ibm.com> References: <423F2BA8.9080503@ca.ibm.com> <20050321234031.GW498@austin.ibm.com> Message-ID: <423FA84B.2030602@ca.ibm.com> Linas Vepstas wrote: >On Mon, Mar 21, 2005 at 03:16:40PM -0500, Omkhar Arasaratnam was heard to remark: > > >>Sorry it took so long to move this over - can we further the investigation? >> >> > >FYI, when I had this problem (I think its the same problem) I backed off >to 2.6.11.2 from kernel.org and discovered the problem went away. > >--linas > > > >>On Tue, 2005-03-15 at 09:54 -0600, Omkhar Arasaratnam wrote: >> >> >>>Benjamin Herrenschmidt wrote: >>> >>> >>>The 2.6.11.3 kernel with the 2.6.10 driver seems to fail with the same >>>sym2 driver error - so I suppose it goes deeper than the driver itself. >>> >>> >>> >>Let's move that to linuxppc64-dev and drop the CC-list. Last message on >>this thread. >> >> >> > > > I'll try 2.6.11.2 when I get a chance - however I'm more interested in getting this bug squashed rather than resorting to backleveled kernels. Omkhar From arnd at arndb.de Tue Mar 22 23:42:44 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Tue, 22 Mar 2005 13:42:44 +0100 Subject: [PATCH] ppc64: fix semtimedop compat syscall In-Reply-To: <20050322142211.1b4337e4.sfr@canb.auug.org.au> References: <20050321171349.GA23908@krispykreme> <20050321135031.6f35932a.davem@redhat.com> <20050322142211.1b4337e4.sfr@canb.auug.org.au> Message-ID: <200503221342.46575.arnd@arndb.de> On Dinsdag 22 M?rz 2005 04:22, Stephen Rothwell wrote: > On Mon, 21 Mar 2005 13:50:31 -0800 "David S. Miller" wrote: > > Anton Blanchard wrote: > > > I was also wondering if we could make the compat layer sign extension > > > code common, the ppc64 ones are written in c and most probably > > > incomplete at the moment. One problem is that sign extension can not be expressed in architecture independent C code. On ppc64, we make the argument an unsigned int and then cast it to signed int, e.g. asmlinkage long sys_ssetmask(int newmask); asmlinkage long sys32_ssetmask(u32 newmask) { return sys_ssetmask((int) newmask); } On s390, the code would instead need to clear the upper 32 bits, the existing assembly code could be expressed in C as: asmlinkage long sys_ssetmask(int newmask); asmlinkage long sys32_ssetmask(long newmask) { return sys_ssetmask((int) newmask); } > > I do them in assembler on sparc64 to save a stack frame. > > See arch/sparc64/kernel/sys32.S > > > > That file is interesting because I am %99.999 certain that > > it is an exhaustive list of the system calls that require > > sign extension. Right, except for the s390 31-bit pointer extension problem. arch/s390/kernel/compat_wrapper.S does very similar stuff, but has to do it also for syscalls that have only unsigned or pointer arguments. The s390 file also handles clearing the upper 32 bits for all arguments, because the CPU does not do that automatically, unlike ppc64 or x86_64 (don't know about the others). I have an old script that generates the s390 compat_wrapper.S file from a header file holding all C prototypes for the compat_sys_* functions. Maybe we can find a way to make that generic enough for all seven compat architectures. > I will take the hint and have a look at compat_sys_ipc (unless someone > beats me to it). When I introduced ipc/compat.c, I left over sys32_ipc because I could not figure out how to do it in a generic way. It should probably work with a prototype like asmlinkage long compat_sys_ipc(u32 call, u32 first, u32 second, u32 third, void __user *ptr); The s390 wrapper would only do a zero-extend the first four arguments, while compat_sys_ipc would need to do proper compat_ptr() and sign-extension itself. Arnd <>< From davem at davemloft.net Wed Mar 23 04:26:27 2005 From: davem at davemloft.net (David S. Miller) Date: Tue, 22 Mar 2005 09:26:27 -0800 Subject: [PATCH] ppc64: fix semtimedop compat syscall In-Reply-To: <200503221342.46575.arnd@arndb.de> References: <20050321171349.GA23908@krispykreme> <20050321135031.6f35932a.davem@redhat.com> <20050322142211.1b4337e4.sfr@canb.auug.org.au> <200503221342.46575.arnd@arndb.de> Message-ID: <20050322092627.20d3898f.davem@davemloft.net> On Tue, 22 Mar 2005 13:42:44 +0100 Arnd Bergmann wrote: > Right, except for the s390 31-bit pointer extension problem. > arch/s390/kernel/compat_wrapper.S does very similar stuff, but > has to do it also for syscalls that have only unsigned or pointer > arguments. > The s390 file also handles clearing the upper 32 bits for all > arguments, because the CPU does not do that automatically, unlike > ppc64 or x86_64 (don't know about the others). Ok. The scheme used on sparc64 (and I was under the impression that a discussion last year left us with the decision that zero extending all args by default, then fixing up the differences was the way to go) is to zero extend all the registers at 32-bit syscall dispatch, then we have fixups for the sign-extension cases. Regardless, we can genericize this even with the differences. > I have an old script that generates the s390 compat_wrapper.S file > from a header file holding all C prototypes for the compat_sys_* > functions. Maybe we can find a way to make that generic enough > for all seven compat architectures. That's a great idea. This thing could spit out a set of macros that an arch-specific ASM stub could end up using. I'm not saying the sparc64 macros are the greatest, but it could be a starting point. So this magic header file your program outputs might look like: #ifdef COMPAT_NEED_SIGN_EXTENSION SIGN1(sys32_exit, sparc_exit, ARG0) SIGN1(sys32_exit_group, sys_exit_group, ARG0) SIGN1(sys32_wait4, compat_sys_wait4, ARG2) SIGN1(sys32_creat, sys_creat, ARG1) SIGN1(sys32_mknod, sys_mknod, ARG1) SIGN1(sys32_perfctr, sys_perfctr, ARG0) SIGN1(sys32_umount, sys_umount, ARG1) ... #endif etc. And then arch/${ARCH}/kernel/compat_wrapper.S looks like: #define COMPAT_NEED_SIGN_EXTENSION #define SIGN1(STUB,SYSCALL,REG1) ... #define ARG0 %o0 #define ARG1 %o1 #define ARG2 %o2 #define ARG3 %o3 #define ARG4 %o4 #define ARG5 %o5 #include Just an idea, maybe you can come up with something cleaner. :-) From brking at us.ibm.com Wed Mar 23 04:38:36 2005 From: brking at us.ibm.com (Brian King) Date: Tue, 22 Mar 2005 11:38:36 -0600 Subject: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)] In-Reply-To: <20050321231028.GV498@austin.ibm.com> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <1109207532.5384.32.camel@gaston> <20050224013137.GF2088@austin.ibm.com> <20050226063609.GC7036@colo.lackof.org> <20050321231028.GV498@austin.ibm.com> Message-ID: <4240581C.1000906@us.ibm.com> Linas Vepstas wrote: > Hi, > > There has been a running thread for a while on several mailing lists > concerning PCI bus error recovery. Very breifly, some architectures > have PCI error recovery mechanisms built into them (e.g. IBM PowerPC, > also new PCI-Express chips from Intel (and other vendors) and possibly > pa-risc and others). > > I've been trying to prototype error recovery. I currently have > ethernet and the IPR scsi driver working, but I am having trouble with > the symbios driver. I need help/advice ... > > On Fri, Feb 25, 2005 at 11:36:09PM -0700, Grant Grundler was heard to remark: > >>On Wed, Feb 23, 2005 at 07:31:37PM -0600, Linas Vepstas wrote: >> >>>I also want to do the symbios driver... >> >>FYI, Mathew Wilcox maintains the sym2 driver in cvs.parisc-linux.org. > > > > My current hardware will halt all i/o to/from the symbios controller > upon detection of a PCI error. The recovery proceedure that I am > currently using is to call system firmware (aka 'bios') to raise > and then lower the #RST pci signal line for 1/4 second, then wait 2 > seconds for the PCI bus to settle, then restore the PCI config space > registers (BARs, interrupt line, etc) to what they used to be. Then, > I call sym_start_up() in an attempt to get the symbios card working > again. And that's where I get stuck ... > > My assumption is that after the #RST, that the symbios card will sit > there, dumb and stupid, with no scripts running. But sometimes I find > that the card has done something to make the PCI error hardware trip > again. Typically, this means that the card attempted to DMA to some > address that its not allowed to touch, or raised #SERR or possibly > #PERR (I can't tell which). What config registers are you restoring? Is it possible symbios does not like something in your config restore? Another possiblity is that asserting PCI reset is not cleanly resetting the card. Does PCI reset force BIST to be run on these cards? You could try to manually run BIST on the card after the PCI reset to see if that helps, or you could try power cycling the slot instead of using PCI reset. -Brian > > Sometimes, I get the PCI error while the card is sitting there idly > after the #RST, but more often, I get the error in sym_chip_reset(), > immediately after the OUTB (nc_istat, SRST); > > Any clue what this is about? Am I missing something? I'm rather > perplexed at this point, any clues/hints/suggestions are welcome. > > --linas > > - > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Brian King eServer Storage I/O IBM Linux Technology Center From anton at samba.org Wed Mar 23 04:51:25 2005 From: anton at samba.org (Anton Blanchard) Date: Wed, 23 Mar 2005 04:51:25 +1100 Subject: [PATCH] ppc64: fix semtimedop compat syscall In-Reply-To: <20050322092627.20d3898f.davem@davemloft.net> References: <20050321171349.GA23908@krispykreme> <20050321135031.6f35932a.davem@redhat.com> <20050322142211.1b4337e4.sfr@canb.auug.org.au> <200503221342.46575.arnd@arndb.de> <20050322092627.20d3898f.davem@davemloft.net> Message-ID: <20050322175125.GB4980@krispykreme> Hi, > Ok. The scheme used on sparc64 (and I was under the impression > that a discussion last year left us with the decision that zero > extending all args by default, then fixing up the differences was > the way to go) is to zero extend all the registers at 32-bit syscall > dispatch, then we have fixups for the sign-extension cases. That was my understanding, on ppc64 we zero extend all 6 arguments. > > I have an old script that generates the s390 compat_wrapper.S file > > from a header file holding all C prototypes for the compat_sys_* > > functions. Maybe we can find a way to make that generic enough > > for all seven compat architectures. > > That's a great idea. This thing could spit out a set of macros > that an arch-specific ASM stub could end up using. I like the idea too. Anton From grundler at parisc-linux.org Wed Mar 23 04:57:28 2005 From: grundler at parisc-linux.org (Grant Grundler) Date: Tue, 22 Mar 2005 10:57:28 -0700 Subject: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)] In-Reply-To: <20050321231028.GV498@austin.ibm.com> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <1109207532.5384.32.camel@gaston> <20050224013137.GF2088@austin.ibm.com> <20050226063609.GC7036@colo.lackof.org> <20050321231028.GV498@austin.ibm.com> Message-ID: <20050322175728.GE12675@colo.lackof.org> On Mon, Mar 21, 2005 at 05:10:28PM -0600, Linas Vepstas wrote: > My current hardware will halt all i/o to/from the symbios controller > upon detection of a PCI error. The recovery proceedure that I am > currently using is to call system firmware (aka 'bios') to raise > and then lower the #RST pci signal line for 1/4 second, then wait 2 > seconds for the PCI bus to settle, then restore the PCI config space > registers (BARs, interrupt line, etc) to what they used to be. Then, > I call sym_start_up() in an attempt to get the symbios card working > again. And that's where I get stuck ... Does this process cause a SCSI bus reset? SCSI devices will continue *forever* to send status back to the host on IO's that have completed. At least that's what I remember from working on this 8 years ago. Issuing a SCSI "Bus Reset" or "Bus Device Reset" (BDR) will quiesce the devices. I'm asking because it's possible sym2 driver isn't expecting anything from any device at that point. BTW, when did sym2 get a chance to cleanup "pending" requests? You want everything moved back to the "queued" state or failed (flush pending IO so upper layers can retry if they want). > My assumption is that after the #RST, that the symbios card will sit > there, dumb and stupid, with no scripts running. But sometimes I find > that the card has done something to make the PCI error hardware trip > again. Typically, this means that the card attempted to DMA to some > address that its not allowed to touch, or raised #SERR or possibly > #PERR (I can't tell which). PCI Reset typically only affects PCI facing parts of a chip. e.g. some LAN Phy's don't get reset and need to be manually reset. I'm skeptical sym2 will (or should) issue a SCSI Bus reset when PCI Reset is asserted. Think multi-initiator. > Sometimes, I get the PCI error while the card is sitting there idly > after the #RST, but more often, I get the error in sym_chip_reset(), > immediately after the OUTB (nc_istat, SRST); Oh? Is this the driver trying to issue SCSI Reset? > Any clue what this is about? Am I missing something? I'm rather > perplexed at this point, any clues/hints/suggestions are welcome. Sorry - I'm no expert on 53c8xx chips. Hope the above helps. grant From daniel at osdl.org Wed Mar 23 06:24:34 2005 From: daniel at osdl.org (Daniel McNeil) Date: Tue, 22 Mar 2005 11:24:34 -0800 Subject: [PATCH 2.6.11] AIO panic on PPC64 caused by is_hugepage_only_range() In-Reply-To: <20050321184113.0f5e2f6b.akpm@osdl.org> References: <1111108348.31932.43.camel@ibm-c.pdx.osdl.net> <20050321184113.0f5e2f6b.akpm@osdl.org> Message-ID: <1111519474.15956.40.camel@ibm-c.pdx.osdl.net> On Mon, 2005-03-21 at 18:41, Andrew Morton wrote: > Did we fix this yet? > Here's a patch against 2.6.11 that fixes the problem. It changes is_hugepage_only_range() to take mm as an argument and then changes the places that call it to pass 'mm'. It includes a change for ia64 which has not been compiled. It applies against the latest bk with some offset. Signed-off-by: Daniel McNeil diff -urp linux-2.6.11.orig/arch/ppc64/mm/hugetlbpage.c linux-2.6.11/arch/ppc64/mm/hugetlbpage.c --- linux-2.6.11.orig/arch/ppc64/mm/hugetlbpage.c 2005-03-22 09:43:09.000000000 -0800 +++ linux-2.6.11/arch/ppc64/mm/hugetlbpage.c 2005-03-22 09:45:46.000000000 -0800 @@ -512,7 +512,7 @@ unsigned long arch_get_unmapped_area(str vma = find_vma(mm, addr); if (((TASK_SIZE - len) >= addr) && (!vma || (addr+len) <= vma->vm_start) - && !is_hugepage_only_range(addr,len)) + && !is_hugepage_only_range(mm, addr,len)) return addr; } start_addr = addr = mm->free_area_cache; @@ -522,7 +522,7 @@ full_search: while (TASK_SIZE - len >= addr) { BUG_ON(vma && (addr >= vma->vm_end)); - if (touches_hugepage_low_range(addr, len)) { + if (touches_hugepage_low_range(mm, addr, len)) { addr = ALIGN(addr+1, 1<= addr && (!vma || addr + len <= vma->vm_start) - && !is_hugepage_only_range(addr,len)) + && !is_hugepage_only_range(mm, addr,len)) return addr; } @@ -596,7 +596,7 @@ try_again: addr = (mm->free_area_cache - len) & PAGE_MASK; do { hugepage_recheck: - if (touches_hugepage_low_range(addr, len)) { + if (touches_hugepage_low_range(mm, addr, len)) { addr = (addr & ((~0) << SID_SHIFT)) - len; goto hugepage_recheck; } else if (touches_hugepage_high_range(addr, len)) { diff -urp linux-2.6.11.orig/include/asm-ia64/page.h linux-2.6.11/include/asm-ia64/page.h --- linux-2.6.11.orig/include/asm-ia64/page.h 2005-03-01 23:37:48.000000000 -0800 +++ linux-2.6.11/include/asm-ia64/page.h 2005-03-21 16:58:54.000000000 -0800 @@ -137,7 +137,7 @@ typedef union ia64_va { # define htlbpage_to_page(x) (((unsigned long) REGION_NUMBER(x) << 61) \ | (REGION_OFFSET(x) >> (HPAGE_SHIFT-PAGE_SHIFT))) # define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) -# define is_hugepage_only_range(addr, len) \ +# define is_hugepage_only_range(mm, addr, len) \ (REGION_NUMBER(addr) == REGION_HPAGE && \ REGION_NUMBER((addr)+(len)) == REGION_HPAGE) extern unsigned int hpage_shift; diff -urp linux-2.6.11.orig/include/asm-ppc64/page.h linux-2.6.11/include/asm-ppc64/page.h --- linux-2.6.11.orig/include/asm-ppc64/page.h 2005-03-01 23:37:30.000000000 -0800 +++ linux-2.6.11/include/asm-ppc64/page.h 2005-03-21 16:59:46.000000000 -0800 @@ -48,8 +48,8 @@ #define ARCH_HAS_HUGEPAGE_ONLY_RANGE #define ARCH_HAS_PREPARE_HUGEPAGE_RANGE -#define touches_hugepage_low_range(addr, len) \ - (LOW_ESID_MASK((addr), (len)) & current->mm->context.htlb_segs) +#define touches_hugepage_low_range(mm, addr, len) \ + (LOW_ESID_MASK((addr), (len)) & mm->context.htlb_segs) #define touches_hugepage_high_range(addr, len) \ (((addr) > (TASK_HPAGE_BASE-(len))) && ((addr) < TASK_HPAGE_END)) @@ -61,9 +61,9 @@ #define within_hugepage_high_range(addr, len) (((addr) >= TASK_HPAGE_BASE) \ && ((addr)+(len) <= TASK_HPAGE_END) && ((addr)+(len) >= (addr))) -#define is_hugepage_only_range(addr, len) \ +#define is_hugepage_only_range(mm, addr, len) \ (touches_hugepage_high_range((addr), (len)) || \ - touches_hugepage_low_range((addr), (len))) + touches_hugepage_low_range((mm), (addr), (len))) #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define in_hugepage_area(context, addr) \ diff -urp linux-2.6.11.orig/include/linux/hugetlb.h linux-2.6.11/include/linux/hugetlb.h --- linux-2.6.11.orig/include/linux/hugetlb.h 2005-03-21 16:50:21.000000000 -0800 +++ linux-2.6.11/include/linux/hugetlb.h 2005-03-22 09:41:24.000000000 -0800 @@ -36,7 +36,7 @@ extern const unsigned long hugetlb_zero, extern int sysctl_hugetlb_shm_group; #ifndef ARCH_HAS_HUGEPAGE_ONLY_RANGE -#define is_hugepage_only_range(addr, len) 0 +#define is_hugepage_only_range(mm, addr, len) 0 #define hugetlb_free_pgtables(tlb, prev, start, end) do { } while (0) #endif @@ -71,7 +71,7 @@ static inline unsigned long hugetlb_tota #define is_aligned_hugepage_range(addr, len) 0 #define prepare_hugepage_range(addr, len) (-EINVAL) #define pmd_huge(x) 0 -#define is_hugepage_only_range(addr, len) 0 +#define is_hugepage_only_range(mm, addr, len) 0 #define hugetlb_free_pgtables(tlb, prev, start, end) do { } while (0) #define alloc_huge_page() ({ NULL; }) #define free_huge_page(p) ({ (void)(p); BUG(); }) diff -urp linux-2.6.11.orig/mm/mmap.c linux-2.6.11/mm/mmap.c --- linux-2.6.11.orig/mm/mmap.c 2005-03-21 17:00:35.000000000 -0800 +++ linux-2.6.11/mm/mmap.c 2005-03-21 17:01:20.000000000 -0800 @@ -1334,7 +1334,7 @@ get_unmapped_area(struct file *file, uns * reserved hugepage range. For some archs like IA-64, * there is a separate region for hugepages. */ - ret = is_hugepage_only_range(addr, len); + ret = is_hugepage_only_range(current->mm, addr, len); } if (ret) return -EINVAL; @@ -1707,7 +1707,7 @@ static void unmap_region(struct mm_struc unmap_vmas(&tlb, mm, vma, start, end, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); - if (is_hugepage_only_range(start, end - start)) + if (is_hugepage_only_range(mm, start, end - start)) hugetlb_free_pgtables(tlb, prev, start, end); else free_pgtables(tlb, prev, start, end); From kenneth.w.chen at intel.com Wed Mar 23 06:59:48 2005 From: kenneth.w.chen at intel.com (Chen, Kenneth W) Date: Tue, 22 Mar 2005 11:59:48 -0800 Subject: [PATCH 2.6.11] AIO panic on PPC64 caused byis_hugepage_only_range() In-Reply-To: <1111519474.15956.40.camel@ibm-c.pdx.osdl.net> Message-ID: <200503221959.j2MJxmg17815@unix-os.sc.intel.com> On Mon, 2005-03-21 at 18:41, Andrew Morton wrote: > Did we fix this yet? > Daniel McNeil wrote on Tuesday, March 22, 2005 11:25 AM > Here's a patch against 2.6.11 that fixes the problem. > It changes is_hugepage_only_range() to take mm as an argument > and then changes the places that call it to pass 'mm'. > It includes a change for ia64 which has not been compiled. Just a sanity check, tested the patch on ia64. Nothing blows up and everything is working. From arnd at arndb.de Wed Mar 23 08:21:48 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Tue, 22 Mar 2005 22:21:48 +0100 Subject: [PATCH] ppc64: fix semtimedop compat syscall In-Reply-To: <20050322092627.20d3898f.davem@davemloft.net> References: <20050321171349.GA23908@krispykreme> <200503221342.46575.arnd@arndb.de> <20050322092627.20d3898f.davem@davemloft.net> Message-ID: <200503222221.50367.arnd@arndb.de> On Dinsdag 22 M?rz 2005 18:26, David S. Miller wrote: > Ok. The scheme used on sparc64 (and I was under the impression > that a discussion last year left us with the decision that zero > extending all args by default, then fixing up the differences was > the way to go) is to zero extend all the registers at 32-bit syscall > dispatch, then we have fixups for the sign-extension cases. > > Regardless, we can genericize this even with the differences. Yes, I remember that discussion now. > > I have an old script that generates the s390 compat_wrapper.S file > > from a header file holding all C prototypes for the compat_sys_* > > functions. Maybe we can find a way to make that generic enough > > for all seven compat architectures. > > That's a great idea. This thing could spit out a set of macros > that an arch-specific ASM stub could end up using. Unfortunately, it turned out that the actual script got lost in some when I clean up my disk, so I have now reimplemented it. The upside is that it now is architecture independent. > Just an idea, maybe you can come up with something > cleaner. :-) This is what I have now. AFAICS, this should work for all architectures, even without having to zero-extend all arguments. The output currently looks like compat_s(sys32_exit, sys_exit) compat_u_p_u(sys32_read, sys32_read) compat_u_p_u(sys32_write, sys32_write) compat_p_s_s(sys32_open, sys_open) compat_u(sys32_close, sys_close) compat_p_s(sys32_creat, sys_creat) compat_p_p(sys32_link, sys_link) compat_p(sys32_unlink, sys_unlink) compat_p(sys32_chdir, sys_chdir) compat_p(sys32_time, sys_time) compat_p_s_u(sys32_mknod, sys_mknod) compat_p_u(sys32_chmod, sys_chmod) compat_p_u_u(sys32_lchown16, sys32_lchown16) compat_u_s_u(sys32_lseek, sys_lseek) compat_p_p_p_u_p(sys32_mount, sys32_mount) This means that every architecture needs to define macros for all combinations of signed, unsigned, pointer and u64 arguments that are used by any system call. There is a total of 58 different calling conventions, half of which can be eliminated for !s390 by providing a generic way to map pointers to unsigned, e.g. #define compat_p compat_u #define compat_s_p compat_s_u only compat_u, compat_u_u ... compat_u_u_u_u_u can be eliminated by always doing zero extension at system call entry and that might not be worth it. Arnd <>< -------------- next part -------------- A non-text attachment was scrubbed... Name: compat_decl.h Type: text/x-chdr Size: 13593 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050322/e770b4d3/attachment.h -------------- next part -------------- # remove all lines except prototypes /^\s*asmlinkage/!d # ignore return type s/^ *asmlinkage \+\(void\)\?\(unsigned \+\)\?\(long\)\? \+// # ignore any const s/const *//g # pointer become _p s/[a-zA-Z0-9_ ]\+ *\*/_p/g s/\b__sighandler_t/_p/g s/\bcaddr_t/_p/g # unsigned long long becomes _U s/unsigned \+long \+long/_U/g # unsigned becomes _u s/\baio_context_t/_u/g s/\bcompat_gid_t/_u/g s/\bcompat_size_t/_u/g s/\bcompat_uid_t/_u/g s/\bdev_t/_u/g s/\bgid_t/_u/g s/\bmode_t/_u/g s/\bqid_t/_u/g s/\bsize_t/_u/g s/\bu32/_u/g s/\buid_t/_u/g s/\bulong_ptr/_u/g s/\bunsigned \+int/_u/g s/\bunsigned \+long/_u/g s/\bunsigned/_u/g # signed becomes _s s/\bclockid_t/_s/g s/\bint/_s/g s/\blong/_s/g s/\boff_t/_s/g s/\bpid_t/_s/g s/\btimer_t/_u/g # remove commas and spaces between arguments s/[, ]//g # output real line s/\(^[a-z0-9_]*\)(\(.*\));/compat\2(\1, \1)/ # rename output function name in case of sys_* s/(sys_/(sys32_/ From paulus at samba.org Wed Mar 23 07:19:31 2005 From: paulus at samba.org (Paul Mackerras) Date: Wed, 23 Mar 2005 07:19:31 +1100 Subject: [PATCH] ppc64: fix semtimedop compat syscall In-Reply-To: <200503221342.46575.arnd@arndb.de> References: <20050321171349.GA23908@krispykreme> <20050321135031.6f35932a.davem@redhat.com> <20050322142211.1b4337e4.sfr@canb.auug.org.au> <200503221342.46575.arnd@arndb.de> Message-ID: <16960.32211.960982.13262@cargo.ozlabs.ibm.com> Arnd Bergmann writes: > One problem is that sign extension can not be expressed in architecture > independent C code. On which architectures does (long)(int) x not give the desired result? Paul. From sfr at canb.auug.org.au Wed Mar 23 09:34:41 2005 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Wed, 23 Mar 2005 09:34:41 +1100 Subject: [PATCH] ppc64: fix semtimedop compat syscall In-Reply-To: <16960.32211.960982.13262@cargo.ozlabs.ibm.com> References: <20050321171349.GA23908@krispykreme> <20050321135031.6f35932a.davem@redhat.com> <20050322142211.1b4337e4.sfr@canb.auug.org.au> <200503221342.46575.arnd@arndb.de> <16960.32211.960982.13262@cargo.ozlabs.ibm.com> Message-ID: <20050323093441.728fa8ca.sfr@canb.auug.org.au> On Wed, 23 Mar 2005 07:19:31 +1100 Paul Mackerras wrote: > > Arnd Bergmann writes: > > > One problem is that sign extension can not be expressed in architecture > > independent C code. > > On which architectures does (long)(int) x not give the desired result? Presumably for a u32 function argument x which has been zero extended? -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050323/f2be59c8/attachment.pgp From olof at austin.ibm.com Wed Mar 23 09:37:52 2005 From: olof at austin.ibm.com (Olof Johansson) Date: Tue, 22 Mar 2005 16:37:52 -0600 Subject: [BUG] 2.6.11- sym53c8xx Broken on pp64 In-Reply-To: <423FA818.1040102@ca.ibm.com> References: <423F361F.8010609@ca.ibm.com> <20050321221433.GD28665@austin.ibm.com> <423FA818.1040102@ca.ibm.com> Message-ID: <20050322223752.GE26568@austin.ibm.com> On Tue, Mar 22, 2005 at 12:07:36AM -0500, Omkhar Arasaratnam wrote: > 2.6.10 works just fine save an initrd error which *could* be my fault 2.6.11, 2.6.11.5 and 2.6.12-rc1-bk1 all work here on a p630 in SMP (nonpartitioned) mode. You didn't answer if you're running in partitioned mode or not, unfortunately this system doesn't have an HMC to try partitioned mode with. There are some peculiarities with p630s in partitioned mode, they're one of the few machines that have dma-window properties at several levels in the tree for the partition that has the internal I/O assigned to it but the logic shouldn't have changed in 2.6.11. If that's your config, let me know. Also, what firmware version do you have? You can find out by looking at /proc/device-tree/openprom/ibm,fw-vernum-encoded. -Olof From paulus at samba.org Wed Mar 23 09:40:45 2005 From: paulus at samba.org (Paul Mackerras) Date: Wed, 23 Mar 2005 09:40:45 +1100 Subject: [PATCH] ppc64: fix semtimedop compat syscall In-Reply-To: <20050323093441.728fa8ca.sfr@canb.auug.org.au> References: <20050321171349.GA23908@krispykreme> <20050321135031.6f35932a.davem@redhat.com> <20050322142211.1b4337e4.sfr@canb.auug.org.au> <200503221342.46575.arnd@arndb.de> <16960.32211.960982.13262@cargo.ozlabs.ibm.com> <20050323093441.728fa8ca.sfr@canb.auug.org.au> Message-ID: <16960.40685.698765.612497@cargo.ozlabs.ibm.com> Stephen Rothwell writes: > On Wed, 23 Mar 2005 07:19:31 +1100 Paul Mackerras wrote: > > > Arnd Bergmann writes: > > > > > One problem is that sign extension can not be expressed in architecture > > > independent C code. > > > > On which architectures does (long)(int) x not give the desired result? > > Presumably for a u32 function argument x which has been zero extended? Huh?? For a 64-bit architecture with 32-bit ints, (long)(int) x will sign-extend the bottom 32 bits of x to 64 bits. The top 32 bits don't matter, and in particular it doesn't matter if x has previously been zero-extended to 64 bits, or if x is a u32. Paul. From sfr at canb.auug.org.au Wed Mar 23 09:58:22 2005 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Wed, 23 Mar 2005 09:58:22 +1100 Subject: [PATCH] ppc64: fix semtimedop compat syscall In-Reply-To: <16960.40685.698765.612497@cargo.ozlabs.ibm.com> References: <20050321171349.GA23908@krispykreme> <20050321135031.6f35932a.davem@redhat.com> <20050322142211.1b4337e4.sfr@canb.auug.org.au> <200503221342.46575.arnd@arndb.de> <16960.32211.960982.13262@cargo.ozlabs.ibm.com> <20050323093441.728fa8ca.sfr@canb.auug.org.au> <16960.40685.698765.612497@cargo.ozlabs.ibm.com> Message-ID: <20050323095822.5997fea5.sfr@canb.auug.org.au> On Wed, 23 Mar 2005 09:40:45 +1100 Paul Mackerras wrote: > > Stephen Rothwell writes: > > > On Wed, 23 Mar 2005 07:19:31 +1100 Paul Mackerras wrote: > > > > > Arnd Bergmann writes: > > > > > > > One problem is that sign extension can not be expressed in architecture > > > > independent C code. > > > > > > On which architectures does (long)(int) x not give the desired result? > > > > Presumably for a u32 function argument x which has been zero extended? > > Huh?? For a 64-bit architecture with 32-bit ints, (long)(int) x will > sign-extend the bottom 32 bits of x to 64 bits. The top 32 bits don't > matter, and in particular it doesn't matter if x has previously been > zero-extended to 64 bits, or if x is a u32. I was just being pedantic as this is the case we care about. It is quite possible that the compiler upon seeing "int x" as an function argument will assume that any value passed to the argument has been sign extended to the equivalent of long by any caller of the function (since ints as function arguments may actually be 64 bit registers). So the (long)(int) cast may be a noop for this particular complier and if we have zero extended the value in assembler instead, the result will be incorrect. I am not saying that any compiler does do this, but it would still be internally consistent, no? (sfr hopes he is not making a fool of himself ... :-)) -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050323/92207740/attachment.pgp From davem at davemloft.net Wed Mar 23 10:33:10 2005 From: davem at davemloft.net (David S. Miller) Date: Tue, 22 Mar 2005 15:33:10 -0800 Subject: [PATCH] ppc64: fix semtimedop compat syscall In-Reply-To: <20050323095822.5997fea5.sfr@canb.auug.org.au> References: <20050321171349.GA23908@krispykreme> <20050321135031.6f35932a.davem@redhat.com> <20050322142211.1b4337e4.sfr@canb.auug.org.au> <200503221342.46575.arnd@arndb.de> <16960.32211.960982.13262@cargo.ozlabs.ibm.com> <20050323093441.728fa8ca.sfr@canb.auug.org.au> <16960.40685.698765.612497@cargo.ozlabs.ibm.com> <20050323095822.5997fea5.sfr@canb.auug.org.au> Message-ID: <20050322153310.3ffdc616.davem@davemloft.net> On Wed, 23 Mar 2005 09:58:22 +1100 Stephen Rothwell wrote: > I was just being pedantic as this is the case we care about. It is quite > possible that the compiler upon seeing "int x" as an function argument > will assume that any value passed to the argument has been sign extended > to the equivalent of long by any caller of the function (since ints as > function arguments may actually be 64 bit registers). That is actually the case on sparc64. The ABI demands that all signed function arguments are sign extended to 64-bits by the caller. So it would only work if the argument was declared as unsigned. From arnd at arndb.de Wed Mar 23 10:44:30 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Wed, 23 Mar 2005 00:44:30 +0100 Subject: [PATCH] ppc64: fix semtimedop compat syscall In-Reply-To: <20050323095822.5997fea5.sfr@canb.auug.org.au> References: <20050321171349.GA23908@krispykreme> <16960.40685.698765.612497@cargo.ozlabs.ibm.com> <20050323095822.5997fea5.sfr@canb.auug.org.au> Message-ID: <200503230044.31986.arnd@arndb.de> On Dinsdag 22 M?rz 2005 23:58, Stephen Rothwell wrote: > I was just being pedantic as this is the case we care about. ?It is quite > possible that the compiler upon seeing "int x" as an function argument > will assume that any value passed to the argument has been sign extended > to the equivalent of long by any caller of the function (since ints as > function arguments may actually be 64 bit registers). ?So the (long)(int) > cast may be a noop for this particular complier and if we have zero > extended the value in assembler instead, the result will be incorrect. > > I am not saying that any compiler does do this, but it would still be > internally consistent, no? I think that some versions of gcc on s390 do exactly this, although I can't reproduce it right now (I only have gcc-2.95 at hand here). When you do a cast from unsigned to int, the compiler always has to do the sign-extend anyway. IIRC, the three architectures that I have looked at so far all behave slightly differently: - on s390x, a register containing an int uses the whole 64 bits, like a long int. When calling a function with int parameters, the _caller_ needs to do a sign-extend from the lower 32 bits of its local variable. - On x86_64, any 32 bit operation, even something trivial like mov eax,ebx clears the high 32 bits of the 64 bit register. - On ppc64, the high order bits are ignored Arnd <>< From david at gibson.dropbear.id.au Wed Mar 23 11:53:05 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Wed, 23 Mar 2005 11:53:05 +1100 Subject: [PATCH 2.6.11] AIO panic on PPC64 caused by is_hugepage_only_range() In-Reply-To: <1111519474.15956.40.camel@ibm-c.pdx.osdl.net> References: <1111108348.31932.43.camel@ibm-c.pdx.osdl.net> <20050321184113.0f5e2f6b.akpm@osdl.org> <1111519474.15956.40.camel@ibm-c.pdx.osdl.net> Message-ID: <20050323005305.GB29765@localhost.localdomain> On Tue, Mar 22, 2005 at 11:24:34AM -0800, Daniel McNeil wrote: > On Mon, 2005-03-21 at 18:41, Andrew Morton wrote: > > Did we fix this yet? > > > > Here's a patch against 2.6.11 that fixes the problem. > It changes is_hugepage_only_range() to take mm as an argument > and then changes the places that call it to pass 'mm'. > It includes a change for ia64 which has not been compiled. > It applies against the latest bk with some offset. > > Signed-off-by: Daniel McNeil Looks good to me. Acked-by: David Gibson -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From arnd at arndb.de Wed Mar 23 12:31:43 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Wed, 23 Mar 2005 02:31:43 +0100 Subject: [PATCH] ppc64: fix semtimedop compat syscall In-Reply-To: <200503222221.50367.arnd@arndb.de> References: <20050321171349.GA23908@krispykreme> <20050322092627.20d3898f.davem@davemloft.net> <200503222221.50367.arnd@arndb.de> Message-ID: <200503230231.45154.arnd@arndb.de> On Dinsdag 22 M?rz 2005 22:21, Arnd Bergmann wrote: > This is what I have now. AFAICS, this should work for all architectures, > even without having to zero-extend all arguments. The output currently > looks like > > compat_s(sys32_exit, sys_exit) > compat_u_p_u(sys32_read, sys32_read) > compat_u_p_u(sys32_write, sys32_write) Ok, a different approach this time, now almost architecture independent. This won't work on s390 because it does not handle 31 bit pointers, but it makes it possible to autogenerate the function stubs that are used on ppc64 at the moment. This is still proof-of-concept code, to see if the idea makes sense. Arnd <>< -------------- next part -------------- A non-text attachment was scrubbed... Name: compat_wrapper.diff Type: text/x-diff Size: 18906 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050323/1a4bdcd6/attachment.diff From apw at us.ibm.com Wed Mar 23 13:52:14 2005 From: apw at us.ibm.com (Amos Waterland) Date: Tue, 22 Mar 2005 21:52:14 -0500 Subject: [patch] reloc_offset comment Message-ID: <20050323025214.GA25788@kvasir.austin.ibm.com> The code in reloc_offset is actually subtracting the address in the link register from the address calculated by the linker. Perhaps the extended mnemonic `sub' replaced an original `subf' and the comment just did not get updated. bl 1f 1: mflr r3 LOADADDR(r4,1b) sub r3,r4,r3 Signed-off-by: Amos Waterland ===== arch/ppc64/kernel/misc.S 1.98 vs edited ===== --- 1.98/arch/ppc64/kernel/misc.S 2005-01-14 14:56:04 -05:00 +++ edited/arch/ppc64/kernel/misc.S 2005-03-22 21:35:48 -05:00 @@ -32,7 +32,7 @@ .text /* - * Returns (address we're running at) - (address we were linked at) + * Returns (address we were linked at) - (address we are running at) * for use before the text and data are mapped to KERNELBASE. */ From iamroot at ca.ibm.com Wed Mar 23 14:54:10 2005 From: iamroot at ca.ibm.com (Omkhar Arasaratnam) Date: Tue, 22 Mar 2005 22:54:10 -0500 Subject: [BUG] 2.6.11- sym53c8xx Broken on pp64 In-Reply-To: <20050322223752.GE26568@austin.ibm.com> References: <423F361F.8010609@ca.ibm.com> <20050321221433.GD28665@austin.ibm.com> <423FA818.1040102@ca.ibm.com> <20050322223752.GE26568@austin.ibm.com> Message-ID: <4240E862.4030908@ca.ibm.com> Olof Johansson wrote: >On Tue, Mar 22, 2005 at 12:07:36AM -0500, Omkhar Arasaratnam wrote: > > > >>2.6.10 works just fine save an initrd error which *could* be my fault >> >> > >2.6.11, 2.6.11.5 and 2.6.12-rc1-bk1 all work here on a p630 in SMP >(nonpartitioned) mode. You didn't answer if you're running in partitioned >mode or not, unfortunately this system doesn't have an HMC to try >partitioned mode with. > >There are some peculiarities with p630s in partitioned mode, they're one >of the few machines that have dma-window properties at several levels >in the tree for the partition that has the internal I/O assigned to it >but the logic shouldn't have changed in 2.6.11. If that's your config, >let me know. > >Also, what firmware version do you have? You can find out by looking at >/proc/device-tree/openprom/ibm,fw-vernum-encoded. > > >-Olof > > > Hey Olof, Sorry it is not in LPAR mode, firmware version is RG041029_d79e00_regatta Did you want me to send you my kernel-config? Omkhar From michael at ellerman.id.au Wed Mar 23 21:33:42 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Wed, 23 Mar 2005 21:33:42 +1100 Subject: [PATCH] ppc64: Make numa=off command line argument work again Message-ID: <200503232133.42514.michael@ellerman.id.au> Hi Andrew, Mike's patch "ppc64: NUMA memory fixup (another try)" broke the numa code when "numa=off" is specified on the kernel command line. The fix is to pretend everything is in node 0 when numa is disabled. Boot tested on pSeries LPAR with numa=off and numa=debug (ie. on). cheers Index: 2.6.12-rc1-mem-limit/arch/ppc64/mm/numa.c =================================================================== --- 2.6.12-rc1-mem-limit.orig/arch/ppc64/mm/numa.c +++ 2.6.12-rc1-mem-limit/arch/ppc64/mm/numa.c @@ -609,7 +609,7 @@ void __init do_init_bootmem(void) new_range: mem_start = read_n_cells(addr_cells, &memcell_buf); mem_size = read_n_cells(size_cells, &memcell_buf); - numa_domain = of_node_numa_domain(memory); + numa_domain = numa_enabled ? of_node_numa_domain(memory) : 0; if (numa_domain != nid) continue; From michael at ellerman.id.au Wed Mar 23 22:19:48 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Wed, 23 Mar 2005 22:19:48 +1100 Subject: [PATCH] ppc64: Fix bounds checking in new numa code Message-ID: <200503232219.48702.michael@ellerman.id.au> Hi Mike, I've noticed what I think is an incorrect bounds check in your new numa code. Correct me if I'm wrong. Come to think of it, why do we even need this? The start_paddr/end_paddr was just computed from the memory regions we're now looking at, so I don't see how this can ever be false. Is there something you're trying to cater for? cheers Index: 2.6.12-rc1-mem-limit/arch/ppc64/mm/numa.c =================================================================== --- 2.6.12-rc1-mem-limit.orig/arch/ppc64/mm/numa.c +++ 2.6.12-rc1-mem-limit/arch/ppc64/mm/numa.c @@ -614,8 +614,8 @@ new_range: if (numa_domain != nid) continue; - if (mem_start < end_paddr && - (mem_start+mem_size) > start_paddr) { + if (mem_start >= start_paddr && + (mem_start + mem_size) <= end_paddr) { /* should be no overlaps ! */ dbg("free_bootmem %lx %lx\n", mem_start, mem_size); free_bootmem_node(NODE_DATA(nid), mem_start, From michael at ellerman.id.au Wed Mar 23 22:41:45 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Wed, 23 Mar 2005 22:41:45 +1100 Subject: [PATCH] ppc64: xmon: Add dl command to dump contents of __log_buf Message-ID: <200503232241.46254.michael@ellerman.id.au> Hi Paulus, As you know there's a small window during boot on PPC64 when things are being printk'ed but don't hit the console yet, although xmon works (??). There's also times when you get a box which is sitting in xmon but the console log has scolled off. In both these cases it's sometimes useful to print the console log, but using xmon's 'd' command produces hard to read output. This patch adds a command 'dl' to xmon which simply looks up the __log_buf symbol and prints the contents in slightly more readable form. I'm not really sure what's going on here with the catch_memory_errors business, but this looked sensible going by other examples in xmon.c cheers Signed-off-by: Michael Ellerman Index: 2.6.12-rc1-mem-limit/arch/ppc64/xmon/xmon.c =================================================================== --- 2.6.12-rc1-mem-limit.orig/arch/ppc64/xmon/xmon.c +++ 2.6.12-rc1-mem-limit/arch/ppc64/xmon/xmon.c @@ -99,6 +99,7 @@ static int bsesc(void); static void dump(void); static void prdump(unsigned long, long); static int ppc_inst_dump(unsigned long, long, int); +static void dump_log_buf(void); void print_address(unsigned long); static void backtrace(struct pt_regs *); static void excprint(struct pt_regs *); @@ -175,6 +176,7 @@ Commands:\n\ di dump instructions\n\ df dump float values\n\ dd dump double values\n\ + dl dump the kernel log buffer\n\ e print exception information\n\ f flush cache\n\ la lookup symbol+offset of specified address\n\ @@ -1950,6 +1952,8 @@ dump(void) nidump = MAX_DUMP; adrs += ppc_inst_dump(adrs, nidump, 1); last_cmd = "di\n"; + } else if (c == 'l') { + dump_log_buf(); } else { scanhex(&ndump); if (ndump == 0) @@ -2046,6 +2050,50 @@ print_address(unsigned long addr) xmon_print_symbol(addr, "\t# ", ""); } +void +dump_log_buf(void) +{ + const unsigned long size = 128; + unsigned long i, end, addr; + unsigned char buf[size + 1]; + + addr = 0; + buf[size] = '\0'; + + if (setjmp(bus_error_jmp) != 0) { + printf("Unable to lookup symbol __log_buf!\n"); + return; + } + + catch_memory_errors = 1; + sync(); + addr = kallsyms_lookup_name("__log_buf"); + + if (! addr) + printf("Symbol __log_buf not found!\n"); + else { + end = addr + (1 << CONFIG_LOG_BUF_SHIFT); + while (addr < end) { + if (! mread(addr, buf, size)) { + printf("Can't read memory at address 0x%lx\n", addr); + break; + } + + printf("%s", buf); + + if (strlen(buf) < size) + break; + + addr += size; + } + } + + sync(); + /* wait a little while to see if we get a machine check */ + __delay(200); + catch_memory_errors = 0; +} + /* * Memory operations - move, set, print differences From michael at ellerman.id.au Wed Mar 23 23:11:10 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Wed, 23 Mar 2005 23:11:10 +1100 Subject: [PATCH] ppc64: Add mem=X option, updated NUMA support Message-ID: <200503232311.11113.michael@ellerman.id.au> Hi Mike, Here's an updated version of my mem=X patch with new NUMA code. Sorry it took so long I've been a bit crook. Can you test this on your 720 or whatever it was? And if anyone else has an interesting NUMA machine they can test it on I'd love to hear about it! This also includes a fix for the bug Maneesh found last week with GCC 3.3 generating a switch table in prom_init.c, which doesn't work. I've boot tested this on pSeries LPAR and I'll get it going on a G5, and Power3 tomorrow. Hopefully that'll go well and it'll boot on the 720, and then we can merge it, fingers x'ed! NB. This applies on top of the other two NUMA fixes I just sent. If they're no good let me know and I'll roll this doobie again without them. cheers arch/ppc64/kernel/iSeries_setup.c | 38 +++++++--- arch/ppc64/kernel/lmb.c | 33 +++++++++ arch/ppc64/kernel/prom.c | 15 ++++ arch/ppc64/kernel/prom_init.c | 137 +++++++++++++++++++++++++++++++++++--- arch/ppc64/kernel/setup.c | 20 ++++- arch/ppc64/mm/hash_utils.c | 23 +++++- arch/ppc64/mm/numa.c | 55 ++++++++++++--- include/asm-ppc64/lmb.h | 1 8 files changed, 284 insertions(+), 38 deletions(-) Index: 2.6.12-rc1-mem-limit/arch/ppc64/kernel/setup.c =================================================================== --- 2.6.12-rc1-mem-limit.orig/arch/ppc64/kernel/setup.c +++ 2.6.12-rc1-mem-limit/arch/ppc64/kernel/setup.c @@ -642,12 +642,11 @@ void __init setup_system(void) early_console_initialized = 1; register_console(&udbg_console); -#endif /* !CONFIG_PPC_ISERIES */ - /* Save unparsed command line copy for /proc/cmdline */ strlcpy(saved_command_line, cmd_line, COMMAND_LINE_SIZE); parse_early_param(); +#endif /* !CONFIG_PPC_ISERIES */ #if defined(CONFIG_SMP) && !defined(CONFIG_PPC_ISERIES) /* @@ -806,20 +805,31 @@ struct seq_operations cpuinfo_op = { .show = show_cpuinfo, }; -#if 0 /* XXX not currently used */ +/* + * These three variables are used to save values passed to us by prom_init() + * via the device tree. The TCE variables are needed because with a memory_limit + * in force we may need to explicitly map the TCE are at the top of RAM. + */ unsigned long memory_limit; +unsigned long tce_alloc_start; +unsigned long tce_alloc_end; +#ifdef CONFIG_PPC_ISERIES +/* + * On iSeries we just parse the mem=X option from the command line. + * On pSeries it's a bit more complicated, see prom_init_mem() + */ static int __init early_parsemem(char *p) { if (!p) return 0; - memory_limit = memparse(p, &p); + memory_limit = ALIGN(memparse(p, &p), PAGE_SIZE); return 0; } early_param("mem", early_parsemem); -#endif +#endif /* CONFIG_PPC_ISERIES */ #ifdef CONFIG_PPC_MULTIPLATFORM static int __init set_preferred_console(void) Index: 2.6.12-rc1-mem-limit/arch/ppc64/kernel/lmb.c =================================================================== --- 2.6.12-rc1-mem-limit.orig/arch/ppc64/kernel/lmb.c +++ 2.6.12-rc1-mem-limit/arch/ppc64/kernel/lmb.c @@ -344,3 +344,36 @@ lmb_abs_to_phys(unsigned long aa) return pa; } + +/* + * Truncate the lmb list to memory_limit if it's set + * You must call lmb_analyze() after this. + */ +void __init lmb_apply_memory_limit(void) +{ + extern unsigned long memory_limit; + unsigned long i, limit; + struct lmb_region *mem = &(lmb.memory); + + if (! memory_limit) + return; + + limit = memory_limit; + for (i = 0; i < mem->cnt; i++) { + if (limit > mem->region[i].size) { + limit -= mem->region[i].size; + continue; + } + +#ifdef DEBUG + udbg_printf("lmb_truncate(): truncating at region %x\n", i); + udbg_printf("lmb_truncate(): total = %x\n", total); + udbg_printf("lmb_truncate(): size = %x\n", mem->region[i].size); + udbg_printf("lmb_truncate(): crop = %x\n", crop); +#endif + + mem->region[i].size = limit; + mem->cnt = i + 1; + break; + } +} Index: 2.6.12-rc1-mem-limit/include/asm-ppc64/lmb.h =================================================================== --- 2.6.12-rc1-mem-limit.orig/include/asm-ppc64/lmb.h +++ 2.6.12-rc1-mem-limit/include/asm-ppc64/lmb.h @@ -51,6 +51,7 @@ extern unsigned long __init lmb_alloc_ba extern unsigned long __init lmb_phys_mem_size(void); extern unsigned long __init lmb_end_of_DRAM(void); extern unsigned long __init lmb_abs_to_phys(unsigned long); +extern void __init lmb_apply_memory_limit(void); extern void lmb_dump_all(void); Index: 2.6.12-rc1-mem-limit/arch/ppc64/kernel/iSeries_setup.c =================================================================== --- 2.6.12-rc1-mem-limit.orig/arch/ppc64/kernel/iSeries_setup.c +++ 2.6.12-rc1-mem-limit/arch/ppc64/kernel/iSeries_setup.c @@ -284,7 +284,7 @@ unsigned long iSeries_process_mainstore_ return mem_blocks; } -static void __init iSeries_parse_cmdline(void) +static void __init iSeries_get_cmdline(void) { char *p, *q; @@ -304,6 +304,8 @@ static void __init iSeries_parse_cmdline /*static*/ void __init iSeries_init_early(void) { + extern unsigned long memory_limit; + DBG(" -> iSeries_init_early()\n"); ppcdbg_initialize(); @@ -351,6 +353,29 @@ static void __init iSeries_parse_cmdline */ build_iSeries_Memory_Map(); + iSeries_get_cmdline(); + + /* Save unparsed command line copy for /proc/cmdline */ + strlcpy(saved_command_line, cmd_line, COMMAND_LINE_SIZE); + + /* Parse early parameters, in particular mem=x */ + parse_early_param(); + + if (memory_limit) { + if (memory_limit > systemcfg->physicalMemorySize) + printk("Ignoring 'mem' option, value %lu is too large.\n", memory_limit); + else + systemcfg->physicalMemorySize = memory_limit; + } + + /* Bolt kernel mappings for all of memory (or just a bit if we've got a limit) */ + iSeries_bolt_kernel(0, systemcfg->physicalMemorySize); + + lmb_init(); + lmb_add(0, systemcfg->physicalMemorySize); + lmb_analyze(); /* ?? */ + lmb_reserve(0, __pa(klimit)); + /* Initialize machine-dependency vectors */ #ifdef CONFIG_SMP smp_init_iSeries(); @@ -376,9 +401,6 @@ static void __init iSeries_parse_cmdline initrd_start = initrd_end = 0; #endif /* CONFIG_BLK_DEV_INITRD */ - - iSeries_parse_cmdline(); - DBG(" <- iSeries_init_early()\n"); } @@ -539,14 +561,6 @@ static void __init build_iSeries_Memory_ * nextPhysChunk */ systemcfg->physicalMemorySize = chunk_to_addr(nextPhysChunk); - - /* Bolt kernel mappings for all of memory */ - iSeries_bolt_kernel(0, systemcfg->physicalMemorySize); - - lmb_init(); - lmb_add(0, systemcfg->physicalMemorySize); - lmb_analyze(); /* ?? */ - lmb_reserve(0, __pa(klimit)); } /* Index: 2.6.12-rc1-mem-limit/arch/ppc64/kernel/prom.c =================================================================== --- 2.6.12-rc1-mem-limit.orig/arch/ppc64/kernel/prom.c +++ 2.6.12-rc1-mem-limit/arch/ppc64/kernel/prom.c @@ -878,6 +878,8 @@ static int __init early_init_dt_scan_cho const char *full_path, void *data) { u32 *prop; + u64 *prop64; + extern unsigned long memory_limit, tce_alloc_start, tce_alloc_end; if (strcmp(full_path, "/chosen") != 0) return 0; @@ -894,6 +896,18 @@ static int __init early_init_dt_scan_cho if (get_flat_dt_prop(node, "linux,iommu-force-on", NULL) != NULL) iommu_force_on = 1; + prop64 = (u64*)get_flat_dt_prop(node, "linux,memory-limit", NULL); + if (prop64) + memory_limit = *prop64; + + prop64 = (u64*)get_flat_dt_prop(node, "linux,tce-alloc-start", NULL); + if (prop64) + tce_alloc_start = *prop64; + + prop64 = (u64*)get_flat_dt_prop(node, "linux,tce-alloc-end", NULL); + if (prop64) + tce_alloc_end = *prop64; + #ifdef CONFIG_PPC_PSERIES /* To help early debugging via the front panel, we retreive a minimal * set of RTAS infos now if available @@ -1033,6 +1047,7 @@ void __init early_init_devtree(void *par lmb_init(); scan_flat_dt(early_init_dt_scan_root, NULL); scan_flat_dt(early_init_dt_scan_memory, NULL); + lmb_apply_memory_limit(); lmb_analyze(); systemcfg->physicalMemorySize = lmb_phys_mem_size(); lmb_reserve(0, __pa(klimit)); Index: 2.6.12-rc1-mem-limit/arch/ppc64/mm/hash_utils.c =================================================================== --- 2.6.12-rc1-mem-limit.orig/arch/ppc64/mm/hash_utils.c +++ 2.6.12-rc1-mem-limit/arch/ppc64/mm/hash_utils.c @@ -149,6 +149,8 @@ void __init htab_initialize(void) unsigned long pteg_count; unsigned long mode_rw; int i, use_largepages = 0; + unsigned long base = 0, size = 0; + extern unsigned long tce_alloc_start, tce_alloc_end; DBG(" -> htab_initialize()\n"); @@ -204,8 +206,6 @@ void __init htab_initialize(void) /* create bolted the linear mapping in the hash table */ for (i=0; i < lmb.memory.cnt; i++) { - unsigned long base, size; - base = lmb.memory.region[i].physbase + KERNELBASE; size = lmb.memory.region[i].size; @@ -234,6 +234,25 @@ void __init htab_initialize(void) #endif /* CONFIG_U3_DART */ create_pte_mapping(base, base + size, mode_rw, use_largepages); } + + /* + * If we have a memory_limit and we've allocated TCEs then we need to + * explicitly map the TCE area at the top of RAM. We also cope with the + * case that the TCEs start below memory_limit. + * tce_alloc_start/end are 16MB aligned so the mapping should work + * for either 4K or 16MB pages. + */ + if (tce_alloc_start) { + tce_alloc_start += KERNELBASE; + tce_alloc_end += KERNELBASE; + + if (base + size >= tce_alloc_start) + tce_alloc_start = base + size + 1; + + create_pte_mapping(tce_alloc_start, tce_alloc_end, + mode_rw, use_largepages); + } + DBG(" <- htab_initialize()\n"); } #undef KB Index: 2.6.12-rc1-mem-limit/arch/ppc64/mm/numa.c =================================================================== --- 2.6.12-rc1-mem-limit.orig/arch/ppc64/mm/numa.c +++ 2.6.12-rc1-mem-limit/arch/ppc64/mm/numa.c @@ -285,6 +285,38 @@ static int cpu_numa_callback(struct noti return ret; } +/* + * Check and possibly modify a memory region to enforce the memory limit. + * + * Returns the size the region should have to enforce the memory limit. + * This will either be the original value of size, a truncated value, + * or zero. If the returned value of size is 0 the region should be + * discarded as it lies wholy above the memory limit. + */ +static unsigned long __init numa_enforce_memory_limit(unsigned long start, unsigned long size) +{ + /* We use lmb_end_of_DRAM() in here instead of memory_limit because + * we've already adjusted it for the limit and it takes care of + * having memory holes below the limit. */ + extern unsigned long memory_limit; + + if (! memory_limit) + return size; + + if (start + size <= lmb_end_of_DRAM()) { + dbg("numa_enforce_memory_limit() size = %lx\n", size); + return size; + } + + if (start >= lmb_end_of_DRAM()) { + dbg("numa_enforce_memory_limit() size = %lx\n", (unsigned long)0); + return 0; + } + + dbg("numa_enforce_memory_limit() size = %lx\n", lmb_end_of_DRAM() - start); + return lmb_end_of_DRAM() - start; +} + static int __init parse_numa_properties(void) { struct device_node *cpu = NULL; @@ -373,6 +405,13 @@ new_range: if (max_domain < numa_domain) max_domain = numa_domain; + if (! (size = numa_enforce_memory_limit(start, size))) { + if (--ranges) + goto new_range; + else + continue; + } + /* * Initialize new node struct, or add to an existing one. */ @@ -405,8 +444,7 @@ new_range: numa_memory_lookup_table[i >> MEMORY_INCREMENT_SHIFT] = numa_domain; - ranges--; - if (ranges) + if (--ranges) goto new_range; } @@ -614,14 +652,16 @@ new_range: if (numa_domain != nid) continue; - if (mem_start >= start_paddr && - (mem_start + mem_size) <= end_paddr) { - /* should be no overlaps ! */ - dbg("free_bootmem %lx %lx\n", mem_start, mem_size); - free_bootmem_node(NODE_DATA(nid), mem_start, - mem_size); - } - + if ((mem_size = numa_enforce_memory_limit(mem_start, mem_size))) { + if (mem_start >= start_paddr && + (mem_start + mem_size) <= end_paddr) { + /* should be no overlaps ! */ + dbg("free_bootmem %lx %lx\n", mem_start, mem_size); + free_bootmem_node(NODE_DATA(nid), mem_start, + mem_size); + } + } + if (--ranges) /* process all ranges in cell */ goto new_range; } Index: 2.6.12-rc1-mem-limit/arch/ppc64/kernel/prom_init.c =================================================================== --- 2.6.12-rc1-mem-limit.orig/arch/ppc64/kernel/prom_init.c +++ 2.6.12-rc1-mem-limit/arch/ppc64/kernel/prom_init.c @@ -177,6 +177,10 @@ static int __initdata of_platform; static char __initdata prom_cmd_line[COMMAND_LINE_SIZE]; +static unsigned long __initdata prom_memory_limit; +static unsigned long __initdata prom_tce_alloc_start; +static unsigned long __initdata prom_tce_alloc_end; + static unsigned long __initdata alloc_top; static unsigned long __initdata alloc_top_high; static unsigned long __initdata alloc_bottom; @@ -384,10 +388,70 @@ static int __init prom_setprop(phandle n (u32)(unsigned long) value, (u32) valuelen); } +/* We can't use the standard versions because of RELOC headaches. */ +#define isxdigit(c) (('0' <= (c) && (c) <= '9') \ + || ('a' <= (c) && (c) <= 'f') \ + || ('A' <= (c) && (c) <= 'F')) + +#define isdigit(c) ('0' <= (c) && (c) <= '9') +#define islower(c) ('a' <= (c) && (c) <= 'z') +#define toupper(c) (islower(c) ? ((c) - 'a' + 'A') : (c)) + +unsigned long prom_strtoul(const char *cp, const char **endp) +{ + unsigned long result = 0, base = 10, value; + + if (*cp == '0') { + base = 8; + cp++; + if (toupper(*cp) == 'X') { + cp++; + base = 16; + } + } + + while (isxdigit(*cp) && + (value = isdigit(*cp) ? *cp - '0' : toupper(*cp) - 'A' + 10) < base) { + result = result * base + value; + cp++; + } + + if (endp) + *endp = cp; + + return result; +} + +unsigned long prom_memparse(const char *ptr, const char **retptr) +{ + unsigned long ret = prom_strtoul(ptr, retptr); + int shift = 0; + + /* + * We can't use a switch here because GCC *may* generate a + * jump table which won't work, because we're not running at + * the address we're linked at. + */ + if ('G' == **retptr || 'g' == **retptr) + shift = 30; + + if ('M' == **retptr || 'm' == **retptr) + shift = 20; + + if ('K' == **retptr || 'k' == **retptr) + shift = 10; + + if (shift) { + ret <<= shift; + (*retptr)++; + } + + return ret; +} /* * Early parsing of the command line passed to the kernel, used for - * the options that affect the iommu + * "mem=x" and the options that affect the iommu */ static void __init early_cmdline_parse(void) { @@ -418,6 +482,14 @@ static void __init early_cmdline_parse(v else if (!strncmp(opt, RELOC("force"), 5)) RELOC(iommu_force_on) = 1; } + + opt = strstr(RELOC(prom_cmd_line), RELOC("mem=")); + if (opt) { + opt += 4; + RELOC(prom_memory_limit) = prom_memparse(opt, (const char **)&opt); + /* Align to 16 MB == size of large page */ + RELOC(prom_memory_limit) = ALIGN(RELOC(prom_memory_limit), 0x1000000); + } } /* @@ -664,15 +736,7 @@ static void __init prom_init_mem(void) } } - /* Setup our top/bottom alloc points, that is top of RMO or top of - * segment 0 when running non-LPAR - */ - if ( RELOC(of_platform) == PLATFORM_PSERIES_LPAR ) - RELOC(alloc_top) = RELOC(rmo_top); - else - RELOC(alloc_top) = RELOC(rmo_top) = min(0x40000000ul, RELOC(ram_top)); RELOC(alloc_bottom) = PAGE_ALIGN(RELOC(klimit) - offset + 0x4000); - RELOC(alloc_top_high) = RELOC(ram_top); /* Check if we have an initrd after the kernel, if we do move our bottom * point to after it @@ -681,8 +745,41 @@ static void __init prom_init_mem(void) if (RELOC(prom_initrd_end) > RELOC(alloc_bottom)) RELOC(alloc_bottom) = PAGE_ALIGN(RELOC(prom_initrd_end)); } + + /* + * If prom_memory_limit is set we reduce the upper limits *except* for + * alloc_top_high. This must be the real top of RAM so we can put + * TCE's up there. + */ + + RELOC(alloc_top_high) = RELOC(ram_top); + + if (RELOC(prom_memory_limit)) { + if (RELOC(prom_memory_limit) <= RELOC(alloc_bottom)) { + prom_printf("Ignoring mem=%x <= alloc_bottom.\n", + RELOC(prom_memory_limit)); + RELOC(prom_memory_limit) = 0; + } else if (RELOC(prom_memory_limit) >= RELOC(ram_top)) { + prom_printf("Ignoring mem=%x >= ram_top.\n", + RELOC(prom_memory_limit)); + RELOC(prom_memory_limit) = 0; + } else { + RELOC(ram_top) = RELOC(prom_memory_limit); + RELOC(rmo_top) = min(RELOC(rmo_top), RELOC(prom_memory_limit)); + } + } + + /* + * Setup our top alloc point, that is top of RMO or top of + * segment 0 when running non-LPAR. + */ + if ( RELOC(of_platform) == PLATFORM_PSERIES_LPAR ) + RELOC(alloc_top) = RELOC(rmo_top); + else + RELOC(alloc_top) = RELOC(rmo_top) = min(0x40000000ul, RELOC(ram_top)); prom_printf("memory layout at init:\n"); + prom_printf(" memory_limit : %x (16 MB aligned)\n", RELOC(prom_memory_limit)); prom_printf(" alloc_bottom : %x\n", RELOC(alloc_bottom)); prom_printf(" alloc_top : %x\n", RELOC(alloc_top)); prom_printf(" alloc_top_hi : %x\n", RELOC(alloc_top_high)); @@ -871,6 +968,16 @@ static void __init prom_initialize_tce_t reserve_mem(local_alloc_bottom, local_alloc_top - local_alloc_bottom); + if (RELOC(prom_memory_limit)) { + /* + * We align the start to a 16MB boundary so we can map the TCE area + * using large pages if possible. The end should be the top of RAM + * so no need to align it. + */ + RELOC(prom_tce_alloc_start) = _ALIGN_DOWN(local_alloc_bottom, 0x1000000); + RELOC(prom_tce_alloc_end) = local_alloc_top; + } + /* Flag the first invalid entry */ prom_debug("ending prom_initialize_tce_table\n"); } @@ -1684,9 +1791,21 @@ unsigned long __init prom_init(unsigned */ if (RELOC(ppc64_iommu_off)) prom_setprop(_prom->chosen, "linux,iommu-off", NULL, 0); + if (RELOC(iommu_force_on)) prom_setprop(_prom->chosen, "linux,iommu-force-on", NULL, 0); + if (RELOC(prom_memory_limit)) + prom_setprop(_prom->chosen, "linux,memory-limit", + PTRRELOC(&prom_memory_limit), sizeof(RELOC(prom_memory_limit))); + + if (RELOC(prom_tce_alloc_start)) { + prom_setprop(_prom->chosen, "linux,tce-alloc-start", + PTRRELOC(&prom_tce_alloc_start), sizeof(RELOC(prom_tce_alloc_start))); + prom_setprop(_prom->chosen, "linux,tce-alloc-end", + PTRRELOC(&prom_tce_alloc_end), sizeof(RELOC(prom_tce_alloc_end))); + } + /* * Now finally create the flattened device-tree */ From markw at osdl.org Thu Mar 24 03:05:30 2005 From: markw at osdl.org (Mark Wong) Date: Wed, 23 Mar 2005 08:05:30 -0800 Subject: ext3 journalling BUG on full filesystem Message-ID: <20050323160530.GA17716@osdl.org> Hi, I'm running 2.6.11 and I'm suspecting that a full ext3 filesystem is causing the following problem when performing some journaling operation. Let me know if I can provide more details: cpu 0x6: Vector: 700 (Program Check) at [c0000002e4f3f6d0] pc: c0000000000a5fbc: .submit_bh+0x64/0x1fc lr: c0000000000a62b4: .ll_rw_block+0x160/0x164 sp: c0000002e4f3f950 msr: 8000000000029032 current = 0xc00000038ff5c7c0 paca = 0xc000000000612000 pid = 1376, comm = kjournald kernel BUG in submit_bh at fs/buffer.c:2706! enter ? for help 6:mon> t [c0000002e4f3f9f0] c0000000000a62b4 .ll_rw_block+0x160/0x164 [c0000002e4f3fab0] c000000000151ac4 .journal_commit_transaction+0xd88/0x16d4 [c0000002e4f3fe30] c000000000155590 .kjournald+0x114/0x308 [c0000002e4f3ff90] c000000000013ec0 .kernel_thread+0x4c/0x6c -- Mark Wong - - markw at osdl.org Open Source Development Lab Inc - A non-profit corporation 12725 SW Millikan Way - Suite 400 - Beaverton, OR 97005 (503) 626-2455 (office) (503) 626-2436 (fax) http://developer.osdl.org/markw/ From olof at austin.ibm.com Thu Mar 24 03:17:28 2005 From: olof at austin.ibm.com (Olof Johansson) Date: Wed, 23 Mar 2005 10:17:28 -0600 Subject: ext3 journalling BUG on full filesystem In-Reply-To: <20050323160530.GA17716@osdl.org> References: <20050323160530.GA17716@osdl.org> Message-ID: <20050323161728.GA23179@austin.ibm.com> On Wed, Mar 23, 2005 at 08:05:30AM -0800, Mark Wong wrote: > Hi, > > I'm running 2.6.11 and I'm suspecting that a full ext3 filesystem is > causing the following problem when performing some journaling > operation. Let me know if I can provide more details: Hi Mark, At least from first look at the information, that doesn't look like a ppc64-specific bug. You might be better off reporting this on linux-kernel. Is there anything else on the console before the XMON output? -Olof From markw at osdl.org Thu Mar 24 03:26:07 2005 From: markw at osdl.org (Mark Wong) Date: Wed, 23 Mar 2005 08:26:07 -0800 Subject: ext3 journalling BUG on full filesystem In-Reply-To: <20050323161728.GA23179@austin.ibm.com> References: <20050323160530.GA17716@osdl.org> <20050323161728.GA23179@austin.ibm.com> Message-ID: <20050323162607.GA18596@osdl.org> On Wed, Mar 23, 2005 at 10:17:28AM -0600, Olof Johansson wrote: > On Wed, Mar 23, 2005 at 08:05:30AM -0800, Mark Wong wrote: > > Hi, > > > > I'm running 2.6.11 and I'm suspecting that a full ext3 filesystem is > > causing the following problem when performing some journaling > > operation. Let me know if I can provide more details: > > Hi Mark, > > At least from first look at the information, that doesn't look like a > ppc64-specific bug. You might be better off reporting this on > linux-kernel. > > Is there anything else on the console before the XMON output? Hi Olof, No, I'm afraid not. I'll forward the information to the linux-kernel list. Thanks, Mark From kravetz at us.ibm.com Thu Mar 24 05:36:11 2005 From: kravetz at us.ibm.com (Mike kravetz) Date: Wed, 23 Mar 2005 10:36:11 -0800 Subject: [PATCH] ppc64: Fix bounds checking in new numa code In-Reply-To: <200503232219.48702.michael@ellerman.id.au> References: <200503232219.48702.michael@ellerman.id.au> Message-ID: <20050323183611.GA3986@w-mikek2.ibm.com> On Wed, Mar 23, 2005 at 10:19:48PM +1100, Michael Ellerman wrote: > > I've noticed what I think is an incorrect bounds check in your > new numa code. Correct me if I'm wrong. I incorrectly kept that check in from the current/existing code. You are correct in saying that it is not needed at all. I believe this check (and the corresponding start/end adjustment) was needed in the case of coalesced LMBs. In these cases, a single 'LMB' could span multiple nodes. This issue goes away when just looking at raw memory blocks. -- Mike From kravetz at us.ibm.com Thu Mar 24 08:54:13 2005 From: kravetz at us.ibm.com (Mike kravetz) Date: Wed, 23 Mar 2005 13:54:13 -0800 Subject: [PATCH] ppc64: Add mem=X option, updated NUMA support In-Reply-To: <200503232311.11113.michael@ellerman.id.au> References: <200503232311.11113.michael@ellerman.id.au> Message-ID: <20050323215413.GF3986@w-mikek2.ibm.com> On Wed, Mar 23, 2005 at 11:11:10PM +1100, Michael Ellerman wrote: > > Can you test this on your 720 or whatever it was? And if anyone else > has an interesting NUMA machine they can test it on I'd love to hear > about it! > I've tested this with various config options on my 720. Appears to work well on all that I tested. -- Mike From benh at kernel.crashing.org Thu Mar 24 17:57:08 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 24 Mar 2005 17:57:08 +1100 Subject: [PATCH] ppc64: Fix ethernet PHY reset on iMac G5 Message-ID: <1111647428.5570.25.camel@gaston> Hi ! On iMac G5, when netbooting (or causing any other ethernet activity from within the Open Firmware environment), the PHY is put into a low power state before booting the OS. The result is that Linux doesn't see it and networking doesn't work. This patch adds the ethernet PHY reset platform hook to pmac_feature.c on ppc64 (it already is commonly used on ppc32 as lots of macs have this same problem, so the hook definition is already there and sungem is already calling it). Signed-off-by: Benjamin Herrenschmidt Index: linux-work/arch/ppc64/kernel/pmac_feature.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/pmac_feature.c 2005-03-15 11:56:46.000000000 +1100 +++ linux-work/arch/ppc64/kernel/pmac_feature.c 2005-03-22 15:30:32.000000000 +1100 @@ -220,6 +220,36 @@ return 0; } +static long __pmac g5_eth_phy_reset(struct device_node* node, long param, long value) +{ + struct macio_chip* macio = &macio_chips[0]; + struct device_node *phy; + int need_reset; + + /* + * We must not reset the combo PHYs, only the BCM5221 found in + * the iMac G5. + */ + phy = of_get_next_child(node, NULL); + if (!phy) + return -ENODEV; + need_reset = device_is_compatible(phy, "B5221"); + of_node_put(phy); + if (!need_reset) + return 0; + + /* PHY reset is GPIO 29, not in device-tree unfortunately */ + MACIO_OUT8(K2_GPIO_EXTINT_0 + 29, + KEYLARGO_GPIO_OUTPUT_ENABLE | KEYLARGO_GPIO_OUTOUT_DATA); + /* Thankfully, this is now always called at a time when we can + * schedule by sungem. + */ + msleep(10); + MACIO_OUT8(K2_GPIO_EXTINT_0 + 29, 0); + + return 0; +} + #ifdef CONFIG_SMP static long __pmac g5_reset_cpu(struct device_node* node, long param, long value) { @@ -306,6 +336,7 @@ { PMAC_FTR_ENABLE_MPIC, g5_mpic_enable }, { PMAC_FTR_READ_GPIO, g5_read_gpio }, { PMAC_FTR_WRITE_GPIO, g5_write_gpio }, + { PMAC_FTR_GMAC_PHY_RESET, g5_eth_phy_reset }, #ifdef CONFIG_SMP { PMAC_FTR_RESET_CPU, g5_reset_cpu }, #endif /* CONFIG_SMP */ From maneesh at in.ibm.com Fri Mar 25 01:20:33 2005 From: maneesh at in.ibm.com (Maneesh Soni) Date: Thu, 24 Mar 2005 19:50:33 +0530 Subject: [PATCH] ppc64: Add mem=X option, updated NUMA support In-Reply-To: <200503232311.11113.michael@ellerman.id.au> References: <200503232311.11113.michael@ellerman.id.au> Message-ID: <20050324142033.GA11276@in.ibm.com> On Wed, Mar 23, 2005 at 11:11:10PM +1100, Michael Ellerman wrote: > Hi Mike, > > Here's an updated version of my mem=X patch with new NUMA code. > Sorry it took so long I've been a bit crook. > > Can you test this on your 720 or whatever it was? And if anyone else > has an interesting NUMA machine they can test it on I'd love to hear > about it! > > This also includes a fix for the bug Maneesh found last week with > GCC 3.3 generating a switch table in prom_init.c, which doesn't > work. > I tried the patch and it works fine in my setup also. Thanks Maneesh -- Maneesh Soni Linux Technology Center, IBM India Software Labs, Bangalore, India email: maneesh at in.ibm.com Phone: 91-80-25044990 From olof at austin.ibm.com Fri Mar 25 04:07:09 2005 From: olof at austin.ibm.com (Olof Johansson) Date: Thu, 24 Mar 2005 11:07:09 -0600 Subject: [PATCH] PPC64: Fix LPAR IOMMU setup code for p630 Message-ID: <20050324170709.GA32597@austin.ibm.com> Hi, Here's a fix to deal with p630 systems in LPAR mode. They're to date the only system that in some cases might lack a dma-window property for the bus, but contain an overriding property in the device node for the specific adapter/slot. This makes the device setup code a bit more complex since it needs to do some of the things that the bus setup code has already done. Signed-off-by: Olof Johansson Index: 2.6/arch/ppc64/kernel/pSeries_iommu.c =================================================================== --- 2.6.orig/arch/ppc64/kernel/pSeries_iommu.c 2005-03-23 13:06:34.000000000 -0600 +++ 2.6/arch/ppc64/kernel/pSeries_iommu.c 2005-03-23 13:08:50.000000000 -0600 @@ -401,6 +401,8 @@ static void iommu_bus_setup_pSeriesLP(st struct device_node *dn, *pdn; unsigned int *dma_window = NULL; + DBG("iommu_bus_setup_pSeriesLP, bus %p, bus->self %p\n", bus, bus->self); + dn = pci_bus_to_OF_node(bus); /* Find nearest ibm,dma-window, walking up the device tree */ @@ -455,6 +457,56 @@ static void iommu_dev_setup_pSeries(stru } } +static void iommu_dev_setup_pSeriesLP(struct pci_dev *dev) +{ + struct device_node *pdn, *dn; + struct iommu_table *tbl; + int *dma_window = NULL; + + DBG("iommu_dev_setup_pSeriesLP, dev %p (%s)\n", dev, dev->pretty_name); + + /* dev setup for LPAR is a little tricky, since the device tree might + * contain the dma-window properties per-device and not neccesarily + * for the bus. So we need to search upwards in the tree until we + * either hit a dma-window property, OR find a parent with a table + * already allocated. + */ + dn = pci_device_to_OF_node(dev); + + for (pdn = dn; pdn && !pdn->iommu_table; pdn = pdn->parent) { + dma_window = (unsigned int *)get_property(pdn, "ibm,dma-window", NULL); + if (dma_window) + break; + } + + /* Check for parent == NULL so we don't try to setup the empty EADS + * slots on POWER4 machines. + */ + if (dma_window == NULL || pdn->parent == NULL) { + /* Fall back to regular (non-LPAR) dev setup */ + DBG("No dma window for device, falling back to regular setup\n"); + iommu_dev_setup_pSeries(dev); + return; + } else { + DBG("Found DMA window, allocating table\n"); + } + + if (!pdn->iommu_table) { + /* iommu_table_setparms_lpar needs bussubno. */ + pdn->bussubno = pdn->phb->bus->number; + + tbl = (struct iommu_table *)kmalloc(sizeof(struct iommu_table), + GFP_KERNEL); + + iommu_table_setparms_lpar(pdn->phb, pdn, tbl, dma_window); + + pdn->iommu_table = iommu_init_table(tbl); + } + + if (pdn != dn) + dn->iommu_table = pdn->iommu_table; +} + static void iommu_bus_setup_null(struct pci_bus *b) { } static void iommu_dev_setup_null(struct pci_dev *d) { } @@ -479,13 +531,14 @@ void iommu_init_early_pSeries(void) ppc_md.tce_free = tce_free_pSeriesLP; } ppc_md.iommu_bus_setup = iommu_bus_setup_pSeriesLP; + ppc_md.iommu_dev_setup = iommu_dev_setup_pSeriesLP; } else { ppc_md.tce_build = tce_build_pSeries; ppc_md.tce_free = tce_free_pSeries; ppc_md.iommu_bus_setup = iommu_bus_setup_pSeries; + ppc_md.iommu_dev_setup = iommu_dev_setup_pSeries; } - ppc_md.iommu_dev_setup = iommu_dev_setup_pSeries; pci_iommu_init(); } From jschopp at austin.ibm.com Fri Mar 25 10:04:49 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Thu, 24 Mar 2005 17:04:49 -0600 Subject: [PATCH] contig_page_data Message-ID: <42434791.2020707@austin.ibm.com> Trivial patch to remove our last direct reference to contig_page_data. This will make it just that much less hard to seperate NUMA and DISCONTIG. Please forward on. Against 2.6.12-rc1 Signed-off-by: Joel Schopp -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: contig_page_data2.6.12-rc1.patch Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050324/1e56fffe/attachment.txt From apw at us.ibm.com Fri Mar 25 11:54:38 2005 From: apw at us.ibm.com (Amos Waterland) Date: Thu, 24 Mar 2005 19:54:38 -0500 Subject: [patch] fix ppc64 zilog link error Message-ID: <20050325005438.GA6567@kvasir.austin.ibm.com> Simple fix so that this will not happen: LD .tmp_vmlinux1 drivers/built-in.o(.text+0x2c640): In function `pmz_attach': /home/apw/devel/percs-head/linux-2.6/drivers/serial/pmac_zilog.c:1560: undefined reference to `.macio_request_resources' drivers/built-in.o(.text+0x2c6f0): In function `pmz_detach': /home/apw/devel/percs-head/linux-2.6/drivers/serial/pmac_zilog.c:1583: undefined reference to `.macio_release_resources' drivers/built-in.o(.init.text+0x6938): In function `init_pmz': /home/apw/devel/percs-head/linux-2.6/drivers/serial/pmac_zilog.c:1916: undefined reference to `.macio_register_driver' drivers/built-in.o(.exit.text+0x2c): In function `exit_pmz': /home/apw/devel/percs-head/linux-2.6/drivers/serial/pmac_zilog.c:1924: undefined reference to `.macio_unregister_driver' Signed-off-by: Amos Waterland ===== drivers/serial/Kconfig 1.50 vs edited ===== --- 1.50/drivers/serial/Kconfig 2005-01-31 01:33:45 -05:00 +++ edited/drivers/serial/Kconfig 2005-03-24 17:44:56 -05:00 @@ -615,7 +615,7 @@ config SERIAL_PMACZILOG tristate "PowerMac z85c30 ESCC support" - depends on PPC_OF + depends on PPC_OF && PPC_PMAC select SERIAL_CORE help This driver supports the Zilog z85C30 serial ports found on From michael at ellerman.id.au Fri Mar 25 15:52:03 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Fri, 25 Mar 2005 15:52:03 +1100 Subject: [PATCH] ppc64: numa: Remove redundant and broken overlap check Message-ID: <200503251552.03931.michael@ellerman.id.au> Hi Andrew, The numa code used to have to handle the fact that memory regions (as reported by OF) had been coallesced in the lmb struct and so might overlap node boundaries. Since Mike's patch went in this doesn't happen anymore, because we iterate over the memory regions from OF directly. This patch simply removes a check that catered for the overlapping case, which now "can't happen". The condition also happens to be broken in the current code, but in a painless way, so it's as good as removed already. I've boot tested this on pSeries. cheers Signed-off-by: Michael Ellerman Index: 2.6.12-rc1-mem-limit/arch/ppc64/mm/numa.c =================================================================== --- 2.6.12-rc1-mem-limit.orig/arch/ppc64/mm/numa.c 2005-03-25 02:50:21.000000000 +1100 +++ 2.6.12-rc1-mem-limit/arch/ppc64/mm/numa.c 2005-03-25 03:51:40.000000000 +1100 @@ -614,13 +614,8 @@ if (numa_domain != nid) continue; - if (mem_start < end_paddr && - (mem_start+mem_size) > start_paddr) { - /* should be no overlaps ! */ - dbg("free_bootmem %lx %lx\n", mem_start, mem_size); - free_bootmem_node(NODE_DATA(nid), mem_start, - mem_size); - } + dbg("free_bootmem %lx %lx\n", mem_start, mem_size); + free_bootmem_node(NODE_DATA(nid), mem_start, mem_size); if (--ranges) /* process all ranges in cell */ goto new_range; From neg at brooktrout.com Sat Mar 26 09:50:14 2005 From: neg at brooktrout.com (Nathan Glasser) Date: Fri, 25 Mar 2005 22:50:14 -0000 Subject: DMA memory Message-ID: Hello, It was suggested that I send a message to this mailing list to attempt to get help with a problem I'm seeing. I'm using ppc64 (p630), kernel 2.4.x (RH 3.0 patch x). I'm working on a proprietary driver for a proprietary device. The device needs to access some host memory in order to perform a DMA transfer. It can only access 32-bits. I'm allocating memory using pci_alloc_consistent. I'm passing the "dma handle" to the device in the place where the bus address would usually go (I formerly used virt_to_bus for x86). It seems that after the device performs the DMA, any further access to MMIO board registers results in a system crash (such accesses work fine prior to the device DMA). Here is the panic message on the serial console. RTAS: 2 --------- RTAS event begin RTAS 0: 00000000 00000000 RTAS: 2 --------- RTAS event end Kernel panic: EEH: MMIO failure (2) on device:pci12e4,1000 /pci at 400000000111/pci at 2,6/pci12e4,1000 at 1 It was suggested to me that the DMA was to a bad address, and that this caused the device to be isolated. I didn't know the system could do that, but it makes sense to me. The amount allocated is PAGE_SIZE. The DMA size is normally 1K. The transfer start address can be at any multiple of 1K within the PAGE_SIZE-sized region. I did this 4 times, and got these addresses. cpu_addr: C0000001E0801000 dma handle: 40000000 cpu_addr: C0000001E0800000 dma handle: 40001000 cpu_addr: C0000001DB25F000 dma handle: 40002000 cpu_addr: C0000001DB25E000 dma handle: 40003000 Assuming you trust that these dma handle values are properly being sent to the device, does anyone have any suggestions for what I'm doing wrong? Is there some further translation I need to do of the dma handle before passing it to the device? Other drivers I looked at didn't seem to. Thanks, Nathan From aw-confirm at ebay.com Tue Mar 29 04:29:02 2005 From: aw-confirm at ebay.com (eBay) Date: Mon, 28 Mar 2005 22:29:02 +0400 Subject: Notification of Fraudulent Account Message-ID: An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050328/2b62bf95/attachment.htm From Habitants at ozlabs.org Mon Mar 28 09:41:08 2005 From: Habitants at ozlabs.org (Habitants at ozlabs.org) Date: Mon, 28 Mar 2005 01:41:08 +0200 Subject: HabitantsDuty - Win prices!! Message-ID: Dear friend, Today i have found a strategic "browser-based" game, no downloads are required! Sign up today and win great prices!! You can win prices like PS2, Xbox, iPod Shuffle etc... The game is located at http://www.HabitantsDuty.com AOL Users click here From agl at us.ibm.com Tue Mar 29 03:22:31 2005 From: agl at us.ibm.com (Adam Litke) Date: Mon, 28 Mar 2005 11:22:31 -0600 Subject: Hugepage COW In-Reply-To: <20050317034844.GD14048@localhost.localdomain> References: <1109085505.5217.28.camel@localhost.localdomain> <20050223070322.GF24473@localhost.localdomain> <1111006896.3635.24.camel@localhost.localdomain> <20050317034844.GD14048@localhost.localdomain> Message-ID: <1112030551.12614.67.camel@localhost.localdomain> Just a trivial compile fix. Some architectures (i386) need asm/tlbflush.h to be included by the arch independant hugetlb header. Compile tested on ppc64 and i386. diff -purN 2.6.11-consolidate/include/linux/hugetlb.h 2.6.11-consolidate+compile/include/linux/hugetlb.h --- 2.6.11-consolidate/include/linux/hugetlb.h 2005-03-28 09:07:23.000000000 -0800 +++ 2.6.11-consolidate+compile/include/linux/hugetlb.h 2005-03-28 09:10:07.000000000 -0800 @@ -4,6 +4,7 @@ #ifdef CONFIG_HUGETLB_PAGE #include +#include struct ctl_table; -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center From haveblue at us.ibm.com Tue Mar 29 03:41:55 2005 From: haveblue at us.ibm.com (Dave Hansen) Date: Mon, 28 Mar 2005 09:41:55 -0800 Subject: Hugepage COW In-Reply-To: <1112030551.12614.67.camel@localhost.localdomain> References: <1109085505.5217.28.camel@localhost.localdomain> <20050223070322.GF24473@localhost.localdomain> <1111006896.3635.24.camel@localhost.localdomain> <20050317034844.GD14048@localhost.localdomain> <1112030551.12614.67.camel@localhost.localdomain> Message-ID: <1112031716.2087.15.camel@localhost> On Mon, 2005-03-28 at 11:22 -0600, Adam Litke wrote: > Just a trivial compile fix. Some architectures (i386) need > asm/tlbflush.h to be included by the arch independant hugetlb header. Why? Which functions/variables/declarations are needed? -- Dave From moilanen at austin.ibm.com Tue Mar 29 03:53:28 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Mon, 28 Mar 2005 11:53:28 -0600 Subject: DMA memory In-Reply-To: References: Message-ID: <20050328115328.79a3abf7.moilanen@austin.ibm.com> > RTAS: 2 --------- RTAS event begin > RTAS 0: 00000000 00000000 > RTAS: 2 --------- RTAS event end > Kernel panic: EEH: MMIO failure (2) on device:pci12e4,1000 /pci at 400000000111/pci at 2,6/pci12e4,1000 at 1 > > It was suggested to me that the DMA was to a bad address, and that this > caused the device to be isolated. I didn't know the system could do that, > but it makes sense to me. That's what it looks like. pSeries does isolation when a bad DMA is attempted through EEH. > Assuming you trust that these dma handle values are properly being sent to the > device, does anyone have any suggestions for what I'm doing wrong? Is there > some further translation I need to do of the dma handle before passing it to > the device? Other drivers I looked at didn't seem to. >From what I can see on your setup, it looks good. Here's a couple things to check: - Passing the correct endiness on your DMA handle to the adapter. - If you had a real old version of firmware on the 630, there was a problem w/ PCI-2-PCI bridges that we hit on a 2.4 kernel (I can't remember the details). Turn on Linux Compatibility Mode from the Service Processor Menus. Here's a link: http://publib16.boulder.ibm.com/pseries/en_US/infocenter/base/hardware_docs/pdf/380606.pdf Thanks, Jake From neg at brooktrout.com Tue Mar 29 04:35:52 2005 From: neg at brooktrout.com (Nathan Glasser) Date: Mon, 28 Mar 2005 18:35:52 -0000 Subject: DMA memory In-Reply-To: Your message of Mon, 28 Mar 2005 11:53:28 -0600 Message-ID: >- Passing the correct endiness on your DMA handle to the adapter. I'm pretty sure I am. However, as a test I did one time deliberately do an extra reversal, just in case I was wrong. >- If you had a real old version of firmware on the 630, there was a >problem w/ PCI-2-PCI bridges that we hit on a 2.4 kernel (I can't >remember the details). Turn on Linux Compatibility Mode from the >Service Processor Menus. Here's a link: I'll look into this. Thanks. - Nathan From neg at brooktrout.com Tue Mar 29 05:44:44 2005 From: neg at brooktrout.com (Nathan Glasser) Date: Mon, 28 Mar 2005 19:44:44 -0000 Subject: DMA memory In-Reply-To: Your message of Mon, 28 Mar 2005 11:53:28 -0600 Message-ID: It turns out that the P630 system is not running in LPAR mode, so the Linux compatibility mode is not available. Also, the firmware is pretty new. It is 3R41021 from 8/2004, and apparently the latest is 3R41029 from 12/2004. We are going to update it to the latest, just in case. Note also the fact that we were using RedHat, and not SUSE (which the document references). So likely this is not related to the DMA problem I'm seeing. Thanks, Nathan From neg at brooktrout.com Tue Mar 29 07:59:08 2005 From: neg at brooktrout.com (Nathan Glasser) Date: Mon, 28 Mar 2005 21:59:08 -0000 Subject: DMA memory In-Reply-To: Your message of Mon, 28 Mar 2005 11:53:28 -0600 Message-ID: Finally, the last thing to report is that after getting the system's firmware upgraded (to 3R041029), I tried a test again, and the system behaved in the usual way. - nathan From benh at kernel.crashing.org Tue Mar 29 15:29:27 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 29 Mar 2005 15:29:27 +1000 Subject: DMA memory In-Reply-To: References: Message-ID: <1112074167.12300.23.camel@gaston> On Mon, 2005-03-28 at 14:39 -0500, Nathan Glasser wrote: > It turns out that the P630 system is not running in LPAR mode, so > the Linux compatibility mode is not available. Ugh ? Linux should work in non-LPAR mode too ... Ben. From benh at kernel.crashing.org Tue Mar 29 15:31:12 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 29 Mar 2005 15:31:12 +1000 Subject: DMA memory In-Reply-To: <20050328115328.79a3abf7.moilanen@austin.ibm.com> References: <20050328115328.79a3abf7.moilanen@austin.ibm.com> Message-ID: <1112074272.12301.26.camel@gaston> > - If you had a real old version of firmware on the 630, there was a > problem w/ PCI-2-PCI bridges that we hit on a 2.4 kernel (I can't > remember the details). Turn on Linux Compatibility Mode from the > Service Processor Menus. Here's a link: > > http://publib16.boulder.ibm.com/pseries/en_US/infocenter/base/hardware_docs/pdf/380606.pdf Do you have some tech. data about the problem ? Ben. From benh at kernel.crashing.org Tue Mar 29 15:37:05 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 29 Mar 2005 15:37:05 +1000 Subject: DMA memory In-Reply-To: <1112074167.12300.23.camel@gaston> References: <1112074167.12300.23.camel@gaston> Message-ID: <1112074625.12301.31.camel@gaston> On Tue, 2005-03-29 at 15:29 +1000, Benjamin Herrenschmidt wrote: > On Mon, 2005-03-28 at 14:39 -0500, Nathan Glasser wrote: > > It turns out that the P630 system is not running in LPAR mode, so > > the Linux compatibility mode is not available. > > Ugh ? Linux should work in non-LPAR mode too ... Ok, I saw Jake message, I wonder what this stuff is though... Ben. From michael at ellerman.id.au Tue Mar 29 18:02:42 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Tue, 29 Mar 2005 18:02:42 +1000 Subject: [PATCH] ppc64: Add mem=X boot command line option Message-ID: <200503291802.43261.michael@ellerman.id.au> Hi Andrew, This patch adds the mem=X boot command line option for PPC64. On iSeries the user's mem=X value is aligned to PAGE_SIZE, on pSeries we align to 16 MB which is the size of a large page. The iSeries implementation is fairly straight forward, we declare mem=X as an early_param() and then in iSeries_init_early() we modify the systemcfg->physicalMemorySize based on that value. On pSeries the mem=X option is parsed in prom_init.c before the kernel proper starts, and is used to modify prom_init_mem()'s idea of memory. The mem=X value and computed tce_alloc_start/end values are saved by prom_init() into the device tree for later use by the kernel. The device tree properties are read by the kernel in early_dt_scan_chosen(), and used to modify the lmb structure in early_init_devtree(). That's the guts of it. On non-LPAR machines the tce_alloc_start/end values are read from the device tree and used in htab_initialize() to make sure the TCE table is mapped at the real top of RAM. If NUMA is enabled we also have to make changes in parse_numa_properties() and do_init_bootmem() to exclude memory regions above the memory limit, and truncate any region which stradles the limit. NB. This patch does not facilitate using mem=X to give drivers access to large regions of contiguous memory. Signed-off-by: Michael Ellerman --- This is almost identical to the patch I posted last week, with one minor change to make boot messages look the same on iSeries & pSeries, and a few little white space and comment tweaks. It has been tested on iSeries, pSeries, pSeries LPAR, G5 and Maple. Thanks to BenH, Anton, Olof, Stephen, Mike & Maneesh for their help. cheers! arch/ppc64/kernel/iSeries_setup.c | 40 +++++++---- arch/ppc64/kernel/lmb.c | 26 +++++++ arch/ppc64/kernel/prom.c | 15 ++++ arch/ppc64/kernel/prom_init.c | 137 +++++++++++++++++++++++++++++++++++--- arch/ppc64/kernel/setup.c | 20 ++++- arch/ppc64/mm/hash_utils.c | 23 +++++- arch/ppc64/mm/numa.c | 46 +++++++++++- include/asm-ppc64/lmb.h | 1 8 files changed, 276 insertions(+), 32 deletions(-) Index: bk-current/arch/ppc64/kernel/setup.c =================================================================== --- bk-current.orig/arch/ppc64/kernel/setup.c +++ bk-current/arch/ppc64/kernel/setup.c @@ -636,12 +636,11 @@ void __init setup_system(void) early_console_initialized = 1; register_console(&udbg_console); -#endif /* !CONFIG_PPC_ISERIES */ - /* Save unparsed command line copy for /proc/cmdline */ strlcpy(saved_command_line, cmd_line, COMMAND_LINE_SIZE); parse_early_param(); +#endif /* !CONFIG_PPC_ISERIES */ #if defined(CONFIG_SMP) && !defined(CONFIG_PPC_ISERIES) /* @@ -800,20 +799,31 @@ struct seq_operations cpuinfo_op = { .show = show_cpuinfo, }; -#if 0 /* XXX not currently used */ +/* + * These three variables are used to save values passed to us by prom_init() + * via the device tree. The TCE variables are needed because with a memory_limit + * in force we may need to explicitly map the TCE are at the top of RAM. + */ unsigned long memory_limit; +unsigned long tce_alloc_start; +unsigned long tce_alloc_end; +#ifdef CONFIG_PPC_ISERIES +/* + * On iSeries we just parse the mem=X option from the command line. + * On pSeries it's a bit more complicated, see prom_init_mem() + */ static int __init early_parsemem(char *p) { if (!p) return 0; - memory_limit = memparse(p, &p); + memory_limit = ALIGN(memparse(p, &p), PAGE_SIZE); return 0; } early_param("mem", early_parsemem); -#endif +#endif /* CONFIG_PPC_ISERIES */ #ifdef CONFIG_PPC_MULTIPLATFORM static int __init set_preferred_console(void) Index: bk-current/arch/ppc64/kernel/lmb.c =================================================================== --- bk-current.orig/arch/ppc64/kernel/lmb.c +++ bk-current/arch/ppc64/kernel/lmb.c @@ -344,3 +344,29 @@ lmb_abs_to_phys(unsigned long aa) return pa; } + +/* + * Truncate the lmb list to memory_limit if it's set + * You must call lmb_analyze() after this. + */ +void __init lmb_enforce_memory_limit(void) +{ + extern unsigned long memory_limit; + unsigned long i, limit; + struct lmb_region *mem = &(lmb.memory); + + if (! memory_limit) + return; + + limit = memory_limit; + for (i = 0; i < mem->cnt; i++) { + if (limit > mem->region[i].size) { + limit -= mem->region[i].size; + continue; + } + + mem->region[i].size = limit; + mem->cnt = i + 1; + break; + } +} Index: bk-current/include/asm-ppc64/lmb.h =================================================================== --- bk-current.orig/include/asm-ppc64/lmb.h +++ bk-current/include/asm-ppc64/lmb.h @@ -51,6 +51,7 @@ extern unsigned long __init lmb_alloc_ba extern unsigned long __init lmb_phys_mem_size(void); extern unsigned long __init lmb_end_of_DRAM(void); extern unsigned long __init lmb_abs_to_phys(unsigned long); +extern void __init lmb_enforce_memory_limit(void); extern void lmb_dump_all(void); Index: bk-current/arch/ppc64/kernel/iSeries_setup.c =================================================================== --- bk-current.orig/arch/ppc64/kernel/iSeries_setup.c +++ bk-current/arch/ppc64/kernel/iSeries_setup.c @@ -285,7 +285,7 @@ static unsigned long iSeries_process_mai return mem_blocks; } -static void __init iSeries_parse_cmdline(void) +static void __init iSeries_get_cmdline(void) { char *p, *q; @@ -305,6 +305,8 @@ static void __init iSeries_parse_cmdline static void __init iSeries_init_early(void) { + extern unsigned long memory_limit; + DBG(" -> iSeries_init_early()\n"); ppcdbg_initialize(); @@ -352,6 +354,31 @@ static void __init iSeries_init_early(vo */ build_iSeries_Memory_Map(); + iSeries_get_cmdline(); + + /* Save unparsed command line copy for /proc/cmdline */ + strlcpy(saved_command_line, cmd_line, COMMAND_LINE_SIZE); + + /* Parse early parameters, in particular mem=x */ + parse_early_param(); + + if (memory_limit) { + if (memory_limit < systemcfg->physicalMemorySize) + systemcfg->physicalMemorySize = memory_limit; + else { + printk("Ignoring mem=%lu >= ram_top.\n", memory_limit); + memory_limit = 0; + } + } + + /* Bolt kernel mappings for all of memory (or just a bit if we've got a limit) */ + iSeries_bolt_kernel(0, systemcfg->physicalMemorySize); + + lmb_init(); + lmb_add(0, systemcfg->physicalMemorySize); + lmb_analyze(); + lmb_reserve(0, __pa(klimit)); + /* Initialize machine-dependency vectors */ #ifdef CONFIG_SMP smp_init_iSeries(); @@ -377,9 +404,6 @@ static void __init iSeries_init_early(vo initrd_start = initrd_end = 0; #endif /* CONFIG_BLK_DEV_INITRD */ - - iSeries_parse_cmdline(); - DBG(" <- iSeries_init_early()\n"); } @@ -540,14 +564,6 @@ static void __init build_iSeries_Memory_ * nextPhysChunk */ systemcfg->physicalMemorySize = chunk_to_addr(nextPhysChunk); - - /* Bolt kernel mappings for all of memory */ - iSeries_bolt_kernel(0, systemcfg->physicalMemorySize); - - lmb_init(); - lmb_add(0, systemcfg->physicalMemorySize); - lmb_analyze(); /* ?? */ - lmb_reserve(0, __pa(klimit)); } /* Index: bk-current/arch/ppc64/kernel/prom.c =================================================================== --- bk-current.orig/arch/ppc64/kernel/prom.c +++ bk-current/arch/ppc64/kernel/prom.c @@ -912,6 +912,8 @@ static int __init early_init_dt_scan_cho const char *full_path, void *data) { u32 *prop; + u64 *prop64; + extern unsigned long memory_limit, tce_alloc_start, tce_alloc_end; if (strcmp(full_path, "/chosen") != 0) return 0; @@ -928,6 +930,18 @@ static int __init early_init_dt_scan_cho if (get_flat_dt_prop(node, "linux,iommu-force-on", NULL) != NULL) iommu_force_on = 1; + prop64 = (u64*)get_flat_dt_prop(node, "linux,memory-limit", NULL); + if (prop64) + memory_limit = *prop64; + + prop64 = (u64*)get_flat_dt_prop(node, "linux,tce-alloc-start", NULL); + if (prop64) + tce_alloc_start = *prop64; + + prop64 = (u64*)get_flat_dt_prop(node, "linux,tce-alloc-end", NULL); + if (prop64) + tce_alloc_end = *prop64; + #ifdef CONFIG_PPC_RTAS /* To help early debugging via the front panel, we retreive a minimal * set of RTAS infos now if available @@ -1067,6 +1081,7 @@ void __init early_init_devtree(void *par lmb_init(); scan_flat_dt(early_init_dt_scan_root, NULL); scan_flat_dt(early_init_dt_scan_memory, NULL); + lmb_enforce_memory_limit(); lmb_analyze(); systemcfg->physicalMemorySize = lmb_phys_mem_size(); lmb_reserve(0, __pa(klimit)); Index: bk-current/arch/ppc64/mm/hash_utils.c =================================================================== --- bk-current.orig/arch/ppc64/mm/hash_utils.c +++ bk-current/arch/ppc64/mm/hash_utils.c @@ -149,6 +149,8 @@ void __init htab_initialize(void) unsigned long pteg_count; unsigned long mode_rw; int i, use_largepages = 0; + unsigned long base = 0, size = 0; + extern unsigned long tce_alloc_start, tce_alloc_end; DBG(" -> htab_initialize()\n"); @@ -204,8 +206,6 @@ void __init htab_initialize(void) /* create bolted the linear mapping in the hash table */ for (i=0; i < lmb.memory.cnt; i++) { - unsigned long base, size; - base = lmb.memory.region[i].physbase + KERNELBASE; size = lmb.memory.region[i].size; @@ -234,6 +234,25 @@ void __init htab_initialize(void) #endif /* CONFIG_U3_DART */ create_pte_mapping(base, base + size, mode_rw, use_largepages); } + + /* + * If we have a memory_limit and we've allocated TCEs then we need to + * explicitly map the TCE area at the top of RAM. We also cope with the + * case that the TCEs start below memory_limit. + * tce_alloc_start/end are 16MB aligned so the mapping should work + * for either 4K or 16MB pages. + */ + if (tce_alloc_start) { + tce_alloc_start += KERNELBASE; + tce_alloc_end += KERNELBASE; + + if (base + size >= tce_alloc_start) + tce_alloc_start = base + size + 1; + + create_pte_mapping(tce_alloc_start, tce_alloc_end, + mode_rw, use_largepages); + } + DBG(" <- htab_initialize()\n"); } #undef KB Index: bk-current/arch/ppc64/mm/numa.c =================================================================== --- bk-current.orig/arch/ppc64/mm/numa.c +++ bk-current/arch/ppc64/mm/numa.c @@ -285,6 +285,35 @@ static int cpu_numa_callback(struct noti return ret; } +/* + * Check and possibly modify a memory region to enforce the memory limit. + * + * Returns the size the region should have to enforce the memory limit. + * This will either be the original value of size, a truncated value, + * or zero. If the returned value of size is 0 the region should be + * discarded as it lies wholy above the memory limit. + */ +static unsigned long __init numa_enforce_memory_limit(unsigned long start, unsigned long size) +{ + /* + * We use lmb_end_of_DRAM() in here instead of memory_limit because + * we've already adjusted it for the limit and it takes care of + * having memory holes below the limit. + */ + extern unsigned long memory_limit; + + if (! memory_limit) + return size; + + if (start + size <= lmb_end_of_DRAM()) + return size; + + if (start >= lmb_end_of_DRAM()) + return 0; + + return lmb_end_of_DRAM() - start; +} + static int __init parse_numa_properties(void) { struct device_node *cpu = NULL; @@ -373,6 +402,13 @@ new_range: if (max_domain < numa_domain) max_domain = numa_domain; + if (! (size = numa_enforce_memory_limit(start, size))) { + if (--ranges) + goto new_range; + else + continue; + } + /* * Initialize new node struct, or add to an existing one. */ @@ -405,8 +441,7 @@ new_range: numa_memory_lookup_table[i >> MEMORY_INCREMENT_SHIFT] = numa_domain; - ranges--; - if (ranges) + if (--ranges) goto new_range; } @@ -614,8 +649,11 @@ new_range: if (numa_domain != nid) continue; - dbg("free_bootmem %lx %lx\n", mem_start, mem_size); - free_bootmem_node(NODE_DATA(nid), mem_start, mem_size); + mem_size = numa_enforce_memory_limit(mem_start, mem_size); + if (mem_size) { + dbg("free_bootmem %lx %lx\n", mem_start, mem_size); + free_bootmem_node(NODE_DATA(nid), mem_start, mem_size); + } if (--ranges) /* process all ranges in cell */ goto new_range; Index: bk-current/arch/ppc64/kernel/prom_init.c =================================================================== --- bk-current.orig/arch/ppc64/kernel/prom_init.c +++ bk-current/arch/ppc64/kernel/prom_init.c @@ -177,6 +177,10 @@ static int __initdata of_platform; static char __initdata prom_cmd_line[COMMAND_LINE_SIZE]; +static unsigned long __initdata prom_memory_limit; +static unsigned long __initdata prom_tce_alloc_start; +static unsigned long __initdata prom_tce_alloc_end; + static unsigned long __initdata alloc_top; static unsigned long __initdata alloc_top_high; static unsigned long __initdata alloc_bottom; @@ -384,10 +388,70 @@ static int __init prom_setprop(phandle n (u32)(unsigned long) value, (u32) valuelen); } +/* We can't use the standard versions because of RELOC headaches. */ +#define isxdigit(c) (('0' <= (c) && (c) <= '9') \ + || ('a' <= (c) && (c) <= 'f') \ + || ('A' <= (c) && (c) <= 'F')) + +#define isdigit(c) ('0' <= (c) && (c) <= '9') +#define islower(c) ('a' <= (c) && (c) <= 'z') +#define toupper(c) (islower(c) ? ((c) - 'a' + 'A') : (c)) + +unsigned long prom_strtoul(const char *cp, const char **endp) +{ + unsigned long result = 0, base = 10, value; + + if (*cp == '0') { + base = 8; + cp++; + if (toupper(*cp) == 'X') { + cp++; + base = 16; + } + } + + while (isxdigit(*cp) && + (value = isdigit(*cp) ? *cp - '0' : toupper(*cp) - 'A' + 10) < base) { + result = result * base + value; + cp++; + } + + if (endp) + *endp = cp; + + return result; +} + +unsigned long prom_memparse(const char *ptr, const char **retptr) +{ + unsigned long ret = prom_strtoul(ptr, retptr); + int shift = 0; + + /* + * We can't use a switch here because GCC *may* generate a + * jump table which won't work, because we're not running at + * the address we're linked at. + */ + if ('G' == **retptr || 'g' == **retptr) + shift = 30; + + if ('M' == **retptr || 'm' == **retptr) + shift = 20; + + if ('K' == **retptr || 'k' == **retptr) + shift = 10; + + if (shift) { + ret <<= shift; + (*retptr)++; + } + + return ret; +} /* * Early parsing of the command line passed to the kernel, used for - * the options that affect the iommu + * "mem=x" and the options that affect the iommu */ static void __init early_cmdline_parse(void) { @@ -418,6 +482,14 @@ static void __init early_cmdline_parse(v else if (!strncmp(opt, RELOC("force"), 5)) RELOC(iommu_force_on) = 1; } + + opt = strstr(RELOC(prom_cmd_line), RELOC("mem=")); + if (opt) { + opt += 4; + RELOC(prom_memory_limit) = prom_memparse(opt, (const char **)&opt); + /* Align to 16 MB == size of large page */ + RELOC(prom_memory_limit) = ALIGN(RELOC(prom_memory_limit), 0x1000000); + } } /* @@ -664,15 +736,7 @@ static void __init prom_init_mem(void) } } - /* Setup our top/bottom alloc points, that is top of RMO or top of - * segment 0 when running non-LPAR - */ - if ( RELOC(of_platform) == PLATFORM_PSERIES_LPAR ) - RELOC(alloc_top) = RELOC(rmo_top); - else - RELOC(alloc_top) = RELOC(rmo_top) = min(0x40000000ul, RELOC(ram_top)); RELOC(alloc_bottom) = PAGE_ALIGN(RELOC(klimit) - offset + 0x4000); - RELOC(alloc_top_high) = RELOC(ram_top); /* Check if we have an initrd after the kernel, if we do move our bottom * point to after it @@ -681,8 +745,41 @@ static void __init prom_init_mem(void) if (RELOC(prom_initrd_end) > RELOC(alloc_bottom)) RELOC(alloc_bottom) = PAGE_ALIGN(RELOC(prom_initrd_end)); } + + /* + * If prom_memory_limit is set we reduce the upper limits *except* for + * alloc_top_high. This must be the real top of RAM so we can put + * TCE's up there. + */ + + RELOC(alloc_top_high) = RELOC(ram_top); + + if (RELOC(prom_memory_limit)) { + if (RELOC(prom_memory_limit) <= RELOC(alloc_bottom)) { + prom_printf("Ignoring mem=%x <= alloc_bottom.\n", + RELOC(prom_memory_limit)); + RELOC(prom_memory_limit) = 0; + } else if (RELOC(prom_memory_limit) >= RELOC(ram_top)) { + prom_printf("Ignoring mem=%x >= ram_top.\n", + RELOC(prom_memory_limit)); + RELOC(prom_memory_limit) = 0; + } else { + RELOC(ram_top) = RELOC(prom_memory_limit); + RELOC(rmo_top) = min(RELOC(rmo_top), RELOC(prom_memory_limit)); + } + } + + /* + * Setup our top alloc point, that is top of RMO or top of + * segment 0 when running non-LPAR. + */ + if ( RELOC(of_platform) == PLATFORM_PSERIES_LPAR ) + RELOC(alloc_top) = RELOC(rmo_top); + else + RELOC(alloc_top) = RELOC(rmo_top) = min(0x40000000ul, RELOC(ram_top)); prom_printf("memory layout at init:\n"); + prom_printf(" memory_limit : %x (16 MB aligned)\n", RELOC(prom_memory_limit)); prom_printf(" alloc_bottom : %x\n", RELOC(alloc_bottom)); prom_printf(" alloc_top : %x\n", RELOC(alloc_top)); prom_printf(" alloc_top_hi : %x\n", RELOC(alloc_top_high)); @@ -871,6 +968,16 @@ static void __init prom_initialize_tce_t reserve_mem(local_alloc_bottom, local_alloc_top - local_alloc_bottom); + if (RELOC(prom_memory_limit)) { + /* + * We align the start to a 16MB boundary so we can map the TCE area + * using large pages if possible. The end should be the top of RAM + * so no need to align it. + */ + RELOC(prom_tce_alloc_start) = _ALIGN_DOWN(local_alloc_bottom, 0x1000000); + RELOC(prom_tce_alloc_end) = local_alloc_top; + } + /* Flag the first invalid entry */ prom_debug("ending prom_initialize_tce_table\n"); } @@ -1684,9 +1791,21 @@ unsigned long __init prom_init(unsigned */ if (RELOC(ppc64_iommu_off)) prom_setprop(_prom->chosen, "linux,iommu-off", NULL, 0); + if (RELOC(iommu_force_on)) prom_setprop(_prom->chosen, "linux,iommu-force-on", NULL, 0); + if (RELOC(prom_memory_limit)) + prom_setprop(_prom->chosen, "linux,memory-limit", + PTRRELOC(&prom_memory_limit), sizeof(RELOC(prom_memory_limit))); + + if (RELOC(prom_tce_alloc_start)) { + prom_setprop(_prom->chosen, "linux,tce-alloc-start", + PTRRELOC(&prom_tce_alloc_start), sizeof(RELOC(prom_tce_alloc_start))); + prom_setprop(_prom->chosen, "linux,tce-alloc-end", + PTRRELOC(&prom_tce_alloc_end), sizeof(RELOC(prom_tce_alloc_end))); + } + /* * Now finally create the flattened device-tree */ From moilanen at austin.ibm.com Wed Mar 30 07:16:11 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Tue, 29 Mar 2005 15:16:11 -0600 Subject: DMA memory In-Reply-To: <1112074272.12301.26.camel@gaston> References: <20050328115328.79a3abf7.moilanen@austin.ibm.com> <1112074272.12301.26.camel@gaston> Message-ID: <20050329151611.6e121042.moilanen@austin.ibm.com> On Tue, 29 Mar 2005 15:31:12 +1000 Benjamin Herrenschmidt wrote: > > > - If you had a real old version of firmware on the 630, there was a > > problem w/ PCI-2-PCI bridges that we hit on a 2.4 kernel (I can't > > remember the details). Turn on Linux Compatibility Mode from the > > Service Processor Menus. Here's a link: > > > > http://publib16.boulder.ibm.com/pseries/en_US/infocenter/base/hardware_docs/pdf/380606.pdf > > > Do you have some tech. data about the problem ? Actually Olof just fixed this problem is 2.6. http://ozlabs.org/pipermail/linuxppc64-dev/2005-March/003666.html Jake From neg at brooktrout.com Wed Mar 30 07:56:03 2005 From: neg at brooktrout.com (Nathan Glasser) Date: Tue, 29 Mar 2005 21:56:03 -0000 Subject: DMA memory In-Reply-To: Your message of Tue, 29 Mar 2005 15:16:11 -0600 Message-ID: Since it seems that the system I am using is not in LPAR mode (or so I've been told), would it be correct that the fix isn't relevant to my situation? Thanks, Nathan From jschopp at austin.ibm.com Wed Mar 30 08:05:29 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Tue, 29 Mar 2005 16:05:29 -0600 Subject: [PATCH] ppc64 memory_present calls Message-ID: <4249D129.8040005@austin.ibm.com> i386 recently added memory_present() calls into mainline in anticipation of CONFIG_NONLINEAR, I figured we should follow suit. Currently the calls get #defined to nothing. Patch diffed against 2.6.12-rc1 Dave, I'll send you a copy rediffed against 2.6.12-rc1-mhp3 later today. Signed-off-by: Joel Schopp -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: memory_present.patch Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050329/276f10de/attachment.txt From darren at kdi.ca Wed Mar 30 08:06:49 2005 From: darren at kdi.ca (Darren Critchley) Date: Tue, 29 Mar 2005 14:06:49 -0800 Subject: Hello - struggling to compile on PPC64 Message-ID: <4249D179.50306@kdi.ca> Hello, I am working on a conversion project going from IA32(Intel Architecture) to PowerPC64, the target will most likely be iSeries. The current project is a complete distribution that is built on IA32 using modified build scripts based on Linux From Scratch (LFS) 5.1.1 (www.linuxfromscratch.org) Basically LFS allows one to build a basic distribution from scratch, and in fact we have used this build process to build an installable distro for IA32. The build script I use, creates a new tool chain independant of the host systems glibc, gcc, binutils and other compiler tools. It then chroots itself to this toolchain and builds the entire distro in a build directory. The kernel that I use for that is a 2.4.29. After compilation, the build directory is then tarred up and placed on a bootable CD with a setup program, etc, effectively becoming a complete installable distro. (an excellent example of this build process can be found at www.ipcop.org, they use it to build a secure distro from scratch. Much of my build script was taken from there) My objective is to make this build script/process work on power5 to produce a distro that can be loaded onto a power5. At this point in time I do not care if it is 32 or 64 bit, we just want to get something built on the power5 platform. To this end, I reserved a machine through IBM's (VLP) Virtual Loaner Program. It is running RHEL3 with a 2.4.21 kernel. I have been told by IBM that I should be able to accomplish what I want on this box as long as I do not require SCSI support, which I don't. I uploaded the source and buildscript to the VLP box. I have made some changes to the build scrip to recognize that it is running on PPC and that the target will be PPC. It builds the first few programs: sed-4.0.9, m4-1.4, bison-1.875, flex-2.5.4a, binutils-2.15.90.0.3, and when gcc-3.3.3 tries to build it ends with this error: /tools/powerpc64-unknown-linux-gnu/bin/ld: warning: powerpc:common architecture of input file `/usr/lib/crti.o' is incompatible with powerpc:common64 output /tools/powerpc64-unknown-linux-gnu/bin/ld: warning: powerpc:common architecture of input file `/usr/lib/crtn.o' is incompatible with powerpc:common64 output /tools/powerpc64-unknown-linux-gnu/bin/ld: can not size stub section: Bad value /tools/powerpc64-unknown-linux-gnu/bin/ld: libgcc_s.so.1: Not enough room for program headers, try linking with -N /tools/powerpc64-unknown-linux-gnu/bin/ld: final link failed: Bad value collect2: ld returned 1 exit status I have looked up those errors and found nothing useful about it. What I am looking for is a really good how to on developing something like this on PPC64. I found a tool chain (biarch) from penguinppc.org, but the included scripts seem to reference what looks to be nightly CVS snapshots of glibc, binutils and gcc. Needless to say I have had no luck in getting that toolchain built either. It would seem that the PPC64 information on the LFS site is limited, mainly because not many people have a power5 sitting around to play on, so it may be some time before they have any useful information on their site. Does anyone know of anyone else who has attempted or completed anything like this? Any information or help would be greatly appreciated. Thanks in advance Darren Critchley Kobelt Development Inc From jschopp at austin.ibm.com Wed Mar 30 08:13:57 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Tue, 29 Mar 2005 16:13:57 -0600 Subject: [PATCH] ppc64 memory_present calls In-Reply-To: <4249D129.8040005@austin.ibm.com> References: <4249D129.8040005@austin.ibm.com> Message-ID: <4249D325.8010606@austin.ibm.com> Joel Schopp wrote: > i386 recently added memory_present() calls into mainline in anticipation > of CONFIG_NONLINEAR, I figured we should follow suit. Currently the > calls get #defined to nothing. Err... by CONFIG_NONLINEAR I meant CONFIG_SPARSEMEM. CONFIG_NONLINEAR was so 2004, we've moved on to better config options now. From haveblue at us.ibm.com Wed Mar 30 08:16:40 2005 From: haveblue at us.ibm.com (Dave Hansen) Date: Tue, 29 Mar 2005 14:16:40 -0800 Subject: [PATCH] ppc64 memory_present calls In-Reply-To: <4249D129.8040005@austin.ibm.com> References: <4249D129.8040005@austin.ibm.com> Message-ID: <1112134600.27732.48.camel@localhost> On Tue, 2005-03-29 at 16:05 -0600, Joel Schopp wrote: > i386 recently added memory_present() calls into mainline in anticipation > of CONFIG_NONLINEAR, I figured we should follow suit. Currently the > calls get #defined to nothing. > > Patch diffed against 2.6.12-rc1 > Dave, I'll send you a copy rediffed against 2.6.12-rc1-mhp3 later today. What about the actual memory_present() discontig implementation? We'll need to push that ahead of sparsemem anyway, so we might as well do them at the same time. -- Dave From olof at austin.ibm.com Wed Mar 30 08:34:46 2005 From: olof at austin.ibm.com (Olof Johansson) Date: Tue, 29 Mar 2005 16:34:46 -0600 Subject: DMA memory In-Reply-To: References: Message-ID: <20050329223446.GA549@austin.ibm.com> On Tue, Mar 29, 2005 at 04:51:25PM -0500, Nathan Glasser wrote: > Since it seems that the system I am using is not in LPAR mode (or so I've been > told), would it be correct that the fix isn't relevant to my situation? Correct. And any version of RHEL3 should contain a similar fix already, the recent 2.6 fix was because I accidentally changed the behaviour when I cleaned up the 2.6 code a few weeks ago. -Olof From linas at austin.ibm.com Wed Mar 30 08:48:20 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Tue, 29 Mar 2005 16:48:20 -0600 Subject: Hello - struggling to compile on PPC64 In-Reply-To: <4249D179.50306@kdi.ca> References: <4249D179.50306@kdi.ca> Message-ID: <20050329224820.GA15596@austin.ibm.com> On Tue, Mar 29, 2005 at 02:06:49PM -0800, Darren Critchley was heard to remark: > > I uploaded the source and buildscript to the VLP box. I have made some > changes to the build scrip to recognize that it is running on PPC and > that the target will be PPC. > It builds the first few programs: sed-4.0.9, m4-1.4, bison-1.875, > flex-2.5.4a, binutils-2.15.90.0.3, and when gcc-3.3.3 tries to build it > ends with this error: > /tools/powerpc64-unknown-linux-gnu/bin/ld: warning: powerpc:common > architecture of input file `/usr/lib/crti.o' is incompatible with > powerpc:common64 output Well, (I am not a toolchain expert) but powerpc:common is the 32-bit environment, and powerpc:common64 is the 64-bit. Double check that /usr/lib/crti.o was rebuilt recently, instead of being something old lying around ... (I am not a toolchain expert) but, historically I have found that building the toolchain on any arch (not just i386) always required extensive hacking around of a rather ugly and hair-pulling sort. Its possible that its gotten easier over the years, but ... maybe not. Are you *sure* your need to build your own compiler? Using the provided compiler & assembler might make life a lot easier for you... --linas From jschopp at austin.ibm.com Wed Mar 30 08:53:31 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Tue, 29 Mar 2005 16:53:31 -0600 Subject: [PATCH] ppc64 memory_present calls In-Reply-To: <1112134600.27732.48.camel@localhost> References: <4249D129.8040005@austin.ibm.com> <1112134600.27732.48.camel@localhost> Message-ID: <4249DC6B.6030407@austin.ibm.com> > What about the actual memory_present() discontig implementation? We'll > need to push that ahead of sparsemem anyway, so we might as well do them > at the same time. DISCONTIG seems to work perfectly fine without it. Do you mean moving some existing discontig code into something called memory_present() or defining memory_present to {} for the discontig case? If it is the latter we seem pretty well covered by include/linux/mmzone.h From haveblue at us.ibm.com Wed Mar 30 08:57:36 2005 From: haveblue at us.ibm.com (Dave Hansen) Date: Tue, 29 Mar 2005 14:57:36 -0800 Subject: [PATCH] ppc64 memory_present calls In-Reply-To: <4249DC6B.6030407@austin.ibm.com> References: <4249D129.8040005@austin.ibm.com> <1112134600.27732.48.camel@localhost> <4249DC6B.6030407@austin.ibm.com> Message-ID: <1112137056.27732.53.camel@localhost> On Tue, 2005-03-29 at 16:53 -0600, Joel Schopp wrote: > > What about the actual memory_present() discontig implementation? We'll > > need to push that ahead of sparsemem anyway, so we might as well do them > > at the same time. > > DISCONTIG seems to work perfectly fine without it. Do you mean moving > some existing discontig code into something called memory_present() or > defining memory_present to {} for the discontig case? If it is the > latter we seem pretty well covered by include/linux/mmzone.h Oh, I guess that ppc64 doesn't have the subarch problem that ia32 does, and doesn't need to have the discontig setup broken out. I guess, if it works, it works. -- Dave From amodra at bigpond.net.au Wed Mar 30 09:23:12 2005 From: amodra at bigpond.net.au (Alan Modra) Date: Wed, 30 Mar 2005 08:53:12 +0930 Subject: Hello - struggling to compile on PPC64 In-Reply-To: <4249D179.50306@kdi.ca> References: <4249D179.50306@kdi.ca> Message-ID: <20050329232312.GP14407@bubble.modra.org> On Tue, Mar 29, 2005 at 02:06:49PM -0800, Darren Critchley wrote: > flex-2.5.4a, binutils-2.15.90.0.3, and when gcc-3.3.3 tries to build it > ends with this error: > /tools/powerpc64-unknown-linux-gnu/bin/ld: warning: powerpc:common > architecture of input file `/usr/lib/crti.o' is incompatible with > powerpc:common64 output You need a 64-bit glibc installed before compiling dynamic objects. -- Alan Modra IBM OzLabs - Linux Technology Centre From linas at austin.ibm.com Wed Mar 30 09:39:07 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Tue, 29 Mar 2005 17:39:07 -0600 Subject: DMA memory In-Reply-To: References: Message-ID: <20050329233907.GB15596@austin.ibm.com> On Fri, Mar 25, 2005 at 05:44:03PM -0500, Nathan Glasser was heard to remark: > Hello, > > I'm using ppc64 (p630), kernel 2.4.x (RH 3.0 patch x). > I'm working on a proprietary driver for a proprietary device. > > The device needs to access some host memory in order to perform > a DMA transfer. It can only access 32-bits. > > I'm allocating memory using pci_alloc_consistent. I'm passing > the "dma handle" to the device in the place where the bus address > would usually go (I formerly used virt_to_bus for x86). > > It seems that after the device performs the DMA, any further > access to MMIO board registers results in a system crash (such accesses > work fine prior to the device DMA). Here is the panic message on the > serial console. > > RTAS: 2 --------- RTAS event begin > RTAS 0: 00000000 00000000 > RTAS: 2 --------- RTAS event end > Kernel panic: EEH: MMIO failure (2) on device:pci12e4,1000 /pci at 400000000111/pci at 2,6/pci12e4,1000 at 1 > > It was suggested to me that the DMA was to a bad address, and that this > caused the device to be isolated. I didn't know the system could do that, > but it makes sense to me. The EEH MMIO failure will be triggered by a large variety of PCI error conditions: -- parity errors on data/ address -- DMA to a bad address -- various PCI-X spec errors, including timed out split completions. -- low voltage on pci bus, poor electrical contacts. Judging from your description, you are probably looking at a bad DMA; but you can try reseating the PCI card anyway, just for good luck. The RTAS message is supposed to be a good bit longer; among other things it will sometimes contain a raw dump of the pci controller state. If I had that, I *might* be able to decode the details of what the pci controller didn't like (including the faulting address, if that's what it is.). I presume the truncated RTAS blob is due to some RH 3.0 bug; is there a chance you can try with a newer RH 3, or RH4 or kernel.org, so as to get the detailed report? --linas p.s. Nathan F, did you ever get an error decoder working for the pci chipsets? From paulus at samba.org Wed Mar 30 13:55:25 2005 From: paulus at samba.org (Paul Mackerras) Date: Wed, 30 Mar 2005 13:55:25 +1000 Subject: prefetch on ppc64 In-Reply-To: <20050330034034.GA1752@IBM-BWN8ZTBWA01.austin.ibm.com> References: <20050330034034.GA1752@IBM-BWN8ZTBWA01.austin.ibm.com> Message-ID: <16970.9005.721117.942549@cargo.ozlabs.ibm.com> Serge E. Hallyn writes: > While investigating the inordinate performance impact one of my patches > seemed to be having, we tracked it down to two hlist_for_each_entry > loops, and finally to the prefetch instruction in the loop. I would be interested to know what results you get if you leave the loops using hlist_for_each_entry but change prefetch() and prefetchw() to do the dcbt or dcbtst instruction only if the address is non-zero, like this: static inline void prefetch(const void *x) { if (x) __asm__ __volatile__ ("dcbt 0,%0" : : "r" (x)); } static inline void prefetchw(const void *x) { if (x) __asm__ __volatile__ ("dcbtst 0,%0" : : "r" (x)); } It seems that doing a prefetch on a NULL pointer, while it doesn't cause a fault, does waste time looking for a translation of the zero address. Paul. From serue at us.ibm.com Wed Mar 30 13:40:34 2005 From: serue at us.ibm.com (Serge E. Hallyn) Date: Tue, 29 Mar 2005 21:40:34 -0600 Subject: prefetch on ppc64 Message-ID: <20050330034034.GA1752@IBM-BWN8ZTBWA01.austin.ibm.com> Hi, While investigating the inordinate performance impact one of my patches seemed to be having, we tracked it down to two hlist_for_each_entry loops, and finally to the prefetch instruction in the loop. The machine I'm testing on has 4 power5 1.5Ghz cpus and 16G ram. I was mostly using dbench (v3.03) in runs of 50 and 100 on an ext2 system. Kernel was 2.6.11-rc5. I've not had much of a chance to test on x86, but the few tests I've run have shown that prefetch does improve performance there. From what I've seen this seems to be a ppc (perhaps ppc64) specific symptom. Following are two sets of interesting results on the ppc64 system. The first is on a stock 2.6.11-rc5 kernel. The actual stock kernel gave the following results for 100 runs of dbench: # elements: 100, mean 862.580380, variance 5.973441, std dev 2.444062 When I patched fs/dcache.c to replace the three hlist_for_each_entry{,_rcu} rules with manual loops, as shown in the attached file dcache-nohlist.patch, I got: # elements: 50, mean 881.804980, variance 10.695022, std dev 3.270325 The next set of results is based on 2.6.11-rc5 with the LSM stacking patches (from www.sf.net/projects/lsm-stacker). I was understandably alarmed to find the original patched version gave me: # elements: 100, mean 797.654870, variance 7.503588, std dev 2.739268 The code which I determined to be responsible contained two list_for_each_entry loops, Replacing one with a manual loop gave me # elements: 50, mean 835.859980, variance 81.901719, std dev 9.049957 and replacing the second gave me # elements: 50, mean 846.541060, variance 17.095401, std dev 4.134658 Finally I followed Paul McKenney's suggestion and just commented out the ppc definition of prefetch altogether, which gave me: # elements: 50, mean 860.823880, variance 47.567428, std dev 6.896914 I am currently testing this same patch against a non-stacking kernel. thanks, -serge -------------- next part -------------- Index: linux-2.6.11-rc5-nostack/fs/dcache.c =================================================================== --- linux-2.6.11-rc5-nostack.orig/fs/dcache.c 2005-03-11 15:19:58.000000000 -0600 +++ linux-2.6.11-rc5-nostack/fs/dcache.c 2005-03-26 01:35:29.000000000 -0600 @@ -656,7 +656,7 @@ do { found = 0; spin_lock(&dcache_lock); - hlist_for_each(lp, head) { + for (lp=head->first; lp; lp = lp->next) { struct dentry *this = hlist_entry(lp, struct dentry, d_hash); if (!list_empty(&this->d_lru)) { dentry_stat.nr_unused--; @@ -1047,7 +1047,9 @@ rcu_read_lock(); - hlist_for_each_rcu(node, head) { + for (node=head->first; node; + ({ node = node->next; smp_read_barrier_depends(); })) + { struct dentry *dentry; struct qstr *qstr; @@ -1123,7 +1125,7 @@ spin_lock(&dcache_lock); base = d_hash(dparent, dentry->d_name.hash); - hlist_for_each(lhp,base) { + for (lhp=base->first; lhp; lhp = lhp->next) { /* hlist_for_each_rcu() not required for d_hash list * as it is parsed under dcache_lock */ From windenntw at gmail.com Wed Mar 30 15:38:10 2005 From: windenntw at gmail.com (Antonio Vargas) Date: Wed, 30 Mar 2005 07:38:10 +0200 Subject: prefetch on ppc64 In-Reply-To: <16970.9005.721117.942549@cargo.ozlabs.ibm.com> References: <20050330034034.GA1752@IBM-BWN8ZTBWA01.austin.ibm.com> <16970.9005.721117.942549@cargo.ozlabs.ibm.com> Message-ID: <69304d110503292138620d4587@mail.gmail.com> On Wed, 30 Mar 2005 13:55:25 +1000, Paul Mackerras wrote: > Serge E. Hallyn writes: > > > While investigating the inordinate performance impact one of my patches > > seemed to be having, we tracked it down to two hlist_for_each_entry > > loops, and finally to the prefetch instruction in the loop. > > I would be interested to know what results you get if you leave the > loops using hlist_for_each_entry but change prefetch() and prefetchw() > to do the dcbt or dcbtst instruction only if the address is non-zero, > like this: > > static inline void prefetch(const void *x) > { > if (x) > __asm__ __volatile__ ("dcbt 0,%0" : : "r" (x)); > } > > static inline void prefetchw(const void *x) > { > if (x) > __asm__ __volatile__ ("dcbtst 0,%0" : : "r" (x)); > } > > It seems that doing a prefetch on a NULL pointer, while it doesn't > cause a fault, does waste time looking for a translation of the zero > address. > > Paul. Don't know exactly about power5, but G5 processor is described on IBM docs as doing automatic whole-page prefetch read-ahead when detecting linear accesses. -- Greetz, Antonio Vargas aka winden of network http://wind.codepixel.com/ Las cosas no son lo que parecen, excepto cuando parecen lo que si son. From paulus at samba.org Wed Mar 30 16:00:55 2005 From: paulus at samba.org (Paul Mackerras) Date: Wed, 30 Mar 2005 16:00:55 +1000 Subject: prefetch on ppc64 In-Reply-To: <69304d110503292138620d4587@mail.gmail.com> References: <20050330034034.GA1752@IBM-BWN8ZTBWA01.austin.ibm.com> <16970.9005.721117.942549@cargo.ozlabs.ibm.com> <69304d110503292138620d4587@mail.gmail.com> Message-ID: <16970.16535.864623.323556@cargo.ozlabs.ibm.com> Antonio Vargas writes: > Don't know exactly about power5, but G5 processor is described on IBM > docs as doing automatic whole-page prefetch read-ahead when detecting > linear accesses. Sure, but linked lists would rarely be laid out linearly in memory. Paul. From serue at us.ibm.com Thu Mar 31 00:33:55 2005 From: serue at us.ibm.com (Serge E. Hallyn) Date: Wed, 30 Mar 2005 08:33:55 -0600 Subject: prefetch on ppc64 In-Reply-To: <16970.9005.721117.942549@cargo.ozlabs.ibm.com> References: <20050330034034.GA1752@IBM-BWN8ZTBWA01.austin.ibm.com> <16970.9005.721117.942549@cargo.ozlabs.ibm.com> Message-ID: <20050330143355.GA1692@IBM-BWN8ZTBWA01.austin.ibm.com> Quoting Paul Mackerras (paulus at samba.org): > Serge E. Hallyn writes: > > > While investigating the inordinate performance impact one of my patches > > seemed to be having, we tracked it down to two hlist_for_each_entry > > loops, and finally to the prefetch instruction in the loop. > > I would be interested to know what results you get if you leave the > loops using hlist_for_each_entry but change prefetch() and prefetchw() > to do the dcbt or dcbtst instruction only if the address is non-zero, > like this: > > static inline void prefetch(const void *x) > { > if (x) > __asm__ __volatile__ ("dcbt 0,%0" : : "r" (x)); > } > > static inline void prefetchw(const void *x) > { > if (x) > __asm__ __volatile__ ("dcbtst 0,%0" : : "r" (x)); > } > > It seems that doing a prefetch on a NULL pointer, while it doesn't > cause a fault, does waste time looking for a translation of the zero > address. Hi, Olof Johansson had suggested that earlier, except that his patch used if (unlikely(!x)) return; Performance was quite good, but not as good as having prefetch completely disabled. I got # elements: 50, mean 851.263680, variance 24.561146, std dev 4.955920 compared to 860.823880 stdev 6.896914 with prefetch disabled. thanks, -serge From darren at kdi.ca Thu Mar 31 01:53:14 2005 From: darren at kdi.ca (Darren Critchley) Date: Wed, 30 Mar 2005 07:53:14 -0800 Subject: Hello - struggling to compile on PPC64 In-Reply-To: <20050329232312.GP14407@bubble.modra.org> References: <4249D179.50306@kdi.ca> <20050329232312.GP14407@bubble.modra.org> Message-ID: <424ACB6A.3050400@kdi.ca> Alan Modra wrote: > On Tue, Mar 29, 2005 at 02:06:49PM -0800, Darren Critchley wrote: > >>flex-2.5.4a, binutils-2.15.90.0.3, and when gcc-3.3.3 tries to build it >>ends with this error: >>/tools/powerpc64-unknown-linux-gnu/bin/ld: warning: powerpc:common >>architecture of input file `/usr/lib/crti.o' is incompatible with >>powerpc:common64 output > > > You need a 64-bit glibc installed before compiling dynamic objects. > Ok, so will using the compiler option -m32 correct the situation? When I do that, I don't even get past the second program for compiling. Darren From nfont at austin.ibm.com Thu Mar 31 01:50:40 2005 From: nfont at austin.ibm.com (Nathan Fontenot) Date: Wed, 30 Mar 2005 09:50:40 -0600 Subject: DMA memory In-Reply-To: <20050329233907.GB15596@austin.ibm.com> References: <20050329233907.GB15596@austin.ibm.com> Message-ID: <424ACAD0.9090201@austin.ibm.com> Linas Vepstas wrote: >p.s. Nathan F, did you ever get an error decoder working for the pci chipsets? > > The code is written but not tested in any way so I have not done anything with it yet. -Nathan F. From neg at brooktrout.com Thu Mar 31 03:30:40 2005 From: neg at brooktrout.com (Nathan Glasser) Date: Wed, 30 Mar 2005 17:30:40 -0000 Subject: DMA memory In-Reply-To: Your message of Tue, 29 Mar 2005 17:39:07 -0600 Message-ID: >I presume the truncated RTAS blob is due to some RH 3.0 bug; is there a >chance you can try with a newer RH 3, or RH4 or kernel.org, so as to get >the detailed report? The RH3.0 is a patch version already (2.4.21-20.EL). But no, there's no chance of trying anything else. >The RTAS message is supposed to be a good bit longer; among other things >it will sometimes contain a raw dump of the pci controller state. If I >had that, I *might* be able to decode the details of what the pci >controller didn't like (including the faulting address, if that's what >it is.). I had installed and enabled (apparently temporarily) some error logging thing I had been pointed last week, it seemed to have caused some extra info to appear in /var/log/platform: RTAS: 1 -------- RTAS event begin -------- RTAS 0: 04440003 00000084 c6000008 16023600 RTAS 1: 20050325 00000000 00000000 00000000 RTAS 2: 00000000 00000000 00000000 00000000 RTAS 3: 49424d00 00000000 00503034 b1004699 RTAS 4: 04a0005d a009c0f5 00000000 00007701 RTAS 5: 00000000 00000000 00000000 00000000 RTAS 6: 01000000 00000000 42313030 34363939 RTAS 7: 20202020 20202020 20202020 20202020 RTAS 8: 20202020 20202020 00020000 RTAS: 1 -------- RTAS event end ---------- diagela: ----------------------------------------------------------------------- diagela: 03/25/2005 16:14:14 diagela: Automatic Error Log Analysis has detected a problem. diagela: diagela: The Service Request Number(s)/Probable Cause(s) diagela: (causes are listed in descending order of probability): diagela: diagela: 651-880: The CEC or SPCN reported an error. Report the SRN and the following reference and physical location codes to your service provider. diagela: diagela: Location: n/a FRU: n/a Ref-Code: B1004699 diagela: diagela: Analysis of /var/log/platform sequence number: 1 RTAS: 1 -------- RTAS event begin -------- RTAS 0: 04440003 00000084 c6000008 17353000 RTAS 1: 20050325 00000000 00000000 00000000 RTAS 2: 00000000 00000000 00000000 00000000 RTAS 3: 49424d00 00000000 00503034 b1004699 RTAS 4: 04a0005d a009c0f5 00000000 00007701 RTAS 5: 00000000 00000000 00000000 00000000 RTAS 6: 01000000 00000000 42313030 34363939 RTAS 7: 20202020 20202020 20202020 20202020 RTAS 8: 20202020 20202020 00020000 RTAS: 1 -------- RTAS event end ---------- diagela: ----------------------------------------------------------------------- diagela: 03/25/2005 17:43:02 diagela: Automatic Error Log Analysis has detected a problem. diagela: diagela: The Service Request Number(s)/Probable Cause(s) diagela: (causes are listed in descending order of probability): diagela: diagela: 651-880: The CEC or SPCN reported an error. Report the SRN and the following reference and physical location codes to your service provider. diagela: diagela: Location: n/a FRU: n/a Ref-Code: B1004699 diagela: diagela: Analysis of /var/log/platform sequence number: 1 RTAS: 1 -------- RTAS event begin -------- RTAS 0: 00000000 00000000 RTAS: 1 -------- RTAS event end ---------- RTAS: 1 -------- RTAS event begin -------- RTAS 0: 00000000 00000000 RTAS: 1 -------- RTAS event end ---------- I had run some more tests this week which also caused crashes, but this is the entire file. Thanks, Nathan From linas at austin.ibm.com Thu Mar 31 05:26:53 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Wed, 30 Mar 2005 13:26:53 -0600 Subject: DMA memory In-Reply-To: References: Message-ID: <20050330192653.GD15596@austin.ibm.com> On Wed, Mar 30, 2005 at 12:25:50PM -0500, Nathan Glasser was heard to remark: > >I presume the truncated RTAS blob is due to some RH 3.0 bug; is there a > >chance you can try with a newer RH 3, or RH4 or kernel.org, so as to get > >the detailed report? > > The RH3.0 is a patch version already (2.4.21-20.EL). But no, there's no chance > of trying anything else. :( > >The RTAS message is supposed to be a good bit longer; among other things > >it will sometimes contain a raw dump of the pci controller state. If I > >had that, I *might* be able to decode the details of what the pci > >controller didn't like (including the faulting address, if that's what > >it is.). > > I had installed and enabled (apparently temporarily) some error logging thing > I had been pointed last week, it seemed to have caused some extra info to > appear in /var/log/platform: Hmm, even that is very short. On decode, it says: Residual error from previous boot Date/Time: 20050325 16023600 which is darned uninformative. There's some other bug that is causing the full error not be logged. > I had run some more tests this week which also caused crashes, but this > is the entire file. You can avoid crashes by editing "arch/ppc64/kernel/eeh.c" and commenting out the call to eeh_panic(). That might help you with your debug efforts. The philosphy of panic'ing-on-error comes from the theory that its better to panic, than it is to corrupt data. Think "banking", a traditional IBM customer segment, to understand the origin of this theory. FWIW, new kernels include code that attempts to pci-reset the device after fielding one of these pci errors. A generic all-architecture implementation for this is being discussed on LKML now, as the new PCI-E chipsets do something similar. Of course, reseting the hardware because there's a software bug won't help driver developers very much ... Even with the full RTAS message, fully decoded, figuring out what wen wrong is hard. If you can't find the bug easily, I am afraid that you'll be reduced to staring at PCI bus analyzer traces, which is how many of these bugs are found ... :( Assuming that you have full access to the device you're coding for, your best bet is to add debug code to its firmware, and have it tell you where it plans to DMA to; compare that to where you thought it would be going. --linas From iamroot at ca.ibm.com Thu Mar 31 05:42:17 2005 From: iamroot at ca.ibm.com (Omkhar Arasaratnam) Date: Wed, 30 Mar 2005 14:42:17 -0500 Subject: [PATCH] PPC64: Fix LPAR IOMMU setup code for p630 In-Reply-To: <20050324170709.GA32597@austin.ibm.com> References: <20050324170709.GA32597@austin.ibm.com> Message-ID: <424B0119.3000102@ca.ibm.com> Olof Johansson wrote: >Hi, > >Here's a fix to deal with p630 systems in LPAR mode. They're to date >the only system that in some cases might lack a dma-window property >for the bus, but contain an overriding property in the device node for >the specific adapter/slot. This makes the device setup code a bit more >complex since it needs to do some of the things that the bus setup code >has already done. > > >Signed-off-by: Olof Johansson > >Index: 2.6/arch/ppc64/kernel/pSeries_iommu.c >=================================================================== >--- 2.6.orig/arch/ppc64/kernel/pSeries_iommu.c 2005-03-23 13:06:34.000000000 -0600 >+++ 2.6/arch/ppc64/kernel/pSeries_iommu.c 2005-03-23 13:08:50.000000000 -0600 >@@ -401,6 +401,8 @@ static void iommu_bus_setup_pSeriesLP(st > struct device_node *dn, *pdn; > unsigned int *dma_window = NULL; > >+ DBG("iommu_bus_setup_pSeriesLP, bus %p, bus->self %p\n", bus, bus->self); >+ > dn = pci_bus_to_OF_node(bus); > > /* Find nearest ibm,dma-window, walking up the device tree */ >@@ -455,6 +457,56 @@ static void iommu_dev_setup_pSeries(stru > } > } > >+static void iommu_dev_setup_pSeriesLP(struct pci_dev *dev) >+{ >+ struct device_node *pdn, *dn; >+ struct iommu_table *tbl; >+ int *dma_window = NULL; >+ >+ DBG("iommu_dev_setup_pSeriesLP, dev %p (%s)\n", dev, dev->pretty_name); >+ >+ /* dev setup for LPAR is a little tricky, since the device tree might >+ * contain the dma-window properties per-device and not neccesarily >+ * for the bus. So we need to search upwards in the tree until we >+ * either hit a dma-window property, OR find a parent with a table >+ * already allocated. >+ */ >+ dn = pci_device_to_OF_node(dev); >+ >+ for (pdn = dn; pdn && !pdn->iommu_table; pdn = pdn->parent) { >+ dma_window = (unsigned int *)get_property(pdn, "ibm,dma-window", NULL); >+ if (dma_window) >+ break; >+ } >+ >+ /* Check for parent == NULL so we don't try to setup the empty EADS >+ * slots on POWER4 machines. >+ */ >+ if (dma_window == NULL || pdn->parent == NULL) { >+ /* Fall back to regular (non-LPAR) dev setup */ >+ DBG("No dma window for device, falling back to regular setup\n"); >+ iommu_dev_setup_pSeries(dev); >+ return; >+ } else { >+ DBG("Found DMA window, allocating table\n"); >+ } >+ >+ if (!pdn->iommu_table) { >+ /* iommu_table_setparms_lpar needs bussubno. */ >+ pdn->bussubno = pdn->phb->bus->number; >+ >+ tbl = (struct iommu_table *)kmalloc(sizeof(struct iommu_table), >+ GFP_KERNEL); >+ >+ iommu_table_setparms_lpar(pdn->phb, pdn, tbl, dma_window); >+ >+ pdn->iommu_table = iommu_init_table(tbl); >+ } >+ >+ if (pdn != dn) >+ dn->iommu_table = pdn->iommu_table; >+} >+ > static void iommu_bus_setup_null(struct pci_bus *b) { } > static void iommu_dev_setup_null(struct pci_dev *d) { } > >@@ -479,13 +531,14 @@ void iommu_init_early_pSeries(void) > ppc_md.tce_free = tce_free_pSeriesLP; > } > ppc_md.iommu_bus_setup = iommu_bus_setup_pSeriesLP; >+ ppc_md.iommu_dev_setup = iommu_dev_setup_pSeriesLP; > } else { > ppc_md.tce_build = tce_build_pSeries; > ppc_md.tce_free = tce_free_pSeries; > ppc_md.iommu_bus_setup = iommu_bus_setup_pSeries; >+ ppc_md.iommu_dev_setup = iommu_dev_setup_pSeries; > } > >- ppc_md.iommu_dev_setup = iommu_dev_setup_pSeries; > > pci_iommu_init(); > } > > > Olaf - everything works great - anyword on when this will all merge upstream? From neg at brooktrout.com Thu Mar 31 05:47:11 2005 From: neg at brooktrout.com (Nathan Glasser) Date: Wed, 30 Mar 2005 19:47:11 -0000 Subject: DMA memory In-Reply-To: Your message of Wed, 30 Mar 2005 13:26:53 -0600 Message-ID: Linas, Thanks for your suggestions. Unfortunately, my access to the system has been time-limited, and is coming to an end. >You can avoid crashes by editing "arch/ppc64/kernel/eeh.c" and >commenting out the call to eeh_panic(). That might help you with your >debug efforts. I'm not sure if it would have been worth trying to rebuild the kernel for this, but in any case, there's no time now. >I am afraid that you'll be reduced to staring at PCI bus analyzer traces, >which is how many of these bugs are found ... :( This sort of hardware assistance was not available to me, and again time has run out for me. >Assuming that you have full access to the device you're coding for, your best >bet is to add debug code to its firmware, and have it tell you where it plans >to DMA to; compare that to where you thought it would be going. While I am very familiar with the device, I don't have control over its operation in that way. While I could theoretically get someone to do that, it would be a great deal of trouble, and again time has run out. Unless someone comes up with a magic formula for me (:-)) I won't be able to make use of these ideas. Thanks, Nathan From olof at austin.ibm.com Thu Mar 31 05:48:01 2005 From: olof at austin.ibm.com (Olof Johansson) Date: Wed, 30 Mar 2005 13:48:01 -0600 Subject: [PATCH] PPC64: Fix LPAR IOMMU setup code for p630 In-Reply-To: <424B0119.3000102@ca.ibm.com> References: <20050324170709.GA32597@austin.ibm.com> <424B0119.3000102@ca.ibm.com> Message-ID: <20050330194800.GA2639@austin.ibm.com> On Wed, Mar 30, 2005 at 02:42:17PM -0500, Omkhar Arasaratnam wrote: > Olaf - everything works great - anyword on when this will all merge > upstream? It's already in mainline as of two days ago :-) (taking off maintainer Cc:s) http://linux.bkbits.net:8080/linux-2.5/cset at 4248cb0a5sYVpmF0MKW6YM_snESiEA -Olof From anton at samba.org Thu Mar 31 07:25:55 2005 From: anton at samba.org (Anton Blanchard) Date: Thu, 31 Mar 2005 07:25:55 +1000 Subject: [PATCH] Trim careful_alocation() In-Reply-To: <20050330204207.GE3834@w-mikek2.ibm.com> References: <20050330204207.GE3834@w-mikek2.ibm.com> Message-ID: <20050330212554.GB15489@krispykreme> Hi Mike, > The following patch removes the call to __alloc_bootmem_node() in > careful_allocation(). Note that careful_allocation is only called > in two places: > 1) to allocate the node pglist_data structure > 2) to allocate the bootmem map for the node > As such, calling __alloc_bootmem_node to allocate space for these > items makes no sense. You can't use the bootmem allocator to create > the initial data structures for the bootmem allocator. On NUMA kernels we create NUMA nodes for all possible nodes. If that node doesnt have any memory behind it then we will have to allocate from another node. If the node we allocate from is a previous node then it will have already been initialised by the bootmem allocator. Thus we need to use bootmem in this case. Its all a bit hairy but we need to support nodes with no memory behind them. Anton From kravetz at us.ibm.com Thu Mar 31 06:42:07 2005 From: kravetz at us.ibm.com (Mike Kravetz) Date: Wed, 30 Mar 2005 12:42:07 -0800 Subject: [PATCH] Trim careful_alocation() Message-ID: <20050330204207.GE3834@w-mikek2.ibm.com> The following patch removes the call to __alloc_bootmem_node() in careful_allocation(). Note that careful_allocation is only called in two places: 1) to allocate the node pglist_data structure 2) to allocate the bootmem map for the node As such, calling __alloc_bootmem_node to allocate space for these items makes no sense. You can't use the bootmem allocator to create the initial data structures for the bootmem allocator. -- Signed-off-by: Mike Kravetz diff -Naupr linux-2.6.12-rc1/arch/ppc64/mm/numa.c linux-2.6.12-rc1.work/arch/ppc64/mm/numa.c --- linux-2.6.12-rc1/arch/ppc64/mm/numa.c 2005-03-02 07:38:38.000000000 +0000 +++ linux-2.6.12-rc1.work/arch/ppc64/mm/numa.c 2005-03-30 19:12:12.000000000 +0000 @@ -469,13 +469,13 @@ static void __init dump_numa_topology(vo } /* - * Allocate some memory, satisfying the lmb or bootmem allocator where - * required. nid is the preferred node and end is the physical address of - * the highest address in the node. + * Allocate some memory via the lmb allocator while trying to allocate space + * close to, but less than the end parameter. For 'node local' allocations + * end should be the highest physical address in the node. * * Returns the physical address of the memory. */ -static unsigned long careful_allocation(int nid, unsigned long size, +static unsigned long careful_allocation(unsigned long size, unsigned long align, unsigned long end) { unsigned long ret = lmb_alloc_base(size, align, end); @@ -485,26 +485,8 @@ static unsigned long careful_allocation( ret = lmb_alloc_base(size, align, lmb_end_of_DRAM()); if (!ret) - panic("numa.c: cannot allocate %lu bytes on node %d", - size, nid); - - /* - * If the memory came from a previously allocated node, we must - * retry with the bootmem allocator. - */ - if (pa_to_nid(ret) < nid) { - nid = pa_to_nid(ret); - ret = (unsigned long)__alloc_bootmem_node(NODE_DATA(nid), - size, align, 0); - - if (!ret) - panic("numa.c: cannot allocate %lu bytes on node %d", - size, nid); - - ret = virt_to_abs(ret); - - dbg("alloc_bootmem %lx %lx\n", ret, size); - } + panic("numa.c: careful_allocation cannot allocate %lu bytes", + size); return ret; } @@ -538,7 +520,7 @@ void __init do_init_bootmem(void) end_paddr = start_paddr + (init_node_data[nid].node_spanned_pages * PAGE_SIZE); /* Allocate the node structure node local if possible */ - NODE_DATA(nid) = (struct pglist_data *)careful_allocation(nid, + NODE_DATA(nid) = (struct pglist_data *)careful_allocation( sizeof(struct pglist_data), SMP_CACHE_BYTES, end_paddr); NODE_DATA(nid) = abs_to_virt(NODE_DATA(nid)); @@ -561,8 +543,7 @@ void __init do_init_bootmem(void) bootmap_pages = bootmem_bootmap_pages((end_paddr - start_paddr) >> PAGE_SHIFT); - bootmem_paddr = careful_allocation(nid, - bootmap_pages << PAGE_SHIFT, + bootmem_paddr = careful_allocation(bootmap_pages << PAGE_SHIFT, PAGE_SIZE, end_paddr); memset(abs_to_virt(bootmem_paddr), 0, bootmap_pages << PAGE_SHIFT); From kravetz at us.ibm.com Thu Mar 31 07:56:11 2005 From: kravetz at us.ibm.com (Mike Kravetz) Date: Wed, 30 Mar 2005 13:56:11 -0800 Subject: [PATCH] Trim careful_alocation() In-Reply-To: <20050330212554.GB15489@krispykreme> References: <20050330204207.GE3834@w-mikek2.ibm.com> <20050330212554.GB15489@krispykreme> Message-ID: <20050330215611.GF3834@w-mikek2.ibm.com> On Thu, Mar 31, 2005 at 07:25:55AM +1000, Anton Blanchard wrote: > > On NUMA kernels we create NUMA nodes for all possible nodes. If that > node doesnt have any memory behind it then we will have to allocate from > another node. If the node we allocate from is a previous node then it > will have already been initialised by the bootmem allocator. Thus we > need to use bootmem in this case. > > Its all a bit hairy but we need to support nodes with no memory behind > them. > You're absolutely right. I missed one/some of the subtle points. Please disregard this patch. -- Mike From paulus at samba.org Thu Mar 31 08:17:03 2005 From: paulus at samba.org (Paul Mackerras) Date: Thu, 31 Mar 2005 08:17:03 +1000 Subject: [PATCH] PPC64: Fix LPAR IOMMU setup code for p630 In-Reply-To: <424B0119.3000102@ca.ibm.com> References: <20050324170709.GA32597@austin.ibm.com> <424B0119.3000102@ca.ibm.com> Message-ID: <16971.9567.973479.441417@cargo.ozlabs.ibm.com> Omkhar Arasaratnam writes: > Olaf - everything works great - anyword on when this will all merge > upstream? It's already in Linus' BK. Paul. From sfr at canb.auug.org.au Thu Mar 31 21:46:24 2005 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Thu, 31 Mar 2005 21:46:24 +1000 Subject: service outage Message-ID: <20050331214624.7e353733.sfr@canb.auug.org.au> Hi all, There will be a (hopefully short) service disruption tomorrow (Friday April 1) at 1pm Canberra time (3 am UTC) to these mailing lists and the ozlabs.org web site while our hosting ISP changes upstream service provider. -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050331/8e26a16f/attachment.pgp