From sfr at canb.auug.org.au Sat Sep 11 16:32:00 2004 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Sat, 11 Sep 2004 16:32:00 +1000 Subject: linuxppc64-dev mailing list In-Reply-To: <200409110049.i8B0nPJ0013638@supreme.pcug.org.au> References: <200409110049.i8B0nPJ0013638@supreme.pcug.org.au> Message-ID: <20040911163200.2ef3cd04.sfr@canb.auug.org.au> On Sat, 11 Sep 2004 10:49:25 +1000 (EST) Stephen Rothwell wrote: > > From: Hugh Blemings > > > > Chat with Anton this morning and Hollis this afternoon I suggest we go > > ahead and set up at least a temporary linuxppc64-dev mailing list on > > ozlabs.org > > > > Does this sound sane ? > > > > If so, Stephen could you oblige ? > > I will do this after lunch, OK? This is now done (as you will all have guessed). > The trick is how to publicize it ... Do we have access to the old subscriber list? Anyone want to volunteer to be list owner. Currently I am it and am more than willing to stay in this position. Can we revive the linuxppc.org domain, or at least get it hosted somewhere (DNS and web)? ozlabs.org is available for this. -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040911/4d709008/attachment.pgp From anton at samba.org Sun Sep 12 12:28:31 2004 From: anton at samba.org (Anton Blanchard) Date: Sun, 12 Sep 2004 12:28:31 +1000 Subject: DEBUG_INFO Message-ID: <20040912022831.GG32755@krispykreme> Hi, # grep DEBUG_INFO .config CONFIG_DEBUG_INFO=y # ls -l vmlinux arch/ppc64/boot/zImage -rwxr-xr-x 1 anton anton 18428775 Sep 12 11:38 arch/ppc64/boot/zImage -rwxr-xr-x 1 anton anton 39353167 Sep 12 11:37 vmlinux We should at least strip the vmlinux we stuff into zImage, an 18MB zImage is pretty obnoxious and it fails to load over the network on my 270: BOOTP S = 1 FILE: zImage.congo Load Addr=0x4000 Max Size=0xbfc000 FINAL File Size = 12566528 bytes. !20EE000B ! Does anyone with special Makefile powers want to have a go at fixing this? Bonus points for converting zlib stuff to the generic /lib stuff like Tom Rini did on ppc32 :) It looks like yaboot can load a 40MB vmlinux OK, so Id prefer to leave the vmlinux unstripped (so people can use gdb against it, addr2line etc). Anton From anton at samba.org Fri Sep 10 22:14:56 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 22:14:56 +1000 Subject: [PATCH] [ppc64] Remove unused ppc64_calibrate_delay In-Reply-To: <20040910121238.GG24408@krispykreme> References: <20040910121238.GG24408@krispykreme> Message-ID: <20040910121456.GH24408@krispykreme> - Remove ppc64_calibrate_delay, no longer used - Formatting fixups Signed-off-by: Anton Blanchard diff --exclude-from=/home/anton/dontdiff -urN foobar2/arch/ppc64/kernel/setup.c foobar3/arch/ppc64/kernel/setup.c --- foobar2/arch/ppc64/kernel/setup.c 2004-09-10 19:42:54.402526581 +1000 +++ foobar3/arch/ppc64/kernel/setup.c 2004-09-10 19:38:55.392966432 +1000 @@ -51,7 +51,6 @@ extern unsigned long klimit; /* extern void *stab; */ extern HTAB htab_data; -extern unsigned long loops_per_jiffy; int have_of = 1; @@ -68,11 +67,11 @@ unsigned long r7); extern void fw_feature_init(void); -extern void iSeries_init_early( void ); -extern void pSeries_init_early( void ); +extern void iSeries_init_early(void); +extern void pSeries_init_early(void); extern void pSeriesLP_init_early(void); extern void pmac_init_early(void); -extern void mm_init_ppc64( void ); +extern void mm_init_ppc64(void); extern void pseries_secondary_smp_init(unsigned long); extern int idle_setup(void); extern void vpa_init(int cpu); @@ -263,10 +262,10 @@ #ifdef CONFIG_PPC_ISERIES /* pSeries systems are identified in prom.c via OF. */ - if ( itLpNaca.xLparInstalled == 1 ) + if (itLpNaca.xLparInstalled == 1) systemcfg->platform = PLATFORM_ISERIES_LPAR; #endif - + switch (systemcfg->platform) { #ifdef CONFIG_PPC_ISERIES case PLATFORM_ISERIES_LPAR: @@ -627,17 +626,6 @@ arch_initcall(ppc_init); -void __init ppc64_calibrate_delay(void) -{ - loops_per_jiffy = tb_ticks_per_jiffy; - - printk("Calibrating delay loop... %lu.%02lu BogoMips\n", - loops_per_jiffy/(500000/HZ), - loops_per_jiffy/(5000/HZ) % 100); -} - -extern void (*calibrate_delay)(void); - #ifdef CONFIG_IRQSTACKS static void __init irqstack_early_init(void) { @@ -693,7 +681,6 @@ extern int panic_timeout; extern void do_init_bootmem(void); - calibrate_delay = ppc64_calibrate_delay; ppc64_boot_msg(0x12, "Setup Arch"); From anton at samba.org Fri Sep 10 19:11:28 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 19:11:28 +1000 Subject: [PATCH] Enable NUMA API on ppc64 In-Reply-To: <20040910090943.GC24408@krispykreme> References: <20040910090458.GB24408@krispykreme> <20040910090943.GC24408@krispykreme> Message-ID: <20040910091128.GD24408@krispykreme> Plumb the NUMA API syscalls into ppc64. Also add some missing cond_syscalls so we still link with NUMA API disabled. Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/misc.S~numa_api arch/ppc64/kernel/misc.S --- foobar2/arch/ppc64/kernel/misc.S~numa_api 2004-09-10 18:32:34.385768254 +1000 +++ foobar2-anton/arch/ppc64/kernel/misc.S 2004-09-10 18:32:34.420765564 +1000 @@ -860,9 +860,9 @@ _GLOBAL(sys_call_table32) .llong .sys_ni_syscall /* 256 reserved for sys_debug_setcontext */ .llong .sys_ni_syscall /* 257 reserved for vserver */ .llong .sys_ni_syscall /* 258 reserved for new sys_remap_file_pages */ - .llong .sys_ni_syscall /* 259 reserved for new sys_mbind */ - .llong .sys_ni_syscall /* 260 reserved for new sys_get_mempolicy */ - .llong .sys_ni_syscall /* 261 reserved for new sys_set_mempolicy */ + .llong .compat_mbind + .llong .compat_get_mempolicy /* 260 */ + .llong .compat_set_mempolicy .llong .compat_sys_mq_open .llong .sys_mq_unlink .llong .compat_sys_mq_timedsend @@ -1132,9 +1132,9 @@ _GLOBAL(sys_call_table) .llong .sys_ni_syscall /* 256 reserved for sys_debug_setcontext */ .llong .sys_ni_syscall /* 257 reserved for vserver */ .llong .sys_ni_syscall /* 258 reserved for new sys_remap_file_pages */ - .llong .sys_ni_syscall /* 259 reserved for new sys_mbind */ - .llong .sys_ni_syscall /* 260 reserved for new sys_get_mempolicy */ - .llong .sys_ni_syscall /* 261 reserved for new sys_set_mempolicy */ + .llong .sys_mbind + .llong .sys_get_mempolicy /* 260 */ + .llong .sys_set_mempolicy .llong .sys_mq_open .llong .sys_mq_unlink .llong .sys_mq_timedsend diff -puN arch/ppc64/mm/numa.c~numa_api arch/ppc64/mm/numa.c diff -puN include/asm-ppc64/unistd.h~numa_api include/asm-ppc64/unistd.h --- foobar2/include/asm-ppc64/unistd.h~numa_api 2004-09-10 18:32:34.397767332 +1000 +++ foobar2-anton/include/asm-ppc64/unistd.h 2004-09-10 18:32:34.423765333 +1000 @@ -269,9 +269,9 @@ /* Number 256 is reserved for sys_debug_setcontext */ /* Number 257 is reserved for vserver */ /* Number 258 is reserved for new sys_remap_file_pages */ -/* Number 259 is reserved for new sys_mbind */ -/* Number 260 is reserved for new sys_get_mempolicy */ -/* Number 261 is reserved for new sys_set_mempolicy */ +#define __NR_mbind 259 +#define __NR_get_mempolicy 260 +#define __NR_set_mempolicy 261 #define __NR_mq_open 262 #define __NR_mq_unlink 263 #define __NR_mq_timedsend 264 diff -puN kernel/sys.c~fix_numa_api kernel/sys.c --- foobar2/kernel/sys.c~fix_numa_api 2004-09-10 18:59:26.757155478 +1000 +++ foobar2-anton/kernel/sys.c 2004-09-10 19:00:16.455837772 +1000 @@ -274,7 +274,9 @@ cond_syscall(compat_sys_mq_getsetattr) cond_syscall(sys_mbind) cond_syscall(sys_get_mempolicy) cond_syscall(sys_set_mempolicy) +cond_syscall(compat_mbind) cond_syscall(compat_get_mempolicy) +cond_syscall(compat_set_mempolicy) /* arch-specific weak syscall entries */ cond_syscall(sys_pciconfig_read) _ From anton at samba.org Fri Sep 10 19:09:43 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 19:09:43 +1000 Subject: [PATCH] RTAS error logs can appear twice in dmesg In-Reply-To: <20040910090458.GB24408@krispykreme> References: <20040910090458.GB24408@krispykreme> Message-ID: <20040910090943.GC24408@krispykreme> Ive started seeing rtas errors printed twice. Remove the second call to printk_log_rtas. Signed-off-by: Anton Blanchard ===== rtasd.c 1.30 vs edited ===== --- 1.30/arch/ppc64/kernel/rtasd.c Fri Sep 3 19:08:18 2004 +++ edited/rtasd.c Fri Sep 10 17:09:57 2004 @@ -216,12 +216,13 @@ if (!no_more_logging && !(err_type & ERR_FLAG_BOOT)) nvram_write_error_log(buf, len, err_type); - /* rtas errors can occur during boot, and we do want to capture + /* + * rtas errors can occur during boot, and we do want to capture * those somewhere, even if nvram isn't ready (why not?), and even - * if rtasd isn't ready. Put them into the boot log, at least. */ - if ((err_type & ERR_TYPE_MASK) == ERR_TYPE_RTAS_LOG) { + * if rtasd isn't ready. Put them into the boot log, at least. + */ + if ((err_type & ERR_TYPE_MASK) == ERR_TYPE_RTAS_LOG) printk_log_rtas(buf, len); - } /* Check to see if we need to or have stopped logging */ if (fatal || no_more_logging) { @@ -233,9 +234,6 @@ /* call type specific method for error */ switch (err_type & ERR_TYPE_MASK) { case ERR_TYPE_RTAS_LOG: - /* put into syslog and error_log file */ - printk_log_rtas(buf, len); - offset = rtas_error_log_buffer_max * ((rtas_log_start+rtas_log_size) & LOG_NUMBER_MASK); From anton at samba.org Fri Sep 10 22:23:37 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 22:23:37 +1000 Subject: [PATCH] [ppc64] Use early_param In-Reply-To: <20040910121941.GI24408@krispykreme> References: <20040910121238.GG24408@krispykreme> <20040910121456.GH24408@krispykreme> <20040910121941.GI24408@krispykreme> Message-ID: <20040910122337.GJ24408@krispykreme> Make use of Rusty's early_param code. Its good stuff. We appear to be the first user :) Move vpa_init and idle_setup later in boot, we dont have to do them right up front in setup_system. Signed-off-by: Anton Blanchard diff --exclude-from=/home/anton/dontdiff -urN foobar2/arch/ppc64/kernel/setup.c foobar3/arch/ppc64/kernel/setup.c --- foobar2/arch/ppc64/kernel/setup.c 2004-09-10 19:52:08.296273933 +1000 +++ foobar3/arch/ppc64/kernel/setup.c 2004-09-10 19:50:06.856308336 +1000 @@ -244,7 +244,21 @@ systemcfg->processorCount = num_present_cpus(); } + #endif /* !defined(CONFIG_PPC_ISERIES) && defined(CONFIG_SMP) */ + +#ifdef CONFIG_XMON +static int __init early_xmon(char *p) +{ + /* ensure xmon is enabled */ + xmon_init(); + debugger(0); + + return 0; +} +early_param("xmon", early_xmon); +#endif + /* * Do some initial setup of the system. The parameters are those which * were passed in from the bootloader. @@ -256,10 +270,6 @@ int ret, i; #endif -#ifdef CONFIG_XMON_DEFAULT - xmon_init(); -#endif - #ifdef CONFIG_PPC_ISERIES /* pSeries systems are identified in prom.c via OF. */ if (itLpNaca.xLparInstalled == 1) @@ -290,6 +300,9 @@ #endif /* CONFIG_PPC_PMAC */ } +#ifdef CONFIG_XMON_DEFAULT + xmon_init(); +#endif /* If we were passed an initrd, set the ROOT_DEV properly if the values * look sensible. If not, clear initrd reference. */ @@ -330,6 +343,11 @@ iSeries_parse_cmdline(); #endif + /* Save unparsed command line copy for /proc/cmdline */ + strlcpy(saved_command_line, cmd_line, COMMAND_LINE_SIZE); + + parse_early_param(); + #ifdef CONFIG_SMP #ifndef CONFIG_PPC_ISERIES /* @@ -351,6 +369,10 @@ i); } } + + if (cur_cpu_spec->firmware_features & FW_FEATURE_SPLPAR) + vpa_init(boot_cpuid); + #endif /* CONFIG_PPC_PSERIES */ #endif /* CONFIG_SMP */ @@ -380,15 +402,6 @@ printk("-----------------------------------------------------\n"); mm_init_ppc64(); - -#if defined(CONFIG_SMP) && defined(CONFIG_PPC_PSERIES) - if (cur_cpu_spec->firmware_features & FW_FEATURE_SPLPAR) { - vpa_init(boot_cpuid); - } -#endif - - /* Select the correct idle loop for the platform. */ - idle_setup(); } void machine_restart(char *cmd) @@ -512,30 +525,20 @@ .show = show_cpuinfo, }; -#endif +#if 0 /* XXX not currently used */ +unsigned long memory_limit; - /* Look for mem= option on command line */ - if (strstr(cmd_line, "mem=")) { - char *p, *q; - unsigned long maxmem = 0; - extern unsigned long __max_memory; - - for (q = cmd_line; (p = strstr(q, "mem=")) != 0; ) { - q = p + 4; - if (p > cmd_line && p[-1] != ' ') - continue; - maxmem = simple_strtoul(q, &q, 0); - if (*q == 'k' || *q == 'K') { - maxmem <<= 10; - ++q; - } else if (*q == 'm' || *q == 'M') { - maxmem <<= 20; - ++q; - } - } - __max_memory = maxmem; - } +static int __init early_parsemem(char *p) +{ + if (!p) + return 0; + + memory_limit = memparse(p, &p); + + return 0; } +early_param("mem", early_parsemem); +#endif #ifdef CONFIG_PPC_PSERIES static int __init set_preferred_console(void) @@ -681,16 +684,10 @@ extern int panic_timeout; extern void do_init_bootmem(void); - ppc64_boot_msg(0x12, "Setup Arch"); -#ifdef CONFIG_XMON - if (strstr(cmd_line, "xmon")) { - /* ensure xmon is enabled */ - xmon_init(); - debugger(0); - } -#endif /* CONFIG_XMON */ + *cmdline_p = cmd_line; + /* * Set cache line size based on type of cpu as a default. @@ -711,16 +708,15 @@ init_mm.end_data = (unsigned long) _edata; init_mm.brk = klimit; - /* Save unparsed command line copy for /proc/cmdline */ - strlcpy(saved_command_line, cmd_line, COMMAND_LINE_SIZE); - *cmdline_p = cmd_line; - irqstack_early_init(); emergency_stack_init(); /* set up the bootmem stuff with available memory */ do_init_bootmem(); + /* Select the correct idle loop for the platform. */ + idle_setup(); + ppc_md.setup_arch(); paging_init(); diff --exclude-from=/home/anton/dontdiff -urN foobar2/arch/ppc64/mm/numa.c foobar3/arch/ppc64/mm/numa.c --- foobar2/arch/ppc64/mm/numa.c 2004-09-10 19:52:05.108989721 +1000 +++ foobar3/arch/ppc64/mm/numa.c 2004-09-10 19:46:40.576232848 +1000 @@ -18,6 +18,8 @@ #include #include +static int numa_enabled = 1; + static int numa_debug; #define dbg(args...) if (numa_debug) { printk(KERN_INFO args); } @@ -189,10 +191,7 @@ long entries = lmb_end_of_DRAM() >> MEMORY_INCREMENT_SHIFT; unsigned long i; - if (strstr(saved_command_line, "numa=debug")) - numa_debug = 1; - - if (strstr(saved_command_line, "numa=off")) { + if (numa_enabled == 0) { printk(KERN_WARNING "NUMA disabled by user\n"); return -1; } @@ -587,3 +586,18 @@ start_pfn, zholes_size); } } + +static int __init early_numa(char *p) +{ + if (!p) + return 0; + + if (strstr(p, "off")) + numa_enabled = 0; + + if (strstr(p, "debug")) + numa_debug = 1; + + return 0; +} +early_param("numa", early_numa); From anton at samba.org Fri Sep 10 22:32:09 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 22:32:09 +1000 Subject: [PATCH] [ppc64] Enable POWER5 low power mode in idle loop In-Reply-To: <20040910122904.GK24408@krispykreme> References: <20040910121238.GG24408@krispykreme> <20040910121456.GH24408@krispykreme> <20040910121941.GI24408@krispykreme> <20040910122337.GJ24408@krispykreme> <20040910122904.GK24408@krispykreme> Message-ID: <20040910123209.GL24408@krispykreme> Now that we understand (and have fixed) the problem with using low power mode in the idle loop, lets enable it. It should save a fair amount of power. (The problem was that our exceptions were inheriting the low power mode and so were executing at a fraction of the normal cpu issue rate. We fixed it by always bumping our priority to medium at the start of every exception). Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/idle.c~enable_r31_in_idle arch/ppc64/kernel/idle.c --- foobar2/arch/ppc64/kernel/idle.c~enable_r31_in_idle 2004-09-10 20:58:19.402799782 +1000 +++ foobar2-anton/arch/ppc64/kernel/idle.c 2004-09-10 20:58:19.423798168 +1000 @@ -142,7 +142,12 @@ int default_idle(void) while (!need_resched() && !cpu_is_offline(cpu)) { barrier(); + /* + * Go into low thread priority and possibly + * low power mode. + */ HMT_low(); + HMT_very_low(); } HMT_medium(); @@ -184,18 +189,18 @@ int dedicated_idle(void) start_snooze = __get_tb() + *smt_snooze_delay * tb_ticks_per_usec; while (!need_resched() && !cpu_is_offline(cpu)) { - /* need_resched could be 1 or 0 at this - * point. If it is 0, set it to 0, so - * an IPI/Prod is sent. If it is 1, keep - * it that way & schedule work. + /* + * Go into low thread priority and possibly + * low power mode. */ + HMT_low(); + HMT_very_low(); + if (*smt_snooze_delay == 0 || - __get_tb() < start_snooze) { - HMT_low(); /* Low thread priority */ + __get_tb() < start_snooze) continue; - } - HMT_very_low(); /* Low power mode */ + HMT_medium(); if (!(ppaca->lppaca.xIdle)) { /* Indicate we are no longer polling for @@ -210,7 +215,6 @@ int dedicated_idle(void) break; } - /* DRENG: Go HMT_medium here ? */ local_irq_disable(); /* SMT dynamic mode. Cede will result _ From anton at samba.org Fri Sep 10 19:13:42 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 19:13:42 +1000 Subject: [PATCH] [ppc64] Give the kernel an OPD section In-Reply-To: <20040910091128.GD24408@krispykreme> References: <20040910090458.GB24408@krispykreme> <20040910090943.GC24408@krispykreme> <20040910091128.GD24408@krispykreme> Message-ID: <20040910091342.GE24408@krispykreme> From: Alan Modra Give the kernel an OPD section, required for recent ppc64 toolchains. Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/vmlinux.lds.S~kernel-opd arch/ppc64/kernel/vmlinux.lds.S --- gr_work/arch/ppc64/kernel/vmlinux.lds.S~kernel-opd 2004-09-04 21:14:22.123514698 -0500 +++ gr_work-anton/arch/ppc64/kernel/vmlinux.lds.S 2004-09-04 21:14:22.133513110 -0500 @@ -117,10 +117,13 @@ SECTIONS .data : { *(.data .data.rel* .toc1) - *(.opd) *(.branch_lt) } + .opd : { + *(.opd) + } + .got : { __toc_start = .; *(.got) _ From anton at samba.org Fri Sep 10 22:19:41 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 22:19:41 +1000 Subject: [PATCH] [ppc64] Remove EEH command line device matching code In-Reply-To: <20040910121456.GH24408@krispykreme> References: <20040910121238.GG24408@krispykreme> <20040910121456.GH24408@krispykreme> Message-ID: <20040910121941.GI24408@krispykreme> We have had reports of people attempting to disable EEH on POWER5 boxes. This is not supported, and the device will most likely not respond to config space reads/writes. Remove the IBM location matching code that was being used to disable devices as well as the global option. We already have the ability to ignore EEH erros via the panic_on_oops sysctl option, advanced users should make use of that instead. Signed-off-by: Anton Blanchard diff --exclude-from=/home/anton/dontdiff -urN foobar2/arch/ppc64/kernel/eeh.c foobar3/arch/ppc64/kernel/eeh.c --- foobar2/arch/ppc64/kernel/eeh.c 2004-09-10 19:41:20.500660954 +1000 +++ foobar3/arch/ppc64/kernel/eeh.c 2004-09-10 19:41:15.745368932 +1000 @@ -48,9 +48,6 @@ static int ibm_slot_error_detail; static int eeh_subsystem_enabled; -#define EEH_MAX_OPTS 4096 -static char *eeh_opts; -static int eeh_opts_last; /* Buffer for reporting slot-error-detail rtas calls */ static unsigned char slot_errbuf[RTAS_ERROR_LOG_MAX]; @@ -62,10 +59,6 @@ static DEFINE_PER_CPU(unsigned long, false_positives); static DEFINE_PER_CPU(unsigned long, ignored_failures); -static int eeh_check_opts_config(struct device_node *dn, int class_code, - int vendor_id, int device_id, - int default_state); - /** * The pci address cache subsystem. This subsystem places * PCI device address resources into a red-black tree, sorted @@ -497,7 +490,6 @@ struct eeh_early_enable_info { unsigned int buid_hi; unsigned int buid_lo; - int force_off; }; /* Enable eeh for the given device node. */ @@ -539,18 +531,8 @@ if ((*class_code >> 16) == PCI_BASE_CLASS_DISPLAY) enable = 0; - if (!eeh_check_opts_config(dn, *class_code, *vendor_id, *device_id, - enable)) { - if (enable) { - printk(KERN_WARNING "EEH: %s user requested to run " - "without EEH checking.\n", dn->full_name); - enable = 0; - } - } - - if (!enable || info->force_off) { + if (!enable) dn->eeh_mode |= EEH_MODE_NOCHECK; - } /* Ok... see if this device supports EEH. Some do, some don't, * and the only way to find out is to check each and every one. */ @@ -604,15 +586,12 @@ { struct device_node *phb, *np; struct eeh_early_enable_info info; - char *eeh_force_off = strstr(saved_command_line, "eeh-force-off"); init_pci_config_tokens(); np = of_find_node_by_path("/rtas"); - if (np == NULL) { - printk(KERN_WARNING "EEH: RTAS not found !\n"); + if (np == NULL) return; - } ibm_set_eeh_option = rtas_token("ibm,set-eeh-option"); ibm_set_slot_reset = rtas_token("ibm,set-slot-reset"); @@ -632,13 +611,6 @@ eeh_error_buf_size = RTAS_ERROR_LOG_MAX; } - info.force_off = 0; - if (eeh_force_off) { - printk(KERN_WARNING "EEH: WARNING: PCI Enhanced I/O Error " - "Handling is user disabled\n"); - info.force_off = 1; - } - /* Enable EEH for all adapters. Note that eeh requires buid's */ for (phb = of_find_node_by_name(NULL, "pci"); phb; phb = of_find_node_by_name(phb, "pci")) { @@ -653,11 +625,10 @@ traverse_pci_devices(phb, early_enable_eeh, &info); } - if (eeh_subsystem_enabled) { + if (eeh_subsystem_enabled) printk(KERN_INFO "EEH: PCI Enhanced I/O Error Handling Enabled\n"); - } else { - printk(KERN_WARNING "EEH: disabled PCI Enhanced I/O Error Handling\n"); - } + else + printk(KERN_WARNING "EEH: No capable adapters found\n"); } /** @@ -816,129 +787,3 @@ return 0; } __initcall(eeh_init_proc); - -/* - * Test if "dev" should be configured on or off. - * This processes the options literally from left to right. - * This lets the user specify stupid combinations of options, - * but at least the result should be very predictable. - */ -static int eeh_check_opts_config(struct device_node *dn, - int class_code, int vendor_id, int device_id, - int default_state) -{ - char devname[32], classname[32]; - char *strs[8], *s; - int nstrs, i; - int ret = default_state; - - /* Build list of strings to match */ - nstrs = 0; - s = (char *)get_property(dn, "ibm,loc-code", NULL); - if (s) - strs[nstrs++] = s; - sprintf(devname, "dev%04x:%04x", vendor_id, device_id); - strs[nstrs++] = devname; - sprintf(classname, "class%04x", class_code); - strs[nstrs++] = classname; - strs[nstrs++] = ""; /* yes, this matches the empty string */ - - /* - * Now see if any string matches the eeh_opts list. - * The eeh_opts list entries start with + or -. - */ - for (s = eeh_opts; s && (s < (eeh_opts + eeh_opts_last)); - s += strlen(s)+1) { - for (i = 0; i < nstrs; i++) { - if (strcasecmp(strs[i], s+1) == 0) { - ret = (strs[i][0] == '+') ? 1 : 0; - } - } - } - return ret; -} - -/* - * Handle kernel eeh-on & eeh-off cmd line options for eeh. - * - * We support: - * eeh-off=loc1,loc2,loc3... - * - * and this option can be repeated so - * eeh-off=loc1,loc2 eeh-off=loc3 - * is the same as eeh-off=loc1,loc2,loc3 - * - * loc is an IBM location code that can be found in a manual or - * via openfirmware (or the Hardware Management Console). - * - * We also support these additional "loc" values: - * - * dev#:# vendor:device id in hex (e.g. dev1022:2000) - * class# class id in hex (e.g. class0200) - * - * If no location code is specified all devices are assumed - * so eeh-off means eeh by default is off. - */ - -/* - * This is implemented as a null separated list of strings. - * Each string looks like this: "+X" or "-X" - * where X is a loc code, vendor:device, class (as shown above) - * or empty which is used to indicate all. - * - * We interpret this option string list so that it will literally - * behave left-to-right even if some combinations don't make sense. - */ -static int __init eeh_parm(char *str, int state) -{ - char *s, *cur, *curend; - - if (!eeh_opts) { - eeh_opts = alloc_bootmem(EEH_MAX_OPTS); - eeh_opts[eeh_opts_last++] = '+'; /* default */ - eeh_opts[eeh_opts_last++] = '\0'; - } - if (*str == '\0') { - eeh_opts[eeh_opts_last++] = state ? '+' : '-'; - eeh_opts[eeh_opts_last++] = '\0'; - return 1; - } - if (*str == '=') - str++; - for (s = str; s && *s != '\0'; s = curend) { - cur = s; - /* ignore empties. Don't treat as "all-on" or "all-off" */ - while (*cur == ',') - cur++; - curend = strchr(cur, ','); - if (!curend) - curend = cur + strlen(cur); - if (*cur) { - int curlen = curend-cur; - if (eeh_opts_last + curlen > EEH_MAX_OPTS-2) { - printk(KERN_WARNING "EEH: sorry...too many " - "eeh cmd line options\n"); - return 1; - } - eeh_opts[eeh_opts_last++] = state ? '+' : '-'; - strncpy(eeh_opts+eeh_opts_last, cur, curlen); - eeh_opts_last += curlen; - eeh_opts[eeh_opts_last++] = '\0'; - } - } - - return 1; -} - -static int __init eehoff_parm(char *str) -{ - return eeh_parm(str, 0); -} - -static int __init eehon_parm(char *str) -{ - return eeh_parm(str, 1); -} - -__setup("eeh-off", eehoff_parm); -__setup("eeh-on", eehon_parm); From anton at samba.org Fri Sep 10 22:12:38 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 22:12:38 +1000 Subject: [PATCH] [ppc64] Clean up kernel command line code Message-ID: <20040910121238.GG24408@krispykreme> Clean up some of our command line code: - We were copying the command line out of the device tree twice, but the first time we forgot to add CONFIG_CMDLINE. Fix this and remove the second copy. - The command line birec code ran after we had done some command line parsing in prom.c. This had the opportunity to really confuse the user, with some options being parsed out of the device tree and the other out of birecs. Luckily we could find no user of the command line birecs, so remove them. - remove duplicate printing of kernel command line; - clean up iseries inits and create an iSeries_parse_cmdline. Signed-off-by: Anton Blanchard diff --exclude-from=/home/anton/dontdiff -urN foobar2/arch/ppc64/kernel/chrp_setup.c foobar3/arch/ppc64/kernel/chrp_setup.c --- foobar2/arch/ppc64/kernel/chrp_setup.c 2004-09-10 19:33:10.910718416 +1000 +++ foobar3/arch/ppc64/kernel/chrp_setup.c 2004-09-10 19:29:44.431528121 +1000 @@ -140,8 +140,6 @@ ROOT_DEV = Root_SDA2; } - printk("Boot arguments: %s\n", cmd_line); - fwnmi_init(); #ifndef CONFIG_PPC_ISERIES diff --exclude-from=/home/anton/dontdiff -urN foobar2/arch/ppc64/kernel/iSeries_setup.c foobar3/arch/ppc64/kernel/iSeries_setup.c --- foobar2/arch/ppc64/kernel/iSeries_setup.c 2004-09-10 19:33:10.918717801 +1000 +++ foobar3/arch/ppc64/kernel/iSeries_setup.c 2004-09-10 19:30:32.042107395 +1000 @@ -333,32 +333,31 @@ #endif if (itLpNaca.xPirEnvironMode == 0) piranha_simulator = 1; + + /* Associate Lp Event Queue 0 with processor 0 */ + HvCallEvent_setLpEventQueueInterruptProc(0, 0); + + mf_init(); + mf_initialized = 1; + mb(); } -void __init iSeries_init(unsigned long r3, unsigned long r4, unsigned long r5, - unsigned long r6, unsigned long r7) +void __init iSeries_parse_cmdline(void) { char *p, *q; - /* Associate Lp Event Queue 0 with processor 0 */ - HvCallEvent_setLpEventQueueInterruptProc(0, 0); - /* copy the command line parameter from the primary VSP */ HvCallEvent_dmaToSp(cmd_line, 2 * 64* 1024, 256, HvLpDma_Direction_RemoteToLocal); p = cmd_line; q = cmd_line + 255; - while( p < q ) { + while(p < q) { if (!*p || *p == '\n') break; ++p; } *p = 0; - - mf_init(); - mf_initialized = 1; - mb(); } /* diff --exclude-from=/home/anton/dontdiff -urN foobar2/arch/ppc64/kernel/prom.c foobar3/arch/ppc64/kernel/prom.c --- foobar2/arch/ppc64/kernel/prom.c 2004-09-10 19:33:10.928717033 +1000 +++ foobar3/arch/ppc64/kernel/prom.c 2004-09-10 19:35:32.146053659 +1000 @@ -1707,6 +1707,9 @@ } RELOC(cmd_line[0]) = 0; +#ifdef CONFIG_CMDLINE + strlcpy(RELOC(cmd_line), CONFIG_CMDLINE, sizeof(cmd_line)); +#endif /* CONFIG_CMDLINE */ if ((long)_prom->chosen > 0) { prom_getprop(_prom->chosen, "bootargs", p, sizeof(cmd_line)); if (p != NULL && p[0] != 0) diff --exclude-from=/home/anton/dontdiff -urN foobar2/arch/ppc64/kernel/setup.c foobar3/arch/ppc64/kernel/setup.c --- foobar2/arch/ppc64/kernel/setup.c 2004-09-10 19:33:10.934716571 +1000 +++ foobar3/arch/ppc64/kernel/setup.c 2004-09-10 19:32:10.785258973 +1000 @@ -68,7 +68,6 @@ unsigned long r7); extern void fw_feature_init(void); -extern void iSeries_init( void ); extern void iSeries_init_early( void ); extern void pSeries_init_early( void ); extern void pSeriesLP_init_early(void); @@ -77,6 +76,7 @@ extern void pseries_secondary_smp_init(unsigned long); extern int idle_setup(void); extern void vpa_init(int cpu); +extern void iSeries_parse_cmdline(void); unsigned long decr_overclock = 1; unsigned long decr_overclock_proc0 = 1; @@ -87,10 +87,6 @@ unsigned char aux_device_present; -void parse_cmd_line(unsigned long r3, unsigned long r4, unsigned long r5, - unsigned long r6, unsigned long r7); -int parse_bootinfo(void); - #ifdef CONFIG_MAGIC_SYSRQ unsigned long SYSRQ_KEY; #endif /* CONFIG_MAGIC_SYSRQ */ @@ -282,19 +278,16 @@ case PLATFORM_PSERIES: fw_feature_init(); pSeries_init_early(); - parse_bootinfo(); break; case PLATFORM_PSERIES_LPAR: fw_feature_init(); pSeriesLP_init_early(); - parse_bootinfo(); break; #endif /* CONFIG_PPC_PSERIES */ #ifdef CONFIG_PPC_PMAC case PLATFORM_POWERMAC: pmac_init_early(); - parse_bootinfo(); #endif /* CONFIG_PPC_PMAC */ } @@ -334,6 +327,10 @@ } #endif /* CONFIG_PPC_PSERIES */ +#ifdef CONFIG_PPC_ISERIES + iSeries_parse_cmdline(); +#endif + #ifdef CONFIG_SMP #ifndef CONFIG_PPC_ISERIES /* @@ -393,18 +390,6 @@ /* Select the correct idle loop for the platform. */ idle_setup(); - - switch (systemcfg->platform) { -#ifdef CONFIG_PPC_ISERIES - case PLATFORM_ISERIES_LPAR: - iSeries_init(); - break; -#endif - default: - /* The following relies on the device tree being */ - /* fully configured. */ - parse_cmd_line(r3, r4, r5, r6, r7); - } } void machine_restart(char *cmd) @@ -528,31 +513,6 @@ .show = show_cpuinfo, }; -/* - * Fetch the cmd_line from open firmware. - */ -void parse_cmd_line(unsigned long r3, unsigned long r4, unsigned long r5, - unsigned long r6, unsigned long r7) -{ - cmd_line[0] = 0; - -#ifdef CONFIG_CMDLINE - strlcpy(cmd_line, CONFIG_CMDLINE, sizeof(cmd_line)); -#endif /* CONFIG_CMDLINE */ - -#ifdef CONFIG_PPC_PSERIES - { - struct device_node *chosen; - - chosen = of_find_node_by_name(NULL, "chosen"); - if (chosen != NULL) { - char *p; - p = get_property(chosen, "bootargs", NULL); - if (p != NULL && p[0] != 0) - strlcpy(cmd_line, p, sizeof(cmd_line)); - of_node_put(chosen); - } - } #endif /* Look for mem= option on command line */ @@ -652,26 +612,6 @@ } console_initcall(set_preferred_console); - -int parse_bootinfo(void) -{ - struct bi_record *rec; - - rec = prom.bi_recs; - - if ( rec == NULL || rec->tag != BI_FIRST ) - return -1; - - for ( ; rec->tag != BI_LAST ; rec = bi_rec_next(rec) ) { - switch (rec->tag) { - case BI_CMD_LINE: - strlcpy(cmd_line, (void *)rec->data, sizeof(cmd_line)); - break; - } - } - - return 0; -} #endif int __init ppc_init(void) From anton at samba.org Fri Sep 10 19:04:58 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 19:04:58 +1000 Subject: [PATCH] remove SPINLINE config option Message-ID: <20040910090458.GB24408@krispykreme> After the spinlock rework, CONFIG_SPINLINE doesnt work and causes a compile error. Remove it for now. Anton Signed-off-by: Anton Blanchard diff -puN arch/ppc64/lib/locks.c~remove_spinline arch/ppc64/lib/locks.c --- foobar2/arch/ppc64/lib/locks.c~remove_spinline 2004-09-10 18:10:07.925966698 +1000 +++ foobar2-anton/arch/ppc64/lib/locks.c 2004-09-10 18:10:23.249023185 +1000 @@ -20,8 +20,6 @@ #include #include -#ifndef CONFIG_SPINLINE - /* waiting for a spinlock... */ #if defined(CONFIG_PPC_SPLPAR) || defined(CONFIG_PPC_ISERIES) @@ -95,5 +93,3 @@ void spin_unlock_wait(spinlock_t *lock) } EXPORT_SYMBOL(spin_unlock_wait); - -#endif /* CONFIG_SPINLINE */ diff -puN ./arch/ppc64/Kconfig.debug~remove_spinline ./arch/ppc64/Kconfig.debug --- foobar2/./arch/ppc64/Kconfig.debug~remove_spinline 2004-09-10 18:10:33.861789115 +1000 +++ foobar2-anton/./arch/ppc64/Kconfig.debug 2004-09-10 18:10:42.108471315 +1000 @@ -44,16 +44,6 @@ config IRQSTACKS for handling hard and soft interrupts. This can help avoid overflowing the process kernel stacks. -config SPINLINE - bool "Inline spinlock code at each call site" - depends on SMP && !PPC_SPLPAR && !PPC_ISERIES - help - Say Y if you want to have the code for acquiring spinlocks - and rwlocks inlined at each call site. This makes the kernel - somewhat bigger, but can be useful when profiling the kernel. - - If in doubt, say N. - config SCHEDSTATS bool "Collect scheduler statistics" depends on DEBUG_KERNEL && PROC_FS _ From anton at samba.org Fri Sep 10 22:36:49 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 22:36:49 +1000 Subject: [PATCH] [ppc64] Clean up idle loop code In-Reply-To: <20040910123209.GL24408@krispykreme> References: <20040910121238.GG24408@krispykreme> <20040910121456.GH24408@krispykreme> <20040910121941.GI24408@krispykreme> <20040910122337.GJ24408@krispykreme> <20040910122904.GK24408@krispykreme> <20040910123209.GL24408@krispykreme> Message-ID: <20040910123649.GM24408@krispykreme> Clean up our idle loop code: - Remove a bunch of useless includes and make most functions static - There were places where we werent disabling interrupts before checking need_resched then calling the hypervisor to sleep our thread. We might race with an IPI and end up missing a reschedule. Disable interrupts around these regions to make them safe. - We forgot to turn off the polling flag when exiting the dedicated_idle idle loop. This could have resulted in all manner problems as other cpus would avoid sending IPIs to force reschedules. - Add a missing check for cpu_is_offline in the shared cpu idle loop. Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/idle.c~cleanup_idle arch/ppc64/kernel/idle.c --- foobar2/arch/ppc64/kernel/idle.c~cleanup_idle 2004-09-10 21:56:13.748876543 +1000 +++ foobar2-anton/arch/ppc64/kernel/idle.c 2004-09-10 22:02:54.115249090 +1000 @@ -16,28 +16,16 @@ */ #include -#include #include #include -#include #include -#include -#include -#include -#include -#include #include -#include -#include #include -#include #include #include -#include #include #include -#include #include #include @@ -45,11 +33,11 @@ extern long cede_processor(void); extern long poll_pending(void); extern void power4_idle(void); -int (*idle_loop)(void); +static int (*idle_loop)(void); #ifdef CONFIG_PPC_ISERIES -unsigned long maxYieldTime = 0; -unsigned long minYieldTime = 0xffffffffffffffffUL; +static unsigned long maxYieldTime = 0; +static unsigned long minYieldTime = 0xffffffffffffffffUL; static void yield_shared_processor(void) { @@ -80,7 +68,7 @@ static void yield_shared_processor(void) process_iSeries_events(); } -int iSeries_idle(void) +static int iSeries_idle(void) { struct paca_struct *lpaca; long oldval; @@ -91,13 +79,10 @@ int iSeries_idle(void) CTRL = mfspr(CTRLF); CTRL &= ~RUNLATCH; mtspr(CTRLT, CTRL); -#if 0 - init_idle(); -#endif lpaca = get_paca(); - for (;;) { + while (1) { if (lpaca->lppaca.xSharedProc) { if (ItLpQueue_isLpIntPending(lpaca->lpqueue_ptr)) process_iSeries_events(); @@ -125,11 +110,13 @@ int iSeries_idle(void) schedule(); } + return 0; } -#endif -int default_idle(void) +#else + +static int default_idle(void) { long oldval; unsigned int cpu = smp_processor_id(); @@ -164,8 +151,6 @@ int default_idle(void) return 0; } -#ifdef CONFIG_PPC_PSERIES - DECLARE_PER_CPU(unsigned long, smt_snooze_delay); int dedicated_idle(void) @@ -179,8 +164,10 @@ int dedicated_idle(void) ppaca = &paca[cpu ^ 1]; while (1) { - /* Indicate to the HV that we are idle. Now would be - * a good time to find other work to dispatch. */ + /* + * Indicate to the HV that we are idle. Now would be + * a good time to find other work to dispatch. + */ lpaca->lppaca.xIdle = 1; oldval = test_and_clear_thread_flag(TIF_NEED_RESCHED); @@ -203,21 +190,17 @@ int dedicated_idle(void) HMT_medium(); if (!(ppaca->lppaca.xIdle)) { - /* Indicate we are no longer polling for - * work, and then clear need_resched. If - * need_resched was 1, set it back to 1 - * and schedule work + local_irq_disable(); + + /* + * We are about to sleep the thread + * and so wont be polling any + * more. */ clear_thread_flag(TIF_POLLING_NRFLAG); - oldval = test_and_clear_thread_flag(TIF_NEED_RESCHED); - if(oldval == 1) { - set_need_resched(); - break; - } - - local_irq_disable(); - /* SMT dynamic mode. Cede will result + /* + * SMT dynamic mode. Cede will result * in this thread going dormant, if the * partner thread is still doing work. * Thread wakes up if partner goes idle, @@ -225,15 +208,21 @@ int dedicated_idle(void) * occurs. Returning from the cede * enables external interrupts. */ - cede_processor(); + if (!need_resched()) + cede_processor(); + else + local_irq_enable(); } else { - /* Give the HV an opportunity at the + /* + * Give the HV an opportunity at the * processor, since we are not doing * any work. */ poll_pending(); } } + + clear_thread_flag(TIF_POLLING_NRFLAG); } else { set_need_resched(); } @@ -247,48 +236,49 @@ int dedicated_idle(void) return 0; } -int shared_idle(void) +static int shared_idle(void) { struct paca_struct *lpaca = get_paca(); + unsigned int cpu = smp_processor_id(); while (1) { - if (cpu_is_offline(smp_processor_id()) && - system_state == SYSTEM_RUNNING) - cpu_die(); - - /* Indicate to the HV that we are idle. Now would be - * a good time to find other work to dispatch. */ + /* + * Indicate to the HV that we are idle. Now would be + * a good time to find other work to dispatch. + */ lpaca->lppaca.xIdle = 1; - if (!need_resched()) { - local_irq_disable(); - - /* + while (!need_resched() && !cpu_is_offline(cpu)) { + local_irq_disable(); + + /* * Yield the processor to the hypervisor. We return if * an external interrupt occurs (which are driven prior * to returning here) or if a prod occurs from another - * processor. When returning here, external interrupts + * processor. When returning here, external interrupts * are enabled. + * + * Check need_resched() again with interrupts disabled + * to avoid a race. */ - cede_processor(); + if (!need_resched()) + cede_processor(); + else + local_irq_enable(); } HMT_medium(); lpaca->lppaca.xIdle = 0; schedule(); + if (cpu_is_offline(smp_processor_id()) && + system_state == SYSTEM_RUNNING) + cpu_die(); } return 0; } -#endif - -int cpu_idle(void) -{ - idle_loop(); - return 0; -} -int native_idle(void) +static int powermac_idle(void) { while(1) { if (!need_resched()) @@ -298,6 +288,13 @@ int native_idle(void) } return 0; } +#endif + +int cpu_idle(void) +{ + idle_loop(); + return 0; +} int idle_setup(void) { @@ -318,8 +315,8 @@ int idle_setup(void) idle_loop = default_idle; } } else if (systemcfg->platform == PLATFORM_POWERMAC) { - printk("idle = native_idle\n"); - idle_loop = native_idle; + printk("idle = powermac_idle\n"); + idle_loop = powermac_idle; } else { printk("idle_setup: unknown platform, use default_idle\n"); idle_loop = default_idle; @@ -328,4 +325,3 @@ int idle_setup(void) return 1; } - _ From anton at samba.org Fri Sep 10 19:15:58 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 19:15:58 +1000 Subject: [PATCH] [ppc64] Use nm --synthetic where available In-Reply-To: <20040910091342.GE24408@krispykreme> References: <20040910090458.GB24408@krispykreme> <20040910090943.GC24408@krispykreme> <20040910091128.GD24408@krispykreme> <20040910091342.GE24408@krispykreme> Message-ID: <20040910091558.GF24408@krispykreme> On new toolchains we need to use nm --synthetic or we miss code symbols. Sam, Im not thrilled about this patch but Im not sure of an easier way. Any ideas? Signed-off-by: Anton Blanchard diff -puN arch/ppc64/Makefile~nm_synthetic arch/ppc64/Makefile --- gr_work/arch/ppc64/Makefile~nm_synthetic 2004-09-01 03:45:49.180788436 -0500 +++ gr_work-anton/arch/ppc64/Makefile 2004-09-01 03:46:31.467604301 -0500 @@ -22,6 +22,12 @@ LD := $(LD) -m elf64ppc CC := $(CC) -m64 endif +new_nm := $(shell if $(NM) --help 2>&1 | grep -- '--synthetic' > /dev/null; then echo y; else echo n; fi) + +ifeq ($(new_nm),y) +NM := $(NM) --synthetic +endif + CHECKFLAGS += -m64 -D__powerpc__=1 LDFLAGS := -m elf64ppc _ From anton at samba.org Fri Sep 10 22:29:04 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 10 Sep 2004 22:29:04 +1000 Subject: [PATCH] [ppc64] Restore smt-enabled=off kernel command line option In-Reply-To: <20040910122337.GJ24408@krispykreme> References: <20040910121238.GG24408@krispykreme> <20040910121456.GH24408@krispykreme> <20040910121941.GI24408@krispykreme> <20040910122337.GJ24408@krispykreme> Message-ID: <20040910122904.GK24408@krispykreme> Restore the smt-enabled=off kernel command line functionality: - Remove the SMT_DYNAMIC state now that smt_snooze_delay allows for the same thing. - Remove the early prom.c parsing for the option, put it into an early_param instead. - In setup_cpu_maps honour the smt-enabled setting Note to Nathan: In order to allow cpu hotplug add of secondary threads after booting with smt-enabled=off, I had to initialise cpu_present_map to cpu_online_map in smp_cpus_done. Im not sure how you want to handle this but it seems our present map currently does not allow cpus to be added into the partition that werent there at boot (but were in the possible map). Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/idle.c~cmdline-5 arch/ppc64/kernel/idle.c --- foobar2/arch/ppc64/kernel/idle.c~cmdline-5 2004-09-10 20:08:09.790873157 +1000 +++ foobar2-anton/arch/ppc64/kernel/idle.c 2004-09-10 20:08:09.850868545 +1000 @@ -197,12 +197,7 @@ int dedicated_idle(void) HMT_very_low(); /* Low power mode */ - /* If the SMT mode is system controlled & the - * partner thread is doing work, switch into - * ST mode. - */ - if((naca->smt_state == SMT_DYNAMIC) && - (!(ppaca->lppaca.xIdle))) { + if (!(ppaca->lppaca.xIdle)) { /* Indicate we are no longer polling for * work, and then clear need_resched. If * need_resched was 1, set it back to 1 diff -puN arch/ppc64/kernel/prom.c~cmdline-5 arch/ppc64/kernel/prom.c --- foobar2/arch/ppc64/kernel/prom.c~cmdline-5 2004-09-10 20:08:09.798872542 +1000 +++ foobar2-anton/arch/ppc64/kernel/prom.c 2004-09-10 20:08:09.858867930 +1000 @@ -918,11 +918,7 @@ static void __init prom_hold_cpus(unsign = (void *)virt_to_abs(&__secondary_hold_acknowledge); unsigned long secondary_hold = virt_to_abs(*PTRRELOC((unsigned long *)__secondary_hold)); - struct systemcfg *_systemcfg = RELOC(systemcfg); struct prom_t *_prom = PTRRELOC(&prom); -#ifdef CONFIG_SMP - struct naca_struct *_naca = RELOC(naca); -#endif prom_debug("prom_hold_cpus: start...\n"); prom_debug(" 1) spinloop = 0x%x\n", (unsigned long)spinloop); @@ -1003,18 +999,18 @@ static void __init prom_hold_cpus(unsign (*acknowledge == ((unsigned long)-1)); i++ ) ; if (*acknowledge == cpuid) { - prom_printf("... done\n"); + prom_printf(" done\n"); /* We have to get every CPU out of OF, * even if we never start it. */ if (cpuid >= NR_CPUS) goto next; } else { - prom_printf("... failed: %x\n", *acknowledge); + prom_printf(" failed: %x\n", *acknowledge); } } #ifdef CONFIG_SMP else - prom_printf("%x : booting cpu %s\n", cpuid, path); + prom_printf("%x : boot cpu %s\n", cpuid, path); #endif next: #ifdef CONFIG_SMP @@ -1023,13 +1019,6 @@ next: cpuid++; if (cpuid >= NR_CPUS) continue; - prom_printf("%x : preparing thread ... ", - interrupt_server[i]); - if (_naca->smt_state) { - prom_printf("available\n"); - } else { - prom_printf("not available\n"); - } } #endif cpuid++; @@ -1068,57 +1057,6 @@ next: prom_debug("prom_hold_cpus: end...\n"); } -static void __init smt_setup(void) -{ - char *p, *q; - char my_smt_enabled = SMT_DYNAMIC; - ihandle prom_options = 0; - char option[9]; - unsigned long offset = reloc_offset(); - struct naca_struct *_naca = RELOC(naca); - char found = 0; - - if (strstr(RELOC(cmd_line), RELOC("smt-enabled="))) { - for (q = RELOC(cmd_line); (p = strstr(q, RELOC("smt-enabled="))) != 0; ) { - q = p + 12; - if (p > RELOC(cmd_line) && p[-1] != ' ') - continue; - found = 1; - if (q[0] == 'o' && q[1] == 'f' && - q[2] == 'f' && (q[3] == ' ' || q[3] == '\0')) { - my_smt_enabled = SMT_OFF; - } else if (q[0]=='o' && q[1] == 'n' && - (q[2] == ' ' || q[2] == '\0')) { - my_smt_enabled = SMT_ON; - } else { - my_smt_enabled = SMT_DYNAMIC; - } - } - } - if (!found) { - prom_options = call_prom("finddevice", 1, 1, ADDR("/options")); - if (prom_options != (ihandle) -1) { - prom_getprop(prom_options, "ibm,smt-enabled", - option, sizeof(option)); - if (option[0] != 0) { - found = 1; - if (!strcmp(option, RELOC("off"))) - my_smt_enabled = SMT_OFF; - else if (!strcmp(option, RELOC("on"))) - my_smt_enabled = SMT_ON; - else - my_smt_enabled = SMT_DYNAMIC; - } - } - } - - if (!found ) - my_smt_enabled = SMT_DYNAMIC; /* default to on */ - - _naca->smt_state = my_smt_enabled; -} - - #ifdef CONFIG_BOOTX_TEXT /* This function will enable the early boot text when doing OF booting. This @@ -1730,8 +1668,6 @@ prom_init(unsigned long r3, unsigned lon /* Initialize some system info into the Naca early... */ prom_initialize_naca(); - smt_setup(); - /* If we are on an SMP machine, then we *MUST* do the * following, regardless of whether we have an SMP * kernel or not. diff -puN arch/ppc64/kernel/setup.c~cmdline-5 arch/ppc64/kernel/setup.c --- foobar2/arch/ppc64/kernel/setup.c~cmdline-5 2004-09-10 20:08:09.805872004 +1000 +++ foobar2-anton/arch/ppc64/kernel/setup.c 2004-09-10 20:26:50.139096936 +1000 @@ -152,6 +152,50 @@ void __init disable_early_printk(void) } #if !defined(CONFIG_PPC_ISERIES) && defined(CONFIG_SMP) + +static int smt_enabled_cmdline; + +/* Look for ibm,smt-enabled OF option */ +static void check_smt_enabled(void) +{ + struct device_node *dn; + char *smt_option; + + /* Allow the command line to overrule the OF option */ + if (smt_enabled_cmdline) + return; + + dn = of_find_node_by_path("/options"); + + if (dn) { + smt_option = (char *)get_property(dn, "ibm,smt-enabled", NULL); + + if (smt_option) { + if (!strcmp(smt_option, "on")) + smt_enabled_at_boot = 1; + else if (!strcmp(smt_option, "off")) + smt_enabled_at_boot = 0; + } + } +} + +/* Look for smt-enabled= cmdline option */ +static int __init early_smt_enabled(char *p) +{ + smt_enabled_cmdline = 1; + + if (!p) + return 0; + + if (!strcmp(p, "on") || !strcmp(p, "1")) + smt_enabled_at_boot = 1; + else if (!strcmp(p, "off") || !strcmp(p, "0")) + smt_enabled_at_boot = 0; + + return 0; +} +early_param("smt-enabled", early_smt_enabled); + /** * setup_cpu_maps - initialize the following cpu maps: * cpu_possible_map @@ -174,6 +218,8 @@ static void __init setup_cpu_maps(void) struct device_node *dn = NULL; int cpu = 0; + check_smt_enabled(); + while ((dn = of_find_node_by_type(dn, "cpu")) && cpu < NR_CPUS) { u32 *intserv; int j, len = sizeof(u32), nthreads; @@ -186,9 +232,16 @@ static void __init setup_cpu_maps(void) nthreads = len / sizeof(u32); for (j = 0; j < nthreads && cpu < NR_CPUS; j++) { + /* + * Only spin up secondary threads if SMT is enabled. + * We must leave space in the logical map for the + * threads. + */ + if (j == 0 || smt_enabled_at_boot) { + cpu_set(cpu, cpu_present_map); + set_hard_smp_processor_id(cpu, intserv[j]); + } cpu_set(cpu, cpu_possible_map); - cpu_set(cpu, cpu_present_map); - set_hard_smp_processor_id(cpu, intserv[j]); cpu++; } } diff -puN arch/ppc64/kernel/smp.c~cmdline-5 arch/ppc64/kernel/smp.c --- foobar2/arch/ppc64/kernel/smp.c~cmdline-5 2004-09-10 20:08:09.811871543 +1000 +++ foobar2-anton/arch/ppc64/kernel/smp.c 2004-09-10 20:48:26.959223351 +1000 @@ -74,6 +74,8 @@ void smp_call_function_interrupt(void); extern long register_vpa(unsigned long flags, unsigned long proc, unsigned long vpa); +int smt_enabled_at_boot = 1; + /* Low level assembly function used to backup CPU 0 state */ extern void __save_cpu_setup(void); @@ -942,4 +944,12 @@ void __init smp_cpus_done(unsigned int m smp_threads_ready = 1; set_cpus_allowed(current, old_mask); + + /* + * We know at boot the maximum number of cpus we can add to + * a partition and set cpu_possible_map accordingly. cpu_present_map + * needs to match for the hotplug code to allow us to hot add + * any offline cpus. + */ + cpu_present_map = cpu_possible_map; } diff -puN include/asm-ppc64/memory.h~cmdline-5 include/asm-ppc64/memory.h --- foobar2/include/asm-ppc64/memory.h~cmdline-5 2004-09-10 20:08:09.817871081 +1000 +++ foobar2-anton/include/asm-ppc64/memory.h 2004-09-10 20:08:09.865867392 +1000 @@ -56,14 +56,4 @@ static inline void isync(void) #define HMT_MEDIUM_HIGH "\tor 5,5,5 # medium high priority\n" #define HMT_HIGH "\tor 3,3,3 # high priority\n" -/* - * Various operational modes for SMT - * Off : never run threaded - * On : always run threaded - * Dynamic: Allow the system to switch modes as needed - */ -#define SMT_OFF 0 -#define SMT_ON 1 -#define SMT_DYNAMIC 2 - #endif diff -puN include/asm-ppc64/naca.h~cmdline-5 include/asm-ppc64/naca.h --- foobar2/include/asm-ppc64/naca.h~cmdline-5 2004-09-10 20:08:09.823870620 +1000 +++ foobar2-anton/include/asm-ppc64/naca.h 2004-09-10 20:08:09.867867238 +1000 @@ -37,9 +37,6 @@ struct naca_struct { u32 dCacheL1LinesPerPage; /* L1 d-cache lines / page 0x64 */ u32 iCacheL1LogLineSize; /* L1 i-cache line size Log2 0x68 */ u32 iCacheL1LinesPerPage; /* L1 i-cache lines / page 0x6c */ - u8 smt_state; /* 0 = SMT off 0x70 */ - /* 1 = SMT on */ - /* 2 = SMT dynamic */ u8 resv0[15]; /* Reserved 0x71 - 0x7F */ }; diff -puN include/asm-ppc64/smp.h~cmdline-5 include/asm-ppc64/smp.h --- foobar2/include/asm-ppc64/smp.h~cmdline-5 2004-09-10 20:08:09.829870159 +1000 +++ foobar2-anton/include/asm-ppc64/smp.h 2004-09-10 20:08:09.868867161 +1000 @@ -65,6 +65,8 @@ extern int query_cpu_stopped(unsigned in #define set_hard_smp_processor_id(CPU, VAL) \ do { (paca[(CPU)].hw_cpu_id = (VAL)); } while (0) +extern int smt_enabled_at_boot; + #endif /* __ASSEMBLY__ */ #endif /* !(_PPC64_SMP_H) */ _ From david at gibson.dropbear.id.au Mon Sep 13 14:11:19 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Mon, 13 Sep 2004 14:11:19 +1000 Subject: [PPC64] Improved VSID allocation algorithm Message-ID: <20040913041119.GA5351@zax> Andrew, please apply. This patch has been tested both on SLB and segment table machines. This new approach is far from the final word in VSID/context allocation, but it's a noticeable improvement on the old method. Replace the VSID allocation algorithm. The new algorithm first generates a 36-bit "proto-VSID" (with 0xfffffffff reserved). For kernel addresses this is equal to the ESID (address >> 28), for user addresses it is: (context << 15) | (esid & 0x7fff) These are distinguishable from kernel proto-VSIDs because the top bit is clear. Proto-VSIDs with the top two bits equal to 0b10 are reserved for now. The proto-VSIDs are then scrambled into real VSIDs with the multiplicative hash: VSID = (proto-VSID * VSID_MULTIPLIER) % VSID_MODULUS where VSID_MULTIPLIER = 268435399 = 0xFFFFFC7 VSID_MODULUS = 2^36-1 = 0xFFFFFFFFF This scramble is 1:1, because VSID_MULTIPLIER and VSID_MODULUS are co-prime since VSID_MULTIPLIER is prime (the largest 28-bit prime, in fact). This scheme has a number of advantages over the old one: - We now have VSIDs for every kernel address (i.e. everything above 0xC000000000000000), except the very top segment. That simplifies a number of things. - We allow for 15 significant bits of ESID for user addresses with 20 bits of context. i.e. 8T (43 bits) of address space for up to 1M contexts, significantly more than the old method (although we will need changes in the hash path and context allocation to take advantage of this). - Because we use a real multiplicative hash function, we have better and more robust hash scattering with this VSID algorithm (at least based on some initial results). Because the MODULUS is 2^n-1 we can use a trick to compute it efficiently without a divide or extra multiply. This makes the new algorithm barely slower than the old one. Index: working-2.6/include/asm-ppc64/mmu_context.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu_context.h 2004-08-25 10:37:27.000000000 +1000 +++ working-2.6/include/asm-ppc64/mmu_context.h 2004-09-09 16:18:05.847490304 +1000 @@ -34,7 +34,7 @@ } #define NO_CONTEXT 0 -#define FIRST_USER_CONTEXT 0x10 /* First 16 reserved for kernel */ +#define FIRST_USER_CONTEXT 1 #define LAST_USER_CONTEXT 0x8000 /* Same as PID_MAX for now... */ #define NUM_USER_CONTEXT (LAST_USER_CONTEXT-FIRST_USER_CONTEXT) @@ -181,46 +181,87 @@ local_irq_restore(flags); } -/* This is only valid for kernel (including vmalloc, imalloc and bolted) EA's +/* VSID allocation + * =============== + * + * We first generate a 36-bit "proto-VSID". For kernel addresses this + * is equal to the ESID, for user addresses it is: + * (context << 15) | (esid & 0x7fff) + * + * The two forms are distinguishable because the top bit is 0 for user + * addresses, whereas the top two bits are 1 for kernel addresses. + * Proto-VSIDs with the top two bits equal to 0b10 are reserved for + * now. + * + * The proto-VSIDs are then scrambled into real VSIDs with the + * multiplicative hash: + * + * VSID = (proto-VSID * VSID_MULTIPLIER) % VSID_MODULUS + * where VSID_MULTIPLIER = 268435399 = 0xFFFFFC7 + * VSID_MODULUS = 2^36-1 = 0xFFFFFFFFF + * + * This scramble is only well defined for proto-VSIDs below + * 0xFFFFFFFFF, so both proto-VSID and actual VSID 0xFFFFFFFFF are + * reserved. VSID_MULTIPLIER is prime (the largest 28-bit prime, in + * fact), so in particular it is co-prime to VSID_MODULUS, making this + * a 1:1 scrambling function. Because the modulus is 2^n-1 we can + * compute it efficiently without a divide or extra multiply (see + * below). + * + * This scheme has several advantages over older methods: + * + * - We have VSIDs allocated for every kernel address + * (i.e. everything above 0xC000000000000000), except the very top + * segment, which simplifies several things. + * + * - We allow for 15 significant bits of ESID and 20 bits of + * context for user addresses. i.e. 8T (43 bits) of address space for + * up to 1M contexts (although the page table structure and context + * allocation will need changes to take advantage of this). + * + * - The scramble function gives robust scattering in the hash + * table (at least based on some initial results). The previous + * method was more susceptible to pathological cases giving excessive + * hash collisions. */ -static inline unsigned long -get_kernel_vsid( unsigned long ea ) -{ - unsigned long ordinal, vsid; - - ordinal = (((ea >> 28) & 0x1fff) * LAST_USER_CONTEXT) | (ea >> 60); - vsid = (ordinal * VSID_RANDOMIZER) & VSID_MASK; - -#ifdef HTABSTRESS - /* For debug, this path creates a very poor vsid distribuition. - * A user program can access virtual addresses in the form - * 0x0yyyyxxxx000 where yyyy = xxxx to cause multiple mappings - * to hash to the same page table group. - */ - ordinal = ((ea >> 28) & 0x1fff) | (ea >> 44); - vsid = ordinal & VSID_MASK; -#endif /* HTABSTRESS */ - return vsid; -} +/* + * WARNING - If you change these you must make sure the asm + * implementations in slb_allocate(), do_stab_bolted and mmu.h + * (ASM_VSID_SCRAMBLE macro) are changed accordingly. + * + * You'll also need to change the precomputed VSID values in head.S + * which are used by the iSeries firmware. + */ + +static inline unsigned long vsid_scramble(unsigned long protovsid) +{ +#if 0 + /* The code below is equivalent to this function for arguments + * < 2^VSID_BITS, which is all this should ever be called + * with. However gcc is not clever enough to compute the + * modulus (2^n-1) without a second multiply. */ + return ((protovsid * VSID_MULTIPLIER) % VSID_MODULUS); +#else /* 1 */ + unsigned long x; + + x = protovsid * VSID_MULTIPLIER; + x = (x >> VSID_BITS) + (x & VSID_MODULUS); + return (x + ((x+1) >> VSID_BITS)) & VSID_MODULUS; +#endif /* 1 */ +} -/* This is only valid for user EA's (user EA's do not exceed 2^41 (EADDR_SIZE)) - */ -static inline unsigned long -get_vsid( unsigned long context, unsigned long ea ) +/* This is only valid for addresses >= KERNELBASE */ +static inline unsigned long get_kernel_vsid(unsigned long ea) { - unsigned long ordinal, vsid; - - ordinal = (((ea >> 28) & 0x1fff) * LAST_USER_CONTEXT) | context; - vsid = (ordinal * VSID_RANDOMIZER) & VSID_MASK; - -#ifdef HTABSTRESS - /* See comment above. */ - ordinal = ((ea >> 28) & 0x1fff) | (context << 16); - vsid = ordinal & VSID_MASK; -#endif /* HTABSTRESS */ + return vsid_scramble(ea >> SID_SHIFT); +} - return vsid; +/* This is only valid for user addresses (which are below 2^41) */ +static inline unsigned long get_vsid(unsigned long context, unsigned long ea) +{ + return vsid_scramble((context << USER_ESID_BITS) + | (ea >> SID_SHIFT)); } #endif /* __PPC64_MMU_CONTEXT_H */ Index: working-2.6/include/asm-ppc64/mmu.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu.h 2004-09-07 10:38:00.000000000 +1000 +++ working-2.6/include/asm-ppc64/mmu.h 2004-09-09 15:04:16.814447984 +1000 @@ -15,6 +15,7 @@ #include #include +#include #ifndef __ASSEMBLY__ @@ -215,12 +216,44 @@ #define SLB_VSID_KERNEL (SLB_VSID_KP|SLB_VSID_C) #define SLB_VSID_USER (SLB_VSID_KP|SLB_VSID_KS) -#define VSID_RANDOMIZER ASM_CONST(42470972311) -#define VSID_MASK 0xfffffffffUL -/* Because we never access addresses below KERNELBASE as kernel - * addresses, this VSID is never used for anything real, and will - * never have pages hashed into it */ -#define BAD_VSID ASM_CONST(0) +#define VSID_MULTIPLIER ASM_CONST(268435399) /* largest 28-bit prime */ +#define VSID_BITS 36 +#define VSID_MODULUS ((1UL<= \ + * 2^36-1, then r3+1 has the 2^36 bit set. So, if r3+1 has \ + * the bit clear, r3 already has the answer we want, if it \ + * doesn't, the answer is the low 36 bits of r3+1. So in all \ + * cases the answer is the low 36 bits of (r3 + ((r3+1) >> 36))*/\ + addi rx,rt,1; \ + srdi rx,rx,VSID_BITS; /* extract 2^36 bit */ \ + add rt,rt,rx /* Block size masks */ #define BL_128K 0x000 Index: working-2.6/arch/ppc64/mm/slb_low.S =================================================================== --- working-2.6.orig/arch/ppc64/mm/slb_low.S 2004-09-07 10:38:00.000000000 +1000 +++ working-2.6/arch/ppc64/mm/slb_low.S 2004-09-09 15:04:16.815447832 +1000 @@ -68,19 +68,19 @@ srdi r3,r3,28 /* get esid */ cmpldi cr7,r9,0xc /* cmp KERNELBASE for later use */ - /* r9 = region, r3 = esid, cr7 = <>KERNELBASE */ - - rldicr. r11,r3,32,16 - bne- 8f /* invalid ea bits set */ - addi r11,r9,-1 - cmpldi r11,0xb - blt- 8f /* invalid region */ + rldimi r10,r3,28,0 /* r10= ESID<<28 | entry */ + oris r10,r10,SLB_ESID_V at h /* r10 |= SLB_ESID_V */ - /* r9 = region, r3 = esid, r10 = entry, cr7 = <>KERNELBASE */ + /* r3 = esid, r10 = esid_data, cr7 = <>KERNELBASE */ blt cr7,0f /* user or kernel? */ - /* kernel address */ + /* kernel address: proto-VSID = ESID */ + /* WARNING - MAGIC: we don't use the VSID 0xfffffffff, but + * this code will generate the protoVSID 0xfffffffff for the + * top segment. That's ok, the scramble below will translate + * it to VSID 0, which is reserved as a bad VSID - one which + * will never have any pages in it. */ li r11,SLB_VSID_KERNEL BEGIN_FTR_SECTION bne cr7,9f @@ -88,8 +88,12 @@ END_FTR_SECTION_IFSET(CPU_FTR_16M_PAGE) b 9f -0: /* user address */ +0: /* user address: proto-VSID = context<<15 | ESID */ li r11,SLB_VSID_USER + + srdi. r9,r3,13 + bne- 8f /* invalid ea bits set */ + #ifdef CONFIG_HUGETLB_PAGE BEGIN_FTR_SECTION /* check against the hugepage ranges */ @@ -111,33 +115,18 @@ #endif /* CONFIG_HUGETLB_PAGE */ 6: ld r9,PACACONTEXTID(r13) + rldimi r3,r9,USER_ESID_BITS,0 -9: /* r9 = "context", r3 = esid, r11 = flags, r10 = entry */ - - rldimi r9,r3,15,0 /* r9= VSID ordinal */ - -7: rldimi r10,r3,28,0 /* r10= ESID<<28 | entry */ - oris r10,r10,SLB_ESID_V at h /* r10 |= SLB_ESID_V */ - - /* r9 = ordinal, r3 = esid, r11 = flags, r10 = esid_data */ - - li r3,VSID_RANDOMIZER at higher - sldi r3,r3,32 - oris r3,r3,VSID_RANDOMIZER at h - ori r3,r3,VSID_RANDOMIZER at l - - mulld r9,r3,r9 /* r9 = ordinal * VSID_RANDOMIZER */ - clrldi r9,r9,28 /* r9 &= VSID_MASK */ - sldi r9,r9,SLB_VSID_SHIFT /* r9 <<= SLB_VSID_SHIFT */ - or r9,r9,r11 /* r9 |= flags */ +9: /* r3 = protovsid, r11 = flags, r10 = esid_data, cr7 = <>KERNELBASE */ + ASM_VSID_SCRAMBLE(r3,r9) - /* r9 = vsid_data, r10 = esid_data, cr7 = <>KERNELBASE */ + rldimi r11,r3,SLB_VSID_SHIFT,16 /* combine VSID and flags */ /* * No need for an isync before or after this slbmte. The exception * we enter with and the rfid we exit with are context synchronizing. */ - slbmte r9,r10 + slbmte r11,r10 bgelr cr7 /* we're done for kernel addresses */ @@ -160,6 +149,6 @@ blr 8: /* invalid EA */ - li r9,0 /* 0 VSID ordinal -> BAD_VSID */ + li r3,0 /* BAD_VSID */ li r11,SLB_VSID_USER /* flags don't much matter */ - b 7b + b 9b Index: working-2.6/arch/ppc64/kernel/head.S =================================================================== --- working-2.6.orig/arch/ppc64/kernel/head.S 2004-09-09 15:04:16.770454672 +1000 +++ working-2.6/arch/ppc64/kernel/head.S 2004-09-09 15:04:16.817447528 +1000 @@ -548,15 +548,15 @@ .llong 0 /* Reserved */ .llong 0 /* Reserved */ .llong 0 /* Reserved */ - .llong 0xc00000000 /* KERNELBASE ESID */ - .llong 0x6a99b4b14 /* KERNELBASE VSID */ + .llong (KERNELBASE>>SID_SHIFT) + .llong 0x40bffffd5 /* KERNELBASE VSID */ /* We have to list the bolted VMALLOC segment here, too, so that it * will be restored on shared processor switch */ - .llong 0xd00000000 /* VMALLOCBASE ESID */ - .llong 0x08d12e6ab /* VMALLOCBASE VSID */ + .llong (VMALLOCBASE>>SID_SHIFT) + .llong 0xb0cffffd1 /* VMALLOCBASE VSID */ .llong 8192 /* # pages to map (32 MB) */ .llong 0 /* Offset from start of loadarea to start of map */ - .llong 0x0006a99b4b140000 /* VPN of first page to map */ + .llong 0x40bffffd50000 /* VPN of first page to map */ . = 0x6100 @@ -1064,18 +1064,9 @@ rldimi r10,r11,7,52 /* r10 = first ste of the group */ /* Calculate VSID */ - /* (((ea >> 28) & 0x1fff) << 15) | (ea >> 60) */ - rldic r11,r11,15,36 - ori r11,r11,0xc - - /* VSID_RANDOMIZER */ - li r9,9 - sldi r9,r9,32 - oris r9,r9,58231 - ori r9,r9,39831 - - mulld r9,r11,r9 - rldic r9,r9,12,16 /* r9 = vsid << 12 */ + /* This is a kernel address, so protovsid = ESID */ + ASM_VSID_SCRAMBLE(r11, r9) + rldic r9,r11,12,16 /* r9 = vsid << 12 */ /* Search the primary group for a free entry */ 1: ld r11,0(r10) /* Test valid bit of the current ste */ Index: working-2.6/arch/ppc64/mm/stab.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/stab.c 2004-08-25 10:37:26.000000000 +1000 +++ working-2.6/arch/ppc64/mm/stab.c 2004-09-09 15:04:16.818447376 +1000 @@ -115,15 +115,11 @@ unsigned char stab_entry; unsigned long offset; - /* Check for invalid effective addresses. */ - if (!IS_VALID_EA(ea)) - return 1; - /* Kernel or user address? */ if (ea >= KERNELBASE) { vsid = get_kernel_vsid(ea); } else { - if (! mm) + if ((ea >= TASK_SIZE_USER64) || (! mm)) return 1; vsid = get_vsid(mm->context.id, ea); Index: working-2.6/include/asm-ppc64/pgtable.h =================================================================== --- working-2.6.orig/include/asm-ppc64/pgtable.h 2004-09-07 10:38:00.000000000 +1000 +++ working-2.6/include/asm-ppc64/pgtable.h 2004-09-09 15:29:13.949495840 +1000 @@ -45,10 +45,16 @@ PGD_INDEX_SIZE + PAGE_SHIFT) /* + * Size of EA range mapped by our pagetables. + */ +#define PGTABLE_EA_BITS 41 +#define PGTABLE_EA_MASK ((1UL< physical */ #define KRANGE_START KERNELBASE -#define KRANGE_END (KRANGE_START + VALID_EA_BITS) +#define KRANGE_END (KRANGE_START + PGTABLE_EA_MASK) /* * Define the user address range */ #define USER_START (0UL) -#define USER_END (USER_START + VALID_EA_BITS) +#define USER_END (USER_START + PGTABLE_EA_MASK) /* Index: working-2.6/arch/ppc64/mm/hash_utils.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hash_utils.c 2004-08-26 10:20:55.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hash_utils.c 2004-09-09 15:04:16.820447072 +1000 @@ -253,24 +253,24 @@ int local = 0; cpumask_t tmp; - /* Check for invalid addresses. */ - if (!IS_VALID_EA(ea)) - return 1; - switch (REGION_ID(ea)) { case USER_REGION_ID: user_region = 1; mm = current->mm; - if (mm == NULL) + if ((ea > USER_END) || (! mm)) return 1; vsid = get_vsid(mm->context.id, ea); break; case IO_REGION_ID: + if (ea > IMALLOC_END) + return 1; mm = &ioremap_mm; vsid = get_kernel_vsid(ea); break; case VMALLOC_REGION_ID: + if (ea > VMALLOC_END) + return 1; mm = &init_mm; vsid = get_kernel_vsid(ea); break; Index: working-2.6/include/asm-ppc64/page.h =================================================================== --- working-2.6.orig/include/asm-ppc64/page.h 2004-09-07 10:38:00.000000000 +1000 +++ working-2.6/include/asm-ppc64/page.h 2004-09-09 15:04:16.820447072 +1000 @@ -212,17 +212,6 @@ #define USER_REGION_ID (0UL) #define REGION_ID(X) (((unsigned long)(X))>>REGION_SHIFT) -/* - * Define valid/invalid EA bits (for all ranges) - */ -#define VALID_EA_BITS (0x000001ffffffffffUL) -#define INVALID_EA_BITS (~(REGION_MASK|VALID_EA_BITS)) - -#define IS_VALID_REGION_ID(x) \ - (((x) == USER_REGION_ID) || ((x) >= KERNEL_REGION_ID)) -#define IS_VALID_EA(x) \ - ((!((x) & INVALID_EA_BITS)) && IS_VALID_REGION_ID(REGION_ID(x))) - #define __bpn_to_ba(x) ((((unsigned long)(x))<> PAGE_SHIFT) -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson From anton at samba.org Mon Sep 13 20:55:05 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 13 Sep 2004 20:55:05 +1000 Subject: [PATCH] [ppc64] force_sigsegv fixes Message-ID: <20040913105505.GA14553@krispykreme> Replace do_exit() in 64bit signal code with force_sig/force_sigsegv where appropriate. Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/signal.c~signal_fixes arch/ppc64/kernel/signal.c --- 2.6.9-rc1-mm5/arch/ppc64/kernel/signal.c~signal_fixes 2004-09-13 19:53:00.173734784 +1000 +++ 2.6.9-rc1-mm5-anton/arch/ppc64/kernel/signal.c 2004-09-13 19:53:07.350235795 +1000 @@ -371,7 +371,8 @@ badframe: printk("badframe in sys_rt_sigreturn, regs=%p uc=%p &uc->uc_mcontext=%p\n", regs, uc, &uc->uc_mcontext); #endif - do_exit(SIGSEGV); + force_sig(SIGSEGV, current); + return 0; } static void setup_rt_frame(int signr, struct k_sigaction *ka, siginfo_t *info, @@ -446,7 +447,7 @@ badframe: printk("badframe in setup_rt_frame, regs=%p frame=%p newsp=%lx\n", regs, frame, newsp); #endif - do_exit(SIGSEGV); + force_sigsegv(signr, current); } _ From anton at samba.org Mon Sep 13 20:56:56 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 13 Sep 2004 20:56:56 +1000 Subject: [PATCH] [ppc64] powersave_nap sysctl In-Reply-To: <20040913105505.GA14553@krispykreme> References: <20040913105505.GA14553@krispykreme> Message-ID: <20040913105656.GB14553@krispykreme> Implement powersave_nap sysctl, like ppc32. This allows us to disable the nap function which is useful when profiling with oprofile (to get an accurate count of idle time). Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/idle.c~powersave_nap arch/ppc64/kernel/idle.c --- 2.6.9-rc1-mm5/arch/ppc64/kernel/idle.c~powersave_nap 2004-09-13 19:51:24.809722022 +1000 +++ 2.6.9-rc1-mm5-anton/arch/ppc64/kernel/idle.c 2004-09-13 19:51:24.835720023 +1000 @@ -20,6 +20,8 @@ #include #include #include +#include +#include #include #include @@ -296,6 +298,38 @@ int cpu_idle(void) return 0; } +int powersave_nap; + +#ifdef CONFIG_SYSCTL +/* + * Register the sysctl to set/clear powersave_nap. + */ +static ctl_table powersave_nap_ctl_table[]={ + { + .ctl_name = KERN_PPC_POWERSAVE_NAP, + .procname = "powersave-nap", + .data = &powersave_nap, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, + { 0, }, +}; +static ctl_table powersave_nap_sysctl_root[] = { + { 1, "kernel", NULL, 0, 0755, powersave_nap_ctl_table, }, + { 0,}, +}; + +static int __init +register_powersave_nap_sysctl(void) +{ + register_sysctl_table(powersave_nap_sysctl_root, 0); + + return 0; +} +__initcall(register_powersave_nap_sysctl); +#endif + int idle_setup(void) { #ifdef CONFIG_PPC_ISERIES diff -puN arch/ppc64/kernel/setup.c~powersave_nap arch/ppc64/kernel/setup.c --- 2.6.9-rc1-mm5/arch/ppc64/kernel/setup.c~powersave_nap 2004-09-13 19:51:24.815721561 +1000 +++ 2.6.9-rc1-mm5-anton/arch/ppc64/kernel/setup.c 2004-09-13 19:51:24.837719870 +1000 @@ -82,8 +82,6 @@ unsigned long decr_overclock_proc0 = 1; unsigned long decr_overclock_set = 0; unsigned long decr_overclock_proc0_set = 0; -int powersave_nap; - unsigned char aux_device_present; #ifdef CONFIG_MAGIC_SYSRQ _ From anton at samba.org Mon Sep 13 21:10:24 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 13 Sep 2004 21:10:24 +1000 Subject: [PATCH] [ppc64] Clean up asm/mmu.h In-Reply-To: <20040913110837.GD14553@krispykreme> References: <20040913105505.GA14553@krispykreme> <20040913105656.GB14553@krispykreme> <20040913110742.GC14553@krispykreme> <20040913110837.GD14553@krispykreme> Message-ID: <20040913111024.GE14553@krispykreme> Remove some old definitions that arent relevant to us. Signed-off-by: Anton Blanchard diff -puN include/asm-ppc64/mmu.h~rip_up_mmu_h include/asm-ppc64/mmu.h --- 2.6.9-rc1-mm5/include/asm-ppc64/mmu.h~rip_up_mmu_h 2004-09-13 19:51:29.477016885 +1000 +++ 2.6.9-rc1-mm5-anton/include/asm-ppc64/mmu.h 2004-09-13 19:51:29.499015194 +1000 @@ -107,18 +107,6 @@ typedef struct { extern HTAB htab_data; -void invalidate_hpte( unsigned long slot ); -long select_hpte_slot( unsigned long vpn ); -void create_valid_hpte( unsigned long slot, unsigned long vpn, - unsigned long prpn, unsigned hash, - void * ptep, unsigned hpteflags, - unsigned bolted ); - -#define PD_SHIFT (10+12) /* Page directory */ -#define PD_MASK 0x02FF -#define PT_SHIFT (12) /* Page Table */ -#define PT_MASK 0x02FF - #define LARGE_PAGE_SHIFT 24 static inline unsigned long hpt_hash(unsigned long vpn, int large) @@ -255,149 +243,4 @@ extern void htab_finish_init(void); srdi rx,rx,VSID_BITS; /* extract 2^36 bit */ \ add rt,rt,rx -/* Block size masks */ -#define BL_128K 0x000 -#define BL_256K 0x001 -#define BL_512K 0x003 -#define BL_1M 0x007 -#define BL_2M 0x00F -#define BL_4M 0x01F -#define BL_8M 0x03F -#define BL_16M 0x07F -#define BL_32M 0x0FF -#define BL_64M 0x1FF -#define BL_128M 0x3FF -#define BL_256M 0x7FF - -/* Used to set up SDR1 register */ -#define HASH_TABLE_SIZE_64K 0x00010000 -#define HASH_TABLE_SIZE_128K 0x00020000 -#define HASH_TABLE_SIZE_256K 0x00040000 -#define HASH_TABLE_SIZE_512K 0x00080000 -#define HASH_TABLE_SIZE_1M 0x00100000 -#define HASH_TABLE_SIZE_2M 0x00200000 -#define HASH_TABLE_SIZE_4M 0x00400000 -#define HASH_TABLE_MASK_64K 0x000 -#define HASH_TABLE_MASK_128K 0x001 -#define HASH_TABLE_MASK_256K 0x003 -#define HASH_TABLE_MASK_512K 0x007 -#define HASH_TABLE_MASK_1M 0x00F -#define HASH_TABLE_MASK_2M 0x01F -#define HASH_TABLE_MASK_4M 0x03F - -/* These are the Ks and Kp from the PowerPC books. For proper operation, - * Ks = 0, Kp = 1. - */ -#define MI_AP 786 -#define MI_Ks 0x80000000 /* Should not be set */ -#define MI_Kp 0x40000000 /* Should always be set */ - -/* The effective page number register. When read, contains the information - * about the last instruction TLB miss. When MI_RPN is written, bits in - * this register are used to create the TLB entry. - */ -#define MI_EPN 787 -#define MI_EPNMASK 0xfffff000 /* Effective page number for entry */ -#define MI_EVALID 0x00000200 /* Entry is valid */ -#define MI_ASIDMASK 0x0000000f /* ASID match value */ - /* Reset value is undefined */ - -/* A "level 1" or "segment" or whatever you want to call it register. - * For the instruction TLB, it contains bits that get loaded into the - * TLB entry when the MI_RPN is written. - */ -#define MI_TWC 789 -#define MI_APG 0x000001e0 /* Access protection group (0) */ -#define MI_GUARDED 0x00000010 /* Guarded storage */ -#define MI_PSMASK 0x0000000c /* Mask of page size bits */ -#define MI_PS8MEG 0x0000000c /* 8M page size */ -#define MI_PS512K 0x00000004 /* 512K page size */ -#define MI_PS4K_16K 0x00000000 /* 4K or 16K page size */ -#define MI_SVALID 0x00000001 /* Segment entry is valid */ - /* Reset value is undefined */ - -/* Real page number. Defined by the pte. Writing this register - * causes a TLB entry to be created for the instruction TLB, using - * additional information from the MI_EPN, and MI_TWC registers. - */ -#define MI_RPN 790 - -/* Define an RPN value for mapping kernel memory to large virtual - * pages for boot initialization. This has real page number of 0, - * large page size, shared page, cache enabled, and valid. - * Also mark all subpages valid and write access. - */ -#define MI_BOOTINIT 0x000001fd - -#define MD_CTR 792 /* Data TLB control register */ -#define MD_GPM 0x80000000 /* Set domain manager mode */ -#define MD_PPM 0x40000000 /* Set subpage protection */ -#define MD_CIDEF 0x20000000 /* Set cache inhibit when MMU dis */ -#define MD_WTDEF 0x10000000 /* Set writethrough when MMU dis */ -#define MD_RSV4I 0x08000000 /* Reserve 4 TLB entries */ -#define MD_TWAM 0x04000000 /* Use 4K page hardware assist */ -#define MD_PPCS 0x02000000 /* Use MI_RPN prob/priv state */ -#define MD_IDXMASK 0x00001f00 /* TLB index to be loaded */ -#define MD_RESETVAL 0x04000000 /* Value of register at reset */ - -#define M_CASID 793 /* Address space ID (context) to match */ -#define MC_ASIDMASK 0x0000000f /* Bits used for ASID value */ - - -/* These are the Ks and Kp from the PowerPC books. For proper operation, - * Ks = 0, Kp = 1. - */ -#define MD_AP 794 -#define MD_Ks 0x80000000 /* Should not be set */ -#define MD_Kp 0x40000000 /* Should always be set */ - -/* The effective page number register. When read, contains the information - * about the last instruction TLB miss. When MD_RPN is written, bits in - * this register are used to create the TLB entry. - */ -#define MD_EPN 795 -#define MD_EPNMASK 0xfffff000 /* Effective page number for entry */ -#define MD_EVALID 0x00000200 /* Entry is valid */ -#define MD_ASIDMASK 0x0000000f /* ASID match value */ - /* Reset value is undefined */ - -/* The pointer to the base address of the first level page table. - * During a software tablewalk, reading this register provides the address - * of the entry associated with MD_EPN. - */ -#define M_TWB 796 -#define M_L1TB 0xfffff000 /* Level 1 table base address */ -#define M_L1INDX 0x00000ffc /* Level 1 index, when read */ - /* Reset value is undefined */ - -/* A "level 1" or "segment" or whatever you want to call it register. - * For the data TLB, it contains bits that get loaded into the TLB entry - * when the MD_RPN is written. It is also provides the hardware assist - * for finding the PTE address during software tablewalk. - */ -#define MD_TWC 797 -#define MD_L2TB 0xfffff000 /* Level 2 table base address */ -#define MD_L2INDX 0xfffffe00 /* Level 2 index (*pte), when read */ -#define MD_APG 0x000001e0 /* Access protection group (0) */ -#define MD_GUARDED 0x00000010 /* Guarded storage */ -#define MD_PSMASK 0x0000000c /* Mask of page size bits */ -#define MD_PS8MEG 0x0000000c /* 8M page size */ -#define MD_PS512K 0x00000004 /* 512K page size */ -#define MD_PS4K_16K 0x00000000 /* 4K or 16K page size */ -#define MD_WT 0x00000002 /* Use writethrough page attribute */ -#define MD_SVALID 0x00000001 /* Segment entry is valid */ - /* Reset value is undefined */ - - -/* Real page number. Defined by the pte. Writing this register - * causes a TLB entry to be created for the data TLB, using - * additional information from the MD_EPN, and MD_TWC registers. - */ -#define MD_RPN 798 - -/* This is a temporary storage register that could be used to save - * a processor working register during a tablewalk. - */ -#define M_TW 799 - #endif /* _PPC64_MMU_H_ */ _ From anton at samba.org Mon Sep 13 21:08:37 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 13 Sep 2004 21:08:37 +1000 Subject: [PATCH] [ppc64] iseries build fixes In-Reply-To: <20040913110742.GC14553@krispykreme> References: <20040913105505.GA14553@krispykreme> <20040913105656.GB14553@krispykreme> <20040913110742.GC14553@krispykreme> Message-ID: <20040913110837.GD14553@krispykreme> Fix one compile warning and one build warning on iseries. Signed-off-by: Anton Blanchard diff -puN arch/ppc64/mm/init.c~fix_iseries arch/ppc64/mm/init.c --- 2.6.9-rc1-mm5/arch/ppc64/mm/init.c~fix_iseries 2004-09-13 19:51:27.220930624 +1000 +++ 2.6.9-rc1-mm5-anton/arch/ppc64/mm/init.c 2004-09-13 19:51:27.246928626 +1000 @@ -534,7 +534,9 @@ arch_initcall(mmu_context_init); */ void __init mm_init_ppc64(void) { +#ifndef CONFIG_PPC_ISERIES unsigned long i; +#endif ppc64_boot_msg(0x100, "MM Init"); diff -puN include/asm-ppc64/page.h~fix_iseries include/asm-ppc64/page.h --- 2.6.9-rc1-mm5/include/asm-ppc64/page.h~fix_iseries 2004-09-13 19:51:27.225930240 +1000 +++ 2.6.9-rc1-mm5-anton/include/asm-ppc64/page.h 2004-09-13 19:51:27.248928472 +1000 @@ -201,9 +201,9 @@ extern int page_is_ram(unsigned long pfn /* to change! */ #define PAGE_OFFSET ASM_CONST(0xC000000000000000) #define KERNELBASE PAGE_OFFSET -#define VMALLOCBASE 0xD000000000000000UL -#define IOREGIONBASE 0xE000000000000000UL -#define EEHREGIONBASE 0xA000000000000000UL +#define VMALLOCBASE ASM_CONST(0xD000000000000000) +#define IOREGIONBASE ASM_CONST(0xE000000000000000) +#define EEHREGIONBASE ASM_CONST(0xA000000000000000) #define IO_REGION_ID (IOREGIONBASE>>REGION_SHIFT) #define EEH_REGION_ID (EEHREGIONBASE>>REGION_SHIFT) _ From anton at samba.org Mon Sep 13 21:11:27 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 13 Sep 2004 21:11:27 +1000 Subject: [PATCH] [ppc64] Fix pseries build in -mm In-Reply-To: <20040913111024.GE14553@krispykreme> References: <20040913105505.GA14553@krispykreme> <20040913105656.GB14553@krispykreme> <20040913110742.GC14553@krispykreme> <20040913110837.GD14553@krispykreme> <20040913111024.GE14553@krispykreme> Message-ID: <20040913111127.GF14553@krispykreme> Looks like a list macro cleanup patch went in, resulting in two definitions of *dev. Remove one. Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/pSeries_pci.c~fix_pseries arch/ppc64/kernel/pSeries_pci.c --- 2.6.9-rc1-mm5/arch/ppc64/kernel/pSeries_pci.c~fix_pseries 2004-09-13 19:58:29.941874428 +1000 +++ 2.6.9-rc1-mm5-anton/arch/ppc64/kernel/pSeries_pci.c 2004-09-13 19:59:21.967773089 +1000 @@ -601,7 +601,6 @@ EXPORT_SYMBOL(pcibios_fixup_device_resou void __devinit pcibios_fixup_bus(struct pci_bus *bus) { struct pci_controller *hose = PCI_GET_PHB_PTR(bus); - struct pci_dev *dev; /* XXX or bus->parent? */ struct pci_dev *dev = bus->self; _ From anton at samba.org Mon Sep 13 21:07:42 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 13 Sep 2004 21:07:42 +1000 Subject: [PATCH] [ppc64] replace mmu_context_queue with idr allocator In-Reply-To: <20040913105656.GB14553@krispykreme> References: <20040913105505.GA14553@krispykreme> <20040913105656.GB14553@krispykreme> Message-ID: <20040913110742.GC14553@krispykreme> Replace the mmu_context_queue structure with the idr allocator. The mmu_context_queue allocation was quite large (~200kB) so on most machines we will have a reduction in usage. We might put a single entry cache on the front of this so we are more likely to reuse ppc64 MMU hashtable entries that are in the caches. Signed-off-by: Anton Blanchard diff -puN arch/ppc64/mm/init.c~context_queue arch/ppc64/mm/init.c --- 2.6.9-rc1-mm5/arch/ppc64/mm/init.c~context_queue 2004-09-13 19:51:26.130749817 +1000 +++ 2.6.9-rc1-mm5-anton/arch/ppc64/mm/init.c 2004-09-13 19:51:26.164747203 +1000 @@ -36,6 +36,7 @@ #include #include #include +#include #include #include @@ -62,8 +63,6 @@ #include #include - -struct mmu_context_queue_t mmu_context_queue; int mem_init_done; unsigned long ioremap_bot = IMALLOC_BASE; static unsigned long phbs_io_bot = PHBS_IO_BASE; @@ -477,6 +476,59 @@ void free_initrd_mem(unsigned long start } #endif +static spinlock_t mmu_context_lock = SPIN_LOCK_UNLOCKED; +static DEFINE_IDR(mmu_context_idr); + +int init_new_context(struct task_struct *tsk, struct mm_struct *mm) +{ + int index; + int err; + +again: + if (!idr_pre_get(&mmu_context_idr, GFP_KERNEL)) + return -ENOMEM; + + spin_lock(&mmu_context_lock); + err = idr_get_new(&mmu_context_idr, NULL, &index); + spin_unlock(&mmu_context_lock); + + if (err == -EAGAIN) + goto again; + else if (err) + return err; + + if (index > MAX_CONTEXT) { + idr_remove(&mmu_context_idr, index); + return -ENOMEM; + } + + mm->context.id = index; + + return 0; +} + +void destroy_context(struct mm_struct *mm) +{ + spin_lock(&mmu_context_lock); + idr_remove(&mmu_context_idr, mm->context.id); + spin_unlock(&mmu_context_lock); + + mm->context.id = NO_CONTEXT; +} + +static int __init mmu_context_init(void) +{ + int index; + + /* Reserve the first (invalid) context*/ + idr_pre_get(&mmu_context_idr, GFP_KERNEL); + idr_get_new(&mmu_context_idr, NULL, &index); + BUG_ON(0 != index); + + return 0; +} +arch_initcall(mmu_context_init); + /* * Do very early mm setup. */ @@ -486,17 +538,6 @@ void __init mm_init_ppc64(void) ppc64_boot_msg(0x100, "MM Init"); - /* Reserve all contexts < FIRST_USER_CONTEXT for kernel use. - * The range of contexts [FIRST_USER_CONTEXT, NUM_USER_CONTEXT) - * are stored on a stack/queue for easy allocation and deallocation. - */ - mmu_context_queue.lock = SPIN_LOCK_UNLOCKED; - mmu_context_queue.head = 0; - mmu_context_queue.tail = NUM_USER_CONTEXT-1; - mmu_context_queue.size = NUM_USER_CONTEXT; - for (i = 0; i < NUM_USER_CONTEXT; i++) - mmu_context_queue.elements[i] = i + FIRST_USER_CONTEXT; - /* This is the story of the IO hole... please, keep seated, * unfortunately, we are out of oxygen masks at the moment. * So we need some rough way to tell where your big IO hole diff -puN include/asm-ppc64/mmu_context.h~context_queue include/asm-ppc64/mmu_context.h --- 2.6.9-rc1-mm5/include/asm-ppc64/mmu_context.h~context_queue 2004-09-13 19:51:26.142748894 +1000 +++ 2.6.9-rc1-mm5-anton/include/asm-ppc64/mmu_context.h 2004-09-13 19:51:26.168746896 +1000 @@ -2,11 +2,9 @@ #define __PPC64_MMU_CONTEXT_H #include -#include #include #include #include -#include #include /* @@ -33,107 +31,15 @@ static inline int sched_find_first_bit(u return __ffs(b[2]) + 128; } -#define NO_CONTEXT 0 -#define FIRST_USER_CONTEXT 1 -#define LAST_USER_CONTEXT 0x8000 /* Same as PID_MAX for now... */ -#define NUM_USER_CONTEXT (LAST_USER_CONTEXT-FIRST_USER_CONTEXT) - -/* Choose whether we want to implement our context - * number allocator as a LIFO or FIFO queue. - */ -#if 1 -#define MMU_CONTEXT_LIFO -#else -#define MMU_CONTEXT_FIFO -#endif - -struct mmu_context_queue_t { - spinlock_t lock; - long head; - long tail; - long size; - mm_context_id_t elements[LAST_USER_CONTEXT]; -}; - -extern struct mmu_context_queue_t mmu_context_queue; - -static inline void -enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) +static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) { } -/* - * The context number queue has underflowed. - * Meaning: we tried to push a context number that was freed - * back onto the context queue and the queue was already full. - */ -static inline void -mmu_context_underflow(void) -{ - printk(KERN_DEBUG "mmu_context_underflow\n"); - panic("mmu_context_underflow"); -} - -/* - * Set up the context for a new address space. - */ -static inline int -init_new_context(struct task_struct *tsk, struct mm_struct *mm) -{ - long head; - unsigned long flags; - /* This does the right thing across a fork (I hope) */ - - spin_lock_irqsave(&mmu_context_queue.lock, flags); - - if (mmu_context_queue.size <= 0) { - spin_unlock_irqrestore(&mmu_context_queue.lock, flags); - return -ENOMEM; - } +#define NO_CONTEXT 0 +#define MAX_CONTEXT (0x100000-1) - head = mmu_context_queue.head; - mm->context.id = mmu_context_queue.elements[head]; - - head = (head < LAST_USER_CONTEXT-1) ? head+1 : 0; - mmu_context_queue.head = head; - mmu_context_queue.size--; - - spin_unlock_irqrestore(&mmu_context_queue.lock, flags); - - return 0; -} - -/* - * We're finished using the context for an address space. - */ -static inline void -destroy_context(struct mm_struct *mm) -{ - long index; - unsigned long flags; - - spin_lock_irqsave(&mmu_context_queue.lock, flags); - - if (mmu_context_queue.size >= NUM_USER_CONTEXT) { - spin_unlock_irqrestore(&mmu_context_queue.lock, flags); - mmu_context_underflow(); - } - -#ifdef MMU_CONTEXT_LIFO - index = mmu_context_queue.head; - index = (index > 0) ? index-1 : LAST_USER_CONTEXT-1; - mmu_context_queue.head = index; -#else - index = mmu_context_queue.tail; - index = (index < LAST_USER_CONTEXT-1) ? index+1 : 0; - mmu_context_queue.tail = index; -#endif - - mmu_context_queue.size++; - mmu_context_queue.elements[index] = mm->context.id; - - spin_unlock_irqrestore(&mmu_context_queue.lock, flags); -} +extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm); +extern void destroy_context(struct mm_struct *mm); extern void switch_stab(struct task_struct *tsk, struct mm_struct *mm); extern void switch_slb(struct task_struct *tsk, struct mm_struct *mm); _ From jschopp at austin.ibm.com Tue Sep 14 01:56:54 2004 From: jschopp at austin.ibm.com (Joel Schopp) Date: Mon, 13 Sep 2004 10:56:54 -0500 Subject: [Fwd: [RFC][PATCH] flush_hash_page] Message-ID: <4145C346.6090102@austin.ibm.com> Resending since list was down last time I sent it. --------------------------------------------------- I'm very new to the ppc64 memory management code, please forgive my ignorance. I need to be able to use flush_hash_page on arbitrary ptes for memory remove. There is a comment in flush_hash_page about not supporting large ptes. It looks like most of that work has already been done, and all that is needed is the following patch. Am I missing something? -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: largepte.patch Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040913/fc2227de/attachment.txt From anton at samba.org Tue Sep 14 02:27:46 2004 From: anton at samba.org (Anton Blanchard) Date: Tue, 14 Sep 2004 02:27:46 +1000 Subject: [PATCH] [ppc64] Restore smt-enabled=off kernel command line option In-Reply-To: <1095091735.12145.72.camel@biclops.private.network> References: <20040910121238.GG24408@krispykreme> <20040910121456.GH24408@krispykreme> <20040910121941.GI24408@krispykreme> <20040910122337.GJ24408@krispykreme> <20040910122904.GK24408@krispykreme> <1095091735.12145.72.camel@biclops.private.network> Message-ID: <20040913162746.GF12514@krispykreme> > Whoops, sorry, didn't mean to break smt-enabled=off. No problem, it gave me a chance to clean up some more code :) > As you mentioned, we do not have any code which updates the present map > when a cpu is added or removed. What we really need is cpu-specific > hooks in of_add_node/of_remove_node which will take care of that. I'll > send along a patch for that shortly. Ahh yes that makes sense. Can we initially set it up so if we boot ST or with maxcpus= that its still possible to hotplug add the threads or cpus? Anton From linas at austin.ibm.com Tue Sep 14 03:09:46 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Mon, 13 Sep 2004 12:09:46 -0500 Subject: vDSO preliminary implementation In-Reply-To: <1094863417.2667.152.camel@gaston> References: <1093496594.2172.80.camel@gaston> <20040910211158.GX9645@austin.ibm.com> <1094863417.2667.152.camel@gaston> Message-ID: <20040913170946.GB9645@austin.ibm.com> On Sat, Sep 11, 2004 at 10:43:38AM +1000, Benjamin Herrenschmidt was heard to remark: > > > > > > Here's a first shot at implementing a vDSO for ppc32/ppc64. This is definitely > > > > What's vDSO ? Google was amazingly unhelpful in figuring this out. > > virtual .so, that is a library mapped by the kernel in userspace Let me re-phrase that: what's it good for? Is this a mechanism for sharing the text segment of a library between all users? Ye olde AIX had this feature; I've never thought about whether Linux does this or not; shared libs were loaded so that the text segment of a library appeared only once in 'real' memory, and was thus shared by the various apps. I'm not sure, I think in AIX even the "ptes" were shared too: the text was always loaded into the same segment (segment 0 iirc), so you wouldn't have tlb misses on things like libc. I've never thought about how Linux loads libraries, so excuse me on this newbie-sounding question. How does Linux load .so's today? Is there one copy per process, or are they shared? --linas From olof at austin.ibm.com Tue Sep 14 03:20:08 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Mon, 13 Sep 2004 12:20:08 -0500 Subject: [Fwd: [RFC][PATCH] flush_hash_page] In-Reply-To: <4145C346.6090102@austin.ibm.com> References: <4145C346.6090102@austin.ibm.com> Message-ID: <4145D6C8.8000702@austin.ibm.com> Joel Schopp wrote: > Resending since list was down last time I sent it. > --------------------------------------------------- > > I'm very new to the ppc64 memory management code, please forgive my > ignorance. I need to be able to use flush_hash_page on arbitrary ptes > for memory remove. There is a comment in flush_hash_page about not > supporting large ptes. It looks like most of that work has already been > done, and all that is needed is the following patch. Am I missing > something? I think you might be missing something. You just changed the function to assume that all pages are large pages if the CPU supports them. This is true for kernel pages, but not for user ones. As a result, the wrong hash function will/might be used. You need to know if the page you're looking to flush is large or not. Right now there's no way to pass that down, thus the comment in the function. -Olof From haveblue at us.ibm.com Tue Sep 14 03:08:29 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Mon, 13 Sep 2004 10:08:29 -0700 Subject: [Fwd: [RFC][PATCH] flush_hash_page] In-Reply-To: <4145C346.6090102@austin.ibm.com> References: <4145C346.6090102@austin.ibm.com> Message-ID: <1095095309.3422.7.camel@localhost> On Mon, 2004-09-13 at 08:56, Joel Schopp wrote: > Resending since list was down last time I sent it. > --------------------------------------------------- > > I'm very new to the ppc64 memory management code, please forgive my > ignorance. I need to be able to use flush_hash_page on arbitrary ptes > for memory remove. There is a comment in flush_hash_page about not > supporting large ptes. It looks like most of that work has already been > done, and all that is needed is the following patch. Am I missing > something? > ________________________________________________________________________ > + if (cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE) > + large = 1; > + First of all, if you do that, doesn't it assume that all ptes that are passed in are large pages? I don't think that's correct. Also, think about how huge pages are implemented in Linux. Do huge pages really even get Linux ptes, or just pmds that act like ptes? That reminds me. Anton, I don't see ppc64 setting up the Linux pagetable for the kernel mappings anywhere. Did I just miss them? -- Dave From linas at austin.ibm.com Tue Sep 14 06:05:39 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Mon, 13 Sep 2004 15:05:39 -0500 Subject: Resending: [PATCH] PPC64: New version of EEH notifier code Message-ID: <20040913200539.GD9645@austin.ibm.com> Resending: ----- The following addresses had permanent fatal errors ----- Paul, I picked up the eeh notifier call-chain patch from http://ozlabs.org/ppc64-patches/ patch 239, I beleive. Because it doens't apply cleanly any more, I whacked on it a bit to get it to apply; the result is below. I'd suggest sending this upstream, as soon as reasonable. It's not 'perfect', but it does provide a convenieint base to do further work from. --linas Signed-off-by: Linas Vepstas ===== arch/ppc64/kernel/eeh.c 1.30 vs edited ===== --- 1.30/arch/ppc64/kernel/eeh.c Thu Sep 2 15:22:27 2004 +++ edited/arch/ppc64/kernel/eeh.c Thu Sep 2 16:06:58 2004 @@ -17,29 +17,79 @@ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ +#include #include +#include +#include +#include #include #include -#include -#include #include -#include #include -#include -#include -#include +#include +#include #include #include -#include #include +#include #include "pci.h" #undef DEBUG +/** Overview: + * EEH, or "Extended Error Handling" is a PCI bridge technology for + * dealing with PCI bus errors that can't be dealt with within the + * usual PCI framework, except by check-stopping the CPU. Systems + * that are designed for high-availability/reliability cannot afford + * to crash due to a "mere" PCI error, thus the need for EEH. + * An EEH-capable bridge operates by converting a detected error + * into a "slot freeze", taking the PCI adapter off-line, making + * the slot behave, from the OS'es point of view, as if the slot + * were "empty": all reads return 0xff's and all writes are silently + * ignored. EEH slot isolation events can be triggered by parity + * errors on the address or data busses (e.g. during posted writes), + * which in turn might be caused by dust, vibration, humidity, + * radioactivity or plain-old failed hardware. + * + * Note, however, that one of the leading causes of EEH slot + * freeze events are buggy device drivers, buggy device microcode, + * or buggy device hardware. This is because any attempt by the + * device to bus-master data to a memory address that is not + * assigned to the device will trigger a slot freeze. (The idea + * is to prevent devices-gone-wild from corrupting system memory). + * Buggy hardware/drivers will have a miserable time co-existing + * with EEH. + * + * Ideally, a PCI device driver, when suspecting that an isolation + * event has occured (e.g. by reading 0xff's), will then ask EEH + * whether this is the case, and then take appropriate steps to + * reset the PCI slot, the PCI device, and then resume operations. + * However, until that day, the checking is done here, with the + * eeh_check_failure() routine embedded in the MMIO macros. If + * the slot is found to be isolated, an "EEH Event" is synthesized + * and sent out for processing. + */ + +/** Bus Unit ID macros; get low and hi 32-bits of the 64-bit BUID */ #define BUID_HI(buid) ((buid) >> 32) #define BUID_LO(buid) ((buid) & 0xffffffff) -#define CONFIG_ADDR(busno, devfn) \ - (((((busno) & 0xff) << 8) | ((devfn) & 0xf8)) << 8) + +/* EEH event workqueue setup. */ +static spinlock_t eeh_eventlist_lock = SPIN_LOCK_UNLOCKED; +LIST_HEAD(eeh_eventlist); +static void eeh_event_handler(void *); +DECLARE_WORK(eeh_event_wq, eeh_event_handler, NULL); + +static struct notifier_block *eeh_notifier_chain; + +/* + * If a device driver keeps reading an MMIO register in an interrupt + * handler after a slot isolation event has occurred, we assume it + * is broken and panic. This sets the threshold for how many read + * attempts we allow before panicking. + */ +#define EEH_MAX_FAILS 1000 +static atomic_t eeh_fail_count; /* RTAS tokens */ static int ibm_set_eeh_option; @@ -61,6 +111,7 @@ static DEFINE_PER_CPU(unsigned long, total_mmio_ffs); static DEFINE_PER_CPU(unsigned long, false_positives); static DEFINE_PER_CPU(unsigned long, ignored_failures); +static DEFINE_PER_CPU(unsigned long, slot_resets); static int eeh_check_opts_config(struct device_node *dn, int class_code, int vendor_id, int device_id, @@ -71,7 +122,8 @@ * PCI device address resources into a red-black tree, sorted * according to the address range, so that given only an i/o * address, the corresponding PCI device can be **quickly** - * found. + * found. It is safe to perform an address lookup in an interrupt + * context; this ability is an important feature. * * Currently, the only customer of this code is the EEH subsystem; * thus, this code has been somewhat tailored to suit EEH better. @@ -340,6 +392,94 @@ #endif } +/* --------------------------------------------------------------- */ +/* Above lies the PCI Address Cache. Below lies the EEH event infrastructure */ + +/** + * eeh_register_notifier - Register to find out about EEH events. + * @nb: notifier block to callback on events + */ +int eeh_register_notifier(struct notifier_block *nb) +{ + return notifier_chain_register(&eeh_notifier_chain, nb); +} + +/** + * eeh_unregister_notifier - Unregister to an EEH event notifier. + * @nb: notifier block to callback on events + */ +int eeh_unregister_notifier(struct notifier_block *nb) +{ + return notifier_chain_unregister(&eeh_notifier_chain, nb); +} + +/** + * eeh_panic - call panic() for an eeh event that cannot be handled. + * The philosophy of this routine is that it is better to panic and + * halt the OS than it is to risk possible data corruption by + * oblivious device drivers that don't know better. + * + * @dev pci device that had an eeh event + * @reset_state current reset state of the device slot + */ +static void eeh_panic(struct pci_dev *dev, int reset_state) +{ + /* + * XXX We should create a seperate sysctl for this. + * + * Since the panic_on_oops sysctl is used to halt the system + * in light of potential corruption, we can use it here. + */ + if (panic_on_oops) + panic("EEH: MMIO failure (%d) on device:%s %s\n", reset_state, + pci_name(dev), pci_pretty_name(dev)); + else { + __get_cpu_var(ignored_failures)++; + printk(KERN_INFO "EEH: Ignored MMIO failure (%d) on device:%s %s\n", + reset_state, pci_name(dev), pci_pretty_name(dev)); + } +} + +/** + * eeh_event_handler - dispatch EEH events. The detection of a frozen + * slot can occur inside an interrupt, where it can be hard to do + * anything about it. The goal of this routine is to pull these + * detection events out of the context of the interrupt handler, and + * re-dispatch them for processing at a later time in a normal context. + * + * @dummy - unused + */ +static void eeh_event_handler(void *dummy) +{ + unsigned long flags; + struct eeh_event *event; + + while (1) { + spin_lock_irqsave(&eeh_eventlist_lock, flags); + event = NULL; + if (!list_empty(&eeh_eventlist)) { + event = list_entry(eeh_eventlist.next, struct eeh_event, list); + list_del(&event->list); + } + spin_unlock_irqrestore(&eeh_eventlist_lock, flags); + if (event == NULL) + break; + + printk(KERN_INFO "EEH: MMIO failure (%d), notifiying device " + "%s %s\n", event->reset_state, + pci_name(event->dev), pci_pretty_name(event->dev)); + + atomic_set(&eeh_fail_count, 0); + notifier_call_chain (&eeh_notifier_chain, + EEH_NOTIFY_FREEZE, event); + + __get_cpu_var(slot_resets)++; + + pci_dev_put(event->dev); + kfree(event); + } +} + /** * eeh_token_to_phys - convert EEH address token to phys address * @token i/o token, should be address in the form 0xA.... @@ -371,11 +511,11 @@ * * Check for an EEH failure for the given device node. Call this * routine if the result of a read was all 0xff's and you want to - * find out if this is due to an EEH slot freeze event. This routine + * find out if this is due to an EEH slot freeze. This routine * will query firmware for the EEH status. * * Returns 0 if there has not been an EEH error; otherwise returns - * an error code. + * a non-zero value and queues up a solt isolation event notification. * * It is safe to call this routine in an interrupt context. */ @@ -384,6 +524,8 @@ int ret; int rets[2]; unsigned long flags; + int rc, reset_state; + struct eeh_event *event; __get_cpu_var(total_mmio_ffs)++; @@ -402,6 +544,24 @@ if (!dn->eeh_config_addr) { return 0; } + + /* + * If we already have a pending isolation event for this + * slot, we know it's bad already, we don't need to check... + */ + if (dn->eeh_mode & EEH_MODE_ISOLATED) { + atomic_inc(&eeh_fail_count); + if (atomic_read(&eeh_fail_count) >= EEH_MAX_FAILS) { + /* re-read the slot reset state */ + rets[0] = -1; + rtas_call(ibm_read_slot_reset_state, 3, 3, rets, + dn->eeh_config_addr, + BUID_HI(dn->phb->buid), + BUID_LO(dn->phb->buid)); + eeh_panic(dev, rets[0]); + } + return 0; + } /* * Now test for an EEH failure. This is VERY expensive. @@ -414,45 +574,52 @@ dn->eeh_config_addr, BUID_HI(dn->phb->buid), BUID_LO(dn->phb->buid)); - if (ret == 0 && rets[1] == 1 && rets[0] >= 2) { - int log_event; - - spin_lock_irqsave(&slot_errbuf_lock, flags); - memset(slot_errbuf, 0, eeh_error_buf_size); - - log_event = rtas_call(ibm_slot_error_detail, - 8, 1, NULL, dn->eeh_config_addr, - BUID_HI(dn->phb->buid), - BUID_LO(dn->phb->buid), NULL, 0, - virt_to_phys(slot_errbuf), - eeh_error_buf_size, - 1 /* Temporary Error */); - - if (log_event == 0) - log_error(slot_errbuf, ERR_TYPE_RTAS_LOG, - 1 /* Fatal */); - - spin_unlock_irqrestore(&slot_errbuf_lock, flags); - - /* - * XXX We should create a separate sysctl for this. - * - * Since the panic_on_oops sysctl is used to halt - * the system in light of potential corruption, we - * can use it here. - */ - if (panic_on_oops) { - panic("EEH: MMIO failure (%d) on device:%s %s\n", - rets[0], dn->name, dn->full_name); - } else { - __get_cpu_var(ignored_failures)++; - printk(KERN_INFO "EEH: MMIO failure (%d) on device:%s %s\n", - rets[0], dn->name, dn->full_name); - } - } else { + if (!(ret == 0 && rets[1] == 1 && rets[0] >= 2)) { __get_cpu_var(false_positives)++; + return 0; } + /* prevent repeated reports of this failure */ + dn->eeh_mode |= EEH_MODE_ISOLATED; + + reset_state = rets[0]; + + spin_lock_irqsave(&slot_errbuf_lock, flags); + memset(slot_errbuf, 0, eeh_error_buf_size); + + rc = rtas_call(ibm_slot_error_detail, + 8, 1, NULL, dn->eeh_config_addr, + BUID_HI(dn->phb->buid), + BUID_LO(dn->phb->buid), NULL, 0, + virt_to_phys(slot_errbuf), + eeh_error_buf_size, + 1 /* Temporary Error */); + + if (rc == 0) + log_error(slot_errbuf, ERR_TYPE_RTAS_LOG, 0); + spin_unlock_irqrestore(&slot_errbuf_lock, flags); + + event = kmalloc(sizeof(*event), GFP_ATOMIC); + if (event == NULL) { + eeh_panic(dev, reset_state); + return 1; + } + + event->dev = dev; + event->dn = dn; + event->reset_state = reset_state; + + /* We may or may not be called in an interrupt context */ + spin_lock_irqsave(&eeh_eventlist_lock, flags); + list_add(&event->list, &eeh_eventlist); + spin_unlock_irqrestore(&eeh_eventlist_lock, flags); + + /* Most EEH events are due to device driver bugs. Having + * a stack trace will help the device-driver authors figure + * out what happened. So print that out. */ + dump_stack(); + schedule_work(&eeh_event_wq); + return 0; } @@ -768,11 +935,13 @@ { unsigned int cpu; unsigned long ffs = 0, positives = 0, failures = 0; + unsigned long resets = 0; for_each_cpu(cpu) { ffs += per_cpu(total_mmio_ffs, cpu); positives += per_cpu(false_positives, cpu); failures += per_cpu(ignored_failures, cpu); + resets += per_cpu(slot_resets, cpu); } if (0 == eeh_subsystem_enabled) { @@ -782,8 +951,11 @@ seq_printf(m, "EEH Subsystem is enabled\n"); seq_printf(m, "eeh_total_mmio_ffs=%ld\n" "eeh_false_positives=%ld\n" - "eeh_ignored_failures=%ld\n", - ffs, positives, failures); + "eeh_ignored_failures=%ld\n" + "eeh_slot_resets=%ld\n" + "eeh_fail_count=%d\n", + ffs, positives, failures, resets, + eeh_fail_count.counter); } return 0; ===== include/asm-ppc64/eeh.h 1.15 vs edited ===== --- 1.15/include/asm-ppc64/eeh.h Thu Sep 2 15:22:27 2004 +++ edited/include/asm-ppc64/eeh.h Thu Sep 2 15:38:32 2004 @@ -20,8 +20,10 @@ #ifndef _PPC64_EEH_H #define _PPC64_EEH_H -#include #include +#include +#include +#include struct pci_dev; struct device_node; @@ -41,6 +43,7 @@ /* Values for eeh_mode bits in device_node */ #define EEH_MODE_SUPPORTED (1<<0) #define EEH_MODE_NOCHECK (1<<1) +#define EEH_MODE_ISOLATED (1<<2) extern void __init eeh_init(void); unsigned long eeh_check_failure(void *token, unsigned long val); @@ -76,7 +79,28 @@ #define EEH_RELEASE_DMA 3 int eeh_set_option(struct pci_dev *dev, int options); -/* + +/** + * Notifier event flags. + */ +#define EEH_NOTIFY_FREEZE 1 + +/** EEH event -- structure holding pci slot data that describes + * a change in the isolation status of a PCI slot. A pointer + * to this struct is passed as the data pointer in a notify callback. + */ +struct eeh_event { + struct list_head list; + struct pci_dev *dev; + struct device_node *dn; + int reset_state; +}; + +/** Register to find out about EEH events. */ +int eeh_register_notifier(struct notifier_block *nb); +int eeh_unregister_notifier(struct notifier_block *nb); + +/** * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure. * * Order this macro for performance. From jschopp at austin.ibm.com Tue Sep 14 06:10:03 2004 From: jschopp at austin.ibm.com (Joel Schopp) Date: Mon, 13 Sep 2004 15:10:03 -0500 Subject: [Fwd: [RFC][PATCH] flush_hash_page] In-Reply-To: <1095095309.3422.7.camel@localhost> References: <4145C346.6090102@austin.ibm.com> <1095095309.3422.7.camel@localhost> Message-ID: <4145FE9B.5090801@austin.ibm.com> htab_initialize calls create_pte_mapping on every lmb, with large set to true if it is supported. create_pte_mapping calls pSeries_lpar_hpte_insert, which calls H_ENTER, which creates a hardware page table entry. This leads me to believe that all physical memory gets initialized with large ptes. I'm sure I'm wrong, but I just don't see the rest of the code. >>+ if (cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE) >>+ large = 1; >>+ > > > First of all, if you do that, doesn't it assume that all ptes that are > passed in are large pages? I don't think that's correct. I have a hard time with that assumption myself, I was hoping by proposing the wrong answer somebody would be able to help me with the right one. > > Also, think about how huge pages are implemented in Linux. Do huge > pages really even get Linux ptes, or just pmds that act like ptes? We need to be clear in our language to differ between hardware page table entries and Linux page table entries. > > That reminds me. Anton, I don't see ppc64 setting up the Linux > pagetable for the kernel mappings anywhere. Did I just miss them? > > -- Dave > > From linas at austin.ibm.com Tue Sep 14 06:57:49 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Mon, 13 Sep 2004 15:57:49 -0500 Subject: [PATCH 1/1] ppc64: Block config accesses during BIST Message-ID: <20040913205749.GE9645@austin.ibm.com> Forwarding ... Brian sent this patch while the list was down. The problem that spurs this patch was discussed a number of time on this mailing list. I like this patch; it seems to solve the problem with a minimum of fuss. I suspect this patch doesn't apply cleanly after other recent changes. Torvalds suggests using "Pirated-by:" when forwarding a patch such as this: http://www.ussg.iu.edu/hypermail/linux/kernel/0405.3/0226.html Signed-off-by: Linas Vepstas --linas ----- Forwarded message from brking at us.ibm.com ----- Subject: [PATCH 1/1] ppc64: Block config accesses during BIST Some PCI adapters on pSeries and iSeries hardware (ipr scsi adapters) have an exposure today in that they issue BIST to the adapter to reset the card. If, during the time it takes to complete BIST, userspace attempts to access PCI config space, the host bus bridge will master abort the access since the ipr adapter does not respond on the PCI bus for a brief period of time when running BIST. This master abort results in the host PCI bridge isolating that PCI device from the rest of the system, making the device unusable until Linux is rebooted. This patch is an attempt to close that exposure by introducing some blocking code in the arch specific PCI code. The intent is to have the ipr device driver invoke these routines to prevent userspace PCI accesses from occurring during this window. It has been tested by running BIST on an ipr adapter while running a script which looped reading the config space of that adapter through sysfs. Without the patch, an EEH error occurrs. With the patch there is no EEH error. Tested on Power 5 and iSeries Power 4. Signed-off-by: Brian King --- linux-2.6.9-rc1-bk8-bjking1/arch/ppc64/kernel/iSeries_pci.c | 127 +++++++++- linux-2.6.9-rc1-bk8-bjking1/arch/ppc64/kernel/pSeries_pci.c | 103 +++++++- linux-2.6.9-rc1-bk8-bjking1/arch/ppc64/kernel/prom.c | 1 linux-2.6.9-rc1-bk8-bjking1/include/asm-ppc64/iSeries/iSeries_pci.h | 2 linux-2.6.9-rc1-bk8-bjking1/include/asm-ppc64/pci.h | 6 linux-2.6.9-rc1-bk8-bjking1/include/asm-ppc64/prom.h | 5 6 files changed, 226 insertions(+), 18 deletions(-) diff -puN include/asm-ppc64/prom.h~ppc64_block_cfg_io_during_bist include/asm-ppc64/prom.h --- linux-2.6.9-rc1-bk8/include/asm-ppc64/prom.h~ppc64_block_cfg_io_during_bist 2004-09-01 16:20:35.000000000 -0500 +++ linux-2.6.9-rc1-bk8-bjking1/include/asm-ppc64/prom.h 2004-09-01 16:20:35.000000000 -0500 @@ -169,16 +169,21 @@ struct device_node { struct proc_dir_entry *addr_link; /* addr symlink */ atomic_t _users; /* reference count */ unsigned long _flags; + spinlock_t config_lock; }; /* flag descriptions */ #define OF_STALE 0 /* node is slated for deletion */ #define OF_DYNAMIC 1 /* node and properties were allocated via kmalloc */ +#define OF_NO_CFGIO 2 /* config space accesses should fail */ #define OF_IS_STALE(x) test_bit(OF_STALE, &x->_flags) #define OF_MARK_STALE(x) set_bit(OF_STALE, &x->_flags) #define OF_IS_DYNAMIC(x) test_bit(OF_DYNAMIC, &x->_flags) #define OF_MARK_DYNAMIC(x) set_bit(OF_DYNAMIC, &x->_flags) +#define OF_IS_CFGIO_BLOCKED(x) test_bit(OF_NO_CFGIO, &x->_flags) +#define OF_UNBLOCK_CFGIO(x) clear_bit(OF_NO_CFGIO, &x->_flags) +#define OF_BLOCK_CFGIO(x) set_bit(OF_NO_CFGIO, &x->_flags) /* * Until 32-bit ppc can add proc_dir_entries to its device_node diff -puN arch/ppc64/kernel/prom.c~ppc64_block_cfg_io_during_bist arch/ppc64/kernel/prom.c --- linux-2.6.9-rc1-bk8/arch/ppc64/kernel/prom.c~ppc64_block_cfg_io_during_bist 2004-09-01 16:20:35.000000000 -0500 +++ linux-2.6.9-rc1-bk8-bjking1/arch/ppc64/kernel/prom.c 2004-09-01 16:20:35.000000000 -0500 @@ -2959,6 +2959,7 @@ int of_add_node(const char *path, struct np->properties = proplist; OF_MARK_DYNAMIC(np); + spin_lock_init(&np->config_lock); of_node_get(np); np->parent = derive_parent(path); if (!np->parent) { diff -puN arch/ppc64/kernel/pSeries_pci.c~ppc64_block_cfg_io_during_bist arch/ppc64/kernel/pSeries_pci.c --- linux-2.6.9-rc1-bk8/arch/ppc64/kernel/pSeries_pci.c~ppc64_block_cfg_io_during_bist 2004-09-01 16:20:35.000000000 -0500 +++ linux-2.6.9-rc1-bk8-bjking1/arch/ppc64/kernel/pSeries_pci.c 2004-09-01 16:20:35.000000000 -0500 @@ -61,15 +61,12 @@ static int s7a_workaround; extern unsigned long pci_probe_only; -static int rtas_read_config(struct device_node *dn, int where, int size, u32 *val) +static int __rtas_read_config(struct device_node *dn, int where, int size, u32 *val) { int returnval = -1; unsigned long buid, addr; int ret; - if (!dn) - return -2; - addr = (dn->busno << 16) | (dn->devfn << 8) | where; buid = dn->phb->buid; if (buid) { @@ -82,6 +79,23 @@ static int rtas_read_config(struct devic return ret; } +static int rtas_read_config(struct device_node *dn, int where, int size, u32 *val) +{ + unsigned long flags; + int ret = 0; + + if (!dn) + return -2; + + spin_lock_irqsave(&dn->config_lock, flags); + if (OF_IS_CFGIO_BLOCKED(dn)) + *val = -1; + else + ret = __rtas_read_config(dn, where, size, val); + spin_unlock_irqrestore(&dn->config_lock, flags); + return ret; +} + static int rtas_pci_read_config(struct pci_bus *bus, unsigned int devfn, int where, int size, u32 *val) @@ -100,14 +114,11 @@ static int rtas_pci_read_config(struct p return PCIBIOS_DEVICE_NOT_FOUND; } -static int rtas_write_config(struct device_node *dn, int where, int size, u32 val) +static int __rtas_write_config(struct device_node *dn, int where, int size, u32 val) { unsigned long buid, addr; int ret; - if (!dn) - return -2; - addr = (dn->busno << 16) | (dn->devfn << 8) | where; buid = dn->phb->buid; if (buid) { @@ -118,6 +129,21 @@ static int rtas_write_config(struct devi return ret; } +static int rtas_write_config(struct device_node *dn, int where, int size, u32 val) +{ + unsigned long flags; + int ret = 0; + + if (!dn) + return -2; + + spin_lock_irqsave(&dn->config_lock, flags); + if (!OF_IS_CFGIO_BLOCKED(dn)) + ret = __rtas_write_config(dn, where, size, val); + spin_unlock_irqrestore(&dn->config_lock, flags); + return ret; +} + static int rtas_pci_write_config(struct pci_bus *bus, unsigned int devfn, int where, int size, u32 val) @@ -141,6 +167,67 @@ struct pci_ops rtas_pci_ops = { rtas_pci_write_config }; +/** + * pci_block_config_io - Block PCI config reads/writes + * @pdev: pci device struct + * + * This function blocks any PCI config accesses from occurring. + * Device drivers may call this prior to running BIST if the + * adapter cannot handle PCI config reads or writes when + * running BIST. When blocked, any writes will be ignored and + * treated as successful and any reads will return all 1's data. + * + * Return value: + * nothing + **/ +void pci_block_config_io(struct pci_dev *pdev) +{ + struct device_node *dn = pci_device_to_OF_node(pdev); + unsigned long flags; + + spin_lock_irqsave(&dn->config_lock, flags); + OF_BLOCK_CFGIO(dn); + spin_unlock_irqrestore(&dn->config_lock, flags); +} +EXPORT_SYMBOL(pci_block_config_io); + +/** + * pci_unblock_config_io - Unblock PCI config reads/writes + * @pdev: pci device struct + * + * This function allows PCI config accesses to resume. + * + * Return value: + * nothing + **/ +void pci_unblock_config_io(struct pci_dev *pdev) +{ + struct device_node *dn = pci_device_to_OF_node(pdev); + unsigned long flags; + + spin_lock_irqsave(&dn->config_lock, flags); + OF_UNBLOCK_CFGIO(dn); + spin_unlock_irqrestore(&dn->config_lock, flags); +} +EXPORT_SYMBOL(pci_unblock_config_io); + +/** + * pci_start_bist - Start BIST on a PCI device + * @pdev: pci device struct + * + * This function allows a device driver to start BIST + * when PCI config accesses are disabled. + * + * Return value: + * nothing + **/ +int pci_start_bist(struct pci_dev *pdev) +{ + struct device_node *dn = pci_device_to_OF_node(pdev); + return __rtas_write_config(dn, PCI_BIST, 1, PCI_BIST_START); +} +EXPORT_SYMBOL(pci_start_bist); + /****************************************************************** * pci_read_irq_line * diff -puN include/asm-ppc64/pci.h~ppc64_block_cfg_io_during_bist include/asm-ppc64/pci.h --- linux-2.6.9-rc1-bk8/include/asm-ppc64/pci.h~ppc64_block_cfg_io_during_bist 2004-09-01 16:20:35.000000000 -0500 +++ linux-2.6.9-rc1-bk8-bjking1/include/asm-ppc64/pci.h 2004-09-01 16:20:35.000000000 -0500 @@ -233,6 +233,12 @@ extern int pci_read_irq_line(struct pci_ extern void pcibios_add_platform_entries(struct pci_dev *dev); +extern void pci_block_config_io(struct pci_dev *dev); + +extern void pci_unblock_config_io(struct pci_dev *dev); + +extern int pci_start_bist(struct pci_dev *dev); + #endif /* __KERNEL__ */ #endif /* __PPC64_PCI_H */ diff -puN include/asm-ppc64/iSeries/iSeries_pci.h~ppc64_block_cfg_io_during_bist include/asm-ppc64/iSeries/iSeries_pci.h --- linux-2.6.9-rc1-bk8/include/asm-ppc64/iSeries/iSeries_pci.h~ppc64_block_cfg_io_during_bist 2004-09-01 16:20:35.000000000 -0500 +++ linux-2.6.9-rc1-bk8-bjking1/include/asm-ppc64/iSeries/iSeries_pci.h 2004-09-01 16:20:35.000000000 -0500 @@ -91,6 +91,7 @@ struct iSeries_Device_Node { int ReturnCode; /* Return Code Holder */ int IoRetry; /* Current Retry Count */ int Flags; /* Possible flags(disable/bist)*/ +#define ISERIES_CFGIO_BLOCKED 1 u16 Vendor; /* Vendor ID */ u8 LogicalSlot; /* Hv Slot Index for Tces */ struct iommu_table* iommu_table;/* Device TCE Table */ @@ -99,6 +100,7 @@ struct iSeries_Device_Node { u8 FrameId; /* iSeries spcn Frame Id */ char CardLocation[4];/* Char format of planar vpd */ char Location[20]; /* Frame 1, Card C10 */ + spinlock_t config_lock; }; /************************************************************************/ diff -puN arch/ppc64/kernel/iSeries_pci.c~ppc64_block_cfg_io_during_bist arch/ppc64/kernel/iSeries_pci.c --- linux-2.6.9-rc1-bk8/arch/ppc64/kernel/iSeries_pci.c~ppc64_block_cfg_io_during_bist 2004-09-01 16:20:35.000000000 -0500 +++ linux-2.6.9-rc1-bk8-bjking1/arch/ppc64/kernel/iSeries_pci.c 2004-09-01 16:20:35.000000000 -0500 @@ -131,6 +131,7 @@ static struct iSeries_Device_Node *build node->AgentId = AgentId; node->DevFn = PCI_DEVFN(ISERIES_ENCODE_DEVICE(AgentId), Function); node->IoRetry = 0; + spin_lock_init(&node->config_lock); iSeries_Get_Location_Code(node); PCIFR("Device 0x%02X.%2X, Node:0x%p ", ISERIES_BUS(node), ISERIES_DEVFUN(node), node); @@ -515,16 +516,12 @@ static u64 hv_cfg_write_func[4] = { /* * Read PCI config space */ -static int iSeries_pci_read_config(struct pci_bus *bus, unsigned int devfn, +static int __iSeries_pci_read_config(struct iSeries_Device_Node *node, int offset, int size, u32 *val) { - struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn); u64 fn; struct HvCallPci_LoadReturn ret; - if (node == NULL) - return PCIBIOS_DEVICE_NOT_FOUND; - fn = hv_cfg_read_func[(size - 1) & 3]; HvCall3Ret16(fn, &ret, node->DsaAddr.DsaAddr, offset, 0); @@ -537,20 +534,36 @@ static int iSeries_pci_read_config(struc return 0; } +static int iSeries_pci_read_config(struct pci_bus *bus, unsigned int devfn, + int offset, int size, u32 *val) +{ + struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn); + int ret = PCIBIOS_DEVICE_NOT_FOUND; + unsigned long flags; + + if (node) { + ret = 0; + spin_lock_irqsave(&node->config_lock, flags); + if (node->Flags & ISERIES_CFGIO_BLOCKED) + *val = -1; + else + ret = __iSeries_pci_read_config(node, offset, size, val); + spin_unlock_irqrestore(&node->config_lock, flags); + } + + return ret; +} + /* * Write PCI config space */ -static int iSeries_pci_write_config(struct pci_bus *bus, unsigned int devfn, +static int __iSeries_pci_write_config(struct iSeries_Device_Node *node, int offset, int size, u32 val) { - struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn); u64 fn; u64 ret; - if (node == NULL) - return PCIBIOS_DEVICE_NOT_FOUND; - fn = hv_cfg_write_func[(size - 1) & 3]; ret = HvCall4(fn, node->DsaAddr.DsaAddr, offset, val, 0); @@ -560,6 +573,23 @@ static int iSeries_pci_write_config(stru return 0; } +static int iSeries_pci_write_config(struct pci_bus *bus, unsigned int devfn, + int offset, int size, u32 val) +{ + struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn); + int ret = PCIBIOS_DEVICE_NOT_FOUND; + unsigned long flags; + + if (node) { + spin_lock_irqsave(&node->config_lock, flags); + if (!(node->Flags & ISERIES_CFGIO_BLOCKED)) + ret = __iSeries_pci_write_config(node, offset, size, val); + spin_unlock_irqrestore(&node->config_lock, flags); + } + + return ret; +} + static struct pci_ops iSeries_pci_ops = { .read = iSeries_pci_read_config, .write = iSeries_pci_write_config @@ -820,3 +850,80 @@ void iSeries_Write_Long(u32 data, void * } while (CheckReturnCode("WWL", DevNode, rc) != 0); } EXPORT_SYMBOL(iSeries_Write_Long); + +/** + * pci_block_config_io - Block PCI config reads/writes + * @pdev: pci device struct + * + * This function blocks any PCI config accesses from occurring. + * Device drivers may call this prior to running BIST if the + * adapter cannot handle PCI config reads or writes when + * running BIST. When blocked, any writes will be ignored and + * treated as successful and any reads will return all 1's data. + * + * Return value: + * nothing + **/ +void pci_block_config_io(struct pci_dev *pdev) +{ + struct iSeries_Device_Node *node; + unsigned long flags; + + node = find_Device_Node(pdev->bus->number, pdev->devfn); + + if (node == NULL) + return; + + spin_lock_irqsave(&node->config_lock, flags); + node->Flags |= ISERIES_CFGIO_BLOCKED; + spin_unlock_irqrestore(&node->config_lock, flags); +} +EXPORT_SYMBOL(pci_block_config_io); + +/** + * pci_unblock_config_io - Unblock PCI config reads/writes + * @pdev: pci device struct + * + * This function allows PCI config accesses to resume. + * + * Return value: + * nothing + **/ +void pci_unblock_config_io(struct pci_dev *pdev) +{ + struct iSeries_Device_Node *node; + unsigned long flags; + + node = find_Device_Node(pdev->bus->number, pdev->devfn); + + if (node == NULL) + return; + + spin_lock_irqsave(&node->config_lock, flags); + node->Flags &= ~ISERIES_CFGIO_BLOCKED; + spin_unlock_irqrestore(&node->config_lock, flags); +} +EXPORT_SYMBOL(pci_unblock_config_io); + +/** + * pci_start_bist - Start BIST on a PCI device + * @pdev: pci device struct + * + * This function allows a device driver to start BIST + * when PCI config accesses are disabled. + * + * Return value: + * nothing + **/ +int pci_start_bist(struct pci_dev *pdev) +{ + struct iSeries_Device_Node *node; + + node = find_Device_Node(pdev->bus->number, pdev->devfn); + + if (node == NULL) + return PCIBIOS_DEVICE_NOT_FOUND; + + return __iSeries_pci_write_config(node, PCI_BIST, 1, PCI_BIST_START); +} +EXPORT_SYMBOL(pci_start_bist); _ ----- End forwarded message ----- From brking at us.ibm.com Tue Sep 14 07:05:39 2004 From: brking at us.ibm.com (Brian King) Date: Mon, 13 Sep 2004 16:05:39 -0500 Subject: [PATCH 1/1] ppc64: Block config accesses during BIST In-Reply-To: <20040913205749.GE9645@austin.ibm.com> References: <20040913205749.GE9645@austin.ibm.com> Message-ID: <41460BA3.9070007@us.ibm.com> I'll be sending a patch that applies cleanly fairly soon. -Brian Linas Vepstas wrote: > Forwarding ... > > Brian sent this patch while the list was down. The problem that > spurs this patch was discussed a number of time on this mailing list. > I like this patch; it seems to solve the problem with a minimum of > fuss. > > I suspect this patch doesn't apply cleanly after other recent > changes. > > Torvalds suggests using "Pirated-by:" when forwarding a patch such as this: > http://www.ussg.iu.edu/hypermail/linux/kernel/0405.3/0226.html > > Signed-off-by: Linas Vepstas > > --linas > > ----- Forwarded message from brking at us.ibm.com ----- > > Subject: [PATCH 1/1] ppc64: Block config accesses during BIST > > Some PCI adapters on pSeries and iSeries hardware (ipr scsi adapters) > have an exposure today in that they issue BIST to the adapter to reset > the card. If, during the time it takes to complete BIST, userspace attempts > to access PCI config space, the host bus bridge will master abort the access > since the ipr adapter does not respond on the PCI bus for a brief period of > time when running BIST. This master abort results in the host PCI bridge > isolating that PCI device from the rest of the system, making the device > unusable until Linux is rebooted. This patch is an attempt to close that > exposure by introducing some blocking code in the arch specific PCI code. > The intent is to have the ipr device driver invoke these routines to > prevent userspace PCI accesses from occurring during this window. > > It has been tested by running BIST on an ipr adapter while running a > script which looped reading the config space of that adapter through sysfs. > Without the patch, an EEH error occurrs. With the patch there is no EEH > error. Tested on Power 5 and iSeries Power 4. > > Signed-off-by: Brian King > --- > > linux-2.6.9-rc1-bk8-bjking1/arch/ppc64/kernel/iSeries_pci.c | 127 +++++++++- > linux-2.6.9-rc1-bk8-bjking1/arch/ppc64/kernel/pSeries_pci.c | 103 +++++++- > linux-2.6.9-rc1-bk8-bjking1/arch/ppc64/kernel/prom.c | 1 > linux-2.6.9-rc1-bk8-bjking1/include/asm-ppc64/iSeries/iSeries_pci.h | 2 > linux-2.6.9-rc1-bk8-bjking1/include/asm-ppc64/pci.h | 6 > linux-2.6.9-rc1-bk8-bjking1/include/asm-ppc64/prom.h | 5 > 6 files changed, 226 insertions(+), 18 deletions(-) > > diff -puN include/asm-ppc64/prom.h~ppc64_block_cfg_io_during_bist include/asm-ppc64/prom.h > --- linux-2.6.9-rc1-bk8/include/asm-ppc64/prom.h~ppc64_block_cfg_io_during_bist 2004-09-01 16:20:35.000000000 -0500 > +++ linux-2.6.9-rc1-bk8-bjking1/include/asm-ppc64/prom.h 2004-09-01 16:20:35.000000000 -0500 > @@ -169,16 +169,21 @@ struct device_node { > struct proc_dir_entry *addr_link; /* addr symlink */ > atomic_t _users; /* reference count */ > unsigned long _flags; > + spinlock_t config_lock; > }; > > /* flag descriptions */ > #define OF_STALE 0 /* node is slated for deletion */ > #define OF_DYNAMIC 1 /* node and properties were allocated via kmalloc */ > +#define OF_NO_CFGIO 2 /* config space accesses should fail */ > > #define OF_IS_STALE(x) test_bit(OF_STALE, &x->_flags) > #define OF_MARK_STALE(x) set_bit(OF_STALE, &x->_flags) > #define OF_IS_DYNAMIC(x) test_bit(OF_DYNAMIC, &x->_flags) > #define OF_MARK_DYNAMIC(x) set_bit(OF_DYNAMIC, &x->_flags) > +#define OF_IS_CFGIO_BLOCKED(x) test_bit(OF_NO_CFGIO, &x->_flags) > +#define OF_UNBLOCK_CFGIO(x) clear_bit(OF_NO_CFGIO, &x->_flags) > +#define OF_BLOCK_CFGIO(x) set_bit(OF_NO_CFGIO, &x->_flags) > > /* > * Until 32-bit ppc can add proc_dir_entries to its device_node > diff -puN arch/ppc64/kernel/prom.c~ppc64_block_cfg_io_during_bist arch/ppc64/kernel/prom.c > --- linux-2.6.9-rc1-bk8/arch/ppc64/kernel/prom.c~ppc64_block_cfg_io_during_bist 2004-09-01 16:20:35.000000000 -0500 > +++ linux-2.6.9-rc1-bk8-bjking1/arch/ppc64/kernel/prom.c 2004-09-01 16:20:35.000000000 -0500 > @@ -2959,6 +2959,7 @@ int of_add_node(const char *path, struct > > np->properties = proplist; > OF_MARK_DYNAMIC(np); > + spin_lock_init(&np->config_lock); > of_node_get(np); > np->parent = derive_parent(path); > if (!np->parent) { > diff -puN arch/ppc64/kernel/pSeries_pci.c~ppc64_block_cfg_io_during_bist arch/ppc64/kernel/pSeries_pci.c > --- linux-2.6.9-rc1-bk8/arch/ppc64/kernel/pSeries_pci.c~ppc64_block_cfg_io_during_bist 2004-09-01 16:20:35.000000000 -0500 > +++ linux-2.6.9-rc1-bk8-bjking1/arch/ppc64/kernel/pSeries_pci.c 2004-09-01 16:20:35.000000000 -0500 > @@ -61,15 +61,12 @@ static int s7a_workaround; > > extern unsigned long pci_probe_only; > > -static int rtas_read_config(struct device_node *dn, int where, int size, u32 *val) > +static int __rtas_read_config(struct device_node *dn, int where, int size, u32 *val) > { > int returnval = -1; > unsigned long buid, addr; > int ret; > > - if (!dn) > - return -2; > - > addr = (dn->busno << 16) | (dn->devfn << 8) | where; > buid = dn->phb->buid; > if (buid) { > @@ -82,6 +79,23 @@ static int rtas_read_config(struct devic > return ret; > } > > +static int rtas_read_config(struct device_node *dn, int where, int size, u32 *val) > +{ > + unsigned long flags; > + int ret = 0; > + > + if (!dn) > + return -2; > + > + spin_lock_irqsave(&dn->config_lock, flags); > + if (OF_IS_CFGIO_BLOCKED(dn)) > + *val = -1; > + else > + ret = __rtas_read_config(dn, where, size, val); > + spin_unlock_irqrestore(&dn->config_lock, flags); > + return ret; > +} > + > static int rtas_pci_read_config(struct pci_bus *bus, > unsigned int devfn, > int where, int size, u32 *val) > @@ -100,14 +114,11 @@ static int rtas_pci_read_config(struct p > return PCIBIOS_DEVICE_NOT_FOUND; > } > > -static int rtas_write_config(struct device_node *dn, int where, int size, u32 val) > +static int __rtas_write_config(struct device_node *dn, int where, int size, u32 val) > { > unsigned long buid, addr; > int ret; > > - if (!dn) > - return -2; > - > addr = (dn->busno << 16) | (dn->devfn << 8) | where; > buid = dn->phb->buid; > if (buid) { > @@ -118,6 +129,21 @@ static int rtas_write_config(struct devi > return ret; > } > > +static int rtas_write_config(struct device_node *dn, int where, int size, u32 val) > +{ > + unsigned long flags; > + int ret = 0; > + > + if (!dn) > + return -2; > + > + spin_lock_irqsave(&dn->config_lock, flags); > + if (!OF_IS_CFGIO_BLOCKED(dn)) > + ret = __rtas_write_config(dn, where, size, val); > + spin_unlock_irqrestore(&dn->config_lock, flags); > + return ret; > +} > + > static int rtas_pci_write_config(struct pci_bus *bus, > unsigned int devfn, > int where, int size, u32 val) > @@ -141,6 +167,67 @@ struct pci_ops rtas_pci_ops = { > rtas_pci_write_config > }; > > +/** > + * pci_block_config_io - Block PCI config reads/writes > + * @pdev: pci device struct > + * > + * This function blocks any PCI config accesses from occurring. > + * Device drivers may call this prior to running BIST if the > + * adapter cannot handle PCI config reads or writes when > + * running BIST. When blocked, any writes will be ignored and > + * treated as successful and any reads will return all 1's data. > + * > + * Return value: > + * nothing > + **/ > +void pci_block_config_io(struct pci_dev *pdev) > +{ > + struct device_node *dn = pci_device_to_OF_node(pdev); > + unsigned long flags; > + > + spin_lock_irqsave(&dn->config_lock, flags); > + OF_BLOCK_CFGIO(dn); > + spin_unlock_irqrestore(&dn->config_lock, flags); > +} > +EXPORT_SYMBOL(pci_block_config_io); > + > +/** > + * pci_unblock_config_io - Unblock PCI config reads/writes > + * @pdev: pci device struct > + * > + * This function allows PCI config accesses to resume. > + * > + * Return value: > + * nothing > + **/ > +void pci_unblock_config_io(struct pci_dev *pdev) > +{ > + struct device_node *dn = pci_device_to_OF_node(pdev); > + unsigned long flags; > + > + spin_lock_irqsave(&dn->config_lock, flags); > + OF_UNBLOCK_CFGIO(dn); > + spin_unlock_irqrestore(&dn->config_lock, flags); > +} > +EXPORT_SYMBOL(pci_unblock_config_io); > + > +/** > + * pci_start_bist - Start BIST on a PCI device > + * @pdev: pci device struct > + * > + * This function allows a device driver to start BIST > + * when PCI config accesses are disabled. > + * > + * Return value: > + * nothing > + **/ > +int pci_start_bist(struct pci_dev *pdev) > +{ > + struct device_node *dn = pci_device_to_OF_node(pdev); > + return __rtas_write_config(dn, PCI_BIST, 1, PCI_BIST_START); > +} > +EXPORT_SYMBOL(pci_start_bist); > + > /****************************************************************** > * pci_read_irq_line > * > diff -puN include/asm-ppc64/pci.h~ppc64_block_cfg_io_during_bist include/asm-ppc64/pci.h > --- linux-2.6.9-rc1-bk8/include/asm-ppc64/pci.h~ppc64_block_cfg_io_during_bist 2004-09-01 16:20:35.000000000 -0500 > +++ linux-2.6.9-rc1-bk8-bjking1/include/asm-ppc64/pci.h 2004-09-01 16:20:35.000000000 -0500 > @@ -233,6 +233,12 @@ extern int pci_read_irq_line(struct pci_ > > extern void pcibios_add_platform_entries(struct pci_dev *dev); > > +extern void pci_block_config_io(struct pci_dev *dev); > + > +extern void pci_unblock_config_io(struct pci_dev *dev); > + > +extern int pci_start_bist(struct pci_dev *dev); > + > #endif /* __KERNEL__ */ > > #endif /* __PPC64_PCI_H */ > diff -puN include/asm-ppc64/iSeries/iSeries_pci.h~ppc64_block_cfg_io_during_bist include/asm-ppc64/iSeries/iSeries_pci.h > --- linux-2.6.9-rc1-bk8/include/asm-ppc64/iSeries/iSeries_pci.h~ppc64_block_cfg_io_during_bist 2004-09-01 16:20:35.000000000 -0500 > +++ linux-2.6.9-rc1-bk8-bjking1/include/asm-ppc64/iSeries/iSeries_pci.h 2004-09-01 16:20:35.000000000 -0500 > @@ -91,6 +91,7 @@ struct iSeries_Device_Node { > int ReturnCode; /* Return Code Holder */ > int IoRetry; /* Current Retry Count */ > int Flags; /* Possible flags(disable/bist)*/ > +#define ISERIES_CFGIO_BLOCKED 1 > u16 Vendor; /* Vendor ID */ > u8 LogicalSlot; /* Hv Slot Index for Tces */ > struct iommu_table* iommu_table;/* Device TCE Table */ > @@ -99,6 +100,7 @@ struct iSeries_Device_Node { > u8 FrameId; /* iSeries spcn Frame Id */ > char CardLocation[4];/* Char format of planar vpd */ > char Location[20]; /* Frame 1, Card C10 */ > + spinlock_t config_lock; > }; > > /************************************************************************/ > diff -puN arch/ppc64/kernel/iSeries_pci.c~ppc64_block_cfg_io_during_bist arch/ppc64/kernel/iSeries_pci.c > --- linux-2.6.9-rc1-bk8/arch/ppc64/kernel/iSeries_pci.c~ppc64_block_cfg_io_during_bist 2004-09-01 16:20:35.000000000 -0500 > +++ linux-2.6.9-rc1-bk8-bjking1/arch/ppc64/kernel/iSeries_pci.c 2004-09-01 16:20:35.000000000 -0500 > @@ -131,6 +131,7 @@ static struct iSeries_Device_Node *build > node->AgentId = AgentId; > node->DevFn = PCI_DEVFN(ISERIES_ENCODE_DEVICE(AgentId), Function); > node->IoRetry = 0; > + spin_lock_init(&node->config_lock); > iSeries_Get_Location_Code(node); > PCIFR("Device 0x%02X.%2X, Node:0x%p ", ISERIES_BUS(node), > ISERIES_DEVFUN(node), node); > @@ -515,16 +516,12 @@ static u64 hv_cfg_write_func[4] = { > /* > * Read PCI config space > */ > -static int iSeries_pci_read_config(struct pci_bus *bus, unsigned int devfn, > +static int __iSeries_pci_read_config(struct iSeries_Device_Node *node, > int offset, int size, u32 *val) > { > - struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn); > u64 fn; > struct HvCallPci_LoadReturn ret; > > - if (node == NULL) > - return PCIBIOS_DEVICE_NOT_FOUND; > - > fn = hv_cfg_read_func[(size - 1) & 3]; > HvCall3Ret16(fn, &ret, node->DsaAddr.DsaAddr, offset, 0); > > @@ -537,20 +534,36 @@ static int iSeries_pci_read_config(struc > return 0; > } > > +static int iSeries_pci_read_config(struct pci_bus *bus, unsigned int devfn, > + int offset, int size, u32 *val) > +{ > + struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn); > + int ret = PCIBIOS_DEVICE_NOT_FOUND; > + unsigned long flags; > + > + if (node) { > + ret = 0; > + spin_lock_irqsave(&node->config_lock, flags); > + if (node->Flags & ISERIES_CFGIO_BLOCKED) > + *val = -1; > + else > + ret = __iSeries_pci_read_config(node, offset, size, val); > + spin_unlock_irqrestore(&node->config_lock, flags); > + } > + > + return ret; > +} > + > /* > * Write PCI config space > */ > > -static int iSeries_pci_write_config(struct pci_bus *bus, unsigned int devfn, > +static int __iSeries_pci_write_config(struct iSeries_Device_Node *node, > int offset, int size, u32 val) > { > - struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn); > u64 fn; > u64 ret; > > - if (node == NULL) > - return PCIBIOS_DEVICE_NOT_FOUND; > - > fn = hv_cfg_write_func[(size - 1) & 3]; > ret = HvCall4(fn, node->DsaAddr.DsaAddr, offset, val, 0); > > @@ -560,6 +573,23 @@ static int iSeries_pci_write_config(stru > return 0; > } > > +static int iSeries_pci_write_config(struct pci_bus *bus, unsigned int devfn, > + int offset, int size, u32 val) > +{ > + struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn); > + int ret = PCIBIOS_DEVICE_NOT_FOUND; > + unsigned long flags; > + > + if (node) { > + spin_lock_irqsave(&node->config_lock, flags); > + if (!(node->Flags & ISERIES_CFGIO_BLOCKED)) > + ret = __iSeries_pci_write_config(node, offset, size, val); > + spin_unlock_irqrestore(&node->config_lock, flags); > + } > + > + return ret; > +} > + > static struct pci_ops iSeries_pci_ops = { > .read = iSeries_pci_read_config, > .write = iSeries_pci_write_config > @@ -820,3 +850,80 @@ void iSeries_Write_Long(u32 data, void * > } while (CheckReturnCode("WWL", DevNode, rc) != 0); > } > EXPORT_SYMBOL(iSeries_Write_Long); > + > +/** > + * pci_block_config_io - Block PCI config reads/writes > + * @pdev: pci device struct > + * > + * This function blocks any PCI config accesses from occurring. > + * Device drivers may call this prior to running BIST if the > + * adapter cannot handle PCI config reads or writes when > + * running BIST. When blocked, any writes will be ignored and > + * treated as successful and any reads will return all 1's data. > + * > + * Return value: > + * nothing > + **/ > +void pci_block_config_io(struct pci_dev *pdev) > +{ > + struct iSeries_Device_Node *node; > + unsigned long flags; > + > + node = find_Device_Node(pdev->bus->number, pdev->devfn); > + > + if (node == NULL) > + return; > + > + spin_lock_irqsave(&node->config_lock, flags); > + node->Flags |= ISERIES_CFGIO_BLOCKED; > + spin_unlock_irqrestore(&node->config_lock, flags); > +} > +EXPORT_SYMBOL(pci_block_config_io); > + > +/** > + * pci_unblock_config_io - Unblock PCI config reads/writes > + * @pdev: pci device struct > + * > + * This function allows PCI config accesses to resume. > + * > + * Return value: > + * nothing > + **/ > +void pci_unblock_config_io(struct pci_dev *pdev) > +{ > + struct iSeries_Device_Node *node; > + unsigned long flags; > + > + node = find_Device_Node(pdev->bus->number, pdev->devfn); > + > + if (node == NULL) > + return; > + > + spin_lock_irqsave(&node->config_lock, flags); > + node->Flags &= ~ISERIES_CFGIO_BLOCKED; > + spin_unlock_irqrestore(&node->config_lock, flags); > +} > +EXPORT_SYMBOL(pci_unblock_config_io); > + > +/** > + * pci_start_bist - Start BIST on a PCI device > + * @pdev: pci device struct > + * > + * This function allows a device driver to start BIST > + * when PCI config accesses are disabled. > + * > + * Return value: > + * nothing > + **/ > +int pci_start_bist(struct pci_dev *pdev) > +{ > + struct iSeries_Device_Node *node; > + > + node = find_Device_Node(pdev->bus->number, pdev->devfn); > + > + if (node == NULL) > + return PCIBIOS_DEVICE_NOT_FOUND; > + > + return __iSeries_pci_write_config(node, PCI_BIST, 1, PCI_BIST_START); > +} > +EXPORT_SYMBOL(pci_start_bist); > _ > > > ----- End forwarded message ----- > -- Brian King eServer Storage I/O IBM Linux Technology Center From brking at us.ibm.com Tue Sep 14 07:53:52 2004 From: brking at us.ibm.com (brking at us.ibm.com) Date: Mon, 13 Sep 2004 16:53:52 -0500 Subject: [PATCH 1/1] ppc64: Block config accesses during BIST Message-ID: <200409132153.i8DLrrrF053396@westrelay04.boulder.ibm.com> Some PCI adapters on pSeries and iSeries hardware (ipr scsi adapters) have an exposure today in that they issue BIST to the adapter to reset the card. If, during the time it takes to complete BIST, userspace attempts to access PCI config space, the host bus bridge will master abort the access since the ipr adapter does not respond on the PCI bus for a brief period of time when running BIST. This master abort results in the host PCI bridge isolating that PCI device from the rest of the system, making the device unusable until Linux is rebooted. This patch is an attempt to close that exposure by introducing some blocking code in the arch specific PCI code. The intent is to have the ipr device driver invoke these routines to prevent userspace PCI accesses from occurring during this window. It has been tested by running BIST on an ipr adapter while running a script which looped reading the config space of that adapter through sysfs. Without the patch, an EEH error occurrs. With the patch there is no EEH error. Tested on Power 5 and iSeries Power 4. Signed-off-by: Brian King --- linux-2.6.9-rc2-bjking1/arch/ppc64/kernel/iSeries_pci.c | 127 +++++++++- linux-2.6.9-rc2-bjking1/arch/ppc64/kernel/pSeries_pci.c | 101 +++++++ linux-2.6.9-rc2-bjking1/arch/ppc64/kernel/prom.c | 1 linux-2.6.9-rc2-bjking1/include/asm-ppc64/iSeries/iSeries_pci.h | 2 linux-2.6.9-rc2-bjking1/include/asm-ppc64/pci.h | 6 linux-2.6.9-rc2-bjking1/include/asm-ppc64/prom.h | 5 6 files changed, 226 insertions(+), 16 deletions(-) diff -puN include/asm-ppc64/prom.h~ppc64_block_cfg_io_during_bist include/asm-ppc64/prom.h --- linux-2.6.9-rc2/include/asm-ppc64/prom.h~ppc64_block_cfg_io_during_bist 2004-09-13 09:54:32.000000000 -0500 +++ linux-2.6.9-rc2-bjking1/include/asm-ppc64/prom.h 2004-09-13 09:54:32.000000000 -0500 @@ -169,16 +169,21 @@ struct device_node { struct proc_dir_entry *addr_link; /* addr symlink */ atomic_t _users; /* reference count */ unsigned long _flags; + spinlock_t config_lock; }; /* flag descriptions */ #define OF_STALE 0 /* node is slated for deletion */ #define OF_DYNAMIC 1 /* node and properties were allocated via kmalloc */ +#define OF_NO_CFGIO 2 /* config space accesses should fail */ #define OF_IS_STALE(x) test_bit(OF_STALE, &x->_flags) #define OF_MARK_STALE(x) set_bit(OF_STALE, &x->_flags) #define OF_IS_DYNAMIC(x) test_bit(OF_DYNAMIC, &x->_flags) #define OF_MARK_DYNAMIC(x) set_bit(OF_DYNAMIC, &x->_flags) +#define OF_IS_CFGIO_BLOCKED(x) test_bit(OF_NO_CFGIO, &x->_flags) +#define OF_UNBLOCK_CFGIO(x) clear_bit(OF_NO_CFGIO, &x->_flags) +#define OF_BLOCK_CFGIO(x) set_bit(OF_NO_CFGIO, &x->_flags) /* * Until 32-bit ppc can add proc_dir_entries to its device_node diff -puN arch/ppc64/kernel/prom.c~ppc64_block_cfg_io_during_bist arch/ppc64/kernel/prom.c --- linux-2.6.9-rc2/arch/ppc64/kernel/prom.c~ppc64_block_cfg_io_during_bist 2004-09-13 09:54:32.000000000 -0500 +++ linux-2.6.9-rc2-bjking1/arch/ppc64/kernel/prom.c 2004-09-13 09:54:32.000000000 -0500 @@ -2959,6 +2959,7 @@ int of_add_node(const char *path, struct np->properties = proplist; OF_MARK_DYNAMIC(np); + spin_lock_init(&np->config_lock); of_node_get(np); np->parent = derive_parent(path); if (!np->parent) { diff -puN arch/ppc64/kernel/pSeries_pci.c~ppc64_block_cfg_io_during_bist arch/ppc64/kernel/pSeries_pci.c --- linux-2.6.9-rc2/arch/ppc64/kernel/pSeries_pci.c~ppc64_block_cfg_io_during_bist 2004-09-13 09:54:32.000000000 -0500 +++ linux-2.6.9-rc2-bjking1/arch/ppc64/kernel/pSeries_pci.c 2004-09-13 09:54:32.000000000 -0500 @@ -61,14 +61,12 @@ static int s7a_workaround; extern unsigned long pci_probe_only; -static int rtas_read_config(struct device_node *dn, int where, int size, u32 *val) +static int __rtas_read_config(struct device_node *dn, int where, int size, u32 *val) { int returnval = -1; unsigned long buid, addr; int ret; - if (!dn) - return PCIBIOS_DEVICE_NOT_FOUND; if (where & (size - 1)) return PCIBIOS_BAD_REGISTER_NUMBER; @@ -92,6 +90,23 @@ static int rtas_read_config(struct devic return PCIBIOS_SUCCESSFUL; } +static int rtas_read_config(struct device_node *dn, int where, int size, u32 *val) +{ + unsigned long flags; + int ret = 0; + + if (!dn) + return PCIBIOS_DEVICE_NOT_FOUND; + + spin_lock_irqsave(&dn->config_lock, flags); + if (OF_IS_CFGIO_BLOCKED(dn)) + *val = -1; + else + ret = __rtas_read_config(dn, where, size, val); + spin_unlock_irqrestore(&dn->config_lock, flags); + return ret; +} + static int rtas_pci_read_config(struct pci_bus *bus, unsigned int devfn, int where, int size, u32 *val) @@ -110,13 +125,11 @@ static int rtas_pci_read_config(struct p return PCIBIOS_DEVICE_NOT_FOUND; } -static int rtas_write_config(struct device_node *dn, int where, int size, u32 val) +static int __rtas_write_config(struct device_node *dn, int where, int size, u32 val) { unsigned long buid, addr; int ret; - if (!dn) - return PCIBIOS_DEVICE_NOT_FOUND; if (where & (size - 1)) return PCIBIOS_BAD_REGISTER_NUMBER; @@ -134,6 +147,21 @@ static int rtas_write_config(struct devi return PCIBIOS_SUCCESSFUL; } +static int rtas_write_config(struct device_node *dn, int where, int size, u32 val) +{ + unsigned long flags; + int ret = 0; + + if (!dn) + return PCIBIOS_DEVICE_NOT_FOUND; + + spin_lock_irqsave(&dn->config_lock, flags); + if (!OF_IS_CFGIO_BLOCKED(dn)) + ret = __rtas_write_config(dn, where, size, val); + spin_unlock_irqrestore(&dn->config_lock, flags); + return ret; +} + static int rtas_pci_write_config(struct pci_bus *bus, unsigned int devfn, int where, int size, u32 val) @@ -157,6 +185,67 @@ struct pci_ops rtas_pci_ops = { rtas_pci_write_config }; +/** + * pci_block_config_io - Block PCI config reads/writes + * @pdev: pci device struct + * + * This function blocks any PCI config accesses from occurring. + * Device drivers may call this prior to running BIST if the + * adapter cannot handle PCI config reads or writes when + * running BIST. When blocked, any writes will be ignored and + * treated as successful and any reads will return all 1's data. + * + * Return value: + * nothing + **/ +void pci_block_config_io(struct pci_dev *pdev) +{ + struct device_node *dn = pci_device_to_OF_node(pdev); + unsigned long flags; + + spin_lock_irqsave(&dn->config_lock, flags); + OF_BLOCK_CFGIO(dn); + spin_unlock_irqrestore(&dn->config_lock, flags); +} +EXPORT_SYMBOL(pci_block_config_io); + +/** + * pci_unblock_config_io - Unblock PCI config reads/writes + * @pdev: pci device struct + * + * This function allows PCI config accesses to resume. + * + * Return value: + * nothing + **/ +void pci_unblock_config_io(struct pci_dev *pdev) +{ + struct device_node *dn = pci_device_to_OF_node(pdev); + unsigned long flags; + + spin_lock_irqsave(&dn->config_lock, flags); + OF_UNBLOCK_CFGIO(dn); + spin_unlock_irqrestore(&dn->config_lock, flags); +} +EXPORT_SYMBOL(pci_unblock_config_io); + +/** + * pci_start_bist - Start BIST on a PCI device + * @pdev: pci device struct + * + * This function allows a device driver to start BIST + * when PCI config accesses are disabled. + * + * Return value: + * nothing + **/ +int pci_start_bist(struct pci_dev *pdev) +{ + struct device_node *dn = pci_device_to_OF_node(pdev); + return __rtas_write_config(dn, PCI_BIST, 1, PCI_BIST_START); +} +EXPORT_SYMBOL(pci_start_bist); + /****************************************************************** * pci_read_irq_line * diff -puN include/asm-ppc64/pci.h~ppc64_block_cfg_io_during_bist include/asm-ppc64/pci.h --- linux-2.6.9-rc2/include/asm-ppc64/pci.h~ppc64_block_cfg_io_during_bist 2004-09-13 09:54:32.000000000 -0500 +++ linux-2.6.9-rc2-bjking1/include/asm-ppc64/pci.h 2004-09-13 09:54:32.000000000 -0500 @@ -233,6 +233,12 @@ extern int pci_read_irq_line(struct pci_ extern void pcibios_add_platform_entries(struct pci_dev *dev); +extern void pci_block_config_io(struct pci_dev *dev); + +extern void pci_unblock_config_io(struct pci_dev *dev); + +extern int pci_start_bist(struct pci_dev *dev); + #endif /* __KERNEL__ */ #endif /* __PPC64_PCI_H */ diff -puN include/asm-ppc64/iSeries/iSeries_pci.h~ppc64_block_cfg_io_during_bist include/asm-ppc64/iSeries/iSeries_pci.h --- linux-2.6.9-rc2/include/asm-ppc64/iSeries/iSeries_pci.h~ppc64_block_cfg_io_during_bist 2004-09-13 09:54:32.000000000 -0500 +++ linux-2.6.9-rc2-bjking1/include/asm-ppc64/iSeries/iSeries_pci.h 2004-09-13 09:54:32.000000000 -0500 @@ -91,6 +91,7 @@ struct iSeries_Device_Node { int ReturnCode; /* Return Code Holder */ int IoRetry; /* Current Retry Count */ int Flags; /* Possible flags(disable/bist)*/ +#define ISERIES_CFGIO_BLOCKED 1 u16 Vendor; /* Vendor ID */ u8 LogicalSlot; /* Hv Slot Index for Tces */ struct iommu_table* iommu_table;/* Device TCE Table */ @@ -99,6 +100,7 @@ struct iSeries_Device_Node { u8 FrameId; /* iSeries spcn Frame Id */ char CardLocation[4];/* Char format of planar vpd */ char Location[20]; /* Frame 1, Card C10 */ + spinlock_t config_lock; }; /************************************************************************/ diff -puN arch/ppc64/kernel/iSeries_pci.c~ppc64_block_cfg_io_during_bist arch/ppc64/kernel/iSeries_pci.c --- linux-2.6.9-rc2/arch/ppc64/kernel/iSeries_pci.c~ppc64_block_cfg_io_during_bist 2004-09-13 09:54:32.000000000 -0500 +++ linux-2.6.9-rc2-bjking1/arch/ppc64/kernel/iSeries_pci.c 2004-09-13 09:54:32.000000000 -0500 @@ -131,6 +131,7 @@ static struct iSeries_Device_Node *build node->AgentId = AgentId; node->DevFn = PCI_DEVFN(ISERIES_ENCODE_DEVICE(AgentId), Function); node->IoRetry = 0; + spin_lock_init(&node->config_lock); iSeries_Get_Location_Code(node); PCIFR("Device 0x%02X.%2X, Node:0x%p ", ISERIES_BUS(node), ISERIES_DEVFUN(node), node); @@ -508,16 +509,12 @@ static u64 hv_cfg_write_func[4] = { /* * Read PCI config space */ -static int iSeries_pci_read_config(struct pci_bus *bus, unsigned int devfn, +static int __iSeries_pci_read_config(struct iSeries_Device_Node *node, int offset, int size, u32 *val) { - struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn); u64 fn; struct HvCallPci_LoadReturn ret; - if (node == NULL) - return PCIBIOS_DEVICE_NOT_FOUND; - fn = hv_cfg_read_func[(size - 1) & 3]; HvCall3Ret16(fn, &ret, node->DsaAddr.DsaAddr, offset, 0); @@ -530,20 +527,36 @@ static int iSeries_pci_read_config(struc return 0; } +static int iSeries_pci_read_config(struct pci_bus *bus, unsigned int devfn, + int offset, int size, u32 *val) +{ + struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn); + int ret = PCIBIOS_DEVICE_NOT_FOUND; + unsigned long flags; + + if (node) { + ret = 0; + spin_lock_irqsave(&node->config_lock, flags); + if (node->Flags & ISERIES_CFGIO_BLOCKED) + *val = -1; + else + ret = __iSeries_pci_read_config(node, offset, size, val); + spin_unlock_irqrestore(&node->config_lock, flags); + } + + return ret; +} + /* * Write PCI config space */ -static int iSeries_pci_write_config(struct pci_bus *bus, unsigned int devfn, +static int __iSeries_pci_write_config(struct iSeries_Device_Node *node, int offset, int size, u32 val) { - struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn); u64 fn; u64 ret; - if (node == NULL) - return PCIBIOS_DEVICE_NOT_FOUND; - fn = hv_cfg_write_func[(size - 1) & 3]; ret = HvCall4(fn, node->DsaAddr.DsaAddr, offset, val, 0); @@ -553,6 +566,23 @@ static int iSeries_pci_write_config(stru return 0; } +static int iSeries_pci_write_config(struct pci_bus *bus, unsigned int devfn, + int offset, int size, u32 val) +{ + struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn); + int ret = PCIBIOS_DEVICE_NOT_FOUND; + unsigned long flags; + + if (node) { + spin_lock_irqsave(&node->config_lock, flags); + if (!(node->Flags & ISERIES_CFGIO_BLOCKED)) + ret = __iSeries_pci_write_config(node, offset, size, val); + spin_unlock_irqrestore(&node->config_lock, flags); + } + + return ret; +} + static struct pci_ops iSeries_pci_ops = { .read = iSeries_pci_read_config, .write = iSeries_pci_write_config @@ -815,3 +845,80 @@ void iSeries_Write_Long(u32 data, volati } while (CheckReturnCode("WWL", DevNode, rc) != 0); } EXPORT_SYMBOL(iSeries_Write_Long); + +/** + * pci_block_config_io - Block PCI config reads/writes + * @pdev: pci device struct + * + * This function blocks any PCI config accesses from occurring. + * Device drivers may call this prior to running BIST if the + * adapter cannot handle PCI config reads or writes when + * running BIST. When blocked, any writes will be ignored and + * treated as successful and any reads will return all 1's data. + * + * Return value: + * nothing + **/ +void pci_block_config_io(struct pci_dev *pdev) +{ + struct iSeries_Device_Node *node; + unsigned long flags; + + node = find_Device_Node(pdev->bus->number, pdev->devfn); + + if (node == NULL) + return; + + spin_lock_irqsave(&node->config_lock, flags); + node->Flags |= ISERIES_CFGIO_BLOCKED; + spin_unlock_irqrestore(&node->config_lock, flags); +} +EXPORT_SYMBOL(pci_block_config_io); + +/** + * pci_unblock_config_io - Unblock PCI config reads/writes + * @pdev: pci device struct + * + * This function allows PCI config accesses to resume. + * + * Return value: + * nothing + **/ +void pci_unblock_config_io(struct pci_dev *pdev) +{ + struct iSeries_Device_Node *node; + unsigned long flags; + + node = find_Device_Node(pdev->bus->number, pdev->devfn); + + if (node == NULL) + return; + + spin_lock_irqsave(&node->config_lock, flags); + node->Flags &= ~ISERIES_CFGIO_BLOCKED; + spin_unlock_irqrestore(&node->config_lock, flags); +} +EXPORT_SYMBOL(pci_unblock_config_io); + +/** + * pci_start_bist - Start BIST on a PCI device + * @pdev: pci device struct + * + * This function allows a device driver to start BIST + * when PCI config accesses are disabled. + * + * Return value: + * nothing + **/ +int pci_start_bist(struct pci_dev *pdev) +{ + struct iSeries_Device_Node *node; + + node = find_Device_Node(pdev->bus->number, pdev->devfn); + + if (node == NULL) + return PCIBIOS_DEVICE_NOT_FOUND; + + return __iSeries_pci_write_config(node, PCI_BIST, 1, PCI_BIST_START); +} +EXPORT_SYMBOL(pci_start_bist); _ From jschopp at austin.ibm.com Tue Sep 14 09:01:22 2004 From: jschopp at austin.ibm.com (Joel Schopp) Date: Mon, 13 Sep 2004 18:01:22 -0500 Subject: [Fwd: [RFC][PATCH] flush_hash_page] In-Reply-To: <1095095309.3422.7.camel@localhost> References: <4145C346.6090102@austin.ibm.com> <1095095309.3422.7.camel@localhost> Message-ID: <414626C2.1080606@austin.ibm.com> > First of all, if you do that, doesn't it assume that all ptes that are > passed in are large pages? I don't think that's correct. > > Also, think about how huge pages are implemented in Linux. Do huge > pages really even get Linux ptes, or just pmds that act like ptes? It appears that large pages only have pmds (as evidenced by hugepte_t being an int and not a long among other things). It turns out there already is a flush_hash_hugepage that does what I am trying to do. Figuring out which to call isn't too bad either. if(in_hugepage_area(context, ea)) should do the trick. Might be nice to clean up flush_hash_page to totally remove the large variable and associated code, and add a comment about flush_hash_hugepage. Thanks for your patience while I learn this stuff. From benh at kernel.crashing.org Tue Sep 14 08:44:28 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 14 Sep 2004 08:44:28 +1000 Subject: vDSO preliminary implementation In-Reply-To: <20040913170946.GB9645@austin.ibm.com> References: <1093496594.2172.80.camel@gaston> <20040910211158.GX9645@austin.ibm.com> <1094863417.2667.152.camel@gaston> <20040913170946.GB9645@austin.ibm.com> Message-ID: <1095115467.24290.298.camel@gaston> On Tue, 2004-09-14 at 03:09, Linas Vepstas wrote: > On Sat, Sep 11, 2004 at 10:43:38AM +1000, Benjamin Herrenschmidt was heard to remark: > > > > > > > > Here's a first shot at implementing a vDSO for ppc32/ppc64. This is definitely > > > > > > What's vDSO ? Google was amazingly unhelpful in figuring this out. > > > > virtual .so, that is a library mapped by the kernel in userspace > > Let me re-phrase that: what's it good for? Is this a mechanism for > sharing the text segment of a library between all users? No, not at all, it's used for the kernel to provide some specific functions to userland, like the signal trampoline (moved out of the stack), or a userland implementation of gettimeofday, along with things that can optimized per-cpu (like memcpy) without the overhead of a syscall. > Ye olde AIX had this feature; I've never thought about whether Linux > does this or not; shared libs were loaded so that the text segment > of a library appeared only once in 'real' memory, and was thus shared > by the various apps. I'm not sure, I think in AIX even the "ptes" were > shared too: the text was always loaded into the same segment (segment 0 > iirc), so you wouldn't have tlb misses on things like libc. > > I've never thought about how Linux loads libraries, so excuse me on this > newbie-sounding question. How does Linux load .so's today? Is there > one copy per process, or are they shared? AIX does segment sharing, which goes further. Linux doesn't share at the low MMU level like AIX does, but the text is definitely shared (unless something triggers a copy-on-write). Ben. From ananth at in.ibm.com Tue Sep 14 16:56:27 2004 From: ananth at in.ibm.com (Ananth N Mavinakayanahalli) Date: Tue, 14 Sep 2004 11:56:27 +0500 Subject: linux-2.6.9-rc* ppc64 broken on UNI? Message-ID: <20040914065627.GA10113@in.ibm.com> Hi, I am using linux-2.6.9-rc* on an old Power3 box here and looks like the tree is broken for UNI (SMP works fine). I am not too familiar with the fpu stuff to figure out the issue myself. I get the following exception: The system is going down for reboot NOW!INIT: Sending pVector: 300 (Data Access) at [c00000003f617bb0] pc: c00000000000b8b0: copy_to_here+0xb0/0x16c lr: 00000000100419dc sp: c00000003f617e30 msr: a000000000003032 dar: 108 dsisr: 40000000 current = 0xc000000001b3d900 paca = 0xc000000000320000 pid = 1298, comm = bash enter ? for help mon> r R00 = 00000000100419dc R16 = 0000000080000000 R01 = c00000003f617e30 R17 = 0000000010010000 R02 = c0000000004273b0 R18 = 0000000000000000 R03 = c000000000428e68 R19 = 00000000ffffee5c R04 = c00000000ff81c60 R20 = 0000000010020000 R05 = 0000000000000000 R21 = 00000000ffff67b0 R06 = 00000000616e6400 R22 = 0000000010010000 R07 = fffffffffefefeff R23 = 00000000000055d0 R08 = 000000007f7f7f7f R24 = 0000000000000001 R09 = 0000000000000801 R25 = 0000000000000001 R10 = 0000000000000000 R26 = 00000000f800fa98 R11 = 7265677368657265 R27 = 00000000f800fa9c R12 = 200000000000d032 R28 = 000000001008e930 R13 = c000000000320000 R29 = 000000001008e930 R14 = 0000000000000000 R30 = 0000000000000001 R15 = 0000000010020000 R31 = 0000000000000001 pc = c00000000000b8b0 copy_to_here+0xb0/0x16c lr = 00000000100419dc msr = a000000000003032 cr = 88000422 ctr = 0000000110049db0 xer = 0000000020000000 trap = 300 mon> s Vector: 300 (Data Access) at [c00000003f617bb0] pc: c00000000000b8b0: copy_to_here+0xb0/0x16c lr: 00000000100419dc sp: c00000003f617e30 msr: a000000000003432 dar: 108 dsisr: 40000000 current = 0xc000000001b3d900 paca = 0xc000000000320000 pid = 1298, comm = bash enter ? for help mon> zr Has anyone seen a similar problem? Any ideas? Thanks, Ananth From brking at us.ibm.com Wed Sep 15 02:19:18 2004 From: brking at us.ibm.com (Brian King) Date: Tue, 14 Sep 2004 11:19:18 -0500 Subject: [PATCH 1/1] ppc64: Block config accesses during BIST In-Reply-To: <1095177165.10019.42.camel@biclops.private.network> References: <200409132153.i8DLrrrF053396@westrelay04.boulder.ibm.com> <1095177165.10019.42.camel@biclops.private.network> Message-ID: <41471A06.8080109@us.ibm.com> Nathan Lynch wrote: > Is there a way to make an attempted config space read sleep until the > BIST is done? E.g. the device's config space is protected by a > semaphore which must be held during accesses or during BIST. Or is it > legal for the driver to access config space in interrupt context? It is legal for the driver to access config space in interrupt context. Additionally, the arch specific config space access routines get called with a global spinlock held irqsave in the core pci code. > Adding a spinlock to the (pSeries-only) device_node structure makes me > concerned about the potential for deadlock, since it's not clear to me > what the lock ordering rules should be with respect to the per-device > node config_lock and the global device tree lock. I don't see a problem. I would argue that the global device tree lock would need to be acquired before the per-device node lock. I don't see any code paths where the per-device node config_lock is held where it would try to acquire the global device tree lock. -- Brian King eServer Storage I/O IBM Linux Technology Center From nathanl at austin.ibm.com Wed Sep 15 01:52:45 2004 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Tue, 14 Sep 2004 10:52:45 -0500 Subject: [PATCH 1/1] ppc64: Block config accesses during BIST In-Reply-To: <200409132153.i8DLrrrF053396@westrelay04.boulder.ibm.com> References: <200409132153.i8DLrrrF053396@westrelay04.boulder.ibm.com> Message-ID: <1095177165.10019.42.camel@biclops.private.network> On Mon, 2004-09-13 at 16:53, brking at us.ibm.com wrote: > Some PCI adapters on pSeries and iSeries hardware (ipr scsi adapters) > have an exposure today in that they issue BIST to the adapter to reset > the card. If, during the time it takes to complete BIST, userspace attempts > to access PCI config space, the host bus bridge will master abort the access > since the ipr adapter does not respond on the PCI bus for a brief period of > time when running BIST. This master abort results in the host PCI bridge > isolating that PCI device from the rest of the system, making the device > unusable until Linux is rebooted. This patch is an attempt to close that > exposure by introducing some blocking code in the arch specific PCI code. > The intent is to have the ipr device driver invoke these routines to > prevent userspace PCI accesses from occurring during this window. Is there a way to make an attempted config space read sleep until the BIST is done? E.g. the device's config space is protected by a semaphore which must be held during accesses or during BIST. Or is it legal for the driver to access config space in interrupt context? Adding a spinlock to the (pSeries-only) device_node structure makes me concerned about the potential for deadlock, since it's not clear to me what the lock ordering rules should be with respect to the per-device node config_lock and the global device tree lock. Nathan From linas at austin.ibm.com Wed Sep 15 03:15:11 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Tue, 14 Sep 2004 12:15:11 -0500 Subject: linux-2.6.9-rc* ppc64 broken on UNI? In-Reply-To: <20040914065627.GA10113@in.ibm.com> References: <20040914065627.GA10113@in.ibm.com> Message-ID: <20040914171511.GH9645@austin.ibm.com> On Tue, Sep 14, 2004 at 11:56:27AM +0500, Ananth N Mavinakayanahalli was heard to remark: > Hi, > > I am using linux-2.6.9-rc* on an old Power3 box here and looks like > the tree is broken for UNI (SMP works fine). I am not too familiar > with the fpu stuff to figure out the issue myself. the fault is in arch/ppc64/kernel/head.S ... I don't have disassembly for copy_to_here+0xb0, but I'm guessing its a null pointer deref in /* Disable FP for last_task_used_math */ ld r5,PT_REGS(r4) ld r4,_MSR-STACK_FRAME_OVERHEAD(r5) That is, PT_REGS(r4) is zero. I assume r4 is a valid value, but you should check that it is. You need to figure out why PT_REGS(r4) wasn't set at some earlier time. This happens somewhere in switch_to, I think. I'm guessing this affects power4 and 5 as well; it looks generic to me. Make sure that asm-offsets.c is new and up-to-date, and not old. --linas > I get the following exception: > > The system is going down for reboot NOW!INIT: Sending pVector: 300 (Data Access) at [c00000003f617bb0] > pc: c00000000000b8b0: copy_to_here+0xb0/0x16c > lr: 00000000100419dc > sp: c00000003f617e30 > msr: a000000000003032 > dar: 108 > dsisr: 40000000 > current = 0xc000000001b3d900 > paca = 0xc000000000320000 > pid = 1298, comm = bash > enter ? for help > mon> r > R00 = 00000000100419dc R16 = 0000000080000000 > R01 = c00000003f617e30 R17 = 0000000010010000 > R02 = c0000000004273b0 R18 = 0000000000000000 > R03 = c000000000428e68 R19 = 00000000ffffee5c > R04 = c00000000ff81c60 R20 = 0000000010020000 > R05 = 0000000000000000 R21 = 00000000ffff67b0 > R06 = 00000000616e6400 R22 = 0000000010010000 > R07 = fffffffffefefeff R23 = 00000000000055d0 > R08 = 000000007f7f7f7f R24 = 0000000000000001 > R09 = 0000000000000801 R25 = 0000000000000001 > R10 = 0000000000000000 R26 = 00000000f800fa98 > R11 = 7265677368657265 R27 = 00000000f800fa9c > R12 = 200000000000d032 R28 = 000000001008e930 > R13 = c000000000320000 R29 = 000000001008e930 > R14 = 0000000000000000 R30 = 0000000000000001 > R15 = 0000000010020000 R31 = 0000000000000001 > pc = c00000000000b8b0 copy_to_here+0xb0/0x16c > lr = 00000000100419dc > msr = a000000000003032 cr = 88000422 > ctr = 0000000110049db0 xer = 0000000020000000 trap = 300 > mon> s > Vector: 300 (Data Access) at [c00000003f617bb0] > pc: c00000000000b8b0: copy_to_here+0xb0/0x16c > lr: 00000000100419dc > sp: c00000003f617e30 > msr: a000000000003432 > dar: 108 > dsisr: 40000000 > current = 0xc000000001b3d900 > paca = 0xc000000000320000 > pid = 1298, comm = bash > enter ? for help > mon> zr > > Has anyone seen a similar problem? Any ideas? > > Thanks, > Ananth > _______________________________________________ > Linuxppc64-dev mailing list > Linuxppc64-dev at ozlabs.org > https://ozlabs.org/cgi-bin/mailman/listinfo/linuxppc64-dev > From linas at austin.ibm.com Wed Sep 15 08:17:26 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Tue, 14 Sep 2004 17:17:26 -0500 Subject: [PATCH 1/1] ppc64: Block config accesses during BIST In-Reply-To: <41471A06.8080109@us.ibm.com> References: <200409132153.i8DLrrrF053396@westrelay04.boulder.ibm.com> <1095177165.10019.42.camel@biclops.private.network> <41471A06.8080109@us.ibm.com> Message-ID: <20040914221726.GI9645@austin.ibm.com> On Tue, Sep 14, 2004 at 11:19:18AM -0500, Brian King was heard to remark: > Nathan Lynch wrote: > >Is there a way to make an attempted config space read sleep until the > >BIST is done? E.g. the device's config space is protected by a > >semaphore which must be held during accesses or during BIST. Or is it > >legal for the driver to access config space in interrupt context? > > It is legal for the driver to access config space in interrupt context. > Additionally, the arch specific config space access routines get called > with a global spinlock held irqsave in the core pci code. drivers/pci/access.c the lock is called "pci_lock", its declared static, and is thus appearently used *only* in this one file. This lock doesn't seem to be protecting any data that I can obviously see. My gut feel is that this lock should be moved to the arch-specific code, where the different arches could implement a global spinlock or something more granular if they wish. Anyone care to bring it up on LKML? Maybe they'd go for it ... --linas From benh at kernel.crashing.org Wed Sep 15 11:14:49 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 15 Sep 2004 11:14:49 +1000 Subject: linux-2.6.9-rc* ppc64 broken on UNI? In-Reply-To: <20040914065627.GA10113@in.ibm.com> References: <20040914065627.GA10113@in.ibm.com> Message-ID: <1095209607.2543.419.camel@gaston> On Tue, 2004-09-14 at 16:56, Ananth N Mavinakayanahalli wrote: > Hi, > > I am using linux-2.6.9-rc* on an old Power3 box here and looks like > the tree is broken for UNI (SMP works fine). I am not too familiar > with the fpu stuff to figure out the issue myself. > > I get the following exception: Can you get a backtrace too please ? Ben. From david at gibson.dropbear.id.au Wed Sep 15 11:25:21 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Wed, 15 Sep 2004 11:25:21 +1000 Subject: [Fwd: [RFC][PATCH] flush_hash_page] In-Reply-To: <414626C2.1080606@austin.ibm.com> References: <4145C346.6090102@austin.ibm.com> <1095095309.3422.7.camel@localhost> <414626C2.1080606@austin.ibm.com> Message-ID: <20040915012521.GB17930@zax> On Mon, Sep 13, 2004 at 06:01:22PM -0500, Joel Schopp wrote: > >First of all, if you do that, doesn't it assume that all ptes that are > >passed in are large pages? I don't think that's correct. > > > >Also, think about how huge pages are implemented in Linux. Do huge > >pages really even get Linux ptes, or just pmds that act like ptes? > > It appears that large pages only have pmds (as evidenced by hugepte_t > being an int and not a long among other things). It turns out there > already is a flush_hash_hugepage that does what I am trying to do. > Figuring out which to call isn't too bad either. > if(in_hugepage_area(context, ea)) should do the trick. Yes, that should do it. > Might be nice to clean up flush_hash_page to totally remove the large > variable and associated code, and add a comment about flush_hash_hugepage. Yes indeed - I've thought about doing that several times. However, I've been thinking about reworking the hugepage stuff to use normal pte_t entries in its own set of pagetables, rather than special pmd_t entries. In that case it would probably make sense to make flush_hash_page() support hugepages as well. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson From ananth at in.ibm.com Wed Sep 15 14:32:51 2004 From: ananth at in.ibm.com (Ananth N Mavinakayanahalli) Date: Wed, 15 Sep 2004 09:32:51 +0500 Subject: linux-2.6.9-rc* ppc64 broken on UNI? In-Reply-To: <1095209607.2543.419.camel@gaston> References: <20040914065627.GA10113@in.ibm.com> <1095209607.2543.419.camel@gaston> Message-ID: <20040915043251.GA12361@in.ibm.com> On Wed, Sep 15, 2004 at 11:14:49AM +1000, Benjamin Herrenschmidt wrote: > On Tue, 2004-09-14 at 16:56, Ananth N Mavinakayanahalli wrote: > > Hi, > > > > I am using linux-2.6.9-rc* on an old Power3 box here and looks like > > the tree is broken for UNI (SMP works fine). I am not too familiar > > with the fpu stuff to figure out the issue myself. > > > > I get the following exception: > > Can you get a backtrace too please ? > Here is what I got... mon> e Vector: 300 (Data Access) at [c00000003dc6bbb0] pc: c00000000000b8b0: copy_to_here+0xb0/0x16c lr: 00000080000d6dc8 sp: c00000003dc6be30 msr: a000000000003032 dar: 108 dsisr: 40000000 current = 0xc00000003da30140 paca = 0xc000000000345500 pid = 1482, comm = syslogd mon> r R00 = 00000080000d6dc8 R16 = 0000000000000000 R01 = c00000003dc6be30 R17 = 0000000000000000 R02 = c000000000453578 R18 = 000000000000001e R03 = c000000000454e68 R19 = 000001fffffff1e0 R04 = c00000000ffa1bd8 R20 = 000001fffffff260 R05 = 0000000000000000 R21 = 000001fffffff290 R06 = 0000000000000000 R22 = 00000080001bbaf0 R07 = 0000000000000000 R23 = 000000001001a3e0 R08 = 000000000000001e R24 = 0000000000000000 R09 = 0000000000000801 R25 = 000001ffffffe7f0 R10 = 0000000000000000 R26 = 0000000010007848 R11 = 7265677368657265 R27 = 000001ffffffe658 R12 = a00000000000d032 R28 = 000001ffffffe530 R13 = c000000000345500 R29 = 0000000000000000 R14 = 0000000010010000 R30 = 0000000000000014 R15 = 0000000010010000 R31 = 0000000010019c48 pc = c00000000000b8b0 copy_to_here+0xb0/0x16c lr = 00000080000d6dc8 msr = a000000000003032 cr = 88000488 ctr = 00000080000bd228 xer = 0000000020000000 trap = 300 mon> s S msr = a000000000001032 sprg0= 0000000000000000 pvr = 0000000000410104 sprg1= c000000000345500 dec = 0000000045dbaffe sprg2= a00000000000d032 sp = c00000003dc6b3e0 sprg3= c000000000345500 toc = c000000000453578 dar = 0000000000000108 srr0 = c00000000000ab80 srr1 = a000000000001032 mon> t [c00000003dc6be30] c00000000000fd80 syscall_exit+0x0/0x18 (unreliable) --- Exception: 801 (FPU Unavailable) at 00000080000d6dd4 SP (1ffffffe3d0) is in userspace mon> zr Hope it helps! Thanks, Ananth From david at gibson.dropbear.id.au Wed Sep 15 14:28:27 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Wed, 15 Sep 2004 14:28:27 +1000 Subject: Resending: [PATCH] PPC64: New version of EEH notifier code In-Reply-To: <20040913200539.GD9645@austin.ibm.com> References: <20040913200539.GD9645@austin.ibm.com> Message-ID: <20040915042827.GA19805@zax> On Mon, Sep 13, 2004 at 03:05:39PM -0500, Linas Vepstas wrote: > > Resending: > > > > > Paul, I picked up the eeh notifier call-chain patch from > http://ozlabs.org/ppc64-patches/ patch 239, I beleive. > Because it doens't apply cleanly any more, I whacked > on it a bit to get it to apply; the result is below. In fact this version doens't apply any more either. Version below seems to fix that. Index: working-2.6/arch/ppc64/kernel/eeh.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/eeh.c 2004-09-15 10:53:53.000000000 +1000 +++ working-2.6/arch/ppc64/kernel/eeh.c 2004-09-15 14:12:04.959680728 +1000 @@ -17,29 +17,79 @@ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ +#include #include +#include +#include +#include #include #include -#include -#include #include -#include #include -#include -#include -#include +#include +#include #include #include -#include #include +#include #include "pci.h" #undef DEBUG +/** Overview: + * EEH, or "Extended Error Handling" is a PCI bridge technology for + * dealing with PCI bus errors that can't be dealt with within the + * usual PCI framework, except by check-stopping the CPU. Systems + * that are designed for high-availability/reliability cannot afford + * to crash due to a "mere" PCI error, thus the need for EEH. + * An EEH-capable bridge operates by converting a detected error + * into a "slot freeze", taking the PCI adapter off-line, making + * the slot behave, from the OS'es point of view, as if the slot + * were "empty": all reads return 0xff's and all writes are silently + * ignored. EEH slot isolation events can be triggered by parity + * errors on the address or data busses (e.g. during posted writes), + * which in turn might be caused by dust, vibration, humidity, + * radioactivity or plain-old failed hardware. + * + * Note, however, that one of the leading causes of EEH slot + * freeze events are buggy device drivers, buggy device microcode, + * or buggy device hardware. This is because any attempt by the + * device to bus-master data to a memory address that is not + * assigned to the device will trigger a slot freeze. (The idea + * is to prevent devices-gone-wild from corrupting system memory). + * Buggy hardware/drivers will have a miserable time co-existing + * with EEH. + * + * Ideally, a PCI device driver, when suspecting that an isolation + * event has occured (e.g. by reading 0xff's), will then ask EEH + * whether this is the case, and then take appropriate steps to + * reset the PCI slot, the PCI device, and then resume operations. + * However, until that day, the checking is done here, with the + * eeh_check_failure() routine embedded in the MMIO macros. If + * the slot is found to be isolated, an "EEH Event" is synthesized + * and sent out for processing. + */ + +/** Bus Unit ID macros; get low and hi 32-bits of the 64-bit BUID */ #define BUID_HI(buid) ((buid) >> 32) #define BUID_LO(buid) ((buid) & 0xffffffff) -#define CONFIG_ADDR(busno, devfn) \ - (((((busno) & 0xff) << 8) | ((devfn) & 0xf8)) << 8) + +/* EEH event workqueue setup. */ +static spinlock_t eeh_eventlist_lock = SPIN_LOCK_UNLOCKED; +LIST_HEAD(eeh_eventlist); +static void eeh_event_handler(void *); +DECLARE_WORK(eeh_event_wq, eeh_event_handler, NULL); + +static struct notifier_block *eeh_notifier_chain; + +/* + * If a device driver keeps reading an MMIO register in an interrupt + * handler after a slot isolation event has occurred, we assume it + * is broken and panic. This sets the threshold for how many read + * attempts we allow before panicking. + */ +#define EEH_MAX_FAILS 1000 +static atomic_t eeh_fail_count; /* RTAS tokens */ static int ibm_set_eeh_option; @@ -58,13 +108,15 @@ static DEFINE_PER_CPU(unsigned long, total_mmio_ffs); static DEFINE_PER_CPU(unsigned long, false_positives); static DEFINE_PER_CPU(unsigned long, ignored_failures); +static DEFINE_PER_CPU(unsigned long, slot_resets); /** * The pci address cache subsystem. This subsystem places * PCI device address resources into a red-black tree, sorted * according to the address range, so that given only an i/o * address, the corresponding PCI device can be **quickly** - * found. + * found. It is safe to perform an address lookup in an interrupt + * context; this ability is an important feature. * * Currently, the only customer of this code is the EEH subsystem; * thus, this code has been somewhat tailored to suit EEH better. @@ -333,6 +385,94 @@ #endif } +/* --------------------------------------------------------------- */ +/* Above lies the PCI Address Cache. Below lies the EEH event infrastructure */ + +/** + * eeh_register_notifier - Register to find out about EEH events. + * @nb: notifier block to callback on events + */ +int eeh_register_notifier(struct notifier_block *nb) +{ + return notifier_chain_register(&eeh_notifier_chain, nb); +} + +/** + * eeh_unregister_notifier - Unregister to an EEH event notifier. + * @nb: notifier block to callback on events + */ +int eeh_unregister_notifier(struct notifier_block *nb) +{ + return notifier_chain_unregister(&eeh_notifier_chain, nb); +} + +/** + * eeh_panic - call panic() for an eeh event that cannot be handled. + * The philosophy of this routine is that it is better to panic and + * halt the OS than it is to risk possible data corruption by + * oblivious device drivers that don't know better. + * + * @dev pci device that had an eeh event + * @reset_state current reset state of the device slot + */ +static void eeh_panic(struct pci_dev *dev, int reset_state) +{ + /* + * XXX We should create a seperate sysctl for this. + * + * Since the panic_on_oops sysctl is used to halt the system + * in light of potential corruption, we can use it here. + */ + if (panic_on_oops) + panic("EEH: MMIO failure (%d) on device:%s %s\n", reset_state, + pci_name(dev), pci_pretty_name(dev)); + else { + __get_cpu_var(ignored_failures)++; + printk(KERN_INFO "EEH: Ignored MMIO failure (%d) on device:%s %s\n", + reset_state, pci_name(dev), pci_pretty_name(dev)); + } +} + +/** + * eeh_event_handler - dispatch EEH events. The detection of a frozen + * slot can occur inside an interrupt, where it can be hard to do + * anything about it. The goal of this routine is to pull these + * detection events out of the context of the interrupt handler, and + * re-dispatch them for processing at a later time in a normal context. + * + * @dummy - unused + */ +static void eeh_event_handler(void *dummy) +{ + unsigned long flags; + struct eeh_event *event; + + while (1) { + spin_lock_irqsave(&eeh_eventlist_lock, flags); + event = NULL; + if (!list_empty(&eeh_eventlist)) { + event = list_entry(eeh_eventlist.next, struct eeh_event, list); + list_del(&event->list); + } + spin_unlock_irqrestore(&eeh_eventlist_lock, flags); + if (event == NULL) + break; + + printk(KERN_INFO "EEH: MMIO failure (%d), notifiying device " + "%s %s\n", event->reset_state, + pci_name(event->dev), pci_pretty_name(event->dev)); + + atomic_set(&eeh_fail_count, 0); + notifier_call_chain (&eeh_notifier_chain, + EEH_NOTIFY_FREEZE, event); + + __get_cpu_var(slot_resets)++; + + pci_dev_put(event->dev); + kfree(event); + } +} + /** * eeh_token_to_phys - convert EEH address token to phys address * @token i/o token, should be address in the form 0xA.... @@ -364,11 +504,11 @@ * * Check for an EEH failure for the given device node. Call this * routine if the result of a read was all 0xff's and you want to - * find out if this is due to an EEH slot freeze event. This routine + * find out if this is due to an EEH slot freeze. This routine * will query firmware for the EEH status. * * Returns 0 if there has not been an EEH error; otherwise returns - * an error code. + * a non-zero value and queues up a solt isolation event notification. * * It is safe to call this routine in an interrupt context. */ @@ -377,6 +517,8 @@ int ret; int rets[2]; unsigned long flags; + int rc, reset_state; + struct eeh_event *event; __get_cpu_var(total_mmio_ffs)++; @@ -395,6 +537,24 @@ if (!dn->eeh_config_addr) { return 0; } + + /* + * If we already have a pending isolation event for this + * slot, we know it's bad already, we don't need to check... + */ + if (dn->eeh_mode & EEH_MODE_ISOLATED) { + atomic_inc(&eeh_fail_count); + if (atomic_read(&eeh_fail_count) >= EEH_MAX_FAILS) { + /* re-read the slot reset state */ + rets[0] = -1; + rtas_call(ibm_read_slot_reset_state, 3, 3, rets, + dn->eeh_config_addr, + BUID_HI(dn->phb->buid), + BUID_LO(dn->phb->buid)); + eeh_panic(dev, rets[0]); + } + return 0; + } /* * Now test for an EEH failure. This is VERY expensive. @@ -407,47 +567,54 @@ dn->eeh_config_addr, BUID_HI(dn->phb->buid), BUID_LO(dn->phb->buid)); - if (ret == 0 && rets[1] == 1 && (rets[0] == 2 || rets[0] == 4)) { - int log_event; - - spin_lock_irqsave(&slot_errbuf_lock, flags); - memset(slot_errbuf, 0, eeh_error_buf_size); - - log_event = rtas_call(ibm_slot_error_detail, - 8, 1, NULL, dn->eeh_config_addr, - BUID_HI(dn->phb->buid), - BUID_LO(dn->phb->buid), NULL, 0, - virt_to_phys(slot_errbuf), - eeh_error_buf_size, - 1 /* Temporary Error */); - - if (log_event == 0) - log_error(slot_errbuf, ERR_TYPE_RTAS_LOG, - 1 /* Fatal */); - - spin_unlock_irqrestore(&slot_errbuf_lock, flags); - - printk(KERN_INFO "EEH: MMIO failure (%d) on device: %s %s\n", - rets[0], dn->name, dn->full_name); - WARN_ON(1); - - /* - * XXX We should create a separate sysctl for this. - * - * Since the panic_on_oops sysctl is used to halt - * the system in light of potential corruption, we - * can use it here. - */ - if (panic_on_oops) { - panic("EEH: MMIO failure (%d) on device: %s %s\n", - rets[0], dn->name, dn->full_name); - } else { - __get_cpu_var(ignored_failures)++; - } - } else { + if (!(ret == 0 && rets[1] == 1 && (rets[0] == 2 || rets[0] == 4))) { __get_cpu_var(false_positives)++; + return 0; } + /* prevent repeated reports of this failure */ + dn->eeh_mode |= EEH_MODE_ISOLATED; + + reset_state = rets[0]; + + spin_lock_irqsave(&slot_errbuf_lock, flags); + memset(slot_errbuf, 0, eeh_error_buf_size); + + rc = rtas_call(ibm_slot_error_detail, + 8, 1, NULL, dn->eeh_config_addr, + BUID_HI(dn->phb->buid), + BUID_LO(dn->phb->buid), NULL, 0, + virt_to_phys(slot_errbuf), + eeh_error_buf_size, + 1 /* Temporary Error */); + + if (rc == 0) + log_error(slot_errbuf, ERR_TYPE_RTAS_LOG, 0); + spin_unlock_irqrestore(&slot_errbuf_lock, flags); + + printk(KERN_INFO "EEH: MMIO failure (%d) on device: %s %s\n", + rets[0], dn->name, dn->full_name); + event = kmalloc(sizeof(*event), GFP_ATOMIC); + if (event == NULL) { + eeh_panic(dev, reset_state); + return 1; + } + + event->dev = dev; + event->dn = dn; + event->reset_state = reset_state; + + /* We may or may not be called in an interrupt context */ + spin_lock_irqsave(&eeh_eventlist_lock, flags); + list_add(&event->list, &eeh_eventlist); + spin_unlock_irqrestore(&eeh_eventlist_lock, flags); + + /* Most EEH events are due to device driver bugs. Having + * a stack trace will help the device-driver authors figure + * out what happened. So print that out. */ + dump_stack(); + schedule_work(&eeh_event_wq); + return 0; } @@ -863,11 +1030,13 @@ { unsigned int cpu; unsigned long ffs = 0, positives = 0, failures = 0; + unsigned long resets = 0; for_each_cpu(cpu) { ffs += per_cpu(total_mmio_ffs, cpu); positives += per_cpu(false_positives, cpu); failures += per_cpu(ignored_failures, cpu); + resets += per_cpu(slot_resets, cpu); } if (0 == eeh_subsystem_enabled) { @@ -877,8 +1046,11 @@ seq_printf(m, "EEH Subsystem is enabled\n"); seq_printf(m, "eeh_total_mmio_ffs=%ld\n" "eeh_false_positives=%ld\n" - "eeh_ignored_failures=%ld\n", - ffs, positives, failures); + "eeh_ignored_failures=%ld\n" + "eeh_slot_resets=%ld\n" + "eeh_fail_count=%d\n", + ffs, positives, failures, resets, + eeh_fail_count.counter); } return 0; Index: working-2.6/include/asm-ppc64/eeh.h =================================================================== --- working-2.6.orig/include/asm-ppc64/eeh.h 2004-09-13 10:30:38.000000000 +1000 +++ working-2.6/include/asm-ppc64/eeh.h 2004-09-15 14:12:04.960680576 +1000 @@ -20,8 +20,10 @@ #ifndef _PPC64_EEH_H #define _PPC64_EEH_H -#include #include +#include +#include +#include struct pci_dev; struct device_node; @@ -41,6 +43,7 @@ /* Values for eeh_mode bits in device_node */ #define EEH_MODE_SUPPORTED (1<<0) #define EEH_MODE_NOCHECK (1<<1) +#define EEH_MODE_ISOLATED (1<<2) extern void __init eeh_init(void); unsigned long eeh_check_failure(const volatile void __iomem *token, unsigned long val); @@ -76,7 +79,28 @@ #define EEH_RELEASE_DMA 3 int eeh_set_option(struct pci_dev *dev, int options); -/* + +/** + * Notifier event flags. + */ +#define EEH_NOTIFY_FREEZE 1 + +/** EEH event -- structure holding pci slot data that describes + * a change in the isolation status of a PCI slot. A pointer + * to this struct is passed as the data pointer in a notify callback. + */ +struct eeh_event { + struct list_head list; + struct pci_dev *dev; + struct device_node *dn; + int reset_state; +}; + +/** Register to find out about EEH events. */ +int eeh_register_notifier(struct notifier_block *nb); +int eeh_unregister_notifier(struct notifier_block *nb); + +/** * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure. * * Order this macro for performance. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson From anton at samba.org Wed Sep 15 22:25:00 2004 From: anton at samba.org (Anton Blanchard) Date: Wed, 15 Sep 2004 22:25:00 +1000 Subject: [PATCH] ppc64: Remove A() and AA() Message-ID: <20040915122500.GG5615@krispykreme> Remove the A() and AA() macros. Now we have compat_ptr we should be using that. Signed-off-by: Anton Blanchard diff -puN include/asm-ppc64/ppc32.h~remove_A_and_AA include/asm-ppc64/ppc32.h --- 2.6.9-rc1-mm5/include/asm-ppc64/ppc32.h~remove_A_and_AA 2004-09-14 15:46:08.566253864 +1000 +++ 2.6.9-rc1-mm5-anton/include/asm-ppc64/ppc32.h 2004-09-14 15:46:37.434377476 +1000 @@ -14,30 +14,6 @@ * 2 of the License, or (at your option) any later version. */ -/* Use this to get at 32-bit user passed pointers. */ -/* Things to consider: the low-level assembly stub does - srl x, 0, x for first four arguments, so if you have - pointer to something in the first four arguments, just - declare it as a pointer, not u32. On the other side, - arguments from 5th onwards should be declared as u32 - for pointers, and need AA() around each usage. - A() macro should be used for places where you e.g. - have some internal variable u32 and just want to get - rid of a compiler warning. AA() has to be used in - places where you want to convert a function argument - to 32bit pointer or when you e.g. access pt_regs - structure and want to consider 32bit registers only. - - - */ -#define A(__x) ((unsigned long)(__x)) -#define AA(__x) \ -({ unsigned long __ret; \ - __asm__ ("clrldi %0, %0, 32" \ - : "=r" (__ret) \ - : "0" (__x)); \ - __ret; \ -}) - /* These are here to support 32-bit syscalls on a 64-bit kernel. */ typedef struct compat_siginfo { diff -puN arch/ppc64/kernel/sys_ppc32.c~remove_A_and_AA arch/ppc64/kernel/sys_ppc32.c --- 2.6.9-rc1-mm5/arch/ppc64/kernel/sys_ppc32.c~remove_A_and_AA 2004-09-14 15:46:45.022847819 +1000 +++ 2.6.9-rc1-mm5-anton/arch/ppc64/kernel/sys_ppc32.c 2004-09-14 18:34:51.658432705 +1000 @@ -72,7 +72,6 @@ #include #include #include -#include #include #include "pci.h" @@ -700,7 +699,7 @@ asmlinkage int sys32_pciconfig_read(u32 (unsigned long) dfn, (unsigned long) off, (unsigned long) len, - (unsigned char __user *)AA(ubuf)); + compat_ptr(ubuf)); } asmlinkage int sys32_pciconfig_write(u32 bus, u32 dfn, u32 off, u32 len, u32 ubuf) @@ -709,7 +708,7 @@ asmlinkage int sys32_pciconfig_write(u32 (unsigned long) dfn, (unsigned long) off, (unsigned long) len, - (unsigned char __user *)AA(ubuf)); + compat_ptr(ubuf)); } #define IOBASE_BRIDGE_NUMBER 0 @@ -1095,7 +1094,7 @@ struct __sysctl_args32 { u32 __unused[4]; }; -extern asmlinkage long sys32_sysctl(struct __sysctl_args32 __user *args) +asmlinkage long sys32_sysctl(struct __sysctl_args32 __user *args) { struct __sysctl_args32 tmp; int error; @@ -1114,19 +1113,20 @@ extern asmlinkage long sys32_sysctl(stru glibc's __sysctl uses rw memory for the structure anyway. */ oldlenp = (size_t __user *)addr; - if (get_user(oldlen, (u32 __user *)A(tmp.oldlenp)) || + if (get_user(oldlen, (compat_size_t __user *)compat_ptr(tmp.oldlenp)) || put_user(oldlen, oldlenp)) return -EFAULT; } lock_kernel(); - error = do_sysctl((int __user *)A(tmp.name), tmp.nlen, (void __user *)A(tmp.oldval), - oldlenp, (void __user *)A(tmp.newval), tmp.newlen); + error = do_sysctl(compat_ptr(tmp.name), tmp.nlen, + compat_ptr(tmp.oldval), oldlenp, + compat_ptr(tmp.newval), tmp.newlen); unlock_kernel(); if (oldlenp) { if (!error) { if (get_user(oldlen, oldlenp) || - put_user(oldlen, (u32 __user *)A(tmp.oldlenp))) + put_user(oldlen, (compat_size_t __user *)compat_ptr(tmp.oldlenp))) error = -EFAULT; } copy_to_user(args->__unused, tmp.__unused, sizeof(tmp.__unused)); diff -puN arch/ppc64/kernel/ioctl32.c~remove_A_and_AA arch/ppc64/kernel/ioctl32.c --- 2.6.9-rc1-mm5/arch/ppc64/kernel/ioctl32.c~remove_A_and_AA 2004-09-14 18:33:57.830408888 +1000 +++ 2.6.9-rc1-mm5-anton/arch/ppc64/kernel/ioctl32.c 2004-09-14 18:34:36.949325869 +1000 @@ -22,9 +22,7 @@ #define INCLUDES #include "compat_ioctl.c" -#include #include -#include #define CODE #include "compat_ioctl.c" diff -puN arch/ppc64/kernel/signal.c~remove_A_and_AA arch/ppc64/kernel/signal.c --- 2.6.9-rc1-mm5/arch/ppc64/kernel/signal.c~remove_A_and_AA 2004-09-14 18:34:03.085794964 +1000 +++ 2.6.9-rc1-mm5-anton/arch/ppc64/kernel/signal.c 2004-09-14 18:34:46.040943197 +1000 @@ -26,7 +26,6 @@ #include #include #include -#include #include #include #include _ From johnrose at austin.ibm.com Thu Sep 16 08:21:00 2004 From: johnrose at austin.ibm.com (John Rose) Date: Wed, 15 Sep 2004 17:21:00 -0500 Subject: [PATCH] dynamic addition of PCI Host bridges Message-ID: <1095286859.4514.16.camel@sinatra.austin.ibm.com> The following patch implements the ppc64-specific bits for dynamic (DLPAR) addition of PCI Host Bridges. The entry point for this operation is init_phb_dynamic(), which will be called by the RPA DLPAR driver. Among the implementation details, the global number aka PCI domain for the newly added PHB is assigned using the same simple counter that assigns it at boot. This has two consequences. First, the PCI domain associated with a PHB will not persist across DLPAR remove and subsequent add. Second, stress tests that repeatedly add/remove PHBs might generate some large values for PCI domain. If we decide at a later point to hash an OF property to PCI domain value, this can be easily fixed up. Also, the linux,pci-domain property is not generated for the newly added PHBs at the moment. Because there doesn't seem to be an easy way to dynamically add single properties to the OFDT, and because the userspace dependency on this property is being questioned, I've ignored it for now. If we decide on a solution for this at a later point, it can also be easily fixed up. Comments welcome! Thanks- John Signed-off-by: John Rose diff -Nru a/arch/ppc64/kernel/pSeries_pci.c b/arch/ppc64/kernel/pSeries_pci.c --- a/arch/ppc64/kernel/pSeries_pci.c Wed Sep 15 17:10:21 2004 +++ b/arch/ppc64/kernel/pSeries_pci.c Wed Sep 15 17:10:21 2004 @@ -41,6 +41,7 @@ #include #include #include +#include #include "open_pic.h" #include "pci.h" @@ -205,9 +206,9 @@ #define ISA_SPACE_MASK 0x1 #define ISA_SPACE_IO 0x1 -static void pci_process_ISA_OF_ranges(struct device_node *isa_node, - unsigned long phb_io_base_phys, - void * phb_io_base_virt) +static void __devinit pci_process_ISA_OF_ranges(struct device_node *isa_node, + unsigned long phb_io_base_phys, + void * phb_io_base_virt) { struct isa_range *range; unsigned long pci_addr; @@ -250,9 +251,8 @@ } } -static void __init pci_process_bridge_OF_ranges(struct pci_controller *hose, - struct device_node *dev, - int primary) +static void __devinit pci_process_bridge_OF_ranges(struct pci_controller *hose, + struct device_node *dev) { unsigned int *ranges; unsigned long size; @@ -261,7 +261,6 @@ struct resource *res; int np, na = prom_n_addr_cells(dev); unsigned long pci_addr, cpu_phys_addr; - struct device_node *isa_dn; np = na + 5; @@ -289,31 +288,10 @@ switch (ranges[0] >> 24) { case 1: /* I/O space */ hose->io_base_phys = cpu_phys_addr; - hose->io_base_virt = reserve_phb_iospace(size); - PPCDBG(PPCDBG_PHBINIT, - "phb%d io_base_phys 0x%lx io_base_virt 0x%lx\n", - hose->global_number, hose->io_base_phys, - (unsigned long) hose->io_base_virt); - - if (primary) { - pci_io_base = (unsigned long)hose->io_base_virt; - isa_dn = of_find_node_by_type(NULL, "isa"); - if (isa_dn) { - isa_io_base = pci_io_base; - pci_process_ISA_OF_ranges(isa_dn, - hose->io_base_phys, - hose->io_base_virt); - of_node_put(isa_dn); - /* Allow all IO */ - io_page_mask = -1; - } - } res = &hose->io_resource; res->flags = IORESOURCE_IO; res->start = pci_addr; - res->start += (unsigned long)hose->io_base_virt - - pci_io_base; break; case 2: /* memory space */ memno = 0; @@ -340,6 +318,55 @@ } } +static void __init pci_setup_phb_io(struct pci_controller *hose, int primary) +{ + unsigned long size = hose->pci_io_size; + unsigned long io_virt_offset; + struct resource *res; + struct device_node *isa_dn; + + hose->io_base_virt = reserve_phb_iospace(size); + PPCDBG(PPCDBG_PHBINIT, "phb%d io_base_phys 0x%lx io_base_virt 0x%lx\n", + hose->global_number, hose->io_base_phys, + (unsigned long) hose->io_base_virt); + + if (primary) { + pci_io_base = (unsigned long)hose->io_base_virt; + isa_dn = of_find_node_by_type(NULL, "isa"); + if (isa_dn) { + isa_io_base = pci_io_base; + pci_process_ISA_OF_ranges(isa_dn, hose->io_base_phys, + hose->io_base_virt); + of_node_put(isa_dn); + /* Allow all IO */ + io_page_mask = -1; + } + } + + io_virt_offset = (unsigned long)hose->io_base_virt - pci_io_base; + res = &hose->io_resource; + res->start += io_virt_offset; + res->end += io_virt_offset; +} + +static void pci_setup_phb_io_dynamic(struct pci_controller *hose) +{ + unsigned long size = hose->pci_io_size; + unsigned long io_virt_offset; + struct resource *res; + + hose->io_base_virt = __ioremap(hose->io_base_phys, size, + _PAGE_NO_CACHE); + PPCDBG(PPCDBG_PHBINIT, "phb%d io_base_phys 0x%lx io_base_virt 0x%lx\n", + hose->global_number, hose->io_base_phys, + (unsigned long) hose->io_base_virt); + + io_virt_offset = (unsigned long)hose->io_base_virt - pci_io_base; + res = &hose->io_resource; + res->start += io_virt_offset; + res->end += io_virt_offset; +} + static void python_countermeasures(unsigned long addr) { void *chip_regs; @@ -379,7 +406,7 @@ ibm_write_pci_config = rtas_token("ibm,write-pci-config"); } -unsigned long __init get_phb_buid (struct device_node *phb) +unsigned long __devinit get_phb_buid (struct device_node *phb) { int addr_cells; unsigned int *buid_vals; @@ -409,49 +436,86 @@ return buid; } -static struct pci_controller * __init alloc_phb(struct device_node *dev, - unsigned int addr_size_words) +enum phb_types get_phb_type(struct device_node *dev) { - struct pci_controller *phb; - unsigned int *ui_ptr = NULL, len; - struct reg_property64 reg_struct; - int *bus_range; + enum phb_types type; char *model; - enum phb_types phb_type; - struct property *of_prop; model = (char *)get_property(dev, "model", NULL); if (!model) { - printk(KERN_ERR "alloc_phb: phb has no model property\n"); + printk(KERN_ERR "%s: phb has no model property\n", + __FUNCTION__); model = ""; } + if (strstr(model, "Python")) { + type = phb_type_python; + } else if (strstr(model, "Speedwagon")) { + type = phb_type_speedwagon; + } else if (strstr(model, "Winnipeg")) { + type = phb_type_winnipeg; + } else { + printk(KERN_ERR "%s: unknown PHB %s\n", __FUNCTION__, model); + type = phb_type_unknown; + } + + return type; +} + +int get_phb_reg_prop(struct device_node *dev, unsigned int addr_size_words, + struct reg_property64 *reg) +{ + unsigned int *ui_ptr = NULL, len; + /* Found a PHB, now figure out where his registers are mapped. */ ui_ptr = (unsigned int *) get_property(dev, "reg", &len); if (ui_ptr == NULL) { PPCDBG(PPCDBG_PHBINIT, "\tget reg failed.\n"); - return NULL; + return 1; } if (addr_size_words == 1) { - reg_struct.address = ((struct reg_property32 *)ui_ptr)->address; - reg_struct.size = ((struct reg_property32 *)ui_ptr)->size; + reg->address = ((struct reg_property32 *)ui_ptr)->address; + reg->size = ((struct reg_property32 *)ui_ptr)->size; } else { - reg_struct = *((struct reg_property64 *)ui_ptr); + *reg = *((struct reg_property64 *)ui_ptr); } - if (strstr(model, "Python")) { - phb_type = phb_type_python; - } else if (strstr(model, "Speedwagon")) { - phb_type = phb_type_speedwagon; - } else if (strstr(model, "Winnipeg")) { - phb_type = phb_type_winnipeg; - } else { - printk(KERN_ERR "alloc_phb: unknown PHB %s\n", model); - phb_type = phb_type_unknown; + return 0; +} + +int phb_set_bus_ranges(struct device_node *dev, struct pci_controller *phb) +{ + int *bus_range; + unsigned int len; + + bus_range = (int *) get_property(dev, "bus-range", &len); + if (bus_range == NULL || len < 2 * sizeof(int)) { + return 1; } + phb->first_busno = bus_range[0]; + phb->last_busno = bus_range[1]; + + return 0; +} + +struct pci_controller *alloc_phb(struct device_node *dev, + unsigned int addr_size_words) +{ + struct pci_controller *phb; + struct reg_property64 reg_struct; + enum phb_types phb_type; + struct property *of_prop; + int rc; + + phb_type = get_phb_type(dev); + + rc = get_phb_reg_prop(dev, addr_size_words, ®_struct); + if (rc) + return NULL; + phb = pci_alloc_pci_controller(phb_type); if (phb == NULL) return NULL; @@ -459,11 +523,9 @@ if (phb_type == phb_type_python) python_countermeasures(reg_struct.address); - bus_range = (int *) get_property(dev, "bus-range", &len); - if (bus_range == NULL || len < 2 * sizeof(int)) { - kfree(phb); - return NULL; - } + rc = phb_set_bus_ranges(dev, phb); + if (rc) + return NULL; of_prop = (struct property *)alloc_bootmem(sizeof(struct property) + sizeof(phb->global_number)); @@ -480,9 +542,6 @@ memcpy(of_prop->value, &phb->global_number, sizeof(phb->global_number)); prom_add_property(dev, of_prop); - phb->first_busno = bus_range[0]; - phb->last_busno = bus_range[1]; - phb->arch_data = dev; phb->ops = &rtas_pci_ops; @@ -491,6 +550,42 @@ return phb; } +struct pci_controller * __devinit alloc_phb_dynamic(struct device_node *dev, + unsigned int addr_size_words) +{ + struct pci_controller *phb; + struct reg_property64 reg_struct; + enum phb_types phb_type; + int rc; + + phb_type = get_phb_type(dev); + + rc = get_phb_reg_prop(dev, addr_size_words, ®_struct); + if (rc) + return NULL; + + phb = pci_alloc_phb_dynamic(phb_type); + if (phb == NULL) + return NULL; + + if (phb_type == phb_type_python) + python_countermeasures(reg_struct.address); + + rc = phb_set_bus_ranges(dev, phb); + if (rc) + return NULL; + + /* TODO: linux,pci-domain? */ + + phb->arch_data = dev; + phb->ops = &rtas_pci_ops; + + phb->buid = get_phb_buid(dev); + + return phb; +} +EXPORT_SYMBOL(alloc_phb_dynamic); + unsigned long __init find_and_init_phbs(void) { struct device_node *node; @@ -519,7 +614,8 @@ if (!phb) continue; - pci_process_bridge_OF_ranges(phb, node, index == 0); + pci_process_bridge_OF_ranges(phb, node); + pci_setup_phb_io(phb, index == 0); if (naca->interrupt_controller == IC_OPEN_PIC) { int addr = root_size_cells * (index + 2) - 1; @@ -535,6 +631,34 @@ return 0; } +struct pci_controller * __devinit init_phb_dynamic(struct device_node *dn) +{ + struct device_node *root = of_find_node_by_path("/"); + unsigned int root_size_cells = 0; + struct pci_controller *phb; + struct pci_bus *bus; + + root_size_cells = prom_n_size_cells(root); + + phb = alloc_phb_dynamic(dn, root_size_cells); + if (!phb) + return NULL; + + pci_process_bridge_OF_ranges(phb, dn); + + pci_setup_phb_io_dynamic(phb); + of_node_put(root); + + pci_devs_phb_init_dynamic(phb); + phb->last_busno = 0xff; + bus = pci_scan_bus(phb->first_busno, phb->ops, phb->arch_data); + phb->bus = bus; + phb->last_busno = bus->subordinate; + + return phb; +} +EXPORT_SYMBOL(init_phb_dynamic); + #if 0 void pcibios_name_device(struct pci_dev *dev) { @@ -610,7 +734,6 @@ if (!dev) { /* Root bus. */ - hose->bus = bus; bus->resource[0] = res = &hose->io_resource; if (!res->flags) @@ -743,7 +866,7 @@ } EXPORT_SYMBOL(remap_bus_range); -static void phbs_fixup_io(void) +static void phbs_remap_io(void) { struct pci_controller *hose, *tmp; @@ -762,7 +885,7 @@ while ((dev = pci_find_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) pci_read_irq_line(dev); - phbs_fixup_io(); + phbs_remap_io(); chrp_request_regions(); pci_fix_bus_sysdata(); if (!ppc64_iommu_off) @@ -788,6 +911,7 @@ } return NULL; } +EXPORT_SYMBOL(pci_find_hose_for_OF_device); /* * ppc64 can have multifunction devices that do not respond to function 0. diff -Nru a/arch/ppc64/kernel/pci.c b/arch/ppc64/kernel/pci.c --- a/arch/ppc64/kernel/pci.c Wed Sep 15 17:10:21 2004 +++ b/arch/ppc64/kernel/pci.c Wed Sep 15 17:10:21 2004 @@ -172,26 +172,11 @@ res->start = start; } -/* - * Allocate pci_controller(phb) initialized common variables. - */ -struct pci_controller * __init -pci_alloc_pci_controller(enum phb_types controller_type) +void +phb_set_model(struct pci_controller *hose, enum phb_types controller_type) { - struct pci_controller *hose; char *model; -#ifdef CONFIG_PPC_ISERIES - hose = (struct pci_controller *)kmalloc(sizeof(struct pci_controller), GFP_KERNEL); -#else - hose = (struct pci_controller *)alloc_bootmem(sizeof(struct pci_controller)); -#endif - if(hose == NULL) { - printk(KERN_ERR "PCI: Allocate pci_controller failed.\n"); - return NULL; - } - memset(hose, 0, sizeof(struct pci_controller)); - switch(controller_type) { #ifdef CONFIG_PPC_ISERIES case phb_type_hypervisor: @@ -219,12 +204,65 @@ strcpy(hose->what,model); else memcpy(hose->what,model,7); - hose->type = controller_type; - hose->global_number = global_phb_number++; +} + +/* + * Allocate pci_controller(phb) initialized common variables. + */ +struct pci_controller * __init +pci_alloc_pci_controller(enum phb_types controller_type) +{ + struct pci_controller *hose; + +#ifdef CONFIG_PPC_ISERIES + hose = (struct pci_controller *)kmalloc(sizeof(struct pci_controller), + GFP_KERNEL); +#else + hose = (struct pci_controller *)alloc_bootmem(sizeof(struct pci_controller)); +#endif + if (hose == NULL) { + printk(KERN_ERR "PCI: Allocate pci_controller failed.\n"); + return NULL; + } + memset(hose, 0, sizeof(struct pci_controller)); + + phb_set_model(hose, controller_type); + + hose->is_dynamic = 0; + hose->type = controller_type; + hose->global_number = global_phb_number++; + + list_add_tail(&hose->list_node, &hose_list); + + return hose; +} + +/* + * Dymnamically allocate pci_controller(phb), initialize common variables. + */ +struct pci_controller * +pci_alloc_phb_dynamic(enum phb_types controller_type) +{ + struct pci_controller *hose; + + hose = (struct pci_controller *)kmalloc(sizeof(struct pci_controller), + GFP_KERNEL); + if(hose == NULL) { + printk(KERN_ERR "PCI: Allocate pci_controller failed.\n"); + return NULL; + } + memset(hose, 0, sizeof(struct pci_controller)); + + phb_set_model(hose, controller_type); + + hose->is_dynamic = 1; + + hose->type = controller_type; + hose->global_number = global_phb_number++; list_add_tail(&hose->list_node, &hose_list); - return hose; + return hose; } static void __init pcibios_claim_one_bus(struct pci_bus *b) diff -Nru a/arch/ppc64/kernel/pci.h b/arch/ppc64/kernel/pci.h --- a/arch/ppc64/kernel/pci.h Wed Sep 15 17:10:21 2004 +++ b/arch/ppc64/kernel/pci.h Wed Sep 15 17:10:21 2004 @@ -15,6 +15,7 @@ extern unsigned long isa_io_base; extern struct pci_controller* pci_alloc_pci_controller(enum phb_types controller_type); +extern struct pci_controller* pci_alloc_phb_dynamic(enum phb_types controller_type); extern struct pci_controller* pci_find_hose_for_OF_device(struct device_node* node); extern struct list_head hose_list; @@ -36,6 +37,7 @@ void *data); void pci_devs_phb_init(void); +void pci_devs_phb_init_dynamic(struct pci_controller *phb); void pci_fix_bus_sysdata(void); struct device_node *fetch_dev_dn(struct pci_dev *dev); diff -Nru a/arch/ppc64/kernel/pci_dn.c b/arch/ppc64/kernel/pci_dn.c --- a/arch/ppc64/kernel/pci_dn.c Wed Sep 15 17:10:21 2004 +++ b/arch/ppc64/kernel/pci_dn.c Wed Sep 15 17:10:21 2004 @@ -42,7 +42,7 @@ * Traverse_func that inits the PCI fields of the device node. * NOTE: this *must* be done before read/write config to the device. */ -static void * __init update_dn_pci_info(struct device_node *dn, void *data) +static void * __devinit update_dn_pci_info(struct device_node *dn, void *data) { struct pci_controller *phb = data; u32 *regs; @@ -193,6 +193,14 @@ traverse_all_pci_devices(update_dn_pci_info); } +void __devinit +pci_devs_phb_init_dynamic(struct pci_controller *phb) +{ + /* Update dn->phb ptrs for new phb and children devices */ + traverse_pci_devices((struct device_node *)phb->arch_data, + update_dn_pci_info, phb); + +} static void __init pci_fixup_bus_sysdata_list(struct list_head *bus_list) { diff -Nru a/include/asm-ppc64/pci-bridge.h b/include/asm-ppc64/pci-bridge.h --- a/include/asm-ppc64/pci-bridge.h Wed Sep 15 17:10:21 2004 +++ b/include/asm-ppc64/pci-bridge.h Wed Sep 15 17:10:21 2004 @@ -34,6 +34,7 @@ char what[8]; /* Eye catcher */ enum phb_types type; /* Type of hardware */ struct pci_bus *bus; + char is_dynamic; void *arch_data; struct list_head list_node; @@ -47,6 +48,7 @@ * the PCI memory space in the CPU bus space */ unsigned long pci_mem_offset; + unsigned long pci_io_size; struct pci_ops *ops; volatile unsigned int *cfg_addr; diff -Nru a/include/asm-ppc64/pci.h b/include/asm-ppc64/pci.h --- a/include/asm-ppc64/pci.h Wed Sep 15 17:10:21 2004 +++ b/include/asm-ppc64/pci.h Wed Sep 15 17:10:21 2004 @@ -229,6 +229,8 @@ extern void pcibios_fixup_device_resources(struct pci_dev *dev, struct pci_bus *bus); +extern struct pci_controller *init_phb_dynamic(struct device_node *dn); + extern int pci_read_irq_line(struct pci_dev *dev); extern void pcibios_add_platform_entries(struct pci_dev *dev); From anton at samba.org Fri Sep 17 03:59:42 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 17 Sep 2004 03:59:42 +1000 Subject: [PATCH] ppc64 Export probe_irq_mask Message-ID: <20040916175942.GC2825@krispykreme> yenta_socket wants probe_irq_mask, so export it. Signed-off-by: Anton Blanchard ===== arch/ppc64/kernel/irq.c 1.69 vs edited ===== --- 1.69/arch/ppc64/kernel/irq.c Tue Aug 31 17:57:57 2004 +++ edited/arch/ppc64/kernel/irq.c Fri Sep 17 03:02:13 2004 @@ -677,6 +677,8 @@ return 0; } +EXPORT_SYMBOL(probe_irq_mask); + void __init init_IRQ(void) { static int once = 0; From anton at samba.org Fri Sep 17 04:06:10 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 17 Sep 2004 04:06:10 +1000 Subject: [PATCH] ppc64 dont use state == SYSTEM_BOOTING Message-ID: <20040916180610.GD2825@krispykreme> From: Nathan Lynch Fedora has a patch which introduces a new system state during boot. Change system_state == SYSTEM_BOOTING to system_state < SYSTEM_RUNNING to match it. Signed-off-by: Anton Blanchard ===== arch/ppc64/kernel/smp.c 1.91 vs edited ===== --- 1.91/arch/ppc64/kernel/smp.c Tue Sep 14 10:23:14 2004 +++ edited/arch/ppc64/kernel/smp.c Fri Sep 17 02:55:45 2004 @@ -379,7 +379,7 @@ /* At boot time the cpus are already spinning in hold * loops, so nothing to do. */ - if (system_state == SYSTEM_BOOTING) + if (system_state < SYSTEM_RUNNING) return 1; pcpu = find_physical_cpu_to_start(get_hard_smp_processor_id(lcpu)); @@ -817,7 +817,7 @@ int c; /* At boot, don't bother with non-present cpus -JSCHOPP */ - if (system_state == SYSTEM_BOOTING && !cpu_present(cpu)) + if (system_state < SYSTEM_RUNNING && !cpu_present(cpu)) return -ENOENT; paca[cpu].default_decr = tb_ticks_per_jiffy / decr_overclock; @@ -849,7 +849,7 @@ * use this value that I found through experimentation. * -- Cort */ - if (system_state == SYSTEM_BOOTING) + if (system_state < SYSTEM_RUNNING) for (c = 5000; c && !cpu_callin_map[cpu]; c--) udelay(100); #ifdef CONFIG_HOTPLUG_CPU From anton at samba.org Fri Sep 17 04:24:26 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 17 Sep 2004 04:24:26 +1000 Subject: [PATCH] ppc64 Fix hotplug CPU when building a pseries+pmac kernel Message-ID: <20040916182425.GE2825@krispykreme> When a pseries+pmac kernel is built, the rtas stop-self token wasnt being initialised. Since doing this will safely fail on pmac, remove the !CONFIG_PPC_PMAC restriction Signed-off-by: Anton Blanchard ===== arch/ppc64/kernel/setup.c 1.76 vs edited ===== --- 1.76/arch/ppc64/kernel/setup.c Tue Sep 14 10:23:15 2004 +++ edited/arch/ppc64/kernel/setup.c Fri Sep 17 03:53:28 2004 @@ -429,9 +429,9 @@ #endif /* CONFIG_PPC_PSERIES */ #endif /* CONFIG_SMP */ -#if defined(CONFIG_HOTPLUG_CPU) && !defined(CONFIG_PPC_PMAC) +#if defined(CONFIG_HOTPLUG_CPU) rtas_stop_self_args.token = rtas_token("stop-self"); -#endif /* CONFIG_HOTPLUG_CPU && !CONFIG_PPC_PMAC */ +#endif /* CONFIG_HOTPLUG_CPU */ /* Finish initializing the hash table (do the dynamic * patching for the fast-path hashtable.S code) From anton at samba.org Fri Sep 17 04:26:44 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 17 Sep 2004 04:26:44 +1000 Subject: [PATCH] ppc64 Disable some drivers broken on 64bit Message-ID: <20040916182644.GF2825@krispykreme> The mace, bmac and dmasound drivers use virt_to_bus and so will not work on ppc64. Reflect this in the relevant Kconfig files. Signed-off-by: Anton Blanchard ===== drivers/net/Kconfig 1.91 vs edited ===== --- 1.91/drivers/net/Kconfig Fri Sep 3 19:08:21 2004 +++ edited/drivers/net/Kconfig Fri Sep 17 04:16:43 2004 @@ -200,7 +200,7 @@ config MACE tristate "MACE (Power Mac ethernet) support" - depends on NET_ETHERNET && PPC_PMAC + depends on NET_ETHERNET && PPC32 && PPC_PMAC select CRC32 help Power Macintoshes and clones with Ethernet built-in on the @@ -223,7 +223,7 @@ config BMAC tristate "BMAC (G3 ethernet) support" - depends on NET_ETHERNET && PPC_PMAC + depends on NET_ETHERNET && PPC32 && PPC_PMAC select CRC32 help Say Y for support of BMAC Ethernet interfaces. These are used on G3 ===== sound/oss/dmasound/Kconfig 1.5 vs edited ===== --- 1.5/sound/oss/dmasound/Kconfig Tue Dec 30 19:45:02 2003 +++ edited/sound/oss/dmasound/Kconfig Fri Sep 17 04:16:24 2004 @@ -14,7 +14,7 @@ config DMASOUND_PMAC tristate "PowerMac DMA sound support" - depends on PPC_PMAC && SOUND && I2C + depends on PPC32 && PPC_PMAC && SOUND && I2C select DMASOUND help If you want to use the internal audio of your PowerMac in Linux, From linas at austin.ibm.com Fri Sep 17 06:34:37 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Thu, 16 Sep 2004 15:34:37 -0500 Subject: [PATCH] dynamic addition of PCI Host bridges In-Reply-To: <1095286859.4514.16.camel@sinatra.austin.ibm.com> References: <1095286859.4514.16.camel@sinatra.austin.ibm.com> Message-ID: <20040916203436.GM9645@austin.ibm.com> On Wed, Sep 15, 2004 at 05:21:00PM -0500, John Rose was heard to remark: > The following patch implements the ppc64-specific bits for dynamic (DLPAR) > addition of PCI Host Bridges. The entry point for this operation is > init_phb_dynamic(), which will be called by the RPA DLPAR driver. > > Comments welcome! one nit-picky comment, one question. nit-pick: I noticed that pci_alloc_pci_controller() and pci_alloc_phb_dynamic() are almost identical, except that one has the flag is_dynamic=1, and the other not. why not merge them into one, and just pass "is_dynamic" as a flag? question: I presume 'remove" is in the works? Is part of the difficulty of implementing remove that pci_alloc_pci_controller() calls alloc_bootmem(), which makes freeing these things kind-of tricky? (i.e. since pci is getting set up before init_mem is done?) --linas From anton at samba.org Fri Sep 17 06:36:34 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 17 Sep 2004 06:36:34 +1000 Subject: [PATCH] ppc64: fix CONFIG_CMDLINE Message-ID: <20040916203634.GI2825@krispykreme> When I cleaned up our cmdline parsing, I missed a RELOC of CONFIG_CMDLINE itself. Without it we copy something random into cmd_line, but only when CONFIG_CMDLINE is enabled. Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/prom.c~fix_cmdline arch/ppc64/kernel/prom.c --- foobar3/arch/ppc64/kernel/prom.c~fix_cmdline 2004-09-16 16:20:43.696593190 +1000 +++ foobar3-anton/arch/ppc64/kernel/prom.c 2004-09-16 16:22:35.601831564 +1000 @@ -1646,7 +1646,7 @@ prom_init(unsigned long r3, unsigned lon RELOC(cmd_line[0]) = 0; #ifdef CONFIG_CMDLINE - strlcpy(RELOC(cmd_line), CONFIG_CMDLINE, sizeof(cmd_line)); + strlcpy(RELOC(cmd_line), RELOC(CONFIG_CMDLINE), sizeof(cmd_line)); #endif /* CONFIG_CMDLINE */ if ((long)_prom->chosen > 0) { prom_getprop(_prom->chosen, "bootargs", p, sizeof(cmd_line)); _ From johnrose at austin.ibm.com Fri Sep 17 06:56:03 2004 From: johnrose at austin.ibm.com (John Rose) Date: Thu, 16 Sep 2004 15:56:03 -0500 Subject: [PATCH] dynamic addition of PCI Host bridges In-Reply-To: <20040916203436.GM9645@austin.ibm.com> References: <1095286859.4514.16.camel@sinatra.austin.ibm.com> <20040916203436.GM9645@austin.ibm.com> Message-ID: <1095368163.13255.15.camel@sinatra.austin.ibm.com> Hi Linas, thanks for the comments. > nit-pick: > I noticed that pci_alloc_pci_controller() and pci_alloc_phb_dynamic() > are almost identical, except that one has the flag is_dynamic=1, > and the other not. why not merge them into one, and just pass > "is_dynamic" as a flag? Knowing the high likelihood of code redundancy when doing something like this, I asked the PLTT whether my approach on these issues should be 1) adding a flag as you've described or 2) splitting into dyanmic vs. init versions, and I remember Rusty (I think) answering 2. If you guys really agree with adding a flag for this particular function, I don't mind either way :) > question: > I presume 'remove" is in the works? > Is part of the difficulty of implementing remove that > pci_alloc_pci_controller() calls alloc_bootmem(), which makes > freeing these things kind-of tricky? (i.e. since pci is getting > set up before init_mem is done?) > I previously posted the patch for remove: http://ozlabs.org/ppc64-patches/patch.pl?id=241 It hasn't been accepted yet. I suppose a third patch to be submitted would augment the remove function to do something like: if (phb->is_dynamic) kfree(phb); I'll post such a patch if some of these start getting accepted. :) > --linas > > From hollis at penguinppc.org Fri Sep 17 06:55:53 2004 From: hollis at penguinppc.org (Hollis Blanchard) Date: Thu, 16 Sep 2004 15:55:53 -0500 Subject: [PATCH] dynamic addition of PCI Host bridges In-Reply-To: <1095368163.13255.15.camel@sinatra.austin.ibm.com> References: <1095286859.4514.16.camel@sinatra.austin.ibm.com> <20040916203436.GM9645@austin.ibm.com> <1095368163.13255.15.camel@sinatra.austin.ibm.com> Message-ID: On Sep 16, 2004, at 3:56 PM, John Rose wrote: > Hi Linas, thanks for the comments. > >> nit-pick: >> I noticed that pci_alloc_pci_controller() and pci_alloc_phb_dynamic() >> are almost identical, except that one has the flag is_dynamic=1, >> and the other not. why not merge them into one, and just pass >> "is_dynamic" as a flag? > > Knowing the high likelihood of code redundancy when doing something > like > this, I asked the PLTT [... snip] What is the PLTT? Is that kinda like a mailing list? ;) -Hollis From johnrose at austin.ibm.com Fri Sep 17 07:13:29 2004 From: johnrose at austin.ibm.com (John Rose) Date: Thu, 16 Sep 2004 16:13:29 -0500 Subject: [PATCH] dynamic addition of PCI Host bridges In-Reply-To: References: <1095286859.4514.16.camel@sinatra.austin.ibm.com> <20040916203436.GM9645@austin.ibm.com> <1095368163.13255.15.camel@sinatra.austin.ibm.com> Message-ID: <1095369209.13255.21.camel@sinatra.austin.ibm.com> > What is the PLTT? Is that kinda like a mailing list? ;) Yep, except all the recipients are really smart. Ha. I think it's a list that includes Paul, Rusty, Ken, and other architects. Can't remember what it stands for. :) J From willschm at us.ibm.com Fri Sep 17 07:08:30 2004 From: willschm at us.ibm.com (Will Schmidt) Date: Thu, 16 Sep 2004 16:08:30 -0500 Subject: [PATCH] dynamic addition of PCI Host bridges In-Reply-To: <1095369209.13255.21.camel@sinatra.austin.ibm.com> Message-ID: probally is Power Linux Tech Board Will Schmidt - willschm at us.ibm.com Platform Enablement/Bringup Team Lead IBM Power Linux Development Rochester,MN John Rose To Sent by: Hollis Blanchard linuxppc64-dev-bo unces at ozlabs.org cc Raymond Harney/Rochester/IBM at IBMUS, Michael W Wortman/Austin/IBM at IBMUS, 09/16/2004 04:13 Paul Mackerras , PM Anton Blanchard , External List Subject Re: [PATCH] dynamic addition of PCI Host bridges > What is the PLTT? Is that kinda like a mailing list? ;) Yep, except all the recipients are really smart. Ha. I think it's a list that includes Paul, Rusty, Ken, and other architects. Can't remember what it stands for. :) J _______________________________________________ Linuxppc64-dev mailing list Linuxppc64-dev at ozlabs.org https://ozlabs.org/cgi-bin/mailman/listinfo/linuxppc64-dev From akpm at osdl.org Fri Sep 17 07:59:37 2004 From: akpm at osdl.org (Andrew Morton) Date: Thu, 16 Sep 2004 14:59:37 -0700 Subject: [PATCH] ppc64 dont use state == SYSTEM_BOOTING In-Reply-To: <20040916180610.GD2825@krispykreme> References: <20040916180610.GD2825@krispykreme> Message-ID: <20040916145937.35f776e6.akpm@osdl.org> Anton Blanchard wrote: > > Fedora has a patch which introduces a new system state during boot. What patch is that? From linas at austin.ibm.com Fri Sep 17 09:06:47 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Thu, 16 Sep 2004 18:06:47 -0500 Subject: [PATCH] add syslog printing to xmon debugger. Message-ID: <20040916230647.GN9645@austin.ibm.com> Hi, This patch 'dmesg'/printk log buffer printing to xmon. I find this useful because crashes are almost always preceeded by interesting printk's. This patch is simple & straightforward, except for one possibly controversial aspect: it embeds a small snippet in kernel/printk.c to return the location of the syslog. This is needed because kallsyms and even CONFIG_KALLSYMS_ALL is not enough to reveal the location of log_buf. This code is about 90% cut-n-paste of earlier code from Keith Owens. Andrew, Please apply at least the kernel/printk.c part of the patch, if you are feeling at all charitable. Signed-off-by: Linas Vepstas -------------- next part -------------- ===== arch/ppc64/xmon/xmon.c 1.50 vs edited ===== --- 1.50/arch/ppc64/xmon/xmon.c Fri Jul 2 00:23:46 2004 +++ edited/arch/ppc64/xmon/xmon.c Thu Sep 16 16:49:23 2004 @@ -132,6 +132,7 @@ static void bootcmds(void); void dump_segments(void); static void symbol_lookup(void); +static void xmon_show_dmesg(void); static int emulate_step(struct pt_regs *regs, unsigned int instr); static void xmon_print_symbol(unsigned long address, const char *mid, const char *after); @@ -175,6 +176,7 @@ #endif "\ C checksum\n\ + D show dmesg (printk) buffer\n\ d dump bytes\n\ di dump instructions\n\ df dump float values\n\ @@ -911,6 +913,9 @@ case 'd': dump(); break; + case 'D': + xmon_show_dmesg(); + break; case 'l': symbol_lookup(); break; @@ -2452,6 +2457,58 @@ printf(" [%s]", modname); } printf("%s", after); +} + +extern void debugger_syslog_data(char *syslog_data[4]); +#define SYSLOG_WRAP(p) if (p < syslog_data[0]) p = syslog_data[1]-1; \ + else if (p >= syslog_data[1]) p = syslog_data[0]; + +static void xmon_show_dmesg(void) +{ + char *syslog_data[4], *start, *end, c; + int logsize; + + /* syslog_data[0,1] physical start, end+1. + * syslog_data[2,3] logical start, end+1. + */ + debugger_syslog_data(syslog_data); + if (syslog_data[2] == syslog_data[3]) + return; + logsize = syslog_data[1] - syslog_data[0]; + start = syslog_data[0] + (syslog_data[2] - syslog_data[0]) % logsize; + end = syslog_data[0] + (syslog_data[3] - syslog_data[0]) % logsize; + + /* Do a line at a time (max 200 chars) to reduce overhead */ + c = '\0'; + while(1) { + char *p; + int chars = 0; + if (!*start) { + while (!*start) { + ++start; + SYSLOG_WRAP(start); + if (start == end) + break; + } + if (start == end) + break; + } + p = start; + while (*start && chars < 200) { + c = *start; + ++chars; + ++start; + SYSLOG_WRAP(start); + if (start == end || c == '\n') + break; + } + if (chars) + printf("%.*s", chars, p); + if (start == end) + break; + } + if (c != '\n') + printf("\n"); } static void debug_trace(void) ===== kernel/printk.c 1.41 vs edited ===== --- 1.41/kernel/printk.c Mon Aug 23 03:15:11 2004 +++ edited/kernel/printk.c Thu Sep 16 15:50:59 2004 @@ -376,6 +376,21 @@ return do_syslog(type, buf, len); } +#ifdef CONFIG_DEBUG_KERNEL +/* Its very handy to be able to view the syslog buffer during debug. + * But do_syslog() uses locks so it cannot be used during debugging. + * Instead, provide the start and end of the physical and logical logs. + * This is equivalent to do_syslog(3). + */ +void debugger_syslog_data(char *syslog_data[4]) +{ + syslog_data[0] = log_buf; + syslog_data[1] = log_buf + __LOG_BUF_LEN; + syslog_data[2] = log_buf + log_end - (logged_chars < __LOG_BUF_LEN ? logged_chars : __LOG_BUF_LEN); + syslog_data[3] = log_buf + log_end; +} +#endif /* CONFIG_DEBUG_KERNEL */ + /* * Call the console drivers on a range of log_buf */ From david at gibson.dropbear.id.au Fri Sep 17 11:13:20 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Fri, 17 Sep 2004 11:13:20 +1000 Subject: [PPC64] Remove LARGE_PAGE_SHIFT constant Message-ID: <20040917011320.GA6523@zax> Andrew, please apply: For historical reasons, ppc64 has ended up with two #defines for the size of a large (16M) page: LARGE_PAGE_SHIFT and HPAGE_SHIFT. This patch removes LARGE_PAGE_SHIFT in favour of the more widely used HPAGE_SHIFT. Signed-off-by: David Gibson Index: working-2.6/arch/ppc64/kernel/pSeries_htab.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/pSeries_htab.c 2004-08-09 09:51:38.000000000 +1000 +++ working-2.6/arch/ppc64/kernel/pSeries_htab.c 2004-09-16 17:04:56.034685808 +1000 @@ -325,7 +325,7 @@ va = (vsid << 28) | (batch->addr[i] & 0x0fffffff); batch->vaddr[j] = va; if (large) - vpn = va >> LARGE_PAGE_SHIFT; + vpn = va >> HPAGE_SHIFT; else vpn = va >> PAGE_SHIFT; hash = hpt_hash(vpn, large); Index: working-2.6/include/asm-ppc64/mmu.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu.h 2004-09-16 16:58:48.170699448 +1000 +++ working-2.6/include/asm-ppc64/mmu.h 2004-09-16 17:04:44.306603944 +1000 @@ -119,8 +119,6 @@ #define PT_SHIFT (12) /* Page Table */ #define PT_MASK 0x02FF -#define LARGE_PAGE_SHIFT 24 - static inline unsigned long hpt_hash(unsigned long vpn, int large) { unsigned long vsid; Index: working-2.6/arch/ppc64/mm/hash_utils.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hash_utils.c 2004-09-15 10:53:53.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hash_utils.c 2004-09-16 17:04:20.121686024 +1000 @@ -100,7 +100,7 @@ int ret; if (large) - vpn = va >> LARGE_PAGE_SHIFT; + vpn = va >> HPAGE_SHIFT; else vpn = va >> PAGE_SHIFT; @@ -332,7 +332,7 @@ va = (vsid << 28) | (ea & 0x0fffffff); if (large) - vpn = va >> LARGE_PAGE_SHIFT; + vpn = va >> HPAGE_SHIFT; else vpn = va >> PAGE_SHIFT; hash = hpt_hash(vpn, large); Index: working-2.6/arch/ppc64/mm/hugetlbpage.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hugetlbpage.c 2004-09-07 10:38:00.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hugetlbpage.c 2004-09-16 17:03:34.673672800 +1000 @@ -853,7 +853,7 @@ vsid = get_vsid(context.id, ea); va = (vsid << 28) | (ea & 0x0fffffff); - vpn = va >> LARGE_PAGE_SHIFT; + vpn = va >> HPAGE_SHIFT; hash = hpt_hash(vpn, 1); if (hugepte_val(pte) & _HUGEPAGE_SECONDARY) hash = ~hash; -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson From arjanv at redhat.com Fri Sep 17 16:14:57 2004 From: arjanv at redhat.com (Arjan van de Ven) Date: Fri, 17 Sep 2004 08:14:57 +0200 Subject: [PATCH] ppc64 dont use state == SYSTEM_BOOTING In-Reply-To: <20040916145937.35f776e6.akpm@osdl.org> References: <20040916180610.GD2825@krispykreme> <20040916145937.35f776e6.akpm@osdl.org> Message-ID: <20040917061456.GA24937@devserv.devel.redhat.com> On Thu, Sep 16, 2004 at 02:59:37PM -0700, Andrew Morton wrote: > Anton Blanchard wrote: > > > > Fedora has a patch which introduces a new system state during boot. > > > > What patch is that? the patch you had for a while that introduced a state for "we can schedule now" needed for voluntary preempt -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040917/cccb7e53/attachment.pgp From benh at kernel.crashing.org Fri Sep 17 22:50:33 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 17 Sep 2004 22:50:33 +1000 Subject: [RFC] Monster boot cleanup patch #1 Message-ID: <1095425432.2214.112.camel@gaston> Hi ! I've been working on making kexec possible the past couple of weeks, in order to real that goal, we need to cleanup significantly the interactions between prom_init and the rest of the kernel in order to provide a useable entry point _after_ prom_init. This is the first product of that cleanup. It's the minium necessary to start working on kexec, it's not by far doing all the cleanups that can be done though it makes a lot of them easier, see below for some notes about this. It's also not "finished" in that sense that I don't expect it to be merged as-is. I want to work a little bit more on it and test it on more than just a POWER4 LPAR & a G5 First of all, prom.c is split in two: what is run in OF environment now goes into prom_init.c, the rest is in prom.c The root of the issue is to have prom_init() now be totally distinct from the kernel. No global of the kernel is written, at all. Actually, if it wasn't for some makefile munging and copying some asm functions it uses, prom_init.c could be a completely separate link entity. The only relationship between them is that prom_init() now calls the kernel entry point with a pointer to a block of memory, which contains a flattened version of the OF device tree and a map of reserved areas of physical memory (for various reasons, it's _much_ simpler to pass the map this way than trying to put it inside of the device-tree itself, there's some chicken & egg stuff going on there). The format of that stuff if quite flexible, but I'll let you look at the patch and make your opinion. Anything that prom_init() has to pass to the kernel is done via the device-tree. The result is that I ended up rewriting a good bunch of prom_init(), changing completely the way it manages & allocates memory (I expect things to be much more sane now). It's also a lot more simple ! On the kernel side, the two calls to stab_initialize() and stab_initialize() that are done in real mode are replaced by a call to early_init() (which lives in setup.c). This new function calls early_init_devtree (in prom.c) which uses a couple of small functions in there to do a linear parsing of the flattened device-tree. Not much is done at this point, just enough to retreive some critical properties from the device-tree, like the platform number, and build the LMBs. At this point, the hash size is calculated and we can move on to stab/htab_initialize (which are called from C code now). Later on, in setup_system(), the flattened device-tree is converted into the familiar pointer-based tree of struct device_node. I did some significant shuffling & cleanup of some of the things happening at this stage of the boot, but here again, I'll let you discover reading the patch. On the TODO or CAN-BE-DONE list, in no special order: - CONFIG_BOOTX_TEXT is not "fixed" yet - setup_system() should die completely. All of this can be done from setup_arch(). Also, we should call those switch/case based on platform and avoid having 3 or more init points in there. I think I'll just find a way to fill up ppc_md. for the platform at early_init() time, and do some calls through that to the init function(s) from setup_arch, trying to shrink that to one init function if possible. - iSeries need to be ripped out of setup.c completely and have it's own setup file. Too much ifdef mess & cruft that we don't want in our feet anymore - CONFIG_PPC_PSERIES must become what CONFIG_PPC_PMAC is, that is a boolean option for IBM-pSeries machine support on top of the "common" config. A lot of code that is now in ifdef CONFIG_PPC_PSERIES should just be moved out of the ifdef's, since it's really just !ISERIES, and iSeries just moved to separate files. - Right now, I don't release the flattened device tree memory, I need to change a few things in the way it's extracted do be able to do that. That's all I have in mind at the moment. The patch is really too big for an email, you'll find it at: http://gate.crashing.org/~benh/ppc64-monster-boot-cleanup-1.diff Enjoy ! Ben. From arnd at arndb.de Sat Sep 18 01:32:51 2004 From: arnd at arndb.de (Arnd Bergmann) Date: Fri, 17 Sep 2004 17:32:51 +0200 Subject: [RFC] Monster boot cleanup patch #1 In-Reply-To: <1095425432.2214.112.camel@gaston> References: <1095425432.2214.112.camel@gaston> Message-ID: <200409171732.56795.arnd@arndb.de> On Freitag, 17. September 2004 14:50, Benjamin Herrenschmidt wrote: > I've been working on making kexec possible the past couple of weeks, > in order to real that goal, we need to cleanup significantly the > interactions between prom_init and the rest of the kernel in order > to provide a useable entry point _after_ prom_init. Nice. > typedef u32 ihandle32; > > -extern char *prom_display_paths[]; > -extern unsigned int prom_num_displays; > - > struct address_range { > unsigned long space; > unsigned long address; I just tried building and found that the removal of these globals from prom.c breaks compilation of drivers/video/offb.c. Do you have a replacement mechanism in place somewhere? Arnd <>< -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: signature Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040917/cde2069f/attachment.pgp From kpfleming at starnetworks.us Sat Sep 18 01:48:29 2004 From: kpfleming at starnetworks.us (Kevin P. Fleming) Date: Fri, 17 Sep 2004 08:48:29 -0700 Subject: [USER] OpenPower 720 support? Message-ID: <414B074D.5070409@starnetworks.us> (Dang, now I'm reposting because I used the wrong sender address. Doh!) (I'm posting here in spite of the warnings about this being a "developer only" list... I'm doing this because all the other lists are down/gone, and dwmw2 suggested I post here instead. Flame me if necessary ) I'm looking at buying an OpenPower 720 instead of a number of x86-64 boxes, for use as a web hosting/email hosting/offsite data storage/MySQL/etc. box. Initially I'd have 6 LPARs, growing to maybe as many as 20. I'm looking at a 2-way 1.65GHz box (4-way box with 2 CPUs), 4GB of RAM, FC HBAs, HMC, etc. My question, though, is that I'm not really keen on using SLES9, and I really want to use the 2.6 kernel so that rules out RHES. I am perfectly comfortable building my own distro to put on this system, but I'm concerned that things like LPAR support and other fancy bits are not present in the kernel.org kernel and I'd be losing access to the features I'm paying for. Can anyone give me some thoughts on whether using one of these machines with an in-house distro is a wise move? From kpfleming at backtobasicsmgmt.com Sat Sep 18 01:42:02 2004 From: kpfleming at backtobasicsmgmt.com (Kevin P. Fleming) Date: Fri, 17 Sep 2004 08:42:02 -0700 Subject: [USER] OpenPower 720 support? Message-ID: <414B05CA.4040307@backtobasicsmgmt.com> (I'm posting here in spite of the warnings about this being a "developer only" list... I'm doing this because all the other lists are down/gone, and dwmw2 suggested I post here instead. Flame me if necessary ) I'm looking at buying an OpenPower 720 instead of a number of x86-64 boxes, for use as a web hosting/email hosting/offsite data storage/MySQL/etc. box. Initially I'd have 6 LPARs, growing to maybe as many as 20. I'm looking at a 2-way 1.65GHz box (4-way box with 2 CPUs), 4GB of RAM, FC HBAs, HMC, etc. My question, though, is that I'm not really keen on using SLES9, and I really want to use the 2.6 kernel so that rules out RHES. I am perfectly comfortable building my own distro to put on this system, but I'm concerned that things like LPAR support and other fancy bits are not present in the kernel.org kernel and I'd be losing access to the features I'm paying for. Can anyone give me some thoughts on whether using one of these machines with an in-house distro is a wise move? From haveblue at us.ibm.com Sat Sep 18 02:31:39 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Fri, 17 Sep 2004 09:31:39 -0700 Subject: [USER] OpenPower 720 support? In-Reply-To: <414B05CA.4040307@backtobasicsmgmt.com> References: <414B05CA.4040307@backtobasicsmgmt.com> Message-ID: <1095438699.4623.12.camel@localhost> On Fri, 2004-09-17 at 08:42, Kevin P. Fleming wrote: > My question, though, is that I'm not really keen on using SLES9, and I > really want to use the 2.6 kernel so that rules out RHES. I am perfectly > comfortable building my own distro to put on this system, but I'm > concerned that things like LPAR support and other fancy bits are not > present in the kernel.org kernel and I'd be losing access to the > features I'm paying for. > > Can anyone give me some thoughts on whether using one of these machines > with an in-house distro is a wise move? Let me just say that the software that interfaces with the HMC is very powerful. So powerful, in fact, that it has a number of very intertwined components. I once knew a young, naive, Linux kernel developer who thought that he could do DLPAR operations on his Debian machine. He failed, and uses SLES9 for that development to this day. It can surely be done, he just wasn't clever or patient enough to figure it out. -- Dave From haveblue at us.ibm.com Sat Sep 18 02:32:46 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Fri, 17 Sep 2004 09:32:46 -0700 Subject: [USER] OpenPower 720 support? In-Reply-To: <414B074D.5070409@starnetworks.us> References: <414B074D.5070409@starnetworks.us> Message-ID: <1095438766.4623.14.camel@localhost> On Fri, 2004-09-17 at 08:48, Kevin P. Fleming wrote: > (Dang, now I'm reposting because I used the wrong sender address. Doh!) Fine, then! I'll send two messages! :) > (I'm posting here in spite of the warnings about this being a "developer > only" list... I'm doing this because all the other lists are down/gone, > and dwmw2 suggested I post here instead. Flame me if necessary ) > > I'm looking at buying an OpenPower 720 instead of a number of x86-64 > boxes, for use as a web hosting/email hosting/offsite data > storage/MySQL/etc. box. Initially I'd have 6 LPARs, growing to maybe as > many as 20. I'm looking at a 2-way 1.65GHz box (4-way box with 2 CPUs), > 4GB of RAM, FC HBAs, HMC, etc. > > My question, though, is that I'm not really keen on using SLES9, and I > really want to use the 2.6 kernel so that rules out RHES. I am perfectly > comfortable building my own distro to put on this system, but I'm > concerned that things like LPAR support and other fancy bits are not > present in the kernel.org kernel and I'd be losing access to the > features I'm paying for. > > Can anyone give me some thoughts on whether using one of these machines > with an in-house distro is a wise move? Let me just say that the software that interfaces with the HMC is very powerful. So powerful, in fact, that it has a number of very intertwined components. I once knew a young, naive, Linux kernel developer who thought that he could do DLPAR operations on his Debian machine. He failed, and uses SLES9 for that development to this day. It can surely be done, he just wasn't clever or patient enough to figure it out. -- Dave From kpfleming at starnetworks.us Sat Sep 18 03:01:18 2004 From: kpfleming at starnetworks.us (Kevin P. Fleming) Date: Fri, 17 Sep 2004 10:01:18 -0700 Subject: [USER] OpenPower 720 support? In-Reply-To: <1095438699.4623.12.camel@localhost> References: <414B05CA.4040307@backtobasicsmgmt.com> <1095438699.4623.12.camel@localhost> Message-ID: <414B185E.6080904@starnetworks.us> Dave Hansen wrote: > Let me just say that the software that interfaces with the HMC is very > powerful. So powerful, in fact, that it has a number of very > intertwined components. That does not surprise me in the least. > I once knew a young, naive, Linux kernel developer who thought that he > could do DLPAR operations on his Debian machine. He failed, and uses > SLES9 for that development to this day. It can surely be done, he just > wasn't clever or patient enough to figure it out. OK, how does your answer change if I _don't_ use DLPAR, but only static LPAR changes made through the HMC? Also, are you saying that the kernel/userspace bits of SLES9 to handle DLPAR are not open-source, or just hard to extract? From jschopp at austin.ibm.com Sat Sep 18 03:11:47 2004 From: jschopp at austin.ibm.com (Joel Schopp) Date: Fri, 17 Sep 2004 12:11:47 -0500 Subject: [USER] OpenPower 720 support? In-Reply-To: <1095438699.4623.12.camel@localhost> References: <414B05CA.4040307@backtobasicsmgmt.com> <1095438699.4623.12.camel@localhost> Message-ID: <414B1AD3.6000508@austin.ibm.com> >>My question, though, is that I'm not really keen on using SLES9, and I >>really want to use the 2.6 kernel so that rules out RHES. I am perfectly >>comfortable building my own distro to put on this system, but I'm >>concerned that things like LPAR support and other fancy bits are not >>present in the kernel.org kernel and I'd be losing access to the >>features I'm paying for. >> >>Can anyone give me some thoughts on whether using one of these machines >>with an in-house distro is a wise move? > An in-house distro seems like a whole lot of extra work. SLES9 is very nice. > > Let me just say that the software that interfaces with the HMC is very > powerful. So powerful, in fact, that it has a number of very > intertwined components. > > I once knew a young, naive, Linux kernel developer who thought that he > could do DLPAR operations on his Debian machine. He failed, and uses > SLES9 for that development to this day. It can surely be done, he just > wasn't clever or patient enough to figure it out. Now, to the best of my knowledge all of the features you want are in the mainline kernel.org kernel. In specific I know that logical partitioning, dynamic logical partitioning of cpus & io slots, shared processors, etc are. I'm even pretty sure (but don't quote me on this) the virtual scsi, virtual ethernet, and virtual console are too. But as Dave was saying, to partition the machine(s) you will need to buy an HMC (hardware management console). One HMC is enough for several machines. Some of the features need software to run on Linux to talk to that HMC. Features such as dynamic partitioning (moving things around without rebooting a partition) or virtual scsi. You don't need the software to use shared processors or to statically partition the machine between partition reboots. The software is no cost, but it is not yet open source. It officially supports SLES8, SLES9 and RHEL3. http://techsupport.services.ibm.com/server/lopdiags/suselinux/hmcserver To quote IBMs page. http://www-1.ibm.com/servers/eserver/openpower/hardware/720_options.html "If more than a single partition is needed on the server, the IBM Hardware Management Console (HMC) is required. This workstation is used solely for console functions. It is required to create, define and modify a partition, but is not required to run the partition. An HMC can act as a console for more than one server or partition. More than one HMC can be attached to a server or partition. The HMC is attached to an HMC-only port on the back of the server's system unit. It is ordered not as a feature code of the server, but as a 7310-CR2 or 7310-C03. CR2 is the rack-mounted HMC and C03 is the desk-side HMC. IBM ships a set of console functionality software with the HMC. This software cannot be modified by the client and thus is called firmware." If you have further technical questions you can send them to me directly to avoid cluttering up what really should be a development list. From kpfleming at starnetworks.us Sat Sep 18 03:29:05 2004 From: kpfleming at starnetworks.us (Kevin P. Fleming) Date: Fri, 17 Sep 2004 10:29:05 -0700 Subject: [USER] OpenPower 720 support? In-Reply-To: <20040917171619.GA9906@cs.umn.edu> References: <414B05CA.4040307@backtobasicsmgmt.com> <20040917171619.GA9906@cs.umn.edu> Message-ID: <414B1EE1.6010704@starnetworks.us> Dave C Boutcher wrote: > Couple of things. First all of the support you need should be in > kernel.org, and starting with a Gentoo ppc64 build you should be able to > bootstrap yourself up everything you want. You could probably hack your > way through debian as well. Good, that matches what Joel said as well. > To do the kind of LPAR you want, you will need to buy an HMC (not sure > of the price, but perhaps significant) and you will also need something > that I think is not announced yet called the "Advanced Open > Virtualization Something" that will let you do more than one partition > per processor ("micropartitioning" in marketing-speak) and virtual I/O > (virtual ethernets for example.) Right, an HMC is on the quote list. If the advanced LPAR product is expensive/unavailable/unsupported on my distro, that would be an issue because I do need to do more than one partition per CPU. > As Dave H said, do do DLPAR (i.e. adding/removing stuff while the OS is > running) you will need some non-open-source tools from the IBM web site > that are only "supported" on SLES or Red Hat. YMMV if you are running > on Gentoo/debian/linux-from-scratch. I don't personally think you will > miss any of those tools (with all due respect to those who do a > magnificent job of writing and delivering them.) This is what I suspected would be the case. > If when you get all of that, I have a document on setting up virtual > partitions which I keep saying I will put somewhere external (like > penguinppc64.org) but haven't got around to doing yet...you can ping me > via email and I'll send you a copy. If I end up buying a machine I'll take you up on that. > And finally, I think people on this list would be happy and interested > to help you. You might find yourself on a poster somewhere (I mean that > in a good sense...not on a post office wall.) That's one thing I've noticed on LKML; the crowd that does PPC64 (both IBM and non-IBM people) seem to be pretty reasonable to deal with :-) From haveblue at us.ibm.com Sat Sep 18 03:53:21 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Fri, 17 Sep 2004 10:53:21 -0700 Subject: [USER] OpenPower 720 support? In-Reply-To: <414B1AD3.6000508@austin.ibm.com> References: <414B05CA.4040307@backtobasicsmgmt.com> <1095438699.4623.12.camel@localhost> <414B1AD3.6000508@austin.ibm.com> Message-ID: <1095443601.3521.3.camel@localhost> On Fri, 2004-09-17 at 10:11, Joel Schopp wrote: > > I once knew a young, naive, Linux kernel developer who thought that he > > could do DLPAR operations on his Debian machine. He failed, and uses > > SLES9 for that development to this day. It can surely be done, he just > > wasn't clever or patient enough to figure it out. > > Now, to the best of my knowledge all of the features you want are in the > mainline kernel.org kernel. In specific I know that logical > partitioning, dynamic logical partitioning of cpus & io slots, shared > processors, etc are. I'm even pretty sure (but don't quote me on this) > the virtual scsi, virtual ethernet, and virtual console are too. DLPAR operations aren't done in the kernel. They're done from userspace via a closed, binary-only application that isn't very portable across distributions. -- Dave From sleddog at us.ibm.com Sat Sep 18 03:53:06 2004 From: sleddog at us.ibm.com (Dave C Boutcher) Date: Fri, 17 Sep 2004 12:53:06 -0500 Subject: [USER] OpenPower 720 support? In-Reply-To: <414B1EE1.6010704@starnetworks.us> References: <414B05CA.4040307@backtobasicsmgmt.com> <20040917171619.GA9906@cs.umn.edu> <414B1EE1.6010704@starnetworks.us> Message-ID: <20040917175306.GB9906@cs.umn.edu> On Fri, Sep 17, 2004 at 10:29:05AM -0700, Kevin P. Fleming wrote: > Dave C Boutcher wrote: > >To do the kind of LPAR you want, you will need to buy an HMC (not sure > >of the price, but perhaps significant) and you will also need something > >that I think is not announced yet called the "Advanced Open > >Virtualization Something" that will let you do more than one partition > >per processor ("micropartitioning" in marketing-speak) and virtual I/O > >(virtual ethernets for example.) > > Right, an HMC is on the quote list. If the advanced LPAR product is > expensive/unavailable/unsupported on my distro, that would be an issue > because I do need to do more than one partition per CPU. Advanced Open Virtualization Something is a "hardware feature" not a Linux product....i.e. you buy a key and micropartitioning and virtual I/O are enabled on your hardware. -- Dave Boutcher From kpfleming at starnetworks.us Sat Sep 18 03:56:54 2004 From: kpfleming at starnetworks.us (Kevin P. Fleming) Date: Fri, 17 Sep 2004 10:56:54 -0700 Subject: [USER] OpenPower 720 support? In-Reply-To: <20040917175306.GB9906@cs.umn.edu> References: <414B05CA.4040307@backtobasicsmgmt.com> <20040917171619.GA9906@cs.umn.edu> <414B1EE1.6010704@starnetworks.us> <20040917175306.GB9906@cs.umn.edu> Message-ID: <414B2566.4070601@starnetworks.us> Dave C Boutcher wrote: > Advanced Open Virtualization Something is a "hardware feature" not a > Linux product....i.e. you buy a key and micropartitioning and > virtual I/O are enabled on your hardware. Ahh, OK. Since I've asked for that to be in my quote, it should be there when they present it. From haveblue at us.ibm.com Sat Sep 18 04:02:53 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Fri, 17 Sep 2004 11:02:53 -0700 Subject: [USER] OpenPower 720 support? In-Reply-To: <414B185E.6080904@starnetworks.us> References: <414B05CA.4040307@backtobasicsmgmt.com> <1095438699.4623.12.camel@localhost> <414B185E.6080904@starnetworks.us> Message-ID: <1095444173.3521.6.camel@localhost> On Fri, 2004-09-17 at 10:01, Kevin P. Fleming wrote: > Dave Hansen wrote: > > I once knew a young, naive, Linux kernel developer who thought that he > > could do DLPAR operations on his Debian machine. He failed, and uses > > SLES9 for that development to this day. It can surely be done, he just > > wasn't clever or patient enough to figure it out. > > OK, how does your answer change if I _don't_ use DLPAR, but only static > LPAR changes made through the HMC? Then, you shouldn't have any problems that I can think of. At that point the HMC becomes a cold partition configurator and an expensive power switch :) > Also, are you saying that the kernel/userspace bits of SLES9 to handle > DLPAR are not open-source, or just hard to extract? Not open source. At all. -- Dave From haveblue at us.ibm.com Sat Sep 18 04:40:29 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Fri, 17 Sep 2004 11:40:29 -0700 Subject: [PPC64] Remove LARGE_PAGE_SHIFT constant In-Reply-To: <20040917170328.GB2179@logos.cnet> References: <20040917011320.GA6523@zax> <20040917170328.GB2179@logos.cnet> Message-ID: <1095446429.4088.3.camel@localhost> On Fri, 2004-09-17 at 10:03, Marcelo Tosatti wrote: > On Fri, Sep 17, 2004 at 11:13:20AM +1000, David Gibson wrote: > > Andrew, please apply: > > > > For historical reasons, ppc64 has ended up with two #defines for the > > size of a large (16M) page: LARGE_PAGE_SHIFT and HPAGE_SHIFT. This > > patch removes LARGE_PAGE_SHIFT in favour of the more widely used > > HPAGE_SHIFT. > > Nitpicking, "LARGE_PAGE_xxx" is used by x86/x86_64: > > #define LARGE_PAGE_MASK (~(LARGE_PAGE_SIZE-1)) > #define LARGE_PAGE_SIZE (1UL << PMD_SHIFT) > > Wouldnt it be nice to keep consistency between archs? Actually, if everybody makes sure to define PMD_SHIFT, we should be able to use common macros, right? -- Dave From marcelo.tosatti at cyclades.com Sat Sep 18 03:03:28 2004 From: marcelo.tosatti at cyclades.com (Marcelo Tosatti) Date: Fri, 17 Sep 2004 14:03:28 -0300 Subject: [PPC64] Remove LARGE_PAGE_SHIFT constant In-Reply-To: <20040917011320.GA6523@zax> References: <20040917011320.GA6523@zax> Message-ID: <20040917170328.GB2179@logos.cnet> On Fri, Sep 17, 2004 at 11:13:20AM +1000, David Gibson wrote: > Andrew, please apply: > > For historical reasons, ppc64 has ended up with two #defines for the > size of a large (16M) page: LARGE_PAGE_SHIFT and HPAGE_SHIFT. This > patch removes LARGE_PAGE_SHIFT in favour of the more widely used > HPAGE_SHIFT. Nitpicking, "LARGE_PAGE_xxx" is used by x86/x86_64: #define LARGE_PAGE_MASK (~(LARGE_PAGE_SIZE-1)) #define LARGE_PAGE_SIZE (1UL << PMD_SHIFT) Wouldnt it be nice to keep consistency between archs? From marcelo.tosatti at cyclades.com Sat Sep 18 03:33:34 2004 From: marcelo.tosatti at cyclades.com (Marcelo Tosatti) Date: Fri, 17 Sep 2004 14:33:34 -0300 Subject: [PPC64] Remove LARGE_PAGE_SHIFT constant In-Reply-To: <1095446429.4088.3.camel@localhost> References: <20040917011320.GA6523@zax> <20040917170328.GB2179@logos.cnet> <1095446429.4088.3.camel@localhost> Message-ID: <20040917173334.GC2179@logos.cnet> On Fri, Sep 17, 2004 at 11:40:29AM -0700, Dave Hansen wrote: > On Fri, 2004-09-17 at 10:03, Marcelo Tosatti wrote: > > On Fri, Sep 17, 2004 at 11:13:20AM +1000, David Gibson wrote: > > > Andrew, please apply: > > > > > > For historical reasons, ppc64 has ended up with two #defines for the > > > size of a large (16M) page: LARGE_PAGE_SHIFT and HPAGE_SHIFT. This > > > patch removes LARGE_PAGE_SHIFT in favour of the more widely used > > > HPAGE_SHIFT. > > > > Nitpicking, "LARGE_PAGE_xxx" is used by x86/x86_64: > > > > #define LARGE_PAGE_MASK (~(LARGE_PAGE_SIZE-1)) > > #define LARGE_PAGE_SIZE (1UL << PMD_SHIFT) > > > > Wouldnt it be nice to keep consistency between archs? > > Actually, if everybody makes sure to define PMD_SHIFT, we should be able > to use common macros, right? Yeap. From sleddog at us.ibm.com Sat Sep 18 03:16:19 2004 From: sleddog at us.ibm.com (Dave C Boutcher) Date: Fri, 17 Sep 2004 12:16:19 -0500 Subject: [USER] OpenPower 720 support? In-Reply-To: <414B05CA.4040307@backtobasicsmgmt.com> References: <414B05CA.4040307@backtobasicsmgmt.com> Message-ID: <20040917171619.GA9906@cs.umn.edu> On Fri, Sep 17, 2004 at 08:42:02AM -0700, Kevin P. Fleming wrote: > I'm looking at buying an OpenPower 720 instead of a number of x86-64 > boxes, for use as a web hosting/email hosting/offsite data > storage/MySQL/etc. box. Initially I'd have 6 LPARs, growing to maybe as > many as 20. I'm looking at a 2-way 1.65GHz box (4-way box with 2 CPUs), > 4GB of RAM, FC HBAs, HMC, etc. > > My question, though, is that I'm not really keen on using SLES9, and I > really want to use the 2.6 kernel so that rules out RHES. I am perfectly > comfortable building my own distro to put on this system, but I'm > concerned that things like LPAR support and other fancy bits are not > present in the kernel.org kernel and I'd be losing access to the > features I'm paying for. Kevin, Couple of things. First all of the support you need should be in kernel.org, and starting with a Gentoo ppc64 build you should be able to bootstrap yourself up everything you want. You could probably hack your way through debian as well. To do the kind of LPAR you want, you will need to buy an HMC (not sure of the price, but perhaps significant) and you will also need something that I think is not announced yet called the "Advanced Open Virtualization Something" that will let you do more than one partition per processor ("micropartitioning" in marketing-speak) and virtual I/O (virtual ethernets for example.) As Dave H said, do do DLPAR (i.e. adding/removing stuff while the OS is running) you will need some non-open-source tools from the IBM web site that are only "supported" on SLES or Red Hat. YMMV if you are running on Gentoo/debian/linux-from-scratch. I don't personally think you will miss any of those tools (with all due respect to those who do a magnificent job of writing and delivering them.) If when you get all of that, I have a document on setting up virtual partitions which I keep saying I will put somewhere external (like penguinppc64.org) but haven't got around to doing yet...you can ping me via email and I'll send you a copy. And finally, I think people on this list would be happy and interested to help you. You might find yourself on a poster somewhere (I mean that in a good sense...not on a post office wall.) -- Dave Boutcher From benh at kernel.crashing.org Sat Sep 18 09:58:06 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 18 Sep 2004 09:58:06 +1000 Subject: [RFC] Monster boot cleanup patch #1 In-Reply-To: <200409171732.56795.arnd@arndb.de> References: <1095425432.2214.112.camel@gaston> <200409171732.56795.arnd@arndb.de> Message-ID: <1095465485.3660.0.camel@gaston> On Sat, 2004-09-18 at 01:32, Arnd Bergmann wrote: > I just tried building and found that the removal of these > globals from prom.c breaks compilation of drivers/video/offb.c. > Do you have a replacement mechanism in place somewhere? No, I just forgot to include the offb patch in my monster patch, will be in the next drop Ben. From paulus at samba.org Sat Sep 18 07:22:47 2004 From: paulus at samba.org (Paul Mackerras) Date: Fri, 17 Sep 2004 17:22:47 -0400 Subject: [USER] OpenPower 720 support? In-Reply-To: <1095443601.3521.3.camel@localhost> References: <414B05CA.4040307@backtobasicsmgmt.com> <1095438699.4623.12.camel@localhost> <414B1AD3.6000508@austin.ibm.com> <1095443601.3521.3.camel@localhost> Message-ID: <16715.21928.10040.902257@cargo.ozlabs.ibm.com> Dave Hansen writes: > DLPAR operations aren't done in the kernel. They're done from userspace > via a closed, binary-only application that isn't very portable across > distributions. Specifically, what differences are there that stop them being used on different distros? I thought we had got past that. Paul. From paulus at samba.org Sat Sep 18 07:23:44 2004 From: paulus at samba.org (Paul Mackerras) Date: Fri, 17 Sep 2004 17:23:44 -0400 Subject: [PPC64] Remove LARGE_PAGE_SHIFT constant In-Reply-To: <1095446429.4088.3.camel@localhost> References: <20040917011320.GA6523@zax> <20040917170328.GB2179@logos.cnet> <1095446429.4088.3.camel@localhost> Message-ID: <16715.21984.137689.407420@cargo.ozlabs.ibm.com> Dave Hansen writes: > Actually, if everybody makes sure to define PMD_SHIFT, we should be able > to use common macros, right? No, because LARGE_PAGE_SHIFT != PMD_SHIFT on ppc64. Paul. From anton at samba.org Sat Sep 18 17:06:12 2004 From: anton at samba.org (Anton Blanchard) Date: Sat, 18 Sep 2004 17:06:12 +1000 Subject: 2.6.9-rc2+BK hvc console oops Message-ID: <20040918070612.GK2825@krispykreme> Hi, Just got this oops on current -BK. Anton INIT: Sending processes the KILL signal cpu 0x0: Vector: 300 (Data Access) at [c00000000f86f460] pc: c0000000003fdb38: ._spin_lock_irqsave+0x30/0xbc lr: c000000000221c18: .hvc_hangup+0x38/0xd4 sp: c00000000f86f6e0 msr: 8000000000001032 dar: 0 dsisr: 40000000 current = 0xc0000000011b6440 paca = 0xc00000000053b780 pid = 992, comm = bash enter ? for help 0:mon> t [c00000000f86f770] c000000000221c18 .hvc_hangup+0x38/0xd4 [c00000000f86f810] c000000000209a74 .do_tty_hangup+0x5e0/0x620 [c00000000f86f8d0] c000000000209c60 .disassociate_ctty+0x1a8/0x214 [c00000000f86f960] c000000000054a04 .do_exit+0x774/0xec8 [c00000000f86fa30] c0000000000551e8 .do_group_exit+0x50/0xd0 [c00000000f86fad0] c00000000006174c .get_signal_to_deliver+0x268/0x3f8 [c00000000f86fb70] c0000000000272fc .do_signal32+0x4c/0x4b0 [c00000000f86fcd0] c000000000016eb8 .do_signal+0x354/0x5d0 [c00000000f86fe30] c000000000010e3c do_work+0x24/0x28 --- Exception: c00 (System Call) at 000000000fee5358 SP (ffffe080) is in userspace From david at gibson.dropbear.id.au Sat Sep 18 09:43:57 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Sat, 18 Sep 2004 09:43:57 +1000 Subject: [PPC64] Remove LARGE_PAGE_SHIFT constant In-Reply-To: <20040917173334.GC2179@logos.cnet> References: <20040917011320.GA6523@zax> <20040917170328.GB2179@logos.cnet> <1095446429.4088.3.camel@localhost> <20040917173334.GC2179@logos.cnet> Message-ID: <20040917234357.GA23252@zax> On Fri, Sep 17, 2004 at 02:33:34PM -0300, Marcelo Tosatti wrote: > On Fri, Sep 17, 2004 at 11:40:29AM -0700, Dave Hansen wrote: > > On Fri, 2004-09-17 at 10:03, Marcelo Tosatti wrote: > > > On Fri, Sep 17, 2004 at 11:13:20AM +1000, David Gibson wrote: > > > > Andrew, please apply: > > > > > > > > For historical reasons, ppc64 has ended up with two #defines for the > > > > size of a large (16M) page: LARGE_PAGE_SHIFT and HPAGE_SHIFT. This > > > > patch removes LARGE_PAGE_SHIFT in favour of the more widely used > > > > HPAGE_SHIFT. > > > > > > Nitpicking, "LARGE_PAGE_xxx" is used by x86/x86_64: > > > > > > #define LARGE_PAGE_MASK (~(LARGE_PAGE_SIZE-1)) > > > #define LARGE_PAGE_SIZE (1UL << PMD_SHIFT) > > > > > > Wouldnt it be nice to keep consistency between archs? > > > > Actually, if everybody makes sure to define PMD_SHIFT, we should be able > > to use common macros, right? > > Yeap. No. The large page shift is not necessarily the same as the PMD_SHIFT. It's the same on x86, but it is less on sparc64 and sh, and greater on ppc64. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson From david at gibson.dropbear.id.au Sat Sep 18 09:46:56 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Sat, 18 Sep 2004 09:46:56 +1000 Subject: [PPC64] Remove LARGE_PAGE_SHIFT constant In-Reply-To: <20040917170328.GB2179@logos.cnet> References: <20040917011320.GA6523@zax> <20040917170328.GB2179@logos.cnet> Message-ID: <20040917234656.GB23252@zax> On Fri, Sep 17, 2004 at 02:03:28PM -0300, Marcelo Tosatti wrote: > On Fri, Sep 17, 2004 at 11:13:20AM +1000, David Gibson wrote: > > Andrew, please apply: > > > > For historical reasons, ppc64 has ended up with two #defines for the > > size of a large (16M) page: LARGE_PAGE_SHIFT and HPAGE_SHIFT. This > > patch removes LARGE_PAGE_SHIFT in favour of the more widely used > > HPAGE_SHIFT. > > Nitpicking, "LARGE_PAGE_xxx" is used by x86/x86_64: > > #define LARGE_PAGE_MASK (~(LARGE_PAGE_SIZE-1)) > #define LARGE_PAGE_SIZE (1UL << PMD_SHIFT) > > Wouldnt it be nice to keep consistency between archs? Hrm... they are indeed there. However *all* archs, including x86 and x86_64 have HPAGE_SHIFT et al - it's used in generic code, so x86 has the duplicate #defines as well. Actually.. I guess the distinction is that LARGE_PAGE_* refer to the hardware large page size, whereas HPAGE_SIZE refers to the software page size used for hugetlbfs. I think those are identical for all arches at the moment, but they wouldn't have to be. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson From david at gibson.dropbear.id.au Mon Sep 20 13:04:11 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Mon, 20 Sep 2004 13:04:11 +1000 Subject: RFC: PPC64 hugepage rework Message-ID: <20040920030411.GA31433@zax> The patch below reworks the ppc64 hugepage code. Instead of using specially marked pmd entries in the normal pagetables to represent hugepages, use normal pte_t entries, in a special set of pagetables used for hugepages only. Using pte_t instead of a special hugepte_t makes the code more similar to that for other architecturess, allowing more possibilities for consolidating the hugepage code. Using independent pagetables for the hugepages is also a prerequisite for moving the hugepages into their own region well outside the normal user address space. We probably want to do that, because of the restrictions imposed on us by the MMU: using a fixed hugepage range greatly simplifies the handling of the segment restrictions, but putting it within the normal address range could cause problems as we expand or move that range in the future. This patch has been given basic testing, but more is always good. Any comments? Index: working-2.6/include/asm-ppc64/pgtable.h =================================================================== --- working-2.6.orig/include/asm-ppc64/pgtable.h 2004-09-15 10:53:53.000000000 +1000 +++ working-2.6/include/asm-ppc64/pgtable.h 2004-09-20 11:15:32.474738984 +1000 @@ -98,6 +98,7 @@ #define _PAGE_BUSY 0x0800 /* software: PTE & hash are busy */ #define _PAGE_SECONDARY 0x8000 /* software: HPTE is in secondary group */ #define _PAGE_GROUP_IX 0x7000 /* software: HPTE index within group */ +#define _PAGE_HUGE 0x10000 /* 16MB page */ /* Bits 0x7000 identify the index within an HPT Group */ #define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | _PAGE_SECONDARY | _PAGE_GROUP_IX) /* PAGE_MASK gives the right answer below, but only by accident */ @@ -157,19 +158,19 @@ #endif /* __ASSEMBLY__ */ /* shift to put page number into pte */ -#define PTE_SHIFT (16) +#define PTE_SHIFT (17) /* We allow 2^41 bytes of real memory, so we need 29 bits in the PMD * to give the PTE page number. The bottom two bits are for flags. */ #define PMD_TO_PTEPAGE_SHIFT (2) #ifdef CONFIG_HUGETLB_PAGE -#define _PMD_HUGEPAGE 0x00000001U -#define HUGEPTE_BATCH_SIZE (1<<(HPAGE_SHIFT-PMD_SHIFT)) #ifndef __ASSEMBLY__ int hash_huge_page(struct mm_struct *mm, unsigned long access, unsigned long ea, unsigned long vsid, int local); + +void hugetlb_mm_free_pgd(struct mm_struct *mm); #endif /* __ASSEMBLY__ */ #define HAVE_ARCH_UNMAPPED_AREA @@ -177,7 +178,7 @@ #else #define hash_huge_page(mm,a,ea,vsid,local) -1 -#define _PMD_HUGEPAGE 0 +#define hugetlb_mm_free_pgd(mm) do {} while (0) #endif @@ -213,10 +214,8 @@ #define pmd_set(pmdp, ptep) \ (pmd_val(*(pmdp)) = (__ba_to_bpn(ptep) << PMD_TO_PTEPAGE_SHIFT)) #define pmd_none(pmd) (!pmd_val(pmd)) -#define pmd_hugepage(pmd) (!!(pmd_val(pmd) & _PMD_HUGEPAGE)) -#define pmd_bad(pmd) (((pmd_val(pmd)) == 0) || pmd_hugepage(pmd)) -#define pmd_present(pmd) ((!pmd_hugepage(pmd)) \ - && (pmd_val(pmd) & ~_PMD_HUGEPAGE) != 0) +#define pmd_bad(pmd) (pmd_val(pmd) == 0) +#define pmd_present(pmd) (pmd_val(pmd) != 0) #define pmd_clear(pmdp) (pmd_val(*(pmdp)) = 0) #define pmd_page_kernel(pmd) \ (__bpn_to_ba(pmd_val(pmd) >> PMD_TO_PTEPAGE_SHIFT)) @@ -269,6 +268,7 @@ static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY;} static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED;} static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE;} +static inline int pte_huge(pte_t pte) { return pte_val(pte) & _PAGE_HUGE;} static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; } static inline void pte_cache(pte_t pte) { pte_val(pte) &= ~_PAGE_NO_CACHE; } @@ -294,6 +294,8 @@ pte_val(pte) |= _PAGE_DIRTY; return pte; } static inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= _PAGE_ACCESSED; return pte; } +static inline pte_t pte_mkhuge(pte_t pte) { + pte_val(pte) |= _PAGE_HUGE; return pte; } /* Atomic PTE updates */ static inline unsigned long pte_update(pte_t *p, unsigned long clr) @@ -464,6 +466,10 @@ extern void paging_init(void); +struct mmu_gather; +void hugetlb_free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *prev, + unsigned long start, unsigned long end); + /* * This gets called at the end of handling a page fault, when * the kernel has put a new PTE into the page table for the process. Index: working-2.6/arch/ppc64/mm/hugetlbpage.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hugetlbpage.c 2004-09-20 10:12:50.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hugetlbpage.c 2004-09-20 12:54:16.026685200 +1000 @@ -27,116 +27,140 @@ #include -/* HugePTE layout: - * - * 31 30 ... 15 14 13 12 10 9 8 7 6 5 4 3 2 1 0 - * PFN>>12..... - - - - - - HASH_IX.... 2ND HASH RW - HG=1 - */ +#define HUGEPGDIR_SHIFT (HPAGE_SHIFT + PAGE_SHIFT - 3) +#define HUGEPGDIR_SIZE (1UL << HUGEPGDIR_SHIFT) +#define HUGEPGDIR_MASK (~(HUGEPGDIR_SIZE-1)) + +#define HUGEPTE_INDEX_SIZE 9 +#define HUGEPGD_INDEX_SIZE 10 + +#define PTRS_PER_HUGEPTE (1 << HUGEPTE_INDEX_SIZE) +#define PTRS_PER_HUGEPGD (1 << HUGEPGD_INDEX_SIZE) -#define HUGEPTE_SHIFT 15 -#define _HUGEPAGE_PFN 0xffff8000 -#define _HUGEPAGE_BAD 0x00007f00 -#define _HUGEPAGE_HASHPTE 0x00000008 -#define _HUGEPAGE_SECONDARY 0x00000010 -#define _HUGEPAGE_GROUP_IX 0x000000e0 -#define _HUGEPAGE_HPTEFLAGS (_HUGEPAGE_HASHPTE | _HUGEPAGE_SECONDARY | \ - _HUGEPAGE_GROUP_IX) -#define _HUGEPAGE_RW 0x00000004 - -typedef struct {unsigned int val;} hugepte_t; -#define hugepte_val(hugepte) ((hugepte).val) -#define __hugepte(x) ((hugepte_t) { (x) } ) -#define hugepte_pfn(x) \ - ((unsigned long)(hugepte_val(x)>>HUGEPTE_SHIFT) << HUGETLB_PAGE_ORDER) -#define mk_hugepte(page,wr) __hugepte( \ - ((page_to_pfn(page)>>HUGETLB_PAGE_ORDER) << HUGEPTE_SHIFT ) \ - | (!!(wr) * _HUGEPAGE_RW) | _PMD_HUGEPAGE ) - -#define hugepte_bad(x) ( !(hugepte_val(x) & _PMD_HUGEPAGE) || \ - (hugepte_val(x) & _HUGEPAGE_BAD) ) -#define hugepte_page(x) pfn_to_page(hugepte_pfn(x)) -#define hugepte_none(x) (!(hugepte_val(x) & _HUGEPAGE_PFN)) - - -static void flush_hash_hugepage(mm_context_t context, unsigned long ea, - hugepte_t pte, int local); - -static inline unsigned int hugepte_update(hugepte_t *p, unsigned int clr, - unsigned int set) -{ - unsigned int old, tmp; - - __asm__ __volatile__( - "1: lwarx %0,0,%3 # pte_update\n\ - andc %1,%0,%4 \n\ - or %1,%1,%5 \n\ - stwcx. %1,0,%3 \n\ - bne- 1b" - : "=&r" (old), "=&r" (tmp), "=m" (*p) - : "r" (p), "r" (clr), "r" (set), "m" (*p) - : "cc" ); - return old; +static inline int hugepgd_index(unsigned long addr) +{ + return (addr & ~REGION_MASK) >> HUGEPGDIR_SHIFT; } -static inline void set_hugepte(hugepte_t *ptep, hugepte_t pte) +static pgd_t *hugepgd_offset(struct mm_struct *mm, unsigned long addr) { - hugepte_update(ptep, ~_HUGEPAGE_HPTEFLAGS, - hugepte_val(pte) & ~_HUGEPAGE_HPTEFLAGS); + int index; + + if (! mm->context.huge_pgdir) + return NULL; + + + index = hugepgd_index(addr); + BUG_ON(index >= PTRS_PER_HUGEPGD); + return mm->context.huge_pgdir + index; } -static hugepte_t *hugepte_alloc(struct mm_struct *mm, unsigned long addr) +static inline pte_t *hugepte_offset(pgd_t *dir, unsigned long addr) { - pgd_t *pgd; - pmd_t *pmd = NULL; + int index; - BUG_ON(!in_hugepage_area(mm->context, addr)); + if (pgd_none(*dir)) + return NULL; - pgd = pgd_offset(mm, addr); - pmd = pmd_alloc(mm, pgd, addr); + index = (addr >> HPAGE_SHIFT) % PTRS_PER_HUGEPTE; + return (pte_t *)pgd_page(*dir) + index; +} - /* We shouldn't find a (normal) PTE page pointer here */ - BUG_ON(!pmd_none(*pmd) && !pmd_hugepage(*pmd)); - - return (hugepte_t *)pmd; +static pgd_t *hugepgd_alloc(struct mm_struct *mm, unsigned long addr) +{ + BUG_ON(! in_hugepage_area(mm->context, addr)); + + if (! mm->context.huge_pgdir) { + pgd_t *new; + spin_unlock(&mm->page_table_lock); + /* Don't use pgd_alloc(), because we want __GFP_REPEAT */ + new = kmem_cache_alloc(zero_cache, GFP_KERNEL | __GFP_REPEAT); + BUG_ON(memcmp(new, empty_zero_page, PAGE_SIZE)); + spin_lock(&mm->page_table_lock); + + /* + * Because we dropped the lock, we should re-check the + * entry, as somebody else could have populated it.. + */ + if (mm->context.huge_pgdir) + pgd_free(new); + else + mm->context.huge_pgdir = new; + } + return hugepgd_offset(mm, addr); } -static hugepte_t *hugepte_offset(struct mm_struct *mm, unsigned long addr) +static pte_t *hugepte_alloc(struct mm_struct *mm, pgd_t *dir, + unsigned long addr) { - pgd_t *pgd; - pmd_t *pmd = NULL; + if (! pgd_present(*dir)) { + pte_t *new; - BUG_ON(!in_hugepage_area(mm->context, addr)); + spin_unlock(&mm->page_table_lock); + new = kmem_cache_alloc(zero_cache, GFP_KERNEL | __GFP_REPEAT); + BUG_ON(memcmp(new, empty_zero_page, PAGE_SIZE)); + spin_lock(&mm->page_table_lock); + /* + * Because we dropped the lock, we should re-check the + * entry, as somebody else could have populated it.. + */ + if (pgd_present(*dir)) { + if (new) + kmem_cache_free(zero_cache, new); + } else { + struct page *ptepage; - pgd = pgd_offset(mm, addr); - if (pgd_none(*pgd)) - return NULL; + if (! new) + return NULL; + ptepage = virt_to_page(new); + ptepage->mapping = (void *) mm; + ptepage->index = addr & HUGEPGDIR_MASK; + pgd_populate(mm, dir, new); + } + } - pmd = pmd_offset(pgd, addr); + return hugepte_offset(dir, addr); +} + +static pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) +{ + pgd_t *pgd; - /* We shouldn't find a (normal) PTE page pointer here */ - BUG_ON(!pmd_none(*pmd) && !pmd_hugepage(*pmd)); + pgd = hugepgd_offset(mm, addr); + if (! pgd) + return NULL; - return (hugepte_t *)pmd; + return hugepte_offset(pgd, addr); } -static void setup_huge_pte(struct mm_struct *mm, struct page *page, - hugepte_t *ptep, int write_access) +static pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr) { - hugepte_t entry; - int i; + pgd_t *pgd; - mm->rss += (HPAGE_SIZE / PAGE_SIZE); - entry = mk_hugepte(page, write_access); - for (i = 0; i < HUGEPTE_BATCH_SIZE; i++) - set_hugepte(ptep+i, entry); + pgd = hugepgd_alloc(mm, addr); + if (! pgd) + return NULL; + + return hugepte_alloc(mm, pgd, addr); } -static void teardown_huge_pte(hugepte_t *ptep) +static void setup_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long addr, struct page *page, pte_t *ptep, + int write_access) { - int i; + pte_t entry; + + mm->rss += (HPAGE_SIZE / PAGE_SIZE); + if (write_access) { + entry = + pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); + } else { + entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot)); + } + entry = pte_mkyoung(entry); + entry = pte_mkhuge(entry); - for (i = 0; i < HUGEPTE_BATCH_SIZE; i++) - pmd_clear((pmd_t *)(ptep+i)); + set_pte(ptep, entry); } /* @@ -267,60 +291,64 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma) { - hugepte_t *src_pte, *dst_pte, entry; + pte_t *src_pte, *dst_pte, entry; struct page *ptepage; unsigned long addr = vma->vm_start; unsigned long end = vma->vm_end; + int err = -ENOMEM; + + spin_lock(&src->page_table_lock); while (addr < end) { BUG_ON(! in_hugepage_area(src->context, addr)); BUG_ON(! in_hugepage_area(dst->context, addr)); - dst_pte = hugepte_alloc(dst, addr); + dst_pte = huge_pte_alloc(dst, addr); if (!dst_pte) - return -ENOMEM; + goto out; - src_pte = hugepte_offset(src, addr); + src_pte = huge_pte_offset(src, addr); entry = *src_pte; - if ((addr % HPAGE_SIZE) == 0) { - /* This is the first hugepte in a batch */ - ptepage = hugepte_page(entry); - get_page(ptepage); - dst->rss += (HPAGE_SIZE / PAGE_SIZE); - } - set_hugepte(dst_pte, entry); + ptepage = pte_page(entry); + get_page(ptepage); + dst->rss += (HPAGE_SIZE / PAGE_SIZE); + set_pte(dst_pte, entry); - - addr += PMD_SIZE; + addr += HPAGE_SIZE; } - return 0; + + err = 0; + out: + spin_unlock(&src->page_table_lock); + return err; } -int -follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, - struct page **pages, struct vm_area_struct **vmas, - unsigned long *position, int *length, int i) +int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, + struct page **pages, struct vm_area_struct **vmas, + unsigned long *position, int *length, int i) { unsigned long vpfn, vaddr = *position; int remainder = *length; WARN_ON(!is_vm_hugetlb_page(vma)); + spin_lock(&mm->page_table_lock); + vpfn = vaddr/PAGE_SIZE; while (vaddr < vma->vm_end && remainder) { BUG_ON(!in_hugepage_area(mm->context, vaddr)); if (pages) { - hugepte_t *pte; + pte_t *pte; struct page *page; - pte = hugepte_offset(mm, vaddr); + pte = huge_pte_offset(mm, vaddr); /* hugetlb should be locked, and hence, prefaulted */ - WARN_ON(!pte || hugepte_none(*pte)); + WARN_ON(!pte || pte_none(*pte)); - page = &hugepte_page(*pte)[vpfn % (HPAGE_SIZE/PAGE_SIZE)]; + page = &pte_page(*pte)[vpfn % (HPAGE_SIZE/PAGE_SIZE)]; WARN_ON(!PageCompound(page)); @@ -340,32 +368,38 @@ *length = remainder; *position = vaddr; + spin_unlock(&mm->page_table_lock); + return i; } -struct page * -follow_huge_addr(struct mm_struct *mm, unsigned long address, int write) +struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address, + int write) { - return ERR_PTR(-EINVAL); + pte_t *ptep; + struct page *page; + + if (! in_hugepage_area(mm->context, address)) + return ERR_PTR(-EINVAL); + + ptep = huge_pte_offset(mm, address); + page = pte_page(*ptep); + if (page) + page += (address % HPAGE_SIZE) / PAGE_SIZE; + + return page; } int pmd_huge(pmd_t pmd) { - return pmd_hugepage(pmd); + return 0; } -struct page * -follow_huge_pmd(struct mm_struct *mm, unsigned long address, - pmd_t *pmd, int write) +struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address, + pmd_t *pmd, int write) { - struct page *page; - - BUG_ON(! pmd_hugepage(*pmd)); - - page = hugepte_page(*(hugepte_t *)pmd); - if (page) - page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT); - return page; + BUG(); + return NULL; } void unmap_hugepage_range(struct vm_area_struct *vma, @@ -373,46 +407,42 @@ { struct mm_struct *mm = vma->vm_mm; unsigned long addr; - hugepte_t *ptep; + pte_t *ptep; struct page *page; - int cpu; - int local = 0; cpumask_t tmp; WARN_ON(!is_vm_hugetlb_page(vma)); BUG_ON((start % HPAGE_SIZE) != 0); BUG_ON((end % HPAGE_SIZE) != 0); - /* XXX are there races with checking cpu_vm_mask? - Anton */ - cpu = get_cpu(); - tmp = cpumask_of_cpu(cpu); - if (cpus_equal(vma->vm_mm->cpu_vm_mask, tmp)) - local = 1; - for (addr = start; addr < end; addr += HPAGE_SIZE) { - hugepte_t pte; + pte_t pte; BUG_ON(!in_hugepage_area(mm->context, addr)); - ptep = hugepte_offset(mm, addr); - if (!ptep || hugepte_none(*ptep)) + ptep = huge_pte_offset(mm, addr); + if (!ptep || pte_none(*ptep)) continue; pte = *ptep; - page = hugepte_page(pte); - teardown_huge_pte(ptep); - - if (hugepte_val(pte) & _HUGEPAGE_HASHPTE) - flush_hash_hugepage(mm->context, addr, - pte, local); + page = pte_page(pte); + pte_clear(ptep); put_page(page); } - put_cpu(); - mm->rss -= (end - start) >> PAGE_SHIFT; } +void hugetlb_free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *prev, + unsigned long start, unsigned long end) +{ + /* Because the huge pgtables are only 2 level, they can take + * at most around 4M, much less than one hugepage which the + * process is presumably entitled to use. So we don't bother + * freeing up the pagetables on unmap, and wait until + * destroy_context() to clean up the lot. */ +} + int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma) { struct mm_struct *mm = current->mm; @@ -426,7 +456,7 @@ spin_lock(&mm->page_table_lock); for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) { unsigned long idx; - hugepte_t *pte = hugepte_alloc(mm, addr); + pte_t *pte = huge_pte_alloc(mm, addr); struct page *page; BUG_ON(!in_hugepage_area(mm->context, addr)); @@ -435,7 +465,7 @@ ret = -ENOMEM; goto out; } - if (!hugepte_none(*pte)) + if (! pte_none(*pte)) continue; idx = ((addr - vma->vm_start) >> HPAGE_SHIFT) @@ -462,7 +492,8 @@ goto out; } } - setup_huge_pte(mm, page, pte, vma->vm_flags & VM_WRITE); + setup_huge_pte(mm, vma, addr, page, pte, + vma->vm_flags & VM_WRITE); } out: spin_unlock(&mm->page_table_lock); @@ -716,20 +747,55 @@ } } +void hugetlb_mm_free_pgd(struct mm_struct *mm) +{ + int i; + pgd_t *pgdir; + + spin_lock(&mm->page_table_lock); + + pgdir = mm->context.huge_pgdir; + if (! pgdir) + return; + + mm->context.huge_pgdir = NULL; + + /* cleanup any hugepte pages leftover */ + for (i = 0; i < PTRS_PER_HUGEPGD; i++) { + pgd_t *pgd = pgdir + i; + + if (! pgd_none(*pgd)) { + pte_t *pte = (pte_t *)pgd_page(*pgd); + struct page *ptepage = virt_to_page(pte); + + ptepage->mapping = NULL; + + BUG_ON(memcmp(pte, empty_zero_page, PAGE_SIZE)); + kmem_cache_free(zero_cache, pte); + } + pgd_clear(pgd); + } + + BUG_ON(memcmp(pgdir, empty_zero_page, PAGE_SIZE)); + kmem_cache_free(zero_cache, pgdir); + + spin_unlock(&mm->page_table_lock); +} + int hash_huge_page(struct mm_struct *mm, unsigned long access, unsigned long ea, unsigned long vsid, int local) { - hugepte_t *ptep; + pte_t *ptep; unsigned long va, vpn; int is_write; - hugepte_t old_pte, new_pte; - unsigned long hpteflags, prpn, flags; + pte_t old_pte, new_pte; + unsigned long hpteflags, prpn; long slot; + int err = 1; + + spin_lock(&mm->page_table_lock); - /* We have to find the first hugepte in the batch, since - * that's the one that will store the HPTE flags */ - ea &= HPAGE_MASK; - ptep = hugepte_offset(mm, ea); + ptep = huge_pte_offset(mm, ea); /* Search the Linux page table for a match with va */ va = (vsid << 28) | (ea & 0x0fffffff); @@ -739,19 +805,18 @@ * If no pte found or not present, send the problem up to * do_page_fault */ - if (unlikely(!ptep || hugepte_none(*ptep))) - return 1; + if (unlikely(!ptep || pte_none(*ptep))) + goto out; - BUG_ON(hugepte_bad(*ptep)); +/* BUG_ON(pte_bad(*ptep)); */ /* * Check the user's access rights to the page. If access should be * prevented then send the problem up to do_page_fault. */ is_write = access & _PAGE_RW; - if (unlikely(is_write && !(hugepte_val(*ptep) & _HUGEPAGE_RW))) - return 1; - + if (unlikely(is_write && !(pte_val(*ptep) & _PAGE_RW))) + goto out; /* * At this point, we have a pte (old_pte) which can be used to build * or update an HPTE. There are 2 cases: @@ -764,41 +829,40 @@ * page is currently not DIRTY. */ - spin_lock_irqsave(&mm->page_table_lock, flags); old_pte = *ptep; new_pte = old_pte; - hpteflags = 0x2 | (! (hugepte_val(new_pte) & _HUGEPAGE_RW)); + hpteflags = 0x2 | (! (pte_val(new_pte) & _PAGE_RW)); /* Check if pte already has an hpte (case 2) */ - if (unlikely(hugepte_val(old_pte) & _HUGEPAGE_HASHPTE)) { + if (unlikely(pte_val(old_pte) & _PAGE_HASHPTE)) { /* There MIGHT be an HPTE for this pte */ unsigned long hash, slot; hash = hpt_hash(vpn, 1); - if (hugepte_val(old_pte) & _HUGEPAGE_SECONDARY) + if (pte_val(old_pte) & _PAGE_SECONDARY) hash = ~hash; slot = (hash & htab_data.htab_hash_mask) * HPTES_PER_GROUP; - slot += (hugepte_val(old_pte) & _HUGEPAGE_GROUP_IX) >> 5; + slot += (pte_val(old_pte) & _PAGE_GROUP_IX) >> 12; if (ppc_md.hpte_updatepp(slot, hpteflags, va, 1, local) == -1) - hugepte_val(old_pte) &= ~_HUGEPAGE_HPTEFLAGS; + pte_val(old_pte) &= ~_PAGE_HPTEFLAGS; } - if (likely(!(hugepte_val(old_pte) & _HUGEPAGE_HASHPTE))) { + if (likely(!(pte_val(old_pte) & _PAGE_HASHPTE))) { unsigned long hash = hpt_hash(vpn, 1); unsigned long hpte_group; - prpn = hugepte_pfn(old_pte); + prpn = pte_pfn(old_pte); repeat: hpte_group = ((hash & htab_data.htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL; /* Update the linux pte with the HPTE slot */ - hugepte_val(new_pte) &= ~_HUGEPAGE_HPTEFLAGS; - hugepte_val(new_pte) |= _HUGEPAGE_HASHPTE; + pte_val(new_pte) &= ~_PAGE_HPTEFLAGS; + pte_val(new_pte) |= _PAGE_HASHPTE; /* Add in WIMG bits */ /* XXX We should store these in the pte */ @@ -809,7 +873,7 @@ /* Primary is full, try the secondary */ if (unlikely(slot == -1)) { - hugepte_val(new_pte) |= _HUGEPAGE_SECONDARY; + pte_val(new_pte) |= _PAGE_SECONDARY; hpte_group = ((~hash & htab_data.htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL; slot = ppc_md.hpte_insert(hpte_group, va, prpn, @@ -826,39 +890,20 @@ if (unlikely(slot == -2)) panic("hash_huge_page: pte_insert failed\n"); - hugepte_val(new_pte) |= (slot<<5) & _HUGEPAGE_GROUP_IX; + pte_val(new_pte) |= (slot<<12) & _PAGE_GROUP_IX; /* * No need to use ldarx/stdcx here because all who * might be updating the pte will hold the - * page_table_lock or the hash_table_lock - * (we hold both) + * page_table_lock */ *ptep = new_pte; } - spin_unlock_irqrestore(&mm->page_table_lock, flags); - - return 0; -} - -static void flush_hash_hugepage(mm_context_t context, unsigned long ea, - hugepte_t pte, int local) -{ - unsigned long vsid, vpn, va, hash, slot; - - BUG_ON(hugepte_bad(pte)); - BUG_ON(!in_hugepage_area(context, ea)); - - vsid = get_vsid(context.id, ea); + err = 0; - va = (vsid << 28) | (ea & 0x0fffffff); - vpn = va >> HPAGE_SHIFT; - hash = hpt_hash(vpn, 1); - if (hugepte_val(pte) & _HUGEPAGE_SECONDARY) - hash = ~hash; - slot = (hash & htab_data.htab_hash_mask) * HPTES_PER_GROUP; - slot += (hugepte_val(pte) & _HUGEPAGE_GROUP_IX) >> 5; + out: + spin_unlock(&mm->page_table_lock); - ppc_md.hpte_invalidate(slot, va, 1, local); + return err; } Index: working-2.6/include/asm-ppc64/mmu.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu.h 2004-09-20 10:12:50.000000000 +1000 +++ working-2.6/include/asm-ppc64/mmu.h 2004-09-20 11:15:32.477738528 +1000 @@ -24,6 +24,7 @@ typedef struct { mm_context_id_t id; #ifdef CONFIG_HUGETLB_PAGE + pgd_t *huge_pgdir; u16 htlb_segs; /* bitmask */ #endif } mm_context_t; Index: working-2.6/include/asm-ppc64/page.h =================================================================== --- working-2.6.orig/include/asm-ppc64/page.h 2004-09-20 10:12:50.000000000 +1000 +++ working-2.6/include/asm-ppc64/page.h 2004-09-20 11:15:32.478738376 +1000 @@ -64,7 +64,6 @@ #define is_hugepage_only_range(addr, len) \ (touches_hugepage_high_range((addr), (len)) || \ touches_hugepage_low_range((addr), (len))) -#define hugetlb_free_pgtables free_pgtables #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define in_hugepage_area(context, addr) \ Index: working-2.6/arch/ppc64/mm/init.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/init.c 2004-09-20 10:12:50.000000000 +1000 +++ working-2.6/arch/ppc64/mm/init.c 2004-09-20 11:15:32.479738224 +1000 @@ -484,6 +484,12 @@ int index; int err; +#ifdef CONFIG_HUGETLB_PAGE + /* We leave htlb_segs as it was, but for a fork, we need to + * clear the huge_pgdir. */ + mm->context.huge_pgdir = NULL; +#endif + again: if (!idr_pre_get(&mmu_context_idr, GFP_KERNEL)) return -ENOMEM; @@ -514,6 +520,8 @@ spin_unlock(&mmu_context_lock); mm->context.id = NO_CONTEXT; + + hugetlb_mm_free_pgd(mm); } static int __init mmu_context_init(void) Index: working-2.6/arch/ppc64/mm/hash_utils.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hash_utils.c 2004-09-20 10:12:50.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hash_utils.c 2004-09-20 11:15:32.480738072 +1000 @@ -321,9 +321,7 @@ int local) { unsigned long vsid, vpn, va, hash, secondary, slot; - - /* XXX fix for large ptes */ - unsigned long large = 0; + unsigned long huge = pte_huge(pte); if ((ea >= USER_START) && (ea <= USER_END)) vsid = get_vsid(context, ea); @@ -331,18 +329,18 @@ vsid = get_kernel_vsid(ea); va = (vsid << 28) | (ea & 0x0fffffff); - if (large) + if (huge) vpn = va >> HPAGE_SHIFT; else vpn = va >> PAGE_SHIFT; - hash = hpt_hash(vpn, large); + hash = hpt_hash(vpn, huge); secondary = (pte_val(pte) & _PAGE_SECONDARY) >> 15; if (secondary) hash = ~hash; slot = (hash & htab_data.htab_hash_mask) * HPTES_PER_GROUP; slot += (pte_val(pte) & _PAGE_GROUP_IX) >> 12; - ppc_md.hpte_invalidate(slot, va, large, local); + ppc_md.hpte_invalidate(slot, va, huge, local); } void flush_hash_range(unsigned long context, unsigned long number, int local) -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson From anton at samba.org Mon Sep 20 19:40:16 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 20 Sep 2004 19:40:16 +1000 Subject: [PATCH] ppc64: User tasks must have a valid thread.regs Message-ID: <20040920094016.GM2825@krispykreme> There have been reports of problems running UP ppc64 kernels where the kernel would die in the floating point save/restore code. It turns out kernel threads that call exec (and so become user tasks) do not have a valid thread.regs. This means init (pid 1) does not, it also means anything called out of exec_usermodehelper does not. Once that task has forked (eg init), then the thread.regs in the new task is correctly set. On UP do lazy save/restore of floating point regs. The SLES9 init is doing floating point (the debian version of init appears not to). The lack of thread.regs in init combined with the fact that it does floating point leads to our lazy FP save/restore code blowing up. There were other places where this problem exhibited itself in weird and interesting ways. If a task being exec'ed out of a kernel thread used more than 1MB of stack, it would be terminated due to the checks in arch/ppc64/mm/fault.c (looking for a valid thread.regs when extending the stack). We had a test case using the tux webserver that was failing due to this. Paul: does this change look OK to you? Since we zero all registers in ELF_PLAT_INIT, I removed the extra memset in start_thread32. Signed-off-by: Anton Blanchard diff -puN arch/ppc64/kernel/process.c~fix_regs arch/ppc64/kernel/process.c --- foobar2/arch/ppc64/kernel/process.c~fix_regs 2004-09-19 23:51:56.894867391 +1000 +++ foobar2-anton/arch/ppc64/kernel/process.c 2004-09-20 00:17:44.391366279 +1000 @@ -397,11 +397,22 @@ void start_thread(struct pt_regs *regs, /* Check whether the e_entry function descriptor entries * need to be relocated before we can use them. */ - if ( load_addr != 0 ) { + if (load_addr != 0) { entry += load_addr; toc += load_addr; } + /* + * If we exec out of a kernel thread then thread.regs will not be + * set. Do it now. + */ + if (!current->thread.regs) { + unsigned long childregs = (unsigned long)current->thread_info + + THREAD_SIZE; + childregs -= sizeof(struct pt_regs); + current->thread.regs = childregs; + } + regs->nip = entry; regs->gpr[1] = sp; regs->gpr[2] = toc; diff -L process.c -puN /dev/null /dev/null diff -puN arch/ppc64/kernel/sys_ppc32.c~fix_regs arch/ppc64/kernel/sys_ppc32.c --- foobar2/arch/ppc64/kernel/sys_ppc32.c~fix_regs 2004-09-19 23:52:48.494666233 +1000 +++ foobar2-anton/arch/ppc64/kernel/sys_ppc32.c 2004-09-20 00:17:38.031932803 +1000 @@ -633,8 +633,24 @@ out: void start_thread32(struct pt_regs* regs, unsigned long nip, unsigned long sp) { set_fs(USER_DS); - memset(regs->gpr, 0, sizeof(regs->gpr)); - memset(®s->ctr, 0, 4 * sizeof(regs->ctr)); + + /* + * If we exec out of a kernel thread then thread.regs will not be + * set. Do it now. + */ + if (!current->thread.regs) { + unsigned long childregs = (unsigned long)current->thread_info + + THREAD_SIZE; + childregs -= sizeof(struct pt_regs); + current->thread.regs = childregs; + } + + /* + * ELF_PLAT_INIT already clears all registers but it also sets r2. + * So just clear r2 here. + */ + regs->gpr[2] = 0; + regs->nip = nip; regs->gpr[1] = sp; regs->msr = MSR_USER32; _ From anton at samba.org Mon Sep 20 19:40:56 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 20 Sep 2004 19:40:56 +1000 Subject: linux-2.6.9-rc* ppc64 broken on UNI? In-Reply-To: <20040914065627.GA10113@in.ibm.com> References: <20040914065627.GA10113@in.ibm.com> Message-ID: <20040920094055.GN2825@krispykreme> Hi, > I am using linux-2.6.9-rc* on an old Power3 box here and looks like > the tree is broken for UNI (SMP works fine). I am not too familiar > with the fpu stuff to figure out the issue myself. > > I get the following exception: > > The system is going down for reboot NOW!INIT: Sending pVector: 300 (Data Access) at [c00000003f617bb0] > pc: c00000000000b8b0: copy_to_here+0xb0/0x16c > lr: 00000000100419dc > sp: c00000003f617e30 > msr: a000000000003032 > dar: 108 > dsisr: 40000000 > current = 0xc000000001b3d900 > paca = 0xc000000000320000 > pid = 1298, comm = bash > enter ? for help The bug causing this was pretty nasty, I just posted a fix for it to the list. Anton From paulus at samba.org Mon Sep 20 20:41:19 2004 From: paulus at samba.org (Paul Mackerras) Date: Mon, 20 Sep 2004 05:41:19 -0500 Subject: [PATCH] ppc64: User tasks must have a valid thread.regs In-Reply-To: <20040920094016.GM2825@krispykreme> References: <20040920094016.GM2825@krispykreme> Message-ID: <16718.46031.81845.48901@cargo.ozlabs.ibm.com> Anton Blanchard writes: > Paul: does this change look OK to you? Yes, looks fine. I think I would have just set current->thread.regs unconditionally rather than only if it was zero, but it doesn't matter. Paul. From anton at samba.org Mon Sep 20 22:57:28 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 20 Sep 2004 22:57:28 +1000 Subject: [USER] OpenPower 720 support? In-Reply-To: <414B185E.6080904@starnetworks.us> References: <414B05CA.4040307@backtobasicsmgmt.com> <1095438699.4623.12.camel@localhost> <414B185E.6080904@starnetworks.us> Message-ID: <20040920125728.GP2825@krispykreme> > OK, how does your answer change if I _don't_ use DLPAR, but only static > LPAR changes made through the HMC? Most of my testing is done on standard Linus kernels. We have some virtual IO only partitions (vscsi, veth) that boot a normal Linus kernel and the root partition consists of something i whipped up with debian's debbootstrap. make pSeries_defconfig should be good, apart from having to change veth and vscsi from modules to built in. We should probably make that the default. Anton From kpfleming at starnetworks.us Tue Sep 21 00:06:58 2004 From: kpfleming at starnetworks.us (Kevin P. Fleming) Date: Mon, 20 Sep 2004 07:06:58 -0700 Subject: [USER] OpenPower 720 support? In-Reply-To: <20040920125728.GP2825@krispykreme> References: <414B05CA.4040307@backtobasicsmgmt.com> <1095438699.4623.12.camel@localhost> <414B185E.6080904@starnetworks.us> <20040920125728.GP2825@krispykreme> Message-ID: <414EE402.9080206@starnetworks.us> Anton Blanchard wrote: > Most of my testing is done on standard Linus kernels. We have some > virtual IO only partitions (vscsi, veth) that boot a normal Linus kernel > and the root partition consists of something i whipped up with debian's > debbootstrap. That's what I was hoping to hear. It's likely that I'll end up using NFS root kernels for some (if not all) of my partitions on this system, with the local disks only used for booting and swap space. > make pSeries_defconfig should be good, apart from having to change veth > and vscsi from modules to built in. We should probably make that the > default. Thanks for the hints; I'm sure I'll wander through the entire config file anyway looking for bits I can turn on, and it's likely I won't have module support enabled anyway unless I absolutely have to have it. At this point I'm thinking I'll have one partition whose sole purpose will be to bond the two built-in GbE NICs together, run 802.1q VLANS on top of the bonded device, and bridge that into the virtual switch inside the 720. That partition will probably run out of a ramfs root filesystem, since it won't need much :-) Can anyone point me to some docs that show how Virtual SCSI actually gets configured? I've read nearly everything I can find on IBM's site, but so far I can't find anything that tells me how the SCSI devices will appear to the "Virtual SCSI client" partitions. I'm hoping they can all see an assigned portion of the shared SCSI disks, not have to be assigned whole disks each. From jschopp at austin.ibm.com Tue Sep 21 02:30:24 2004 From: jschopp at austin.ibm.com (Joel Schopp) Date: Mon, 20 Sep 2004 11:30:24 -0500 Subject: RFC: PPC64 hugepage rework In-Reply-To: <20040920030411.GA31433@zax> References: <20040920030411.GA31433@zax> Message-ID: <414F05A0.6090409@austin.ibm.com> > The patch below reworks the ppc64 hugepage code. Instead of using > specially marked pmd entries in the normal pagetables to represent > hugepages, use normal pte_t entries, in a special set of pagetables > used for hugepages only. > > Using pte_t instead of a special hugepte_t makes the code more similar > to that for other architecturess, allowing more possibilities for > consolidating the hugepage code. Excellent! > -static void flush_hash_hugepage(mm_context_t context, unsigned long ea, > - hugepte_t pte, int local); > - If you remove this you might fix up flush_hash_page so it works with huge pages. Or did you do that and I just missed it? > -int > -follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, > - struct page **pages, struct vm_area_struct **vmas, > - unsigned long *position, int *length, int i) > +int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, > + struct page **pages, struct vm_area_struct **vmas, > + unsigned long *position, int *length, int i) > { This seems like unnecessary churn. > -struct page * > -follow_huge_addr(struct mm_struct *mm, unsigned long address, int write) > +struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address, > + int write) > { Again, seems like a little extra churn. > -struct page * > -follow_huge_pmd(struct mm_struct *mm, unsigned long address, > - pmd_t *pmd, int write) > +struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address, > + pmd_t *pmd, int write) Ditto. > { > - struct page *page; > - > - BUG_ON(! pmd_hugepage(*pmd)); > - > - page = hugepte_page(*(hugepte_t *)pmd); > - if (page) > - page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT); > - return page; > + BUG(); > + return NULL; > } Why not just remove the function and let the compiler catch any errors instead of catching them at runtime? From rsa at us.ibm.com Tue Sep 21 04:42:43 2004 From: rsa at us.ibm.com (Ryan Arnold) Date: Mon, 20 Sep 2004 13:42:43 -0500 Subject: 2.6.9-rc2+BK hvc console oops In-Reply-To: <20040918070612.GK2825@krispykreme> References: <20040918070612.GK2825@krispykreme> Message-ID: <1095705762.3294.624.camel@localhost> On Sat, 2004-09-18 at 02:06, Anton Blanchard wrote: > Hi, > > Just got this oops on current -BK. > > Anton > > INIT: Sending processes the KILL signal > cpu 0x0: Vector: 300 (Data Access) at [c00000000f86f460] > pc: c0000000003fdb38: ._spin_lock_irqsave+0x30/0xbc > lr: c000000000221c18: .hvc_hangup+0x38/0xd4 > sp: c00000000f86f6e0 > msr: 8000000000001032 Ugh, The problem here is the same as was seen in hvc_close() earlier. I should have been paying more attention. My naive fix for blocking ldisc writes during a tty close by setting tty->driver_data = NULL isn't going to cut it. It looks like hangup is being called while tty-release() is waiting for a previously called driver level close() operation to complete. The tty layer doesn't seem to care about the fact that release_dev() is blocking. I think I may get rid of the tty->driver_data = NULL fix and use an atomic write/read on a flag during the close operation to prevent such things as writes and hangups during close ops. Ryan S. Arnold IBM Linux Technology Center From linas at austin.ibm.com Tue Sep 21 07:13:33 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Mon, 20 Sep 2004 16:13:33 -0500 Subject: [PATCH] ppc64: User tasks must have a valid thread.regs In-Reply-To: <20040920094016.GM2825@krispykreme> References: <20040920094016.GM2825@krispykreme> Message-ID: <20040920211333.GA1872@austin.ibm.com> On Mon, Sep 20, 2004 at 07:40:16PM +1000, Anton Blanchard was heard to remark: > > There have been reports of problems running UP ppc64 kernels where the > kernel would die in the floating point save/restore code. > > It turns out kernel threads that call exec (and so become user tasks) do > not have a valid thread.regs. This means init (pid 1) does not, it also > means anything called out of exec_usermodehelper does not. Once that > task has forked (eg init), then the thread.regs in the new task is > correctly set. > > On UP do lazy save/restore of floating point regs. The SLES9 init is > doing floating point (the debian version of init appears not to). The > lack of thread.regs in init combined with the fact that it does floating > point leads to our lazy FP save/restore code blowing up. OK, but why wasn't SMP affected? SMP doesn't have lazy save/restore, it should have hit the null regs pointer "sooner". --linas From linas at austin.ibm.com Tue Sep 21 08:19:33 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Mon, 20 Sep 2004 17:19:33 -0500 Subject: [PATCH] [PPC64] [TRIVIAL] Janitor whitespace in pSeries_pci.c Message-ID: <20040920221933.GB1872@austin.ibm.com> Hi, This file mixes tabs with 8 spaces, leading to poor display if one's editor doesn't have tab-stops set to 8. Please apply. --linas Signed-off-by: Linas Vepstas From linas at austin.ibm.com Tue Sep 21 08:31:21 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Mon, 20 Sep 2004 17:31:21 -0500 Subject: [PATCH] [PPC64] [TRIVIAL] Janitor whitespace in pSeries_pci.c In-Reply-To: <20040920221933.GB1872@austin.ibm.com> References: <20040920221933.GB1872@austin.ibm.com> Message-ID: <20040920223121.GC1872@austin.ibm.com> Forgot to attach the actual patch. On Mon, Sep 20, 2004 at 05:19:33PM -0500, Linas Vepstas was heard to remark: > Hi, This file mixes tabs with 8 spaces, leading to poor display if one's editor doesn't have tab-stops set to 8. Please apply. --linas Signed-off-by: Linas Vepstas -------------- next part -------------- ===== pSeries_pci.c 1.44 vs edited ===== --- 1.44/arch/ppc64/kernel/pSeries_pci.c Mon Sep 13 19:23:15 2004 +++ edited/pSeries_pci.c Mon Sep 20 17:08:48 2004 @@ -170,7 +170,7 @@ u8 intpin; struct device_node *node; - pci_read_config_byte(pci_dev, PCI_INTERRUPT_PIN, &intpin); + pci_read_config_byte(pci_dev, PCI_INTERRUPT_PIN, &intpin); if (intpin == 0) { PPCDBG(PPCDBG_BUSWALK,"\tDevice: %s No Interrupt used by device.\n", pci_name(pci_dev)); @@ -206,8 +206,8 @@ #define ISA_SPACE_IO 0x1 static void pci_process_ISA_OF_ranges(struct device_node *isa_node, - unsigned long phb_io_base_phys, - void * phb_io_base_virt) + unsigned long phb_io_base_phys, + void * phb_io_base_virt) { struct isa_range *range; unsigned long pci_addr; @@ -304,8 +304,8 @@ hose->io_base_phys, hose->io_base_virt); of_node_put(isa_dn); - /* Allow all IO */ - io_page_mask = -1; + /* Allow all IO */ + io_page_mask = -1; } } @@ -466,7 +466,7 @@ } of_prop = (struct property *)alloc_bootmem(sizeof(struct property) + - sizeof(phb->global_number)); + sizeof(phb->global_number)); if (!of_prop) { kfree(phb); @@ -571,30 +571,30 @@ for (i = 0; i < PCI_NUM_RESOURCES; i++) { if (dev->resource[i].flags & IORESOURCE_IO) { unsigned long offset = (unsigned long)hose->io_base_virt - pci_io_base; - unsigned long start, end, mask; + unsigned long start, end, mask; - start = dev->resource[i].start += offset; - end = dev->resource[i].end += offset; + start = dev->resource[i].start += offset; + end = dev->resource[i].end += offset; - /* Need to allow IO access to pages that are in the - ISA range */ - if (start < MAX_ISA_PORT) { - if (end > MAX_ISA_PORT) - end = MAX_ISA_PORT; - - start >>= PAGE_SHIFT; - end >>= PAGE_SHIFT; - - /* get the range of pages for the map */ - mask = ((1 << (end+1))-1) ^ ((1 << start)-1); - io_page_mask |= mask; - } + /* Need to allow IO access to pages that are in the + ISA range */ + if (start < MAX_ISA_PORT) { + if (end > MAX_ISA_PORT) + end = MAX_ISA_PORT; + + start >>= PAGE_SHIFT; + end >>= PAGE_SHIFT; + + /* get the range of pages for the map */ + mask = ((1 << (end+1))-1) ^ ((1 << start)-1); + io_page_mask |= mask; + } } - else if (dev->resource[i].flags & IORESOURCE_MEM) { + else if (dev->resource[i].flags & IORESOURCE_MEM) { dev->resource[i].start += hose->pci_mem_offset; dev->resource[i].end += hose->pci_mem_offset; } - } + } } EXPORT_SYMBOL(pcibios_fixup_device_resources); @@ -796,21 +796,21 @@ int pcibios_scan_all_fns(struct pci_bus *bus, int devfn) { - struct device_node *busdn, *dn; + struct device_node *busdn, *dn; + + if (bus->self) + busdn = pci_device_to_OF_node(bus->self); + else + busdn = bus->sysdata; /* must be a phb */ - if (bus->self) - busdn = pci_device_to_OF_node(bus->self); - else - busdn = bus->sysdata; /* must be a phb */ - - /* - * Check to see if there is any of the 8 functions are in the - * device tree. If they are then we need to scan all the - * functions of this slot. - */ - for (dn = busdn->child; dn; dn = dn->sibling) - if ((dn->devfn >> 3) == (devfn >> 3)) - return 1; + /* + * Check to see if there is any of the 8 functions are in the + * device tree. If they are then we need to scan all the + * functions of this slot. + */ + for (dn = busdn->child; dn; dn = dn->sibling) + if ((dn->devfn >> 3) == (devfn >> 3)) + return 1; - return 0; + return 0; } From david at gibson.dropbear.id.au Tue Sep 21 10:37:25 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Tue, 21 Sep 2004 10:37:25 +1000 Subject: RFC: PPC64 hugepage rework In-Reply-To: <414F05A0.6090409@austin.ibm.com> References: <20040920030411.GA31433@zax> <414F05A0.6090409@austin.ibm.com> Message-ID: <20040921003725.GB11990@zax> On Mon, Sep 20, 2004 at 11:30:24AM -0500, Joel Schopp wrote: > >The patch below reworks the ppc64 hugepage code. Instead of using > >specially marked pmd entries in the normal pagetables to represent > >hugepages, use normal pte_t entries, in a special set of pagetables > >used for hugepages only. > > > >Using pte_t instead of a special hugepte_t makes the code more similar > >to that for other architecturess, allowing more possibilities for > >consolidating the hugepage code. > > Excellent! > > > >-static void flush_hash_hugepage(mm_context_t context, unsigned long ea, > >- hugepte_t pte, int local); > >- > > If you remove this you might fix up flush_hash_page so it works with > huge pages. Or did you do that and I just missed it? Did that - hpte_update() and flush_hash_page() now understand huge pages. > >-int > >-follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, > >- struct page **pages, struct vm_area_struct **vmas, > >- unsigned long *position, int *length, int i) > >+int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma, > >+ struct page **pages, struct vm_area_struct **vmas, > >+ unsigned long *position, int *length, int i) > > { > > This seems like unnecessary churn. True enough, removed. > > { > >- struct page *page; > >- > >- BUG_ON(! pmd_hugepage(*pmd)); > >- > >- page = hugepte_page(*(hugepte_t *)pmd); > >- if (page) > >- page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT); > >- return page; > >+ BUG(); > >+ return NULL; > > } > > Why not just remove the function and let the compiler catch any errors > instead of catching them at runtime? That function is referenced from generic code. It should never be called, because our pmd_huge() always returns zero, but that's not inline, so I don't think the compiler can know that. Below is a new version of the patch which also includes some other changes to make the code more similar to other architectures, in preparation for more consolidation work: Rework the ppc64 hugepage code. Instead of using specially marked pmd entries in the normal pagetables to represent hugepages, use normal pte_t entries, in a special set of pagetables used for hugepages only. Using pte_t instead of a special hugepte_t makes the code more similar to that for other architecturess, allowing more possibilities for consolidating the hugepage code. Using independent pagetables for the hugepages is also a prerequisite for moving the hugepages into their own region well outside the normal user address space. The restrictions imposed by the powerpc mmu's segment design mean we probably want to do that in the fairly near future. Index: working-2.6/include/asm-ppc64/pgtable.h =================================================================== --- working-2.6.orig/include/asm-ppc64/pgtable.h 2004-09-15 10:53:53.000000000 +1000 +++ working-2.6/include/asm-ppc64/pgtable.h 2004-09-20 14:15:57.000000000 +1000 @@ -98,6 +98,7 @@ #define _PAGE_BUSY 0x0800 /* software: PTE & hash are busy */ #define _PAGE_SECONDARY 0x8000 /* software: HPTE is in secondary group */ #define _PAGE_GROUP_IX 0x7000 /* software: HPTE index within group */ +#define _PAGE_HUGE 0x10000 /* 16MB page */ /* Bits 0x7000 identify the index within an HPT Group */ #define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | _PAGE_SECONDARY | _PAGE_GROUP_IX) /* PAGE_MASK gives the right answer below, but only by accident */ @@ -157,19 +158,19 @@ #endif /* __ASSEMBLY__ */ /* shift to put page number into pte */ -#define PTE_SHIFT (16) +#define PTE_SHIFT (17) /* We allow 2^41 bytes of real memory, so we need 29 bits in the PMD * to give the PTE page number. The bottom two bits are for flags. */ #define PMD_TO_PTEPAGE_SHIFT (2) #ifdef CONFIG_HUGETLB_PAGE -#define _PMD_HUGEPAGE 0x00000001U -#define HUGEPTE_BATCH_SIZE (1<<(HPAGE_SHIFT-PMD_SHIFT)) #ifndef __ASSEMBLY__ int hash_huge_page(struct mm_struct *mm, unsigned long access, unsigned long ea, unsigned long vsid, int local); + +void hugetlb_mm_free_pgd(struct mm_struct *mm); #endif /* __ASSEMBLY__ */ #define HAVE_ARCH_UNMAPPED_AREA @@ -177,7 +178,7 @@ #else #define hash_huge_page(mm,a,ea,vsid,local) -1 -#define _PMD_HUGEPAGE 0 +#define hugetlb_mm_free_pgd(mm) do {} while (0) #endif @@ -213,10 +214,8 @@ #define pmd_set(pmdp, ptep) \ (pmd_val(*(pmdp)) = (__ba_to_bpn(ptep) << PMD_TO_PTEPAGE_SHIFT)) #define pmd_none(pmd) (!pmd_val(pmd)) -#define pmd_hugepage(pmd) (!!(pmd_val(pmd) & _PMD_HUGEPAGE)) -#define pmd_bad(pmd) (((pmd_val(pmd)) == 0) || pmd_hugepage(pmd)) -#define pmd_present(pmd) ((!pmd_hugepage(pmd)) \ - && (pmd_val(pmd) & ~_PMD_HUGEPAGE) != 0) +#define pmd_bad(pmd) (pmd_val(pmd) == 0) +#define pmd_present(pmd) (pmd_val(pmd) != 0) #define pmd_clear(pmdp) (pmd_val(*(pmdp)) = 0) #define pmd_page_kernel(pmd) \ (__bpn_to_ba(pmd_val(pmd) >> PMD_TO_PTEPAGE_SHIFT)) @@ -269,6 +268,7 @@ static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY;} static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED;} static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE;} +static inline int pte_huge(pte_t pte) { return pte_val(pte) & _PAGE_HUGE;} static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; } static inline void pte_cache(pte_t pte) { pte_val(pte) &= ~_PAGE_NO_CACHE; } @@ -294,6 +294,8 @@ pte_val(pte) |= _PAGE_DIRTY; return pte; } static inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= _PAGE_ACCESSED; return pte; } +static inline pte_t pte_mkhuge(pte_t pte) { + pte_val(pte) |= _PAGE_HUGE; return pte; } /* Atomic PTE updates */ static inline unsigned long pte_update(pte_t *p, unsigned long clr) @@ -464,6 +466,10 @@ extern void paging_init(void); +struct mmu_gather; +void hugetlb_free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *prev, + unsigned long start, unsigned long end); + /* * This gets called at the end of handling a page fault, when * the kernel has put a new PTE into the page table for the process. Index: working-2.6/arch/ppc64/mm/hugetlbpage.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hugetlbpage.c 2004-09-20 10:12:50.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hugetlbpage.c 2004-09-21 10:34:46.384789688 +1000 @@ -27,116 +27,143 @@ #include -/* HugePTE layout: - * - * 31 30 ... 15 14 13 12 10 9 8 7 6 5 4 3 2 1 0 - * PFN>>12..... - - - - - - HASH_IX.... 2ND HASH RW - HG=1 - */ +#define HUGEPGDIR_SHIFT (HPAGE_SHIFT + PAGE_SHIFT - 3) +#define HUGEPGDIR_SIZE (1UL << HUGEPGDIR_SHIFT) +#define HUGEPGDIR_MASK (~(HUGEPGDIR_SIZE-1)) + +#define HUGEPTE_INDEX_SIZE 9 +#define HUGEPGD_INDEX_SIZE 10 + +#define PTRS_PER_HUGEPTE (1 << HUGEPTE_INDEX_SIZE) +#define PTRS_PER_HUGEPGD (1 << HUGEPGD_INDEX_SIZE) -#define HUGEPTE_SHIFT 15 -#define _HUGEPAGE_PFN 0xffff8000 -#define _HUGEPAGE_BAD 0x00007f00 -#define _HUGEPAGE_HASHPTE 0x00000008 -#define _HUGEPAGE_SECONDARY 0x00000010 -#define _HUGEPAGE_GROUP_IX 0x000000e0 -#define _HUGEPAGE_HPTEFLAGS (_HUGEPAGE_HASHPTE | _HUGEPAGE_SECONDARY | \ - _HUGEPAGE_GROUP_IX) -#define _HUGEPAGE_RW 0x00000004 - -typedef struct {unsigned int val;} hugepte_t; -#define hugepte_val(hugepte) ((hugepte).val) -#define __hugepte(x) ((hugepte_t) { (x) } ) -#define hugepte_pfn(x) \ - ((unsigned long)(hugepte_val(x)>>HUGEPTE_SHIFT) << HUGETLB_PAGE_ORDER) -#define mk_hugepte(page,wr) __hugepte( \ - ((page_to_pfn(page)>>HUGETLB_PAGE_ORDER) << HUGEPTE_SHIFT ) \ - | (!!(wr) * _HUGEPAGE_RW) | _PMD_HUGEPAGE ) - -#define hugepte_bad(x) ( !(hugepte_val(x) & _PMD_HUGEPAGE) || \ - (hugepte_val(x) & _HUGEPAGE_BAD) ) -#define hugepte_page(x) pfn_to_page(hugepte_pfn(x)) -#define hugepte_none(x) (!(hugepte_val(x) & _HUGEPAGE_PFN)) - - -static void flush_hash_hugepage(mm_context_t context, unsigned long ea, - hugepte_t pte, int local); - -static inline unsigned int hugepte_update(hugepte_t *p, unsigned int clr, - unsigned int set) -{ - unsigned int old, tmp; - - __asm__ __volatile__( - "1: lwarx %0,0,%3 # pte_update\n\ - andc %1,%0,%4 \n\ - or %1,%1,%5 \n\ - stwcx. %1,0,%3 \n\ - bne- 1b" - : "=&r" (old), "=&r" (tmp), "=m" (*p) - : "r" (p), "r" (clr), "r" (set), "m" (*p) - : "cc" ); - return old; +static inline int hugepgd_index(unsigned long addr) +{ + return (addr & ~REGION_MASK) >> HUGEPGDIR_SHIFT; } -static inline void set_hugepte(hugepte_t *ptep, hugepte_t pte) +static pgd_t *hugepgd_offset(struct mm_struct *mm, unsigned long addr) { - hugepte_update(ptep, ~_HUGEPAGE_HPTEFLAGS, - hugepte_val(pte) & ~_HUGEPAGE_HPTEFLAGS); + int index; + + if (! mm->context.huge_pgdir) + return NULL; + + + index = hugepgd_index(addr); + BUG_ON(index >= PTRS_PER_HUGEPGD); + return mm->context.huge_pgdir + index; } -static hugepte_t *hugepte_alloc(struct mm_struct *mm, unsigned long addr) +static inline pte_t *hugepte_offset(pgd_t *dir, unsigned long addr) { - pgd_t *pgd; - pmd_t *pmd = NULL; + int index; - BUG_ON(!in_hugepage_area(mm->context, addr)); + if (pgd_none(*dir)) + return NULL; - pgd = pgd_offset(mm, addr); - pmd = pmd_alloc(mm, pgd, addr); + index = (addr >> HPAGE_SHIFT) % PTRS_PER_HUGEPTE; + return (pte_t *)pgd_page(*dir) + index; +} - /* We shouldn't find a (normal) PTE page pointer here */ - BUG_ON(!pmd_none(*pmd) && !pmd_hugepage(*pmd)); - - return (hugepte_t *)pmd; +static pgd_t *hugepgd_alloc(struct mm_struct *mm, unsigned long addr) +{ + BUG_ON(! in_hugepage_area(mm->context, addr)); + + if (! mm->context.huge_pgdir) { + pgd_t *new; + spin_unlock(&mm->page_table_lock); + /* Don't use pgd_alloc(), because we want __GFP_REPEAT */ + new = kmem_cache_alloc(zero_cache, GFP_KERNEL | __GFP_REPEAT); + BUG_ON(memcmp(new, empty_zero_page, PAGE_SIZE)); + spin_lock(&mm->page_table_lock); + + /* + * Because we dropped the lock, we should re-check the + * entry, as somebody else could have populated it.. + */ + if (mm->context.huge_pgdir) + pgd_free(new); + else + mm->context.huge_pgdir = new; + } + return hugepgd_offset(mm, addr); } -static hugepte_t *hugepte_offset(struct mm_struct *mm, unsigned long addr) +static pte_t *hugepte_alloc(struct mm_struct *mm, pgd_t *dir, + unsigned long addr) { - pgd_t *pgd; - pmd_t *pmd = NULL; + if (! pgd_present(*dir)) { + pte_t *new; - BUG_ON(!in_hugepage_area(mm->context, addr)); + spin_unlock(&mm->page_table_lock); + new = kmem_cache_alloc(zero_cache, GFP_KERNEL | __GFP_REPEAT); + BUG_ON(memcmp(new, empty_zero_page, PAGE_SIZE)); + spin_lock(&mm->page_table_lock); + /* + * Because we dropped the lock, we should re-check the + * entry, as somebody else could have populated it.. + */ + if (pgd_present(*dir)) { + if (new) + kmem_cache_free(zero_cache, new); + } else { + struct page *ptepage; - pgd = pgd_offset(mm, addr); - if (pgd_none(*pgd)) - return NULL; + if (! new) + return NULL; + ptepage = virt_to_page(new); + ptepage->mapping = (void *) mm; + ptepage->index = addr & HUGEPGDIR_MASK; + pgd_populate(mm, dir, new); + } + } - pmd = pmd_offset(pgd, addr); + return hugepte_offset(dir, addr); +} + +static pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) +{ + pgd_t *pgd; - /* We shouldn't find a (normal) PTE page pointer here */ - BUG_ON(!pmd_none(*pmd) && !pmd_hugepage(*pmd)); + BUG_ON(! in_hugepage_area(mm->context, addr)); - return (hugepte_t *)pmd; + pgd = hugepgd_offset(mm, addr); + if (! pgd) + return NULL; + + return hugepte_offset(pgd, addr); } -static void setup_huge_pte(struct mm_struct *mm, struct page *page, - hugepte_t *ptep, int write_access) +static pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr) { - hugepte_t entry; - int i; + pgd_t *pgd; - mm->rss += (HPAGE_SIZE / PAGE_SIZE); - entry = mk_hugepte(page, write_access); - for (i = 0; i < HUGEPTE_BATCH_SIZE; i++) - set_hugepte(ptep+i, entry); + BUG_ON(! in_hugepage_area(mm->context, addr)); + + pgd = hugepgd_alloc(mm, addr); + if (! pgd) + return NULL; + + return hugepte_alloc(mm, pgd, addr); } -static void teardown_huge_pte(hugepte_t *ptep) +static void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma, + struct page *page, pte_t *ptep, int write_access) { - int i; + pte_t entry; - for (i = 0; i < HUGEPTE_BATCH_SIZE; i++) - pmd_clear((pmd_t *)(ptep+i)); + mm->rss += (HPAGE_SIZE / PAGE_SIZE); + if (write_access) { + entry = + pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); + } else { + entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot)); + } + entry = pte_mkyoung(entry); + entry = pte_mkhuge(entry); + + set_pte(ptep, entry); } /* @@ -267,34 +294,31 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma) { - hugepte_t *src_pte, *dst_pte, entry; + pte_t *src_pte, *dst_pte, entry; struct page *ptepage; unsigned long addr = vma->vm_start; unsigned long end = vma->vm_end; + int err = -ENOMEM; while (addr < end) { - BUG_ON(! in_hugepage_area(src->context, addr)); - BUG_ON(! in_hugepage_area(dst->context, addr)); - - dst_pte = hugepte_alloc(dst, addr); + dst_pte = huge_pte_alloc(dst, addr); if (!dst_pte) - return -ENOMEM; + goto out; - src_pte = hugepte_offset(src, addr); + src_pte = huge_pte_offset(src, addr); entry = *src_pte; - if ((addr % HPAGE_SIZE) == 0) { - /* This is the first hugepte in a batch */ - ptepage = hugepte_page(entry); - get_page(ptepage); - dst->rss += (HPAGE_SIZE / PAGE_SIZE); - } - set_hugepte(dst_pte, entry); - + ptepage = pte_page(entry); + get_page(ptepage); + dst->rss += (HPAGE_SIZE / PAGE_SIZE); + set_pte(dst_pte, entry); - addr += PMD_SIZE; + addr += HPAGE_SIZE; } - return 0; + + err = 0; + out: + return err; } int @@ -309,18 +333,16 @@ vpfn = vaddr/PAGE_SIZE; while (vaddr < vma->vm_end && remainder) { - BUG_ON(!in_hugepage_area(mm->context, vaddr)); - if (pages) { - hugepte_t *pte; + pte_t *pte; struct page *page; - pte = hugepte_offset(mm, vaddr); + pte = huge_pte_offset(mm, vaddr); /* hugetlb should be locked, and hence, prefaulted */ - WARN_ON(!pte || hugepte_none(*pte)); + WARN_ON(!pte || pte_none(*pte)); - page = &hugepte_page(*pte)[vpfn % (HPAGE_SIZE/PAGE_SIZE)]; + page = &pte_page(*pte)[vpfn % (HPAGE_SIZE/PAGE_SIZE)]; WARN_ON(!PageCompound(page)); @@ -346,26 +368,31 @@ struct page * follow_huge_addr(struct mm_struct *mm, unsigned long address, int write) { - return ERR_PTR(-EINVAL); + pte_t *ptep; + struct page *page; + + if (! in_hugepage_area(mm->context, address)) + return ERR_PTR(-EINVAL); + + ptep = huge_pte_offset(mm, address); + page = pte_page(*ptep); + if (page) + page += (address % HPAGE_SIZE) / PAGE_SIZE; + + return page; } int pmd_huge(pmd_t pmd) { - return pmd_hugepage(pmd); + return 0; } struct page * follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write) { - struct page *page; - - BUG_ON(! pmd_hugepage(*pmd)); - - page = hugepte_page(*(hugepte_t *)pmd); - if (page) - page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT); - return page; + BUG(); + return NULL; } void unmap_hugepage_range(struct vm_area_struct *vma, @@ -373,44 +400,38 @@ { struct mm_struct *mm = vma->vm_mm; unsigned long addr; - hugepte_t *ptep; + pte_t *ptep; struct page *page; - int cpu; - int local = 0; - cpumask_t tmp; WARN_ON(!is_vm_hugetlb_page(vma)); BUG_ON((start % HPAGE_SIZE) != 0); BUG_ON((end % HPAGE_SIZE) != 0); - /* XXX are there races with checking cpu_vm_mask? - Anton */ - cpu = get_cpu(); - tmp = cpumask_of_cpu(cpu); - if (cpus_equal(vma->vm_mm->cpu_vm_mask, tmp)) - local = 1; - for (addr = start; addr < end; addr += HPAGE_SIZE) { - hugepte_t pte; - - BUG_ON(!in_hugepage_area(mm->context, addr)); + pte_t pte; - ptep = hugepte_offset(mm, addr); - if (!ptep || hugepte_none(*ptep)) + ptep = huge_pte_offset(mm, addr); + if (!ptep || pte_none(*ptep)) continue; pte = *ptep; - page = hugepte_page(pte); - teardown_huge_pte(ptep); - - if (hugepte_val(pte) & _HUGEPAGE_HASHPTE) - flush_hash_hugepage(mm->context, addr, - pte, local); + page = pte_page(pte); + pte_clear(ptep); put_page(page); } - put_cpu(); - mm->rss -= (end - start) >> PAGE_SHIFT; + flush_tlb_pending(); +} + +void hugetlb_free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *prev, + unsigned long start, unsigned long end) +{ + /* Because the huge pgtables are only 2 level, they can take + * at most around 4M, much less than one hugepage which the + * process is presumably entitled to use. So we don't bother + * freeing up the pagetables on unmap, and wait until + * destroy_context() to clean up the lot. */ } int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma) @@ -426,16 +447,14 @@ spin_lock(&mm->page_table_lock); for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) { unsigned long idx; - hugepte_t *pte = hugepte_alloc(mm, addr); + pte_t *pte = huge_pte_alloc(mm, addr); struct page *page; - BUG_ON(!in_hugepage_area(mm->context, addr)); - if (!pte) { ret = -ENOMEM; goto out; } - if (!hugepte_none(*pte)) + if (! pte_none(*pte)) continue; idx = ((addr - vma->vm_start) >> HPAGE_SHIFT) @@ -462,7 +481,7 @@ goto out; } } - setup_huge_pte(mm, page, pte, vma->vm_flags & VM_WRITE); + set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE); } out: spin_unlock(&mm->page_table_lock); @@ -716,20 +735,55 @@ } } +void hugetlb_mm_free_pgd(struct mm_struct *mm) +{ + int i; + pgd_t *pgdir; + + spin_lock(&mm->page_table_lock); + + pgdir = mm->context.huge_pgdir; + if (! pgdir) + return; + + mm->context.huge_pgdir = NULL; + + /* cleanup any hugepte pages leftover */ + for (i = 0; i < PTRS_PER_HUGEPGD; i++) { + pgd_t *pgd = pgdir + i; + + if (! pgd_none(*pgd)) { + pte_t *pte = (pte_t *)pgd_page(*pgd); + struct page *ptepage = virt_to_page(pte); + + ptepage->mapping = NULL; + + BUG_ON(memcmp(pte, empty_zero_page, PAGE_SIZE)); + kmem_cache_free(zero_cache, pte); + } + pgd_clear(pgd); + } + + BUG_ON(memcmp(pgdir, empty_zero_page, PAGE_SIZE)); + kmem_cache_free(zero_cache, pgdir); + + spin_unlock(&mm->page_table_lock); +} + int hash_huge_page(struct mm_struct *mm, unsigned long access, unsigned long ea, unsigned long vsid, int local) { - hugepte_t *ptep; + pte_t *ptep; unsigned long va, vpn; int is_write; - hugepte_t old_pte, new_pte; - unsigned long hpteflags, prpn, flags; + pte_t old_pte, new_pte; + unsigned long hpteflags, prpn; long slot; + int err = 1; + + spin_lock(&mm->page_table_lock); - /* We have to find the first hugepte in the batch, since - * that's the one that will store the HPTE flags */ - ea &= HPAGE_MASK; - ptep = hugepte_offset(mm, ea); + ptep = huge_pte_offset(mm, ea); /* Search the Linux page table for a match with va */ va = (vsid << 28) | (ea & 0x0fffffff); @@ -739,19 +793,18 @@ * If no pte found or not present, send the problem up to * do_page_fault */ - if (unlikely(!ptep || hugepte_none(*ptep))) - return 1; + if (unlikely(!ptep || pte_none(*ptep))) + goto out; - BUG_ON(hugepte_bad(*ptep)); +/* BUG_ON(pte_bad(*ptep)); */ /* * Check the user's access rights to the page. If access should be * prevented then send the problem up to do_page_fault. */ is_write = access & _PAGE_RW; - if (unlikely(is_write && !(hugepte_val(*ptep) & _HUGEPAGE_RW))) - return 1; - + if (unlikely(is_write && !(pte_val(*ptep) & _PAGE_RW))) + goto out; /* * At this point, we have a pte (old_pte) which can be used to build * or update an HPTE. There are 2 cases: @@ -764,41 +817,40 @@ * page is currently not DIRTY. */ - spin_lock_irqsave(&mm->page_table_lock, flags); old_pte = *ptep; new_pte = old_pte; - hpteflags = 0x2 | (! (hugepte_val(new_pte) & _HUGEPAGE_RW)); + hpteflags = 0x2 | (! (pte_val(new_pte) & _PAGE_RW)); /* Check if pte already has an hpte (case 2) */ - if (unlikely(hugepte_val(old_pte) & _HUGEPAGE_HASHPTE)) { + if (unlikely(pte_val(old_pte) & _PAGE_HASHPTE)) { /* There MIGHT be an HPTE for this pte */ unsigned long hash, slot; hash = hpt_hash(vpn, 1); - if (hugepte_val(old_pte) & _HUGEPAGE_SECONDARY) + if (pte_val(old_pte) & _PAGE_SECONDARY) hash = ~hash; slot = (hash & htab_data.htab_hash_mask) * HPTES_PER_GROUP; - slot += (hugepte_val(old_pte) & _HUGEPAGE_GROUP_IX) >> 5; + slot += (pte_val(old_pte) & _PAGE_GROUP_IX) >> 12; if (ppc_md.hpte_updatepp(slot, hpteflags, va, 1, local) == -1) - hugepte_val(old_pte) &= ~_HUGEPAGE_HPTEFLAGS; + pte_val(old_pte) &= ~_PAGE_HPTEFLAGS; } - if (likely(!(hugepte_val(old_pte) & _HUGEPAGE_HASHPTE))) { + if (likely(!(pte_val(old_pte) & _PAGE_HASHPTE))) { unsigned long hash = hpt_hash(vpn, 1); unsigned long hpte_group; - prpn = hugepte_pfn(old_pte); + prpn = pte_pfn(old_pte); repeat: hpte_group = ((hash & htab_data.htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL; /* Update the linux pte with the HPTE slot */ - hugepte_val(new_pte) &= ~_HUGEPAGE_HPTEFLAGS; - hugepte_val(new_pte) |= _HUGEPAGE_HASHPTE; + pte_val(new_pte) &= ~_PAGE_HPTEFLAGS; + pte_val(new_pte) |= _PAGE_HASHPTE; /* Add in WIMG bits */ /* XXX We should store these in the pte */ @@ -809,7 +861,7 @@ /* Primary is full, try the secondary */ if (unlikely(slot == -1)) { - hugepte_val(new_pte) |= _HUGEPAGE_SECONDARY; + pte_val(new_pte) |= _PAGE_SECONDARY; hpte_group = ((~hash & htab_data.htab_hash_mask) * HPTES_PER_GROUP) & ~0x7UL; slot = ppc_md.hpte_insert(hpte_group, va, prpn, @@ -826,39 +878,20 @@ if (unlikely(slot == -2)) panic("hash_huge_page: pte_insert failed\n"); - hugepte_val(new_pte) |= (slot<<5) & _HUGEPAGE_GROUP_IX; + pte_val(new_pte) |= (slot<<12) & _PAGE_GROUP_IX; /* * No need to use ldarx/stdcx here because all who * might be updating the pte will hold the - * page_table_lock or the hash_table_lock - * (we hold both) + * page_table_lock */ *ptep = new_pte; } - spin_unlock_irqrestore(&mm->page_table_lock, flags); - - return 0; -} - -static void flush_hash_hugepage(mm_context_t context, unsigned long ea, - hugepte_t pte, int local) -{ - unsigned long vsid, vpn, va, hash, slot; - - BUG_ON(hugepte_bad(pte)); - BUG_ON(!in_hugepage_area(context, ea)); - - vsid = get_vsid(context.id, ea); + err = 0; - va = (vsid << 28) | (ea & 0x0fffffff); - vpn = va >> HPAGE_SHIFT; - hash = hpt_hash(vpn, 1); - if (hugepte_val(pte) & _HUGEPAGE_SECONDARY) - hash = ~hash; - slot = (hash & htab_data.htab_hash_mask) * HPTES_PER_GROUP; - slot += (hugepte_val(pte) & _HUGEPAGE_GROUP_IX) >> 5; + out: + spin_unlock(&mm->page_table_lock); - ppc_md.hpte_invalidate(slot, va, 1, local); + return err; } Index: working-2.6/include/asm-ppc64/mmu.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu.h 2004-09-20 10:12:50.000000000 +1000 +++ working-2.6/include/asm-ppc64/mmu.h 2004-09-20 14:15:57.000000000 +1000 @@ -24,6 +24,7 @@ typedef struct { mm_context_id_t id; #ifdef CONFIG_HUGETLB_PAGE + pgd_t *huge_pgdir; u16 htlb_segs; /* bitmask */ #endif } mm_context_t; Index: working-2.6/include/asm-ppc64/page.h =================================================================== --- working-2.6.orig/include/asm-ppc64/page.h 2004-09-20 10:12:50.000000000 +1000 +++ working-2.6/include/asm-ppc64/page.h 2004-09-20 14:15:57.000000000 +1000 @@ -64,7 +64,6 @@ #define is_hugepage_only_range(addr, len) \ (touches_hugepage_high_range((addr), (len)) || \ touches_hugepage_low_range((addr), (len))) -#define hugetlb_free_pgtables free_pgtables #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA #define in_hugepage_area(context, addr) \ Index: working-2.6/arch/ppc64/mm/init.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/init.c 2004-09-20 10:12:50.000000000 +1000 +++ working-2.6/arch/ppc64/mm/init.c 2004-09-20 14:15:57.000000000 +1000 @@ -484,6 +484,12 @@ int index; int err; +#ifdef CONFIG_HUGETLB_PAGE + /* We leave htlb_segs as it was, but for a fork, we need to + * clear the huge_pgdir. */ + mm->context.huge_pgdir = NULL; +#endif + again: if (!idr_pre_get(&mmu_context_idr, GFP_KERNEL)) return -ENOMEM; @@ -514,6 +520,8 @@ spin_unlock(&mmu_context_lock); mm->context.id = NO_CONTEXT; + + hugetlb_mm_free_pgd(mm); } static int __init mmu_context_init(void) Index: working-2.6/arch/ppc64/mm/hash_utils.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hash_utils.c 2004-09-20 10:12:50.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hash_utils.c 2004-09-20 14:15:57.000000000 +1000 @@ -321,9 +321,7 @@ int local) { unsigned long vsid, vpn, va, hash, secondary, slot; - - /* XXX fix for large ptes */ - unsigned long large = 0; + unsigned long huge = pte_huge(pte); if ((ea >= USER_START) && (ea <= USER_END)) vsid = get_vsid(context, ea); @@ -331,18 +329,18 @@ vsid = get_kernel_vsid(ea); va = (vsid << 28) | (ea & 0x0fffffff); - if (large) + if (huge) vpn = va >> HPAGE_SHIFT; else vpn = va >> PAGE_SHIFT; - hash = hpt_hash(vpn, large); + hash = hpt_hash(vpn, huge); secondary = (pte_val(pte) & _PAGE_SECONDARY) >> 15; if (secondary) hash = ~hash; slot = (hash & htab_data.htab_hash_mask) * HPTES_PER_GROUP; slot += (pte_val(pte) & _PAGE_GROUP_IX) >> 12; - ppc_md.hpte_invalidate(slot, va, large, local); + ppc_md.hpte_invalidate(slot, va, huge, local); } void flush_hash_range(unsigned long context, unsigned long number, int local) -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson From anton at samba.org Tue Sep 21 10:50:44 2004 From: anton at samba.org (Anton Blanchard) Date: Tue, 21 Sep 2004 10:50:44 +1000 Subject: [PATCH] ppc64: User tasks must have a valid thread.regs In-Reply-To: <20040920211333.GA1872@austin.ibm.com> References: <20040920094016.GM2825@krispykreme> <20040920211333.GA1872@austin.ibm.com> Message-ID: <20040921005044.GQ2825@krispykreme> Hi, > OK, but why wasn't SMP affected? SMP doesn't have lazy save/restore, > it should have hit the null regs pointer "sooner". Because the FPR space is allocated for all tasks in the thread struct: struct thread_struct { unsigned long ksp; /* Kernel stack pointer */ unsigned long ksp_vsid; struct pt_regs *regs; /* Pointer to saved register state */ mm_segment_t fs; /* for get_fs() validation */ double fpr[32]; /* Complete floating point set */ unsigned long fpscr; /* Floating point status (plus pad) */ unsigned long fpexc_mode; /* Floating-point exception mode */ unsigned long pad[3]; /* was saved_msr, saved_softe */ #ifdef CONFIG_ALTIVEC /* Complete AltiVec register set */ vector128 vr[32] __attribute((aligned(16))); /* AltiVec status */ vector128 vscr __attribute((aligned(16))); unsigned long vrsave; int used_vr; /* set if process has used altivec */ #endif /* CONFIG_ALTIVEC */ }; The SMP lazy FP stuff doesnt touch stuff in thread->regs, but the UP one does. It hits thread->regs.msr for the previous FP task I think. Anton From anil_411 at yahoo.com Tue Sep 21 15:52:09 2004 From: anil_411 at yahoo.com (Anil Kumar Prasad) Date: Mon, 20 Sep 2004 22:52:09 -0700 (PDT) Subject: [PATCH] unexported flush_tlb in 2.6 ppc64 kernel In-Reply-To: <20040920094407.GO2825@krispykreme> Message-ID: <20040921055209.11075.qmail@web11506.mail.yahoo.com> > > I was trying to get my 2.4 module code working on > 2.6 > > ppc64 (2.6.5-7.97 SLES9). It seems tlb_flush > routines > > are no longer exported for module usage. Same is > the > > case with 2.6.8 > > ppc64 kernel. > > > > I could see tlb_flush routines still exported on > all > > other platforms. Is there any reason for > un-exporting > > it on ppc64? > > I cant think of a reason, but the code is new so we > probably havent seen > this issue yet. > > Could you send me a patch which adds the exports? On > 2.6 you can put the > EXPORT_SYMBOL right next to where the thing is > defined. Attached is the patch. I had to #define tlb_flush_pending to __tlb_flush_pending and merge tlb_flush_pending few lines body with __tlb_flush_pending. As exporting per cpu tlb_batch data didn't make sense to me. Thanks, Anil. __________________________________ Do you Yahoo!? Read only the mail you want - Yahoo! Mail SpamGuard. http://promotions.yahoo.com/new_mail -------------- next part -------------- A non-text attachment was scrubbed... Name: tlb_export.patch Type: application/octet-stream Size: 1917 bytes Desc: tlb_export.patch Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040920/b30c583f/attachment.obj From benh at kernel.crashing.org Wed Sep 22 20:39:58 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 22 Sep 2004 20:39:58 +1000 Subject: [PATCH] Monster cleanup patch #2 Message-ID: <1095849598.20684.23.camel@gaston> Hi ! This is the second version of the monster cleanup patch (I dropped the word "boot" as it cleans up a bit more than the boot code). Of course, we can still cleanup further ;) But at least this gives up a "first" step that I intend to submit upstream as soon as I have fixed legacy iSeries (help welcome finding out what's wong...) So this is close to the final patch (except for iSeries not booting, but I can't seem to find a way to output early debug messages on those, if some US folks who knows those beast would be willing to help, it would be much appreciated). What it does is: - Cut all ties between prom_init (code running in the context of the firmware at boot) and the rest of the kernel. prom_init() and functions it uses are now in a separate file and no longer share any global with the rest of the kernel. It's not yet a separate link entity but it could be made one. The only communication between prom_init() and the rest of the kernel goes through a block of memory passed by prom_init() which contains a flattened version of the device-tree and a map of reserved physical memory areas. Any other information is passed via properties added to the device-tree. I also did some major cleanup in the prom_init code, especially in the way it allocates memory, if you are interested, read the various comments in there. This work main goal is to enable implementation of kexec, but it also helps in various other ways the support of additional platform types. - Rework the early kernel initialization (early_setup() is now called from the assembly in real mode before setup_system(), which itself gets rids of some of the ifdef's (not all yet) and switch/case, look at the comments in the code). The goal is to adapt to the prom_init changes, make the code more readable, and make addition of new platform types easier. - Cut the ties between pSeries and Powermac. Now, the kernel config provides a choice between legacy iSeries and "multiplatform". The later is a set of various supported platform, each of them beeing a boolean switch, currently defined beeing pSeries and PowerMac. You can enable both or just one of them. CONFIG_PPC_PSERIES is now specifically set for IBM pSeries support, you can build a PowerMac kernel without pSeries support if you which. The main goal here is to simplify addition of new platform types. At this point, it's very difficult to split the patch into incremental changes, there are a _lot_ of inter dependencies. I'll try to get it merged at once as soon as the iSeries problem is fixed. NOTE: To apply the patch, you need first to: rm arch/ppc64/boot/addSystemMap.c rm include/asm-ppc64/bootx.h mv arch/ppc64/kernel/chrp_setup.c arch/ppc64/kernel/pSeries_setup.c mv arch/ppc64/kernel/pSeries_htab.c arch/ppc64/mm/hash_native.c I intentionally didn't include those changes in the patch, so you can use the "bk" equivalents and it does the right thing for bitkeeper users. After the patch is applied, if you use a revision control, don't forget to check in those 2 new files: include/asm-ppc64/plpar_wrappers.h arch/ppc64/kernel/prom_init.c The patch itself is too big for the lists and is available at: http://gate.crashing.org/~benh/monster-cleanup-2.diff Ben. From linas at austin.ibm.com Thu Sep 23 06:14:01 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Wed, 22 Sep 2004 15:14:01 -0500 Subject: Hotplug: crash in sys_clone()/do_fork() Message-ID: <20040922201401.GA18954@austin.ibm.com> Hi, I'm tripping over a race condition involving file handling. I can consistently crash here: TASK: c0000000fd6dc040[10448] 'ifdown' THREAD: c0000000f3310000 Call Trace: Sep 22 14:29:21 marulp1 kernel: [c0000000f3313b10] [c0000000000506c4] .copy_files+0x400/0x414 (unreliable) [c0000000f3313bd0] [c00000000005161c] .copy_process+0x660/0x12bc [c0000000f3313ce0] [c000000000052318] .do_fork+0xa0/0x25c [c0000000f3313dc0] [c0000000000159c8] .sys_clone+0x5c/0x74 [c0000000f3313e30] [c000000000010a88] .ppc_clone+0x8/0xc The problem seems to be that one of the file pointers is breifly set to (int32)-1 even on a 64-bit machine. The part of copy_process() that gets mashed by this is: for (i = open_files; i != 0; i--) { struct file *f = *old_fds++; if (f) get_file(f); <== derefs f, which is -1 *new_fds++ = f; } By inserting if(f==(void*)0xffffffffUL) printk ... I can find out that i=230 is the one with the problem and open_files=256! I haven't yet found who set struct file * to a -1. I'm generting this behaviour with a hotplug event that is causing ifdown and ifup to run simultaneously. (The device driver was shut down and restarted, causing simultaneous hotplug events). Although the above stack shows ifdown getting clobbered, I've also seen pci.agent be the process that suffers. The problem goes away if I insert a sleep of about half-a-second or more between the device driver shutdown and startup. Affected machine is a ppc64 power4 box. I've seen the problem for a long time (months?), including monday's bk clone of bkbits of 2.6.9-rc2; waiting for this bug "to fix itself" doesn't seem to be working. --linas p.s. am I supposed to be using the OSDL bugzilla to report & track bugs like this? From dsteklof at us.ibm.com Thu Sep 23 08:36:21 2004 From: dsteklof at us.ibm.com (Daniel Stekloff) Date: Wed, 22 Sep 2004 15:36:21 -0700 Subject: 32-bit or 64-bit default applications? Message-ID: <1095892581.3050.9.camel@localhost.localdomain> Hi! What should applications default to when built on a ppc64 system - 32 or 64-bit? Do they default to 32-bit unless explicitly changed to build 64-bit? I'm porting netdump and the netdump server ppc port currently defaults to being built 32-bit. The new server is meant to be platform agnostic, so dumping from a ppc64 system to the 32-bit server works fine. I'm wondering, however, if I should default build the netdump-server to 64-bit on a 64-bit system. I'm not sure what the policy is, if there's one. Thanks, Dan From benh at kernel.crashing.org Thu Sep 23 17:48:50 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 23 Sep 2004 17:48:50 +1000 Subject: [PATCH] ppc64: monster cleanup Message-ID: <1095925729.21793.47.camel@gaston> Hi ! This is the long awaited "monster cleanup" patch. It's huge and moves some files around, so to apply it, you need to do some file renaming first (I did that to avoid bloating the patch and so that Linus can do proper bk mv and not lose history). Details about the patch content are later in this mail, I'll start with the detail of applying it first ;) The patch should be merged asap. It's a bit difficult to maintain of tree and any further patch after this one will be a pain to backport. Since 2.6.9 will be a milestone for some distros, it would be nasty if, for example, this went into early 2.6.10, thus any further patch beeing a pain to backport to 2.6.9. But then, I'll let you decide here :) It would be extremely difficult to break this patch in pieces. However, I have tested it on a wide variety of ppc64 machines (rs64 iSeries, POWER3 pSeries, POWER4 pSeries SMP, POWER4 pSeries LPAR, POWER5 pSeries LPAR with SMT, js20 blade, and of course Apple G5). In general, if the machine boots, it will work, there is little to no impact on code that is run after boot. To apply the patch, you need first to: rm arch/ppc64/boot/addSystemMap.c rm include/asm-ppc64/bootx.h mv arch/ppc64/kernel/chrp_setup.c arch/ppc64/kernel/pSeries_setup.c mv arch/ppc64/kernel/pSeries_htab.c arch/ppc64/mm/hash_native.c I intentionally didn't include those changes in the patch, so you can use the "bk" equivalents and it does the right thing for bitkeeper users. After the patch is applied, if you use a revision control, don't forget to check in those 2 new files: include/asm-ppc64/plpar_wrappers.h arch/ppc64/kernel/prom_init.c The patch itself is too big for the lists and is available at: http://gate.crashing.org/~benh/monster-cleanup-3.diff (Let me know if you want it as a separate email). It should apply on a bk snapshot I did today. The description now: This is the third & hopefully final version of the monster cleanup patch. It does significant cleanups of the early boot code of the ppc64 kernel, and begins the long process of cleaning up & splitting properly the platform support. It completely reworks the interface between the early code that is run in the firmware context (prom_init) and the rest of the kernel, in such a way that will make kexec or static device-tree for embedded people possible. The early init code can eventually be moved to a separate link entity, it no longer touches any of the kernel globals, everything is passed via a single blob of data in memory containing a flattened version of the device-tree and a memory reserve map. While doing it, I also cut the ties between pSeries and Powermac. Now, the kernel config provides a choice between legacy iSeries and "multiplatform". The later is a set of various supported platform, each of them beeing a boolean switch, currently defined beeing pSeries and PowerMac. You can enable both or just one of them. CONFIG_PPC_PSERIES is now specifically set for IBM pSeries support, you can build a PowerMac kernel without pSeries support if you which. The main goal here is to simplify addition of new platform types. Regards, Ben. From paulus at samba.org Fri Sep 24 01:39:44 2004 From: paulus at samba.org (Paul Mackerras) Date: Thu, 23 Sep 2004 08:39:44 -0700 Subject: 32-bit or 64-bit default applications? In-Reply-To: <1095892581.3050.9.camel@localhost.localdomain> References: <1095892581.3050.9.camel@localhost.localdomain> Message-ID: <16722.60992.597392.198738@cargo.ozlabs.ibm.com> Daniel Stekloff writes: > What should applications default to when built on a ppc64 system - 32 or > 64-bit? Do they default to 32-bit unless explicitly changed to build > 64-bit? Yes. The vast majority of programs don't need to be compiled as 64-bit, and come out a bit smaller when compiled as 32-bit (and many, but not all, run a little faster as 32-bit as well). Paul. From linas at austin.ibm.com Fri Sep 24 02:25:32 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Thu, 23 Sep 2004 11:25:32 -0500 Subject: Hotplug: crash in sys_clone()/do_fork() In-Reply-To: <20040922201401.GA18954@austin.ibm.com> References: <20040922201401.GA18954@austin.ibm.com> Message-ID: <20040923162532.GC18954@austin.ibm.com> More info... On Wed, Sep 22, 2004 at 03:14:01PM -0500, Linas Vepstas was heard to remark: > > I'm tripping over a race condition involving file handling. > I can consistently crash here: > > TASK: c0000000fd6dc040[10448] 'ifdown' THREAD: c0000000f3310000 > Call Trace: Sep 22 14:29:21 marulp1 kernel: [c0000000f3313b10] [c0000000000506c4] .copy_files+0x400/0x414 (unreliable) > [c0000000f3313bd0] [c00000000005161c] .copy_process+0x660/0x12bc > [c0000000f3313ce0] [c000000000052318] .do_fork+0xa0/0x25c > [c0000000f3313dc0] [c0000000000159c8] .sys_clone+0x5c/0x74 > [c0000000f3313e30] [c000000000010a88] .ppc_clone+0x8/0xc > > The problem seems to be that one of the file pointers is breifly > set to (int32)-1 even on a 64-bit machine. The part of copy_process() > that gets mashed by this is: > > for (i = open_files; i != 0; i--) { > struct file *f = *old_fds++; > if (f) > get_file(f); <== derefs f, which is -1 > *new_fds++ = f; > } > > By inserting if(f==(void*)0xffffffffUL) printk ... > I can find out that i=230 is the one with the problem and open_files=256! > I haven't yet found who set struct file * to a -1. > > I'm generting this behaviour with a hotplug event that is causing ifdown > and ifup to run simultaneously. (The device driver was shut down and > restarted, causing simultaneous hotplug events). Although the above > stack shows ifdown getting clobbered, I've also seen pci.agent be > the process that suffers. > > The problem goes away if I insert a sleep of about half-a-second or more > between the device driver shutdown and startup. > > Affected machine is a ppc64 power4 box. I've seen the problem > for a long time (months?), including monday's bk clone of > bkbits of 2.6.9-rc2; waiting for this bug "to fix itself" doesn't > seem to be working. By adding appropriate printk's to copy_files(), it appears that many many processes have a -1 in the highest non-zero fd pointer. The crash that I see seems to happen when there is another nearby fd that is open, (nearby==within the same 8-bit bitmask) so that the "open_files" value is larger than/equal-to the highest non-zero fd pointer (which is set to -1), thus leading to crash. The "race" is that these high fd's normally close pretty quickly, and thus, "open_files" usually has a much smaller value. --linas From sonny at burdell.org Fri Sep 24 06:02:58 2004 From: sonny at burdell.org (Sonny Rao) Date: Thu, 23 Sep 2004 16:02:58 -0400 Subject: CONFIG_SPINLINE broken on 2.6.9-rc2 Message-ID: <20040923200258.GA20689@kevlar.burdell.org> Hi, I was trying to get a profile with inlined spinlocks on 2.6.9-rc2 and I noticed that my kernel build failed with several undefined references to spin_unlock_wait. After some grepping around it looked like the prototype in include/asm-ppc64/spinlock.h for spin_unlock_wait might be incorrect and should be a preprocessor define (like the ones in all of the other architectures including ppc), and the kernel does build and run with this change. Is this correct? Sonny Here's the one-liner: Fix CONFIG_SPINLINE for ppc64 by adding macro for spin_unlock_wait in include/asm-ppc64/spinlock.h --- linux-2.6.9-rc2/include/asm-ppc64/spinlock.h.original 2004-09-23 14:36:08.618964736 -0500 +++ linux-2.6.9-rc2/include/asm-ppc64/spinlock.h 2004-09-23 14:36:53.231971488 -0500 @@ -65,7 +65,7 @@ extern void __rw_yield(rwlock_t *lock); #define __rw_yield(x) barrier() #define SHARED_PROCESSOR 0 #endif -extern void spin_unlock_wait(spinlock_t *lock); +#define spin_unlock_wait(x) do { barrier(); } while(spin_is_locked(x)) /* * This returns the old value in the lock, so we succeeded From linas at austin.ibm.com Fri Sep 24 07:19:14 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Thu, 23 Sep 2004 16:19:14 -0500 Subject: [PATCH] PPC64: xmon: add process-display command Message-ID: <20040923211914.GA22210@austin.ibm.com> Hi, The following patch adds process display to xmon. The current xmon does have a process display, but it has several problems: 1) it displays nothing if the logging level isn't set; and xmon doesn't currently support sysreq commands. 2) if it can be made to display, then it would display stacks too, which is waaaay too much if one has hundreds of processes. 3) it depends on code that uses uses spinlocks :( 4) it fails to validate any pointers it chases. The attached patch fixes all but 4): it prints a short-form process display with the little 'p' command, and a long form, with stacks, with the big 'P' command. Please forward and apply ... Signed-off-by: Linas Vepstas --linas -------------- next part -------------- --- gidyap/arch/ppc64/xmon/xmon.c Mon Sep 13 19:23:15 2004 +++ edited/arch/ppc64/xmon/xmon.c Thu Sep 23 13:26:14 2004 @@ -193,6 +193,7 @@ Commands:\n\ mz zero a block of memory\n\ mi show information about memory allocation\n\ p show the task list\n\ + P show the task list and stacks\n\ r print registers\n\ s single step\n\ S print special registers\n\ @@ -881,6 +882,81 @@ static inline struct task_struct *younge return list_entry(p->sibling.next,struct task_struct,sibling); } +static void xmon_show_task(task_t * p, int prt_stacks) +{ + task_t *relative; + unsigned state; + unsigned long free = 0; + static const char *stat_nam[] = { "R", "S", "D", "T", "t", "Z", "X" }; + + printf("%-13.13s ", p->comm); + state = p->state ? __ffs(p->state) + 1 : 0; + if (state < ARRAY_SIZE(stat_nam)) + printf(stat_nam[state]); + else + printf("?"); +#if (BITS_PER_LONG == 32) + if (state == TASK_RUNNING) + printf(" running "); + else + printf(" %08lX ", thread_saved_pc(p)); +#else + if (state == TASK_RUNNING) + printf(" running task "); + else + printf(" %016lx ", thread_saved_pc(p)); +#endif +#ifdef CONFIG_DEBUG_STACK_USAGE + { + unsigned long * n = (unsigned long *) (p->thread_info+1); + while (!*n) + n++; + free = (unsigned long) n - (unsigned long)(p->thread_info+1); + } +#endif + printf("%5lu %5d %6d ", free, p->pid, p->parent->pid); + if ((relative = eldest_child(p))) + printf("%5d ", relative->pid); + else + printf(" "); + if ((relative = younger_sibling(p))) + printf("%7d", relative->pid); + else + printf(" "); + if ((relative = older_sibling(p))) + printf(" %5d", relative->pid); + else + printf(" "); + if (!p->mm) + printf(" (L-TLB)\n"); + else + printf(" (NOTLB)\n"); + + if (prt_stacks) + show_stack(p, NULL); +} + +static task_t *xmon_next_thread(const task_t *p) +{ + return pid_task(p->pids[PIDTYPE_TGID].pid_list.next, PIDTYPE_TGID); +} + +static void xmon_show_state(int prt_stacks) +{ + task_t *g, *p; +#if (BITS_PER_LONG == 32) + printf("\n" + " sibling\n"); + printf(" task PC pid father child younger older\n"); +#else + printf("\n" + " sibling\n"); + printf(" task PC pid father child younger older\n"); +#endif + do_each_thread(g, p) { + xmon_show_task(p, prt_stacks); + } while ((p = xmon_next_thread(p)) != g); +} /* Command interpreting routine */ static char *last_cmd; @@ -967,7 +1043,10 @@ cmds(struct pt_regs *excp) printf(help_string); break; case 'p': - show_state(); + xmon_show_state(0); + break; + case 'P': + xmon_show_state(1); break; case 'b': bpt_cmds(); From arnd at arndb.de Fri Sep 24 18:54:00 2004 From: arnd at arndb.de (Arnd Bergmann) Date: Fri, 24 Sep 2004 10:54:00 +0200 Subject: [PATCH] PPC64: xmon: add process-display command In-Reply-To: <20040923211914.GA22210@austin.ibm.com> References: <20040923211914.GA22210@austin.ibm.com> Message-ID: <200409241054.03553.arnd@arndb.de> On Donnerstag, 23. September 2004 23:19, Linas Vepstas wrote: > + > +???????if (prt_stacks) > +???????????????show_stack(p, NULL); Can this give any useful data for running tasks? IIRC, you can only call this for sleeping tasks or for current. Arnd <>< -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: signature Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040924/71ca1b82/attachment.pgp From linas at austin.ibm.com Sat Sep 25 04:09:23 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Fri, 24 Sep 2004 13:09:23 -0500 Subject: [PATCH] PPC64: xmon: add process-display command In-Reply-To: <200409241054.03553.arnd@arndb.de> References: <20040923211914.GA22210@austin.ibm.com> <200409241054.03553.arnd@arndb.de> Message-ID: <20040924180923.GE22210@austin.ibm.com> On Fri, Sep 24, 2004 at 10:54:00AM +0200, Arnd Bergmann was heard to remark: > On Donnerstag, 23. September 2004 23:19, Linas Vepstas wrote: > > > + > > +???????if (prt_stacks) > > +???????????????show_stack(p, NULL); > > Can this give any useful data for running tasks? IIRC, you can only call > this for sleeping tasks or for current. Not sure i understand the question. All of the other cpu's are stopped while the system is in the debugger; and nothing is running except the debugger itself. Thus even the stack of a "running" process is frozen in time. Its useful to look at the stacks on other cpu's since they sometimes deadlock against each other. --linas From sonny at burdell.org Sat Sep 25 05:48:26 2004 From: sonny at burdell.org (Sonny Rao) Date: Fri, 24 Sep 2004 15:48:26 -0400 Subject: CONFIG_SPINLINE broken on 2.6.9-rc2 In-Reply-To: <20040923200258.GA20689@kevlar.burdell.org> References: <20040923200258.GA20689@kevlar.burdell.org> Message-ID: <20040924194826.GA11806@kevlar.burdell.org> On Thu, Sep 23, 2004 at 04:02:58PM -0400, Sonny Rao wrote: > Hi, I was trying to get a profile with inlined spinlocks on 2.6.9-rc2 > and I noticed that my kernel build failed with several undefined > references to spin_unlock_wait. > > After some grepping around it looked like the prototype in > include/asm-ppc64/spinlock.h for spin_unlock_wait might be incorrect and > should be a preprocessor define (like the ones in all of the other > architectures including ppc), and the kernel does build and run with > this change. > > Is this correct? > > Sonny Apparently not quite correct. Though the kernel built and ran, the spinlocks didn't get inlined :-( If I have time, I'll try and track it down. Sonny Rao -- LTC Kernel Performance From rsa at us.ibm.com Sat Sep 25 06:15:23 2004 From: rsa at us.ibm.com (Ryan Arnold) Date: Fri, 24 Sep 2004 15:15:23 -0500 Subject: 2.6.9-rc2+BK hvc console oops In-Reply-To: <20040918070612.GK2825@krispykreme> References: <20040918070612.GK2825@krispykreme> Message-ID: <1096056923.3355.81.camel@localhost.localdomain> On Sat, 2004-09-18 at 02:06, Anton Blanchard wrote: > Hi, > > Just got this oops on current -BK. > > Anton > > INIT: Sending processes the KILL signal > cpu 0x0: Vector: 300 (Data Access) at [c00000000f86f460] > pc: c0000000003fdb38: ._spin_lock_irqsave+0x30/0xbc > lr: c000000000221c18: .hvc_hangup+0x38/0xd4 Anton, I traded emails with Alan Cox on the LKML about some odd tty line discipline behavior exposed by this bug. Basically we found that there is a problem where late ldisc writes can follow a final tty close and another issue where a quick tty close followed by open->hangup allows hangup to be issued against a closed tty. My previous short-sighted patch simply set tty->driver_data = NULL but this has caused problems with these late calls oopsing when the functions expected driver_data to be a valid hvc_struct instance. This new patch should prevent late hangups and writes after final closes from doing anything nasty to the system if/when they occur. I now simply check the struct reference count and block writes and hangups when the hvc_struct ref count is 0 and I leave tty->driver_data alone in hvc_close(). Please test this patch out and verify that you can't reproduce and then I'll send it along to the LKML. Alan has both of these issues on his todo list, at which time we won't have to worry about these unexpected writes/hangups. Thanks for the patience, Ryan S. Arnold IBM Linux Technology Center -------------- next part -------------- A non-text attachment was scrubbed... Name: hvc_hangup_write_close.patch Type: text/x-patch Size: 1760 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040924/10622045/attachment.bin From johnrose at austin.ibm.com Sat Sep 25 09:14:13 2004 From: johnrose at austin.ibm.com (John Rose) Date: Fri, 24 Sep 2004 18:14:13 -0500 Subject: when to mark reserved low memory pages Message-ID: <1096067653.28337.5.camel@sinatra.austin.ibm.com> We have a need to mark a certain range of low memory (alloc_bootmem) pages as reserved, similar to the way that kernel pages are marked reserved in mem_init(). The region is question is rtas_rmo_buf, if you're curious. So would mem_init() be the appropriate place to do this, and if so, should it also be done in the CONFIG_DISCONTIG case? Thanks- John Memory Mgmt. Ignoramus From arnd at arndb.de Sat Sep 25 11:10:56 2004 From: arnd at arndb.de (Arnd Bergmann) Date: Sat, 25 Sep 2004 03:10:56 +0200 Subject: [PATCH] PPC64: xmon: add process-display command In-Reply-To: <20040924180923.GE22210@austin.ibm.com> References: <20040923211914.GA22210@austin.ibm.com> <200409241054.03553.arnd@arndb.de> <20040924180923.GE22210@austin.ibm.com> Message-ID: <200409250311.00083.arnd@arndb.de> On Freitag, 24. September 2004 20:09, Linas Vepstas wrote: > Not sure i understand the question. ?All of the other cpu's are stopped > while the system is in the debugger; and nothing is running except the > debugger itself. ? Thus even the stack of a "running" process is frozen > in time. ? > > Its useful to look at the stacks on other cpu's since they sometimes > deadlock against each other. No doubt that it's useful, but show_stack reads the stack pointer from p->thread.ksp, which contains the value of the stack pointer at the time when p was sleeping the last time. Walking the stack down from there gives you a combination of the current call chain, the old call chain down from the ksp value and some random local variables that happen to be in the place of old backchain pointers (are there backchain pointers on ppc?). AFAICS, what the code should do instead is something like: if (prt_stacks) { if (state != TASK_RUNNING) show_stack(p, NULL); else show_stack(p, get_stack_pointer_from_other_cpu( p->thread_info->cpu)); } I don't know if there is an easy way to implement get_stack_pointer_from_other_cpu() in xmon. Arnd <>< -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: signature Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040925/8fd219d8/attachment.pgp From benh at kernel.crashing.org Sat Sep 25 12:15:54 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 25 Sep 2004 12:15:54 +1000 Subject: when to mark reserved low memory pages In-Reply-To: <1096067653.28337.5.camel@sinatra.austin.ibm.com> References: <1096067653.28337.5.camel@sinatra.austin.ibm.com> Message-ID: <1096078554.475.8.camel@gaston> On Sat, 2004-09-25 at 09:14, John Rose wrote: > We have a need to mark a certain range of low memory (alloc_bootmem) > pages as reserved, similar to the way that kernel pages are marked > reserved in mem_init(). The region is question is rtas_rmo_buf, if > you're curious. > > So would mem_init() be the appropriate place to do this, and if so, > should it also be done in the CONFIG_DISCONTIG case? Why ? because they get freed when you mmap them and later quit the userland process ? In this case, the option of marking them reserved might lead to an ever increasing page count (or not), make very sure of that, as get_page() will increase count but free_page() will not decrease it for a reserved page. For similar cases, what I did rather than marking them reserved was to do a get_page() early so that the kernel always holds a reference Ben. From olh at suse.de Sun Sep 26 03:39:19 2004 From: olh at suse.de (Olaf Hering) Date: Sat, 25 Sep 2004 19:39:19 +0200 Subject: glibc fails in tst-timer4, __SI_TIMER needs its own case In-Reply-To: <1096078402.576.5.camel@gaston> References: <20040815215142.GA18611@suse.de> <1092805380.9536.178.camel@gaston> <20040818164436.GA26820@suse.de> <1096078402.576.5.camel@gaston> Message-ID: <20040925173919.GA25882@suse.de> On Sat, Sep 25, Benjamin Herrenschmidt wrote: > + err |= __put_user(s->si_value, &d->si_value); This doesnt work for some reason. If that value is 0x1234567800000000, the upper bits are ignored. That 0x12345678 is the &ev pointer in tst-timer4.c, passed to timer_create. ppc32_timer_create takes it as an int, see event.sigev_value.sival_int + err |= __put_user(s->si_int, &d->si_int); This patch seems to fix it for me, this is the debug output: clock_gettime returned timespec = { 1096132996, 951020000 } clock_getres returned timespec = { 0, 999848 } !!! second timer_gettime timer_none returned it_interval 0.100984648 e54:/usr/src/packages/BUILD/glibc-2.3/rt$ dmesg copy_siginfo_to_user32(476) ld.so.1(3784):c1,j182037 si_ptr ffffe03000000000 s24 d20 d 0x00000000ffffd9f0 p 0x00000000ffffd9f0 00000021 00000000 fffffffe 02000001 00000000 ffffe030 00000000 00000000 00000021 00000000 0001fffe 00000000 02000001 00000000 ffffe030 00000000 00000000 00000000 00000000 00000000 copy_siginfo_to_user32(476) ld.so.1(3784):c1,j182137 si_ptr 000000a300000000 s24 d20 d 0x00000000ffffd9f0 p 0x00000000ffffd9f0 00000022 00000000 fffffffe 03000002 00000000 000000a3 00000000 00000000 00000022 00000000 0001fffe 00000000 03000002 00000000 000000a3 00000000 00000000 00000000 00000000 00000000 copy_siginfo_to_user32(476) ld.so.1(3784):c1,j183438 si_ptr ffffe03000000000 s24 d20 d 0x00000000ffffd9c0 p 0x00000000ffffd9c0 00000021 00000000 fffffffe 02000001 00000000 ffffe030 0ff90e5c ffffd9e0 00000021 00000000 0001fffe 00000000 02000001 00000000 ffffe030 00000000 00000005 00000000 00000000 00000000 copy_siginfo_to_user32(476) ld.so.1(3784):c1,j183538 si_ptr 000000a300000000 s24 d20 d 0x00000000ffffd9c0 p 0x00000000ffffd9c0 00000022 00000000 fffffffe 03000002 00000000 000000a3 0ff90e5c ffffd9e0 00000022 00000000 0001fffe 00000000 03000002 00000000 000000a3 00000000 00000005 00000000 00000000 00000000 copy_siginfo_to_user32(476) ld.so.1(3784):c1,j183839 si_ptr ffffe03000000000 s24 d20 d 0x00000000ffffd9c0 p 0x00000000ffffd9c0 00000021 00000000 fffffffe 02000001 00000000 ffffe030 0ff90e5c ffffd9e0 00000021 00000000 0001fffe 00000000 02000001 00000000 ffffe030 00000000 00000007 00000000 00000000 00000000 copy_siginfo_to_user32(476) ld.so.1(3784):c1,j184039 si_ptr 000000a300000000 s24 d20 d 0x00000000ffffd9c0 p 0x00000000ffffd9c0 00000022 00000000 fffffffe 03000002 00000000 000000a3 0ff90e5c ffffd9e0 00000022 00000000 0001fffe 00000000 03000002 00000000 000000a3 00000000 00000007 00000000 00000000 00000000 copy_siginfo_to_user32(476) ld.so.1(3784):c1,j184240 si_ptr ffffe03000000000 s24 d20 d 0x00000000ffffd9c0 p 0x00000000ffffd9c0 00000021 00000000 fffffffe 02000001 00000000 ffffe030 0ff90e5c ffffd9e0 00000021 00000000 0001fffe 00000000 02000001 00000000 ffffe030 00000000 00000009 00000000 00000000 00000000 copy_siginfo_to_user32(476) ld.so.1(3784):c1,j184540 si_ptr 000000a300000000 s24 d20 d 0x00000000ffffd9c0 p 0x00000000ffffd9c0 00000022 00000000 fffffffe 03000002 00000000 000000a3 0ff90e5c ffffd9e0 00000022 00000000 0001fffe 00000000 03000002 00000000 000000a3 00000000 00000009 00000000 00000000 00000000 copy_siginfo_to_user32(476) ld.so.1(3784):c1,j184641 si_ptr ffffe03000000000 s24 d20 d 0x00000000ffffd9f0 p 0x00000000ffffd9f0 00000021 00000000 fffffffe 02000001 00000000 ffffe030 00000000 00000000 00000021 00000000 0001fffe 00000000 02000001 00000000 ffffe030 00000000 0000000b 00000000 00000000 00000000 copy_siginfo_to_user32(476) ld.so.1(3784):c1,j185041 si_ptr 000000a300000000 s24 d20 d 0x00000000ffffd9f0 p 0x00000000ffffd9f0 00000022 00000000 fffffffe 03000002 00000000 000000a3 00000000 00000000 00000022 00000000 0001fffe 00000000 03000002 00000000 000000a3 00000000 0000000b 00000000 00000000 00000000 copy_siginfo_to_user32(476) ld.so.1(3784):c1,j185042 si_ptr ffffe03000000000 s24 d20 d 0x00000000ffffd9f0 p 0x00000000ffffd9f0 00000021 00000000 fffffffe 02000001 00000000 ffffe030 00000000 00000000 00000021 00000000 0001fffe 00000000 02000001 00000000 ffffe030 00000000 0000000d 00000000 00000000 00000000 copy_siginfo_to_user32(476) ld.so.1(3784):c1,j185542 si_ptr 000000a300000000 s24 d20 d 0x00000000ffffd9f0 p 0x00000000ffffd9f0 00000022 00000000 fffffffe 03000002 00000000 000000a3 00000000 00000000 00000022 00000000 0001fffe 00000000 03000002 00000000 000000a3 00000000 0000000d 00000000 00000000 00000000 diff -purNX /tmp/kernel_exclude.txt ../linux-2.6.5.orig/Makefile ./Makefile --- ../linux-2.6.5.orig/Makefile 2004-09-25 13:38:59.000000000 +0000 +++ ./Makefile 2004-09-25 14:49:12.660598245 +0000 @@ -753,10 +753,8 @@ _modinst_: @rm -rf $(MODLIB)/kernel @rm -f $(MODLIB)/source @mkdir -p $(MODLIB)/kernel - @ln -s $(srctree) $(MODLIB)/source @if [ ! $(objtree) -ef $(MODLIB)/build ]; then \ rm -f $(MODLIB)/build ; \ - ln -s $(objtree) $(MODLIB)/build ; \ fi $(Q)$(MAKE) -rR -f $(srctree)/scripts/Makefile.modinst diff -purNX /tmp/kernel_exclude.txt ../linux-2.6.5.orig/arch/ppc64/kernel/signal32.c ./arch/ppc64/kernel/signal32.c --- ../linux-2.6.5.orig/arch/ppc64/kernel/signal32.c 2004-09-25 13:38:38.000000000 +0000 +++ ./arch/ppc64/kernel/signal32.c 2004-09-25 17:15:43.164116677 +0000 @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -464,10 +465,25 @@ static long copy_siginfo_to_user32(compa &d->si_addr); break; case __SI_POLL >> 16: - case __SI_TIMER >> 16: err |= __put_user(s->si_band, &d->si_band); err |= __put_user(s->si_fd, &d->si_fd); break; + case __SI_TIMER >> 16: + { + int i; + unsigned char *p; + p=(unsigned char*)d; + olh("si_ptr %p s%u d%u d 0x%p p 0x%p",s->si_ptr,offsetof(siginfo_t,si_ptr),offsetof(compat_siginfo_t,si_ptr),d,p); + err |= __put_user(s->si_tid, &d->si_tid); + err |= __put_user(s->si_overrun, &d->si_overrun); + err |= __put_user(s->si_int, &d->si_int); + for(i=0;i<(8*sizeof(int));i=i+4)printk("%02x%02x%02x%02x ",p[i],p[i+1],p[i+2],p[i+3]); + printk("\n"); + p=(unsigned char*)s; + for(i=0;i<(12*sizeof(int));i=i+4)printk("%02x%02x%02x%02x ",p[i],p[i+1],p[i+2],p[i+3]); + printk("\n"); + break; + } case __SI_RT >> 16: /* This is not generated by the kernel as of now. */ /* case __SI_MESGQ >> 16: */ err |= __put_user(s->si_int, &d->si_int); diff -purNX /tmp/kernel_exclude.txt ../linux-2.6.5.orig/include/asm-ppc64/ppc32.h ./include/asm-ppc64/ppc32.h --- ../linux-2.6.5.orig/include/asm-ppc64/ppc32.h 2004-04-04 03:36:57.000000000 +0000 +++ ./include/asm-ppc64/ppc32.h 2004-09-25 14:55:56.110531545 +0000 @@ -56,8 +56,10 @@ typedef struct compat_siginfo { /* POSIX.1b timers */ struct { - unsigned int _timer1; - unsigned int _timer2; + timer_t _tid; /* timer id */ + int _overrun; /* overrun count */ + compat_sigval_t _sigval; /* same as below */ + int _sys_private; /* not to be passed to user */ } _timer; /* POSIX.1b signals */ diff -purNX /tmp/kernel_exclude.txt ../linux-2.6.5.orig/include/linux/olh.h ./include/linux/olh.h --- ../linux-2.6.5.orig/include/linux/olh.h 1970-01-01 00:00:00.000000000 +0000 +++ ./include/linux/olh.h 2004-09-25 13:54:22.548576332 +0000 @@ -0,0 +1,8 @@ +#ifndef __LINUX_OLH_H +#define __LINUX_OLH_H +#define tolh(fmt) \ + printk(KERN_DEBUG "%s(%u) %s(%u):c%u,j%lu " fmt "\n",__FUNCTION__,__LINE__,current->comm,current->pid,smp_processor_id(),jiffies) +#define olh(fmt,args ...) \ + printk(KERN_DEBUG "%s(%u) %s(%u):c%u,j%lu " fmt "\n",__FUNCTION__,__LINE__,current->comm,current->pid,smp_processor_id(),jiffies,args) +#endif + -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG From sfr at canb.auug.org.au Mon Sep 27 15:28:36 2004 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Mon, 27 Sep 2004 15:28:36 +1000 Subject: [PATCH] PPC64: fix CONFIG check typo Message-ID: <20040927152836.624c503d.sfr@canb.auug.org.au> Hi Andrew, This should allow sys_rtas to work again on PPC64 pSeries. Signed-off-by: Stephen Rothwell -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ diff -ruN 2.6.9-rc2-bk12/arch/ppc64/kernel/misc.S 2.6.9-rc2-bk12.sfr.1/arch/ppc64/kernel/misc.S --- 2.6.9-rc2-bk12/arch/ppc64/kernel/misc.S 2004-09-27 12:10:57.000000000 +1000 +++ 2.6.9-rc2-bk12.sfr.1/arch/ppc64/kernel/misc.S 2004-09-27 15:11:39.000000000 +1000 @@ -687,7 +687,7 @@ ld r30,-16(r1) blr -#ifndef CONFIG_PPC_PSERIE /* hack hack hack */ +#ifndef CONFIG_PPC_PSERIES /* hack hack hack */ #define ppc_rtas sys_ni_syscall #endif -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040927/bee00397/attachment.pgp From benh at kernel.crashing.org Mon Sep 27 19:12:32 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 27 Sep 2004 19:12:32 +1000 Subject: [PATCH] ppc64: Make the DART "iommu" code more generic Message-ID: <1096276351.1102.40.camel@gaston> Hi ! The "DART" iommu used on Apple U3 chipset will appear into 3rd party products soon, thus the code shouldn't be named "pmac_*" anymore. Also, the explicit config option is no longer needed, there is no reason to build a PowerMac kernel without the iommu support, and it can always be disabled at runtime with a cmdline option for testing or debugging. This patch do the appropriate renaming and makes the config option implicit & selected when pmac support is included. (Patch looks big because of the file renaming) Signed-off-by: Benjamin Herrenschmidt diff -Nru a/arch/ppc64/Kconfig b/arch/ppc64/Kconfig --- a/arch/ppc64/Kconfig 2004-09-27 19:12:49 +10:00 +++ b/arch/ppc64/Kconfig 2004-09-27 19:12:49 +10:00 @@ -78,6 +78,7 @@ bool " Apple G5 based machines" default y select ADB_PMU + select U3_DART config PPC bool @@ -109,16 +110,10 @@ processors, that is, which share physical processors between two or more partitions. -config PMAC_DART - bool "Enable DART/IOMMU on PowerMac (allow >2G of RAM)" - depends on PPC_PMAC - depends on EXPERIMENTAL +config U3_DART + bool + depends on PPC_MULTIPLATFORM default n - help - Enabling DART makes it possible to boot a PowerMac G5 with more - than 2GB of memory. Note that the code is very new and untested - at this time, so it has to be considered experimental. Enabling - this might result in data loss. config PPC_PMAC64 bool diff -Nru a/arch/ppc64/kernel/Makefile b/arch/ppc64/kernel/Makefile --- a/arch/ppc64/kernel/Makefile 2004-09-27 19:12:49 +10:00 +++ b/arch/ppc64/kernel/Makefile 2004-09-27 19:12:49 +10:00 @@ -49,7 +49,7 @@ obj-$(CONFIG_PPC_PMAC) += pmac_setup.o pmac_feature.o pmac_pci.o \ pmac_time.o pmac_nvram.o pmac_low_i2c.o \ open_pic_u3.o -obj-$(CONFIG_PMAC_DART) += pmac_iommu.o +obj-$(CONFIG_U3_DART) += u3_iommu.o ifdef CONFIG_SMP obj-$(CONFIG_PPC_PMAC) += pmac_smp.o smp-tbsync.o diff -Nru a/arch/ppc64/kernel/pmac.h b/arch/ppc64/kernel/pmac.h --- a/arch/ppc64/kernel/pmac.h 2004-09-27 19:12:49 +10:00 +++ b/arch/ppc64/kernel/pmac.h 2004-09-27 19:12:49 +10:00 @@ -29,6 +29,4 @@ extern void pmac_nvram_init(void); -extern void pmac_iommu_alloc(void); - #endif /* __PMAC_H__ */ diff -Nru a/arch/ppc64/kernel/pmac_iommu.c b/arch/ppc64/kernel/pmac_iommu.c --- a/arch/ppc64/kernel/pmac_iommu.c 2004-09-27 19:12:49 +10:00 +++ /dev/null Wed Dec 31 16:00:00 196900 @@ -1,321 +0,0 @@ -/* - * arch/ppc64/kernel/pmac_iommu.c - * - * Copyright (C) 2004 Olof Johansson , IBM Corporation - * - * Based on pSeries_iommu.c: - * Copyright (C) 2001 Mike Corrigan & Dave Engebretsen, IBM Corporation - * Copyright (C) 2004 Olof Johansson , IBM Corporation - * - * Dynamic DMA mapping support, PowerMac G5 (DART)-specific parts. - * - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include "pci.h" - - -/* physical base of DART registers */ -#define DART_BASE 0xf8033000UL - -/* Offset from base to control register */ -#define DARTCNTL 0 -/* Offset from base to exception register */ -#define DARTEXCP 0x10 -/* Offset from base to TLB tag registers */ -#define DARTTAG 0x1000 - - -/* Control Register fields */ - -/* base address of table (pfn) */ -#define DARTCNTL_BASE_MASK 0xfffff -#define DARTCNTL_BASE_SHIFT 12 - -#define DARTCNTL_FLUSHTLB 0x400 -#define DARTCNTL_ENABLE 0x200 - -/* size of table in pages */ -#define DARTCNTL_SIZE_MASK 0x1ff -#define DARTCNTL_SIZE_SHIFT 0 - -/* DART table fields */ -#define DARTMAP_VALID 0x80000000 -#define DARTMAP_RPNMASK 0x00ffffff - -/* Physical base address and size of the DART table */ -unsigned long dart_tablebase; /* exported to htab_initialize */ -static unsigned long dart_tablesize; - -/* Virtual base address of the DART table */ -static u32 *dart_vbase; - -/* Mapped base address for the dart */ -static unsigned int *dart; - -/* Dummy val that entries are set to when unused */ -static unsigned int dart_emptyval; - -static struct iommu_table iommu_table_pmac; -static int dart_dirty; - -#define DBG(...) - -static inline void dart_tlb_invalidate_all(void) -{ - unsigned long l = 0; - unsigned int reg; - unsigned long limit; - - DBG("dart: flush\n"); - - /* To invalidate the DART, set the DARTCNTL_FLUSHTLB bit in the - * control register and wait for it to clear. - * - * Gotcha: Sometimes, the DART won't detect that the bit gets - * set. If so, clear it and set it again. - */ - - limit = 0; - -retry: - reg = in_be32((unsigned int *)dart+DARTCNTL); - reg |= DARTCNTL_FLUSHTLB; - out_be32((unsigned int *)dart+DARTCNTL, reg); - - l = 0; - while ((in_be32((unsigned int *)dart+DARTCNTL) & DARTCNTL_FLUSHTLB) && - l < (1L<it_base) + index; - - /* On pmac, all memory is contigous, so we can move this - * out of the loop. - */ - while (npages--) { - rpn = virt_to_abs(uaddr) >> PAGE_SHIFT; - - *(dp++) = DARTMAP_VALID | (rpn & DARTMAP_RPNMASK); - - rpn++; - uaddr += PAGE_SIZE; - } - - dart_dirty = 1; -} - - -static void dart_free_pmac(struct iommu_table *tbl, long index, long npages) -{ - unsigned int *dp; - - /* We don't worry about flushing the TLB cache. The only drawback of - * not doing it is that we won't catch buggy device drivers doing - * bad DMAs, but then no 32-bit architecture ever does either. - */ - - DBG("dart: free at: %lx, %lx\n", index, npages); - - dp = ((unsigned int *)tbl->it_base) + index; - - while (npages--) - *(dp++) = dart_emptyval; -} - - -static int dart_init(struct device_node *dart_node) -{ - unsigned int regword; - unsigned int i; - unsigned long tmp; - struct page *p; - - if (dart_tablebase == 0 || dart_tablesize == 0) { - printk(KERN_INFO "U3-DART: table not allocated, using direct DMA\n"); - return -ENODEV; - } - - /* Make sure nothing from the DART range remains in the CPU cache - * from a previous mapping that existed before the kernel took - * over - */ - flush_dcache_phys_range(dart_tablebase, dart_tablebase + dart_tablesize); - - /* Allocate a spare page to map all invalid DART pages. We need to do - * that to work around what looks like a problem with the HT bridge - * prefetching into invalid pages and corrupting data - */ - tmp = __get_free_pages(GFP_ATOMIC, 1); - if (tmp == 0) - panic("U3-DART: Cannot allocate spare page !"); - dart_emptyval = DARTMAP_VALID | - ((virt_to_abs(tmp) >> PAGE_SHIFT) & DARTMAP_RPNMASK); - - /* Map in DART registers. FIXME: Use device node to get base address */ - dart = ioremap(DART_BASE, 0x7000); - if (dart == NULL) - panic("U3-DART: Cannot map registers !"); - - /* Set initial control register contents: table base, - * table size and enable bit - */ - regword = DARTCNTL_ENABLE | - ((dart_tablebase >> PAGE_SHIFT) << DARTCNTL_BASE_SHIFT) | - (((dart_tablesize >> PAGE_SHIFT) & DARTCNTL_SIZE_MASK) - << DARTCNTL_SIZE_SHIFT); - p = virt_to_page(dart_tablebase); - dart_vbase = ioremap(virt_to_abs(dart_tablebase), dart_tablesize); - - /* Fill initial table */ - for (i = 0; i < dart_tablesize/4; i++) - dart_vbase[i] = dart_emptyval; - - /* Initialize DART with table base and enable it. */ - out_be32((unsigned int *)dart, regword); - - /* Invalidate DART to get rid of possible stale TLBs */ - dart_tlb_invalidate_all(); - - iommu_table_pmac.it_busno = 0; - - /* Units of tce entries */ - iommu_table_pmac.it_offset = 0; - - /* Set the tce table size - measured in pages */ - iommu_table_pmac.it_size = dart_tablesize >> PAGE_SHIFT; - - /* Initialize the common IOMMU code */ - iommu_table_pmac.it_base = (unsigned long)dart_vbase; - iommu_table_pmac.it_index = 0; - iommu_table_pmac.it_blocksize = 1; - iommu_table_pmac.it_entrysize = sizeof(u32); - iommu_init_table(&iommu_table_pmac); - - /* Reserve the last page of the DART to avoid possible prefetch - * past the DART mapped area - */ - set_bit(iommu_table_pmac.it_mapsize - 1, iommu_table_pmac.it_map); - - printk(KERN_INFO "U3-DART IOMMU initialized\n"); - - return 0; -} - -void iommu_setup_pmac(void) -{ - struct pci_dev *dev = NULL; - struct device_node *dn; - - /* Find the DART in the device-tree */ - dn = of_find_compatible_node(NULL, "dart", "u3-dart"); - if (dn == NULL) - return; - - /* Setup low level TCE operations for the core IOMMU code */ - ppc_md.tce_build = dart_build_pmac; - ppc_md.tce_free = dart_free_pmac; - ppc_md.tce_flush = dart_flush; - - /* Initialize the DART HW */ - if (dart_init(dn)) - return; - - /* Setup pci_dma ops */ - pci_iommu_init(); - - /* We only have one iommu table on the mac for now, which makes - * things simple. Setup all PCI devices to point to this table - */ - while ((dev = pci_find_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) { - /* We must use pci_device_to_OF_node() to make sure that - * we get the real "final" pointer to the device in the - * pci_dev sysdata and not the temporary PHB one - */ - struct device_node *dn = pci_device_to_OF_node(dev); - if (dn) - dn->iommu_table = &iommu_table_pmac; - } -} - -void __init pmac_iommu_alloc(void) -{ - /* Only reserve DART space if machine has more than 2GB of RAM - * or if requested with iommu=on on cmdline. - */ - if (lmb_end_of_DRAM() <= 0x80000000ull && - get_property(of_chosen, "linux,iommu-force-on", NULL) == NULL) - return; - - /* 512 pages (2MB) is max DART tablesize. */ - dart_tablesize = 1UL << 21; - /* 16MB (1 << 24) alignment. We allocate a full 16Mb chuck since we - * will blow up an entire large page anyway in the kernel mapping - */ - dart_tablebase = (unsigned long) - abs_to_virt(lmb_alloc_base(1UL<<24, 1UL<<24, 0x80000000L)); - - printk(KERN_INFO "U3-DART allocated at: %lx\n", dart_tablebase); -} diff -Nru a/arch/ppc64/kernel/pmac_pci.c b/arch/ppc64/kernel/pmac_pci.c --- a/arch/ppc64/kernel/pmac_pci.c 2004-09-27 19:12:49 +10:00 +++ b/arch/ppc64/kernel/pmac_pci.c 2004-09-27 19:12:49 +10:00 @@ -664,9 +664,7 @@ pci_fix_bus_sysdata(); -#ifdef CONFIG_PMAC_DART - iommu_setup_pmac(); -#endif /* CONFIG_PMAC_DART */ + iommu_setup_u3(); } diff -Nru a/arch/ppc64/kernel/pmac_setup.c b/arch/ppc64/kernel/pmac_setup.c --- a/arch/ppc64/kernel/pmac_setup.c 2004-09-27 19:12:49 +10:00 +++ b/arch/ppc64/kernel/pmac_setup.c 2004-09-27 19:12:49 +10:00 @@ -447,16 +447,6 @@ if (platform != PLATFORM_POWERMAC) return 0; -#ifdef CONFIG_PMAC_DART - /* - * On U3, the DART (iommu) must be allocated now since it - * has an impact on htab_initialize (due to the large page it - * occupies having to be broken up so the DART itself is not - * part of the cacheable linar mapping - */ - pmac_iommu_alloc(); -#endif /* CONFIG_PMAC_DART */ - return 1; } diff -Nru a/arch/ppc64/kernel/prom_init.c b/arch/ppc64/kernel/prom_init.c --- a/arch/ppc64/kernel/prom_init.c 2004-09-27 19:12:49 +10:00 +++ b/arch/ppc64/kernel/prom_init.c 2004-09-27 19:12:49 +10:00 @@ -423,13 +423,6 @@ else if (!strncmp(opt, RELOC("force"), 5)) RELOC(iommu_force_on) = 1; } - -#ifndef CONFIG_PMAC_DART - if (RELOC(of_platform) == PLATFORM_POWERMAC) { - RELOC(ppc64_iommu_off) = 1; - prom_printf("DART disabled on PowerMac !\n"); - } -#endif } /* diff -Nru a/arch/ppc64/kernel/setup.c b/arch/ppc64/kernel/setup.c --- a/arch/ppc64/kernel/setup.c 2004-09-27 19:12:49 +10:00 +++ b/arch/ppc64/kernel/setup.c 2004-09-27 19:12:49 +10:00 @@ -50,6 +50,7 @@ #include #include #include +#include #ifdef DEBUG #define DBG(fmt...) udbg_printf(fmt) @@ -404,6 +405,16 @@ EARLY_DEBUG_INIT(); DBG("Found, Initializing memory management...\n"); + +#ifdef CONFIG_U3_DART + /* + * On U3, the DART (iommu) must be allocated now since it + * has an impact on htab_initialize (due to the large page it + * occupies having to be broken up so the DART itself is not + * part of the cacheable linar mapping + */ + alloc_u3_dart_table(); +#endif /* CONFIG_U3_DART */ /* * Initialize stab / SLB management diff -Nru a/arch/ppc64/kernel/u3_iommu.c b/arch/ppc64/kernel/u3_iommu.c --- /dev/null Wed Dec 31 16:00:00 196900 +++ b/arch/ppc64/kernel/u3_iommu.c 2004-09-27 19:12:49 +10:00 @@ -0,0 +1,321 @@ +/* + * arch/ppc64/kernel/u3_iommu.c + * + * Copyright (C) 2004 Olof Johansson , IBM Corporation + * + * Based on pSeries_iommu.c: + * Copyright (C) 2001 Mike Corrigan & Dave Engebretsen, IBM Corporation + * Copyright (C) 2004 Olof Johansson , IBM Corporation + * + * Dynamic DMA mapping support, Apple U3 & IBM CPC925 "DART" iommu. + * + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "pci.h" + + +/* physical base of DART registers */ +#define DART_BASE 0xf8033000UL + +/* Offset from base to control register */ +#define DARTCNTL 0 +/* Offset from base to exception register */ +#define DARTEXCP 0x10 +/* Offset from base to TLB tag registers */ +#define DARTTAG 0x1000 + + +/* Control Register fields */ + +/* base address of table (pfn) */ +#define DARTCNTL_BASE_MASK 0xfffff +#define DARTCNTL_BASE_SHIFT 12 + +#define DARTCNTL_FLUSHTLB 0x400 +#define DARTCNTL_ENABLE 0x200 + +/* size of table in pages */ +#define DARTCNTL_SIZE_MASK 0x1ff +#define DARTCNTL_SIZE_SHIFT 0 + +/* DART table fields */ +#define DARTMAP_VALID 0x80000000 +#define DARTMAP_RPNMASK 0x00ffffff + +/* Physical base address and size of the DART table */ +unsigned long dart_tablebase; /* exported to htab_initialize */ +static unsigned long dart_tablesize; + +/* Virtual base address of the DART table */ +static u32 *dart_vbase; + +/* Mapped base address for the dart */ +static unsigned int *dart; + +/* Dummy val that entries are set to when unused */ +static unsigned int dart_emptyval; + +static struct iommu_table iommu_table_u3; +static int dart_dirty; + +#define DBG(...) + +static inline void dart_tlb_invalidate_all(void) +{ + unsigned long l = 0; + unsigned int reg; + unsigned long limit; + + DBG("dart: flush\n"); + + /* To invalidate the DART, set the DARTCNTL_FLUSHTLB bit in the + * control register and wait for it to clear. + * + * Gotcha: Sometimes, the DART won't detect that the bit gets + * set. If so, clear it and set it again. + */ + + limit = 0; + +retry: + reg = in_be32((unsigned int *)dart+DARTCNTL); + reg |= DARTCNTL_FLUSHTLB; + out_be32((unsigned int *)dart+DARTCNTL, reg); + + l = 0; + while ((in_be32((unsigned int *)dart+DARTCNTL) & DARTCNTL_FLUSHTLB) && + l < (1L<it_base) + index; + + /* On U3, all memory is contigous, so we can move this + * out of the loop. + */ + while (npages--) { + rpn = virt_to_abs(uaddr) >> PAGE_SHIFT; + + *(dp++) = DARTMAP_VALID | (rpn & DARTMAP_RPNMASK); + + rpn++; + uaddr += PAGE_SIZE; + } + + dart_dirty = 1; +} + + +static void dart_free(struct iommu_table *tbl, long index, long npages) +{ + unsigned int *dp; + + /* We don't worry about flushing the TLB cache. The only drawback of + * not doing it is that we won't catch buggy device drivers doing + * bad DMAs, but then no 32-bit architecture ever does either. + */ + + DBG("dart: free at: %lx, %lx\n", index, npages); + + dp = ((unsigned int *)tbl->it_base) + index; + + while (npages--) + *(dp++) = dart_emptyval; +} + + +static int dart_init(struct device_node *dart_node) +{ + unsigned int regword; + unsigned int i; + unsigned long tmp; + struct page *p; + + if (dart_tablebase == 0 || dart_tablesize == 0) { + printk(KERN_INFO "U3-DART: table not allocated, using direct DMA\n"); + return -ENODEV; + } + + /* Make sure nothing from the DART range remains in the CPU cache + * from a previous mapping that existed before the kernel took + * over + */ + flush_dcache_phys_range(dart_tablebase, dart_tablebase + dart_tablesize); + + /* Allocate a spare page to map all invalid DART pages. We need to do + * that to work around what looks like a problem with the HT bridge + * prefetching into invalid pages and corrupting data + */ + tmp = __get_free_pages(GFP_ATOMIC, 1); + if (tmp == 0) + panic("U3-DART: Cannot allocate spare page !"); + dart_emptyval = DARTMAP_VALID | + ((virt_to_abs(tmp) >> PAGE_SHIFT) & DARTMAP_RPNMASK); + + /* Map in DART registers. FIXME: Use device node to get base address */ + dart = ioremap(DART_BASE, 0x7000); + if (dart == NULL) + panic("U3-DART: Cannot map registers !"); + + /* Set initial control register contents: table base, + * table size and enable bit + */ + regword = DARTCNTL_ENABLE | + ((dart_tablebase >> PAGE_SHIFT) << DARTCNTL_BASE_SHIFT) | + (((dart_tablesize >> PAGE_SHIFT) & DARTCNTL_SIZE_MASK) + << DARTCNTL_SIZE_SHIFT); + p = virt_to_page(dart_tablebase); + dart_vbase = ioremap(virt_to_abs(dart_tablebase), dart_tablesize); + + /* Fill initial table */ + for (i = 0; i < dart_tablesize/4; i++) + dart_vbase[i] = dart_emptyval; + + /* Initialize DART with table base and enable it. */ + out_be32((unsigned int *)dart, regword); + + /* Invalidate DART to get rid of possible stale TLBs */ + dart_tlb_invalidate_all(); + + iommu_table_u3.it_busno = 0; + + /* Units of tce entries */ + iommu_table_u3.it_offset = 0; + + /* Set the tce table size - measured in pages */ + iommu_table_u3.it_size = dart_tablesize >> PAGE_SHIFT; + + /* Initialize the common IOMMU code */ + iommu_table_u3.it_base = (unsigned long)dart_vbase; + iommu_table_u3.it_index = 0; + iommu_table_u3.it_blocksize = 1; + iommu_table_u3.it_entrysize = sizeof(u32); + iommu_init_table(&iommu_table_u3); + + /* Reserve the last page of the DART to avoid possible prefetch + * past the DART mapped area + */ + set_bit(iommu_table_u3.it_mapsize - 1, iommu_table_u3.it_map); + + printk(KERN_INFO "U3/CPC925 DART IOMMU initialized\n"); + + return 0; +} + +void iommu_setup_u3(void) +{ + struct pci_dev *dev = NULL; + struct device_node *dn; + + /* Find the DART in the device-tree */ + dn = of_find_compatible_node(NULL, "dart", "u3-dart"); + if (dn == NULL) + return; + + /* Setup low level TCE operations for the core IOMMU code */ + ppc_md.tce_build = dart_build; + ppc_md.tce_free = dart_free; + ppc_md.tce_flush = dart_flush; + + /* Initialize the DART HW */ + if (dart_init(dn)) + return; + + /* Setup pci_dma ops */ + pci_iommu_init(); + + /* We only have one iommu table on the mac for now, which makes + * things simple. Setup all PCI devices to point to this table + */ + while ((dev = pci_find_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) { + /* We must use pci_device_to_OF_node() to make sure that + * we get the real "final" pointer to the device in the + * pci_dev sysdata and not the temporary PHB one + */ + struct device_node *dn = pci_device_to_OF_node(dev); + if (dn) + dn->iommu_table = &iommu_table_u3; + } +} + +void __init alloc_u3_dart_table(void) +{ + /* Only reserve DART space if machine has more than 2GB of RAM + * or if requested with iommu=on on cmdline. + */ + if (lmb_end_of_DRAM() <= 0x80000000ull && + get_property(of_chosen, "linux,iommu-force-on", NULL) == NULL) + return; + + /* 512 pages (2MB) is max DART tablesize. */ + dart_tablesize = 1UL << 21; + /* 16MB (1 << 24) alignment. We allocate a full 16Mb chuck since we + * will blow up an entire large page anyway in the kernel mapping + */ + dart_tablebase = (unsigned long) + abs_to_virt(lmb_alloc_base(1UL<<24, 1UL<<24, 0x80000000L)); + + printk(KERN_INFO "U3-DART allocated at: %lx\n", dart_tablebase); +} diff -Nru a/arch/ppc64/mm/hash_utils.c b/arch/ppc64/mm/hash_utils.c --- a/arch/ppc64/mm/hash_utils.c 2004-09-27 19:12:49 +10:00 +++ b/arch/ppc64/mm/hash_utils.c 2004-09-27 19:12:49 +10:00 @@ -71,9 +71,9 @@ * */ -#ifdef CONFIG_PMAC_DART +#ifdef CONFIG_U3_DART extern unsigned long dart_tablebase; -#endif /* CONFIG_PMAC_DART */ +#endif /* CONFIG_U3_DART */ HTAB htab_data = {NULL, 0, 0, 0, 0}; @@ -203,7 +203,7 @@ DBG("creating mapping for region: %lx : %lx\n", base, size); -#ifdef CONFIG_PMAC_DART +#ifdef CONFIG_U3_DART /* Do not map the DART space. Fortunately, it will be aligned * in such a way that it will not cross two lmb regions and will * fit within a single 16Mb page. @@ -223,7 +223,7 @@ mode_rw, use_largepages); continue; } -#endif /* CONFIG_PMAC_DART */ +#endif /* CONFIG_U3_DART */ create_pte_mapping(base, base + size, mode_rw, use_largepages); } DBG(" <- htab_initialize()\n"); diff -Nru a/include/asm-ppc64/iommu.h b/include/asm-ppc64/iommu.h --- a/include/asm-ppc64/iommu.h 2004-09-27 19:12:49 +10:00 +++ b/include/asm-ppc64/iommu.h 2004-09-27 19:12:49 +10:00 @@ -108,7 +108,7 @@ /* Walks all buses and creates iommu tables */ extern void iommu_setup_pSeries(void); -extern void iommu_setup_pmac(void); +extern void iommu_setup_u3(void); /* Creates table for an individual device node */ extern void iommu_devnode_init(struct device_node *dn); @@ -154,6 +154,8 @@ extern void pci_iommu_init(void); extern void pci_dma_init_direct(void); + +extern void alloc_u3_dart_table(void); extern int ppc64_iommu_off; From johnrose at austin.ibm.com Tue Sep 28 01:11:57 2004 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 27 Sep 2004 10:11:57 -0500 Subject: when to mark reserved low memory pages In-Reply-To: <1096078554.475.8.camel@gaston> References: <1096067653.28337.5.camel@sinatra.austin.ibm.com> <1096078554.475.8.camel@gaston> Message-ID: <1096297917.18483.1.camel@sinatra.austin.ibm.com> On Fri, 2004-09-24 at 21:15, Benjamin Herrenschmidt wrote: > Why ? because they get freed when you mmap them and later quit the > userland process ? In this case, the option of marking them reserved > might lead to an ever increasing page count (or not), make very sure > of that, as get_page() will increase count but free_page() will not > decrease it for a reserved page. Ok retry :) If you were in a hypothetical situation where page count wasn't the issue, and you specifically needed the pages marked reserved, where would you do it? And how to handle the discontig case? Thanks- John From cchaney at us.ibm.com Tue Sep 28 01:18:25 2004 From: cchaney at us.ibm.com (Craig Chaney) Date: Mon, 27 Sep 2004 11:18:25 -0400 Subject: [PATCH] MSR_RI not cleared early enough in entry.S Message-ID: <20040927151825.GA1004@sage.ibm.com> Hi, This patch fixes a small hole in entry.S. In the section of code under the label syscall_exit_trace_cont, the kernel is reverting to its previous context. The kernel stack pointer is updated, MSR_RI is cleared, and then the rest of the context is restored leading up to the rfid instruction. An exception between the update of the kernel stack pointer and the clearing of MSR_RI can cause a problem. If r1 has been updated to point to userspace, this will trigger an error condition at the top of EXCEPTION_PROLOG_COMMON, and we get the "Bad kernel stack pointer" error. If I understand the use of MSR_RI correctly, we should delay the update of the kernel stack pointer until after the clearing of MSR_RI. I'm new to this, so please let me know if I've made any mistakes (not only in the patch itself of course, but also in the conventions of submitting a patch). Is submitting the patch here sufficient for it to make it upstream? Thanks, Craig Signed-off-by: Craig Chaney From cchaney at us.ibm.com Tue Sep 28 02:06:42 2004 From: cchaney at us.ibm.com (Craig Chaney) Date: Mon, 27 Sep 2004 12:06:42 -0400 Subject: [PATCH] MSR_RI not cleared early enough in entry.S In-Reply-To: <20040927151825.GA1004@sage.ibm.com> References: <20040927151825.GA1004@sage.ibm.com> Message-ID: <20040927160642.GA1715@sage.ibm.com> oops. Forgot to attach the patch. Thanks, Craig On Mon, Sep 27, 2004 at 11:18:25AM -0400, Craig Chaney wrote: > Hi, > > This patch fixes a small hole in entry.S. In the section of code under the > label syscall_exit_trace_cont, the kernel is reverting to its previous > context. The kernel stack pointer is updated, MSR_RI is cleared, and then the > rest of the context is restored leading up to the rfid instruction. > > An exception between the update of the kernel stack pointer and the clearing > of MSR_RI can cause a problem. If r1 has been updated to point to userspace, > this will trigger an error condition at the top of EXCEPTION_PROLOG_COMMON, > and we get the "Bad kernel stack pointer" error. > > If I understand the use of MSR_RI correctly, we should delay the update of the > kernel stack pointer until after the clearing of MSR_RI. > > I'm new to this, so please let me know if I've made any mistakes (not only in > the patch itself of course, but also in the conventions of submitting a patch). > Is submitting the patch here sufficient for it to make it upstream? > > Thanks, > Craig > > Signed-off-by: Craig Chaney > > _______________________________________________ > Linuxppc64-dev mailing list > Linuxppc64-dev at ozlabs.org > https://ozlabs.org/cgi-bin/mailman/listinfo/linuxppc64-dev -------------- next part -------------- diff -Naur clean/arch/ppc64/kernel/entry.S edited/arch/ppc64/kernel/entry.S --- clean/arch/ppc64/kernel/entry.S 2004-09-26 14:24:27.000000000 +0000 +++ edited/arch/ppc64/kernel/entry.S 2004-09-27 14:36:29.221308744 +0000 @@ -185,10 +185,10 @@ beq- 1f /* only restore r13 if */ ld r13,GPR13(r1) /* returning to usermode */ 1: ld r2,GPR2(r1) - ld r1,GPR1(r1) li r12,MSR_RI andc r10,r10,r12 mtmsrd r10,1 /* clear MSR.RI */ + ld r1,GPR1(r1) mtlr r4 mtcr r5 mtspr SRR0,r7 From jonmasters at gmail.com Tue Sep 28 02:07:16 2004 From: jonmasters at gmail.com (Jon Masters) Date: Mon, 27 Sep 2004 17:07:16 +0100 Subject: [PATCH] MSR_RI not cleared early enough in entry.S In-Reply-To: <20040927151825.GA1004@sage.ibm.com> References: <20040927151825.GA1004@sage.ibm.com> Message-ID: <35fb2e5904092709075f06af9@mail.gmail.com> On Mon, 27 Sep 2004 11:18:25 -0400, Craig Chaney wrote: > I'm new to this, so please let me know if I've made any mistakes (not only in > the patch itself of course, but also in the conventions of submitting a patch). > Is submitting the patch here sufficient for it to make it upstream? This might be due to my reading high volume lists with gmail but I cannot see any patch. Jon. From igor at cs.wisc.edu Tue Sep 28 05:47:15 2004 From: igor at cs.wisc.edu (Igor Grobman) Date: Mon, 27 Sep 2004 14:47:15 -0500 (CDT) Subject: mapping memory in 0xb space Message-ID: I would like to be able to remap memory into 0xb space with 2.4.21 kernel. I have code that works fine on power3 box, but fails miserably on my dual power4+ box. I am using btmalloc() from pmc.c with a modified range. Normal btmalloc() allocation works fine, but if I change the range to start with, e.g. 0xb00001FFFFF00000 (instead of 0xb00..), the kernel crashes when accessing the allocated page. For those of you unfamiliar with btmalloc() code, it finds a physical page using get_free_pages(), subtracts PAGE_OFFSET (i.e. 0xc00...) to form a physical address, then inserts the page into linux page tables, using the VSID calculated with get_kernel_vsid() and the physical address calculated above. It also inserts an HPTE into the hardware page table. It doesn't do anything with regards to allocating segments. I understand the segments should be faulted in. I looked at the code in head.S, and it appears that do_stab_bolted should be doing this. Yet, I am missing something, because the btmalloc() code does not in fact work for pages in the range I specified above. I am running this on 7029-6E3 (p615?) machine with two power4+ processors. I am using the kernel from Suse's 2.4.21-215 source package. Any ideas and attempts to un-confuse me are welcome. Thanks, Igor From benh at kernel.crashing.org Tue Sep 28 08:27:09 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 28 Sep 2004 08:27:09 +1000 Subject: when to mark reserved low memory pages In-Reply-To: <1096297917.18483.1.camel@sinatra.austin.ibm.com> References: <1096067653.28337.5.camel@sinatra.austin.ibm.com> <1096078554.475.8.camel@gaston> <1096297917.18483.1.camel@sinatra.austin.ibm.com> Message-ID: <1096324027.1052.46.camel@gaston> On Tue, 2004-09-28 at 01:11, John Rose wrote: > On Fri, 2004-09-24 at 21:15, Benjamin Herrenschmidt wrote: > > Why ? because they get freed when you mmap them and later quit the > > userland process ? In this case, the option of marking them reserved > > might lead to an ever increasing page count (or not), make very sure > > of that, as get_page() will increase count but free_page() will not > > decrease it for a reserved page. > > Ok retry :) If you were in a hypothetical situation where page count > wasn't the issue, and you specifically needed the pages marked reserved, > where would you do it? Some time right after mem_init() I suppose... > And how to handle the discontig case? Not too sure Ben. From johnrose at austin.ibm.com Tue Sep 28 08:35:44 2004 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 27 Sep 2004 17:35:44 -0500 Subject: when to mark reserved low memory pages In-Reply-To: <1096324476.25805.1.camel@sinatra.austin.ibm.com> References: <1096067653.28337.5.camel@sinatra.austin.ibm.com> <1096078554.475.8.camel@gaston> <1096297917.18483.1.camel@sinatra.austin.ibm.com> <1096324027.1052.46.camel@gaston> <1096324476.25805.1.camel@sinatra.austin.ibm.com> Message-ID: <1096324544.25805.3.camel@sinatra.austin.ibm.com> > I haven't had much > luck getting responses from Paul or Anton lately :( > Because they're busy/travelling/etc. Wow that sounded rude, my apologies, I know you guys are busy :) Thanks- John From johnrose at austin.ibm.com Tue Sep 28 08:34:37 2004 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 27 Sep 2004 17:34:37 -0500 Subject: when to mark reserved low memory pages In-Reply-To: <1096324027.1052.46.camel@gaston> References: <1096067653.28337.5.camel@sinatra.austin.ibm.com> <1096078554.475.8.camel@gaston> <1096297917.18483.1.camel@sinatra.austin.ibm.com> <1096324027.1052.46.camel@gaston> Message-ID: <1096324476.25805.1.camel@sinatra.austin.ibm.com> Hi Linda- You might consider posting your patch to the external list, in response to ben's note here, like you did in the bug text. I haven't had much luck getting responses from Paul or Anton lately :( Just a thought.. Thanks- John On Mon, 2004-09-27 at 17:27, Benjamin Herrenschmidt wrote: > On Tue, 2004-09-28 at 01:11, John Rose wrote: > > On Fri, 2004-09-24 at 21:15, Benjamin Herrenschmidt wrote: > > > Why ? because they get freed when you mmap them and later quit the > > > userland process ? In this case, the option of marking them reserved > > > might lead to an ever increasing page count (or not), make very sure > > > of that, as get_page() will increase count but free_page() will not > > > decrease it for a reserved page. > > > > Ok retry :) If you were in a hypothetical situation where page count > > wasn't the issue, and you specifically needed the pages marked reserved, > > where would you do it? > > Some time right after mem_init() I suppose... > > > And how to handle the discontig case? > > Not too sure > > Ben. > > > From david at gibson.dropbear.id.au Tue Sep 28 10:44:54 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Tue, 28 Sep 2004 10:44:54 +1000 Subject: mapping memory in 0xb space In-Reply-To: References: Message-ID: <20040928004454.GD8889@zax> On Mon, Sep 27, 2004 at 02:47:15PM -0500, Igor Grobman wrote: > I would like to be able to remap memory into 0xb space with 2.4.21 kernel. > I have code that works fine on power3 box, but fails miserably on my dual > power4+ box. I am using btmalloc() from pmc.c with a modified range. > Normal btmalloc() allocation works fine, but if I change the range to > start with, e.g. 0xb00001FFFFF00000 (instead of 0xb00..), the kernel > crashes when accessing the allocated page. > > For those of you unfamiliar with btmalloc() code, it finds a physical page > using get_free_pages(), subtracts PAGE_OFFSET (i.e. 0xc00...) to form a > physical address, then inserts the page into linux page tables, using the > VSID calculated with get_kernel_vsid() and the physical address > calculated above. It also inserts an HPTE into the hardware page table. > It doesn't do anything with regards to allocating segments. I understand > the segments should be faulted in. I looked at the code in head.S, and it > appears that do_stab_bolted should be doing this. Yet, I am missing something, > because the btmalloc() code does not in fact work for pages in the range I > specified above. > > I am running this on 7029-6E3 (p615?) machine with two power4+ processors. > I am using the kernel from Suse's 2.4.21-215 source package. > > Any ideas and attempts to un-confuse me are welcome. Err... this seems likely to cause trouble. Recent kernels don't even have VSIDs allocated for the 0xb... region. 2.4.21 will have some kernel VSIDs there, but there will probably be a bunch of places that assume addresses below 0xc... are user addresses and compute the vsids accordingly. Differences in the code for segment table and SLB machines are probably why it works (or seems to) on power3 but not power4. Also, unless you've created a new set, there are no Linux page tables for the 0xb region - the only linux page tables for the kernel are for the vmalloc (0xd) and ioremap (0xe) regions. Why on earth do you want to do this? -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson From ngilbert at cyberway.com.sg Tue Sep 28 12:28:06 2004 From: ngilbert at cyberway.com.sg (Gilbert Ng) Date: Tue, 28 Sep 2004 10:28:06 +0800 Subject: Global Business Opportunity In New Industry Message-ID: <20040928022204.4EBF82BDA2@ozlabs.org> Dear Business Owner/Professional, I am Gilbert Ng, a Singapore based businessman with global business operation. Our Company is currently open up a New Industry in the Philippines and globally. We will be traveling and conducting our Global Business previews in the city of Makati, Quezon, Davao and Lipa respectively. Of these, the major business preview will be conducted in Makati. If you are open to new business ideas/opportunity and interested to find out more, kindly provide your particular as follow: Name: Country/City: Profession: Contact No. (HP/Office): We will invite you and revert you with more information. Hear from you soon... & God Bless! Sincerely, Gilbert Ng Email: ngilbert at cyberway.com.sg Email: ngilbert at L88edc.com L88 International Singapore Disclaimer: This email, together with any attachments, is intended ONLY for the use of the individual or entity to which it is addressed, and may contain information that is legally privileged, confidential, and/or subject to copyright. If you are not the intended recipient, please be informed that any dissemination, distribution or copying of this email, any attachment, or part thereof is strictly prohibited. Kindly note that Internet communications are not secure, and therefore are susceptible to alterations. If you have received this email in error, please advise the sender by reply email, and delete this message. Your co-operation on this matter is highly appreciated. Thank you. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040928/1e0e26b1/attachment.htm From paulus at samba.org Tue Sep 28 15:44:05 2004 From: paulus at samba.org (Paul Mackerras) Date: Tue, 28 Sep 2004 15:44:05 +1000 Subject: [PATCH] MSR_RI not cleared early enough in entry.S In-Reply-To: <20040927151825.GA1004@sage.ibm.com> References: <20040927151825.GA1004@sage.ibm.com> Message-ID: <16728.64037.205638.199425@cargo.ozlabs.ibm.com> Craig Chaney writes: > An exception between the update of the kernel stack pointer and the clearing > of MSR_RI can cause a problem. If r1 has been updated to point to userspace, > this will trigger an error condition at the top of EXCEPTION_PROLOG_COMMON, > and we get the "Bad kernel stack pointer" error. You are correct, but in fact we will get the same bad kernel stack pointer error whether RI is set or clear. What exception do you think we might get? Once we have successfully loaded r1 from the stack there should be no possibility of getting any exceptions until after the rfid, except for a machine check. If we get a machine check in these circumstances we're toast but that should be extremely unlikely. I think it is slightly more correct to clear RI before loading r1, but it won't make any difference to the actual outcome in any scenario that I can see. Paul. From lxie at us.ibm.com Tue Sep 28 15:50:27 2004 From: lxie at us.ibm.com (Linda Xie) Date: Tue, 28 Sep 2004 00:50:27 -0500 Subject: when to mark reserved low memory pages In-Reply-To: <1096324476.25805.1.camel@sinatra.austin.ibm.com> Message-ID: Below is an experimental patch I made to mark the RTAS pages as PG_reserved. Any comments/suggestions/corrections appreciated. Thanks, Linda diff -purN linux-2.6.8/arch/ppc64/kernel/prom.c linux-2.6.8-linda/arch/ppc64/kernel/prom.c --- linux-2.6.8/arch/ppc64/kernel/prom.c 1970-05-10 00:19:10.198944464 -0400 +++ linux-2.6.8-linda/arch/ppc64/kernel/prom.c 1970-05-10 00:20:02.905919168 -0400 @@ -15,7 +15,7 @@ * 2 of the License, or (at your option) any later version. */ -#undef DEBUG_PROM +#define DEBUG_PROM #include #include @@ -646,6 +646,7 @@ static void __init prom_initialize_lmb(v static void __init prom_instantiate_rtas(void) { + unsigned long addr; unsigned long offset = reloc_offset(); struct prom_t *_prom = PTRRELOC(&prom); struct rtas_t *_rtas = PTRRELOC(&rtas); @@ -699,6 +700,15 @@ prom_instantiate_rtas(void) RELOC(rtas_rmo_buf) = lmb_alloc_base(RTAS_RMOBUF_MAX, PAGE_SIZE, rtas_region); + prom_printf("%s: Mark rtas_rmo_buf as PG_reserved, b=0x%x, e=0x%x\n", + __FUNCTION__, rtas_rmo_buf, rtas_rmo_buf + RTAS_RMOBUF_MAX); + for (addr = (unsigned long)__va(rtas_rmo_buf); + addr < PAGE_ALIGN((unsigned long)__va(rtas_rmo_buf) + RTAS_RMOBUF_MAX); + addr += PAGE_SIZE) { + SetPageReserved(virt_to_page(addr)); + } + prom_printf("%s: Mark rtas_rmo_buf as PG_reserved, b=0x%x, e=0x%x done!\n", + __FUNCTION__, rtas_rmo_buf, rtas_rmo_buf + RTAS_RMOBUF_MAX); } if (_rtas->entry <= 0) { John Rose ibm.com> cc: External List , Linda Xie/Austin/IBM at IBMUS Subject: Re: when to mark reserved low memory pages 09/27/2004 05:34 PM Hi Linda- You might consider posting your patch to the external list, in response to ben's note here, like you did in the bug text. I haven't had much luck getting responses from Paul or Anton lately :( Just a thought.. Thanks- John On Mon, 2004-09-27 at 17:27, Benjamin Herrenschmidt wrote: > On Tue, 2004-09-28 at 01:11, John Rose wrote: > > On Fri, 2004-09-24 at 21:15, Benjamin Herrenschmidt wrote: > > > Why ? because they get freed when you mmap them and later quit the > > > userland process ? In this case, the option of marking them reserved > > > might lead to an ever increasing page count (or not), make very sure > > > of that, as get_page() will increase count but free_page() will not > > > decrease it for a reserved page. > > > > Ok retry :) If you were in a hypothetical situation where page count > > wasn't the issue, and you specifically needed the pages marked reserved, > > where would you do it? > > Some time right after mem_init() I suppose... > > > And how to handle the discontig case? > > Not too sure > > Ben. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040928/76d22c50/attachment.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040928/76d22c50/attachment.gif -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040928/76d22c50/attachment-0001.gif -------------- next part -------------- A non-text attachment was scrubbed... Name: pic09202.gif Type: image/gif Size: 1255 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040928/76d22c50/attachment-0002.gif From benh at kernel.crashing.org Tue Sep 28 16:41:02 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 28 Sep 2004 16:41:02 +1000 Subject: G5 and IO accesses Message-ID: <1096353661.22555.24.camel@gaston> Hi ! I've heard in the past that people had trouble getting PCI cards with IO ports (inX/outX) to work on G5s... looking at the setup code today, it looks like I had a bogus base address when setting it up, can somebody with such a card confirm that it works with this patch ? ===== arch/ppc64/kernel/pmac_pci.c 1.10 vs edited ===== --- 1.10/arch/ppc64/kernel/pmac_pci.c 2004-09-28 11:47:33 +10:00 +++ edited/arch/ppc64/kernel/pmac_pci.c 2004-09-28 16:41:24 +10:00 @@ -419,7 +419,7 @@ * properties or figuring out the U3 address space decoding logic and * then read it's configuration register (if any). */ - hose->io_base_phys = 0xf4000000 + 0x00400000; + hose->io_base_phys = 0xf4000000; hose->io_base_virt = ioremap(hose->io_base_phys, 0x00400000); isa_io_base = pci_io_base = (unsigned long) hose->io_base_virt; hose->io_resource.name = np->full_name; From cchaney at us.ibm.com Tue Sep 28 23:04:31 2004 From: cchaney at us.ibm.com (Craig Chaney) Date: Tue, 28 Sep 2004 09:04:31 -0400 Subject: [PATCH] MSR_RI not cleared early enough in entry.S In-Reply-To: <16728.64037.205638.199425@cargo.ozlabs.ibm.com> References: <20040927151825.GA1004@sage.ibm.com> <16728.64037.205638.199425@cargo.ozlabs.ibm.com> Message-ID: <20040928130431.GB7029@sage.raleigh.ibm.com> On Tue, Sep 28, 2004 at 03:44:05PM +1000, Paul Mackerras wrote: > What exception do you think we might get? Once we have successfully > loaded r1 from the stack there should be no possibility of getting any > exceptions until after the rfid, except for a machine check. If we > get a machine check in these circumstances we're toast but that should > be extremely unlikely. > > I think it is slightly more correct to clear RI before loading r1, but > it won't make any difference to the actual outcome in any scenario > that I can see. You're right, the situation that I encountered this in was artificial. I'm working with an instruction tracing tool that records execution paths through the kernel. This tool looks for mtmsr instructions that clear MSR_RI; when it finds one, it temporarily suspends single stepping. I was getting a single step exception after r1 was updated but before encountering the clear of MSR_RI. Thanks, Craig From johnrose at austin.ibm.com Wed Sep 29 02:15:11 2004 From: johnrose at austin.ibm.com (John Rose) Date: Tue, 28 Sep 2004 11:15:11 -0500 Subject: [PATCH] [resend] create pcibios_remove_root_bus() Message-ID: <1096388111.27834.14.camel@sinatra.austin.ibm.com> Hi Paul- Here's an updated version of this patch after Ben's changes. Please apply if there are no objections. The following patch creates pcibios_remove_root_bus(), which performs the ppc64-specific actions for removal of PCI Host Bridges. This call is invoked by the RPA DLPAR driver upon PHB removal. Thanks for your help- John diff -Nru a/arch/ppc64/kernel/pSeries_pci.c b/arch/ppc64/kernel/pSeries_pci.c --- a/arch/ppc64/kernel/pSeries_pci.c Tue Sep 28 11:10:41 2004 +++ b/arch/ppc64/kernel/pSeries_pci.c Tue Sep 28 11:10:41 2004 @@ -472,6 +472,54 @@ remap_bus_range(hose->bus); } + +/* RPA-specific bits for removing PHBs */ +int pcibios_remove_root_bus(struct pci_controller *phb) +{ + struct pci_bus *b = phb->bus; + struct resource *res; + int rc, i; + + res = b->resource[0]; + if (!res->flags) { + printk(KERN_ERR "%s: no IO resource for PHB %s\n", __FUNCTION__, + b->name); + return 1; + } + + rc = unmap_bus_range(b); + if (rc) { + printk(KERN_ERR "%s: failed to unmap IO on bus %s\n", + __FUNCTION__, b->name); + return 1; + } + + if (release_resource(res)) { + printk(KERN_ERR "%s: failed to release IO on bus %s\n", + __FUNCTION__, b->name); + return 1; + } + + for (i = 1; i < 3; ++i) { + res = b->resource[i]; + if (!res->flags && i == 0) { + printk(KERN_ERR "%s: no MEM resource for PHB %s\n", + __FUNCTION__, b->name); + return 1; + } + if (res->flags && release_resource(res)) { + printk(KERN_ERR + "%s: failed to release IO %d on bus %s\n", + __FUNCTION__, i, b->name); + return 1; + } + } + + list_del(&phb->list_node); + return 0; +} +EXPORT_SYMBOL(pcibios_remove_root_bus); + static void __init pSeries_request_regions(void) { struct device_node *i8042; diff -Nru a/include/asm-ppc64/pci-bridge.h b/include/asm-ppc64/pci-bridge.h --- a/include/asm-ppc64/pci-bridge.h Tue Sep 28 11:10:41 2004 +++ b/include/asm-ppc64/pci-bridge.h Tue Sep 28 11:10:41 2004 @@ -90,6 +90,8 @@ extern void pci_process_bridge_OF_ranges(struct pci_controller *hose, struct device_node *dev, int primary); +extern int pcibios_remove_root_bus(struct pci_controller *phb); + /* Use this macro after the PCI bus walk for max performance when it * is known that sysdata is correct. */ From igor at cs.wisc.edu Wed Sep 29 04:52:16 2004 From: igor at cs.wisc.edu (Igor Grobman) Date: Tue, 28 Sep 2004 13:52:16 -0500 (CDT) Subject: mapping memory in 0xb space In-Reply-To: <20040928004454.GD8889@zax> References: <20040928004454.GD8889@zax> Message-ID: On Tue, 28 Sep 2004, David Gibson wrote: > > Err... this seems likely to cause trouble. So it has :-) > Recent kernels don't even > have VSIDs allocated for the 0xb... region. Looking at both 2.6.8 and 2.4.21, I don't see a difference in get_kernel_vsid() code. > 2.4.21 will have some > kernel VSIDs there, but there will probably be a bunch of places that > assume addresses below 0xc... are user addresses and compute the vsids > accordingly. Note that since I put in a bolted entry into the HPT, I assume I don't have to worry about page fault handling code. This leaves segments. Both DataAccess_common and DataAccessSLB_common call do_stab_bolted/do_slb_bolted when confronted with an address in 0xb region. Presumably, this will fault in the segments I am interested in. > Differences in the code for segment table and SLB > machines are probably why it works (or seems to) on power3 but not > power4. SLB machines are LPAR machines, correct? In that case, neither the power3 nor the power4 I am using are SLB machines. Also, I narrowed it down to working (or appearing to work) as long as the highest 5 bits of the page index (those that end up as partial index in the HPTE) are zero. This may just be a weird coincidence. > > Also, unless you've created a new set, there are no Linux page tables > for the 0xb region - the only linux page tables for the kernel are for > the vmalloc (0xd) and ioremap (0xe) regions. I am building page tables off the bolted_dir, which does exist in 2.4.21 for the 0xb region. But again, assuming I insert a bolted entry in HPT, do I even need to worry about linux page tables as such? > > Why on earth do you want to do this? Good question ;-). A long long time ago, I posted on this list and explained. Since then, I found what appeared to be a solution, except that it appears power4 breaks it. I am building a tool that allows dynamic splicing of code into a running kernel (see http://www.paradyn.org/html/kerninst.html). In order for this to work, I need to be able to overwrite a single instruction with a jump to spliced-in code. The target of the jump needs to be within the range (26 bits). Therefore, I have a choice of 0xbfff.. addresses with backward jumps from 0xc region, or the 0xff.. addresses for absolute jumps. I chose 0xbff.., because I found already-working code, originally written for the performance counter interface. Am I making more sense now? Thanks, Igor > On Mon, Sep 27, 2004 at 02:47:15PM -0500, Igor Grobman wrote: > > I would like to be able to remap memory into 0xb space with 2.4.21 kernel. > > I have code that works fine on power3 box, but fails miserably on my dual > > power4+ box. I am using btmalloc() from pmc.c with a modified range. > > Normal btmalloc() allocation works fine, but if I change the range to > > start with, e.g. 0xb00001FFFFF00000 (instead of 0xb00..), the kernel > > crashes when accessing the allocated page. > > > > For those of you unfamiliar with btmalloc() code, it finds a physical page > > using get_free_pages(), subtracts PAGE_OFFSET (i.e. 0xc00...) to form a > > physical address, then inserts the page into linux page tables, using the > > VSID calculated with get_kernel_vsid() and the physical address > > calculated above. It also inserts an HPTE into the hardware page table. > > It doesn't do anything with regards to allocating segments. I understand > > the segments should be faulted in. I looked at the code in head.S, and it > > appears that do_stab_bolted should be doing this. Yet, I am missing someth$ > > because the btmalloc() code does not in fact work for pages in the range I > > specified above. > > > > I am running this on 7029-6E3 (p615?) machine with two power4+ processors. > > I am using the kernel from Suse's 2.4.21-215 source package. > > > > Any ideas and attempts to un-confuse me are welcome. From johnrose at austin.ibm.com Wed Sep 29 06:25:40 2004 From: johnrose at austin.ibm.com (John Rose) Date: Tue, 28 Sep 2004 15:25:40 -0500 Subject: [PATCH] [resend] RPA dynamic addition/removal of PCI Host Bridges Message-ID: <1096403140.27834.39.camel@sinatra.austin.ibm.com> Hi Paul- Here's an updated version of the patch. This contains a couple of bug fixes and accounts for Benh's monster changes :) The following patch implements the ppc64-specific bits for dynamic (DLPAR) addition of PCI Host Bridges. The entry point for this operation is init_phb_dynamic(), which will be called by the RPA DLPAR driver. Among the implementation details, the global number aka PCI domain for the newly added PHB is assigned using the same simple counter that assigns it at boot. This has two consequences. First, the PCI domain associated with a PHB will not persist across DLPAR remove and subsequent add. Second, stress tests that repeatedly add/remove PHBs might generate some large values for PCI domain. If we decide at a later point to hash an OF property to PCI domain value, this can be easily fixed up. Also, the linux,pci-domain property is not generated for the newly added PHBs at the moment. Because there doesn't seem to be an easy way to dynamically add single properties to the OFDT, and because the userspace dependency on this property is being questioned, I've ignored it for now. If we decide on a solution for this at a later point, it can also be easily fixed up. Please apply if appropriate. Thanks for your help- John Signed-off-by: John Rose diff -Nru a/arch/ppc64/kernel/pSeries_pci.c b/arch/ppc64/kernel/pSeries_pci.c --- a/arch/ppc64/kernel/pSeries_pci.c Tue Sep 28 15:11:38 2004 +++ b/arch/ppc64/kernel/pSeries_pci.c Tue Sep 28 15:11:38 2004 @@ -190,7 +190,7 @@ ibm_write_pci_config = rtas_token("ibm,write-pci-config"); } -unsigned long __init get_phb_buid (struct device_node *phb) +unsigned long __devinit get_phb_buid (struct device_node *phb) { int addr_cells; unsigned int *buid_vals; @@ -220,48 +220,85 @@ return buid; } -static struct pci_controller * __init alloc_phb(struct device_node *dev, - unsigned int addr_size_words) +static enum phb_types get_phb_type(struct device_node *dev) { - struct pci_controller *phb; - unsigned int *ui_ptr = NULL, len; - struct reg_property64 reg_struct; - int *bus_range; + enum phb_types type; char *model; - enum phb_types phb_type; - struct property *of_prop; model = (char *)get_property(dev, "model", NULL); if (!model) { - printk(KERN_ERR "alloc_phb: phb has no model property\n"); + printk(KERN_ERR "%s: phb has no model property\n", + __FUNCTION__); model = ""; } + if (strstr(model, "Python")) { + type = phb_type_python; + } else if (strstr(model, "Speedwagon")) { + type = phb_type_speedwagon; + } else if (strstr(model, "Winnipeg")) { + type = phb_type_winnipeg; + } else { + printk(KERN_ERR "%s: unknown PHB %s\n", __FUNCTION__, model); + type = phb_type_unknown; + } + + return type; +} + +int get_phb_reg_prop(struct device_node *dev, unsigned int addr_size_words, + struct reg_property64 *reg) +{ + unsigned int *ui_ptr = NULL, len; + /* Found a PHB, now figure out where his registers are mapped. */ ui_ptr = (unsigned int *) get_property(dev, "reg", &len); if (ui_ptr == NULL) { PPCDBG(PPCDBG_PHBINIT, "\tget reg failed.\n"); - return NULL; + return 1; } if (addr_size_words == 1) { - reg_struct.address = ((struct reg_property32 *)ui_ptr)->address; - reg_struct.size = ((struct reg_property32 *)ui_ptr)->size; + reg->address = ((struct reg_property32 *)ui_ptr)->address; + reg->size = ((struct reg_property32 *)ui_ptr)->size; } else { - reg_struct = *((struct reg_property64 *)ui_ptr); + *reg = *((struct reg_property64 *)ui_ptr); } - if (strstr(model, "Python")) { - phb_type = phb_type_python; - } else if (strstr(model, "Speedwagon")) { - phb_type = phb_type_speedwagon; - } else if (strstr(model, "Winnipeg")) { - phb_type = phb_type_winnipeg; - } else { - printk(KERN_ERR "alloc_phb: unknown PHB %s\n", model); - phb_type = phb_type_unknown; - } + return 0; +} + +int phb_set_bus_ranges(struct device_node *dev, struct pci_controller *phb) +{ + int *bus_range; + unsigned int len; + + bus_range = (int *) get_property(dev, "bus-range", &len); + if (bus_range == NULL || len < 2 * sizeof(int)) { + return 1; + } + + phb->first_busno = bus_range[0]; + phb->last_busno = bus_range[1]; + + return 0; +} + +static struct pci_controller *alloc_phb(struct device_node *dev, + unsigned int addr_size_words) +{ + struct pci_controller *phb; + struct reg_property64 reg_struct; + enum phb_types phb_type; + struct property *of_prop; + int rc; + + phb_type = get_phb_type(dev); + + rc = get_phb_reg_prop(dev, addr_size_words, ®_struct); + if (rc) + return NULL; phb = pci_alloc_pci_controller(phb_type); if (phb == NULL) @@ -270,11 +307,9 @@ if (phb_type == phb_type_python) python_countermeasures(reg_struct.address); - bus_range = (int *) get_property(dev, "bus-range", &len); - if (bus_range == NULL || len < 2 * sizeof(int)) { - kfree(phb); + rc = phb_set_bus_ranges(dev, phb); + if (rc) return NULL; - } of_prop = (struct property *)alloc_bootmem(sizeof(struct property) + sizeof(phb->global_number)); @@ -291,9 +326,6 @@ memcpy(of_prop->value, &phb->global_number, sizeof(phb->global_number)); prom_add_property(dev, of_prop); - phb->first_busno = bus_range[0]; - phb->last_busno = bus_range[1]; - phb->arch_data = dev; phb->ops = &rtas_pci_ops; @@ -302,6 +334,40 @@ return phb; } +static struct pci_controller * __devinit alloc_phb_dynamic(struct device_node *dev, unsigned int addr_size_words) +{ + struct pci_controller *phb; + struct reg_property64 reg_struct; + enum phb_types phb_type; + int rc; + + phb_type = get_phb_type(dev); + + rc = get_phb_reg_prop(dev, addr_size_words, ®_struct); + if (rc) + return NULL; + + phb = pci_alloc_phb_dynamic(phb_type); + if (phb == NULL) + return NULL; + + if (phb_type == phb_type_python) + python_countermeasures(reg_struct.address); + + rc = phb_set_bus_ranges(dev, phb); + if (rc) + return NULL; + + /* TODO: linux,pci-domain? */ + + phb->arch_data = dev; + phb->ops = &rtas_pci_ops; + + phb->buid = get_phb_buid(dev); + + return phb; +} + unsigned long __init find_and_init_phbs(void) { struct device_node *node; @@ -330,7 +396,8 @@ if (!phb) continue; - pci_process_bridge_OF_ranges(phb, node, index == 0); + pci_process_bridge_OF_ranges(phb, node); + pci_setup_phb_io(phb, index == 0); if (naca->interrupt_controller == IC_OPEN_PIC) { int addr = root_size_cells * (index + 2) - 1; @@ -346,6 +413,34 @@ return 0; } +struct pci_controller * __devinit init_phb_dynamic(struct device_node *dn) +{ + struct device_node *root = of_find_node_by_path("/"); + unsigned int root_size_cells = 0; + struct pci_controller *phb; + struct pci_bus *bus; + + root_size_cells = prom_n_size_cells(root); + + phb = alloc_phb_dynamic(dn, root_size_cells); + if (!phb) + return NULL; + + pci_process_bridge_OF_ranges(phb, dn); + + pci_setup_phb_io_dynamic(phb); + of_node_put(root); + + pci_devs_phb_init_dynamic(phb); + phb->last_busno = 0xff; + bus = pci_scan_bus(phb->first_busno, phb->ops, phb->arch_data); + phb->bus = bus; + phb->last_busno = bus->subordinate; + + return phb; +} +EXPORT_SYMBOL(init_phb_dynamic); + #if 0 void pcibios_name_device(struct pci_dev *dev) { @@ -464,7 +559,7 @@ } EXPORT_SYMBOL(remap_bus_range); -static void phbs_fixup_io(void) +static void phbs_remap_io(void) { struct pci_controller *hose, *tmp; @@ -472,7 +567,6 @@ remap_bus_range(hose->bus); } - /* RPA-specific bits for removing PHBs */ int pcibios_remove_root_bus(struct pci_controller *phb) { @@ -516,6 +610,9 @@ } list_del(&phb->list_node); + if (phb->is_dynamic) + kfree(phb); + return 0; } EXPORT_SYMBOL(pcibios_remove_root_bus); @@ -559,7 +656,7 @@ } } - phbs_fixup_io(); + phbs_remap_io(); pSeries_request_regions(); pci_fix_bus_sysdata(); diff -Nru a/arch/ppc64/kernel/pci.c b/arch/ppc64/kernel/pci.c --- a/arch/ppc64/kernel/pci.c Tue Sep 28 15:11:38 2004 +++ b/arch/ppc64/kernel/pci.c Tue Sep 28 15:11:38 2004 @@ -179,26 +179,11 @@ res->start = start; } -/* - * Allocate pci_controller(phb) initialized common variables. - */ -struct pci_controller * __init -pci_alloc_pci_controller(enum phb_types controller_type) +static void phb_set_model(struct pci_controller *hose, + enum phb_types controller_type) { - struct pci_controller *hose; char *model; -#ifdef CONFIG_PPC_ISERIES - hose = (struct pci_controller *)kmalloc(sizeof(struct pci_controller), GFP_KERNEL); -#else - hose = (struct pci_controller *)alloc_bootmem(sizeof(struct pci_controller)); -#endif - if(hose == NULL) { - printk(KERN_ERR "PCI: Allocate pci_controller failed.\n"); - return NULL; - } - memset(hose, 0, sizeof(struct pci_controller)); - switch(controller_type) { #ifdef CONFIG_PPC_ISERIES case phb_type_hypervisor: @@ -226,12 +211,63 @@ strcpy(hose->what,model); else memcpy(hose->what,model,7); - hose->type = controller_type; - hose->global_number = global_phb_number++; +} +/* + * Allocate pci_controller(phb) initialized common variables. + */ +struct pci_controller * __init +pci_alloc_pci_controller(enum phb_types controller_type) +{ + struct pci_controller *hose; + +#ifdef CONFIG_PPC_ISERIES + hose = (struct pci_controller *)kmalloc(sizeof(struct pci_controller), + GFP_KERNEL); +#else + hose = (struct pci_controller *)alloc_bootmem(sizeof(struct pci_controller)); +#endif + if (hose == NULL) { + printk(KERN_ERR "PCI: Allocate pci_controller failed.\n"); + return NULL; + } + memset(hose, 0, sizeof(struct pci_controller)); + + phb_set_model(hose, controller_type); + + hose->is_dynamic = 0; + hose->type = controller_type; + hose->global_number = global_phb_number++; + + list_add_tail(&hose->list_node, &hose_list); + + return hose; +} + +/* + * Dymnamically allocate pci_controller(phb), initialize common variables. + */ +struct pci_controller * +pci_alloc_phb_dynamic(enum phb_types controller_type) +{ + struct pci_controller *hose; + + hose = (struct pci_controller *)kmalloc(sizeof(struct pci_controller), + GFP_KERNEL); + if(hose == NULL) { + printk(KERN_ERR "PCI: Allocate pci_controller failed.\n"); + return NULL; + } + memset(hose, 0, sizeof(struct pci_controller)); + + phb_set_model(hose, controller_type); + + hose->is_dynamic = 1; + hose->type = controller_type; + hose->global_number = global_phb_number++; list_add_tail(&hose->list_node, &hose_list); - return hose; + return hose; } static void __init pcibios_claim_one_bus(struct pci_bus *b) @@ -534,7 +570,7 @@ #define ISA_SPACE_MASK 0x1 #define ISA_SPACE_IO 0x1 -static void pci_process_ISA_OF_ranges(struct device_node *isa_node, +static void __devinit pci_process_ISA_OF_ranges(struct device_node *isa_node, unsigned long phb_io_base_phys, void * phb_io_base_virt) { @@ -579,8 +615,8 @@ } } -void __init pci_process_bridge_OF_ranges(struct pci_controller *hose, - struct device_node *dev, int primary) +void __devinit pci_process_bridge_OF_ranges(struct pci_controller *hose, + struct device_node *dev) { unsigned int *ranges; unsigned long size; @@ -589,7 +625,6 @@ struct resource *res; int np, na = prom_n_addr_cells(dev); unsigned long pci_addr, cpu_phys_addr; - struct device_node *isa_dn; np = na + 5; @@ -617,31 +652,11 @@ switch (ranges[0] >> 24) { case 1: /* I/O space */ hose->io_base_phys = cpu_phys_addr; - hose->io_base_virt = reserve_phb_iospace(size); - PPCDBG(PPCDBG_PHBINIT, - "phb%d io_base_phys 0x%lx io_base_virt 0x%lx\n", - hose->global_number, hose->io_base_phys, - (unsigned long) hose->io_base_virt); - - if (primary) { - pci_io_base = (unsigned long)hose->io_base_virt; - isa_dn = of_find_node_by_type(NULL, "isa"); - if (isa_dn) { - isa_io_base = pci_io_base; - pci_process_ISA_OF_ranges(isa_dn, - hose->io_base_phys, - hose->io_base_virt); - of_node_put(isa_dn); - /* Allow all IO */ - io_page_mask = -1; - } - } + hose->pci_io_size = size; res = &hose->io_resource; res->flags = IORESOURCE_IO; res->start = pci_addr; - res->start += (unsigned long)hose->io_base_virt - - pci_io_base; break; case 2: /* memory space */ memno = 0; @@ -666,6 +681,55 @@ } ranges += np; } +} + +void __init pci_setup_phb_io(struct pci_controller *hose, int primary) +{ + unsigned long size = hose->pci_io_size; + unsigned long io_virt_offset; + struct resource *res; + struct device_node *isa_dn; + + hose->io_base_virt = reserve_phb_iospace(size); + PPCDBG(PPCDBG_PHBINIT, "phb%d io_base_phys 0x%lx io_base_virt 0x%lx\n", + hose->global_number, hose->io_base_phys, + (unsigned long) hose->io_base_virt); + + if (primary) { + pci_io_base = (unsigned long)hose->io_base_virt; + isa_dn = of_find_node_by_type(NULL, "isa"); + if (isa_dn) { + isa_io_base = pci_io_base; + pci_process_ISA_OF_ranges(isa_dn, hose->io_base_phys, + hose->io_base_virt); + of_node_put(isa_dn); + /* Allow all IO */ + io_page_mask = -1; + } + } + + io_virt_offset = (unsigned long)hose->io_base_virt - pci_io_base; + res = &hose->io_resource; + res->start += io_virt_offset; + res->end += io_virt_offset; +} + +void __devinit pci_setup_phb_io_dynamic(struct pci_controller *hose) +{ + unsigned long size = hose->pci_io_size; + unsigned long io_virt_offset; + struct resource *res; + + hose->io_base_virt = __ioremap(hose->io_base_phys, size, + _PAGE_NO_CACHE); + PPCDBG(PPCDBG_PHBINIT, "phb%d io_base_phys 0x%lx io_base_virt 0x%lx\n", + hose->global_number, hose->io_base_phys, + (unsigned long) hose->io_base_virt); + + io_virt_offset = (unsigned long)hose->io_base_virt - pci_io_base; + res = &hose->io_resource; + res->start += io_virt_offset; + res->end += io_virt_offset; } /*********************************************************************** diff -Nru a/arch/ppc64/kernel/pci.h b/arch/ppc64/kernel/pci.h --- a/arch/ppc64/kernel/pci.h Tue Sep 28 15:11:38 2004 +++ b/arch/ppc64/kernel/pci.h Tue Sep 28 15:11:38 2004 @@ -15,7 +15,12 @@ extern unsigned long isa_io_base; extern struct pci_controller* pci_alloc_pci_controller(enum phb_types controller_type); +extern struct pci_controller* pci_alloc_phb_dynamic(enum phb_types controller_type); +extern void pci_setup_phb_io(struct pci_controller *hose, int primary); + extern struct pci_controller* pci_find_hose_for_OF_device(struct device_node* node); +extern void pci_setup_phb_io_dynamic(struct pci_controller *hose); + extern struct list_head hose_list; extern int global_phb_number; @@ -36,6 +41,7 @@ void *data); void pci_devs_phb_init(void); +void pci_devs_phb_init_dynamic(struct pci_controller *phb); void pci_fix_bus_sysdata(void); struct device_node *fetch_dev_dn(struct pci_dev *dev); diff -Nru a/arch/ppc64/kernel/pci_dn.c b/arch/ppc64/kernel/pci_dn.c --- a/arch/ppc64/kernel/pci_dn.c Tue Sep 28 15:11:38 2004 +++ b/arch/ppc64/kernel/pci_dn.c Tue Sep 28 15:11:38 2004 @@ -42,7 +42,7 @@ * Traverse_func that inits the PCI fields of the device node. * NOTE: this *must* be done before read/write config to the device. */ -static void * __init update_dn_pci_info(struct device_node *dn, void *data) +static void * __devinit update_dn_pci_info(struct device_node *dn, void *data) { struct pci_controller *phb = data; u32 *regs; @@ -139,6 +139,12 @@ return NULL; } +void __devinit pci_devs_phb_init_dynamic(struct pci_controller *phb) +{ + /* Update dn->phb ptrs for new phb and children devices */ + traverse_pci_devices((struct device_node *)phb->arch_data, + update_dn_pci_info, phb); +} /* * Traversal func that looks for a value. diff -Nru a/include/asm-ppc64/pci-bridge.h b/include/asm-ppc64/pci-bridge.h --- a/include/asm-ppc64/pci-bridge.h Tue Sep 28 15:11:38 2004 +++ b/include/asm-ppc64/pci-bridge.h Tue Sep 28 15:11:38 2004 @@ -34,6 +34,7 @@ char what[8]; /* Eye catcher */ enum phb_types type; /* Type of hardware */ struct pci_bus *bus; + char is_dynamic; void *arch_data; struct list_head list_node; @@ -47,6 +48,7 @@ * the PCI memory space in the CPU bus space */ unsigned long pci_mem_offset; + unsigned long pci_io_size; struct pci_ops *ops; volatile unsigned int *cfg_addr; @@ -88,7 +90,7 @@ } extern void pci_process_bridge_OF_ranges(struct pci_controller *hose, - struct device_node *dev, int primary); + struct device_node *dev); extern int pcibios_remove_root_bus(struct pci_controller *phb); diff -Nru a/include/asm-ppc64/pci.h b/include/asm-ppc64/pci.h --- a/include/asm-ppc64/pci.h Tue Sep 28 15:11:38 2004 +++ b/include/asm-ppc64/pci.h Tue Sep 28 15:11:38 2004 @@ -229,6 +229,8 @@ extern void pcibios_fixup_device_resources(struct pci_dev *dev, struct pci_bus *bus); +extern struct pci_controller *init_phb_dynamic(struct device_node *dn); + extern int pci_read_irq_line(struct pci_dev *dev); extern void pcibios_add_platform_entries(struct pci_dev *dev); From johnrose at austin.ibm.com Wed Sep 29 06:29:17 2004 From: johnrose at austin.ibm.com (John Rose) Date: Tue, 28 Sep 2004 15:29:17 -0500 Subject: [PATCH] [resend] create pcibios_remove_root_bus() In-Reply-To: <1096388111.27834.14.camel@sinatra.austin.ibm.com> References: <1096388111.27834.14.camel@sinatra.austin.ibm.com> Message-ID: <1096403357.27834.43.camel@sinatra.austin.ibm.com> Hi Paul- Forgot the signed-off-by, sorry. :) Here's an updated version of this patch after Ben's changes. Please apply if there are no objections. The following patch creates pcibios_remove_root_bus(), which performs the ppc64-specific actions for removal of PCI Host Bridges. This call is invoked by the RPA DLPAR driver upon PHB removal. Thanks for your help- John Signed-off-by: John Rose diff -Nru a/arch/ppc64/kernel/pSeries_pci.c b/arch/ppc64/kernel/pSeries_pci.c --- a/arch/ppc64/kernel/pSeries_pci.c Tue Sep 28 11:10:41 2004 +++ b/arch/ppc64/kernel/pSeries_pci.c Tue Sep 28 11:10:41 2004 @@ -472,6 +472,54 @@ remap_bus_range(hose->bus); } + +/* RPA-specific bits for removing PHBs */ +int pcibios_remove_root_bus(struct pci_controller *phb) +{ + struct pci_bus *b = phb->bus; + struct resource *res; + int rc, i; + + res = b->resource[0]; + if (!res->flags) { + printk(KERN_ERR "%s: no IO resource for PHB %s\n", __FUNCTION__, + b->name); + return 1; + } + + rc = unmap_bus_range(b); + if (rc) { + printk(KERN_ERR "%s: failed to unmap IO on bus %s\n", + __FUNCTION__, b->name); + return 1; + } + + if (release_resource(res)) { + printk(KERN_ERR "%s: failed to release IO on bus %s\n", + __FUNCTION__, b->name); + return 1; + } + + for (i = 1; i < 3; ++i) { + res = b->resource[i]; + if (!res->flags && i == 0) { + printk(KERN_ERR "%s: no MEM resource for PHB %s\n", + __FUNCTION__, b->name); + return 1; + } + if (res->flags && release_resource(res)) { + printk(KERN_ERR + "%s: failed to release IO %d on bus %s\n", + __FUNCTION__, i, b->name); + return 1; + } + } + + list_del(&phb->list_node); + return 0; +} +EXPORT_SYMBOL(pcibios_remove_root_bus); + static void __init pSeries_request_regions(void) { struct device_node *i8042; diff -Nru a/include/asm-ppc64/pci-bridge.h b/include/asm-ppc64/pci-bridge.h --- a/include/asm-ppc64/pci-bridge.h Tue Sep 28 11:10:41 2004 +++ b/include/asm-ppc64/pci-bridge.h Tue Sep 28 11:10:41 2004 @@ -90,6 +90,8 @@ extern void pci_process_bridge_OF_ranges(struct pci_controller *hose, struct device_node *dev, int primary); +extern int pcibios_remove_root_bus(struct pci_controller *phb); + /* Use this macro after the PCI bus walk for max performance when it * is known that sysdata is correct. */ From johnrose at austin.ibm.com Wed Sep 29 06:35:31 2004 From: johnrose at austin.ibm.com (John Rose) Date: Tue, 28 Sep 2004 15:35:31 -0500 Subject: [PATCH] [resend] RPA dynamic addition/removal of PCI Host Bridges In-Reply-To: <1096403140.27834.39.camel@sinatra.austin.ibm.com> References: <1096403140.27834.39.camel@sinatra.austin.ibm.com> Message-ID: <1096403731.583.1.camel@sinatra.austin.ibm.com> Please note that this patch only applies after the pcibios_remove_root_bus() patch. Last email of the day, I promise :) Thanks! From solrac at us.ibm.com Wed Sep 29 08:02:07 2004 From: solrac at us.ibm.com (Frank Carlos) Date: Tue, 28 Sep 2004 15:02:07 -0700 Subject: IBM Launches Zone for Power Architecture developers Message-ID: IBM Launches Zone for Power Architecture developers The Power Architecture Zone is dedicated to all things POWER, PowerPC and individuals who work with microprocessor technologists, OEMs, embedded designers, POWER toolsmiths, electrical engineers, PowerPC software developers, administrators, university researchers -- in short, everybody who works with POWER or PowerPC or related technologies. It?s not just about a faster chip, it?s the fact that POWER technology is open to collaboration that's important. Join the Power Architecture global community and have open dialogues with other engineers on Power Architecture developments and important changes in the POWER industry. Kind regards, Frank Carlos From david at gibson.dropbear.id.au Wed Sep 29 11:40:17 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Wed, 29 Sep 2004 11:40:17 +1000 Subject: mapping memory in 0xb space In-Reply-To: References: <20040928004454.GD8889@zax> Message-ID: <20040929014017.GC5470@zax> On Tue, Sep 28, 2004 at 01:52:16PM -0500, Igor Grobman wrote: > On Tue, 28 Sep 2004, David Gibson wrote: > > > > Err... this seems likely to cause trouble. > > So it has :-) > > > Recent kernels don't even > > have VSIDs allocated for the 0xb... region. > > Looking at both 2.6.8 and 2.4.21, I don't see a difference in > get_kernel_vsid() code. Ok, *very* recent kernels. The new VSID algorithm has gone into the BK tree since 2.6.8. > > 2.4.21 will have some > > kernel VSIDs there, but there will probably be a bunch of places that > > assume addresses below 0xc... are user addresses and compute the vsids > > accordingly. > > Note that since I put in a bolted entry into the HPT, I assume I don't > have to worry about page fault handling code. That sounds correct. > This leaves segments. Both > DataAccess_common and DataAccessSLB_common call > do_stab_bolted/do_slb_bolted when confronted with an address in 0xb > region. Oh, so it does. That, I think is a 2.4 thing, long gone in 2.6 (even before the SLB rewrite, I'm pretty sure do_slb_bolted was only called for 0xc addresses). > Presumably, this will fault in the segments I am interested in. Yes, actually, it should. Ok, I guess the problem is deeper than I thought. > > Differences in the code for segment table and SLB > > machines are probably why it works (or seems to) on power3 but not > > power4. > > SLB machines are LPAR machines, correct? Not necessarily. > In that case, neither the power3 > nor the power4 I am using are SLB machines. No, power4 is definitely SLB. > Also, I narrowed it down to > working (or appearing to work) as long as the highest 5 bits of the page > index (those that end up as partial index in the HPTE) are zero. This may > just be a weird coincidence. Could be. > > Also, unless you've created a new set, there are no Linux page tables > > for the 0xb region - the only linux page tables for the kernel are for > > the vmalloc (0xd) and ioremap (0xe) regions. > > I am building page tables off the bolted_dir, which does exist in 2.4.21 > for the 0xb region. But again, assuming I insert a bolted entry in HPT, > do I even need to worry about linux page tables as such? Oh, so they do. But yes, you shouldn't need to worry about them. > > Why on earth do you want to do this? > > Good question ;-). A long long time ago, I posted on this list and > explained. Since then, I found what appeared to be a solution, except > that it appears power4 breaks it. I am building a tool that allows > dynamic splicing of code into a running kernel (see > http://www.paradyn.org/html/kerninst.html). In order for this to work, I > need to be able to overwrite a single instruction with a jump to > spliced-in code. The target of the jump needs to be within the range (26 > bits). Therefore, I have a choice of 0xbfff.. addresses with backward > jumps from 0xc region, or the 0xff.. addresses for absolute jumps. I > chose 0xbff.., because I found already-working code, originally written > for the performance counter interface. Am I making more sense now? Aha! But this does actually explain the problem - there are only VSIDs assigned for the first 2^41 bits of each region - so although there are vsids for 0xb000000000000000-0xb00001ffffffffff, there aren't any for 0xbff... addresses. Likewise the Linux pagetables only cover a 41-bit address range, but that won't matter if you're creating HPTEs directly. You may have seen the comment in do_slb_bolted which claims to permit a full 32-bits of ESID - it's wrong. The code doesn't mask the ESID down to 13 bits as get_kernel_vsid() does, but it probably should - an overlarge ESID will cause collisions with VSIDs from entirely different address places, which would be a Bad Thing. Actually, you should be able to allow ESIDs of up to 21 bits there (36 bit VSID - 15 bits of "context"). But you will need to make sure get_kernel_vsid(), or whatever you're using to calculate the VAs for the hash HPTEs is updated to match - at the moment I think it will mask down to 13 bits. I'm not sure if that will get you sufficiently close to 0xc0... for your purposes. > > On Mon, Sep 27, 2004 at 02:47:15PM -0500, Igor Grobman wrote: > > > I would like to be able to remap memory into 0xb space with 2.4.21 > kernel. > > > I have code that works fine on power3 box, but fails miserably on my > dual > > > power4+ box. I am using btmalloc() from pmc.c with a modified range. > > > Normal btmalloc() allocation works fine, but if I change the range to > > > start with, e.g. 0xb00001FFFFF00000 (instead of 0xb00..), the kernel > > > crashes when accessing the allocated page. > > > > > > For those of you unfamiliar with btmalloc() code, it finds a physical > page > > > using get_free_pages(), subtracts PAGE_OFFSET (i.e. 0xc00...) to form > a > > > physical address, then inserts the page into linux page tables, using > the > > > VSID calculated with get_kernel_vsid() and the physical address > > > calculated above. It also inserts an HPTE into the hardware page > table. > > > It doesn't do anything with regards to allocating segments. I > understand > > > the segments should be faulted in. I looked at the code in head.S, > and it > > > appears that do_stab_bolted should be doing this. Yet, I am missing > someth$ > > > because the btmalloc() code does not in fact work for pages in the > range I > > > specified above. > > > > > > I am running this on 7029-6E3 (p615?) machine with two power4+ > processors. > > > I am using the kernel from Suse's 2.4.21-215 source package. > > > > > > Any ideas and attempts to un-confuse me are welcome. > -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson From jonmasters at gmail.com Wed Sep 29 11:51:01 2004 From: jonmasters at gmail.com (Jon Masters) Date: Wed, 29 Sep 2004 02:51:01 +0100 Subject: [Kernel-janitors] LinuxCPD - New mini project In-Reply-To: <20040929010532.82474.qmail@web41005.mail.yahoo.com> References: <20040929010532.82474.qmail@web41005.mail.yahoo.com> Message-ID: <35fb2e59040928185177ca0255@mail.gmail.com> On Tue, 28 Sep 2004 18:05:32 -0700 (PDT), Aaron Grothe wrote: > I've put a small project up at Sourceforge http://linuxcpd.sf.net > Looking at I'm happily surprised about > how much less duplicated code there appears to be in the 2.6 kernel series. Generally a good thing. It does reveal quite a few duplications between ppc and ppc64 trees which might be worth pursuing sometime. Jon. From igor at cs.wisc.edu Wed Sep 29 15:14:08 2004 From: igor at cs.wisc.edu (Igor Grobman) Date: Wed, 29 Sep 2004 00:14:08 -0500 (CDT) Subject: mapping memory in 0xb space In-Reply-To: <20040929014017.GC5470@zax> Message-ID: On Wed, 29 Sep 2004, David Gibson wrote: > On Tue, Sep 28, 2004 at 01:52:16PM -0500, Igor Grobman wrote: > > On Tue, 28 Sep 2004, David Gibson wrote: > > > > > Recent kernels don't even > > > have VSIDs allocated for the 0xb... region. > > > > Looking at both 2.6.8 and 2.4.21, I don't see a difference in > > get_kernel_vsid() code. > > Ok, *very* recent kernels. The new VSID algorithm has gone into the > BK tree since 2.6.8. >From the description I read, I might be better off using 0xfff.. addresses with that algorithm. Not a big deal. > > > This leaves segments. Both > > DataAccess_common and DataAccessSLB_common call > > do_stab_bolted/do_slb_bolted when confronted with an address in 0xb > > region. > > Oh, so it does. That, I think is a 2.4 thing, long gone in 2.6 (even > before the SLB rewrite, I'm pretty sure do_slb_bolted was only called > for 0xc addresses). In my 2.4.21 source, do_slb_bolted does get called for 0xb addresses. And thanks for letting me know about power4 being SLB. I was clueless on the issue. > > > Presumably, this will fault in the segments I am interested in. > > Yes, actually, it should. Ok, I guess the problem is deeper than I > thought. Or is it? > > > > Also, I narrowed it down to > > working (or appearing to work) as long as the highest 5 bits of the page > > index (those that end up as partial index in the HPTE) are zero. This may > > just be a weird coincidence. > > Could be. > > > > > Why on earth do you want to do this? > > > > Good question ;-). A long long time ago, I posted on this list and > > explained. Since then, I found what appeared to be a solution, except > > that it appears power4 breaks it. I am building a tool that allows > > dynamic splicing of code into a running kernel (see > > http://www.paradyn.org/html/kerninst.html). In order for this to work, I > > need to be able to overwrite a single instruction with a jump to > > spliced-in code. The target of the jump needs to be within the range (26 > > bits). Therefore, I have a choice of 0xbfff.. addresses with backward > > jumps from 0xc region, or the 0xff.. addresses for absolute jumps. I > > chose 0xbff.., because I found already-working code, originally written > > for the performance counter interface. Am I making more sense now? > > Aha! But this does actually explain the problem - there are only > VSIDs assigned for the first 2^41 bits of each region - so although > there are vsids for 0xb000000000000000-0xb00001ffffffffff, there > aren't any for 0xbff... addresses. Likewise the Linux pagetables only > cover a 41-bit address range, but that won't matter if you're creating > HPTEs directly. And this is why I avoided explaining fully in my first email :-). I'd like to solve one problem at a time. What I said in my initial email is accurate. Even within the valid VSID range, if the highest 5 bits of the page index are not zero, I get a crash on access (e.g. 0xb00001FFFFF00000, but works on 0xb00001FFF0000000). As for why I thought 0xbff would work, I reasoned that since the highest bits are masked out in get_kernel_vsid(), and since nobody else is using the 0xb region, it doesn't matter if I get a VSID that is the same as some other VSID in 0xb region. However, I did not consider the bug in do_slb_bolted that you describe below. > > You may have seen the comment in do_slb_bolted which claims to permit > a full 32-bits of ESID - it's wrong. The code doesn't mask the ESID > down to 13 bits as get_kernel_vsid() does, but it probably should - an > overlarge ESID will cause collisions with VSIDs from entirely > different address places, which would be a Bad Thing. This must be happening, although I would still like to know why it misbehaves even within the valid VSID range. > > Actually, you should be able to allow ESIDs of up to 21 bits there (36 > bit VSID - 15 bits of "context"). But you will need to make sure > get_kernel_vsid(), or whatever you're using to calculate the VAs for > the hash HPTEs is updated to match - at the moment I think it will > mask down to 13 bits. I'm not sure if that will get you sufficiently > close to 0xc0... for your purposes. No, it's not close enough--I really must have that very last segment. It sounds like I was simply getting lucky on the power3 machine. Without the mask, I must have been getting random pages, and happily overwriting them. Any ideas on how I might map that very last segment of 0xb, or for that matter the very last segment of 0xf ? It need not be pretty, but it cannot involve modifying the kernel source, though it can rely on whatever dirty tricks a kernel module might get away with. I don't want to modify the source, because I would like the tool to work on unmodified kernels. It's starting to sound like an impossible task (at least on non-recent kernels). I think I might go with a backup suboptimal solution, which involves extra jumps, but at least it might work. Thanks again, Igor > > > > On Mon, Sep 27, 2004 at 02:47:15PM -0500, Igor Grobman wrote: > > > > I would like to be able to remap memory into 0xb space with 2.4.21 > > kernel. > > > > I have code that works fine on power3 box, but fails miserably on my > > dual > > > > power4+ box. I am using btmalloc() from pmc.c with a modified range. > > > > Normal btmalloc() allocation works fine, but if I change the range to > > > > start with, e.g. 0xb00001FFFFF00000 (instead of 0xb00..), the kernel > > > > crashes when accessing the allocated page. > > > > > > > > For those of you unfamiliar with btmalloc() code, it finds a physical > > page > > > > using get_free_pages(), subtracts PAGE_OFFSET (i.e. 0xc00...) to form > > a > > > > physical address, then inserts the page into linux page tables, using > > the > > > > VSID calculated with get_kernel_vsid() and the physical address > > > > calculated above. It also inserts an HPTE into the hardware page > > table. > > > > It doesn't do anything with regards to allocating segments. I > > understand > > > > the segments should be faulted in. I looked at the code in head.S, > > and it > > > > appears that do_stab_bolted should be doing this. Yet, I am missing > > someth$ > > > > because the btmalloc() code does not in fact work for pages in the > > range I > > > > specified above. > > > > > > > > I am running this on 7029-6E3 (p615?) machine with two power4+ > > processors. > > > > I am using the kernel from Suse's 2.4.21-215 source package. > > > > > > > > Any ideas and attempts to un-confuse me are welcome. > > > > From sharada at in.ibm.com Wed Sep 29 20:17:00 2004 From: sharada at in.ibm.com (R Sharada) Date: Wed, 29 Sep 2004 15:47:00 +0530 Subject: reading files in /proc/device-tree Message-ID: <20040929101700.GA2623@in.ibm.com> Hello, I am looking for a way to be able to read data files for nodes under /proc/device-tree, from userspace. I do see a fs/proc/proc_devtree.c which has the kernel code to read properties from nodes under /proc/device-tree. However, I could not find out how this is being exported to the userspace. Basically, I have a requirement to be able to find out System RAM addressable ranges from userspace. I was told it could be obtained by reading the memory nodes's reg property from under /proc/device-tree. Any pointers or ideas will be helpful. Thanks and Regards, Sharada From arnd at arndb.de Wed Sep 29 23:24:00 2004 From: arnd at arndb.de (Arnd Bergmann) Date: Wed, 29 Sep 2004 15:24:00 +0200 Subject: [Kernel-janitors] LinuxCPD - New mini project In-Reply-To: <35fb2e59040928185177ca0255@mail.gmail.com> References: <20040929010532.82474.qmail@web41005.mail.yahoo.com> <35fb2e59040928185177ca0255@mail.gmail.com> Message-ID: <200409291524.04624.arnd@arndb.de> On Mittwoch, 29. September 2004 03:51, Jon Masters wrote: > On Tue, 28 Sep 2004 18:05:32 -0700 (PDT), Aaron Grothe > wrote: > > > I've put a small project up at Sourceforge http://linuxcpd.sf.net > > > Looking at I'm happily surprised about > > how much less duplicated code there appears to be in the 2.6 kernel series. Much of the improvements can probably be attributed to merging the 64 bit subarchitectures of mips and s390 into their respective 32 bit trees. > Generally a good thing. It does reveal quite a few duplications > between ppc and ppc64 trees which might be worth pursuing sometime. It may not be a good idea to follow the same route as mips and s390 for ppc64 and completely get rid of the 64 bit tree, because the ppc64 tree already doesn't need much of the legacy code in ppc. OTOH, it would be nice to share the include/asm tree in order to simplify life for multilib build environments. For arch/ppc*/, we could follow the approach of x86_64, where the files with identical functionality are simply built in the arch/i386 tree. See the patch below for a trivial example of this (these two files are already identical on ppc and ppc64). Unifying the xmon directory could be something more interesting. Arnd <>< diff -u -r1.2 Makefile --- ./arch/ppc64/lib/Makefile 7 Sep 2004 10:32:45 -0000 1.2 +++ ./arch/ppc64/lib/Makefile 29 Sep 2004 13:06:00 -0000 @@ -2,8 +2,11 @@ # Makefile for ppc64-specific library files.. # -lib-y := checksum.o dec_and_lock.o string.o strcase.o +lib-y := checksum.o string.o lib-y += copypage.o memcpy.o copyuser.o + +obj-y += ppclib.o +ppclib-y := $(addprefix ../../ppc/lib/,dec_and_lock.o strcase.o) # Lock primitives are defined as no-ops in include/linux/spinlock.h # for non-SMP configs. Don't build the real versions. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: signature Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040929/e8c6a439/attachment.pgp From segher at kernel.crashing.org Thu Sep 30 01:26:22 2004 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Wed, 29 Sep 2004 10:26:22 -0500 Subject: reading files in /proc/device-tree In-Reply-To: <20040929101700.GA2623@in.ibm.com> References: <20040929101700.GA2623@in.ibm.com> Message-ID: > I am looking for a way to be able to read data files for nodes under > /proc/device-tree, from userspace. > Basically, I have a requirement to be able to find out System RAM > addressable ranges from userspace. I was told it could be obtained by > reading > the memory nodes's reg property from under /proc/device-tree. Just read the file. segher at victim:~ $ xxd -g4 /proc/device-tree/memory/reg 00000000 00000000 00000000 80000000 ; 2GB at 0 00000001 00000000 00000000 80000000 ; 2GB at 4GB (comments are mine, data is faked). There might be more memory nodes; read them all and take the union. Also, the format of the entries is dependent on the #address-cells and #sized-cells properties. Hope this helps, Segher (oh, and _do_ read the OF specs: ). From lxie at us.ibm.com Thu Sep 30 02:26:20 2004 From: lxie at us.ibm.com (Linda Xie) Date: Wed, 29 Sep 2004 17:26:20 +0100 Subject: Fw: when to mark reserved low memory pages Message-ID: Here is my final patch which will be submitted to redhat. -Linda diff -purN linux-2.6.8/arch/ppc64/mm/init.c linux-2.6.8-linda/arch/ppc64/mm/init.c --- linux-2.6.8/arch/ppc64/mm/init.c 1970-05-10 22:51:47.153000776 -0400 +++ linux-2.6.8-linda/arch/ppc64/mm/init.c 1970-05-10 23:36:33.774957608 -0400 @@ -706,7 +706,7 @@ void __init mem_init(void) int nid; #endif pg_data_t *pgdat; - unsigned long i; + unsigned long i, addr; struct page *page; unsigned long reservedpages = 0, codesize, initsize, datasize, bsssize; @@ -749,8 +749,22 @@ void __init mem_init(void) bsssize >> 10, initsize >> 10); + /* Mark the RTAS pages as PG_reserved */ + for (addr = (unsigned long)__va(rtas_rmo_buf); + addr < PAGE_ALIGN((unsigned long)__va(rtas_rmo_buf) + RTAS_RMOBUF_MAX); + addr += PAGE_SIZE) { + SetPageReserved(virt_to_page(addr)); + } + mem_init_done = 1; + #ifdef CONFIG_PPC_ISERIES iommu_vio_init(); #endif -------------- next part -------------- An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040929/20b3fe7d/attachment.htm From linas at austin.ibm.com Thu Sep 30 03:04:51 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Wed, 29 Sep 2004 12:04:51 -0500 Subject: [PATCH] PPC64: xmon: improve process-display command Message-ID: <20040929170451.GA11461@austin.ibm.com> Hi, The following patch adds process display to xmon. It supreseeds the previous patch I sent in for this (http://ozlabs.org/ppc64-patches/patch.pl?id=306) The current xmon does have a process display, but it has several problems: 1) it doesn't display important kernel pointers needed for debugging. 2) it displays nothing if the logging level isn't set; and xmon doesn't currently support sysreq commands. 3) if it can be made to display, then it would display stacks too, which is waaaay too much if one has hundreds of processes. 4) it depends on code that uses uses spinlocks :( 5) it fails to validate any pointers it chases. The attached patch fixes all but 5): it prints a short-form process display with the little 'p' command, and a long form, with stacks, with the big 'P' command. Please forward and apply ... Signed-off-by: Linas Vepstas -------------- next part -------------- --- bk/arch/ppc64/xmon/xmon.c. 2004-09-29 11:10:43.000000000 -0500 +++ edited/arch/ppc64/xmon/xmon.c 2004-09-29 11:52:28.000000000 -0500 @@ -2,6 +2,9 @@ * Routines providing a simple monitor for use on the PowerMac. * * Copyright (C) 1996 Paul Mackerras. + * Copyright (C) 1999-2003 Silicon Graphics, Inc. All Rights Reserved. + * Copyright (C) 2000 Stephane Eranian + * Xscale (R) modifications copyright (C) 2003 Intel Corporation. * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License @@ -13,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -101,6 +105,7 @@ static void prdump(unsigned long, long); static int ppc_inst_dump(unsigned long, long, int); void print_address(unsigned long); static void backtrace(struct pt_regs *); +static void xmon_show_stack(unsigned long sp, unsigned long lr, unsigned long pc); static void excprint(struct pt_regs *); static void prregs(struct pt_regs *); static void memops(int); @@ -193,6 +198,7 @@ Commands:\n\ mz zero a block of memory\n\ mi show information about memory allocation\n\ p show the task list\n\ + P show the task list and stacks\n\ r print registers\n\ s single step\n\ S print special registers\n\ @@ -861,6 +867,63 @@ static int emulate_step(struct pt_regs * return 0; } +static inline int +xmon_process_cpu(const task_t *p) +{ + return p->thread_info->cpu; +} + +#define xmon_task_has_cpu(p) (task_curr(p)) + +static void +xmon_show_task(task_t *p) +{ + printf("0x%p %8d %8d %d %4d %c 0x%p %c%s\n", + (void *)p, p->pid, p->parent->pid, + xmon_task_has_cpu(p), xmon_process_cpu(p), + (p->state == 0) ? 'R' : + (p->state < 0) ? 'U' : + (p->state & TASK_UNINTERRUPTIBLE) ? 'D' : + (p->state & TASK_STOPPED || p->ptrace & PT_PTRACED) ? 'T' : + (p->state & TASK_ZOMBIE) ? 'Z' : + (p->state & TASK_INTERRUPTIBLE) ? 'S' : '?', + (void *)(&p->thread), + (p == current) ? '*': ' ', + p->comm); +} + +static task_t *xmon_next_thread(const task_t *p) +{ + return pid_task(p->pids[PIDTYPE_TGID].pid_list.next, PIDTYPE_TGID); +} + +static void +xmon_show_state(int prt_stacks) +{ + task_t *g, *p; + + printf("%-*s Pid Parent [*] cpu State %-*s Command\n", + (int)(2*sizeof(void *))+2, "Task Addr", + (int)(2*sizeof(void *))+2, "Thread"); + +#ifdef PER_CPU_RUNQUEUES_NO_LONGER_DECLARED_STATIC_IN_SCHED_C + /* Run the active tasks first */ + for (cpu = 0; cpu < NR_CPUS; ++cpu) + if (cpu_online(cpu)) { + p = cpu_curr(cpu); + xmon_show_task(p); + } +#endif + + /* Now the real tasks */ + do_each_thread(g, p) { + xmon_show_task(p); + if (prt_stacks) + xmon_show_stack(p->thread.ksp, 0, 0); + } while ((p = xmon_next_thread(p)) != g); +} + + /* Command interpreting routine */ static char *last_cmd; @@ -946,7 +1009,10 @@ cmds(struct pt_regs *excp) printf(help_string); break; case 'p': - show_state(); + xmon_show_state(0); + break; + case 'P': + xmon_show_state(1); break; case 'b': bpt_cmds(); From linas at austin.ibm.com Thu Sep 30 04:11:40 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Wed, 29 Sep 2004 13:11:40 -0500 Subject: Fw: when to mark reserved low memory pages In-Reply-To: References: Message-ID: <20040929181140.GD11461@austin.ibm.com> On Wed, Sep 29, 2004 at 05:26:20PM +0100, Linda Xie was heard to remark: > > Here is my final patch which will be submitted to redhat. You should also ask paulus to put the patch into mainline, right? For that, you should provide an explanation of what this fixes, (why it hadn't been fixed before?) and also a 'Signed-off-by' line. And to satisfy my curiosity, what's this thing used for? I can only see it alloced and user-sapce read/written via proc interfaces; why is it needed? And why does it have to be in low memory? --linas From haveblue at us.ibm.com Thu Sep 30 04:12:37 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Wed, 29 Sep 2004 11:12:37 -0700 Subject: Fw: when to mark reserved low memory pages In-Reply-To: References: Message-ID: <1096481557.22404.68.camel@localhost> On Wed, 2004-09-29 at 09:26, Linda Xie wrote: > Here is my final patch which will be submitted to redhat. Might be nicer to do something like this instead of a whole bunch of __va() operations. page_to_pfn() is slightly lighter-weight than virt_to_page() (doesn't really matter at init-time, though). start_pfn = rtas_rmo_buf >> PAGE_SHIFT; end_pfn = (rtas_rmo_buf + RTAS_RMOBUF_MAX) >> PAGE_SHIFT; for (pfn = start_pfn; pfn < end_pfn; pfn++) SetPageReserved(page_to_pfn(pfn)); -- Dave Hansen haveblue at us.ibm.com From johnrose at austin.ibm.com Thu Sep 30 04:31:29 2004 From: johnrose at austin.ibm.com (John Rose) Date: Wed, 29 Sep 2004 13:31:29 -0500 Subject: Fw: when to mark reserved low memory pages In-Reply-To: <1096481557.22404.68.camel@localhost> References: <1096481557.22404.68.camel@localhost> Message-ID: <1096482689.20321.47.camel@sinatra.austin.ibm.com> Hi Dave, thanks for the suggestion. One nit: > start_pfn = rtas_rmo_buf >> PAGE_SHIFT; > end_pfn = (rtas_rmo_buf + RTAS_RMOBUF_MAX) >> PAGE_SHIFT; > for (pfn = start_pfn; pfn < end_pfn; pfn++) > SetPageReserved(page_to_pfn(pfn)); Shouldn't the last line use pfn_to_page()? Thanks- John From haveblue at us.ibm.com Thu Sep 30 04:32:38 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Wed, 29 Sep 2004 11:32:38 -0700 Subject: Fw: when to mark reserved low memory pages In-Reply-To: <1096482689.20321.47.camel@sinatra.austin.ibm.com> References: <1096481557.22404.68.camel@localhost> <1096482689.20321.47.camel@sinatra.austin.ibm.com> Message-ID: <1096482758.22404.70.camel@localhost> On Wed, 2004-09-29 at 11:31, John Rose wrote: > Hi Dave, thanks for the suggestion. One nit: > > > start_pfn = rtas_rmo_buf >> PAGE_SHIFT; > > end_pfn = (rtas_rmo_buf + RTAS_RMOBUF_MAX) >> PAGE_SHIFT; > > for (pfn = start_pfn; pfn < end_pfn; pfn++) > > SetPageReserved(page_to_pfn(pfn)); > > Shouldn't the last line use pfn_to_page()? Yep. Didn't I say something *like* that? :) -- Dave Hansen haveblue at us.ibm.com From johnrose at austin.ibm.com Thu Sep 30 05:30:23 2004 From: johnrose at austin.ibm.com (John Rose) Date: Wed, 29 Sep 2004 14:30:23 -0500 Subject: [PATCH] remove __ioremap_explicit() error message Message-ID: <1096486223.20321.69.camel@sinatra.austin.ibm.com> Here's a really long explanation for a really short patch! :) As an unfortunate side effect of runtime addition/removal of PCI Host Bridges, the RPA DLPAR driver can no longer depend on the success of ioremap_explicit() (and therefore remap_page_range()) for the case of DLPAR adding an I/O Slot. Without addressing this, an attempt to add the first child slot of a newly added PHB will fail when __ioremap_explicit() determines the mappings for that range to already exist. For a little context, __ioremap_explicit() creates mappings for the range of a newly added slot. Here's why these calls will be expected to fail in some cases. Keep in mind that at boot-time, the PPC64 kernel calls ioremap() for the entire range spanned by each PHB. Consider the following scenarios of DLPAR-adding an I/O slot. 1) Just after boot, one removes an I/O slot. At this point the range associated with the parent PHB is fragmented, and the child range for the slot in question is iounmap()'ed. One then re-adds the slot, at which point remap_page_range()/ioremap_explicit() restores the mappings that were previously removed. 2) One adds a new PHB, at which point the ppc64-specific addition ioremaps the entire PHB range. One then performs a DLPAR-add of a child slot of that PHB. At this point, mappings already exist for the range of the slot to be added. So remap_page_range()/ioremap_explicit() will fail at this point. The problem is, there's not a good way to distinguish between cases 1 and 2 from the perspective of the DLPAR driver. Because of that, I believe the correct solution to be: - Removal of relevant error prints from iounmap_explicit(), which is only used for DLPAR. - Removal of error code checks from the RPA driver Here's the first of these. Thanks- John diff -Nru a/arch/ppc64/mm/init.c b/arch/ppc64/mm/init.c --- a/arch/ppc64/mm/init.c Wed Sep 29 14:10:06 2004 +++ b/arch/ppc64/mm/init.c Wed Sep 29 14:10:06 2004 @@ -265,7 +265,7 @@ } else { area = im_get_area(ea, size, IM_REGION_UNUSED|IM_REGION_SUBSET); if (area == NULL) { - printk(KERN_ERR "could not obtain imalloc area for ea 0x%lx\n", ea); + /* Expected when PHB-dlpar is in play */ return 1; } if (ea != (unsigned long) area->addr) { From olh at suse.de Thu Sep 30 05:47:30 2004 From: olh at suse.de (Olaf Hering) Date: Wed, 29 Sep 2004 21:47:30 +0200 Subject: [PPC64] Improved VSID allocation algorithm In-Reply-To: <20040913041119.GA5351@zax> References: <20040913041119.GA5351@zax> Message-ID: <20040929194730.GA6292@suse.de> On Mon, Sep 13, David Gibson wrote: > Andrew, please apply. This patch has been tested both on SLB and > segment table machines. This new approach is far from the final word > in VSID/context allocation, but it's a noticeable improvement on the > old method. This patch went into 2.6.9-rc2-bk2, and my p640 does not boot anymore. Hangs after 'returning from prom_init', wants a power cycle. -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG From johnrose at austin.ibm.com Thu Sep 30 08:22:54 2004 From: johnrose at austin.ibm.com (John Rose) Date: Wed, 29 Sep 2004 17:22:54 -0500 Subject: [PATCH] remove __ioremap_explicit() error message In-Reply-To: <1096486223.20321.69.camel@sinatra.austin.ibm.com> References: <1096486223.20321.69.camel@sinatra.austin.ibm.com> Message-ID: <1096496574.25927.6.camel@sinatra.austin.ibm.com> Ack, why can't I ever remember: Signed-off-by: John Rose On Wed, 2004-09-29 at 14:30, John Rose wrote: > Here's a really long explanation for a really short patch! :) > > As an unfortunate side effect of runtime addition/removal of PCI Host Bridges, > the RPA DLPAR driver can no longer depend on the success of ioremap_explicit() > (and therefore remap_page_range()) for the case of DLPAR adding an I/O Slot. > > Without addressing this, an attempt to add the first child slot of a newly > added PHB will fail when __ioremap_explicit() determines the mappings for that > range to already exist. > > For a little context, __ioremap_explicit() creates mappings for the range of a > newly added slot. Here's why these calls will be expected to fail in some > cases. Keep in mind that at boot-time, the PPC64 kernel calls ioremap() for > the entire range spanned by each PHB. Consider the following scenarios of > DLPAR-adding an I/O slot. > > 1) Just after boot, one removes an I/O slot. At this point the range > associated with the parent PHB is fragmented, and the child range for the > slot in question is iounmap()'ed. One then re-adds the slot, at which point > remap_page_range()/ioremap_explicit() restores the mappings that were > previously removed. > > 2) One adds a new PHB, at which point the ppc64-specific addition ioremaps the > entire PHB range. One then performs a DLPAR-add of a child slot of that > PHB. At this point, mappings already exist for the range of the slot to > be added. So remap_page_range()/ioremap_explicit() will fail at this point. > > The problem is, there's not a good way to distinguish between cases 1 and 2 > from the perspective of the DLPAR driver. Because of that, I believe the > correct solution to be: > > - Removal of relevant error prints from iounmap_explicit(), which is only used > for DLPAR. > - Removal of error code checks from the RPA driver > > Here's the first of these. > > Thanks- > John > > diff -Nru a/arch/ppc64/mm/init.c b/arch/ppc64/mm/init.c > --- a/arch/ppc64/mm/init.c Wed Sep 29 14:10:06 2004 > +++ b/arch/ppc64/mm/init.c Wed Sep 29 14:10:06 2004 > @@ -265,7 +265,7 @@ > } else { > area = im_get_area(ea, size, IM_REGION_UNUSED|IM_REGION_SUBSET); > if (area == NULL) { > - printk(KERN_ERR "could not obtain imalloc area for ea 0x%lx\n", ea); > + /* Expected when PHB-dlpar is in play */ > return 1; > } > if (ea != (unsigned long) area->addr) { From benh at kernel.crashing.org Thu Sep 30 09:08:09 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 30 Sep 2004 09:08:09 +1000 Subject: Fw: when to mark reserved low memory pages In-Reply-To: References: Message-ID: <1096499289.31806.17.camel@gaston> On Thu, 2004-09-30 at 02:26, Linda Xie wrote: > Here is my final patch which will be submitted to redhat. can you resend it not broken by your mailer ? that is not in html form and not wrapped ? thanks. Ben. > -Linda > > diff -purN linux-2.6.8/arch/ppc64/mm/init.c > linux-2.6.8-linda/arch/ppc64/mm/init.c > --- linux-2.6.8/arch/ppc64/mm/init.c 1970-05-10 22:51:47.153000776 > -0400 > +++ linux-2.6.8-linda/arch/ppc64/mm/init.c 1970-05-10 > 23:36:33.774957608 -0400 > @@ -706,7 +706,7 @@ void __init mem_init(void) > int nid; > #endif > pg_data_t *pgdat; > - unsigned long i; > + unsigned long i, addr; > struct page *page; > unsigned long reservedpages = 0, codesize, initsize, datasize, > bsssize; > > @@ -749,8 +749,22 @@ void __init mem_init(void) > bsssize >> 10, > initsize >> 10); > > + /* Mark the RTAS pages as PG_reserved */ > + for (addr = (unsigned long)__va(rtas_rmo_buf); > + addr < PAGE_ALIGN((unsigned long)__va(rtas_rmo_buf) + > RTAS_RMOBUF_MAX); > + addr += PAGE_SIZE) { > + SetPageReserved(virt_to_page(addr)); > + } > + > mem_init_done = 1; > > + > #ifdef CONFIG_PPC_ISERIES > iommu_vio_init(); > #endif -- Benjamin Herrenschmidt From david at gibson.dropbear.id.au Thu Sep 30 10:38:46 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 30 Sep 2004 10:38:46 +1000 Subject: [PPC64] Improved VSID allocation algorithm In-Reply-To: <20040929194730.GA6292@suse.de> References: <20040913041119.GA5351@zax> <20040929194730.GA6292@suse.de> Message-ID: <20040930003846.GB25001@zax> On Wed, Sep 29, 2004 at 09:47:30PM +0200, Olaf Hering wrote: > On Mon, Sep 13, David Gibson wrote: > > > Andrew, please apply. This patch has been tested both on SLB and > > segment table machines. This new approach is far from the final word > > in VSID/context allocation, but it's a noticeable improvement on the > > old method. > > This patch went into 2.6.9-rc2-bk2, and my p640 does not boot anymore. > Hangs after 'returning from prom_init', wants a power cycle. Have you isolated the problem to the VSID allocation patch? I think there may have been a number of ppc64 changes which went into 2.6.9-rc2-bk2. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson From olh at suse.de Thu Sep 30 15:53:07 2004 From: olh at suse.de (Olaf Hering) Date: Thu, 30 Sep 2004 07:53:07 +0200 Subject: [PPC64] Improved VSID allocation algorithm In-Reply-To: <20040930003846.GB25001@zax> References: <20040913041119.GA5351@zax> <20040929194730.GA6292@suse.de> <20040930003846.GB25001@zax> Message-ID: <20040930055307.GA15291@suse.de> On Thu, Sep 30, David Gibson wrote: > On Wed, Sep 29, 2004 at 09:47:30PM +0200, Olaf Hering wrote: > > On Mon, Sep 13, David Gibson wrote: > > > > > Andrew, please apply. This patch has been tested both on SLB and > > > segment table machines. This new approach is far from the final word > > > in VSID/context allocation, but it's a noticeable improvement on the > > > old method. > > > > This patch went into 2.6.9-rc2-bk2, and my p640 does not boot anymore. > > Hangs after 'returning from prom_init', wants a power cycle. > > Have you isolated the problem to the VSID allocation patch? I think > there may have been a number of ppc64 changes which went into > 2.6.9-rc2-bk2. Yes, rc2 does not boot on power3 with that patch. -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG From david at gibson.dropbear.id.au Thu Sep 30 16:20:48 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 30 Sep 2004 16:20:48 +1000 Subject: [PPC64] EEH checks mistakenly became no-ops Message-ID: <20040930062048.GA21889@zax> Andrew, please apply: Recent changes which removed the use of IO tokens for EEH enabled devices had a bug, which mean we now never do EEH checks at all. This patch corrects the problem. Unfortunately, it does mean we do EEH checks on pSeries whenever any IO returns all 1s. Signed-off-by: David Gibson Index: working-2.6/include/asm-ppc64/eeh.h =================================================================== --- working-2.6.orig/include/asm-ppc64/eeh.h 2004-09-24 10:14:10.000000000 +1000 +++ working-2.6/include/asm-ppc64/eeh.h 2004-09-30 16:03:17.822907592 +1000 @@ -71,16 +71,10 @@ /* * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure. * - * Order this macro for performance. - * If EEH is off for a device and it is a memory BAR, ioremap will - * map it to the IOREGION. In this case addr == vaddr and since these - * should be in registers we compare them first. Next we check for - * ff's which indicates a (very) possible failure. - * * If this macro yields TRUE, the caller relays to eeh_check_failure() * which does further tests out of line. */ -#define EEH_POSSIBLE_IO_ERROR(val, type) ((val) == (type)~0) +#define EEH_POSSIBLE_ERROR(val, type) ((val) == (type)~0) /* * Reads from a device which has been isolated by EEH will return @@ -89,21 +83,13 @@ */ #define EEH_IO_ERROR_VALUE(size) (~0U >> ((4 - (size)) * 8)) -/* - * The vaddr will equal the addr if EEH checking is disabled for - * this device. This is because eeh_ioremap() will not have - * remapped to 0xA0, and thus both vaddr and addr will be 0xE0... - */ -#define EEH_POSSIBLE_ERROR(addr, vaddr, val, type) \ - ((vaddr) != (addr) && EEH_POSSIBLE_IO_ERROR(val, type)) - /* * MMIO read/write operations with EEH support. */ static inline u8 eeh_readb(const volatile void __iomem *addr) { volatile u8 *vaddr = (volatile u8 __force *) addr; u8 val = in_8(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u8)) + if (EEH_POSSIBLE_ERROR(val, u8)) return eeh_check_failure(addr, val); return val; } @@ -115,7 +101,7 @@ static inline u16 eeh_readw(const volatile void __iomem *addr) { volatile u16 *vaddr = (volatile u16 __force *) addr; u16 val = in_le16(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u16)) + if (EEH_POSSIBLE_ERROR(val, u16)) return eeh_check_failure(addr, val); return val; } @@ -126,7 +112,7 @@ static inline u16 eeh_raw_readw(const volatile void __iomem *addr) { volatile u16 *vaddr = (volatile u16 __force *) addr; u16 val = in_be16(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u16)) + if (EEH_POSSIBLE_ERROR(val, u16)) return eeh_check_failure(addr, val); return val; } @@ -138,7 +124,7 @@ static inline u32 eeh_readl(const volatile void __iomem *addr) { volatile u32 *vaddr = (volatile u32 __force *) addr; u32 val = in_le32(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u32)) + if (EEH_POSSIBLE_ERROR(val, u32)) return eeh_check_failure(addr, val); return val; } @@ -149,7 +135,7 @@ static inline u32 eeh_raw_readl(const volatile void __iomem *addr) { volatile u32 *vaddr = (volatile u32 __force *) addr; u32 val = in_be32(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u32)) + if (EEH_POSSIBLE_ERROR(val, u32)) return eeh_check_failure(addr, val); return val; } @@ -161,7 +147,7 @@ static inline u64 eeh_readq(const volatile void __iomem *addr) { volatile u64 *vaddr = (volatile u64 __force *) addr; u64 val = in_le64(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u64)) + if (EEH_POSSIBLE_ERROR(val, u64)) return eeh_check_failure(addr, val); return val; } @@ -172,7 +158,7 @@ static inline u64 eeh_raw_readq(const volatile void __iomem *addr) { volatile u64 *vaddr = (volatile u64 __force *) addr; u64 val = in_be64(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u64)) + if (EEH_POSSIBLE_ERROR(val, u64)) return eeh_check_failure(addr, val); return val; } @@ -209,7 +195,7 @@ } static inline void eeh_memcpy_fromio(void *dest, const volatile void __iomem *src, unsigned long n) { void *vsrc = (void __force *) src; - void *vsrcsave = vsrc, *destsave = dest; + void *destsave = dest; const volatile void __iomem *srcsave = src; unsigned long nsave = n; @@ -240,8 +226,7 @@ * were copied. Check all four bytes. */ if ((nsave >= 4) && - (EEH_POSSIBLE_ERROR(srcsave, vsrcsave, (*((u32 *) destsave+nsave-4)), - u32))) { + (EEH_POSSIBLE_ERROR((*((u32 *) destsave+nsave-4)), u32))) { eeh_check_failure(srcsave, (*((u32 *) destsave+nsave-4))); } } @@ -281,7 +266,7 @@ if (!_IO_IS_VALID(port)) return ~0; val = in_8((u8 *)(port+pci_io_base)); - if (EEH_POSSIBLE_IO_ERROR(val, u8)) + if (EEH_POSSIBLE_ERROR(val, u8)) return eeh_check_failure((void __iomem *)(port), val); return val; } @@ -296,7 +281,7 @@ if (!_IO_IS_VALID(port)) return ~0; val = in_le16((u16 *)(port+pci_io_base)); - if (EEH_POSSIBLE_IO_ERROR(val, u16)) + if (EEH_POSSIBLE_ERROR(val, u16)) return eeh_check_failure((void __iomem *)(port), val); return val; } @@ -311,7 +296,7 @@ if (!_IO_IS_VALID(port)) return ~0; val = in_le32((u32 *)(port+pci_io_base)); - if (EEH_POSSIBLE_IO_ERROR(val, u32)) + if (EEH_POSSIBLE_ERROR(val, u32)) return eeh_check_failure((void __iomem *)(port), val); return val; } @@ -324,19 +309,19 @@ /* in-string eeh macros */ static inline void eeh_insb(unsigned long port, void * buf, int ns) { _insb((u8 *)(port+pci_io_base), buf, ns); - if (EEH_POSSIBLE_IO_ERROR((*(((u8*)buf)+ns-1)), u8)) + if (EEH_POSSIBLE_ERROR((*(((u8*)buf)+ns-1)), u8)) eeh_check_failure((void __iomem *)(port), *(u8*)buf); } static inline void eeh_insw_ns(unsigned long port, void * buf, int ns) { _insw_ns((u16 *)(port+pci_io_base), buf, ns); - if (EEH_POSSIBLE_IO_ERROR((*(((u16*)buf)+ns-1)), u16)) + if (EEH_POSSIBLE_ERROR((*(((u16*)buf)+ns-1)), u16)) eeh_check_failure((void __iomem *)(port), *(u16*)buf); } static inline void eeh_insl_ns(unsigned long port, void * buf, int nl) { _insl_ns((u32 *)(port+pci_io_base), buf, nl); - if (EEH_POSSIBLE_IO_ERROR((*(((u32*)buf)+nl-1)), u32)) + if (EEH_POSSIBLE_ERROR((*(((u32*)buf)+nl-1)), u32)) eeh_check_failure((void __iomem *)(port), *(u32*)buf); } -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson From david at gibson.dropbear.id.au Thu Sep 30 16:27:19 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 30 Sep 2004 16:27:19 +1000 Subject: [PPC64] EEH checks mistakenly became no-ops In-Reply-To: <20040930062048.GA21889@zax> References: <20040930062048.GA21889@zax> Message-ID: <20040930062719.GB21889@zax> On Thu, Sep 30, 2004 at 04:20:48PM +1000, David Gibson wrote: > Andrew, please apply: > > Recent changes which removed the use of IO tokens for EEH enabled > devices had a bug, which mean we now never do EEH checks at all. This > patch corrects the problem. Unfortunately, it does mean we do EEH > checks on pSeries whenever any IO returns all 1s. > > Signed-off-by: David Gibson Bother, forgot to refresh the patch before sending. Here's a version which sucks less. Recent changes which removed the use of IO tokens for EEH enabled devices had a bug, which mean we now never do EEH checks at all. This patch corrects the problem. Signed-off-by: David Gibson Index: working-2.6/include/asm-ppc64/eeh.h =================================================================== --- working-2.6.orig/include/asm-ppc64/eeh.h 2004-09-24 10:14:10.000000000 +1000 +++ working-2.6/include/asm-ppc64/eeh.h 2004-09-30 16:03:17.822907592 +1000 @@ -71,16 +71,10 @@ /* * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure. * - * Order this macro for performance. - * If EEH is off for a device and it is a memory BAR, ioremap will - * map it to the IOREGION. In this case addr == vaddr and since these - * should be in registers we compare them first. Next we check for - * ff's which indicates a (very) possible failure. - * * If this macro yields TRUE, the caller relays to eeh_check_failure() * which does further tests out of line. */ -#define EEH_POSSIBLE_IO_ERROR(val, type) ((val) == (type)~0) +#define EEH_POSSIBLE_ERROR(val, type) ((val) == (type)~0) /* * Reads from a device which has been isolated by EEH will return @@ -89,21 +83,13 @@ */ #define EEH_IO_ERROR_VALUE(size) (~0U >> ((4 - (size)) * 8)) -/* - * The vaddr will equal the addr if EEH checking is disabled for - * this device. This is because eeh_ioremap() will not have - * remapped to 0xA0, and thus both vaddr and addr will be 0xE0... - */ -#define EEH_POSSIBLE_ERROR(addr, vaddr, val, type) \ - ((vaddr) != (addr) && EEH_POSSIBLE_IO_ERROR(val, type)) - /* * MMIO read/write operations with EEH support. */ static inline u8 eeh_readb(const volatile void __iomem *addr) { volatile u8 *vaddr = (volatile u8 __force *) addr; u8 val = in_8(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u8)) + if (EEH_POSSIBLE_ERROR(val, u8)) return eeh_check_failure(addr, val); return val; } @@ -115,7 +101,7 @@ static inline u16 eeh_readw(const volatile void __iomem *addr) { volatile u16 *vaddr = (volatile u16 __force *) addr; u16 val = in_le16(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u16)) + if (EEH_POSSIBLE_ERROR(val, u16)) return eeh_check_failure(addr, val); return val; } @@ -126,7 +112,7 @@ static inline u16 eeh_raw_readw(const volatile void __iomem *addr) { volatile u16 *vaddr = (volatile u16 __force *) addr; u16 val = in_be16(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u16)) + if (EEH_POSSIBLE_ERROR(val, u16)) return eeh_check_failure(addr, val); return val; } @@ -138,7 +124,7 @@ static inline u32 eeh_readl(const volatile void __iomem *addr) { volatile u32 *vaddr = (volatile u32 __force *) addr; u32 val = in_le32(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u32)) + if (EEH_POSSIBLE_ERROR(val, u32)) return eeh_check_failure(addr, val); return val; } @@ -149,7 +135,7 @@ static inline u32 eeh_raw_readl(const volatile void __iomem *addr) { volatile u32 *vaddr = (volatile u32 __force *) addr; u32 val = in_be32(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u32)) + if (EEH_POSSIBLE_ERROR(val, u32)) return eeh_check_failure(addr, val); return val; } @@ -161,7 +147,7 @@ static inline u64 eeh_readq(const volatile void __iomem *addr) { volatile u64 *vaddr = (volatile u64 __force *) addr; u64 val = in_le64(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u64)) + if (EEH_POSSIBLE_ERROR(val, u64)) return eeh_check_failure(addr, val); return val; } @@ -172,7 +158,7 @@ static inline u64 eeh_raw_readq(const volatile void __iomem *addr) { volatile u64 *vaddr = (volatile u64 __force *) addr; u64 val = in_be64(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u64)) + if (EEH_POSSIBLE_ERROR(val, u64)) return eeh_check_failure(addr, val); return val; } @@ -209,7 +195,7 @@ } static inline void eeh_memcpy_fromio(void *dest, const volatile void __iomem *src, unsigned long n) { void *vsrc = (void __force *) src; - void *vsrcsave = vsrc, *destsave = dest; + void *destsave = dest; const volatile void __iomem *srcsave = src; unsigned long nsave = n; @@ -240,8 +226,7 @@ * were copied. Check all four bytes. */ if ((nsave >= 4) && - (EEH_POSSIBLE_ERROR(srcsave, vsrcsave, (*((u32 *) destsave+nsave-4)), - u32))) { + (EEH_POSSIBLE_ERROR((*((u32 *) destsave+nsave-4)), u32))) { eeh_check_failure(srcsave, (*((u32 *) destsave+nsave-4))); } } @@ -281,7 +266,7 @@ if (!_IO_IS_VALID(port)) return ~0; val = in_8((u8 *)(port+pci_io_base)); - if (EEH_POSSIBLE_IO_ERROR(val, u8)) + if (EEH_POSSIBLE_ERROR(val, u8)) return eeh_check_failure((void __iomem *)(port), val); return val; } @@ -296,7 +281,7 @@ if (!_IO_IS_VALID(port)) return ~0; val = in_le16((u16 *)(port+pci_io_base)); - if (EEH_POSSIBLE_IO_ERROR(val, u16)) + if (EEH_POSSIBLE_ERROR(val, u16)) return eeh_check_failure((void __iomem *)(port), val); return val; } @@ -311,7 +296,7 @@ if (!_IO_IS_VALID(port)) return ~0; val = in_le32((u32 *)(port+pci_io_base)); - if (EEH_POSSIBLE_IO_ERROR(val, u32)) + if (EEH_POSSIBLE_ERROR(val, u32)) return eeh_check_failure((void __iomem *)(port), val); return val; } @@ -324,19 +309,19 @@ /* in-string eeh macros */ static inline void eeh_insb(unsigned long port, void * buf, int ns) { _insb((u8 *)(port+pci_io_base), buf, ns); - if (EEH_POSSIBLE_IO_ERROR((*(((u8*)buf)+ns-1)), u8)) + if (EEH_POSSIBLE_ERROR((*(((u8*)buf)+ns-1)), u8)) eeh_check_failure((void __iomem *)(port), *(u8*)buf); } static inline void eeh_insw_ns(unsigned long port, void * buf, int ns) { _insw_ns((u16 *)(port+pci_io_base), buf, ns); - if (EEH_POSSIBLE_IO_ERROR((*(((u16*)buf)+ns-1)), u16)) + if (EEH_POSSIBLE_ERROR((*(((u16*)buf)+ns-1)), u16)) eeh_check_failure((void __iomem *)(port), *(u16*)buf); } static inline void eeh_insl_ns(unsigned long port, void * buf, int nl) { _insl_ns((u32 *)(port+pci_io_base), buf, nl); - if (EEH_POSSIBLE_IO_ERROR((*(((u32*)buf)+nl-1)), u32)) + if (EEH_POSSIBLE_ERROR((*(((u32*)buf)+nl-1)), u32)) eeh_check_failure((void __iomem *)(port), *(u32*)buf); } -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson From david at gibson.dropbear.id.au Thu Sep 30 16:35:18 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 30 Sep 2004 16:35:18 +1000 Subject: [PPC64] Improved VSID allocation algorithm In-Reply-To: <20040930055307.GA15291@suse.de> References: <20040913041119.GA5351@zax> <20040929194730.GA6292@suse.de> <20040930003846.GB25001@zax> <20040930055307.GA15291@suse.de> Message-ID: <20040930063518.GD21889@zax> On Thu, Sep 30, 2004 at 07:53:07AM +0200, Olaf Hering wrote: > On Thu, Sep 30, David Gibson wrote: > > > On Wed, Sep 29, 2004 at 09:47:30PM +0200, Olaf Hering wrote: > > > On Mon, Sep 13, David Gibson wrote: > > > > > > > Andrew, please apply. This patch has been tested both on SLB and > > > > segment table machines. This new approach is far from the final word > > > > in VSID/context allocation, but it's a noticeable improvement on the > > > > old method. > > > > > > This patch went into 2.6.9-rc2-bk2, and my p640 does not boot anymore. > > > Hangs after 'returning from prom_init', wants a power cycle. > > > > Have you isolated the problem to the VSID allocation patch? I think > > there may have been a number of ppc64 changes which went into > > 2.6.9-rc2-bk2. > > Yes, rc2 does not boot on power3 with that patch. Hrm.. current BK works on the 270 here, so it's not all power3s. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson From anton at samba.org Thu Sep 30 16:40:37 2004 From: anton at samba.org (Anton Blanchard) Date: Thu, 30 Sep 2004 16:40:37 +1000 Subject: [PPC64] Improved VSID allocation algorithm In-Reply-To: <20040929194730.GA6292@suse.de> References: <20040913041119.GA5351@zax> <20040929194730.GA6292@suse.de> Message-ID: <20040930064037.GA3167@krispykreme.ozlabs.ibm.com> > This patch went into 2.6.9-rc2-bk2, and my p640 does not boot anymore. > Hangs after 'returning from prom_init', wants a power cycle. How much memory do you have? We might be filling up a hpte bucket completely with certain amounts of memory. Anton From olh at suse.de Thu Sep 30 16:49:18 2004 From: olh at suse.de (Olaf Hering) Date: Thu, 30 Sep 2004 08:49:18 +0200 Subject: [PPC64] Improved VSID allocation algorithm In-Reply-To: <20040930064037.GA3167@krispykreme.ozlabs.ibm.com> References: <20040913041119.GA5351@zax> <20040929194730.GA6292@suse.de> <20040930064037.GA3167@krispykreme.ozlabs.ibm.com> Message-ID: <20040930064918.GA20357@suse.de> On Thu, Sep 30, Anton Blanchard wrote: > > > This patch went into 2.6.9-rc2-bk2, and my p640 does not boot anymore. > > Hangs after 'returning from prom_init', wants a power cycle. > > How much memory do you have? We might be filling up a hpte bucket > completely with certain amounts of memory. It has 4gig. -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG From david at gibson.dropbear.id.au Thu Sep 30 17:01:51 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 30 Sep 2004 17:01:51 +1000 Subject: [PPC64] Improved VSID allocation algorithm In-Reply-To: <20040930064037.GA3167@krispykreme.ozlabs.ibm.com> References: <20040913041119.GA5351@zax> <20040929194730.GA6292@suse.de> <20040930064037.GA3167@krispykreme.ozlabs.ibm.com> Message-ID: <20040930070151.GG21889@zax> On Thu, Sep 30, 2004 at 04:40:37PM +1000, Anton Blanchard wrote: > > > This patch went into 2.6.9-rc2-bk2, and my p640 does not boot anymore. > > Hangs after 'returning from prom_init', wants a power cycle. > > How much memory do you have? We might be filling up a hpte bucket > completely with certain amounts of memory. Bugger, bugger, bugger bugger. That's it. Just ran 4GB linear mapping with 4k pages through by hash scattering simulator - max bucket load is 2 with the old algo and 16 with the new one. Well, we just found the first case where the new algorithm scatters significantly worse than the old one. It would be something this dire, wouldn't it. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson From david at gibson.dropbear.id.au Thu Sep 30 18:05:10 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 30 Sep 2004 18:05:10 +1000 Subject: [PPC64] Improved VSID allocation algorithm In-Reply-To: <20040930070151.GG21889@zax> References: <20040913041119.GA5351@zax> <20040929194730.GA6292@suse.de> <20040930064037.GA3167@krispykreme.ozlabs.ibm.com> <20040930070151.GG21889@zax> Message-ID: <20040930080510.GH21889@zax> On Thu, Sep 30, 2004 at 05:01:51PM +1000, David Gibson wrote: > On Thu, Sep 30, 2004 at 04:40:37PM +1000, Anton Blanchard wrote: > > > > > This patch went into 2.6.9-rc2-bk2, and my p640 does not boot anymore. > > > Hangs after 'returning from prom_init', wants a power cycle. > > > > How much memory do you have? We might be filling up a hpte bucket > > completely with certain amounts of memory. > > Bugger, bugger, bugger bugger. That's it. Just ran 4GB linear > mapping with 4k pages through by hash scattering simulator - max > bucket load is 2 with the old algo and 16 with the new one. Well, we > just found the first case where the new algorithm scatters > significantly worse than the old one. It would be something this > dire, wouldn't it. Ok, after some more investigations, I think I've come to a theoretical understanding of why this is happening. The problem is that VSID_MULTIPLIER is too close to (1<<28), i.e. too close to all-1s. That means the differences between the VSIDs for the 16 segments in the linear mapping are all either in the high bits - where they get truncated my the size of the hash table - or in the low bits, where they get masked by the vpn component of the hash (it cycles through every possible value in the low 16 bits). We crucially need the bits in the middle to be different - with an order 19 hash table, we only have three significant bits to play with.. Fortunately, I think it's not too hard to fix. Olaf, can you try changing VSID_MULTIPLIER in include/asm-ppc64/mmu.h to 200730139, instead of the current value. According to my hash simulator that should fix the problem for you (and work out to larger amounts of RAM, too). I'll push a patch for this tomorrow - the fact that this has come up suggest to me that I need to think a little deeper about the rationale for picking VSID_MULTIPLIER. NB, I'm assuming this is a pSeries machine we're talking about - just changing VSID_MULTIPLIER is not sufficient on an iSeries machine. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson From olh at suse.de Thu Sep 30 18:20:42 2004 From: olh at suse.de (Olaf Hering) Date: Thu, 30 Sep 2004 10:20:42 +0200 Subject: [PPC64] Improved VSID allocation algorithm In-Reply-To: <20040930080510.GH21889@zax> References: <20040913041119.GA5351@zax> <20040929194730.GA6292@suse.de> <20040930064037.GA3167@krispykreme.ozlabs.ibm.com> <20040930070151.GG21889@zax> <20040930080510.GH21889@zax> Message-ID: <20040930082042.GA27980@suse.de> On Thu, Sep 30, David Gibson wrote: > changing VSID_MULTIPLIER in include/asm-ppc64/mmu.h to 200730139, > instead of the current value. According to my hash simulator that > should fix the problem for you (and work out to larger amounts of RAM, > too). Yes, that number works, tested on rc2 + the vsid patch. -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG From benh at kernel.crashing.org Thu Sep 30 18:22:53 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 30 Sep 2004 18:22:53 +1000 Subject: Why do we map PCI IO space so late ? Message-ID: <1096532573.32754.13.camel@gaston> Hi John ! I was going through some of the PCI setup code while working on some bringup stuff, and had an issue which was related to the way we do the ioremap'ing of the PCI IO space. So the current scenario is: - early (setup_arch() time basically), we ioremap_explicit the ISA space and that only - later (pcibios_fixup time), we scan all busses and ioremap_explicit their various IO spaces. I have two problems with that at the moment. First is, I'm annoyed that during the actual PCI probing, the IO space is not mapped. That means that any quirk that needs IO accesses to the device will not work. I wonder also in which conditions we might end up instanciating a PCI driver as early as the PCI probing and thus crash. Also, this is all after console_initcalls(), so that leaves a gap of code that runs with PCI IO space not mapped. So far, it ended up beeing mostly ok because our console uses legacy serial drivers that use the ISA space which happen to be mapped early, but that sounds fragile & bogus to me. (For the short story, I found that while working on a board for which the "isa" node didn't have a "ranges" property, so we failed to early map it, thus the serial driver would crash doing IO cycles). Why can't we do the ioremap_explicit right after setting up the PHBs ? The second thing that annoys me is that it seems we are also doing an ioremap_explicit for each p2p bridge IO space, aren't we ? I don't fully understand the logic here. Aren't those supposed to be fully enclosed by their parent PHB IO space, and thus mapped by those ? Thanks for enlightening me, Ben. From lxie at us.ibm.com Thu Sep 30 18:37:29 2004 From: lxie at us.ibm.com (Linda Xie) Date: Thu, 30 Sep 2004 03:37:29 -0500 Subject: Fw: when to mark reserved low memory pages In-Reply-To: <1096481557.22404.68.camel@localhost> Message-ID: The following is an updated version based on Dave's suggestion. Paul, This patch was tested on a Power 5 partition running redhat 4 (kernel src level 584). Please send it to mainline if there are no objections. This patch was originally created to fix the prolem that was found in PCI Hot Plug testing on redhat 4. The problem was that rtas config-connector call failed because RedHat actively disabled access to /dev/mem ( for security/selinux reasons). Disabling access to /dev/mem prevents the RTAS pages from being mmaped. So we have to mark the RTAS pages as PG_reserved to allow mmap to pass. Paul already verified my previous version. He will review this one and push it to mainline, if it is acceptable. -Linda diff -purN linux-2.6.8/arch/ppc64/mm/init.c linux-2.6.8-linda/arch/ppc64/mm/init.c --- linux-2.6.8/arch/ppc64/mm/init.c 1970-05-10 22:51:47.153000776 -0400 +++ linux-2.6.8-linda/arch/ppc64/mm/init.c 1970-05-10 23:36:33.774957608 -0400 @@ -706,7 +706,7 @@ void __init mem_init(void) int nid; #endif pg_data_t *pgdat; - unsigned long i; + unsigned long i, pfn, start_pfn, end_pfn; struct page *page; unsigned long reservedpages = 0, codesize, initsize, datasize, bsssize; @@ -749,8 +749,22 @@ void __init mem_init(void) bsssize >> 10, initsize >> 10); + /* Mark the RTAS pages as PG_reserved */ + start_pfn = rtas_rmo_buf >> PAGE_SHIFT; + end_pfn = (rtas_rmo_buf + RTAS_RMOBUF_MAX) >> PAGE_SHIFT; + for (pfn = start_pfn; pfn < end_pfn; pfn++) { + SetPageReserved(pfn_to_page(pfn)); + } + mem_init_done = 1; + #ifdef CONFIG_PPC_ISERIES iommu_vio_init(); #endif Signed-off-by: Linda Xie haveblue at us.ltcfwd.li nux.ibm.com To: Linda Xie/Austin/IBM at IBMUS cc: Benjamin Herrenschmidt , John Rose 09/29/2004 01:12 PM , linuxppc64-dev at ozlabs.org, Michael W Wortman/Austin/IBM at IBMUS Subject: Re: Fw: when to mark reserved low memory pages On Wed, 2004-09-29 at 09:26, Linda Xie wrote: > Here is my final patch which will be submitted to redhat. Might be nicer to do something like this instead of a whole bunch of __va() operations. page_to_pfn() is slightly lighter-weight than virt_to_page() (doesn't really matter at init-time, though). start_pfn = rtas_rmo_buf >> PAGE_SHIFT; end_pfn = (rtas_rmo_buf + RTAS_RMOBUF_MAX) >> PAGE_SHIFT; for (pfn = start_pfn; pfn < end_pfn; pfn++) SetPageReserved(page_to_pfn(pfn)); -- Dave Hansen haveblue at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040930/7a813faa/attachment.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040930/7a813faa/attachment.gif -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040930/7a813faa/attachment-0001.gif -------------- next part -------------- A non-text attachment was scrubbed... Name: pic06243.gif Type: image/gif Size: 1255 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040930/7a813faa/attachment-0002.gif From benh at kernel.crashing.org Thu Sep 30 18:50:48 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 30 Sep 2004 18:50:48 +1000 Subject: [RFC][PATCH] Way for platforms to alter built-in serial ports Message-ID: <1096534248.32721.36.camel@gaston> Hi Russel ! I'm back with a finally useable fix for an oooooold problem, which is the way the serial table is supposed to be hard coded in asm/serial.h makes it really nasty to deal with for platforms like ppc where a given kernel can boot machines with different setups or even no built-in serial port at all. (I currently have the 3 cases on ppc64: some normal ISA serial, some with different port/irq settings, and machines with no ports at all, all in the same kernel image). The early_serial_setup() thing has never been practical to use, since it basically consist of "shoving" entries into the table, which is a bit ugly, requires knowledge of the table size if we want to remove all entries but N, that sort of thing, but the MAIN issue is that it's fundamentally incompatible with the driver beeing in a module. What I propose is a way for the arch to provide it's own table along with the size of it via a function call. It's optional, based on a #ifdef defined by the arch in it's asm/serial.h. The only remaining tricky point is the fact that you used to size your static array of UART's based on the size of the table. So with my path, an arch that defines ARCH_HAS_GET_LEGACY_SERIAL_PORTS is supposed to provide both the new get_legacy_serial_ports() function, but also to define UART_NR to something sensible. I hope one day, we'll be able to convert 8250 to more dynamic allocation though. With this patch, I fix all my problems of properly detecting built-in serial ports _and_ not munging around with non existing ports on machines with no legacy HW with a single kernel image on ppc64 (and hopefully on ppc32 too). I have the ppc64-side of the patch beeing tested on various hardware at the moment and will submit it if you accept this one. Regards, Ben. Signed-off-by: Benjamin Herrenschmidt diff -urN linux-2.5/drivers/serial/8250.c linux-maple/drivers/serial/8250.c --- linux-2.5/drivers/serial/8250.c 2004-09-30 18:31:42.560719874 +1000 +++ linux-maple/drivers/serial/8250.c 2004-09-30 16:27:39.421941590 +1000 @@ -112,11 +112,18 @@ #define SERIAL_PORT_DFNS #endif +#ifndef ARCH_HAS_GET_LEGACY_SERIAL_PORTS static struct old_serial_port old_serial_port[] = { SERIAL_PORT_DFNS /* defined in asm/serial.h */ }; - +static inline struct old_serial_port *get_legacy_serial_ports(unsigned int *count) +{ + *count = ARRAY_SIZE(old_serial_port); + return old_serial_port; +} #define UART_NR (ARRAY_SIZE(old_serial_port) + CONFIG_SERIAL_8250_NR_UARTS) +#endif /* ARCH_HAS_GET_LEGACY_SERIAL_PORTS */ + #ifdef CONFIG_SERIAL_8250_RSA @@ -1839,22 +1846,28 @@ { struct uart_8250_port *up; static int first = 1; + struct old_serial_port *old_ports; + int count; int i; if (!first) return; first = 0; - for (i = 0, up = serial8250_ports; i < ARRAY_SIZE(old_serial_port); + old_ports = get_legacy_serial_ports(&count); + if (old_ports == NULL) + return; + + for (i = 0, up = serial8250_ports; i < count; i++, up++) { - up->port.iobase = old_serial_port[i].port; - up->port.irq = irq_canonicalize(old_serial_port[i].irq); - up->port.uartclk = old_serial_port[i].baud_base * 16; - up->port.flags = old_serial_port[i].flags; - up->port.hub6 = old_serial_port[i].hub6; - up->port.membase = old_serial_port[i].iomem_base; - up->port.iotype = old_serial_port[i].io_type; - up->port.regshift = old_serial_port[i].iomem_reg_shift; + up->port.iobase = old_ports[i].port; + up->port.irq = irq_canonicalize(old_ports[i].irq); + up->port.uartclk = old_ports[i].baud_base * 16; + up->port.flags = old_ports[i].flags; + up->port.hub6 = old_ports[i].hub6; + up->port.membase = old_ports[i].iomem_base; + up->port.iotype = old_ports[i].io_type; + up->port.regshift = old_ports[i].iomem_reg_shift; up->port.ops = &serial8250_pops; if (share_irqs) up->port.flags |= UPF_SHARE_IRQ; diff -urN linux-2.5/drivers/serial/8250.h linux-maple/drivers/serial/8250.h --- linux-2.5/drivers/serial/8250.h 2004-09-30 18:31:42.561719806 +1000 +++ linux-maple/drivers/serial/8250.h 2004-09-30 15:36:55.867448623 +1000 @@ -21,18 +21,6 @@ void serial8250_suspend_port(int line); void serial8250_resume_port(int line); -struct old_serial_port { - unsigned int uart; - unsigned int baud_base; - unsigned int port; - unsigned int irq; - unsigned int flags; - unsigned char hub6; - unsigned char io_type; - unsigned char *iomem_base; - unsigned short iomem_reg_shift; -}; - /* * This replaces serial_uart_config in include/linux/serial.h */ diff -urN linux-2.5/include/linux/serial.h linux-maple/include/linux/serial.h --- linux-2.5/include/linux/serial.h 2004-09-30 18:31:55.867785437 +1000 +++ linux-maple/include/linux/serial.h 2004-09-30 15:36:57.981697919 +1000 @@ -14,6 +14,21 @@ #include /* + * Definition of a legacy serial port + */ +struct old_serial_port { + unsigned int uart; + unsigned int baud_base; + unsigned int port; + unsigned int irq; + unsigned int flags; + unsigned char hub6; + unsigned char io_type; + unsigned char *iomem_base; + unsigned short iomem_reg_shift; +}; + +/* * Counters of the input lines (CTS, DSR, RI, CD) interrupts */ From benh at kernel.crashing.org Thu Sep 30 22:20:53 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 30 Sep 2004 22:20:53 +1000 Subject: reading files in /proc/device-tree In-Reply-To: References: <20040929101700.GA2623@in.ibm.com> Message-ID: <1096546849.3081.2.camel@gaston> > Just read the file. > > segher at victim:~ $ xxd -g4 /proc/device-tree/memory/reg > > 00000000 00000000 00000000 80000000 ; 2GB at 0 > 00000001 00000000 00000000 80000000 ; 2GB at 4GB > > (comments are mine, data is faked). > > There might be more memory nodes; read them all and take > the union. > > Also, the format of the entries is dependent on the > #address-cells and #sized-cells properties. ... of the parent node :) read the OF spec for more details Ben.