From nathanl at austin.ibm.com Sun Feb 1 11:08:00 2004 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Sat, 31 Jan 2004 18:08:00 -0600 Subject: [PATCH] ioremap kmallocs inside spinlocks - review request In-Reply-To: <1075393500.13228.3.camel@verve> References: <1075226812.10285.17.camel@verve> <4016B046.2060103@austin.ibm.com> <1075393500.13228.3.camel@verve> Message-ID: <401C4360.3050702@austin.ibm.com> John Rose wrote: > Anybody have any yea or nay thoughts on Nathan's idea and/or my patch? For what it's worth, here's a patch which converts the spinlock in imalloc.c to a semaphore. Tested with Anton's spinlock debugging patch. This should be safe -- in i386 ioremap and friends use the vmalloc subsystem, which can sleep. And we're already using kmalloc(GFP_KERNEL) in this code, so it's not going to break anything which wasn't already broken. While your patch also seems to solve the problem, converting to the semaphore seems easier to me. :) Nathan -------------- next part -------------- A non-text attachment was scrubbed... Name: imalloc_sem.patch Type: text/x-patch Size: 1707 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040131/f1daabd1/attachment.bin From nathanl at austin.ibm.com Sun Feb 1 11:35:41 2004 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Sat, 31 Jan 2004 18:35:41 -0600 Subject: ioremap problems In-Reply-To: <20040127105533.GL11236@krispykreme> References: <20040127104640.GK11236@krispykreme> <20040127105533.GL11236@krispykreme> Message-ID: <401C49DD.1050008@austin.ibm.com> Hi Anton- > Heres the spinlock sleep debugging patch. Give it a spin on a 2.6 > kernel (dont forget to enable the config option), I'll bet there are still > 5 bugs in our drivers left to find. $ patch -p1 --dry-run < might_sleep_warn.patch arch/ppc64/Kconfig 1.39: 433 lines patching file arch/ppc64/Kconfig Hunk #1 succeeded at 339 with fuzz 2 (offset -71 lines). ... The patch you posted didn't apply cleanly for me on latest 2.6 from Ameslab; it inserted the Kconfig option in the middle of the vio stuff for some reason. So I hand-patched the Kconfig; here's the patch I generated if anyone wants it. It would be nice to get this into Ameslab or mainline, it's very useful. Thanks Anton. Nathan -------------- next part -------------- A non-text attachment was scrubbed... Name: spinlock_sleep_debug.patch Type: text/x-patch Size: 2664 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040131/8e9a3cf9/attachment.bin From paulus at samba.org Sun Feb 1 12:07:04 2004 From: paulus at samba.org (Paul Mackerras) Date: Sun, 1 Feb 2004 12:07:04 +1100 Subject: [PATCH][2.6] rtas error-inject support In-Reply-To: <1075480843.682.188.camel@magik> References: <1075480843.682.188.camel@magik> Message-ID: <16412.20792.302473.676404@cargo.ozlabs.ibm.com> Jake Moilanen writes: > Here is support for the rtas error-inject call. Wouldn't it simplify things if you had a read_proc function rather than having your ppc_rtas_errinjct_read function? As it is you must use copy_to_user rather than memcpy there. Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sun Feb 1 16:02:28 2004 From: anton at samba.org (Anton Blanchard) Date: Sun, 1 Feb 2004 16:02:28 +1100 Subject: LPARCFG Message-ID: <20040201050227.GB22694@krispykreme> Hi, Any reason we cant make lparcfg tristate (so we can compile it as a module?) Im keeping an eye on our kernel size (its getting huge), any chance we get to cut it down is worth it. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sun Feb 1 16:10:58 2004 From: anton at samba.org (Anton Blanchard) Date: Sun, 1 Feb 2004 16:10:58 +1100 Subject: [PATCH] ioremap kmallocs inside spinlocks - review request In-Reply-To: <401C4360.3050702@austin.ibm.com> References: <1075226812.10285.17.camel@verve> <4016B046.2060103@austin.ibm.com> <1075393500.13228.3.camel@verve> <401C4360.3050702@austin.ibm.com> Message-ID: <20040201051058.GC22694@krispykreme> > For what it's worth, here's a patch which converts the spinlock in > imalloc.c to a semaphore. Tested with Anton's spinlock debugging patch. > > This should be safe -- in i386 ioremap and friends use the vmalloc > subsystem, which can sleep. And we're already using kmalloc(GFP_KERNEL) > in this code, so it's not going to break anything which wasn't already > broken. > > While your patch also seems to solve the problem, converting to the > semaphore seems easier to me. :) Sorry for not getting back to you. Paul and I had a talk and came to the same conclusion, ioremap can sleep so its safe to use a semaphore there. BTW Every time I look at the imalloc code I get that feeling we should be using the generic get_vm_area :) Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From akpm at osdl.org Sun Feb 1 17:30:42 2004 From: akpm at osdl.org (Andrew Morton) Date: Sat, 31 Jan 2004 22:30:42 -0800 Subject: [PATCH 2/4] 2.6.1-mm2: sched-domain In-Reply-To: <4001BC69.8040300@cyberone.com.au> References: <4001BAF8.9090203@cyberone.com.au> <4001BC69.8040300@cyberone.com.au> Message-ID: <20040131223042.782ad9e2.akpm@osdl.org> Nick Piggin wrote: > > This is the core sched domains patch. I'm having a ton of trouble with this on the 4-way ppc64 box. Symptoms are similar to memory corruption: gcc falls over with sig11, filenames corrupted, etc. Often it fails to get through the initscripts without userspace processes failing randomly. One example: cc -Wall -Wall -I../include -c -o search_path.o search_path.c make[1]: B: Command not found make[1]: *** [search_path.o] Error 127 make[1]: Leaving directory `/mnt/sdb5/ltp-full-20040108/lib' make: *** [libltp.a] Error 2 and cc -O -Wall -w -o test ./test.c cc -c -Wall -w -o test_arch.o ./test.c cc -Wall -w -o test_D ./test.c make[4]: *** [test] Segmentation fault make[4]: *** Deleting file `test' I'm surprised that nobody else has noticed it. The results of a binary search through my current patch queue: local_bh_enable-warning-fix.patch OK pnp-8250_pnp-fix.patch pnp-resource-flags-reorganisation.patch pnp-BIOS-workaround.patch pnp-avoid-static-allocations.patch pnp-move-ID-declarations.patch pnp-file2alias-update.patch pnp-update-matching-code.patch pnp-additional-sysfs-info.patch pnp-config-cleanup.patch OK sched-find_busiest_node-resolution-fix.patch OK sched-domains.patch BAD sched-clock-fixes.patch BAD sched-build-fix.patch BAD sched-sibling-map-to-cpumask.patch p4-clockmod-sibling-map-fix.patch p4-clockmod-more-than-two-siblings.patch sched-domains-i386-ht.patch sched-find_busiest_group-fix.patch BAD sched-domain-tweak.patch sched-no-drop-balance.patch sched-arch_init_sched_domains-fix.patch sched-find_busiest_group-clarification.patch sched-remove-noisy-printks.patch sched-directed-migration.patch sched-domain-debugging.patch acpi-numa-printk-level-fixes.patch BAD points the finger at the core sched-domains patch. But Anton says that he's using your scheduler patches without problems. mnm:/usr/src/25-power4> /usr/local/ppc64/bin/powerpc-linux-gcc --version powerpc-linux-gcc (GCC) 3.2.3 20030211 (prerelease) I tried a kernel compiled with -O1 and it failed in the same way, which somewhat rules out a compiler bug. Anyone have any suggestions as to a next step? ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From mbligh at aracnet.com Sun Feb 1 17:52:33 2004 From: mbligh at aracnet.com (Martin J. Bligh) Date: Sat, 31 Jan 2004 22:52:33 -0800 Subject: [Lse-tech] Re: [PATCH 2/4] 2.6.1-mm2: sched-domain In-Reply-To: <20040131223042.782ad9e2.akpm@osdl.org> References: <4001BAF8.9090203@cyberone.com.au><4001BC69.8040300@cyberone.com.au> <20040131223042.782ad9e2.akpm@osdl.org> Message-ID: <46800000.1075618352@[10.10.2.4]> > But Anton says that he's using your scheduler patches without problems. > > mnm:/usr/src/25-power4> /usr/local/ppc64/bin/powerpc-linux-gcc --version > powerpc-linux-gcc (GCC) 3.2.3 20030211 (prerelease) > > I tried a kernel compiled with -O1 and it failed in the same way, which > somewhat rules out a compiler bug. > > Anyone have any suggestions as to a next step? Where'd your PPC toolchain come from? Anton tends to roll his own, IIRC. What happens if Anton gives you a binary kernel, and you run that? M. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From akpm at osdl.org Sun Feb 1 17:56:44 2004 From: akpm at osdl.org (Andrew Morton) Date: Sat, 31 Jan 2004 22:56:44 -0800 Subject: [Lse-tech] Re: [PATCH 2/4] 2.6.1-mm2: sched-domain In-Reply-To: <46800000.1075618352@[10.10.2.4]> References: <4001BAF8.9090203@cyberone.com.au> <4001BC69.8040300@cyberone.com.au> <20040131223042.782ad9e2.akpm@osdl.org> <46800000.1075618352@[10.10.2.4]> Message-ID: <20040131225644.3b160f38.akpm@osdl.org> "Martin J. Bligh" wrote: > > > But Anton says that he's using your scheduler patches without problems. > > > > mnm:/usr/src/25-power4> /usr/local/ppc64/bin/powerpc-linux-gcc --version > > powerpc-linux-gcc (GCC) 3.2.3 20030211 (prerelease) > > > > I tried a kernel compiled with -O1 and it failed in the same way, which > > somewhat rules out a compiler bug. > > > > Anyone have any suggestions as to a next step? > > Where'd your PPC toolchain come from? Anton tends to roll his own, IIRC. I begat it ages ago and I hoard it jealously, because its birth was so painful. > What happens if Anton gives you a binary kernel, and you run that? Dunno, I'll send him a .config. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sun Feb 1 17:57:23 2004 From: anton at samba.org (Anton Blanchard) Date: Sun, 1 Feb 2004 17:57:23 +1100 Subject: [Lse-tech] Re: [PATCH 2/4] 2.6.1-mm2: sched-domain In-Reply-To: <20040131223042.782ad9e2.akpm@osdl.org> References: <4001BAF8.9090203@cyberone.com.au> <4001BC69.8040300@cyberone.com.au> <20040131223042.782ad9e2.akpm@osdl.org> Message-ID: <20040201065723.GD22694@krispykreme> > I'm having a ton of trouble with this on the 4-way ppc64 box. Symptoms are > similar to memory corruption: gcc falls over with sig11, filenames > corrupted, etc. Often it fails to get through the initscripts without > userspace processes failing randomly. ... > points the finger at the core sched-domains patch. > > But Anton says that he's using your scheduler patches without problems. Do you have the SMT scheduler enabled? Maybe its a bug with SMT off or maybe my box wasnt stable and I didnt notice :) Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sun Feb 1 18:36:27 2004 From: anton at samba.org (Anton Blanchard) Date: Sun, 1 Feb 2004 18:36:27 +1100 Subject: RTAS error logging of EEH errors Message-ID: <20040201073627.GE22694@krispykreme> Hi, I was reviewing patches to send onto Andrew Morton and noticed: if (ret == 0 && rets[1] == 1 && rets[0] >= 2) { + unsigned char slot_err_buf[RTAS_ERROR_LOG_MAX]; + unsigned long slot_err_ret; Thats 2kB on the stack, it really wants to be kmalloced (GFP_ATOMIC since it can be called in interrupt context) or statically defined. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From ricklind at us.ibm.com Sun Feb 1 20:02:40 2004 From: ricklind at us.ibm.com (Rick Lindsley) Date: Sun, 01 Feb 2004 01:02:40 -0800 Subject: [Lse-tech] Re: [PATCH 2/4] 2.6.1-mm2: sched-domain In-Reply-To: Your message of "Sun, 01 Feb 2004 17:57:23 +1100." <20040201065723.GD22694@krispykreme> Message-ID: <200402010902.i1192mM06956@owlet.beaverton.ibm.com> Do you have the SMT scheduler enabled? Maybe its a bug with SMT off or maybe my box wasnt stable and I didnt notice :) Good point -- I've been running standard regression tests against it on x86 and seen nothing like this. But I haven't enabled SMT. I just finished going through the code with Martin and another engineer last week and am summarizing some thoughts, but there was no major fallovers like this. Rick ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From piggin at cyberone.com.au Sun Feb 1 21:24:40 2004 From: piggin at cyberone.com.au (Nick Piggin) Date: Sun, 01 Feb 2004 21:24:40 +1100 Subject: [Lse-tech] Re: [PATCH 2/4] 2.6.1-mm2: sched-domain In-Reply-To: <200402010902.i1192mM06956@owlet.beaverton.ibm.com> References: <200402010902.i1192mM06956@owlet.beaverton.ibm.com> Message-ID: <401CD3E8.7050208@cyberone.com.au> Rick Lindsley wrote: > Do you have the SMT scheduler enabled? Maybe its a bug with SMT off or > maybe my box wasnt stable and I didnt notice :) > >Good point -- I've been running standard regression tests against it >on x86 and seen nothing like this. But I haven't enabled SMT. I just >finished going through the code with Martin and another engineer last >week and am summarizing some thoughts, but there was no major fallovers >like this. > > SMT should be i386 only at this stage so it rules that out. x86 is pretty stable, that I am sure of. I have run it on quite a lot of boxes, SMP P4, NUMAQ, SMP P3s, UP, etc plus countless people testing -mm. It wouldn't surprise me if there is a problem with another architecture... but then again as you say Anton is running it OK so I dunno. Maybe it is a compiler bug? ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From piggin at cyberone.com.au Sun Feb 1 21:26:21 2004 From: piggin at cyberone.com.au (Nick Piggin) Date: Sun, 01 Feb 2004 21:26:21 +1100 Subject: [Lse-tech] Re: [PATCH 2/4] 2.6.1-mm2: sched-domain In-Reply-To: <401CD3E8.7050208@cyberone.com.au> References: <200402010902.i1192mM06956@owlet.beaverton.ibm.com> <401CD3E8.7050208@cyberone.com.au> Message-ID: <401CD44D.8040203@cyberone.com.au> Nick Piggin wrote: > > > Rick Lindsley wrote: > >> Do you have the SMT scheduler enabled? Maybe its a bug with SMT >> off or >> maybe my box wasnt stable and I didnt notice :) >> >> Good point -- I've been running standard regression tests against it >> on x86 and seen nothing like this. But I haven't enabled SMT. I just >> finished going through the code with Martin and another engineer last >> week and am summarizing some thoughts, but there was no major fallovers >> like this. >> >> > > SMT should be i386 only at this stage so it rules that out. > > x86 is pretty stable, that I am sure of. I have run it on quite > a lot of boxes, SMP P4, NUMAQ, SMP P3s, UP, etc plus countless > people testing -mm. > > It wouldn't surprise me if there is a problem with another > architecture... but then again as you say Anton is running it > OK so I dunno. Maybe it is a compiler bug? > > Andrew, Anton, are you using CONFIG_PREEMPT? ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From akpm at osdl.org Sun Feb 1 21:34:08 2004 From: akpm at osdl.org (Andrew Morton) Date: Sun, 1 Feb 2004 02:34:08 -0800 Subject: [Lse-tech] Re: [PATCH 2/4] 2.6.1-mm2: sched-domain In-Reply-To: <401CD44D.8040203@cyberone.com.au> References: <200402010902.i1192mM06956@owlet.beaverton.ibm.com> <401CD3E8.7050208@cyberone.com.au> <401CD44D.8040203@cyberone.com.au> Message-ID: <20040201023408.262205d6.akpm@osdl.org> Nick Piggin wrote: > > Andrew, Anton, are you using CONFIG_PREEMPT? nope. And we're not sure that Anton has tested the patch much yet. It could be that the bug is happening for him too. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sun Feb 1 21:51:24 2004 From: anton at samba.org (Anton Blanchard) Date: Sun, 1 Feb 2004 21:51:24 +1100 Subject: [Lse-tech] Re: [PATCH 2/4] 2.6.1-mm2: sched-domain In-Reply-To: <401CD3E8.7050208@cyberone.com.au> References: <200402010902.i1192mM06956@owlet.beaverton.ibm.com> <401CD3E8.7050208@cyberone.com.au> Message-ID: <20040201105124.GG22694@krispykreme> > SMT should be i386 only at this stage so it rules that out. It turns out my testing was done with SMT scheduler on but with a busted topology. I just fixed it (cpumask_snprintf is broken on big endian, the output had me confused) and its blowing up pretty soon in the slab cache. Give me a bit, I'll narrow it down. I'll also do a boot with SMT scheduelr off and verify its OK. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From boutcher at us.ibm.com Mon Feb 2 04:43:16 2004 From: boutcher at us.ibm.com (David Boutcher) Date: Sun, 1 Feb 2004 11:43:16 -0600 Subject: LPARCFG In-Reply-To: <20040201050227.GB22694@krispykreme> Message-ID: owner-linuxppc64-dev at lists.linuxppc.org wrote on 01/31/2004 11:02:28 PM: > Any reason we cant make lparcfg tristate (so we can compile it as a > module?) > > Im keeping an eye on our kernel size (its getting huge), any chance > we get to cut it down is worth it. I think that's a great idea. Especially since it has that stupid e2a routine :-) There are a couple of pieces of IBM software that want to find /proc/lparcfg, but as long as they know they need to document the dependency on the module, we should be fine. Dave Boutcher IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Mon Feb 2 06:53:02 2004 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Sun, 01 Feb 2004 13:53:02 -0600 Subject: LPARCFG In-Reply-To: <20040201050227.GB22694@krispykreme> References: <20040201050227.GB22694@krispykreme> Message-ID: <401D591E.9030503@austin.ibm.com> Anton Blanchard wrote: > Hi, > > Any reason we cant make lparcfg tristate (so we can compile it as a > module?) Building as a module was broken when the code was checked in to ameslab 2.6; I suggested turning it off. I think lparcfg uses unexported symbols -- cur_cpu_spec, systemcfg, and plpar_hcall_4out. Should any of those be exported? See this thread for the history: http://lists.linuxppc.org/linuxppc64-dev/200312/msg00023.html Nathan ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Mon Feb 2 10:50:02 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 2 Feb 2004 10:50:02 +1100 Subject: [Lse-tech] Re: [PATCH 2/4] 2.6.1-mm2: sched-domain In-Reply-To: <20040201105124.GG22694@krispykreme> References: <200402010902.i1192mM06956@owlet.beaverton.ibm.com> <401CD3E8.7050208@cyberone.com.au> <20040201105124.GG22694@krispykreme> Message-ID: <20040201235001.GJ22694@krispykreme> > Give me a bit, I'll narrow it down. I'll also do a boot with SMT > scheduelr off and verify its OK. Im seeing very early slab corruption. Back out sched-* and it boots OK. sym.0014:02:01.0:0:0: tagged command queuing enabled, command queue depth 16. sym.0014:02:01.0:0: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 31) scsi: On host 12 channel 0 id 0 only 128 (max_scsi_report_luns) of 189483851 luns reported, try increasing max_scsi_report_luns. scsi: host 12 channel 0 id 0 lun 0x5a5a5a5a5a5a5a5a has a LUN larger than currently supported. scsi: host 12 channel 0 id 0 lun 0x5a5a5a5a5a5a5a5a has a LUN larger than currently supported. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Mon Feb 2 17:48:41 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Mon, 2 Feb 2004 17:48:41 +1100 Subject: [RFC] implicit hugetlb pages (mmu_context_to_struct) In-Reply-To: <1073683778.1298.115.camel@agtpad> References: <1073683188.1297.105.camel@agtpad> <1073683778.1298.115.camel@agtpad> Message-ID: <20040202064841.GA31383@zax> On Fri, Jan 09, 2004 at 01:29:38PM -0800, Adam Litke wrote: > > mmu_context_to_struct (2.6.0): > This patch converts the mmu_context variable to a structure. It is > needed for the dynamic address space resizing patch. Ok, I've made a revised version of this patch. As well as handling the various changes in ameslab since Adam's version, it uses a better name for the mm_context fields, separates out the low_hpages flag, and avoids putting extraneous info in the mmu_context_queue. I'm slightly concerned about whether it has zero-impact in the !CONFIG_HUGETLB_PAGE case. I think theoretically it could/should, but there's a couple of cases where I don't know if gcc will be clever enough to optimise everything away. Anton, if you approve I can push this to ameslab and/or akpm. mmu-context-struct (2.6.0): This patch converts the mmu_context variable to a structure. It is needed for the dynamic HTLB address space resizing patch. Index: working-2.6/arch/ppc64/kernel/stab.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/stab.c 2004-02-02 17:04:37.000000000 +1100 +++ working-2.6/arch/ppc64/kernel/stab.c 2004-02-02 17:36:30.961032128 +1100 @@ -175,13 +175,13 @@ /* Kernel or user address? */ if (REGION_ID(ea) >= KERNEL_REGION_ID) { vsid = get_kernel_vsid(ea); - context = REGION_ID(ea); + context = KERNEL_CONTEXT(ea); } else { if (!current->mm) return 1; context = current->mm->context; - vsid = get_vsid(context, ea); + vsid = get_vsid(context.id, ea); } esid = GET_ESID(ea); @@ -214,7 +214,7 @@ if (!IS_VALID_EA(pc) || (REGION_ID(pc) >= KERNEL_REGION_ID)) return; - vsid = get_vsid(mm->context, pc); + vsid = get_vsid(mm->context.id, pc); __ste_allocate(pc_esid, vsid); if (pc_esid == stack_esid) @@ -222,7 +222,7 @@ if (!IS_VALID_EA(stack) || (REGION_ID(stack) >= KERNEL_REGION_ID)) return; - vsid = get_vsid(mm->context, stack); + vsid = get_vsid(mm->context.id, stack); __ste_allocate(stack_esid, vsid); if (pc_esid == unmapped_base_esid || stack_esid == unmapped_base_esid) @@ -231,7 +231,7 @@ if (!IS_VALID_EA(unmapped_base) || (REGION_ID(unmapped_base) >= KERNEL_REGION_ID)) return; - vsid = get_vsid(mm->context, unmapped_base); + vsid = get_vsid(mm->context.id, unmapped_base); __ste_allocate(unmapped_base_esid, vsid); /* Order update */ @@ -396,14 +396,14 @@ /* Kernel or user address? */ if (REGION_ID(ea) >= KERNEL_REGION_ID) { - context = REGION_ID(ea); + context = KERNEL_CONTEXT(ea); vsid = get_kernel_vsid(ea); } else { if (unlikely(!current->mm)) return 1; context = current->mm->context; - vsid = get_vsid(context, ea); + vsid = get_vsid(context.id, ea); } esid = GET_ESID(ea); @@ -434,7 +434,7 @@ if (!IS_VALID_EA(pc) || (REGION_ID(pc) >= KERNEL_REGION_ID)) return; - vsid = get_vsid(mm->context, pc); + vsid = get_vsid(mm->context.id, pc); __slb_allocate(pc_esid, vsid, mm->context); if (pc_esid == stack_esid) @@ -442,7 +442,7 @@ if (!IS_VALID_EA(stack) || (REGION_ID(stack) >= KERNEL_REGION_ID)) return; - vsid = get_vsid(mm->context, stack); + vsid = get_vsid(mm->context.id, stack); __slb_allocate(stack_esid, vsid, mm->context); if (pc_esid == unmapped_base_esid || stack_esid == unmapped_base_esid) @@ -451,7 +451,7 @@ if (!IS_VALID_EA(unmapped_base) || (REGION_ID(unmapped_base) >= KERNEL_REGION_ID)) return; - vsid = get_vsid(mm->context, unmapped_base); + vsid = get_vsid(mm->context.id, unmapped_base); __slb_allocate(unmapped_base_esid, vsid, mm->context); } Index: working-2.6/arch/ppc64/mm/hugetlbpage.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hugetlbpage.c 2004-02-02 10:44:47.000000000 +1100 +++ working-2.6/arch/ppc64/mm/hugetlbpage.c 2004-02-02 17:36:30.964031672 +1100 @@ -244,7 +244,7 @@ struct vm_area_struct *vma; unsigned long addr; - if (mm->context & CONTEXT_LOW_HPAGES) + if (mm->context.low_hpages) return 0; /* The window is already open */ /* Check no VMAs are in the region */ @@ -281,7 +281,7 @@ /* FIXME: do we need to scan for PTEs too? */ - mm->context |= CONTEXT_LOW_HPAGES; + mm->context.low_hpages = 1; /* the context change must make it to memory before the slbia, * so that further SLB misses do the right thing. */ @@ -589,7 +589,6 @@ } } - unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) @@ -778,7 +777,7 @@ BUG_ON(hugepte_bad(pte)); BUG_ON(!in_hugepage_area(context, ea)); - vsid = get_vsid(context, ea); + vsid = get_vsid(context.id, ea); va = (vsid << 28) | (ea & 0x0fffffff); vpn = va >> LARGE_PAGE_SHIFT; Index: working-2.6/arch/ppc64/mm/init.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/init.c 2004-01-22 10:53:08.000000000 +1100 +++ working-2.6/arch/ppc64/mm/init.c 2004-02-02 17:36:30.967031216 +1100 @@ -502,7 +502,7 @@ break; case USER_REGION_ID: pgd = pgd_offset( vma->vm_mm, vmaddr ); - context = vma->vm_mm->context; + context = vma->vm_mm->context.id; /* XXX are there races with checking cpu_vm_mask? - Anton */ tmp = cpumask_of_cpu(smp_processor_id()); @@ -554,7 +554,7 @@ break; case USER_REGION_ID: pgd = pgd_offset(mm, start); - context = mm->context; + context = mm->context.id; /* XXX are there races with checking cpu_vm_mask? - Anton */ tmp = cpumask_of_cpu(smp_processor_id()); @@ -943,7 +943,7 @@ if (!ptep) return; - vsid = get_vsid(vma->vm_mm->context, ea); + vsid = get_vsid(vma->vm_mm->context.id, ea); tmp = cpumask_of_cpu(smp_processor_id()); if (cpus_equal(vma->vm_mm->cpu_vm_mask, tmp)) Index: working-2.6/arch/ppc64/mm/hash_utils.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hash_utils.c 2004-01-22 10:53:08.000000000 +1100 +++ working-2.6/arch/ppc64/mm/hash_utils.c 2004-02-02 15:16:34.000000000 +1100 @@ -237,7 +237,7 @@ if (mm == NULL) return 1; - vsid = get_vsid(mm->context, ea); + vsid = get_vsid(mm->context.id, ea); break; case IO_REGION_ID: mm = &ioremap_mm; Index: working-2.6/arch/ppc64/xmon/xmon.c =================================================================== --- working-2.6.orig/arch/ppc64/xmon/xmon.c 2004-01-22 10:53:08.000000000 +1100 +++ working-2.6/arch/ppc64/xmon/xmon.c 2004-02-02 17:36:30.972030456 +1100 @@ -1952,7 +1952,7 @@ // if in user range, use the current task's page directory else if ( ( ea >= USER_START ) && ( ea <= USER_END ) ) { mm = current->mm; - vsid = get_vsid(mm->context, ea ); + vsid = get_vsid(mm->context.id, ea ); } pgdir = mm->pgd; va = ( vsid << 28 ) | ( ea & 0x0fffffff ); Index: working-2.6/include/asm-ppc64/mmu.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu.h 2004-02-02 10:44:49.000000000 +1100 +++ working-2.6/include/asm-ppc64/mmu.h 2004-02-02 17:36:30.975030000 +1100 @@ -18,15 +18,25 @@ #ifndef __ASSEMBLY__ -/* Default "unsigned long" context */ -typedef unsigned long mm_context_t; +/* Time to allow for more things here */ +typedef unsigned long mm_context_id_t; +typedef struct { + mm_context_id_t id; +#ifdef CONFIG_HUGETLB_PAGE + int low_hpages; +#endif +} mm_context_t; #ifdef CONFIG_HUGETLB_PAGE -#define CONTEXT_LOW_HPAGES (1UL<<63) +#define KERNEL_LOW_HPAGES .low_hpages = 0, #else -#define CONTEXT_LOW_HPAGES 0 +#define KERNEL_LOW_HPAGES #endif +#define KERNEL_CONTEXT(ea) ({ \ + mm_context_t ctx = { .id = REGION_ID(ea), KERNEL_LOW_HPAGES}; \ + ctx; }) + /* * Hardware Segment Lookaside Buffer Entry * This structure has been padded out to two 64b doublewords (actual SLBE's are Index: working-2.6/include/asm-ppc64/mmu_context.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu_context.h 2004-02-02 10:44:49.000000000 +1100 +++ working-2.6/include/asm-ppc64/mmu_context.h 2004-02-02 17:36:30.977029696 +1100 @@ -52,7 +52,7 @@ long head; long tail; long size; - mm_context_t elements[LAST_USER_CONTEXT]; + mm_context_id_t elements[LAST_USER_CONTEXT]; }; extern struct mmu_context_queue_t mmu_context_queue; @@ -83,7 +83,6 @@ long head; unsigned long flags; /* This does the right thing across a fork (I hope) */ - unsigned long low_hpages = mm->context & CONTEXT_LOW_HPAGES; spin_lock_irqsave(&mmu_context_queue.lock, flags); @@ -93,8 +92,7 @@ } head = mmu_context_queue.head; - mm->context = mmu_context_queue.elements[head]; - mm->context |= low_hpages; + mm->context.id = mmu_context_queue.elements[head]; head = (head < LAST_USER_CONTEXT-1) ? head+1 : 0; mmu_context_queue.head = head; @@ -132,8 +130,7 @@ #endif mmu_context_queue.size++; - mmu_context_queue.elements[index] = - mm->context & ~CONTEXT_LOW_HPAGES; + mmu_context_queue.elements[index] = mm->context.id; spin_unlock_irqrestore(&mmu_context_queue.lock, flags); } @@ -210,8 +207,6 @@ { unsigned long ordinal, vsid; - context &= ~CONTEXT_LOW_HPAGES; - ordinal = (((ea >> 28) & 0x1fffff) * LAST_USER_CONTEXT) | context; vsid = (ordinal * VSID_RANDOMIZER) & VSID_MASK; Index: working-2.6/include/asm-ppc64/page.h =================================================================== --- working-2.6.orig/include/asm-ppc64/page.h 2004-01-09 11:11:12.000000000 +1100 +++ working-2.6/include/asm-ppc64/page.h 2004-02-02 17:36:30.979029392 +1100 @@ -32,6 +32,7 @@ /* For 64-bit processes the hugepage range is 1T-1.5T */ #define TASK_HPAGE_BASE (0x0000010000000000UL) #define TASK_HPAGE_END (0x0000018000000000UL) + /* For 32-bit processes the hugepage range is 2-3G */ #define TASK_HPAGE_BASE_32 (0x80000000UL) #define TASK_HPAGE_END_32 (0xc0000000UL) @@ -39,7 +40,7 @@ #define ARCH_HAS_HUGEPAGE_ONLY_RANGE #define is_hugepage_only_range(addr, len) \ ( ((addr > (TASK_HPAGE_BASE-len)) && (addr < TASK_HPAGE_END)) || \ - ((current->mm->context & CONTEXT_LOW_HPAGES) && \ + (current->mm->context.low_hpages && \ (addr > (TASK_HPAGE_BASE_32-len)) && (addr < TASK_HPAGE_END_32)) ) #define hugetlb_free_pgtables free_pgtables #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA @@ -47,7 +48,7 @@ #define in_hugepage_area(context, addr) \ ((cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE) && \ ((((addr) >= TASK_HPAGE_BASE) && ((addr) < TASK_HPAGE_END)) || \ - (((context) & CONTEXT_LOW_HPAGES) && \ + ((context).low_hpages && \ (((addr) >= TASK_HPAGE_BASE_32) && ((addr) < TASK_HPAGE_END_32))))) #else /* !CONFIG_HUGETLB_PAGE */ Index: working-2.6/include/asm-ppc64/tlb.h =================================================================== --- working-2.6.orig/include/asm-ppc64/tlb.h 2004-01-22 10:53:21.000000000 +1100 +++ working-2.6/include/asm-ppc64/tlb.h 2004-02-02 17:36:30.980029240 +1100 @@ -65,7 +65,7 @@ if (cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask)) local = 1; - flush_hash_range(tlb->mm->context, i, local); + flush_hash_range(tlb->mm->context.id, i, local); i = 0; } } @@ -86,7 +86,7 @@ if (cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask)) local = 1; - flush_hash_range(tlb->mm->context, batch->index, local); + flush_hash_range(tlb->mm->context.id, batch->index, local); batch->index = 0; pte_free_finish(); -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nfont at austin.ibm.com Tue Feb 3 03:01:44 2004 From: nfont at austin.ibm.com (Nathan Fontenot) Date: Mon, 02 Feb 2004 10:01:44 -0600 Subject: RTAS error logging of EEH errors In-Reply-To: <20040201073627.GE22694@krispykreme> References: <20040201073627.GE22694@krispykreme> Message-ID: <1075737704.8079.27.camel@mudbug.austin.ibm.com> I ended up leaving this on the stack because we call panic() right after using the buffer. The patch below changes this so that the buffer is allocated. If this is preferred let me know and I will push the change to Ameslab. I don't think this should be a statically defined buffer, no sense in keeping around a buffer this big that will only be used once right before we die. -Nathan Fontenot On Sun, 2004-02-01 at 01:36, Anton Blanchard wrote: > Hi, > > I was reviewing patches to send onto Andrew Morton and noticed: > > if (ret == 0 && rets[1] == 1 && rets[0] >= 2) { > + unsigned char slot_err_buf[RTAS_ERROR_LOG_MAX]; > + unsigned long slot_err_ret; > > Thats 2kB on the stack, it really wants to be kmalloced (GFP_ATOMIC > since it can be called in interrupt context) or statically defined. > > Anton > -- -------------- next part -------------- A non-text attachment was scrubbed... Name: eeh_update.patch Type: text/x-patch Size: 1470 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040202/e964cd4a/attachment.bin From johnrose at austin.ibm.com Tue Feb 3 04:14:39 2004 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 02 Feb 2004 11:14:39 -0600 Subject: [PATCH] ioremap kmallocs inside spinlocks - review request In-Reply-To: <20040201051058.GC22694@krispykreme> References: <1075226812.10285.17.camel@verve> <4016B046.2060103@austin.ibm.com> <1075393500.13228.3.camel@verve> <401C4360.3050702@austin.ibm.com> <20040201051058.GC22694@krispykreme> Message-ID: <1075742079.26785.16.camel@verve> I like Nathan's patch quite a bit more than what I came up with :) I was looking at the problem from the wrong angle I guess. Thanks- John > Sorry for not getting back to you. Paul and I had a talk and came to the > same conclusion, ioremap can sleep so its safe to use a semaphore there. > > BTW Every time I look at the imalloc code I get that feeling we should > be using the generic get_vm_area :) > > Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Tue Feb 3 10:40:01 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Mon, 02 Feb 2004 17:40:01 -0600 Subject: [2.6] [PATCH] [LARGE] Rewrite/cleanup of PCI DMA mapping code Message-ID: <401EDFD1.3010203@austin.ibm.com> Hello all, I've spent some time cleaning up the PCI DMA mapping code in 2.6. The patch is large (117KB, 4000 lines), so I won't post it here. It can be found at: http://www.localnet.sh/patches/tce-rewrite-feb02.patch Summary: * Renamed include/asm-ppc64/pci_dma.h -> iommu.h * Removed include/asm-ppc64/iSeries/iSeries_dma.h * Broke the contents of old arch/ppc64/kernel/pci_dma.c into: arch/ppc64/kernel/pci_iommu.c arch/ppc64/kernel/pSeries_iommu.c arch/ppc64/kernel/iSeries_iommu.c ...plus a large number of renames of functions, structures and structure members. I've cleaned out the PPCDBG code from old pci_dma.c, since it was very rarely useful due to the sheer volume of output and it cluttered the code. Tracing can be added back in a limited fashion as the need comes up in the future. I also replaced the old buddy-style allocator for TCE ranges with a simpler bitmap allocator. Time and benchmarking will tell if it's efficient enough, but it's fairly well abstracted and can easily be replaced Please take some time and look at the patch, the more eyes on it (and comments/questions) the better. Patch is against ameslab as of today. I will also add some statistics gathering similar to Linas Vepstas' patch for the old TCE implementation, but that'll be handled separately. arch/ppc64/kernel/pci_dma.c | 1477 ---------------------- b/arch/ppc64/kdb/kdbasupport.c | 2 b/arch/ppc64/kernel/Makefile | 6 b/arch/ppc64/kernel/chrp_setup.c | 2 b/arch/ppc64/kernel/iSeries_iommu.c | 242 ++++ b/arch/ppc64/kernel/iSeries_pci.c | 6 b/arch/ppc64/kernel/mf.c | 18 b/arch/ppc64/kernel/pSeries_iommu.c | 294 +++++ b/arch/ppc64/kernel/pSeries_lpar.c | 86 - b/arch/ppc64/kernel/pSeries_pci.c | 4 b/arch/ppc64/kernel/pci.c | 2 b/arch/ppc64/kernel/pci_dn.c | 2 b/arch/ppc64/kernel/pci_iommu.c | 431 ++++++++ b/arch/ppc64/kernel/ppc_ksyms.c | 12 b/arch/ppc64/kernel/prom.c | 14 b/arch/ppc64/kernel/vio.c | 138 +- b/arch/ppc64/kernel/viopath.c | 6 b/arch/ppc64/mm/init.c | 6 b/drivers/block/viodasd.c | 6 b/drivers/cdrom/viocd.c | 12 b/drivers/net/ibmveth.c | 2 b/drivers/net/iseries_veth.c | 8 b/drivers/scsi/iSeries_vscsi.c | 4 b/drivers/scsi/rpa_vscsi.c | 2 b/include/asm-ppc64/iSeries/iSeries_pci.h | 2 b/include/asm-ppc64/iommu.h | 158 +++ b/include/asm-ppc64/machdep.h | 12 b/include/asm-ppc64/prom.h | 4 b/include/asm-ppc64/vio.h | 6 include/asm-ppc64/iSeries/iSeries_dma.h | 95 - include/asm-ppc64/pci_dma.h | 100 -- 31 files changed, 1293 insertions(+), 1866 deletions(-) Thanks, -Olof -- Olof Johansson Office: 4F005/905 pSeries Linux Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Tue Feb 3 10:51:33 2004 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 02 Feb 2004 17:51:33 -0600 Subject: [PATCH] nonempty bus->name for PHBs Message-ID: <1075765893.26785.22.camel@verve> Anyone have thoughts on this small addition to pcibios_fixup_bus()? The bus->name field is currently empty for PHBs. Filling this in makes debugging easier. I also don't like the non-root level bus naming in pci_scan_bridge(), as it doesn't include the PCI domain. Thoughts? John diff -Nru a/arch/ppc64/kernel/pSeries_pci.c b/arch/ppc64/kernel/pSeries_pci.c --- a/arch/ppc64/kernel/pSeries_pci.c Mon Feb 2 17:47:17 2004 +++ b/arch/ppc64/kernel/pSeries_pci.c Mon Feb 2 17:47:17 2004 @@ -565,6 +565,7 @@ printk(KERN_ERR "Failed to request MEM" "on hose %d\n", 0 /* FIXME */); } + sprintf(bus->name, "PCI Host Bridge #%02x", pci_domain_nr(bus)); } else if (pci_probe_only && (dev->class >> 8) == PCI_CLASS_BRIDGE_PCI) { /* This is a subordinate bridge */ ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Wed Feb 4 03:17:48 2004 From: johnrose at austin.ibm.com (John Rose) Date: Tue, 03 Feb 2004 10:17:48 -0600 Subject: [PATCH] nonempty bus->name for PHBs In-Reply-To: References: Message-ID: <1075825068.28337.11.camel@verve> Hi Olof- On Mon, 2004-02-02 at 23:14, olof at austin.ibm.com wrote: > I'm wondering if #%02x is a bad choice of format, since how should #10 be > parsed? Is it 10 or 16? It might be hard to tell when you're sitting there > reading the output. How about 0x%02x instead? Good point. I directly copied this from pci_scan_bridge(), but your suggestion makes things more clear. > > Also, how large is the buffer? Should you use snprintf just to be safe? Eh. The buffer is 48 chars long, and the other piece of code doesn't use snprintf(). Maybe I'm being careless, but... New patch attached! Thanks- John diff -Nru a/arch/ppc64/kernel/pSeries_pci.c b/arch/ppc64/kernel/pSeries_pci.c --- a/arch/ppc64/kernel/pSeries_pci.c Tue Feb 3 10:16:39 2004 +++ b/arch/ppc64/kernel/pSeries_pci.c Tue Feb 3 10:16:39 2004 @@ -565,6 +565,8 @@ printk(KERN_ERR "Failed to request MEM" "on hose %d\n", 0 /* FIXME */); } + sprintf(bus->name, "PCI Host Bridge 0x%02x", + pci_domain_nr(bus)); } else if (pci_probe_only && (dev->class >> 8) == PCI_CLASS_BRIDGE_PCI) { /* This is a subordinate bridge */ ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Wed Feb 4 03:27:30 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Tue, 03 Feb 2004 10:27:30 -0600 Subject: [PATCH][2.6] rtas error-inject support In-Reply-To: <16412.20792.302473.676404@cargo.ozlabs.ibm.com> References: <1075480843.682.188.camel@magik> <16412.20792.302473.676404@cargo.ozlabs.ibm.com> Message-ID: <1075825650.1591.114.camel@magik> > Wouldn't it simplify things if you had a read_proc function rather > than having your ppc_rtas_errinjct_read function? As it is you must > use copy_to_user rather than memcpy there. > Correct me if I'm wrong, but you can't use the file operations open and release w/ read_proc. While it would simplify ppc_rtas_errinjct_read, to do rtas error inject correctly, we need to have the open and close file operation hooked. This will allow us to open and close the FW errinjct facilities correctly. One of the reasons the RPA architects designed error inject w/ this "open token" in mind was to have only one error inject app to have access to errinject at a time. Thanks, Jake ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From jdewand at redhat.com Wed Feb 4 04:54:47 2004 From: jdewand at redhat.com (Julie DeWandel) Date: Tue, 03 Feb 2004 12:54:47 -0500 Subject: [2.4] [PATCH] hash_page rework, take 2 References: <40104461.5030804@austin.ibm.com> Message-ID: <401FE067.80902@redhat.com> Hi Olof, The patch wasn't attached to your email so I included it along with comments below (my comments preceded by "JSD:"). Thanks, Julie Olof Johansson wrote: > Ok, so the previous approach of the hash_page rework had a few > drawbacks as pointed out by Ben and others. Here's a new try, I'm > looking for any feedback I can get on it! > -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: hashpage.patch Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040203/6c16856e/attachment.txt From jdewand at redhat.com Wed Feb 4 06:12:42 2004 From: jdewand at redhat.com (Julie DeWandel) Date: Tue, 03 Feb 2004 14:12:42 -0500 Subject: [2.4] [PATCH] hash_page rework, take 2 References: <40104461.5030804@austin.ibm.com> <401FE067.80902@redhat.com> Message-ID: <401FF2AA.8000903@redhat.com> Clarification/update on 2 points. > >- spin_unlock(&hash_table_lock[lock_slot].lock); >+out_unlock: >+ smp_wmb(); > >JSD: I believe an smp_mb__before_clear_bit() would be a better choice, or >JSD: at least an lwsync because this is an unlock. > >+ >+ pte_val(*ptep) &= ~_PAGE_BUSY; > >- return 0; >+ return ret; > } > > /* > JSD: I should have added that the clearing of the _PAGE_BUSY bit should be JSD: done using an atomic op (clear_bit() routine) >diff -pru -X /root/.diffexclude linux-2.4.21-6.EL-just_tools/arch/ppc64/mm/init.c hashpage/arch/ppc64/mm/init.c >--- linux-2.4.21-6.EL-just_tools/arch/ppc64/mm/init.c 2004-01-06 20:35:23.000000000 -0600 >+++ hashpage/arch/ppc64/mm/init.c 2004-01-18 18:11:14.000000000 -0600 > >+static inline void pte_free_sync(void) >+{ >+ unsigned long flags; >+ int i; >+ >+ /* All we need to know is that we can get the write lock if >+ * we wanted to, i.e. that no hash_page()s are holding it for reading. >+ * If none are reading, that means there's no currently executing >+ * hash_page() that might be working on one of the PTE's that will >+ * be deleted. Likewise, if there is a reader, we need to get the >+ * write lock to know when it releases the lock. >+ */ >+ >+ for (i = 0; i < smp_num_cpus; i++) >+ if (is_read_locked(&pte_hash_lock[i])) { >+ if(i == smp_processor_id()) >+ local_irq_save(flags); >+ >+ write_lock(&pte_hash_lock[i]); >+ write_unlock(&pte_hash_lock[i]); >+ >+ if(i == smp_processor_id()) >+ local_irq_restore(flags); >+ } >+} > >JSD: Two questions here. (1) Shouldn't interrupts be disabled for the >JSD: write_lock/unlock here regardless of what processor we are running >JSD: on? (2) I don't see how this code is preventing another processor >JSD: from grabbing the read_lock immediately after this processor has >JSD: checked to make sure it isn't held. > JSD: Better question for (1) is why are interrupts being disabled here? JSD: Can this routine be called from interrupt context? -- Julie DeWandel Red Hat, Inc. Tel (978) 692-3113 x23251 ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Wed Feb 4 11:32:10 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Tue, 3 Feb 2004 18:32:10 -0600 Subject: RTAS error logging of EEH errors In-Reply-To: <1075737704.8079.27.camel@mudbug.austin.ibm.com>; from nfont@austin.ibm.com on Mon, Feb 02, 2004 at 10:01:44AM -0600 References: <20040201073627.GE22694@krispykreme> <1075737704.8079.27.camel@mudbug.austin.ibm.com> Message-ID: <20040203183210.A27780@forte.austin.ibm.com> Please don't push this patch, It'll mess up my patch which I'm sending out in a few minutes. Anton, if this still really bugs you, let me know, I'll fix it. --linas On Mon, Feb 02, 2004 at 10:01:44AM -0600, Nathan Fontenot wrote: > I ended up leaving this on the stack because we call panic() right > after using the buffer. The patch below changes this so that the > buffer is allocated. If this is preferred let me know and I will > push the change to Ameslab. > > I don't think this should be a statically defined buffer, no sense > in keeping around a buffer this big that will only be used once > right before we die. > > -Nathan Fontenot > > On Sun, 2004-02-01 at 01:36, Anton Blanchard wrote: > > Hi, > > > > I was reviewing patches to send onto Andrew Morton and noticed: > > > > if (ret == 0 && rets[1] == 1 && rets[0] >= 2) { > > + unsigned char slot_err_buf[RTAS_ERROR_LOG_MAX]; > > + unsigned long slot_err_ret; > > > > Thats 2kB on the stack, it really wants to be kmalloced (GFP_ATOMIC > > since it can be called in interrupt context) or statically defined. > > > > Anton > > > -- ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Wed Feb 4 11:34:59 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Tue, 3 Feb 2004 18:34:59 -0600 Subject: 2.6: PATCH for multiple EEH bugs Message-ID: <20040203183459.B27780@forte.austin.ibm.com> Patch for multiple EEH-related bugs. Please review this patch, & if appropriate, please apply. It should apply cleanly to the current ameslab tree (Feb 03 2004 2.6.2-rc3). (I could try to do a "bk push", but I'm not yet confident with my bk abilities). This patch fixes multiple EEH-related bugs: -- Fixes the eeh_check_failure() usage in an interrupt context. This routine is now safe to use in an interrupt. The fix was to build a cache of IO addresses and check that, instead of using the pci routines. -- Merges in Olof Johansson's sizeof patch when checking for failure -- Adds EEH tests to array/string reads -- Fixes bugs with address resolution (some i/o addresses were handled incorrectly, resulting in EEH errors slipping by undetected.) -- Adds EEH support to the PCI Hotplug system (so that devices that get added/removed get properly registered with the EEH subsystem.) -- Fixes improper use of /proc filesystem. -- Adds some misc statistics. Please note that the EEH subsystem will be undergoing a major revision in the not-to-distant future; this patch is a 'stopgap' to address the immediate concerns/issues until that time. --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Wed Feb 4 11:44:41 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Tue, 3 Feb 2004 18:44:41 -0600 Subject: [PATCH][2.6] rtas error-inject support In-Reply-To: <1075493023.682.199.camel@magik>; from moilanen@austin.ibm.com on Fri, Jan 30, 2004 at 02:03:43PM -0600 References: <1075480843.682.188.camel@magik> <9F2C2DDE-5349-11D8-928E-000A95A0560C@us.ibm.com> <1075488969.682.192.camel@magik> <264AA26A-535D-11D8-928E-000A95A0560C@us.ibm.com> <1075493023.682.199.camel@magik> Message-ID: <20040203184441.C27780@forte.austin.ibm.com> On Fri, Jan 30, 2004 at 02:03:43PM -0600, Jake Moilanen wrote: > > > > Whoops, your right. Good catch. This was leftover from the port from > > > 2.4. > > > > That statement was alarming... :) I only found one memcpy left in that > > file in 2.4, but I guess we're supposed to check for EFAULT: > > IIRC Linas went through a couple of months ago and fixed up > rtas-proc.c. The whole file was using memcpy instead of > copy_to/from_user(). Yep, and there's a bunch of lower priority bugs in there too, which I promised to fix, but haven't gotten a round tuit yet. --linas p.s. jake, thanks for this patch, I've been needing it ... ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From haveblue at us.ibm.com Wed Feb 4 12:07:44 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: 03 Feb 2004 17:07:44 -0800 Subject: [PATCH] kill lmb_add_io Message-ID: <1075856864.1449.205.camel@nighthawk> This function seems to be dead code now. I found it during a horribly confusing encounter with lmb.[ch] and CONFIG_MSCHUNKS. Does anyone know what an MSCHUNK is? It seems like it simplifies some of the cases. Does iSeries guarantee that the kernel's memory starts at 0 while pSeries doesn't? --dave -------------- next part -------------- A non-text attachment was scrubbed... Name: kill-lmb_add_io-2.6.1-0.patch Type: text/x-patch Size: 1516 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040203/655e5a6a/attachment.bin From paulus at samba.org Wed Feb 4 12:17:25 2004 From: paulus at samba.org (Paul Mackerras) Date: Wed, 4 Feb 2004 12:17:25 +1100 Subject: [PATCH][2.6] rtas error-inject support In-Reply-To: <1075501743.681.214.camel@magik> References: <1075480843.682.188.camel@magik> <9F2C2DDE-5349-11D8-928E-000A95A0560C@us.ibm.com> <1075488969.682.192.camel@magik> <264AA26A-535D-11D8-928E-000A95A0560C@us.ibm.com> <1075501743.681.214.camel@magik> Message-ID: <16416.18469.3956.192501@cargo.ozlabs.ibm.com> Jake Moilanen writes: > Here's a patch w/ check for EFAULT. In general, since this is a patch for 2.6 (according to the subject line :), it would be better to do all this in userspace using the rtas syscall that was added recently. But here are comments on the patch anyway: > +config RTAS_ERRINJCT > + bool "RTAS Errinject" How about bool "RTAS Error injection facility" ? > +static unsigned int open_token = 0; No need to initialize things to 0, C does that by default. > @@ -207,7 +221,8 @@ > void proc_rtas_init(void) > { > struct proc_dir_entry *entry; > - > + int errinjct_token; > + I can't see any difference between the blank line that is deleted and the one that is inserted (not even any whitespace). I wonder why diff put that in? Or has your mailer deleted trailing whitespace? > +static ssize_t ppc_rtas_errinjct_write(struct file * file, const char * buf, > + size_t count, loff_t *ppos) > +{ > + > + char * ei_token; > + char * workspace = NULL; > + size_t max_len; > + int token_len; > + int rc; > + > + /* Verify the errinjct token length */ > + if (count < ERRINJCT_TOKEN_LEN) { > + max_len = count; > + } else { > + max_len = ERRINJCT_TOKEN_LEN; > + } > + > + token_len = strnlen(buf, max_len); That's a user pointer that you're using strnlen on. Ouch. Use strnlen_user or copy_from_user. In fact, do we need to check the string length at all? Would it matter if there was a null in the buffer, and the value taken as a string was shorter than we thought? > + token_len++; /* Add one for the null termination */ > + > + ei_token = (char *)kmalloc(token_len, GFP_KERNEL); > + if (!ei_token) { > + printk(KERN_WARNING "error: kmalloc failed\n"); > + return -ENOMEM; > + } > + > + strncpy(ei_token, buf, token_len); Another access to the user buffer without using *user functions. Ouch. > +int > +rtas_errinjct(unsigned int open_token, char * ei_token, char * workspace, size_t workspace_size) > +{ > + struct errinjct_token * ei; > + int rtas_ei_token = -1; > + unsigned int time; > + int rc = 0; > + int i; > + > + ei = ei_token_list; > + for (i = 0; i < MAX_ERRINJCT_TOKENS && ei->name; i++) { > + if (strcmp(ei_token, ei->name) == 0) { > + rtas_ei_token = ei->value; > + break; > + } > + ei++; > + } > + if (rtas_ei_token == -1) { > + return -EINVAL; > + } > + > + spin_lock(&rtas_data_buf_lock); > + > + while (1) { > + if (rc != RTAS_BUSY && workspace) { > + memset(rtas_data_buf, 0, RTAS_DATA_BUF_SIZE); > + memcpy(rtas_data_buf, workspace, workspace_size); > + } This worries me. We copy the workspace (the contents of which are undefined since we just kmalloc'd it) to rtas_data_buf, but we never copy rtas_data_buf back to the workspace after the rtas call. So how can the contents of workspace ever be anything but undefined? If we were doing this from userspace we could use the existing facilities for getting memory below 4G and use that for the workspace without any copying. > +static int __init rtas_errinjct_init(void) > +{ > + char * token_array; > + char * end_array; > + int array_len = 0; > + int len; > + int i, j; > + > + token_array = (char *) get_property(rtas.dev, "ibm,errinjct-tokens", > + &array_len); > + end_array = token_array + array_len; > + for (i = 0, j = 0; i < MAX_ERRINJCT_TOKENS && token_array < end_array; i++) { > + > + len = strnlen(token_array, ERRINJCT_TOKEN_LEN) + 1; > + ei_token_list[i].name = (char *) kmalloc(len, GFP_KERNEL); > + if (!ei_token_list[i].name) { > + printk(KERN_WARNING "error: kmalloc failed\n"); Why can't we just store a pointer to the token name within the OF property value? Why do we have to make a copy of it? In fact, why do we need to parse the list here at all? We use the list for two things: matching the token name in the write function, and listing the tokens in the read function. In both cases we could just run through the ibm,errinjct-tokens (what do they have against vwls?) property value almost as easily as ei_token_list. > +/* Error inject defines */ > +#define ERRINJCT_TOKEN_LEN 24 /* Max length of an error inject token */ > +#define MAX_ERRINJCT_TOKENS 15 /* Max # tokens. */ > +#define WORKSPACE_SIZE 1024 I worry about these too. 15 tokens sounds like future machines are going to going to exceed this limit quite easily. Also, is the workspace size limit there something that applies to all RTAS functions, or just to the error injection functions? If the latter, you should choose a name that indicates that. Overall, this looks to me like something that could be done just as well or better in userspace. Doing it in userspace would make it easy to avoid having an arbitrary limit on the number of tokens, for instance. Userspace could just read /proc/device-tree/rtas/ibm,errinjct-tokens to get the list and match against that directly. Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Wed Feb 4 12:18:39 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Tue, 3 Feb 2004 19:18:39 -0600 Subject: 2.6: PATCH for multiple EEH bugs In-Reply-To: ; from boutcher@us.ibm.com on Tue, Feb 03, 2004 at 06:45:05PM -0600 References: <20040203183459.B27780@forte.austin.ibm.com> Message-ID: <20040203191839.D27780@forte.austin.ibm.com> Dohhh, On Tue, Feb 03, 2004 at 06:45:05PM -0600, David Boutcher wrote: > > > owner-linuxppc64-dev at lists.linuxppc.org wrote on 02/03/2004 06:34:59 PM: > > Patch for multiple EEH-related bugs. Please review this patch, > > & if appropriate, please apply. It should apply cleanly to > > the current ameslab tree (Feb 03 2004 2.6.2-rc3). > > By the way....no patch :-) > > Dave Boutcher > IBM Linux Technology Center > -------------- next part -------------- ===== arch/ppc64/kernel/chrp_setup.c 1.49 vs edited ===== --- 1.49/arch/ppc64/kernel/chrp_setup.c Mon Jan 19 20:07:02 2004 +++ edited/arch/ppc64/kernel/chrp_setup.c Tue Feb 3 16:35:44 2004 @@ -71,6 +71,7 @@ extern void openpic_init_irq_desc(irq_desc_t *); extern void find_and_init_phbs(void); +extern void __init eeh_init(void); extern void pSeries_get_boot_time(struct rtc_time *rtc_time); extern void pSeries_get_rtc_time(struct rtc_time *rtc_time); ===== arch/ppc64/kernel/eeh.c 1.17 vs edited ===== --- 1.17/arch/ppc64/kernel/eeh.c Tue Feb 3 11:03:04 2004 +++ edited/arch/ppc64/kernel/eeh.c Tue Feb 3 16:38:19 2004 @@ -38,44 +38,68 @@ #define BUID_LO(buid) ((buid) & 0xffffffff) #define CONFIG_ADDR(busno, devfn) (((((busno) & 0xff) << 8) | ((devfn) & 0xf8)) << 8) -unsigned long eeh_total_mmio_ffs; -unsigned long eeh_false_positives; /* RTAS tokens */ static int ibm_set_eeh_option; static int ibm_set_slot_reset; static int ibm_read_slot_reset_state; -static int eeh_implemented; +static int eeh_subsystem_enabled; #define EEH_MAX_OPTS 4096 static char *eeh_opts; static int eeh_opts_last; -unsigned char slot_err_buf[RTAS_ERROR_LOG_MAX]; +/* System monitoring statistics */ +unsigned long eeh_total_mmio_ffs; +unsigned long eeh_false_positives; +unsigned long eeh_ignored_failures; pte_t *find_linux_pte(pgd_t *pgdir, unsigned long va); /* from htab.c */ static int eeh_check_opts_config(struct device_node *dn, int class_code, int vendor_id, int device_id, int default_state); -unsigned long eeh_token_to_phys(unsigned long token) + +/** + * eeh_token_to_phys - convert EEH address token to phys address + * @token i/o token, should be address in the form 0xA.... + * + * Converts EEH address tokens into physical addresses. Note that + * ths routine does *not* convert I/O BAR addresses (which start + * with 0xE...) to phys addresses! + */ +unsigned long +eeh_token_to_phys(unsigned long token) { + pte_t *ptep; + unsigned long pa, vaddr; if (REGION_ID(token) == EEH_REGION_ID) { - unsigned long vaddr = IO_TOKEN_TO_ADDR(token); - pte_t *ptep = find_linux_pte(ioremap_mm.pgd, vaddr); - unsigned long pa = pte_pfn(*ptep) << PAGE_SHIFT; - return pa | (vaddr & (PAGE_SIZE-1)); - } else + vaddr = IO_TOKEN_TO_ADDR(token); + } else { return token; + } + + ptep = find_linux_pte(ioremap_mm.pgd, vaddr); + pa = pte_pfn(*ptep) << PAGE_SHIFT; + return pa | (vaddr & (PAGE_SIZE-1)); } -/* Check for an eeh failure at the given token address. +/** + * eeh_check_failure - check if all 1's data is due to EEH slot freeze + * @token i/o token, should be address in the form 0xA.... + * @val value, should be all 1's (XXX why do we need this arg??) + * @who arbitrary ID, useful for debugging + * + * Check for an eeh failure at the given token address. * The given value has been read and it should be 1's (0xff, 0xffff or * 0xffffffff). * * Probe to determine if an error actually occurred. If not return val. * Otherwise panic. + * + * Note this routine might be called in an interrupt context ... */ -unsigned long eeh_check_failure(void *token, unsigned long val) +unsigned long +eeh_check_failure(void *token, unsigned long val, int who) { unsigned long addr; struct pci_dev *dev; @@ -85,28 +109,27 @@ /* IO BAR access could get us here...or if we manually force EEH * operation on even if the hardware won't support it. */ - if (!eeh_implemented || ibm_read_slot_reset_state == RTAS_UNKNOWN_SERVICE) + if (!eeh_subsystem_enabled || ibm_read_slot_reset_state == RTAS_UNKNOWN_SERVICE) return val; - /* Finding the phys addr + pci device is quite expensive. - * However, the RTAS call is MUCH slower.... :( - */ + /* Finding the phys addr + pci device; this is pretty quick. */ addr = eeh_token_to_phys((unsigned long)token); - dev = pci_find_dev_by_addr(addr); - if (!dev) { - printk("EEH: no pci dev found for addr=0x%lx\n", addr); - return val; - } + dev = pci_get_device_by_addr(addr); + + if (!dev) return val; + dn = pci_device_to_OF_node(dev); if (!dn) { - printk("EEH: no pci dn found for addr=0x%lx\n", addr); + pci_dev_put (dev); return val; } /* Access to IO BARs might get this far and still not want checking. */ - if (!(dn->eeh_mode & EEH_MODE_SUPPORTED) || dn->eeh_mode & EEH_MODE_NOCHECK) + if (!(dn->eeh_mode & EEH_MODE_SUPPORTED) || + dn->eeh_mode & EEH_MODE_NOCHECK) { + pci_dev_put (dev); return val; - + } /* Now test for an EEH failure. This is VERY expensive. * Note that the eeh_config_addr may be a parent device @@ -119,6 +142,7 @@ dn->eeh_config_addr, BUID_HI(dn->phb->buid), BUID_LO(dn->phb->buid)); if (ret == 0 && rets[1] == 1 && rets[0] >= 2) { + unsigned char slot_err_buf[RTAS_ERROR_LOG_MAX]; unsigned long slot_err_ret; memset(slot_err_buf, 0, RTAS_ERROR_LOG_MAX); @@ -139,23 +163,36 @@ * the system in light of potential corruption, we * can use it here. */ - if (panic_on_oops) - panic("EEH: MMIO failure (%ld) on device:\n%s\n", - rets[0], pci_name(dev)); - else - printk("EEH: MMIO failure (%ld) on device:\n%s\n", - rets[0], pci_name(dev)); + if (panic_on_oops) { + panic("EEH: MMIO failure (%ld) on device:\n%s\n", rets[0], pci_name(dev)); + } else { + eeh_ignored_failures ++; + if (!in_interrupt()) { /* XXX this will be replaced by eehdaemon */ + printk(KERN_INFO "EEH: MMIO failure (%ld) on device:%s %s\n", + rets[0], pci_name(dev), pci_pretty_name(dev)); + } + } + } else { + eeh_false_positives++; } } - eeh_false_positives++; + pci_dev_put (dev); return val; /* good case */ - } struct eeh_early_enable_info { unsigned int buid_hi; unsigned int buid_lo; - int adapters_enabled; + + /* Handy-dandy statistics help us understand what's going on */ + int num_phbs_found; + int num_of_nodes; + int num_devices; + int num_devices_bad_status; + int num_devices_graphics; + int num_devices_w_eeh_parent; + int num_devices_w_eeh_disabled; + int num_adapters_enabled; }; /* Enable eeh for the given device node. */ @@ -170,12 +207,17 @@ u32 *regs; int enable; - if (status && strcmp(status, "ok") != 0) + info->num_devices ++; + if (status && strcmp(status, "ok") != 0) { + info->num_devices_bad_status ++; return NULL; /* ignore devices with bad status */ + } /* Weed out PHBs or other bad nodes. */ - if (!class_code || !vendor_id || !device_id) + if (!class_code || !vendor_id || !device_id) { + info->num_devices_bad_status ++; return NULL; + } /* Ignore known PHBs and EADs bridges */ if (*vendor_id == PCI_VENDOR_ID_IBM && @@ -191,28 +233,36 @@ * But there are a few cases like display devices that make sense. */ enable = 1; /* i.e. we will do checking */ - if ((*class_code >> 16) == PCI_BASE_CLASS_DISPLAY) + if ((*class_code >> 16) == PCI_BASE_CLASS_DISPLAY) { + printk (KERN_INFO "EEH: %s: display device, disabling EEH checking.\n", dn->full_name); + info->num_devices_graphics ++; enable = 0; + } if (!eeh_check_opts_config(dn, *class_code, *vendor_id, *device_id, enable)) { if (enable) { - printk(KERN_INFO "EEH: %s user requested to run without EEH.\n", dn->full_name); + printk(KERN_NOTICE "EEH: %s user requested to run without EEH.\n", dn->full_name); enable = 0; } } if (!enable) + { dn->eeh_mode = EEH_MODE_NOCHECK; + info->num_devices_w_eeh_disabled ++; + return NULL; + } /* This device may already have an EEH parent. */ if (dn->parent && (dn->parent->eeh_mode & EEH_MODE_SUPPORTED)) { /* Parent supports EEH. */ dn->eeh_mode |= EEH_MODE_SUPPORTED; dn->eeh_config_addr = dn->parent->eeh_config_addr; + info->num_devices_w_eeh_parent ++; return NULL; } - /* Ok..see if this device supports EEH. */ + /* Ok... see if this device supports EEH. */ regs = (u32 *)get_property(dn, "reg", 0); if (regs) { /* First register entry is addr (00BBSS00) */ @@ -221,16 +271,21 @@ regs[0], info->buid_hi, info->buid_lo, EEH_ENABLE); if (ret == 0) { - info->adapters_enabled++; + info->num_adapters_enabled++; dn->eeh_mode |= EEH_MODE_SUPPORTED; dn->eeh_config_addr = regs[0]; + printk (KERN_DEBUG "EEH: %s: eeh enabled\n", dn->full_name); + } else { + printk (KERN_WARNING "EEH: %s: rtas_call failed.\n", dn->full_name); } + } else { + printk (KERN_WARNING "EEH: %s: unable to get reg property.\n", dn->full_name); } return NULL; } /* - * Initialize eeh by trying to enable it for all of the adapters in the system. + * Initialize EEH by trying to enable it for all of the adapters in the system. * As a side effect we can determine here if eeh is supported at all. * Note that we leave EEH on so failed config cycles won't cause a machine * check. If a user turns off EEH for a particular adapter they are really @@ -243,7 +298,7 @@ * The eeh-force-off/on option does literally what it says, so if Linux must * avoid enabling EEH this must be done. */ -void eeh_init(void) +void __init eeh_init(void) { struct device_node *phb; struct eeh_early_enable_info info; @@ -261,26 +316,37 @@ * of I/O macros even if we can't actually test for EEH failure. */ if (eeh_force_on > eeh_force_off) - eeh_implemented = 1; + eeh_subsystem_enabled = 1; else if (ibm_set_eeh_option == RTAS_UNKNOWN_SERVICE) return; if (eeh_force_off > eeh_force_on) { /* User is forcing EEH off. Be noisy if it is implemented. */ - if (eeh_implemented) + if (eeh_subsystem_enabled) printk(KERN_WARNING "EEH: WARNING: PCI Enhanced I/O Error Handling is user disabled\n"); - eeh_implemented = 0; + eeh_subsystem_enabled = 0; return; } - /* Enable EEH for all adapters. Note that eeh requires buid's */ - info.adapters_enabled = 0; + info.num_adapters_enabled = 0; + info.num_of_nodes = 0; + info.num_phbs_found = 0; + info.num_devices = 0; + info.num_devices_bad_status = 0; + info.num_devices_graphics = 0; + info.num_devices_w_eeh_parent = 0; + info.num_devices_w_eeh_disabled = 0; for (phb = of_find_node_by_name(NULL, "pci"); phb; phb = of_find_node_by_name(phb, "pci")) { + int len; - int *buid_vals = (int *) get_property(phb, "ibm,fw-phb-id", &len); + int *buid_vals; + + info.num_of_nodes ++; + buid_vals = (int *) get_property(phb, "ibm,fw-phb-id", &len); if (!buid_vals) continue; + info.num_phbs_found ++; if (len == sizeof(int)) { info.buid_lo = buid_vals[0]; info.buid_hi = 0; @@ -288,24 +354,88 @@ info.buid_hi = buid_vals[0]; info.buid_lo = buid_vals[1]; } else { - printk("EEH: odd ibm,fw-phb-id len returned: %d\n", len); + printk(KERN_INFO "EEH: odd ibm,fw-phb-id len returned: %d\n", len); continue; } traverse_pci_devices(phb, early_enable_eeh, NULL, &info); } - if (info.adapters_enabled) { + if (info.num_adapters_enabled) { printk(KERN_INFO "EEH: PCI Enhanced I/O Error Handling Enabled\n"); - eeh_implemented = 1; + eeh_subsystem_enabled = 1; + } + printk(KERN_INFO "EEH: num_of_nodes=%d\n", info.num_of_nodes); + printk(KERN_INFO "EEH: num_phbs_found=%d\n", info.num_phbs_found); + printk(KERN_INFO "EEH: num_devices=%d\n", info.num_devices); + printk(KERN_INFO "EEH: num_devices_bad_status=%d\n", info.num_devices_bad_status); + printk(KERN_INFO "EEH: num_devices_graphics=%d\n", info.num_devices_graphics); + printk(KERN_INFO "EEH: num_devices_w_eeh_parent=%d\n", info.num_devices_w_eeh_parent); + printk(KERN_INFO "EEH: num_devices_w_eeh_disabled=%d\n", info.num_devices_w_eeh_disabled); + printk(KERN_INFO "EEH: num_adapters_enabled=%d\n", info.num_adapters_enabled); +} + +/** + * eeh_add_device - perform EEH initialization for the indicated pci device + * @dev: pci device for which to set up EEH + * + * This routine can be used to perform EEH initialization for PCI + * devices that were added after system boot (e.g. hotplug, dlpar). + * Whether this actually enables EEH or not for this device depends + * on the type of the device, on earlier boot command-line + * arguments & etc. + */ +void +eeh_add_device (struct pci_dev *dev) +{ + struct device_node *dn; + struct pci_controller *phb; + struct eeh_early_enable_info info; + + if (!dev || !eeh_subsystem_enabled) return; + + printk (KERN_DEBUG "EEH: adding device %s %s\n", + pci_name (dev), pci_pretty_name(dev)); + dn = pci_device_to_OF_node(dev); + if (NULL == dn) return; + + phb = PCI_GET_PHB_PTR(dev); + if (NULL == phb || 0 == phb->buid) { + printk (KERN_WARNING "EEH: Expected buid but found none\n"); + return; } + + info.buid_hi = BUID_HI(phb->buid); + info.buid_lo = BUID_LO(phb->buid); + + early_enable_eeh(dn, &info); + pci_addr_cache_insert_device (dev); } +/** + * eeh_remove_device - undo EEH setup for the indicated pci device + * @dev: pci device to be removed + * + * This routine should be when a device is removed from a running + * system (e.g. by hotplug or dlpar). + */ +void +eeh_remove_device (struct pci_dev *dev) +{ + if (!dev || !eeh_subsystem_enabled) return; + + /* Unregister the device with the EEH/PCI address search system */ + printk (KERN_DEBUG "EEH: remove device %s %s\n", + pci_name (dev), pci_pretty_name(dev)); + pci_addr_cache_remove_device (dev); + +} -int eeh_set_option(struct pci_dev *dev, int option) +int +eeh_set_option(struct pci_dev *dev, int option) { struct device_node *dn = pci_device_to_OF_node(dev); struct pci_controller *phb = PCI_GET_PHB_PTR(dev); - if (dn == NULL || phb == NULL || phb->buid == 0 || !eeh_implemented) + if (dn == NULL || phb == NULL || phb->buid == 0 || !eeh_subsystem_enabled) return -2; return rtas_call(ibm_set_eeh_option, 4, 1, NULL, @@ -316,7 +446,7 @@ /* If EEH is implemented, find the PCI device using given phys addr * and check to see if eeh failure checking is disabled. - * Remap the addr (trivially) to the EEH region if not. + * Remap the addr (trivially) to the EEH region if EEH checking enabled. * For addresses not known to PCI the vaddr is simply returned unchanged. */ void *eeh_ioremap(unsigned long addr, void *vaddr) @@ -324,28 +454,72 @@ struct pci_dev *dev; struct device_node *dn; - if (!eeh_implemented) + if (!eeh_subsystem_enabled) return vaddr; - dev = pci_find_dev_by_addr(addr); + dev = pci_get_device_by_addr(addr); if (!dev) return vaddr; - dn = pci_device_to_OF_node(dev); - if (!dn) + + dn = pci_device_to_OF_node(dev); + if (!dn) { + pci_dev_put (dev); return vaddr; - if (dn->eeh_mode & EEH_MODE_NOCHECK) + } + if (dn->eeh_mode & EEH_MODE_NOCHECK) { + pci_dev_put (dev); return vaddr; + } + pci_dev_put (dev); return (void *)IO_ADDR_TO_TOKEN(vaddr); } static int eeh_proc_falsepositive_read(char *page, char **start, off_t off, int count, int *eof, void *data) { - int len; - len = sprintf(page, "eeh_false_positives=%ld\n" - "eeh_total_mmio_ffs=%ld\n", - eeh_false_positives, eeh_total_mmio_ffs); - return len; + char *p, *buffer; +#define EEH_PROC_BUFSZ 250 + int n=0, bs=EEH_PROC_BUFSZ; + + if (count < 0) return -EINVAL; + + buffer = kmalloc (EEH_PROC_BUFSZ,GFP_KERNEL); + if (!buffer) return -ENOMEM; + + p = buffer; + + if (0 == eeh_subsystem_enabled) { + n += snprintf (p+n, bs-n, "EEH Subsystem is globally disabled\n"); + n += snprintf(p+n, bs-n, "eeh_total_mmio_ffs=%ld\n", + eeh_total_mmio_ffs); + } else { + n += snprintf (p+n, bs-n, "EEH Subsystem is enabled\n"); + n += snprintf(p+n, bs-n, + "eeh_total_mmio_ffs=%ld\n" + "eeh_false_positives=%ld\n" + "eeh_ignored_failures=%ld\n", + eeh_total_mmio_ffs, + eeh_false_positives, + eeh_ignored_failures); + } + + /* Misc machinations of the proc file system */ + if (off >= strlen(buffer)) { + *eof = 1; + kfree(buffer); + return 0; + } + if (n > strlen(buffer) - off) + n = strlen(buffer) - off; + if (n > count) + n = count; + else + *eof = 1; + + memcpy(page, buffer + off, n); + *start = page; + kfree(buffer); + return n; } /* Implementation of /proc/ppc64/eeh @@ -362,6 +536,12 @@ return 0; } +static int __init eeh_init_late(void) +{ + eeh_init_proc (); + return 0; +} + /* * Test if "dev" should be configured on or off. * This processes the options literally from left to right. @@ -456,7 +636,7 @@ if (*cur) { int curlen = curend-cur; if (eeh_opts_last + curlen > EEH_MAX_OPTS-2) { - printk(KERN_INFO "EEH: sorry...too many eeh cmd line options\n"); + printk(KERN_WARNING "EEH: sorry...too many eeh cmd line options\n"); return 1; } eeh_opts[eeh_opts_last++] = state ? '+' : '-'; @@ -478,6 +658,6 @@ return eeh_parm(str, 1); } -__initcall(eeh_init_proc); +__initcall(eeh_init_late); __setup("eeh-off", eehoff_parm); __setup("eeh-on", eehon_parm); ===== arch/ppc64/kernel/pSeries_pci.c 1.34 vs edited ===== --- 1.34/arch/ppc64/kernel/pSeries_pci.c Fri Jan 30 21:22:28 2004 +++ edited/arch/ppc64/kernel/pSeries_pci.c Tue Feb 3 16:35:49 2004 @@ -530,7 +530,7 @@ dev->resource[i].start += hose->pci_mem_offset; dev->resource[i].end += hose->pci_mem_offset; } - } + } } EXPORT_SYMBOL(pcibios_fixup_device_resources); ===== arch/ppc64/kernel/pci.c 1.42 vs edited ===== --- 1.42/arch/ppc64/kernel/pci.c Mon Jan 19 20:07:05 2004 +++ edited/arch/ppc64/kernel/pci.c Tue Feb 3 16:39:53 2004 @@ -23,6 +23,8 @@ #include #include #include +#include +#include #include #include @@ -107,42 +109,264 @@ } } -/* Given an mmio phys address, find a pci device that implements - * this address. This is of course expensive, but only used - * for device initialization or error paths. - * For io BARs it is assumed the pci_io_base has already been added - * into addr. +/** + * The pci address cache subsystem. This subsystem places + * PCI device address resources into a red-black tree, sorted + * according to the address range, so that given only an i/o + * address, the corresponding PCI device can be **quickly** + * found. * - * Bridges are ignored although they could be used to optimize the search. + * Currently, the only customer of this code is the EEH subsystem; + * thus, this code has been somewhat tailored to suit EEH better. + * In particular, the cache does *not* hold the addresses of devices + * for which EEH is not enabled. + * + * (Implementation Note: The RB tree seems to be better/faster + * than any hash algo I could think of for this problem, even + * with the penalty of slow pointer chases for d-cache misses). */ -struct pci_dev *pci_find_dev_by_addr(unsigned long addr) +struct pci_io_addr_range { - struct pci_dev *dev = NULL; + struct rb_node rb_node; + unsigned long addr_lo; + unsigned long addr_hi; + struct pci_dev *pcidev; + unsigned int flags; +}; + +struct pci_io_addr_cache +{ + struct rb_root rb_root; + spinlock_t piar_lock; +} pci_io_addr_cache_root; + +static inline struct pci_dev * +__pci_get_device_by_addr (unsigned long addr) +{ + struct rb_node *n = pci_io_addr_cache_root.rb_root.rb_node; + while (n) + { + struct pci_io_addr_range *piar; + piar = rb_entry (n, struct pci_io_addr_range, rb_node); + if (addr < piar->addr_lo) { + n = n->rb_left; + } else + if (addr > piar->addr_hi) { + n = n->rb_right; + } else { + pci_dev_get (piar->pcidev); + return piar->pcidev; + } + } + return NULL; +} + +/** + * pci_get_device_by_addr - Get device, given only address + * @addr: mmio (PIO) phys address or i/o port number + * + * Given an mmio phys address, or a port number, find a pci device + * that implements this address. Be sure to pci_dev_put the device + * when finished. I/O port numbers are assumed to be offset + * from zero (that is, they do *not* have pci_io_addr added in). + * It is safe to call this function within an interrupt. + */ +struct pci_dev * +pci_get_device_by_addr (unsigned long addr) +{ + struct pci_dev *dev; + unsigned long flags; + + spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); + dev = __pci_get_device_by_addr (addr); + spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); + return dev; +} + +/* Handy-dandy debug print routine, does nothing more + * than print out the contents of our addr cache. */ +static void +pci_addr_cache_print (struct pci_io_addr_cache *cache) +{ + struct rb_node *n; + n = rb_first (&cache->rb_root); + int cnt=0; + while (n) { + struct pci_io_addr_range *piar; + piar = rb_entry (n, struct pci_io_addr_range, rb_node); + printk (KERN_DEBUG "PCI: %s addr range %d [%lx -%lx]: %s %s\n", + (piar->flags & IORESOURCE_IO) ? "i/o" : "mem", + cnt, + piar->addr_lo, piar->addr_hi, + pci_name (piar->pcidev), + pci_pretty_name (piar->pcidev)); + cnt ++; + n = rb_next (n); + } +} + +/* Insert address range into the rb tree. */ +static inline struct pci_io_addr_range * +pci_addr_cache_insert (struct pci_dev *dev, + unsigned long alo, unsigned long ahi, unsigned int flags) +{ + struct rb_node **p = &pci_io_addr_cache_root.rb_root.rb_node; + struct rb_node * parent = NULL; + struct pci_io_addr_range *piar; + + // Walk tree, find a place to insert into tree + while (*p) { + parent = *p; + piar = rb_entry (parent, struct pci_io_addr_range, rb_node); + if (alo < piar->addr_lo) { + p = &parent->rb_left; + } else if (ahi > piar->addr_hi) { + p = &parent->rb_right; + } else { + if (dev != piar->pcidev || + alo != piar->addr_lo || ahi != piar->addr_hi) { + printk (KERN_WARNING "PIAR: overlapping address range\n"); + } + return piar; + } + } + piar = (struct pci_io_addr_range *) kmalloc ( + sizeof(struct pci_io_addr_range), GFP_ATOMIC); + + if (!piar) return NULL; // whoops + + piar->addr_lo = alo; + piar->addr_hi = ahi; + piar->pcidev = dev; + piar->flags = flags; + + rb_link_node (&piar->rb_node, parent, p); + rb_insert_color (&piar->rb_node, &pci_io_addr_cache_root.rb_root); + return piar; +} + +inline void +__pci_addr_cache_insert_device (struct pci_dev *dev) +{ + struct device_node *dn; + dn = pci_device_to_OF_node(dev); + if (!dn) { + printk(KERN_WARNING "PCI: no pci dn found for dev=%s %s\n", + pci_name(dev), pci_pretty_name(dev)); + pci_dev_put (dev); + return; + } + + // Skip any devices for which EEH is not enabled. + if (!(dn->eeh_mode & EEH_MODE_SUPPORTED) || + dn->eeh_mode & EEH_MODE_NOCHECK) { + printk(KERN_INFO "PCI: skip building address cache for=%s %s\n", + pci_name(dev), pci_pretty_name(dev)); + pci_dev_put (dev); + return; + } + + // Walk resources on this device, poke them into the tree int i; - unsigned long ioaddr; + for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) { + unsigned long start = pci_resource_start(dev,i); + unsigned long end = pci_resource_end(dev,i); + unsigned int flags = pci_resource_flags(dev,i); + + // We are interested only bus addresses, not dma or other stuff + if (0 == (flags & (IORESOURCE_IO | IORESOURCE_MEM))) continue; + if (start == 0 || ~start == 0 || end == 0 || ~end == 0) + continue; + pci_addr_cache_insert (dev, start, end, flags); + } +} - ioaddr = (addr > isa_io_base) ? addr - isa_io_base : 0; +/** + * pci_addr_cache_insert_device - Add a device to the address cache + * @dev: PCI device whose I/O addresses we are interested in. + * + * In order to support the fast lookup of devices based on addresses, + * we maintain a cache of devices that can be quickly searched. + * This routine adds a device to that cache. + */ +void +pci_addr_cache_insert_device (struct pci_dev *dev) +{ + unsigned long flags; + + spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); + __pci_addr_cache_insert_device (dev); + spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); +} - while ((dev = pci_find_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) { - if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE) +static inline void +__pci_addr_cache_remove_device (struct pci_dev *dev) +{ + struct rb_node *n; + +restart: + n = rb_first (&pci_io_addr_cache_root.rb_root); + while (n) { + struct pci_io_addr_range *piar; + piar = rb_entry (n, struct pci_io_addr_range, rb_node); + + if (piar->pcidev == dev) + { + rb_erase (n, &pci_io_addr_cache_root.rb_root); + kfree (piar); + goto restart; + } + n = rb_next (n); + } + pci_dev_put (dev); +} + +/** + * pci_addr_cache_remove_device - remove pci device from addr cache + * @dev: device to remove + * + * Remove a device from the addr-cache tree. + * This is potentially expensive, since it will walk + * the tree multiple times (once per resource). + * But so what; device removal doesn't need to be that fast. + */ +void +pci_addr_cache_remove_device (struct pci_dev *dev) +{ + unsigned long flags; + + spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); + __pci_addr_cache_remove_device (dev); + spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); +} + +/** + * pci_addr_cache_build - Build a cache of I/O addresses + * + * Build a cache of pci i/o addresses. This cache will be used to + * find the pci device that corresponds to a given address. + * This routine scans all pci busses to build the cache. + * Must be run late in boot process, after the pci controllers + * have been scaned for devices (after all device resources are known). + */ +static __init void +pci_addr_cache_build (void) +{ + struct pci_dev *dev = NULL; + + spin_lock_init (&pci_io_addr_cache_root.piar_lock); + + while ((dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) { + // Ignore PCI bridges ( XXX why ??) + if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE) { + pci_dev_put (dev); continue; - - for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) { - unsigned long start = pci_resource_start(dev,i); - unsigned long end = pci_resource_end(dev,i); - unsigned int flags = pci_resource_flags(dev,i); - if (start == 0 || ~start == 0 || - end == 0 || ~end == 0) - continue; - if ((flags & IORESOURCE_IO) && - (ioaddr >= start && ioaddr <= end)) - return dev; - else if ((flags & IORESOURCE_MEM) && - (addr >= start && addr <= end)) - return dev; } + pci_addr_cache_insert_device (dev); } - return NULL; + + // Verify tree built up above, echo back the list of addrs. + pci_addr_cache_print (&pci_io_addr_cache_root); } void @@ -343,6 +567,8 @@ printk("PCI: Probing PCI hardware done\n"); //ppc64_boot_msg(0x41, "PCI Done"); + + pci_addr_cache_build (); return 0; } ===== arch/ppc64/kernel/pci.h 1.10 vs edited ===== --- 1.10/arch/ppc64/kernel/pci.h Fri Sep 12 06:01:39 2003 +++ edited/arch/ppc64/kernel/pci.h Tue Feb 3 16:35:50 2004 @@ -37,11 +37,15 @@ void *traverse_pci_devices(struct device_node *start, traverse_func pre, traverse_func post, void *data); void *traverse_all_pci_devices(traverse_func pre); -struct pci_dev *pci_find_dev_by_addr(unsigned long addr); void pci_devs_phb_init(void); void pci_fix_bus_sysdata(void); struct device_node *fetch_dev_dn(struct pci_dev *dev); #define PCI_GET_PHB_PTR(dev) (((struct device_node *)(dev)->sysdata)->phb) + +/* PCI address cache management routines */ +struct pci_dev *pci_get_device_by_addr(unsigned long addr); +void pci_addr_cache_insert_device (struct pci_dev *dev); +void pci_addr_cache_remove_device (struct pci_dev *dev); #endif /* __PPC_KERNEL_PCI_H__ */ ===== drivers/pci/hotplug/rpaphp_core.c 1.2 vs edited ===== --- 1.2/drivers/pci/hotplug/rpaphp_core.c Tue Dec 9 11:03:38 2003 +++ edited/drivers/pci/hotplug/rpaphp_core.c Tue Feb 3 16:35:51 2004 @@ -30,6 +30,7 @@ #include #include #include +#include /* for eeh_add_device() */ #include /* rtas_call */ #include /* for pci_controller */ #include "../pci.h" /* for pci_add_new_bus*/ @@ -512,6 +513,7 @@ } dev = rpaphp_find_pci_dev(slot->dn->child); + eeh_add_device(dev); } else { /* slot is not enabled */ @@ -540,12 +542,12 @@ goto exit; } + /* remove the device from the pci core */ + eeh_remove_device(slot->dev); + pci_remove_bus_device(slot->dev); - /* remove the device from the pci core */ - pci_remove_bus_device(slot->dev); - - pci_dev_put(slot->dev); - slot->state = NOT_CONFIGURED; + pci_dev_put(slot->dev); + slot->state = NOT_CONFIGURED; dbg("%s: adapter in slot[%s] unconfigured.\n", __FUNCTION__, slot->name); ===== include/asm-ppc64/eeh.h 1.6 vs edited ===== --- 1.6/include/asm-ppc64/eeh.h Fri Sep 12 06:06:51 2003 +++ edited/include/asm-ppc64/eeh.h Tue Feb 3 16:35:51 2004 @@ -45,22 +45,37 @@ /* This is for profiling only */ extern unsigned long eeh_total_mmio_ffs; -void eeh_init(void); -int eeh_get_state(unsigned long ea); -unsigned long eeh_check_failure(void *token, unsigned long val); +unsigned long eeh_check_failure(void *token, unsigned long val, int who); void *eeh_ioremap(unsigned long addr, void *vaddr); +/** + * eeh_add_device - perform EEH initialization for the indicated pci device + * @dev: pci device for which to set up EEH + * + * This routine can be used to perform EEH initialization for PCI + * devices that were added after system boot (e.g. hotplug, dlpar). + * Whether this actually enables EEH or not for this device depends + * on the type of the device, on earlier boot command-line + * arguments & etc. + */ +void eeh_add_device(struct pci_dev *); + +/** + * eeh_remove_device - undo EEH setup for the indicated pci device + * @dev: pci device to be removed + * + * This routine should be when a device is removed from a running + * system (e.g. by hotplug or dlpar). + */ +void eeh_remove_device(struct pci_dev *); + + #define EEH_DISABLE 0 #define EEH_ENABLE 1 #define EEH_RELEASE_LOADSTORE 2 #define EEH_RELEASE_DMA 3 int eeh_set_option(struct pci_dev *dev, int options); -/* Given a PCI device check if eeh should be configured or not. - * This may look at firmware properties and/or kernel cmdline options. - */ -int is_eeh_configured(struct pci_dev *dev); - /* Translate a (possible) eeh token to a physical addr. * If "token" is not an eeh token it is simply returned under * the assumption that it is already a physical addr. @@ -78,11 +93,16 @@ * If this macro yields TRUE, the caller relays to eeh_check_failure() * which does further tests out of line. */ -/* #define EEH_POSSIBLE_IO_ERROR(val) (~(val) == 0) */ -/* #define EEH_POSSIBLE_ERROR(addr, vaddr, val) ((vaddr) != (addr) && EEH_POSSIBLE_IO_ERROR(val) */ /* This version is rearranged to collect some profiling data */ -#define EEH_POSSIBLE_IO_ERROR(val) (~(val) == 0 && ++eeh_total_mmio_ffs) -#define EEH_POSSIBLE_ERROR(addr, vaddr, val) (EEH_POSSIBLE_IO_ERROR(val) && (vaddr) != (addr)) +#define EEH_POSSIBLE_IO_ERROR(val, type) \ + ((val) == (type)~0 && ++eeh_total_mmio_ffs) + +/* The vaddr will equal the addr if EEH checking is disabled for + * this device. This is because eeh_ioremap() will not have + * remapped to 0xA0, and thus both vaddr and addr will be 0xE0... + */ +#define EEH_POSSIBLE_ERROR(addr, vaddr, val, type) \ + (EEH_POSSIBLE_IO_ERROR(val, type) && (vaddr) != (addr)) /* * MMIO read/write operations with EEH support. @@ -101,8 +121,8 @@ static inline u8 eeh_readb(void *addr) { volatile u8 *vaddr = (volatile u8 *)IO_TOKEN_TO_ADDR(addr); u8 val = in_8(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val)) - return eeh_check_failure(addr, val); + if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u8)) + return eeh_check_failure(addr, val, 8); return val; } static inline void eeh_writeb(u8 val, void *addr) { @@ -112,25 +132,47 @@ static inline u16 eeh_readw(void *addr) { volatile u16 *vaddr = (volatile u16 *)IO_TOKEN_TO_ADDR(addr); u16 val = in_le16(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val)) - return eeh_check_failure(addr, val); + if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u16)) + return eeh_check_failure(addr, val, 16); return val; } static inline void eeh_writew(u16 val, void *addr) { volatile u16 *vaddr = (volatile u16 *)IO_TOKEN_TO_ADDR(addr); out_le16(vaddr, val); } +static inline u16 eeh_raw_readw(void *addr) { + volatile u16 *vaddr = (volatile u16 *)IO_TOKEN_TO_ADDR(addr); + u16 val = in_be16(vaddr); + if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u16)) + return eeh_check_failure(addr, val, 17); + return val; +} +static inline void eeh_raw_writew(u16 val, void *addr) { + volatile u16 *vaddr = (volatile u16 *)IO_TOKEN_TO_ADDR(addr); + out_be16(vaddr, val); +} static inline u32 eeh_readl(void *addr) { volatile u32 *vaddr = (volatile u32 *)IO_TOKEN_TO_ADDR(addr); u32 val = in_le32(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val)) - return eeh_check_failure(addr, val); + if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u32)) + return eeh_check_failure(addr, val, 32); return val; } static inline void eeh_writel(u32 val, void *addr) { volatile u32 *vaddr = (volatile u32 *)IO_TOKEN_TO_ADDR(addr); out_le32(vaddr, val); } +static inline u32 eeh_raw_readl(void *addr) { + volatile u32 *vaddr = (volatile u32 *)IO_TOKEN_TO_ADDR(addr); + u32 val = in_be32(vaddr); + if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u32)) + return eeh_check_failure(addr, val, 33); + return val; +} +static inline void eeh_raw_writel(u32 val, void *addr) { + volatile u32 *vaddr = (volatile u32 *)IO_TOKEN_TO_ADDR(addr); + out_be32(vaddr, val); +} static inline void eeh_memset_io(void *addr, int c, unsigned long n) { void *vaddr = (void *)IO_TOKEN_TO_ADDR(addr); @@ -139,8 +181,14 @@ static inline void eeh_memcpy_fromio(void *dest, void *src, unsigned long n) { void *vsrc = (void *)IO_TOKEN_TO_ADDR(src); memcpy(dest, vsrc, n); - /* look for ffff's here at dest[n] */ + /* Look for ffff's here at dest[n]. Assume that at least 4 bytes + * were copied. Check all four bytes. + */ + if ((n>=4) && (EEH_POSSIBLE_ERROR(src, vsrc, (*((u32 *) dest+n-4)), u32))) { + eeh_check_failure(src, (*((u32 *) dest+n-4)), 88); + } } + static inline void eeh_memcpy_toio(void *dest, void *src, unsigned long n) { void *vdest = (void *)IO_TOKEN_TO_ADDR(dest); memcpy(vdest, src, n); @@ -158,8 +206,8 @@ if (_IO_IS_ISA(port) && !_IO_HAS_ISA_BUS) return ~0; val = in_8((u8 *)(port+pci_io_base)); - if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val)) - return eeh_check_failure((void*)(port+pci_io_base), val); + if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val, u8)) + return eeh_check_failure((void*)(port), val, -8); return val; } @@ -173,8 +221,8 @@ if (_IO_IS_ISA(port) && !_IO_HAS_ISA_BUS) return ~0; val = in_le16((u16 *)(port+pci_io_base)); - if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val)) - return eeh_check_failure((void*)(port+pci_io_base), val); + if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val, u16)) + return eeh_check_failure((void*)(port), val, -16); return val; } @@ -188,14 +236,33 @@ if (_IO_IS_ISA(port) && !_IO_HAS_ISA_BUS) return ~0; val = in_le32((u32 *)(port+pci_io_base)); - if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val)) - return eeh_check_failure((void*)(port+pci_io_base), val); + if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val, u32)) + return eeh_check_failure((void*)(port), val, -32); return val; } static inline void eeh_outl(u32 val, unsigned long port) { if (!_IO_IS_ISA(port) || _IO_HAS_ISA_BUS) return out_le32((u32 *)(port+pci_io_base), val); +} + +/* in-string eeh macros */ +static inline void eeh_insb(unsigned long port, void * buf, int ns) { + _insb((u8 *)(port+pci_io_base), buf, ns); + if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR((*(((u8*)buf)+ns-1)), u8)) + eeh_check_failure((void*)(port), *(u8*)buf, -9); +} + +static inline void eeh_insw_ns(unsigned long port, void * buf, int ns) { + _insw_ns((u16 *)(port+pci_io_base), buf, ns); + if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR((*(((u16*)buf)+ns-1)), u16)) + eeh_check_failure((void*)(port), *(u16*)buf, -17); +} + +static inline void eeh_insl_ns(unsigned long port, void * buf, int nl) { + _insl_ns((u32 *)(port+pci_io_base), buf, nl); + if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR((*(((u32*)buf)+nl-1)), u32)) + eeh_check_failure((void*)(port), *(u32*)buf, -33); } #endif /* _EEH_H */ ===== include/asm-ppc64/io.h 1.11 vs edited ===== --- 1.11/include/asm-ppc64/io.h Mon Jan 19 20:08:22 2004 +++ edited/include/asm-ppc64/io.h Tue Feb 3 16:35:52 2004 @@ -49,6 +49,13 @@ #define outb(data,addr) writeb(data,((unsigned long)(addr))) #define outw(data,addr) writew(data,((unsigned long)(addr))) #define outl(data,addr) writel(data,((unsigned long)(addr))) +/* + * The *_ns versions below don't do byte-swapping. + * Neither do the standard versions now, these are just here + * for older code. + */ +#define insw_ns(port, buf, ns) _insw_ns((u16 *)((port)+pci_io_base), (buf), (ns)) +#define insl_ns(port, buf, nl) _insl_ns((u32 *)((port)+pci_io_base), (buf), (nl)) #else #define readb(addr) eeh_readb((void*)(addr)) #define readw(addr) eeh_readw((void*)(addr)) @@ -71,12 +78,16 @@ * They are only used in practice for transferring buffers which * are arrays of bytes, and byte-swapping is not appropriate in * that case. - paulus */ -#define insb(port, buf, ns) _insb((u8 *)((port)+pci_io_base), (buf), (ns)) -#define outsb(port, buf, ns) _outsb((u8 *)((port)+pci_io_base), (buf), (ns)) -#define insw(port, buf, ns) _insw_ns((u16 *)((port)+pci_io_base), (buf), (ns)) -#define outsw(port, buf, ns) _outsw_ns((u16 *)((port)+pci_io_base), (buf), (ns)) -#define insl(port, buf, nl) _insl_ns((u32 *)((port)+pci_io_base), (buf), (nl)) -#define outsl(port, buf, nl) _outsl_ns((u32 *)((port)+pci_io_base), (buf), (nl)) +#define insb(port, buf, ns) eeh_insb((port), (buf), (ns)) +#define insw(port, buf, ns) eeh_insw_ns((port), (buf), (ns)) +#define insl(port, buf, nl) eeh_insl_ns((port), (buf), (nl)) +#define insw_ns(port, buf, ns) eeh_insw_ns((port), (buf), (ns)) +#define insl_ns(port, buf, nl) eeh_insl_ns((port), (buf), (nl)) + +#define outsb(port, buf, ns) _outsb((u8 *)((port)+pci_io_base), (buf), (ns)) +#define outsw(port, buf, ns) _outsw_ns((u16 *)((port)+pci_io_base), (buf), (ns)) +#define outsl(port, buf, nl) _outsl_ns((u32 *)((port)+pci_io_base), (buf), (nl)) + #endif extern void _insb(volatile u8 *port, void *buf, int ns); @@ -106,9 +117,7 @@ * Neither do the standard versions now, these are just here * for older code. */ -#define insw_ns(port, buf, ns) _insw_ns((u16 *)((port)+pci_io_base), (buf), (ns)) #define outsw_ns(port, buf, ns) _outsw_ns((u16 *)((port)+pci_io_base), (buf), (ns)) -#define insl_ns(port, buf, nl) _insl_ns((u32 *)((port)+pci_io_base), (buf), (nl)) #define outsl_ns(port, buf, nl) _outsl_ns((u32 *)((port)+pci_io_base), (buf), (nl)) @@ -177,6 +186,9 @@ /* * 8, 16 and 32 bit, big and little endian I/O operations, with barrier. + * These routines do not perform EEH-related I/O address translation, + * and should not be used directly by device drivers. Use inb/readb + * instead. */ static inline int in_8(volatile unsigned char *addr) { From paulus at samba.org Wed Feb 4 13:22:27 2004 From: paulus at samba.org (Paul Mackerras) Date: Wed, 4 Feb 2004 13:22:27 +1100 Subject: [PATCH] kill lmb_add_io In-Reply-To: <1075856864.1449.205.camel@nighthawk> References: <1075856864.1449.205.camel@nighthawk> Message-ID: <16416.22371.237705.124944@cargo.ozlabs.ibm.com> Dave Hansen writes: > This function seems to be dead code now. I found it during a horribly > confusing encounter with lmb.[ch] and CONFIG_MSCHUNKS. > > Does anyone know what an MSCHUNK is? It seems like it simplifies some > of the cases. Does iSeries guarantee that the kernel's memory starts at > 0 while pSeries doesn't? A "Main Store chunk", I would think. With iSeries, the hypervisor gives you memory in 256kB chunks (I think), which can be scattered throughout physical memory. The CONFIG_MSCHUNKS stuff is a mapping layer that is there to give the illusion to the main part of the kernel that physical memory starts at 0 and is contiguous, when in fact it's anything but. A "LMB" is a "large memory block" AFAIK, and comes from the pSeries side of things. It does indeed seem that lmb_add_io is entirely unused. Does anyone have any good reason for keeping it? Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Wed Feb 4 13:35:51 2004 From: anton at samba.org (Anton Blanchard) Date: Wed, 4 Feb 2004 13:35:51 +1100 Subject: RTAS error logging of EEH errors In-Reply-To: <20040203183210.A27780@forte.austin.ibm.com> References: <20040201073627.GE22694@krispykreme> <1075737704.8079.27.camel@mudbug.austin.ibm.com> <20040203183210.A27780@forte.austin.ibm.com> Message-ID: <20040204023551.GD22694@krispykreme> > Please don't push this patch, It'll mess up my patch which I'm sending > out in a few minutes. Anton, if this still really bugs you, let me know, > I'll fix it. I think its worth fixing, eventually we will be recovering from these failures and we'll no doubt have forgotten about the stack usage. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Wed Feb 4 15:18:21 2004 From: anton at samba.org (Anton Blanchard) Date: Wed, 4 Feb 2004 15:18:21 +1100 Subject: [PATCH] kill lmb_add_io In-Reply-To: <1075856864.1449.205.camel@nighthawk> References: <1075856864.1449.205.camel@nighthawk> Message-ID: <20040204041820.GF22694@krispykreme> Hi, > This function seems to be dead code now. I found it during a horribly > confusing encounter with lmb.[ch] and CONFIG_MSCHUNKS. OK, you got me to look :) The IO stuff was around from when we were using MSCHUNKS on pSeries. It would make memory appear contiguous on machines with IO holes (POWER3 boxes). In the end it confused the hell out of Linux (from memory it was the macros that wanted to know if a physical address was in an IO hole or not). I just went through and cleaned the lmb stuff up, it needs to be tested on i and p, but I think its a step in the right direction. It certainly simplifies a lot of the code. Thoughts? Anton -- - remove LMB_MEMORY_AREA, LMB_IO_AREA, we only allocate/reserve memory areas now - remove lmb_property->type, lmb_region->iosize, lmb_region->lcd_size, no longer used - bump number of regions to 128, we'll hit this limit sooner or later with our big boxes (if we have more than 64 PCI host bridges the reserved array will fill up for example) - make all the lmb stuff __init - no need to explicitly zero struct lmb lmb now we zero the BSS early - we had two functions to dump the lmb array, kill one of them - move the inline functions into lmb.c, they are only ever called from there ===== include/asm-ppc64/lmb.h 1.6 vs edited ===== --- 1.6/include/asm-ppc64/lmb.h Fri Sep 13 21:24:37 2002 +++ edited/include/asm-ppc64/lmb.h Wed Feb 4 15:10:52 2004 @@ -13,36 +13,29 @@ * 2 of the License, or (at your option) any later version. */ -#include +#include #include extern unsigned long reloc_offset(void); -#define MAX_LMB_REGIONS 64 +#define MAX_LMB_REGIONS 128 union lmb_reg_property { struct reg_property32 addr32[MAX_LMB_REGIONS]; struct reg_property64 addr64[MAX_LMB_REGIONS]; }; -#define LMB_MEMORY_AREA 1 -#define LMB_IO_AREA 2 - #define LMB_ALLOC_ANYWHERE 0 -#define LMB_ALLOC_FIRST4GBYTE (1UL<<32) struct lmb_property { unsigned long base; unsigned long physbase; unsigned long size; - unsigned long type; }; struct lmb_region { unsigned long cnt; unsigned long size; - unsigned long iosize; - unsigned long lcd_size; /* Least Common Denominator */ struct lmb_property region[MAX_LMB_REGIONS+1]; }; @@ -53,63 +46,17 @@ struct lmb_region reserved; }; -extern struct lmb lmb; - -extern void lmb_init(void); -extern void lmb_analyze(void); -extern long lmb_add(unsigned long, unsigned long); -#ifdef CONFIG_MSCHUNKS -extern long lmb_add_io(unsigned long base, unsigned long size); -#endif /* CONFIG_MSCHUNKS */ -extern long lmb_reserve(unsigned long, unsigned long); -extern unsigned long lmb_alloc(unsigned long, unsigned long); -extern unsigned long lmb_alloc_base(unsigned long, unsigned long, unsigned long); -extern unsigned long lmb_phys_mem_size(void); -extern unsigned long lmb_end_of_DRAM(void); -extern unsigned long lmb_abs_to_phys(unsigned long); -extern void lmb_dump(char *); - -static inline unsigned long -lmb_addrs_overlap(unsigned long base1, unsigned long size1, - unsigned long base2, unsigned long size2) -{ - return ((base1 < (base2+size2)) && (base2 < (base1+size1))); -} - -static inline long -lmb_regions_overlap(struct lmb_region *rgn, unsigned long r1, unsigned long r2) -{ - unsigned long base1 = rgn->region[r1].base; - unsigned long size1 = rgn->region[r1].size; - unsigned long base2 = rgn->region[r2].base; - unsigned long size2 = rgn->region[r2].size; - - return lmb_addrs_overlap(base1,size1,base2,size2); -} - -static inline long -lmb_addrs_adjacent(unsigned long base1, unsigned long size1, - unsigned long base2, unsigned long size2) -{ - if ( base2 == base1 + size1 ) { - return 1; - } else if ( base1 == base2 + size2 ) { - return -1; - } - return 0; -} - -static inline long -lmb_regions_adjacent(struct lmb_region *rgn, unsigned long r1, unsigned long r2) -{ - unsigned long base1 = rgn->region[r1].base; - unsigned long size1 = rgn->region[r1].size; - unsigned long type1 = rgn->region[r1].type; - unsigned long base2 = rgn->region[r2].base; - unsigned long size2 = rgn->region[r2].size; - unsigned long type2 = rgn->region[r2].type; +extern struct lmb lmb __initdata; - return (type1 == type2) && lmb_addrs_adjacent(base1,size1,base2,size2); -} +extern void __init lmb_init(void); +extern void __init lmb_analyze(void); +extern long __init lmb_add(unsigned long, unsigned long); +extern long __init lmb_reserve(unsigned long, unsigned long); +extern unsigned long __init lmb_alloc(unsigned long, unsigned long); +extern unsigned long __init lmb_alloc_base(unsigned long, unsigned long, + unsigned long); +extern unsigned long __init lmb_phys_mem_size(void); +extern unsigned long __init lmb_end_of_DRAM(void); +extern unsigned long __init lmb_abs_to_phys(unsigned long); #endif /* _PPC64_LMB_H */ ===== arch/ppc64/kernel/lmb.c 1.6 vs edited ===== --- 1.6/arch/ppc64/kernel/lmb.c Tue Feb 25 20:38:45 2003 +++ edited/arch/ppc64/kernel/lmb.c Wed Feb 4 15:10:44 2004 @@ -1,5 +1,4 @@ /* - * * Procedures for interfacing to Open Firmware. * * Peter Bergner, IBM Corp. June 2001. @@ -13,46 +12,63 @@ #include #include +#include #include #include #include #include #include #include -#include -extern unsigned long klimit; -extern unsigned long reloc_offset(void); +struct lmb lmb __initdata; + +static unsigned long __init +lmb_addrs_overlap(unsigned long base1, unsigned long size1, + unsigned long base2, unsigned long size2) +{ + return ((base1 < (base2+size2)) && (base2 < (base1+size1))); +} +static long __init +lmb_addrs_adjacent(unsigned long base1, unsigned long size1, + unsigned long base2, unsigned long size2) +{ + if (base2 == base1 + size1) + return 1; + else if (base1 == base2 + size2) + return -1; -static long lmb_add_region(struct lmb_region *, unsigned long, unsigned long, unsigned long); + return 0; +} -struct lmb lmb = { - 0, 0, - {0,0,0,0,{{0,0,0}}}, - {0,0,0,0,{{0,0,0}}} -}; +static long __init +lmb_regions_adjacent(struct lmb_region *rgn, unsigned long r1, unsigned long r2) +{ + unsigned long base1 = rgn->region[r1].base; + unsigned long size1 = rgn->region[r1].size; + unsigned long base2 = rgn->region[r2].base; + unsigned long size2 = rgn->region[r2].size; + return lmb_addrs_adjacent(base1, size1, base2, size2); +} /* Assumption: base addr of region 1 < base addr of region 2 */ -static void +static void __init lmb_coalesce_regions(struct lmb_region *rgn, unsigned long r1, unsigned long r2) { unsigned long i; rgn->region[r1].size += rgn->region[r2].size; - for (i=r2; i < rgn->cnt-1 ;i++) { + for (i=r2; i < rgn->cnt-1; i++) { rgn->region[i].base = rgn->region[i+1].base; rgn->region[i].physbase = rgn->region[i+1].physbase; rgn->region[i].size = rgn->region[i+1].size; - rgn->region[i].type = rgn->region[i+1].type; } rgn->cnt--; } - /* This routine called with relocation disabled. */ -void +void __init lmb_init(void) { unsigned long offset = reloc_offset(); @@ -63,13 +79,11 @@ */ _lmb->memory.region[0].base = 0; _lmb->memory.region[0].size = 0; - _lmb->memory.region[0].type = LMB_MEMORY_AREA; _lmb->memory.cnt = 1; /* Ditto. */ _lmb->reserved.region[0].base = 0; _lmb->reserved.region[0].size = 0; - _lmb->reserved.region[0].type = LMB_MEMORY_AREA; _lmb->reserved.cnt = 1; } @@ -89,12 +103,11 @@ } /* This routine called with relocation disabled. */ -void +void __init lmb_analyze(void) { unsigned long i; unsigned long mem_size = 0; - unsigned long io_size = 0; unsigned long size_mask = 0; unsigned long offset = reloc_offset(); struct lmb *_lmb = PTRRELOC(&lmb); @@ -102,13 +115,9 @@ unsigned long physbase = 0; #endif - for (i=0; i < _lmb->memory.cnt ;i++) { - unsigned long lmb_type = _lmb->memory.region[i].type; + for (i=0; i < _lmb->memory.cnt; i++) { unsigned long lmb_size; - if ( lmb_type != LMB_MEMORY_AREA ) - continue; - lmb_size = _lmb->memory.region[i].size; #ifdef CONFIG_MSCHUNKS @@ -121,84 +130,20 @@ size_mask |= lmb_size; } -#ifdef CONFIG_MSCHUNKS - for (i=0; i < _lmb->memory.cnt ;i++) { - unsigned long lmb_type = _lmb->memory.region[i].type; - unsigned long lmb_size; - - if ( lmb_type != LMB_IO_AREA ) - continue; - - lmb_size = _lmb->memory.region[i].size; - - _lmb->memory.region[i].physbase = physbase; - physbase += lmb_size; - io_size += lmb_size; - size_mask |= lmb_size; - } -#endif /* CONFIG_MSCHUNKS */ - _lmb->memory.size = mem_size; - _lmb->memory.iosize = io_size; - _lmb->memory.lcd_size = (1UL << cnt_trailing_zeros(size_mask)); -} - -/* This routine called with relocation disabled. */ -long -lmb_add(unsigned long base, unsigned long size) -{ - unsigned long offset = reloc_offset(); - struct lmb *_lmb = PTRRELOC(&lmb); - struct lmb_region *_rgn = &(_lmb->memory); - - /* On pSeries LPAR systems, the first LMB is our RMO region. */ - if ( base == 0 ) - _lmb->rmo_size = size; - - return lmb_add_region(_rgn, base, size, LMB_MEMORY_AREA); - } -#ifdef CONFIG_MSCHUNKS /* This routine called with relocation disabled. */ -long -lmb_add_io(unsigned long base, unsigned long size) -{ - unsigned long offset = reloc_offset(); - struct lmb *_lmb = PTRRELOC(&lmb); - struct lmb_region *_rgn = &(_lmb->memory); - - return lmb_add_region(_rgn, base, size, LMB_IO_AREA); - -} -#endif /* CONFIG_MSCHUNKS */ - -long -lmb_reserve(unsigned long base, unsigned long size) -{ - unsigned long offset = reloc_offset(); - struct lmb *_lmb = PTRRELOC(&lmb); - struct lmb_region *_rgn = &(_lmb->reserved); - - return lmb_add_region(_rgn, base, size, LMB_MEMORY_AREA); -} - -/* This routine called with relocation disabled. */ -static long -lmb_add_region(struct lmb_region *rgn, unsigned long base, unsigned long size, - unsigned long type) +static long __init +lmb_add_region(struct lmb_region *rgn, unsigned long base, unsigned long size) { unsigned long i, coalesced = 0; long adjacent; /* First try and coalesce this LMB with another. */ - for (i=0; i < rgn->cnt ;i++) { + for (i=0; i < rgn->cnt; i++) { unsigned long rgnbase = rgn->region[i].base; unsigned long rgnsize = rgn->region[i].size; - unsigned long rgntype = rgn->region[i].type; - - if ( rgntype != type ) - continue; adjacent = lmb_addrs_adjacent(base,size,rgnbase,rgnsize); if ( adjacent > 0 ) { @@ -227,17 +172,15 @@ } /* Couldn't coalesce the LMB, so add it to the sorted table. */ - for (i=rgn->cnt-1; i >= 0 ;i--) { + for (i=rgn->cnt-1; i >= 0; i--) { if (base < rgn->region[i].base) { rgn->region[i+1].base = rgn->region[i].base; rgn->region[i+1].physbase = rgn->region[i].physbase; rgn->region[i+1].size = rgn->region[i].size; - rgn->region[i+1].type = rgn->region[i].type; } else { rgn->region[i+1].base = base; rgn->region[i+1].physbase = lmb_abs_to_phys(base); rgn->region[i+1].size = size; - rgn->region[i+1].type = type; break; } } @@ -246,12 +189,38 @@ return 0; } -long +/* This routine called with relocation disabled. */ +long __init +lmb_add(unsigned long base, unsigned long size) +{ + unsigned long offset = reloc_offset(); + struct lmb *_lmb = PTRRELOC(&lmb); + struct lmb_region *_rgn = &(_lmb->memory); + + /* On pSeries LPAR systems, the first LMB is our RMO region. */ + if ( base == 0 ) + _lmb->rmo_size = size; + + return lmb_add_region(_rgn, base, size); + +} + +long __init +lmb_reserve(unsigned long base, unsigned long size) +{ + unsigned long offset = reloc_offset(); + struct lmb *_lmb = PTRRELOC(&lmb); + struct lmb_region *_rgn = &(_lmb->reserved); + + return lmb_add_region(_rgn, base, size); +} + +long __init lmb_overlaps_region(struct lmb_region *rgn, unsigned long base, unsigned long size) { unsigned long i; - for (i=0; i < rgn->cnt ;i++) { + for (i=0; i < rgn->cnt; i++) { unsigned long rgnbase = rgn->region[i].base; unsigned long rgnsize = rgn->region[i].size; if ( lmb_addrs_overlap(base,size,rgnbase,rgnsize) ) { @@ -262,13 +231,13 @@ return (i < rgn->cnt) ? i : -1; } -unsigned long +unsigned long __init lmb_alloc(unsigned long size, unsigned long align) { return lmb_alloc_base(size, align, LMB_ALLOC_ANYWHERE); } -unsigned long +unsigned long __init lmb_alloc_base(unsigned long size, unsigned long align, unsigned long max_addr) { long i, j; @@ -278,13 +247,9 @@ struct lmb_region *_mem = &(_lmb->memory); struct lmb_region *_rsv = &(_lmb->reserved); - for (i=_mem->cnt-1; i >= 0 ;i--) { + for (i=_mem->cnt-1; i >= 0; i--) { unsigned long lmbbase = _mem->region[i].base; unsigned long lmbsize = _mem->region[i].size; - unsigned long lmbtype = _mem->region[i].type; - - if ( lmbtype != LMB_MEMORY_AREA ) - continue; if ( max_addr == LMB_ALLOC_ANYWHERE ) base = _ALIGN_DOWN(lmbbase+lmbsize-size, align); @@ -305,12 +270,12 @@ if ( i < 0 ) return 0; - lmb_add_region(_rsv, base, size, LMB_MEMORY_AREA); + lmb_add_region(_rsv, base, size); return base; } -unsigned long +unsigned long __init lmb_phys_mem_size(void) { unsigned long offset = reloc_offset(); @@ -327,7 +292,7 @@ #endif /* CONFIG_MSCHUNKS */ } -unsigned long +unsigned long __init lmb_end_of_DRAM(void) { unsigned long offset = reloc_offset(); @@ -335,9 +300,7 @@ struct lmb_region *_mem = &(_lmb->memory); unsigned long idx; - for(idx=_mem->cnt-1; idx >= 0 ;idx--) { - if ( _mem->region[idx].type != LMB_MEMORY_AREA ) - continue; + for(idx=_mem->cnt-1; idx >= 0; idx--) { #ifdef CONFIG_MSCHUNKS return (_mem->region[idx].physbase + _mem->region[idx].size); #else @@ -348,8 +311,7 @@ return 0; } - -unsigned long +unsigned long __init lmb_abs_to_phys(unsigned long aa) { unsigned long i, pa = aa; @@ -357,7 +319,7 @@ struct lmb *_lmb = PTRRELOC(&lmb); struct lmb_region *_mem = &(_lmb->memory); - for (i=0; i < _mem->cnt ;i++) { + for (i=0; i < _mem->cnt; i++) { unsigned long lmbbase = _mem->region[i].base; unsigned long lmbsize = _mem->region[i].size; if ( lmb_addrs_overlap(aa,1,lmbbase,lmbsize) ) { @@ -367,48 +329,4 @@ } return pa; -} - -void -lmb_dump(char *str) -{ - unsigned long i; - - udbg_printf("\nlmb_dump: %s\n", str); - udbg_printf(" debug = %s\n", - (lmb.debug) ? "TRUE" : "FALSE"); - udbg_printf(" memory.cnt = %d\n", - lmb.memory.cnt); - udbg_printf(" memory.size = 0x%lx\n", - lmb.memory.size); - udbg_printf(" memory.lcd_size = 0x%lx\n", - lmb.memory.lcd_size); - for (i=0; i < lmb.memory.cnt ;i++) { - udbg_printf(" memory.region[%d].base = 0x%lx\n", - i, lmb.memory.region[i].base); - udbg_printf(" .physbase = 0x%lx\n", - lmb.memory.region[i].physbase); - udbg_printf(" .size = 0x%lx\n", - lmb.memory.region[i].size); - udbg_printf(" .type = 0x%lx\n", - lmb.memory.region[i].type); - } - - udbg_printf("\n"); - udbg_printf(" reserved.cnt = %d\n", - lmb.reserved.cnt); - udbg_printf(" reserved.size = 0x%lx\n", - lmb.reserved.size); - udbg_printf(" reserved.lcd_size = 0x%lx\n", - lmb.reserved.lcd_size); - for (i=0; i < lmb.reserved.cnt ;i++) { - udbg_printf(" reserved.region[%d].base = 0x%lx\n", - i, lmb.reserved.region[i].base); - udbg_printf(" .physbase = 0x%lx\n", - lmb.reserved.region[i].physbase); - udbg_printf(" .size = 0x%lx\n", - lmb.reserved.region[i].size); - udbg_printf(" .type = 0x%lx\n", - lmb.reserved.region[i].type); - } } ===== arch/ppc64/kernel/prom.c 1.52 vs edited ===== --- 1.52/arch/ppc64/kernel/prom.c Sun Feb 1 13:40:21 2004 +++ edited/arch/ppc64/kernel/prom.c Wed Feb 4 14:42:19 2004 @@ -668,9 +668,6 @@ prom_print(RELOC(" memory.size = 0x")); prom_print_hex(_lmb->memory.size); prom_print_nl(); - prom_print(RELOC(" memory.lcd_size = 0x")); - prom_print_hex(_lmb->memory.lcd_size); - prom_print_nl(); for (i=0; i < _lmb->memory.cnt ;i++) { prom_print(RELOC(" memory.region[0x")); prom_print_hex(i); @@ -683,9 +680,6 @@ prom_print(RELOC(" .size = 0x")); prom_print_hex(_lmb->memory.region[i].size); prom_print_nl(); - prom_print(RELOC(" .type = 0x")); - prom_print_hex(_lmb->memory.region[i].type); - prom_print_nl(); } prom_print_nl(); @@ -695,9 +689,6 @@ prom_print(RELOC(" reserved.size = 0x")); prom_print_hex(_lmb->reserved.size); prom_print_nl(); - prom_print(RELOC(" reserved.lcd_size = 0x")); - prom_print_hex(_lmb->reserved.lcd_size); - prom_print_nl(); for (i=0; i < _lmb->reserved.cnt ;i++) { prom_print(RELOC(" reserved.region[0x")); prom_print_hex(i); @@ -709,9 +700,6 @@ prom_print_nl(); prom_print(RELOC(" .size = 0x")); prom_print_hex(_lmb->reserved.region[i].size); - prom_print_nl(); - prom_print(RELOC(" .type = 0x")); - prom_print_hex(_lmb->reserved.region[i].type); prom_print_nl(); } } ===== arch/ppc64/mm/numa.c 1.16 vs edited ===== --- 1.16/arch/ppc64/mm/numa.c Tue Jan 20 13:07:09 2004 +++ edited/arch/ppc64/mm/numa.c Wed Feb 4 15:12:02 2004 @@ -257,10 +257,6 @@ for (i = 0; i < lmb.memory.cnt; i++) { unsigned long physbase, size; - unsigned long type = lmb.memory.region[i].type; - - if (type != LMB_MEMORY_AREA) - continue; physbase = lmb.memory.region[i].physbase; size = lmb.memory.region[i].size; ===== arch/ppc64/mm/init.c 1.55 vs edited ===== --- 1.55/arch/ppc64/mm/init.c Tue Jan 20 13:07:09 2004 +++ edited/arch/ppc64/mm/init.c Wed Feb 4 15:11:54 2004 @@ -702,10 +702,6 @@ /* add all physical memory to the bootmem map */ for (i=0; i < lmb.memory.cnt; i++) { unsigned long physbase, size; - unsigned long type = lmb.memory.region[i].type; - - if ( type != LMB_MEMORY_AREA ) - continue; physbase = lmb.memory.region[i].physbase; size = lmb.memory.region[i].size; @@ -746,11 +742,7 @@ for (i=0; i < lmb.memory.cnt; i++) { unsigned long physbase, size; - unsigned long type = lmb.memory.region[i].type; struct kcore_list *kcore_mem; - - if (type != LMB_MEMORY_AREA) - continue; physbase = lmb.memory.region[i].physbase; size = lmb.memory.region[i].size; ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Wed Feb 4 15:18:37 2004 From: olof at austin.ibm.com (olof at austin.ibm.com) Date: Tue, 3 Feb 2004 22:18:37 -0600 (CST) Subject: [2.4] [PATCH] hash_page rework, take 2 In-Reply-To: <401FE067.80902@redhat.com> Message-ID: On Tue, 3 Feb 2004, Julie DeWandel wrote: > Hi Olof, > > The patch wasn't attached to your email so I included it along with > comments below (my comments preceded by "JSD:"). Ugh. This is the second time I've forgotten to attach a patch. Not sure what is going on here... Thanks for taking time to review! Comments below each remark below, and new patch attached (this time for real). -Olof +static inline void pSeries_unlock_hpte(HPTE *hptep) +{ + unsigned long *word = &hptep->dw0.dword0; + + asm volatile("lwsync":::"memory"); + clear_bit(HPTE_LOCK_BIT, word); +} JSD: Other places within the kernel do an smp_mb__before_clear_bit() when JSD: clearing a bit representing a lock. Would that be a better choice here JSD: (it resolves to a sync)? I just copied the 2.6 code here. lwsync is cheaper on Power4 and beyond, and on Power3/RS64 it equals a sync. Cheapest sufficient syncronization is always to be preferred.. JSD: I should have added that the clearing of the _PAGE_BUSY bit should be JSD: done using an atomic op (clear_bit() routine) Not needed, since noone is modifying the contents of the word if the bit is set. I.e, similar to spin_unlock(). JSD: The inline assembly code will not set _PAGE_BUSY if it finds that JSD: the user's access rights doesn't allow access to the page. JSD: So, if access_ok = 0, _PAGE_BUSY is not set, and the code may JSD: branch to out_unlock where _PAGE_BUSY is unconditionally cleared. JSD: It could be possible for another processor to coincidently have JSD: set _PAGE_BUSY in the PTE but then have this processor clear it JSD: before the other processor wanted it clear. Do you agree this can JSD: happen? Yes, it's a small window but it might happen. See below. JSD: Furthermore, the "ea" condition check (above) might yield false and JSD: the code then continue as though access_ok were true, modifying the JSD: PTE without the _PAGE_BUSY bit set. Seems bad. Right. The proper thing to do is to set _PAGE_BUSY always, and clear it if access is denied. Fixed. JSD: NOTE: at this point, the *ptep = new_pte just unlocked the PTE by JSD: clearing _PAGE_BUSY. The code then goes on to clear it again, thereby JSD: possibly rendering unsafe updates that some other processor might be JSD: doing after it thought it had set _PAGE_BUSY. There is a small window JSD: here. Fixed as well. There's only a window at the second updated of *ptep, in the first one _PAGE_BUSY is still set. - spin_unlock(&hash_table_lock[lock_slot].lock); +out_unlock: + smp_wmb(); JSD: I believe an smp_mb__before_clear_bit() would be a better choice, or JSD: at least an lwsync because this is an unlock. I'm not sure we really need any syncronization at all, since the only memery that's protected by the _PAGE_BUSY bit is the word itself, so the lock and contents change will be seen at the same time (or at least in the right order). smp_wmb() is cheaper (eieio) than smp_mb() (sync), so either the smp_wmb() should stay or it should be taken out alltogether. I'd rather err on the side of caution, so I'll keep it. JSD: In the above two hunks of code, a pte_update is done which returns JSD: the old pte value. This value is checked to determine if the hpte JSD: should be invalidated. However, there is no lock held between the JSD: time the pte value is read and the time the hpte is invalidated. JSD: The hpte_invalidate routine doesn't check to make sure the va JSD: passed in is really the one being invalidated in the slot -- it just JSD: assumes the slot, etc are enough to locate it. So we might be JSD: invalidating the wrong thing here. Probably a don't care, but thought JSD: I'd ask. Yes, this is a general behaviour though, since entries are not updated whenever they're thrown out of the HPTE. We always risk invalidating an entry that's been reused. In the grand scheme of things it's not a big problem, since it should happen fairly rarely, and when it happens the invalidated entry will be faulted right back in. 2.6 is the same. /* Invalidate the hpte. */ hptep->dw0.dword0 = 0; JSD: It would really be nice to add a comment here that the above JSD: assignment statement is also unlocking the hpte as well. Done. JSD: Two questions here. (1) Shouldn't interrupts be disabled for the JSD: write_lock/unlock here regardless of what processor we are running JSD: on? (2) I don't see how this code is preventing another processor JSD: from grabbing the read_lock immediately after this processor has JSD: checked to make sure it isn't held. JSD: Better question for (1) is why are interrupts being disabled here? JSD: Can this routine be called from interrupt context? Without disabling interrupts, there's a risk for deadlocks if the processor gets interrupted and the interrupt handler causes a page fault that needs to be resolved Since the lock is held for writing, the handler will wait forever when locking for reading. This is actually similar to the original deadlock that this whole patch is meant to remove, but the window is really small (just a few instructions) now. Likewise an interrupt on a different processor is not a problem since forward progress is still guaranteed on the processor holding it for writing so the reader will eventually get the lock. And on the second question: This is the trick with using the rwlock, there's no need to _prevent_ reading, all I needed to know is that all readers that started before pte_free_sync() have completed: Noone can get a reference to a PTE after it's been free'd, so the only risk is if someone has walked the table right during pte_free() and is still holding a reference to it. hash_page will hold the lock for reading while this takes place, so all we need to know is that we _could_ take the lock for writing (i.e. no readers for the table). Even if another CPU comes in and traverses the tree we're safe since there's no way they can end up in that PTE (since it's been removed from the tree). This is all an ad-hoc solution since there's no RCU in 2.4, so I needed another light-weight syncronization method. JSD: Since the pte_freelist_cur is a per-processor structure, I don't JSD: think you need the batch->lock at all. What other thing could be JSD: running on this same processor at the same time? I thought there was a risk that pte_free() could be called from interrupt context through pte_alloc(), but with some closer examination it seems like it shouldn't happen. If so, the locks can come out. JSD: Why were these spinlocks changed to _irq? I noticed this change was JSD: not present in the 2.6 code. No, it's my mistake. It was part of the workaround patch that for some reason got left in the patch I attached to the bug. They were not there in the patch that I was supposed to have posted to the list. JSD: Same question about the _irq addition. Doesn't seem necessary. I'm JSD: probably missing something -- please explain. Same answer as above; dirty patch. JSD: Why was the UL (unsigned long) dropped from the bit definitions? Sync with 2.6, at one time I used the definitions in assembly and the UL syntax is illegal there. It's no longer needed. -------------- next part -------------- ===== arch/ppc64/kernel/htab.c 1.11 vs edited ===== --- 1.11/arch/ppc64/kernel/htab.c Thu Dec 18 16:13:25 2003 +++ edited/arch/ppc64/kernel/htab.c Tue Feb 3 22:01:59 2004 @@ -48,6 +48,29 @@ #include #include +#define HPTE_LOCK_BIT 3 + +static inline void pSeries_lock_hpte(HPTE *hptep) +{ + unsigned long *word = &hptep->dw0.dword0; + + while (1) { + if (!test_and_set_bit(HPTE_LOCK_BIT, word)) + break; + while(test_bit(HPTE_LOCK_BIT, word)) + cpu_relax(); + } +} + +static inline void pSeries_unlock_hpte(HPTE *hptep) +{ + unsigned long *word = &hptep->dw0.dword0; + + asm volatile("lwsync":::"memory"); + clear_bit(HPTE_LOCK_BIT, word); +} + + /* * Note: pte --> Linux PTE * HPTE --> PowerPC Hashed Page Table Entry @@ -64,6 +87,7 @@ extern unsigned long _SDR1; extern unsigned long klimit; +extern rwlock_t pte_hash_lock[] __cacheline_aligned_in_smp; void make_pte(HPTE *htab, unsigned long va, unsigned long pa, int mode, unsigned long hash_mask, int large); @@ -320,51 +344,73 @@ unsigned long va, vpn; unsigned long newpp, prpn; unsigned long hpteflags, lock_slot; + unsigned long access_ok, tmp; long slot; pte_t old_pte, new_pte; + int ret = 0; /* Search the Linux page table for a match with va */ va = (vsid << 28) | (ea & 0x0fffffff); vpn = va >> PAGE_SHIFT; lock_slot = get_lock_slot(vpn); - /* Acquire the hash table lock to guarantee that the linux - * pte we fetch will not change + /* + * Check the user's access rights to the page. If access should be + * prevented then send the problem up to do_page_fault. */ - spin_lock(&hash_table_lock[lock_slot].lock); - + /* * Check the user's access rights to the page. If access should be * prevented then send the problem up to do_page_fault. */ -#ifdef CONFIG_SHARED_MEMORY_ADDRESSING + access |= _PAGE_PRESENT; - if (unlikely(access & ~(pte_val(*ptep)))) { + + /* We'll do access checking and _PAGE_BUSY setting in assembly, since + * it needs to be atomic. + */ + + __asm__ __volatile__ ("\n + 1: ldarx %0,0,%3\n + # Check if PTE is busy\n + andi. %1,%0,%4\n + bne- 1b\n + ori %0,%0,%4\n + # Write the linux PTE atomically (setting busy)\n + stdcx. %0,0,%3\n + bne- 1b\n + # Check access rights (access & ~(pte_val(*ptep)))\n + andc. %1,%2,%0\n + bne- 2f\n + li %1,1\n + b 3f\n + 2: li %1,0\n + 3:" + : "=r" (old_pte), "=r" (access_ok) + : "r" (access), "r" (ptep), "i" (_PAGE_BUSY) + : "cr0", "memory"); + +#ifdef CONFIG_SHARED_MEMORY_ADDRESSING + if (unlikely(!access_ok)) { if(!(((ea >> SMALLOC_EA_SHIFT) == (SMALLOC_START >> SMALLOC_EA_SHIFT)) && ((current->thread.flags) & PPC_FLAG_SHARED))) { - spin_unlock(&hash_table_lock[lock_slot].lock); - return 1; + ret = 1; + goto out_unlock; } } #else - access |= _PAGE_PRESENT; - if (unlikely(access & ~(pte_val(*ptep)))) { - spin_unlock(&hash_table_lock[lock_slot].lock); - return 1; + if (unlikely(!access_ok)) { + ret = 1; + goto out_unlock; } #endif /* - * We have found a pte (which was present). - * The spinlocks prevent this status from changing - * The hash_table_lock prevents the _PAGE_HASHPTE status - * from changing (RPN, DIRTY and ACCESSED too) - * The page_table_lock prevents the pte from being - * invalidated or modified - */ - - /* + * We have found a proper pte. The hash_table_lock protects + * the pte from deallocation and the _PAGE_BUSY bit protects + * the contents of the PTE from changing. + * * At this point, we have a pte (old_pte) which can be used to build * or update an HPTE. There are 2 cases: * @@ -385,7 +431,7 @@ else pte_val(new_pte) |= _PAGE_ACCESSED; - newpp = computeHptePP(pte_val(new_pte)); + newpp = computeHptePP(pte_val(new_pte) & ~_PAGE_BUSY); /* Check if pte already has an hpte (case 2) */ if (unlikely(pte_val(old_pte) & _PAGE_HASHPTE)) { @@ -400,12 +446,13 @@ slot = (hash & htab_data.htab_hash_mask) * HPTES_PER_GROUP; slot += (pte_val(old_pte) & _PAGE_GROUP_IX) >> 12; - /* XXX fix large pte flag */ + /* XXX fix large pte flag */ if (ppc_md.hpte_updatepp(slot, secondary, newpp, va, 0) == -1) { pte_val(old_pte) &= ~_PAGE_HPTEFLAGS; } else { if (!pte_same(old_pte, new_pte)) { + /* _PAGE_BUSY is still set in new_pte */ *ptep = new_pte; } } @@ -425,12 +472,19 @@ pte_val(new_pte) |= ((slot<<12) & (_PAGE_GROUP_IX | _PAGE_SECONDARY)); + smp_wmb(); + /* _PAGE_BUSY is not set in new_pte */ *ptep = new_pte; + + return 0; } - spin_unlock(&hash_table_lock[lock_slot].lock); +out_unlock: + smp_wmb(); - return 0; + pte_val(*ptep) &= ~_PAGE_BUSY; + + return ret; } /* @@ -497,11 +551,14 @@ pgdir = mm->pgd; if (pgdir == NULL) return 1; - /* - * Lock the Linux page table to prevent mmap and kswapd - * from modifying entries while we search and update + /* The pte_hash_lock is used to block any PTE deallocations + * while we walk the tree and use the entry. While technically + * we both read and write the PTE entry while holding the read + * lock, the _PAGE_BUSY bit will block pte_update()s to the + * specific entry. */ - spin_lock(&mm->page_table_lock); + + read_lock(&pte_hash_lock[smp_processor_id()]); ptep = find_linux_pte(pgdir, ea); /* @@ -514,8 +571,7 @@ /* If no pte, send the problem up to do_page_fault */ ret = 1; } - - spin_unlock(&mm->page_table_lock); + read_unlock(&pte_hash_lock[smp_processor_id()]); return ret; } @@ -540,8 +596,6 @@ lock_slot = get_lock_slot(vpn); hash = hpt_hash(vpn, large); - spin_lock_irqsave(&hash_table_lock[lock_slot].lock, flags); - pte = __pte(pte_update(ptep, _PAGE_HPTEFLAGS, 0)); secondary = (pte_val(pte) & _PAGE_SECONDARY) >> 15; if (secondary) hash = ~hash; @@ -551,8 +605,6 @@ if (pte_val(pte) & _PAGE_HASHPTE) { ppc_md.hpte_invalidate(slot, secondary, va, large, local); } - - spin_unlock_irqrestore(&hash_table_lock[lock_slot].lock, flags); } long plpar_pte_enter(unsigned long flags, @@ -787,6 +839,8 @@ avpn = vpn >> 11; + pSeries_lock_hpte(hptep); + dw0 = hptep->dw0.dw0; /* @@ -794,9 +848,13 @@ * the AVPN, hash group, and valid bits. By doing it this way, * it is common with the pSeries LPAR optimal path. */ - if (dw0.bolted) return; + if (dw0.bolted) { + pSeries_unlock_hpte(hptep); - /* Invalidate the hpte. */ + return; + } + + /* Invalidate the hpte. This clears the lock as well. */ hptep->dw0.dword0 = 0; /* Invalidate the tlb */ @@ -875,6 +933,8 @@ avpn = vpn >> 11; + pSeries_lock_hpte(hptep); + dw0 = hptep->dw0.dw0; if ((dw0.avpn == avpn) && (dw0.v) && (dw0.h == secondary)) { @@ -900,10 +960,14 @@ hptep->dw0.dw0 = dw0; __asm__ __volatile__ ("ptesync" : : : "memory"); + + pSeries_unlock_hpte(hptep); return 0; } + pSeries_unlock_hpte(hptep); + return -1; } @@ -1062,9 +1126,11 @@ dw0 = hptep->dw0.dw0; if (!dw0.v) { /* retry with lock held */ + pSeries_lock_hpte(hptep); dw0 = hptep->dw0.dw0; if (!dw0.v) break; + pSeries_unlock_hpte(hptep); } hptep++; } @@ -1079,9 +1145,11 @@ dw0 = hptep->dw0.dw0; if (!dw0.v) { /* retry with lock held */ + pSeries_lock_hpte(hptep); dw0 = hptep->dw0.dw0; if (!dw0.v) break; + pSeries_unlock_hpte(hptep); } hptep++; } @@ -1304,9 +1372,11 @@ if (dw0.v && !dw0.bolted) { /* retry with lock held */ + pSeries_lock_hpte(hptep); dw0 = hptep->dw0.dw0; if (dw0.v && !dw0.bolted) break; + pSeries_unlock_hpte(hptep); } slot_offset++; ===== arch/ppc64/mm/init.c 1.8 vs edited ===== --- 1.8/arch/ppc64/mm/init.c Tue Jan 6 17:54:44 2004 +++ edited/arch/ppc64/mm/init.c Tue Feb 3 21:49:59 2004 @@ -104,16 +104,94 @@ */ mmu_gather_t mmu_gathers[NR_CPUS]; +/* PTE free batching structures. We need a lock since not all + * operations take place under page_table_lock. Keep it per-CPU + * to avoid bottlenecks. + */ + +struct pte_freelist_batch ____cacheline_aligned pte_freelist_cur[NR_CPUS] __cacheline_aligned_in_smp; +rwlock_t pte_hash_lock[NR_CPUS] __cacheline_aligned_in_smp = { [0 ... NR_CPUS-1] = RW_LOCK_UNLOCKED }; + +unsigned long pte_freelist_forced_free; + +static inline void pte_free_sync(void) +{ + unsigned long flags; + int i; + + /* All we need to know is that we can get the write lock if + * we wanted to, i.e. that no hash_page()s are holding it for reading. + * If none are reading, that means there's no currently executing + * hash_page() that might be working on one of the PTE's that will + * be deleted. Likewise, if there is a reader, we need to get the + * write lock to know when it releases the lock. + */ + + for (i = 0; i < smp_num_cpus; i++) + if (is_read_locked(&pte_hash_lock[i])) { + /* So we don't deadlock with a reader on current cpu */ + if(i == smp_processor_id()) + local_irq_save(flags); + + write_lock(&pte_hash_lock[i]); + write_unlock(&pte_hash_lock[i]); + + if(i == smp_processor_id()) + local_irq_restore(flags); + } +} + + +/* This is only called when we are critically out of memory + * (and fail to get a page in pte_free_tlb). + */ +void pte_free_now(pte_t *pte) +{ + pte_freelist_forced_free++; + + pte_free_sync(); + + pte_free_kernel(pte); +} + +/* Deallocates the pte-free batch after syncronizing with readers of + * any page tables. + */ +void pte_free_batch(void **batch, int size) +{ + unsigned int i; + + pte_free_sync(); + + for (i = 0; i < size; i++) + pte_free_kernel(batch[i]); + + free_page((unsigned long)batch); +} + + int do_check_pgt_cache(int low, int high) { int freed = 0; + struct pte_freelist_batch *batch; + + /* We use this function to push the current pte free batch to be + * deallocated, since do_check_pgt_cache() is callEd at the end of each + * free_one_pgd() and other parts of VM relies on all PTE's being + * properly freed upon return from that function. + */ + + batch = &pte_freelist_cur[smp_processor_id()]; + + if(batch->entry) { + pte_free_batch(batch->entry, batch->index); + batch->entry = NULL; + } if (pgtable_cache_size > high) { do { if (pgd_quicklist) free_page((unsigned long)pgd_alloc_one_fast(0)), ++freed; - if (pmd_quicklist) - free_page((unsigned long)pmd_alloc_one_fast(0, 0)), ++freed; if (pte_quicklist) free_page((unsigned long)pte_alloc_one_fast(0, 0)), ++freed; } while (pgtable_cache_size > low); @@ -290,7 +368,9 @@ void local_flush_tlb_mm(struct mm_struct *mm) { - spin_lock(&mm->page_table_lock); + unsigned long flags; + + spin_lock_irqsave(&mm->page_table_lock, flags); if ( mm->map_count ) { struct vm_area_struct *mp; @@ -298,7 +378,7 @@ local_flush_tlb_range( mm, mp->vm_start, mp->vm_end ); } - spin_unlock(&mm->page_table_lock); + spin_unlock_irqrestore(&mm->page_table_lock, flags); } /* ===== include/asm-ppc64/pgalloc.h 1.2 vs edited ===== --- 1.2/include/asm-ppc64/pgalloc.h Tue Apr 9 06:31:08 2002 +++ edited/include/asm-ppc64/pgalloc.h Tue Feb 3 21:50:36 2004 @@ -15,7 +15,6 @@ #define quicklists get_paca() #define pgd_quicklist (quicklists->pgd_cache) -#define pmd_quicklist (quicklists->pmd_cache) #define pte_quicklist (quicklists->pte_cache) #define pgtable_cache_size (quicklists->pgtable_cache_sz) @@ -60,10 +59,10 @@ static inline pmd_t* pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr) { - unsigned long *ret = (unsigned long *)pmd_quicklist; + unsigned long *ret = (unsigned long *)pte_quicklist; if (ret != NULL) { - pmd_quicklist = (unsigned long *)(*ret); + pte_quicklist = (unsigned long *)(*ret); ret[0] = 0; --pgtable_cache_size; } @@ -80,14 +79,6 @@ return pmd; } -static inline void -pmd_free (pmd_t *pmd) -{ - *(unsigned long *)pmd = (unsigned long) pmd_quicklist; - pmd_quicklist = (unsigned long *) pmd; - ++pgtable_cache_size; -} - #define pmd_populate(MM, PMD, PTE) pmd_set(PMD, PTE) static inline pte_t* @@ -115,12 +106,54 @@ } static inline void -pte_free (pte_t *pte) +pte_free_kernel (pte_t *pte) { *(unsigned long *)pte = (unsigned long) pte_quicklist; pte_quicklist = (unsigned long *) pte; ++pgtable_cache_size; } + + +/* Use the PTE functions for freeing PMD as well, since the same + * problem with tree traversals apply. Since pmd pointers are always + * virtual, no need for a page_address() translation. + */ + +#define pte_free(pte_page) __pte_free(pte_page) +#define pmd_free(pmd) __pte_free(pmd) + +struct pte_freelist_batch +{ + unsigned int index; + void **entry; +}; + +#define PTE_FREELIST_SIZE (PAGE_SIZE / sizeof(void *)) + +extern void pte_free_now(pte_t *pte); +extern void pte_free_batch(void **batch, int size); +extern struct ____cacheline_aligned pte_freelist_batch pte_freelist_cur[] __cacheline_aligned_in_smp; + +static inline void __pte_free(pte_t *pte) +{ + struct pte_freelist_batch *batchp = &pte_freelist_cur[smp_processor_id()]; + + if (batchp->entry == NULL) { + batchp->entry = (void **)__get_free_page(GFP_ATOMIC); + if (batchp->entry == NULL) { + pte_free_now(pte); + return; + } + batchp->index = 0; + } + + batchp->entry[batchp->index++] = pte; + if (batchp->index == PTE_FREELIST_SIZE) { + pte_free_batch(batchp->entry, batchp->index); + batchp->entry = NULL; + } +} + extern int do_check_pgt_cache(int, int); ===== include/asm-ppc64/pgtable.h 1.7 vs edited ===== --- 1.7/include/asm-ppc64/pgtable.h Mon Aug 25 23:47:52 2003 +++ edited/include/asm-ppc64/pgtable.h Tue Feb 3 21:33:22 2004 @@ -88,22 +88,22 @@ * Bits in a linux-style PTE. These match the bits in the * (hardware-defined) PowerPC PTE as closely as possible. */ -#define _PAGE_PRESENT 0x001UL /* software: pte contains a translation */ -#define _PAGE_USER 0x002UL /* matches one of the PP bits */ -#define _PAGE_RW 0x004UL /* software: user write access allowed */ -#define _PAGE_GUARDED 0x008UL -#define _PAGE_COHERENT 0x010UL /* M: enforce memory coherence (SMP systems) */ -#define _PAGE_NO_CACHE 0x020UL /* I: cache inhibit */ -#define _PAGE_WRITETHRU 0x040UL /* W: cache write-through */ -#define _PAGE_DIRTY 0x080UL /* C: page changed */ -#define _PAGE_ACCESSED 0x100UL /* R: page referenced */ -#define _PAGE_HPTENOIX 0x200UL /* software: pte HPTE slot unknown */ -#define _PAGE_HASHPTE 0x400UL /* software: pte has an associated HPTE */ -#define _PAGE_EXEC 0x800UL /* software: i-cache coherence required */ -#define _PAGE_SECONDARY 0x8000UL /* software: HPTE is in secondary group */ -#define _PAGE_GROUP_IX 0x7000UL /* software: HPTE index within group */ +#define _PAGE_PRESENT 0x0001 /* software: pte contains a translation */ +#define _PAGE_USER 0x0002 /* matches one of the PP bits */ +#define _PAGE_RW 0x0004 /* software: user write access allowed */ +#define _PAGE_GUARDED 0x0008 +#define _PAGE_COHERENT 0x0010 /* M: enforce memory coherence (SMP systems) */ +#define _PAGE_NO_CACHE 0x0020 /* I: cache inhibit */ +#define _PAGE_WRITETHRU 0x0040 /* W: cache write-through */ +#define _PAGE_DIRTY 0x0080 /* C: page changed */ +#define _PAGE_ACCESSED 0x0100 /* R: page referenced */ +#define _PAGE_BUSY 0x0200 /* software: pte & hash are busy */ +#define _PAGE_HASHPTE 0x0400 /* software: pte has an associated HPTE */ +#define _PAGE_EXEC 0x0800 /* software: i-cache coherence required */ +#define _PAGE_GROUP_IX 0x7000 /* software: HPTE index within group */ +#define _PAGE_SECONDARY 0x8000 /* software: HPTE is in secondary group */ /* Bits 0x7000 identify the index within an HPT Group */ -#define _PAGE_HPTEFLAGS (_PAGE_HASHPTE | _PAGE_HPTENOIX | _PAGE_SECONDARY | _PAGE_GROUP_IX) +#define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | _PAGE_SECONDARY | _PAGE_GROUP_IX) /* PAGE_MASK gives the right answer below, but only by accident */ /* It should be preserving the high 48 bits and then specifically */ /* preserving _PAGE_SECONDARY | _PAGE_GROUP_IX */ @@ -281,13 +281,15 @@ unsigned long old, tmp; __asm__ __volatile__("\n\ -1: ldarx %0,0,%3 \n\ +1: ldarx %0,0,%3 \n\ + andi. %1,%0,%7 # loop on _PAGE_BUSY set\n\ + bne- 1b \n\ andc %1,%0,%4 \n\ or %1,%1,%5 \n\ stdcx. %1,0,%3 \n\ bne- 1b" : "=&r" (old), "=&r" (tmp), "=m" (*p) - : "r" (p), "r" (clr), "r" (set), "m" (*p) + : "r" (p), "r" (clr), "r" (set), "m" (*p), "i" (_PAGE_BUSY) : "cc" ); return old; } From olof at austin.ibm.com Wed Feb 4 15:52:11 2004 From: olof at austin.ibm.com (olof at austin.ibm.com) Date: Tue, 3 Feb 2004 22:52:11 -0600 (CST) Subject: 2.6: PATCH for multiple EEH bugs In-Reply-To: <20040203183459.B27780@forte.austin.ibm.com> Message-ID: On Tue, 3 Feb 2004 linas at austin.ibm.com wrote: > Patch for multiple EEH-related bugs. Please review this patch, > & if appropriate, please apply. It should apply cleanly to > the current ameslab tree (Feb 03 2004 2.6.2-rc3). Linas, I have patch failures in eeh.c, pci.c and rpaphp_core.c. Are you sure you made the diff against a current ameslab tree? -Olof Olof Johansson Office: 4E002/905 Linux on Power Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Wed Feb 4 20:40:59 2004 From: anton at samba.org (Anton Blanchard) Date: Wed, 4 Feb 2004 20:40:59 +1100 Subject: [PATCH] kill lmb_add_io In-Reply-To: <20040204041820.GF22694@krispykreme> References: <1075856864.1449.205.camel@nighthawk> <20040204041820.GF22694@krispykreme> Message-ID: <20040204094059.GF19011@krispykreme> Boots on a p630. Will get some help testing it on iSeries tomorrow (hi Stephen :) and will push it if there are no complaints. Anton > OK, you got me to look :) > > The IO stuff was around from when we were using MSCHUNKS on pSeries. It > would make memory appear contiguous on machines with IO holes (POWER3 > boxes). In the end it confused the hell out of Linux (from memory it was > the macros that wanted to know if a physical address was in an IO hole > or not). > > I just went through and cleaned the lmb stuff up, it needs to be tested > on i and p, but I think its a step in the right direction. It certainly > simplifies a lot of the code. > > Thoughts? > > Anton > > -- > > - remove LMB_MEMORY_AREA, LMB_IO_AREA, we only allocate/reserve memory > areas now > - remove lmb_property->type, lmb_region->iosize, lmb_region->lcd_size, > no longer used > - bump number of regions to 128, we'll hit this limit sooner or later > with our big boxes (if we have more than 64 PCI host bridges the > reserved array will fill up for example) > - make all the lmb stuff __init > - no need to explicitly zero struct lmb lmb now we zero the BSS early > - we had two functions to dump the lmb array, kill one of them > - move the inline functions into lmb.c, they are only ever called from > there > > ===== include/asm-ppc64/lmb.h 1.6 vs edited ===== > --- 1.6/include/asm-ppc64/lmb.h Fri Sep 13 21:24:37 2002 > +++ edited/include/asm-ppc64/lmb.h Wed Feb 4 15:10:52 2004 > @@ -13,36 +13,29 @@ > * 2 of the License, or (at your option) any later version. > */ > > -#include > +#include > #include > > extern unsigned long reloc_offset(void); > > -#define MAX_LMB_REGIONS 64 > +#define MAX_LMB_REGIONS 128 > > union lmb_reg_property { > struct reg_property32 addr32[MAX_LMB_REGIONS]; > struct reg_property64 addr64[MAX_LMB_REGIONS]; > }; > > -#define LMB_MEMORY_AREA 1 > -#define LMB_IO_AREA 2 > - > #define LMB_ALLOC_ANYWHERE 0 > -#define LMB_ALLOC_FIRST4GBYTE (1UL<<32) > > struct lmb_property { > unsigned long base; > unsigned long physbase; > unsigned long size; > - unsigned long type; > }; > > struct lmb_region { > unsigned long cnt; > unsigned long size; > - unsigned long iosize; > - unsigned long lcd_size; /* Least Common Denominator */ > struct lmb_property region[MAX_LMB_REGIONS+1]; > }; > > @@ -53,63 +46,17 @@ > struct lmb_region reserved; > }; > > -extern struct lmb lmb; > - > -extern void lmb_init(void); > -extern void lmb_analyze(void); > -extern long lmb_add(unsigned long, unsigned long); > -#ifdef CONFIG_MSCHUNKS > -extern long lmb_add_io(unsigned long base, unsigned long size); > -#endif /* CONFIG_MSCHUNKS */ > -extern long lmb_reserve(unsigned long, unsigned long); > -extern unsigned long lmb_alloc(unsigned long, unsigned long); > -extern unsigned long lmb_alloc_base(unsigned long, unsigned long, unsigned long); > -extern unsigned long lmb_phys_mem_size(void); > -extern unsigned long lmb_end_of_DRAM(void); > -extern unsigned long lmb_abs_to_phys(unsigned long); > -extern void lmb_dump(char *); > - > -static inline unsigned long > -lmb_addrs_overlap(unsigned long base1, unsigned long size1, > - unsigned long base2, unsigned long size2) > -{ > - return ((base1 < (base2+size2)) && (base2 < (base1+size1))); > -} > - > -static inline long > -lmb_regions_overlap(struct lmb_region *rgn, unsigned long r1, unsigned long r2) > -{ > - unsigned long base1 = rgn->region[r1].base; > - unsigned long size1 = rgn->region[r1].size; > - unsigned long base2 = rgn->region[r2].base; > - unsigned long size2 = rgn->region[r2].size; > - > - return lmb_addrs_overlap(base1,size1,base2,size2); > -} > - > -static inline long > -lmb_addrs_adjacent(unsigned long base1, unsigned long size1, > - unsigned long base2, unsigned long size2) > -{ > - if ( base2 == base1 + size1 ) { > - return 1; > - } else if ( base1 == base2 + size2 ) { > - return -1; > - } > - return 0; > -} > - > -static inline long > -lmb_regions_adjacent(struct lmb_region *rgn, unsigned long r1, unsigned long r2) > -{ > - unsigned long base1 = rgn->region[r1].base; > - unsigned long size1 = rgn->region[r1].size; > - unsigned long type1 = rgn->region[r1].type; > - unsigned long base2 = rgn->region[r2].base; > - unsigned long size2 = rgn->region[r2].size; > - unsigned long type2 = rgn->region[r2].type; > +extern struct lmb lmb __initdata; > > - return (type1 == type2) && lmb_addrs_adjacent(base1,size1,base2,size2); > -} > +extern void __init lmb_init(void); > +extern void __init lmb_analyze(void); > +extern long __init lmb_add(unsigned long, unsigned long); > +extern long __init lmb_reserve(unsigned long, unsigned long); > +extern unsigned long __init lmb_alloc(unsigned long, unsigned long); > +extern unsigned long __init lmb_alloc_base(unsigned long, unsigned long, > + unsigned long); > +extern unsigned long __init lmb_phys_mem_size(void); > +extern unsigned long __init lmb_end_of_DRAM(void); > +extern unsigned long __init lmb_abs_to_phys(unsigned long); > > #endif /* _PPC64_LMB_H */ > ===== arch/ppc64/kernel/lmb.c 1.6 vs edited ===== > --- 1.6/arch/ppc64/kernel/lmb.c Tue Feb 25 20:38:45 2003 > +++ edited/arch/ppc64/kernel/lmb.c Wed Feb 4 15:10:44 2004 > @@ -1,5 +1,4 @@ > /* > - * > * Procedures for interfacing to Open Firmware. > * > * Peter Bergner, IBM Corp. June 2001. > @@ -13,46 +12,63 @@ > > #include > #include > +#include > #include > #include > #include > #include > #include > #include > -#include > > -extern unsigned long klimit; > -extern unsigned long reloc_offset(void); > +struct lmb lmb __initdata; > + > +static unsigned long __init > +lmb_addrs_overlap(unsigned long base1, unsigned long size1, > + unsigned long base2, unsigned long size2) > +{ > + return ((base1 < (base2+size2)) && (base2 < (base1+size1))); > +} > > +static long __init > +lmb_addrs_adjacent(unsigned long base1, unsigned long size1, > + unsigned long base2, unsigned long size2) > +{ > + if (base2 == base1 + size1) > + return 1; > + else if (base1 == base2 + size2) > + return -1; > > -static long lmb_add_region(struct lmb_region *, unsigned long, unsigned long, unsigned long); > + return 0; > +} > > -struct lmb lmb = { > - 0, 0, > - {0,0,0,0,{{0,0,0}}}, > - {0,0,0,0,{{0,0,0}}} > -}; > +static long __init > +lmb_regions_adjacent(struct lmb_region *rgn, unsigned long r1, unsigned long r2) > +{ > + unsigned long base1 = rgn->region[r1].base; > + unsigned long size1 = rgn->region[r1].size; > + unsigned long base2 = rgn->region[r2].base; > + unsigned long size2 = rgn->region[r2].size; > > + return lmb_addrs_adjacent(base1, size1, base2, size2); > +} > > /* Assumption: base addr of region 1 < base addr of region 2 */ > -static void > +static void __init > lmb_coalesce_regions(struct lmb_region *rgn, unsigned long r1, unsigned long r2) > { > unsigned long i; > > rgn->region[r1].size += rgn->region[r2].size; > - for (i=r2; i < rgn->cnt-1 ;i++) { > + for (i=r2; i < rgn->cnt-1; i++) { > rgn->region[i].base = rgn->region[i+1].base; > rgn->region[i].physbase = rgn->region[i+1].physbase; > rgn->region[i].size = rgn->region[i+1].size; > - rgn->region[i].type = rgn->region[i+1].type; > } > rgn->cnt--; > } > > - > /* This routine called with relocation disabled. */ > -void > +void __init > lmb_init(void) > { > unsigned long offset = reloc_offset(); > @@ -63,13 +79,11 @@ > */ > _lmb->memory.region[0].base = 0; > _lmb->memory.region[0].size = 0; > - _lmb->memory.region[0].type = LMB_MEMORY_AREA; > _lmb->memory.cnt = 1; > > /* Ditto. */ > _lmb->reserved.region[0].base = 0; > _lmb->reserved.region[0].size = 0; > - _lmb->reserved.region[0].type = LMB_MEMORY_AREA; > _lmb->reserved.cnt = 1; > } > > @@ -89,12 +103,11 @@ > } > > /* This routine called with relocation disabled. */ > -void > +void __init > lmb_analyze(void) > { > unsigned long i; > unsigned long mem_size = 0; > - unsigned long io_size = 0; > unsigned long size_mask = 0; > unsigned long offset = reloc_offset(); > struct lmb *_lmb = PTRRELOC(&lmb); > @@ -102,13 +115,9 @@ > unsigned long physbase = 0; > #endif > > - for (i=0; i < _lmb->memory.cnt ;i++) { > - unsigned long lmb_type = _lmb->memory.region[i].type; > + for (i=0; i < _lmb->memory.cnt; i++) { > unsigned long lmb_size; > > - if ( lmb_type != LMB_MEMORY_AREA ) > - continue; > - > lmb_size = _lmb->memory.region[i].size; > > #ifdef CONFIG_MSCHUNKS > @@ -121,84 +130,20 @@ > size_mask |= lmb_size; > } > > -#ifdef CONFIG_MSCHUNKS > - for (i=0; i < _lmb->memory.cnt ;i++) { > - unsigned long lmb_type = _lmb->memory.region[i].type; > - unsigned long lmb_size; > - > - if ( lmb_type != LMB_IO_AREA ) > - continue; > - > - lmb_size = _lmb->memory.region[i].size; > - > - _lmb->memory.region[i].physbase = physbase; > - physbase += lmb_size; > - io_size += lmb_size; > - size_mask |= lmb_size; > - } > -#endif /* CONFIG_MSCHUNKS */ > - > _lmb->memory.size = mem_size; > - _lmb->memory.iosize = io_size; > - _lmb->memory.lcd_size = (1UL << cnt_trailing_zeros(size_mask)); > -} > - > -/* This routine called with relocation disabled. */ > -long > -lmb_add(unsigned long base, unsigned long size) > -{ > - unsigned long offset = reloc_offset(); > - struct lmb *_lmb = PTRRELOC(&lmb); > - struct lmb_region *_rgn = &(_lmb->memory); > - > - /* On pSeries LPAR systems, the first LMB is our RMO region. */ > - if ( base == 0 ) > - _lmb->rmo_size = size; > - > - return lmb_add_region(_rgn, base, size, LMB_MEMORY_AREA); > - > } > > -#ifdef CONFIG_MSCHUNKS > /* This routine called with relocation disabled. */ > -long > -lmb_add_io(unsigned long base, unsigned long size) > -{ > - unsigned long offset = reloc_offset(); > - struct lmb *_lmb = PTRRELOC(&lmb); > - struct lmb_region *_rgn = &(_lmb->memory); > - > - return lmb_add_region(_rgn, base, size, LMB_IO_AREA); > - > -} > -#endif /* CONFIG_MSCHUNKS */ > - > -long > -lmb_reserve(unsigned long base, unsigned long size) > -{ > - unsigned long offset = reloc_offset(); > - struct lmb *_lmb = PTRRELOC(&lmb); > - struct lmb_region *_rgn = &(_lmb->reserved); > - > - return lmb_add_region(_rgn, base, size, LMB_MEMORY_AREA); > -} > - > -/* This routine called with relocation disabled. */ > -static long > -lmb_add_region(struct lmb_region *rgn, unsigned long base, unsigned long size, > - unsigned long type) > +static long __init > +lmb_add_region(struct lmb_region *rgn, unsigned long base, unsigned long size) > { > unsigned long i, coalesced = 0; > long adjacent; > > /* First try and coalesce this LMB with another. */ > - for (i=0; i < rgn->cnt ;i++) { > + for (i=0; i < rgn->cnt; i++) { > unsigned long rgnbase = rgn->region[i].base; > unsigned long rgnsize = rgn->region[i].size; > - unsigned long rgntype = rgn->region[i].type; > - > - if ( rgntype != type ) > - continue; > > adjacent = lmb_addrs_adjacent(base,size,rgnbase,rgnsize); > if ( adjacent > 0 ) { > @@ -227,17 +172,15 @@ > } > > /* Couldn't coalesce the LMB, so add it to the sorted table. */ > - for (i=rgn->cnt-1; i >= 0 ;i--) { > + for (i=rgn->cnt-1; i >= 0; i--) { > if (base < rgn->region[i].base) { > rgn->region[i+1].base = rgn->region[i].base; > rgn->region[i+1].physbase = rgn->region[i].physbase; > rgn->region[i+1].size = rgn->region[i].size; > - rgn->region[i+1].type = rgn->region[i].type; > } else { > rgn->region[i+1].base = base; > rgn->region[i+1].physbase = lmb_abs_to_phys(base); > rgn->region[i+1].size = size; > - rgn->region[i+1].type = type; > break; > } > } > @@ -246,12 +189,38 @@ > return 0; > } > > -long > +/* This routine called with relocation disabled. */ > +long __init > +lmb_add(unsigned long base, unsigned long size) > +{ > + unsigned long offset = reloc_offset(); > + struct lmb *_lmb = PTRRELOC(&lmb); > + struct lmb_region *_rgn = &(_lmb->memory); > + > + /* On pSeries LPAR systems, the first LMB is our RMO region. */ > + if ( base == 0 ) > + _lmb->rmo_size = size; > + > + return lmb_add_region(_rgn, base, size); > + > +} > + > +long __init > +lmb_reserve(unsigned long base, unsigned long size) > +{ > + unsigned long offset = reloc_offset(); > + struct lmb *_lmb = PTRRELOC(&lmb); > + struct lmb_region *_rgn = &(_lmb->reserved); > + > + return lmb_add_region(_rgn, base, size); > +} > + > +long __init > lmb_overlaps_region(struct lmb_region *rgn, unsigned long base, unsigned long size) > { > unsigned long i; > > - for (i=0; i < rgn->cnt ;i++) { > + for (i=0; i < rgn->cnt; i++) { > unsigned long rgnbase = rgn->region[i].base; > unsigned long rgnsize = rgn->region[i].size; > if ( lmb_addrs_overlap(base,size,rgnbase,rgnsize) ) { > @@ -262,13 +231,13 @@ > return (i < rgn->cnt) ? i : -1; > } > > -unsigned long > +unsigned long __init > lmb_alloc(unsigned long size, unsigned long align) > { > return lmb_alloc_base(size, align, LMB_ALLOC_ANYWHERE); > } > > -unsigned long > +unsigned long __init > lmb_alloc_base(unsigned long size, unsigned long align, unsigned long max_addr) > { > long i, j; > @@ -278,13 +247,9 @@ > struct lmb_region *_mem = &(_lmb->memory); > struct lmb_region *_rsv = &(_lmb->reserved); > > - for (i=_mem->cnt-1; i >= 0 ;i--) { > + for (i=_mem->cnt-1; i >= 0; i--) { > unsigned long lmbbase = _mem->region[i].base; > unsigned long lmbsize = _mem->region[i].size; > - unsigned long lmbtype = _mem->region[i].type; > - > - if ( lmbtype != LMB_MEMORY_AREA ) > - continue; > > if ( max_addr == LMB_ALLOC_ANYWHERE ) > base = _ALIGN_DOWN(lmbbase+lmbsize-size, align); > @@ -305,12 +270,12 @@ > if ( i < 0 ) > return 0; > > - lmb_add_region(_rsv, base, size, LMB_MEMORY_AREA); > + lmb_add_region(_rsv, base, size); > > return base; > } > > -unsigned long > +unsigned long __init > lmb_phys_mem_size(void) > { > unsigned long offset = reloc_offset(); > @@ -327,7 +292,7 @@ > #endif /* CONFIG_MSCHUNKS */ > } > > -unsigned long > +unsigned long __init > lmb_end_of_DRAM(void) > { > unsigned long offset = reloc_offset(); > @@ -335,9 +300,7 @@ > struct lmb_region *_mem = &(_lmb->memory); > unsigned long idx; > > - for(idx=_mem->cnt-1; idx >= 0 ;idx--) { > - if ( _mem->region[idx].type != LMB_MEMORY_AREA ) > - continue; > + for(idx=_mem->cnt-1; idx >= 0; idx--) { > #ifdef CONFIG_MSCHUNKS > return (_mem->region[idx].physbase + _mem->region[idx].size); > #else > @@ -348,8 +311,7 @@ > return 0; > } > > - > -unsigned long > +unsigned long __init > lmb_abs_to_phys(unsigned long aa) > { > unsigned long i, pa = aa; > @@ -357,7 +319,7 @@ > struct lmb *_lmb = PTRRELOC(&lmb); > struct lmb_region *_mem = &(_lmb->memory); > > - for (i=0; i < _mem->cnt ;i++) { > + for (i=0; i < _mem->cnt; i++) { > unsigned long lmbbase = _mem->region[i].base; > unsigned long lmbsize = _mem->region[i].size; > if ( lmb_addrs_overlap(aa,1,lmbbase,lmbsize) ) { > @@ -367,48 +329,4 @@ > } > > return pa; > -} > - > -void > -lmb_dump(char *str) > -{ > - unsigned long i; > - > - udbg_printf("\nlmb_dump: %s\n", str); > - udbg_printf(" debug = %s\n", > - (lmb.debug) ? "TRUE" : "FALSE"); > - udbg_printf(" memory.cnt = %d\n", > - lmb.memory.cnt); > - udbg_printf(" memory.size = 0x%lx\n", > - lmb.memory.size); > - udbg_printf(" memory.lcd_size = 0x%lx\n", > - lmb.memory.lcd_size); > - for (i=0; i < lmb.memory.cnt ;i++) { > - udbg_printf(" memory.region[%d].base = 0x%lx\n", > - i, lmb.memory.region[i].base); > - udbg_printf(" .physbase = 0x%lx\n", > - lmb.memory.region[i].physbase); > - udbg_printf(" .size = 0x%lx\n", > - lmb.memory.region[i].size); > - udbg_printf(" .type = 0x%lx\n", > - lmb.memory.region[i].type); > - } > - > - udbg_printf("\n"); > - udbg_printf(" reserved.cnt = %d\n", > - lmb.reserved.cnt); > - udbg_printf(" reserved.size = 0x%lx\n", > - lmb.reserved.size); > - udbg_printf(" reserved.lcd_size = 0x%lx\n", > - lmb.reserved.lcd_size); > - for (i=0; i < lmb.reserved.cnt ;i++) { > - udbg_printf(" reserved.region[%d].base = 0x%lx\n", > - i, lmb.reserved.region[i].base); > - udbg_printf(" .physbase = 0x%lx\n", > - lmb.reserved.region[i].physbase); > - udbg_printf(" .size = 0x%lx\n", > - lmb.reserved.region[i].size); > - udbg_printf(" .type = 0x%lx\n", > - lmb.reserved.region[i].type); > - } > } > ===== arch/ppc64/kernel/prom.c 1.52 vs edited ===== > --- 1.52/arch/ppc64/kernel/prom.c Sun Feb 1 13:40:21 2004 > +++ edited/arch/ppc64/kernel/prom.c Wed Feb 4 14:42:19 2004 > @@ -668,9 +668,6 @@ > prom_print(RELOC(" memory.size = 0x")); > prom_print_hex(_lmb->memory.size); > prom_print_nl(); > - prom_print(RELOC(" memory.lcd_size = 0x")); > - prom_print_hex(_lmb->memory.lcd_size); > - prom_print_nl(); > for (i=0; i < _lmb->memory.cnt ;i++) { > prom_print(RELOC(" memory.region[0x")); > prom_print_hex(i); > @@ -683,9 +680,6 @@ > prom_print(RELOC(" .size = 0x")); > prom_print_hex(_lmb->memory.region[i].size); > prom_print_nl(); > - prom_print(RELOC(" .type = 0x")); > - prom_print_hex(_lmb->memory.region[i].type); > - prom_print_nl(); > } > > prom_print_nl(); > @@ -695,9 +689,6 @@ > prom_print(RELOC(" reserved.size = 0x")); > prom_print_hex(_lmb->reserved.size); > prom_print_nl(); > - prom_print(RELOC(" reserved.lcd_size = 0x")); > - prom_print_hex(_lmb->reserved.lcd_size); > - prom_print_nl(); > for (i=0; i < _lmb->reserved.cnt ;i++) { > prom_print(RELOC(" reserved.region[0x")); > prom_print_hex(i); > @@ -709,9 +700,6 @@ > prom_print_nl(); > prom_print(RELOC(" .size = 0x")); > prom_print_hex(_lmb->reserved.region[i].size); > - prom_print_nl(); > - prom_print(RELOC(" .type = 0x")); > - prom_print_hex(_lmb->reserved.region[i].type); > prom_print_nl(); > } > } > ===== arch/ppc64/mm/numa.c 1.16 vs edited ===== > --- 1.16/arch/ppc64/mm/numa.c Tue Jan 20 13:07:09 2004 > +++ edited/arch/ppc64/mm/numa.c Wed Feb 4 15:12:02 2004 > @@ -257,10 +257,6 @@ > > for (i = 0; i < lmb.memory.cnt; i++) { > unsigned long physbase, size; > - unsigned long type = lmb.memory.region[i].type; > - > - if (type != LMB_MEMORY_AREA) > - continue; > > physbase = lmb.memory.region[i].physbase; > size = lmb.memory.region[i].size; > ===== arch/ppc64/mm/init.c 1.55 vs edited ===== > --- 1.55/arch/ppc64/mm/init.c Tue Jan 20 13:07:09 2004 > +++ edited/arch/ppc64/mm/init.c Wed Feb 4 15:11:54 2004 > @@ -702,10 +702,6 @@ > /* add all physical memory to the bootmem map */ > for (i=0; i < lmb.memory.cnt; i++) { > unsigned long physbase, size; > - unsigned long type = lmb.memory.region[i].type; > - > - if ( type != LMB_MEMORY_AREA ) > - continue; > > physbase = lmb.memory.region[i].physbase; > size = lmb.memory.region[i].size; > @@ -746,11 +742,7 @@ > > for (i=0; i < lmb.memory.cnt; i++) { > unsigned long physbase, size; > - unsigned long type = lmb.memory.region[i].type; > struct kcore_list *kcore_mem; > - > - if (type != LMB_MEMORY_AREA) > - continue; > > physbase = lmb.memory.region[i].physbase; > size = lmb.memory.region[i].size; > ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Wed Feb 4 21:51:13 2004 From: anton at samba.org (Anton Blanchard) Date: Wed, 4 Feb 2004 21:51:13 +1100 Subject: ioremap problems In-Reply-To: <401C49DD.1050008@austin.ibm.com> References: <20040127104640.GK11236@krispykreme> <20040127105533.GL11236@krispykreme> <401C49DD.1050008@austin.ibm.com> Message-ID: <20040204105113.GI19011@krispykreme> > The patch you posted didn't apply cleanly for me on latest 2.6 from > Ameslab; it inserted the Kconfig option in the middle of the vio stuff > for some reason. So I hand-patched the Kconfig; here's the patch I > generated if anyone wants it. > > It would be nice to get this into Ameslab or mainline, it's very useful. Let me try it on Andrew once again but I'll merge it into ameslab regardless. Heres another simple patch based on the x86 version. It warns whenever we use more than 8kB of stack (out of 16kB). I set the threshold rather high since it doesnt catch the full extent of stack usage (it only catches the start of the last irq). Anton ===== arch/ppc64/Kconfig 1.31 vs edited ===== gr16b-anton/arch/ppc64/Kconfig | 4 ++++ gr16b-anton/arch/ppc64/kernel/irq.c | 15 +++++++++++++++ 2 files changed, 19 insertions(+) diff -puN arch/ppc64/Kconfig~debug_stackoverflow arch/ppc64/Kconfig --- gr16b/arch/ppc64/Kconfig~debug_stackoverflow 2004-01-22 01:18:46.234108336 +1100 +++ gr16b-anton/arch/ppc64/Kconfig 2004-01-22 01:18:46.242108222 +1100 @@ -312,6 +312,10 @@ config DEBUG_KERNEL Say Y here if you are developing drivers or trying to debug and identify kernel problems. +config DEBUG_STACKOVERFLOW + bool "Check for stack overflows" + depends on DEBUG_KERNEL + config DEBUG_SLAB bool "Debug memory allocations" depends on DEBUG_KERNEL diff -puN arch/ppc64/kernel/irq.c~debug_stackoverflow arch/ppc64/kernel/irq.c --- gr16b/arch/ppc64/kernel/irq.c~debug_stackoverflow 2004-01-22 01:18:46.237108293 +1100 +++ gr16b-anton/arch/ppc64/kernel/irq.c 2004-01-22 01:25:34.498313587 +1100 @@ -568,6 +568,21 @@ int do_IRQ(struct pt_regs *regs) irq_enter(); +#ifdef CONFIG_DEBUG_STACKOVERFLOW + /* Debugging check for stack overflow: is there less than 8KB free? */ + { + long sp; + + sp = (unsigned long)_get_SP() & (THREAD_SIZE-1); + + if (unlikely(sp < (sizeof(struct thread_info) + 8192))) { + printk("do_IRQ: stack overflow: %ld\n", + sp - sizeof(struct thread_info)); + dump_stack(); + } + } +#endif + lpaca = get_paca(); #ifdef CONFIG_SMP if (lpaca->xLpPaca.xIntDword.xFields.xIpiCnt) { _ ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Thu Feb 5 03:03:03 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Wed, 4 Feb 2004 10:03:03 -0600 Subject: 2.6: PATCH for multiple EEH bugs In-Reply-To: ; from olof@austin.ibm.com on Tue, Feb 03, 2004 at 10:52:11PM -0600 References: <20040203183459.B27780@forte.austin.ibm.com> Message-ID: <20040204100303.E27780@forte.austin.ibm.com> On Tue, Feb 03, 2004 at 10:52:11PM -0600, olof at austin.ibm.com wrote: > On Tue, 3 Feb 2004 linas at austin.ibm.com wrote: > > > Patch for multiple EEH-related bugs. Please review this patch, > > & if appropriate, please apply. It should apply cleanly to > > the current ameslab tree (Feb 03 2004 2.6.2-rc3). > > Linas, > > I have patch failures in eeh.c, pci.c and rpaphp_core.c. Are you sure you > made the diff against a current ameslab tree? It applied to a fresh bk pull as of yesterday morning ... I am having a very hard time understanding how bk works, I'll check again. --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Thu Feb 5 03:42:13 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Wed, 4 Feb 2004 10:42:13 -0600 Subject: 2.6: PATCH for multiple EEH bugs In-Reply-To: <20040204100303.E27780@forte.austin.ibm.com>; from linas@austin.ibm.com on Wed, Feb 04, 2004 at 10:03:03AM -0600 References: <20040203183459.B27780@forte.austin.ibm.com> <20040204100303.E27780@forte.austin.ibm.com> Message-ID: <20040204104213.F27780@forte.austin.ibm.com> On Wed, Feb 04, 2004 at 10:03:03AM -0600, linas at austin.ibm.com wrote: > > On Tue, Feb 03, 2004 at 10:52:11PM -0600, olof at austin.ibm.com wrote: > > On Tue, 3 Feb 2004 linas at austin.ibm.com wrote: > > > > > Patch for multiple EEH-related bugs. Please review this patch, > > > & if appropriate, please apply. It should apply cleanly to > > > the current ameslab tree (Feb 03 2004 2.6.2-rc3). > > > > Linas, > > > > I have patch failures in eeh.c, pci.c and rpaphp_core.c. Are you sure you > > made the diff against a current ameslab tree? > > It applied to a fresh bk pull as of yesterday morning ... > I am having a very hard time understanding how bk works, > I'll check again. I just did a fresh bk clone bk://source.scl.ameslab.gov/linux-2.5 Maybe I'm using the wrong tree? --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Thu Feb 5 04:03:45 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Wed, 04 Feb 2004 11:03:45 -0600 Subject: 2.6: PATCH for multiple EEH bugs In-Reply-To: <20040204104213.F27780@forte.austin.ibm.com> References: <20040203183459.B27780@forte.austin.ibm.com> <20040204100303.E27780@forte.austin.ibm.com> <20040204104213.F27780@forte.austin.ibm.com> Message-ID: <402125F1.2010608@austin.ibm.com> linas at austin.ibm.com wrote: > I just did a fresh bk clone bk://source.scl.ameslab.gov/linux-2.5 > >Maybe I'm using the wrong tree? > > That's really weird, since I just did the exact same thing and got failures: olof at olof tmp $ bk clone -q bk://source.scl.ameslab.gov/linux-2.5 olof at olof tmp $ cd linux-2.5/ olof at olof linux-2.5 $ patch -p1 < ~/eeh-bug-fixes-2.6.2.patch [...] arch/ppc64/kernel/eeh.c 1.17: 483 lines Get file arch/ppc64/kernel/eeh.c from SCCS with lock? [y] arch/ppc64/kernel/eeh.c 1.17 -> 1.18: 483 lines patching file arch/ppc64/kernel/eeh.c Hunk #3 succeeded at 142 with fuzz 1. Hunk #4 FAILED at 163. Hunk #7 FAILED at 271. 2 out of 15 hunks FAILED -- saving rejects to file arch/ppc64/kernel/eeh.c.rej [...] arch/ppc64/kernel/pci.c 1.42: 543 lines Get file arch/ppc64/kernel/pci.c from SCCS with lock? [y] arch/ppc64/kernel/pci.c 1.42 -> 1.43: 543 lines patching file arch/ppc64/kernel/pci.c Hunk #2 FAILED at 109. 1 out of 3 hunks FAILED -- saving rejects to file arch/ppc64/kernel/pci.c.rej [...] drivers/pci/hotplug/rpaphp_core.c 1.2: 1040 lines Get file drivers/pci/hotplug/rpaphp_core.c from SCCS with lock? [y] drivers/pci/hotplug/rpaphp_core.c 1.2 -> 1.3: 1040 lines patching file drivers/pci/hotplug/rpaphp_core.c Hunk #3 FAILED at 542. 1 out of 3 hunks FAILED -- saving rejects to file drivers/pci/hotplug/rpaphp_core.c.rej -Olof ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From kravetz at us.ibm.com Thu Feb 5 04:40:36 2004 From: kravetz at us.ibm.com (Mike Kravetz) Date: Wed, 4 Feb 2004 09:40:36 -0800 Subject: [PATCH] kill lmb_add_io In-Reply-To: <1075856864.1449.205.camel@nighthawk> References: <1075856864.1449.205.camel@nighthawk> Message-ID: <20040204174036.GB4996@w-mikek2.beaverton.ibm.com> On the subject of lmb's ... I've got a pSeries 615. On this machine I only see one lmb of 4GB in size. This is what is dug out of the prom by 'prom_initialize_lmb' before any coalescing. I was somehow expecting more lmb's of a smaller size. Of course, I don't know the hardware specifics of this machine or how many DIMMs of what size it contains. Dave, what does the lmb layout on your 'bigger' machine look like? -- Mike ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Thu Feb 5 04:41:44 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Wed, 4 Feb 2004 11:41:44 -0600 Subject: [2.5][PATCH] Remove warnings from rtas-proc.c In-Reply-To: ; from olof@austin.ibm.com on Mon, Jan 26, 2004 at 09:20:00PM -0600 References: Message-ID: <20040204114143.H27780@forte.austin.ibm.com> On Mon, Jan 26, 2004 at 09:20:00PM -0600, olof at austin.ibm.com wrote: > > I'm tired of seeing the warnings go by every time I compile. It might be > valid C99 code, but my GCC still warns about it: > > arch/ppc64/kernel/rtas-proc.c: In function `ppc_rtas_poweron_read': > arch/ppc64/kernel/rtas-proc.c:294: warning: ISO C90 forbids mixed > declarations and code > ..and so on.. Not all compilers seem to generate this warning. Mine doesn't... although I have noticed that other people's compilers do ... ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From haveblue at us.ibm.com Thu Feb 5 04:47:23 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: 04 Feb 2004 09:47:23 -0800 Subject: [PATCH] kill lmb_add_io In-Reply-To: <20040204174036.GB4996@w-mikek2.beaverton.ibm.com> References: <1075856864.1449.205.camel@nighthawk> <20040204174036.GB4996@w-mikek2.beaverton.ibm.com> Message-ID: <1075916843.14153.2397.camel@nighthawk> On Wed, 2004-02-04 at 09:40, Mike Kravetz wrote: > On the subject of lmb's ... I've got a pSeries 615. On this machine > I only see one lmb of 4GB in size. This is what is dug out of the > prom by 'prom_initialize_lmb' before any coalescing. I was somehow > expecting more lmb's of a smaller size. Of course, I don't know the > hardware specifics of this machine or how many DIMMs of what size it > contains. > > Dave, what does the lmb layout on your 'bigger' machine look like? There were a bunch of them, I'd say around 30 or so. But, I'm booting a 64GB p650 in an 8-way LPAR, so the hypervisor might change the layout. --dave ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Thu Feb 5 05:02:52 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Wed, 4 Feb 2004 12:02:52 -0600 Subject: EEH as module [was Re: LPARCFG] In-Reply-To: <20040201050227.GB22694@krispykreme>; from anton@samba.org on Sun, Feb 01, 2004 at 04:02:28PM +1100 References: <20040201050227.GB22694@krispykreme> Message-ID: <20040204120252.J27780@forte.austin.ibm.com> On Sun, Feb 01, 2004 at 04:02:28PM +1100, Anton Blanchard wrote: > > Hi, > > Any reason we cant make xxx > tristate (so we can compile it as a > module?) > > Im keeping an eye on our kernel size (its getting huge), any chance > we get to cut it down is worth it. Since I'm planning to be whacking EEH in a big way, does anyone feel that it should be turned into a module? --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From jdewand at redhat.com Thu Feb 5 07:15:04 2004 From: jdewand at redhat.com (Julie DeWandel) Date: Wed, 04 Feb 2004 15:15:04 -0500 Subject: [2.4] [PATCH] hash_page rework, take 2 References: Message-ID: <402152C8.7080907@redhat.com> Hi Olof, Thank you for the explanations. In most cases, I agree but I still have one or two things I wanted to follow up on. I have added my comments to yours below. I looked over the new patch and am pretty happy with it. I had to modify it a bit (we took out quicklists in our distro) and I wanted to keep the UL in the #defines for the PTE bits. But please read through the patch, looking for my "JSD:" comments. Thanks! Julie olof at austin.ibm.com wrote: >- spin_unlock(&hash_table_lock[lock_slot].lock); >+out_unlock: >+ smp_wmb(); > >JSD: I believe an smp_mb__before_clear_bit() would be a better choice, or >JSD: at least an lwsync because this is an unlock. > >I'm not sure we really need any synchronization at all, since the only >memory that's protected by the _PAGE_BUSY bit is the word itself, so >the lock and contents change will be seen at the same time (or at least >in the right order). > >smp_wmb() is cheaper (eieio) than smp_mb() (sync), so either the >smp_wmb() should stay or it should be taken out alltogether. I'd rather >err on the side of caution, so I'll keep it. > OK, I managed to convince myself that using an eieio is ok here. I was concerned that other processors might not have seen any of the stores that preceded the eieio instruction, since eieio is normally only used when dealing with device memory. lwsync ensures other processors have seen any stores to system memory at the point the lock is released. But the only stores that matter here are the hpte (and it is sync'd) and the pte and it has the lock bit. So when another processor sees the pte contents without the lock bit set it will, by default, be seeing the updated value as well. >JSD: Better question for (1) is why are interrupts being disabled here? >JSD: Can this routine be called from interrupt context? > >Without disabling interrupts, there's a risk for deadlocks if the >processor gets interrupted and the interrupt handler causes a page fault >that needs to be resolved Since the lock is held for writing, the handler >will wait forever when locking for reading. This is actually similar to >the original deadlock that this whole patch is meant to remove, but the >window is really small (just a few instructions) now. Likewise an >interrupt on a different processor is not a problem since forward progress >is still guaranteed on the processor holding it for writing so the reader >will eventually get the lock. > So it is true that an interrupt handler can cause a page fault? Can you provide me with an example? >JSD: (2) I don't see how this code is preventing another processor >JSD: from grabbing the read_lock immediately after this processor has >JSD: checked to make sure it isn't held. > >And on the second question: This is the trick with using the rwlock, >there's no need to _prevent_ reading, all I needed to know is that all >readers that started before pte_free_sync() have completed: > >No one can get a reference to a PTE after it's been free'd, so the only >risk is if someone has walked the table right during pte_free() and is >still holding a reference to it. hash_page will hold the lock for >reading while this takes place, so all we need to know is that we >_could_ take the lock for writing (i.e. no readers for the table). Even >if another CPU comes in and traverses the tree we're safe since there's >no way they can end up in that PTE (since it's been removed from the >tree). > >This is all an ad-hoc solution since there's no RCU in 2.4, so I needed >another light-weight syncronization method. > Let me see if I understand this. When someone wants to free a page pointed to by an entry in a 3rd level page table, they clear out the pte in the page table using pte_clear(). Then they call pte_free with the address of the page they are freeing up (not really a page table entry but the actual page address). This page address is added to the batch list. Later, the idle loop or process termination code calls do_check_pgt_cache which will free all the pages in the batch list. However, prior to freeing the pages, pte_free_batch() will call pte_free_sync(). pte_free_sync tries to ensure that the pte_hash_locks aren't held on any processor. If a processor is holding it, it could be the case that the processor is currently walking the page tables and might have loaded, for example, the address of a pmd that we have since cleared. Since it is still using the data in that page, we don't want to free it until they are done with it. If we wait until they drop the lock, they are done and we also know any new reference will see the cleared value. Is this correct? >------------------------------------------------------------------------ > >===== arch/ppc64/kernel/htab.c 1.11 vs edited ===== >--- 1.11/arch/ppc64/kernel/htab.c Thu Dec 18 16:13:25 2003 >+++ edited/arch/ppc64/kernel/htab.c Tue Feb 3 22:01:59 2004 >@@ -48,6 +48,29 @@ > #include > #include > >+#define HPTE_LOCK_BIT 3 >+ >+static inline void pSeries_lock_hpte(HPTE *hptep) >+{ >+ unsigned long *word = &hptep->dw0.dword0; >+ >+ while (1) { >+ if (!test_and_set_bit(HPTE_LOCK_BIT, word)) >+ break; >+ while(test_bit(HPTE_LOCK_BIT, word)) >+ cpu_relax(); >+ } >+} >+ >+static inline void pSeries_unlock_hpte(HPTE *hptep) >+{ >+ unsigned long *word = &hptep->dw0.dword0; >+ >+ asm volatile("lwsync":::"memory"); >+ clear_bit(HPTE_LOCK_BIT, word); >+} >+ >+ > /* > * Note: pte --> Linux PTE > * HPTE --> PowerPC Hashed Page Table Entry >@@ -64,6 +87,7 @@ > > extern unsigned long _SDR1; > extern unsigned long klimit; >+extern rwlock_t pte_hash_lock[] __cacheline_aligned_in_smp; > > void make_pte(HPTE *htab, unsigned long va, unsigned long pa, > int mode, unsigned long hash_mask, int large); >@@ -320,51 +344,73 @@ > unsigned long va, vpn; > unsigned long newpp, prpn; > unsigned long hpteflags, lock_slot; >+ unsigned long access_ok, tmp; > long slot; > pte_t old_pte, new_pte; >+ int ret = 0; > > /* Search the Linux page table for a match with va */ > va = (vsid << 28) | (ea & 0x0fffffff); > vpn = va >> PAGE_SHIFT; > lock_slot = get_lock_slot(vpn); > >- /* Acquire the hash table lock to guarantee that the linux >- * pte we fetch will not change >+ /* >+ * Check the user's access rights to the page. If access should be >+ * prevented then send the problem up to do_page_fault. > */ >- spin_lock(&hash_table_lock[lock_slot].lock); >- >+ > /* > * Check the user's access rights to the page. If access should be > * prevented then send the problem up to do_page_fault. > */ >-#ifdef CONFIG_SHARED_MEMORY_ADDRESSING >+ > access |= _PAGE_PRESENT; >- if (unlikely(access & ~(pte_val(*ptep)))) { >+ >+ /* We'll do access checking and _PAGE_BUSY setting in assembly, since >+ * it needs to be atomic. >+ */ >+ >+ __asm__ __volatile__ ("\n >+ 1: ldarx %0,0,%3\n >+ # Check if PTE is busy\n >+ andi. %1,%0,%4\n >+ bne- 1b\n >+ ori %0,%0,%4\n >+ # Write the linux PTE atomically (setting busy)\n >+ stdcx. %0,0,%3\n >+ bne- 1b\n >+ # Check access rights (access & ~(pte_val(*ptep)))\n >+ andc. %1,%2,%0\n >+ bne- 2f\n >+ li %1,1\n >+ b 3f\n >+ 2: li %1,0\n >+ 3:" >+ : "=r" (old_pte), "=r" (access_ok) >+ : "r" (access), "r" (ptep), "i" (_PAGE_BUSY) >+ : "cr0", "memory"); >+ >+#ifdef CONFIG_SHARED_MEMORY_ADDRESSING >+ if (unlikely(!access_ok)) { > if(!(((ea >> SMALLOC_EA_SHIFT) == > (SMALLOC_START >> SMALLOC_EA_SHIFT)) && > ((current->thread.flags) & PPC_FLAG_SHARED))) { >- spin_unlock(&hash_table_lock[lock_slot].lock); >- return 1; >+ ret = 1; >+ goto out_unlock; > } > } > #else >- access |= _PAGE_PRESENT; >- if (unlikely(access & ~(pte_val(*ptep)))) { >- spin_unlock(&hash_table_lock[lock_slot].lock); >- return 1; >+ if (unlikely(!access_ok)) { >+ ret = 1; >+ goto out_unlock; > } > #endif > > /* >- * We have found a pte (which was present). >- * The spinlocks prevent this status from changing >- * The hash_table_lock prevents the _PAGE_HASHPTE status >- * from changing (RPN, DIRTY and ACCESSED too) >- * The page_table_lock prevents the pte from being >- * invalidated or modified >- */ >- >- /* >+ * We have found a proper pte. The hash_table_lock protects >+ * the pte from deallocation and the _PAGE_BUSY bit protects >+ * the contents of the PTE from changing. >+ * > * At this point, we have a pte (old_pte) which can be used to build > * or update an HPTE. There are 2 cases: > * >@@ -385,7 +431,7 @@ > else > pte_val(new_pte) |= _PAGE_ACCESSED; > >- newpp = computeHptePP(pte_val(new_pte)); >+ newpp = computeHptePP(pte_val(new_pte) & ~_PAGE_BUSY); > > /* Check if pte already has an hpte (case 2) */ > if (unlikely(pte_val(old_pte) & _PAGE_HASHPTE)) { >@@ -400,12 +446,13 @@ > slot = (hash & htab_data.htab_hash_mask) * HPTES_PER_GROUP; > slot += (pte_val(old_pte) & _PAGE_GROUP_IX) >> 12; > >- /* XXX fix large pte flag */ >+ /* XXX fix large pte flag */ > if (ppc_md.hpte_updatepp(slot, secondary, > newpp, va, 0) == -1) { > pte_val(old_pte) &= ~_PAGE_HPTEFLAGS; > } else { > if (!pte_same(old_pte, new_pte)) { >+ /* _PAGE_BUSY is still set in new_pte */ > *ptep = new_pte; > } > } >@@ -425,12 +472,19 @@ > pte_val(new_pte) |= ((slot<<12) & > (_PAGE_GROUP_IX | _PAGE_SECONDARY)); > >+ smp_wmb(); >+ /* _PAGE_BUSY is not set in new_pte */ > *ptep = new_pte; >+ >+ return 0; > } > >- spin_unlock(&hash_table_lock[lock_slot].lock); >+out_unlock: >+ smp_wmb(); > >- return 0; >+ pte_val(*ptep) &= ~_PAGE_BUSY; >+ >+ return ret; > } > > /* >@@ -497,11 +551,14 @@ > pgdir = mm->pgd; > if (pgdir == NULL) return 1; > >- /* >- * Lock the Linux page table to prevent mmap and kswapd >- * from modifying entries while we search and update >+ /* The pte_hash_lock is used to block any PTE deallocations >+ * while we walk the tree and use the entry. While technically >+ * we both read and write the PTE entry while holding the read >+ * lock, the _PAGE_BUSY bit will block pte_update()s to the >+ * specific entry. > */ >- spin_lock(&mm->page_table_lock); >+ >+ read_lock(&pte_hash_lock[smp_processor_id()]); > > ptep = find_linux_pte(pgdir, ea); > /* >@@ -514,8 +571,7 @@ > /* If no pte, send the problem up to do_page_fault */ > ret = 1; > } >- >- spin_unlock(&mm->page_table_lock); >+ read_unlock(&pte_hash_lock[smp_processor_id()]); > > return ret; > } >@@ -540,8 +596,6 @@ > lock_slot = get_lock_slot(vpn); > hash = hpt_hash(vpn, large); > >- spin_lock_irqsave(&hash_table_lock[lock_slot].lock, flags); >- > pte = __pte(pte_update(ptep, _PAGE_HPTEFLAGS, 0)); > secondary = (pte_val(pte) & _PAGE_SECONDARY) >> 15; > if (secondary) hash = ~hash; >@@ -551,8 +605,6 @@ > if (pte_val(pte) & _PAGE_HASHPTE) { > ppc_md.hpte_invalidate(slot, secondary, va, large, local); > } >- >- spin_unlock_irqrestore(&hash_table_lock[lock_slot].lock, flags); > } > > long plpar_pte_enter(unsigned long flags, >@@ -787,6 +839,8 @@ > > avpn = vpn >> 11; > >+ pSeries_lock_hpte(hptep); >+ > dw0 = hptep->dw0.dw0; > > /* >@@ -794,9 +848,13 @@ > * the AVPN, hash group, and valid bits. By doing it this way, > * it is common with the pSeries LPAR optimal path. > */ >- if (dw0.bolted) return; >+ if (dw0.bolted) { >+ pSeries_unlock_hpte(hptep); > >- /* Invalidate the hpte. */ >+ return; >+ } >+ >+ /* Invalidate the hpte. This clears the lock as well. */ > hptep->dw0.dword0 = 0; > > /* Invalidate the tlb */ >@@ -875,6 +933,8 @@ > > avpn = vpn >> 11; > >+ pSeries_lock_hpte(hptep); >+ > dw0 = hptep->dw0.dw0; > if ((dw0.avpn == avpn) && > (dw0.v) && (dw0.h == secondary)) { >@@ -900,10 +960,14 @@ > hptep->dw0.dw0 = dw0; > > __asm__ __volatile__ ("ptesync" : : : "memory"); >+ >+ pSeries_unlock_hpte(hptep); > > return 0; > } > >+ pSeries_unlock_hpte(hptep); >+ > return -1; > } > >@@ -1062,9 +1126,11 @@ > dw0 = hptep->dw0.dw0; > if (!dw0.v) { > /* retry with lock held */ >+ pSeries_lock_hpte(hptep); > dw0 = hptep->dw0.dw0; > if (!dw0.v) > break; >+ pSeries_unlock_hpte(hptep); > } > hptep++; > } >@@ -1079,9 +1145,11 @@ > dw0 = hptep->dw0.dw0; > if (!dw0.v) { > /* retry with lock held */ >+ pSeries_lock_hpte(hptep); > dw0 = hptep->dw0.dw0; > if (!dw0.v) > break; >+ pSeries_unlock_hpte(hptep); > } > hptep++; > } >@@ -1304,9 +1372,11 @@ > > if (dw0.v && !dw0.bolted) { > /* retry with lock held */ >+ pSeries_lock_hpte(hptep); > dw0 = hptep->dw0.dw0; > if (dw0.v && !dw0.bolted) > break; >+ pSeries_unlock_hpte(hptep); > } > > slot_offset++; >===== arch/ppc64/mm/init.c 1.8 vs edited ===== >--- 1.8/arch/ppc64/mm/init.c Tue Jan 6 17:54:44 2004 >+++ edited/arch/ppc64/mm/init.c Tue Feb 3 21:49:59 2004 >@@ -104,16 +104,94 @@ > */ > mmu_gather_t mmu_gathers[NR_CPUS]; > >+/* PTE free batching structures. We need a lock since not all >+ * operations take place under page_table_lock. Keep it per-CPU >+ * to avoid bottlenecks. >+ */ >+ >+struct pte_freelist_batch ____cacheline_aligned pte_freelist_cur[NR_CPUS] __cacheline_aligned_in_smp; >+rwlock_t pte_hash_lock[NR_CPUS] __cacheline_aligned_in_smp = { [0 ... NR_CPUS-1] = RW_LOCK_UNLOCKED }; >+ >+unsigned long pte_freelist_forced_free; >+ >+static inline void pte_free_sync(void) >+{ >+ unsigned long flags; >+ int i; >+ >+ /* All we need to know is that we can get the write lock if >+ * we wanted to, i.e. that no hash_page()s are holding it for reading. >+ * If none are reading, that means there's no currently executing >+ * hash_page() that might be working on one of the PTE's that will >+ * be deleted. Likewise, if there is a reader, we need to get the >+ * write lock to know when it releases the lock. >+ */ >+ >+ for (i = 0; i < smp_num_cpus; i++) >+ if (is_read_locked(&pte_hash_lock[i])) { >+ /* So we don't deadlock with a reader on current cpu */ >+ if(i == smp_processor_id()) >+ local_irq_save(flags); >+ >+ write_lock(&pte_hash_lock[i]); >+ write_unlock(&pte_hash_lock[i]); >+ >+ if(i == smp_processor_id()) >+ local_irq_restore(flags); >+ } >+} >+ >+ >+/* This is only called when we are critically out of memory >+ * (and fail to get a page in pte_free_tlb). >+ */ >+void pte_free_now(pte_t *pte) >+{ >+ pte_freelist_forced_free++; >+ >+ pte_free_sync(); >+ >+ pte_free_kernel(pte); >+} >+ >+/* Deallocates the pte-free batch after syncronizing with readers of >+ * any page tables. >+ */ >+void pte_free_batch(void **batch, int size) >+{ >+ unsigned int i; >+ >+ pte_free_sync(); >+ >+ for (i = 0; i < size; i++) >+ pte_free_kernel(batch[i]); >+ >+ free_page((unsigned long)batch); >+} >+ >+ > int do_check_pgt_cache(int low, int high) > { > int freed = 0; >+ struct pte_freelist_batch *batch; >+ >+ /* We use this function to push the current pte free batch to be >+ * deallocated, since do_check_pgt_cache() is callEd at the end of each >+ * free_one_pgd() and other parts of VM relies on all PTE's being >+ * properly freed upon return from that function. >+ */ >+ >+ batch = &pte_freelist_cur[smp_processor_id()]; >+ >+ if(batch->entry) { >+ pte_free_batch(batch->entry, batch->index); >+ batch->entry = NULL; >+ } > > if (pgtable_cache_size > high) { > do { > if (pgd_quicklist) > free_page((unsigned long)pgd_alloc_one_fast(0)), ++freed; >- if (pmd_quicklist) >- free_page((unsigned long)pmd_alloc_one_fast(0, 0)), ++freed; > if (pte_quicklist) > free_page((unsigned long)pte_alloc_one_fast(0, 0)), ++freed; > } while (pgtable_cache_size > low); >@@ -290,7 +368,9 @@ > void > local_flush_tlb_mm(struct mm_struct *mm) > { >- spin_lock(&mm->page_table_lock); >+ unsigned long flags; >+ >+ spin_lock_irqsave(&mm->page_table_lock, flags); > > if ( mm->map_count ) { > struct vm_area_struct *mp; > JSD: I believe you said the _irqsave wasn't needed here so this hunk and the next JSD: one can be removed. >@@ -298,7 +378,7 @@ > local_flush_tlb_range( mm, mp->vm_start, mp->vm_end ); > } > >- spin_unlock(&mm->page_table_lock); >+ spin_unlock_irqrestore(&mm->page_table_lock, flags); > } > > /* >===== include/asm-ppc64/pgalloc.h 1.2 vs edited ===== >--- 1.2/include/asm-ppc64/pgalloc.h Tue Apr 9 06:31:08 2002 >+++ edited/include/asm-ppc64/pgalloc.h Tue Feb 3 21:50:36 2004 >@@ -15,7 +15,6 @@ > #define quicklists get_paca() > > #define pgd_quicklist (quicklists->pgd_cache) >-#define pmd_quicklist (quicklists->pmd_cache) > #define pte_quicklist (quicklists->pte_cache) > #define pgtable_cache_size (quicklists->pgtable_cache_sz) > >@@ -60,10 +59,10 @@ > static inline pmd_t* > pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr) > { >- unsigned long *ret = (unsigned long *)pmd_quicklist; >+ unsigned long *ret = (unsigned long *)pte_quicklist; > > if (ret != NULL) { >- pmd_quicklist = (unsigned long *)(*ret); >+ pte_quicklist = (unsigned long *)(*ret); > ret[0] = 0; > --pgtable_cache_size; > } >@@ -80,14 +79,6 @@ > return pmd; > } > >-static inline void >-pmd_free (pmd_t *pmd) >-{ >- *(unsigned long *)pmd = (unsigned long) pmd_quicklist; >- pmd_quicklist = (unsigned long *) pmd; >- ++pgtable_cache_size; >-} >- > #define pmd_populate(MM, PMD, PTE) pmd_set(PMD, PTE) > > static inline pte_t* >@@ -115,12 +106,54 @@ > } > > static inline void >-pte_free (pte_t *pte) >+pte_free_kernel (pte_t *pte) > { > *(unsigned long *)pte = (unsigned long) pte_quicklist; > pte_quicklist = (unsigned long *) pte; > ++pgtable_cache_size; > } >+ >+ >+/* Use the PTE functions for freeing PMD as well, since the same >+ * problem with tree traversals apply. Since pmd pointers are always >+ * virtual, no need for a page_address() translation. >+ */ >+ >+#define pte_free(pte_page) __pte_free(pte_page) > JSD: Your original patch defined pte_free to be __pte_free(page_address(pte_page)) JSD: Is the page_address() wrapper no longer necessary? >+#define pmd_free(pmd) __pte_free(pmd) >+ >+struct pte_freelist_batch >+{ >+ unsigned int index; >+ void **entry; >+}; >+ >+#define PTE_FREELIST_SIZE (PAGE_SIZE / sizeof(void *)) >+ >+extern void pte_free_now(pte_t *pte); >+extern void pte_free_batch(void **batch, int size); >+extern struct ____cacheline_aligned pte_freelist_batch pte_freelist_cur[] __cacheline_aligned_in_smp; >+ >+static inline void __pte_free(pte_t *pte) >+{ >+ struct pte_freelist_batch *batchp = &pte_freelist_cur[smp_processor_id()]; >+ >+ if (batchp->entry == NULL) { >+ batchp->entry = (void **)__get_free_page(GFP_ATOMIC); >+ if (batchp->entry == NULL) { >+ pte_free_now(pte); >+ return; >+ } >+ batchp->index = 0; >+ } >+ >+ batchp->entry[batchp->index++] = pte; >+ if (batchp->index == PTE_FREELIST_SIZE) { >+ pte_free_batch(batchp->entry, batchp->index); >+ batchp->entry = NULL; >+ } >+} >+ > > extern int do_check_pgt_cache(int, int); > >===== include/asm-ppc64/pgtable.h 1.7 vs edited ===== >--- 1.7/include/asm-ppc64/pgtable.h Mon Aug 25 23:47:52 2003 >+++ edited/include/asm-ppc64/pgtable.h Tue Feb 3 21:33:22 2004 >@@ -88,22 +88,22 @@ > * Bits in a linux-style PTE. These match the bits in the > * (hardware-defined) PowerPC PTE as closely as possible. > */ >-#define _PAGE_PRESENT 0x001UL /* software: pte contains a translation */ >-#define _PAGE_USER 0x002UL /* matches one of the PP bits */ >-#define _PAGE_RW 0x004UL /* software: user write access allowed */ >-#define _PAGE_GUARDED 0x008UL >-#define _PAGE_COHERENT 0x010UL /* M: enforce memory coherence (SMP systems) */ >-#define _PAGE_NO_CACHE 0x020UL /* I: cache inhibit */ >-#define _PAGE_WRITETHRU 0x040UL /* W: cache write-through */ >-#define _PAGE_DIRTY 0x080UL /* C: page changed */ >-#define _PAGE_ACCESSED 0x100UL /* R: page referenced */ >-#define _PAGE_HPTENOIX 0x200UL /* software: pte HPTE slot unknown */ >-#define _PAGE_HASHPTE 0x400UL /* software: pte has an associated HPTE */ >-#define _PAGE_EXEC 0x800UL /* software: i-cache coherence required */ >-#define _PAGE_SECONDARY 0x8000UL /* software: HPTE is in secondary group */ >-#define _PAGE_GROUP_IX 0x7000UL /* software: HPTE index within group */ >+#define _PAGE_PRESENT 0x0001 /* software: pte contains a translation */ >+#define _PAGE_USER 0x0002 /* matches one of the PP bits */ >+#define _PAGE_RW 0x0004 /* software: user write access allowed */ >+#define _PAGE_GUARDED 0x0008 >+#define _PAGE_COHERENT 0x0010 /* M: enforce memory coherence (SMP systems) */ >+#define _PAGE_NO_CACHE 0x0020 /* I: cache inhibit */ >+#define _PAGE_WRITETHRU 0x0040 /* W: cache write-through */ >+#define _PAGE_DIRTY 0x0080 /* C: page changed */ >+#define _PAGE_ACCESSED 0x0100 /* R: page referenced */ >+#define _PAGE_BUSY 0x0200 /* software: pte & hash are busy */ >+#define _PAGE_HASHPTE 0x0400 /* software: pte has an associated HPTE */ >+#define _PAGE_EXEC 0x0800 /* software: i-cache coherence required */ >+#define _PAGE_GROUP_IX 0x7000 /* software: HPTE index within group */ >+#define _PAGE_SECONDARY 0x8000 /* software: HPTE is in secondary group */ > /* Bits 0x7000 identify the index within an HPT Group */ >-#define _PAGE_HPTEFLAGS (_PAGE_HASHPTE | _PAGE_HPTENOIX | _PAGE_SECONDARY | _PAGE_GROUP_IX) >+#define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | _PAGE_SECONDARY | _PAGE_GROUP_IX) > /* PAGE_MASK gives the right answer below, but only by accident */ > /* It should be preserving the high 48 bits and then specifically */ > /* preserving _PAGE_SECONDARY | _PAGE_GROUP_IX */ >@@ -281,13 +281,15 @@ > unsigned long old, tmp; > > __asm__ __volatile__("\n\ >-1: ldarx %0,0,%3 \n\ >+1: ldarx %0,0,%3 \n\ >+ andi. %1,%0,%7 # loop on _PAGE_BUSY set\n\ >+ bne- 1b \n\ > andc %1,%0,%4 \n\ > or %1,%1,%5 \n\ > stdcx. %1,0,%3 \n\ > bne- 1b" > : "=&r" (old), "=&r" (tmp), "=m" (*p) >- : "r" (p), "r" (clr), "r" (set), "m" (*p) >+ : "r" (p), "r" (clr), "r" (set), "m" (*p), "i" (_PAGE_BUSY) > : "cc" ); > return old; > } > > -- Julie DeWandel Red Hat, Inc. Tel (978) 692-3113 x23251 ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Thu Feb 5 07:19:05 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Wed, 04 Feb 2004 14:19:05 -0600 Subject: halt vs halt -p Message-ID: <1075925944.1421.41.camel@magik> Maybe I don't know the history, but why do we power-off on a 'halt'? According to the man page, we shouldn't power-off unless it's a 'halt -p' or a 'poweroff'. I checked x86 and it doesn't do anything on a halt. I put a patch below of what I would have expected the code to look like. Thanks, Jake diff -Nru a/arch/ppc64/kernel/iSeries_setup.c b/arch/ppc64/kernel/iSeries_setup.c --- a/arch/ppc64/kernel/iSeries_setup.c Wed Feb 4 13:37:46 2004 +++ b/arch/ppc64/kernel/iSeries_setup.c Wed Feb 4 13:37:46 2004 @@ -786,7 +786,6 @@ */ void iSeries_halt(void) { - mf_powerOff(); } /* JDH Hack */ diff -Nru a/arch/ppc64/kernel/rtas.c b/arch/ppc64/kernel/rtas.c --- a/arch/ppc64/kernel/rtas.c Wed Feb 4 13:37:46 2004 +++ b/arch/ppc64/kernel/rtas.c Wed Feb 4 13:37:46 2004 @@ -417,7 +417,6 @@ { if (rtas_firmware_flash_list.next) rtas_flash_bypass_warning(); - rtas_power_off(); } unsigned long rtas_rmo_buf = 0; ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From boutcher at us.ibm.com Thu Feb 5 07:27:04 2004 From: boutcher at us.ibm.com (David Boutcher) Date: Wed, 4 Feb 2004 14:27:04 -0600 Subject: halt vs halt -p In-Reply-To: <1075925944.1421.41.camel@magik> Message-ID: Well, validate the iSeries path. On the iSeries, OS4/400 kicks off a graceful shutdown of the Linux partition, and at the end it is supposed to end up powered off. I'm not sure if that ends up in iSeries_halt(void)or not. Dave Boutcher IBM Linux Technology Center Jake Moilanen To Sent by: PPC64 External List owner-linuxppc64- dev at lists.linuxpp cc c.org Subject halt vs halt -p 02/04/2004 02:19 PM Maybe I don't know the history, but why do we power-off on a 'halt'? According to the man page, we shouldn't power-off unless it's a 'halt -p' or a 'poweroff'. I checked x86 and it doesn't do anything on a halt. I put a patch below of what I would have expected the code to look like. Thanks, Jake diff -Nru a/arch/ppc64/kernel/iSeries_setup.c b/arch/ppc64/kernel/iSeries_setup.c --- a/arch/ppc64/kernel/iSeries_setup.c Wed Feb 4 13:37:46 2004 +++ b/arch/ppc64/kernel/iSeries_setup.c Wed Feb 4 13:37:46 2004 @@ -786,7 +786,6 @@ */ void iSeries_halt(void) { - mf_powerOff(); } /* JDH Hack */ diff -Nru a/arch/ppc64/kernel/rtas.c b/arch/ppc64/kernel/rtas.c --- a/arch/ppc64/kernel/rtas.c Wed Feb 4 13:37:46 2004 +++ b/arch/ppc64/kernel/rtas.c Wed Feb 4 13:37:46 2004 @@ -417,7 +417,6 @@ { if (rtas_firmware_flash_list.next) rtas_flash_bypass_warning(); - rtas_power_off(); } unsigned long rtas_rmo_buf = 0; ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Thu Feb 5 07:28:53 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Wed, 4 Feb 2004 14:28:53 -0600 Subject: Resending the patch: Re: 2.6: PATCH for multiple EEH bugs In-Reply-To: ; from olof@austin.ibm.com on Tue, Feb 03, 2004 at 10:52:11PM -0600 References: <20040203183459.B27780@forte.austin.ibm.com> Message-ID: <20040204142853.A28220@forte.austin.ibm.com> On Tue, Feb 03, 2004 at 10:52:11PM -0600, olof at austin.ibm.com wrote: > On Tue, 3 Feb 2004 linas at austin.ibm.com wrote: > > > Patch for multiple EEH-related bugs. Please review this patch, > > & if appropriate, please apply. It should apply cleanly to > > the current ameslab tree (Feb 03 2004 2.6.2-rc3). > > I have patch failures in eeh.c, pci.c and rpaphp_core.c. Are you sure you > made the diff against a current ameslab tree? Right tree, bad email attachment. I don't know how it happened, but what I sent out had some trailing whitespace whacked. The attached patch should not have this problem. --linas -------------- next part -------------- ===== arch/ppc64/kernel/chrp_setup.c 1.49 vs edited ===== --- 1.49/arch/ppc64/kernel/chrp_setup.c Mon Jan 19 20:07:02 2004 +++ edited/arch/ppc64/kernel/chrp_setup.c Tue Feb 3 16:35:44 2004 @@ -71,6 +71,7 @@ extern void openpic_init_irq_desc(irq_desc_t *); extern void find_and_init_phbs(void); +extern void __init eeh_init(void); extern void pSeries_get_boot_time(struct rtc_time *rtc_time); extern void pSeries_get_rtc_time(struct rtc_time *rtc_time); ===== arch/ppc64/kernel/eeh.c 1.17 vs edited ===== --- 1.17/arch/ppc64/kernel/eeh.c Tue Feb 3 11:03:04 2004 +++ edited/arch/ppc64/kernel/eeh.c Tue Feb 3 16:38:19 2004 @@ -38,44 +38,68 @@ #define BUID_LO(buid) ((buid) & 0xffffffff) #define CONFIG_ADDR(busno, devfn) (((((busno) & 0xff) << 8) | ((devfn) & 0xf8)) << 8) -unsigned long eeh_total_mmio_ffs; -unsigned long eeh_false_positives; /* RTAS tokens */ static int ibm_set_eeh_option; static int ibm_set_slot_reset; static int ibm_read_slot_reset_state; -static int eeh_implemented; +static int eeh_subsystem_enabled; #define EEH_MAX_OPTS 4096 static char *eeh_opts; static int eeh_opts_last; -unsigned char slot_err_buf[RTAS_ERROR_LOG_MAX]; +/* System monitoring statistics */ +unsigned long eeh_total_mmio_ffs; +unsigned long eeh_false_positives; +unsigned long eeh_ignored_failures; pte_t *find_linux_pte(pgd_t *pgdir, unsigned long va); /* from htab.c */ static int eeh_check_opts_config(struct device_node *dn, int class_code, int vendor_id, int device_id, int default_state); -unsigned long eeh_token_to_phys(unsigned long token) + +/** + * eeh_token_to_phys - convert EEH address token to phys address + * @token i/o token, should be address in the form 0xA.... + * + * Converts EEH address tokens into physical addresses. Note that + * ths routine does *not* convert I/O BAR addresses (which start + * with 0xE...) to phys addresses! + */ +unsigned long +eeh_token_to_phys(unsigned long token) { + pte_t *ptep; + unsigned long pa, vaddr; if (REGION_ID(token) == EEH_REGION_ID) { - unsigned long vaddr = IO_TOKEN_TO_ADDR(token); - pte_t *ptep = find_linux_pte(ioremap_mm.pgd, vaddr); - unsigned long pa = pte_pfn(*ptep) << PAGE_SHIFT; - return pa | (vaddr & (PAGE_SIZE-1)); - } else + vaddr = IO_TOKEN_TO_ADDR(token); + } else { return token; + } + + ptep = find_linux_pte(ioremap_mm.pgd, vaddr); + pa = pte_pfn(*ptep) << PAGE_SHIFT; + return pa | (vaddr & (PAGE_SIZE-1)); } -/* Check for an eeh failure at the given token address. +/** + * eeh_check_failure - check if all 1's data is due to EEH slot freeze + * @token i/o token, should be address in the form 0xA.... + * @val value, should be all 1's (XXX why do we need this arg??) + * @who arbitrary ID, useful for debugging + * + * Check for an eeh failure at the given token address. * The given value has been read and it should be 1's (0xff, 0xffff or * 0xffffffff). * * Probe to determine if an error actually occurred. If not return val. * Otherwise panic. + * + * Note this routine might be called in an interrupt context ... */ -unsigned long eeh_check_failure(void *token, unsigned long val) +unsigned long +eeh_check_failure(void *token, unsigned long val, int who) { unsigned long addr; struct pci_dev *dev; @@ -85,28 +109,27 @@ /* IO BAR access could get us here...or if we manually force EEH * operation on even if the hardware won't support it. */ - if (!eeh_implemented || ibm_read_slot_reset_state == RTAS_UNKNOWN_SERVICE) + if (!eeh_subsystem_enabled || ibm_read_slot_reset_state == RTAS_UNKNOWN_SERVICE) return val; - /* Finding the phys addr + pci device is quite expensive. - * However, the RTAS call is MUCH slower.... :( - */ + /* Finding the phys addr + pci device; this is pretty quick. */ addr = eeh_token_to_phys((unsigned long)token); - dev = pci_find_dev_by_addr(addr); - if (!dev) { - printk("EEH: no pci dev found for addr=0x%lx\n", addr); - return val; - } + dev = pci_get_device_by_addr(addr); + + if (!dev) return val; + dn = pci_device_to_OF_node(dev); if (!dn) { - printk("EEH: no pci dn found for addr=0x%lx\n", addr); + pci_dev_put (dev); return val; } /* Access to IO BARs might get this far and still not want checking. */ - if (!(dn->eeh_mode & EEH_MODE_SUPPORTED) || dn->eeh_mode & EEH_MODE_NOCHECK) + if (!(dn->eeh_mode & EEH_MODE_SUPPORTED) || + dn->eeh_mode & EEH_MODE_NOCHECK) { + pci_dev_put (dev); return val; - + } /* Now test for an EEH failure. This is VERY expensive. * Note that the eeh_config_addr may be a parent device @@ -119,6 +142,7 @@ dn->eeh_config_addr, BUID_HI(dn->phb->buid), BUID_LO(dn->phb->buid)); if (ret == 0 && rets[1] == 1 && rets[0] >= 2) { + unsigned char slot_err_buf[RTAS_ERROR_LOG_MAX]; unsigned long slot_err_ret; memset(slot_err_buf, 0, RTAS_ERROR_LOG_MAX); @@ -139,23 +163,36 @@ * the system in light of potential corruption, we * can use it here. */ - if (panic_on_oops) - panic("EEH: MMIO failure (%ld) on device:\n%s\n", - rets[0], pci_name(dev)); - else - printk("EEH: MMIO failure (%ld) on device:\n%s\n", - rets[0], pci_name(dev)); + if (panic_on_oops) { + panic("EEH: MMIO failure (%ld) on device:\n%s\n", rets[0], pci_name(dev)); + } else { + eeh_ignored_failures ++; + if (!in_interrupt()) { /* XXX this will be replaced by eehdaemon */ + printk(KERN_INFO "EEH: MMIO failure (%ld) on device:%s %s\n", + rets[0], pci_name(dev), pci_pretty_name(dev)); + } + } + } else { + eeh_false_positives++; } } - eeh_false_positives++; + pci_dev_put (dev); return val; /* good case */ - } struct eeh_early_enable_info { unsigned int buid_hi; unsigned int buid_lo; - int adapters_enabled; + + /* Handy-dandy statistics help us understand what's going on */ + int num_phbs_found; + int num_of_nodes; + int num_devices; + int num_devices_bad_status; + int num_devices_graphics; + int num_devices_w_eeh_parent; + int num_devices_w_eeh_disabled; + int num_adapters_enabled; }; /* Enable eeh for the given device node. */ @@ -170,12 +207,17 @@ u32 *regs; int enable; - if (status && strcmp(status, "ok") != 0) + info->num_devices ++; + if (status && strcmp(status, "ok") != 0) { + info->num_devices_bad_status ++; return NULL; /* ignore devices with bad status */ + } /* Weed out PHBs or other bad nodes. */ - if (!class_code || !vendor_id || !device_id) + if (!class_code || !vendor_id || !device_id) { + info->num_devices_bad_status ++; return NULL; + } /* Ignore known PHBs and EADs bridges */ if (*vendor_id == PCI_VENDOR_ID_IBM && @@ -191,28 +233,36 @@ * But there are a few cases like display devices that make sense. */ enable = 1; /* i.e. we will do checking */ - if ((*class_code >> 16) == PCI_BASE_CLASS_DISPLAY) + if ((*class_code >> 16) == PCI_BASE_CLASS_DISPLAY) { + printk (KERN_INFO "EEH: %s: display device, disabling EEH checking.\n", dn->full_name); + info->num_devices_graphics ++; enable = 0; + } if (!eeh_check_opts_config(dn, *class_code, *vendor_id, *device_id, enable)) { if (enable) { - printk(KERN_INFO "EEH: %s user requested to run without EEH.\n", dn->full_name); + printk(KERN_NOTICE "EEH: %s user requested to run without EEH.\n", dn->full_name); enable = 0; } } if (!enable) + { dn->eeh_mode = EEH_MODE_NOCHECK; + info->num_devices_w_eeh_disabled ++; + return NULL; + } /* This device may already have an EEH parent. */ if (dn->parent && (dn->parent->eeh_mode & EEH_MODE_SUPPORTED)) { /* Parent supports EEH. */ dn->eeh_mode |= EEH_MODE_SUPPORTED; dn->eeh_config_addr = dn->parent->eeh_config_addr; + info->num_devices_w_eeh_parent ++; return NULL; } - /* Ok..see if this device supports EEH. */ + /* Ok... see if this device supports EEH. */ regs = (u32 *)get_property(dn, "reg", 0); if (regs) { /* First register entry is addr (00BBSS00) */ @@ -221,16 +271,21 @@ regs[0], info->buid_hi, info->buid_lo, EEH_ENABLE); if (ret == 0) { - info->adapters_enabled++; + info->num_adapters_enabled++; dn->eeh_mode |= EEH_MODE_SUPPORTED; dn->eeh_config_addr = regs[0]; + printk (KERN_DEBUG "EEH: %s: eeh enabled\n", dn->full_name); + } else { + printk (KERN_WARNING "EEH: %s: rtas_call failed.\n", dn->full_name); } + } else { + printk (KERN_WARNING "EEH: %s: unable to get reg property.\n", dn->full_name); } return NULL; } /* - * Initialize eeh by trying to enable it for all of the adapters in the system. + * Initialize EEH by trying to enable it for all of the adapters in the system. * As a side effect we can determine here if eeh is supported at all. * Note that we leave EEH on so failed config cycles won't cause a machine * check. If a user turns off EEH for a particular adapter they are really @@ -243,7 +298,7 @@ * The eeh-force-off/on option does literally what it says, so if Linux must * avoid enabling EEH this must be done. */ -void eeh_init(void) +void __init eeh_init(void) { struct device_node *phb; struct eeh_early_enable_info info; @@ -261,26 +316,37 @@ * of I/O macros even if we can't actually test for EEH failure. */ if (eeh_force_on > eeh_force_off) - eeh_implemented = 1; + eeh_subsystem_enabled = 1; else if (ibm_set_eeh_option == RTAS_UNKNOWN_SERVICE) return; if (eeh_force_off > eeh_force_on) { /* User is forcing EEH off. Be noisy if it is implemented. */ - if (eeh_implemented) + if (eeh_subsystem_enabled) printk(KERN_WARNING "EEH: WARNING: PCI Enhanced I/O Error Handling is user disabled\n"); - eeh_implemented = 0; + eeh_subsystem_enabled = 0; return; } - /* Enable EEH for all adapters. Note that eeh requires buid's */ - info.adapters_enabled = 0; + info.num_adapters_enabled = 0; + info.num_of_nodes = 0; + info.num_phbs_found = 0; + info.num_devices = 0; + info.num_devices_bad_status = 0; + info.num_devices_graphics = 0; + info.num_devices_w_eeh_parent = 0; + info.num_devices_w_eeh_disabled = 0; for (phb = of_find_node_by_name(NULL, "pci"); phb; phb = of_find_node_by_name(phb, "pci")) { + int len; - int *buid_vals = (int *) get_property(phb, "ibm,fw-phb-id", &len); + int *buid_vals; + + info.num_of_nodes ++; + buid_vals = (int *) get_property(phb, "ibm,fw-phb-id", &len); if (!buid_vals) continue; + info.num_phbs_found ++; if (len == sizeof(int)) { info.buid_lo = buid_vals[0]; info.buid_hi = 0; @@ -288,24 +354,88 @@ info.buid_hi = buid_vals[0]; info.buid_lo = buid_vals[1]; } else { - printk("EEH: odd ibm,fw-phb-id len returned: %d\n", len); + printk(KERN_INFO "EEH: odd ibm,fw-phb-id len returned: %d\n", len); continue; } traverse_pci_devices(phb, early_enable_eeh, NULL, &info); } - if (info.adapters_enabled) { + if (info.num_adapters_enabled) { printk(KERN_INFO "EEH: PCI Enhanced I/O Error Handling Enabled\n"); - eeh_implemented = 1; + eeh_subsystem_enabled = 1; + } + printk(KERN_INFO "EEH: num_of_nodes=%d\n", info.num_of_nodes); + printk(KERN_INFO "EEH: num_phbs_found=%d\n", info.num_phbs_found); + printk(KERN_INFO "EEH: num_devices=%d\n", info.num_devices); + printk(KERN_INFO "EEH: num_devices_bad_status=%d\n", info.num_devices_bad_status); + printk(KERN_INFO "EEH: num_devices_graphics=%d\n", info.num_devices_graphics); + printk(KERN_INFO "EEH: num_devices_w_eeh_parent=%d\n", info.num_devices_w_eeh_parent); + printk(KERN_INFO "EEH: num_devices_w_eeh_disabled=%d\n", info.num_devices_w_eeh_disabled); + printk(KERN_INFO "EEH: num_adapters_enabled=%d\n", info.num_adapters_enabled); +} + +/** + * eeh_add_device - perform EEH initialization for the indicated pci device + * @dev: pci device for which to set up EEH + * + * This routine can be used to perform EEH initialization for PCI + * devices that were added after system boot (e.g. hotplug, dlpar). + * Whether this actually enables EEH or not for this device depends + * on the type of the device, on earlier boot command-line + * arguments & etc. + */ +void +eeh_add_device (struct pci_dev *dev) +{ + struct device_node *dn; + struct pci_controller *phb; + struct eeh_early_enable_info info; + + if (!dev || !eeh_subsystem_enabled) return; + + printk (KERN_DEBUG "EEH: adding device %s %s\n", + pci_name (dev), pci_pretty_name(dev)); + dn = pci_device_to_OF_node(dev); + if (NULL == dn) return; + + phb = PCI_GET_PHB_PTR(dev); + if (NULL == phb || 0 == phb->buid) { + printk (KERN_WARNING "EEH: Expected buid but found none\n"); + return; } + + info.buid_hi = BUID_HI(phb->buid); + info.buid_lo = BUID_LO(phb->buid); + + early_enable_eeh(dn, &info); + pci_addr_cache_insert_device (dev); } +/** + * eeh_remove_device - undo EEH setup for the indicated pci device + * @dev: pci device to be removed + * + * This routine should be when a device is removed from a running + * system (e.g. by hotplug or dlpar). + */ +void +eeh_remove_device (struct pci_dev *dev) +{ + if (!dev || !eeh_subsystem_enabled) return; + + /* Unregister the device with the EEH/PCI address search system */ + printk (KERN_DEBUG "EEH: remove device %s %s\n", + pci_name (dev), pci_pretty_name(dev)); + pci_addr_cache_remove_device (dev); + +} -int eeh_set_option(struct pci_dev *dev, int option) +int +eeh_set_option(struct pci_dev *dev, int option) { struct device_node *dn = pci_device_to_OF_node(dev); struct pci_controller *phb = PCI_GET_PHB_PTR(dev); - if (dn == NULL || phb == NULL || phb->buid == 0 || !eeh_implemented) + if (dn == NULL || phb == NULL || phb->buid == 0 || !eeh_subsystem_enabled) return -2; return rtas_call(ibm_set_eeh_option, 4, 1, NULL, @@ -316,7 +446,7 @@ /* If EEH is implemented, find the PCI device using given phys addr * and check to see if eeh failure checking is disabled. - * Remap the addr (trivially) to the EEH region if not. + * Remap the addr (trivially) to the EEH region if EEH checking enabled. * For addresses not known to PCI the vaddr is simply returned unchanged. */ void *eeh_ioremap(unsigned long addr, void *vaddr) @@ -324,28 +454,72 @@ struct pci_dev *dev; struct device_node *dn; - if (!eeh_implemented) + if (!eeh_subsystem_enabled) return vaddr; - dev = pci_find_dev_by_addr(addr); + dev = pci_get_device_by_addr(addr); if (!dev) return vaddr; - dn = pci_device_to_OF_node(dev); - if (!dn) + + dn = pci_device_to_OF_node(dev); + if (!dn) { + pci_dev_put (dev); return vaddr; - if (dn->eeh_mode & EEH_MODE_NOCHECK) + } + if (dn->eeh_mode & EEH_MODE_NOCHECK) { + pci_dev_put (dev); return vaddr; + } + pci_dev_put (dev); return (void *)IO_ADDR_TO_TOKEN(vaddr); } static int eeh_proc_falsepositive_read(char *page, char **start, off_t off, int count, int *eof, void *data) { - int len; - len = sprintf(page, "eeh_false_positives=%ld\n" - "eeh_total_mmio_ffs=%ld\n", - eeh_false_positives, eeh_total_mmio_ffs); - return len; + char *p, *buffer; +#define EEH_PROC_BUFSZ 250 + int n=0, bs=EEH_PROC_BUFSZ; + + if (count < 0) return -EINVAL; + + buffer = kmalloc (EEH_PROC_BUFSZ,GFP_KERNEL); + if (!buffer) return -ENOMEM; + + p = buffer; + + if (0 == eeh_subsystem_enabled) { + n += snprintf (p+n, bs-n, "EEH Subsystem is globally disabled\n"); + n += snprintf(p+n, bs-n, "eeh_total_mmio_ffs=%ld\n", + eeh_total_mmio_ffs); + } else { + n += snprintf (p+n, bs-n, "EEH Subsystem is enabled\n"); + n += snprintf(p+n, bs-n, + "eeh_total_mmio_ffs=%ld\n" + "eeh_false_positives=%ld\n" + "eeh_ignored_failures=%ld\n", + eeh_total_mmio_ffs, + eeh_false_positives, + eeh_ignored_failures); + } + + /* Misc machinations of the proc file system */ + if (off >= strlen(buffer)) { + *eof = 1; + kfree(buffer); + return 0; + } + if (n > strlen(buffer) - off) + n = strlen(buffer) - off; + if (n > count) + n = count; + else + *eof = 1; + + memcpy(page, buffer + off, n); + *start = page; + kfree(buffer); + return n; } /* Implementation of /proc/ppc64/eeh @@ -362,6 +536,12 @@ return 0; } +static int __init eeh_init_late(void) +{ + eeh_init_proc (); + return 0; +} + /* * Test if "dev" should be configured on or off. * This processes the options literally from left to right. @@ -456,7 +636,7 @@ if (*cur) { int curlen = curend-cur; if (eeh_opts_last + curlen > EEH_MAX_OPTS-2) { - printk(KERN_INFO "EEH: sorry...too many eeh cmd line options\n"); + printk(KERN_WARNING "EEH: sorry...too many eeh cmd line options\n"); return 1; } eeh_opts[eeh_opts_last++] = state ? '+' : '-'; @@ -478,6 +658,6 @@ return eeh_parm(str, 1); } -__initcall(eeh_init_proc); +__initcall(eeh_init_late); __setup("eeh-off", eehoff_parm); __setup("eeh-on", eehon_parm); ===== arch/ppc64/kernel/pSeries_pci.c 1.34 vs edited ===== --- 1.34/arch/ppc64/kernel/pSeries_pci.c Fri Jan 30 21:22:28 2004 +++ edited/arch/ppc64/kernel/pSeries_pci.c Tue Feb 3 16:35:49 2004 @@ -530,7 +530,7 @@ dev->resource[i].start += hose->pci_mem_offset; dev->resource[i].end += hose->pci_mem_offset; } - } + } } EXPORT_SYMBOL(pcibios_fixup_device_resources); ===== arch/ppc64/kernel/pci.c 1.42 vs edited ===== --- 1.42/arch/ppc64/kernel/pci.c Mon Jan 19 20:07:05 2004 +++ edited/arch/ppc64/kernel/pci.c Tue Feb 3 16:39:53 2004 @@ -23,6 +23,8 @@ #include #include #include +#include +#include #include #include @@ -107,42 +109,264 @@ } } -/* Given an mmio phys address, find a pci device that implements - * this address. This is of course expensive, but only used - * for device initialization or error paths. - * For io BARs it is assumed the pci_io_base has already been added - * into addr. +/** + * The pci address cache subsystem. This subsystem places + * PCI device address resources into a red-black tree, sorted + * according to the address range, so that given only an i/o + * address, the corresponding PCI device can be **quickly** + * found. * - * Bridges are ignored although they could be used to optimize the search. + * Currently, the only customer of this code is the EEH subsystem; + * thus, this code has been somewhat tailored to suit EEH better. + * In particular, the cache does *not* hold the addresses of devices + * for which EEH is not enabled. + * + * (Implementation Note: The RB tree seems to be better/faster + * than any hash algo I could think of for this problem, even + * with the penalty of slow pointer chases for d-cache misses). */ -struct pci_dev *pci_find_dev_by_addr(unsigned long addr) +struct pci_io_addr_range { - struct pci_dev *dev = NULL; + struct rb_node rb_node; + unsigned long addr_lo; + unsigned long addr_hi; + struct pci_dev *pcidev; + unsigned int flags; +}; + +struct pci_io_addr_cache +{ + struct rb_root rb_root; + spinlock_t piar_lock; +} pci_io_addr_cache_root; + +static inline struct pci_dev * +__pci_get_device_by_addr (unsigned long addr) +{ + struct rb_node *n = pci_io_addr_cache_root.rb_root.rb_node; + while (n) + { + struct pci_io_addr_range *piar; + piar = rb_entry (n, struct pci_io_addr_range, rb_node); + if (addr < piar->addr_lo) { + n = n->rb_left; + } else + if (addr > piar->addr_hi) { + n = n->rb_right; + } else { + pci_dev_get (piar->pcidev); + return piar->pcidev; + } + } + return NULL; +} + +/** + * pci_get_device_by_addr - Get device, given only address + * @addr: mmio (PIO) phys address or i/o port number + * + * Given an mmio phys address, or a port number, find a pci device + * that implements this address. Be sure to pci_dev_put the device + * when finished. I/O port numbers are assumed to be offset + * from zero (that is, they do *not* have pci_io_addr added in). + * It is safe to call this function within an interrupt. + */ +struct pci_dev * +pci_get_device_by_addr (unsigned long addr) +{ + struct pci_dev *dev; + unsigned long flags; + + spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); + dev = __pci_get_device_by_addr (addr); + spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); + return dev; +} + +/* Handy-dandy debug print routine, does nothing more + * than print out the contents of our addr cache. */ +static void +pci_addr_cache_print (struct pci_io_addr_cache *cache) +{ + struct rb_node *n; + n = rb_first (&cache->rb_root); + int cnt=0; + while (n) { + struct pci_io_addr_range *piar; + piar = rb_entry (n, struct pci_io_addr_range, rb_node); + printk (KERN_DEBUG "PCI: %s addr range %d [%lx -%lx]: %s %s\n", + (piar->flags & IORESOURCE_IO) ? "i/o" : "mem", + cnt, + piar->addr_lo, piar->addr_hi, + pci_name (piar->pcidev), + pci_pretty_name (piar->pcidev)); + cnt ++; + n = rb_next (n); + } +} + +/* Insert address range into the rb tree. */ +static inline struct pci_io_addr_range * +pci_addr_cache_insert (struct pci_dev *dev, + unsigned long alo, unsigned long ahi, unsigned int flags) +{ + struct rb_node **p = &pci_io_addr_cache_root.rb_root.rb_node; + struct rb_node * parent = NULL; + struct pci_io_addr_range *piar; + + // Walk tree, find a place to insert into tree + while (*p) { + parent = *p; + piar = rb_entry (parent, struct pci_io_addr_range, rb_node); + if (alo < piar->addr_lo) { + p = &parent->rb_left; + } else if (ahi > piar->addr_hi) { + p = &parent->rb_right; + } else { + if (dev != piar->pcidev || + alo != piar->addr_lo || ahi != piar->addr_hi) { + printk (KERN_WARNING "PIAR: overlapping address range\n"); + } + return piar; + } + } + piar = (struct pci_io_addr_range *) kmalloc ( + sizeof(struct pci_io_addr_range), GFP_ATOMIC); + + if (!piar) return NULL; // whoops + + piar->addr_lo = alo; + piar->addr_hi = ahi; + piar->pcidev = dev; + piar->flags = flags; + + rb_link_node (&piar->rb_node, parent, p); + rb_insert_color (&piar->rb_node, &pci_io_addr_cache_root.rb_root); + return piar; +} + +inline void +__pci_addr_cache_insert_device (struct pci_dev *dev) +{ + struct device_node *dn; + dn = pci_device_to_OF_node(dev); + if (!dn) { + printk(KERN_WARNING "PCI: no pci dn found for dev=%s %s\n", + pci_name(dev), pci_pretty_name(dev)); + pci_dev_put (dev); + return; + } + + // Skip any devices for which EEH is not enabled. + if (!(dn->eeh_mode & EEH_MODE_SUPPORTED) || + dn->eeh_mode & EEH_MODE_NOCHECK) { + printk(KERN_INFO "PCI: skip building address cache for=%s %s\n", + pci_name(dev), pci_pretty_name(dev)); + pci_dev_put (dev); + return; + } + + // Walk resources on this device, poke them into the tree int i; - unsigned long ioaddr; + for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) { + unsigned long start = pci_resource_start(dev,i); + unsigned long end = pci_resource_end(dev,i); + unsigned int flags = pci_resource_flags(dev,i); + + // We are interested only bus addresses, not dma or other stuff + if (0 == (flags & (IORESOURCE_IO | IORESOURCE_MEM))) continue; + if (start == 0 || ~start == 0 || end == 0 || ~end == 0) + continue; + pci_addr_cache_insert (dev, start, end, flags); + } +} - ioaddr = (addr > isa_io_base) ? addr - isa_io_base : 0; +/** + * pci_addr_cache_insert_device - Add a device to the address cache + * @dev: PCI device whose I/O addresses we are interested in. + * + * In order to support the fast lookup of devices based on addresses, + * we maintain a cache of devices that can be quickly searched. + * This routine adds a device to that cache. + */ +void +pci_addr_cache_insert_device (struct pci_dev *dev) +{ + unsigned long flags; + + spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); + __pci_addr_cache_insert_device (dev); + spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); +} - while ((dev = pci_find_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) { - if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE) +static inline void +__pci_addr_cache_remove_device (struct pci_dev *dev) +{ + struct rb_node *n; + +restart: + n = rb_first (&pci_io_addr_cache_root.rb_root); + while (n) { + struct pci_io_addr_range *piar; + piar = rb_entry (n, struct pci_io_addr_range, rb_node); + + if (piar->pcidev == dev) + { + rb_erase (n, &pci_io_addr_cache_root.rb_root); + kfree (piar); + goto restart; + } + n = rb_next (n); + } + pci_dev_put (dev); +} + +/** + * pci_addr_cache_remove_device - remove pci device from addr cache + * @dev: device to remove + * + * Remove a device from the addr-cache tree. + * This is potentially expensive, since it will walk + * the tree multiple times (once per resource). + * But so what; device removal doesn't need to be that fast. + */ +void +pci_addr_cache_remove_device (struct pci_dev *dev) +{ + unsigned long flags; + + spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags); + __pci_addr_cache_remove_device (dev); + spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags); +} + +/** + * pci_addr_cache_build - Build a cache of I/O addresses + * + * Build a cache of pci i/o addresses. This cache will be used to + * find the pci device that corresponds to a given address. + * This routine scans all pci busses to build the cache. + * Must be run late in boot process, after the pci controllers + * have been scaned for devices (after all device resources are known). + */ +static __init void +pci_addr_cache_build (void) +{ + struct pci_dev *dev = NULL; + + spin_lock_init (&pci_io_addr_cache_root.piar_lock); + + while ((dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) { + // Ignore PCI bridges ( XXX why ??) + if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE) { + pci_dev_put (dev); continue; - - for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) { - unsigned long start = pci_resource_start(dev,i); - unsigned long end = pci_resource_end(dev,i); - unsigned int flags = pci_resource_flags(dev,i); - if (start == 0 || ~start == 0 || - end == 0 || ~end == 0) - continue; - if ((flags & IORESOURCE_IO) && - (ioaddr >= start && ioaddr <= end)) - return dev; - else if ((flags & IORESOURCE_MEM) && - (addr >= start && addr <= end)) - return dev; } + pci_addr_cache_insert_device (dev); } - return NULL; + + // Verify tree built up above, echo back the list of addrs. + pci_addr_cache_print (&pci_io_addr_cache_root); } void @@ -343,6 +567,8 @@ printk("PCI: Probing PCI hardware done\n"); //ppc64_boot_msg(0x41, "PCI Done"); + + pci_addr_cache_build (); return 0; } ===== arch/ppc64/kernel/pci.h 1.10 vs edited ===== --- 1.10/arch/ppc64/kernel/pci.h Fri Sep 12 06:01:39 2003 +++ edited/arch/ppc64/kernel/pci.h Tue Feb 3 16:35:50 2004 @@ -37,11 +37,15 @@ void *traverse_pci_devices(struct device_node *start, traverse_func pre, traverse_func post, void *data); void *traverse_all_pci_devices(traverse_func pre); -struct pci_dev *pci_find_dev_by_addr(unsigned long addr); void pci_devs_phb_init(void); void pci_fix_bus_sysdata(void); struct device_node *fetch_dev_dn(struct pci_dev *dev); #define PCI_GET_PHB_PTR(dev) (((struct device_node *)(dev)->sysdata)->phb) + +/* PCI address cache management routines */ +struct pci_dev *pci_get_device_by_addr(unsigned long addr); +void pci_addr_cache_insert_device (struct pci_dev *dev); +void pci_addr_cache_remove_device (struct pci_dev *dev); #endif /* __PPC_KERNEL_PCI_H__ */ ===== drivers/pci/hotplug/rpaphp_core.c 1.2 vs edited ===== --- 1.2/drivers/pci/hotplug/rpaphp_core.c Tue Dec 9 11:03:38 2003 +++ edited/drivers/pci/hotplug/rpaphp_core.c Tue Feb 3 16:35:51 2004 @@ -30,6 +30,7 @@ #include #include #include +#include /* for eeh_add_device() */ #include /* rtas_call */ #include /* for pci_controller */ #include "../pci.h" /* for pci_add_new_bus*/ @@ -512,6 +513,7 @@ } dev = rpaphp_find_pci_dev(slot->dn->child); + eeh_add_device(dev); } else { /* slot is not enabled */ @@ -540,12 +542,12 @@ goto exit; } + /* remove the device from the pci core */ + eeh_remove_device(slot->dev); + pci_remove_bus_device(slot->dev); - /* remove the device from the pci core */ - pci_remove_bus_device(slot->dev); - - pci_dev_put(slot->dev); - slot->state = NOT_CONFIGURED; + pci_dev_put(slot->dev); + slot->state = NOT_CONFIGURED; dbg("%s: adapter in slot[%s] unconfigured.\n", __FUNCTION__, slot->name); ===== include/asm-ppc64/eeh.h 1.6 vs edited ===== --- 1.6/include/asm-ppc64/eeh.h Fri Sep 12 06:06:51 2003 +++ edited/include/asm-ppc64/eeh.h Tue Feb 3 16:35:51 2004 @@ -45,22 +45,37 @@ /* This is for profiling only */ extern unsigned long eeh_total_mmio_ffs; -void eeh_init(void); -int eeh_get_state(unsigned long ea); -unsigned long eeh_check_failure(void *token, unsigned long val); +unsigned long eeh_check_failure(void *token, unsigned long val, int who); void *eeh_ioremap(unsigned long addr, void *vaddr); +/** + * eeh_add_device - perform EEH initialization for the indicated pci device + * @dev: pci device for which to set up EEH + * + * This routine can be used to perform EEH initialization for PCI + * devices that were added after system boot (e.g. hotplug, dlpar). + * Whether this actually enables EEH or not for this device depends + * on the type of the device, on earlier boot command-line + * arguments & etc. + */ +void eeh_add_device(struct pci_dev *); + +/** + * eeh_remove_device - undo EEH setup for the indicated pci device + * @dev: pci device to be removed + * + * This routine should be when a device is removed from a running + * system (e.g. by hotplug or dlpar). + */ +void eeh_remove_device(struct pci_dev *); + + #define EEH_DISABLE 0 #define EEH_ENABLE 1 #define EEH_RELEASE_LOADSTORE 2 #define EEH_RELEASE_DMA 3 int eeh_set_option(struct pci_dev *dev, int options); -/* Given a PCI device check if eeh should be configured or not. - * This may look at firmware properties and/or kernel cmdline options. - */ -int is_eeh_configured(struct pci_dev *dev); - /* Translate a (possible) eeh token to a physical addr. * If "token" is not an eeh token it is simply returned under * the assumption that it is already a physical addr. @@ -78,11 +93,16 @@ * If this macro yields TRUE, the caller relays to eeh_check_failure() * which does further tests out of line. */ -/* #define EEH_POSSIBLE_IO_ERROR(val) (~(val) == 0) */ -/* #define EEH_POSSIBLE_ERROR(addr, vaddr, val) ((vaddr) != (addr) && EEH_POSSIBLE_IO_ERROR(val) */ /* This version is rearranged to collect some profiling data */ -#define EEH_POSSIBLE_IO_ERROR(val) (~(val) == 0 && ++eeh_total_mmio_ffs) -#define EEH_POSSIBLE_ERROR(addr, vaddr, val) (EEH_POSSIBLE_IO_ERROR(val) && (vaddr) != (addr)) +#define EEH_POSSIBLE_IO_ERROR(val, type) \ + ((val) == (type)~0 && ++eeh_total_mmio_ffs) + +/* The vaddr will equal the addr if EEH checking is disabled for + * this device. This is because eeh_ioremap() will not have + * remapped to 0xA0, and thus both vaddr and addr will be 0xE0... + */ +#define EEH_POSSIBLE_ERROR(addr, vaddr, val, type) \ + (EEH_POSSIBLE_IO_ERROR(val, type) && (vaddr) != (addr)) /* * MMIO read/write operations with EEH support. @@ -101,8 +121,8 @@ static inline u8 eeh_readb(void *addr) { volatile u8 *vaddr = (volatile u8 *)IO_TOKEN_TO_ADDR(addr); u8 val = in_8(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val)) - return eeh_check_failure(addr, val); + if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u8)) + return eeh_check_failure(addr, val, 8); return val; } static inline void eeh_writeb(u8 val, void *addr) { @@ -112,25 +132,47 @@ static inline u16 eeh_readw(void *addr) { volatile u16 *vaddr = (volatile u16 *)IO_TOKEN_TO_ADDR(addr); u16 val = in_le16(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val)) - return eeh_check_failure(addr, val); + if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u16)) + return eeh_check_failure(addr, val, 16); return val; } static inline void eeh_writew(u16 val, void *addr) { volatile u16 *vaddr = (volatile u16 *)IO_TOKEN_TO_ADDR(addr); out_le16(vaddr, val); } +static inline u16 eeh_raw_readw(void *addr) { + volatile u16 *vaddr = (volatile u16 *)IO_TOKEN_TO_ADDR(addr); + u16 val = in_be16(vaddr); + if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u16)) + return eeh_check_failure(addr, val, 17); + return val; +} +static inline void eeh_raw_writew(u16 val, void *addr) { + volatile u16 *vaddr = (volatile u16 *)IO_TOKEN_TO_ADDR(addr); + out_be16(vaddr, val); +} static inline u32 eeh_readl(void *addr) { volatile u32 *vaddr = (volatile u32 *)IO_TOKEN_TO_ADDR(addr); u32 val = in_le32(vaddr); - if (EEH_POSSIBLE_ERROR(addr, vaddr, val)) - return eeh_check_failure(addr, val); + if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u32)) + return eeh_check_failure(addr, val, 32); return val; } static inline void eeh_writel(u32 val, void *addr) { volatile u32 *vaddr = (volatile u32 *)IO_TOKEN_TO_ADDR(addr); out_le32(vaddr, val); } +static inline u32 eeh_raw_readl(void *addr) { + volatile u32 *vaddr = (volatile u32 *)IO_TOKEN_TO_ADDR(addr); + u32 val = in_be32(vaddr); + if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u32)) + return eeh_check_failure(addr, val, 33); + return val; +} +static inline void eeh_raw_writel(u32 val, void *addr) { + volatile u32 *vaddr = (volatile u32 *)IO_TOKEN_TO_ADDR(addr); + out_be32(vaddr, val); +} static inline void eeh_memset_io(void *addr, int c, unsigned long n) { void *vaddr = (void *)IO_TOKEN_TO_ADDR(addr); @@ -139,8 +181,14 @@ static inline void eeh_memcpy_fromio(void *dest, void *src, unsigned long n) { void *vsrc = (void *)IO_TOKEN_TO_ADDR(src); memcpy(dest, vsrc, n); - /* look for ffff's here at dest[n] */ + /* Look for ffff's here at dest[n]. Assume that at least 4 bytes + * were copied. Check all four bytes. + */ + if ((n>=4) && (EEH_POSSIBLE_ERROR(src, vsrc, (*((u32 *) dest+n-4)), u32))) { + eeh_check_failure(src, (*((u32 *) dest+n-4)), 88); + } } + static inline void eeh_memcpy_toio(void *dest, void *src, unsigned long n) { void *vdest = (void *)IO_TOKEN_TO_ADDR(dest); memcpy(vdest, src, n); @@ -158,8 +206,8 @@ if (_IO_IS_ISA(port) && !_IO_HAS_ISA_BUS) return ~0; val = in_8((u8 *)(port+pci_io_base)); - if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val)) - return eeh_check_failure((void*)(port+pci_io_base), val); + if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val, u8)) + return eeh_check_failure((void*)(port), val, -8); return val; } @@ -173,8 +221,8 @@ if (_IO_IS_ISA(port) && !_IO_HAS_ISA_BUS) return ~0; val = in_le16((u16 *)(port+pci_io_base)); - if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val)) - return eeh_check_failure((void*)(port+pci_io_base), val); + if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val, u16)) + return eeh_check_failure((void*)(port), val, -16); return val; } @@ -188,14 +236,33 @@ if (_IO_IS_ISA(port) && !_IO_HAS_ISA_BUS) return ~0; val = in_le32((u32 *)(port+pci_io_base)); - if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val)) - return eeh_check_failure((void*)(port+pci_io_base), val); + if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val, u32)) + return eeh_check_failure((void*)(port), val, -32); return val; } static inline void eeh_outl(u32 val, unsigned long port) { if (!_IO_IS_ISA(port) || _IO_HAS_ISA_BUS) return out_le32((u32 *)(port+pci_io_base), val); +} + +/* in-string eeh macros */ +static inline void eeh_insb(unsigned long port, void * buf, int ns) { + _insb((u8 *)(port+pci_io_base), buf, ns); + if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR((*(((u8*)buf)+ns-1)), u8)) + eeh_check_failure((void*)(port), *(u8*)buf, -9); +} + +static inline void eeh_insw_ns(unsigned long port, void * buf, int ns) { + _insw_ns((u16 *)(port+pci_io_base), buf, ns); + if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR((*(((u16*)buf)+ns-1)), u16)) + eeh_check_failure((void*)(port), *(u16*)buf, -17); +} + +static inline void eeh_insl_ns(unsigned long port, void * buf, int nl) { + _insl_ns((u32 *)(port+pci_io_base), buf, nl); + if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR((*(((u32*)buf)+nl-1)), u32)) + eeh_check_failure((void*)(port), *(u32*)buf, -33); } #endif /* _EEH_H */ ===== include/asm-ppc64/io.h 1.11 vs edited ===== --- 1.11/include/asm-ppc64/io.h Mon Jan 19 20:08:22 2004 +++ edited/include/asm-ppc64/io.h Tue Feb 3 16:35:52 2004 @@ -49,6 +49,13 @@ #define outb(data,addr) writeb(data,((unsigned long)(addr))) #define outw(data,addr) writew(data,((unsigned long)(addr))) #define outl(data,addr) writel(data,((unsigned long)(addr))) +/* + * The *_ns versions below don't do byte-swapping. + * Neither do the standard versions now, these are just here + * for older code. + */ +#define insw_ns(port, buf, ns) _insw_ns((u16 *)((port)+pci_io_base), (buf), (ns)) +#define insl_ns(port, buf, nl) _insl_ns((u32 *)((port)+pci_io_base), (buf), (nl)) #else #define readb(addr) eeh_readb((void*)(addr)) #define readw(addr) eeh_readw((void*)(addr)) @@ -71,12 +78,16 @@ * They are only used in practice for transferring buffers which * are arrays of bytes, and byte-swapping is not appropriate in * that case. - paulus */ -#define insb(port, buf, ns) _insb((u8 *)((port)+pci_io_base), (buf), (ns)) -#define outsb(port, buf, ns) _outsb((u8 *)((port)+pci_io_base), (buf), (ns)) -#define insw(port, buf, ns) _insw_ns((u16 *)((port)+pci_io_base), (buf), (ns)) -#define outsw(port, buf, ns) _outsw_ns((u16 *)((port)+pci_io_base), (buf), (ns)) -#define insl(port, buf, nl) _insl_ns((u32 *)((port)+pci_io_base), (buf), (nl)) -#define outsl(port, buf, nl) _outsl_ns((u32 *)((port)+pci_io_base), (buf), (nl)) +#define insb(port, buf, ns) eeh_insb((port), (buf), (ns)) +#define insw(port, buf, ns) eeh_insw_ns((port), (buf), (ns)) +#define insl(port, buf, nl) eeh_insl_ns((port), (buf), (nl)) +#define insw_ns(port, buf, ns) eeh_insw_ns((port), (buf), (ns)) +#define insl_ns(port, buf, nl) eeh_insl_ns((port), (buf), (nl)) + +#define outsb(port, buf, ns) _outsb((u8 *)((port)+pci_io_base), (buf), (ns)) +#define outsw(port, buf, ns) _outsw_ns((u16 *)((port)+pci_io_base), (buf), (ns)) +#define outsl(port, buf, nl) _outsl_ns((u32 *)((port)+pci_io_base), (buf), (nl)) + #endif extern void _insb(volatile u8 *port, void *buf, int ns); @@ -106,9 +117,7 @@ * Neither do the standard versions now, these are just here * for older code. */ -#define insw_ns(port, buf, ns) _insw_ns((u16 *)((port)+pci_io_base), (buf), (ns)) #define outsw_ns(port, buf, ns) _outsw_ns((u16 *)((port)+pci_io_base), (buf), (ns)) -#define insl_ns(port, buf, nl) _insl_ns((u32 *)((port)+pci_io_base), (buf), (nl)) #define outsl_ns(port, buf, nl) _outsl_ns((u32 *)((port)+pci_io_base), (buf), (nl)) @@ -177,6 +186,9 @@ /* * 8, 16 and 32 bit, big and little endian I/O operations, with barrier. + * These routines do not perform EEH-related I/O address translation, + * and should not be used directly by device drivers. Use inb/readb + * instead. */ static inline int in_8(volatile unsigned char *addr) { From linas at austin.ibm.com Thu Feb 5 07:37:48 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Wed, 4 Feb 2004 14:37:48 -0600 Subject: Resending the patch: Re: 2.6: PATCH for multiple EEH bugs In-Reply-To: <20040204142853.A28220@forte.austin.ibm.com>; from linas@austin.ibm.com on Wed, Feb 04, 2004 at 02:28:53PM -0600 References: <20040203183459.B27780@forte.austin.ibm.com> <20040204142853.A28220@forte.austin.ibm.com> Message-ID: <20040204143748.M27780@forte.austin.ibm.com> On Wed, Feb 04, 2004 at 02:28:53PM -0600, linas at austin.ibm.com wrote: > On Tue, Feb 03, 2004 at 10:52:11PM -0600, olof at austin.ibm.com wrote: > > On Tue, 3 Feb 2004 linas at austin.ibm.com wrote: > > > > > Patch for multiple EEH-related bugs. Please review this patch, > > > & if appropriate, please apply. It should apply cleanly to > > > the current ameslab tree (Feb 03 2004 2.6.2-rc3). > > > > I have patch failures in eeh.c, pci.c and rpaphp_core.c. Are you sure you > > made the diff against a current ameslab tree? > > Right tree, bad email attachment. > > I don't know how it happened, but what I sent out had some trailing > whitespace whacked. The attached patch should not have this problem. Arghhh! Not my day. The ppc64 mailing list manager or one of the mail gateways is seems to be removing whitespace. When I send the patch to myself, its OK, but when I send it on the list, it gets mangled ... see attached. --linas -------------- next part -------------- the last 9 lines should have 1-9 whitespaces followed by newline From haveblue at us.ibm.com Thu Feb 5 08:31:49 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: 04 Feb 2004 13:31:49 -0800 Subject: lmb_phys_mem_size() vs. lmb_end_of_DRAM() Message-ID: <1075930308.27981.863.camel@nighthawk> It looks to me like lmb_phys_mem_size() will return the largest valid physical address on the system while lmb_end_of_DRAM() is used just for usable RAM. lmb_phys_mem_size() determines how big the kernel's ZONE_DMA is and lmb_end_of_DRAM() is used to figure out everything else like bootmem sizes and the zone start/end_pfn values. Can we just abolish the use of lmb_end_of_DRAM(), and bootmem reserve the I/O areas? We'll probably have some unused struct pages, but that should be just about the only side-effect, and it keeps from having the confusing distinction. That's what we do on x86 at least. --dave ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Thu Feb 5 10:07:08 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 05 Feb 2004 10:07:08 +1100 Subject: [2.4] [PATCH] hash_page rework, take 2 In-Reply-To: <402152C8.7080907@redhat.com> References: <402152C8.7080907@redhat.com> Message-ID: <1075936028.4019.27.camel@gaston> Hi Julie ! > OK, I managed to convince myself that using an eieio is ok here. I was > concerned that other processors might not have seen any of the stores > that preceded the eieio instruction, since eieio is normally only used > when dealing with device memory. lwsync ensures other processors have > seen any stores to system memory at the point the lock is released. But > the only stores that matter here are the hpte (and it is sync'd) and the > pte and it has the lock bit. So when another processor sees the pte > contents without the lock bit set it will, by default, be seeing the > updated value as well. eieio enforce store ordering on cacheable accesses too, which is all we should need at this point. > So it is true that an interrupt handler can cause a page fault? Can you > provide me with an example? Not really a "page fault" in the linux sense, but rather a hash miss, yes. Typically, a driver accessing ioremap'ed IO space or a module running vmalloc'ed memory can trigger a hash miss. With my 2.6 implementation, there shouldn't be a problem as only hash_page will set PAGE_BUSY and this is done with interrupts off, so it can't be re-entered on the same CPU. > Let me see if I understand this. When someone wants to free a page > pointed to by an entry in a 3rd level page table, they clear out the pte > in the page table using pte_clear(). Then they call pte_free with the > address of the page they are freeing up (not really a page table entry > but the actual page address). This page address is added to the batch > list. Later, the idle loop or process termination code calls > do_check_pgt_cache which will free all the pages in the batch list. Yes. What we need it to make sure no CPU was currently walking the page tables when the pte_clear occured, that is that no CPU is actually still using the PTEs in the page we are about to get rid of, which basically means we must make sure that no CPU that was in hash_page at the time of the pte_clear is still in that function. Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Thu Feb 5 11:37:22 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Wed, 4 Feb 2004 18:37:22 -0600 Subject: Root Drive Mirroring and LVM. In-Reply-To: <16417.31540.78193.699721@notabene.cse.unsw.edu.au>; from neilb@cse.unsw.edu.au on Thu, Feb 05, 2004 at 10:07:32AM +1100 References: <20040204122317.K27780@forte.austin.ibm.com> <20040204164946.P27780@forte.austin.ibm.com> <16417.31540.78193.699721@notabene.cse.unsw.edu.au> Message-ID: <20040204183722.R27780@forte.austin.ibm.com> On Thu, Feb 05, 2004 at 10:07:32AM +1100, Neil Brown wrote: > On Wednesday February 4, linas at austin.ibm.com wrote: > > On Wed, Feb 04, 2004 at 05:32:01PM -0500, Mark Hahn wrote: > > > > So what's wrong with autodetect, again? > > > > > > my guess is: in-kernel autodetect is the problem. > > > out-of-kernel detection can be much smarter, > > > and can be more easily tested/replaced. > > > > Hm, yes, that makes sense. > > > Good :-) > > Just to flesh out my thoughts a bit more: > > If the root filesystem is on an MD array, then I see the process of > assembling that md array as quite similar to the process of finding > the device for the root filesystem. > > We don't expect the kernel, or anyone else, to scan all devices > looking for something that looks like a root filesystem, and loading > that. Rather we tell the kernel or boot loader exactly where to find > the root filesystem. And if the root filesystem moves, we get to > explicitly tell the boot loader where it is (root=/dev/hdc1 or > whatever). Well, this is actually one of my more bitter complaints. I'd much much rather have some symbolic disk label, and have the kernel scan *all* devices for that. I can defend this for both low-end and high-end machines. On the low end i.e. PC's, PC servers: I've dealt (multiple times) with disk and/or controller failure. If you have more than 3 disks, it can get very confusing as to which cable is which. Add to that that some BIOS'es allow you to enable/disable controllers, which cause devices to be renumbered. E.g. I have a mobo with two 'plain vanilla' ide connectors and two '100MHZ' 80-wire ide connectors. Which one is /dev/hda and which is not depends on the BIOS settings. Worse: if I plug in a 3rd party ide controller, then the numbering becomes insane: /dev/hda: mobo 33MHz controller /dev/hdc: mobo 33MHz controller /dev/hde: pci card controller /dev/hdg: pci card controller /dev/hdi: mobo 100MHz controller /dev/hdk: mobo 100MHz controller But if I disable the 33MHz mobo connector in bios: /dev/hda: mobo 100MHz controller /dev/hdc: mobo 100MHz controller /dev/hde: pci card controller /dev/hdg: pci card controller Since some types of disk failures will lock up the kernel during boot, you spend a lot of time plugging and unplugging disks and rebooting. It gets confusing after a while. To add to the confusion: If you RAID-1 mirror the root partition, but then mount it as /dev/hdk (and not /dev/md0) during a rescue operation (because rescue disks often don't have RAID on them), and then edit /etc/fstab to try to match the new config... you end up with two root partitions (each of the mirror pair) with different /etc/fstab's, on different cables/controllers, ... after fiddling with that, its a nightmare to figure out which is which, what's mounted how and where. I also had a similar experience on a machine with 30-odd scsi disks. We installed a new kernel on /dev/sdp, rebooted ... and what confusion, since /dev/sdp was now /dev/sdg. Worse, since the bootloader was unaware of this difference ... So then we made a change, and rebooted... ... again, much confusion till we worked it out. As a cure-all, I started fiddling with ext2 disk labels. However, an ext2 disk label, when written on a RAID-1 device, will identify *three* things: /dev/md0 (for example), and both mirror pairs: /dev/hda1 and /dev/hde1 all have the same label. rieserfs doesn't have ext2 labels.... I wanted to fool with other partition table schemes (non-DOS partition tables) until I realized most PC rescue disks wouldn't be able to get you out of that jam. So then I thought about using LVM to label my disks (i.e. use LVM for only one reason, and no other reason: to be able to assign logical names, and not physical names). But think about it... its scary, LVM is not widespread, somewhat buggy, and standard debian/redhat/suse rescue diskettes won't have it ... etc. > Assembling the root device should be handled the same way. We tell > the boot loader/kernel where to expect it, but can over-ride that if > we need to: > md=0,/dev/hdc1,/dev/hde4 > > All other md arrays can, and so should, be assembled by code running > out of the root filesystem. *If* you mounted the "right" root filesystem. If you have multiple copies around, and edit fstab on one but not the others, and then recable and reboot ... > This could be some program that > assembles anything it finds after scanning all devices, or something > a bit more focused, but it should be controllable by the sysadmin. At least once I managed to mount /usr as /var because of the confusion, and upon reboot the init.d scripts spewed /var/lock crud into my poor /usr filesystem ... I want to know that /dev/mdwhatever is /var *before* I mount it, not after. > It is true that in-kernel auto-detect can be controlled by fiddling > with partition types, but the problem is that it runs *before* the > root filesystem is mounted and so could conceivably confuse the > assembly of the root device (if e.g. you plugged in some other device > that also claimed to be part of /dev/md0, and it got scanned before > your real root device). Well, yes, that too, I suppose. But as you see, explicitly specifying it at the boot prompt works only if you type in the right thing ... --linas cc.ing ppc64 because although not an architecture issue, it is a sysadmin issue on enterprise-class machines. cc.ing hot-plug because this is a cold-plug issue. Note you get similar crud for multiple ethernet cards ... ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Thu Feb 5 11:47:39 2004 From: johnrose at austin.ibm.com (John Rose) Date: Wed, 04 Feb 2004 18:47:39 -0600 Subject: [PATCH] RPA PCI Hotplug - Remove Adapter Config at DLPAR add time Message-ID: <1075942059.3026.96.camel@verve> Patch below fixes: https://bugzilla.linux.ibm.com/show_bug.cgi?id=6136 Details in bug if you're interested, comments welcome. Thanks- John diff -Nru a/drivers/pci/hotplug/rpaphp_core.c b/drivers/pci/hotplug/rpaphp_core.c --- a/drivers/pci/hotplug/rpaphp_core.c Wed Feb 4 18:42:00 2004 +++ b/drivers/pci/hotplug/rpaphp_core.c Wed Feb 4 18:42:00 2004 @@ -814,29 +814,16 @@ } slot->dev = rpaphp_find_adapter_pdev(slot); - - if (!slot->dev && slot_name) { - /* adapter being added doesn't have pci_dev yet */ - slot->dev = rpaphp_config_adapter(slot); - if (!slot->dev) { - err("%s: add new adapter device for slot[%s] failed\n", - __FUNCTION__, slot->name); - kfree(slot->hotplug_slot->info); - kfree(slot->hotplug_slot->name); - kfree(slot->hotplug_slot); - kfree(slot); - pci_dev_put(slot->bridge); - continue; - - } - } - if(slot->dev) { slot->state = CONFIGURED; pci_dev_get(slot->dev); } - else + else { + /* DLPAR add as opposed to + * boot time */ slot->state = NOT_CONFIGURED; + } + } dbg("%s registering slot:path[%s] index[%x], name[%s] pdomain[%x] type[%d]\n", __FUNCTION__, dn->full_name, slot->index, slot->name, ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Thu Feb 5 12:25:11 2004 From: anton at samba.org (Anton Blanchard) Date: Thu, 5 Feb 2004 12:25:11 +1100 Subject: lmb_phys_mem_size() vs. lmb_end_of_DRAM() In-Reply-To: <1075930308.27981.863.camel@nighthawk> References: <1075930308.27981.863.camel@nighthawk> Message-ID: <20040205012511.GM19011@krispykreme> > It looks to me like lmb_phys_mem_size() will return the largest valid > physical address on the system while lmb_end_of_DRAM() is used just for > usable RAM. > > lmb_phys_mem_size() determines how big the kernel's ZONE_DMA is and > lmb_end_of_DRAM() is used to figure out everything else like bootmem > sizes and the zone start/end_pfn values. > > Can we just abolish the use of lmb_end_of_DRAM(), and bootmem reserve > the I/O areas? We'll probably have some unused struct pages, but that > should be just about the only side-effect, and it keeps from having the > confusing distinction. That's what we do on x86 at least. On POWER4 and newer the IO hole sits above memory, so __pa(lmb_end_of_DRAM()) == lmb_phys_mem_size(). The only place we care about this distinction is POWER3. It is similar to x86, 270 has 3G MEM - 1 G IO - 15G MEM. The nighthawk is worse, it has 1G MEM - 3G IO - 63G MEM. x86 doesnt have struct pages backing IO regions does it? On ppc64 we need to create a mem_map array big enough to hit the top of RAM, so it seems to me that we have to use lmb_end_of_DRAM(). So we currently do have struct pages for the IO region, we should probably reserve them. sparc64 passes in the size of the hole to free_area_init_node, I guess we should be doing this too. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From bugzilla at watkins-home.com Thu Feb 5 12:31:01 2004 From: bugzilla at watkins-home.com (Guy) Date: Wed, 4 Feb 2004 20:31:01 -0500 Subject: Root Drive Mirroring and LVM. In-Reply-To: <20040204183722.R27780@forte.austin.ibm.com> Message-ID: <200402050129.i151T2i30440@dns1.watkins-home.com> I upgraded my firmware on a 2940U2W. That changed the order my SCSI buses were scanned. This changed the boot order of my disks. I had to disable the bios on the 2940U2W so it would not attempt to boot from the disks on that bus. My MB has 2 SCSI buses and I have 2 SCSI cards. So, anything that could have prevented this would be good. -----Original Message----- From: linas at austin.ibm.com Sent: Wednesday, February 04, 2004 7:37 PM Subject: Re: Root Drive Mirroring and LVM. On Thu, Feb 05, 2004 at 10:07:32AM +1100, Neil Brown wrote: > On Wednesday February 4, linas at austin.ibm.com wrote: > > On Wed, Feb 04, 2004 at 05:32:01PM -0500, Mark Hahn wrote: > > > > So what's wrong with autodetect, again? > > > > > > my guess is: in-kernel autodetect is the problem. out-of-kernel > > > detection can be much smarter, and can be more easily > > > tested/replaced. > > > > Hm, yes, that makes sense. > > Good :-) > > Just to flesh out my thoughts a bit more: > > If the root filesystem is on an MD array, then I see the process of > assembling that md array as quite similar to the process of finding > the device for the root filesystem. > > We don't expect the kernel, or anyone else, to scan all devices > looking for something that looks like a root filesystem, and loading > that. Rather we tell the kernel or boot loader exactly where to > find the root filesystem. And if the root filesystem moves, we get > to explicitly tell the boot loader where it is (root=/dev/hdc1 or > whatever). Well, this is actually one of my more bitter complaints. I'd much much rather have some symbolic disk label, and have the kernel scan *all* devices for that. I can defend this for both low-end and high-end machines. On the low end i.e. PC's, PC servers: I've dealt (multiple times) with disk and/or controller failure. If you have more than 3 disks, it can get very confusing as to which cable is which. Add to that that some BIOS'es allow you to enable/disable controllers, which cause devices to be renumbered. E.g. I have a mobo with two 'plain vanilla' ide connectors and two '100MHZ' 80-wire ide connectors. Which one is /dev/hda and which is not depends on the BIOS settings. Worse: if I plug in a 3rd party ide controller, then the numbering becomes insane: /dev/hda: mobo 33MHz controller /dev/hdc: mobo 33MHz controller /dev/hde: pci card controller /dev/hdg: pci card controller /dev/hdi: mobo 100MHz controller /dev/hdk: mobo 100MHz controller But if I disable the 33MHz mobo connector in bios: /dev/hda: mobo 100MHz controller /dev/hdc: mobo 100MHz controller /dev/hde: pci card controller /dev/hdg: pci card controller Since some types of disk failures will lock up the kernel during boot, you spend a lot of time plugging and unplugging disks and rebooting. It gets confusing after a while. To add to the confusion: If you RAID-1 mirror the root partition, but then mount it as /dev/hdk (and not /dev/md0) during a rescue operation (because rescue disks often don't have RAID on them), and then edit /etc/fstab to try to match the new config... you end up with two root partitions (each of the mirror pair) with different /etc/fstab's, on different cables/controllers, ... after fiddling with that, its a nightmare to figure out which is which, what's mounted how and where. I also had a similar experience on a machine with 30-odd scsi disks. We installed a new kernel on /dev/sdp, rebooted ... and what confusion, since /dev/sdp was now /dev/sdg. Worse, since the bootloader was unaware of this difference ... So then we made a change, and rebooted... ... again, much confusion till we worked it out. As a cure-all, I started fiddling with ext2 disk labels. However, an ext2 disk label, when written on a RAID-1 device, will identify *three* things: /dev/md0 (for example), and both mirror pairs: /dev/hda1 and /dev/hde1 all have the same label. rieserfs doesn't have ext2 labels.... I wanted to fool with other partition table schemes (non-DOS partition tables) until I realized most PC rescue disks wouldn't be able to get you out of that jam. So then I thought about using LVM to label my disks (i.e. use LVM for only one reason, and no other reason: to be able to assign logical names, and not physical names). But think about it... its scary, LVM is not widespread, somewhat buggy, and standard debian/redhat/suse rescue diskettes won't have it ... etc. > Assembling the root device should be handled the same way. We tell > the boot loader/kernel where to expect it, but can over-ride that if > we need to: > md=0,/dev/hdc1,/dev/hde4 > > All other md arrays can, and so should, be assembled by code running > out of the root filesystem. *If* you mounted the "right" root filesystem. If you have multiple copies around, and edit fstab on one but not the others, and then recable and reboot ... > This could be some program that assembles anything it finds after > scanning all devices, or something a bit more focused, but it should > be controllable by the sysadmin. At least once I managed to mount /usr as /var because of the confusion, and upon reboot the init.d scripts spewed /var/lock crud into my poor /usr filesystem ... I want to know that /dev/mdwhatever is /var *before* I mount it, not after. > It is true that in-kernel auto-detect can be controlled by fiddling > with partition types, but the problem is that it runs *before* the > root filesystem is mounted and so could conceivably confuse the > assembly of the root device (if e.g. you plugged in some other device > that also claimed to be part of /dev/md0, and it got scanned before > your real root device). Well, yes, that too, I suppose. But as you see, explicitly specifying it at the boot prompt works only if you type in the right thing ... --linas cc.ing ppc64 because although not an architecture issue, it is a sysadmin issue on enterprise-class machines. cc.ing hot-plug because this is a cold-plug issue. Note you get similar crud for multiple ethernet cards ... ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Fri Feb 6 00:36:50 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 6 Feb 2004 00:36:50 +1100 Subject: [PATCH] kill lmb_add_io In-Reply-To: <20040204174036.GB4996@w-mikek2.beaverton.ibm.com> References: <1075856864.1449.205.camel@nighthawk> <20040204174036.GB4996@w-mikek2.beaverton.ibm.com> Message-ID: <20040205133650.GV19011@krispykreme> Hi, > On the subject of lmb's ... I've got a pSeries 615. On this machine > I only see one lmb of 4GB in size. This is what is dug out of the > prom by 'prom_initialize_lmb' before any coalescing. I was somehow > expecting more lmb's of a smaller size. Of course, I don't know the > hardware specifics of this machine or how many DIMMs of what size it > contains. > > Dave, what does the lmb layout on your 'bigger' machine look like? You only start caring about it when we talk partitioning and hot swap memory. The reality is your 615s memory is most probably contiguous. I seem to remember firmware were doing a 32bit backwards compatible thing and splitting memory into at least two blocks, one of 4GB or less. Not sure if they do that any more. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Fri Feb 6 00:37:18 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 6 Feb 2004 00:37:18 +1100 Subject: [PATCH] kill lmb_add_io In-Reply-To: <20040204094059.GF19011@krispykreme> References: <1075856864.1449.205.camel@nighthawk> <20040204041820.GF22694@krispykreme> <20040204094059.GF19011@krispykreme> Message-ID: <20040205133718.GW19011@krispykreme> > Boots on a p630. Will get some help testing it on iSeries tomorrow > (hi Stephen :) and will push it if there are no complaints. Boots on iSeries (thanks to Stephen for testing) and its merged into ameslab. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Fri Feb 6 03:04:31 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 6 Feb 2004 03:04:31 +1100 Subject: EEH as module [was Re: LPARCFG] In-Reply-To: <20040204120252.J27780@forte.austin.ibm.com> References: <20040201050227.GB22694@krispykreme> <20040204120252.J27780@forte.austin.ibm.com> Message-ID: <20040205160431.GY19011@krispykreme> > Since I'm planning to be whacking EEH in a big way, does anyone feel > that it should be turned into a module? Hmm I hadnt thought about that. Its fairly core and could get messy trying to modularise it. I am looking forwards to EEH being whacked around however :) Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Fri Feb 6 03:07:10 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 6 Feb 2004 03:07:10 +1100 Subject: LPARCFG In-Reply-To: <401D591E.9030503@austin.ibm.com> References: <20040201050227.GB22694@krispykreme> <401D591E.9030503@austin.ibm.com> Message-ID: <20040205160710.GZ19011@krispykreme> > Building as a module was broken when the code was checked in to ameslab > 2.6; I suggested turning it off. I think lparcfg uses unexported > symbols -- cur_cpu_spec, systemcfg, and plpar_hcall_4out. Should any of > those be exported? > > See this thread for the history: > http://lists.linuxppc.org/linuxppc64-dev/200312/msg00023.html Ahh yeah, I have such a short memory. Actually I enabled it as a module and built all modules and didnt get any warnings. Either we have everything exported now, Im not getting undefined symbol warnings any more or else Im going blind. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From kravetz at us.ibm.com Fri Feb 6 04:15:22 2004 From: kravetz at us.ibm.com (Mike Kravetz) Date: Thu, 5 Feb 2004 09:15:22 -0800 Subject: [PATCH] kill lmb_add_io In-Reply-To: <20040205133650.GV19011@krispykreme> References: <1075856864.1449.205.camel@nighthawk> <20040204174036.GB4996@w-mikek2.beaverton.ibm.com> <20040205133650.GV19011@krispykreme> Message-ID: <20040205171522.GB4983@w-mikek2.beaverton.ibm.com> On Fri, Feb 06, 2004 at 12:36:50AM +1100, Anton Blanchard wrote: > > > > Dave, what does the lmb layout on your 'bigger' machine look like? > > You only start caring about it when we talk partitioning and hot swap > memory. Exactly, and this is an area I am interested in pursuing. Hence, the interest in Dave's machine. I haven't gone through enough pSeries architecture documentation yet to know what to expect. One would think that a smaller lmb size would be better for hot swap memory. At least in the 'remove' side of the equation. > The reality is your 615s memory is most probably contiguous. Yup. I think I'll try to put together a 'hack' to make the memory layout on my 615 look like one of the bigger machines with dynamic LPAR capabilities. In other words, multiple (smaller) possibly non-contiguous lmb's. -- Mike ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Fri Feb 6 04:46:51 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Thu, 05 Feb 2004 11:46:51 -0600 Subject: [2.4] [PATCH] hash_page rework, take 2 In-Reply-To: <402152C8.7080907@redhat.com> References: <402152C8.7080907@redhat.com> Message-ID: <4022818B.90902@austin.ibm.com> Julie DeWandel wrote: > Hi Olof, > > Thank you for the explanations. In most cases, I agree but I still have > one or two things I wanted to follow up on. I have added my comments to > yours below. See below. > olof at austin.ibm.com wrote: >> JSD: Better question for (1) is why are interrupts being disabled here? >> JSD: Can this routine be called from interrupt context? >> >> Without disabling interrupts, there's a risk for deadlocks if the >> processor gets interrupted and the interrupt handler causes a page fault >> that needs to be resolved Since the lock is held for writing, the handler >> will wait forever when locking for reading. This is actually similar to >> the original deadlock that this whole patch is meant to remove, but the >> window is really small (just a few instructions) now. Likewise an >> interrupt on a different processor is not a problem since forward >> progress >> is still guaranteed on the processor holding it for writing so the reader >> will eventually get the lock. >> > So it is true that an interrupt handler can cause a page fault? Can you > provide me with an example? Ben explained this pretty well yesterday. What we saw was mostly in drivers loaded as modules when they happened to interrupt a processor currently executing in vmalloc(). Stacks could look like: c0000000c961f170 c00000000049bc30 __ex_table ?kernel? 0x3c30 c0000000c961f200 c00000000000b544 .do_hash_page_DSI ?kernel? 0x10 c0000000c961f4f0 c0000000c961f580 xfrm_policy_list_Rsmp_cdacf85e ?? 0xc8d6dd28 c0000000c961f580 d000000000055f64 .tux_data_ready ?tux? 0x78 c0000000c961f620 c000000000294680 .tcp_data_queue ?kernel? 0x7d0 c0000000c961f6e0 c000000000295b10 .tcp_rcv_established ?kernel? 0x2b8 c0000000c961f7a0 c0000000002a1288 .tcp_v4_do_rcv ?kernel? 0x1b0 c0000000c961f830 c0000000002a1a48 .tcp_v4_rcv ?kernel? 0x7ac c0000000c961f8d0 c00000000027b034 .ip_local_deliver_finish ?kernel? 0x14c c0000000c961f960 c00000000027aba4 .ip_local_deliver ?kernel? 0x60 c0000000c961f9e0 c00000000027b4cc .ip_rcv_finish ?kernel? 0x34c c0000000c961fa80 c00000000027ae04 .ip_rcv ?kernel? 0x20c c0000000c961fb20 c00000000025a26c .netif_receive_skb ?kernel? 0x240 c0000000c961fbc0 c00000000017bb7c .e1000_clean_rx_irq ?kernel? 0x394 c0000000c961fcd0 c00000000017b378 .e1000_clean ?kernel? 0x7c c0000000c961fd90 c00000000025a7a0 .net_rx_action ?kernel? 0x15c c0000000c961fe50 c000000000066c40 .do_softirq ?kernel? 0x1b4 c0000000c961ff00 c000000000011fec .do_IRQ ?kernel? 0x164 c0000000c961ff90 c00000000000ae20 HardwareInterrupt_entry ?kernel? 0x38 c00000055c3f6a80 c000000000179324 .e1000_xmit_frame ?kernel? 0x1e8 c00000055c3f6d70 c00000000000aaa8 DataAccessSLB_common ?kernel? 0x108 c00000055c3f6e30 c000000000090d2c .__vmalloc ?kernel? 0x1a4 c00000055c3f6f00 c0000000000d284c .alloc_fd_array ?kernel? 0x4c c00000055c3f6f80 c0000000000d297c .expand_fd_array ?kernel? 0x9c c00000055c3f7030 c00000000005a6f0 .copy_files ?kernel? 0x214 c00000055c3f70f0 c00000000005adbc .copy_process ?kernel? 0x440 c00000055c3f71c0 c00000000005b838 .do_fork ?kernel? 0x40 c00000055c3f7280 c000000000014100 .sys_clone ?kernel? 0x9c c00000055c3f7320 c000000000029944 .sys32_clone ?kernel? 0x28 c00000055c3f7390 c00000000000fe48 ret_from_syscall_1 exception: c00 (System Call) regs c00000055c3f7400 c000000000017020 .arch_kernel_thread ?kernel? 0x24 c00000055c3f76f0 c0000005599b9000 xfrm_policy_list_Rsmp_cdacf85e ?? 0x591077a8 c00000055c3f7780 d000000000073610 .start_external_cgi ?tux? 0x2c c00000055c3f7800 d00000000007368c .query_extcgi ?tux? 0x18 c00000055c3f7880 d0000000000602e0 .http_process_message ?tux? 0x2e4 c00000055c3f7910 d0000000000571fc .tux_schedule_atom ?tux? 0x40 c00000055c3f7990 d0000000000587d8 .process_requests ?tux? 0x14c c00000055c3f7a30 d0000000000672d8 .event_loop ?tux? 0x1a0 c00000055c3f7ad0 d00000000006977c .__sys_tux ?tux? 0x3c8 c00000055c3f7bc0 c00000000024de0c .sys_tux ?kernel? 0x17c c00000055c3f7c60 c00000000000fe48 ret_from_syscall_1 >> This is all an ad-hoc solution since there's no RCU in 2.4, so I needed >> another light-weight syncronization method. >> > Let me see if I understand this. When someone wants to free a page > pointed to by an entry in a 3rd level page table, they clear out the pte > in the page table using pte_clear(). Then they call pte_free with the > address of the page they are freeing up (not really a page table entry > but the actual page address). This page address is added to the batch > list. Later, the idle loop or process termination code calls > do_check_pgt_cache which will free all the pages in the batch list. > > However, prior to freeing the pages, pte_free_batch() will call > pte_free_sync(). pte_free_sync tries to ensure that the pte_hash_locks > aren't held on any processor. If a processor is holding it, it could be > the case that the processor is currently walking the page tables and > might have loaded, for example, the address of a pmd that we have since > cleared. Since it is still using the data in that page, we don't want to > free it until they are done with it. If we wait until they drop the > lock, they are done and we also know any new reference will see the > cleared value. > > Is this correct? Yes. >> void >> local_flush_tlb_mm(struct mm_struct *mm) >> { >> - spin_lock(&mm->page_table_lock); >> + unsigned long flags; >> + >> + spin_lock_irqsave(&mm->page_table_lock, flags); >> >> if ( mm->map_count ) { >> struct vm_area_struct *mp; >> > JSD: I believe you said the _irqsave wasn't needed here so this hunk and > the next JSD: one can be removed. Grmbl. I'll make sure it's gone before I push a fix. >> +/* Use the PTE functions for freeing PMD as well, since the same >> + * problem with tree traversals apply. Since pmd pointers are always >> + * virtual, no need for a page_address() translation. >> + */ >> + +#define pte_free(pte_page) __pte_free(pte_page) >> > JSD: Your original patch defined pte_free to be > __pte_free(page_address(pte_page)) > JSD: Is the page_address() wrapper no longer necessary? This is a difference between the RedHat and ames/mainline trees due to quicklists. In mainline/ames, pte's are just put on free lists, so the translation isn't needed there. You still need it in your tree. -Olof -- Olof Johansson Office: 4F005/905 pSeries Linux Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Fri Feb 6 06:01:26 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Thu, 05 Feb 2004 13:01:26 -0600 Subject: [PATCH][2.6] rtas error-inject support In-Reply-To: <16416.18469.3956.192501@cargo.ozlabs.ibm.com> References: <1075480843.682.188.camel@magik> <9F2C2DDE-5349-11D8-928E-000A95A0560C@us.ibm.com> <1075488969.682.192.camel@magik> <264AA26A-535D-11D8-928E-000A95A0560C@us.ibm.com> <1075501743.681.214.camel@magik> <16416.18469.3956.192501@cargo.ozlabs.ibm.com> Message-ID: <1076007685.1023.1209.camel@magik> > In general, since this is a patch for 2.6 (according to the subject > line :), it would be better to do all this in userspace using the rtas > syscall that was added recently. But here are comments on the patch > anyway: I mentioned this to you on IRC, but the main reason I was trying to keep the same interface was because this funcationality is for test teams. The test teams already have tools and test suites that use this interface in 2.4. I would hate to break them when they will probably need this interface in the very near future. > > +config RTAS_ERRINJCT > > + bool "RTAS Errinject" > > How about bool "RTAS Error injection facility" ? agreed > > +static unsigned int open_token = 0; > > No need to initialize things to 0, C does that by default. Done. Normally I try following the style of the particular file. If this is bothering you, you probably want to clean up a couple other variables in the file. > > @@ -207,7 +221,8 @@ > > void proc_rtas_init(void) > > { > > struct proc_dir_entry *entry; > > - > > + int errinjct_token; > > + > > I can't see any difference between the blank line that is deleted and > the one that is inserted (not even any whitespace). I wonder why diff > put that in? Or has your mailer deleted trailing whitespace? not sure either. > > That's a user pointer that you're using strnlen on. Ouch. Use > strnlen_user or copy_from_user. In fact, do we need to check the > string length at all? Would it matter if there was a null in the > buffer, and the value taken as a string was shorter than we thought? strnlen taken out. I don't remember what the thought was for having this in there. It was over a year ago when this was originally written. > > Another access to the user buffer without using *user functions. > Ouch. Fixed. > > This worries me. We copy the workspace (the contents of which are > undefined since we just kmalloc'd it) to rtas_data_buf, but we never > copy rtas_data_buf back to the workspace after the rtas call. So how > can the contents of workspace ever be anything but undefined? I'm not sure I'm following you here. The workspace is passed in (from ppc_rtas_errinjct_write). The workspace information is just for RTAS to know the specifics on the error that is being injected (or NULL depending of if uses the workspace on that particular injection). > Why can't we just store a pointer to the token name within the OF > property value? Why do we have to make a copy of it? IIRC the original thought was to keep the string parsing to one place and only do it once. > In fact, why do we need to parse the list here at all? We use the > list for two things: matching the token name in the write function, > and listing the tokens in the read function. In both cases we could > just run through the ibm,errinjct-tokens (what do they have against > vwls?) property value almost as easily as ei_token_list. Changed it to do this parsing. > > +/* Error inject defines */ > > +#define ERRINJCT_TOKEN_LEN 24 /* Max length of an error inject token */ > > +#define MAX_ERRINJCT_TOKENS 15 /* Max # tokens. */ > > +#define WORKSPACE_SIZE 1024 > > I worry about these too. 15 tokens sounds like future machines are > going to going to exceed this limit quite easily. Also, is the > workspace size limit there something that applies to all RTAS > functions, or just to the error injection functions? If the latter, > you should choose a name that indicates that. This was taken directly out of the RPA. If the RPA changes in the future, it's just a matter of changing a #define. A workspace is used on other rtas calls, but none of them we have implemented (ibm,get-indices, ibm,get-vpd, etc...). Until that day, I changed this to ERRINJCT_WORKSPACE_SIZE. > Overall, this looks to me like something that could be done just as > well or better in userspace. Doing it in userspace would make it > easy to avoid having an arbitrary limit on the number of tokens, for > instance. Userspace could just read > /proc/device-tree/rtas/ibm,errinjct-tokens to get the list and match > against that directly. This could probably done just as well in user land. The problem is that some of these test teams need to start testing very soon. Their resources are thin, I doubt they'll have time to rewrite interfaces. Why don't we use this interface for now and depricate it and tell the test teams that the interface will change in the 2.7 time frame. Jake -------------- next part -------------- # This is a BitKeeper generated patch for the following project: # Project Name: Linux kernel tree # This patch format is intended for GNU patch command version 2.5 or higher. # This patch includes the following deltas: # ChangeSet 1.1415 -> 1.1416 # arch/ppc64/kernel/rtas.c 1.22 -> 1.23 # arch/ppc64/defconfig 1.42 -> 1.43 # arch/ppc64/kernel/rtas-proc.c 1.13 -> 1.14 # arch/ppc64/Kconfig 1.39 -> 1.40 # include/asm-ppc64/rtas.h 1.17 -> 1.18 # # The following is the BitKeeper ChangeSet Log # -------------------------------------------- # 04/02/05 moilanen at zippy.ltc.austin.ibm.com 1.1416 # RTAS Error Inject support # -------------------------------------------- # diff -Nru a/arch/ppc64/Kconfig b/arch/ppc64/Kconfig --- a/arch/ppc64/Kconfig Thu Feb 5 12:47:44 2004 +++ b/arch/ppc64/Kconfig Thu Feb 5 12:47:44 2004 @@ -165,6 +165,14 @@ Provide system capacity information via human readable = pairs through a /proc/ppc64/lparcfg interface. +config RTAS_ERRINJCT + bool "RTAS Error inject facility" + depends on PPC_RTAS + help + Provide ability to inject errors into hardware for the purpose + of testing hardware error code path. Do not use on production + machine. + endmenu diff -Nru a/arch/ppc64/defconfig b/arch/ppc64/defconfig --- a/arch/ppc64/defconfig Thu Feb 5 12:47:44 2004 +++ b/arch/ppc64/defconfig Thu Feb 5 12:47:44 2004 @@ -65,6 +65,7 @@ CONFIG_RTAS_FLASH=m CONFIG_SCANLOG=m CONFIG_LPARCFG=y +# CONFIG_RTAS_ERRINJCT is not set # # General setup diff -Nru a/arch/ppc64/kernel/rtas-proc.c b/arch/ppc64/kernel/rtas-proc.c --- a/arch/ppc64/kernel/rtas-proc.c Thu Feb 5 12:47:44 2004 +++ b/arch/ppc64/kernel/rtas-proc.c Thu Feb 5 12:47:44 2004 @@ -126,6 +126,9 @@ static unsigned long rtas_tone_frequency = 1000; static unsigned long rtas_tone_volume = 0; +#ifdef CONFIG_RTAS_ERRINJCT +static unsigned int open_token; +#endif /* ****************STRUCTS******************************************* */ struct individual_sensor { @@ -165,6 +168,14 @@ size_t count, loff_t *ppos); static ssize_t ppc_rtas_rmo_buf_read(struct file *file, char *buf, size_t count, loff_t *ppos); +#ifdef CONFIG_RTAS_ERRINJCT +static int ppc_rtas_errinjct_open(struct inode *inode, struct file *file); +static int ppc_rtas_errinjct_release(struct inode *inode, struct file *file); +static ssize_t ppc_rtas_errinjct_write(struct file * file, const char * buf, + size_t count, loff_t *ppos); +static ssize_t ppc_rtas_errinjct_read(struct file *file, char *buf, + size_t count, loff_t *ppos); +#endif struct file_operations ppc_rtas_poweron_operations = { .read = ppc_rtas_poweron_read, @@ -189,6 +200,15 @@ .write = ppc_rtas_tone_volume_write }; +#ifdef CONFIG_RTAS_ERRINJCT +struct file_operations ppc_rtas_errinjct_operations = { + .open = ppc_rtas_errinjct_open, + .read = ppc_rtas_errinjct_read, + .write = ppc_rtas_errinjct_write, + .release = ppc_rtas_errinjct_release +}; +#endif + static struct file_operations ppc_rtas_rmo_buf_ops = { .read = ppc_rtas_rmo_buf_read, }; @@ -244,6 +264,13 @@ entry = create_proc_entry("rmo_buffer", S_IRUSR, proc_ppc64.rtas); if (entry) entry->proc_fops = &ppc_rtas_rmo_buf_ops; + +#ifdef CONFIG_RTAS_ERRINJCT + if (rtas_token("ibm,errinjct") != RTAS_UNKNOWN_SERVICE) { + entry = create_proc_entry("errinjct",S_IWUSR|S_IRUGO, proc_ppc64.rtas); + if (entry) entry->proc_fops = &ppc_rtas_errinjct_operations; + } +#endif } /* ****************************************************************** */ @@ -932,6 +959,157 @@ *ppos += n; return n; } + +#ifdef CONFIG_RTAS_ERRINJCT +/* ****************************************************************** */ +/* ERRINJCT */ +/* ****************************************************************** */ +static int ppc_rtas_errinjct_open(struct inode *inode, struct file *file) +{ + int rc; + + /* We will only allow one process to use error inject at a + time. Since errinjct is usually only used for testing, + this shouldn't be an issue */ + if (open_token) { + return -EAGAIN; + } + rc = rtas_errinjct_open(); + if (rc < 0) { + return -EIO; + } + open_token = rc; + + return 0; +} + +static ssize_t ppc_rtas_errinjct_write(struct file * file, const char * buf, + size_t count, loff_t *ppos) +{ + + char * ei_token; + char * workspace = NULL; + size_t max_len; + int token_len; + int rc; + + /* Verify the errinjct token length */ + if (count < ERRINJCT_TOKEN_LEN) { + max_len = count; + } else { + max_len = ERRINJCT_TOKEN_LEN; + } + + ei_token = (char *)kmalloc(max_len, GFP_KERNEL); + if (!ei_token) { + printk(KERN_WARNING "error: kmalloc failed\n"); + return -ENOMEM; + } + + token_len = strncpy_from_user(ei_token, buf, max_len); + if (token_len <= 0) { + kfree(ei_token); + return -EFAULT; + } + token_len++; + + if (count > token_len + ERRINJCT_WORKSPACE_SIZE) { + count = token_len + ERRINJCT_WORKSPACE_SIZE; + } + + buf += token_len; + + /* check if there is a workspace */ + if (count > token_len) { + /* Verify the workspace size */ + if ((count - token_len) > ERRINJCT_WORKSPACE_SIZE) { + max_len = ERRINJCT_WORKSPACE_SIZE; + } else { + max_len = count - token_len; + } + + workspace = (char *)kmalloc(max_len, GFP_KERNEL); + if (!workspace) { + printk(KERN_WARNING "error: failed kmalloc\n"); + kfree(ei_token); + return -ENOMEM; + } + if (copy_from_user(workspace, buf, max_len)) { + kfree(ei_token); + kfree(workspace); + return -EFAULT; + } + } + + rc = rtas_errinjct(open_token, ei_token, workspace, max_len); + + if (count > token_len) { + kfree(workspace); + } + kfree(ei_token); + + return rc < 0 ? rc : count; +} + +static int ppc_rtas_errinjct_release(struct inode *inode, struct file *file) +{ + int rc; + + rc = rtas_errinjct_close(open_token); + if (rc) { + return rc; + } + open_token = 0; + return 0; +} + +static ssize_t ppc_rtas_errinjct_read(struct file *file, char *buf, + size_t count, loff_t *ppos) +{ + char * buffer; + char * token_array; + char * token_array_end; + int array_len; + int n = 0; + + buffer = (char *)kmalloc(MAX_ERRINJCT_TOKENS * (ERRINJCT_TOKEN_LEN+1), + GFP_KERNEL); + if (!buffer) { + printk(KERN_ERR "error: kmalloc failed\n"); + return -ENOMEM; + } + + token_array = (char *) get_property(rtas.dev, "ibm,errinjct-tokens", &array_len); + token_array_end = token_array + array_len; + + while (token_array < token_array_end) { + n += sprintf(buffer+n, token_array); + token_array += strlen(token_array) + 1; + n += sprintf(buffer+n, " %d\n", *(int *)token_array); + token_array += sizeof(int); + } + + if (*ppos >= strlen(buffer)) { + kfree(buffer); + return 0; + } + if (n > strlen(buffer) - *ppos) + n = strlen(buffer) - *ppos; + + if (n > count) + n = count; + + if (copy_to_user(buf, buffer + *ppos, n)) { + kfree(buffer); + return -EFAULT; + } + + *ppos += n; + + kfree(buffer); + return n; +} +#endif /* CONFIG_RTAS_ERRINJCT */ #define RMO_READ_BUF_MAX 30 diff -Nru a/arch/ppc64/kernel/rtas.c b/arch/ppc64/kernel/rtas.c --- a/arch/ppc64/kernel/rtas.c Thu Feb 5 12:47:44 2004 +++ b/arch/ppc64/kernel/rtas.c Thu Feb 5 12:47:44 2004 @@ -225,6 +225,10 @@ int order = status - 9900; unsigned long ms; + if (status < RTAS_EXTENDED_DELAY_MIN || + status > RTAS_EXTENDED_DELAY_MAX) + return 0; + if (order < 0) order = 0; /* RTC depends on this for -2 clock busy */ else if (order > 5) @@ -463,6 +467,127 @@ return 0; } +#ifdef CONFIG_RTAS_ERRINJCT +int +rtas_errinjct_open(void) +{ + u32 ret[2]; + int open_token; + int rc; + unsigned int time; + + + while (1) { + /* + * The rc and open_token values are backwards due to a + * misprint in the RPA. + */ + open_token = rtas_call(rtas_token("ibm,open-errinjct"), 0, 2, (void *) &ret); + rc = ret[0]; + + if (rc == RTAS_BUSY) { + continue; + } + + if ((time = rtas_extended_busy_delay_time(rc))) { + udelay(time * 1000); + continue; + } + + if (rc < 0) { + printk(KERN_WARNING "error: ibm,open-errinjct failed (%d)\n", rc); + return rc; + } + + return open_token; + } +} + +int +rtas_errinjct(unsigned int open_token, char * ei_token, char * workspace, size_t workspace_size) +{ + char * token_array; + char * token_array_end; + int array_len; + int rtas_ei_token = -1; + unsigned int time; + int rc = 0; + + token_array = (char *) get_property(rtas.dev, "ibm,errinjct-tokens", &array_len); + token_array_end = token_array + array_len; + + while (token_array <= token_array_end) { + if (strcmp(token_array, ei_token) == 0) { + rtas_ei_token = *(int *)(token_array + strlen(token_array) + 1); + break; + } + token_array += strlen(token_array) + 1; + token_array += sizeof(int); + } + + if (rtas_ei_token == -1) { + return -EINVAL; + } + + spin_lock(&rtas_data_buf_lock); + + while (1) { + if (rc != RTAS_BUSY && workspace) { + memset(rtas_data_buf, 0, RTAS_DATA_BUF_SIZE); + memcpy(rtas_data_buf, workspace, workspace_size); + } + + rc = rtas_call(rtas_token("ibm,errinjct"), 3, 1, NULL, + rtas_ei_token, open_token, __pa(rtas_data_buf)); + + if (rc == RTAS_BUSY) { + continue; + } + + if ((time = rtas_extended_busy_delay_time(rc))) { + spin_unlock(&rtas_data_buf_lock); + udelay(time * 1000); + spin_lock(&rtas_data_buf_lock); + continue; + } + + if (rc != 0) { + printk(KERN_WARNING "error: ibm,errinjct failed (%d)\n", rc); + } + + spin_unlock(&rtas_data_buf_lock); + + return rc; + } +} + +int +rtas_errinjct_close(unsigned int open_token) +{ + int rc; + unsigned int time; + + while (1) { + rc = rtas_call(rtas_token("ibm,close-errinjct"), 1, 1, NULL, open_token); + + if (rc == RTAS_BUSY) { + continue; + } + + if ((time = rtas_extended_busy_delay_time(rc))) { + udelay(time * 1000); + continue; + } + + if (rc != 0) { + printk(KERN_WARNING "error: ibm,close-errinjct failed (%d)\n", rc); + } + + return rc; + } +} + +#endif EXPORT_SYMBOL(rtas_firmware_flash_list); EXPORT_SYMBOL(rtas_token); diff -Nru a/include/asm-ppc64/rtas.h b/include/asm-ppc64/rtas.h --- a/include/asm-ppc64/rtas.h Thu Feb 5 12:47:44 2004 +++ b/include/asm-ppc64/rtas.h Thu Feb 5 12:47:44 2004 @@ -22,6 +22,11 @@ /* Buffer size for ppc_rtas system call. */ #define RTAS_RMOBUF_MAX (64 * 1024) +/* Error inject defines */ +#define ERRINJCT_TOKEN_LEN 24 /* Max length of an error inject token */ +#define MAX_ERRINJCT_TOKENS 15 /* Max # tokens. */ +#define ERRINJCT_WORKSPACE_SIZE 1024 + /* RTAS return codes */ #define RTAS_BUSY -2 /* RTAS Return Status - Busy */ #define RTAS_EXTENDED_DELAY_MIN 9900 @@ -178,6 +183,9 @@ extern int rtas_get_sensor(int sensor, int index, int *state); extern int rtas_get_power_level(int powerdomain, int *level); extern int rtas_set_indicator(int indicator, int index, int new_value); +extern int rtas_errinjct_open(void); +extern int rtas_errinjct(unsigned int, char *, char *, size_t); +extern int rtas_errinjct_close(unsigned int); /* Given an RTAS status code of 9900..9905 compute the hinted delay */ unsigned int rtas_extended_busy_delay_time(int status); From linas at austin.ibm.com Fri Feb 6 07:26:43 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Thu, 5 Feb 2004 14:26:43 -0600 Subject: Resending the patch: Re: 2.6: PATCH for multiple EEH bugs In-Reply-To: <20040204142853.A28220@forte.austin.ibm.com>; from linas@austin.ibm.com on Wed, Feb 04, 2004 at 02:28:53PM -0600 References: <20040203183459.B27780@forte.austin.ibm.com> <20040204142853.A28220@forte.austin.ibm.com> Message-ID: <20040205142643.T27780@forte.austin.ibm.com> OK, Fifth time's a charm ... base64 encoding the patch helps prevent the mail gateways from mangling it, but then its too big for the mailing list manager. You can ftp the patch http://www-124.ibm.com/linux/patches/?patch_id=1344 To repeat the original note: Patch for multiple EEH-related bugs. Please review this patch, & if appropriate, please apply. It should apply cleanly to the current ameslab tree (Feb 03 2004 2.6.2-rc3). This patch fixes multiple EEH-related bugs: -- Fixes the eeh_check_failure() usage in an interrupt context. This routine is now safe to use in an interrupt. The fix was to build a cache of IO addresses and check that, instead of using the pci routines. -- Merges in Olof Johansson's sizeof patch when checking for failure -- Adds EEH tests to array/string reads -- Fixes bugs with address resolution (some i/o addresses were handled incorrectly, resulting in EEH errors slipping by undetected.) -- Adds EEH support to the PCI Hotplug system (so that devices that get added/removed get properly registered with the EEH subsystem.) -- Fixes improper use of /proc filesystem. -- Adds some misc statistics. Please note that the EEH subsystem will be undergoing a major revision in the not-to-distant future; this patch is a 'stopgap' to address the immediate concerns/issues until that time. --linas On Wed, Feb 04, 2004 at 02:28:53PM -0600, linas at austin.ibm.com wrote: > On Tue, Feb 03, 2004 at 10:52:11PM -0600, olof at austin.ibm.com wrote: > > On Tue, 3 Feb 2004 linas at austin.ibm.com wrote: > > > > > Patch for multiple EEH-related bugs. Please review this patch, > > > & if appropriate, please apply. It should apply cleanly to > > > the current ameslab tree (Feb 03 2004 2.6.2-rc3). > > > > I have patch failures in eeh.c, pci.c and rpaphp_core.c. Are you sure you > > made the diff against a current ameslab tree? > > Right tree, bad email attachment. > > I don't know how it happened, but what I sent out had some trailing > whitespace whacked. The attached patch should not have this problem. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Fri Feb 6 09:10:33 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Thu, 05 Feb 2004 16:10:33 -0600 Subject: [PATCH][2.6] Nested Interrupt support In-Reply-To: <16400.37866.867318.95501@cargo.ozlabs.ibm.com> References: <1074094346.2389.42.camel@magik> <20040119042022.GA20834@krispykreme> <1074781690.23288.571.camel@magik> <16400.37866.867318.95501@cargo.ozlabs.ibm.com> Message-ID: <1076019033.1023.1266.camel@magik> > > Here's the patch using per cpu data for the irq stack. > > Can't we find somewhere on the kernel stack to stash this? Could we > use regs->softe maybe? Do you have any better idea how you would want this? I still think Anton's idea of storing this in the per-cpu data makes more sense since it's the priority stack of each CPU. Jake ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Fri Feb 6 11:20:47 2004 From: johnrose at austin.ibm.com (John Rose) Date: Thu, 05 Feb 2004 18:20:47 -0600 Subject: PCI Probe Question Message-ID: <1076026847.4798.43.camel@verve> My question involves pci_read_bases(), which reads resource info from the config space of an adapter. I'm using an e100. At boot, and at hotplug-enable time, pci_read_bases() reads 0xfc00 for the base of the adapter's only "I/O" resource region. After DLPAR removing and adding back the parent bus of the card, pci_read_bases() reads a different value (0x1000) for the base of the same region. It reads the same size as at boot, and all the other regions are identical to their boot-time bases and sizes. How can the base address of the only I/O region change like this? I thought that was a static property of an adapter. Despite this difference, the card still works. Thoughts? John BTW - the addr and buid values are identical to those at boot, so the rtas read pci config call is passing identical inputs in both cases. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From willschm at us.ibm.com Fri Feb 6 15:24:22 2004 From: willschm at us.ibm.com (Will Schmidt) Date: Thu, 5 Feb 2004 22:24:22 -0600 Subject: LPARCFG In-Reply-To: <20040205160710.GZ19011@krispykreme> Message-ID: > > Building as a module was broken when the code was checked in to ameslab > > 2.6; I suggested turning it off. I think lparcfg uses unexported > > symbols -- cur_cpu_spec, systemcfg, and plpar_hcall_4out. Should any of > > those be exported? > > > > See this thread for the history: > > http://lists.linuxppc.org/linuxppc64-dev/200312/msg00023.html > > Ahh yeah, I have such a short memory. > > Actually I enabled it as a module and built all modules and didnt get > any warnings. Either we have everything exported now, Im not getting > undefined symbol warnings any more or else Im going blind. I've got some more updates for this code, will try to get a patch onto this list tomorrow. (still need to forward port to current and check on iSeries).. I couldnt build it as a module without exporting a few symbols, but my tree is about a week old, so i'm probably missing those fixes. -Will willschm at us.ibm.com Linux on PowerPC-64 Development IBM Rochester ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sat Feb 7 02:11:55 2004 From: anton at samba.org (Anton Blanchard) Date: Sat, 7 Feb 2004 02:11:55 +1100 Subject: [PATCH] kill lmb_add_io In-Reply-To: <20040205171522.GB4983@w-mikek2.beaverton.ibm.com> References: <1075856864.1449.205.camel@nighthawk> <20040204174036.GB4996@w-mikek2.beaverton.ibm.com> <20040205133650.GV19011@krispykreme> <20040205171522.GB4983@w-mikek2.beaverton.ibm.com> Message-ID: <20040206151155.GK19011@krispykreme> > Exactly, and this is an area I am interested in pursuing. Hence, the > interest in Dave's machine. I haven't gone through enough pSeries > architecture documentation yet to know what to expect. One would think > that a smaller lmb size would be better for hot swap memory. At least > in the 'remove' side of the equation. A good start would be to tar up a device tree of such a machine. You can untar it and use lsprop to look around. Actually it would be worth having a repository of device trees of various machines and OF versions somewhere. We often need to know if some random version of OF on some machine has a particular property. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Sat Feb 7 03:28:49 2004 From: johnrose at austin.ibm.com (John Rose) Date: Fri, 06 Feb 2004 10:28:49 -0600 Subject: PCI Probe Question In-Reply-To: <1076026847.4798.43.camel@verve> References: <1076026847.4798.43.camel@verve> Message-ID: <1076084929.6881.2.camel@verve> > How can the base address of the only I/O region change like this? I > thought that was a static property of an adapter. Despite this > difference, the card still works. So apparently the bases of these regions can be changed by firmware and/or OS. The sizes won't change, but the bases might. So this isn't a bug :) Thanks to Mike Lyons for the pci knowledge. John ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From jimix at watson.ibm.com Sat Feb 7 06:12:44 2004 From: jimix at watson.ibm.com (Jimi Xenidis) Date: Fri, 6 Feb 2004 14:12:44 -0500 Subject: missing RELOC()s in prom.c? Message-ID: <16419.59180.461757.753005@kitch0.watson.ibm.com> Fellow coder (cc'd) found the following issue -JX ===== /scratch1/jimix/work/linux/linux-2.5/arch/ppc64/kernel/prom.c 1.46 vs edited ===== --- 1.46/arch/ppc64/kernel/prom.c Tue Dec 9 11:45:05 2003 +++ edited//scratch1/jimix/work/linux/linux-2.5/arch/ppc64/kernel/prom.c Fri Feb 6 13:35:47 2004 @@ -1143,9 +1143,9 @@ sizeof(option)); if (option[0] != 0) { found = 1; - if (!strcmp(option, "off")) + if (!strcmp(option, RELOC("off"))) my_smt_enabled = SMT_OFF; - else if (!strcmp(option, "on")) + else if (!strcmp(option, RELOC("on"))) my_smt_enabled = SMT_ON; else my_smt_enabled = SMT_DYNAMIC; ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From haveblue at us.ibm.com Sat Feb 7 06:37:45 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: 06 Feb 2004 11:37:45 -0800 Subject: [PATCH] get CROSS32 from environment Message-ID: <1076096265.5716.340.camel@nighthawk> I'm doing some x86->ppc64 cross compiles. The top-level CROSS_COMPILE comes out of the environment just fine, but the 32-bit compiler is just set up to be empty. This patch gets it out of the environment, if present. --dave -------------- next part -------------- --- linux-2.6.1-clean/arch/ppc64/boot/Makefile 2004-01-08 22:59:56.000000000 -0800 +++ linux-2.6.1-memhotplug/arch/ppc64/boot/Makefile 2004-02-04 15:02:17.000000000 -0800 @@ -20,7 +20,7 @@ # CROSS32_COMPILE is setup as a prefix just like CROSS_COMPILE # in the toplevel makefile. -CROSS32_COMPILE = +CROSS32_COMPILE ?= #CROSS32_COMPILE = /usr/local/ppc/bin/powerpc-linux- BOOTCC := $(CROSS32_COMPILE)gcc From jschopp at austin.ibm.com Sat Feb 7 09:53:32 2004 From: jschopp at austin.ibm.com (jschopp at austin.ibm.com) Date: Fri, 6 Feb 2004 16:53:32 -0600 (CST) Subject: missing RELOC()s in prom.c? In-Reply-To: <16419.59180.461757.753005@kitch0.watson.ibm.com> Message-ID: Good catch. I think this is dead on the money. I'll push it to ameslab unless somebody beats me to it. -JOel On Fri, 6 Feb 2004, Jimi Xenidis wrote: > > Fellow coder (cc'd) found the following issue > -JX > > > ===== /scratch1/jimix/work/linux/linux-2.5/arch/ppc64/kernel/prom.c 1.46 vs edited ===== > --- 1.46/arch/ppc64/kernel/prom.c Tue Dec 9 11:45:05 2003 > +++ edited//scratch1/jimix/work/linux/linux-2.5/arch/ppc64/kernel/prom.c Fri Feb 6 13:35:47 2004 > @@ -1143,9 +1143,9 @@ > sizeof(option)); > if (option[0] != 0) { > found = 1; > - if (!strcmp(option, "off")) > + if (!strcmp(option, RELOC("off"))) > my_smt_enabled = SMT_OFF; > - else if (!strcmp(option, "on")) > + else if (!strcmp(option, RELOC("on"))) > my_smt_enabled = SMT_ON; > else > my_smt_enabled = SMT_DYNAMIC; > > > ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Sat Feb 7 12:42:27 2004 From: paulus at samba.org (Paul Mackerras) Date: Sat, 7 Feb 2004 12:42:27 +1100 Subject: missing RELOC()s in prom.c? In-Reply-To: <16419.59180.461757.753005@kitch0.watson.ibm.com> References: <16419.59180.461757.753005@kitch0.watson.ibm.com> Message-ID: <16420.17027.764354.90415@cargo.ozlabs.ibm.com> Jimi Xenidis writes: > Fellow coder (cc'd) found the following issue > -JX Hmmm, good catch, but it's not obvious to me why smt_setup() has to be done at prom_init time. I don't see anything in there that has to be done then - we are just looking at the command line and the device tree and setting some fields in the naca. That could be done more easily (without the RELOCs) once we have set up the MMU, surely? We use _naca->smt_state in prom_hold_cpus, it is true, but that only affects the setting of cpu_available_map and cpu_present_at_boot, and the values of those bitmaps doesn't affect any calls to OF. Does anyone have a reason why smt_setup has to be called from prom_init? Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sat Feb 7 12:57:28 2004 From: anton at samba.org (Anton Blanchard) Date: Sat, 7 Feb 2004 12:57:28 +1100 Subject: missing RELOC()s in prom.c? In-Reply-To: <16420.17027.764354.90415@cargo.ozlabs.ibm.com> References: <16419.59180.461757.753005@kitch0.watson.ibm.com> <16420.17027.764354.90415@cargo.ozlabs.ibm.com> Message-ID: <20040207015727.GO19011@krispykreme> > Does anyone have a reason why smt_setup has to be called from > prom_init? Good point. We need the paca and prom.c police, someone who will come knocking any time code gets added to either :) (for the benefit of others, the reason for using percpu data rather than the paca is that we can do node local allocation of per cpu memory one day) Speaking of things in the paca, does anyone know why we a) have an option to overclock the decr and b) have an option to overclock the decr on cpu0 separately? If its an issue with iseries servicing events, does the current HZ=1000 solve this? Anton (looking to remove another field from the paca) ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From boutcher at us.ibm.com Sat Feb 7 14:20:46 2004 From: boutcher at us.ibm.com (David Boutcher) Date: Fri, 6 Feb 2004 21:20:46 -0600 Subject: missing RELOC()s in prom.c? In-Reply-To: <20040207015727.GO19011@krispykreme> Message-ID: Anton Blanchard wrote on 02/06/2004 07:57:28 PM: > Speaking of things in the paca, does anyone know why we a) have an > option to overclock the decr and b) have an option to overclock the > decr on cpu0 separately? If its an issue with iseries servicing > events, does the current HZ=1000 solve this? Cause on legacy iSeries there are no real interrupts. There are only lp events put on a queue. And we check the queue on each decrementer. And there's an option somewhere or other (search for spread_lp_events) for whether we check the queue only with cpu0 or with all cpus.. By the way, at one point there was a cool option to stagger the decrementers so that even with overclocking, multiple CPUs would check the lp event queues at staggered intervals within the decrementer intervals. Overclocking the decrementer doesn't do anything for jiffies, but it does reduce the latency on I/O events and is only useful on legacy iSeries. As to whether HZ helps, It's too late at night, and Hugh made me drink too much beer to contemplate that. Dave Boutcher IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sat Feb 7 20:40:35 2004 From: anton at samba.org (Anton Blanchard) Date: Sat, 7 Feb 2004 20:40:35 +1100 Subject: missing RELOC()s in prom.c? In-Reply-To: References: <20040207015727.GO19011@krispykreme> Message-ID: <20040207094035.GR19011@krispykreme> > Cause on legacy iSeries there are no real interrupts. There are only lp > events put on a queue. And we check the queue on each decrementer. And > there's an option somewhere or other (search for spread_lp_events) for > whether we check the queue only with cpu0 or with all cpus.. > > By the way, at one point there was a cool option to stagger the > decrementers so that even with overclocking, multiple CPUs would check the > lp event queues at staggered intervals within the decrementer intervals. In fact we are currently staggering the decrementers in 2.6. It was done to avoid all cpus hitting some global locks in the timer code. The global lock has been removed and x86 has moved back to triggering all timer irqs at the same time. Considering that staggered interrupts should help iseries, we might continue to do it on ppc64. Also I notice there is a spread_events boot option, should we make this the default? > As to whether HZ helps, It's too late at night, and Hugh made me drink too > much beer to contemplate that. Watch out for Hugh :) My theory is if our usual overclocking of the decr in 2.4 ends up the same frequency as base 2.6 (1000Hz) then we can remove the option. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From jimix at watson.ibm.com Sun Feb 8 02:52:55 2004 From: jimix at watson.ibm.com (Jimi Xenidis) Date: Sat, 7 Feb 2004 10:52:55 -0500 Subject: missing RELOC()s in prom.c? In-Reply-To: <16420.17027.764354.90415@cargo.ozlabs.ibm.com> References: <16419.59180.461757.753005@kitch0.watson.ibm.com> <16420.17027.764354.90415@cargo.ozlabs.ibm.com> Message-ID: <16421.2519.579220.101653@kitch0.watson.ibm.com> >>>>> "PM" == Paul Mackerras writes: PM> Jimi Xenidis writes: >> Fellow coder (cc'd) found the following issue >> -JX PM> Hmmm, good catch, but it's not obvious to me why smt_setup() has to be PM> done at prom_init time. we wondered this as well ;-) -JX ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From mostrows at watson.ibm.com Sun Feb 8 09:40:38 2004 From: mostrows at watson.ibm.com (Michal Ostrowski) Date: Sat, 07 Feb 2004 17:40:38 -0500 Subject: G5 timebase_frequency Message-ID: <1076193638.10104.3326.camel@brick.watson.ibm.com> I've noticed that timebase_frequency on a G5 appears to be defined by OF to be 33.3333MHz. OTOH, the specs for the 970 claim that timebase has a frequency of 1/8 that of the CPU clock rate. So, on a 2.0GHz processor timebase would be 250MHz. Which one of these, if any is correct? My guess would be to say 250MHz, but my experiments show that a clock based on timebase, assuming that rate, appears to be about 10% slow. -- Michal Ostrowski ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Sun Feb 8 17:45:19 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sun, 08 Feb 2004 17:45:19 +1100 Subject: G5 timebase_frequency In-Reply-To: <1076193638.10104.3326.camel@brick.watson.ibm.com> References: <1076193638.10104.3326.camel@brick.watson.ibm.com> Message-ID: <1076222702.887.81.camel@gaston> On Sun, 2004-02-08 at 09:40, Michal Ostrowski wrote: > I've noticed that timebase_frequency on a G5 appears to be defined by OF > to be 33.3333MHz. OTOH, the specs for the 970 claim that timebase has a > frequency of 1/8 that of the CPU clock rate. So, on a 2.0GHz processor > timebase would be 250MHz. > > Which one of these, if any is correct? > > My guess would be to say 250MHz, but my experiments show that a clock > based on timebase, assuming that rate, appears to be about 10% slow. 10% Only ? :) It's really 33Mhz as OF says. AFAIK, the HID0[19] bit is set by the firmware causing the timebase to be clocked on the rising edge of the TBEN input, thus the timebase is externally clocked. This allows a stable timebase when slewing the cpu & bus clocks. Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sun Feb 8 18:46:13 2004 From: anton at samba.org (Anton Blanchard) Date: Sun, 8 Feb 2004 18:46:13 +1100 Subject: [2.6] [PATCH] [LARGE] Rewrite/cleanup of PCI DMA mapping code In-Reply-To: <401EDFD1.3010203@austin.ibm.com> References: <401EDFD1.3010203@austin.ibm.com> Message-ID: <20040208074613.GB19011@krispykreme> Hi Olof, > I've spent some time cleaning up the PCI DMA mapping code in 2.6. The > patch is large (117KB, 4000 lines), so I won't post it here. It can be > found at: Very nice! Ive thrown this onto a few machines here and am stressing it with random IO benchmarks. Its something we desperately needed done. > I also replaced the old buddy-style allocator for TCE ranges with a > simpler bitmap allocator. Time and benchmarking will tell if it's > efficient enough, but it's fairly well abstracted and can easily be > replaced Agreed. I suspect (as with our SLB allocation code) we will only know once the big IO benchmarks have beaten on it. We should get Jose, Rick and Nancy onto it as soon as possible. Some things to think about: - Minimum guaranteed TCE table size (128MB for 32bit slot, 256MB for 64bit slot) - Likelihood of large (multiple page) non SG requests. The e1000 comes to mind here, it has an MTU of 16kB so could do a pci_map_single of that size. - Peak TCE usage. After chasing emulex TCE starvation you guys would know the figures for this better than I. - Whether virtual merging makes sense. Virtual merging will place more pressure on our TCE allocation code (because we will end up asking for much more high order TCE allocations). Its also more complex, Id prefer to avoid it unless we do see a performance advantage. - We currently allocate a 2GB window in PCI space for TCEs. This is 4MB worth of TCE tables. Unfortunately we have to allocate an 8MB window on POWER4 boxes because firmware sets up some chip inits to cover the TCE region. If we allocate less and let normal memory get into this region, our performance grinds to a halt. (Its to do with the way TCE coherency is done on POWER4). Allocating a 2GB region unconditionally is also wrong, I have a nighthawk node that has a 3GB IO hole, and yes there is PCI memory allocated at 1GB and above (see below). We get away with it by luck with the current code but its going to hit when we switch to your new code. If we allocate 4MB on POWER3, 8MB on POWER4 and check the OF properties to get the correct range we should be good. - False sharing between the CPU and host bridge. We store a few things to a TCE cacheline (eg for an SG list) then initiate IO. The IO device requests the first address, the host bridge realises it must do a TCE lookup. It then caches this cacheline. Meantime the cpu is setting up another request. It stores to the same cacheline which forces the cacheline in the host bridge to be flushed. It still hasnt completed the first sg list, so it has to refetch it. I think the answer here is to allocate an SG list within a cacheline then move onto the next cacheline for the next request. As suggested by davem we should convert the network drivers over to using SG lists. - TCE table bypass, DAC. Anton Bus 0, device 12, function 0: SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 3). IRQ 17. Master Capable. Latency=74. Min Gnt=17.Max Lat=64. I/O at 0x7ffc00 [0x7ffcff]. Non-prefetchable 32 bit memory at 0x40101000 [0x401010ff]. Non-prefetchable 32 bit memory at 0x40100000 [0x40100fff]. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Mon Feb 9 13:17:33 2004 From: olof at austin.ibm.com (olof at austin.ibm.com) Date: Sun, 8 Feb 2004 20:17:33 -0600 (CST) Subject: [2.6] [PATCH] [LARGE] Rewrite/cleanup of PCI DMA mapping code In-Reply-To: <20040208074613.GB19011@krispykreme> Message-ID: Anton, Thanks for the feedback. Comments to a few of the items are below. -Olof On Sun, 8 Feb 2004, Anton Blanchard wrote: > > I also replaced the old buddy-style allocator for TCE ranges with a > > simpler bitmap allocator. Time and benchmarking will tell if it's > > efficient enough, but it's fairly well abstracted and can easily be > > replaced > > Agreed. I suspect (as with our SLB allocation code) we will only know > once the big IO benchmarks have beaten on it. We should get Jose, Rick > and Nancy onto it as soon as possible. > > Some things to think about: > > - Minimum guaranteed TCE table size (128MB for 32bit slot, 256MB for > 64bit slot) Not sure I follow this one. Are you saying we need to guarantee a minimum? On SMP, we just carve up the space reserved by the PHB in equal chunks per slot, on LPAR we just use the ibm,dma-window property to size the space. > - Likelihood of large (multiple page) non SG requests. The e1000 comes > to mind here, it has an MTU of 16kB so could do a pci_map_single of > that size. Yes, the behaviour here would be interesting. It'll still only be 4-page allocations so I don't expect any trouble. Allocations closer to 16 pages would be more likely to fail due to fragmentation. > - Peak TCE usage. After chasing emulex TCE starvation you guys would > know the figures for this better than I. Good point. I'll follow up on this. [... > - We currently allocate a 2GB window in PCI space for TCEs. This is 4MB > worth of TCE tables. Unfortunately we have to allocate an 8MB window > on POWER4 boxes because firmware sets up some chip inits to cover the > TCE region. If we allocate less and let normal memory get into this > region, our performance grinds to a halt. (Its to do with the way > TCE coherency is done on POWER4). > > Allocating a 2GB region unconditionally is also wrong, I have a > nighthawk node that has a 3GB IO hole, and yes there is PCI memory > allocated at 1GB and above (see below). We get away with it by luck with > the current code but its going to hit when we switch to your new code. > > If we allocate 4MB on POWER3, 8MB on POWER4 and check the OF > properties to get the correct range we should be good. Can we blindly base this on architecture, i.e. all POWER4-based systems will be fine with 8MB and the other way around? > - False sharing between the CPU and host bridge. We store a few things > to a TCE cacheline (eg for an SG list) then initiate IO. The IO device > requests the first address, the host bridge realises it must do a TCE > lookup. It then caches this cacheline. > > Meantime the cpu is setting up another request. It stores to the same > cacheline which forces the cacheline in the host bridge to be flushed. > It still hasnt completed the first sg list, so it has to refetch it. > > I think the answer here is to allocate an SG list within a cacheline > then move onto the next cacheline for the next request. As suggested > by davem we should convert the network drivers over to using SG lists. There's room for improvement here, but did a simple first implementation that will try to honor the PHB cachelines when allocating: The "next" hint will always be bumped up to a new cacheline, and SG list allocations will use their own hint instead of the next pointer. The result should be that SG lists are packed in cache lines, while other allocations should always move between lines. The downside is when the table gets full, but the load should be evenly spread between cache lines at least. I.e. while we'll have sharing, it should get evenly distributed in round-robin fashion. The next-allocation hint is explicitly NOT updated at free time to accomodate this. > - TCE table bypass, DAC. Yes, table bypass will be useful for machines with less than 2GB memory as well (G5's in particular). Ben has a ppc_pci_md (or similar) with function pointers to the pci_*map* functions, we might need something like that for boot/runtime selection of allocation behaviour. -Olof ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Tue Feb 10 01:47:28 2004 From: anton at samba.org (Anton Blanchard) Date: Tue, 10 Feb 2004 01:47:28 +1100 Subject: [2.6] [PATCH] [LARGE] Rewrite/cleanup of PCI DMA mapping code In-Reply-To: References: <20040208074613.GB19011@krispykreme> Message-ID: <20040209144728.GE19011@krispykreme> > > - Minimum guaranteed TCE table size (128MB for 32bit slot, 256MB for > > 64bit slot) > > Not sure I follow this one. Are you saying we need to guarantee a minimum? > On SMP, we just carve up the space reserved by the PHB in equal chunks per > slot, on LPAR we just use the ibm,dma-window property to size the space. Just looking at worst case scenarios. eg how would an emulex survive with only 256MB of TCE space. > Yes, the behaviour here would be interesting. It'll still only be 4-page > allocations so I don't expect any trouble. Allocations closer to 16 pages > would be more likely to fail due to fragmentation. Agreed. I havent seen a device that wanted to do really large allocations. They might do large consistent allocations but thats not so bad, we can take our time to allocate those and can in the end fail them (they are usually during driver init). The only other thing to keep in mind is we enable physical merging on scatter gather lists, if we were really unlucky we could end up with a very large element in the SG list. > > Can we blindly base this on architecture, i.e. all POWER4-based systems > will be fine with 8MB and the other way around? Assuming we are happy with 2GB on pre POWER4 machines we can go with that. Then in tce table init we check the OF properties and put a limit on the POWER4 tce table and perhaps the POWER3/RS64 one (eg on nighthawk). > There's room for improvement here, but did a simple first implementation > that will try to honor the PHB cachelines when allocating: Yep, Im just dumping all the ideas that have come up over the last year. I like simple and would prefer not to complicate the allocator unless we have to. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Tue Feb 10 09:31:22 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Mon, 9 Feb 2004 16:31:22 -0600 Subject: extreme RTAS printks Message-ID: I have here a boot log in which 280 of the 450 lines are "RTAS" hex dumps. This is getting really ridiculous. Regardless of how important these messages are (very likely they're reporting "hey! this partition was killed unexpectedly!" which is worthless to me), I would much rather see English from a daemon than hex in my boot log. Could we *please* kill printk_log_rtas()? This data is available via /proc files; there is no need to spam our boot logs with it. -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Tue Feb 10 10:18:13 2004 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 09 Feb 2004 17:18:13 -0600 Subject: [PATCH] RPA DLPAR/PCIHP cleanups Message-ID: <1076368692.18739.199.camel@verve> Hi Linda, World- I'd like to push the changes below to Ameslab tomorrow morning, and generate our final submission against the mainline tree from this. The patch: - Fixes whitespace misuse - Removes some debug prints (which you removed in your VIO code anyway) - Fixes a hotplug bug - Adds a semaphore to the DLPAR interface to protect against multiple users If there are no objections, will push tomorrow. Thanks- John diff -Nru a/drivers/pci/hotplug/rpadlpar_core.c b/drivers/pci/hotplug/rpadlpar_core.c --- a/drivers/pci/hotplug/rpadlpar_core.c Mon Feb 9 17:05:49 2004 +++ b/drivers/pci/hotplug/rpadlpar_core.c Mon Feb 9 17:05:49 2004 @@ -3,7 +3,7 @@ * * John Rose * Linda Xie - * + * * October 2003 * * Copyright (C) 2003 IBM. @@ -16,6 +16,7 @@ #include #include #include +#include #include "../pci.h" #include "rpaphp.h" #include "rpadlpar.h" @@ -23,6 +24,8 @@ #define MODULE_VERSION "1.0" #define MODULE_NAME "rpadlpar_io" +static DECLARE_MUTEX(rpadlpar_sem); + static inline int is_hotplug_capable(struct device_node *dn) { unsigned char *ptr = get_property(dn, "ibm,fw-pci-hot-plug-ctrl", NULL); @@ -175,7 +178,7 @@ struct pci_bus *secondary_bus; if (!bridge_dev) { - printk(KERN_ERR "%s: %s() unexpected null device\n", + printk(KERN_ERR "%s: %s() unexpected null device\n", MODULE_NAME, __FUNCTION__); return 1; } @@ -183,7 +186,7 @@ secondary_bus = bridge_dev->subordinate; if (unmap_bus_range(secondary_bus)) { - printk(KERN_ERR "%s: failed to unmap bus range\n", + printk(KERN_ERR "%s: failed to unmap bus range\n", __FUNCTION__); return 1; } @@ -203,36 +206,47 @@ * 0 Success * -ENODEV Not a valid drc_name * -EINVAL Slot already added + * -ERESTARTSYS Signalled before obtaining lock * -EIO Internal PCI Error */ int dlpar_add_slot(char *drc_name) { struct device_node *dn = find_php_slot_node(drc_name); struct pci_dev *dev; + int rc = 0; + + if (down_interruptible(&rpadlpar_sem)) + return -ERESTARTSYS; - if (!dn) - return -ENODEV; + if (!dn) { + rc = -ENODEV; + goto exit; + } /* Check for existing hotplug slot */ - if (find_slot(drc_name)) - return -EINVAL; + if (find_slot(drc_name)) { + rc = -EINVAL; + goto exit; + } /* Add pci bus */ dev = dlpar_pci_add_bus(dn); if (!dev) { printk(KERN_ERR "%s: unable to add bus %s\n", __FUNCTION__, drc_name); - return -EIO; + rc = -EIO; + goto exit; } /* Add hotplug slot for new bus */ if (rpaphp_add_slot(drc_name)) { printk(KERN_ERR "%s: unable to add hotplug slot %s\n", __FUNCTION__, drc_name); - return -EIO; + rc = -EIO; } - - return 0; +exit: + up(&rpadlpar_sem); + return rc; } /** @@ -245,6 +259,7 @@ * 0 Success * -ENODEV Not a valid drc_name * -EINVAL Slot already removed + * -ERESTARTSYS Signalled before obtaining lock * -EIO Internal PCI Error */ int dlpar_remove_slot(char *drc_name) @@ -252,35 +267,46 @@ struct device_node *dn = find_php_slot_node(drc_name); struct slot *slot; struct pci_dev *bridge_dev; + int rc = 0; + + if (down_interruptible(&rpadlpar_sem)) + return -ERESTARTSYS; - if (!dn) - return -ENODEV; + if (!dn) { + rc = -ENODEV; + goto exit; + } - if (!(slot = find_slot(drc_name))) - return -EINVAL; + if (!(slot = find_slot(drc_name))) { + rc = -EINVAL; + goto exit; + } bridge_dev = slot->bridge; if (!bridge_dev) { printk(KERN_ERR "%s: %s(): unexpected null bridge device\n", MODULE_NAME, __FUNCTION__); - return -EIO; + rc = -EIO; + goto exit; } /* Remove hotplug slot */ if (rpaphp_remove_slot(slot)) { printk(KERN_ERR "%s: %s(): unable to remove hotplug slot %s\n", MODULE_NAME, __FUNCTION__, drc_name); - return -EIO; + rc = -EIO; + goto exit; } /* Remove pci bus */ if (dlpar_pci_remove_bus(bridge_dev)) { printk(KERN_ERR "%s: %s() unable to remove pci bus %s\n", MODULE_NAME, __FUNCTION__, drc_name); - return -EIO; + rc = -EIO; } - - return 0; +exit: + up(&rpadlpar_sem); + return rc; } static inline int is_dlpar_capable(void) diff -Nru a/drivers/pci/hotplug/rpadlpar_sysfs.c b/drivers/pci/hotplug/rpadlpar_sysfs.c --- a/drivers/pci/hotplug/rpadlpar_sysfs.c Mon Feb 9 17:05:49 2004 +++ b/drivers/pci/hotplug/rpadlpar_sysfs.c Mon Feb 9 17:05:49 2004 @@ -116,7 +116,7 @@ static void dlpar_io_release(struct kobject *kobj) { /* noop */ - return; + return; } struct kobj_type ktype_dlpar_io = { diff -Nru a/drivers/pci/hotplug/rpaphp.h b/drivers/pci/hotplug/rpaphp.h --- a/drivers/pci/hotplug/rpaphp.h Mon Feb 9 17:05:49 2004 +++ b/drivers/pci/hotplug/rpaphp.h Mon Feb 9 17:05:49 2004 @@ -47,8 +47,8 @@ #define ERR_SENSE_USE -9002 /* No DR operation will succeed, slot is unusable */ /* Sensor values from rtas_get-sensor */ -#define EMPTY 0 /* No card in slot */ -#define PRESENT 1 /* Card in slot */ +#define EMPTY 0 /* No card in slot */ +#define PRESENT 1 /* Card in slot */ #if !defined(CONFIG_HOTPLUG_PCI_MODULE) #define MY_NAME "rpaphp" @@ -81,11 +81,11 @@ */ struct slot { u32 magic; - int state; - u32 index; - u32 type; - u32 power_domain; - char *name; + int state; + u32 index; + u32 type; + u32 power_domain; + char *name; struct device_node *dn;/* slot's device_node in OFDT */ /* dn has phb info */ struct pci_dev *bridge;/* slot's pci_dev in pci_devices */ diff -Nru a/drivers/pci/hotplug/rpaphp_core.c b/drivers/pci/hotplug/rpaphp_core.c --- a/drivers/pci/hotplug/rpaphp_core.c Mon Feb 9 17:05:49 2004 +++ b/drivers/pci/hotplug/rpaphp_core.c Mon Feb 9 17:05:49 2004 @@ -38,8 +38,8 @@ #include "pci_hotplug.h" -static int debug; -static struct semaphore rpaphp_sem; +static int debug; +static struct semaphore rpaphp_sem; static int rpaphp_debug; static LIST_HEAD (rpaphp_slot_head); static int num_slots = 0; @@ -79,13 +79,13 @@ { int rc; - rc = rtas_get_sensor(DR_ENTITY_SENSE, index, state); - - if (rc) { + rc = rtas_get_sensor(DR_ENTITY_SENSE, index, state); + + if (rc) { if (rc == NEED_POWER || rc == PWR_ONLY) { - dbg("%s: slot must be power up to get sensor-state\n", + dbg("%s: slot must be power up to get sensor-state\n", __FUNCTION__); - } else if (rc == ERR_SENSE_USE) + } else if (rc == ERR_SENSE_USE) info("%s: slot is unusable\n", __FUNCTION__); else err("%s failed to get sensor state\n", __FUNCTION__); } @@ -95,8 +95,8 @@ static struct pci_dev *rpaphp_find_bridge_pdev(struct slot *slot) { struct pci_dev *retval_dev = NULL; - - retval_dev = rpaphp_find_pci_dev(slot->dn); + + retval_dev = rpaphp_find_pci_dev(slot->dn); return retval_dev; } @@ -105,7 +105,7 @@ { struct pci_dev * retval_dev = NULL; - retval_dev = rpaphp_find_pci_dev(slot->dn->child); + retval_dev = rpaphp_find_pci_dev(slot->dn->child); return retval_dev; } @@ -118,7 +118,7 @@ dbg("%s - slot == NULL\n", function); return -1; } - + if (!slot->hotplug_slot) { dbg("%s - slot->hotplug_slot == NULL!\n", function); return -1; @@ -137,7 +137,7 @@ slot = (struct slot *)hotplug_slot->private; if (slot_paranoia_check(slot, function)) - return NULL; + return NULL; return slot; } @@ -145,16 +145,12 @@ { int rc; - dbg("Entry %s: status=%d\n", __FUNCTION__, status); - /* status: LED_OFF or LED_ON */ rc = rtas_set_indicator(DR_INDICATOR, slot->index, status); if (rc) - err("slot(%s) set attention-status(%d) failed! rc=0x%x\n", - slot->name, status, rc); - - dbg("Exit %s, rc=0x%x\n", __FUNCTION__, rc); - + err("slot(%s) set attention-status(%d) failed! rc=0x%x\n", + slot->name, status, rc); + return rc; } @@ -162,12 +158,12 @@ { int rc; - rc = rtas_get_power_level(slot->power_domain, (int *)value); - if (rc) - err("failed to get power-level for slot(%s), rc=0x%x\n", + rc = rtas_get_power_level(slot->power_domain, (int *)value); + if (rc) + err("failed to get power-level for slot(%s), rc=0x%x\n", slot->name, rc); - - return rc; + + return rc; } static int rpaphp_get_attention_status(struct slot *slot) @@ -191,8 +187,6 @@ if (slot == NULL) return -ENODEV; - dbg("%s - Entry: slot[%s] value[0x%x]\n", - __FUNCTION__, slot->name, value); down(&rpaphp_sem); switch (value) { case 0: @@ -213,8 +207,7 @@ } up(&rpaphp_sem); - - dbg("%s - Exit: rc[%d]\n", __FUNCTION__, retval); + return retval; } @@ -229,7 +222,7 @@ { int retval; struct slot *slot = get_slot(hotplug_slot, __FUNCTION__); - + if (slot == NULL) return -ENODEV; @@ -254,21 +247,16 @@ return -ENODEV; - dbg("%s - Entry: slot[%s]\n", - __FUNCTION__, slot->name); - down(&rpaphp_sem); *value = rpaphp_get_attention_status(slot); up(&rpaphp_sem); - dbg("%s - Exit: value[0x%x] rc[%d]\n", - __FUNCTION__, *value, retval); return retval; } /* * get_adapter_status - get the status of a slot - * + * * 0-- slot is empty * 1-- adapter is configured * 2-- adapter is not configured @@ -278,32 +266,30 @@ { int state, rc; - dbg("Entry %s\n", __FUNCTION__); + *value = NOT_VALID; - *value = NOT_VALID; + rc = rpaphp_get_sensor_state(slot->index, &state); - rc = rpaphp_get_sensor_state(slot->index, &state); - - if (rc) - goto exit; + if (rc) + goto exit; if (state == PRESENT) { dbg("slot is occupied\n"); - + if (!is_init) /* at run-time slot->state can be changed by */ /* config/unconfig adapter */ *value = slot->state; else { - if (!slot->dn->child) - dbg("%s: %s is not valid OFDT node\n", + if (!slot->dn->child) + dbg("%s: %s is not valid OFDT node\n", __FUNCTION__, slot->dn->full_name); - else - if (rpaphp_find_pci_dev(slot->dn->child)) + else + if (rpaphp_find_pci_dev(slot->dn->child)) *value = CONFIGURED; else { dbg("%s: can't find pdev of adapter in slot[%s]\n", __FUNCTION__, slot->name); - *value = NOT_CONFIGURED; + *value = NOT_CONFIGURED; } } } @@ -312,10 +298,8 @@ dbg("slot is empty\n"); *value = state; } - -exit: dbg("Exit %s slot[%s] has adapter-status %d rtas call's rc=0x%x\n", - __FUNCTION__, slot->name, *value, rc); +exit: return rc; } @@ -344,14 +328,11 @@ if (slot == NULL) return -ENODEV; - - dbg("%s - Entry: slot->name[%s] slot->type[%d]\n", - __FUNCTION__, slot->name, slot->type); - down(&rpaphp_sem); + down(&rpaphp_sem); switch (slot->type) { - case 1: + case 1: case 2: case 3: case 4: @@ -378,10 +359,10 @@ default: *value = PCI_SPEED_UNKNOWN; break; - + } - up(&rpaphp_sem); + up(&rpaphp_sem); return 0; } @@ -400,7 +381,7 @@ return 0; } -/* +/* * rpaphp_validate_slot - make sure the name of the slot matches * the location code , if the slots is not * empty. @@ -409,11 +390,8 @@ { struct device_node *dn; int retval = 0; - - dbg("Entry %s: (name: %s index: 0x%x\n", - __FUNCTION__, slot_name, slot_index); - for(dn = find_all_nodes(); dn; dn = dn->next) { + for(dn = find_all_nodes(); dn; dn = dn->next) { int *index; unsigned char *loc_code; @@ -423,30 +401,25 @@ if (index && *index == slot_index) { char *slash, tmp_str[128]; - loc_code = get_property(dn, "ibm,loc-code", NULL); + loc_code = get_property(dn, "ibm,loc-code", NULL); if (!loc_code) { retval = -1; goto exit; } - dbg("%s: name=%s loc-code=%s index=0x%x\n", - __FUNCTION__, slot_name, loc_code, slot_index); - strcpy(tmp_str, loc_code); slash = strrchr(tmp_str, '/'); if (slash) { *slash = '\0'; } - if (strcmp(slot_name, tmp_str)) + if (strcmp(slot_name, tmp_str)) retval = -1; - goto exit; + goto exit; } } exit: - dbg("Exit %s with retval=%d\n", __FUNCTION__, retval); - return retval; } @@ -454,8 +427,6 @@ static void rpaphp_fixup_new_devices(struct pci_bus *bus) { struct pci_dev *dev; - - dbg("Enter rpaphp_fixup_new_devices()\n"); list_for_each_entry(dev, &bus->devices, bus_list) { /* @@ -467,30 +438,26 @@ pcibios_fixup_device_resources(dev, bus); pci_read_irq_line(dev); for (i = 0; i < PCI_NUM_RESOURCES; i++) { - struct resource *r = &dev->resource[i]; - if (r->parent || !r->start || !r->flags) - continue; - rpaphp_claim_resource(dev, i); - } - + struct resource *r = &dev->resource[i]; + if (r->parent || !r->start || !r->flags) + continue; + rpaphp_claim_resource(dev, i); + } } } } -static struct pci_dev *rpaphp_config_adapter(struct slot *slot) +static struct pci_dev *rpaphp_config_adapter(struct slot *slot) { struct pci_bus *pci_bus; struct device_node *dn; int num; struct pci_dev *dev = NULL; - dbg("Entry %s: slot[%s]\n", - __FUNCTION__, slot->name); - if (slot->bridge) { - + pci_bus = slot->bridge->subordinate; - + if (!pci_bus) { err("%s: can't find bus structure\n", __FUNCTION__); goto exit; @@ -498,14 +465,12 @@ for (dn = slot->dn->child; dn; dn = dn->sibling) { dbg("child dn's devfn=[%x]\n", dn->devfn); - num = pci_scan_slot(pci_bus, + num = pci_scan_slot(pci_bus, PCI_DEVFN(PCI_SLOT(dn->devfn), 0)); dbg("pci_scan_slot return num=%d\n", num); if (num) { - dbg("%s: calling rpaphp_fixup_new_devices()\n", - __FUNCTION__); rpaphp_fixup_new_devices(pci_bus); pci_bus_add_devices(pci_bus); } @@ -518,42 +483,37 @@ err("slot doesn't have pci_dev structure\n"); dev = NULL; goto exit; - } + } -exit: +exit: dbg("Exit %s: pci_dev %s\n", __FUNCTION__, dev? "found":"not found"); return dev; } -static int rpaphp_unconfig_adapter(struct slot *slot) +static int rpaphp_unconfig_adapter(struct slot *slot) { int retval = 0; - dbg("Entry %s: slot[%s]\n", - __FUNCTION__, slot->name); if (!slot->dev) { info("%s: no card in slot[%s]\n", __FUNCTION__, slot->name); retval = -EINVAL; - goto exit; + goto exit; } + /* remove the device from the pci core */ + pci_remove_bus_device(slot->dev); - /* remove the device from the pci core */ - pci_remove_bus_device(slot->dev); + pci_dev_put(slot->dev); + slot->state = NOT_CONFIGURED; - pci_dev_put(slot->dev); - slot->state = NOT_CONFIGURED; - dbg("%s: adapter in slot[%s] unconfigured.\n", __FUNCTION__, slot->name); - -exit: - dbg("Exit %s, rc=0x%x\n", __FUNCTION__, retval); +exit: return retval; - + } /* free up the memory user be a slot */ @@ -561,39 +521,32 @@ static void rpaphp_release_slot(struct hotplug_slot *hotplug_slot) { struct slot *slot = get_slot(hotplug_slot, __FUNCTION__); - + if (slot == NULL) return; - dbg("%s - Entry: slot[%s]\n", - __FUNCTION__, slot->name); kfree(slot->hotplug_slot->info); kfree(slot->hotplug_slot->name); kfree(slot->hotplug_slot); pci_dev_put(slot->bridge); pci_dev_put(slot->dev); kfree(slot); - dbg("%s - Exit\n", __FUNCTION__); } int rpaphp_remove_slot(struct slot *slot) { int retval = 0; - dbg("%s - Entry: slot[%s]\n", - __FUNCTION__, slot->name); - sysfs_remove_link(slot->hotplug_slot->kobj.parent, - slot->bridge->slot_name); - + slot->bridge->slot_name); + list_del(&slot->rpaphp_slot_list); retval = pci_hp_deregister(slot->hotplug_slot); if (retval) err("Problem unregistering a slot %s\n", slot->name); num_slots--; - dbg("%s - Exit: rc[%d]\n", __FUNCTION__, retval); - return retval; + return retval; } static int is_php_dn(struct device_node *dn, int **indexes, int **names, int **types, int **power_domains) @@ -604,7 +557,7 @@ /* &names[1] contains NULL terminated slot names */ *names = (int *)get_property(dn, "ibm,drc-names", NULL); - if (!*names) + if (!*names) return(0); /* &types[1] contains NULL terminated slot types */ @@ -615,7 +568,7 @@ /* power_domains[1...n] are the slot power domains */ *power_domains = (int *)get_property(dn, "ibm,drc-power-domains", NULL); - if (!*power_domains) + if (!*power_domains) return(0); if (!get_property(dn, "ibm,fw-pci-hot-plug-ctrl", NULL)) @@ -629,17 +582,17 @@ struct slot *slot; slot = kmalloc(sizeof(struct slot), GFP_KERNEL); - if (!slot) + if (!slot) return (NULL); memset(slot, 0, sizeof(struct slot)); - slot->hotplug_slot = kmalloc(sizeof(struct hotplug_slot), + slot->hotplug_slot = kmalloc(sizeof(struct hotplug_slot), GFP_KERNEL); if (!slot->hotplug_slot) { kfree(slot); return (NULL); - } + } memset(slot->hotplug_slot, 0, sizeof(struct hotplug_slot)); - slot->hotplug_slot->info = kmalloc(sizeof(struct hotplug_slot_info), + slot->hotplug_slot->info = kmalloc(sizeof(struct hotplug_slot_info), GFP_KERNEL); if (!slot->hotplug_slot->info) { kfree(slot->hotplug_slot); @@ -659,17 +612,14 @@ static int setup_hotplug_slot_info(struct slot *slot) { - dbg("%s Initilize the slot info structure ...\n", - __FUNCTION__); - - rpaphp_get_power_status(slot, - &slot->hotplug_slot->info->power_status); + rpaphp_get_power_status(slot, + &slot->hotplug_slot->info->power_status); rpaphp_get_adapter_status(slot, 1, - &slot->hotplug_slot->info->adapter_status); + &slot->hotplug_slot->info->adapter_status); if (slot->hotplug_slot->info->adapter_status == NOT_VALID) { - dbg("%s: NOT_VALID: skip dn->full_name=%s\n", + dbg("%s: NOT_VALID: skip dn->full_name=%s\n", __FUNCTION__, slot->dn->full_name); kfree(slot->hotplug_slot->info); kfree(slot->hotplug_slot->name); @@ -682,7 +632,7 @@ static int register_slot(struct slot *slot) { - int retval; + int retval; retval = pci_hp_register(slot->hotplug_slot); if (retval) { @@ -692,7 +642,7 @@ } /* create symlink between slot->name and it's bus_id */ dbg("%s: sysfs_create_link: %s --> %s\n", __FUNCTION__, - slot->bridge->slot_name, slot->name); + slot->bridge->slot_name, slot->name); retval = sysfs_create_link(slot->hotplug_slot->kobj.parent, &slot->hotplug_slot->kobj, slot->bridge->slot_name); @@ -702,12 +652,12 @@ return (retval); } /* add slot to our internal list */ - dbg("%s adding slot[%s] to rpaphp_slot_list\n", + dbg("%s adding slot[%s] to rpaphp_slot_list\n", __FUNCTION__, slot->name); list_add(&slot->rpaphp_slot_list, &rpaphp_slot_head); - info("Slot [%s] (bus_id=%s) registered\n", + info("Slot [%s] (bus_id=%s) registered\n", slot->name, slot->bridge->slot_name); return (0); } @@ -721,38 +671,33 @@ struct slot *slot; int retval = 0; int i; - struct device_node *dn; - int *indexes, *names, *types, *power_domains; - char *name, *type; - - dbg("Entry %s: %s\n", __FUNCTION__, - slot_name? slot_name: "init"); + struct device_node *dn; + int *indexes, *names, *types, *power_domains; + char *name, *type; for (dn = find_all_nodes(); dn; dn = dn->next) { if (dn->name != 0 && strcmp(dn->name, "pci") == 0) { if (!is_php_dn(dn, &indexes, &names, &types, &power_domains)) continue; - + dbg("%s : found device_node in OFDT full_name=%s, name=%s\n", __FUNCTION__, dn->full_name, dn->name); name = (char *)&names[1]; type = (char *)&types[1]; - - dbg("%s: indexes=%d\n", __FUNCTION__, indexes[0]); - for (i = 0; i < indexes[0]; - i++, + for (i = 0; i < indexes[0]; + i++, name += (strlen(name) + 1), type += (strlen(type) + 1)) { dbg("%s: name[%s] index[%x]\n", __FUNCTION__, name, indexes[i+1]); - if (slot_name && strcmp(slot_name, name)) + if (slot_name && strcmp(slot_name, name)) continue; - + if (rpaphp_validate_slot(name, indexes[i + 1])) { dbg("%s: slot(%s, 0x%x) is invalid.\n", __FUNCTION__, name, indexes[i+ 1]); @@ -767,7 +712,7 @@ slot->name = slot->hotplug_slot->name; slot->index = indexes[i + 1]; strcpy(slot->name, name); - slot->type = simple_strtoul(type, NULL, 10); + slot->type = simple_strtoul(type, NULL, 10); if (slot->type < 1 || slot->type > 16) slot->type = 0; @@ -779,7 +724,7 @@ slot->dn = dn; /* - * Initilize the slot info structure with some known + * Initilize the slot info structure with some known * good values. */ if (setup_hotplug_slot_info(slot)) @@ -787,15 +732,15 @@ slot->bridge = rpaphp_find_bridge_pdev(slot); if (!slot->bridge && slot_name) { /* slot being added doesn't have pci_dev yet*/ - dbg("%s: no pci_dev for bridge dn %s\n", + dbg("%s: no pci_dev for bridge dn %s\n", __FUNCTION__, slot_name); - kfree(slot->hotplug_slot->info); - kfree(slot->hotplug_slot->name); - kfree(slot->hotplug_slot); - kfree(slot); + kfree(slot->hotplug_slot->info); + kfree(slot->hotplug_slot->name); + kfree(slot->hotplug_slot); + kfree(slot); continue; } - + /* find slot's pci_dev if it's not empty*/ if (slot->hotplug_slot->info->adapter_status == EMPTY) { slot->state = EMPTY; /* slot is empty */ @@ -812,49 +757,35 @@ continue; } - - slot->dev = rpaphp_find_adapter_pdev(slot); - - if (!slot->dev && slot_name) { - /* adapter being added doesn't have pci_dev yet */ - slot->dev = rpaphp_config_adapter(slot); - if (!slot->dev) { - err("%s: add new adapter device for slot[%s] failed\n", - __FUNCTION__, slot->name); - kfree(slot->hotplug_slot->info); - kfree(slot->hotplug_slot->name); - kfree(slot->hotplug_slot); - kfree(slot); - pci_dev_put(slot->bridge); - continue; - - } - } + slot->dev = rpaphp_find_adapter_pdev(slot); if(slot->dev) { slot->state = CONFIGURED; pci_dev_get(slot->dev); } - else + else { + /* DLPAR add as opposed to + * boot time */ slot->state = NOT_CONFIGURED; + } } dbg("%s registering slot:path[%s] index[%x], name[%s] pdomain[%x] type[%d]\n", - __FUNCTION__, dn->full_name, slot->index, slot->name, - slot->power_domain, slot->type); + __FUNCTION__, dn->full_name, slot->index, slot->name, + slot->power_domain, slot->type); if ((retval = register_slot(slot))) goto exit; num_slots++; - - if (slot_name) + + if (slot_name) goto exit; }/* for indexes */ }/* "pci" */ }/* find_all_nodes */ exit: - dbg("%s - Exit: num_slots=%d rc[%d]\n", + dbg("%s - Exit: num_slots=%d rc[%d]\n", __FUNCTION__, num_slots, retval); return retval; } @@ -867,12 +798,8 @@ { int retval = 0; - dbg("Entry %s\n", __FUNCTION__); - retval = rpaphp_add_slot(NULL); - dbg("Exit %s with retval=%d\n", __FUNCTION__, retval); - return retval; } @@ -881,17 +808,12 @@ { int retval = 0; - dbg("Entry %s\n", __FUNCTION__); - init_MUTEX(&rpaphp_sem); - + /* initialize internal data structure etc. */ retval = init_slots(); if (!num_slots) retval = -ENODEV; - - dbg("Exit %s with retval=%d, num_slots=%d\n", - __FUNCTION__, retval, num_slots); return retval; } @@ -904,12 +826,12 @@ /* * Unregister all of our slots with the pci_hotplug subsystem, * and free up all memory that we had allocated. - * memory will be freed in release_slot callback. + * memory will be freed in release_slot callback. */ list_for_each_safe (tmp, n, &rpaphp_slot_head) { slot = list_entry(tmp, struct slot, rpaphp_slot_list); - sysfs_remove_link(slot->hotplug_slot->kobj.parent, + sysfs_remove_link(slot->hotplug_slot->kobj.parent, slot->bridge->slot_name); list_del(&slot->rpaphp_slot_list); pci_hp_deregister(slot->hotplug_slot); @@ -923,7 +845,6 @@ { int retval = 0; - dbg("Entry %s\n", __FUNCTION__); info(DRIVER_DESC " version: " DRIVER_VERSION "\n"); rpaphp_debug = debug; @@ -931,7 +852,6 @@ /* read all the PRA info from the system */ retval = init_rpa(); - dbg("Exit %s with retval=%d\n", __FUNCTION__, retval); return retval; } @@ -951,21 +871,24 @@ if (slot == NULL) return -ENODEV; - dbg("%s - Entry: slot[%s]\n", - __FUNCTION__, slot->name); - + if (slot->state == CONFIGURED) { + dbg("%s: %s is already enabled\n", + __FUNCTION__, slot->name); + goto exit; + } + dbg("ENABLING SLOT %s\n", slot->name); down(&rpaphp_sem); - retval = rpaphp_get_sensor_state(slot->index, &state); - - if (retval) - goto exit; + retval = rpaphp_get_sensor_state(slot->index, &state); + + if (retval) + goto exit; dbg("%s: sensor state[%d]\n", __FUNCTION__, state); - /* if slot is not empty, enable the adapter */ + /* if slot is not empty, enable the adapter */ if (state == PRESENT) { dbg("%s : slot[%s] is occupid.\n", __FUNCTION__, slot->name); @@ -984,7 +907,7 @@ } } - else if (state == EMPTY) { + else if (state == EMPTY) { dbg("%s : slot[%s] is empty\n", __FUNCTION__, slot->name); slot->state = EMPTY; } @@ -993,30 +916,26 @@ slot->state = NOT_VALID; retval = -EINVAL; } - -exit: + +exit: if (slot->state != NOT_VALID) rpaphp_set_attention_status(slot, LED_ON); else rpaphp_set_attention_status(slot, LED_ID); up(&rpaphp_sem); - dbg("%s - Exit: rc[%d]\n", __FUNCTION__, retval); - - return retval; + + return retval; } static int disable_slot(struct hotplug_slot *hotplug_slot) { int retval; struct slot *slot = get_slot(hotplug_slot, __FUNCTION__); - if (slot == NULL) return -ENODEV; - - dbg("%s - Entry: slot[%s]\n", - __FUNCTION__, slot->name); + dbg("DISABLING SLOT %s\n", slot->name); down(&rpaphp_sem); @@ -1024,13 +943,12 @@ rpaphp_set_attention_status(slot, LED_ID); retval = rpaphp_unconfig_adapter(slot); - + rpaphp_set_attention_status(slot, LED_OFF); up(&rpaphp_sem); - dbg("%s - Exit: rc[%d]\n", __FUNCTION__, retval); - return retval; + return retval; } module_init(rpaphp_init); diff -Nru a/drivers/pci/hotplug/rpaphp_pci.c b/drivers/pci/hotplug/rpaphp_pci.c --- a/drivers/pci/hotplug/rpaphp_pci.c Mon Feb 9 17:05:49 2004 +++ b/drivers/pci/hotplug/rpaphp_pci.c Mon Feb 9 17:05:49 2004 @@ -32,12 +32,12 @@ struct pci_dev *retval_dev = NULL, *dev = NULL; while ((dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) { - if(!dev->bus) + if(!dev->bus) continue; - - if (dev->devfn != dn->devfn) + + if (dev->devfn != dn->devfn) continue; - + if (dn->phb->global_number == pci_domain_nr(dev->bus) && dn->busno == dev->bus->number) { retval_dev = dev; @@ -46,9 +46,9 @@ } return retval_dev; - + } - + int rpaphp_claim_resource(struct pci_dev *dev, int resource) { struct resource *res = &dev->resource[resource]; @@ -63,9 +63,9 @@ if (err) { err("PCI: %s region %d of %s %s [%lx:%lx]\n", - root ? "Address space collision on" : - "No parent found for", - resource, dtype, pci_name(dev), res->start, res->end); + root ? "Address space collision on" : + "No parent found for", + resource, dtype, pci_name(dev), res->start, res->end); } return err; ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From boutcher at us.ibm.com Tue Feb 10 14:01:55 2004 From: boutcher at us.ibm.com (David Boutcher) Date: Mon, 9 Feb 2004 21:01:55 -0600 Subject: extreme RTAS printks In-Reply-To: Message-ID: owner-linuxppc64-dev at lists.linuxppc.org wrote on 02/09/2004 04:31:22 PM: > Could we *please* kill printk_log_rtas()? This data is available via > /proc files; there is no need to spam our boot logs with it. Agreed!!!!! Or at least provide the decoder ring so we (and later, end users) know what they say! Dave Boutcher IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Tue Feb 10 14:59:04 2004 From: olof at austin.ibm.com (olof at austin.ibm.com) Date: Mon, 9 Feb 2004 21:59:04 -0600 (CST) Subject: extreme RTAS printks In-Reply-To: Message-ID: On Mon, 9 Feb 2004, Hollis Blanchard wrote: > Could we *please* kill printk_log_rtas()? This data is available via > /proc files; there is no need to spam our boot logs with it. I think killing it completely is a bad idea. There are times when a machine has bad hardware, but good enough to boot halfway up. Getting the error messages then could be very helpful. Likewise, at runtime the RTAS messages are also useful since they'd show up on the console (and in the dmesg output in kdb). But, as you said, at boot time there's normally limited use for them. If they're killed, I want to see a command line option to enable them if needed. I also don't think that post-boot output should be removed at all. -Olof Olof Johansson Office: 4E002/905 Linux on Power Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Tue Feb 10 15:00:43 2004 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Mon, 09 Feb 2004 22:00:43 -0600 Subject: extreme RTAS printks In-Reply-To: References: Message-ID: <4028576B.40205@austin.ibm.com> Hollis Blanchard wrote: > Could we *please* kill printk_log_rtas()? This data is available via > /proc files; there is no need to spam our boot logs with it. How about making it a config option? Patch attached. I'm not sure the help text is 100% correct - do RAS tools depend on the messages being in dmesg output or do they look at the /proc files? Another idea is changing the log level of these messages from KERN_ERR to something lower priority like KERN_INFO or KERN_DEBUG. At least that way the console doesn't get spammed in default configurations. Nathan -------------- next part -------------- A non-text attachment was scrubbed... Name: rtas_verbose.patch Type: text/x-patch Size: 1067 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040209/4772b6c2/attachment.bin From boutcher at us.ibm.com Wed Feb 11 04:18:00 2004 From: boutcher at us.ibm.com (David Boutcher) Date: Tue, 10 Feb 2004 11:18:00 -0600 Subject: extreme RTAS printks In-Reply-To: Message-ID: owner-linuxppc64-dev at lists.linuxppc.org wrote on 02/09/2004 09:59:04 PM: > I think killing it completely is a bad idea. There are times when a > machine has bad hardware, but good enough to boot halfway up. Getting the > error messages then could be very helpful. Likewise, at runtime the RTAS > messages are also useful since they'd show up on the console (and in the > dmesg output in kdb). OK, but is there any way to make them more meaningful than a hex dump? No user is going to take any action on a hex dump, and there is no way of differentiating bad hardware from some mild informational log. I know the kernel doesn't decode these things, but can you at least get a serverity or a classification or something? Dave Boutcher IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Wed Feb 11 04:18:11 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Tue, 10 Feb 2004 11:18:11 -0600 Subject: extreme RTAS printks In-Reply-To: References: Message-ID: <1A5F9B3C-5BED-11D8-BDA7-000A95A0560C@us.ibm.com> On Feb 9, 2004, at 9:59 PM, olof at austin.ibm.com wrote: > On Mon, 9 Feb 2004, Hollis Blanchard wrote: > >> Could we *please* kill printk_log_rtas()? This data is available via >> /proc files; there is no need to spam our boot logs with it. > > I think killing it completely is a bad idea. There are times when a > machine has bad hardware, but good enough to boot halfway up. Getting > the > error messages then could be very helpful. Doesn't the service processor log such hardware errors for exactly this reason? > Likewise, at runtime the RTAS > messages are also useful since they'd show up on the console (and in > the > dmesg output in kdb). So you're saying you *do* want to see hex dumps? Printing 64 lines of hex to a console you're actually trying to use I think is much worse even than getting it at boot time. -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Wed Feb 11 04:18:57 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Tue, 10 Feb 2004 11:18:57 -0600 Subject: extreme RTAS printks In-Reply-To: References: Message-ID: <40291281.1000905@austin.ibm.com> David Boutcher wrote: > OK, but is there any way to make them more meaningful than a hex dump? No > user is going to take any action on a hex dump, and there is no way of > differentiating bad hardware from some mild informational log. I know the > kernel doesn't decode these things, but can you at least get a serverity or > a classification or something? Hmm, there's a userspace package that's used to parse all that info. I don't know enough about the internal binary format, but I'd think that there's severity levels to it. Maybe we can have a threshold for what gets printed at boot and not? Mike, got any insight to share? :-) -Olof -- Olof Johansson Office: 4F005/905 pSeries Linux Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Wed Feb 11 04:25:44 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Tue, 10 Feb 2004 11:25:44 -0600 Subject: extreme RTAS printks In-Reply-To: <1A5F9B3C-5BED-11D8-BDA7-000A95A0560C@us.ibm.com> References: <1A5F9B3C-5BED-11D8-BDA7-000A95A0560C@us.ibm.com> Message-ID: <40291418.7050003@austin.ibm.com> Hollis Blanchard wrote: >> I think killing it completely is a bad idea. There are times when a >> machine has bad hardware, but good enough to boot halfway up. Getting the >> error messages then could be very helpful. > > Doesn't the service processor log such hardware errors for exactly this > reason? I thought the SP only logged checkstops and other severe errors. Or does it log ECC parity errors/corrections and other "minor" problems too? >> Likewise, at runtime the RTAS >> messages are also useful since they'd show up on the console (and in the >> dmesg output in kdb). > > > So you're saying you *do* want to see hex dumps? Printing 64 lines of > hex to a console you're actually trying to use I think is much worse > even than getting it at boot time. If you're getting the RTAS messages on the console, the machine is likely going to be unstable anyway. Would you prefer it to be quiet and not warn at all? Also, the data needs to be accessible from a debugger, be it via dmesg or in other ways. Hex is much better than nothing at all. -Olof -- Olof Johansson Office: 4F005/905 pSeries Linux Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From strosake at austin.ibm.com Wed Feb 11 09:30:50 2004 From: strosake at austin.ibm.com (Mike Strosaker) Date: Tue, 10 Feb 2004 16:30:50 -0600 Subject: extreme RTAS printks In-Reply-To: References: Message-ID: <40295B9A.2040101@austin.ibm.com> Olof Johansson wrote: > Hmm, there's a userspace package that's used to parse all that info. I > don't know enough about the internal binary format, but I'd think that > there's severity levels to it. Maybe we can have a threshold for what > gets printed at boot and not? It would require a fairly significant amount of parsing inside the kernel to determine if the message is worthy of being printed. There's no one severity field that makes thresholding easy. Also, the parsing code would need to be updated to reflect any new error log formats in the future, which is why it's better done in userspace. Nathan Lynch wrote: > How about making it a config option? Patch attached. I'm not sure the > help text is 100% correct - do RAS tools depend on the messages being > in dmesg output or do they look at the /proc files? > > Another idea is changing the log level of these messages from KERN_ERR > to something lower priority like KERN_INFO or KERN_DEBUG. At least > that way the console doesn't get spammed in default configurations. The userspace RAS tools look at the /proc file; there's no harm from that perspective in either of the above solutions. I'm all for doing either, but the latter (KERN_INFO or KERN_DEBUG) sounds like a particularly good compromise to me... that way the messages are still there in case of emergency. Thanks, Mike Strosaker ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From boutcher at us.ibm.com Wed Feb 11 10:28:24 2004 From: boutcher at us.ibm.com (David Boutcher) Date: Tue, 10 Feb 2004 17:28:24 -0600 Subject: extreme RTAS printks In-Reply-To: <40295B9A.2040101@austin.ibm.com> Message-ID: owner-linuxppc64-dev at lists.linuxppc.org wrote on 02/10/2004 04:30:50 PM: > The userspace RAS tools look at the /proc file; there's no harm from > that perspective in either of the above solutions. I'm all for doing > either, but the latter (KERN_INFO or KERN_DEBUG) sounds like a > particularly good compromise to me... that way the messages are still > there in case of emergency. So we are going to document the format of the hex dump so that it is useful to people? If not, I'm back to wondering exactly WHO the large kernel messages are useful for. Dave Boutcher IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Wed Feb 11 10:30:59 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Tue, 10 Feb 2004 17:30:59 -0600 Subject: extreme RTAS printks In-Reply-To: References: Message-ID: <402969B3.1080401@austin.ibm.com> David Boutcher wrote: > So we are going to document the format of the hex dump so that it is useful > to people? If not, I'm back to wondering exactly WHO the large kernel > messages are useful for. Doesn't look like it. I was of the impression that the RAS tools used the /var/log/messages dump. If they don't, then there's no use in printing it at all (_as long as_ there's a way to get to it from a debugger, and/or as long as the last ones are dumped right before a panic). -Olof -- Olof Johansson Office: 4F005/905 pSeries Linux Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Thu Feb 12 01:22:35 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Wed, 11 Feb 2004 08:22:35 -0600 Subject: extreme RTAS printks In-Reply-To: <402969B3.1080401@austin.ibm.com> References: <402969B3.1080401@austin.ibm.com> Message-ID: <1076509355.10309.40.camel@DYN279927END.austin.ibm.com> On Tue, 2004-02-10 at 17:30, Olof Johansson wrote: > David Boutcher wrote: > > > So we are going to document the format of the hex dump so that it is useful > > to people? If not, I'm back to wondering exactly WHO the large kernel > > messages are useful for. > > > Doesn't look like it. I was of the impression that the RAS tools used > the /var/log/messages dump. If they don't, then there's no use in > printing it at all (_as long as_ there's a way to get to it from a > debugger, and/or as long as the last ones are dumped right before a panic). Putting them both in /proc and in /var/log/messages was a interim solution until all boxes have rtas_errd and diagela. These errors need to saved for a CE to diagnose the problem in the field, otherwise we have no first failure data to analyze. These messages should only be seen in two cases, either there is a error that needs to be reported from a real failure, or NVRAM is not being cleared of the error because rtas_errd is not installed on the machine and it's showing up on every boot. The error logs are going to 2k, so there could be a lot of messages printed in the future. I think it's a good compromise to move the log level to KERN_INFO. That way the data is still stored for a CE, and the messages won't annoy Hollis. :) Thanks, Jake ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Thu Feb 12 04:10:39 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Wed, 11 Feb 2004 11:10:39 -0600 Subject: extreme RTAS printks In-Reply-To: <40291418.7050003@austin.ibm.com>; from olof@austin.ibm.com on Tue, Feb 10, 2004 at 11:25:44AM -0600 References: <1A5F9B3C-5BED-11D8-BDA7-000A95A0560C@us.ibm.com> <40291418.7050003@austin.ibm.com> Message-ID: <20040211111039.B58152@forte.austin.ibm.com> On Tue, Feb 10, 2004 at 11:25:44AM -0600, Olof Johansson wrote: > > I thought the SP only logged checkstops and other severe errors. Or does > it log ECC parity errors/corrections and other "minor" problems too? Its reported minor/ridiculous warnings in the past ... > If you're getting the RTAS messages on the console, the machine is > likely going to be unstable anyway. Would you prefer it to be quiet and > not warn at all? The preference that hollis & the majority express is for an english-language message. When I asked about this ages ago, the answer I got back was: "The logic to decode these messages is complex and is getting more complex every day with new additions. This logic is too big to fit in the kernel and should be a userland process/daemon (closed source, at that) instead." I dunno, I'm not sure I buy the "too complex' story. Sure, there's a lot of data in the hex dump, but a simple two-sentance english language summary sure would be nice. *especially* during boot of a failing machine. And that amount of code would not be big or complex. > Also, the data needs to be accessible from a debugger, be it via dmesg > or in other ways. Hex is much better than nothing at all. well, a quick extension to add rtas error decode to kdb sure would be cute! I don't think its hard at all, not the basics. --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Thu Feb 12 04:34:23 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Wed, 11 Feb 2004 11:34:23 -0600 Subject: extreme RTAS printks In-Reply-To: ; from boutcher@us.ibm.com on Tue, Feb 10, 2004 at 05:28:24PM -0600 References: <40295B9A.2040101@austin.ibm.com> Message-ID: <20040211113422.C58152@forte.austin.ibm.com> On Tue, Feb 10, 2004 at 05:28:24PM -0600, David Boutcher wrote: > > owner-linuxppc64-dev at lists.linuxppc.org wrote on 02/10/2004 04:30:50 PM: > > The userspace RAS tools look at the /proc file; there's no harm from > > that perspective in either of the above solutions. I'm all for doing > > either, but the latter (KERN_INFO or KERN_DEBUG) sounds like a > > particularly good compromise to me... that way the messages are still > > there in case of emergency. > > So we are going to document the format of the hex dump so that it is useful > to people? If not, I'm back to wondering exactly WHO the large kernel > messages are useful for. The format is documented in the RPA. I heard a rumour last summer that there was a move afoot to make the rpa open to the general public but I don't know whether that panned out. --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From brking at us.ibm.com Thu Feb 12 10:29:19 2004 From: brking at us.ibm.com (Brian King) Date: Wed, 11 Feb 2004 17:29:19 -0600 Subject: Resending the patch: Re: 2.6: PATCH for multiple EEH bugs References: <20040203183459.B27780@forte.austin.ibm.com> <20040204142853.A28220@forte.austin.ibm.com> <20040205142643.T27780@forte.austin.ibm.com> Message-ID: <402ABACF.3090809@us.ibm.com> I just want to express the importance of this patch. The 2.6 ipr driver requires it, since it regularly hits the eeh_check_failure bug. Please apply. -Brian linas at austin.ibm.com wrote: > OK, > > Fifth time's a charm ... > > base64 encoding the patch helps prevent the mail gateways from mangling it, > but then its too big for the mailing list manager. You can ftp the patch > > http://www-124.ibm.com/linux/patches/?patch_id=1344 > > To repeat the original note: > > Patch for multiple EEH-related bugs. Please review this patch, > & if appropriate, please apply. It should apply cleanly to > the current ameslab tree (Feb 03 2004 2.6.2-rc3). > > This patch fixes multiple EEH-related bugs: > > -- Fixes the eeh_check_failure() usage in an interrupt context. > This routine is now safe to use in an interrupt. The fix was to > build a cache of IO addresses and check that, instead of using > the pci routines. > -- Merges in Olof Johansson's sizeof patch when checking for failure > -- Adds EEH tests to array/string reads > -- Fixes bugs with address resolution (some i/o addresses were handled > incorrectly, resulting in EEH errors slipping by undetected.) > -- Adds EEH support to the PCI Hotplug system (so that devices that > get added/removed get properly registered with the EEH subsystem.) > -- Fixes improper use of /proc filesystem. > -- Adds some misc statistics. > > Please note that the EEH subsystem will be undergoing a major revision > in the not-to-distant future; this patch is a 'stopgap' to address the > immediate concerns/issues until that time. > > --linas > > > On Wed, Feb 04, 2004 at 02:28:53PM -0600, linas at austin.ibm.com wrote: > >>On Tue, Feb 03, 2004 at 10:52:11PM -0600, olof at austin.ibm.com wrote: >> >>>On Tue, 3 Feb 2004 linas at austin.ibm.com wrote: >>> >>> >>>>Patch for multiple EEH-related bugs. Please review this patch, >>>>& if appropriate, please apply. It should apply cleanly to >>>>the current ameslab tree (Feb 03 2004 2.6.2-rc3). >>> >>>I have patch failures in eeh.c, pci.c and rpaphp_core.c. Are you sure you >>>made the diff against a current ameslab tree? >> >>Right tree, bad email attachment. >> >>I don't know how it happened, but what I sent out had some trailing >>whitespace whacked. The attached patch should not have this problem. > > > > -- Brian King eServer Storage I/O IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olh at suse.de Fri Feb 13 02:09:07 2004 From: olh at suse.de (Olaf Hering) Date: Thu, 12 Feb 2004 16:09:07 +0100 Subject: autoconsole In-Reply-To: <20040115011808.GD27924@krispykreme> References: <20040113134042.A3771@w-mikek2.beaverton.ibm.com> <20040113235744.GB13397@krispykreme> <20040113185245.A1747@w-mikek2.beaverton.ibm.com> <20040114080749.GA374@suse.de> <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> <20040115011808.GD27924@krispykreme> Message-ID: <20040212150907.GA13059@suse.de> On Thu, Jan 15, Anton Blanchard wrote: > > > I haven't had time to check it out yet, but Sparc pushed an > > add_preferred_console() to 2.5 a couple weeks ago. See > > arch/sparc/kernel/setup.c set_preferred_console() ; it's a bit cleaner > > looking than what you've posted here. :) > > Agreed, how does this look? I could only compile test it, I dont have a > machine to run on at the moment. have you tested it? console=tty1 doesnt work, console output still goes straight to ttyS0. cmd_line is probably not yet set, my cmdline contains alot of stuff, but only the first word was printed with printk("%s(%u) cmd_line is %s\n",__FUNCTION__,__LINE__,cmd_line); in set_preferred_console(). So strstr(cmd_line, "console=") does not trigger. Does it just not work for me? -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olh at suse.de Fri Feb 13 02:55:53 2004 From: olh at suse.de (Olaf Hering) Date: Thu, 12 Feb 2004 16:55:53 +0100 Subject: bogus check in sg_add() Message-ID: <20040212155553.GA11426@suse.de> what is up with this check in ameslab? Still needed, what does it fix? diff -purN linux-2.5/drivers/scsi/sg.c linuxppc64-2.5/drivers/scsi/sg.c --- linux-2.5/drivers/scsi/sg.c 2004-02-06 08:21:23.000000000 +0000 +++ linuxppc64-2.5/drivers/scsi/sg.c 2004-02-10 08:37:26.000000000 +0000 @@ -1343,6 +1343,9 @@ sg_add(struct class_device *cl_dev) struct cdev * cdev = NULL; int k, error; + if (scsidp->type == 255) + return 0; + disk = alloc_disk(1); if (!disk) return -ENOMEM; -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From brking at us.ibm.com Fri Feb 13 04:19:06 2004 From: brking at us.ibm.com (Brian King) Date: Thu, 12 Feb 2004 11:19:06 -0600 Subject: bogus check in sg_add() References: <20040212155553.GA11426@suse.de> Message-ID: <402BB58A.7090208@us.ibm.com> Someone please remove this check. The ipr driver reports a device of type 255, and wants an sg started for it. -Brian Olaf Hering wrote: > what is up with this check in ameslab? Still needed, what does it fix? > > > diff -purN linux-2.5/drivers/scsi/sg.c linuxppc64-2.5/drivers/scsi/sg.c > --- linux-2.5/drivers/scsi/sg.c 2004-02-06 08:21:23.000000000 +0000 > +++ linuxppc64-2.5/drivers/scsi/sg.c 2004-02-10 08:37:26.000000000 +0000 > @@ -1343,6 +1343,9 @@ sg_add(struct class_device *cl_dev) > struct cdev * cdev = NULL; > int k, error; > > + if (scsidp->type == 255) > + return 0; > + > disk = alloc_disk(1); > if (!disk) > return -ENOMEM; > > -- > USB is for mice, FireWire is for men! > > sUse lINUX ag, n?RNBERG > > > -- Brian King eServer Storage I/O IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Fri Feb 13 07:55:14 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Thu, 12 Feb 2004 14:55:14 -0600 Subject: [PATCH][2.6] hardware breakpoint fix/support for xmon Message-ID: <1076619314.10309.85.camel@DYN279927END.austin.ibm.com> Here's a patch to fix hardware breakpoints and add support for LPAR systems in xmon. On an SMP system, the breakpoints appeared to have been working for hitting the breakpoint. When you exited xmon to continue, you would instead hit the same breakpoint again. There were a couple errors for the check to see if xmon needed to single step over the instruction. The other problem I was seeing was when breakpoint was cleared on one CPU, unless the other CPUS were stopped as well, they would not clear their dabr until they hit xmon. Thanks, Jake -------------- next part -------------- A non-text attachment was scrubbed... Name: linux-2.6-xmon-dabr-fix-1.patch Type: text/x-patch Size: 4794 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040212/85c8f2d6/attachment.bin From olof at austin.ibm.com Fri Feb 13 08:00:58 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Thu, 12 Feb 2004 15:00:58 -0600 Subject: [2.4] [PATCH] hash_page rework, take 2 In-Reply-To: <40104461.5030804@austin.ibm.com> References: <40104461.5030804@austin.ibm.com> Message-ID: <402BE98A.9090304@austin.ibm.com> After feedback from Julie and Ben, here's a revised mainline patch. Consider this last call for comments before I push. :-) Changes: * Add a smb_mb() in pte_free_sync(), to make sure that any 0-writes to PTE/PMDs are seen on other processors before the free. * Make sure "newpp" is never more than 3 bits. This saves us from crashing older iSeries hypervisor. Shouldn't be a problem since we don't do aging in 2.4, but I prefer the more conservative approach. * Add an isync after the _PAGE_BUSY lock, to avoid out-of-order execution. * Don't invalidate/update/validate a HPTE in hpte_updatepp unless the pp bits have changed. This avoids a pileup of faults caused by other processors faulting during the period when the HPTE is invalid and/or several processors faulting at the same time and resolving the same fault. * Other logic fixes based on discussions on this list, mostly dealing with timing windows during which _PAGE_BUSY was cleared inappropriately. -Olof -- Olof Johansson Office: 4F005/905 pSeries Linux Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: hash_page-rework-feb12 Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040212/a4abb01e/attachment.txt From johnrose at us.ibm.com Fri Feb 13 08:13:57 2004 From: johnrose at us.ibm.com (John H Rose) Date: Thu, 12 Feb 2004 15:13:57 -0600 Subject: [PATCH][2.6] hardware breakpoint fix/support for xmon Message-ID: I might be confused, but don't these qualify as hardware watchpoints rather than breakpoints? Regardless, the patch looks good :) Thanks- John ----------------------- John Rose pSeries Linux Development johnrose at austin.ibm.com Office: 512-838-0298 Tieline: 678-0298 Jake Moilanen @lists.linuxppc.org on 02/12/2004 02:55:14 PM Sent by: owner-linuxppc64-dev at lists.linuxppc.org To: linuxppc64-dev at lists.linuxppc.org cc: Subject: [PATCH][2.6] hardware breakpoint fix/support for xmon Here's a patch to fix hardware breakpoints and add support for LPAR systems in xmon. On an SMP system, the breakpoints appeared to have been working for hitting the breakpoint. When you exited xmon to continue, you would instead hit the same breakpoint again. There were a couple errors for the check to see if xmon needed to single step over the instruction. The other problem I was seeing was when breakpoint was cleared on one CPU, unless the other CPUS were stopped as well, they would not clear their dabr until they hit xmon. Thanks, Jake ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Fri Feb 13 08:20:02 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 13 Feb 2004 08:20:02 +1100 Subject: [2.4] [PATCH] hash_page rework, take 2 In-Reply-To: <402BE98A.9090304@austin.ibm.com> References: <40104461.5030804@austin.ibm.com> <402BE98A.9090304@austin.ibm.com> Message-ID: <1076620802.12434.39.camel@gaston> On Fri, 2004-02-13 at 08:00, Olof Johansson wrote: > After feedback from Julie and Ben, here's a revised mainline patch. > Consider this last call for comments before I push. :-) > > > Changes: > > * Add a smb_mb() in pte_free_sync(), to make sure that any 0-writes to > PTE/PMDs are seen on other processors before the free. Actually, what matters is before the is_locked(), we act both as a write barrier for the 0 and a read barrier for is_locked(). That said... I wonder if the implementation of is_locked() shouldn't have a rmb() by default after all ... > * Make sure "newpp" is never more than 3 bits. This saves us from > crashing older iSeries hypervisor. Shouldn't be a problem since we don't > do aging in 2.4, but I prefer the more conservative approach. Not only crashing older HVs, but crashing the kernel with newer HVs ;) > * Add an isync after the _PAGE_BUSY lock, to avoid out-of-order execution. > > * Don't invalidate/update/validate a HPTE in hpte_updatepp unless the pp > bits have changed. This avoids a pileup of faults caused by other > processors faulting during the period when the HPTE is invalid and/or > several processors faulting at the same time and resolving the same fault. > > * Other logic fixes based on discussions on this list, mostly dealing > with timing windows during which _PAGE_BUSY was cleared inappropriately. Did you spot any case that could have affected the 2.6 version ? Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Fri Feb 13 08:26:50 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Thu, 12 Feb 2004 15:26:50 -0600 Subject: [PATCH][2.6] hardware breakpoint fix/support for xmon In-Reply-To: References: Message-ID: <1076621210.10309.93.camel@DYN279927END.austin.ibm.com> On Thu, 2004-02-12 at 15:13, John H Rose wrote: > I might be confused, but don't these qualify as hardware watchpoints rather > than breakpoints? Regardless, the patch looks good :) The correct name is "Data Access Breakpoints". But depending on who or what you are asking they'll go by different names. I think it's more of a AIXism to call them watchpoints. I was following the xmon naming convention of calling them hardware breakpoints. Thanks, Jake ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Fri Feb 13 08:34:16 2004 From: paulus at samba.org (Paul Mackerras) Date: Fri, 13 Feb 2004 08:34:16 +1100 Subject: bogus check in sg_add() In-Reply-To: <402BB58A.7090208@us.ibm.com> References: <20040212155553.GA11426@suse.de> <402BB58A.7090208@us.ibm.com> Message-ID: <16427.61784.266371.410005@cargo.ozlabs.ibm.com> Brian King writes: > > Someone please remove this check. The ipr driver reports a device of > type 255, and wants an sg started for it. Done. I always like reducing the diffs between ameslab and the official trees. :) For interest, the check was added by Todd Inglett on 14 Jan 2003 with this changeset comment: Ignore host devices which are erroneously created by scsi_get_host_dev(). Mike Anderson is looking into it. Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olh at suse.de Fri Feb 13 08:40:33 2004 From: olh at suse.de (Olaf Hering) Date: Thu, 12 Feb 2004 22:40:33 +0100 Subject: bogus check in sg_add() In-Reply-To: <16427.61784.266371.410005@cargo.ozlabs.ibm.com> References: <20040212155553.GA11426@suse.de> <402BB58A.7090208@us.ibm.com> <16427.61784.266371.410005@cargo.ozlabs.ibm.com> Message-ID: <20040212214033.GB30422@suse.de> On Fri, Feb 13, Paul Mackerras wrote: > Brian King writes: > > > > Someone please remove this check. The ipr driver reports a device of > > type 255, and wants an sg started for it. > > Done. I always like reducing the diffs between ameslab and the > official trees. :) There are also lots of mb() in e100 and e1000, stuff for jgarzik? -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Fri Feb 13 08:42:56 2004 From: paulus at samba.org (Paul Mackerras) Date: Fri, 13 Feb 2004 08:42:56 +1100 Subject: [PATCH][2.6] hardware breakpoint fix/support for xmon In-Reply-To: <1076619314.10309.85.camel@DYN279927END.austin.ibm.com> References: <1076619314.10309.85.camel@DYN279927END.austin.ibm.com> Message-ID: <16427.62304.699990.252289@cargo.ozlabs.ibm.com> Jake Moilanen writes: > Here's a patch to fix hardware breakpoints and add support for LPAR > systems in xmon. > > On an SMP system, the breakpoints appeared to have been working for > hitting the breakpoint. When you exited xmon to continue, you would > instead hit the same breakpoint again. There were a couple errors for > the check to see if xmon needed to single step over the instruction. > > The other problem I was seeing was when breakpoint was cleared on one > CPU, unless the other CPUS were stopped as well, they would not clear > their dabr until they hit xmon. Hmmm... We're all hacking on xmon these days, it seems. Anton was making some changes to xmon just yesterday, and I am about to rework the xmon entry/exit and breakpoint insertion/removal to make it work properly on SMP systems. Don't push for now, and I'll talk to Anton today about merging your changes in with ours. Thanks, Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From sleddog at us.ibm.com Fri Feb 13 08:51:00 2004 From: sleddog at us.ibm.com (Dave Boutcher) Date: Thu, 12 Feb 2004 15:51:00 -0600 Subject: support for dma-mapping Message-ID: I wonder what you all (especially paulus and sfr) think of the following: Currently ppc64 uses the asm-generic version... ===== dma-mapping.h 1.1 vs edited ===== --- 1.1/include/asm-ppc64/dma-mapping.h Sat Dec 21 22:36:58 2002 +++ edited/dma-mapping.h Thu Feb 12 15:38:59 2004 @@ -1 +1,157 @@ -#include +/* Copyright (C) 2004 IBM + * + * Implements the generic device dma API + */ + +#ifndef _ASM_DMA_MAPPING_H +#define _ASM_DMA_MAPPING_H + +/* Include the busses we support */ +#include +#include +/* need struct page definitions */ +#include + +static inline int +dma_supported(struct device *dev, u64 mask) +{ + if (dev->bus == &pci_bus_type) return pci_dma_supported(to_pci_dev(dev), mask); + if (dev->bus == &vio_bus_type) return vio_dma_supported(to_vio_dev(dev), mask); + BUG(); +} + +static inline int +dma_set_mask(struct device *dev, u64 dma_mask) +{ + if (dev->bus == &pci_bus_type) return pci_set_dma_mask(to_pci_dev(dev), dma_mask); + if (dev->bus == &vio_bus_type) return vio_set_dma_mask(to_vio_dev(dev), dma_mask); + BUG(); +} + +static inline void * +dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle, + int flag) +{ + if (dev->bus == &pci_bus_type) return pci_alloc_consistent(to_pci_dev(dev), size, dma_handle); + if (dev->bus == &vio_bus_type) return vio_alloc_consistent(to_vio_dev(dev), size, dma_handle); + BUG(); +} + +static inline void +dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, + dma_addr_t dma_handle) +{ + if (dev->bus == &pci_bus_type) pci_free_consistent(to_pci_dev(dev), size, cpu_addr, dma_handle); + if (dev->bus == &vio_bus_type) vio_free_consistent(to_vio_dev(dev), size, cpu_addr, dma_handle); + BUG(); +} + +static inline dma_addr_t +dma_map_single(struct device *dev, void *cpu_addr, size_t size, + enum dma_data_direction direction) +{ + if (dev->bus == &pci_bus_type) return pci_map_single(to_pci_dev(dev), cpu_addr, size, (int)direction); + if (dev->bus == &vio_bus_type) return vio_map_single(to_vio_dev(dev), cpu_addr, size, (int)direction); + BUG(); +} + +static inline void +dma_unmap_single(struct device *dev, dma_addr_t dma_addr, size_t size, + enum dma_data_direction direction) +{ + if (dev->bus == &pci_bus_type) pci_unmap_single(to_pci_dev(dev), dma_addr, size, (int)direction); + if (dev->bus == &vio_bus_type) vio_unmap_single(to_vio_dev(dev), dma_addr, size, (int)direction); + BUG(); +} + +static inline dma_addr_t +dma_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction direction) +{ + if (dev->bus == &pci_bus_type) return pci_map_page(to_pci_dev(dev), page, offset, size, (int)direction); + if (dev->bus == &vio_bus_type) return vio_map_page(to_vio_dev(dev), page, offset, size, (int)direction); + BUG(); +} + + +static inline void +dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size, + enum dma_data_direction direction) +{ + if (dev->bus == &pci_bus_type) pci_unmap_page(to_pci_dev(dev), dma_address, size, (int)direction); + if (dev->bus == &vio_bus_type) vip_unmap_page(to_vio_dev(dev), dma_address, size, (int)direction); + BUG(); +} + +static inline int +dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, + enum dma_data_direction direction) +{ + if (dev->bus == &pci_bus_type) return pci_map_sg(to_pci_dev(dev), sg, nents, (int)direction); + if (dev->bus == &vio_bus_type) return vio_map_sg(to_vio_dev(dev), sg, nents, (int)direction); + BUG(); +} + +static inline void +dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nhwentries, + enum dma_data_direction direction) +{ + if (dev->bus == &pci_bus_type) pci_unmap_sg(to_pci_dev(dev), sg, nhwentries, (int)direction); + if (dev->bus == &vio_bus_type) pci_unmap_sg(to_vio_dev(dev), sg, nhwentries, (int)direction); + BUG(); +} + +static inline void +dma_sync_single(struct device *dev, dma_addr_t dma_handle, size_t size, + enum dma_data_direction direction) +{ + if (dev->bus == &pci_bus_type) pci_dma_sync_single(to_pci_dev(dev), dma_handle, size, (int)direction); + if (dev->bus == &vio_bus_type) vio_dma_sync_single(to_vio_dev(dev), dma_handle, size, (int)direction); + BUG(); +} + +static inline void +dma_sync_sg(struct device *dev, struct scatterlist *sg, int nelems, + enum dma_data_direction direction) +{ + if (dev->bus == &pci_bus_type) pci_dma_sync_sg(to_pci_dev(dev), sg, nelems, (int)direction); + if (dev->bus == &vio_bus_type) vio_dma_sync_sg(to_vio_dev(dev), sg, nelems, (int)direction); + BUG(); +} + +/* Now for the API extensions over the pci_ one */ + +#define dma_alloc_noncoherent(d, s, h, f) dma_alloc_coherent(d, s, h, f) +#define dma_free_noncoherent(d, s, v, h) dma_free_coherent(d, s, v, h) +#define dma_is_consistent(d) (1) + +static inline int +dma_get_cache_alignment(void) +{ + /* no easy way to get cache size on all processors, so return + * the maximum possible, to be safe */ + return (1 << L1_CACHE_SHIFT_MAX); +} + +static inline void +dma_sync_single_range(struct device *dev, dma_addr_t dma_handle, + unsigned long offset, size_t size, + enum dma_data_direction direction) +{ + /* just sync everything, that's all the pci API can do */ + dma_sync_single(dev, dma_handle, offset+size, direction); +} + +static inline void +dma_cache_sync(void *vaddr, size_t size, + enum dma_data_direction direction) +{ + /* could define this in terms of the dma_cache ... operations, + * but if you get this on a platform, you should convert the platform + * to using the generic device DMA API */ + BUG(); +} + +#endif + Dave B ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Fri Feb 13 09:18:25 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Thu, 12 Feb 2004 16:18:25 -0600 Subject: [2.4] [PATCH] hash_page rework, take 2 In-Reply-To: <1076620802.12434.39.camel@gaston> References: <40104461.5030804@austin.ibm.com> <402BE98A.9090304@austin.ibm.com> <1076620802.12434.39.camel@gaston> Message-ID: <402BFBB1.5000302@austin.ibm.com> Benjamin Herrenschmidt wrote: >>* Add a smb_mb() in pte_free_sync(), to make sure that any 0-writes to >>PTE/PMDs are seen on other processors before the free. > > > Actually, what matters is before the is_locked(), we act both > as a write barrier for the 0 and a read barrier for is_locked(). > > That said... I wonder if the implementation of is_locked() shouldn't > have a rmb() by default after all ... I can't even find any other users of is_read_locked in the ppc64 code. I guess it should be fixed for future reference though. :-) As for the memory barrier: Since smb_mb() (sync) is "larger" than smb_rmb() (lwsync), we should be fine to keep it outside the loop: * Another CPU has taken the read lock, seeing the old PTE value: mb() will make us see the read lock. * Another CPU will shortly take the read lock: Either the mb() will make them see the new PTE value, or we will see their read lock. Does the above make sense? >>* Make sure "newpp" is never more than 3 bits. This saves us from >>crashing older iSeries hypervisor. Shouldn't be a problem since we don't >>do aging in 2.4, but I prefer the more conservative approach. > > Not only crashing older HVs, but crashing the kernel with newer HVs ;) Oops. Either way, we shouldn't be exposed more now than before on 2.4 since none of the pp code was really changed, and no flags besides _PAGE_BUSY were redefined. >>* Other logic fixes based on discussions on this list, mostly dealing >>with timing windows during which _PAGE_BUSY was cleared inappropriately. > > > Did you spot any case that could have affected the 2.6 version ? I didn't look much at it yet, but there's no isync after the loop at the top of __hash_page (add one right before "Step 2"). I can supply patch, but it's pretty obvious where it should go... -Olof -- Olof Johansson Office: 4F005/905 pSeries Linux Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Fri Feb 13 09:25:12 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 13 Feb 2004 09:25:12 +1100 Subject: [2.4] [PATCH] hash_page rework, take 2 In-Reply-To: <402BFBB1.5000302@austin.ibm.com> References: <40104461.5030804@austin.ibm.com> <402BE98A.9090304@austin.ibm.com> <1076620802.12434.39.camel@gaston> <402BFBB1.5000302@austin.ibm.com> Message-ID: <1076624712.13813.48.camel@gaston> > I can't even find any other users of is_read_locked in the ppc64 code. I > guess it should be fixed for future reference though. :-) > > As for the memory barrier: Since smb_mb() (sync) is "larger" than > smb_rmb() (lwsync), we should be fine to keep it outside the loop: Sure, the code is fine, I was correcting your comments :) > I didn't look much at it yet, but there's no isync after the loop at the > top of __hash_page (add one right before "Step 2"). I can supply patch, > but it's pretty obvious where it should go... I did already, it's in linus tree. Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Fri Feb 13 09:30:23 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Thu, 12 Feb 2004 16:30:23 -0600 Subject: [2.4] [PATCH] hash_page rework, take 2 In-Reply-To: <1076624712.13813.48.camel@gaston> References: <40104461.5030804@austin.ibm.com> <402BE98A.9090304@austin.ibm.com> <1076620802.12434.39.camel@gaston> <402BFBB1.5000302@austin.ibm.com> <1076624712.13813.48.camel@gaston> Message-ID: <402BFE7F.4030907@austin.ibm.com> Benjamin Herrenschmidt wrote: >>I can't even find any other users of is_read_locked in the ppc64 code. I >>guess it should be fixed for future reference though. :-) >> >>As for the memory barrier: Since smb_mb() (sync) is "larger" than >>smb_rmb() (lwsync), we should be fine to keep it outside the loop: > > > Sure, the code is fine, I was correcting your comments :) Thanks. I'll fix them before any push. >>I didn't look much at it yet, but there's no isync after the loop at the >>top of __hash_page (add one right before "Step 2"). I can supply patch, >>but it's pretty obvious where it should go... > > I did already, it's in linus tree. Ok, my ames tree that I checked with might have been slightly out of date. -Olof -- Olof Johansson Office: 4F005/905 pSeries Linux Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Fri Feb 13 10:00:45 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 13 Feb 2004 10:00:45 +1100 Subject: [PATCH][2.6] hardware breakpoint fix/support for xmon In-Reply-To: <16427.62304.699990.252289@cargo.ozlabs.ibm.com> References: <1076619314.10309.85.camel@DYN279927END.austin.ibm.com> <16427.62304.699990.252289@cargo.ozlabs.ibm.com> Message-ID: <20040212230045.GJ25922@krispykreme> > Hmmm... We're all hacking on xmon these days, it seems. Anton was > making some changes to xmon just yesterday, and I am about to rework > the xmon entry/exit and breakpoint insertion/removal to make it work > properly on SMP systems. Don't push for now, and I'll talk to Anton > today about merging your changes in with ours. For the benefit of others on the list, heres what Ive got: - recover from bad SPR read/write (we get a program check) - remove some old code (bat and segment register stuff) - update the help text to match reality - add a "press ? for help" when xmon first appears to make rusty happy - protect against flushing bad parts of memory from Milton - dont print iseries specific stuff on pseries in SPR dump (S) - add code to dump the segment table or SLB - remove a number of functions that wouldnt work on LPAR Im trying to make sure xmon is solid so am interested in any way someone can lock up xmon once this patch goes in. Anton ===== arch/ppc64/kernel/traps.c 1.25 vs edited ===== --- 1.25/arch/ppc64/kernel/traps.c Tue Jan 20 13:07:09 2004 +++ edited/arch/ppc64/kernel/traps.c Thu Feb 12 15:54:09 2004 @@ -372,6 +372,13 @@ { siginfo_t info; +#ifdef CONFIG_DEBUG_KERNEL + if (debugger_fault_handler) { + debugger_fault_handler(regs); + return; + } +#endif + if (regs->msr & 0x100000) { /* IEEE FP exception */ ===== arch/ppc64/xmon/privinst.h 1.2 vs edited ===== --- 1.2/arch/ppc64/xmon/privinst.h Mon Jun 10 12:37:26 2002 +++ edited/arch/ppc64/xmon/privinst.h Thu Feb 12 16:18:53 2004 @@ -43,39 +43,12 @@ GSETSPR(275, sprg3) GSETSPR(282, ear) GSETSPR(287, pvr) -GSETSPR(528, bat0u) -GSETSPR(529, bat0l) -GSETSPR(530, bat1u) -GSETSPR(531, bat1l) -GSETSPR(532, bat2u) -GSETSPR(533, bat2l) -GSETSPR(534, bat3u) -GSETSPR(535, bat3l) GSETSPR(1008, hid0) GSETSPR(1009, hid1) GSETSPR(1010, iabr) GSETSPR(1013, dabr) GSETSPR(1023, pir) -static inline int get_sr(int n) -{ - int ret; - -#if 0 - // DRENG does not assemble - asm (" mfsrin %0,%1" : "=r" (ret) : "r" (n << 28)); -#endif - return ret; -} - -static inline void set_sr(int n, int val) -{ -#if 0 - // DRENG does not assemble - asm ("mtsrin %0,%1" : : "r" (val), "r" (n << 28)); -#endif -} - static inline void store_inst(void *p) { asm volatile ("dcbst 0,%0; sync; icbi 0,%0; isync" : : "r" (p)); @@ -90,4 +63,3 @@ { asm volatile ("dcbi 0,%0; icbi 0,%0" : : "r" (p)); } - ===== arch/ppc64/xmon/xmon.c 1.33 vs edited ===== --- 1.33/arch/ppc64/xmon/xmon.c Sat Feb 7 14:17:23 2004 +++ edited/arch/ppc64/xmon/xmon.c Thu Feb 12 16:48:23 2004 @@ -115,10 +115,7 @@ #endif /* CONFIG_SMP */ static void csum(void); static void bootcmds(void); -static void mem_translate(void); -static void mem_check(void); -static void mem_find_real(void); -static void mem_find_vsid(void); +void dump_segments(void); static void debug_trace(void); @@ -149,7 +146,15 @@ b show breakpoints\n\ bd set data breakpoint\n\ bi set instruction breakpoint\n\ - bc clear breakpoint\n\ + bc clear breakpoint\n" +#ifdef CONFIG_SMP + "\ + c print cpus stopped in xmon\n\ + ci send xmon interrupt to all other cpus\n\ + c# try to switch to cpu number h (in hex)\n" +#endif + "\ + C checksum\n\ d dump bytes\n\ di dump instructions\n\ df dump float values\n\ @@ -162,7 +167,6 @@ md compare two blocks of memory\n\ ml locate a block of memory\n\ mz zero a block of memory\n\ - mx translation information for an effective address\n\ mi show information about memory allocation\n\ p show the task list\n\ r print registers\n\ @@ -171,7 +175,14 @@ t print backtrace\n\ T Enable/Disable PPCDBG flags\n\ x exit monitor\n\ -"; + u dump segment table or SLB\n\ + ? help\n" +#ifndef CONFIG_PPC_ISERIES + "\ + zr reboot\n\ + zh halt\n" +#endif +; static int xmon_trace[NR_CPUS]; #define SSTEP 1 /* stepping because of 's' command */ @@ -308,6 +319,7 @@ #endif /* CONFIG_SMP */ remove_bpts(); disable_surveillance(); + printf("press ? for help "); cmd = cmds(excp); if (cmd == 's') { xmon_trace[smp_processor_id()] = SSTEP; @@ -332,17 +344,6 @@ set_msrd(msr); /* restore interrupt enable */ } -void -xmon_irq(int irq, void *d, struct pt_regs *regs) -{ - unsigned long flags; - local_save_flags(flags); - local_irq_disable(); - printf("Keyboard interrupt\n"); - xmon(regs); - local_irq_restore(flags); -} - int xmon_bpt(struct pt_regs *regs) { @@ -524,18 +525,6 @@ case 'z': memzcan(); break; - case 'x': - mem_translate(); - break; - case 'c': - mem_check(); - break; - case 'f': - mem_find_real(); - break; - case 'e': - mem_find_vsid(); - break; case 'i': show_mem(); break; @@ -587,11 +576,16 @@ cpu_cmd(); break; #endif /* CONFIG_SMP */ +#ifndef CONFIG_PPC_ISERIES case 'z': bootcmds(); +#endif case 'T': debug_trace(); break; + case 'u': + dump_segments(); + break; default: printf("Unrecognized command: "); do { @@ -1056,14 +1050,23 @@ termch = 0; nflush = 1; scanhex(&nflush); - nflush = (nflush + 31) / 32; - if (cmd != 'i') { - for (; nflush > 0; --nflush, adrs += 0x20) - cflush((void *) adrs); - } else { - for (; nflush > 0; --nflush, adrs += 0x20) - cinval((void *) adrs); + nflush = (nflush + L1_CACHE_BYTES - 1) / L1_CACHE_BYTES; + if( setjmp(bus_error_jmp) == 0 ) { + debugger_fault_handler = handle_fault; + sync(); + + if (cmd != 'i') { + for (; nflush > 0; --nflush, adrs += L1_CACHE_BYTES) + cflush((void *) adrs); + } else { + for (; nflush > 0; --nflush, adrs += L1_CACHE_BYTES) + cinval((void *) adrs); + } + sync(); + /* wait a little while to see if we get a machine check */ + __delay(200); } + debugger_fault_handler = 0; } unsigned long @@ -1072,6 +1075,7 @@ unsigned int instrs[2]; unsigned long (*code)(void); unsigned long opd[3]; + unsigned long ret = -1UL; instrs[0] = 0x7c6002a6 + ((n & 0x1F) << 16) + ((n & 0x3e0) << 6); instrs[1] = 0x4e800020; @@ -1082,7 +1086,22 @@ store_inst(instrs+1); code = (unsigned long (*)(void)) opd; - return code(); + if (setjmp(bus_error_jmp) == 0) { + debugger_fault_handler = handle_fault; + sync(); + + ret = code(); + + sync(); + /* wait a little while to see if we get a machine check */ + __delay(200); + } else { + printf("*** Error reading spr %x\n", n); + } + + debugger_fault_handler = 0; + + return ret; } void @@ -1101,7 +1120,20 @@ store_inst(instrs+1); code = (unsigned long (*)(unsigned long)) opd; - code(val); + if (setjmp(bus_error_jmp) == 0) { + debugger_fault_handler = handle_fault; + sync(); + + code(val); + + sync(); + /* wait a little while to see if we get a machine check */ + __delay(200); + } else { + printf("*** Error writing spr %x\n", n); + } + + debugger_fault_handler = 0; } static unsigned long regno; @@ -1111,11 +1143,14 @@ void super_regs() { - int i, cmd; + int cmd; unsigned long val; - struct paca_struct* ptrPaca = NULL; - struct ItLpPaca* ptrLpPaca = NULL; - struct ItLpRegSave* ptrLpRegSave = NULL; +#ifdef CONFIG_PPC_ISERIES + int i; + struct paca_struct *ptrPaca = NULL; + struct ItLpPaca *ptrLpPaca = NULL; + struct ItLpRegSave *ptrLpRegSave = NULL; +#endif cmd = skipbl(); if (cmd == '\n') { @@ -1129,10 +1164,7 @@ printf("sp = %.16lx sprg3= %.16lx\n", sp, get_sprg3()); printf("toc = %.16lx dar = %.16lx\n", toc, get_dar()); printf("srr0 = %.16lx srr1 = %.16lx\n", get_srr0(), get_srr1()); - printf("asr = %.16lx\n", mfasr()); - for (i = 0; i < 8; ++i) - printf("sr%.2ld = %.16lx sr%.2ld = %.16lx\n", i, get_sr(i), i+8, get_sr(i+8)); - +#ifdef CONFIG_PPC_ISERIES // Dump out relevant Paca data areas. printf("Paca: \n"); ptrPaca = get_paca(); @@ -1148,7 +1180,8 @@ printf(" Saved Sprg0=%.16lx Saved Sprg1=%.16lx \n", ptrLpRegSave->xSPRG0, ptrLpRegSave->xSPRG0); printf(" Saved Sprg2=%.16lx Saved Sprg3=%.16lx \n", ptrLpRegSave->xSPRG2, ptrLpRegSave->xSPRG3); printf(" Saved Msr =%.16lx Saved Nia =%.16lx \n", ptrLpRegSave->xMSR, ptrLpRegSave->xNIA); - +#endif + return; } @@ -1162,11 +1195,6 @@ case 'r': printf("spr %lx = %lx\n", regno, read_spr(regno)); break; - case 's': - val = get_sr(regno); - scanhex(&val); - set_sr(regno, val); - break; case 'm': val = get_msr(); scanhex(&val); @@ -1923,240 +1951,8 @@ } } -void -mem_translate() -{ - int c; - unsigned long ea, va, vsid, vpn, page, hpteg_slot_primary, hpteg_slot_secondary, primary_hash, i, *steg, esid, stabl; - HPTE * hpte; - struct mm_struct * mm; - pte_t *ptep = NULL; - void * pgdir; - - c = inchar(); - if ((isxdigit(c) && c != 'f' && c != 'd') || c == '\n') - termch = c; - scanhex((void *)&ea); - - if ((ea >= KRANGE_START) && (ea <= (KRANGE_START + (1UL<<60)))) { - ptep = 0; - vsid = get_kernel_vsid(ea); - va = ( vsid << 28 ) | ( ea & 0x0fffffff ); - } else { - // if in vmalloc range, use the vmalloc page directory - if ( ( ea >= VMALLOC_START ) && ( ea <= VMALLOC_END ) ) { - mm = &init_mm; - vsid = get_kernel_vsid( ea ); - } - // if in ioremap range, use the ioremap page directory - else if ( ( ea >= IMALLOC_START ) && ( ea <= IMALLOC_END ) ) { - mm = &ioremap_mm; - vsid = get_kernel_vsid( ea ); - } - // if in user range, use the current task's page directory - else if ( ( ea >= USER_START ) && ( ea <= USER_END ) ) { - mm = current->mm; - vsid = get_vsid(mm->context, ea ); - } - pgdir = mm->pgd; - va = ( vsid << 28 ) | ( ea & 0x0fffffff ); - ptep = find_linux_pte( pgdir, ea ); - } - - vpn = ((vsid << 28) | (((ea) & 0xFFFF000))) >> 12; - page = vpn & 0xffff; - esid = (ea >> 28) & 0xFFFFFFFFF; - - // Search the primary group for an available slot - primary_hash = ( vsid & 0x7fffffffff ) ^ page; - hpteg_slot_primary = ( primary_hash & htab_data.htab_hash_mask ) * HPTES_PER_GROUP; - hpteg_slot_secondary = ( ~primary_hash & htab_data.htab_hash_mask ) * HPTES_PER_GROUP; - - printf("ea : %.16lx\n", ea); - printf("esid : %.16lx\n", esid); - printf("vsid : %.16lx\n", vsid); - - printf("\nSoftware Page Table\n-------------------\n"); - printf("ptep : %.16lx\n", ((unsigned long *)ptep)); - if(ptep) { - printf("*ptep : %.16lx\n", *((unsigned long *)ptep)); - } - - hpte = htab_data.htab + hpteg_slot_primary; - printf("\nHardware Page Table\n-------------------\n"); - printf("htab base : %.16lx\n", htab_data.htab); - printf("slot primary : %.16lx\n", hpteg_slot_primary); - printf("slot secondary : %.16lx\n", hpteg_slot_secondary); - printf("\nPrimary Group\n"); - for (i=0; i<8; ++i) { - if ( hpte->dw0.dw0.v != 0 ) { - printf("%d: (hpte)%.16lx %.16lx\n", i, hpte->dw0.dword0, hpte->dw1.dword1); - printf(" vsid: %.13lx api: %.2lx hash: %.1lx\n", - (hpte->dw0.dw0.avpn)>>5, - (hpte->dw0.dw0.avpn) & 0x1f, - (hpte->dw0.dw0.h)); - printf(" rpn: %.13lx \n", (hpte->dw1.dw1.rpn)); - printf(" pp: %.1lx \n", - ((hpte->dw1.dw1.pp0)<<2)|(hpte->dw1.dw1.pp)); - printf(" wimgn: %.2lx reference: %.1lx change: %.1lx\n", - ((hpte->dw1.dw1.w)<<4)| - ((hpte->dw1.dw1.i)<<3)| - ((hpte->dw1.dw1.m)<<2)| - ((hpte->dw1.dw1.g)<<1)| - ((hpte->dw1.dw1.n)<<0), - hpte->dw1.dw1.r, hpte->dw1.dw1.c); - } - hpte++; - } - - printf("\nSecondary Group\n"); - // Search the secondary group - hpte = htab_data.htab + hpteg_slot_secondary; - for (i=0; i<8; ++i) { - if(hpte->dw0.dw0.v) { - printf("%d: (hpte)%.16lx %.16lx\n", i, hpte->dw0.dword0, hpte->dw1.dword1); - printf(" vsid: %.13lx api: %.2lx hash: %.1lx\n", - (hpte->dw0.dw0.avpn)>>5, - (hpte->dw0.dw0.avpn) & 0x1f, - (hpte->dw0.dw0.h)); - printf(" rpn: %.13lx \n", (hpte->dw1.dw1.rpn)); - printf(" pp: %.1lx \n", - ((hpte->dw1.dw1.pp0)<<2)|(hpte->dw1.dw1.pp)); - printf(" wimgn: %.2lx reference: %.1lx change: %.1lx\n", - ((hpte->dw1.dw1.w)<<4)| - ((hpte->dw1.dw1.i)<<3)| - ((hpte->dw1.dw1.m)<<2)| - ((hpte->dw1.dw1.g)<<1)| - ((hpte->dw1.dw1.n)<<0), - hpte->dw1.dw1.r, hpte->dw1.dw1.c); - } - hpte++; - } - - printf("\nHardware Segment Table\n-----------------------\n"); - stabl = (unsigned long)(KERNELBASE+(_ASR&0xFFFFFFFFFFFFFFFE)); - steg = (unsigned long *)((stabl) | ((esid & 0x1f) << 7)); - - printf("stab base : %.16lx\n", stabl); - printf("slot : %.16lx\n", steg); - - for (i=0; i<8; ++i) { - printf("%d: (ste) %.16lx %.16lx\n", i, - *((unsigned long *)(steg+i*2)),*((unsigned long *)(steg+i*2+1)) ); - } -} - -void mem_check() -{ - unsigned long htab_size_bytes; - unsigned long htab_end; - unsigned long last_rpn; - HPTE *hpte1, *hpte2; - - htab_size_bytes = htab_data.htab_num_ptegs * 128; // 128B / PTEG - htab_end = (unsigned long)htab_data.htab + htab_size_bytes; - // last_rpn = (naca->physicalMemorySize-1) >> PAGE_SHIFT; - last_rpn = 0xfffff; - - printf("\nHardware Page Table Check\n-------------------\n"); - printf("htab base : %.16lx\n", htab_data.htab); - printf("htab size : %.16lx\n", htab_size_bytes); - -#if 1 - for(hpte1 = htab_data.htab; hpte1 < (HPTE *)htab_end; hpte1++) { - if ( hpte1->dw0.dw0.v != 0 ) { - if ( hpte1->dw1.dw1.rpn <= last_rpn ) { - for(hpte2 = hpte1+1; hpte2 < (HPTE *)htab_end; hpte2++) { - if ( hpte2->dw0.dw0.v != 0 ) { - if(hpte1->dw1.dw1.rpn == hpte2->dw1.dw1.rpn) { - printf(" Duplicate rpn: %.13lx \n", (hpte1->dw1.dw1.rpn)); - printf(" hpte1: %16.16lx *hpte1: %16.16lx %16.16lx\n", - hpte1, hpte1->dw0.dword0, hpte1->dw1.dword1); - printf(" hpte2: %16.16lx *hpte2: %16.16lx %16.16lx\n", - hpte2, hpte2->dw0.dword0, hpte2->dw1.dword1); - } - } - } - } else { - printf(" Bogus rpn: %.13lx \n", (hpte1->dw1.dw1.rpn)); - printf(" hpte: %16.16lx *hpte: %16.16lx %16.16lx\n", - hpte1, hpte1->dw0.dword0, hpte1->dw1.dword1); - } - } - } -#endif - printf("\nDone -------------------\n"); -} - -void mem_find_real() +static void debug_trace(void) { - unsigned long htab_size_bytes; - unsigned long htab_end; - unsigned long last_rpn; - HPTE *hpte1; - unsigned long pa, rpn; - int c; - - c = inchar(); - if ((isxdigit(c) && c != 'f' && c != 'd') || c == '\n') - termch = c; - scanhex((void *)&pa); - rpn = pa >> 12; - - htab_size_bytes = htab_data.htab_num_ptegs * 128; // 128B / PTEG - htab_end = (unsigned long)htab_data.htab + htab_size_bytes; - // last_rpn = (naca->physicalMemorySize-1) >> PAGE_SHIFT; - last_rpn = 0xfffff; - - printf("\nMem Find RPN\n-------------------\n"); - printf("htab base : %.16lx\n", htab_data.htab); - printf("htab size : %.16lx\n", htab_size_bytes); - - for(hpte1 = htab_data.htab; hpte1 < (HPTE *)htab_end; hpte1++) { - if ( hpte1->dw0.dw0.v != 0 ) { - if ( hpte1->dw1.dw1.rpn == rpn ) { - printf(" Found rpn: %.13lx \n", (hpte1->dw1.dw1.rpn)); - printf(" hpte: %16.16lx *hpte1: %16.16lx %16.16lx\n", - hpte1, hpte1->dw0.dword0, hpte1->dw1.dword1); - } - } - } - printf("\nDone -------------------\n"); -} - -void mem_find_vsid() -{ - unsigned long htab_size_bytes; - unsigned long htab_end; - HPTE *hpte1; - unsigned long vsid; - int c; - - c = inchar(); - if ((isxdigit(c) && c != 'f' && c != 'd') || c == '\n') - termch = c; - scanhex((void *)&vsid); - - htab_size_bytes = htab_data.htab_num_ptegs * 128; // 128B / PTEG - htab_end = (unsigned long)htab_data.htab + htab_size_bytes; - - printf("\nMem Find VSID\n-------------------\n"); - printf("htab base : %.16lx\n", htab_data.htab); - printf("htab size : %.16lx\n", htab_size_bytes); - - for(hpte1 = htab_data.htab; hpte1 < (HPTE *)htab_end; hpte1++) { - if ( hpte1->dw0.dw0.v != 0 ) { - if ( ((hpte1->dw0.dw0.avpn)>>5) == vsid ) { - printf(" Found vsid: %.16lx \n", ((hpte1->dw0.dw0.avpn) >> 5)); - printf(" hpte: %16.16lx *hpte1: %16.16lx %16.16lx\n", - hpte1, hpte1->dw0.dword0, hpte1->dw1.dword1); - } - } - } - printf("\nDone -------------------\n"); -} - -static void debug_trace(void) { unsigned long val, cmd, on; cmd = skipbl(); @@ -2202,4 +1998,48 @@ } cmd = skipbl(); } +} + +static void dump_slb(void) +{ + int i; + unsigned long tmp; + + printf("SLB contents of cpu %d\n", smp_processor_id()); + + for (i = 0; i < naca->slb_size; i++) { + asm volatile("slbmfee %0,%1" : "=r" (tmp) : "r" (i)); + printf("%02d %016lx ", i, tmp); + + asm volatile("slbmfev %0,%1" : "=r" (tmp) : "r" (i)); + printf("%016lx\n", tmp); + } +} + +static void dump_stab(void) +{ + int i; + unsigned long *tmp = (unsigned long *)get_paca()->xStab_data.virt; + + printf("Segment table contents of cpu %d\n", smp_processor_id()); + + for (i = 0; i < PAGE_SIZE/16; i++) { + unsigned long a, b; + + a = *tmp++; + b = *tmp++; + + if (a || b) { + printf("%03d %016lx ", i, a); + printf("%016lx\n", b); + } + } +} + +void dump_segments(void) +{ + if (cur_cpu_spec->cpu_features & CPU_FTR_SLB) + dump_slb(); + else + dump_stab(); } ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Fri Feb 13 17:24:50 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 13 Feb 2004 17:24:50 +1100 Subject: Resending the patch: Re: 2.6: PATCH for multiple EEH bugs In-Reply-To: <402ABACF.3090809@us.ibm.com> References: <20040203183459.B27780@forte.austin.ibm.com> <20040204142853.A28220@forte.austin.ibm.com> <20040205142643.T27780@forte.austin.ibm.com> <402ABACF.3090809@us.ibm.com> Message-ID: <20040213062450.GL25922@krispykreme> > I just want to express the importance of this patch. The 2.6 ipr driver > requires it, since it regularly hits the eeh_check_failure bug. Please > apply. Yep, its in the queue. Linas, I'd like to remove the verbose messages at boot, its very verbose and it will result in the Hollis police coming after you. Anton PCI: skip building address cache for=0000:00:0c.0 Advanced Micro Devices \x{00C4}AMD\x{00DC} 79c970 \x{00C4}PCnet32 LANCE\x{00DC} PCI: skip building address cache for=0000:00:0d.0 Alteon Networks Inc. AceNIC Gigabit Ethernet PCI: skip building address cache for=0000:00:0e.0 Alteon Networks Inc. AceNIC Gigabit Ethernet (#2) PCI: skip building address cache for=0000:00:0f.0 Matrox Graphics, Inc. MGA G200PCI: skip building address cache for=0000:00:10.0 Advanced Micro Devices \x{00C4}AMD\x{00DC} 79c970 \x{00C4}PCnet32 LANCE\x{00DC} (#2) PCI: skip building address cache for=0000:00:11.0 LSI Logic / Symbios Logic 53c896 PCI: skip building address cache for=0000:00:11.1 LSI Logic / Symbios Logic 53c896 (#2) PCI: skip building address cache for=0001:40:0b.0 Intel Corp. 82546EB Gigabit Ethernet Controller (Copper) PCI: skip building address cache for=0001:40:0b.1 Intel Corp. 82546EB Gigabit Ethernet Controller (Copper) (#2) PCI: skip building address cache for=0001:40:0c.0 Alteon Networks Inc. AceNIC Gigabit Ethernet (#3) SCSI subsystem initialized matroxfb: Matrox MGA-G200 (PCI) detected matroxfb: BIOS on your Matrox device does ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Sat Feb 14 01:00:05 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Fri, 13 Feb 2004 08:00:05 -0600 Subject: [PATCH][2.6] hardware breakpoint fix/support for xmon In-Reply-To: <20040212230045.GJ25922@krispykreme> References: <1076619314.10309.85.camel@DYN279927END.austin.ibm.com> <16427.62304.699990.252289@cargo.ozlabs.ibm.com> <20040212230045.GJ25922@krispykreme> Message-ID: <1076680805.10309.145.camel@DYN279927END.austin.ibm.com> > Im trying to make sure xmon is solid so am interested in any way someone > can lock up xmon once this patch goes in. I have seen a number of instances during bringup where one of the CPUs will hang in a rtas call (due to a FW DSI), and then xmon would come in and disable_surveillance and hang on the rtas lock. I would suggest removing the disable_surveillance() call since there are no boxes that I'm aware of that enable surveillance by default. Maybe make it a separate command to disable surveillance similar to KDB. Thanks, Jake ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Sat Feb 14 02:05:22 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Fri, 13 Feb 2004 09:05:22 -0600 Subject: hotplug devices require interrupts? Message-ID: <0BC11108-5E36-11D8-B629-000A95A0560C@us.ibm.com> A patch from Linda yesterday pointed out that of_finish_dynamic_node() (prom.c) requires an "interrupts" property for hotplugged device nodes. Why is that? Until very recently the vty nodes didn't have that... the fact that they do now only coincidentally means we didn't shoot ourselves in the foot (unless there are still other interrupt-less devices). Certainly lacking interrupts it is not an ENODEV error in the non-dynamic finish_node/finish_node_interrupts case; why is it for hotplug? -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From lxiep at us.ibm.com Sat Feb 14 03:04:34 2004 From: lxiep at us.ibm.com (Linda Xie) Date: Fri, 13 Feb 2004 10:04:34 -0600 Subject: hotplug devices require interrupts? References: <0BC11108-5E36-11D8-B629-000A95A0560C@us.ibm.com> Message-ID: <402CF592.9050209@us.ltcfwd.linux.ibm.com> Hollis Blanchard wrote: > A patch from Linda yesterday pointed out that of_finish_dynamic_node() > (prom.c) requires an "interrupts" property for hotplugged device nodes. > Why is that? Until very recently the vty nodes didn't have that... the > fact that they do now only coincidentally means we didn't shoot > ourselves in the foot (unless there are still other interrupt-less > devices). > > Certainly lacking interrupts it is not an ENODEV error in the > non-dynamic finish_node/finish_node_interrupts case; why is it for hotplug? > Nathan & Hollis, Sorry for the confusion. "interrupts" property is for most of PCI I/O devices. I will submit another patch that will fix the problem. Thanks, Linda ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From lxiep at us.ibm.com Sat Feb 14 03:43:49 2004 From: lxiep at us.ibm.com (Linda Xie) Date: Fri, 13 Feb 2004 10:43:49 -0600 Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards Message-ID: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com> The attached patch fixes OF device tree update code so that it supports Hotplug and DLPAR PCI multifunction cards. Comments are welcome. Thanks, Linda -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: OFDT_update.patch Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040213/42da4b42/attachment.txt From hollisb at us.ibm.com Sat Feb 14 04:15:49 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Fri, 13 Feb 2004 11:15:49 -0600 Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards In-Reply-To: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com> References: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com> Message-ID: <44CE5954-5E48-11D8-B629-000A95A0560C@us.ibm.com> On Feb 13, 2004, at 10:43 AM, Linda Xie wrote: > The attached patch fixes OF device tree update code so that it supports > Hotplug and DLPAR PCI multifunction cards. I'm still unclear on the requirement to have an "interrupts" property. In this patch, all non-PCI devices without interrupts will be treated as errors. Why? If the new node doesn't have the property, just skip the interrupt code (bringing you to what you've called "fixup_pci"). It doesn't even have to be a goto: break the interrupt code into its own function and then: ints = (unsigned int *) get_property(node, "interrupts", &intlen); if (ints) of_finish_dynamic_interrupts(...) node->phb = parent->phb; regs = (u32 *)get_property(node, "reg", 0); > @@ -2929,8 +2930,12 @@ > > ints = (unsigned int *) get_property(node, "interrupts", &intlen); > if (!ints) { > - err = -ENODEV; > - goto out; > + char *name = get_property(node, "name", NULL); > + > + /* hot-pluggable cards don't have "interrupts" */ > + if (name && !strcmp(name, "pci") && !get_property(node, "built-in", > NULL)) > + goto fixup_pci; > + goto out; > } > > intrcells = prom_n_intr_cells(node); -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From lxiep at us.ibm.com Sat Feb 14 05:00:28 2004 From: lxiep at us.ibm.com (Linda Xie) Date: Fri, 13 Feb 2004 12:00:28 -0600 Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards References: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com> <44CE5954-5E48-11D8-B629-000A95A0560C@us.ibm.com> Message-ID: <402D10BC.3000404@us.ltcfwd.linux.ibm.com> Hollis Blanchard wrote: > On Feb 13, 2004, at 10:43 AM, Linda Xie wrote: > I'm still unclear on the requirement to have an "interrupts" property. > In this patch, all non-PCI devices without interrupts will be treated as > errors. They are not treated as errors, the patch just skips some lines that are not needed for the devices. If the new node doesn't have the property, just skip the > interrupt code (bringing you to what you've called "fixup_pci"). It > doesn't even have to be a goto: break the interrupt code into its own > function and then: > > ints = (unsigned int *) get_property(node, "interrupts", &intlen); > if (ints) > of_finish_dynamic_interrupts(...) > > node->phb = parent->phb; > regs = (u32 *)get_property(node, "reg", 0); > Because I didn't want to change the "if(!ints)" logic that was there before. BTW, I think that yours looks better. Nathan, Any thoughts? ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From willschm at northrelay02.pok.ibm.com Sat Feb 14 05:23:03 2004 From: willschm at northrelay02.pok.ibm.com (will schmidt) Date: Fri, 13 Feb 2004 10:23:03 -0800 Subject: LPARCFG In-Reply-To: <200402060425.i164PRJ60614@news.rchland.ibm.com> References: <200402060425.i164PRJ60614@news.rchland.ibm.com> Message-ID: <200402131620.i1DGKTOX103590@northrelay02.pok.ibm.com> This patch includes: - more of the function calls needed/requested for the licence manager folks. - change config option for lparcfg back to tristate - exported those pesky symbols needed via lparcfg as a module. (with kallsyms exporting all symbols, this issue might be masked depending upon your kernel config) - change H_SET_PURR to H_PURR i realise i've also got some whitespace in the patch, i'll clean those out before i push anything up to ames. Will Schmidt wrote: >>>Building as a module was broken when the code was checked in to ameslab >>>2.6; I suggested turning it off. I think lparcfg uses unexported >>>symbols -- cur_cpu_spec, systemcfg, and plpar_hcall_4out. Should any > > of > >>>those be exported? >>> >>>See this thread for the history: >>>http://lists.linuxppc.org/linuxppc64-dev/200312/msg00023.html >> >>Ahh yeah, I have such a short memory. >> >>Actually I enabled it as a module and built all modules and didnt get >>any warnings. Either we have everything exported now, Im not getting >>undefined symbol warnings any more or else Im going blind. > > > I've got some more updates for this code, will try to get a patch onto this > list tomorrow. (still need to forward port to current and check on > iSeries).. > I couldnt build it as a module without exporting a few symbols, but my > tree is about a week old, so i'm probably missing those fixes. > > > -Will > > willschm at us.ibm.com > Linux on PowerPC-64 Development > IBM Rochester > > -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: lparcfg.0212.diff Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040213/0282971c/attachment.txt From olh at suse.de Sat Feb 14 05:42:08 2004 From: olh at suse.de (Olaf Hering) Date: Fri, 13 Feb 2004 19:42:08 +0100 Subject: [PATCH] PCI_DMA_NONE undeclared in asm/vio.h Message-ID: <20040213184208.GA24365@suse.de> asm/vio.h requires PCI_DMA_NONE, so include it either in the drivers or directly in asm/vio.h --- ./drivers/scsi/ibmvscsi/rpa_vscsi.c~ 2004-02-13 18:56:02.000000000 +0100 +++ ./drivers/scsi/ibmvscsi/rpa_vscsi.c 2004-02-13 19:41:13.000000000 +0100 @@ -28,6 +28,7 @@ */ #include +#include #include #include #include --- ./drivers/scsi/ibmvscsi/ibmvscsi.c~ 2004-02-13 18:56:02.000000000 +0100 +++ ./drivers/scsi/ibmvscsi/ibmvscsi.c 2004-02-13 19:38:14.000000000 +0100 @@ -63,6 +63,7 @@ */ #include +#include #include #include "ibmvscsi.h" -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olh at suse.de Sat Feb 14 05:48:52 2004 From: olh at suse.de (Olaf Hering) Date: Fri, 13 Feb 2004 19:48:52 +0100 Subject: vio_unmap_sg and vio_bus_type not exported Message-ID: <20040213184852.GA9470@suse.de> I get two unresolved symbols after the recent changes, current ameslab 2.6 kernel/drivers/usb/core/usbcore.ko needs unknown symbol vio_unmap_sg kernel/drivers/usb/core/usbcore.ko needs unknown symbol vio_bus_type -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Sat Feb 14 06:16:41 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Fri, 13 Feb 2004 13:16:41 -0600 Subject: vio_unmap_sg and vio_bus_type not exported In-Reply-To: <20040213184852.GA9470@suse.de> References: <20040213184852.GA9470@suse.de> Message-ID: <2791A786-5E59-11D8-B629-000A95A0560C@us.ibm.com> On Feb 13, 2004, at 12:48 PM, Olaf Hering wrote: > > I get two unresolved symbols after the recent changes, > current ameslab 2.6 > > kernel/drivers/usb/core/usbcore.ko needs unknown symbol vio_unmap_sg > kernel/drivers/usb/core/usbcore.ko needs unknown symbol vio_bus_type This must be because of the new dma-mapping.h. As long as vio_bus_type is referenced in that header, it will have to be exported. And the hacks multiply... -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From boutcher at us.ibm.com Sat Feb 14 07:07:31 2004 From: boutcher at us.ibm.com (David Boutcher) Date: Fri, 13 Feb 2004 14:07:31 -0600 Subject: vio_unmap_sg and vio_bus_type not exported In-Reply-To: <2791A786-5E59-11D8-B629-000A95A0560C@us.ibm.com> Message-ID: owner-linuxppc64-dev at lists.linuxppc.org wrote on 02/13/2004 01:16:41 PM: > On Feb 13, 2004, at 12:48 PM, Olaf Hering wrote: > > > > I get two unresolved symbols after the recent changes, current > > ameslab 2.6 > > > > kernel/drivers/usb/core/usbcore.ko needs unknown symbol vio_unmap_sg > > kernel/drivers/usb/core/usbcore.ko needs unknown symbol vio_bus_type > > This must be because of the new dma-mapping.h. As long as vio_bus_type > is referenced in that header, it will have to be exported. And the > hacks multiply... There was a one hour window last night where I had a bug....does this still happen with a current bk pull? Dave Boutcher IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olh at suse.de Sat Feb 14 07:46:12 2004 From: olh at suse.de (Olaf Hering) Date: Fri, 13 Feb 2004 21:46:12 +0100 Subject: likely() disappeared Message-ID: <20040213204612.GA28320@suse.de> likely() from linux/compiler.h disappeared into a #ifdef __KERNEL__. arch/ppc64/boot/prom.c calls do_div() in number(). what is the correct fix, other than not using the 'make all' target? -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olh at suse.de Sat Feb 14 08:11:31 2004 From: olh at suse.de (Olaf Hering) Date: Fri, 13 Feb 2004 22:11:31 +0100 Subject: vio_unmap_sg and vio_bus_type not exported In-Reply-To: References: <2791A786-5E59-11D8-B629-000A95A0560C@us.ibm.com> Message-ID: <20040213211131.GA19947@suse.de> On Fri, Feb 13, David Boutcher wrote: > > > > > > owner-linuxppc64-dev at lists.linuxppc.org wrote on 02/13/2004 01:16:41 PM: > > On Feb 13, 2004, at 12:48 PM, Olaf Hering wrote: > > > > > > I get two unresolved symbols after the recent changes, > > > current ameslab 2.6 > > > > > > kernel/drivers/usb/core/usbcore.ko needs unknown symbol vio_unmap_sg > > > kernel/drivers/usb/core/usbcore.ko needs unknown symbol vio_bus_type > > > > This must be because of the new dma-mapping.h. As long as vio_bus_type > > is referenced in that header, it will have to be exported. And the > > hacks multiply... > > There was a one hour window last night where I had a bug....does this still > happen with a current bk pull? yes, current bk. -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olh at suse.de Sat Feb 14 08:18:51 2004 From: olh at suse.de (Olaf Hering) Date: Fri, 13 Feb 2004 22:18:51 +0100 Subject: g5 requires ADB_PMU Message-ID: <20040213211851.GA20600@suse.de> the kernel doesnt link without PMU. --- ../linux-2.6.2/arch/ppc64/Kconfig 2004-02-13 17:56:05.000000000 +0000 +++ ./arch/ppc64/Kconfig 2004-02-13 21:14:45.000000000 +0000 @@ -87,6 +87,7 @@ config PPC_PMAC depends on PPC_PSERIES bool "Apple PowerMac G5 support" + select ADB_PMU config PPC_PMAC64 bool -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Sat Feb 14 10:04:33 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 14 Feb 2004 10:04:33 +1100 Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards In-Reply-To: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com> References: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com> Message-ID: <1076713472.8000.83.camel@gaston> On Sat, 2004-02-14 at 03:43, Linda Xie wrote: > The attached patch fixes OF device tree update code so that it supports > Hotplug and DLPAR PCI multifunction cards. > > Comments are welcome. Can't this code be moved out of prom.c ? It's time to think about cleaning up that pile of mess that is prom.c, ultimately, things that have to run in the context of the firmware environement early during boot, should be completely split from things that are used later on during normal kernel usage. And low level device-tree accessors split from higher level things like this hotplug stuff. We won't do that right away, but starting to slowly move things away as they are modified is a good way to make things simpler for anton and I when it's time to do the big cleanup. Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Sat Feb 14 11:09:53 2004 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Fri, 13 Feb 2004 18:09:53 -0600 Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards In-Reply-To: <1076713472.8000.83.camel@gaston> References: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com> <1076713472.8000.83.camel@gaston> Message-ID: <402D6751.6080502@austin.ibm.com> Benjamin Herrenschmidt wrote: > Can't this code be moved out of prom.c ? Yes, the of_find accessors etc. could be moved to their own file, probably without too much pain. > It's time to think about cleaning up that pile of mess that is prom.c, > ultimately, things that have to run in the context of the firmware > environement early during boot, should be completely split from things > that are used later on during normal kernel usage. > > And low level device-tree accessors split from higher level things like > this hotplug stuff. The hotplug stuff (of_add_node, of_remove_node) necessarily uses the same lock as the accessors, but we could take more care to avoid building it on platforms which don't need it. I believe those functions should be compiled in only when building for pSeries with CONFIG_HOTPLUG=y. Nathan ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olh at suse.de Sat Feb 14 19:59:37 2004 From: olh at suse.de (Olaf Hering) Date: Sat, 14 Feb 2004 09:59:37 +0100 Subject: likely() disappeared In-Reply-To: <20040213204612.GA28320@suse.de> References: <20040213204612.GA28320@suse.de> Message-ID: <20040214085937.GA1426@suse.de> On Fri, Feb 13, Olaf Hering wrote: > > likely() from linux/compiler.h disappeared into a #ifdef __KERNEL__. > arch/ppc64/boot/prom.c calls do_div() in number(). > what is the correct fix, other than not using the 'make all' target? its already in ameslab: diff -purN linux-2.5/arch/ppc64/boot/prom.c linuxppc64-2.5/arch/ppc64/boot/prom.c --- linux-2.5/arch/ppc64/boot/prom.c 2003-07-09 17:20:23.000000000 +0000 +++ linuxppc64-2.5/arch/ppc64/boot/prom.c 2004-02-12 12:05:57.000000000 +0000 @@ -11,9 +11,6 @@ #include #include -#define BITS_PER_LONG 32 -#include - int (*prom)(void *); void *chosen_handle; @@ -28,6 +25,9 @@ void chrpboot(int a1, int a2, void *prom void printk(char *fmt, ...); +/* there is no convenient header to get this from... -- paulus */ +extern unsigned long strlen(const char *); + int write(void *handle, void *ptr, int nb) { @@ -352,7 +352,7 @@ static int skip_atoi(const char **s) #define SPECIAL 32 /* 0x */ #define LARGE 64 /* use 'ABCDEF' instead of 'abcdef' */ -static char * number(char * str, long long num, int base, int size, int precision, int type) +static char * number(char * str, long num, int base, int size, int precision, int type) { char c,sign,tmp[66]; const char *digits="0123456789abcdefghijklmnopqrstuvwxyz"; @@ -388,8 +388,10 @@ static char * number(char * str, long lo i = 0; if (num == 0) tmp[i++]='0'; - else while (num != 0) - tmp[i++] = digits[do_div(num,base)]; + else while (num != 0) { + tmp[i++] = digits[num % base]; + num /= base; + } if (i > precision) precision = i; size -= precision; @@ -424,7 +426,7 @@ int sprintf(char * buf, const char *fmt, int vsprintf(char *buf, const char *fmt, va_list args) { int len; - unsigned long long num; + unsigned long num; int i, base; char * str; const char *s; @@ -575,9 +577,7 @@ int vsprintf(char *buf, const char *fmt, --fmt; continue; } - if (qualifier == 'L') - num = va_arg(args, long long); - else if (qualifier == 'l') { + if (qualifier == 'l') { num = va_arg(args, unsigned long); if (flags & SIGN) num = (signed long) num; -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Sat Feb 14 20:24:31 2004 From: paulus at samba.org (Paul Mackerras) Date: Sat, 14 Feb 2004 20:24:31 +1100 Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards In-Reply-To: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com> References: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com> Message-ID: <16429.59727.123989.862372@cargo.ozlabs.ibm.com> Linda Xie writes: > The attached patch fixes OF device tree update code so that it supports > Hotplug and DLPAR PCI multifunction cards. > > Comments are welcome. This looks to me like it would benefit from having parts of it such as the interrupt property parsing code split out into subroutines. Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sun Feb 15 00:57:48 2004 From: anton at samba.org (Anton Blanchard) Date: Sun, 15 Feb 2004 00:57:48 +1100 Subject: likely() disappeared In-Reply-To: <20040214085937.GA1426@suse.de> References: <20040213204612.GA28320@suse.de> <20040214085937.GA1426@suse.de> Message-ID: <20040214135748.GF9910@krispykreme> > its already in ameslab: Thanks, I forwarded paulus' patch onto Linus and akpm. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Sun Feb 15 13:48:18 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sun, 15 Feb 2004 13:48:18 +1100 Subject: People with ATAPI problems: possible fix Message-ID: <1076813298.853.2.camel@gaston> Hi ! Recently, there have been lost of reports of ATAPI issues, typically with CD/DVD burners, where DMA would be disabled for no obvious reasons, with the driver spitting a message about a timeout waiting for dbdma command stop. The problem is that our driver doesn't deal with buffer underruns due to the drive not sending as many data as requested. This patch is an attempt at fixing this. Please let me know if it helps. Ben. ===== drivers/ide/ppc/pmac.c 1.50 vs edited ===== --- 1.50/drivers/ide/ppc/pmac.c Sat Feb 14 19:29:16 2004 +++ edited/drivers/ide/ppc/pmac.c Sun Feb 15 13:47:12 2004 @@ -55,7 +55,7 @@ #define IDE_PMAC_DEBUG -#define DMA_WAIT_TIMEOUT 100 +#define DMA_WAIT_TIMEOUT 50 typedef struct pmac_ide_hwif { unsigned long regbase; @@ -2032,8 +2032,11 @@ dstat = readl(&dma->status); writel(((RUN|WAKE|DEAD) << 16), &dma->control); pmac_ide_destroy_dmatable(drive); - /* verify good dma status */ - return (dstat & (RUN|DEAD|ACTIVE)) != RUN; + /* verify good dma status. we don't check for ACTIVE beeing 0. We should... + * in theory, but with ATAPI decices doing buffer underruns, that would + * cause us to disable DMA, which isn't what we want + */ + return (dstat & (RUN|DEAD)) != RUN; } /* @@ -2047,7 +2050,7 @@ { pmac_ide_hwif_t* pmif = (pmac_ide_hwif_t *)HWIF(drive)->hwif_data; volatile struct dbdma_regs *dma; - unsigned long status; + unsigned long status, timeout; if (pmif == NULL) return 0; @@ -2063,17 +2066,8 @@ * - The dbdma fifo hasn't yet finished flushing to * to system memory when the disk interrupt occurs. * - * The trick here is to increment drive->waiting_for_dma, - * and return as if no interrupt occurred. If the counter - * reach a certain timeout value, we then return 1. If - * we really got the interrupt, it will happen right away - * again. - * Apple's solution here may be more elegant. They issue - * a DMA channel interrupt (a separate irq line) via a DBDMA - * NOP command just before the STOP, and wait for both the - * disk and DBDMA interrupts to have completed. */ - + /* If ACTIVE is cleared, the STOP command have passed and * transfer is complete. */ @@ -2085,15 +2079,26 @@ called while not waiting\n", HWIF(drive)->index); /* If dbdma didn't execute the STOP command yet, the - * active bit is still set */ - drive->waiting_for_dma++; - if (drive->waiting_for_dma >= DMA_WAIT_TIMEOUT) { - printk(KERN_WARNING "ide%d, timeout waiting \ - for dbdma command stop\n", HWIF(drive)->index); - return 1; - } - udelay(5); - return 0; + * active bit is still set. We consider that we aren't + * sharing interrupts (which is hopefully the case with + * those controllers) and so we just try to flush the + * channel for pending data in the fifo + */ + udelay(1); + writel((FLUSH << 16) | FLUSH, &dma->control); + timeout = 0; + for (;;) { + udelay(1); + status = readl(&dma->status); + if ((status & FLUSH) == 0) + break; + if (++timeout > 100) { + printk(KERN_WARNING "ide%d, ide_dma_test_irq \ + timeout flushing channel\n", HWIF(drive)->index); + break; + } + } + return 1; } static int __pmac ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Tue Feb 17 05:57:15 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Mon, 16 Feb 2004 12:57:15 -0600 Subject: open_pic.c without CONFIG_SMP Message-ID: <4031128B.8050602@us.ibm.com> I don't know if this patch is correct, but something is needed here. -- Hollis Blanchard IBM Linux Technology Center -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: openpic-UP.diff Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040216/9b2124bc/attachment.txt From olh at suse.de Tue Feb 17 07:49:32 2004 From: olh at suse.de (Olaf Hering) Date: Mon, 16 Feb 2004 21:49:32 +0100 Subject: [PATCH] unbreaking pseries nvram driver Message-ID: <20040216204932.GA3926@suse.de> rtas_token returns an int, the rtas_call arguments are also an int. The error code returned by rtas_token is also an int. So, why the unsigned int? --- /tmp/linuxppc64-2.5/arch/ppc64/kernel/pSeries_nvram.c 2004-02-12 03:47:53.000000000 +0000 +++ ./arch/ppc64/kernel/pSeries_nvram.c 2004-02-16 20:45:12.000000000 +0000 @@ -29,7 +29,7 @@ #include static unsigned int nvram_size; -static unsigned int nvram_fetch, nvram_store; +static int nvram_fetch, nvram_store; static char nvram_buf[NVRW_CNT]; /* assume this is in the first 4GB */ static spinlock_t nvram_lock = SPIN_LOCK_UNLOCKED; @@ -41,7 +41,7 @@ static ssize_t pSeries_nvram_read(char * unsigned long flags; char *p = buf; - if (nvram_size == 0 || nvram_fetch) + if (nvram_size == 0 || nvram_fetch == RTAS_UNKNOWN_SERVICE) return -ENODEV; if (*index >= nvram_size) @@ -83,7 +83,7 @@ static ssize_t pSeries_nvram_write(char unsigned long flags; const char *p = buf; - if (nvram_size == 0 || nvram_store) + if (nvram_size == 0 || nvram_store == RTAS_UNKNOWN_SERVICE) return -ENODEV; if (*index >= nvram_size) -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Tue Feb 17 15:35:27 2004 From: anton at samba.org (Anton Blanchard) Date: Tue, 17 Feb 2004 15:35:27 +1100 Subject: KDB in ameslab Message-ID: <20040217043527.GC25491@krispykreme> Hi, Linus had a nasty problem debugging what should have been a simple problem but became a nightmare due to an incorrect debug hook. In response to this I have cleaned up our debug hooks, it should be much harder to screw up. A side effect of this is that KDB is probably broken. I started looking into fixing it however I noticed it looks out of date. Does someone have the urge to update it? Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From segher at kernel.crashing.org Tue Feb 17 18:19:06 2004 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Tue, 17 Feb 2004 08:19:06 +0100 Subject: open_pic.c without CONFIG_SMP In-Reply-To: <4031128B.8050602@us.ibm.com> References: <4031128B.8050602@us.ibm.com> Message-ID: <924B9C8A-6119-11D8-BF03-000A95A4DC02@kernel.crashing.org> > I don't know if this patch is correct, but something is needed here. Why? OpenPIC has IPIs whether or not the system is SMP, i.e., it's a feature of the interrupt controller itself, not of the system. Of course it may not be _necessary_ to initialize this stuff, but it won't hurt, either. Segher ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Tue Feb 17 18:19:25 2004 From: anton at samba.org (Anton Blanchard) Date: Tue, 17 Feb 2004 18:19:25 +1100 Subject: autoconsole In-Reply-To: <20040212150907.GA13059@suse.de> References: <20040113134042.A3771@w-mikek2.beaverton.ibm.com> <20040113235744.GB13397@krispykreme> <20040113185245.A1747@w-mikek2.beaverton.ibm.com> <20040114080749.GA374@suse.de> <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> <20040115011808.GD27924@krispykreme> <20040212150907.GA13059@suse.de> Message-ID: <20040217071925.GE25491@krispykreme> > have you tested it? console=tty1 doesnt work, console output still goes > straight to ttyS0. cmd_line is probably not yet set, my cmdline contains > alot of stuff, but only the first word was printed with > > printk("%s(%u) cmd_line is %s\n",__FUNCTION__,__LINE__,cmd_line); > > in set_preferred_console(). So strstr(cmd_line, "console=") does not > trigger. Does it just not work for me? Yeah we are parsing cmd_line after its been tokenised. Ive converted it to use saved_command_line (and found what looks like another case in eeh init). Its in ameslab and I'll send it upstream after 2.6.3 is released. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Tue Feb 17 18:35:10 2004 From: anton at samba.org (Anton Blanchard) Date: Tue, 17 Feb 2004 18:35:10 +1100 Subject: [PATCH][2.6] hardware breakpoint fix/support for xmon In-Reply-To: <1076680805.10309.145.camel@DYN279927END.austin.ibm.com> References: <1076619314.10309.85.camel@DYN279927END.austin.ibm.com> <16427.62304.699990.252289@cargo.ozlabs.ibm.com> <20040212230045.GJ25922@krispykreme> <1076680805.10309.145.camel@DYN279927END.austin.ibm.com> Message-ID: <20040217073510.GF25491@krispykreme> > I have seen a number of instances during bringup where one of the CPUs > will hang in a rtas call (due to a FW DSI), and then xmon would come in > and disable_surveillance and hang on the rtas lock. I would suggest > removing the disable_surveillance() call since there are no boxes that > I'm aware of that enable surveillance by default. Maybe make it a > separate command to disable surveillance similar to KDB. Interesting, I enable it on all our machines and would have thought we should be doing it automatically. From a RAS perspective we want to reboot the machine if it locks up. Maybe we can have a spin_trylock rtas call and whinge if the lock was already taken. Once we switch to stopping all cpus on xmon entry we want to do this to avoid deadlocks anyway. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Wed Feb 18 00:46:34 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Tue, 17 Feb 2004 07:46:34 -0600 Subject: [PATCH][2.6] hardware breakpoint fix/support for xmon In-Reply-To: <20040217073510.GF25491@krispykreme> References: <1076619314.10309.85.camel@DYN279927END.austin.ibm.com> <16427.62304.699990.252289@cargo.ozlabs.ibm.com> <20040212230045.GJ25922@krispykreme> <1076680805.10309.145.camel@DYN279927END.austin.ibm.com> <20040217073510.GF25491@krispykreme> Message-ID: <1077025594.4441.67.camel@DYN279927END.austin.ibm.com> > Interesting, I enable it on all our machines and would have thought we > should be doing it automatically. From a RAS perspective we want to > reboot the machine if it locks up. I believe it was enabled by default on 2.4, but on 2.6 I don't think it enables by the OS unless the surveillance= boot option is specified. IIRC FW has it off by default as well on any GA'd system. Thanks, Jake ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Wed Feb 18 08:01:37 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Tue, 17 Feb 2004 15:01:37 -0600 Subject: bk pushed to deleted files? Message-ID: <798E3D07-618C-11D8-8A98-000A95A0560C@us.ibm.com> Hi, I just pushed some changes to include/asm-ppc64/hvconsole.h, arch/ppc64/kernel/hvconsole.c, and drivers/char/hvc_console.c. Unfortunately this is what I see: drivers/char/hvc_console.c: 2 deltas BitKeeper/deleted/.del-hvconsole.c: 3 deltas BitKeeper/deleted/.del-hvconsole.h: 3 deltas ... despite the fact that neither of those two files have ever been deleted, Of course my changes were not applied to the correct (undeleted) files. Help? -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Wed Feb 18 09:44:38 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Tue, 17 Feb 2004 16:44:38 -0600 Subject: bk pushed to deleted files? In-Reply-To: <798E3D07-618C-11D8-8A98-000A95A0560C@us.ibm.com> References: <798E3D07-618C-11D8-8A98-000A95A0560C@us.ibm.com> Message-ID: On Feb 17, 2004, at 3:01 PM, Hollis Blanchard wrote: > > Hi, I just pushed some changes to include/asm-ppc64/hvconsole.h, > arch/ppc64/kernel/hvconsole.c, and drivers/char/hvc_console.c. > Unfortunately this is what I see: > drivers/char/hvc_console.c: 2 deltas > BitKeeper/deleted/.del-hvconsole.c: 3 deltas > BitKeeper/deleted/.del-hvconsole.h: 3 deltas > ... despite the fact that neither of those two files have ever been > deleted, Of course my changes were not applied to the correct > (undeleted) files. The files were deleted at ameslab recently when there was a merge conflict from upstream, and my tree was from before that deletion, so bk correctly noted that my diffs were to a deleted version of the file. I've just pushed the lost changes again, which unfortunately won't be associated with the not-lost changes in a Changeset. Who knew merging could be so hazardous... -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Wed Feb 18 10:37:11 2004 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Tue, 17 Feb 2004 17:37:11 -0600 Subject: [PATCH] (2.6) put prom.c on a diet Message-ID: <4032A5A7.4040700@austin.ibm.com> Hi- This patch separates the Open Firmware "client" code which runs early in the boot process from code which is used to access and manipulate the kernel's copy of the Open Firmware device tree. The former remains in prom.c; the latter is placed in a new file, of_devtree.c. I tried to be fairly aggressive about pulling everything out of prom.c that could conceivably be used during normal system functioning after the system has booted. This has been compile-tested and booted on a Power4 lpar. I believe I've avoided breaking any builds, please point out anything I've missed. Feel free to suggest a different name for the new file. Also, please help me get the author credits right :) Nathan arch/ppc64/kernel/Makefile | 2 arch/ppc64/kernel/of_devtree.c | 1025 ++++++++++++++++++++++++++++++++++++++++ arch/ppc64/kernel/prom.c | 1030 ----------------------------------------- include/asm-ppc64/prom.h | 24 4 files changed, 1049 insertions(+), 1032 deletions(-) -------------- next part -------------- A non-text attachment was scrubbed... Name: split_prom.c.patch.bz2 Type: application/x-bzip Size: 9924 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040217/f2e74a29/attachment.bin From anton at samba.org Wed Feb 18 11:03:54 2004 From: anton at samba.org (Anton Blanchard) Date: Wed, 18 Feb 2004 11:03:54 +1100 Subject: [PATCH] (2.6) put prom.c on a diet In-Reply-To: <4032A5A7.4040700@austin.ibm.com> References: <4032A5A7.4040700@austin.ibm.com> Message-ID: <20040218000353.GC22534@krispykreme> Hi, > This patch separates the Open Firmware "client" code which runs early in > the boot process from code which is used to access and manipulate the > kernel's copy of the Open Firmware device tree. The former remains in > prom.c; the latter is placed in a new file, of_devtree.c. Very nice! Assuming no one has pending stuff that touches prom.c I think we should merge this ASAP. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Wed Feb 18 11:04:06 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 18 Feb 2004 11:04:06 +1100 Subject: [PATCH] (2.6) put prom.c on a diet In-Reply-To: <4032A5A7.4040700@austin.ibm.com> References: <4032A5A7.4040700@austin.ibm.com> Message-ID: <1077062646.1080.66.camel@gaston> On Wed, 2004-02-18 at 10:37, Nathan Lynch wrote: > Hi- > > This patch separates the Open Firmware "client" code which runs early in > the boot process from code which is used to access and manipulate the > kernel's copy of the Open Firmware device tree. The former remains in > prom.c; the latter is placed in a new file, of_devtree.c. > > I tried to be fairly aggressive about pulling everything out of prom.c > that could conceivably be used during normal system functioning after > the system has booted. > > This has been compile-tested and booted on a Power4 lpar. I believe I've > avoided breaking any builds, please point out anything I've missed. Feel free > to suggest a different name for the new file. Also, please help me get the > author credits right :) While you are at it, can you unify the function definition "style" ? That is replace int myfunc(...) with int myfunc(...) Whatever are the respective merits of each version, the "Linux" way is the second form ;) The credits are probably scattered all over the place. I'd name paulus of course, myself, probably dave engebretsen and peter bergner too... Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Wed Feb 18 11:59:32 2004 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Tue, 17 Feb 2004 18:59:32 -0600 Subject: [PATCH] (2.6) put prom.c on a diet In-Reply-To: <1077062646.1080.66.camel@gaston> References: <4032A5A7.4040700@austin.ibm.com> <1077062646.1080.66.camel@gaston> Message-ID: <4032B8F4.3090905@austin.ibm.com> Benjamin Herrenschmidt wrote: > While you are at it, can you unify the function definition "style" ? Alright, I've done this. > The credits are probably scattered all over the place. I'd name paulus > of course, myself, probably dave engebretsen and peter bergner too... Ok, I've tried to do this right. I also updated the patch to latest available Ameslab code. Nathan -------------- next part -------------- A non-text attachment was scrubbed... Name: split_prom.c.patch.bz2 Type: application/x-bzip Size: 10339 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040217/3dc0e321/attachment.bin From hollisb at us.ibm.com Thu Feb 19 10:49:23 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Wed, 18 Feb 2004 17:49:23 -0600 Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards In-Reply-To: <4033DFB7.1020603@ltcfwd.linux.ibm.com> References: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com> <44CE5954-5E48-11D8-B629-000A95A0560C@us.ibm.com> <4033DFB7.1020603@ltcfwd.linux.ibm.com> Message-ID: <13C60E86-626D-11D8-ACBB-000A95A0560C@us.ibm.com> On Feb 18, 2004, at 3:57 PM, Linda Xie wrote: > > Ben, Please let me know your plan for pushing Nathan's split_prom > patch, > so I can recut OFDT_update patch against of_devtree.c. of_devtree.c is already present in ameslab linux-2.5. -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Thu Feb 19 13:27:40 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 19 Feb 2004 13:27:40 +1100 Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards In-Reply-To: <4033DFB7.1020603@ltcfwd.linux.ibm.com> References: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com> <44CE5954-5E48-11D8-B629-000A95A0560C@us.ibm.com> <4033DFB7.1020603@ltcfwd.linux.ibm.com> Message-ID: <1077157659.20788.190.camel@gaston> On Thu, 2004-02-19 at 08:57, Linda Xie wrote: > Hi, > > Attached is the updated OFDT_update patch. > > Comments are welcome. > > > Ben, > Please let me know your plan for pushing Nathan's split_prom patch, > so I can recut OFDT_update patch against of_devtree.c. Well... somebody pushed it already so ... Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Fri Feb 20 03:34:12 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Thu, 19 Feb 2004 10:34:12 -0600 Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards In-Reply-To: <4034DEA9.4070004@ltcfwd.linux.ibm.com> References: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com> <44CE5954-5E48-11D8-B629-000A95A0560C@us.ibm.com> <4033DFB7.1020603@ltcfwd.linux.ibm.com> <13C60E86-626D-11D8-ACBB-000A95A0560C@us.ibm.com> <4034DEA9.4070004@ltcfwd.linux.ibm.com> Message-ID: <730B4500-62F9-11D8-BBC4-000A95A0560C@us.ibm.com> On Feb 19, 2004, at 10:04 AM, Linda Xie wrote: > Hollis Blanchard wrote: >> >> of_devtree.c is already present in ameslab linux-2.5. > > the one is in ameslab you are talking about is of_device.c, Am I right? Yup, my mistake. -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From stefan at nocrew.org Fri Feb 20 08:12:51 2004 From: stefan at nocrew.org (Stefan Berndtsson) Date: Thu, 19 Feb 2004 22:12:51 +0100 Subject: People with ATAPI problems: possible fix In-Reply-To: <1076813298.853.2.camel@gaston> (Benjamin Herrenschmidt's message of "Sun, 15 Feb 2004 13:48:18 +1100") References: <1076813298.853.2.camel@gaston> Message-ID: <8765e2srho.fsf@hades.nocrew.org> Benjamin Herrenschmidt writes: > Hi ! > > Recently, there have been lost of reports of ATAPI issues, typically > with CD/DVD burners, where DMA would be disabled for no obvious reasons, > with the driver spitting a message about a timeout waiting for dbdma > command stop. > > The problem is that our driver doesn't deal with buffer underruns due > to the drive not sending as many data as requested. > > This patch is an attempt at fixing this. Please let me know if it > helps. First impression is that it seems to work. It no longer claims to kill DMA when running growisofs, and it is still active when writing is done. Thanks. /Stefan ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Fri Feb 20 08:49:57 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 20 Feb 2004 08:49:57 +1100 Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards In-Reply-To: <4034DEA9.4070004@ltcfwd.linux.ibm.com> References: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com> <44CE5954-5E48-11D8-B629-000A95A0560C@us.ibm.com> <4033DFB7.1020603@ltcfwd.linux.ibm.com> <13C60E86-626D-11D8-ACBB-000A95A0560C@us.ibm.com> <4034DEA9.4070004@ltcfwd.linux.ibm.com> Message-ID: <1077227397.20789.554.camel@gaston> > of_devtree.c is NOT in ameslab linux-2.5. Nathan's > split_prom.c.patch(posted on 2/17) > generates of_devtree.c (new file) by pulling everything out of porm.c > that could conceivably > be used after the system has booted. I checked ameslab tree this > morning, his patch is not there yet. > > the one is in ameslab you are talking about is of_device.c, Am I right? Oh right, somebody told me it was and I didn't double check. Anton, do you want Nathan patch in now ? Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Fri Feb 20 09:05:50 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 20 Feb 2004 09:05:50 +1100 Subject: People with ATAPI problems: possible fix In-Reply-To: <8765e2srho.fsf@hades.nocrew.org> References: <1076813298.853.2.camel@gaston> <8765e2srho.fsf@hades.nocrew.org> Message-ID: <1077228350.20781.587.camel@gaston> > First impression is that it seems to work. It no longer claims to kill DMA > when running growisofs, and it is still active when writing is done. And the actual datas on the written media are ok ? Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From stefan at nocrew.org Fri Feb 20 10:21:02 2004 From: stefan at nocrew.org (Stefan Berndtsson) Date: Fri, 20 Feb 2004 00:21:02 +0100 Subject: People with ATAPI problems: possible fix In-Reply-To: <1077228350.20781.587.camel@gaston> (Benjamin Herrenschmidt's message of "Fri, 20 Feb 2004 09:05:50 +1100") References: <1076813298.853.2.camel@gaston> <8765e2srho.fsf@hades.nocrew.org> <1077228350.20781.587.camel@gaston> Message-ID: <87y8qyr6zl.fsf@hades.nocrew.org> Benjamin Herrenschmidt writes: >> First impression is that it seems to work. It no longer claims to kill DMA >> when running growisofs, and it is still active when writing is done. > > And the actual datas on the written media are ok ? At least the parts of the movie I looked at worked in my dvd player, so I'd say it is. /Stefan ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From rsa at us.ibm.com Tue Feb 24 02:00:31 2004 From: rsa at us.ibm.com (Ryan Arnold) Date: 23 Feb 2004 09:00:31 -0600 Subject: New driver (hvcs) review request Message-ID: <1077548434.933.15.camel@SigurRos.rchland.ibm.com> Greetings, I have a new driver that I'd like some comments on. The following is the driver description from the source file hvcs.c. * This is the device driver for the IBM Hypervisor Virtual Console * Server, "hvcs". The IBM hvcs provides a TTY interface to allow * Linux user space applications access to the system consoles of * partitioned RPA supported operating systems (Linux and AIX) * running on the same partitioned IBM POWER architecture eServer. * Physical hardware consoles per partition do not exist on these * platforms and system consoles are interacted with through * hypervisor interfaces utilized by this driver. I've included with this email a link to the patch that I plan on checking in to the Ameslab 2.6 linux trees as soon as the community feels it is in decent enough shape to do so. I would appreciate any and all comments that could be given regarding this patch because I also plan on submitting it to the LKML soon. This patch would probably not apply cleanly to the current Ameslab source because there are some other recent patches that it may need to be resolved with. http://www-124.ibm.com/linux/patches/?patch_id=1377 For those interested, I would appreciate some comments on the questions/concerns I have outlined in the "TODO" section of driver comments. This is a basic working driver that has not had extensive testing done against it. Thanks, Ryan S. Arnold IBM Linux Technology Center IBM Rochester, MN. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Tue Feb 24 06:46:39 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Mon, 23 Feb 2004 13:46:39 -0600 Subject: KDB in ameslab In-Reply-To: <20040217043527.GC25491@krispykreme>; from anton@samba.org on Tue, Feb 17, 2004 at 03:35:27PM +1100 References: <20040217043527.GC25491@krispykreme> Message-ID: <20040223134639.A74832@forte.austin.ibm.com> On Tue, Feb 17, 2004 at 03:35:27PM +1100, Anton Blanchard wrote: > > Hi, > > Linus had a nasty problem debugging what should have been a simple > problem but became a nightmare due to an incorrect debug hook. In > response to this I have cleaned up our debug hooks, it should be much > harder to screw up. > > A side effect of this is that KDB is probably broken. I started looking > into fixing it however I noticed it looks out of date. Does someone have > the urge to update it? I don't have the urge to update it but I'm motivated (today) to fix it up enough to work. Do you want a patch on the mailing list, or just a bk push? --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Tue Feb 24 08:21:01 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Mon, 23 Feb 2004 15:21:01 -0600 Subject: [PATCH] Re: KDB in ameslab In-Reply-To: <20040223134639.A74832@forte.austin.ibm.com>; from linas@austin.ibm.com on Mon, Feb 23, 2004 at 01:46:39PM -0600 References: <20040217043527.GC25491@krispykreme> <20040223134639.A74832@forte.austin.ibm.com> Message-ID: <20040223152101.B74832@forte.austin.ibm.com> Hi, On Mon, Feb 23, 2004 at 01:46:39PM -0600, linas at austin.ibm.com wrote: > > On Tue, Feb 17, 2004 at 03:35:27PM +1100, Anton Blanchard wrote: > > > > Hi, > > > > Linus had a nasty problem debugging what should have been a simple > > problem but became a nightmare due to an incorrect debug hook. In > > response to this I have cleaned up our debug hooks, it should be much > > harder to screw up. > > > > A side effect of this is that KDB is probably broken. I started looking > > into fixing it however I noticed it looks out of date. Does someone have > > the urge to update it? > > I don't have the urge to update it but I'm motivated (today) to fix it > up enough to work. Do you want a patch on the mailing list, or just > a bk push? The atttachments below are the only changes I needed to make to get kdb to compile & run with toadys (2.6.3) ameslab bk tree. The file include/linux/dis-asm.h seems to be missing from the ameslab tree, I'm not sure why. I append it below. Its a copy of this file from the 2.4 trees (which seem to be identical). Please apply this patch & file to ameslab. --linas p.s. where is the "newest" KDB ? I found ftp://oss.sgi.com/projects/kdb/download/v4.3 but googling KDB is such a sad experience that I'm not convinced that something newer isn't hiding somewhere. -------------- next part -------------- The atttachments below are the only changes I needed to make to get kdb to compile & run with toadys (2.6.3) ameslab bk tree. The file include/linux/dis-asm.h seems to be missing from the ameslab tree, I'm not sure why. I append it below. Its a copy of this file from the 2.4 trees (which seem to be identical). Please apply this patch & file to ameslab. --linas p.s. where is the "newest" KDB ? I found ftp://oss.sgi.com/projects/kdb/download/v4.3 but googling KDB is such a sad experience that I'm not convinced that something newer isn't hiding somewhere. ===== kdbasupport.c 1.5 vs edited ===== --- 1.5/arch/ppc64/kdb/kdbasupport.c Thu Jan 22 00:11:58 2004 +++ edited/kdbasupport.c Mon Feb 23 13:49:24 2004 @@ -732,7 +732,6 @@ asm volatile("sync; isync"); } -extern void (*debugger_fault_handler)(struct pt_regs *); extern void longjmp(u_int *, int); unsigned long @@ -2028,12 +2027,12 @@ kdb_map_scc(); /* map sysrq key */ #endif - debugger = kdb_debugger; - debugger_bpt = kdb_debugger_bpt; - debugger_sstep = kdb_debugger_sstep; - debugger_iabr_match = kdb_debugger_iabr_match; - debugger_dabr_match = kdb_debugger_dabr_match; - debugger_fault_handler = NULL; /* this guy is normally off. */ + __debugger = kdb_debugger; + __debugger_bpt = kdb_debugger_bpt; + __debugger_sstep = kdb_debugger_sstep; + __debugger_iabr_match = kdb_debugger_iabr_match; + __debugger_dabr_match = kdb_debugger_dabr_match; + __debugger_fault_handler = NULL; /* this guy is normally off. */ /* = kdb_debugger_fault_handler; */ kdba_enable_lbr(); -------------- next part -------------- /* Interface between the opcode library and its callers. Written by Cygnus Support, 1993. The opcode library (libopcodes.a) provides instruction decoders for a large variety of instruction sets, callable with an identical interface, for making instruction-processing programs more independent of the instruction set being processed. */ /* Hacked by Scott Lurndal at SGI (02/1999) for linux kernel debugger */ /* Upgraded to cygnus CVS Keith Owens 30 Oct 2000 */ #ifndef DIS_ASM_H #define DIS_ASM_H #ifdef __cplusplus extern "C" { #endif /* * Misc definitions */ #ifndef PARAMS #define PARAMS(x) x #endif #define PTR void * #define FILE int #if !defined(NULL) #define NULL 0 #endif #define abort() dis_abort(__LINE__) static inline void dis_abort(int line) { panic("Aborting disassembler @ line %d\n", line); } #include #include #define xstrdup(string) ({ char *res = kdb_strdup(string, GFP_ATOMIC); if (!res) BUG(); res; }) #define xmalloc(size) ({ void *res = kmalloc(size, GFP_ATOMIC); if (!res) BUG(); res; }) #define free(address) kfree(address) #include typedef int (*fprintf_ftype) PARAMS((PTR, const char*, ...)); enum dis_insn_type { dis_noninsn, /* Not a valid instruction */ dis_nonbranch, /* Not a branch instruction */ dis_branch, /* Unconditional branch */ dis_condbranch, /* Conditional branch */ dis_jsr, /* Jump to subroutine */ dis_condjsr, /* Conditional jump to subroutine */ dis_dref, /* Data reference instruction */ dis_dref2 /* Two data references in instruction */ }; /* This struct is passed into the instruction decoding routine, and is passed back out into each callback. The various fields are used for conveying information from your main routine into your callbacks, for passing information into the instruction decoders (such as the addresses of the callback functions), or for passing information back from the instruction decoders to their callers. It must be initialized before it is first passed; this can be done by hand, or using one of the initialization macros below. */ typedef struct disassemble_info { fprintf_ftype fprintf_func; fprintf_ftype fprintf_dummy; PTR stream; PTR application_data; /* Target description. We could replace this with a pointer to the bfd, but that would require one. There currently isn't any such requirement so to avoid introducing one we record these explicitly. */ /* The bfd_flavour. This can be bfd_target_unknown_flavour. */ enum bfd_flavour flavour; /* The bfd_arch value. */ enum bfd_architecture arch; /* The bfd_mach value. */ unsigned long mach; /* Endianness (for bi-endian cpus). Mono-endian cpus can ignore this. */ enum bfd_endian endian; /* An array of pointers to symbols either at the location being disassembled or at the start of the function being disassembled. The array is sorted so that the first symbol is intended to be the one used. The others are present for any misc. purposes. This is not set reliably, but if it is not NULL, it is correct. */ asymbol **symbols; /* Number of symbols in array. */ int num_symbols; /* For use by the disassembler. The top 16 bits are reserved for public use (and are documented here). The bottom 16 bits are for the internal use of the disassembler. */ unsigned long flags; #define INSN_HAS_RELOC 0x80000000 PTR private_data; /* Function used to get bytes to disassemble. MEMADDR is the address of the stuff to be disassembled, MYADDR is the address to put the bytes in, and LENGTH is the number of bytes to read. INFO is a pointer to this struct. Returns an errno value or 0 for success. */ int (*read_memory_func) PARAMS ((bfd_vma memaddr, bfd_byte *myaddr, unsigned int length, struct disassemble_info *info)); /* Function which should be called if we get an error that we can't recover from. STATUS is the errno value from read_memory_func and MEMADDR is the address that we were trying to read. INFO is a pointer to this struct. */ void (*memory_error_func) PARAMS ((int status, bfd_vma memaddr, struct disassemble_info *info)); /* Function called to print ADDR. */ void (*print_address_func) PARAMS ((bfd_vma addr, struct disassemble_info *info)); /* Function called to determine if there is a symbol at the given ADDR. If there is, the function returns 1, otherwise it returns 0. This is used by ports which support an overlay manager where the overlay number is held in the top part of an address. In some circumstances we want to include the overlay number in the address, (normally because there is a symbol associated with that address), but sometimes we want to mask out the overlay bits. */ int (* symbol_at_address_func) PARAMS ((bfd_vma addr, struct disassemble_info * info)); /* These are for buffer_read_memory. */ bfd_byte *buffer; bfd_vma buffer_vma; unsigned int buffer_length; /* This variable may be set by the instruction decoder. It suggests the number of bytes objdump should display on a single line. If the instruction decoder sets this, it should always set it to the same value in order to get reasonable looking output. */ int bytes_per_line; /* the next two variables control the way objdump displays the raw data */ /* For example, if bytes_per_line is 8 and bytes_per_chunk is 4, the */ /* output will look like this: 00: 00000000 00000000 with the chunks displayed according to "display_endian". */ int bytes_per_chunk; enum bfd_endian display_endian; /* Number of octets per incremented target address Normally one, but some DSPs have byte sizes of 16 or 32 bits */ unsigned int octets_per_byte; /* Results from instruction decoders. Not all decoders yet support this information. This info is set each time an instruction is decoded, and is only valid for the last such instruction. To determine whether this decoder supports this information, set insn_info_valid to 0, decode an instruction, then check it. */ char insn_info_valid; /* Branch info has been set. */ char branch_delay_insns; /* How many sequential insn's will run before a branch takes effect. (0 = normal) */ char data_size; /* Size of data reference in insn, in bytes */ enum dis_insn_type insn_type; /* Type of instruction */ bfd_vma target; /* Target address of branch or dref, if known; zero if unknown. */ bfd_vma target2; /* Second target address for dref2 */ /* Command line options specific to the target disassembler. */ char * disassembler_options; } disassemble_info; /* Standard disassemblers. Disassemble one instruction at the given target address. Return number of bytes processed. */ typedef int (*disassembler_ftype) PARAMS((bfd_vma, disassemble_info *)); extern int print_insn_big_mips PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_little_mips PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_i386_att PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_i386_intel PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_ia64 PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_i370 PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_m68hc11 PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_m68hc12 PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_m68k PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_z8001 PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_z8002 PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_h8300 PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_h8300h PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_h8300s PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_h8500 PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_alpha PARAMS ((bfd_vma, disassemble_info*)); extern disassembler_ftype arc_get_disassembler PARAMS ((int, int)); extern int print_insn_big_arm PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_little_arm PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_sparc PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_big_a29k PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_little_a29k PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_i860 PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_i960 PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_sh PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_shl PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_hppa PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_fr30 PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_m32r PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_m88k PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_mcore PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_mn10200 PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_mn10300 PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_ns32k PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_big_powerpc PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_little_powerpc PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_rs6000 PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_w65 PARAMS ((bfd_vma, disassemble_info*)); extern disassembler_ftype cris_get_disassembler PARAMS ((bfd *)); extern int print_insn_d10v PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_d30v PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_v850 PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_tic30 PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_vax PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_tic54x PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_tic80 PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_pj PARAMS ((bfd_vma, disassemble_info*)); extern int print_insn_avr PARAMS ((bfd_vma, disassemble_info*)); extern void print_arm_disassembler_options PARAMS ((FILE *)); extern void parse_arm_disassembler_option PARAMS ((char *)); extern int get_arm_regname_num_options PARAMS ((void)); extern int set_arm_regname_option PARAMS ((int)); extern int get_arm_regnames PARAMS ((int, const char **, const char **, const char ***)); /* Fetch the disassembler for a given BFD, if that support is available. */ extern disassembler_ftype disassembler PARAMS ((bfd *)); /* Document any target specific options available from the disassembler. */ extern void disassembler_usage PARAMS ((FILE *)); /* This block of definitions is for particular callers who read instructions into a buffer before calling the instruction decoder. */ /* Here is a function which callers may wish to use for read_memory_func. It gets bytes from a buffer. */ extern int buffer_read_memory PARAMS ((bfd_vma, bfd_byte *, unsigned int, struct disassemble_info *)); /* This function goes with buffer_read_memory. It prints a message using info->fprintf_func and info->stream. */ extern void perror_memory PARAMS ((int, bfd_vma, struct disassemble_info *)); /* Just print the address in hex. This is included for completeness even though both GDB and objdump provide their own (to print symbolic addresses). */ extern void generic_print_address PARAMS ((bfd_vma, struct disassemble_info *)); /* Always true. */ extern int generic_symbol_at_address PARAMS ((bfd_vma, struct disassemble_info *)); /* Macro to initialize a disassemble_info struct. This should be called by all applications creating such a struct. */ #define INIT_DISASSEMBLE_INFO(INFO, STREAM, FPRINTF_FUNC) \ (INFO).flavour = bfd_target_unknown_flavour, \ (INFO).arch = bfd_arch_unknown, \ (INFO).mach = 0, \ (INFO).endian = BFD_ENDIAN_UNKNOWN, \ (INFO).octets_per_byte = 1, \ INIT_DISASSEMBLE_INFO_NO_ARCH(INFO, STREAM, FPRINTF_FUNC) /* Call this macro to initialize only the internal variables for the disassembler. Architecture dependent things such as byte order, or machine variant are not touched by this macro. This makes things much easier for GDB which must initialize these things separately. */ #define INIT_DISASSEMBLE_INFO_NO_ARCH(INFO, STREAM, FPRINTF_FUNC) \ (INFO).fprintf_func = (fprintf_ftype)(FPRINTF_FUNC), \ (INFO).stream = (PTR)(STREAM), \ (INFO).symbols = NULL, \ (INFO).num_symbols = 0, \ (INFO).buffer = NULL, \ (INFO).buffer_vma = 0, \ (INFO).buffer_length = 0, \ (INFO).read_memory_func = buffer_read_memory, \ (INFO).memory_error_func = perror_memory, \ (INFO).print_address_func = generic_print_address, \ (INFO).symbol_at_address_func = generic_symbol_at_address, \ (INFO).flags = 0, \ (INFO).bytes_per_line = 0, \ (INFO).bytes_per_chunk = 0, \ (INFO).display_endian = BFD_ENDIAN_UNKNOWN, \ (INFO).insn_info_valid = 0 #ifdef __cplusplus }; #endif #endif /* ! defined (DIS_ASM_H) */ From olof at austin.ibm.com Tue Feb 24 08:27:21 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Mon, 23 Feb 2004 15:27:21 -0600 Subject: [PATCH] Re: KDB in ameslab In-Reply-To: <20040223152101.B74832@forte.austin.ibm.com> References: <20040217043527.GC25491@krispykreme> <20040223134639.A74832@forte.austin.ibm.com> <20040223152101.B74832@forte.austin.ibm.com> Message-ID: <403A7039.6060605@austin.ibm.com> linas at austin.ibm.com wrote: > p.s. where is the "newest" KDB ? I found > ftp://oss.sgi.com/projects/kdb/download/v4.3 > but googling KDB is such a sad experience that I'm not > convinced that something newer isn't hiding somewhere. 4.3 seems to be current version. Keith Owens recently announced a port of it to kernel 2.6.3. -Olof -- Olof Johansson Office: 4F005/905 pSeries Linux Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Tue Feb 24 08:43:17 2004 From: anton at samba.org (Anton Blanchard) Date: Tue, 24 Feb 2004 08:43:17 +1100 Subject: KDB in ameslab In-Reply-To: <20040223134639.A74832@forte.austin.ibm.com> References: <20040217043527.GC25491@krispykreme> <20040223134639.A74832@forte.austin.ibm.com> Message-ID: <20040223214317.GU5801@krispykreme> > I don't have the urge to update it but I'm motivated (today) to fix it > up enough to work. Do you want a patch on the mailing list, or just > a bk push? Im OK with a push, it probably just requires fixes to the code that initialises the debugger hooks. The xmon stuff now can be enabled/disabled at runtime (at this stage you have to use sysrq to do it), it would be nice if kdb did too. That would mean we could have both debuggers compiled in and we can enable either at runtime. With a bit of effort xmon could become a module too. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Tue Feb 24 08:56:11 2004 From: anton at samba.org (Anton Blanchard) Date: Tue, 24 Feb 2004 08:56:11 +1100 Subject: CONFIG_PAGEALLOC_DEBUG Message-ID: <20040223215611.GV5801@krispykreme> Hi, Heres a first stab at CONFIG_PAGEALLOC_DEBUG. Its a useful debug feature where you unmap unused pages, catching use after free bugs etc. It only works on pseries SMP at the moment, we really need to rework how we do it. The current updateboltedpp hooks arent good enough because they only write protect but still allow reading. At the moment I just turn off the valid bit and leave the entry there, but that wont work on LPAR. I think we will have to remove the bolted entry completely and reinsert it. You might have to tune how the slab cache interacts (for maximum coverage you pretty much want all allocations even small ones to end up on their own page, and you dont want any of the slab caches to be operating) Anton --- foobar-anton/arch/ppc64/Kconfig | 8 ++++++++ foobar-anton/arch/ppc64/kernel/iSeries_htab.c | 7 ++++--- foobar-anton/arch/ppc64/kernel/idle.c | 11 +++++++++++ foobar-anton/arch/ppc64/kernel/pSeries_htab.c | 25 +++++++++++-------------- foobar-anton/arch/ppc64/kernel/pSeries_lpar.c | 11 ++++------- foobar-anton/arch/ppc64/mm/hash_utils.c | 11 +++++++++++ foobar-anton/arch/ppc64/mm/init.c | 2 +- foobar-anton/include/asm-ppc64/cacheflush.h | 5 +++++ foobar-anton/include/asm-ppc64/cputable.h | 7 +++++-- foobar-anton/include/asm-ppc64/machdep.h | 4 ++-- mm/slab.c | 0 11 files changed, 62 insertions(+), 29 deletions(-) diff -puN arch/ppc64/Kconfig~ppc64-config_pagealloc_debug arch/ppc64/Kconfig --- foobar/arch/ppc64/Kconfig~ppc64-config_pagealloc_debug 2004-02-21 13:58:15.922539797 +1100 +++ foobar-anton/arch/ppc64/Kconfig 2004-02-21 13:58:15.996534209 +1100 @@ -401,6 +401,14 @@ config DEBUG_SPINLOCK_SLEEP If you say Y here, various routines which may sleep will become very noisy if they are called with a spinlock held. +config DEBUG_PAGEALLOC + bool "Page alloc debugging" + depends on DEBUG_KERNEL + help + Unmap pages from the kernel linear mapping after free_pages(). + This results in a large slowdown, but helps to find certain types + of memory corruptions. + endmenu source "security/Kconfig" diff -puN arch/ppc64/kernel/iSeries_htab.c~ppc64-config_pagealloc_debug arch/ppc64/kernel/iSeries_htab.c --- foobar/arch/ppc64/kernel/iSeries_htab.c~ppc64-config_pagealloc_debug 2004-02-21 13:58:15.928539344 +1100 +++ foobar-anton/arch/ppc64/kernel/iSeries_htab.c 2004-02-21 13:58:15.997534133 +1100 @@ -167,7 +167,7 @@ static long iSeries_hpte_find(unsigned l * * No need to lock here because we should be the only user. */ -static void iSeries_hpte_updateboltedpp(unsigned long newpp, unsigned long ea) +static void iSeries_hpte_updatevalid(unsigned long valid, unsigned long ea) { unsigned long vsid,va,vpn; long slot; @@ -176,8 +176,9 @@ static void iSeries_hpte_updateboltedpp( va = (vsid << 28) | (ea & 0x0fffffff); vpn = va >> PAGE_SHIFT; slot = iSeries_hpte_find(vpn); - if (slot == -1) - panic("updateboltedpp: Could not find page to bolt\n"); + BUG_ON(slot == -1); + + /* XXX FIXME */ HvCallHpt_setPp(slot, newpp); } diff -puN arch/ppc64/kernel/idle.c~ppc64-config_pagealloc_debug arch/ppc64/kernel/idle.c --- foobar/arch/ppc64/kernel/idle.c~ppc64-config_pagealloc_debug 2004-02-21 13:58:15.934538891 +1100 +++ foobar-anton/arch/ppc64/kernel/idle.c 2004-02-21 14:49:51.761292296 +1100 @@ -132,6 +132,17 @@ int default_idle(void) { long oldval; +#if 0 + struct page *tmp = alloc_pages(GFP_KERNEL, 0); + unsigned char *foo = __va(page_to_pfn(tmp) << PAGE_SHIFT); + foo[0] = '1'; + free_pages(foo, 0); + + printk("use after free: %p\n", foo); + printk("%c\n", foo[0]); + printk("passed\n"); +#endif + while (1) { oldval = test_and_clear_thread_flag(TIF_NEED_RESCHED); diff -puN arch/ppc64/kernel/pSeries_htab.c~ppc64-config_pagealloc_debug arch/ppc64/kernel/pSeries_htab.c --- foobar/arch/ppc64/kernel/pSeries_htab.c~ppc64-config_pagealloc_debug 2004-02-21 13:58:15.939538513 +1100 +++ foobar-anton/arch/ppc64/kernel/pSeries_htab.c 2004-02-21 15:07:47.212994956 +1100 @@ -60,11 +60,11 @@ long pSeries_hpte_insert(unsigned long h for (i = 0; i < HPTES_PER_GROUP; i++) { dw0 = hptep->dw0.dw0; - if (!dw0.v) { + if (!dw0.v && !dw0.bolted) { /* retry with lock held */ pSeries_lock_hpte(hptep); dw0 = hptep->dw0.dw0; - if (!dw0.v) + if (!dw0.v && !dw0.bolted) break; pSeries_unlock_hpte(hptep); } @@ -177,7 +177,7 @@ static long pSeries_hpte_find(unsigned l hptep = htab_data.htab + slot; dw0 = hptep->dw0.dw0; - if ((dw0.avpn == (vpn >> 11)) && dw0.v && + if ((dw0.avpn == (vpn >> 11)) && dw0.bolted && (dw0.h == j)) { /* HPTE matches */ if (j) @@ -230,14 +230,12 @@ static long pSeries_hpte_updatepp(unsign } /* - * Update the page protection bits. Intended to be used to create - * guard pages for kernel data structures on pages which are bolted - * in the HPT. Assumes pages being operated on will not be stolen. - * Does not work on large pages. + * Change the valid bit on bolted pages. Used by debugging code such + * as CONFIG_PAGEALLOC_DEBUG to cause accesses on certain pages to fault. * - * No need to lock here because we should be the only user. + * We assume the caller provides any locking. */ -static void pSeries_hpte_updateboltedpp(unsigned long newpp, unsigned long ea) +static void pSeries_hpte_updatevalid(unsigned long valid, unsigned long ea) { unsigned long vsid, va, vpn, flags; long slot; @@ -248,11 +246,10 @@ static void pSeries_hpte_updateboltedpp( vpn = va >> PAGE_SHIFT; slot = pSeries_hpte_find(vpn); - if (slot == -1) - panic("could not find page to bolt\n"); - hptep = htab_data.htab + slot; + BUG_ON(slot == -1); - set_pp_bit(newpp, hptep); + hptep = htab_data.htab + slot; + hptep->dw0.dw0.v = valid; /* Ensure it is out of the tlb too */ spin_lock_irqsave(&pSeries_tlbie_lock, flags); @@ -376,7 +373,7 @@ void hpte_init_pSeries(void) ppc_md.hpte_invalidate = pSeries_hpte_invalidate; ppc_md.hpte_updatepp = pSeries_hpte_updatepp; - ppc_md.hpte_updateboltedpp = pSeries_hpte_updateboltedpp; + ppc_md.hpte_updatevalid = pSeries_hpte_updatevalid; ppc_md.hpte_insert = pSeries_hpte_insert; ppc_md.hpte_remove = pSeries_hpte_remove; diff -puN arch/ppc64/kernel/pSeries_lpar.c~ppc64-config_pagealloc_debug arch/ppc64/kernel/pSeries_lpar.c --- foobar/arch/ppc64/kernel/pSeries_lpar.c~ppc64-config_pagealloc_debug 2004-02-21 13:58:15.945538060 +1100 +++ foobar-anton/arch/ppc64/kernel/pSeries_lpar.c 2004-02-21 13:58:16.003533680 +1100 @@ -487,8 +487,7 @@ static long pSeries_lpar_hpte_find(unsig return -1; } -static void pSeries_lpar_hpte_updateboltedpp(unsigned long newpp, - unsigned long ea) +static void pSeries_lpar_hpte_updatevalid(unsigned long valid, unsigned long ea) { unsigned long lpar_rc; unsigned long vsid, va, vpn, flags; @@ -499,11 +498,9 @@ static void pSeries_lpar_hpte_updatebolt vpn = va >> PAGE_SHIFT; slot = pSeries_lpar_hpte_find(vpn); - if (slot == -1) - panic("updateboltedpp: Could not find page to bolt\n"); + BUG_ON(slot == -1); - flags = newpp & 3; - lpar_rc = plpar_pte_protect(flags, slot, 0); + /* XXX FIXME */ if (lpar_rc != H_Success) panic("Bad return code from pte bolted protect rc = %lx\n", @@ -555,7 +552,7 @@ void pSeries_lpar_mm_init(void) { ppc_md.hpte_invalidate = pSeries_lpar_hpte_invalidate; ppc_md.hpte_updatepp = pSeries_lpar_hpte_updatepp; - ppc_md.hpte_updateboltedpp = pSeries_lpar_hpte_updateboltedpp; + ppc_md.hpte_updatevalid = pSeries_lpar_hpte_updatevalid; ppc_md.hpte_insert = pSeries_lpar_hpte_insert; ppc_md.hpte_remove = pSeries_lpar_hpte_remove; ppc_md.flush_hash_range = pSeries_lpar_flush_hash_range; diff -puN arch/ppc64/mm/hash_utils.c~ppc64-config_pagealloc_debug arch/ppc64/mm/hash_utils.c --- foobar/arch/ppc64/mm/hash_utils.c~ppc64-config_pagealloc_debug 2004-02-21 13:58:15.951537607 +1100 +++ foobar-anton/arch/ppc64/mm/hash_utils.c 2004-02-21 13:58:16.005533529 +1100 @@ -357,3 +357,14 @@ void __init htab_finish_init(void) make_bl(htab_call_hpte_remove, ppc_md.hpte_remove); make_bl(htab_call_hpte_updatepp, ppc_md.hpte_updatepp); } + +#ifdef CONFIG_DEBUG_PAGEALLOC +void kernel_map_pages(struct page *page, int numpages, int enable) +{ + int i; + + for (i = 0; i < numpages; i++) + ppc_md.hpte_updatevalid(enable, + (unsigned long)page_address(page) + PAGE_SIZE * i); +} +#endif diff -puN arch/ppc64/mm/init.c~ppc64-config_pagealloc_debug arch/ppc64/mm/init.c --- foobar/arch/ppc64/mm/init.c~ppc64-config_pagealloc_debug 2004-02-21 13:58:15.958537079 +1100 +++ foobar-anton/arch/ppc64/mm/init.c 2004-02-21 14:51:33.299823789 +1100 @@ -666,7 +666,7 @@ void __init mm_init_ppc64(void) for (index = 0; index < NR_CPUS; index++) { lpaca = &paca[index]; guard_page = ((unsigned long)lpaca) + 0x1000; - ppc_md.hpte_updateboltedpp(PP_RXRX, guard_page); + ppc_md.hpte_updatevalid(0, guard_page); } ppc64_boot_msg(0x100, "MM Init Done"); diff -puN include/asm-ppc64/cacheflush.h~ppc64-config_pagealloc_debug include/asm-ppc64/cacheflush.h --- foobar/include/asm-ppc64/cacheflush.h~ppc64-config_pagealloc_debug 2004-02-21 13:58:15.963536701 +1100 +++ foobar-anton/include/asm-ppc64/cacheflush.h 2004-02-21 13:58:16.009533227 +1100 @@ -32,4 +32,9 @@ do { memcpy(dst, src, len); \ extern void __flush_dcache_icache(void *page_va); +#ifdef CONFIG_DEBUG_PAGEALLOC +/* internal debugging function */ +void kernel_map_pages(struct page *page, int numpages, int enable); +#endif + #endif /* _PPC64_CACHEFLUSH_H */ diff -puN include/asm-ppc64/cputable.h~ppc64-config_pagealloc_debug include/asm-ppc64/cputable.h --- foobar/include/asm-ppc64/cputable.h~ppc64-config_pagealloc_debug 2004-02-21 13:58:15.969536248 +1100 +++ foobar-anton/include/asm-ppc64/cputable.h 2004-02-21 13:58:16.010533152 +1100 @@ -139,8 +139,11 @@ extern firmware_feature_t firmware_featu CPU_FTR_TLBIEL | CPU_FTR_NOEXECUTE | \ CPU_FTR_NODSISRALIGN) -/* iSeries doesn't support large pages */ -#ifdef CONFIG_PPC_ISERIES +/* + * iSeries doesn't support large pages and we cant use large pages when + * page alloc debug is enabled + */ +#if defined(CONFIG_PPC_ISERIES) || defined(CONFIG_DEBUG_PAGEALLOC) #define CPU_FTR_PPCAS_ARCH_V2 (CPU_FTR_PPCAS_ARCH_V2_BASE) #else #define CPU_FTR_PPCAS_ARCH_V2 (CPU_FTR_PPCAS_ARCH_V2_BASE | CPU_FTR_16M_PAGE) diff -puN include/asm-ppc64/machdep.h~ppc64-config_pagealloc_debug include/asm-ppc64/machdep.h --- foobar/include/asm-ppc64/machdep.h~ppc64-config_pagealloc_debug 2004-02-21 13:58:15.975535795 +1100 +++ foobar-anton/include/asm-ppc64/machdep.h 2004-02-21 13:58:16.011533076 +1100 @@ -40,8 +40,8 @@ struct machdep_calls { unsigned long va, int large, int local); - void (*hpte_updateboltedpp)(unsigned long newpp, - unsigned long ea); + void (*hpte_updatevalid)(unsigned long valid, + unsigned long ea); long (*hpte_insert)(unsigned long hpte_group, unsigned long va, unsigned long prpn, diff -puN mm/slab.c~ppc64-config_pagealloc_debug mm/slab.c diff -puN -L arch/ppc64/mm/ash_utils.c /dev/null /dev/null _ ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Tue Feb 24 11:41:02 2004 From: olof at austin.ibm.com (olof at austin.ibm.com) Date: Mon, 23 Feb 2004 18:41:02 -0600 (CST) Subject: TCE changes pushed... Message-ID: I just pushed a big changeset to ameslab, consisting of the TCE rewrite. While I have tried to build it in all ways imaginable, there's still a risk I missed a driver that needed changes to build. Also, for those of you maintaining/developing the VIO drivers, the following changes should be noted: * is no more. Use the global instead. * Likewise, has been renamed to . If your driver can't find NO_TCE (or other defines), check your includes. * There's been a bunch of renames. TceTable is now iommu_table, and the tce_table structure members have been renamed accordingly. Most things have a 1-1 mapping, so it's just a matter of figuring out the renames. Let me know if anything is unclear (or if there's any other build breaks or strange behaviour). Thanks, Olof Olof Johansson Office: 4E002/905 Linux on Power Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From marco.killijian at laas.fr Tue Feb 24 21:04:11 2004 From: marco.killijian at laas.fr (Marc-Olivier Killijian) Date: Tue, 24 Feb 2004 11:04:11 +0100 Subject: People with ATAPI problems: possible fix In-Reply-To: <1076813298.853.2.camel@gaston> References: <1076813298.853.2.camel@gaston> Message-ID: <1077617051.1054.15.camel@tsfmok> Hi, I used to have the problem described by Branden Robinson in a previous mail on DebianPPC: a Samsung CDRW/DVD that would not eject discs nor play, nor anything. I patched a 2.4.24-ben1 with your code and now the CD features are working fine. I didn't test burning CDs or playing DVDs yet. Cheers, Marco Le dim 15/02/2004 ? 03:48, Benjamin Herrenschmidt a ?crit : > > Recently, there have been lost of reports of ATAPI issues, typically > with CD/DVD burners, where DMA would be disabled for no obvious reasons, > with the driver spitting a message about a timeout waiting for dbdma > command stop. > > The problem is that our driver doesn't deal with buffer underruns due > to the drive not sending as many data as requested. > > This patch is an attempt at fixing this. Please let me know if it > helps. [deleted] ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Wed Feb 25 02:57:19 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Tue, 24 Feb 2004 09:57:19 -0600 Subject: [PATCH] rpaphp driver changes for vio and multifunction cards support-please review In-Reply-To: <403AC039.3020903@ltcfwd.linux.ibm.com> References: <403AC039.3020903@ltcfwd.linux.ibm.com> Message-ID: <20697AC2-66E2-11D8-B1E2-000A95A0560C@us.ibm.com> On Feb 23, 2004, at 9:08 PM, Linda Xie wrote: > Any comments would be much appreciated. In register_slot(), what is this code doing? > case VIO_DEV: > /* create symlink between slot->name and it's uni-address > */ > vio_uni_addr = strchr(slot->dn->full_name, '@'); > if (!vio_uni_addr) > return (1); > retval = sysfs_create_link(slot->hotplug_slot->kobj.parent, > &slot->hotplug_slot->kobj, > vio_uni_addr); Can you show me tree output? I don't know what these sysfs directories look like. Whatever this symlink is, do we *need* it? If not I'd say take it out, as userland will start depending on it and then we'll be stuck with it forever. -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Wed Feb 25 03:15:19 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Tue, 24 Feb 2004 10:15:19 -0600 Subject: [PATCH] rpaphp driver changes for vio and multifunction cards support-please review In-Reply-To: <403AC039.3020903@ltcfwd.linux.ibm.com> References: <403AC039.3020903@ltcfwd.linux.ibm.com> Message-ID: rpaphp_get_vio_adapter_status() is extern in rpaphp.h but inline in rpaphp_vio.c. It can't be inlined because it's called from rpaphp_core.c. Other functions, like setup_vio_hotplug_slot_info(), are marked inline (in rpaphp.h), are not static (in rpaphp_vio.c), but aren't used by anyone else. All such functions (in all these files) should be made static (and removed from rpaphp.h, since nobody else needs to know about them). -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olh at suse.de Wed Feb 25 05:21:20 2004 From: olh at suse.de (Olaf Hering) Date: Tue, 24 Feb 2004 19:21:20 +0100 Subject: 2.6.3: Debug: sleeping function called from invalid context at include/linux/rwsem.h:43 Message-ID: <20040224182120.GA4026@suse.de> This is with the ameslab tree as of Sunday evening, on a 6way p660. papaya:~ # Debug: sleeping function called from invalid context at include/linux/rwsem.h:43 in_atomic():0, irqs_disabled():1 Calcpu 2: Vector: 300 (Data Access) at [c0000000864c7980] pc: c000000000091dd0 (.kmem_calche_alloc+0x54/0xc0) lr: c000000000091e2c (.kmem_cache_alloc+0xb0/0xc0) sp: c00 00000864c7c00 msr: a000000000001032 dar: 2777d6650 dsisr: 40000000 current = T0xc000000021306d80 paca = 0xc000000000444000 pid = 6280, comm = ld64.so.1 rpress ? for help 2:mon> ace: [c000000000044718] .do_page_fault+0x10c/0x514 [c00000000000aa94] stab_bolted_user_return+0x118/0x11c [c000000000091e2c] .kmem_cache_alloc+0xb0/0xc0 [c00000000020ce80] .idr_pre_get+0x58/0x144 [c000000000079644] .sys_timer_create+0x10c/0x624 [c00000000000f014] .ret_from_syscall_1+0x0/0xa4 NETDEV WATCHDOG: eth1: transmit timed out eth1: transmit timed out, status 06f3, resetting. sym0:4:0: ABORT operation started. -- USB is for mice, FireWire is for men! sUse lINUX ag, n?RNBERG ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Wed Feb 25 07:35:11 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Tue, 24 Feb 2004 14:35:11 -0600 Subject: [PATCH] xmon command for symbol lookups Message-ID: <403BB57F.7010200@austin.ibm.com> One of the most crippling drawbacks (for me) with xmon is that you need a System.map to lookup all addresses in when you're digging through disassembly. The attached patch adds a new command, "n
" that will show the corresponding symbol in the same way that the "t" command does. It also fixes another thing that's irritated me before: xmon doesn't take values in 0x form, only . 3:mon> t c00000007e4978b0 c00000000004d894 .xmon+0x15c/0x358 c00000007e497a90 c00000000004d2e0 .sysrq_handle_xmon+0x5c/0x64 c00000007e497b20 c00000000024d7c4 .__handle_sysrq_nolock+0xe0/0x184 c00000007e497bd0 c00000000024d6c0 .handle_sysrq+0x70/0x94 c00000007e497c60 c000000000102dd0 .write_sysrq_trigger+0x88/0xac c00000007e497cf0 c0000000000b4464 .vfs_write+0x10c/0x164 c00000007e497d90 c0000000000b45a0 .sys_write+0x50/0x94 c00000007e497e30 c0000000000119bc ret_from_syscall_1 exception: c00 (System Call) regs c00000007e497ea0 000000000ff27b2c 3:mon> di c00000000024d6c0 c00000000024d6c0 60000000 nop c00000000024d6c4 38210090 addi r1,r1,144 c00000000024d6c8 e8010010 ld r0,16(r1) c00000000024d6cc ebe1fff8 ld r31,-8(r1) c00000000024d6d0 eb81ffe0 ld r28,-32(r1) c00000000024d6d4 eba1ffe8 ld r29,-24(r1) c00000000024d6d8 ebc1fff0 ld r30,-16(r1) c00000000024d6dc 7c0803a6 mtlr r0 c00000000024d6e0 4bfffeac b 0xc00000000024d58c c00000000024d6e4 fbc1fff0 std r30,-16(r1) c00000000024d6e8 ebc2c5d8 ld r30,-14888(r2) c00000000024d6ec 7c0802a6 mflr r0 c00000000024d6f0 fb41ffd0 std r26,-48(r1) c00000000024d6f4 fb61ffd8 std r27,-40(r1) c00000000024d6f8 7cba2b78 mr r26,r5 c00000000024d6fc 7c9b2378 mr r27,r4 3:mon> n 0xc00000000024d58c c00000000024d58c: .__sysrq_unlock_table+0x0/0x30 3:mon> Another enhancement would be to simply print the symbol as a comment behind the address when doing disasm, but that would make the output wider than 80 characters. Does anyone have problems with that or should I add that as well? Thanks, Olof -- Olof Johansson Office: 4F005/905 pSeries Linux Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: xmon-n-cmd Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040224/0d049fe1/attachment.txt From hollisb at us.ibm.com Wed Feb 25 09:31:13 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Tue, 24 Feb 2004 16:31:13 -0600 Subject: [PATCH] rpadlpar changes for DLPAR VIO devices -- please review In-Reply-To: <403BC5DA.4010705@ltcfwd.linux.ibm.com> References: <403BC5DA.4010705@ltcfwd.linux.ibm.com> Message-ID: <270FFA96-6719-11D8-8F7A-000A95A0560C@us.ibm.com> On Feb 24, 2004, at 3:44 PM, Linda Xie wrote: > + if (strstr(drc_name, "-V")) > + dn = find_php_slot_vio_node(drc_name); > + else > + dn = find_php_slot_pci_node(drc_name); I'm not sure this is safe as a canonical test. Maybe sprintf("%s-V%i-D%i", ...)? Is this namespace defined and documented somewhere? I just don't feel comfortable with a two-character test determining it one way or the other. -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Wed Feb 25 09:46:54 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 25 Feb 2004 09:46:54 +1100 Subject: [PATCH] xmon command for symbol lookups In-Reply-To: <403BB57F.7010200@austin.ibm.com> References: <403BB57F.7010200@austin.ibm.com> Message-ID: <1077662814.965.26.camel@gaston> On Wed, 2004-02-25 at 07:35, Olof Johansson wrote: > One of the most crippling drawbacks (for me) with xmon is that you need > a System.map to lookup all addresses in when you're digging through > disassembly. > > The attached patch adds a new command, "n
" that will show the > corresponding symbol in the same way that the "t" command does. > > It also fixes another thing that's irritated me before: xmon doesn't > take values in 0x form, only . Hrm... well... I have this on ppc32 already, but used different commmands: la and ls (lookup address and lookup symbol), also I have added the ability to use a symbol (preceded by the $) every time you can enter a number (like you can use % with a register name). What about making xmon in sync ? :) Another _real cool_ feature is a dmesg in xmon btw :) > 3:mon> t > c00000007e4978b0 c00000000004d894 .xmon+0x15c/0x358 > c00000007e497a90 c00000000004d2e0 .sysrq_handle_xmon+0x5c/0x64 > c00000007e497b20 c00000000024d7c4 .__handle_sysrq_nolock+0xe0/0x184 > c00000007e497bd0 c00000000024d6c0 .handle_sysrq+0x70/0x94 > c00000007e497c60 c000000000102dd0 .write_sysrq_trigger+0x88/0xac > c00000007e497cf0 c0000000000b4464 .vfs_write+0x10c/0x164 > c00000007e497d90 c0000000000b45a0 .sys_write+0x50/0x94 > c00000007e497e30 c0000000000119bc ret_from_syscall_1 > exception: c00 (System Call) regs c00000007e497ea0 > 000000000ff27b2c > > 3:mon> di c00000000024d6c0 > c00000000024d6c0 60000000 nop > c00000000024d6c4 38210090 addi r1,r1,144 > c00000000024d6c8 e8010010 ld r0,16(r1) > c00000000024d6cc ebe1fff8 ld r31,-8(r1) > c00000000024d6d0 eb81ffe0 ld r28,-32(r1) > c00000000024d6d4 eba1ffe8 ld r29,-24(r1) > c00000000024d6d8 ebc1fff0 ld r30,-16(r1) > c00000000024d6dc 7c0803a6 mtlr r0 > c00000000024d6e0 4bfffeac b 0xc00000000024d58c > c00000000024d6e4 fbc1fff0 std r30,-16(r1) > c00000000024d6e8 ebc2c5d8 ld r30,-14888(r2) > c00000000024d6ec 7c0802a6 mflr r0 > c00000000024d6f0 fb41ffd0 std r26,-48(r1) > c00000000024d6f4 fb61ffd8 std r27,-40(r1) > c00000000024d6f8 7cba2b78 mr r26,r5 > c00000000024d6fc 7c9b2378 mr r27,r4 > 3:mon> n 0xc00000000024d58c > c00000000024d58c: .__sysrq_unlock_table+0x0/0x30 > 3:mon> > > > > > > Another enhancement would be to simply print the symbol as a comment > behind the address when doing disasm, but that would make the output > wider than 80 characters. Does anyone have problems with that or should > I add that as well? > > > Thanks, > > Olof > > -- > Olof Johansson Office: 4F005/905 > pSeries Linux Development IBM Systems Group > Email: olof at austin.ibm.com Phone: 512-838-9858 > All opinions are my own and not those of IBM > > ______________________________________________________________________ > ===== arch/ppc64/xmon/xmon.c 1.35 vs edited ===== > --- 1.35/arch/ppc64/xmon/xmon.c Sun Feb 15 15:23:37 2004 > +++ edited/arch/ppc64/xmon/xmon.c Tue Feb 24 14:25:34 2004 > @@ -83,6 +83,7 @@ > static void dump(void); > static void prdump(unsigned long, long); > static int ppc_inst_dump(unsigned long, long); > +static void lookupsymbol(void); > void print_address(unsigned long); > static int getsp(void); > static void backtrace(struct pt_regs *); > @@ -167,6 +168,7 @@ > ml locate a block of memory\n\ > mz zero a block of memory\n\ > mi show information about memory allocation\n\ > + n lookup address -> symbol name\n\ > p show the task list\n\ > r print registers\n\ > s single step\n\ > @@ -537,6 +539,9 @@ > case 'd': > dump(); > break; > + case 'n': > + lookupsymbol(); > + break; > case 'r': > if (excp != NULL) > prregs(excp); /* print regs */ > @@ -1644,6 +1649,23 @@ > printf("0x%lx", addr); > } > > + > +void > +lookupsymbol(void) > +{ > + int c; > + > + c = inchar(); > + if (c == '\n') > + termch = c; > + scanhex((void *)&adrs); > + if( termch != '\n') > + termch = 0; > + printf("%016lx: ", adrs); > + xmon_print_symbol("%s\n", adrs); > +} > + > + > /* > * Memory operations - move, set, print differences > */ > @@ -1820,6 +1842,14 @@ > } > printf("invalid register name '%%%s'\n", regname); > return 0; > + } > + > + /* skip leading "0x" if any */ > + > + if (c == '0') { > + c = inchar(); > + if (c == 'x') > + c = inchar(); > } > > d = hexdigit(c); -- Benjamin Herrenschmidt ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From rsa at us.ibm.com Wed Feb 25 09:52:05 2004 From: rsa at us.ibm.com (Ryan Arnold) Date: 24 Feb 2004 16:52:05 -0600 Subject: hvcs driver (with 80 character columns + revisions) revised :WAS Re: New driver (hvcs) review request In-Reply-To: <1077577919.5940.90.camel@gaston> References: <1077548434.933.15.camel@SigurRos.rchland.ibm.com> <1077577919.5940.90.camel@gaston> Message-ID: <1077663127.21201.5.camel@SigurRos.rchland.ibm.com> On Mon, 2004-02-23 at 17:11, Benjamin Herrenschmidt wrote: > Hi Ryan. Before somebody dives into the code per-se, could you > reformat the driver properly ? Normally, linux code is supposed > to fit in 80 columns. We can accept exceptions, but not a whole > driver using more than 132 cols :) For the 22" monitor challenged here is a revised version of this driver which has columns of no greater than 80 characters. Additionally, this driver contains some first pass revisions that should make the driver more readable thanks to Dave Hansen and Ben H. http://www-124.ibm.com/linux/patches/misc/hvcs-20040224.diff Thanks Ryan S. Arnold IBM LTC, Rochester MN. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Wed Feb 25 09:55:03 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Tue, 24 Feb 2004 16:55:03 -0600 Subject: [PATCH] xmon command for symbol lookups In-Reply-To: <1077662814.965.26.camel@gaston> References: <403BB57F.7010200@austin.ibm.com> <1077662814.965.26.camel@gaston> Message-ID: <403BD647.5060509@austin.ibm.com> Benjamin Herrenschmidt wrote: > Hrm... well... I have this on ppc32 already, but used different > commmands: la and ls (lookup address and lookup symbol), also I > have added the ability to use a symbol (preceded by the $) every > time you can enter a number (like you can use % with a register > name). What about making xmon in sync ? :) Ah, I was reinventing the wheel! I'll bring that stuff over from ppc32 instead then. > Another _real cool_ feature is a dmesg in xmon btw :) Yeah, I can look at that next. -Olof -- Olof Johansson Office: 4F005/905 pSeries Linux Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Wed Feb 25 10:05:06 2004 From: anton at samba.org (Anton Blanchard) Date: Wed, 25 Feb 2004 10:05:06 +1100 Subject: [PATCH] xmon command for symbol lookups In-Reply-To: <403BD647.5060509@austin.ibm.com> References: <403BB57F.7010200@austin.ibm.com> <1077662814.965.26.camel@gaston> <403BD647.5060509@austin.ibm.com> Message-ID: <20040224230506.GE5801@krispykreme> > Ah, I was reinventing the wheel! I'll bring that stuff over from ppc32 > instead then. Not completely, ppc32 xmon doesnt use kallsyms. I do like the idea of doing lookups in disassembly too, Im not sure if ppc32 does that. > > Another _real cool_ feature is a dmesg in xmon btw :) > Yeah, I can look at that next. kdb adds all these nasty hooks into kernel/printk.c. I wonder if we cant make the relevant symbols not static so xmon, kdb, kgdb etc can get at them. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Wed Feb 25 10:06:30 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 25 Feb 2004 10:06:30 +1100 Subject: [PATCH] xmon command for symbol lookups In-Reply-To: <20040224230506.GE5801@krispykreme> References: <403BB57F.7010200@austin.ibm.com> <1077662814.965.26.camel@gaston> <403BD647.5060509@austin.ibm.com> <20040224230506.GE5801@krispykreme> Message-ID: <1077663989.1105.28.camel@gaston> On Wed, 2004-02-25 at 10:05, Anton Blanchard wrote: > > Ah, I was reinventing the wheel! I'll bring that stuff over from ppc32 > > instead then. > > Not completely, ppc32 xmon doesnt use kallsyms. I do like the idea of > doing lookups in disassembly too, Im not sure if ppc32 does that. No, it doesn't and that's a good feature. > > > Another _real cool_ feature is a dmesg in xmon btw :) > > Yeah, I can look at that next. > > kdb adds all these nasty hooks into kernel/printk.c. I wonder if we cant > make the relevant symbols not static so xmon, kdb, kgdb etc can get at > them. Yah, no need for hook, not even for non-static crap btw, we can use kallsyms to locate the symbols :) Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Wed Feb 25 10:20:52 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Tue, 24 Feb 2004 17:20:52 -0600 Subject: revised hvcs driver In-Reply-To: <1077663127.21201.5.camel@SigurRos.rchland.ibm.com> References: <1077548434.933.15.camel@SigurRos.rchland.ibm.com> <1077577919.5940.90.camel@gaston> <1077663127.21201.5.camel@SigurRos.rchland.ibm.com> Message-ID: <16F915D3-6720-11D8-8F7A-000A95A0560C@us.ibm.com> On Feb 24, 2004, at 4:52 PM, Ryan Arnold wrote: > > http://www-124.ibm.com/linux/patches/misc/hvcs-20040224.diff One more comment: don't add things to hvconsole.h unless you want/need other code to be able to use it (which is the case for the prototypes currently in hvconsole.h). Your constants, struct, and prototypes probably don't fall into that category. -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Wed Feb 25 10:40:17 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 25 Feb 2004 10:40:17 +1100 Subject: [PATCH] xmon command for symbol lookups In-Reply-To: <1077663989.1105.28.camel@gaston> References: <403BB57F.7010200@austin.ibm.com> <1077662814.965.26.camel@gaston> <403BD647.5060509@austin.ibm.com> <20040224230506.GE5801@krispykreme> <1077663989.1105.28.camel@gaston> Message-ID: <1077666016.1128.30.camel@gaston> On Wed, 2004-02-25 at 10:06, Benjamin Herrenschmidt wrote: > > Not completely, ppc32 xmon doesnt use kallsyms. I do like the idea of > > doing lookups in disassembly too, Im not sure if ppc32 does that. > > No, it doesn't and that's a good feature. I mean that is a good feature to add :) Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Wed Feb 25 11:23:13 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Tue, 24 Feb 2004 18:23:13 -0600 Subject: [ppc64 patch] rename virtual IO sysfs directory Message-ID: This names the virtual IO devices sysfs directory /sys/devices/vio, to match /sys/bus/vio. Linus, please apply. ===== arch/ppc64/kernel/vio.c 1.14 vs edited ===== --- 1.14/arch/ppc64/kernel/vio.c Tue Feb 24 16:00:14 2004 +++ edited/arch/ppc64/kernel/vio.c Tue Feb 24 18:30:39 2004 @@ -149,7 +149,7 @@ return 1; } memset(vio_bus_device, 0, sizeof(struct vio_dev)); - strcpy(vio_bus_device->dev.bus_id, "vdevice"); + strcpy(vio_bus_device->dev.bus_id, "vio"); err = device_register(&vio_bus_device->dev); if (err) { -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Wed Feb 25 11:38:14 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Tue, 24 Feb 2004 18:38:14 -0600 Subject: [ppc64 patch] rename virtual IO sysfs directory Message-ID: [Bad address book first time.] This names the virtual IO devices sysfs directory /sys/devices/vio, to match /sys/bus/vio. Linus, please apply. ===== arch/ppc64/kernel/vio.c 1.14 vs edited ===== --- 1.14/arch/ppc64/kernel/vio.c Tue Feb 24 16:00:14 2004 +++ edited/arch/ppc64/kernel/vio.c Tue Feb 24 18:30:39 2004 @@ -149,7 +149,7 @@ return 1; } memset(vio_bus_device, 0, sizeof(struct vio_dev)); - strcpy(vio_bus_device->dev.bus_id, "vdevice"); + strcpy(vio_bus_device->dev.bus_id, "vio"); err = device_register(&vio_bus_device->dev); if (err) { -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Wed Feb 25 20:21:23 2004 From: anton at samba.org (Anton Blanchard) Date: Wed, 25 Feb 2004 20:21:23 +1100 Subject: [PATCH] ppc64 procfs cleanup Message-ID: <20040225092123.GH5801@krispykreme> Hi, Olaf pointed me at an issue with our current procfs code and modules. I decided to go through all of our code and clean things up. Patch is at http://samba.org/~anton/fixup_procfs.patch - Use initcalls everywhere. This allowed us to remove the iseries proc callback interface - Kill proc_pmc.c. Most of it wasnt used (and we are planning to export the PMCs via sysfs). The few things left were iseries specific so they got moved into iSeries_proc.c - Kill pmc.c. We dont use those statistics and the ones that are left can be gained via PMCs. - Create /proc/iSeries and /proc/ppc64 very early. This means we no longer have to call proc_ppc64_init in all the drivers, we can assume its there. - Fix some error return cases in rtas-proc.c and rtas-flash - Dont even try some pseries specific drivers on mac. Im planning to merge this tomorrow unless there are any objections. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Wed Feb 25 21:19:31 2004 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Wed, 25 Feb 2004 04:19:31 -0600 Subject: [PATCH] ppc64 procfs cleanup In-Reply-To: <20040225092123.GH5801@krispykreme> References: <20040225092123.GH5801@krispykreme> Message-ID: <403C76B3.9000108@austin.ibm.com> > - Create /proc/iSeries and /proc/ppc64 very early. This means we no > longer have to call proc_ppc64_init in all the drivers, we can > assume its there. +__initcall(proc_ppc64_init); What guarantee is there that proc_ppc64_init will run before the init routines for scanlog, rtas, etc? Perhaps subsys_initcall(proc_ppc64_init) would be better? Nathan ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Thu Feb 26 00:49:08 2004 From: anton at samba.org (Anton Blanchard) Date: Thu, 26 Feb 2004 00:49:08 +1100 Subject: [PATCH] ppc64 procfs cleanup In-Reply-To: <403C76B3.9000108@austin.ibm.com> References: <20040225092123.GH5801@krispykreme> <403C76B3.9000108@austin.ibm.com> Message-ID: <20040225134908.GJ5801@krispykreme> Hi Nathan Are you trying to rival me for worst hours kept? :) > +__initcall(proc_ppc64_init); > > What guarantee is there that proc_ppc64_init will run before the init > routines for scanlog, rtas, etc? Perhaps > subsys_initcall(proc_ppc64_init) would be better? Ill see your subsys_initcall and raise you a core_initcall. The stuff in proc_ppc64_init shouldnt have any dependencies, I broke off the important bits into proc_ppc64_create which is a core_initcall. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Thu Feb 26 01:55:14 2004 From: anton at samba.org (Anton Blanchard) Date: Thu, 26 Feb 2004 01:55:14 +1100 Subject: thrashing 2.6 ameslab Message-ID: <20040225145514.GL5801@krispykreme> Hi, I just got the following WARN_ON when pounding the machine with bash-shared-mappings. Its part of the tlb flush rework. Paul originally had something in there to recognise the mm had changed and to fix it up silently. I changed it to WARN_ON thinking it shouldnt occur normally (an mm changing in the middle of a batch). However looking at this trace I guess it can. Looks like we encountered memory pressure in copy_page_range and ended up scanning pages doing page_referenced. The problem is, copy_page_range is in the middle of a tlb batch... I'll restore the silent flush in arch/ppc64/mm/tlb.c Anton Badness in hpte_update at arch/ppc64/mm/tlb.c:71 Call Trace: [c0000000000a32cc] .page_referenced+0x10c/0x25c [c000000000095a14] .refill_inactive_zone+0xa7c/0xb2c [c000000000095b60] .shrink_zone+0x9c/0xd4 [c000000000095d04] .shrink_caches+0x16c/0x194 [c000000000095e28] .try_to_free_pages+0xfc/0x224 [c00000000008a29c] .__alloc_pages+0x254/0x438 [c00000000008a4b4] .__get_free_pages+0x34/0x80 [c00000000008f10c] .cache_grow+0x158/0x528 [c00000000008f7c4] .cache_alloc_refill+0x2e8/0x398 [c00000000008fc7c] .kmem_cache_alloc+0xac/0xc0 [c00000000009cec0] .__pmd_alloc+0x60/0x14c [c000000000098b4c] .copy_page_range+0x690/0x768 [c000000000058948] .copy_mm+0x62c/0x730 [c000000000059920] .copy_process+0x71c/0xfd4 [c00000000005a21c] .do_fork+0x44/0x210 [c000000000016ba8] .sys_fork+0x28/0x40 [c0000000000117d4] .ret_from_syscall_1+0x0/0xa4 ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Thu Feb 26 04:21:52 2004 From: johnrose at austin.ibm.com (John Rose) Date: Wed, 25 Feb 2004 11:21:52 -0600 Subject: [PATCH] ppc64 procfs cleanup In-Reply-To: References: Message-ID: <1077729711.10733.12.camel@verve.austin.ibm.com> Hi Anton- >From rtas_flash.c: + if (rtas_token("ibm,update-flash-64-and-reboot") == + RTAS_UNKNOWN_SERVICE) { + printk(KERN_ERR "rtas_flash: no firmware flash support\n"); + return 1; Can we not add this? :) The current module init creates the three /proc files regardless, and handles the case of "function not supported" with a certain return code upon /proc file read. This allows the userland tool to distinguish between the error cases of "You don't have the module compiled in/loaded" and "You don't have the firmware functionality." -static inline struct proc_dir_entry * create_flash_pde(const char *filename, - struct file_operations *fops) +static struct proc_dir_entry *create_flash_pde(const char *filename, + struct file_operations *fops) For my own education, could you explain when it's appropriate to inline? I was under the impression that functions that could be macros were good candidates. Thanks- John ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Thu Feb 26 04:41:49 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Wed, 25 Feb 2004 11:41:49 -0600 Subject: [PATCH] ppc64 procfs cleanup In-Reply-To: <1077729711.10733.12.camel@verve.austin.ibm.com> References: <1077729711.10733.12.camel@verve.austin.ibm.com> Message-ID: On Feb 25, 2004, at 11:21 AM, John Rose wrote: > For my own education, could you explain when it's appropriate to > inline? I was under the impression that functions that could be macros > were good candidates. It's been observed that too much is being inlined, to the point that we're bigger and slower because we're eating icache. So recent wisdom has been to let the compiler do it except in specific cases that can demonstrate a performance improvement. -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Thu Feb 26 04:58:36 2004 From: olof at austin.ibm.com (olof at austin.ibm.com) Date: Wed, 25 Feb 2004 11:58:36 -0600 (CST) Subject: [PATCH] ppc64 procfs cleanup In-Reply-To: <1077729711.10733.12.camel@verve.austin.ibm.com> Message-ID: On Wed, 25 Feb 2004, John Rose wrote: > -static inline struct proc_dir_entry * create_flash_pde(const char *filename, > - struct file_operations *fops) > +static struct proc_dir_entry *create_flash_pde(const char *filename, > + struct file_operations *fops) > > For my own education, could you explain when it's appropriate to > inline? I was under the impression that functions that could be macros > were good candidates. Inlining should only be done where taking the additional function call adds significant overhead and it's called often enough to impact system performance. In addition to the binary bloat, inlining makes debugging painful since the code will be inserted in the caller and following the flow in disassembly is hard. -Olof Olof Johansson Office: 4E002/905 Linux on Power Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Thu Feb 26 04:59:42 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Wed, 25 Feb 2004 11:59:42 -0600 Subject: [PATCH] rpadlpar changes for DLPAR VIO devices -- please review In-Reply-To: <403CDE23.4040007@ltcfwd.linux.ibm.com> References: <403BC5DA.4010705@ltcfwd.linux.ibm.com> <270FFA96-6719-11D8-8F7A-000A95A0560C@us.ibm.com> <403CDE23.4040007@ltcfwd.linux.ibm.com> Message-ID: <63620E9A-67BC-11D8-B826-000A95A0560C@us.ibm.com> On Feb 25, 2004, at 11:40 AM, Linda Xie wrote: > Hollis Blanchard wrote: > >> On Feb 24, 2004, at 3:44 PM, Linda Xie wrote: >> >>> + if (strstr(drc_name, "-V")) >>> + dn = find_php_slot_vio_node(drc_name); >>> + else >>> + dn = find_php_slot_pci_node(drc_name); >> >> I'm not sure this is safe as a canonical test. Maybe >> sprintf("%s-V%i-D%i", ...)? Is this namespace defined and documented >> somewhere? I just don't feel comfortable with a two-character test >> determining it one way or the other. > > I agreed [ event with sprintf("%s-V%i-D%i", ...)]. Because the format > can be > changed by FW at anytime, It looks like we have to search OFDT twice > (the worst case) for a given > drc-name: > Call find_php_slot_vio_node(drc_name) first, if return value is NULL, > then > call find_php_slot_pci_node(drc_name). I had trouble understanding that first part, but I agree that it makes sense to search both for vdevice and pci devices for a given location code, rather than try to parse the location code yourself. -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Thu Feb 26 09:32:41 2004 From: olof at austin.ibm.com (olof at austin.ibm.com) Date: Wed, 25 Feb 2004 16:32:41 -0600 (CST) Subject: Adding kallsyms_lookupname() Message-ID: Rusty, Attached patch adds a kallsyms_lookupname() function for lookups of a symbol name to an address. I've attempted to be somewhat efficient and skip all "stems" where the base part of the name doesn't match. I also added a loop through module symbols to try finding the symbol there in case it's not found in the kernel table. That part is not as efficient, but that's OK. Furthermore, it's intentionally not exported as a symbol for module use, since it can be used to circumvent other symbol export restrictions. Please consider for upstream inclusion. Thanks, -Olof Olof Johansson Office: 4E002/905 Linux on Power Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM -------------- next part -------------- ===== include/linux/kallsyms.h 1.3 vs edited ===== --- 1.3/include/linux/kallsyms.h Wed Dec 25 21:46:20 2002 +++ edited/include/linux/kallsyms.h Tue Feb 24 21:59:20 2004 @@ -8,6 +8,9 @@ #include #ifdef CONFIG_KALLSYMS +/* Lookup the address of a symbol. Returns 0 if not found. */ +unsigned long kallsyms_lookupname(char *name); + /* Lookup an address. modname is set to NULL if it's in the kernel. */ const char *kallsyms_lookup(unsigned long addr, unsigned long *symbolsize, @@ -18,6 +21,11 @@ extern void __print_symbol(const char *fmt, unsigned long address); #else /* !CONFIG_KALLSYMS */ + +static inline const unsigned long kallsyms_lookupname(unsigned long addr) +{ + return 0; +} static inline const char *kallsyms_lookup(unsigned long addr, unsigned long *symbolsize, ===== kernel/kallsyms.c 1.14 vs edited ===== --- 1.14/kernel/kallsyms.c Sun Aug 31 18:14:13 2003 +++ edited/kernel/kallsyms.c Wed Feb 25 16:29:42 2004 @@ -37,6 +37,58 @@ return 0; } +/* Lookup the address of a symbol. Returns 0 if not found. */ +unsigned long kallsyms_lookupname(char *name) +{ + unsigned long i; + char namebuf[128]; + char *knames = kallsyms_names; + unsigned int namelen = strlen(name); + unsigned long val; + char type; + + /* This kernel should never had been booted. */ + BUG_ON(!kallsyms_addresses); + + namebuf[0] = 0; + + for (i = 0; i < kallsyms_num_syms; i++) { + unsigned prefix = *knames++; + unsigned len = strlen(knames); + + /* Skip over as long as prefix at 0 doesn't match */ + if (!prefix && len <= namelen && + strncmp(knames, name, len)) { + do { + knames += len + 1; + prefix = *knames++; + len = strlen(knames); + i++; + } while (prefix && i < kallsyms_num_syms); + } + + strncpy(namebuf + prefix, knames, 127 - prefix); + + if (prefix + len == namelen && + !strncmp(namebuf, name, namelen)) + return kallsyms_addresses[i]; + knames += len + 1; + } + + /* If not found above, try looking up the name in modules. + * This isn't all that efficient, but performance isn't critical + * here. + */ + i = 0; + while (module_get_kallsym(i++, &val, &type, namebuf)) { + namebuf[127] = 0; /* Just in case */ + if (!strcmp(namebuf, name)) + return val; + } + + return 0; +} + /* Lookup an address. modname is set to NULL if it's in the kernel. */ const char *kallsyms_lookup(unsigned long addr, unsigned long *symbolsize, From olof at austin.ibm.com Thu Feb 26 09:46:05 2004 From: olof at austin.ibm.com (olof at austin.ibm.com) Date: Wed, 25 Feb 2004 16:46:05 -0600 (CST) Subject: xmon support for symbol lookup Message-ID: Attached patch adds symbol lookup functions to xmon, similar to what Ben added to ppc32 but not requiring a linked-in system.map. It requires the kallsyms_lookupname patch (see previous post), so I won't push this until I've heard back that it's accepted upstream. Commands added are "la " and "ls
". The syntax $ can also be used to specify addresses, like on ppc32: 0:mon> di $.sysrq_handle_xmon c000000000046b5c 7c0802a6 mflr r0 c000000000046b60 fba1ffe8 std r29,-24(r1) c000000000046b64 7c9d2378 mr r29,r4 c000000000046b68 f8010010 std r0,16(r1) c000000000046b6c f821ff71 stdu r1,-144(r1) c000000000046b70 48004691 bl 0xc00000000004b200 # .xmon_init+0x0 (also notice the added comments in the disasm output with symbols) -Olof Olof Johansson Office: 4E002/905 Linux on Power Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM -------------- next part -------------- ===== arch/ppc64/xmon/ppc-dis.c 1.1 vs edited ===== --- 1.1/arch/ppc64/xmon/ppc-dis.c Thu Feb 14 06:14:36 2002 +++ edited/arch/ppc64/xmon/ppc-dis.c Wed Feb 25 11:22:49 2004 @@ -18,6 +18,7 @@ along with this file; see the file COPYING. If not, write to the Free Software Foundation, 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. */ +#include #include "nonstdio.h" #include "ansidecl.h" #include "ppc.h" @@ -61,6 +62,7 @@ int invalid; int need_comma; int need_paren; + unsigned long addr; table_op = PPC_OP (opcode->opcode); if (op < table_op) @@ -93,6 +95,7 @@ /* Now extract and print the operands. */ need_comma = 0; need_paren = 0; + addr = 0; for (opindex = opcode->operands; *opindex != 0; opindex++) { long value; @@ -134,9 +137,10 @@ fprintf(out, "r%ld", value); else if ((operand->flags & PPC_OPERAND_FPR) != 0) fprintf(out, "f%ld", value); - else if ((operand->flags & PPC_OPERAND_RELATIVE) != 0) + else if ((operand->flags & PPC_OPERAND_RELATIVE) != 0) { print_address (memaddr + value); - else if ((operand->flags & PPC_OPERAND_ABSOLUTE) != 0) + addr = memaddr + value; + } else if ((operand->flags & PPC_OPERAND_ABSOLUTE) != 0) print_address (value & 0xffffffff); else if ((operand->flags & PPC_OPERAND_CR) == 0 || (dialect & PPC_OPCODE_PPC) == 0) @@ -178,6 +182,21 @@ need_paren = 1; } } + + if (addr) { + char namebuf[128]; + const char *name; + char *modname; + long size, offset; + + name = kallsyms_lookup(addr, &size, &offset, &modname, namebuf); + if (name) { + if(modname) + fprintf(out, "\t# [%s]%s+0x%lx", modname, name, offset); + else + fprintf(out, "\t# %s+0x%lx", name, offset); + } + } /* We have found and printed an instruction; return. */ return 4; ===== arch/ppc64/xmon/xmon.c 1.34 vs edited ===== --- 1.34/arch/ppc64/xmon/xmon.c Sat Feb 14 06:40:59 2004 +++ edited/arch/ppc64/xmon/xmon.c Wed Feb 25 11:38:19 2004 @@ -115,6 +115,7 @@ static void csum(void); static void bootcmds(void); void dump_segments(void); +void symbol_lookup(void); static void debug_trace(void); @@ -160,6 +161,8 @@ dd dump double values\n\ e print exception information\n\ f flush cache\n\ + la lookup address\n\ + ls lookup symbol\n\ m examine/change memory\n\ mm move a block of memory\n\ ms set a block of memory\n\ @@ -537,6 +540,9 @@ case 'd': dump(); break; + case 'l': + symbol_lookup(); + break; case 'r': if (excp != NULL) prregs(excp); /* print regs */ @@ -1142,6 +1148,7 @@ extern char exc_prolog; extern char dec_exc; + void super_regs() { @@ -1644,6 +1651,7 @@ printf("0x%lx", addr); } + /* * Memory operations - move, set, print differences */ @@ -1822,8 +1830,40 @@ return 0; } + /* skip leading "0x" if any */ + + if (c == '0') { + c = inchar(); + if (c == 'x') + c = inchar(); + } + + if (c == '0') { + c = inchar(); + if (c == 'x') + c = inchar(); + } else if (c == '$') { + static char symname[64]; + int i; + for (i=0; i<63; i++) { + c = inchar(); + if (isspace(c)) { + termch = c; + break; + } + symname[i] = c; + } + symname[i++] = 0; + *vp = kallsyms_lookupname(symname); + if (!(*vp)) { + printf("unknown symbol '%s'\n", symname); + return 0; + } + return 1; + } + d = hexdigit(c); - if( d == EOF ){ + if (d == EOF) { termch = c; return 0; } @@ -1832,7 +1872,7 @@ v = (v << 4) + d; c = inchar(); d = hexdigit(c); - } while( d != EOF ); + } while (d != EOF); termch = c; *vp = v; return 1; @@ -1907,13 +1947,48 @@ lineptr = str; } + +void +symbol_lookup(void) +{ + int type = inchar(); + unsigned long addr; + static char tmp[64]; + + switch (type) { + case 'a': + if (scanhex(&addr)) { + printf("%lx: ", addr); + xmon_print_symbol("%s\n", addr); + } + termch = 0; + break; + case 's': + getstring(tmp, 64); + if (setjmp(bus_error_jmp) == 0) { + __debugger_fault_handler = handle_fault; + sync(); + addr = kallsyms_lookupname(tmp); + if (addr) + printf("%s: %lx\n", tmp, addr); + else + printf("Symbol '%s' not found.\n", tmp); + sync(); + } + __debugger_fault_handler = 0; + termch = 0; + break; + } +} + + /* xmon version of __print_symbol */ void __xmon_print_symbol(const char *fmt, unsigned long address) { char *modname; const char *name; unsigned long offset, size; - char namebuf[128]; + static char namebuf[128]; if (setjmp(bus_error_jmp) == 0) { __debugger_fault_handler = handle_fault; From mjanders at us.ibm.com Thu Feb 26 10:05:27 2004 From: mjanders at us.ibm.com (Michael Anderson) Date: Wed, 25 Feb 2004 17:05:27 -0600 Subject: PCI Interrupts with CONFIG_PPC_ISERIES in linux 2.6.3 Message-ID: It appears that I am not getting interrupts on my iseries machine. I compiled as module the ibmsis, ipr, olympic and icom drivers as modules. The ibmsis, ipr and olympic drivers hung when modprobe was attempted. The icom driver did modprobe successfully, no interrupts are needed for icom to install, though when I attempt to transmit data, which does require interrupts, no interrupts were received and transmit operations hung. I loaded this same driver on a simular pseries install and the icom driver did receive interrupts so this appears to be an iseries issue. Any known problems in this area? I've seen this behavior before on the 2.4 kernel last fall. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Thu Feb 26 10:05:39 2004 From: anton at samba.org (Anton Blanchard) Date: Thu, 26 Feb 2004 10:05:39 +1100 Subject: [PATCH] ppc64 procfs cleanup In-Reply-To: <1077729711.10733.12.camel@verve.austin.ibm.com> References: <1077729711.10733.12.camel@verve.austin.ibm.com> Message-ID: <20040225230539.GM5801@krispykreme> > + if (rtas_token("ibm,update-flash-64-and-reboot") == > + RTAS_UNKNOWN_SERVICE) { > + printk(KERN_ERR "rtas_flash: no firmware flash support\n"); > + return 1; > > Can we not add this? :) The current module init creates the three /proc > files regardless, and handles the case of "function not supported" with > a certain return code upon /proc file read. Im thinking Linus and his G5 here :) We have a bunch of stuff thats pseries specific that G5 should never know about. I just copied how scanlog handles this (doesnt load if you dont have the required RTAS methods). Im open to other suggestions, I guess we could do a platform & PSERIES check. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Thu Feb 26 10:10:55 2004 From: anton at samba.org (Anton Blanchard) Date: Thu, 26 Feb 2004 10:10:55 +1100 Subject: PCI Interrupts with CONFIG_PPC_ISERIES in linux 2.6.3 In-Reply-To: References: Message-ID: <20040225231055.GN5801@krispykreme> Hi Mike, > It appears that I am not getting interrupts on my iseries machine. I > compiled as module the ibmsis, ipr, olympic and icom drivers as modules. > The ibmsis, ipr and olympic drivers hung when modprobe was attempted. The > icom driver did modprobe successfully, no interrupts are needed for icom to > install, though when I attempt to transmit data, which does require > interrupts, no interrupts were received and transmit operations hung. I > loaded this same driver on a simular pseries install and the icom driver > did receive interrupts so this appears to be an iseries issue. > > Any known problems in this area? I've seen this behavior before on the 2.4 > kernel last fall. Yep I noticed it on our iseries box here, there is a problem with ameslab at the moment. Linus' tree (pauls large IRQ patch got merged yesterday) should work. We'll be looking into the ameslab issue today. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Thu Feb 26 11:38:50 2004 From: paulus at samba.org (Paul Mackerras) Date: Thu, 26 Feb 2004 11:38:50 +1100 Subject: xmon support for symbol lookup In-Reply-To: References: Message-ID: <16445.16410.436976.639488@cargo.ozlabs.ibm.com> Hi Olof, > Attached patch adds symbol lookup functions to xmon, similar to what Ben > added to ppc32 but not requiring a linked-in system.map. Nice :) > It requires the kallsyms_lookupname patch (see previous post), so I won't > push this until I've heard back that it's accepted upstream. > > Commands added are "la " and "ls
". The syntax $ > can also be used to specify addresses, like on ppc32: Hopefully you mean "la
" and "ls ", at least that is what the code seems to implement. > ===== arch/ppc64/xmon/ppc-dis.c 1.1 vs edited ===== > --- 1.1/arch/ppc64/xmon/ppc-dis.c Thu Feb 14 06:14:36 2002 > +++ edited/arch/ppc64/xmon/ppc-dis.c Wed Feb 25 11:22:49 2004 Someone needs to update ppc-dis.c and ppc-opc.c with a more recent version from the BFD library. The one we have seems to be missing some power4 instructions, and also seems to have a 32-bit assumption here: > + } else if ((operand->flags & PPC_OPERAND_ABSOLUTE) != 0) > print_address (value & 0xffffffff); I don't like this bit, though: > + if (addr) { > + char namebuf[128]; > + const char *name; > + char *modname; > + long size, offset; > + > + name = kallsyms_lookup(addr, &size, &offset, &modname, namebuf); > + if (name) { > + if(modname) > + fprintf(out, "\t# [%s]%s+0x%lx", modname, name, offset); > + else > + fprintf(out, "\t# %s+0x%lx", name, offset); > + } > + } This should be put in the print_address function. (With correct indentation. :) If we minimize the changes we make to ppc_dis.c, it makes it easier to update it from the BFD version as we go along. And here it looks like you are sometimes using 8 spaces to indent, and sometimes a tab: > +void > +symbol_lookup(void) > +{ > + int type = inchar(); > + unsigned long addr; > + static char tmp[64]; > + > + switch (type) { > + case 'a': > + if (scanhex(&addr)) { It would be good if you could use tabs everywhere. Regards, Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Thu Feb 26 12:59:00 2004 From: olof at austin.ibm.com (olof at austin.ibm.com) Date: Wed, 25 Feb 2004 19:59:00 -0600 (CST) Subject: xmon support for symbol lookup In-Reply-To: <16445.16410.436976.639488@cargo.ozlabs.ibm.com> Message-ID: Paul, Thanks for your feedback, see below. On Thu, 26 Feb 2004, Paul Mackerras wrote: > Hopefully you mean "la
" and "ls ", at least that is > what the code seems to implement. Ack, yes. Helptext has been updated to clarify too. Whitespace weirdness was partially because I was tring to follow the weird existing style of ppc-dis.c. The space/tab-issue was cut-n-paste garbage. All that has been fixed. I also consolidated some of the static char arrays to use one global instead to save some BSS. Actually, I can push everything but the "ls" functionality soonish if it'll take a while to get the kallsyms change into mainline. -Olof Olof Johansson Office: 4E002/905 Linux on Power Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM > > > ===== arch/ppc64/xmon/ppc-dis.c 1.1 vs edited ===== > > --- 1.1/arch/ppc64/xmon/ppc-dis.c Thu Feb 14 06:14:36 2002 > > +++ edited/arch/ppc64/xmon/ppc-dis.c Wed Feb 25 11:22:49 2004 > > Someone needs to update ppc-dis.c and ppc-opc.c with a more recent > version from the BFD library. The one we have seems to be missing > some power4 instructions, and also seems to have a 32-bit assumption > here: > > > + } else if ((operand->flags & PPC_OPERAND_ABSOLUTE) != 0) > > print_address (value & 0xffffffff); > > I don't like this bit, though: > > > + if (addr) { > > + char namebuf[128]; > > + const char *name; > > + char *modname; > > + long size, offset; > > + > > + name = kallsyms_lookup(addr, &size, &offset, &modname, namebuf); > > + if (name) { > > + if(modname) > > + fprintf(out, "\t# [%s]%s+0x%lx", modname, name, offset); > > + else > > + fprintf(out, "\t# %s+0x%lx", name, offset); > > + } > > + } > > This should be put in the print_address function. (With correct > indentation. :) If we minimize the changes we make to ppc_dis.c, it > makes it easier to update it from the BFD version as we go along. > > And here it looks like you are sometimes using 8 spaces to indent, and > sometimes a tab: > > > +void > > +symbol_lookup(void) > > +{ > > + int type = inchar(); > > + unsigned long addr; > > + static char tmp[64]; > > + > > + switch (type) { > > + case 'a': > > + if (scanhex(&addr)) { > > It would be good if you could use tabs everywhere. > > Regards, > Paul. > -------------- next part -------------- ===== arch/ppc64/xmon/xmon.c 1.34 vs edited ===== --- 1.34/arch/ppc64/xmon/xmon.c Sat Feb 14 06:40:59 2004 +++ edited/arch/ppc64/xmon/xmon.c Wed Feb 25 19:53:15 2004 @@ -50,6 +50,7 @@ static unsigned long nidump = 16; static unsigned long ncsum = 4096; static int termch; +static char tmpstr[128]; static u_int bus_error_jmp[100]; #define setjmp xmon_setjmp @@ -115,6 +116,7 @@ static void csum(void); static void bootcmds(void); void dump_segments(void); +void symbol_lookup(void); static void debug_trace(void); @@ -160,6 +162,8 @@ dd dump double values\n\ e print exception information\n\ f flush cache\n\ + la lookup symbol+offset of specified address\n\ + ls lookup address of specified symbol\n\ m examine/change memory\n\ mm move a block of memory\n\ ms set a block of memory\n\ @@ -537,6 +541,9 @@ case 'd': dump(); break; + case 'l': + symbol_lookup(); + break; case 'r': if (excp != NULL) prregs(excp); /* print regs */ @@ -1641,9 +1648,22 @@ void print_address(unsigned long addr) { - printf("0x%lx", addr); + const char *name; + char *modname; + long size, offset; + + name = kallsyms_lookup(addr, &size, &offset, &modname, tmpstr); + + if (name) { + if (modname) + printf("0x%lx\t# [%s]%s+0x%lx", addr, modname, name, offset); + else + printf("0x%lx\t# %s+0x%lx", addr, name, offset); + } else + printf("0x%lx", addr); } + /* * Memory operations - move, set, print differences */ @@ -1822,8 +1842,33 @@ return 0; } + /* skip leading "0x" if any */ + + if (c == '0') { + c = inchar(); + if (c == 'x') + c = inchar(); + } else if (c == '$') { + int i; + for (i=0; i<63; i++) { + c = inchar(); + if (isspace(c)) { + termch = c; + break; + } + tmpstr[i] = c; + } + tmpstr[i++] = 0; + *vp = kallsyms_lookupname(tmpstr); + if (!(*vp)) { + printf("unknown symbol '%s'\n", tmpstr); + return 0; + } + return 1; + } + d = hexdigit(c); - if( d == EOF ){ + if (d == EOF) { termch = c; return 0; } @@ -1832,7 +1877,7 @@ v = (v << 4) + d; c = inchar(); d = hexdigit(c); - } while( d != EOF ); + } while (d != EOF); termch = c; *vp = v; return 1; @@ -1907,19 +1952,53 @@ lineptr = str; } + +void +symbol_lookup(void) +{ + int type = inchar(); + unsigned long addr; + static char tmp[64]; + + switch (type) { + case 'a': + if (scanhex(&addr)) { + printf("%lx: ", addr); + xmon_print_symbol("%s\n", addr); + } + termch = 0; + break; + case 's': + getstring(tmp, 64); + if (setjmp(bus_error_jmp) == 0) { + __debugger_fault_handler = handle_fault; + sync(); + addr = kallsyms_lookupname(tmp); + if (addr) + printf("%s: %lx\n", tmp, addr); + else + printf("Symbol '%s' not found.\n", tmp); + sync(); + } + __debugger_fault_handler = 0; + termch = 0; + break; + } +} + + /* xmon version of __print_symbol */ void __xmon_print_symbol(const char *fmt, unsigned long address) { char *modname; const char *name; unsigned long offset, size; - char namebuf[128]; if (setjmp(bus_error_jmp) == 0) { __debugger_fault_handler = handle_fault; sync(); name = kallsyms_lookup(address, &size, &offset, &modname, - namebuf); + tmpstr); sync(); /* wait a little while to see if we get a machine check */ __delay(200); From rusty at au1.ibm.com Thu Feb 26 17:51:08 2004 From: rusty at au1.ibm.com (Rusty Russell) Date: Thu, 26 Feb 2004 17:51:08 +1100 Subject: Adding kallsyms_lookupname() In-Reply-To: Your message of "Wed, 25 Feb 2004 16:32:41 MDT." Message-ID: <20040226225409.91D7C17DD8@ozlabs.au.ibm.com> In message you write: > Rusty, > > Attached patch adds a kallsyms_lookupname() function for lookups of a > symbol name to an address. OK, I simplified it a bit, and gave it some prototype love: > +unsigned long kallsyms_lookupname(char *name); .... > +static inline const unsigned long kallsyms_lookupname(unsigned long addr) How's this: Name: kallsyms_lookupname() Function For Debuggers Author: Olof Johansson, Rusty Russell Status: Experimental Attached patch adds a kallsyms_lookupname() function for lookups of a symbol name to an address. It's intentionally not exported as a symbol for module use, since it can be used to circumvent other symbol export restrictions. diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .32130-linux-2.6.3-bk7/include/linux/kallsyms.h .32130-linux-2.6.3-bk7.updated/include/linux/kallsyms.h --- .32130-linux-2.6.3-bk7/include/linux/kallsyms.h 2003-09-22 09:47:17.000000000 +1000 +++ .32130-linux-2.6.3-bk7.updated/include/linux/kallsyms.h 2004-02-26 13:57:39.000000000 +1100 @@ -8,6 +8,9 @@ #include #ifdef CONFIG_KALLSYMS +/* Lookup the address for a symbol. Returns 0 if not found. */ +unsigned long kallsyms_lookup_name(const char *name); + /* Lookup an address. modname is set to NULL if it's in the kernel. */ const char *kallsyms_lookup(unsigned long addr, unsigned long *symbolsize, @@ -19,6 +22,11 @@ extern void __print_symbol(const char *f #else /* !CONFIG_KALLSYMS */ +static inline unsigned long kallsyms_lookup_name(const char *name) +{ + return 0; +} + static inline const char *kallsyms_lookup(unsigned long addr, unsigned long *symbolsize, unsigned long *offset, diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .32130-linux-2.6.3-bk7/include/linux/module.h .32130-linux-2.6.3-bk7.updated/include/linux/module.h --- .32130-linux-2.6.3-bk7/include/linux/module.h 2004-02-04 15:39:14.000000000 +1100 +++ .32130-linux-2.6.3-bk7.updated/include/linux/module.h 2004-02-26 15:20:29.000000000 +1100 @@ -282,6 +282,10 @@ struct module *module_get_kallsym(unsign unsigned long *value, char *type, char namebuf[128]); + +/* Look for this name: can be of form module:name. */ +unsigned long module_kallsyms_lookup_name(const char *name); + int is_exported(const char *name, const struct module *mod); extern void __module_put_and_exit(struct module *mod, long code) @@ -434,6 +438,11 @@ static inline struct module *module_get_ return NULL; } +static inline unsigned long module_kallsyms_lookup_name(const char *name) +{ + return 0; +} + static inline int is_exported(const char *name, const struct module *mod) { return 0; diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .32130-linux-2.6.3-bk7/kernel/kallsyms.c .32130-linux-2.6.3-bk7.updated/kernel/kallsyms.c --- .32130-linux-2.6.3-bk7/kernel/kallsyms.c 2003-09-22 10:28:13.000000000 +1000 +++ .32130-linux-2.6.3-bk7.updated/kernel/kallsyms.c 2004-02-26 14:22:37.000000000 +1100 @@ -37,6 +37,25 @@ static inline int is_kernel_text(unsigne return 0; } +/* Lookup the address for this symbol. Returns 0 if not found. */ +unsigned long kallsyms_lookup_name(const char *name) +{ + char namebuf[128]; + unsigned long i; + char *knames; + + for (i = 0, knames = kallsyms_names; i < kallsyms_num_syms; i++) { + unsigned prefix = *knames++; + + strlcpy(namebuf + prefix, knames, 127 - prefix); + if (strcmp(namebuf, name) == 0) + return kallsyms_addresses[i]; + + knames += strlen(knames) + 1; + } + return module_kallsyms_lookup_name(name); +} + /* Lookup an address. modname is set to NULL if it's in the kernel. */ const char *kallsyms_lookup(unsigned long addr, unsigned long *symbolsize, diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .32130-linux-2.6.3-bk7/kernel/module.c .32130-linux-2.6.3-bk7.updated/kernel/module.c --- .32130-linux-2.6.3-bk7/kernel/module.c 2004-02-26 11:53:26.000000000 +1100 +++ .32130-linux-2.6.3-bk7.updated/kernel/module.c 2004-02-26 14:40:25.000000000 +1100 @@ -1892,6 +1892,37 @@ struct module *module_get_kallsym(unsign up(&module_mutex); return NULL; } + +static unsigned long mod_find_symname(struct module *mod, const char *name) +{ + unsigned int i; + + for (i = 0; i < mod->num_symtab; i++) + if (strcmp(name, mod->strtab+mod->symtab[i].st_name) == 0) + return mod->symtab[i].st_value; + return 0; +} + +/* Look for this name: can be of form module:name. */ +unsigned long module_kallsyms_lookup_name(const char *name) +{ + struct module *mod; + char *colon; + unsigned long ret = 0; + + /* Don't lock: we're in enough trouble already. */ + if ((colon = strchr(name, ':')) != NULL) { + *colon = '\0'; + if ((mod = find_module(name)) != NULL) + ret = mod_find_symname(mod, colon+1); + *colon = ':'; + } else { + list_for_each_entry(mod, &modules, list) + if ((ret = mod_find_symname(mod, name)) != 0) + break; + } + return ret; +} #endif /* CONFIG_KALLSYMS */ /* Called by the /proc file system to return a list of modules. */ -- Anyone who quotes me in their sig is an idiot. -- Rusty Russell. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Fri Feb 27 02:58:42 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 27 Feb 2004 02:58:42 +1100 Subject: PCI Interrupts with CONFIG_PPC_ISERIES in linux 2.6.3 In-Reply-To: <20040225231055.GN5801@krispykreme> References: <20040225231055.GN5801@krispykreme> Message-ID: <20040226155842.GR5801@krispykreme> > Linus' tree (pauls large IRQ patch got merged yesterday) should work. We'll > be looking into the ameslab issue today. Paul has merged this into ameslab and iseries irqs should work again. I tested this with a pcnet32 card in our iseries and it worked fine. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From akpm at osdl.org Fri Feb 27 10:06:51 2004 From: akpm at osdl.org (Andrew Morton) Date: Thu, 26 Feb 2004 15:06:51 -0800 Subject: Adding kallsyms_lookupname() In-Reply-To: <20040226225409.91D7C17DD8@ozlabs.au.ibm.com> References: <20040226225409.91D7C17DD8@ozlabs.au.ibm.com> Message-ID: <20040226150651.66d45f91.akpm@osdl.org> Rusty Russell wrote: > > Name: kallsyms_lookupname() Function For Debuggers What uses this? Whatever it is, I'd prefer to merge both caller and callee please. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Fri Feb 27 10:23:42 2004 From: olof at austin.ibm.com (olof at austin.ibm.com) Date: Thu, 26 Feb 2004 17:23:42 -0600 (CST) Subject: Adding kallsyms_lookupname() In-Reply-To: <20040226150651.66d45f91.akpm@osdl.org> Message-ID: On Thu, 26 Feb 2004, Andrew Morton wrote: > Rusty Russell wrote: > > > > Name: kallsyms_lookupname() Function For Debuggers > > What uses this? > > Whatever it is, I'd prefer to merge both caller and callee please. Attached is the corresponding patch to xmon on ppc64. Ben said he might backport these changes to ppc32 as well. -Olof Olof Johansson Office: 4E002/905 Linux on Power Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM -------------- next part -------------- ===== arch/ppc64/xmon/xmon.c 1.34 vs edited ===== --- 1.34/arch/ppc64/xmon/xmon.c Sat Feb 14 06:40:59 2004 +++ edited/arch/ppc64/xmon/xmon.c Wed Feb 25 19:53:15 2004 @@ -50,6 +50,7 @@ static unsigned long nidump = 16; static unsigned long ncsum = 4096; static int termch; +static char tmpstr[128]; static u_int bus_error_jmp[100]; #define setjmp xmon_setjmp @@ -115,6 +116,7 @@ static void csum(void); static void bootcmds(void); void dump_segments(void); +void symbol_lookup(void); static void debug_trace(void); @@ -160,6 +162,8 @@ dd dump double values\n\ e print exception information\n\ f flush cache\n\ + la lookup symbol+offset of specified address\n\ + ls lookup address of specified symbol\n\ m examine/change memory\n\ mm move a block of memory\n\ ms set a block of memory\n\ @@ -537,6 +541,9 @@ case 'd': dump(); break; + case 'l': + symbol_lookup(); + break; case 'r': if (excp != NULL) prregs(excp); /* print regs */ @@ -1641,9 +1648,22 @@ void print_address(unsigned long addr) { - printf("0x%lx", addr); + const char *name; + char *modname; + long size, offset; + + name = kallsyms_lookup(addr, &size, &offset, &modname, tmpstr); + + if (name) { + if (modname) + printf("0x%lx\t# %s:%s+0x%lx", addr, modname, name, offset); + else + printf("0x%lx\t# %s+0x%lx", addr, name, offset); + } else + printf("0x%lx", addr); } + /* * Memory operations - move, set, print differences */ @@ -1822,8 +1842,33 @@ return 0; } + /* skip leading "0x" if any */ + + if (c == '0') { + c = inchar(); + if (c == 'x') + c = inchar(); + } else if (c == '$') { + int i; + for (i=0; i<63; i++) { + c = inchar(); + if (isspace(c)) { + termch = c; + break; + } + tmpstr[i] = c; + } + tmpstr[i++] = 0; + *vp = kallsyms_lookupname(tmpstr); + if (!(*vp)) { + printf("unknown symbol '%s'\n", tmpstr); + return 0; + } + return 1; + } + d = hexdigit(c); - if( d == EOF ){ + if (d == EOF) { termch = c; return 0; } @@ -1832,7 +1877,7 @@ v = (v << 4) + d; c = inchar(); d = hexdigit(c); - } while( d != EOF ); + } while (d != EOF); termch = c; *vp = v; return 1; @@ -1907,19 +1952,53 @@ lineptr = str; } + +void +symbol_lookup(void) +{ + int type = inchar(); + unsigned long addr; + static char tmp[64]; + + switch (type) { + case 'a': + if (scanhex(&addr)) { + printf("%lx: ", addr); + xmon_print_symbol("%s\n", addr); + } + termch = 0; + break; + case 's': + getstring(tmp, 64); + if (setjmp(bus_error_jmp) == 0) { + __debugger_fault_handler = handle_fault; + sync(); + addr = kallsyms_lookupname(tmp); + if (addr) + printf("%s: %lx\n", tmp, addr); + else + printf("Symbol '%s' not found.\n", tmp); + sync(); + } + __debugger_fault_handler = 0; + termch = 0; + break; + } +} + + /* xmon version of __print_symbol */ void __xmon_print_symbol(const char *fmt, unsigned long address) { char *modname; const char *name; unsigned long offset, size; - char namebuf[128]; if (setjmp(bus_error_jmp) == 0) { __debugger_fault_handler = handle_fault; sync(); name = kallsyms_lookup(address, &size, &offset, &modname, - namebuf); + tmpstr); sync(); /* wait a little while to see if we get a machine check */ __delay(200); From akpm at osdl.org Fri Feb 27 10:37:36 2004 From: akpm at osdl.org (Andrew Morton) Date: Thu, 26 Feb 2004 15:37:36 -0800 Subject: Adding kallsyms_lookupname() In-Reply-To: References: <20040226150651.66d45f91.akpm@osdl.org> Message-ID: <20040226153736.74fecb3b.akpm@osdl.org> olof at austin.ibm.com wrote: > > On Thu, 26 Feb 2004, Andrew Morton wrote: > > > Rusty Russell wrote: > > > > > > Name: kallsyms_lookupname() Function For Debuggers > > > > What uses this? > > > > Whatever it is, I'd prefer to merge both caller and callee please. > > Attached is the corresponding patch to xmon on ppc64. OK, thanks. Is this ready to be merged? > +void symbol_lookup(void); Should I make this static? ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From strosake at austin.ibm.com Fri Feb 27 11:02:55 2004 From: strosake at austin.ibm.com (Mike Strosaker) Date: Thu, 26 Feb 2004 18:02:55 -0600 Subject: [PATCH] (2.6) os-term call upon kernel panic Message-ID: <403E892F.3080001@austin.ibm.com> This patch will cause the os-term RTAS call to be invoked after a kernel panic. The call notifies the platform that the OS is terminating normal operation, which causes the service processor (on systems so equipped) to perform some pre-defined actions (like a call home). The os-term routine is given the lowest priority on panic_notifier_list so that it will be the last routine invoked, since it may not return. Comments welcome. Thanks, Mike diff -Nru a/arch/ppc64/kernel/chrp_setup.c b/arch/ppc64/kernel/chrp_setup.c --- a/arch/ppc64/kernel/chrp_setup.c Thu Feb 26 16:54:18 2004 +++ b/arch/ppc64/kernel/chrp_setup.c Thu Feb 26 16:54:18 2004 @@ -268,6 +268,7 @@ ppc_md.restart = rtas_restart; ppc_md.power_off = rtas_power_off; ppc_md.halt = rtas_halt; + ppc_md.panic = rtas_os_term; ppc_md.get_boot_time = pSeries_get_boot_time; ppc_md.get_rtc_time = pSeries_get_rtc_time; diff -Nru a/arch/ppc64/kernel/iSeries_setup.c b/arch/ppc64/kernel/iSeries_setup.c --- a/arch/ppc64/kernel/iSeries_setup.c Thu Feb 26 16:54:18 2004 +++ b/arch/ppc64/kernel/iSeries_setup.c Thu Feb 26 16:54:18 2004 @@ -323,6 +323,7 @@ ppc_md.restart = iSeries_restart; ppc_md.power_off = iSeries_power_off; ppc_md.halt = iSeries_halt; + ppc_md.panic = iSeries_panic; ppc_md.get_boot_time = iSeries_get_boot_time; ppc_md.set_rtc_time = iSeries_set_rtc_time; @@ -790,6 +791,14 @@ void iSeries_halt(void) { mf_powerOff(); +} + +/* + * Document me. + */ +void iSeries_panic(void) +{ + mf_reboot(); } /* JDH Hack */ diff -Nru a/arch/ppc64/kernel/rtas.c b/arch/ppc64/kernel/rtas.c --- a/arch/ppc64/kernel/rtas.c Thu Feb 26 16:54:18 2004 +++ b/arch/ppc64/kernel/rtas.c Thu Feb 26 16:54:18 2004 @@ -420,6 +420,17 @@ rtas_power_off(); } +void +rtas_os_term(void) +{ + long status; + char *str = "OS panic"; + + status = rtas_call(rtas_token("ibm,os-term"), 1, 1, NULL, __pa(str)); + if (status != 0) + printk(KERN_EMERG "ibm,os-term call failed %ld\n", status); +} + unsigned long rtas_rmo_buf = 0; asmlinkage int ppc_rtas(struct rtas_args __user *uargs) diff -Nru a/arch/ppc64/kernel/setup.c b/arch/ppc64/kernel/setup.c --- a/arch/ppc64/kernel/setup.c Thu Feb 26 16:54:18 2004 +++ b/arch/ppc64/kernel/setup.c Thu Feb 26 16:54:18 2004 @@ -25,6 +25,7 @@ #include #include #include +#include #include #include #include @@ -93,6 +94,12 @@ struct machdep_calls ppc_md; +static int ppc64_panic_event(struct notifier_block *, unsigned long, void *); +static struct notifier_block ppc64_panic_block = { + notifier_call: ppc64_panic_event, + priority: INT_MIN /* may not return; must be done last */ +}; + /* * Perhaps we can put the pmac screen_info[] here * on pmac as well so we don't need the ifdef's. @@ -316,6 +323,13 @@ EXPORT_SYMBOL(machine_halt); +static int ppc64_panic_event(struct notifier_block *this, + unsigned long event, void *ptr) +{ + ppc_md.panic(); /* May not return */ + return NOTIFY_DONE; +} + unsigned long ppc_proc_freq; unsigned long ppc_tb_freq; @@ -610,8 +624,9 @@ dcache_bsize = systemcfg->dCacheL1LineSize; icache_bsize = systemcfg->iCacheL1LineSize; - /* reboot on panic */ - panic_timeout = 180; + /* do not reboot on panic */ + panic_timeout = 0; + notifier_chain_register(&panic_notifier_list, &ppc64_panic_block); init_mm.start_code = PAGE_OFFSET; init_mm.end_code = (unsigned long) _etext; diff -Nru a/include/asm-ppc64/machdep.h b/include/asm-ppc64/machdep.h --- a/include/asm-ppc64/machdep.h Thu Feb 26 16:54:18 2004 +++ b/include/asm-ppc64/machdep.h Thu Feb 26 16:54:18 2004 @@ -82,6 +82,7 @@ void (*restart)(char *cmd); void (*power_off)(void); void (*halt)(void); + void (*panic)(void); int (*set_rtc_time)(struct rtc_time *); void (*get_rtc_time)(struct rtc_time *); diff -Nru a/include/asm-ppc64/rtas.h b/include/asm-ppc64/rtas.h --- a/include/asm-ppc64/rtas.h Thu Feb 26 16:54:18 2004 +++ b/include/asm-ppc64/rtas.h Thu Feb 26 16:54:18 2004 @@ -175,6 +175,7 @@ extern void rtas_restart(char *cmd); extern void rtas_power_off(void); extern void rtas_halt(void); +extern void rtas_os_term(void); extern int rtas_get_sensor(int sensor, int index, int *state); extern int rtas_get_power_level(int powerdomain, int *level); extern int rtas_set_indicator(int indicator, int index, int new_value); ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Fri Feb 27 11:15:37 2004 From: olof at austin.ibm.com (olof at austin.ibm.com) Date: Thu, 26 Feb 2004 18:15:37 -0600 (CST) Subject: Adding kallsyms_lookupname() In-Reply-To: <20040226153736.74fecb3b.akpm@osdl.org> Message-ID: On Thu, 26 Feb 2004, Andrew Morton wrote: > OK, thanks. Is this ready to be merged? The attached new patch is good to go -- Rusty renamed the function and I didn't notice at first. It's been built and tested together now. > > +void symbol_lookup(void); > > Should I make this static? Yep, done in the attached patch as well. Thanks, -Olof Olof Johansson Office: 4E002/905 Linux on Power Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM -------------- next part -------------- ===== arch/ppc64/xmon/xmon.c 1.34 vs edited ===== --- 1.34/arch/ppc64/xmon/xmon.c Sat Feb 14 06:40:59 2004 +++ edited/arch/ppc64/xmon/xmon.c Wed Feb 25 19:53:15 2004 @@ -50,6 +50,7 @@ static unsigned long nidump = 16; static unsigned long ncsum = 4096; static int termch; +static char tmpstr[128]; static u_int bus_error_jmp[100]; #define setjmp xmon_setjmp @@ -115,6 +116,7 @@ static void csum(void); static void bootcmds(void); void dump_segments(void); +static void symbol_lookup(void); static void debug_trace(void); @@ -160,6 +162,8 @@ dd dump double values\n\ e print exception information\n\ f flush cache\n\ + la lookup symbol+offset of specified address\n\ + ls lookup address of specified symbol\n\ m examine/change memory\n\ mm move a block of memory\n\ ms set a block of memory\n\ @@ -537,6 +541,9 @@ case 'd': dump(); break; + case 'l': + symbol_lookup(); + break; case 'r': if (excp != NULL) prregs(excp); /* print regs */ @@ -1641,9 +1648,22 @@ void print_address(unsigned long addr) { - printf("0x%lx", addr); + const char *name; + char *modname; + long size, offset; + + name = kallsyms_lookup(addr, &size, &offset, &modname, tmpstr); + + if (name) { + if (modname) + printf("0x%lx\t# %s:%s+0x%lx", addr, modname, name, offset); + else + printf("0x%lx\t# %s+0x%lx", addr, name, offset); + } else + printf("0x%lx", addr); } + /* * Memory operations - move, set, print differences */ @@ -1822,8 +1842,33 @@ return 0; } + /* skip leading "0x" if any */ + + if (c == '0') { + c = inchar(); + if (c == 'x') + c = inchar(); + } else if (c == '$') { + int i; + for (i=0; i<63; i++) { + c = inchar(); + if (isspace(c)) { + termch = c; + break; + } + tmpstr[i] = c; + } + tmpstr[i++] = 0; + *vp = kallsyms_lookup_name(tmpstr); + if (!(*vp)) { + printf("unknown symbol '%s'\n", tmpstr); + return 0; + } + return 1; + } + d = hexdigit(c); - if( d == EOF ){ + if (d == EOF) { termch = c; return 0; } @@ -1832,7 +1877,7 @@ v = (v << 4) + d; c = inchar(); d = hexdigit(c); - } while( d != EOF ); + } while (d != EOF); termch = c; *vp = v; return 1; @@ -1907,19 +1952,53 @@ lineptr = str; } + +static void +symbol_lookup(void) +{ + int type = inchar(); + unsigned long addr; + static char tmp[64]; + + switch (type) { + case 'a': + if (scanhex(&addr)) { + printf("%lx: ", addr); + xmon_print_symbol("%s\n", addr); + } + termch = 0; + break; + case 's': + getstring(tmp, 64); + if (setjmp(bus_error_jmp) == 0) { + __debugger_fault_handler = handle_fault; + sync(); + addr = kallsyms_lookup_name(tmp); + if (addr) + printf("%s: %lx\n", tmp, addr); + else + printf("Symbol '%s' not found.\n", tmp); + sync(); + } + __debugger_fault_handler = 0; + termch = 0; + break; + } +} + + /* xmon version of __print_symbol */ void __xmon_print_symbol(const char *fmt, unsigned long address) { char *modname; const char *name; unsigned long offset, size; - char namebuf[128]; if (setjmp(bus_error_jmp) == 0) { __debugger_fault_handler = handle_fault; sync(); name = kallsyms_lookup(address, &size, &offset, &modname, - namebuf); + tmpstr); sync(); /* wait a little while to see if we get a machine check */ __delay(200); From johnrose at austin.ibm.com Fri Feb 27 11:22:36 2004 From: johnrose at austin.ibm.com (John Rose) Date: Thu, 26 Feb 2004 18:22:36 -0600 Subject: [PATCH] RTAS syscall NULL ptr deref (2.6) Message-ID: <1077841356.14211.17.camel@verve.austin.ibm.com> The patch below fixes a NULL ptr deref in the RTAS syscall on 2.6. I pushed it already, but send comments if you want :) Thanks- John diff -Nru a/arch/ppc64/kernel/rtas.c b/arch/ppc64/kernel/rtas.c --- a/arch/ppc64/kernel/rtas.c Thu Feb 26 16:30:25 2004 +++ b/arch/ppc64/kernel/rtas.c Thu Feb 26 16:30:25 2004 @@ -426,6 +426,7 @@ { struct rtas_args args; unsigned long flags; + int nargs; if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -433,14 +434,15 @@ if (copy_from_user(&args, uargs, 3 * sizeof(u32)) != 0) return -EFAULT; - if (args.nargs > ARRAY_SIZE(args.args) + nargs = args.nargs; + if (nargs > ARRAY_SIZE(args.args) || args.nret > ARRAY_SIZE(args.args) - || args.nargs + args.nret > ARRAY_SIZE(args.args)) + || nargs + args.nret > ARRAY_SIZE(args.args)) return -EINVAL; /* Copy in args. */ if (copy_from_user(args.args, uargs->args, - args.nargs * sizeof(rtas_arg_t)) != 0) + nargs * sizeof(rtas_arg_t)) != 0) return -EFAULT; spin_lock_irqsave(&rtas.lock, flags); @@ -449,14 +451,15 @@ enter_rtas((void *)__pa((unsigned long)&get_paca()->xRtas)); args = get_paca()->xRtas; + args.rets = (rtas_arg_t *)&(args.args[nargs]); if (args.rets[0] == -1) log_rtas_error(&args); spin_unlock_irqrestore(&rtas.lock, flags); /* Copy out args. */ - if (copy_to_user(uargs->args + args.nargs, - args.args + args.nargs, + if (copy_to_user(uargs->args + nargs, + args.args + nargs, args.nret * sizeof(rtas_arg_t)) != 0) return -EFAULT; ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From strosake at austin.ibm.com Fri Feb 27 11:24:52 2004 From: strosake at austin.ibm.com (Mike Strosaker) Date: Thu, 26 Feb 2004 18:24:52 -0600 Subject: (resend) [PATCH] (2.6) os-term call upon kernel panic Message-ID: <403E8E54.2000009@austin.ibm.com> Resend... tabs got converted to spaces on the last one. Thanks, Mike diff -Nru a/arch/ppc64/kernel/chrp_setup.c b/arch/ppc64/kernel/chrp_setup.c --- a/arch/ppc64/kernel/chrp_setup.c Thu Feb 26 16:54:18 2004 +++ b/arch/ppc64/kernel/chrp_setup.c Thu Feb 26 16:54:18 2004 @@ -268,6 +268,7 @@ ppc_md.restart = rtas_restart; ppc_md.power_off = rtas_power_off; ppc_md.halt = rtas_halt; + ppc_md.panic = rtas_os_term; ppc_md.get_boot_time = pSeries_get_boot_time; ppc_md.get_rtc_time = pSeries_get_rtc_time; diff -Nru a/arch/ppc64/kernel/iSeries_setup.c b/arch/ppc64/kernel/iSeries_setup.c --- a/arch/ppc64/kernel/iSeries_setup.c Thu Feb 26 16:54:18 2004 +++ b/arch/ppc64/kernel/iSeries_setup.c Thu Feb 26 16:54:18 2004 @@ -323,6 +323,7 @@ ppc_md.restart = iSeries_restart; ppc_md.power_off = iSeries_power_off; ppc_md.halt = iSeries_halt; + ppc_md.panic = iSeries_panic; ppc_md.get_boot_time = iSeries_get_boot_time; ppc_md.set_rtc_time = iSeries_set_rtc_time; @@ -790,6 +791,14 @@ void iSeries_halt(void) { mf_powerOff(); +} + +/* + * Document me. + */ +void iSeries_panic(void) +{ + mf_reboot(); } /* JDH Hack */ diff -Nru a/arch/ppc64/kernel/rtas.c b/arch/ppc64/kernel/rtas.c --- a/arch/ppc64/kernel/rtas.c Thu Feb 26 16:54:18 2004 +++ b/arch/ppc64/kernel/rtas.c Thu Feb 26 16:54:18 2004 @@ -420,6 +420,17 @@ rtas_power_off(); } +void +rtas_os_term(void) +{ + long status; + char *str = "OS panic"; + + status = rtas_call(rtas_token("ibm,os-term"), 1, 1, NULL, __pa(str)); + if (status != 0) + printk(KERN_EMERG "ibm,os-term call failed %ld\n", status); +} + unsigned long rtas_rmo_buf = 0; asmlinkage int ppc_rtas(struct rtas_args __user *uargs) diff -Nru a/arch/ppc64/kernel/setup.c b/arch/ppc64/kernel/setup.c --- a/arch/ppc64/kernel/setup.c Thu Feb 26 16:54:18 2004 +++ b/arch/ppc64/kernel/setup.c Thu Feb 26 16:54:18 2004 @@ -25,6 +25,7 @@ #include #include #include +#include #include #include #include @@ -93,6 +94,12 @@ struct machdep_calls ppc_md; +static int ppc64_panic_event(struct notifier_block *, unsigned long, void *); +static struct notifier_block ppc64_panic_block = { + notifier_call: ppc64_panic_event, + priority: INT_MIN /* may not return; must be done last */ +}; + /* * Perhaps we can put the pmac screen_info[] here * on pmac as well so we don't need the ifdef's. @@ -316,6 +323,13 @@ EXPORT_SYMBOL(machine_halt); +static int ppc64_panic_event(struct notifier_block *this, + unsigned long event, void *ptr) +{ + ppc_md.panic(); /* May not return */ + return NOTIFY_DONE; +} + unsigned long ppc_proc_freq; unsigned long ppc_tb_freq; @@ -610,8 +624,9 @@ dcache_bsize = systemcfg->dCacheL1LineSize; icache_bsize = systemcfg->iCacheL1LineSize; - /* reboot on panic */ - panic_timeout = 180; + /* do not reboot on panic */ + panic_timeout = 0; + notifier_chain_register(&panic_notifier_list, &ppc64_panic_block); init_mm.start_code = PAGE_OFFSET; init_mm.end_code = (unsigned long) _etext; diff -Nru a/include/asm-ppc64/machdep.h b/include/asm-ppc64/machdep.h --- a/include/asm-ppc64/machdep.h Thu Feb 26 16:54:18 2004 +++ b/include/asm-ppc64/machdep.h Thu Feb 26 16:54:18 2004 @@ -82,6 +82,7 @@ void (*restart)(char *cmd); void (*power_off)(void); void (*halt)(void); + void (*panic)(void); int (*set_rtc_time)(struct rtc_time *); void (*get_rtc_time)(struct rtc_time *); diff -Nru a/include/asm-ppc64/rtas.h b/include/asm-ppc64/rtas.h --- a/include/asm-ppc64/rtas.h Thu Feb 26 16:54:18 2004 +++ b/include/asm-ppc64/rtas.h Thu Feb 26 16:54:18 2004 @@ -175,6 +175,7 @@ extern void rtas_restart(char *cmd); extern void rtas_power_off(void); extern void rtas_halt(void); +extern void rtas_os_term(void); extern int rtas_get_sensor(int sensor, int index, int *state); extern int rtas_get_power_level(int powerdomain, int *level); extern int rtas_set_indicator(int indicator, int index, int new_value); ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Fri Feb 27 12:15:23 2004 From: olof at austin.ibm.com (olof at austin.ibm.com) Date: Thu, 26 Feb 2004 19:15:23 -0600 (CST) Subject: (resend) [PATCH] (2.6) os-term call upon kernel panic In-Reply-To: <403E8E54.2000009@austin.ibm.com> Message-ID: A couple of comments: * on iSeries, it'll result in an instant reboot, maybe not desirable * no support for G5, will result in NULL branch. Panic will panic. :) * if the RTAS os-term call for some reason fails (can it?), the system won't reboot since panic_timeout is 0. Attached patch resolves those issues by not setting ppc_md.panic on pmac and iSeries, and only registers the notifier in case it's set. The timeout is kept at 180 even though it won't be used on a successful panic on pSeries. I've tried building it, but I didn't test it much. -Olof Olof Johansson Office: 4E002/905 Linux on Power Development IBM Systems Group Email: olof at austin.ibm.com Phone: 512-838-9858 All opinions are my own and not those of IBM -------------- next part -------------- ===== arch/ppc64/kernel/chrp_setup.c 1.56 vs edited ===== --- 1.56/arch/ppc64/kernel/chrp_setup.c Thu Feb 26 04:56:03 2004 +++ edited/arch/ppc64/kernel/chrp_setup.c Thu Feb 26 18:47:17 2004 @@ -266,6 +266,7 @@ ppc_md.restart = rtas_restart; ppc_md.power_off = rtas_power_off; ppc_md.halt = rtas_halt; + ppc_md.panic = rtas_os_term; ppc_md.get_boot_time = pSeries_get_boot_time; ppc_md.get_rtc_time = pSeries_get_rtc_time; ===== arch/ppc64/kernel/rtas.c 1.24 vs edited ===== --- 1.24/arch/ppc64/kernel/rtas.c Tue Feb 24 22:28:40 2004 +++ edited/arch/ppc64/kernel/rtas.c Thu Feb 26 18:48:28 2004 @@ -420,6 +420,18 @@ rtas_power_off(); } +void +rtas_os_term(void) +{ + long status; + char *str = "OS panic"; + + status = rtas_call(rtas_token("ibm,os-term"), 1, 1, NULL, __pa(str)); + if (status != 0) + printk(KERN_EMERG "ibm,os-term call failed %ld\n", status); +} + + unsigned long rtas_rmo_buf = 0; asmlinkage int ppc_rtas(struct rtas_args __user *uargs) ===== arch/ppc64/kernel/setup.c 1.61 vs edited ===== --- 1.61/arch/ppc64/kernel/setup.c Tue Feb 24 21:57:17 2004 +++ edited/arch/ppc64/kernel/setup.c Thu Feb 26 19:02:59 2004 @@ -25,6 +25,7 @@ #include #include #include +#include #include #include #include @@ -93,6 +94,13 @@ struct machdep_calls ppc_md; +static int ppc64_panic_event(struct notifier_block *, unsigned long, void *); + +static struct notifier_block ppc64_panic_block = { + notifier_call: ppc64_panic_event, + priority: INT_MIN /* may not return; must be done last */ +}; + /* * Perhaps we can put the pmac screen_info[] here * on pmac as well so we don't need the ifdef's. @@ -319,6 +327,14 @@ unsigned long ppc_proc_freq; unsigned long ppc_tb_freq; +static int ppc64_panic_event(struct notifier_block *this, + unsigned long event, void *ptr) +{ + ppc_md.panic(); /* May not return */ + return NOTIFY_DONE; +} + + #ifdef CONFIG_SMP DEFINE_PER_CPU(unsigned int, pvr); #endif @@ -610,8 +626,11 @@ dcache_bsize = systemcfg->dCacheL1LineSize; icache_bsize = systemcfg->iCacheL1LineSize; /* reboot on panic */ panic_timeout = 180; + + if (ppc_md.panic) + notifier_chain_register(&panic_notifier_list, &ppc64_panic_block); init_mm.start_code = PAGE_OFFSET; init_mm.end_code = (unsigned long) _etext; ===== include/asm-ppc64/machdep.h 1.34 vs edited ===== --- 1.34/include/asm-ppc64/machdep.h Thu Feb 26 04:56:03 2004 +++ edited/include/asm-ppc64/machdep.h Thu Feb 26 18:47:18 2004 @@ -81,6 +81,7 @@ void (*restart)(char *cmd); void (*power_off)(void); void (*halt)(void); + void (*panic)(void); int (*set_rtc_time)(struct rtc_time *); void (*get_rtc_time)(struct rtc_time *); ===== include/asm-ppc64/rtas.h 1.17 vs edited ===== --- 1.17/include/asm-ppc64/rtas.h Mon Jan 19 20:08:24 2004 +++ edited/include/asm-ppc64/rtas.h Thu Feb 26 18:56:39 2004 @@ -175,6 +175,7 @@ extern void rtas_restart(char *cmd); extern void rtas_power_off(void); extern void rtas_halt(void); +extern void rtas_os_term(void); extern int rtas_get_sensor(int sensor, int index, int *state); extern int rtas_get_power_level(int powerdomain, int *level); extern int rtas_set_indicator(int indicator, int index, int new_value); From benh at kernel.crashing.org Fri Feb 27 13:35:54 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 27 Feb 2004 13:35:54 +1100 Subject: [PATCH] RTAS syscall NULL ptr deref (2.6) In-Reply-To: <1077841356.14211.17.camel@verve.austin.ibm.com> References: <1077841356.14211.17.camel@verve.austin.ibm.com> Message-ID: <1077849354.22397.161.camel@gaston> On Fri, 2004-02-27 at 11:22, John Rose wrote: > The patch below fixes a NULL ptr deref in the RTAS syscall on 2.6. I > pushed it already, but send comments if you want :) Can you quickly explain how the code could do a NULL ptr deref in the first place ? (and how taht's fixed). The patch looks fine but I don't see how it fixes a NULL ptr :) And Linus is rather picky about patch descriptions not matching actual content... Ben. > Thanks- > John > > diff -Nru a/arch/ppc64/kernel/rtas.c b/arch/ppc64/kernel/rtas.c > --- a/arch/ppc64/kernel/rtas.c Thu Feb 26 16:30:25 2004 > +++ b/arch/ppc64/kernel/rtas.c Thu Feb 26 16:30:25 2004 > @@ -426,6 +426,7 @@ > { > struct rtas_args args; > unsigned long flags; > + int nargs; > > if (!capable(CAP_SYS_ADMIN)) > return -EPERM; > @@ -433,14 +434,15 @@ > if (copy_from_user(&args, uargs, 3 * sizeof(u32)) != 0) > return -EFAULT; > > - if (args.nargs > ARRAY_SIZE(args.args) > + nargs = args.nargs; > + if (nargs > ARRAY_SIZE(args.args) > || args.nret > ARRAY_SIZE(args.args) > - || args.nargs + args.nret > ARRAY_SIZE(args.args)) > + || nargs + args.nret > ARRAY_SIZE(args.args)) > return -EINVAL; > > /* Copy in args. */ > if (copy_from_user(args.args, uargs->args, > - args.nargs * sizeof(rtas_arg_t)) != 0) > + nargs * sizeof(rtas_arg_t)) != 0) > return -EFAULT; > > spin_lock_irqsave(&rtas.lock, flags); > @@ -449,14 +451,15 @@ > enter_rtas((void *)__pa((unsigned long)&get_paca()->xRtas)); > args = get_paca()->xRtas; > > + args.rets = (rtas_arg_t *)&(args.args[nargs]); > if (args.rets[0] == -1) > log_rtas_error(&args); > > spin_unlock_irqrestore(&rtas.lock, flags); > > /* Copy out args. */ > - if (copy_to_user(uargs->args + args.nargs, > - args.args + args.nargs, > + if (copy_to_user(uargs->args + nargs, > + args.args + nargs, > args.nret * sizeof(rtas_arg_t)) != 0) > return -EFAULT; > > -- Benjamin Herrenschmidt ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From strosake at austin.ibm.com Fri Feb 27 16:42:08 2004 From: strosake at austin.ibm.com (Mike Strosaker) Date: Thu, 26 Feb 2004 23:42:08 -0600 Subject: (resend) [PATCH] (2.6) os-term call upon kernel panic In-Reply-To: References: Message-ID: <403ED8B0.4080003@austin.ibm.com> olof at austin.ibm.com wrote: > A couple of comments: > > * on iSeries, it'll result in an instant reboot, maybe not desirable > * no support for G5, will result in NULL branch. Panic will panic. :) > * if the RTAS os-term call for some reason fails (can it?), the system > won't reboot since panic_timeout is 0. > > Attached patch resolves those issues by not setting ppc_md.panic on pmac > and iSeries, and only registers the notifier in case it's set. The timeout > is kept at 180 even though it won't be used on a successful panic on > pSeries. > > I've tried building it, but I didn't test it much. Thanks for the updates, Olof. I tested your patch on a p650 LPAR, and it works well. It's probably better to keep panic_timeout at 180, as your patch does, because it's definitely possible for os-term to return. Thanks, Mike ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sat Feb 28 02:23:14 2004 From: anton at samba.org (Anton Blanchard) Date: Sat, 28 Feb 2004 02:23:14 +1100 Subject: 2.6 viodasd and IDE emulation Message-ID: <20040227152313.GN5801@krispykreme> Hi, IDE emulation on iseries virtual disks was removed in 2.6 recently (it had no chance of being merged upstream) so if you are having trouble finding the root filesystem on your iseries that is probably it. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Sat Feb 28 03:16:09 2004 From: johnrose at austin.ibm.com (John Rose) Date: Fri, 27 Feb 2004 10:16:09 -0600 Subject: [PATCH] RTAS syscall NULL ptr deref (2.6) In-Reply-To: <1077849354.22397.161.camel@gaston> References: <1077841356.14211.17.camel@verve.austin.ibm.com> <1077849354.22397.161.camel@gaston> Message-ID: <1077898569.17961.11.camel@verve.austin.ibm.com> Hi Ben- > Can you quickly explain how the code could do a NULL ptr deref in > the first place ? (and how taht's fixed). Heh sure. The rets member of the rtas_args structure is an int pointer into the args member, which is an int array. Initially, I didn't set the "rets" ptr in this syscall, because I didn't need it in the function, and it wouldn't be useful to userspace when copied out. The following lines were more recently added to log hardware errors: + if (args.rets[0] == -1) + log_rtas_error(&args); Since rets was unassigned in this case, we're reading at a bad address. The following line fixes the problem: + args.rets = (rtas_arg_t *)&(args.args[nargs]); Thanks- John ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Sat Feb 28 05:12:42 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Fri, 27 Feb 2004 12:12:42 -0600 Subject: [ppc64 patch] virtual IO bus updates Message-ID: <403F889A.9020001@us.ibm.com> Hi Linus, please apply these two sysfs-related patches. The first makes GregKH happy by removing the device name from the device.bus_id field (and replacing it with a "name" sysfs attribute). The second renames the parent device from "vdevice" to "vio", making the /sys/bus and /sys/devices hierarchies consistent. -- Hollis Blanchard IBM Linux Technology Center -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: vio-sysfs-parent.diff Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040227/8509194f/attachment.txt -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: vio-name-attr.diff Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040227/8509194f/attachment-0001.txt From benh at kernel.crashing.org Sat Feb 28 09:22:49 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 28 Feb 2004 09:22:49 +1100 Subject: [PATCH] RTAS syscall NULL ptr deref (2.6) In-Reply-To: <1077898569.17961.11.camel@verve.austin.ibm.com> References: <1077841356.14211.17.camel@verve.austin.ibm.com> <1077849354.22397.161.camel@gaston> <1077898569.17961.11.camel@verve.austin.ibm.com> Message-ID: <1077920569.22962.19.camel@gaston> On Sat, 2004-02-28 at 03:16, John Rose wrote: > Hi Ben- > > > Can you quickly explain how the code could do a NULL ptr deref in > > the first place ? (and how taht's fixed). > > Heh sure. The rets member of the rtas_args structure is an int pointer > into the args member, which is an int array. Initially, I didn't set > the "rets" ptr in this syscall, because I didn't need it in the > function, and it wouldn't be useful to userspace when copied out. Ok, makes more sense now, thanks. Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/