From nathanl at austin.ibm.com  Sun Feb  1 11:08:00 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Sat, 31 Jan 2004 18:08:00 -0600
Subject: [PATCH] ioremap kmallocs inside spinlocks - review request
In-Reply-To: <1075393500.13228.3.camel@verve>
References: <1075226812.10285.17.camel@verve>	 <4016B046.2060103@austin.ibm.com> <1075393500.13228.3.camel@verve>
Message-ID: <401C4360.3050702@austin.ibm.com>

John Rose wrote:
> Anybody have any yea or nay thoughts on Nathan's idea and/or my patch?

For what it's worth, here's a patch which converts the spinlock in
imalloc.c to a semaphore.  Tested with Anton's spinlock debugging patch.

This should be safe -- in i386 ioremap and friends use the vmalloc
subsystem, which can sleep.  And we're already using kmalloc(GFP_KERNEL)
in this code, so it's not going to break anything which wasn't already
broken.

While your patch also seems to solve the problem, converting to the
semaphore seems easier to me. :)

Nathan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: imalloc_sem.patch
Type: text/x-patch
Size: 1707 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040131/f1daabd1/attachment.bin 

From nathanl at austin.ibm.com  Sun Feb  1 11:35:41 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Sat, 31 Jan 2004 18:35:41 -0600
Subject: ioremap problems
In-Reply-To: <20040127105533.GL11236@krispykreme>
References: <20040127104640.GK11236@krispykreme> <20040127105533.GL11236@krispykreme>
Message-ID: <401C49DD.1050008@austin.ibm.com>

Hi Anton-

> Heres the spinlock sleep debugging patch. Give it a spin on a 2.6
> kernel (dont forget to enable the config option), I'll bet there are still
> 5 bugs in our drivers left to find.

$ patch -p1 --dry-run < might_sleep_warn.patch
arch/ppc64/Kconfig 1.39: 433 lines
patching file arch/ppc64/Kconfig
Hunk #1 succeeded at 339 with fuzz 2 (offset -71 lines).
...

The patch you posted didn't apply cleanly for me on latest 2.6 from
Ameslab; it inserted the Kconfig option in the middle of the vio stuff
for some reason.  So I hand-patched the Kconfig; here's the patch I
generated if anyone wants it.

It would be nice to get this into Ameslab or mainline, it's very useful.
  Thanks Anton.

Nathan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: spinlock_sleep_debug.patch
Type: text/x-patch
Size: 2664 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040131/8e9a3cf9/attachment.bin 

From paulus at samba.org  Sun Feb  1 12:07:04 2004
From: paulus at samba.org (Paul Mackerras)
Date: Sun, 1 Feb 2004 12:07:04 +1100
Subject: [PATCH][2.6] rtas error-inject support
In-Reply-To: <1075480843.682.188.camel@magik>
References: <1075480843.682.188.camel@magik>
Message-ID: <16412.20792.302473.676404@cargo.ozlabs.ibm.com>


Jake Moilanen writes:

> Here is support for the rtas error-inject call.

Wouldn't it simplify things if you had a read_proc function rather
than having your ppc_rtas_errinjct_read function?  As it is you must
use copy_to_user rather than memcpy there.

Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Sun Feb  1 16:02:28 2004
From: anton at samba.org (Anton Blanchard)
Date: Sun, 1 Feb 2004 16:02:28 +1100
Subject: LPARCFG
Message-ID: <20040201050227.GB22694@krispykreme>


Hi,

Any reason we cant make lparcfg tristate (so we can compile it as a
module?)

Im keeping an eye on our kernel size (its getting huge), any chance
we get to cut it down is worth it.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Sun Feb  1 16:10:58 2004
From: anton at samba.org (Anton Blanchard)
Date: Sun, 1 Feb 2004 16:10:58 +1100
Subject: [PATCH] ioremap kmallocs inside spinlocks - review request
In-Reply-To: <401C4360.3050702@austin.ibm.com>
References: <1075226812.10285.17.camel@verve> <4016B046.2060103@austin.ibm.com> <1075393500.13228.3.camel@verve> <401C4360.3050702@austin.ibm.com>
Message-ID: <20040201051058.GC22694@krispykreme>


> For what it's worth, here's a patch which converts the spinlock in
> imalloc.c to a semaphore.  Tested with Anton's spinlock debugging patch.
>
> This should be safe -- in i386 ioremap and friends use the vmalloc
> subsystem, which can sleep.  And we're already using kmalloc(GFP_KERNEL)
> in this code, so it's not going to break anything which wasn't already
> broken.
>
> While your patch also seems to solve the problem, converting to the
> semaphore seems easier to me. :)

Sorry for not getting back to you. Paul and I had a talk and came to the
same conclusion, ioremap can sleep so its safe to use a semaphore there.

BTW Every time I look at the imalloc code I get that feeling we should
be using the generic get_vm_area :)

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From akpm at osdl.org  Sun Feb  1 17:30:42 2004
From: akpm at osdl.org (Andrew Morton)
Date: Sat, 31 Jan 2004 22:30:42 -0800
Subject: [PATCH 2/4] 2.6.1-mm2: sched-domain
In-Reply-To: <4001BC69.8040300@cyberone.com.au>
References: <4001BAF8.9090203@cyberone.com.au>
	<4001BC69.8040300@cyberone.com.au>
Message-ID: <20040131223042.782ad9e2.akpm@osdl.org>


Nick Piggin <piggin at cyberone.com.au> wrote:
>
> This is the core sched domains patch.

I'm having a ton of trouble with this on the 4-way ppc64 box.  Symptoms are
similar to memory corruption: gcc falls over with sig11, filenames
corrupted, etc.  Often it fails to get through the initscripts without
userspace processes failing randomly.

One example:

 cc -Wall  -Wall -I../include    -c -o search_path.o search_path.c
make[1]: B<lots of random binary garbage>: Command not found
make[1]: *** [search_path.o] Error 127
make[1]: Leaving directory `/mnt/sdb5/ltp-full-20040108/lib'
make: *** [libltp.a] Error 2

and

cc -O -Wall  -w -o  test ./test.c
cc -c -Wall  -w -o  test_arch.o ./test.c
cc -Wall  -w -o  test_D ./test.c
make[4]: *** [test] Segmentation fault
make[4]: *** Deleting file `test'


I'm surprised that nobody else has noticed it.  The results of a binary
search through my current patch queue:

local_bh_enable-warning-fix.patch			OK
pnp-8250_pnp-fix.patch
pnp-resource-flags-reorganisation.patch
pnp-BIOS-workaround.patch
pnp-avoid-static-allocations.patch
pnp-move-ID-declarations.patch
pnp-file2alias-update.patch
pnp-update-matching-code.patch
pnp-additional-sysfs-info.patch
pnp-config-cleanup.patch				OK
sched-find_busiest_node-resolution-fix.patch		OK
sched-domains.patch					BAD
sched-clock-fixes.patch					BAD
sched-build-fix.patch					BAD
sched-sibling-map-to-cpumask.patch
p4-clockmod-sibling-map-fix.patch
p4-clockmod-more-than-two-siblings.patch
sched-domains-i386-ht.patch
sched-find_busiest_group-fix.patch			BAD
sched-domain-tweak.patch
sched-no-drop-balance.patch
sched-arch_init_sched_domains-fix.patch
sched-find_busiest_group-clarification.patch
sched-remove-noisy-printks.patch
sched-directed-migration.patch
sched-domain-debugging.patch
acpi-numa-printk-level-fixes.patch			BAD

points the finger at the core sched-domains patch.

But Anton says that he's using your scheduler patches without problems.

mnm:/usr/src/25-power4> /usr/local/ppc64/bin/powerpc-linux-gcc --version
powerpc-linux-gcc (GCC) 3.2.3 20030211 (prerelease)

I tried a kernel compiled with -O1 and it failed in the same way, which
somewhat rules out a compiler bug.

Anyone have any suggestions as to a next step?


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From mbligh at aracnet.com  Sun Feb  1 17:52:33 2004
From: mbligh at aracnet.com (Martin J. Bligh)
Date: Sat, 31 Jan 2004 22:52:33 -0800
Subject: [Lse-tech] Re: [PATCH 2/4] 2.6.1-mm2: sched-domain
In-Reply-To: <20040131223042.782ad9e2.akpm@osdl.org>
References: <4001BAF8.9090203@cyberone.com.au><4001BC69.8040300@cyberone.com.au> <20040131223042.782ad9e2.akpm@osdl.org>
Message-ID: <46800000.1075618352@[10.10.2.4]>


> But Anton says that he's using your scheduler patches without problems.
>
> mnm:/usr/src/25-power4> /usr/local/ppc64/bin/powerpc-linux-gcc --version
> powerpc-linux-gcc (GCC) 3.2.3 20030211 (prerelease)
>
> I tried a kernel compiled with -O1 and it failed in the same way, which
> somewhat rules out a compiler bug.
>
> Anyone have any suggestions as to a next step?

Where'd your PPC toolchain come from? Anton tends to roll his own, IIRC.
What happens if Anton gives you a binary kernel, and you run that?

M.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From akpm at osdl.org  Sun Feb  1 17:56:44 2004
From: akpm at osdl.org (Andrew Morton)
Date: Sat, 31 Jan 2004 22:56:44 -0800
Subject: [Lse-tech] Re: [PATCH 2/4] 2.6.1-mm2: sched-domain
In-Reply-To: <46800000.1075618352@[10.10.2.4]>
References: <4001BAF8.9090203@cyberone.com.au>
	<4001BC69.8040300@cyberone.com.au>
	<20040131223042.782ad9e2.akpm@osdl.org>
	<46800000.1075618352@[10.10.2.4]>
Message-ID: <20040131225644.3b160f38.akpm@osdl.org>


"Martin J. Bligh" <mbligh at aracnet.com> wrote:
>
> > But Anton says that he's using your scheduler patches without problems.
> >
> > mnm:/usr/src/25-power4> /usr/local/ppc64/bin/powerpc-linux-gcc --version
> > powerpc-linux-gcc (GCC) 3.2.3 20030211 (prerelease)
> >
> > I tried a kernel compiled with -O1 and it failed in the same way, which
> > somewhat rules out a compiler bug.
> >
> > Anyone have any suggestions as to a next step?
>
> Where'd your PPC toolchain come from? Anton tends to roll his own, IIRC.

I begat it ages ago and I hoard it jealously, because its birth was so
painful.

> What happens if Anton gives you a binary kernel, and you run that?

Dunno, I'll send him a .config.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Sun Feb  1 17:57:23 2004
From: anton at samba.org (Anton Blanchard)
Date: Sun, 1 Feb 2004 17:57:23 +1100
Subject: [Lse-tech] Re: [PATCH 2/4] 2.6.1-mm2: sched-domain
In-Reply-To: <20040131223042.782ad9e2.akpm@osdl.org>
References: <4001BAF8.9090203@cyberone.com.au> <4001BC69.8040300@cyberone.com.au> <20040131223042.782ad9e2.akpm@osdl.org>
Message-ID: <20040201065723.GD22694@krispykreme>


> I'm having a ton of trouble with this on the 4-way ppc64 box.  Symptoms are
> similar to memory corruption: gcc falls over with sig11, filenames
> corrupted, etc.  Often it fails to get through the initscripts without
> userspace processes failing randomly.

...

> points the finger at the core sched-domains patch.
>
> But Anton says that he's using your scheduler patches without problems.

Do you have the SMT scheduler enabled? Maybe its a bug with SMT off or
maybe my box wasnt stable and I didnt notice :)

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Sun Feb  1 18:36:27 2004
From: anton at samba.org (Anton Blanchard)
Date: Sun, 1 Feb 2004 18:36:27 +1100
Subject: RTAS error logging of EEH errors
Message-ID: <20040201073627.GE22694@krispykreme>


Hi,

I was reviewing patches to send onto Andrew Morton and noticed:

                if (ret == 0 && rets[1] == 1 && rets[0] >= 2) {
+                       unsigned char slot_err_buf[RTAS_ERROR_LOG_MAX];
+                       unsigned long   slot_err_ret;

Thats 2kB on the stack, it really wants to be kmalloced (GFP_ATOMIC
since it can be called in interrupt context) or statically defined.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From ricklind at us.ibm.com  Sun Feb  1 20:02:40 2004
From: ricklind at us.ibm.com (Rick Lindsley)
Date: Sun, 01 Feb 2004 01:02:40 -0800
Subject: [Lse-tech] Re: [PATCH 2/4] 2.6.1-mm2: sched-domain 
In-Reply-To: Your message of "Sun, 01 Feb 2004 17:57:23 +1100."
             <20040201065723.GD22694@krispykreme> 
Message-ID: <200402010902.i1192mM06956@owlet.beaverton.ibm.com>


    Do you have the SMT scheduler enabled? Maybe its a bug with SMT off or
    maybe my box wasnt stable and I didnt notice :)

Good point --  I've been running standard regression tests against it
on x86 and seen nothing like this.  But I haven't enabled SMT.  I just
finished going through the code with Martin and another engineer last
week and am summarizing some thoughts, but there was no major fallovers
like this.

Rick

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From piggin at cyberone.com.au  Sun Feb  1 21:24:40 2004
From: piggin at cyberone.com.au (Nick Piggin)
Date: Sun, 01 Feb 2004 21:24:40 +1100
Subject: [Lse-tech] Re: [PATCH 2/4] 2.6.1-mm2: sched-domain
In-Reply-To: <200402010902.i1192mM06956@owlet.beaverton.ibm.com>
References: <200402010902.i1192mM06956@owlet.beaverton.ibm.com>
Message-ID: <401CD3E8.7050208@cyberone.com.au>


Rick Lindsley wrote:

>    Do you have the SMT scheduler enabled? Maybe its a bug with SMT off or
>    maybe my box wasnt stable and I didnt notice :)
>
>Good point --  I've been running standard regression tests against it
>on x86 and seen nothing like this.  But I haven't enabled SMT.  I just
>finished going through the code with Martin and another engineer last
>week and am summarizing some thoughts, but there was no major fallovers
>like this.
>
>

SMT should be i386 only at this stage so it rules that out.

x86 is pretty stable, that I am sure of. I have run it on quite
a lot of boxes, SMP P4, NUMAQ, SMP P3s, UP, etc plus countless
people testing -mm.

It wouldn't surprise me if there is a problem with another
architecture... but then again as you say Anton is running it
OK so I dunno. Maybe it is a compiler bug?


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From piggin at cyberone.com.au  Sun Feb  1 21:26:21 2004
From: piggin at cyberone.com.au (Nick Piggin)
Date: Sun, 01 Feb 2004 21:26:21 +1100
Subject: [Lse-tech] Re: [PATCH 2/4] 2.6.1-mm2: sched-domain
In-Reply-To: <401CD3E8.7050208@cyberone.com.au>
References: <200402010902.i1192mM06956@owlet.beaverton.ibm.com> <401CD3E8.7050208@cyberone.com.au>
Message-ID: <401CD44D.8040203@cyberone.com.au>


Nick Piggin wrote:

>
>
> Rick Lindsley wrote:
>
>>    Do you have the SMT scheduler enabled? Maybe its a bug with SMT
>> off or
>>    maybe my box wasnt stable and I didnt notice :)
>>
>> Good point --  I've been running standard regression tests against it
>> on x86 and seen nothing like this.  But I haven't enabled SMT.  I just
>> finished going through the code with Martin and another engineer last
>> week and am summarizing some thoughts, but there was no major fallovers
>> like this.
>>
>>
>
> SMT should be i386 only at this stage so it rules that out.
>
> x86 is pretty stable, that I am sure of. I have run it on quite
> a lot of boxes, SMP P4, NUMAQ, SMP P3s, UP, etc plus countless
> people testing -mm.
>
> It wouldn't surprise me if there is a problem with another
> architecture... but then again as you say Anton is running it
> OK so I dunno. Maybe it is a compiler bug?
>
>

Andrew, Anton, are you using CONFIG_PREEMPT?


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From akpm at osdl.org  Sun Feb  1 21:34:08 2004
From: akpm at osdl.org (Andrew Morton)
Date: Sun, 1 Feb 2004 02:34:08 -0800
Subject: [Lse-tech] Re: [PATCH 2/4] 2.6.1-mm2: sched-domain
In-Reply-To: <401CD44D.8040203@cyberone.com.au>
References: <200402010902.i1192mM06956@owlet.beaverton.ibm.com>
	<401CD3E8.7050208@cyberone.com.au>
	<401CD44D.8040203@cyberone.com.au>
Message-ID: <20040201023408.262205d6.akpm@osdl.org>


Nick Piggin <piggin at cyberone.com.au> wrote:
>
>  Andrew, Anton, are you using CONFIG_PREEMPT?

nope.

And we're not sure that Anton has tested the patch much yet.  It could be
that the bug is happening for him too.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Sun Feb  1 21:51:24 2004
From: anton at samba.org (Anton Blanchard)
Date: Sun, 1 Feb 2004 21:51:24 +1100
Subject: [Lse-tech] Re: [PATCH 2/4] 2.6.1-mm2: sched-domain
In-Reply-To: <401CD3E8.7050208@cyberone.com.au>
References: <200402010902.i1192mM06956@owlet.beaverton.ibm.com> <401CD3E8.7050208@cyberone.com.au>
Message-ID: <20040201105124.GG22694@krispykreme>


> SMT should be i386 only at this stage so it rules that out.

It turns out my testing was done with SMT scheduler on but with a busted
topology. I just fixed it (cpumask_snprintf is broken on big endian,
the output had me confused) and its blowing up pretty soon in the slab
cache.

Give me a bit, I'll narrow it down. I'll also do a boot with SMT
scheduelr off and verify its OK.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From boutcher at us.ibm.com  Mon Feb  2 04:43:16 2004
From: boutcher at us.ibm.com (David Boutcher)
Date: Sun, 1 Feb 2004 11:43:16 -0600
Subject: LPARCFG
In-Reply-To: <20040201050227.GB22694@krispykreme>
Message-ID: <OFE049E96C.C60D4B6A-ON86256E2D.0061119C-86256E2D.00615894@us.ibm.com>


owner-linuxppc64-dev at lists.linuxppc.org wrote on 01/31/2004 11:02:28 PM:
> Any reason we cant make lparcfg tristate (so we can compile it as a
> module?)
>
> Im keeping an eye on our kernel size (its getting huge), any chance
> we get to cut it down is worth it.

I think that's a great idea.  Especially since it has that stupid e2a
routine :-)  There are a couple of pieces of IBM software that want to find
/proc/lparcfg, but as long as they know they need to document the
dependency on the module, we should be fine.

Dave Boutcher
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nathanl at austin.ibm.com  Mon Feb  2 06:53:02 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Sun, 01 Feb 2004 13:53:02 -0600
Subject: LPARCFG
In-Reply-To: <20040201050227.GB22694@krispykreme>
References: <20040201050227.GB22694@krispykreme>
Message-ID: <401D591E.9030503@austin.ibm.com>


Anton Blanchard wrote:
> Hi,
>
> Any reason we cant make lparcfg tristate (so we can compile it as a
> module?)

Building as a module was broken when the code was checked in to ameslab
2.6; I suggested turning it off.  I think lparcfg uses unexported
symbols -- cur_cpu_spec, systemcfg, and plpar_hcall_4out.  Should any of
those be exported?

See this thread for the history:
http://lists.linuxppc.org/linuxppc64-dev/200312/msg00023.html

Nathan

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Mon Feb  2 10:50:02 2004
From: anton at samba.org (Anton Blanchard)
Date: Mon, 2 Feb 2004 10:50:02 +1100
Subject: [Lse-tech] Re: [PATCH 2/4] 2.6.1-mm2: sched-domain
In-Reply-To: <20040201105124.GG22694@krispykreme>
References: <200402010902.i1192mM06956@owlet.beaverton.ibm.com> <401CD3E8.7050208@cyberone.com.au> <20040201105124.GG22694@krispykreme>
Message-ID: <20040201235001.GJ22694@krispykreme>


> Give me a bit, I'll narrow it down. I'll also do a boot with SMT
> scheduelr off and verify its OK.

Im seeing very early slab corruption. Back out sched-* and it boots OK.

sym.0014:02:01.0:0:0: tagged command queuing enabled, command queue depth 16.
sym.0014:02:01.0:0: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 31)
scsi: On host 12 channel 0 id 0 only 128 (max_scsi_report_luns) of 189483851 luns reported, try increasing max_scsi_report_luns.
scsi: host 12 channel 0 id 0 lun 0x5a5a5a5a5a5a5a5a has a LUN larger than currently supported.
scsi: host 12 channel 0 id 0 lun 0x5a5a5a5a5a5a5a5a has a LUN larger than currently supported.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From david at gibson.dropbear.id.au  Mon Feb  2 17:48:41 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Mon, 2 Feb 2004 17:48:41 +1100
Subject: [RFC] implicit hugetlb pages (mmu_context_to_struct)
In-Reply-To: <1073683778.1298.115.camel@agtpad>
References: <1073683188.1297.105.camel@agtpad> <1073683778.1298.115.camel@agtpad>
Message-ID: <20040202064841.GA31383@zax>


On Fri, Jan 09, 2004 at 01:29:38PM -0800, Adam Litke wrote:
>
> mmu_context_to_struct (2.6.0):
>    This patch converts the mmu_context variable to a structure.  It is
> needed for the dynamic address space resizing patch.

Ok, I've made a revised version of this patch.  As well as handling
the various changes in ameslab since Adam's version, it uses a better
name for the mm_context fields, separates out the low_hpages flag, and
avoids putting extraneous info in the mmu_context_queue.

I'm slightly concerned about whether it has zero-impact in the
!CONFIG_HUGETLB_PAGE case.  I think theoretically it could/should, but
there's a couple of cases where I don't know if gcc will be clever
enough to optimise everything away.

Anton, if you approve I can push this to ameslab and/or akpm.

mmu-context-struct (2.6.0):
   This patch converts the mmu_context variable to a structure.  It is
needed for the dynamic HTLB address space resizing patch.

Index: working-2.6/arch/ppc64/kernel/stab.c
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/stab.c	2004-02-02 17:04:37.000000000 +1100
+++ working-2.6/arch/ppc64/kernel/stab.c	2004-02-02 17:36:30.961032128 +1100
@@ -175,13 +175,13 @@
 	/* Kernel or user address? */
 	if (REGION_ID(ea) >= KERNEL_REGION_ID) {
 		vsid = get_kernel_vsid(ea);
-		context = REGION_ID(ea);
+		context = KERNEL_CONTEXT(ea);
 	} else {
 		if (!current->mm)
 			return 1;

 		context = current->mm->context;
-		vsid = get_vsid(context, ea);
+		vsid = get_vsid(context.id, ea);
 	}

 	esid = GET_ESID(ea);
@@ -214,7 +214,7 @@

 	if (!IS_VALID_EA(pc) || (REGION_ID(pc) >= KERNEL_REGION_ID))
 		return;
-	vsid = get_vsid(mm->context, pc);
+	vsid = get_vsid(mm->context.id, pc);
 	__ste_allocate(pc_esid, vsid);

 	if (pc_esid == stack_esid)
@@ -222,7 +222,7 @@

 	if (!IS_VALID_EA(stack) || (REGION_ID(stack) >= KERNEL_REGION_ID))
 		return;
-	vsid = get_vsid(mm->context, stack);
+	vsid = get_vsid(mm->context.id, stack);
 	__ste_allocate(stack_esid, vsid);

 	if (pc_esid == unmapped_base_esid || stack_esid == unmapped_base_esid)
@@ -231,7 +231,7 @@
 	if (!IS_VALID_EA(unmapped_base) ||
 	    (REGION_ID(unmapped_base) >= KERNEL_REGION_ID))
 		return;
-	vsid = get_vsid(mm->context, unmapped_base);
+	vsid = get_vsid(mm->context.id, unmapped_base);
 	__ste_allocate(unmapped_base_esid, vsid);

 	/* Order update */
@@ -396,14 +396,14 @@

 	/* Kernel or user address? */
 	if (REGION_ID(ea) >= KERNEL_REGION_ID) {
-		context = REGION_ID(ea);
+		context = KERNEL_CONTEXT(ea);
 		vsid = get_kernel_vsid(ea);
 	} else {
 		if (unlikely(!current->mm))
 			return 1;

 		context = current->mm->context;
-		vsid = get_vsid(context, ea);
+		vsid = get_vsid(context.id, ea);
 	}

 	esid = GET_ESID(ea);
@@ -434,7 +434,7 @@

 	if (!IS_VALID_EA(pc) || (REGION_ID(pc) >= KERNEL_REGION_ID))
 		return;
-	vsid = get_vsid(mm->context, pc);
+	vsid = get_vsid(mm->context.id, pc);
 	__slb_allocate(pc_esid, vsid, mm->context);

 	if (pc_esid == stack_esid)
@@ -442,7 +442,7 @@

 	if (!IS_VALID_EA(stack) || (REGION_ID(stack) >= KERNEL_REGION_ID))
 		return;
-	vsid = get_vsid(mm->context, stack);
+	vsid = get_vsid(mm->context.id, stack);
 	__slb_allocate(stack_esid, vsid, mm->context);

 	if (pc_esid == unmapped_base_esid || stack_esid == unmapped_base_esid)
@@ -451,7 +451,7 @@
 	if (!IS_VALID_EA(unmapped_base) ||
 	    (REGION_ID(unmapped_base) >= KERNEL_REGION_ID))
 		return;
-	vsid = get_vsid(mm->context, unmapped_base);
+	vsid = get_vsid(mm->context.id, unmapped_base);
 	__slb_allocate(unmapped_base_esid, vsid, mm->context);
 }

Index: working-2.6/arch/ppc64/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/ppc64/mm/hugetlbpage.c	2004-02-02 10:44:47.000000000 +1100
+++ working-2.6/arch/ppc64/mm/hugetlbpage.c	2004-02-02 17:36:30.964031672 +1100
@@ -244,7 +244,7 @@
 	struct vm_area_struct *vma;
 	unsigned long addr;

-	if (mm->context & CONTEXT_LOW_HPAGES)
+	if (mm->context.low_hpages)
 		return 0; /* The window is already open */

 	/* Check no VMAs are in the region */
@@ -281,7 +281,7 @@

 	/* FIXME: do we need to scan for PTEs too? */

-	mm->context |= CONTEXT_LOW_HPAGES;
+	mm->context.low_hpages = 1;

 	/* the context change must make it to memory before the slbia,
 	 * so that further SLB misses do the right thing. */
@@ -589,7 +589,6 @@
 	}
 }

-
 unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 					unsigned long len, unsigned long pgoff,
 					unsigned long flags)
@@ -778,7 +777,7 @@
 	BUG_ON(hugepte_bad(pte));
 	BUG_ON(!in_hugepage_area(context, ea));

-	vsid = get_vsid(context, ea);
+	vsid = get_vsid(context.id, ea);

 	va = (vsid << 28) | (ea & 0x0fffffff);
 	vpn = va >> LARGE_PAGE_SHIFT;
Index: working-2.6/arch/ppc64/mm/init.c
===================================================================
--- working-2.6.orig/arch/ppc64/mm/init.c	2004-01-22 10:53:08.000000000 +1100
+++ working-2.6/arch/ppc64/mm/init.c	2004-02-02 17:36:30.967031216 +1100
@@ -502,7 +502,7 @@
 		break;
 	case USER_REGION_ID:
 		pgd = pgd_offset( vma->vm_mm, vmaddr );
-		context = vma->vm_mm->context;
+		context = vma->vm_mm->context.id;

 		/* XXX are there races with checking cpu_vm_mask? - Anton */
 		tmp = cpumask_of_cpu(smp_processor_id());
@@ -554,7 +554,7 @@
 		break;
 	case USER_REGION_ID:
 		pgd = pgd_offset(mm, start);
-		context = mm->context;
+		context = mm->context.id;

 		/* XXX are there races with checking cpu_vm_mask? - Anton */
 		tmp = cpumask_of_cpu(smp_processor_id());
@@ -943,7 +943,7 @@
 	if (!ptep)
 		return;

-	vsid = get_vsid(vma->vm_mm->context, ea);
+	vsid = get_vsid(vma->vm_mm->context.id, ea);

 	tmp = cpumask_of_cpu(smp_processor_id());
 	if (cpus_equal(vma->vm_mm->cpu_vm_mask, tmp))
Index: working-2.6/arch/ppc64/mm/hash_utils.c
===================================================================
--- working-2.6.orig/arch/ppc64/mm/hash_utils.c	2004-01-22 10:53:08.000000000 +1100
+++ working-2.6/arch/ppc64/mm/hash_utils.c	2004-02-02 15:16:34.000000000 +1100
@@ -237,7 +237,7 @@
 		if (mm == NULL)
 			return 1;

-		vsid = get_vsid(mm->context, ea);
+		vsid = get_vsid(mm->context.id, ea);
 		break;
 	case IO_REGION_ID:
 		mm = &ioremap_mm;
Index: working-2.6/arch/ppc64/xmon/xmon.c
===================================================================
--- working-2.6.orig/arch/ppc64/xmon/xmon.c	2004-01-22 10:53:08.000000000 +1100
+++ working-2.6/arch/ppc64/xmon/xmon.c	2004-02-02 17:36:30.972030456 +1100
@@ -1952,7 +1952,7 @@
 		// if in user range, use the current task's page directory
 		else if ( ( ea >= USER_START ) && ( ea <= USER_END ) ) {
 			mm = current->mm;
-			vsid = get_vsid(mm->context, ea );
+			vsid = get_vsid(mm->context.id, ea );
 		}
 		pgdir = mm->pgd;
 		va = ( vsid << 28 ) | ( ea & 0x0fffffff );
Index: working-2.6/include/asm-ppc64/mmu.h
===================================================================
--- working-2.6.orig/include/asm-ppc64/mmu.h	2004-02-02 10:44:49.000000000 +1100
+++ working-2.6/include/asm-ppc64/mmu.h	2004-02-02 17:36:30.975030000 +1100
@@ -18,15 +18,25 @@

 #ifndef __ASSEMBLY__

-/* Default "unsigned long" context */
-typedef unsigned long mm_context_t;
+/* Time to allow for more things here */
+typedef unsigned long mm_context_id_t;
+typedef struct {
+	mm_context_id_t id;
+#ifdef CONFIG_HUGETLB_PAGE
+	int low_hpages;
+#endif
+} mm_context_t;

 #ifdef CONFIG_HUGETLB_PAGE
-#define CONTEXT_LOW_HPAGES	(1UL<<63)
+#define KERNEL_LOW_HPAGES	.low_hpages = 0,
 #else
-#define CONTEXT_LOW_HPAGES	0
+#define KERNEL_LOW_HPAGES
 #endif

+#define KERNEL_CONTEXT(ea) ({ \
+		mm_context_t ctx = { .id = REGION_ID(ea), KERNEL_LOW_HPAGES}; \
+		ctx; })
+
 /*
  * Hardware Segment Lookaside Buffer Entry
  * This structure has been padded out to two 64b doublewords (actual SLBE's are
Index: working-2.6/include/asm-ppc64/mmu_context.h
===================================================================
--- working-2.6.orig/include/asm-ppc64/mmu_context.h	2004-02-02 10:44:49.000000000 +1100
+++ working-2.6/include/asm-ppc64/mmu_context.h	2004-02-02 17:36:30.977029696 +1100
@@ -52,7 +52,7 @@
 	long head;
 	long tail;
 	long size;
-	mm_context_t elements[LAST_USER_CONTEXT];
+	mm_context_id_t elements[LAST_USER_CONTEXT];
 };

 extern struct mmu_context_queue_t mmu_context_queue;
@@ -83,7 +83,6 @@
 	long head;
 	unsigned long flags;
 	/* This does the right thing across a fork (I hope) */
-	unsigned long low_hpages = mm->context & CONTEXT_LOW_HPAGES;

 	spin_lock_irqsave(&mmu_context_queue.lock, flags);

@@ -93,8 +92,7 @@
 	}

 	head = mmu_context_queue.head;
-	mm->context = mmu_context_queue.elements[head];
-	mm->context |= low_hpages;
+	mm->context.id = mmu_context_queue.elements[head];

 	head = (head < LAST_USER_CONTEXT-1) ? head+1 : 0;
 	mmu_context_queue.head = head;
@@ -132,8 +130,7 @@
 #endif

 	mmu_context_queue.size++;
-	mmu_context_queue.elements[index] =
-		mm->context & ~CONTEXT_LOW_HPAGES;
+	mmu_context_queue.elements[index] = mm->context.id;

 	spin_unlock_irqrestore(&mmu_context_queue.lock, flags);
 }
@@ -210,8 +207,6 @@
 {
 	unsigned long ordinal, vsid;

-	context &= ~CONTEXT_LOW_HPAGES;
-
 	ordinal = (((ea >> 28) & 0x1fffff) * LAST_USER_CONTEXT) | context;
 	vsid = (ordinal * VSID_RANDOMIZER) & VSID_MASK;

Index: working-2.6/include/asm-ppc64/page.h
===================================================================
--- working-2.6.orig/include/asm-ppc64/page.h	2004-01-09 11:11:12.000000000 +1100
+++ working-2.6/include/asm-ppc64/page.h	2004-02-02 17:36:30.979029392 +1100
@@ -32,6 +32,7 @@
 /* For 64-bit processes the hugepage range is 1T-1.5T */
 #define TASK_HPAGE_BASE 	(0x0000010000000000UL)
 #define TASK_HPAGE_END 	(0x0000018000000000UL)
+
 /* For 32-bit processes the hugepage range is 2-3G */
 #define TASK_HPAGE_BASE_32	(0x80000000UL)
 #define TASK_HPAGE_END_32	(0xc0000000UL)
@@ -39,7 +40,7 @@
 #define ARCH_HAS_HUGEPAGE_ONLY_RANGE
 #define is_hugepage_only_range(addr, len) \
 	( ((addr > (TASK_HPAGE_BASE-len)) && (addr < TASK_HPAGE_END)) || \
-	  ((current->mm->context & CONTEXT_LOW_HPAGES) && \
+	  (current->mm->context.low_hpages && \
 	   (addr > (TASK_HPAGE_BASE_32-len)) && (addr < TASK_HPAGE_END_32)) )
 #define hugetlb_free_pgtables free_pgtables
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
@@ -47,7 +48,7 @@
 #define in_hugepage_area(context, addr) \
 	((cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE) && \
 	 ((((addr) >= TASK_HPAGE_BASE) && ((addr) < TASK_HPAGE_END)) || \
-	  (((context) & CONTEXT_LOW_HPAGES) && \
+	  ((context).low_hpages && \
 	   (((addr) >= TASK_HPAGE_BASE_32) && ((addr) < TASK_HPAGE_END_32)))))

 #else /* !CONFIG_HUGETLB_PAGE */
Index: working-2.6/include/asm-ppc64/tlb.h
===================================================================
--- working-2.6.orig/include/asm-ppc64/tlb.h	2004-01-22 10:53:21.000000000 +1100
+++ working-2.6/include/asm-ppc64/tlb.h	2004-02-02 17:36:30.980029240 +1100
@@ -65,7 +65,7 @@
 				if (cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask))
 					local = 1;

-				flush_hash_range(tlb->mm->context, i, local);
+				flush_hash_range(tlb->mm->context.id, i, local);
 				i = 0;
 			}
 		}
@@ -86,7 +86,7 @@
 	if (cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask))
 		local = 1;

-	flush_hash_range(tlb->mm->context, batch->index, local);
+	flush_hash_range(tlb->mm->context.id, batch->index, local);
 	batch->index = 0;

 	pte_free_finish();

--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nfont at austin.ibm.com  Tue Feb  3 03:01:44 2004
From: nfont at austin.ibm.com (Nathan Fontenot)
Date: Mon, 02 Feb 2004 10:01:44 -0600
Subject: RTAS error logging of EEH errors
In-Reply-To: <20040201073627.GE22694@krispykreme>
References: <20040201073627.GE22694@krispykreme>
Message-ID: <1075737704.8079.27.camel@mudbug.austin.ibm.com>

I ended up leaving this on the stack because we call panic() right
after using the buffer.  The patch below changes this so that the
buffer is allocated.  If this is preferred let me know and I will
push the change to Ameslab.

I don't think this should be a statically defined buffer, no sense
in keeping around a buffer this big that will only be used once
right before we die.

-Nathan Fontenot

On Sun, 2004-02-01 at 01:36, Anton Blanchard wrote:
> Hi,
>
> I was reviewing patches to send onto Andrew Morton and noticed:
>
>                 if (ret == 0 && rets[1] == 1 && rets[0] >= 2) {
> +                       unsigned char slot_err_buf[RTAS_ERROR_LOG_MAX];
> +                       unsigned long   slot_err_ret;
>
> Thats 2kB on the stack, it really wants to be kmalloced (GFP_ATOMIC
> since it can be called in interrupt context) or statically defined.
>
> Anton
>
--
-------------- next part --------------
A non-text attachment was scrubbed...
Name: eeh_update.patch
Type: text/x-patch
Size: 1470 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040202/e964cd4a/attachment.bin 

From johnrose at austin.ibm.com  Tue Feb  3 04:14:39 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Mon, 02 Feb 2004 11:14:39 -0600
Subject: [PATCH] ioremap kmallocs inside spinlocks - review request
In-Reply-To: <20040201051058.GC22694@krispykreme>
References: <1075226812.10285.17.camel@verve>
	 <4016B046.2060103@austin.ibm.com> <1075393500.13228.3.camel@verve>
	 <401C4360.3050702@austin.ibm.com>  <20040201051058.GC22694@krispykreme>
Message-ID: <1075742079.26785.16.camel@verve>


I like Nathan's patch quite a bit more than what I came up with :)  I
was looking at the problem from the wrong angle I guess.

Thanks-
John

> Sorry for not getting back to you. Paul and I had a talk and came to the
> same conclusion, ioremap can sleep so its safe to use a semaphore there.
>
> BTW Every time I look at the imalloc code I get that feeling we should
> be using the generic get_vm_area :)
>
> Anton


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Tue Feb  3 10:40:01 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Mon, 02 Feb 2004 17:40:01 -0600
Subject: [2.6] [PATCH] [LARGE] Rewrite/cleanup of PCI DMA mapping code
Message-ID: <401EDFD1.3010203@austin.ibm.com>


Hello all,

I've spent some time cleaning up the PCI DMA mapping code in 2.6. The
patch is large (117KB, 4000 lines), so I won't post it here. It can be
found at:

http://www.localnet.sh/patches/tce-rewrite-feb02.patch

Summary:

* Renamed include/asm-ppc64/pci_dma.h   -> iommu.h

* Removed include/asm-ppc64/iSeries/iSeries_dma.h

* Broke the contents of old arch/ppc64/kernel/pci_dma.c into:
    arch/ppc64/kernel/pci_iommu.c
    arch/ppc64/kernel/pSeries_iommu.c
    arch/ppc64/kernel/iSeries_iommu.c

...plus a large number of renames of functions, structures and structure
members. I've cleaned out the PPCDBG code from old pci_dma.c, since it
was very rarely useful due to the sheer volume of output and it
cluttered the code. Tracing can be added back in a limited fashion as
the need comes up in the future.

I also replaced the old buddy-style allocator for TCE ranges with a
simpler bitmap allocator. Time and benchmarking will tell if it's
efficient enough, but it's fairly well abstracted and can easily be replaced

Please take some time and look at the patch, the more eyes on it (and
comments/questions) the better. Patch is against ameslab as of today.

I will also add some statistics gathering similar to Linas Vepstas'
patch for the old TCE implementation, but that'll be handled separately.


  arch/ppc64/kernel/pci_dma.c               | 1477 ----------------------
  b/arch/ppc64/kdb/kdbasupport.c            |    2
  b/arch/ppc64/kernel/Makefile              |    6
  b/arch/ppc64/kernel/chrp_setup.c          |    2
  b/arch/ppc64/kernel/iSeries_iommu.c       |  242 ++++
  b/arch/ppc64/kernel/iSeries_pci.c         |    6
  b/arch/ppc64/kernel/mf.c                  |   18
  b/arch/ppc64/kernel/pSeries_iommu.c       |  294 +++++
  b/arch/ppc64/kernel/pSeries_lpar.c        |   86 -
  b/arch/ppc64/kernel/pSeries_pci.c         |    4
  b/arch/ppc64/kernel/pci.c                 |    2
  b/arch/ppc64/kernel/pci_dn.c              |    2
  b/arch/ppc64/kernel/pci_iommu.c           |  431 ++++++++
  b/arch/ppc64/kernel/ppc_ksyms.c           |   12
  b/arch/ppc64/kernel/prom.c                |   14
  b/arch/ppc64/kernel/vio.c                 |  138 +-
  b/arch/ppc64/kernel/viopath.c             |    6
  b/arch/ppc64/mm/init.c                    |    6
  b/drivers/block/viodasd.c                 |    6
  b/drivers/cdrom/viocd.c                   |   12
  b/drivers/net/ibmveth.c                   |    2
  b/drivers/net/iseries_veth.c              |    8
  b/drivers/scsi/iSeries_vscsi.c            |    4
  b/drivers/scsi/rpa_vscsi.c                |    2
  b/include/asm-ppc64/iSeries/iSeries_pci.h |    2
  b/include/asm-ppc64/iommu.h               |  158 +++
  b/include/asm-ppc64/machdep.h             |   12
  b/include/asm-ppc64/prom.h                |    4
  b/include/asm-ppc64/vio.h                 |    6
  include/asm-ppc64/iSeries/iSeries_dma.h   |   95 -
  include/asm-ppc64/pci_dma.h               |  100 --
  31 files changed, 1293 insertions(+), 1866 deletions(-)


Thanks,

-Olof


--
Olof Johansson                                        Office: 4F005/905
pSeries Linux Development                             IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From johnrose at austin.ibm.com  Tue Feb  3 10:51:33 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Mon, 02 Feb 2004 17:51:33 -0600
Subject: [PATCH] nonempty bus->name for PHBs
Message-ID: <1075765893.26785.22.camel@verve>


Anyone have thoughts on this small addition to pcibios_fixup_bus()?  The
bus->name field is currently empty for PHBs.  Filling this in makes
debugging easier.  I also don't like the non-root level bus naming in
pci_scan_bridge(), as it doesn't include the PCI domain.

Thoughts?
John

diff -Nru a/arch/ppc64/kernel/pSeries_pci.c b/arch/ppc64/kernel/pSeries_pci.c
--- a/arch/ppc64/kernel/pSeries_pci.c	Mon Feb  2 17:47:17 2004
+++ b/arch/ppc64/kernel/pSeries_pci.c	Mon Feb  2 17:47:17 2004
@@ -565,6 +565,7 @@
 				printk(KERN_ERR "Failed to request MEM"
 						"on hose %d\n", 0 /* FIXME */);
 		}
+		sprintf(bus->name, "PCI Host Bridge #%02x", pci_domain_nr(bus));
 	} else if (pci_probe_only &&
 		   (dev->class >> 8) == PCI_CLASS_BRIDGE_PCI) {
 		/* This is a subordinate bridge */


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From johnrose at austin.ibm.com  Wed Feb  4 03:17:48 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Tue, 03 Feb 2004 10:17:48 -0600
Subject: [PATCH] nonempty bus->name for PHBs
In-Reply-To: <Pine.A41.4.44.0402022311290.24564-100000@forte.austin.ibm.com>
References: <Pine.A41.4.44.0402022311290.24564-100000@forte.austin.ibm.com>
Message-ID: <1075825068.28337.11.camel@verve>


Hi Olof-

On Mon, 2004-02-02 at 23:14, olof at austin.ibm.com wrote:
> I'm wondering if #%02x is a bad choice of format, since how should #10 be
> parsed? Is it 10 or 16? It might be hard to tell when you're sitting there
> reading the output. How about 0x%02x instead?

Good point.  I directly copied this from pci_scan_bridge(), but your
suggestion makes things more clear.

>
> Also, how large is the buffer? Should you use snprintf just to be safe?

Eh.  The buffer is 48 chars long, and the other piece of code doesn't
use snprintf().  Maybe I'm being careless, but...

New patch attached!

Thanks-
John

diff -Nru a/arch/ppc64/kernel/pSeries_pci.c b/arch/ppc64/kernel/pSeries_pci.c
--- a/arch/ppc64/kernel/pSeries_pci.c	Tue Feb  3 10:16:39 2004
+++ b/arch/ppc64/kernel/pSeries_pci.c	Tue Feb  3 10:16:39 2004
@@ -565,6 +565,8 @@
 				printk(KERN_ERR "Failed to request MEM"
 						"on hose %d\n", 0 /* FIXME */);
 		}
+		sprintf(bus->name, "PCI Host Bridge 0x%02x",
+				pci_domain_nr(bus));
 	} else if (pci_probe_only &&
 		   (dev->class >> 8) == PCI_CLASS_BRIDGE_PCI) {
 		/* This is a subordinate bridge */


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From moilanen at austin.ibm.com  Wed Feb  4 03:27:30 2004
From: moilanen at austin.ibm.com (Jake Moilanen)
Date: Tue, 03 Feb 2004 10:27:30 -0600
Subject: [PATCH][2.6] rtas error-inject support
In-Reply-To: <16412.20792.302473.676404@cargo.ozlabs.ibm.com>
References: <1075480843.682.188.camel@magik>
	 <16412.20792.302473.676404@cargo.ozlabs.ibm.com>
Message-ID: <1075825650.1591.114.camel@magik>


> Wouldn't it simplify things if you had a read_proc function rather
> than having your ppc_rtas_errinjct_read function?  As it is you must
> use copy_to_user rather than memcpy there.
>

Correct me if I'm wrong, but you can't use the file operations open and
release w/ read_proc.  While it would simplify ppc_rtas_errinjct_read,
to do rtas error inject correctly, we need to have the open and close
file operation hooked.  This will allow us to open and close the FW
errinjct facilities correctly.

One of the reasons the RPA architects designed error inject w/ this
"open token" in mind was to have only one error inject app to have
access to errinject at a time.

Thanks,
Jake


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From jdewand at redhat.com  Wed Feb  4 04:54:47 2004
From: jdewand at redhat.com (Julie DeWandel)
Date: Tue, 03 Feb 2004 12:54:47 -0500
Subject: [2.4] [PATCH] hash_page rework, take 2
References: <40104461.5030804@austin.ibm.com>
Message-ID: <401FE067.80902@redhat.com>

Hi Olof,

The patch wasn't attached to your email so I included it along with
comments below (my comments preceded by "JSD:").

Thanks,
Julie

Olof Johansson wrote:

> Ok, so the previous approach of the hash_page rework had a few
> drawbacks as pointed out by Ben and others. Here's a new try, I'm
> looking for any feedback I can get on it!
>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: hashpage.patch
Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040203/6c16856e/attachment.txt 

From jdewand at redhat.com  Wed Feb  4 06:12:42 2004
From: jdewand at redhat.com (Julie DeWandel)
Date: Tue, 03 Feb 2004 14:12:42 -0500
Subject: [2.4] [PATCH] hash_page rework, take 2
References: <40104461.5030804@austin.ibm.com> <401FE067.80902@redhat.com>
Message-ID: <401FF2AA.8000903@redhat.com>


Clarification/update on 2 points.

>
>-	spin_unlock(&hash_table_lock[lock_slot].lock);
>+out_unlock:
>+	smp_wmb();
>
>JSD: I believe an smp_mb__before_clear_bit() would be a better choice, or
>JSD: at least an lwsync because this is an unlock.
>
>+
>+	pte_val(*ptep) &= ~_PAGE_BUSY;
>
>-	return 0;
>+	return ret;
> }
>
> /*
>
JSD: I should have added that the clearing of the _PAGE_BUSY bit should be
JSD: done using an atomic op (clear_bit() routine)

>diff -pru -X /root/.diffexclude linux-2.4.21-6.EL-just_tools/arch/ppc64/mm/init.c hashpage/arch/ppc64/mm/init.c
>--- linux-2.4.21-6.EL-just_tools/arch/ppc64/mm/init.c	2004-01-06 20:35:23.000000000 -0600
>+++ hashpage/arch/ppc64/mm/init.c	2004-01-18 18:11:14.000000000 -0600
>
>+static inline void pte_free_sync(void)
>+{
>+	unsigned long flags;
>+	int i;
>+
>+	/* All we need to know is that we can get the write lock if
>+	 * we wanted to, i.e. that no hash_page()s are holding it for reading.
>+	 * If none are reading, that means there's no currently executing
>+	 * hash_page() that might be working on one of the PTE's that will
>+	 * be deleted. Likewise, if there is a reader, we need to get the
>+	 * write lock to know when it releases the lock.
>+	 */
>+
>+	for (i = 0; i < smp_num_cpus; i++)
>+		if (is_read_locked(&pte_hash_lock[i])) {
>+			if(i == smp_processor_id())
>+				local_irq_save(flags);
>+
>+			write_lock(&pte_hash_lock[i]);
>+			write_unlock(&pte_hash_lock[i]);
>+
>+			if(i == smp_processor_id())
>+				local_irq_restore(flags);
>+		}
>+}
>
>JSD: Two questions here. (1) Shouldn't interrupts be disabled for the
>JSD: write_lock/unlock here regardless of what processor we are running
>JSD: on? (2) I don't see how this code is preventing another processor
>JSD: from grabbing the read_lock immediately after this processor has
>JSD: checked to make sure it isn't held.
>
JSD: Better question for (1) is why are interrupts being disabled here?
JSD: Can this routine be called from interrupt context?


--
Julie DeWandel <jdewand at redhat.com>
Red Hat, Inc.
Tel (978) 692-3113 x23251


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Wed Feb  4 11:32:10 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Tue, 3 Feb 2004 18:32:10 -0600
Subject: RTAS error logging of EEH errors
In-Reply-To: <1075737704.8079.27.camel@mudbug.austin.ibm.com>; from nfont@austin.ibm.com on Mon, Feb 02, 2004 at 10:01:44AM -0600
References: <20040201073627.GE22694@krispykreme> <1075737704.8079.27.camel@mudbug.austin.ibm.com>
Message-ID: <20040203183210.A27780@forte.austin.ibm.com>


Please don't push this patch, It'll mess up my patch which I'm sending
out in a few minutes.  Anton, if this still really bugs you, let me know,
I'll fix it.

--linas

On Mon, Feb 02, 2004 at 10:01:44AM -0600, Nathan Fontenot wrote:
> I ended up leaving this on the stack because we call panic() right
> after using the buffer.  The patch below changes this so that the
> buffer is allocated.  If this is preferred let me know and I will
> push the change to Ameslab.
>
> I don't think this should be a statically defined buffer, no sense
> in keeping around a buffer this big that will only be used once
> right before we die.
>
> -Nathan Fontenot
>
> On Sun, 2004-02-01 at 01:36, Anton Blanchard wrote:
> > Hi,
> >
> > I was reviewing patches to send onto Andrew Morton and noticed:
> >
> >                 if (ret == 0 && rets[1] == 1 && rets[0] >= 2) {
> > +                       unsigned char slot_err_buf[RTAS_ERROR_LOG_MAX];
> > +                       unsigned long   slot_err_ret;
> >
> > Thats 2kB on the stack, it really wants to be kmalloced (GFP_ATOMIC
> > since it can be called in interrupt context) or statically defined.
> >
> > Anton
> >
> --


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Wed Feb  4 11:34:59 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Tue, 3 Feb 2004 18:34:59 -0600
Subject: 2.6: PATCH for multiple EEH bugs
Message-ID: <20040203183459.B27780@forte.austin.ibm.com>


Patch for multiple EEH-related bugs.  Please review this patch,
& if appropriate, please apply.  It should apply cleanly to
the current ameslab tree (Feb 03 2004  2.6.2-rc3).

(I could try to do a "bk push", but I'm not yet confident with
my bk abilities).

This patch fixes multiple EEH-related bugs:

-- Fixes the eeh_check_failure() usage in an interrupt context.
   This routine is now safe to use in an interrupt. The fix was to
   build a cache of IO addresses and check that, instead of using
   the pci routines.
-- Merges in Olof Johansson's sizeof patch when checking for failure
-- Adds EEH tests to array/string reads
-- Fixes bugs with address resolution (some i/o addresses were handled
   incorrectly, resulting in EEH errors slipping by undetected.)
-- Adds EEH support to the PCI Hotplug system (so that devices that
   get added/removed get properly registered with the EEH subsystem.)
-- Fixes improper use of /proc filesystem.
-- Adds some misc statistics.

Please note that the EEH subsystem will be undergoing a major revision
in the not-to-distant future; this patch is a 'stopgap' to address the
immediate concerns/issues until that time.

--linas


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Wed Feb  4 11:44:41 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Tue, 3 Feb 2004 18:44:41 -0600
Subject: [PATCH][2.6] rtas error-inject support
In-Reply-To: <1075493023.682.199.camel@magik>; from moilanen@austin.ibm.com on Fri, Jan 30, 2004 at 02:03:43PM -0600
References: <1075480843.682.188.camel@magik> <9F2C2DDE-5349-11D8-928E-000A95A0560C@us.ibm.com> <1075488969.682.192.camel@magik> <264AA26A-535D-11D8-928E-000A95A0560C@us.ibm.com> <1075493023.682.199.camel@magik>
Message-ID: <20040203184441.C27780@forte.austin.ibm.com>


On Fri, Jan 30, 2004 at 02:03:43PM -0600, Jake Moilanen wrote:
>
> > > Whoops, your right.  Good catch.  This was leftover from the port from
> > > 2.4.
> >
> > That statement was alarming... :) I only found one memcpy left in that
> > file in 2.4, but I guess we're supposed to check for EFAULT:
>
> IIRC Linas went through a couple of months ago and fixed up
> rtas-proc.c.  The whole file was using memcpy instead of
> copy_to/from_user().

Yep, and there's a bunch of lower priority bugs in there too,
which I promised to fix, but haven't gotten a round tuit yet.

--linas
p.s. jake, thanks for this patch, I've been needing it ...


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From haveblue at us.ibm.com  Wed Feb  4 12:07:44 2004
From: haveblue at us.ibm.com (Dave Hansen)
Date: 03 Feb 2004 17:07:44 -0800
Subject: [PATCH] kill lmb_add_io
Message-ID: <1075856864.1449.205.camel@nighthawk>

This function seems to be dead code now.  I found it during a horribly
confusing encounter with lmb.[ch] and CONFIG_MSCHUNKS.

Does anyone know what an MSCHUNK is?  It seems like it simplifies some
of the cases.  Does iSeries guarantee that the kernel's memory starts at
0 while pSeries doesn't?

--dave
-------------- next part --------------
A non-text attachment was scrubbed...
Name: kill-lmb_add_io-2.6.1-0.patch
Type: text/x-patch
Size: 1516 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040203/655e5a6a/attachment.bin 

From paulus at samba.org  Wed Feb  4 12:17:25 2004
From: paulus at samba.org (Paul Mackerras)
Date: Wed, 4 Feb 2004 12:17:25 +1100
Subject: [PATCH][2.6] rtas error-inject support
In-Reply-To: <1075501743.681.214.camel@magik>
References: <1075480843.682.188.camel@magik>
	<9F2C2DDE-5349-11D8-928E-000A95A0560C@us.ibm.com>
	<1075488969.682.192.camel@magik>
	<264AA26A-535D-11D8-928E-000A95A0560C@us.ibm.com>
	<1075501743.681.214.camel@magik>
Message-ID: <16416.18469.3956.192501@cargo.ozlabs.ibm.com>


Jake Moilanen writes:

> Here's a patch w/ check for EFAULT.

In general, since this is a patch for 2.6 (according to the subject
line :), it would be better to do all this in userspace using the rtas
syscall that was added recently.  But here are comments on the patch
anyway:


> +config RTAS_ERRINJCT
> +	bool "RTAS Errinject"

How about bool "RTAS Error injection facility" ?

> +static unsigned int open_token = 0;

No need to initialize things to 0, C does that by default.

> @@ -207,7 +221,8 @@
>  void proc_rtas_init(void)
>  {
>  	struct proc_dir_entry *entry;
> -
> + 	int errinjct_token;
> +

I can't see any difference between the blank line that is deleted and
the one that is inserted (not even any whitespace).  I wonder why diff
put that in?  Or has your mailer deleted trailing whitespace?

> +static ssize_t ppc_rtas_errinjct_write(struct file * file, const char * buf,
> +				       size_t count, loff_t *ppos)
> +{
> +
> +	char * ei_token;
> +	char * workspace = NULL;
> +	size_t max_len;
> +	int token_len;
> +	int rc;
> +
> +	/* Verify the errinjct token length */
> +	if (count < ERRINJCT_TOKEN_LEN) {
> +		max_len = count;
> +	} else {
> +		max_len = ERRINJCT_TOKEN_LEN;
> +	}
> +
> +	token_len = strnlen(buf, max_len);

That's a user pointer that you're using strnlen on.  Ouch.  Use
strnlen_user or copy_from_user.  In fact, do we need to check the
string length at all?  Would it matter if there was a null in the
buffer, and the value taken as a string was shorter than we thought?

> +	token_len++; /* Add one for the null termination */
> +
> +	ei_token = (char *)kmalloc(token_len, GFP_KERNEL);
> +	if (!ei_token) {
> +		printk(KERN_WARNING "error: kmalloc failed\n");
> +		return -ENOMEM;
> +	}
> +
> +	strncpy(ei_token, buf, token_len);

Another access to the user buffer without using *user functions.
Ouch.

> +int
> +rtas_errinjct(unsigned int open_token, char * ei_token, char * workspace, size_t workspace_size)
> +{
> +	struct errinjct_token * ei;
> +	int rtas_ei_token = -1;
> +	unsigned int time;
> +	int rc = 0;
> + 	int i;
> +
> + 	ei = ei_token_list;
> + 	for (i = 0; i < MAX_ERRINJCT_TOKENS && ei->name; i++) {
> + 		if (strcmp(ei_token, ei->name) == 0) {
> + 			rtas_ei_token = ei->value;
> + 			break;
> + 		}
> + 		ei++;
> + 	}
> + 	if (rtas_ei_token == -1) {
> + 		return -EINVAL;
> + 	}
> +
> + 	spin_lock(&rtas_data_buf_lock);
> +
> +	while (1) {
> +		if (rc != RTAS_BUSY && workspace) {
> +			memset(rtas_data_buf, 0, RTAS_DATA_BUF_SIZE);
> +			memcpy(rtas_data_buf, workspace, workspace_size);
> +		}

This worries me.  We copy the workspace (the contents of which are
undefined since we just kmalloc'd it) to rtas_data_buf, but we never
copy rtas_data_buf back to the workspace after the rtas call.  So how
can the contents of workspace ever be anything but undefined?

If we were doing this from userspace we could use the existing
facilities for getting memory below 4G and use that for the workspace
without any copying.

> +static int __init rtas_errinjct_init(void)
> +{
> +	char * token_array;
> + 	char * end_array;
> + 	int array_len = 0;
> + 	int len;
> + 	int i, j;
> +
> + 	token_array = (char *) get_property(rtas.dev, "ibm,errinjct-tokens",
> + 					    &array_len);
> + 	end_array = token_array + array_len;
> + 	for (i = 0, j = 0; i < MAX_ERRINJCT_TOKENS && token_array < end_array; i++) {
> +
> + 		len = strnlen(token_array, ERRINJCT_TOKEN_LEN) + 1;
> + 		ei_token_list[i].name = (char *) kmalloc(len, GFP_KERNEL);
> + 		if (!ei_token_list[i].name) {
> + 			printk(KERN_WARNING "error: kmalloc failed\n");

Why can't we just store a pointer to the token name within the OF
property value?  Why do we have to make a copy of it?

In fact, why do we need to parse the list here at all?  We use the
list for two things: matching the token name in the write function,
and listing the tokens in the read function.  In both cases we could
just run through the ibm,errinjct-tokens (what do they have against
vwls?) property value almost as easily as ei_token_list.

> +/* Error inject defines */
> +#define ERRINJCT_TOKEN_LEN 24  /* Max length of an error inject token */
> +#define MAX_ERRINJCT_TOKENS 15 /* Max # tokens. */
> +#define WORKSPACE_SIZE 1024

I worry about these too.  15 tokens sounds like future machines are
going to going to exceed this limit quite easily.  Also, is the
workspace size limit there something that applies to all RTAS
functions, or just to the error injection functions?  If the latter,
you should choose a name that indicates that.

Overall, this looks to me like something that could be done just as
well or better in userspace.  Doing it in userspace would make it
easy to avoid having an arbitrary limit on the number of tokens, for
instance.  Userspace could just read
/proc/device-tree/rtas/ibm,errinjct-tokens to get the list and match
against that directly.

Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Wed Feb  4 12:18:39 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Tue, 3 Feb 2004 19:18:39 -0600
Subject: 2.6: PATCH for multiple EEH bugs
In-Reply-To: <OF444DAAFD.B48E986E-ON86256E30.00041431-86256E30.000420B8@us.ibm.com>; from boutcher@us.ibm.com on Tue, Feb 03, 2004 at 06:45:05PM -0600
References: <20040203183459.B27780@forte.austin.ibm.com> <OF444DAAFD.B48E986E-ON86256E30.00041431-86256E30.000420B8@us.ibm.com>
Message-ID: <20040203191839.D27780@forte.austin.ibm.com>


Dohhh,


On Tue, Feb 03, 2004 at 06:45:05PM -0600, David Boutcher wrote:
>
>
> owner-linuxppc64-dev at lists.linuxppc.org wrote on 02/03/2004 06:34:59 PM:
> > Patch for multiple EEH-related bugs.  Please review this patch,
> > & if appropriate, please apply.  It should apply cleanly to
> > the current ameslab tree (Feb 03 2004  2.6.2-rc3).
>
> By the way....no patch :-)
>
> Dave Boutcher
> IBM Linux Technology Center
>
-------------- next part --------------
===== arch/ppc64/kernel/chrp_setup.c 1.49 vs edited =====
--- 1.49/arch/ppc64/kernel/chrp_setup.c	Mon Jan 19 20:07:02 2004
+++ edited/arch/ppc64/kernel/chrp_setup.c	Tue Feb  3 16:35:44 2004
@@ -71,6 +71,7 @@
 extern void openpic_init_irq_desc(irq_desc_t *);

 extern void find_and_init_phbs(void);
+extern void __init eeh_init(void);

 extern void pSeries_get_boot_time(struct rtc_time *rtc_time);
 extern void pSeries_get_rtc_time(struct rtc_time *rtc_time);
===== arch/ppc64/kernel/eeh.c 1.17 vs edited =====
--- 1.17/arch/ppc64/kernel/eeh.c	Tue Feb  3 11:03:04 2004
+++ edited/arch/ppc64/kernel/eeh.c	Tue Feb  3 16:38:19 2004
@@ -38,44 +38,68 @@
 #define BUID_LO(buid) ((buid) & 0xffffffff)
 #define CONFIG_ADDR(busno, devfn) (((((busno) & 0xff) << 8) | ((devfn) & 0xf8)) << 8)

-unsigned long eeh_total_mmio_ffs;
-unsigned long eeh_false_positives;
 /* RTAS tokens */
 static int ibm_set_eeh_option;
 static int ibm_set_slot_reset;
 static int ibm_read_slot_reset_state;

-static int eeh_implemented;
+static int eeh_subsystem_enabled;
 #define EEH_MAX_OPTS 4096
 static char *eeh_opts;
 static int eeh_opts_last;

-unsigned char	slot_err_buf[RTAS_ERROR_LOG_MAX];
+/* System monitoring statistics */
+unsigned long eeh_total_mmio_ffs;
+unsigned long eeh_false_positives;
+unsigned long eeh_ignored_failures;

 pte_t *find_linux_pte(pgd_t *pgdir, unsigned long va);	/* from htab.c */
 static int eeh_check_opts_config(struct device_node *dn,
 				 int class_code, int vendor_id, int device_id,
 				 int default_state);

-unsigned long eeh_token_to_phys(unsigned long token)
+
+/**
+ * eeh_token_to_phys - convert EEH address token to phys address
+ * @token i/o token, should be address in the form 0xA....
+ *
+ * Converts EEH address tokens into physical addresses.  Note that
+ * ths routine does *not* convert I/O BAR addresses (which start
+ * with 0xE...) to phys addresses!
+ */
+unsigned long
+eeh_token_to_phys(unsigned long token)
 {
+	pte_t *ptep;
+	unsigned long pa, vaddr;
 	if (REGION_ID(token) == EEH_REGION_ID) {
-		unsigned long vaddr = IO_TOKEN_TO_ADDR(token);
-		pte_t *ptep = find_linux_pte(ioremap_mm.pgd, vaddr);
-		unsigned long pa = pte_pfn(*ptep) << PAGE_SHIFT;
-		return pa | (vaddr & (PAGE_SIZE-1));
-	} else
+		vaddr = IO_TOKEN_TO_ADDR(token);
+	} else {
 		return token;
+	}
+
+	ptep = find_linux_pte(ioremap_mm.pgd, vaddr);
+	pa = pte_pfn(*ptep) << PAGE_SHIFT;
+	return pa | (vaddr & (PAGE_SIZE-1));
 }

-/* Check for an eeh failure at the given token address.
+/**
+ * eeh_check_failure - check if all 1's data is due to EEH slot freeze
+ * @token i/o token, should be address in the form 0xA....
+ * @val value, should be all 1's (XXX why do we need this arg??)
+ * @who arbitrary ID, useful for debugging
+ *
+ * Check for an eeh failure at the given token address.
  * The given value has been read and it should be 1's (0xff, 0xffff or
  * 0xffffffff).
  *
  * Probe to determine if an error actually occurred.  If not return val.
  * Otherwise panic.
+ *
+ * Note this routine might be called in an interrupt context ...
  */
-unsigned long eeh_check_failure(void *token, unsigned long val)
+unsigned long
+eeh_check_failure(void *token, unsigned long val, int who)
 {
 	unsigned long addr;
 	struct pci_dev *dev;
@@ -85,28 +109,27 @@
 	/* IO BAR access could get us here...or if we manually force EEH
 	 * operation on even if the hardware won't support it.
 	 */
-	if (!eeh_implemented || ibm_read_slot_reset_state == RTAS_UNKNOWN_SERVICE)
+	if (!eeh_subsystem_enabled || ibm_read_slot_reset_state == RTAS_UNKNOWN_SERVICE)
 		return val;

-	/* Finding the phys addr + pci device is quite expensive.
-	 * However, the RTAS call is MUCH slower.... :(
-	 */
+	/* Finding the phys addr + pci device; this is pretty quick. */
 	addr = eeh_token_to_phys((unsigned long)token);
-	dev = pci_find_dev_by_addr(addr);
-	if (!dev) {
-		printk("EEH: no pci dev found for addr=0x%lx\n", addr);
-		return val;
-	}
+	dev = pci_get_device_by_addr(addr);
+
+	if (!dev) return val;
+
 	dn = pci_device_to_OF_node(dev);
 	if (!dn) {
-		printk("EEH: no pci dn found for addr=0x%lx\n", addr);
+		pci_dev_put (dev);
 		return val;
 	}

 	/* Access to IO BARs might get this far and still not want checking. */
-	if (!(dn->eeh_mode & EEH_MODE_SUPPORTED) || dn->eeh_mode & EEH_MODE_NOCHECK)
+	if (!(dn->eeh_mode & EEH_MODE_SUPPORTED) ||
+			 dn->eeh_mode & EEH_MODE_NOCHECK) {
+		pci_dev_put (dev);
 		return val;
-
+	}

 	/* Now test for an EEH failure.  This is VERY expensive.
 	 * Note that the eeh_config_addr may be a parent device
@@ -119,6 +142,7 @@
 				dn->eeh_config_addr, BUID_HI(dn->phb->buid),
 				BUID_LO(dn->phb->buid));
 		if (ret == 0 && rets[1] == 1 && rets[0] >= 2) {
+			unsigned char	slot_err_buf[RTAS_ERROR_LOG_MAX];
 			unsigned long	slot_err_ret;

 			memset(slot_err_buf, 0, RTAS_ERROR_LOG_MAX);
@@ -139,23 +163,36 @@
 			 * the system in light of potential corruption, we
 			 * can use it here.
 			 */
-			if (panic_on_oops)
-				panic("EEH: MMIO failure (%ld) on device:\n%s\n",
-				      rets[0], pci_name(dev));
-			else
-				printk("EEH: MMIO failure (%ld) on device:\n%s\n",
-				       rets[0], pci_name(dev));
+			if (panic_on_oops) {
+				panic("EEH: MMIO failure (%ld) on device:\n%s\n", rets[0], pci_name(dev));
+			} else {
+				eeh_ignored_failures ++;
+				if (!in_interrupt()) {  /* XXX this will be replaced by eehdaemon */
+					printk(KERN_INFO "EEH: MMIO failure (%ld) on device:%s %s\n",
+						rets[0], pci_name(dev), pci_pretty_name(dev));
+				}
+			}
+		} else {
+			eeh_false_positives++;
 		}
 	}
-	eeh_false_positives++;
+	pci_dev_put (dev);
 	return val;	/* good case */
-
 }

 struct eeh_early_enable_info {
 	unsigned int buid_hi;
 	unsigned int buid_lo;
-	int adapters_enabled;
+
+	/* Handy-dandy statistics help us understand what's going on */
+	int num_phbs_found;
+	int num_of_nodes;
+	int num_devices;
+	int num_devices_bad_status;
+	int num_devices_graphics;
+	int num_devices_w_eeh_parent;
+	int num_devices_w_eeh_disabled;
+	int num_adapters_enabled;
 };

 /* Enable eeh for the given device node. */
@@ -170,12 +207,17 @@
 	u32 *regs;
 	int enable;

-	if (status && strcmp(status, "ok") != 0)
+	info->num_devices ++;
+	if (status && strcmp(status, "ok") != 0) {
+		info->num_devices_bad_status ++;
 		return NULL;	/* ignore devices with bad status */
+	}

 	/* Weed out PHBs or other bad nodes. */
-	if (!class_code || !vendor_id || !device_id)
+	if (!class_code || !vendor_id || !device_id) {
+		info->num_devices_bad_status ++;
 		return NULL;
+	}

 	/* Ignore known PHBs and EADs bridges */
 	if (*vendor_id == PCI_VENDOR_ID_IBM &&
@@ -191,28 +233,36 @@
 	 * But there are a few cases like display devices that make sense.
 	 */
 	enable = 1;	/* i.e. we will do checking */
-	if ((*class_code >> 16) == PCI_BASE_CLASS_DISPLAY)
+	if ((*class_code >> 16) == PCI_BASE_CLASS_DISPLAY) {
+		printk (KERN_INFO "EEH: %s: display device, disabling EEH checking.\n", dn->full_name);
+		info->num_devices_graphics ++;
 		enable = 0;
+	}

 	if (!eeh_check_opts_config(dn, *class_code, *vendor_id, *device_id, enable)) {
 		if (enable) {
-			printk(KERN_INFO "EEH: %s user requested to run without EEH.\n", dn->full_name);
+			printk(KERN_NOTICE "EEH: %s user requested to run without EEH.\n", dn->full_name);
 			enable = 0;
 		}
 	}

 	if (!enable)
+	{
 		dn->eeh_mode = EEH_MODE_NOCHECK;
+		info->num_devices_w_eeh_disabled ++;
+		return NULL;
+	}

 	/* This device may already have an EEH parent. */
 	if (dn->parent && (dn->parent->eeh_mode & EEH_MODE_SUPPORTED)) {
 		/* Parent supports EEH. */
 		dn->eeh_mode |= EEH_MODE_SUPPORTED;
 		dn->eeh_config_addr = dn->parent->eeh_config_addr;
+		info->num_devices_w_eeh_parent ++;
 		return NULL;
 	}

-	/* Ok..see if this device supports EEH. */
+	/* Ok... see if this device supports EEH. */
 	regs = (u32 *)get_property(dn, "reg", 0);
 	if (regs) {
 		/* First register entry is addr (00BBSS00)  */
@@ -221,16 +271,21 @@
 				regs[0], info->buid_hi, info->buid_lo,
 				EEH_ENABLE);
 		if (ret == 0) {
-			info->adapters_enabled++;
+			info->num_adapters_enabled++;
 			dn->eeh_mode |= EEH_MODE_SUPPORTED;
 			dn->eeh_config_addr = regs[0];
+			printk (KERN_DEBUG "EEH: %s: eeh enabled\n", dn->full_name);
+		} else {
+			printk (KERN_WARNING "EEH: %s: rtas_call failed.\n", dn->full_name);
 		}
+	} else {
+		printk (KERN_WARNING "EEH: %s: unable to get reg property.\n", dn->full_name);
 	}
 	return NULL;
 }

 /*
- * Initialize eeh by trying to enable it for all of the adapters in the system.
+ * Initialize EEH by trying to enable it for all of the adapters in the system.
  * As a side effect we can determine here if eeh is supported at all.
  * Note that we leave EEH on so failed config cycles won't cause a machine
  * check.  If a user turns off EEH for a particular adapter they are really
@@ -243,7 +298,7 @@
  * The eeh-force-off/on option does literally what it says, so if Linux must
  * avoid enabling EEH this must be done.
  */
-void eeh_init(void)
+void __init eeh_init(void)
 {
 	struct device_node *phb;
 	struct eeh_early_enable_info info;
@@ -261,26 +316,37 @@
 	 * of I/O macros even if we can't actually test for EEH failure.
 	 */
 	if (eeh_force_on > eeh_force_off)
-		eeh_implemented = 1;
+		eeh_subsystem_enabled = 1;
 	else if (ibm_set_eeh_option == RTAS_UNKNOWN_SERVICE)
 		return;

 	if (eeh_force_off > eeh_force_on) {
 		/* User is forcing EEH off.  Be noisy if it is implemented. */
-		if (eeh_implemented)
+		if (eeh_subsystem_enabled)
 			printk(KERN_WARNING "EEH: WARNING: PCI Enhanced I/O Error Handling is user disabled\n");
-		eeh_implemented = 0;
+		eeh_subsystem_enabled = 0;
 		return;
 	}

-
 	/* Enable EEH for all adapters.  Note that eeh requires buid's */
-	info.adapters_enabled = 0;
+	info.num_adapters_enabled = 0;
+	info.num_of_nodes = 0;
+	info.num_phbs_found = 0;
+	info.num_devices = 0;
+	info.num_devices_bad_status = 0;
+	info.num_devices_graphics = 0;
+	info.num_devices_w_eeh_parent = 0;
+	info.num_devices_w_eeh_disabled = 0;
 	for (phb = of_find_node_by_name(NULL, "pci"); phb; phb = of_find_node_by_name(phb, "pci")) {
+
 		int len;
-		int *buid_vals = (int *) get_property(phb, "ibm,fw-phb-id", &len);
+		int *buid_vals;
+
+		info.num_of_nodes ++;
+		buid_vals = (int *) get_property(phb, "ibm,fw-phb-id", &len);
 		if (!buid_vals)
 			continue;
+		info.num_phbs_found ++;
 		if (len == sizeof(int)) {
 			info.buid_lo = buid_vals[0];
 			info.buid_hi = 0;
@@ -288,24 +354,88 @@
 			info.buid_hi = buid_vals[0];
 			info.buid_lo = buid_vals[1];
 		} else {
-			printk("EEH: odd ibm,fw-phb-id len returned: %d\n", len);
+			printk(KERN_INFO "EEH: odd ibm,fw-phb-id len returned: %d\n", len);
 			continue;
 		}
 		traverse_pci_devices(phb, early_enable_eeh, NULL, &info);
 	}
-	if (info.adapters_enabled) {
+	if (info.num_adapters_enabled) {
 		printk(KERN_INFO "EEH: PCI Enhanced I/O Error Handling Enabled\n");
-		eeh_implemented = 1;
+		eeh_subsystem_enabled = 1;
+	}
+	printk(KERN_INFO "EEH: num_of_nodes=%d\n", info.num_of_nodes);
+	printk(KERN_INFO "EEH: num_phbs_found=%d\n", info.num_phbs_found);
+	printk(KERN_INFO "EEH: num_devices=%d\n", info.num_devices);
+	printk(KERN_INFO "EEH: num_devices_bad_status=%d\n", info.num_devices_bad_status);
+	printk(KERN_INFO "EEH: num_devices_graphics=%d\n", info.num_devices_graphics);
+	printk(KERN_INFO "EEH: num_devices_w_eeh_parent=%d\n", info.num_devices_w_eeh_parent);
+	printk(KERN_INFO "EEH: num_devices_w_eeh_disabled=%d\n", info.num_devices_w_eeh_disabled);
+	printk(KERN_INFO "EEH: num_adapters_enabled=%d\n", info.num_adapters_enabled);
+}
+
+/**
+ * eeh_add_device - perform EEH initialization for the indicated pci device
+ * @dev: pci device for which to set up EEH
+ *
+ * This routine can be used to perform EEH initialization for PCI
+ * devices that were added after system boot (e.g. hotplug, dlpar).
+ * Whether this actually enables EEH or not for this device depends
+ * on the type of the device, on earlier boot command-line
+ * arguments & etc.
+ */
+void
+eeh_add_device (struct pci_dev *dev)
+{
+	struct device_node *dn;
+	struct pci_controller *phb;
+	struct eeh_early_enable_info info;
+
+	if (!dev || !eeh_subsystem_enabled) return;
+
+	printk (KERN_DEBUG "EEH: adding device %s %s\n",
+					                 pci_name (dev), pci_pretty_name(dev));
+	dn = pci_device_to_OF_node(dev);
+	if (NULL == dn) return;
+
+	phb = PCI_GET_PHB_PTR(dev);
+	if (NULL == phb || 0 == phb->buid) {
+		printk (KERN_WARNING "EEH: Expected buid but found none\n");
+		return;
 	}
+
+	info.buid_hi = BUID_HI(phb->buid);
+	info.buid_lo = BUID_LO(phb->buid);
+
+	early_enable_eeh(dn, &info);
+	pci_addr_cache_insert_device (dev);
 }

+/**
+ * eeh_remove_device - undo EEH setup for the indicated pci device
+ * @dev: pci device to be removed
+ *
+ * This routine should be when a device is removed from a running
+ * system (e.g. by hotplug or dlpar).
+ */
+void
+eeh_remove_device (struct pci_dev *dev)
+{
+	if (!dev || !eeh_subsystem_enabled) return;
+
+	/* Unregister the device with the EEH/PCI address search system */
+	printk (KERN_DEBUG "EEH: remove device %s %s\n",
+					                 pci_name (dev), pci_pretty_name(dev));
+	pci_addr_cache_remove_device (dev);
+
+}

-int eeh_set_option(struct pci_dev *dev, int option)
+int
+eeh_set_option(struct pci_dev *dev, int option)
 {
 	struct device_node *dn = pci_device_to_OF_node(dev);
 	struct pci_controller *phb = PCI_GET_PHB_PTR(dev);

-	if (dn == NULL || phb == NULL || phb->buid == 0 || !eeh_implemented)
+	if (dn == NULL || phb == NULL || phb->buid == 0 || !eeh_subsystem_enabled)
 		return -2;

 	return rtas_call(ibm_set_eeh_option, 4, 1, NULL,
@@ -316,7 +446,7 @@

 /* If EEH is implemented, find the PCI device using given phys addr
  * and check to see if eeh failure checking is disabled.
- * Remap the addr (trivially) to the EEH region if not.
+ * Remap the addr (trivially) to the EEH region if EEH checking enabled.
  * For addresses not known to PCI the vaddr is simply returned unchanged.
  */
 void *eeh_ioremap(unsigned long addr, void *vaddr)
@@ -324,28 +454,72 @@
 	struct pci_dev *dev;
 	struct device_node *dn;

-	if (!eeh_implemented)
+	if (!eeh_subsystem_enabled)
 		return vaddr;
-	dev = pci_find_dev_by_addr(addr);
+	dev = pci_get_device_by_addr(addr);
 	if (!dev)
 		return vaddr;
-	dn = pci_device_to_OF_node(dev);
-	if (!dn)
+
+	dn = pci_device_to_OF_node(dev);
+	if (!dn) {
+		pci_dev_put (dev);
 		return vaddr;
-	if (dn->eeh_mode & EEH_MODE_NOCHECK)
+	}
+	if (dn->eeh_mode & EEH_MODE_NOCHECK) {
+		pci_dev_put (dev);
 		return vaddr;
+	}

+	pci_dev_put (dev);
 	return (void *)IO_ADDR_TO_TOKEN(vaddr);
 }

 static int eeh_proc_falsepositive_read(char *page, char **start, off_t off,
 			 int count, int *eof, void *data)
 {
-	int len;
-	len = sprintf(page, "eeh_false_positives=%ld\n"
-		      "eeh_total_mmio_ffs=%ld\n",
-		      eeh_false_positives, eeh_total_mmio_ffs);
-	return len;
+	char *p, *buffer;
+#define EEH_PROC_BUFSZ 250
+	int n=0, bs=EEH_PROC_BUFSZ;
+
+	if (count < 0) return -EINVAL;
+
+	buffer = kmalloc (EEH_PROC_BUFSZ,GFP_KERNEL);
+	if (!buffer) return -ENOMEM;
+
+	p = buffer;
+
+	if (0 == eeh_subsystem_enabled) {
+		n += snprintf (p+n, bs-n, "EEH Subsystem is globally disabled\n");
+		n += snprintf(p+n, bs-n, "eeh_total_mmio_ffs=%ld\n",
+		      	eeh_total_mmio_ffs);
+	} else {
+		n += snprintf (p+n, bs-n, "EEH Subsystem is enabled\n");
+		n += snprintf(p+n, bs-n,
+		      "eeh_total_mmio_ffs=%ld\n"
+				"eeh_false_positives=%ld\n"
+				"eeh_ignored_failures=%ld\n",
+		      eeh_total_mmio_ffs,
+		      eeh_false_positives,
+				eeh_ignored_failures);
+	}
+
+	/* Misc machinations of the proc file system */
+	if (off >= strlen(buffer)) {
+		*eof = 1;
+		kfree(buffer);
+		return 0;
+	}
+   if (n > strlen(buffer) - off)
+		n = strlen(buffer) - off;
+	if (n > count)
+		n = count;
+	else
+		*eof = 1;
+
+	memcpy(page, buffer + off, n);
+	*start = page;
+	kfree(buffer);
+	return n;
 }

 /* Implementation of /proc/ppc64/eeh
@@ -362,6 +536,12 @@
 	return 0;
 }

+static int __init eeh_init_late(void)
+{
+	eeh_init_proc ();
+	return 0;
+}
+
 /*
  * Test if "dev" should be configured on or off.
  * This processes the options literally from left to right.
@@ -456,7 +636,7 @@
 		if (*cur) {
 			int curlen = curend-cur;
 			if (eeh_opts_last + curlen > EEH_MAX_OPTS-2) {
-				printk(KERN_INFO "EEH: sorry...too many eeh cmd line options\n");
+				printk(KERN_WARNING "EEH: sorry...too many eeh cmd line options\n");
 				return 1;
 			}
 			eeh_opts[eeh_opts_last++] = state ? '+' : '-';
@@ -478,6 +658,6 @@
 	return eeh_parm(str, 1);
 }

-__initcall(eeh_init_proc);
+__initcall(eeh_init_late);
 __setup("eeh-off", eehoff_parm);
 __setup("eeh-on", eehon_parm);
===== arch/ppc64/kernel/pSeries_pci.c 1.34 vs edited =====
--- 1.34/arch/ppc64/kernel/pSeries_pci.c	Fri Jan 30 21:22:28 2004
+++ edited/arch/ppc64/kernel/pSeries_pci.c	Tue Feb  3 16:35:49 2004
@@ -530,7 +530,7 @@
 			dev->resource[i].start += hose->pci_mem_offset;
 			dev->resource[i].end += hose->pci_mem_offset;
 		}
-        }
+	}
 }
 EXPORT_SYMBOL(pcibios_fixup_device_resources);

===== arch/ppc64/kernel/pci.c 1.42 vs edited =====
--- 1.42/arch/ppc64/kernel/pci.c	Mon Jan 19 20:07:05 2004
+++ edited/arch/ppc64/kernel/pci.c	Tue Feb  3 16:39:53 2004
@@ -23,6 +23,8 @@
 #include <linux/bootmem.h>
 #include <linux/module.h>
 #include <linux/mm.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>

 #include <asm/processor.h>
 #include <asm/io.h>
@@ -107,42 +109,264 @@
 	}
 }

-/* Given an mmio phys address, find a pci device that implements
- * this address.  This is of course expensive, but only used
- * for device initialization or error paths.
- * For io BARs it is assumed the pci_io_base has already been added
- * into addr.
+/**
+ * The pci address cache subsystem.  This subsystem places
+ * PCI device address resources into a red-black tree, sorted
+ * according to the address range, so that given only an i/o
+ * address, the corresponding PCI device can be **quickly**
+ * found.
  *
- * Bridges are ignored although they could be used to optimize the search.
+ * Currently, the only customer of this code is the EEH subsystem;
+ * thus, this code has been somewhat tailored to suit EEH better.
+ * In particular, the cache does *not* hold the addresses of devices
+ * for which EEH is not enabled.
+ *
+ * (Implementation Note: The RB tree seems to be better/faster
+ * than any hash algo I could think of for this problem, even
+ * with the penalty of slow pointer chases for d-cache misses).
  */
-struct pci_dev *pci_find_dev_by_addr(unsigned long addr)
+struct pci_io_addr_range
 {
-	struct pci_dev *dev = NULL;
+	struct rb_node rb_node;
+	unsigned long addr_lo;
+	unsigned long addr_hi;
+	struct pci_dev *pcidev;
+	unsigned int flags;
+};
+
+struct pci_io_addr_cache
+{
+	struct rb_root rb_root;
+	spinlock_t piar_lock;
+} pci_io_addr_cache_root;
+
+static inline struct pci_dev *
+__pci_get_device_by_addr (unsigned long addr)
+{
+	struct rb_node *n = pci_io_addr_cache_root.rb_root.rb_node;
+	while (n)
+	{
+		struct pci_io_addr_range *piar;
+		piar = rb_entry (n, struct pci_io_addr_range, rb_node);
+		if (addr < piar->addr_lo) {
+			n = n->rb_left;
+		} else
+		if (addr > piar->addr_hi) {
+			n = n->rb_right;
+		} else {
+			pci_dev_get (piar->pcidev);
+			return piar->pcidev;
+		}
+	}
+	return NULL;
+}
+
+/**
+ * pci_get_device_by_addr - Get device, given only address
+ * @addr: mmio (PIO) phys address or i/o port number
+ *
+ * Given an mmio phys address, or a port number, find a pci device
+ * that implements this address.  Be sure to pci_dev_put the device
+ * when finished.  I/O port numbers are assumed to be offset
+ * from zero (that is, they do *not* have pci_io_addr added in).
+ * It is safe to call this function within an interrupt.
+ */
+struct pci_dev *
+pci_get_device_by_addr (unsigned long addr)
+{
+	struct pci_dev *dev;
+	unsigned long flags;
+
+	spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags);
+   dev = __pci_get_device_by_addr (addr);
+	spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags);
+	return dev;
+}
+
+/* Handy-dandy debug print routine, does nothing more
+ * than print out the contents of our addr cache. */
+static void
+pci_addr_cache_print (struct pci_io_addr_cache *cache)
+{
+	struct rb_node *n;
+	n = rb_first (&cache->rb_root);
+	int cnt=0;
+	while (n) {
+		struct pci_io_addr_range *piar;
+		piar = rb_entry (n, struct pci_io_addr_range, rb_node);
+		printk (KERN_DEBUG "PCI: %s addr range %d [%lx -%lx]: %s %s\n",
+			(piar->flags & IORESOURCE_IO) ? "i/o" : "mem",
+			cnt,
+			piar->addr_lo, piar->addr_hi,
+			pci_name (piar->pcidev),
+			pci_pretty_name (piar->pcidev));
+		cnt ++;
+		n = rb_next (n);
+	}
+}
+
+/* Insert address range into the rb tree. */
+static inline struct pci_io_addr_range *
+pci_addr_cache_insert (struct pci_dev *dev,
+                unsigned long alo, unsigned long ahi, unsigned int flags)
+{
+	struct rb_node **p = &pci_io_addr_cache_root.rb_root.rb_node;
+	struct rb_node * parent = NULL;
+	struct pci_io_addr_range *piar;
+
+	// Walk tree, find a place to insert into tree
+	while (*p) {
+		parent = *p;
+		piar = rb_entry (parent, struct pci_io_addr_range, rb_node);
+		if (alo < piar->addr_lo) {
+			p = &parent->rb_left;
+		} else if (ahi > piar->addr_hi) {
+			p = &parent->rb_right;
+		} else {
+			if (dev != piar->pcidev ||
+			    alo != piar->addr_lo || ahi != piar->addr_hi) {
+				printk (KERN_WARNING "PIAR: overlapping address range\n");
+			}
+			return piar;
+		}
+	}
+	piar = (struct pci_io_addr_range *) kmalloc (
+			sizeof(struct pci_io_addr_range),  GFP_ATOMIC);
+
+	if (!piar) return NULL;  // whoops
+
+	piar->addr_lo = alo;
+	piar->addr_hi = ahi;
+	piar->pcidev = dev;
+	piar->flags = flags;
+
+	rb_link_node (&piar->rb_node, parent, p);
+	rb_insert_color (&piar->rb_node, &pci_io_addr_cache_root.rb_root);
+	return piar;
+}
+
+inline void
+__pci_addr_cache_insert_device (struct pci_dev *dev)
+{
+	struct device_node *dn;
+	dn = pci_device_to_OF_node(dev);
+	if (!dn) {
+		printk(KERN_WARNING "PCI: no pci dn found for dev=%s %s\n",
+			pci_name(dev), pci_pretty_name(dev));
+		pci_dev_put (dev);
+		return;
+	}
+
+	// Skip any devices for which EEH is not enabled.
+	if (!(dn->eeh_mode & EEH_MODE_SUPPORTED) ||
+	      dn->eeh_mode & EEH_MODE_NOCHECK) {
+		printk(KERN_INFO "PCI: skip building address cache for=%s %s\n",
+			pci_name(dev), pci_pretty_name(dev));
+		pci_dev_put (dev);
+		return;
+	}
+
+	// Walk resources on this device, poke them into the tree
 	int i;
-	unsigned long ioaddr;
+	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
+		unsigned long start = pci_resource_start(dev,i);
+		unsigned long end = pci_resource_end(dev,i);
+		unsigned int flags = pci_resource_flags(dev,i);
+
+		// We are interested only bus addresses, not dma or other stuff
+		if (0 == (flags & (IORESOURCE_IO | IORESOURCE_MEM))) continue;
+		if (start == 0 || ~start == 0 || end == 0 || ~end == 0)
+			 continue;
+		pci_addr_cache_insert (dev, start, end, flags);
+	}
+}

-	ioaddr = (addr > isa_io_base) ? addr - isa_io_base : 0;
+/**
+ * pci_addr_cache_insert_device - Add a device to the address cache
+ * @dev: PCI device whose I/O addresses we are interested in.
+ *
+ * In order to support the fast lookup of devices based on addresses,
+ * we maintain a cache of devices that can be quickly searched.
+ * This routine adds a device to that cache.
+ */
+void
+pci_addr_cache_insert_device (struct pci_dev *dev)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags);
+	__pci_addr_cache_insert_device (dev);
+	spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags);
+}

-	while ((dev = pci_find_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) {
-		if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE)
+static inline void
+__pci_addr_cache_remove_device (struct pci_dev *dev)
+{
+	struct rb_node *n;
+
+restart:
+	n = rb_first (&pci_io_addr_cache_root.rb_root);
+	while (n) {
+		struct pci_io_addr_range *piar;
+		piar = rb_entry (n, struct pci_io_addr_range, rb_node);
+
+		if (piar->pcidev == dev)
+		{
+			rb_erase (n, &pci_io_addr_cache_root.rb_root);
+			kfree (piar);
+			goto restart;
+		}
+		n = rb_next (n);
+	}
+	pci_dev_put (dev);
+}
+
+/**
+ * pci_addr_cache_remove_device - remove pci device from addr cache
+ * @dev: device to remove
+ *
+ * Remove a device from the addr-cache tree.
+ * This is potentially expensive, since it will walk
+ * the tree multiple times (once per resource).
+ * But so what; device removal doesn't need to be that fast.
+ */
+void
+pci_addr_cache_remove_device (struct pci_dev *dev)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags);
+	__pci_addr_cache_remove_device (dev);
+	spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags);
+}
+
+/**
+ * pci_addr_cache_build - Build a cache of I/O addresses
+ *
+ * Build a cache of pci i/o addresses.  This cache will be used to
+ * find the pci device that corresponds to a given address.
+ * This routine scans all pci busses to build the cache.
+ * Must be run late in boot process, after the pci controllers
+ * have been scaned for devices (after all device resources are known).
+ */
+static __init void
+pci_addr_cache_build (void)
+{
+	struct pci_dev *dev = NULL;
+
+	spin_lock_init (&pci_io_addr_cache_root.piar_lock);
+
+	while ((dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) {
+		// Ignore PCI bridges ( XXX why ??)
+		if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE) {
+			pci_dev_put (dev);
 			continue;
-
-		for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
-			unsigned long start = pci_resource_start(dev,i);
-			unsigned long end = pci_resource_end(dev,i);
-			unsigned int flags = pci_resource_flags(dev,i);
-			if (start == 0 || ~start == 0 ||
-			    end == 0 || ~end == 0)
-				continue;
-			if ((flags & IORESOURCE_IO) &&
-			    (ioaddr >= start && ioaddr <= end))
-				return dev;
-			else if ((flags & IORESOURCE_MEM) &&
-				 (addr >= start && addr <= end))
-				return dev;
 		}
+		pci_addr_cache_insert_device (dev);
 	}
-	return NULL;
+
+	// Verify tree built up above, echo back the list of addrs.
+	pci_addr_cache_print (&pci_io_addr_cache_root);
 }

 void
@@ -343,6 +567,8 @@

 	printk("PCI: Probing PCI hardware done\n");
 	//ppc64_boot_msg(0x41, "PCI Done");
+
+	pci_addr_cache_build ();

 	return 0;
 }
===== arch/ppc64/kernel/pci.h 1.10 vs edited =====
--- 1.10/arch/ppc64/kernel/pci.h	Fri Sep 12 06:01:39 2003
+++ edited/arch/ppc64/kernel/pci.h	Tue Feb  3 16:35:50 2004
@@ -37,11 +37,15 @@
 void *traverse_pci_devices(struct device_node *start, traverse_func pre, traverse_func post, void *data);
 void *traverse_all_pci_devices(traverse_func pre);

-struct pci_dev *pci_find_dev_by_addr(unsigned long addr);
 void pci_devs_phb_init(void);
 void pci_fix_bus_sysdata(void);
 struct device_node *fetch_dev_dn(struct pci_dev *dev);

 #define PCI_GET_PHB_PTR(dev)    (((struct device_node *)(dev)->sysdata)->phb)
+
+/* PCI address cache management routines */
+struct pci_dev *pci_get_device_by_addr(unsigned long addr);
+void pci_addr_cache_insert_device (struct pci_dev *dev);
+void pci_addr_cache_remove_device (struct pci_dev *dev);

 #endif /* __PPC_KERNEL_PCI_H__ */
===== drivers/pci/hotplug/rpaphp_core.c 1.2 vs edited =====
--- 1.2/drivers/pci/hotplug/rpaphp_core.c	Tue Dec  9 11:03:38 2003
+++ edited/drivers/pci/hotplug/rpaphp_core.c	Tue Feb  3 16:35:51 2004
@@ -30,6 +30,7 @@
 #include <linux/smp.h>
 #include <linux/smp_lock.h>
 #include <linux/init.h>
+#include <asm/eeh.h>       /* for eeh_add_device() */
 #include <asm/rtas.h>		/* rtas_call */
 #include <asm/pci-bridge.h>	/* for pci_controller */
 #include "../pci.h"		/* for pci_add_new_bus*/
@@ -512,6 +513,7 @@
 		}

 		dev = rpaphp_find_pci_dev(slot->dn->child);
+		eeh_add_device(dev);
 	}
 	else {
 		/* slot is not enabled */
@@ -540,12 +542,12 @@
 		goto exit;
 	}

+	/* remove the device from the pci core */
+	eeh_remove_device(slot->dev);
+	pci_remove_bus_device(slot->dev);

-        /* remove the device from the pci core */
-        pci_remove_bus_device(slot->dev);
-
-        pci_dev_put(slot->dev);
-        slot->state = NOT_CONFIGURED;
+	pci_dev_put(slot->dev);
+	slot->state = NOT_CONFIGURED;

 	dbg("%s: adapter in slot[%s] unconfigured.\n", __FUNCTION__, slot->name);

===== include/asm-ppc64/eeh.h 1.6 vs edited =====
--- 1.6/include/asm-ppc64/eeh.h	Fri Sep 12 06:06:51 2003
+++ edited/include/asm-ppc64/eeh.h	Tue Feb  3 16:35:51 2004
@@ -45,22 +45,37 @@
 /* This is for profiling only */
 extern unsigned long eeh_total_mmio_ffs;

-void eeh_init(void);
-int eeh_get_state(unsigned long ea);
-unsigned long eeh_check_failure(void *token, unsigned long val);
+unsigned long eeh_check_failure(void *token, unsigned long val, int who);
 void *eeh_ioremap(unsigned long addr, void *vaddr);

+/**
+ * eeh_add_device - perform EEH initialization for the indicated pci device
+ * @dev: pci device for which to set up EEH
+ *
+ * This routine can be used to perform EEH initialization for PCI
+ * devices that were added after system boot (e.g. hotplug, dlpar).
+ * Whether this actually enables EEH or not for this device depends
+ * on the type of the device, on earlier boot command-line
+ * arguments & etc.
+ */
+void   eeh_add_device(struct pci_dev *);
+
+/**
+ * eeh_remove_device - undo EEH setup for the indicated pci device
+ * @dev: pci device to be removed
+ *
+ * This routine should be when a device is removed from a running
+ * system (e.g. by hotplug or dlpar).
+ */
+void   eeh_remove_device(struct pci_dev *);
+
+
 #define EEH_DISABLE		0
 #define EEH_ENABLE		1
 #define EEH_RELEASE_LOADSTORE	2
 #define EEH_RELEASE_DMA		3
 int eeh_set_option(struct pci_dev *dev, int options);

-/* Given a PCI device check if eeh should be configured or not.
- * This may look at firmware properties and/or kernel cmdline options.
- */
-int is_eeh_configured(struct pci_dev *dev);
-
 /* Translate a (possible) eeh token to a physical addr.
  * If "token" is not an eeh token it is simply returned under
  * the assumption that it is already a physical addr.
@@ -78,11 +93,16 @@
  * If this macro yields TRUE, the caller relays to eeh_check_failure()
  * which does further tests out of line.
  */
-/* #define EEH_POSSIBLE_IO_ERROR(val) (~(val) == 0) */
-/* #define EEH_POSSIBLE_ERROR(addr, vaddr, val) ((vaddr) != (addr) && EEH_POSSIBLE_IO_ERROR(val) */
 /* This version is rearranged to collect some profiling data */
-#define EEH_POSSIBLE_IO_ERROR(val) (~(val) == 0 && ++eeh_total_mmio_ffs)
-#define EEH_POSSIBLE_ERROR(addr, vaddr, val) (EEH_POSSIBLE_IO_ERROR(val) && (vaddr) != (addr))
+#define EEH_POSSIBLE_IO_ERROR(val, type)				\
+		((val) == (type)~0 && ++eeh_total_mmio_ffs)
+
+/* The vaddr will equal the addr if EEH checking is disabled for
+ * this device.  This is because eeh_ioremap() will not have
+ * remapped to 0xA0, and thus both vaddr and addr will be 0xE0...
+ */
+#define EEH_POSSIBLE_ERROR(addr, vaddr, val, type)			\
+		(EEH_POSSIBLE_IO_ERROR(val, type) && (vaddr) != (addr))

 /*
  * MMIO read/write operations with EEH support.
@@ -101,8 +121,8 @@
 static inline u8 eeh_readb(void *addr) {
 	volatile u8 *vaddr = (volatile u8 *)IO_TOKEN_TO_ADDR(addr);
 	u8 val = in_8(vaddr);
-	if (EEH_POSSIBLE_ERROR(addr, vaddr, val))
-		return eeh_check_failure(addr, val);
+	if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u8))
+		return eeh_check_failure(addr, val, 8);
 	return val;
 }
 static inline void eeh_writeb(u8 val, void *addr) {
@@ -112,25 +132,47 @@
 static inline u16 eeh_readw(void *addr) {
 	volatile u16 *vaddr = (volatile u16 *)IO_TOKEN_TO_ADDR(addr);
 	u16 val = in_le16(vaddr);
-	if (EEH_POSSIBLE_ERROR(addr, vaddr, val))
-		return eeh_check_failure(addr, val);
+	if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u16))
+		return eeh_check_failure(addr, val, 16);
 	return val;
 }
 static inline void eeh_writew(u16 val, void *addr) {
 	volatile u16 *vaddr = (volatile u16 *)IO_TOKEN_TO_ADDR(addr);
 	out_le16(vaddr, val);
 }
+static inline u16 eeh_raw_readw(void *addr) {
+	volatile u16 *vaddr = (volatile u16 *)IO_TOKEN_TO_ADDR(addr);
+	u16 val = in_be16(vaddr);
+	if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u16))
+		return eeh_check_failure(addr, val, 17);
+	return val;
+}
+static inline void eeh_raw_writew(u16 val, void *addr) {
+	volatile u16 *vaddr = (volatile u16 *)IO_TOKEN_TO_ADDR(addr);
+	out_be16(vaddr, val);
+}
 static inline u32 eeh_readl(void *addr) {
 	volatile u32 *vaddr = (volatile u32 *)IO_TOKEN_TO_ADDR(addr);
 	u32 val = in_le32(vaddr);
-	if (EEH_POSSIBLE_ERROR(addr, vaddr, val))
-		return eeh_check_failure(addr, val);
+	if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u32))
+		return eeh_check_failure(addr, val, 32);
 	return val;
 }
 static inline void eeh_writel(u32 val, void *addr) {
 	volatile u32 *vaddr = (volatile u32 *)IO_TOKEN_TO_ADDR(addr);
 	out_le32(vaddr, val);
 }
+static inline u32 eeh_raw_readl(void *addr) {
+	volatile u32 *vaddr = (volatile u32 *)IO_TOKEN_TO_ADDR(addr);
+	u32 val = in_be32(vaddr);
+	if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u32))
+		return eeh_check_failure(addr, val, 33);
+	return val;
+}
+static inline void eeh_raw_writel(u32 val, void *addr) {
+	volatile u32 *vaddr = (volatile u32 *)IO_TOKEN_TO_ADDR(addr);
+	out_be32(vaddr, val);
+}

 static inline void eeh_memset_io(void *addr, int c, unsigned long n) {
 	void *vaddr = (void *)IO_TOKEN_TO_ADDR(addr);
@@ -139,8 +181,14 @@
 static inline void eeh_memcpy_fromio(void *dest, void *src, unsigned long n) {
 	void *vsrc = (void *)IO_TOKEN_TO_ADDR(src);
 	memcpy(dest, vsrc, n);
-	/* look for ffff's here at dest[n] */
+	/* Look for ffff's here at dest[n].  Assume that at least 4 bytes
+	 * were copied. Check all four bytes.
+	 */
+	if ((n>=4) && (EEH_POSSIBLE_ERROR(src, vsrc, (*((u32 *) dest+n-4)), u32))) {
+		eeh_check_failure(src, (*((u32 *) dest+n-4)), 88);
+	}
 }
+
 static inline void eeh_memcpy_toio(void *dest, void *src, unsigned long n) {
 	void *vdest = (void *)IO_TOKEN_TO_ADDR(dest);
 	memcpy(vdest, src, n);
@@ -158,8 +206,8 @@
 	if (_IO_IS_ISA(port) && !_IO_HAS_ISA_BUS)
 		return ~0;
 	val = in_8((u8 *)(port+pci_io_base));
-	if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val))
-		return eeh_check_failure((void*)(port+pci_io_base), val);
+	if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val, u8))
+		return eeh_check_failure((void*)(port), val, -8);
 	return val;
 }

@@ -173,8 +221,8 @@
 	if (_IO_IS_ISA(port) && !_IO_HAS_ISA_BUS)
 		return ~0;
 	val = in_le16((u16 *)(port+pci_io_base));
-	if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val))
-		return eeh_check_failure((void*)(port+pci_io_base), val);
+	if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val, u16))
+		return eeh_check_failure((void*)(port), val, -16);
 	return val;
 }

@@ -188,14 +236,33 @@
 	if (_IO_IS_ISA(port) && !_IO_HAS_ISA_BUS)
 		return ~0;
 	val = in_le32((u32 *)(port+pci_io_base));
-	if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val))
-		return eeh_check_failure((void*)(port+pci_io_base), val);
+	if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val, u32))
+		return eeh_check_failure((void*)(port), val, -32);
 	return val;
 }

 static inline void eeh_outl(u32 val, unsigned long port) {
 	if (!_IO_IS_ISA(port) || _IO_HAS_ISA_BUS)
 		return out_le32((u32 *)(port+pci_io_base), val);
+}
+
+/* in-string eeh macros */
+static inline void eeh_insb(unsigned long port, void * buf, int ns) {
+	_insb((u8 *)(port+pci_io_base), buf, ns);
+	if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR((*(((u8*)buf)+ns-1)), u8))
+		eeh_check_failure((void*)(port), *(u8*)buf, -9);
+}
+
+static inline void eeh_insw_ns(unsigned long port, void * buf, int ns) {
+	_insw_ns((u16 *)(port+pci_io_base), buf, ns);
+	if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR((*(((u16*)buf)+ns-1)), u16))
+		eeh_check_failure((void*)(port), *(u16*)buf, -17);
+}
+
+static inline void eeh_insl_ns(unsigned long port, void * buf, int nl) {
+	_insl_ns((u32 *)(port+pci_io_base), buf, nl);
+	if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR((*(((u32*)buf)+nl-1)), u32))
+		eeh_check_failure((void*)(port), *(u32*)buf, -33);
 }

 #endif /* _EEH_H */
===== include/asm-ppc64/io.h 1.11 vs edited =====
--- 1.11/include/asm-ppc64/io.h	Mon Jan 19 20:08:22 2004
+++ edited/include/asm-ppc64/io.h	Tue Feb  3 16:35:52 2004
@@ -49,6 +49,13 @@
 #define outb(data,addr)		writeb(data,((unsigned long)(addr)))
 #define outw(data,addr)		writew(data,((unsigned long)(addr)))
 #define outl(data,addr)		writel(data,((unsigned long)(addr)))
+/*
+ * The *_ns versions below don't do byte-swapping.
+ * Neither do the standard versions now, these are just here
+ * for older code.
+ */
+#define insw_ns(port, buf, ns)	_insw_ns((u16 *)((port)+pci_io_base), (buf), (ns))
+#define insl_ns(port, buf, nl)	_insl_ns((u32 *)((port)+pci_io_base), (buf), (nl))
 #else
 #define readb(addr)		eeh_readb((void*)(addr))
 #define readw(addr)		eeh_readw((void*)(addr))
@@ -71,12 +78,16 @@
  * They are only used in practice for transferring buffers which
  * are arrays of bytes, and byte-swapping is not appropriate in
  * that case.  - paulus */
-#define insb(port, buf, ns)	_insb((u8 *)((port)+pci_io_base), (buf), (ns))
-#define outsb(port, buf, ns)	_outsb((u8 *)((port)+pci_io_base), (buf), (ns))
-#define insw(port, buf, ns)	_insw_ns((u16 *)((port)+pci_io_base), (buf), (ns))
-#define outsw(port, buf, ns)	_outsw_ns((u16 *)((port)+pci_io_base), (buf), (ns))
-#define insl(port, buf, nl)	_insl_ns((u32 *)((port)+pci_io_base), (buf), (nl))
-#define outsl(port, buf, nl)	_outsl_ns((u32 *)((port)+pci_io_base), (buf), (nl))
+#define insb(port, buf, ns)	eeh_insb((port), (buf), (ns))
+#define insw(port, buf, ns)	eeh_insw_ns((port), (buf), (ns))
+#define insl(port, buf, nl)	eeh_insl_ns((port), (buf), (nl))
+#define insw_ns(port, buf, ns)	eeh_insw_ns((port), (buf), (ns))
+#define insl_ns(port, buf, nl)	eeh_insl_ns((port), (buf), (nl))
+
+#define outsb(port, buf, ns)  _outsb((u8 *)((port)+pci_io_base), (buf), (ns))
+#define outsw(port, buf, ns)  _outsw_ns((u16 *)((port)+pci_io_base), (buf), (ns))
+#define outsl(port, buf, nl)  _outsl_ns((u32 *)((port)+pci_io_base), (buf), (nl))
+
 #endif

 extern void _insb(volatile u8 *port, void *buf, int ns);
@@ -106,9 +117,7 @@
  * Neither do the standard versions now, these are just here
  * for older code.
  */
-#define insw_ns(port, buf, ns)	_insw_ns((u16 *)((port)+pci_io_base), (buf), (ns))
 #define outsw_ns(port, buf, ns)	_outsw_ns((u16 *)((port)+pci_io_base), (buf), (ns))
-#define insl_ns(port, buf, nl)	_insl_ns((u32 *)((port)+pci_io_base), (buf), (nl))
 #define outsl_ns(port, buf, nl)	_outsl_ns((u32 *)((port)+pci_io_base), (buf), (nl))


@@ -177,6 +186,9 @@

 /*
  * 8, 16 and 32 bit, big and little endian I/O operations, with barrier.
+ * These routines do not perform EEH-related I/O address translation,
+ * and should not be used directly by device drivers.  Use inb/readb
+ * instead.
  */
 static inline int in_8(volatile unsigned char *addr)
 {

From paulus at samba.org  Wed Feb  4 13:22:27 2004
From: paulus at samba.org (Paul Mackerras)
Date: Wed, 4 Feb 2004 13:22:27 +1100
Subject: [PATCH] kill lmb_add_io
In-Reply-To: <1075856864.1449.205.camel@nighthawk>
References: <1075856864.1449.205.camel@nighthawk>
Message-ID: <16416.22371.237705.124944@cargo.ozlabs.ibm.com>


Dave Hansen writes:

> This function seems to be dead code now.  I found it during a horribly
> confusing encounter with lmb.[ch] and CONFIG_MSCHUNKS.
>
> Does anyone know what an MSCHUNK is?  It seems like it simplifies some
> of the cases.  Does iSeries guarantee that the kernel's memory starts at
> 0 while pSeries doesn't?

A "Main Store chunk", I would think.

With iSeries, the hypervisor gives you memory in 256kB chunks (I
think), which can be scattered throughout physical memory.  The
CONFIG_MSCHUNKS stuff is a mapping layer that is there to give the
illusion to the main part of the kernel that physical memory starts at
0 and is contiguous, when in fact it's anything but.

A "LMB" is a "large memory block" AFAIK, and comes from the pSeries
side of things.

It does indeed seem that lmb_add_io is entirely unused.  Does anyone
have any good reason for keeping it?

Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Wed Feb  4 13:35:51 2004
From: anton at samba.org (Anton Blanchard)
Date: Wed, 4 Feb 2004 13:35:51 +1100
Subject: RTAS error logging of EEH errors
In-Reply-To: <20040203183210.A27780@forte.austin.ibm.com>
References: <20040201073627.GE22694@krispykreme> <1075737704.8079.27.camel@mudbug.austin.ibm.com> <20040203183210.A27780@forte.austin.ibm.com>
Message-ID: <20040204023551.GD22694@krispykreme>


> Please don't push this patch, It'll mess up my patch which I'm sending
> out in a few minutes.  Anton, if this still really bugs you, let me know,
> I'll fix it.

I think its worth fixing, eventually we will be recovering from these
failures and we'll no doubt have forgotten about the stack usage.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Wed Feb  4 15:18:21 2004
From: anton at samba.org (Anton Blanchard)
Date: Wed, 4 Feb 2004 15:18:21 +1100
Subject: [PATCH] kill lmb_add_io
In-Reply-To: <1075856864.1449.205.camel@nighthawk>
References: <1075856864.1449.205.camel@nighthawk>
Message-ID: <20040204041820.GF22694@krispykreme>


Hi,

> This function seems to be dead code now.  I found it during a horribly
> confusing encounter with lmb.[ch] and CONFIG_MSCHUNKS.

OK, you got me to look :)

The IO stuff was around from when we were using MSCHUNKS on pSeries. It
would make memory appear contiguous on machines with IO holes (POWER3
boxes). In the end it confused the hell out of Linux (from memory it was
the macros that wanted to know if a physical address was in an IO hole
or not).

I just went through and cleaned the lmb stuff up, it needs to be tested
on i and p, but I think its a step in the right direction. It certainly
simplifies a lot of the code.

Thoughts?

Anton

--

- remove LMB_MEMORY_AREA, LMB_IO_AREA, we only allocate/reserve memory
  areas now
- remove lmb_property->type, lmb_region->iosize, lmb_region->lcd_size,
  no longer used
- bump number of regions to 128, we'll hit this limit sooner or later
  with our big boxes (if we have more than 64 PCI host bridges the
  reserved array will fill up for example)
- make all the lmb stuff __init
- no need to explicitly zero struct lmb lmb now we zero the BSS early
- we had two functions to dump the lmb array, kill one of them
- move the inline functions into lmb.c, they are only ever called from
  there

===== include/asm-ppc64/lmb.h 1.6 vs edited =====
--- 1.6/include/asm-ppc64/lmb.h	Fri Sep 13 21:24:37 2002
+++ edited/include/asm-ppc64/lmb.h	Wed Feb  4 15:10:52 2004
@@ -13,36 +13,29 @@
  * 2 of the License, or (at your option) any later version.
  */

-#include <linux/config.h>
+#include <linux/init.h>
 #include <asm/prom.h>

 extern unsigned long reloc_offset(void);

-#define MAX_LMB_REGIONS 64
+#define MAX_LMB_REGIONS 128

 union lmb_reg_property {
 	struct reg_property32 addr32[MAX_LMB_REGIONS];
 	struct reg_property64 addr64[MAX_LMB_REGIONS];
 };

-#define LMB_MEMORY_AREA	1
-#define LMB_IO_AREA	2
-
 #define LMB_ALLOC_ANYWHERE	0
-#define LMB_ALLOC_FIRST4GBYTE	(1UL<<32)

 struct lmb_property {
 	unsigned long base;
 	unsigned long physbase;
 	unsigned long size;
-	unsigned long type;
 };

 struct lmb_region {
 	unsigned long cnt;
 	unsigned long size;
-	unsigned long iosize;
-	unsigned long lcd_size;		/* Least Common Denominator */
 	struct lmb_property region[MAX_LMB_REGIONS+1];
 };

@@ -53,63 +46,17 @@
 	struct lmb_region reserved;
 };

-extern struct lmb lmb;
-
-extern void lmb_init(void);
-extern void lmb_analyze(void);
-extern long lmb_add(unsigned long, unsigned long);
-#ifdef CONFIG_MSCHUNKS
-extern long lmb_add_io(unsigned long base, unsigned long size);
-#endif /* CONFIG_MSCHUNKS */
-extern long lmb_reserve(unsigned long, unsigned long);
-extern unsigned long lmb_alloc(unsigned long, unsigned long);
-extern unsigned long lmb_alloc_base(unsigned long, unsigned long, unsigned long);
-extern unsigned long lmb_phys_mem_size(void);
-extern unsigned long lmb_end_of_DRAM(void);
-extern unsigned long lmb_abs_to_phys(unsigned long);
-extern void lmb_dump(char *);
-
-static inline unsigned long
-lmb_addrs_overlap(unsigned long base1, unsigned long size1,
-                  unsigned long base2, unsigned long size2)
-{
-        return ((base1 < (base2+size2)) && (base2 < (base1+size1)));
-}
-
-static inline long
-lmb_regions_overlap(struct lmb_region *rgn, unsigned long r1, unsigned long r2)
-{
-	unsigned long base1 = rgn->region[r1].base;
-        unsigned long size1 = rgn->region[r1].size;
-	unsigned long base2 = rgn->region[r2].base;
-        unsigned long size2 = rgn->region[r2].size;
-
-	return lmb_addrs_overlap(base1,size1,base2,size2);
-}
-
-static inline long
-lmb_addrs_adjacent(unsigned long base1, unsigned long size1,
-		   unsigned long base2, unsigned long size2)
-{
-	if ( base2 == base1 + size1 ) {
-		return 1;
-	} else if ( base1 == base2 + size2 ) {
-		return -1;
-	}
-	return 0;
-}
-
-static inline long
-lmb_regions_adjacent(struct lmb_region *rgn, unsigned long r1, unsigned long r2)
-{
-	unsigned long base1 = rgn->region[r1].base;
-        unsigned long size1 = rgn->region[r1].size;
-        unsigned long type1 = rgn->region[r1].type;
-	unsigned long base2 = rgn->region[r2].base;
-        unsigned long size2 = rgn->region[r2].size;
-        unsigned long type2 = rgn->region[r2].type;
+extern struct lmb lmb __initdata;

-	return (type1 == type2) && lmb_addrs_adjacent(base1,size1,base2,size2);
-}
+extern void __init lmb_init(void);
+extern void __init lmb_analyze(void);
+extern long __init lmb_add(unsigned long, unsigned long);
+extern long __init lmb_reserve(unsigned long, unsigned long);
+extern unsigned long __init lmb_alloc(unsigned long, unsigned long);
+extern unsigned long __init lmb_alloc_base(unsigned long, unsigned long,
+					   unsigned long);
+extern unsigned long __init lmb_phys_mem_size(void);
+extern unsigned long __init lmb_end_of_DRAM(void);
+extern unsigned long __init lmb_abs_to_phys(unsigned long);

 #endif /* _PPC64_LMB_H */
===== arch/ppc64/kernel/lmb.c 1.6 vs edited =====
--- 1.6/arch/ppc64/kernel/lmb.c	Tue Feb 25 20:38:45 2003
+++ edited/arch/ppc64/kernel/lmb.c	Wed Feb  4 15:10:44 2004
@@ -1,5 +1,4 @@
 /*
- *
  * Procedures for interfacing to Open Firmware.
  *
  * Peter Bergner, IBM Corp.	June 2001.
@@ -13,46 +12,63 @@

 #include <linux/config.h>
 #include <linux/kernel.h>
+#include <linux/init.h>
 #include <asm/types.h>
 #include <asm/page.h>
 #include <asm/prom.h>
 #include <asm/lmb.h>
 #include <asm/abs_addr.h>
 #include <asm/bitops.h>
-#include <asm/udbg.h>

-extern unsigned long klimit;
-extern unsigned long reloc_offset(void);
+struct lmb lmb __initdata;
+
+static unsigned long __init
+lmb_addrs_overlap(unsigned long base1, unsigned long size1,
+                  unsigned long base2, unsigned long size2)
+{
+	return ((base1 < (base2+size2)) && (base2 < (base1+size1)));
+}

+static long __init
+lmb_addrs_adjacent(unsigned long base1, unsigned long size1,
+		   unsigned long base2, unsigned long size2)
+{
+	if (base2 == base1 + size1)
+		return 1;
+	else if (base1 == base2 + size2)
+		return -1;

-static long lmb_add_region(struct lmb_region *, unsigned long, unsigned long, unsigned long);
+	return 0;
+}

-struct lmb lmb = {
-	0, 0,
-	{0,0,0,0,{{0,0,0}}},
-	{0,0,0,0,{{0,0,0}}}
-};
+static long __init
+lmb_regions_adjacent(struct lmb_region *rgn, unsigned long r1, unsigned long r2)
+{
+	unsigned long base1 = rgn->region[r1].base;
+	unsigned long size1 = rgn->region[r1].size;
+	unsigned long base2 = rgn->region[r2].base;
+	unsigned long size2 = rgn->region[r2].size;

+	return lmb_addrs_adjacent(base1, size1, base2, size2);
+}

 /* Assumption: base addr of region 1 < base addr of region 2 */
-static void
+static void __init
 lmb_coalesce_regions(struct lmb_region *rgn, unsigned long r1, unsigned long r2)
 {
 	unsigned long i;

 	rgn->region[r1].size += rgn->region[r2].size;
-	for (i=r2; i < rgn->cnt-1 ;i++) {
+	for (i=r2; i < rgn->cnt-1; i++) {
 		rgn->region[i].base = rgn->region[i+1].base;
 		rgn->region[i].physbase = rgn->region[i+1].physbase;
 		rgn->region[i].size = rgn->region[i+1].size;
-		rgn->region[i].type = rgn->region[i+1].type;
 	}
 	rgn->cnt--;
 }

-
 /* This routine called with relocation disabled. */
-void
+void __init
 lmb_init(void)
 {
 	unsigned long offset = reloc_offset();
@@ -63,13 +79,11 @@
 	 */
 	_lmb->memory.region[0].base = 0;
 	_lmb->memory.region[0].size = 0;
-	_lmb->memory.region[0].type = LMB_MEMORY_AREA;
 	_lmb->memory.cnt = 1;

 	/* Ditto. */
 	_lmb->reserved.region[0].base = 0;
 	_lmb->reserved.region[0].size = 0;
-	_lmb->reserved.region[0].type = LMB_MEMORY_AREA;
 	_lmb->reserved.cnt = 1;
 }

@@ -89,12 +103,11 @@
 }

 /* This routine called with relocation disabled. */
-void
+void __init
 lmb_analyze(void)
 {
 	unsigned long i;
 	unsigned long mem_size = 0;
-	unsigned long io_size = 0;
 	unsigned long size_mask = 0;
 	unsigned long offset = reloc_offset();
 	struct lmb *_lmb = PTRRELOC(&lmb);
@@ -102,13 +115,9 @@
 	unsigned long physbase = 0;
 #endif

-	for (i=0; i < _lmb->memory.cnt ;i++) {
-		unsigned long lmb_type = _lmb->memory.region[i].type;
+	for (i=0; i < _lmb->memory.cnt; i++) {
 		unsigned long lmb_size;

-		if ( lmb_type != LMB_MEMORY_AREA )
-			continue;
-
 		lmb_size = _lmb->memory.region[i].size;

 #ifdef CONFIG_MSCHUNKS
@@ -121,84 +130,20 @@
 		size_mask |= lmb_size;
 	}

-#ifdef CONFIG_MSCHUNKS
-	for (i=0; i < _lmb->memory.cnt ;i++) {
-		unsigned long lmb_type = _lmb->memory.region[i].type;
-		unsigned long lmb_size;
-
-		if ( lmb_type != LMB_IO_AREA )
-			continue;
-
-		lmb_size = _lmb->memory.region[i].size;
-
-		_lmb->memory.region[i].physbase = physbase;
-		physbase += lmb_size;
-		io_size += lmb_size;
-		size_mask |= lmb_size;
-	}
-#endif /* CONFIG_MSCHUNKS */
-
 	_lmb->memory.size = mem_size;
-	_lmb->memory.iosize = io_size;
-	_lmb->memory.lcd_size = (1UL << cnt_trailing_zeros(size_mask));
-}
-
-/* This routine called with relocation disabled. */
-long
-lmb_add(unsigned long base, unsigned long size)
-{
-	unsigned long offset = reloc_offset();
-	struct lmb *_lmb = PTRRELOC(&lmb);
-	struct lmb_region *_rgn = &(_lmb->memory);
-
-	/* On pSeries LPAR systems, the first LMB is our RMO region. */
-	if ( base == 0 )
-		_lmb->rmo_size = size;
-
-	return lmb_add_region(_rgn, base, size, LMB_MEMORY_AREA);
-
 }

-#ifdef CONFIG_MSCHUNKS
 /* This routine called with relocation disabled. */
-long
-lmb_add_io(unsigned long base, unsigned long size)
-{
-	unsigned long offset = reloc_offset();
-	struct lmb *_lmb = PTRRELOC(&lmb);
-	struct lmb_region *_rgn = &(_lmb->memory);
-
-	return lmb_add_region(_rgn, base, size, LMB_IO_AREA);
-
-}
-#endif /* CONFIG_MSCHUNKS */
-
-long
-lmb_reserve(unsigned long base, unsigned long size)
-{
-	unsigned long offset = reloc_offset();
-	struct lmb *_lmb = PTRRELOC(&lmb);
-	struct lmb_region *_rgn = &(_lmb->reserved);
-
-	return lmb_add_region(_rgn, base, size, LMB_MEMORY_AREA);
-}
-
-/* This routine called with relocation disabled. */
-static long
-lmb_add_region(struct lmb_region *rgn, unsigned long base, unsigned long size,
-		unsigned long type)
+static long __init
+lmb_add_region(struct lmb_region *rgn, unsigned long base, unsigned long size)
 {
 	unsigned long i, coalesced = 0;
 	long adjacent;

 	/* First try and coalesce this LMB with another. */
-	for (i=0; i < rgn->cnt ;i++) {
+	for (i=0; i < rgn->cnt; i++) {
 		unsigned long rgnbase = rgn->region[i].base;
 		unsigned long rgnsize = rgn->region[i].size;
-		unsigned long rgntype = rgn->region[i].type;
-
-		if ( rgntype != type )
-			continue;

 		adjacent = lmb_addrs_adjacent(base,size,rgnbase,rgnsize);
 		if ( adjacent > 0 ) {
@@ -227,17 +172,15 @@
 	}

 	/* Couldn't coalesce the LMB, so add it to the sorted table. */
-	for (i=rgn->cnt-1; i >= 0 ;i--) {
+	for (i=rgn->cnt-1; i >= 0; i--) {
 		if (base < rgn->region[i].base) {
 			rgn->region[i+1].base = rgn->region[i].base;
 			rgn->region[i+1].physbase = rgn->region[i].physbase;
 			rgn->region[i+1].size = rgn->region[i].size;
-			rgn->region[i+1].type = rgn->region[i].type;
 		}  else {
 			rgn->region[i+1].base = base;
 			rgn->region[i+1].physbase = lmb_abs_to_phys(base);
 			rgn->region[i+1].size = size;
-			rgn->region[i+1].type = type;
 			break;
 		}
 	}
@@ -246,12 +189,38 @@
 	return 0;
 }

-long
+/* This routine called with relocation disabled. */
+long __init
+lmb_add(unsigned long base, unsigned long size)
+{
+	unsigned long offset = reloc_offset();
+	struct lmb *_lmb = PTRRELOC(&lmb);
+	struct lmb_region *_rgn = &(_lmb->memory);
+
+	/* On pSeries LPAR systems, the first LMB is our RMO region. */
+	if ( base == 0 )
+		_lmb->rmo_size = size;
+
+	return lmb_add_region(_rgn, base, size);
+
+}
+
+long __init
+lmb_reserve(unsigned long base, unsigned long size)
+{
+	unsigned long offset = reloc_offset();
+	struct lmb *_lmb = PTRRELOC(&lmb);
+	struct lmb_region *_rgn = &(_lmb->reserved);
+
+	return lmb_add_region(_rgn, base, size);
+}
+
+long __init
 lmb_overlaps_region(struct lmb_region *rgn, unsigned long base, unsigned long size)
 {
 	unsigned long i;

-	for (i=0; i < rgn->cnt ;i++) {
+	for (i=0; i < rgn->cnt; i++) {
 		unsigned long rgnbase = rgn->region[i].base;
 		unsigned long rgnsize = rgn->region[i].size;
 		if ( lmb_addrs_overlap(base,size,rgnbase,rgnsize) ) {
@@ -262,13 +231,13 @@
 	return (i < rgn->cnt) ? i : -1;
 }

-unsigned long
+unsigned long __init
 lmb_alloc(unsigned long size, unsigned long align)
 {
 	return lmb_alloc_base(size, align, LMB_ALLOC_ANYWHERE);
 }

-unsigned long
+unsigned long __init
 lmb_alloc_base(unsigned long size, unsigned long align, unsigned long max_addr)
 {
 	long i, j;
@@ -278,13 +247,9 @@
 	struct lmb_region *_mem = &(_lmb->memory);
 	struct lmb_region *_rsv = &(_lmb->reserved);

-	for (i=_mem->cnt-1; i >= 0 ;i--) {
+	for (i=_mem->cnt-1; i >= 0; i--) {
 		unsigned long lmbbase = _mem->region[i].base;
 		unsigned long lmbsize = _mem->region[i].size;
-		unsigned long lmbtype = _mem->region[i].type;
-
-		if ( lmbtype != LMB_MEMORY_AREA )
-			continue;

 		if ( max_addr == LMB_ALLOC_ANYWHERE )
 			base = _ALIGN_DOWN(lmbbase+lmbsize-size, align);
@@ -305,12 +270,12 @@
 	if ( i < 0 )
 		return 0;

-	lmb_add_region(_rsv, base, size, LMB_MEMORY_AREA);
+	lmb_add_region(_rsv, base, size);

 	return base;
 }

-unsigned long
+unsigned long __init
 lmb_phys_mem_size(void)
 {
 	unsigned long offset = reloc_offset();
@@ -327,7 +292,7 @@
 #endif /* CONFIG_MSCHUNKS */
 }

-unsigned long
+unsigned long __init
 lmb_end_of_DRAM(void)
 {
 	unsigned long offset = reloc_offset();
@@ -335,9 +300,7 @@
 	struct lmb_region *_mem = &(_lmb->memory);
 	unsigned long idx;

-	for(idx=_mem->cnt-1; idx >= 0 ;idx--) {
-		if ( _mem->region[idx].type != LMB_MEMORY_AREA )
-			continue;
+	for(idx=_mem->cnt-1; idx >= 0; idx--) {
 #ifdef CONFIG_MSCHUNKS
 		return (_mem->region[idx].physbase + _mem->region[idx].size);
 #else
@@ -348,8 +311,7 @@
 	return 0;
 }

-
-unsigned long
+unsigned long __init
 lmb_abs_to_phys(unsigned long aa)
 {
 	unsigned long i, pa = aa;
@@ -357,7 +319,7 @@
 	struct lmb *_lmb = PTRRELOC(&lmb);
 	struct lmb_region *_mem = &(_lmb->memory);

-	for (i=0; i < _mem->cnt ;i++) {
+	for (i=0; i < _mem->cnt; i++) {
 		unsigned long lmbbase = _mem->region[i].base;
 		unsigned long lmbsize = _mem->region[i].size;
 		if ( lmb_addrs_overlap(aa,1,lmbbase,lmbsize) ) {
@@ -367,48 +329,4 @@
 	}

 	return pa;
-}
-
-void
-lmb_dump(char *str)
-{
-	unsigned long i;
-
-	udbg_printf("\nlmb_dump: %s\n", str);
-	udbg_printf("    debug                       = %s\n",
-		(lmb.debug) ? "TRUE" : "FALSE");
-	udbg_printf("    memory.cnt                  = %d\n",
-		lmb.memory.cnt);
-	udbg_printf("    memory.size                 = 0x%lx\n",
-		lmb.memory.size);
-	udbg_printf("    memory.lcd_size             = 0x%lx\n",
-		lmb.memory.lcd_size);
-	for (i=0; i < lmb.memory.cnt ;i++) {
-		udbg_printf("    memory.region[%d].base       = 0x%lx\n",
-			i, lmb.memory.region[i].base);
-		udbg_printf("                      .physbase = 0x%lx\n",
-			lmb.memory.region[i].physbase);
-		udbg_printf("                      .size     = 0x%lx\n",
-			lmb.memory.region[i].size);
-		udbg_printf("                      .type     = 0x%lx\n",
-			lmb.memory.region[i].type);
-	}
-
-	udbg_printf("\n");
-	udbg_printf("    reserved.cnt                = %d\n",
-		lmb.reserved.cnt);
-	udbg_printf("    reserved.size               = 0x%lx\n",
-		lmb.reserved.size);
-	udbg_printf("    reserved.lcd_size           = 0x%lx\n",
-		lmb.reserved.lcd_size);
-	for (i=0; i < lmb.reserved.cnt ;i++) {
-		udbg_printf("    reserved.region[%d].base     = 0x%lx\n",
-			i, lmb.reserved.region[i].base);
-		udbg_printf("                      .physbase = 0x%lx\n",
-			lmb.reserved.region[i].physbase);
-		udbg_printf("                      .size     = 0x%lx\n",
-			lmb.reserved.region[i].size);
-		udbg_printf("                      .type     = 0x%lx\n",
-			lmb.reserved.region[i].type);
-	}
 }
===== arch/ppc64/kernel/prom.c 1.52 vs edited =====
--- 1.52/arch/ppc64/kernel/prom.c	Sun Feb  1 13:40:21 2004
+++ edited/arch/ppc64/kernel/prom.c	Wed Feb  4 14:42:19 2004
@@ -668,9 +668,6 @@
         prom_print(RELOC("    memory.size                 = 0x"));
         prom_print_hex(_lmb->memory.size);
 	prom_print_nl();
-        prom_print(RELOC("    memory.lcd_size             = 0x"));
-        prom_print_hex(_lmb->memory.lcd_size);
-	prom_print_nl();
         for (i=0; i < _lmb->memory.cnt ;i++) {
                 prom_print(RELOC("    memory.region[0x"));
 		prom_print_hex(i);
@@ -683,9 +680,6 @@
                 prom_print(RELOC("                      .size     = 0x"));
                 prom_print_hex(_lmb->memory.region[i].size);
 		prom_print_nl();
-                prom_print(RELOC("                      .type     = 0x"));
-                prom_print_hex(_lmb->memory.region[i].type);
-		prom_print_nl();
         }

 	prom_print_nl();
@@ -695,9 +689,6 @@
         prom_print(RELOC("    reserved.size                 = 0x"));
         prom_print_hex(_lmb->reserved.size);
 	prom_print_nl();
-        prom_print(RELOC("    reserved.lcd_size             = 0x"));
-        prom_print_hex(_lmb->reserved.lcd_size);
-	prom_print_nl();
         for (i=0; i < _lmb->reserved.cnt ;i++) {
                 prom_print(RELOC("    reserved.region[0x"));
 		prom_print_hex(i);
@@ -709,9 +700,6 @@
 		prom_print_nl();
                 prom_print(RELOC("                      .size     = 0x"));
                 prom_print_hex(_lmb->reserved.region[i].size);
-		prom_print_nl();
-                prom_print(RELOC("                      .type     = 0x"));
-                prom_print_hex(_lmb->reserved.region[i].type);
 		prom_print_nl();
         }
 }
===== arch/ppc64/mm/numa.c 1.16 vs edited =====
--- 1.16/arch/ppc64/mm/numa.c	Tue Jan 20 13:07:09 2004
+++ edited/arch/ppc64/mm/numa.c	Wed Feb  4 15:12:02 2004
@@ -257,10 +257,6 @@

 		for (i = 0; i < lmb.memory.cnt; i++) {
 			unsigned long physbase, size;
-			unsigned long type = lmb.memory.region[i].type;
-
-			if (type != LMB_MEMORY_AREA)
-				continue;

 			physbase = lmb.memory.region[i].physbase;
 			size = lmb.memory.region[i].size;
===== arch/ppc64/mm/init.c 1.55 vs edited =====
--- 1.55/arch/ppc64/mm/init.c	Tue Jan 20 13:07:09 2004
+++ edited/arch/ppc64/mm/init.c	Wed Feb  4 15:11:54 2004
@@ -702,10 +702,6 @@
 	/* add all physical memory to the bootmem map */
 	for (i=0; i < lmb.memory.cnt; i++) {
 		unsigned long physbase, size;
-		unsigned long type = lmb.memory.region[i].type;
-
-		if ( type != LMB_MEMORY_AREA )
-			continue;

 		physbase = lmb.memory.region[i].physbase;
 		size = lmb.memory.region[i].size;
@@ -746,11 +742,7 @@

 	for (i=0; i < lmb.memory.cnt; i++) {
 		unsigned long physbase, size;
-		unsigned long type = lmb.memory.region[i].type;
 		struct kcore_list *kcore_mem;
-
-		if (type != LMB_MEMORY_AREA)
-			continue;

 		physbase = lmb.memory.region[i].physbase;
 		size = lmb.memory.region[i].size;

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Wed Feb  4 15:18:37 2004
From: olof at austin.ibm.com (olof at austin.ibm.com)
Date: Tue, 3 Feb 2004 22:18:37 -0600 (CST)
Subject: [2.4] [PATCH] hash_page rework, take 2
In-Reply-To: <401FE067.80902@redhat.com>
Message-ID: <Pine.A41.4.44.0402032206230.85750-200000@forte.austin.ibm.com>

On Tue, 3 Feb 2004, Julie DeWandel wrote:

> Hi Olof,
>
> The patch wasn't attached to your email so I included it along with
> comments below (my comments preceded by "JSD:").

Ugh. This is the second time I've forgotten to attach a patch. Not sure
what is going on here...

Thanks for taking time to review! Comments below each remark below, and
new patch attached (this time for real).


-Olof


+static inline void pSeries_unlock_hpte(HPTE *hptep)
+{
+	unsigned long *word = &hptep->dw0.dword0;
+
+	asm volatile("lwsync":::"memory");
+	clear_bit(HPTE_LOCK_BIT, word);
+}

JSD: Other places within the kernel do an smp_mb__before_clear_bit() when
JSD: clearing a bit representing a lock. Would that be a better choice here
JSD: (it resolves to a sync)?

I just copied the 2.6 code here. lwsync is cheaper on Power4 and
beyond, and on Power3/RS64 it equals a sync. Cheapest sufficient
syncronization is always to be preferred..

JSD: I should have added that the clearing of the _PAGE_BUSY bit should be
JSD: done using an atomic op (clear_bit() routine)

Not needed, since noone is modifying the contents of the word if the
bit is set. I.e, similar to spin_unlock().

JSD: The inline assembly code will not set _PAGE_BUSY if it finds that
JSD: the user's access rights doesn't allow access to the page.
JSD: So, if access_ok = 0, _PAGE_BUSY is not set, and the code may
JSD: branch to out_unlock where _PAGE_BUSY is unconditionally cleared.
JSD: It could be possible for another processor to coincidently have
JSD: set _PAGE_BUSY in the PTE but then have this processor clear it
JSD: before the other processor wanted it clear. Do you agree this can
JSD: happen?

Yes, it's a small window but it might happen. See below.

JSD: Furthermore, the "ea" condition check (above) might yield false and
JSD: the code then continue as though access_ok were true, modifying the
JSD: PTE without the _PAGE_BUSY bit set. Seems bad.

Right. The proper thing to do is to set _PAGE_BUSY always, and clear it
if access is denied. Fixed.

JSD: NOTE: at this point, the *ptep = new_pte just unlocked the PTE by
JSD: clearing _PAGE_BUSY. The code then goes on to clear it again, thereby
JSD: possibly rendering unsafe updates that some other processor might be
JSD: doing after it thought it had set _PAGE_BUSY. There is a small window
JSD: here.

Fixed as well. There's only a window at the second updated of *ptep,
in the first one _PAGE_BUSY is still set.

-       spin_unlock(&hash_table_lock[lock_slot].lock);
+out_unlock:
+       smp_wmb();

JSD: I believe an smp_mb__before_clear_bit() would be a better choice, or
JSD: at least an lwsync because this is an unlock.

I'm not sure we really need any syncronization at all, since the only
memery that's protected by the _PAGE_BUSY bit is the word itself, so
the lock and contents change will be seen at the same time (or at least
in the right order).

smp_wmb() is cheaper (eieio) than smp_mb() (sync), so either the
smp_wmb() should stay or it should be taken out alltogether. I'd rather
err on the side of caution, so I'll keep it.

JSD: In the above two hunks of code, a pte_update is done which returns
JSD: the old pte value. This value is checked to determine if the hpte
JSD: should be invalidated. However, there is no lock held between the
JSD: time the pte value is read and the time the hpte is invalidated.
JSD: The hpte_invalidate routine doesn't check to make sure the va
JSD: passed in is really the one being invalidated in the slot -- it just
JSD: assumes the slot, etc are enough to locate it. So we might be
JSD: invalidating the wrong thing here. Probably a don't care, but thought
JSD: I'd ask.

Yes, this is a general behaviour though, since entries are not updated
whenever they're thrown out of the HPTE. We always risk invalidating
an entry that's been reused. In the grand scheme of things it's not a
big problem, since it should happen fairly rarely, and when it happens the
invalidated entry will be faulted right back in. 2.6 is the same.

 	/* Invalidate the hpte. */
 	hptep->dw0.dword0 = 0;

JSD: It would really be nice to add a comment here that the above
JSD: assignment statement is also unlocking the hpte as well.

Done.

JSD: Two questions here. (1) Shouldn't interrupts be disabled for the
JSD: write_lock/unlock here regardless of what processor we are running
JSD: on? (2) I don't see how this code is preventing another processor
JSD: from grabbing the read_lock immediately after this processor has
JSD: checked to make sure it isn't held.

JSD: Better question for (1) is why are interrupts being disabled here?
JSD: Can this routine be called from interrupt context?

Without disabling interrupts, there's a risk for deadlocks if the
processor gets interrupted and the interrupt handler causes a page fault
that needs to be resolved Since the lock is held for writing, the handler
will wait forever when locking for reading. This is actually similar to
the original deadlock that this whole patch is meant to remove, but the
window is really small (just a few instructions) now. Likewise an
interrupt on a different processor is not a problem since forward progress
is still guaranteed on the processor holding it for writing so the reader
will eventually get the lock.

And on the second question: This is the trick with using the rwlock,
there's no need to _prevent_ reading, all I needed to know is that all
readers that started before pte_free_sync() have completed:

Noone can get a reference to a PTE after it's been free'd, so the only
risk is if someone has walked the table right during pte_free() and is
still holding a reference to it.  hash_page will hold the lock for
reading while this takes place, so all we need to know is that we
_could_ take the lock for writing (i.e. no readers for the table). Even
if another CPU comes in and traverses the tree we're safe since there's
no way they can end up in that PTE (since it's been removed from the
tree).

This is all an ad-hoc solution since there's no RCU in 2.4, so I needed
another light-weight syncronization method.

JSD: Since the pte_freelist_cur is a per-processor structure, I don't
JSD: think you need the batch->lock at all. What other thing could be
JSD: running on this same processor at the same time?

I thought there was a risk that pte_free() could be called from interrupt
context through pte_alloc(), but with some closer examination it seems
like it shouldn't happen. If so, the locks can come out.

JSD: Why were these spinlocks changed to _irq? I noticed this change was
JSD: not present in the 2.6 code.

No, it's my mistake. It was part of the workaround patch that for some reason
got left in the patch I attached to the bug. They were not there in the patch
that I was supposed to have posted to the list.

JSD: Same question about the _irq addition. Doesn't seem necessary. I'm
JSD: probably missing something -- please explain.

Same answer as above; dirty patch.

JSD: Why was the UL (unsigned long) dropped from the bit definitions?

Sync with 2.6, at one time I used the definitions in assembly and the UL
syntax is illegal there. It's no longer needed.

-------------- next part --------------
===== arch/ppc64/kernel/htab.c 1.11 vs edited =====
--- 1.11/arch/ppc64/kernel/htab.c	Thu Dec 18 16:13:25 2003
+++ edited/arch/ppc64/kernel/htab.c	Tue Feb  3 22:01:59 2004
@@ -48,6 +48,29 @@
 #include <asm/iSeries/HvCallHpt.h>
 #include <asm/cputable.h>
 
+#define HPTE_LOCK_BIT 3
+
+static inline void pSeries_lock_hpte(HPTE *hptep)
+{
+	unsigned long *word = &hptep->dw0.dword0;
+
+	while (1) {
+		if (!test_and_set_bit(HPTE_LOCK_BIT, word))
+			break;
+		while(test_bit(HPTE_LOCK_BIT, word))
+			cpu_relax();
+	}
+}
+
+static inline void pSeries_unlock_hpte(HPTE *hptep)
+{
+	unsigned long *word = &hptep->dw0.dword0;
+
+	asm volatile("lwsync":::"memory");
+	clear_bit(HPTE_LOCK_BIT, word);
+}
+
+
 /*
  * Note:  pte   --> Linux PTE
  *        HPTE  --> PowerPC Hashed Page Table Entry
@@ -64,6 +87,7 @@
 
 extern unsigned long _SDR1;
 extern unsigned long klimit;
+extern rwlock_t pte_hash_lock[] __cacheline_aligned_in_smp;
 
 void make_pte(HPTE *htab, unsigned long va, unsigned long pa,
 	      int mode, unsigned long hash_mask, int large);
@@ -320,51 +344,73 @@
 	unsigned long va, vpn;
 	unsigned long newpp, prpn;
 	unsigned long hpteflags, lock_slot;
+	unsigned long access_ok, tmp;
 	long slot;
 	pte_t old_pte, new_pte;
+	int ret = 0;
 
 	/* Search the Linux page table for a match with va */
 	va = (vsid << 28) | (ea & 0x0fffffff);
 	vpn = va >> PAGE_SHIFT;
 	lock_slot = get_lock_slot(vpn); 
 
-	/* Acquire the hash table lock to guarantee that the linux
-	 * pte we fetch will not change
+	/* 
+	 * Check the user's access rights to the page.  If access should be
+	 * prevented then send the problem up to do_page_fault.
 	 */
-	spin_lock(&hash_table_lock[lock_slot].lock);
-	
+
 	/* 
 	 * Check the user's access rights to the page.  If access should be
 	 * prevented then send the problem up to do_page_fault.
 	 */
-#ifdef CONFIG_SHARED_MEMORY_ADDRESSING
+
 	access |= _PAGE_PRESENT;
-	if (unlikely(access & ~(pte_val(*ptep)))) {
+
+	/* We'll do access checking and _PAGE_BUSY setting in assembly, since
+	 * it needs to be atomic. 
+	 */
+
+	__asm__ __volatile__ ("\n
+	1:	ldarx	%0,0,%3\n
+		# Check if PTE is busy\n
+		andi.	%1,%0,%4\n
+		bne-	1b\n
+		ori	%0,%0,%4\n
+		# Write the linux PTE atomically (setting busy)\n
+		stdcx.	%0,0,%3\n
+		bne-	1b\n
+		# Check access rights (access & ~(pte_val(*ptep)))\n
+		andc.	%1,%2,%0\n
+		bne-	2f\n
+		li      %1,1\n
+		b	3f\n
+	2:      li      %1,0\n
+	3:"
+	: "=r" (old_pte), "=r" (access_ok)
+	: "r" (access), "r" (ptep), "i" (_PAGE_BUSY)
+        : "cr0", "memory");
+
+#ifdef CONFIG_SHARED_MEMORY_ADDRESSING
+	if (unlikely(!access_ok)) {
 		if(!(((ea >> SMALLOC_EA_SHIFT) == 
 		      (SMALLOC_START >> SMALLOC_EA_SHIFT)) &&
 		     ((current->thread.flags) & PPC_FLAG_SHARED))) {
-			spin_unlock(&hash_table_lock[lock_slot].lock);
-			return 1;
+			ret = 1;
+			goto out_unlock;
 		}
 	}
 #else
-	access |= _PAGE_PRESENT;
-	if (unlikely(access & ~(pte_val(*ptep)))) {
-		spin_unlock(&hash_table_lock[lock_slot].lock);
-		return 1;
+	if (unlikely(!access_ok)) {
+		ret = 1;
+		goto out_unlock;
 	}
 #endif
 
 	/* 
-	 * We have found a pte (which was present).
-	 * The spinlocks prevent this status from changing
-	 * The hash_table_lock prevents the _PAGE_HASHPTE status
-	 * from changing (RPN, DIRTY and ACCESSED too)
-	 * The page_table_lock prevents the pte from being 
-	 * invalidated or modified
-	 */
-
-	/*
+	 * We have found a proper pte. The hash_table_lock protects
+	 * the pte from deallocation and the _PAGE_BUSY bit protects
+	 * the contents of the PTE from changing.
+	 *
 	 * At this point, we have a pte (old_pte) which can be used to build
 	 * or update an HPTE. There are 2 cases:
 	 *
@@ -385,7 +431,7 @@
 	else
 		pte_val(new_pte) |= _PAGE_ACCESSED;
 
-	newpp = computeHptePP(pte_val(new_pte));
+	newpp = computeHptePP(pte_val(new_pte) & ~_PAGE_BUSY);
 	
 	/* Check if pte already has an hpte (case 2) */
 	if (unlikely(pte_val(old_pte) & _PAGE_HASHPTE)) {
@@ -400,12 +446,13 @@
 		slot = (hash & htab_data.htab_hash_mask) * HPTES_PER_GROUP;
 		slot += (pte_val(old_pte) & _PAGE_GROUP_IX) >> 12;
 
-		/* XXX fix large pte flag */
+	        /* XXX fix large pte flag */
 		if (ppc_md.hpte_updatepp(slot, secondary, 
 					 newpp, va, 0) == -1) {
 			pte_val(old_pte) &= ~_PAGE_HPTEFLAGS;
 		} else {
 			if (!pte_same(old_pte, new_pte)) {
+				/* _PAGE_BUSY is still set in new_pte */
 				*ptep = new_pte;
 			}
 		}
@@ -425,12 +472,19 @@
 		pte_val(new_pte) |= ((slot<<12) & 
 				     (_PAGE_GROUP_IX | _PAGE_SECONDARY));
 
+		smp_wmb();
+		/* _PAGE_BUSY is not set in new_pte */
 		*ptep = new_pte;
+
+		return 0;
 	}
 
-	spin_unlock(&hash_table_lock[lock_slot].lock);
+out_unlock:
+	smp_wmb();
 
-	return 0;
+	pte_val(*ptep) &= ~_PAGE_BUSY;
+
+	return ret;
 }
 
 /*
@@ -497,11 +551,14 @@
 	pgdir = mm->pgd;
 	if (pgdir == NULL) return 1;
 
-	/*
-	 * Lock the Linux page table to prevent mmap and kswapd
-	 * from modifying entries while we search and update
+	/* The pte_hash_lock is used to block any PTE deallocations
+	 * while we walk the tree and use the entry. While technically
+	 * we both read and write the PTE entry while holding the read
+	 * lock, the _PAGE_BUSY bit will block pte_update()s to the
+	 * specific entry.
 	 */
-	spin_lock(&mm->page_table_lock);
+	
+	read_lock(&pte_hash_lock[smp_processor_id()]);
 
 	ptep = find_linux_pte(pgdir, ea);
 	/*
@@ -514,8 +571,7 @@
 		/* If no pte, send the problem up to do_page_fault */
 		ret = 1;
 	}
-
-	spin_unlock(&mm->page_table_lock);
+	read_unlock(&pte_hash_lock[smp_processor_id()]);
 
 	return ret;
 }
@@ -540,8 +596,6 @@
 	lock_slot = get_lock_slot(vpn); 
 	hash = hpt_hash(vpn, large);
 
-	spin_lock_irqsave(&hash_table_lock[lock_slot].lock, flags);
-
 	pte = __pte(pte_update(ptep, _PAGE_HPTEFLAGS, 0));
 	secondary = (pte_val(pte) & _PAGE_SECONDARY) >> 15;
 	if (secondary) hash = ~hash;
@@ -551,8 +605,6 @@
 	if (pte_val(pte) & _PAGE_HASHPTE) {
 		ppc_md.hpte_invalidate(slot, secondary, va, large, local);
 	}
-
-	spin_unlock_irqrestore(&hash_table_lock[lock_slot].lock, flags);
 }
 
 long plpar_pte_enter(unsigned long flags,
@@ -787,6 +839,8 @@
 
 	avpn = vpn >> 11;
 
+	pSeries_lock_hpte(hptep);
+
 	dw0 = hptep->dw0.dw0;
 
 	/*
@@ -794,9 +848,13 @@
 	 * the AVPN, hash group, and valid bits.  By doing it this way,
 	 * it is common with the pSeries LPAR optimal path.
 	 */
-	if (dw0.bolted) return;
+	if (dw0.bolted) {
+		pSeries_unlock_hpte(hptep);
 
-	/* Invalidate the hpte. */
+		return;
+	}
+
+	/* Invalidate the hpte. This clears the lock as well. */
 	hptep->dw0.dword0 = 0;
 
 	/* Invalidate the tlb */
@@ -875,6 +933,8 @@
 
 	avpn = vpn >> 11;
 
+	pSeries_lock_hpte(hptep);
+
 	dw0 = hptep->dw0.dw0;
 	if ((dw0.avpn == avpn) && 
 	    (dw0.v) && (dw0.h == secondary)) {
@@ -900,10 +960,14 @@
 		hptep->dw0.dw0 = dw0;
 		
 		__asm__ __volatile__ ("ptesync" : : : "memory");
+
+		pSeries_unlock_hpte(hptep);
 		
 		return 0;
 	}
 
+	pSeries_unlock_hpte(hptep);
+
 	return -1;
 }
 
@@ -1062,9 +1126,11 @@
 		dw0 = hptep->dw0.dw0;
 		if (!dw0.v) {
 			/* retry with lock held */
+			pSeries_lock_hpte(hptep);
 			dw0 = hptep->dw0.dw0;
 			if (!dw0.v)
 				break;
+			pSeries_unlock_hpte(hptep);
 		}
 		hptep++;
 	}
@@ -1079,9 +1145,11 @@
 			dw0 = hptep->dw0.dw0;
 			if (!dw0.v) {
 				/* retry with lock held */
+				pSeries_lock_hpte(hptep);
 				dw0 = hptep->dw0.dw0;
 				if (!dw0.v)
 					break;
+				pSeries_unlock_hpte(hptep);
 			}
 			hptep++;
 		}
@@ -1304,9 +1372,11 @@
 
 		if (dw0.v && !dw0.bolted) {
 			/* retry with lock held */
+			pSeries_lock_hpte(hptep);
 			dw0 = hptep->dw0.dw0;
 			if (dw0.v && !dw0.bolted)
 				break;
+			pSeries_unlock_hpte(hptep);
 		}
 
 		slot_offset++;
===== arch/ppc64/mm/init.c 1.8 vs edited =====
--- 1.8/arch/ppc64/mm/init.c	Tue Jan  6 17:54:44 2004
+++ edited/arch/ppc64/mm/init.c	Tue Feb  3 21:49:59 2004
@@ -104,16 +104,94 @@
  */
 mmu_gather_t     mmu_gathers[NR_CPUS];
 
+/* PTE free batching structures. We need a lock since not all
+ * operations take place under page_table_lock. Keep it per-CPU
+ * to avoid bottlenecks.
+ */
+
+struct pte_freelist_batch ____cacheline_aligned pte_freelist_cur[NR_CPUS] __cacheline_aligned_in_smp;
+rwlock_t pte_hash_lock[NR_CPUS] __cacheline_aligned_in_smp = { [0 ... NR_CPUS-1] = RW_LOCK_UNLOCKED };
+
+unsigned long pte_freelist_forced_free;
+
+static inline void pte_free_sync(void)
+{
+	unsigned long flags;
+	int i;
+
+	/* All we need to know is that we can get the write lock if
+	 * we wanted to, i.e. that no hash_page()s are holding it for reading.
+	 * If none are reading, that means there's no currently executing
+	 * hash_page() that might be working on one of the PTE's that will
+	 * be deleted. Likewise, if there is a reader, we need to get the
+	 * write lock to know when it releases the lock.
+	 */
+
+	for (i = 0; i < smp_num_cpus; i++)
+		if (is_read_locked(&pte_hash_lock[i])) {
+			/* So we don't deadlock with a reader on current cpu */
+			if(i == smp_processor_id())
+				local_irq_save(flags);
+
+			write_lock(&pte_hash_lock[i]);
+			write_unlock(&pte_hash_lock[i]);
+
+			if(i == smp_processor_id())
+				local_irq_restore(flags);
+		}
+}
+
+
+/* This is only called when we are critically out of memory
+ * (and fail to get a page in pte_free_tlb).
+ */
+void pte_free_now(pte_t *pte)
+{
+	pte_freelist_forced_free++;
+
+	pte_free_sync();
+
+	pte_free_kernel(pte);
+}
+
+/* Deallocates the pte-free batch after syncronizing with readers of
+ * any page tables.
+ */
+void pte_free_batch(void **batch, int size)
+{
+	unsigned int i;
+
+	pte_free_sync();
+
+	for (i = 0; i < size; i++)
+		pte_free_kernel(batch[i]);
+
+	free_page((unsigned long)batch);
+}
+
+
 int do_check_pgt_cache(int low, int high)
 {
 	int freed = 0;
+	struct pte_freelist_batch *batch;
+
+	/* We use this function to push the current pte free batch to be
+	 * deallocated, since do_check_pgt_cache() is callEd at the end of each
+	 * free_one_pgd() and other parts of VM relies on all PTE's being
+	 * properly freed upon return from that function.
+	 */
+
+	batch = &pte_freelist_cur[smp_processor_id()];
+
+	if(batch->entry) {
+		pte_free_batch(batch->entry, batch->index);
+		batch->entry = NULL;
+	}
 
 	if (pgtable_cache_size > high) {
 		do {
 			if (pgd_quicklist)
 				free_page((unsigned long)pgd_alloc_one_fast(0)), ++freed;
-			if (pmd_quicklist)
-				free_page((unsigned long)pmd_alloc_one_fast(0, 0)), ++freed;
 			if (pte_quicklist)
 				free_page((unsigned long)pte_alloc_one_fast(0, 0)), ++freed;
 		} while (pgtable_cache_size > low);
@@ -290,7 +368,9 @@
 void
 local_flush_tlb_mm(struct mm_struct *mm)
 {
-	spin_lock(&mm->page_table_lock);
+	unsigned long flags;
+
+	spin_lock_irqsave(&mm->page_table_lock, flags);
 
 	if ( mm->map_count ) {
 		struct vm_area_struct *mp;
@@ -298,7 +378,7 @@
 			local_flush_tlb_range( mm, mp->vm_start, mp->vm_end );
 	}
 
-	spin_unlock(&mm->page_table_lock);
+	spin_unlock_irqrestore(&mm->page_table_lock, flags);
 }
 
 /*
===== include/asm-ppc64/pgalloc.h 1.2 vs edited =====
--- 1.2/include/asm-ppc64/pgalloc.h	Tue Apr  9 06:31:08 2002
+++ edited/include/asm-ppc64/pgalloc.h	Tue Feb  3 21:50:36 2004
@@ -15,7 +15,6 @@
 #define quicklists      get_paca()
 
 #define pgd_quicklist 		(quicklists->pgd_cache)
-#define pmd_quicklist 		(quicklists->pmd_cache)
 #define pte_quicklist 		(quicklists->pte_cache)
 #define pgtable_cache_size 	(quicklists->pgtable_cache_sz)
 
@@ -60,10 +59,10 @@
 static inline pmd_t*
 pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
 {
-	unsigned long *ret = (unsigned long *)pmd_quicklist;
+	unsigned long *ret = (unsigned long *)pte_quicklist;
 
 	if (ret != NULL) {
-		pmd_quicklist = (unsigned long *)(*ret);
+		pte_quicklist = (unsigned long *)(*ret);
 		ret[0] = 0;
 		--pgtable_cache_size;
 	}
@@ -80,14 +79,6 @@
 	return pmd;
 }
 
-static inline void
-pmd_free (pmd_t *pmd)
-{
-	*(unsigned long *)pmd = (unsigned long) pmd_quicklist;
-	pmd_quicklist = (unsigned long *) pmd;
-	++pgtable_cache_size;
-}
-
 #define pmd_populate(MM, PMD, PTE)	pmd_set(PMD, PTE)
 
 static inline pte_t*
@@ -115,12 +106,54 @@
 }
 
 static inline void
-pte_free (pte_t *pte)
+pte_free_kernel (pte_t *pte)
 {
 	*(unsigned long *)pte = (unsigned long) pte_quicklist;
 	pte_quicklist = (unsigned long *) pte;
 	++pgtable_cache_size;
 }
+
+
+/* Use the PTE functions for freeing PMD as well, since the same
+ * problem with tree traversals apply. Since pmd pointers are always
+ * virtual, no need for a page_address() translation.
+ */
+ 
+#define pte_free(pte_page)      __pte_free(pte_page)
+#define pmd_free(pmd)           __pte_free(pmd)
+ 
+struct pte_freelist_batch
+{
+	unsigned int	index;
+	void	      **entry;
+};
+ 
+#define PTE_FREELIST_SIZE	(PAGE_SIZE / sizeof(void *))
+ 
+extern void pte_free_now(pte_t *pte);
+extern void pte_free_batch(void **batch, int size);
+extern struct ____cacheline_aligned pte_freelist_batch pte_freelist_cur[] __cacheline_aligned_in_smp;
+ 
+static inline void __pte_free(pte_t *pte)
+{
+	struct pte_freelist_batch *batchp = &pte_freelist_cur[smp_processor_id()];
+ 
+	if (batchp->entry == NULL) {
+		batchp->entry = (void **)__get_free_page(GFP_ATOMIC);
+		if (batchp->entry == NULL) {
+			pte_free_now(pte);
+			return;
+		}
+		batchp->index = 0;
+	}
+ 
+	batchp->entry[batchp->index++] = pte;
+	if (batchp->index == PTE_FREELIST_SIZE) {
+		pte_free_batch(batchp->entry, batchp->index);
+		batchp->entry = NULL;
+	}
+}
+
 
 extern int do_check_pgt_cache(int, int);
 
===== include/asm-ppc64/pgtable.h 1.7 vs edited =====
--- 1.7/include/asm-ppc64/pgtable.h	Mon Aug 25 23:47:52 2003
+++ edited/include/asm-ppc64/pgtable.h	Tue Feb  3 21:33:22 2004
@@ -88,22 +88,22 @@
  * Bits in a linux-style PTE.  These match the bits in the
  * (hardware-defined) PowerPC PTE as closely as possible.
  */
-#define _PAGE_PRESENT	0x001UL	/* software: pte contains a translation */
-#define _PAGE_USER	0x002UL	/* matches one of the PP bits */
-#define _PAGE_RW	0x004UL	/* software: user write access allowed */
-#define _PAGE_GUARDED	0x008UL
-#define _PAGE_COHERENT	0x010UL	/* M: enforce memory coherence (SMP systems) */
-#define _PAGE_NO_CACHE	0x020UL	/* I: cache inhibit */
-#define _PAGE_WRITETHRU	0x040UL	/* W: cache write-through */
-#define _PAGE_DIRTY	0x080UL	/* C: page changed */
-#define _PAGE_ACCESSED	0x100UL	/* R: page referenced */
-#define _PAGE_HPTENOIX	0x200UL /* software: pte HPTE slot unknown */
-#define _PAGE_HASHPTE	0x400UL	/* software: pte has an associated HPTE */
-#define _PAGE_EXEC	0x800UL	/* software: i-cache coherence required */
-#define _PAGE_SECONDARY 0x8000UL /* software: HPTE is in secondary group */
-#define _PAGE_GROUP_IX  0x7000UL /* software: HPTE index within group */
+#define _PAGE_PRESENT	0x0001 /* software: pte contains a translation */
+#define _PAGE_USER	0x0002 /* matches one of the PP bits */
+#define _PAGE_RW	0x0004 /* software: user write access allowed */
+#define _PAGE_GUARDED	0x0008
+#define _PAGE_COHERENT	0x0010 /* M: enforce memory coherence (SMP systems) */
+#define _PAGE_NO_CACHE	0x0020 /* I: cache inhibit */
+#define _PAGE_WRITETHRU	0x0040 /* W: cache write-through */
+#define _PAGE_DIRTY	0x0080 /* C: page changed */
+#define _PAGE_ACCESSED	0x0100 /* R: page referenced */
+#define _PAGE_BUSY	0x0200 /* software: pte & hash are busy */
+#define _PAGE_HASHPTE	0x0400 /* software: pte has an associated HPTE */
+#define _PAGE_EXEC	0x0800 /* software: i-cache coherence required */
+#define _PAGE_GROUP_IX  0x7000 /* software: HPTE index within group */
+#define _PAGE_SECONDARY 0x8000 /* software: HPTE is in secondary group */
 /* Bits 0x7000 identify the index within an HPT Group */
-#define _PAGE_HPTEFLAGS (_PAGE_HASHPTE | _PAGE_HPTENOIX | _PAGE_SECONDARY | _PAGE_GROUP_IX)
+#define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | _PAGE_SECONDARY | _PAGE_GROUP_IX)
 /* PAGE_MASK gives the right answer below, but only by accident */
 /* It should be preserving the high 48 bits and then specifically */
 /* preserving _PAGE_SECONDARY | _PAGE_GROUP_IX */
@@ -281,13 +281,15 @@
 	unsigned long old, tmp;
 
 	__asm__ __volatile__("\n\
-1:	ldarx	%0,0,%3	\n\
+1:	ldarx	%0,0,%3 \n\
+        andi.   %1,%0,%7 # loop on _PAGE_BUSY set\n\
+        bne-    1b \n\
 	andc	%1,%0,%4 \n\
 	or	%1,%1,%5 \n\
 	stdcx.	%1,0,%3 \n\
 	bne-	1b"
 	: "=&r" (old), "=&r" (tmp), "=m" (*p)
-	: "r" (p), "r" (clr), "r" (set), "m" (*p)
+	: "r" (p), "r" (clr), "r" (set), "m" (*p), "i" (_PAGE_BUSY)
 	: "cc" );
 	return old;
 }

From olof at austin.ibm.com  Wed Feb  4 15:52:11 2004
From: olof at austin.ibm.com (olof at austin.ibm.com)
Date: Tue, 3 Feb 2004 22:52:11 -0600 (CST)
Subject: 2.6: PATCH for multiple EEH bugs
In-Reply-To: <20040203183459.B27780@forte.austin.ibm.com>
Message-ID: <Pine.A41.4.44.0402032246520.85316-100000@forte.austin.ibm.com>


On Tue, 3 Feb 2004 linas at austin.ibm.com wrote:

> Patch for multiple EEH-related bugs.  Please review this patch,
> & if appropriate, please apply.  It should apply cleanly to
> the current ameslab tree (Feb 03 2004  2.6.2-rc3).

Linas,

I have patch failures in eeh.c, pci.c and rpaphp_core.c. Are you sure you
made the diff against a current ameslab tree?


-Olof

Olof Johansson                                        Office: 4E002/905
Linux on Power Development                            IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Wed Feb  4 20:40:59 2004
From: anton at samba.org (Anton Blanchard)
Date: Wed, 4 Feb 2004 20:40:59 +1100
Subject: [PATCH] kill lmb_add_io
In-Reply-To: <20040204041820.GF22694@krispykreme>
References: <1075856864.1449.205.camel@nighthawk> <20040204041820.GF22694@krispykreme>
Message-ID: <20040204094059.GF19011@krispykreme>


Boots on a p630. Will get some help testing it on iSeries tomorrow
(hi Stephen :) and will push it if there are no complaints.

Anton

> OK, you got me to look :)
>
> The IO stuff was around from when we were using MSCHUNKS on pSeries. It
> would make memory appear contiguous on machines with IO holes (POWER3
> boxes). In the end it confused the hell out of Linux (from memory it was
> the macros that wanted to know if a physical address was in an IO hole
> or not).
>
> I just went through and cleaned the lmb stuff up, it needs to be tested
> on i and p, but I think its a step in the right direction. It certainly
> simplifies a lot of the code.
>
> Thoughts?
>
> Anton
>
> --
>
> - remove LMB_MEMORY_AREA, LMB_IO_AREA, we only allocate/reserve memory
>   areas now
> - remove lmb_property->type, lmb_region->iosize, lmb_region->lcd_size,
>   no longer used
> - bump number of regions to 128, we'll hit this limit sooner or later
>   with our big boxes (if we have more than 64 PCI host bridges the
>   reserved array will fill up for example)
> - make all the lmb stuff __init
> - no need to explicitly zero struct lmb lmb now we zero the BSS early
> - we had two functions to dump the lmb array, kill one of them
> - move the inline functions into lmb.c, they are only ever called from
>   there
>
> ===== include/asm-ppc64/lmb.h 1.6 vs edited =====
> --- 1.6/include/asm-ppc64/lmb.h	Fri Sep 13 21:24:37 2002
> +++ edited/include/asm-ppc64/lmb.h	Wed Feb  4 15:10:52 2004
> @@ -13,36 +13,29 @@
>   * 2 of the License, or (at your option) any later version.
>   */
>
> -#include <linux/config.h>
> +#include <linux/init.h>
>  #include <asm/prom.h>
>
>  extern unsigned long reloc_offset(void);
>
> -#define MAX_LMB_REGIONS 64
> +#define MAX_LMB_REGIONS 128
>
>  union lmb_reg_property {
>  	struct reg_property32 addr32[MAX_LMB_REGIONS];
>  	struct reg_property64 addr64[MAX_LMB_REGIONS];
>  };
>
> -#define LMB_MEMORY_AREA	1
> -#define LMB_IO_AREA	2
> -
>  #define LMB_ALLOC_ANYWHERE	0
> -#define LMB_ALLOC_FIRST4GBYTE	(1UL<<32)
>
>  struct lmb_property {
>  	unsigned long base;
>  	unsigned long physbase;
>  	unsigned long size;
> -	unsigned long type;
>  };
>
>  struct lmb_region {
>  	unsigned long cnt;
>  	unsigned long size;
> -	unsigned long iosize;
> -	unsigned long lcd_size;		/* Least Common Denominator */
>  	struct lmb_property region[MAX_LMB_REGIONS+1];
>  };
>
> @@ -53,63 +46,17 @@
>  	struct lmb_region reserved;
>  };
>
> -extern struct lmb lmb;
> -
> -extern void lmb_init(void);
> -extern void lmb_analyze(void);
> -extern long lmb_add(unsigned long, unsigned long);
> -#ifdef CONFIG_MSCHUNKS
> -extern long lmb_add_io(unsigned long base, unsigned long size);
> -#endif /* CONFIG_MSCHUNKS */
> -extern long lmb_reserve(unsigned long, unsigned long);
> -extern unsigned long lmb_alloc(unsigned long, unsigned long);
> -extern unsigned long lmb_alloc_base(unsigned long, unsigned long, unsigned long);
> -extern unsigned long lmb_phys_mem_size(void);
> -extern unsigned long lmb_end_of_DRAM(void);
> -extern unsigned long lmb_abs_to_phys(unsigned long);
> -extern void lmb_dump(char *);
> -
> -static inline unsigned long
> -lmb_addrs_overlap(unsigned long base1, unsigned long size1,
> -                  unsigned long base2, unsigned long size2)
> -{
> -        return ((base1 < (base2+size2)) && (base2 < (base1+size1)));
> -}
> -
> -static inline long
> -lmb_regions_overlap(struct lmb_region *rgn, unsigned long r1, unsigned long r2)
> -{
> -	unsigned long base1 = rgn->region[r1].base;
> -        unsigned long size1 = rgn->region[r1].size;
> -	unsigned long base2 = rgn->region[r2].base;
> -        unsigned long size2 = rgn->region[r2].size;
> -
> -	return lmb_addrs_overlap(base1,size1,base2,size2);
> -}
> -
> -static inline long
> -lmb_addrs_adjacent(unsigned long base1, unsigned long size1,
> -		   unsigned long base2, unsigned long size2)
> -{
> -	if ( base2 == base1 + size1 ) {
> -		return 1;
> -	} else if ( base1 == base2 + size2 ) {
> -		return -1;
> -	}
> -	return 0;
> -}
> -
> -static inline long
> -lmb_regions_adjacent(struct lmb_region *rgn, unsigned long r1, unsigned long r2)
> -{
> -	unsigned long base1 = rgn->region[r1].base;
> -        unsigned long size1 = rgn->region[r1].size;
> -        unsigned long type1 = rgn->region[r1].type;
> -	unsigned long base2 = rgn->region[r2].base;
> -        unsigned long size2 = rgn->region[r2].size;
> -        unsigned long type2 = rgn->region[r2].type;
> +extern struct lmb lmb __initdata;
>
> -	return (type1 == type2) && lmb_addrs_adjacent(base1,size1,base2,size2);
> -}
> +extern void __init lmb_init(void);
> +extern void __init lmb_analyze(void);
> +extern long __init lmb_add(unsigned long, unsigned long);
> +extern long __init lmb_reserve(unsigned long, unsigned long);
> +extern unsigned long __init lmb_alloc(unsigned long, unsigned long);
> +extern unsigned long __init lmb_alloc_base(unsigned long, unsigned long,
> +					   unsigned long);
> +extern unsigned long __init lmb_phys_mem_size(void);
> +extern unsigned long __init lmb_end_of_DRAM(void);
> +extern unsigned long __init lmb_abs_to_phys(unsigned long);
>
>  #endif /* _PPC64_LMB_H */
> ===== arch/ppc64/kernel/lmb.c 1.6 vs edited =====
> --- 1.6/arch/ppc64/kernel/lmb.c	Tue Feb 25 20:38:45 2003
> +++ edited/arch/ppc64/kernel/lmb.c	Wed Feb  4 15:10:44 2004
> @@ -1,5 +1,4 @@
>  /*
> - *
>   * Procedures for interfacing to Open Firmware.
>   *
>   * Peter Bergner, IBM Corp.	June 2001.
> @@ -13,46 +12,63 @@
>
>  #include <linux/config.h>
>  #include <linux/kernel.h>
> +#include <linux/init.h>
>  #include <asm/types.h>
>  #include <asm/page.h>
>  #include <asm/prom.h>
>  #include <asm/lmb.h>
>  #include <asm/abs_addr.h>
>  #include <asm/bitops.h>
> -#include <asm/udbg.h>
>
> -extern unsigned long klimit;
> -extern unsigned long reloc_offset(void);
> +struct lmb lmb __initdata;
> +
> +static unsigned long __init
> +lmb_addrs_overlap(unsigned long base1, unsigned long size1,
> +                  unsigned long base2, unsigned long size2)
> +{
> +	return ((base1 < (base2+size2)) && (base2 < (base1+size1)));
> +}
>
> +static long __init
> +lmb_addrs_adjacent(unsigned long base1, unsigned long size1,
> +		   unsigned long base2, unsigned long size2)
> +{
> +	if (base2 == base1 + size1)
> +		return 1;
> +	else if (base1 == base2 + size2)
> +		return -1;
>
> -static long lmb_add_region(struct lmb_region *, unsigned long, unsigned long, unsigned long);
> +	return 0;
> +}
>
> -struct lmb lmb = {
> -	0, 0,
> -	{0,0,0,0,{{0,0,0}}},
> -	{0,0,0,0,{{0,0,0}}}
> -};
> +static long __init
> +lmb_regions_adjacent(struct lmb_region *rgn, unsigned long r1, unsigned long r2)
> +{
> +	unsigned long base1 = rgn->region[r1].base;
> +	unsigned long size1 = rgn->region[r1].size;
> +	unsigned long base2 = rgn->region[r2].base;
> +	unsigned long size2 = rgn->region[r2].size;
>
> +	return lmb_addrs_adjacent(base1, size1, base2, size2);
> +}
>
>  /* Assumption: base addr of region 1 < base addr of region 2 */
> -static void
> +static void __init
>  lmb_coalesce_regions(struct lmb_region *rgn, unsigned long r1, unsigned long r2)
>  {
>  	unsigned long i;
>
>  	rgn->region[r1].size += rgn->region[r2].size;
> -	for (i=r2; i < rgn->cnt-1 ;i++) {
> +	for (i=r2; i < rgn->cnt-1; i++) {
>  		rgn->region[i].base = rgn->region[i+1].base;
>  		rgn->region[i].physbase = rgn->region[i+1].physbase;
>  		rgn->region[i].size = rgn->region[i+1].size;
> -		rgn->region[i].type = rgn->region[i+1].type;
>  	}
>  	rgn->cnt--;
>  }
>
> -
>  /* This routine called with relocation disabled. */
> -void
> +void __init
>  lmb_init(void)
>  {
>  	unsigned long offset = reloc_offset();
> @@ -63,13 +79,11 @@
>  	 */
>  	_lmb->memory.region[0].base = 0;
>  	_lmb->memory.region[0].size = 0;
> -	_lmb->memory.region[0].type = LMB_MEMORY_AREA;
>  	_lmb->memory.cnt = 1;
>
>  	/* Ditto. */
>  	_lmb->reserved.region[0].base = 0;
>  	_lmb->reserved.region[0].size = 0;
> -	_lmb->reserved.region[0].type = LMB_MEMORY_AREA;
>  	_lmb->reserved.cnt = 1;
>  }
>
> @@ -89,12 +103,11 @@
>  }
>
>  /* This routine called with relocation disabled. */
> -void
> +void __init
>  lmb_analyze(void)
>  {
>  	unsigned long i;
>  	unsigned long mem_size = 0;
> -	unsigned long io_size = 0;
>  	unsigned long size_mask = 0;
>  	unsigned long offset = reloc_offset();
>  	struct lmb *_lmb = PTRRELOC(&lmb);
> @@ -102,13 +115,9 @@
>  	unsigned long physbase = 0;
>  #endif
>
> -	for (i=0; i < _lmb->memory.cnt ;i++) {
> -		unsigned long lmb_type = _lmb->memory.region[i].type;
> +	for (i=0; i < _lmb->memory.cnt; i++) {
>  		unsigned long lmb_size;
>
> -		if ( lmb_type != LMB_MEMORY_AREA )
> -			continue;
> -
>  		lmb_size = _lmb->memory.region[i].size;
>
>  #ifdef CONFIG_MSCHUNKS
> @@ -121,84 +130,20 @@
>  		size_mask |= lmb_size;
>  	}
>
> -#ifdef CONFIG_MSCHUNKS
> -	for (i=0; i < _lmb->memory.cnt ;i++) {
> -		unsigned long lmb_type = _lmb->memory.region[i].type;
> -		unsigned long lmb_size;
> -
> -		if ( lmb_type != LMB_IO_AREA )
> -			continue;
> -
> -		lmb_size = _lmb->memory.region[i].size;
> -
> -		_lmb->memory.region[i].physbase = physbase;
> -		physbase += lmb_size;
> -		io_size += lmb_size;
> -		size_mask |= lmb_size;
> -	}
> -#endif /* CONFIG_MSCHUNKS */
> -
>  	_lmb->memory.size = mem_size;
> -	_lmb->memory.iosize = io_size;
> -	_lmb->memory.lcd_size = (1UL << cnt_trailing_zeros(size_mask));
> -}
> -
> -/* This routine called with relocation disabled. */
> -long
> -lmb_add(unsigned long base, unsigned long size)
> -{
> -	unsigned long offset = reloc_offset();
> -	struct lmb *_lmb = PTRRELOC(&lmb);
> -	struct lmb_region *_rgn = &(_lmb->memory);
> -
> -	/* On pSeries LPAR systems, the first LMB is our RMO region. */
> -	if ( base == 0 )
> -		_lmb->rmo_size = size;
> -
> -	return lmb_add_region(_rgn, base, size, LMB_MEMORY_AREA);
> -
>  }
>
> -#ifdef CONFIG_MSCHUNKS
>  /* This routine called with relocation disabled. */
> -long
> -lmb_add_io(unsigned long base, unsigned long size)
> -{
> -	unsigned long offset = reloc_offset();
> -	struct lmb *_lmb = PTRRELOC(&lmb);
> -	struct lmb_region *_rgn = &(_lmb->memory);
> -
> -	return lmb_add_region(_rgn, base, size, LMB_IO_AREA);
> -
> -}
> -#endif /* CONFIG_MSCHUNKS */
> -
> -long
> -lmb_reserve(unsigned long base, unsigned long size)
> -{
> -	unsigned long offset = reloc_offset();
> -	struct lmb *_lmb = PTRRELOC(&lmb);
> -	struct lmb_region *_rgn = &(_lmb->reserved);
> -
> -	return lmb_add_region(_rgn, base, size, LMB_MEMORY_AREA);
> -}
> -
> -/* This routine called with relocation disabled. */
> -static long
> -lmb_add_region(struct lmb_region *rgn, unsigned long base, unsigned long size,
> -		unsigned long type)
> +static long __init
> +lmb_add_region(struct lmb_region *rgn, unsigned long base, unsigned long size)
>  {
>  	unsigned long i, coalesced = 0;
>  	long adjacent;
>
>  	/* First try and coalesce this LMB with another. */
> -	for (i=0; i < rgn->cnt ;i++) {
> +	for (i=0; i < rgn->cnt; i++) {
>  		unsigned long rgnbase = rgn->region[i].base;
>  		unsigned long rgnsize = rgn->region[i].size;
> -		unsigned long rgntype = rgn->region[i].type;
> -
> -		if ( rgntype != type )
> -			continue;
>
>  		adjacent = lmb_addrs_adjacent(base,size,rgnbase,rgnsize);
>  		if ( adjacent > 0 ) {
> @@ -227,17 +172,15 @@
>  	}
>
>  	/* Couldn't coalesce the LMB, so add it to the sorted table. */
> -	for (i=rgn->cnt-1; i >= 0 ;i--) {
> +	for (i=rgn->cnt-1; i >= 0; i--) {
>  		if (base < rgn->region[i].base) {
>  			rgn->region[i+1].base = rgn->region[i].base;
>  			rgn->region[i+1].physbase = rgn->region[i].physbase;
>  			rgn->region[i+1].size = rgn->region[i].size;
> -			rgn->region[i+1].type = rgn->region[i].type;
>  		}  else {
>  			rgn->region[i+1].base = base;
>  			rgn->region[i+1].physbase = lmb_abs_to_phys(base);
>  			rgn->region[i+1].size = size;
> -			rgn->region[i+1].type = type;
>  			break;
>  		}
>  	}
> @@ -246,12 +189,38 @@
>  	return 0;
>  }
>
> -long
> +/* This routine called with relocation disabled. */
> +long __init
> +lmb_add(unsigned long base, unsigned long size)
> +{
> +	unsigned long offset = reloc_offset();
> +	struct lmb *_lmb = PTRRELOC(&lmb);
> +	struct lmb_region *_rgn = &(_lmb->memory);
> +
> +	/* On pSeries LPAR systems, the first LMB is our RMO region. */
> +	if ( base == 0 )
> +		_lmb->rmo_size = size;
> +
> +	return lmb_add_region(_rgn, base, size);
> +
> +}
> +
> +long __init
> +lmb_reserve(unsigned long base, unsigned long size)
> +{
> +	unsigned long offset = reloc_offset();
> +	struct lmb *_lmb = PTRRELOC(&lmb);
> +	struct lmb_region *_rgn = &(_lmb->reserved);
> +
> +	return lmb_add_region(_rgn, base, size);
> +}
> +
> +long __init
>  lmb_overlaps_region(struct lmb_region *rgn, unsigned long base, unsigned long size)
>  {
>  	unsigned long i;
>
> -	for (i=0; i < rgn->cnt ;i++) {
> +	for (i=0; i < rgn->cnt; i++) {
>  		unsigned long rgnbase = rgn->region[i].base;
>  		unsigned long rgnsize = rgn->region[i].size;
>  		if ( lmb_addrs_overlap(base,size,rgnbase,rgnsize) ) {
> @@ -262,13 +231,13 @@
>  	return (i < rgn->cnt) ? i : -1;
>  }
>
> -unsigned long
> +unsigned long __init
>  lmb_alloc(unsigned long size, unsigned long align)
>  {
>  	return lmb_alloc_base(size, align, LMB_ALLOC_ANYWHERE);
>  }
>
> -unsigned long
> +unsigned long __init
>  lmb_alloc_base(unsigned long size, unsigned long align, unsigned long max_addr)
>  {
>  	long i, j;
> @@ -278,13 +247,9 @@
>  	struct lmb_region *_mem = &(_lmb->memory);
>  	struct lmb_region *_rsv = &(_lmb->reserved);
>
> -	for (i=_mem->cnt-1; i >= 0 ;i--) {
> +	for (i=_mem->cnt-1; i >= 0; i--) {
>  		unsigned long lmbbase = _mem->region[i].base;
>  		unsigned long lmbsize = _mem->region[i].size;
> -		unsigned long lmbtype = _mem->region[i].type;
> -
> -		if ( lmbtype != LMB_MEMORY_AREA )
> -			continue;
>
>  		if ( max_addr == LMB_ALLOC_ANYWHERE )
>  			base = _ALIGN_DOWN(lmbbase+lmbsize-size, align);
> @@ -305,12 +270,12 @@
>  	if ( i < 0 )
>  		return 0;
>
> -	lmb_add_region(_rsv, base, size, LMB_MEMORY_AREA);
> +	lmb_add_region(_rsv, base, size);
>
>  	return base;
>  }
>
> -unsigned long
> +unsigned long __init
>  lmb_phys_mem_size(void)
>  {
>  	unsigned long offset = reloc_offset();
> @@ -327,7 +292,7 @@
>  #endif /* CONFIG_MSCHUNKS */
>  }
>
> -unsigned long
> +unsigned long __init
>  lmb_end_of_DRAM(void)
>  {
>  	unsigned long offset = reloc_offset();
> @@ -335,9 +300,7 @@
>  	struct lmb_region *_mem = &(_lmb->memory);
>  	unsigned long idx;
>
> -	for(idx=_mem->cnt-1; idx >= 0 ;idx--) {
> -		if ( _mem->region[idx].type != LMB_MEMORY_AREA )
> -			continue;
> +	for(idx=_mem->cnt-1; idx >= 0; idx--) {
>  #ifdef CONFIG_MSCHUNKS
>  		return (_mem->region[idx].physbase + _mem->region[idx].size);
>  #else
> @@ -348,8 +311,7 @@
>  	return 0;
>  }
>
> -
> -unsigned long
> +unsigned long __init
>  lmb_abs_to_phys(unsigned long aa)
>  {
>  	unsigned long i, pa = aa;
> @@ -357,7 +319,7 @@
>  	struct lmb *_lmb = PTRRELOC(&lmb);
>  	struct lmb_region *_mem = &(_lmb->memory);
>
> -	for (i=0; i < _mem->cnt ;i++) {
> +	for (i=0; i < _mem->cnt; i++) {
>  		unsigned long lmbbase = _mem->region[i].base;
>  		unsigned long lmbsize = _mem->region[i].size;
>  		if ( lmb_addrs_overlap(aa,1,lmbbase,lmbsize) ) {
> @@ -367,48 +329,4 @@
>  	}
>
>  	return pa;
> -}
> -
> -void
> -lmb_dump(char *str)
> -{
> -	unsigned long i;
> -
> -	udbg_printf("\nlmb_dump: %s\n", str);
> -	udbg_printf("    debug                       = %s\n",
> -		(lmb.debug) ? "TRUE" : "FALSE");
> -	udbg_printf("    memory.cnt                  = %d\n",
> -		lmb.memory.cnt);
> -	udbg_printf("    memory.size                 = 0x%lx\n",
> -		lmb.memory.size);
> -	udbg_printf("    memory.lcd_size             = 0x%lx\n",
> -		lmb.memory.lcd_size);
> -	for (i=0; i < lmb.memory.cnt ;i++) {
> -		udbg_printf("    memory.region[%d].base       = 0x%lx\n",
> -			i, lmb.memory.region[i].base);
> -		udbg_printf("                      .physbase = 0x%lx\n",
> -			lmb.memory.region[i].physbase);
> -		udbg_printf("                      .size     = 0x%lx\n",
> -			lmb.memory.region[i].size);
> -		udbg_printf("                      .type     = 0x%lx\n",
> -			lmb.memory.region[i].type);
> -	}
> -
> -	udbg_printf("\n");
> -	udbg_printf("    reserved.cnt                = %d\n",
> -		lmb.reserved.cnt);
> -	udbg_printf("    reserved.size               = 0x%lx\n",
> -		lmb.reserved.size);
> -	udbg_printf("    reserved.lcd_size           = 0x%lx\n",
> -		lmb.reserved.lcd_size);
> -	for (i=0; i < lmb.reserved.cnt ;i++) {
> -		udbg_printf("    reserved.region[%d].base     = 0x%lx\n",
> -			i, lmb.reserved.region[i].base);
> -		udbg_printf("                      .physbase = 0x%lx\n",
> -			lmb.reserved.region[i].physbase);
> -		udbg_printf("                      .size     = 0x%lx\n",
> -			lmb.reserved.region[i].size);
> -		udbg_printf("                      .type     = 0x%lx\n",
> -			lmb.reserved.region[i].type);
> -	}
>  }
> ===== arch/ppc64/kernel/prom.c 1.52 vs edited =====
> --- 1.52/arch/ppc64/kernel/prom.c	Sun Feb  1 13:40:21 2004
> +++ edited/arch/ppc64/kernel/prom.c	Wed Feb  4 14:42:19 2004
> @@ -668,9 +668,6 @@
>          prom_print(RELOC("    memory.size                 = 0x"));
>          prom_print_hex(_lmb->memory.size);
>  	prom_print_nl();
> -        prom_print(RELOC("    memory.lcd_size             = 0x"));
> -        prom_print_hex(_lmb->memory.lcd_size);
> -	prom_print_nl();
>          for (i=0; i < _lmb->memory.cnt ;i++) {
>                  prom_print(RELOC("    memory.region[0x"));
>  		prom_print_hex(i);
> @@ -683,9 +680,6 @@
>                  prom_print(RELOC("                      .size     = 0x"));
>                  prom_print_hex(_lmb->memory.region[i].size);
>  		prom_print_nl();
> -                prom_print(RELOC("                      .type     = 0x"));
> -                prom_print_hex(_lmb->memory.region[i].type);
> -		prom_print_nl();
>          }
>
>  	prom_print_nl();
> @@ -695,9 +689,6 @@
>          prom_print(RELOC("    reserved.size                 = 0x"));
>          prom_print_hex(_lmb->reserved.size);
>  	prom_print_nl();
> -        prom_print(RELOC("    reserved.lcd_size             = 0x"));
> -        prom_print_hex(_lmb->reserved.lcd_size);
> -	prom_print_nl();
>          for (i=0; i < _lmb->reserved.cnt ;i++) {
>                  prom_print(RELOC("    reserved.region[0x"));
>  		prom_print_hex(i);
> @@ -709,9 +700,6 @@
>  		prom_print_nl();
>                  prom_print(RELOC("                      .size     = 0x"));
>                  prom_print_hex(_lmb->reserved.region[i].size);
> -		prom_print_nl();
> -                prom_print(RELOC("                      .type     = 0x"));
> -                prom_print_hex(_lmb->reserved.region[i].type);
>  		prom_print_nl();
>          }
>  }
> ===== arch/ppc64/mm/numa.c 1.16 vs edited =====
> --- 1.16/arch/ppc64/mm/numa.c	Tue Jan 20 13:07:09 2004
> +++ edited/arch/ppc64/mm/numa.c	Wed Feb  4 15:12:02 2004
> @@ -257,10 +257,6 @@
>
>  		for (i = 0; i < lmb.memory.cnt; i++) {
>  			unsigned long physbase, size;
> -			unsigned long type = lmb.memory.region[i].type;
> -
> -			if (type != LMB_MEMORY_AREA)
> -				continue;
>
>  			physbase = lmb.memory.region[i].physbase;
>  			size = lmb.memory.region[i].size;
> ===== arch/ppc64/mm/init.c 1.55 vs edited =====
> --- 1.55/arch/ppc64/mm/init.c	Tue Jan 20 13:07:09 2004
> +++ edited/arch/ppc64/mm/init.c	Wed Feb  4 15:11:54 2004
> @@ -702,10 +702,6 @@
>  	/* add all physical memory to the bootmem map */
>  	for (i=0; i < lmb.memory.cnt; i++) {
>  		unsigned long physbase, size;
> -		unsigned long type = lmb.memory.region[i].type;
> -
> -		if ( type != LMB_MEMORY_AREA )
> -			continue;
>
>  		physbase = lmb.memory.region[i].physbase;
>  		size = lmb.memory.region[i].size;
> @@ -746,11 +742,7 @@
>
>  	for (i=0; i < lmb.memory.cnt; i++) {
>  		unsigned long physbase, size;
> -		unsigned long type = lmb.memory.region[i].type;
>  		struct kcore_list *kcore_mem;
> -
> -		if (type != LMB_MEMORY_AREA)
> -			continue;
>
>  		physbase = lmb.memory.region[i].physbase;
>  		size = lmb.memory.region[i].size;
>

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Wed Feb  4 21:51:13 2004
From: anton at samba.org (Anton Blanchard)
Date: Wed, 4 Feb 2004 21:51:13 +1100
Subject: ioremap problems
In-Reply-To: <401C49DD.1050008@austin.ibm.com>
References: <20040127104640.GK11236@krispykreme> <20040127105533.GL11236@krispykreme> <401C49DD.1050008@austin.ibm.com>
Message-ID: <20040204105113.GI19011@krispykreme>


> The patch you posted didn't apply cleanly for me on latest 2.6 from
> Ameslab; it inserted the Kconfig option in the middle of the vio stuff
> for some reason.  So I hand-patched the Kconfig; here's the patch I
> generated if anyone wants it.
>
> It would be nice to get this into Ameslab or mainline, it's very useful.

Let me try it on Andrew once again but I'll merge it into ameslab
regardless.

Heres another simple patch based on the x86 version. It warns whenever
we use more than 8kB of stack (out of 16kB). I set the threshold rather
high since it doesnt catch the full extent of stack usage (it only
catches the start of the last irq).

Anton

===== arch/ppc64/Kconfig 1.31 vs edited =====


 gr16b-anton/arch/ppc64/Kconfig      |    4 ++++
 gr16b-anton/arch/ppc64/kernel/irq.c |   15 +++++++++++++++
 2 files changed, 19 insertions(+)

diff -puN arch/ppc64/Kconfig~debug_stackoverflow arch/ppc64/Kconfig
--- gr16b/arch/ppc64/Kconfig~debug_stackoverflow	2004-01-22 01:18:46.234108336 +1100
+++ gr16b-anton/arch/ppc64/Kconfig	2004-01-22 01:18:46.242108222 +1100
@@ -312,6 +312,10 @@ config DEBUG_KERNEL
 	  Say Y here if you are developing drivers or trying to debug and
 	  identify kernel problems.

+config DEBUG_STACKOVERFLOW
+	bool "Check for stack overflows"
+	depends on DEBUG_KERNEL
+
 config DEBUG_SLAB
 	bool "Debug memory allocations"
 	depends on DEBUG_KERNEL
diff -puN arch/ppc64/kernel/irq.c~debug_stackoverflow arch/ppc64/kernel/irq.c
--- gr16b/arch/ppc64/kernel/irq.c~debug_stackoverflow	2004-01-22 01:18:46.237108293 +1100
+++ gr16b-anton/arch/ppc64/kernel/irq.c	2004-01-22 01:25:34.498313587 +1100
@@ -568,6 +568,21 @@ int do_IRQ(struct pt_regs *regs)

 	irq_enter();

+#ifdef CONFIG_DEBUG_STACKOVERFLOW
+	/* Debugging check for stack overflow: is there less than 8KB free? */
+	{
+		long sp;
+
+		sp = (unsigned long)_get_SP() & (THREAD_SIZE-1);
+
+		if (unlikely(sp < (sizeof(struct thread_info) + 8192))) {
+			printk("do_IRQ: stack overflow: %ld\n",
+				sp - sizeof(struct thread_info));
+			dump_stack();
+		}
+	}
+#endif
+
 	lpaca = get_paca();
 #ifdef CONFIG_SMP
 	if (lpaca->xLpPaca.xIntDword.xFields.xIpiCnt) {

_

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Thu Feb  5 03:03:03 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Wed, 4 Feb 2004 10:03:03 -0600
Subject: 2.6: PATCH for multiple EEH bugs
In-Reply-To: <Pine.A41.4.44.0402032246520.85316-100000@forte.austin.ibm.com>; from olof@austin.ibm.com on Tue, Feb 03, 2004 at 10:52:11PM -0600
References: <20040203183459.B27780@forte.austin.ibm.com> <Pine.A41.4.44.0402032246520.85316-100000@forte.austin.ibm.com>
Message-ID: <20040204100303.E27780@forte.austin.ibm.com>


On Tue, Feb 03, 2004 at 10:52:11PM -0600, olof at austin.ibm.com wrote:
> On Tue, 3 Feb 2004 linas at austin.ibm.com wrote:
>
> > Patch for multiple EEH-related bugs.  Please review this patch,
> > & if appropriate, please apply.  It should apply cleanly to
> > the current ameslab tree (Feb 03 2004  2.6.2-rc3).
>
> Linas,
>
> I have patch failures in eeh.c, pci.c and rpaphp_core.c. Are you sure you
> made the diff against a current ameslab tree?

It applied to a fresh bk pull as of yesterday morning ...
I am having a very hard time understanding how bk works,
I'll check again.

--linas


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Thu Feb  5 03:42:13 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Wed, 4 Feb 2004 10:42:13 -0600
Subject: 2.6: PATCH for multiple EEH bugs
In-Reply-To: <20040204100303.E27780@forte.austin.ibm.com>; from linas@austin.ibm.com on Wed, Feb 04, 2004 at 10:03:03AM -0600
References: <20040203183459.B27780@forte.austin.ibm.com> <Pine.A41.4.44.0402032246520.85316-100000@forte.austin.ibm.com> <20040204100303.E27780@forte.austin.ibm.com>
Message-ID: <20040204104213.F27780@forte.austin.ibm.com>


On Wed, Feb 04, 2004 at 10:03:03AM -0600, linas at austin.ibm.com wrote:
>
> On Tue, Feb 03, 2004 at 10:52:11PM -0600, olof at austin.ibm.com wrote:
> > On Tue, 3 Feb 2004 linas at austin.ibm.com wrote:
> >
> > > Patch for multiple EEH-related bugs.  Please review this patch,
> > > & if appropriate, please apply.  It should apply cleanly to
> > > the current ameslab tree (Feb 03 2004  2.6.2-rc3).
> >
> > Linas,
> >
> > I have patch failures in eeh.c, pci.c and rpaphp_core.c. Are you sure you
> > made the diff against a current ameslab tree?
>
> It applied to a fresh bk pull as of yesterday morning ...
> I am having a very hard time understanding how bk works,
> I'll check again.

I just did a fresh bk clone bk://source.scl.ameslab.gov/linux-2.5
Maybe I'm using the wrong tree?

--linas


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Thu Feb  5 04:03:45 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Wed, 04 Feb 2004 11:03:45 -0600
Subject: 2.6: PATCH for multiple EEH bugs
In-Reply-To: <20040204104213.F27780@forte.austin.ibm.com>
References: <20040203183459.B27780@forte.austin.ibm.com> <Pine.A41.4.44.0402032246520.85316-100000@forte.austin.ibm.com> <20040204100303.E27780@forte.austin.ibm.com> <20040204104213.F27780@forte.austin.ibm.com>
Message-ID: <402125F1.2010608@austin.ibm.com>


linas at austin.ibm.com wrote:

> I just did a fresh bk clone bk://source.scl.ameslab.gov/linux-2.5
>
>Maybe I'm using the wrong tree?
>
>

That's really weird, since I just did the exact same thing and got failures:

olof at olof tmp $ bk clone -q bk://source.scl.ameslab.gov/linux-2.5
olof at olof tmp $ cd linux-2.5/
olof at olof linux-2.5 $ patch -p1 < ~/eeh-bug-fixes-2.6.2.patch
[...]
arch/ppc64/kernel/eeh.c 1.17: 483 lines
Get file arch/ppc64/kernel/eeh.c from SCCS with lock? [y]
arch/ppc64/kernel/eeh.c 1.17 -> 1.18: 483 lines
patching file arch/ppc64/kernel/eeh.c
Hunk #3 succeeded at 142 with fuzz 1.
Hunk #4 FAILED at 163.
Hunk #7 FAILED at 271.
2 out of 15 hunks FAILED -- saving rejects to file
arch/ppc64/kernel/eeh.c.rej
[...]
arch/ppc64/kernel/pci.c 1.42: 543 lines
Get file arch/ppc64/kernel/pci.c from SCCS with lock? [y]
arch/ppc64/kernel/pci.c 1.42 -> 1.43: 543 lines
patching file arch/ppc64/kernel/pci.c
Hunk #2 FAILED at 109.
1 out of 3 hunks FAILED -- saving rejects to file
arch/ppc64/kernel/pci.c.rej
[...]
drivers/pci/hotplug/rpaphp_core.c 1.2: 1040 lines
Get file drivers/pci/hotplug/rpaphp_core.c from SCCS with lock? [y]
drivers/pci/hotplug/rpaphp_core.c 1.2 -> 1.3: 1040 lines
patching file drivers/pci/hotplug/rpaphp_core.c
Hunk #3 FAILED at 542.
1 out of 3 hunks FAILED -- saving rejects to file
drivers/pci/hotplug/rpaphp_core.c.rej


-Olof

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From kravetz at us.ibm.com  Thu Feb  5 04:40:36 2004
From: kravetz at us.ibm.com (Mike Kravetz)
Date: Wed, 4 Feb 2004 09:40:36 -0800
Subject: [PATCH] kill lmb_add_io
In-Reply-To: <1075856864.1449.205.camel@nighthawk>
References: <1075856864.1449.205.camel@nighthawk>
Message-ID: <20040204174036.GB4996@w-mikek2.beaverton.ibm.com>


On the subject of lmb's ... I've got a pSeries 615.  On this machine
I only see one lmb of 4GB in size.  This is what is dug out of the
prom by 'prom_initialize_lmb' before any coalescing.  I was somehow
expecting more lmb's of a smaller size.  Of course, I don't know the
hardware specifics of this machine or how many DIMMs of what size it
contains.

Dave, what does the lmb layout on your 'bigger' machine look like?

--
Mike

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Thu Feb  5 04:41:44 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Wed, 4 Feb 2004 11:41:44 -0600
Subject: [2.5][PATCH] Remove warnings from rtas-proc.c
In-Reply-To: <Pine.A41.4.44.0401262117480.39576-100000@forte.austin.ibm.com>; from olof@austin.ibm.com on Mon, Jan 26, 2004 at 09:20:00PM -0600
References: <Pine.A41.4.44.0401262117480.39576-100000@forte.austin.ibm.com>
Message-ID: <20040204114143.H27780@forte.austin.ibm.com>


On Mon, Jan 26, 2004 at 09:20:00PM -0600, olof at austin.ibm.com wrote:
>
> I'm tired of seeing the warnings go by every time I compile. It might be
> valid C99 code, but my GCC still warns about it:
>
> arch/ppc64/kernel/rtas-proc.c: In function `ppc_rtas_poweron_read':
> arch/ppc64/kernel/rtas-proc.c:294: warning: ISO C90 forbids mixed
> declarations and code
> ..and so on..

Not all compilers seem to generate this warning.  Mine doesn't...
although I have noticed that other people's compilers do ...


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From haveblue at us.ibm.com  Thu Feb  5 04:47:23 2004
From: haveblue at us.ibm.com (Dave Hansen)
Date: 04 Feb 2004 09:47:23 -0800
Subject: [PATCH] kill lmb_add_io
In-Reply-To: <20040204174036.GB4996@w-mikek2.beaverton.ibm.com>
References: <1075856864.1449.205.camel@nighthawk>
	 <20040204174036.GB4996@w-mikek2.beaverton.ibm.com>
Message-ID: <1075916843.14153.2397.camel@nighthawk>


On Wed, 2004-02-04 at 09:40, Mike Kravetz wrote:
> On the subject of lmb's ... I've got a pSeries 615.  On this machine
> I only see one lmb of 4GB in size.  This is what is dug out of the
> prom by 'prom_initialize_lmb' before any coalescing.  I was somehow
> expecting more lmb's of a smaller size.  Of course, I don't know the
> hardware specifics of this machine or how many DIMMs of what size it
> contains.
>
> Dave, what does the lmb layout on your 'bigger' machine look like?

There were a bunch of them, I'd say around 30 or so.  But, I'm booting a
64GB p650 in an 8-way LPAR, so the hypervisor might change the layout.

--dave


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Thu Feb  5 05:02:52 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Wed, 4 Feb 2004 12:02:52 -0600
Subject: EEH as module  [was Re: LPARCFG]
In-Reply-To: <20040201050227.GB22694@krispykreme>; from anton@samba.org on Sun, Feb 01, 2004 at 04:02:28PM +1100
References: <20040201050227.GB22694@krispykreme>
Message-ID: <20040204120252.J27780@forte.austin.ibm.com>


On Sun, Feb 01, 2004 at 04:02:28PM +1100, Anton Blanchard wrote:
>
> Hi,
>
> Any reason we cant make
xxx

> tristate (so we can compile it as a
> module?)
>
> Im keeping an eye on our kernel size (its getting huge), any chance
> we get to cut it down is worth it.

Since I'm planning to be whacking EEH in a big way, does anyone feel
that it should be turned into a module?

--linas


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From jdewand at redhat.com  Thu Feb  5 07:15:04 2004
From: jdewand at redhat.com (Julie DeWandel)
Date: Wed, 04 Feb 2004 15:15:04 -0500
Subject: [2.4] [PATCH] hash_page rework, take 2
References: <Pine.A41.4.44.0402032206230.85750-200000@forte.austin.ibm.com>
Message-ID: <402152C8.7080907@redhat.com>


Hi Olof,

Thank you for the explanations. In most cases, I agree but I still have
one or two things I wanted to follow up on. I have added my comments to
yours below.

I looked over the new patch and am pretty happy with it. I had to modify
it a bit (we took out quicklists in our distro) and I wanted to keep the
UL in the #defines for the PTE bits. But please read through the patch,
looking for my "JSD:" comments.

Thanks!
Julie

olof at austin.ibm.com wrote:

<snip>

>-       spin_unlock(&hash_table_lock[lock_slot].lock);
>+out_unlock:
>+       smp_wmb();
>
>JSD: I believe an smp_mb__before_clear_bit() would be a better choice, or
>JSD: at least an lwsync because this is an unlock.
>
>I'm not sure we really need any synchronization at all, since the only
>memory that's protected by the _PAGE_BUSY bit is the word itself, so
>the lock and contents change will be seen at the same time (or at least
>in the right order).
>
>smp_wmb() is cheaper (eieio) than smp_mb() (sync), so either the
>smp_wmb() should stay or it should be taken out alltogether. I'd rather
>err on the side of caution, so I'll keep it.
>
OK, I managed to convince myself that using an eieio is ok here. I was
concerned that other processors might not have seen any of the stores
that preceded the eieio instruction, since eieio is normally only used
when dealing with device memory. lwsync ensures other processors have
seen any stores to system memory at the point the lock is released. But
the only stores that matter here are the hpte (and it is sync'd) and the
pte and it has the lock bit. So when another processor sees the pte
contents without the lock bit set it will, by default, be seeing the
updated value as well.

<snip>

>JSD: Better question for (1) is why are interrupts being disabled here?
>JSD: Can this routine be called from interrupt context?
>
>Without disabling interrupts, there's a risk for deadlocks if the
>processor gets interrupted and the interrupt handler causes a page fault
>that needs to be resolved Since the lock is held for writing, the handler
>will wait forever when locking for reading. This is actually similar to
>the original deadlock that this whole patch is meant to remove, but the
>window is really small (just a few instructions) now. Likewise an
>interrupt on a different processor is not a problem since forward progress
>is still guaranteed on the processor holding it for writing so the reader
>will eventually get the lock.
>
So it is true that an interrupt handler can cause a page fault? Can you
provide me with an example?

>JSD: (2) I don't see how this code is preventing another processor
>JSD: from grabbing the read_lock immediately after this processor has
>JSD: checked to make sure it isn't held.
>
>And on the second question: This is the trick with using the rwlock,
>there's no need to _prevent_ reading, all I needed to know is that all
>readers that started before pte_free_sync() have completed:
>
>No one can get a reference to a PTE after it's been free'd, so the only
>risk is if someone has walked the table right during pte_free() and is
>still holding a reference to it.  hash_page will hold the lock for
>reading while this takes place, so all we need to know is that we
>_could_ take the lock for writing (i.e. no readers for the table). Even
>if another CPU comes in and traverses the tree we're safe since there's
>no way they can end up in that PTE (since it's been removed from the
>tree).
>
>This is all an ad-hoc solution since there's no RCU in 2.4, so I needed
>another light-weight syncronization method.
>
Let me see if I understand this. When someone wants to free a page
pointed to by an entry in a 3rd level page table, they clear out the pte
in the page table using pte_clear(). Then they call pte_free with the
address of the page they are freeing up (not really a page table entry
but the actual page address). This page address is added to the batch
list. Later, the idle loop or process termination code calls
do_check_pgt_cache which will free all the pages in the batch list.

However, prior to freeing the pages, pte_free_batch() will call
pte_free_sync(). pte_free_sync tries to ensure that the pte_hash_locks
aren't held on any processor. If a processor is holding it, it could be
the case that the processor is currently walking the page tables and
might have loaded, for example, the address of a pmd that we have since
cleared. Since it is still using the data in that page, we don't want to
free it until they are done with it. If we wait until they drop the
lock, they are done and we also know any new reference will see the
cleared value.

Is this correct?

<snip>

>------------------------------------------------------------------------
>
>===== arch/ppc64/kernel/htab.c 1.11 vs edited =====
>--- 1.11/arch/ppc64/kernel/htab.c	Thu Dec 18 16:13:25 2003
>+++ edited/arch/ppc64/kernel/htab.c	Tue Feb  3 22:01:59 2004
>@@ -48,6 +48,29 @@
> #include <asm/iSeries/HvCallHpt.h>
> #include <asm/cputable.h>
>
>+#define HPTE_LOCK_BIT 3
>+
>+static inline void pSeries_lock_hpte(HPTE *hptep)
>+{
>+	unsigned long *word = &hptep->dw0.dword0;
>+
>+	while (1) {
>+		if (!test_and_set_bit(HPTE_LOCK_BIT, word))
>+			break;
>+		while(test_bit(HPTE_LOCK_BIT, word))
>+			cpu_relax();
>+	}
>+}
>+
>+static inline void pSeries_unlock_hpte(HPTE *hptep)
>+{
>+	unsigned long *word = &hptep->dw0.dword0;
>+
>+	asm volatile("lwsync":::"memory");
>+	clear_bit(HPTE_LOCK_BIT, word);
>+}
>+
>+
> /*
>  * Note:  pte   --> Linux PTE
>  *        HPTE  --> PowerPC Hashed Page Table Entry
>@@ -64,6 +87,7 @@
>
> extern unsigned long _SDR1;
> extern unsigned long klimit;
>+extern rwlock_t pte_hash_lock[] __cacheline_aligned_in_smp;
>
> void make_pte(HPTE *htab, unsigned long va, unsigned long pa,
> 	      int mode, unsigned long hash_mask, int large);
>@@ -320,51 +344,73 @@
> 	unsigned long va, vpn;
> 	unsigned long newpp, prpn;
> 	unsigned long hpteflags, lock_slot;
>+	unsigned long access_ok, tmp;
> 	long slot;
> 	pte_t old_pte, new_pte;
>+	int ret = 0;
>
> 	/* Search the Linux page table for a match with va */
> 	va = (vsid << 28) | (ea & 0x0fffffff);
> 	vpn = va >> PAGE_SHIFT;
> 	lock_slot = get_lock_slot(vpn);
>
>-	/* Acquire the hash table lock to guarantee that the linux
>-	 * pte we fetch will not change
>+	/*
>+	 * Check the user's access rights to the page.  If access should be
>+	 * prevented then send the problem up to do_page_fault.
> 	 */
>-	spin_lock(&hash_table_lock[lock_slot].lock);
>-
>+
> 	/*
> 	 * Check the user's access rights to the page.  If access should be
> 	 * prevented then send the problem up to do_page_fault.
> 	 */
>-#ifdef CONFIG_SHARED_MEMORY_ADDRESSING
>+
> 	access |= _PAGE_PRESENT;
>-	if (unlikely(access & ~(pte_val(*ptep)))) {
>+
>+	/* We'll do access checking and _PAGE_BUSY setting in assembly, since
>+	 * it needs to be atomic.
>+	 */
>+
>+	__asm__ __volatile__ ("\n
>+	1:	ldarx	%0,0,%3\n
>+		# Check if PTE is busy\n
>+		andi.	%1,%0,%4\n
>+		bne-	1b\n
>+		ori	%0,%0,%4\n
>+		# Write the linux PTE atomically (setting busy)\n
>+		stdcx.	%0,0,%3\n
>+		bne-	1b\n
>+		# Check access rights (access & ~(pte_val(*ptep)))\n
>+		andc.	%1,%2,%0\n
>+		bne-	2f\n
>+		li      %1,1\n
>+		b	3f\n
>+	2:      li      %1,0\n
>+	3:"
>+	: "=r" (old_pte), "=r" (access_ok)
>+	: "r" (access), "r" (ptep), "i" (_PAGE_BUSY)
>+        : "cr0", "memory");
>+
>+#ifdef CONFIG_SHARED_MEMORY_ADDRESSING
>+	if (unlikely(!access_ok)) {
> 		if(!(((ea >> SMALLOC_EA_SHIFT) ==
> 		      (SMALLOC_START >> SMALLOC_EA_SHIFT)) &&
> 		     ((current->thread.flags) & PPC_FLAG_SHARED))) {
>-			spin_unlock(&hash_table_lock[lock_slot].lock);
>-			return 1;
>+			ret = 1;
>+			goto out_unlock;
> 		}
> 	}
> #else
>-	access |= _PAGE_PRESENT;
>-	if (unlikely(access & ~(pte_val(*ptep)))) {
>-		spin_unlock(&hash_table_lock[lock_slot].lock);
>-		return 1;
>+	if (unlikely(!access_ok)) {
>+		ret = 1;
>+		goto out_unlock;
> 	}
> #endif
>
> 	/*
>-	 * We have found a pte (which was present).
>-	 * The spinlocks prevent this status from changing
>-	 * The hash_table_lock prevents the _PAGE_HASHPTE status
>-	 * from changing (RPN, DIRTY and ACCESSED too)
>-	 * The page_table_lock prevents the pte from being
>-	 * invalidated or modified
>-	 */
>-
>-	/*
>+	 * We have found a proper pte. The hash_table_lock protects
>+	 * the pte from deallocation and the _PAGE_BUSY bit protects
>+	 * the contents of the PTE from changing.
>+	 *
> 	 * At this point, we have a pte (old_pte) which can be used to build
> 	 * or update an HPTE. There are 2 cases:
> 	 *
>@@ -385,7 +431,7 @@
> 	else
> 		pte_val(new_pte) |= _PAGE_ACCESSED;
>
>-	newpp = computeHptePP(pte_val(new_pte));
>+	newpp = computeHptePP(pte_val(new_pte) & ~_PAGE_BUSY);
>
> 	/* Check if pte already has an hpte (case 2) */
> 	if (unlikely(pte_val(old_pte) & _PAGE_HASHPTE)) {
>@@ -400,12 +446,13 @@
> 		slot = (hash & htab_data.htab_hash_mask) * HPTES_PER_GROUP;
> 		slot += (pte_val(old_pte) & _PAGE_GROUP_IX) >> 12;
>
>-		/* XXX fix large pte flag */
>+	        /* XXX fix large pte flag */
> 		if (ppc_md.hpte_updatepp(slot, secondary,
> 					 newpp, va, 0) == -1) {
> 			pte_val(old_pte) &= ~_PAGE_HPTEFLAGS;
> 		} else {
> 			if (!pte_same(old_pte, new_pte)) {
>+				/* _PAGE_BUSY is still set in new_pte */
> 				*ptep = new_pte;
> 			}
> 		}
>@@ -425,12 +472,19 @@
> 		pte_val(new_pte) |= ((slot<<12) &
> 				     (_PAGE_GROUP_IX | _PAGE_SECONDARY));
>
>+		smp_wmb();
>+		/* _PAGE_BUSY is not set in new_pte */
> 		*ptep = new_pte;
>+
>+		return 0;
> 	}
>
>-	spin_unlock(&hash_table_lock[lock_slot].lock);
>+out_unlock:
>+	smp_wmb();
>
>-	return 0;
>+	pte_val(*ptep) &= ~_PAGE_BUSY;
>+
>+	return ret;
> }
>
> /*
>@@ -497,11 +551,14 @@
> 	pgdir = mm->pgd;
> 	if (pgdir == NULL) return 1;
>
>-	/*
>-	 * Lock the Linux page table to prevent mmap and kswapd
>-	 * from modifying entries while we search and update
>+	/* The pte_hash_lock is used to block any PTE deallocations
>+	 * while we walk the tree and use the entry. While technically
>+	 * we both read and write the PTE entry while holding the read
>+	 * lock, the _PAGE_BUSY bit will block pte_update()s to the
>+	 * specific entry.
> 	 */
>-	spin_lock(&mm->page_table_lock);
>+
>+	read_lock(&pte_hash_lock[smp_processor_id()]);
>
> 	ptep = find_linux_pte(pgdir, ea);
> 	/*
>@@ -514,8 +571,7 @@
> 		/* If no pte, send the problem up to do_page_fault */
> 		ret = 1;
> 	}
>-
>-	spin_unlock(&mm->page_table_lock);
>+	read_unlock(&pte_hash_lock[smp_processor_id()]);
>
> 	return ret;
> }
>@@ -540,8 +596,6 @@
> 	lock_slot = get_lock_slot(vpn);
> 	hash = hpt_hash(vpn, large);
>
>-	spin_lock_irqsave(&hash_table_lock[lock_slot].lock, flags);
>-
> 	pte = __pte(pte_update(ptep, _PAGE_HPTEFLAGS, 0));
> 	secondary = (pte_val(pte) & _PAGE_SECONDARY) >> 15;
> 	if (secondary) hash = ~hash;
>@@ -551,8 +605,6 @@
> 	if (pte_val(pte) & _PAGE_HASHPTE) {
> 		ppc_md.hpte_invalidate(slot, secondary, va, large, local);
> 	}
>-
>-	spin_unlock_irqrestore(&hash_table_lock[lock_slot].lock, flags);
> }
>
> long plpar_pte_enter(unsigned long flags,
>@@ -787,6 +839,8 @@
>
> 	avpn = vpn >> 11;
>
>+	pSeries_lock_hpte(hptep);
>+
> 	dw0 = hptep->dw0.dw0;
>
> 	/*
>@@ -794,9 +848,13 @@
> 	 * the AVPN, hash group, and valid bits.  By doing it this way,
> 	 * it is common with the pSeries LPAR optimal path.
> 	 */
>-	if (dw0.bolted) return;
>+	if (dw0.bolted) {
>+		pSeries_unlock_hpte(hptep);
>
>-	/* Invalidate the hpte. */
>+		return;
>+	}
>+
>+	/* Invalidate the hpte. This clears the lock as well. */
> 	hptep->dw0.dword0 = 0;
>
> 	/* Invalidate the tlb */
>@@ -875,6 +933,8 @@
>
> 	avpn = vpn >> 11;
>
>+	pSeries_lock_hpte(hptep);
>+
> 	dw0 = hptep->dw0.dw0;
> 	if ((dw0.avpn == avpn) &&
> 	    (dw0.v) && (dw0.h == secondary)) {
>@@ -900,10 +960,14 @@
> 		hptep->dw0.dw0 = dw0;
>
> 		__asm__ __volatile__ ("ptesync" : : : "memory");
>+
>+		pSeries_unlock_hpte(hptep);
>
> 		return 0;
> 	}
>
>+	pSeries_unlock_hpte(hptep);
>+
> 	return -1;
> }
>
>@@ -1062,9 +1126,11 @@
> 		dw0 = hptep->dw0.dw0;
> 		if (!dw0.v) {
> 			/* retry with lock held */
>+			pSeries_lock_hpte(hptep);
> 			dw0 = hptep->dw0.dw0;
> 			if (!dw0.v)
> 				break;
>+			pSeries_unlock_hpte(hptep);
> 		}
> 		hptep++;
> 	}
>@@ -1079,9 +1145,11 @@
> 			dw0 = hptep->dw0.dw0;
> 			if (!dw0.v) {
> 				/* retry with lock held */
>+				pSeries_lock_hpte(hptep);
> 				dw0 = hptep->dw0.dw0;
> 				if (!dw0.v)
> 					break;
>+				pSeries_unlock_hpte(hptep);
> 			}
> 			hptep++;
> 		}
>@@ -1304,9 +1372,11 @@
>
> 		if (dw0.v && !dw0.bolted) {
> 			/* retry with lock held */
>+			pSeries_lock_hpte(hptep);
> 			dw0 = hptep->dw0.dw0;
> 			if (dw0.v && !dw0.bolted)
> 				break;
>+			pSeries_unlock_hpte(hptep);
> 		}
>
> 		slot_offset++;
>===== arch/ppc64/mm/init.c 1.8 vs edited =====
>--- 1.8/arch/ppc64/mm/init.c	Tue Jan  6 17:54:44 2004
>+++ edited/arch/ppc64/mm/init.c	Tue Feb  3 21:49:59 2004
>@@ -104,16 +104,94 @@
>  */
> mmu_gather_t     mmu_gathers[NR_CPUS];
>
>+/* PTE free batching structures. We need a lock since not all
>+ * operations take place under page_table_lock. Keep it per-CPU
>+ * to avoid bottlenecks.
>+ */
>+
>+struct pte_freelist_batch ____cacheline_aligned pte_freelist_cur[NR_CPUS] __cacheline_aligned_in_smp;
>+rwlock_t pte_hash_lock[NR_CPUS] __cacheline_aligned_in_smp = { [0 ... NR_CPUS-1] = RW_LOCK_UNLOCKED };
>+
>+unsigned long pte_freelist_forced_free;
>+
>+static inline void pte_free_sync(void)
>+{
>+	unsigned long flags;
>+	int i;
>+
>+	/* All we need to know is that we can get the write lock if
>+	 * we wanted to, i.e. that no hash_page()s are holding it for reading.
>+	 * If none are reading, that means there's no currently executing
>+	 * hash_page() that might be working on one of the PTE's that will
>+	 * be deleted. Likewise, if there is a reader, we need to get the
>+	 * write lock to know when it releases the lock.
>+	 */
>+
>+	for (i = 0; i < smp_num_cpus; i++)
>+		if (is_read_locked(&pte_hash_lock[i])) {
>+			/* So we don't deadlock with a reader on current cpu */
>+			if(i == smp_processor_id())
>+				local_irq_save(flags);
>+
>+			write_lock(&pte_hash_lock[i]);
>+			write_unlock(&pte_hash_lock[i]);
>+
>+			if(i == smp_processor_id())
>+				local_irq_restore(flags);
>+		}
>+}
>+
>+
>+/* This is only called when we are critically out of memory
>+ * (and fail to get a page in pte_free_tlb).
>+ */
>+void pte_free_now(pte_t *pte)
>+{
>+	pte_freelist_forced_free++;
>+
>+	pte_free_sync();
>+
>+	pte_free_kernel(pte);
>+}
>+
>+/* Deallocates the pte-free batch after syncronizing with readers of
>+ * any page tables.
>+ */
>+void pte_free_batch(void **batch, int size)
>+{
>+	unsigned int i;
>+
>+	pte_free_sync();
>+
>+	for (i = 0; i < size; i++)
>+		pte_free_kernel(batch[i]);
>+
>+	free_page((unsigned long)batch);
>+}
>+
>+
> int do_check_pgt_cache(int low, int high)
> {
> 	int freed = 0;
>+	struct pte_freelist_batch *batch;
>+
>+	/* We use this function to push the current pte free batch to be
>+	 * deallocated, since do_check_pgt_cache() is callEd at the end of each
>+	 * free_one_pgd() and other parts of VM relies on all PTE's being
>+	 * properly freed upon return from that function.
>+	 */
>+
>+	batch = &pte_freelist_cur[smp_processor_id()];
>+
>+	if(batch->entry) {
>+		pte_free_batch(batch->entry, batch->index);
>+		batch->entry = NULL;
>+	}
>
> 	if (pgtable_cache_size > high) {
> 		do {
> 			if (pgd_quicklist)
> 				free_page((unsigned long)pgd_alloc_one_fast(0)), ++freed;
>-			if (pmd_quicklist)
>-				free_page((unsigned long)pmd_alloc_one_fast(0, 0)), ++freed;
> 			if (pte_quicklist)
> 				free_page((unsigned long)pte_alloc_one_fast(0, 0)), ++freed;
> 		} while (pgtable_cache_size > low);
>@@ -290,7 +368,9 @@
> void
> local_flush_tlb_mm(struct mm_struct *mm)
> {
>-	spin_lock(&mm->page_table_lock);
>+	unsigned long flags;
>+
>+	spin_lock_irqsave(&mm->page_table_lock, flags);
>
> 	if ( mm->map_count ) {
> 		struct vm_area_struct *mp;
>
JSD: I believe you said the _irqsave wasn't needed here so this hunk and the next
JSD: one can be removed.

>@@ -298,7 +378,7 @@
> 			local_flush_tlb_range( mm, mp->vm_start, mp->vm_end );
> 	}
>
>-	spin_unlock(&mm->page_table_lock);
>+	spin_unlock_irqrestore(&mm->page_table_lock, flags);
> }
>
> /*
>===== include/asm-ppc64/pgalloc.h 1.2 vs edited =====
>--- 1.2/include/asm-ppc64/pgalloc.h	Tue Apr  9 06:31:08 2002
>+++ edited/include/asm-ppc64/pgalloc.h	Tue Feb  3 21:50:36 2004
>@@ -15,7 +15,6 @@
> #define quicklists      get_paca()
>
> #define pgd_quicklist 		(quicklists->pgd_cache)
>-#define pmd_quicklist 		(quicklists->pmd_cache)
> #define pte_quicklist 		(quicklists->pte_cache)
> #define pgtable_cache_size 	(quicklists->pgtable_cache_sz)
>
>@@ -60,10 +59,10 @@
> static inline pmd_t*
> pmd_alloc_one_fast (struct mm_struct *mm, unsigned long addr)
> {
>-	unsigned long *ret = (unsigned long *)pmd_quicklist;
>+	unsigned long *ret = (unsigned long *)pte_quicklist;
>
> 	if (ret != NULL) {
>-		pmd_quicklist = (unsigned long *)(*ret);
>+		pte_quicklist = (unsigned long *)(*ret);
> 		ret[0] = 0;
> 		--pgtable_cache_size;
> 	}
>@@ -80,14 +79,6 @@
> 	return pmd;
> }
>
>-static inline void
>-pmd_free (pmd_t *pmd)
>-{
>-	*(unsigned long *)pmd = (unsigned long) pmd_quicklist;
>-	pmd_quicklist = (unsigned long *) pmd;
>-	++pgtable_cache_size;
>-}
>-
> #define pmd_populate(MM, PMD, PTE)	pmd_set(PMD, PTE)
>
> static inline pte_t*
>@@ -115,12 +106,54 @@
> }
>
> static inline void
>-pte_free (pte_t *pte)
>+pte_free_kernel (pte_t *pte)
> {
> 	*(unsigned long *)pte = (unsigned long) pte_quicklist;
> 	pte_quicklist = (unsigned long *) pte;
> 	++pgtable_cache_size;
> }
>+
>+
>+/* Use the PTE functions for freeing PMD as well, since the same
>+ * problem with tree traversals apply. Since pmd pointers are always
>+ * virtual, no need for a page_address() translation.
>+ */
>+
>+#define pte_free(pte_page)      __pte_free(pte_page)
>
JSD: Your original patch defined pte_free to be __pte_free(page_address(pte_page))
JSD: Is the page_address() wrapper no longer necessary?

>+#define pmd_free(pmd)           __pte_free(pmd)
>+
>+struct pte_freelist_batch
>+{
>+	unsigned int	index;
>+	void	      **entry;
>+};
>+
>+#define PTE_FREELIST_SIZE	(PAGE_SIZE / sizeof(void *))
>+
>+extern void pte_free_now(pte_t *pte);
>+extern void pte_free_batch(void **batch, int size);
>+extern struct ____cacheline_aligned pte_freelist_batch pte_freelist_cur[] __cacheline_aligned_in_smp;
>+
>+static inline void __pte_free(pte_t *pte)
>+{
>+	struct pte_freelist_batch *batchp = &pte_freelist_cur[smp_processor_id()];
>+
>+	if (batchp->entry == NULL) {
>+		batchp->entry = (void **)__get_free_page(GFP_ATOMIC);
>+		if (batchp->entry == NULL) {
>+			pte_free_now(pte);
>+			return;
>+		}
>+		batchp->index = 0;
>+	}
>+
>+	batchp->entry[batchp->index++] = pte;
>+	if (batchp->index == PTE_FREELIST_SIZE) {
>+		pte_free_batch(batchp->entry, batchp->index);
>+		batchp->entry = NULL;
>+	}
>+}
>+
>
> extern int do_check_pgt_cache(int, int);
>
>===== include/asm-ppc64/pgtable.h 1.7 vs edited =====
>--- 1.7/include/asm-ppc64/pgtable.h	Mon Aug 25 23:47:52 2003
>+++ edited/include/asm-ppc64/pgtable.h	Tue Feb  3 21:33:22 2004
>@@ -88,22 +88,22 @@
>  * Bits in a linux-style PTE.  These match the bits in the
>  * (hardware-defined) PowerPC PTE as closely as possible.
>  */
>-#define _PAGE_PRESENT	0x001UL	/* software: pte contains a translation */
>-#define _PAGE_USER	0x002UL	/* matches one of the PP bits */
>-#define _PAGE_RW	0x004UL	/* software: user write access allowed */
>-#define _PAGE_GUARDED	0x008UL
>-#define _PAGE_COHERENT	0x010UL	/* M: enforce memory coherence (SMP systems) */
>-#define _PAGE_NO_CACHE	0x020UL	/* I: cache inhibit */
>-#define _PAGE_WRITETHRU	0x040UL	/* W: cache write-through */
>-#define _PAGE_DIRTY	0x080UL	/* C: page changed */
>-#define _PAGE_ACCESSED	0x100UL	/* R: page referenced */
>-#define _PAGE_HPTENOIX	0x200UL /* software: pte HPTE slot unknown */
>-#define _PAGE_HASHPTE	0x400UL	/* software: pte has an associated HPTE */
>-#define _PAGE_EXEC	0x800UL	/* software: i-cache coherence required */
>-#define _PAGE_SECONDARY 0x8000UL /* software: HPTE is in secondary group */
>-#define _PAGE_GROUP_IX  0x7000UL /* software: HPTE index within group */
>+#define _PAGE_PRESENT	0x0001 /* software: pte contains a translation */
>+#define _PAGE_USER	0x0002 /* matches one of the PP bits */
>+#define _PAGE_RW	0x0004 /* software: user write access allowed */
>+#define _PAGE_GUARDED	0x0008
>+#define _PAGE_COHERENT	0x0010 /* M: enforce memory coherence (SMP systems) */
>+#define _PAGE_NO_CACHE	0x0020 /* I: cache inhibit */
>+#define _PAGE_WRITETHRU	0x0040 /* W: cache write-through */
>+#define _PAGE_DIRTY	0x0080 /* C: page changed */
>+#define _PAGE_ACCESSED	0x0100 /* R: page referenced */
>+#define _PAGE_BUSY	0x0200 /* software: pte & hash are busy */
>+#define _PAGE_HASHPTE	0x0400 /* software: pte has an associated HPTE */
>+#define _PAGE_EXEC	0x0800 /* software: i-cache coherence required */
>+#define _PAGE_GROUP_IX  0x7000 /* software: HPTE index within group */
>+#define _PAGE_SECONDARY 0x8000 /* software: HPTE is in secondary group */
> /* Bits 0x7000 identify the index within an HPT Group */
>-#define _PAGE_HPTEFLAGS (_PAGE_HASHPTE | _PAGE_HPTENOIX | _PAGE_SECONDARY | _PAGE_GROUP_IX)
>+#define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | _PAGE_SECONDARY | _PAGE_GROUP_IX)
> /* PAGE_MASK gives the right answer below, but only by accident */
> /* It should be preserving the high 48 bits and then specifically */
> /* preserving _PAGE_SECONDARY | _PAGE_GROUP_IX */
>@@ -281,13 +281,15 @@
> 	unsigned long old, tmp;
>
> 	__asm__ __volatile__("\n\
>-1:	ldarx	%0,0,%3	\n\
>+1:	ldarx	%0,0,%3 \n\
>+        andi.   %1,%0,%7 # loop on _PAGE_BUSY set\n\
>+        bne-    1b \n\
> 	andc	%1,%0,%4 \n\
> 	or	%1,%1,%5 \n\
> 	stdcx.	%1,0,%3 \n\
> 	bne-	1b"
> 	: "=&r" (old), "=&r" (tmp), "=m" (*p)
>-	: "r" (p), "r" (clr), "r" (set), "m" (*p)
>+	: "r" (p), "r" (clr), "r" (set), "m" (*p), "i" (_PAGE_BUSY)
> 	: "cc" );
> 	return old;
> }
>
>

--
Julie DeWandel <jdewand at redhat.com>
Red Hat, Inc.
Tel (978) 692-3113 x23251


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From moilanen at austin.ibm.com  Thu Feb  5 07:19:05 2004
From: moilanen at austin.ibm.com (Jake Moilanen)
Date: Wed, 04 Feb 2004 14:19:05 -0600
Subject: halt vs halt -p
Message-ID: <1075925944.1421.41.camel@magik>


Maybe I don't know the history, but why do we power-off on a 'halt'?
According to the man page, we shouldn't power-off unless it's a 'halt
-p' or a 'poweroff'.

I checked x86 and it doesn't do anything on a halt.  I put a patch below
of what I would have expected the code to look like.

Thanks,
Jake

  diff -Nru a/arch/ppc64/kernel/iSeries_setup.c
b/arch/ppc64/kernel/iSeries_setup.c
--- a/arch/ppc64/kernel/iSeries_setup.c Wed Feb  4 13:37:46 2004
+++ b/arch/ppc64/kernel/iSeries_setup.c Wed Feb  4 13:37:46 2004
@@ -786,7 +786,6 @@
  */
 void iSeries_halt(void)
 {
-       mf_powerOff();
 }

 /* JDH Hack */
diff -Nru a/arch/ppc64/kernel/rtas.c b/arch/ppc64/kernel/rtas.c
--- a/arch/ppc64/kernel/rtas.c  Wed Feb  4 13:37:46 2004
+++ b/arch/ppc64/kernel/rtas.c  Wed Feb  4 13:37:46 2004
@@ -417,7 +417,6 @@
 {
        if (rtas_firmware_flash_list.next)
                rtas_flash_bypass_warning();
-        rtas_power_off();
 }

 unsigned long rtas_rmo_buf = 0;


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From boutcher at us.ibm.com  Thu Feb  5 07:27:04 2004
From: boutcher at us.ibm.com (David Boutcher)
Date: Wed, 4 Feb 2004 14:27:04 -0600
Subject: halt vs halt -p
In-Reply-To: <1075925944.1421.41.camel@magik>
Message-ID: <OF0CCB1191.7A2FE6BA-ON86256E30.00703A1B-86256E30.007057B6@us.ibm.com>


Well, validate the iSeries path.  On the iSeries, OS4/400 kicks off a
graceful shutdown of the Linux partition, and at the end it is supposed to
end up powered off.  I'm not sure if that ends up in iSeries_halt(void)or
not.

Dave Boutcher
IBM Linux Technology Center


             Jake Moilanen
             <moilanen at austin.
             ibm.com>                                                   To
             Sent by:                  PPC64 External List
             owner-linuxppc64-         <linuxppc64-dev at lists.linuxppc.org>
             dev at lists.linuxpp                                          cc
             c.org
                                                                   Subject
                                       halt vs halt -p
             02/04/2004 02:19
             PM


Maybe I don't know the history, but why do we power-off on a 'halt'?
According to the man page, we shouldn't power-off unless it's a 'halt
-p' or a 'poweroff'.

I checked x86 and it doesn't do anything on a halt.  I put a patch below
of what I would have expected the code to look like.

Thanks,
Jake

  diff -Nru a/arch/ppc64/kernel/iSeries_setup.c
b/arch/ppc64/kernel/iSeries_setup.c
--- a/arch/ppc64/kernel/iSeries_setup.c Wed Feb  4 13:37:46 2004
+++ b/arch/ppc64/kernel/iSeries_setup.c Wed Feb  4 13:37:46 2004
@@ -786,7 +786,6 @@
  */
 void iSeries_halt(void)
 {
-       mf_powerOff();
 }

 /* JDH Hack */
diff -Nru a/arch/ppc64/kernel/rtas.c b/arch/ppc64/kernel/rtas.c
--- a/arch/ppc64/kernel/rtas.c  Wed Feb  4 13:37:46 2004
+++ b/arch/ppc64/kernel/rtas.c  Wed Feb  4 13:37:46 2004
@@ -417,7 +417,6 @@
 {
        if (rtas_firmware_flash_list.next)
                rtas_flash_bypass_warning();
-        rtas_power_off();
 }

 unsigned long rtas_rmo_buf = 0;


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Thu Feb  5 07:28:53 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Wed, 4 Feb 2004 14:28:53 -0600
Subject: Resending the patch: Re: 2.6: PATCH for multiple EEH bugs
In-Reply-To: <Pine.A41.4.44.0402032246520.85316-100000@forte.austin.ibm.com>; from olof@austin.ibm.com on Tue, Feb 03, 2004 at 10:52:11PM -0600
References: <20040203183459.B27780@forte.austin.ibm.com> <Pine.A41.4.44.0402032246520.85316-100000@forte.austin.ibm.com>
Message-ID: <20040204142853.A28220@forte.austin.ibm.com>

On Tue, Feb 03, 2004 at 10:52:11PM -0600, olof at austin.ibm.com wrote:
> On Tue, 3 Feb 2004 linas at austin.ibm.com wrote:
>
> > Patch for multiple EEH-related bugs.  Please review this patch,
> > & if appropriate, please apply.  It should apply cleanly to
> > the current ameslab tree (Feb 03 2004  2.6.2-rc3).
>
> I have patch failures in eeh.c, pci.c and rpaphp_core.c. Are you sure you
> made the diff against a current ameslab tree?

Right tree, bad email attachment.

I don't know how it happened, but what I sent out had some trailing
whitespace whacked. The attached patch should not have this problem.

--linas
-------------- next part --------------
===== arch/ppc64/kernel/chrp_setup.c 1.49 vs edited =====
--- 1.49/arch/ppc64/kernel/chrp_setup.c	Mon Jan 19 20:07:02 2004
+++ edited/arch/ppc64/kernel/chrp_setup.c	Tue Feb  3 16:35:44 2004
@@ -71,6 +71,7 @@
 extern void openpic_init_irq_desc(irq_desc_t *);

 extern void find_and_init_phbs(void);
+extern void __init eeh_init(void);

 extern void pSeries_get_boot_time(struct rtc_time *rtc_time);
 extern void pSeries_get_rtc_time(struct rtc_time *rtc_time);
===== arch/ppc64/kernel/eeh.c 1.17 vs edited =====
--- 1.17/arch/ppc64/kernel/eeh.c	Tue Feb  3 11:03:04 2004
+++ edited/arch/ppc64/kernel/eeh.c	Tue Feb  3 16:38:19 2004
@@ -38,44 +38,68 @@
 #define BUID_LO(buid) ((buid) & 0xffffffff)
 #define CONFIG_ADDR(busno, devfn) (((((busno) & 0xff) << 8) | ((devfn) & 0xf8)) << 8)

-unsigned long eeh_total_mmio_ffs;
-unsigned long eeh_false_positives;
 /* RTAS tokens */
 static int ibm_set_eeh_option;
 static int ibm_set_slot_reset;
 static int ibm_read_slot_reset_state;

-static int eeh_implemented;
+static int eeh_subsystem_enabled;
 #define EEH_MAX_OPTS 4096
 static char *eeh_opts;
 static int eeh_opts_last;

-unsigned char	slot_err_buf[RTAS_ERROR_LOG_MAX];
+/* System monitoring statistics */
+unsigned long eeh_total_mmio_ffs;
+unsigned long eeh_false_positives;
+unsigned long eeh_ignored_failures;

 pte_t *find_linux_pte(pgd_t *pgdir, unsigned long va);	/* from htab.c */
 static int eeh_check_opts_config(struct device_node *dn,
 				 int class_code, int vendor_id, int device_id,
 				 int default_state);

-unsigned long eeh_token_to_phys(unsigned long token)
+
+/**
+ * eeh_token_to_phys - convert EEH address token to phys address
+ * @token i/o token, should be address in the form 0xA....
+ *
+ * Converts EEH address tokens into physical addresses.  Note that
+ * ths routine does *not* convert I/O BAR addresses (which start
+ * with 0xE...) to phys addresses!
+ */
+unsigned long
+eeh_token_to_phys(unsigned long token)
 {
+	pte_t *ptep;
+	unsigned long pa, vaddr;
 	if (REGION_ID(token) == EEH_REGION_ID) {
-		unsigned long vaddr = IO_TOKEN_TO_ADDR(token);
-		pte_t *ptep = find_linux_pte(ioremap_mm.pgd, vaddr);
-		unsigned long pa = pte_pfn(*ptep) << PAGE_SHIFT;
-		return pa | (vaddr & (PAGE_SIZE-1));
-	} else
+		vaddr = IO_TOKEN_TO_ADDR(token);
+	} else {
 		return token;
+	}
+
+	ptep = find_linux_pte(ioremap_mm.pgd, vaddr);
+	pa = pte_pfn(*ptep) << PAGE_SHIFT;
+	return pa | (vaddr & (PAGE_SIZE-1));
 }

-/* Check for an eeh failure at the given token address.
+/**
+ * eeh_check_failure - check if all 1's data is due to EEH slot freeze
+ * @token i/o token, should be address in the form 0xA....
+ * @val value, should be all 1's (XXX why do we need this arg??)
+ * @who arbitrary ID, useful for debugging
+ *
+ * Check for an eeh failure at the given token address.
  * The given value has been read and it should be 1's (0xff, 0xffff or
  * 0xffffffff).
  *
  * Probe to determine if an error actually occurred.  If not return val.
  * Otherwise panic.
+ *
+ * Note this routine might be called in an interrupt context ...
  */
-unsigned long eeh_check_failure(void *token, unsigned long val)
+unsigned long
+eeh_check_failure(void *token, unsigned long val, int who)
 {
 	unsigned long addr;
 	struct pci_dev *dev;
@@ -85,28 +109,27 @@
 	/* IO BAR access could get us here...or if we manually force EEH
 	 * operation on even if the hardware won't support it.
 	 */
-	if (!eeh_implemented || ibm_read_slot_reset_state == RTAS_UNKNOWN_SERVICE)
+	if (!eeh_subsystem_enabled || ibm_read_slot_reset_state == RTAS_UNKNOWN_SERVICE)
 		return val;

-	/* Finding the phys addr + pci device is quite expensive.
-	 * However, the RTAS call is MUCH slower.... :(
-	 */
+	/* Finding the phys addr + pci device; this is pretty quick. */
 	addr = eeh_token_to_phys((unsigned long)token);
-	dev = pci_find_dev_by_addr(addr);
-	if (!dev) {
-		printk("EEH: no pci dev found for addr=0x%lx\n", addr);
-		return val;
-	}
+	dev = pci_get_device_by_addr(addr);
+
+	if (!dev) return val;
+
 	dn = pci_device_to_OF_node(dev);
 	if (!dn) {
-		printk("EEH: no pci dn found for addr=0x%lx\n", addr);
+		pci_dev_put (dev);
 		return val;
 	}

 	/* Access to IO BARs might get this far and still not want checking. */
-	if (!(dn->eeh_mode & EEH_MODE_SUPPORTED) || dn->eeh_mode & EEH_MODE_NOCHECK)
+	if (!(dn->eeh_mode & EEH_MODE_SUPPORTED) ||
+			 dn->eeh_mode & EEH_MODE_NOCHECK) {
+		pci_dev_put (dev);
 		return val;
-
+	}

 	/* Now test for an EEH failure.  This is VERY expensive.
 	 * Note that the eeh_config_addr may be a parent device
@@ -119,6 +142,7 @@
 				dn->eeh_config_addr, BUID_HI(dn->phb->buid),
 				BUID_LO(dn->phb->buid));
 		if (ret == 0 && rets[1] == 1 && rets[0] >= 2) {
+			unsigned char	slot_err_buf[RTAS_ERROR_LOG_MAX];
 			unsigned long	slot_err_ret;

 			memset(slot_err_buf, 0, RTAS_ERROR_LOG_MAX);
@@ -139,23 +163,36 @@
 			 * the system in light of potential corruption, we
 			 * can use it here.
 			 */
-			if (panic_on_oops)
-				panic("EEH: MMIO failure (%ld) on device:\n%s\n",
-				      rets[0], pci_name(dev));
-			else
-				printk("EEH: MMIO failure (%ld) on device:\n%s\n",
-				       rets[0], pci_name(dev));
+			if (panic_on_oops) {
+				panic("EEH: MMIO failure (%ld) on device:\n%s\n", rets[0], pci_name(dev));
+			} else {
+				eeh_ignored_failures ++;
+				if (!in_interrupt()) {  /* XXX this will be replaced by eehdaemon */
+					printk(KERN_INFO "EEH: MMIO failure (%ld) on device:%s %s\n",
+						rets[0], pci_name(dev), pci_pretty_name(dev));
+				}
+			}
+		} else {
+			eeh_false_positives++;
 		}
 	}
-	eeh_false_positives++;
+	pci_dev_put (dev);
 	return val;	/* good case */
-
 }

 struct eeh_early_enable_info {
 	unsigned int buid_hi;
 	unsigned int buid_lo;
-	int adapters_enabled;
+
+	/* Handy-dandy statistics help us understand what's going on */
+	int num_phbs_found;
+	int num_of_nodes;
+	int num_devices;
+	int num_devices_bad_status;
+	int num_devices_graphics;
+	int num_devices_w_eeh_parent;
+	int num_devices_w_eeh_disabled;
+	int num_adapters_enabled;
 };

 /* Enable eeh for the given device node. */
@@ -170,12 +207,17 @@
 	u32 *regs;
 	int enable;

-	if (status && strcmp(status, "ok") != 0)
+	info->num_devices ++;
+	if (status && strcmp(status, "ok") != 0) {
+		info->num_devices_bad_status ++;
 		return NULL;	/* ignore devices with bad status */
+	}

 	/* Weed out PHBs or other bad nodes. */
-	if (!class_code || !vendor_id || !device_id)
+	if (!class_code || !vendor_id || !device_id) {
+		info->num_devices_bad_status ++;
 		return NULL;
+	}

 	/* Ignore known PHBs and EADs bridges */
 	if (*vendor_id == PCI_VENDOR_ID_IBM &&
@@ -191,28 +233,36 @@
 	 * But there are a few cases like display devices that make sense.
 	 */
 	enable = 1;	/* i.e. we will do checking */
-	if ((*class_code >> 16) == PCI_BASE_CLASS_DISPLAY)
+	if ((*class_code >> 16) == PCI_BASE_CLASS_DISPLAY) {
+		printk (KERN_INFO "EEH: %s: display device, disabling EEH checking.\n", dn->full_name);
+		info->num_devices_graphics ++;
 		enable = 0;
+	}

 	if (!eeh_check_opts_config(dn, *class_code, *vendor_id, *device_id, enable)) {
 		if (enable) {
-			printk(KERN_INFO "EEH: %s user requested to run without EEH.\n", dn->full_name);
+			printk(KERN_NOTICE "EEH: %s user requested to run without EEH.\n", dn->full_name);
 			enable = 0;
 		}
 	}

 	if (!enable)
+	{
 		dn->eeh_mode = EEH_MODE_NOCHECK;
+		info->num_devices_w_eeh_disabled ++;
+		return NULL;
+	}

 	/* This device may already have an EEH parent. */
 	if (dn->parent && (dn->parent->eeh_mode & EEH_MODE_SUPPORTED)) {
 		/* Parent supports EEH. */
 		dn->eeh_mode |= EEH_MODE_SUPPORTED;
 		dn->eeh_config_addr = dn->parent->eeh_config_addr;
+		info->num_devices_w_eeh_parent ++;
 		return NULL;
 	}

-	/* Ok..see if this device supports EEH. */
+	/* Ok... see if this device supports EEH. */
 	regs = (u32 *)get_property(dn, "reg", 0);
 	if (regs) {
 		/* First register entry is addr (00BBSS00)  */
@@ -221,16 +271,21 @@
 				regs[0], info->buid_hi, info->buid_lo,
 				EEH_ENABLE);
 		if (ret == 0) {
-			info->adapters_enabled++;
+			info->num_adapters_enabled++;
 			dn->eeh_mode |= EEH_MODE_SUPPORTED;
 			dn->eeh_config_addr = regs[0];
+			printk (KERN_DEBUG "EEH: %s: eeh enabled\n", dn->full_name);
+		} else {
+			printk (KERN_WARNING "EEH: %s: rtas_call failed.\n", dn->full_name);
 		}
+	} else {
+		printk (KERN_WARNING "EEH: %s: unable to get reg property.\n", dn->full_name);
 	}
 	return NULL;
 }

 /*
- * Initialize eeh by trying to enable it for all of the adapters in the system.
+ * Initialize EEH by trying to enable it for all of the adapters in the system.
  * As a side effect we can determine here if eeh is supported at all.
  * Note that we leave EEH on so failed config cycles won't cause a machine
  * check.  If a user turns off EEH for a particular adapter they are really
@@ -243,7 +298,7 @@
  * The eeh-force-off/on option does literally what it says, so if Linux must
  * avoid enabling EEH this must be done.
  */
-void eeh_init(void)
+void __init eeh_init(void)
 {
 	struct device_node *phb;
 	struct eeh_early_enable_info info;
@@ -261,26 +316,37 @@
 	 * of I/O macros even if we can't actually test for EEH failure.
 	 */
 	if (eeh_force_on > eeh_force_off)
-		eeh_implemented = 1;
+		eeh_subsystem_enabled = 1;
 	else if (ibm_set_eeh_option == RTAS_UNKNOWN_SERVICE)
 		return;

 	if (eeh_force_off > eeh_force_on) {
 		/* User is forcing EEH off.  Be noisy if it is implemented. */
-		if (eeh_implemented)
+		if (eeh_subsystem_enabled)
 			printk(KERN_WARNING "EEH: WARNING: PCI Enhanced I/O Error Handling is user disabled\n");
-		eeh_implemented = 0;
+		eeh_subsystem_enabled = 0;
 		return;
 	}

-
 	/* Enable EEH for all adapters.  Note that eeh requires buid's */
-	info.adapters_enabled = 0;
+	info.num_adapters_enabled = 0;
+	info.num_of_nodes = 0;
+	info.num_phbs_found = 0;
+	info.num_devices = 0;
+	info.num_devices_bad_status = 0;
+	info.num_devices_graphics = 0;
+	info.num_devices_w_eeh_parent = 0;
+	info.num_devices_w_eeh_disabled = 0;
 	for (phb = of_find_node_by_name(NULL, "pci"); phb; phb = of_find_node_by_name(phb, "pci")) {
+
 		int len;
-		int *buid_vals = (int *) get_property(phb, "ibm,fw-phb-id", &len);
+		int *buid_vals;
+
+		info.num_of_nodes ++;
+		buid_vals = (int *) get_property(phb, "ibm,fw-phb-id", &len);
 		if (!buid_vals)
 			continue;
+		info.num_phbs_found ++;
 		if (len == sizeof(int)) {
 			info.buid_lo = buid_vals[0];
 			info.buid_hi = 0;
@@ -288,24 +354,88 @@
 			info.buid_hi = buid_vals[0];
 			info.buid_lo = buid_vals[1];
 		} else {
-			printk("EEH: odd ibm,fw-phb-id len returned: %d\n", len);
+			printk(KERN_INFO "EEH: odd ibm,fw-phb-id len returned: %d\n", len);
 			continue;
 		}
 		traverse_pci_devices(phb, early_enable_eeh, NULL, &info);
 	}
-	if (info.adapters_enabled) {
+	if (info.num_adapters_enabled) {
 		printk(KERN_INFO "EEH: PCI Enhanced I/O Error Handling Enabled\n");
-		eeh_implemented = 1;
+		eeh_subsystem_enabled = 1;
+	}
+	printk(KERN_INFO "EEH: num_of_nodes=%d\n", info.num_of_nodes);
+	printk(KERN_INFO "EEH: num_phbs_found=%d\n", info.num_phbs_found);
+	printk(KERN_INFO "EEH: num_devices=%d\n", info.num_devices);
+	printk(KERN_INFO "EEH: num_devices_bad_status=%d\n", info.num_devices_bad_status);
+	printk(KERN_INFO "EEH: num_devices_graphics=%d\n", info.num_devices_graphics);
+	printk(KERN_INFO "EEH: num_devices_w_eeh_parent=%d\n", info.num_devices_w_eeh_parent);
+	printk(KERN_INFO "EEH: num_devices_w_eeh_disabled=%d\n", info.num_devices_w_eeh_disabled);
+	printk(KERN_INFO "EEH: num_adapters_enabled=%d\n", info.num_adapters_enabled);
+}
+
+/**
+ * eeh_add_device - perform EEH initialization for the indicated pci device
+ * @dev: pci device for which to set up EEH
+ *
+ * This routine can be used to perform EEH initialization for PCI
+ * devices that were added after system boot (e.g. hotplug, dlpar).
+ * Whether this actually enables EEH or not for this device depends
+ * on the type of the device, on earlier boot command-line
+ * arguments & etc.
+ */
+void
+eeh_add_device (struct pci_dev *dev)
+{
+	struct device_node *dn;
+	struct pci_controller *phb;
+	struct eeh_early_enable_info info;
+
+	if (!dev || !eeh_subsystem_enabled) return;
+
+	printk (KERN_DEBUG "EEH: adding device %s %s\n",
+					                 pci_name (dev), pci_pretty_name(dev));
+	dn = pci_device_to_OF_node(dev);
+	if (NULL == dn) return;
+
+	phb = PCI_GET_PHB_PTR(dev);
+	if (NULL == phb || 0 == phb->buid) {
+		printk (KERN_WARNING "EEH: Expected buid but found none\n");
+		return;
 	}
+
+	info.buid_hi = BUID_HI(phb->buid);
+	info.buid_lo = BUID_LO(phb->buid);
+
+	early_enable_eeh(dn, &info);
+	pci_addr_cache_insert_device (dev);
 }

+/**
+ * eeh_remove_device - undo EEH setup for the indicated pci device
+ * @dev: pci device to be removed
+ *
+ * This routine should be when a device is removed from a running
+ * system (e.g. by hotplug or dlpar).
+ */
+void
+eeh_remove_device (struct pci_dev *dev)
+{
+	if (!dev || !eeh_subsystem_enabled) return;
+
+	/* Unregister the device with the EEH/PCI address search system */
+	printk (KERN_DEBUG "EEH: remove device %s %s\n",
+					                 pci_name (dev), pci_pretty_name(dev));
+	pci_addr_cache_remove_device (dev);
+
+}

-int eeh_set_option(struct pci_dev *dev, int option)
+int
+eeh_set_option(struct pci_dev *dev, int option)
 {
 	struct device_node *dn = pci_device_to_OF_node(dev);
 	struct pci_controller *phb = PCI_GET_PHB_PTR(dev);

-	if (dn == NULL || phb == NULL || phb->buid == 0 || !eeh_implemented)
+	if (dn == NULL || phb == NULL || phb->buid == 0 || !eeh_subsystem_enabled)
 		return -2;

 	return rtas_call(ibm_set_eeh_option, 4, 1, NULL,
@@ -316,7 +446,7 @@

 /* If EEH is implemented, find the PCI device using given phys addr
  * and check to see if eeh failure checking is disabled.
- * Remap the addr (trivially) to the EEH region if not.
+ * Remap the addr (trivially) to the EEH region if EEH checking enabled.
  * For addresses not known to PCI the vaddr is simply returned unchanged.
  */
 void *eeh_ioremap(unsigned long addr, void *vaddr)
@@ -324,28 +454,72 @@
 	struct pci_dev *dev;
 	struct device_node *dn;

-	if (!eeh_implemented)
+	if (!eeh_subsystem_enabled)
 		return vaddr;
-	dev = pci_find_dev_by_addr(addr);
+	dev = pci_get_device_by_addr(addr);
 	if (!dev)
 		return vaddr;
-	dn = pci_device_to_OF_node(dev);
-	if (!dn)
+
+	dn = pci_device_to_OF_node(dev);
+	if (!dn) {
+		pci_dev_put (dev);
 		return vaddr;
-	if (dn->eeh_mode & EEH_MODE_NOCHECK)
+	}
+	if (dn->eeh_mode & EEH_MODE_NOCHECK) {
+		pci_dev_put (dev);
 		return vaddr;
+	}

+	pci_dev_put (dev);
 	return (void *)IO_ADDR_TO_TOKEN(vaddr);
 }

 static int eeh_proc_falsepositive_read(char *page, char **start, off_t off,
 			 int count, int *eof, void *data)
 {
-	int len;
-	len = sprintf(page, "eeh_false_positives=%ld\n"
-		      "eeh_total_mmio_ffs=%ld\n",
-		      eeh_false_positives, eeh_total_mmio_ffs);
-	return len;
+	char *p, *buffer;
+#define EEH_PROC_BUFSZ 250
+	int n=0, bs=EEH_PROC_BUFSZ;
+
+	if (count < 0) return -EINVAL;
+
+	buffer = kmalloc (EEH_PROC_BUFSZ,GFP_KERNEL);
+	if (!buffer) return -ENOMEM;
+
+	p = buffer;
+
+	if (0 == eeh_subsystem_enabled) {
+		n += snprintf (p+n, bs-n, "EEH Subsystem is globally disabled\n");
+		n += snprintf(p+n, bs-n, "eeh_total_mmio_ffs=%ld\n",
+		      	eeh_total_mmio_ffs);
+	} else {
+		n += snprintf (p+n, bs-n, "EEH Subsystem is enabled\n");
+		n += snprintf(p+n, bs-n,
+		      "eeh_total_mmio_ffs=%ld\n"
+				"eeh_false_positives=%ld\n"
+				"eeh_ignored_failures=%ld\n",
+		      eeh_total_mmio_ffs,
+		      eeh_false_positives,
+				eeh_ignored_failures);
+	}
+
+	/* Misc machinations of the proc file system */
+	if (off >= strlen(buffer)) {
+		*eof = 1;
+		kfree(buffer);
+		return 0;
+	}
+   if (n > strlen(buffer) - off)
+		n = strlen(buffer) - off;
+	if (n > count)
+		n = count;
+	else
+		*eof = 1;
+
+	memcpy(page, buffer + off, n);
+	*start = page;
+	kfree(buffer);
+	return n;
 }

 /* Implementation of /proc/ppc64/eeh
@@ -362,6 +536,12 @@
 	return 0;
 }

+static int __init eeh_init_late(void)
+{
+	eeh_init_proc ();
+	return 0;
+}
+
 /*
  * Test if "dev" should be configured on or off.
  * This processes the options literally from left to right.
@@ -456,7 +636,7 @@
 		if (*cur) {
 			int curlen = curend-cur;
 			if (eeh_opts_last + curlen > EEH_MAX_OPTS-2) {
-				printk(KERN_INFO "EEH: sorry...too many eeh cmd line options\n");
+				printk(KERN_WARNING "EEH: sorry...too many eeh cmd line options\n");
 				return 1;
 			}
 			eeh_opts[eeh_opts_last++] = state ? '+' : '-';
@@ -478,6 +658,6 @@
 	return eeh_parm(str, 1);
 }

-__initcall(eeh_init_proc);
+__initcall(eeh_init_late);
 __setup("eeh-off", eehoff_parm);
 __setup("eeh-on", eehon_parm);
===== arch/ppc64/kernel/pSeries_pci.c 1.34 vs edited =====
--- 1.34/arch/ppc64/kernel/pSeries_pci.c	Fri Jan 30 21:22:28 2004
+++ edited/arch/ppc64/kernel/pSeries_pci.c	Tue Feb  3 16:35:49 2004
@@ -530,7 +530,7 @@
 			dev->resource[i].start += hose->pci_mem_offset;
 			dev->resource[i].end += hose->pci_mem_offset;
 		}
-        }
+	}
 }
 EXPORT_SYMBOL(pcibios_fixup_device_resources);

===== arch/ppc64/kernel/pci.c 1.42 vs edited =====
--- 1.42/arch/ppc64/kernel/pci.c	Mon Jan 19 20:07:05 2004
+++ edited/arch/ppc64/kernel/pci.c	Tue Feb  3 16:39:53 2004
@@ -23,6 +23,8 @@
 #include <linux/bootmem.h>
 #include <linux/module.h>
 #include <linux/mm.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>

 #include <asm/processor.h>
 #include <asm/io.h>
@@ -107,42 +109,264 @@
 	}
 }

-/* Given an mmio phys address, find a pci device that implements
- * this address.  This is of course expensive, but only used
- * for device initialization or error paths.
- * For io BARs it is assumed the pci_io_base has already been added
- * into addr.
+/**
+ * The pci address cache subsystem.  This subsystem places
+ * PCI device address resources into a red-black tree, sorted
+ * according to the address range, so that given only an i/o
+ * address, the corresponding PCI device can be **quickly**
+ * found.
  *
- * Bridges are ignored although they could be used to optimize the search.
+ * Currently, the only customer of this code is the EEH subsystem;
+ * thus, this code has been somewhat tailored to suit EEH better.
+ * In particular, the cache does *not* hold the addresses of devices
+ * for which EEH is not enabled.
+ *
+ * (Implementation Note: The RB tree seems to be better/faster
+ * than any hash algo I could think of for this problem, even
+ * with the penalty of slow pointer chases for d-cache misses).
  */
-struct pci_dev *pci_find_dev_by_addr(unsigned long addr)
+struct pci_io_addr_range
 {
-	struct pci_dev *dev = NULL;
+	struct rb_node rb_node;
+	unsigned long addr_lo;
+	unsigned long addr_hi;
+	struct pci_dev *pcidev;
+	unsigned int flags;
+};
+
+struct pci_io_addr_cache
+{
+	struct rb_root rb_root;
+	spinlock_t piar_lock;
+} pci_io_addr_cache_root;
+
+static inline struct pci_dev *
+__pci_get_device_by_addr (unsigned long addr)
+{
+	struct rb_node *n = pci_io_addr_cache_root.rb_root.rb_node;
+	while (n)
+	{
+		struct pci_io_addr_range *piar;
+		piar = rb_entry (n, struct pci_io_addr_range, rb_node);
+		if (addr < piar->addr_lo) {
+			n = n->rb_left;
+		} else
+		if (addr > piar->addr_hi) {
+			n = n->rb_right;
+		} else {
+			pci_dev_get (piar->pcidev);
+			return piar->pcidev;
+		}
+	}
+	return NULL;
+}
+
+/**
+ * pci_get_device_by_addr - Get device, given only address
+ * @addr: mmio (PIO) phys address or i/o port number
+ *
+ * Given an mmio phys address, or a port number, find a pci device
+ * that implements this address.  Be sure to pci_dev_put the device
+ * when finished.  I/O port numbers are assumed to be offset
+ * from zero (that is, they do *not* have pci_io_addr added in).
+ * It is safe to call this function within an interrupt.
+ */
+struct pci_dev *
+pci_get_device_by_addr (unsigned long addr)
+{
+	struct pci_dev *dev;
+	unsigned long flags;
+
+	spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags);
+   dev = __pci_get_device_by_addr (addr);
+	spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags);
+	return dev;
+}
+
+/* Handy-dandy debug print routine, does nothing more
+ * than print out the contents of our addr cache. */
+static void
+pci_addr_cache_print (struct pci_io_addr_cache *cache)
+{
+	struct rb_node *n;
+	n = rb_first (&cache->rb_root);
+	int cnt=0;
+	while (n) {
+		struct pci_io_addr_range *piar;
+		piar = rb_entry (n, struct pci_io_addr_range, rb_node);
+		printk (KERN_DEBUG "PCI: %s addr range %d [%lx -%lx]: %s %s\n",
+			(piar->flags & IORESOURCE_IO) ? "i/o" : "mem",
+			cnt,
+			piar->addr_lo, piar->addr_hi,
+			pci_name (piar->pcidev),
+			pci_pretty_name (piar->pcidev));
+		cnt ++;
+		n = rb_next (n);
+	}
+}
+
+/* Insert address range into the rb tree. */
+static inline struct pci_io_addr_range *
+pci_addr_cache_insert (struct pci_dev *dev,
+                unsigned long alo, unsigned long ahi, unsigned int flags)
+{
+	struct rb_node **p = &pci_io_addr_cache_root.rb_root.rb_node;
+	struct rb_node * parent = NULL;
+	struct pci_io_addr_range *piar;
+
+	// Walk tree, find a place to insert into tree
+	while (*p) {
+		parent = *p;
+		piar = rb_entry (parent, struct pci_io_addr_range, rb_node);
+		if (alo < piar->addr_lo) {
+			p = &parent->rb_left;
+		} else if (ahi > piar->addr_hi) {
+			p = &parent->rb_right;
+		} else {
+			if (dev != piar->pcidev ||
+			    alo != piar->addr_lo || ahi != piar->addr_hi) {
+				printk (KERN_WARNING "PIAR: overlapping address range\n");
+			}
+			return piar;
+		}
+	}
+	piar = (struct pci_io_addr_range *) kmalloc (
+			sizeof(struct pci_io_addr_range),  GFP_ATOMIC);
+
+	if (!piar) return NULL;  // whoops
+
+	piar->addr_lo = alo;
+	piar->addr_hi = ahi;
+	piar->pcidev = dev;
+	piar->flags = flags;
+
+	rb_link_node (&piar->rb_node, parent, p);
+	rb_insert_color (&piar->rb_node, &pci_io_addr_cache_root.rb_root);
+	return piar;
+}
+
+inline void
+__pci_addr_cache_insert_device (struct pci_dev *dev)
+{
+	struct device_node *dn;
+	dn = pci_device_to_OF_node(dev);
+	if (!dn) {
+		printk(KERN_WARNING "PCI: no pci dn found for dev=%s %s\n",
+			pci_name(dev), pci_pretty_name(dev));
+		pci_dev_put (dev);
+		return;
+	}
+
+	// Skip any devices for which EEH is not enabled.
+	if (!(dn->eeh_mode & EEH_MODE_SUPPORTED) ||
+	      dn->eeh_mode & EEH_MODE_NOCHECK) {
+		printk(KERN_INFO "PCI: skip building address cache for=%s %s\n",
+			pci_name(dev), pci_pretty_name(dev));
+		pci_dev_put (dev);
+		return;
+	}
+
+	// Walk resources on this device, poke them into the tree
 	int i;
-	unsigned long ioaddr;
+	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
+		unsigned long start = pci_resource_start(dev,i);
+		unsigned long end = pci_resource_end(dev,i);
+		unsigned int flags = pci_resource_flags(dev,i);
+
+		// We are interested only bus addresses, not dma or other stuff
+		if (0 == (flags & (IORESOURCE_IO | IORESOURCE_MEM))) continue;
+		if (start == 0 || ~start == 0 || end == 0 || ~end == 0)
+			 continue;
+		pci_addr_cache_insert (dev, start, end, flags);
+	}
+}

-	ioaddr = (addr > isa_io_base) ? addr - isa_io_base : 0;
+/**
+ * pci_addr_cache_insert_device - Add a device to the address cache
+ * @dev: PCI device whose I/O addresses we are interested in.
+ *
+ * In order to support the fast lookup of devices based on addresses,
+ * we maintain a cache of devices that can be quickly searched.
+ * This routine adds a device to that cache.
+ */
+void
+pci_addr_cache_insert_device (struct pci_dev *dev)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags);
+	__pci_addr_cache_insert_device (dev);
+	spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags);
+}

-	while ((dev = pci_find_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) {
-		if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE)
+static inline void
+__pci_addr_cache_remove_device (struct pci_dev *dev)
+{
+	struct rb_node *n;
+
+restart:
+	n = rb_first (&pci_io_addr_cache_root.rb_root);
+	while (n) {
+		struct pci_io_addr_range *piar;
+		piar = rb_entry (n, struct pci_io_addr_range, rb_node);
+
+		if (piar->pcidev == dev)
+		{
+			rb_erase (n, &pci_io_addr_cache_root.rb_root);
+			kfree (piar);
+			goto restart;
+		}
+		n = rb_next (n);
+	}
+	pci_dev_put (dev);
+}
+
+/**
+ * pci_addr_cache_remove_device - remove pci device from addr cache
+ * @dev: device to remove
+ *
+ * Remove a device from the addr-cache tree.
+ * This is potentially expensive, since it will walk
+ * the tree multiple times (once per resource).
+ * But so what; device removal doesn't need to be that fast.
+ */
+void
+pci_addr_cache_remove_device (struct pci_dev *dev)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&pci_io_addr_cache_root.piar_lock, flags);
+	__pci_addr_cache_remove_device (dev);
+	spin_unlock_irqrestore(&pci_io_addr_cache_root.piar_lock, flags);
+}
+
+/**
+ * pci_addr_cache_build - Build a cache of I/O addresses
+ *
+ * Build a cache of pci i/o addresses.  This cache will be used to
+ * find the pci device that corresponds to a given address.
+ * This routine scans all pci busses to build the cache.
+ * Must be run late in boot process, after the pci controllers
+ * have been scaned for devices (after all device resources are known).
+ */
+static __init void
+pci_addr_cache_build (void)
+{
+	struct pci_dev *dev = NULL;
+
+	spin_lock_init (&pci_io_addr_cache_root.piar_lock);
+
+	while ((dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) {
+		// Ignore PCI bridges ( XXX why ??)
+		if ((dev->class >> 16) == PCI_BASE_CLASS_BRIDGE) {
+			pci_dev_put (dev);
 			continue;
-
-		for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
-			unsigned long start = pci_resource_start(dev,i);
-			unsigned long end = pci_resource_end(dev,i);
-			unsigned int flags = pci_resource_flags(dev,i);
-			if (start == 0 || ~start == 0 ||
-			    end == 0 || ~end == 0)
-				continue;
-			if ((flags & IORESOURCE_IO) &&
-			    (ioaddr >= start && ioaddr <= end))
-				return dev;
-			else if ((flags & IORESOURCE_MEM) &&
-				 (addr >= start && addr <= end))
-				return dev;
 		}
+		pci_addr_cache_insert_device (dev);
 	}
-	return NULL;
+
+	// Verify tree built up above, echo back the list of addrs.
+	pci_addr_cache_print (&pci_io_addr_cache_root);
 }

 void
@@ -343,6 +567,8 @@

 	printk("PCI: Probing PCI hardware done\n");
 	//ppc64_boot_msg(0x41, "PCI Done");
+
+	pci_addr_cache_build ();

 	return 0;
 }
===== arch/ppc64/kernel/pci.h 1.10 vs edited =====
--- 1.10/arch/ppc64/kernel/pci.h	Fri Sep 12 06:01:39 2003
+++ edited/arch/ppc64/kernel/pci.h	Tue Feb  3 16:35:50 2004
@@ -37,11 +37,15 @@
 void *traverse_pci_devices(struct device_node *start, traverse_func pre, traverse_func post, void *data);
 void *traverse_all_pci_devices(traverse_func pre);

-struct pci_dev *pci_find_dev_by_addr(unsigned long addr);
 void pci_devs_phb_init(void);
 void pci_fix_bus_sysdata(void);
 struct device_node *fetch_dev_dn(struct pci_dev *dev);

 #define PCI_GET_PHB_PTR(dev)    (((struct device_node *)(dev)->sysdata)->phb)
+
+/* PCI address cache management routines */
+struct pci_dev *pci_get_device_by_addr(unsigned long addr);
+void pci_addr_cache_insert_device (struct pci_dev *dev);
+void pci_addr_cache_remove_device (struct pci_dev *dev);

 #endif /* __PPC_KERNEL_PCI_H__ */
===== drivers/pci/hotplug/rpaphp_core.c 1.2 vs edited =====
--- 1.2/drivers/pci/hotplug/rpaphp_core.c	Tue Dec  9 11:03:38 2003
+++ edited/drivers/pci/hotplug/rpaphp_core.c	Tue Feb  3 16:35:51 2004
@@ -30,6 +30,7 @@
 #include <linux/smp.h>
 #include <linux/smp_lock.h>
 #include <linux/init.h>
+#include <asm/eeh.h>       /* for eeh_add_device() */
 #include <asm/rtas.h>		/* rtas_call */
 #include <asm/pci-bridge.h>	/* for pci_controller */
 #include "../pci.h"		/* for pci_add_new_bus*/
@@ -512,6 +513,7 @@
 		}

 		dev = rpaphp_find_pci_dev(slot->dn->child);
+		eeh_add_device(dev);
 	}
 	else {
 		/* slot is not enabled */
@@ -540,12 +542,12 @@
 		goto exit;
 	}

+	/* remove the device from the pci core */
+	eeh_remove_device(slot->dev);
+	pci_remove_bus_device(slot->dev);

-        /* remove the device from the pci core */
-        pci_remove_bus_device(slot->dev);
-
-        pci_dev_put(slot->dev);
-        slot->state = NOT_CONFIGURED;
+	pci_dev_put(slot->dev);
+	slot->state = NOT_CONFIGURED;

 	dbg("%s: adapter in slot[%s] unconfigured.\n", __FUNCTION__, slot->name);

===== include/asm-ppc64/eeh.h 1.6 vs edited =====
--- 1.6/include/asm-ppc64/eeh.h	Fri Sep 12 06:06:51 2003
+++ edited/include/asm-ppc64/eeh.h	Tue Feb  3 16:35:51 2004
@@ -45,22 +45,37 @@
 /* This is for profiling only */
 extern unsigned long eeh_total_mmio_ffs;

-void eeh_init(void);
-int eeh_get_state(unsigned long ea);
-unsigned long eeh_check_failure(void *token, unsigned long val);
+unsigned long eeh_check_failure(void *token, unsigned long val, int who);
 void *eeh_ioremap(unsigned long addr, void *vaddr);

+/**
+ * eeh_add_device - perform EEH initialization for the indicated pci device
+ * @dev: pci device for which to set up EEH
+ *
+ * This routine can be used to perform EEH initialization for PCI
+ * devices that were added after system boot (e.g. hotplug, dlpar).
+ * Whether this actually enables EEH or not for this device depends
+ * on the type of the device, on earlier boot command-line
+ * arguments & etc.
+ */
+void   eeh_add_device(struct pci_dev *);
+
+/**
+ * eeh_remove_device - undo EEH setup for the indicated pci device
+ * @dev: pci device to be removed
+ *
+ * This routine should be when a device is removed from a running
+ * system (e.g. by hotplug or dlpar).
+ */
+void   eeh_remove_device(struct pci_dev *);
+
+
 #define EEH_DISABLE		0
 #define EEH_ENABLE		1
 #define EEH_RELEASE_LOADSTORE	2
 #define EEH_RELEASE_DMA		3
 int eeh_set_option(struct pci_dev *dev, int options);

-/* Given a PCI device check if eeh should be configured or not.
- * This may look at firmware properties and/or kernel cmdline options.
- */
-int is_eeh_configured(struct pci_dev *dev);
-
 /* Translate a (possible) eeh token to a physical addr.
  * If "token" is not an eeh token it is simply returned under
  * the assumption that it is already a physical addr.
@@ -78,11 +93,16 @@
  * If this macro yields TRUE, the caller relays to eeh_check_failure()
  * which does further tests out of line.
  */
-/* #define EEH_POSSIBLE_IO_ERROR(val) (~(val) == 0) */
-/* #define EEH_POSSIBLE_ERROR(addr, vaddr, val) ((vaddr) != (addr) && EEH_POSSIBLE_IO_ERROR(val) */
 /* This version is rearranged to collect some profiling data */
-#define EEH_POSSIBLE_IO_ERROR(val) (~(val) == 0 && ++eeh_total_mmio_ffs)
-#define EEH_POSSIBLE_ERROR(addr, vaddr, val) (EEH_POSSIBLE_IO_ERROR(val) && (vaddr) != (addr))
+#define EEH_POSSIBLE_IO_ERROR(val, type)				\
+		((val) == (type)~0 && ++eeh_total_mmio_ffs)
+
+/* The vaddr will equal the addr if EEH checking is disabled for
+ * this device.  This is because eeh_ioremap() will not have
+ * remapped to 0xA0, and thus both vaddr and addr will be 0xE0...
+ */
+#define EEH_POSSIBLE_ERROR(addr, vaddr, val, type)			\
+		(EEH_POSSIBLE_IO_ERROR(val, type) && (vaddr) != (addr))

 /*
  * MMIO read/write operations with EEH support.
@@ -101,8 +121,8 @@
 static inline u8 eeh_readb(void *addr) {
 	volatile u8 *vaddr = (volatile u8 *)IO_TOKEN_TO_ADDR(addr);
 	u8 val = in_8(vaddr);
-	if (EEH_POSSIBLE_ERROR(addr, vaddr, val))
-		return eeh_check_failure(addr, val);
+	if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u8))
+		return eeh_check_failure(addr, val, 8);
 	return val;
 }
 static inline void eeh_writeb(u8 val, void *addr) {
@@ -112,25 +132,47 @@
 static inline u16 eeh_readw(void *addr) {
 	volatile u16 *vaddr = (volatile u16 *)IO_TOKEN_TO_ADDR(addr);
 	u16 val = in_le16(vaddr);
-	if (EEH_POSSIBLE_ERROR(addr, vaddr, val))
-		return eeh_check_failure(addr, val);
+	if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u16))
+		return eeh_check_failure(addr, val, 16);
 	return val;
 }
 static inline void eeh_writew(u16 val, void *addr) {
 	volatile u16 *vaddr = (volatile u16 *)IO_TOKEN_TO_ADDR(addr);
 	out_le16(vaddr, val);
 }
+static inline u16 eeh_raw_readw(void *addr) {
+	volatile u16 *vaddr = (volatile u16 *)IO_TOKEN_TO_ADDR(addr);
+	u16 val = in_be16(vaddr);
+	if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u16))
+		return eeh_check_failure(addr, val, 17);
+	return val;
+}
+static inline void eeh_raw_writew(u16 val, void *addr) {
+	volatile u16 *vaddr = (volatile u16 *)IO_TOKEN_TO_ADDR(addr);
+	out_be16(vaddr, val);
+}
 static inline u32 eeh_readl(void *addr) {
 	volatile u32 *vaddr = (volatile u32 *)IO_TOKEN_TO_ADDR(addr);
 	u32 val = in_le32(vaddr);
-	if (EEH_POSSIBLE_ERROR(addr, vaddr, val))
-		return eeh_check_failure(addr, val);
+	if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u32))
+		return eeh_check_failure(addr, val, 32);
 	return val;
 }
 static inline void eeh_writel(u32 val, void *addr) {
 	volatile u32 *vaddr = (volatile u32 *)IO_TOKEN_TO_ADDR(addr);
 	out_le32(vaddr, val);
 }
+static inline u32 eeh_raw_readl(void *addr) {
+	volatile u32 *vaddr = (volatile u32 *)IO_TOKEN_TO_ADDR(addr);
+	u32 val = in_be32(vaddr);
+	if (EEH_POSSIBLE_ERROR(addr, vaddr, val, u32))
+		return eeh_check_failure(addr, val, 33);
+	return val;
+}
+static inline void eeh_raw_writel(u32 val, void *addr) {
+	volatile u32 *vaddr = (volatile u32 *)IO_TOKEN_TO_ADDR(addr);
+	out_be32(vaddr, val);
+}

 static inline void eeh_memset_io(void *addr, int c, unsigned long n) {
 	void *vaddr = (void *)IO_TOKEN_TO_ADDR(addr);
@@ -139,8 +181,14 @@
 static inline void eeh_memcpy_fromio(void *dest, void *src, unsigned long n) {
 	void *vsrc = (void *)IO_TOKEN_TO_ADDR(src);
 	memcpy(dest, vsrc, n);
-	/* look for ffff's here at dest[n] */
+	/* Look for ffff's here at dest[n].  Assume that at least 4 bytes
+	 * were copied. Check all four bytes.
+	 */
+	if ((n>=4) && (EEH_POSSIBLE_ERROR(src, vsrc, (*((u32 *) dest+n-4)), u32))) {
+		eeh_check_failure(src, (*((u32 *) dest+n-4)), 88);
+	}
 }
+
 static inline void eeh_memcpy_toio(void *dest, void *src, unsigned long n) {
 	void *vdest = (void *)IO_TOKEN_TO_ADDR(dest);
 	memcpy(vdest, src, n);
@@ -158,8 +206,8 @@
 	if (_IO_IS_ISA(port) && !_IO_HAS_ISA_BUS)
 		return ~0;
 	val = in_8((u8 *)(port+pci_io_base));
-	if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val))
-		return eeh_check_failure((void*)(port+pci_io_base), val);
+	if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val, u8))
+		return eeh_check_failure((void*)(port), val, -8);
 	return val;
 }

@@ -173,8 +221,8 @@
 	if (_IO_IS_ISA(port) && !_IO_HAS_ISA_BUS)
 		return ~0;
 	val = in_le16((u16 *)(port+pci_io_base));
-	if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val))
-		return eeh_check_failure((void*)(port+pci_io_base), val);
+	if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val, u16))
+		return eeh_check_failure((void*)(port), val, -16);
 	return val;
 }

@@ -188,14 +236,33 @@
 	if (_IO_IS_ISA(port) && !_IO_HAS_ISA_BUS)
 		return ~0;
 	val = in_le32((u32 *)(port+pci_io_base));
-	if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val))
-		return eeh_check_failure((void*)(port+pci_io_base), val);
+	if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR(val, u32))
+		return eeh_check_failure((void*)(port), val, -32);
 	return val;
 }

 static inline void eeh_outl(u32 val, unsigned long port) {
 	if (!_IO_IS_ISA(port) || _IO_HAS_ISA_BUS)
 		return out_le32((u32 *)(port+pci_io_base), val);
+}
+
+/* in-string eeh macros */
+static inline void eeh_insb(unsigned long port, void * buf, int ns) {
+	_insb((u8 *)(port+pci_io_base), buf, ns);
+	if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR((*(((u8*)buf)+ns-1)), u8))
+		eeh_check_failure((void*)(port), *(u8*)buf, -9);
+}
+
+static inline void eeh_insw_ns(unsigned long port, void * buf, int ns) {
+	_insw_ns((u16 *)(port+pci_io_base), buf, ns);
+	if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR((*(((u16*)buf)+ns-1)), u16))
+		eeh_check_failure((void*)(port), *(u16*)buf, -17);
+}
+
+static inline void eeh_insl_ns(unsigned long port, void * buf, int nl) {
+	_insl_ns((u32 *)(port+pci_io_base), buf, nl);
+	if (!_IO_IS_ISA(port) && EEH_POSSIBLE_IO_ERROR((*(((u32*)buf)+nl-1)), u32))
+		eeh_check_failure((void*)(port), *(u32*)buf, -33);
 }

 #endif /* _EEH_H */
===== include/asm-ppc64/io.h 1.11 vs edited =====
--- 1.11/include/asm-ppc64/io.h	Mon Jan 19 20:08:22 2004
+++ edited/include/asm-ppc64/io.h	Tue Feb  3 16:35:52 2004
@@ -49,6 +49,13 @@
 #define outb(data,addr)		writeb(data,((unsigned long)(addr)))
 #define outw(data,addr)		writew(data,((unsigned long)(addr)))
 #define outl(data,addr)		writel(data,((unsigned long)(addr)))
+/*
+ * The *_ns versions below don't do byte-swapping.
+ * Neither do the standard versions now, these are just here
+ * for older code.
+ */
+#define insw_ns(port, buf, ns)	_insw_ns((u16 *)((port)+pci_io_base), (buf), (ns))
+#define insl_ns(port, buf, nl)	_insl_ns((u32 *)((port)+pci_io_base), (buf), (nl))
 #else
 #define readb(addr)		eeh_readb((void*)(addr))
 #define readw(addr)		eeh_readw((void*)(addr))
@@ -71,12 +78,16 @@
  * They are only used in practice for transferring buffers which
  * are arrays of bytes, and byte-swapping is not appropriate in
  * that case.  - paulus */
-#define insb(port, buf, ns)	_insb((u8 *)((port)+pci_io_base), (buf), (ns))
-#define outsb(port, buf, ns)	_outsb((u8 *)((port)+pci_io_base), (buf), (ns))
-#define insw(port, buf, ns)	_insw_ns((u16 *)((port)+pci_io_base), (buf), (ns))
-#define outsw(port, buf, ns)	_outsw_ns((u16 *)((port)+pci_io_base), (buf), (ns))
-#define insl(port, buf, nl)	_insl_ns((u32 *)((port)+pci_io_base), (buf), (nl))
-#define outsl(port, buf, nl)	_outsl_ns((u32 *)((port)+pci_io_base), (buf), (nl))
+#define insb(port, buf, ns)	eeh_insb((port), (buf), (ns))
+#define insw(port, buf, ns)	eeh_insw_ns((port), (buf), (ns))
+#define insl(port, buf, nl)	eeh_insl_ns((port), (buf), (nl))
+#define insw_ns(port, buf, ns)	eeh_insw_ns((port), (buf), (ns))
+#define insl_ns(port, buf, nl)	eeh_insl_ns((port), (buf), (nl))
+
+#define outsb(port, buf, ns)  _outsb((u8 *)((port)+pci_io_base), (buf), (ns))
+#define outsw(port, buf, ns)  _outsw_ns((u16 *)((port)+pci_io_base), (buf), (ns))
+#define outsl(port, buf, nl)  _outsl_ns((u32 *)((port)+pci_io_base), (buf), (nl))
+
 #endif

 extern void _insb(volatile u8 *port, void *buf, int ns);
@@ -106,9 +117,7 @@
  * Neither do the standard versions now, these are just here
  * for older code.
  */
-#define insw_ns(port, buf, ns)	_insw_ns((u16 *)((port)+pci_io_base), (buf), (ns))
 #define outsw_ns(port, buf, ns)	_outsw_ns((u16 *)((port)+pci_io_base), (buf), (ns))
-#define insl_ns(port, buf, nl)	_insl_ns((u32 *)((port)+pci_io_base), (buf), (nl))
 #define outsl_ns(port, buf, nl)	_outsl_ns((u32 *)((port)+pci_io_base), (buf), (nl))


@@ -177,6 +186,9 @@

 /*
  * 8, 16 and 32 bit, big and little endian I/O operations, with barrier.
+ * These routines do not perform EEH-related I/O address translation,
+ * and should not be used directly by device drivers.  Use inb/readb
+ * instead.
  */
 static inline int in_8(volatile unsigned char *addr)
 {

From linas at austin.ibm.com  Thu Feb  5 07:37:48 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Wed, 4 Feb 2004 14:37:48 -0600
Subject: Resending the patch: Re: 2.6: PATCH for multiple EEH bugs
In-Reply-To: <20040204142853.A28220@forte.austin.ibm.com>; from linas@austin.ibm.com on Wed, Feb 04, 2004 at 02:28:53PM -0600
References: <20040203183459.B27780@forte.austin.ibm.com> <Pine.A41.4.44.0402032246520.85316-100000@forte.austin.ibm.com> <20040204142853.A28220@forte.austin.ibm.com>
Message-ID: <20040204143748.M27780@forte.austin.ibm.com>

On Wed, Feb 04, 2004 at 02:28:53PM -0600, linas at austin.ibm.com wrote:
> On Tue, Feb 03, 2004 at 10:52:11PM -0600, olof at austin.ibm.com wrote:
> > On Tue, 3 Feb 2004 linas at austin.ibm.com wrote:
> >
> > > Patch for multiple EEH-related bugs.  Please review this patch,
> > > & if appropriate, please apply.  It should apply cleanly to
> > > the current ameslab tree (Feb 03 2004  2.6.2-rc3).
> >
> > I have patch failures in eeh.c, pci.c and rpaphp_core.c. Are you sure you
> > made the diff against a current ameslab tree?
>
> Right tree, bad email attachment.
>
> I don't know how it happened, but what I sent out had some trailing
> whitespace whacked. The attached patch should not have this problem.

Arghhh! Not my day.

The ppc64 mailing list manager or one of the mail gateways is seems to
be removing whitespace.  When I send the patch to myself, its OK, but
when I send it on the list, it gets mangled ...  see attached.

--linas

-------------- next part --------------


          the last 9 lines should have 1-9 whitespaces followed by newline

From haveblue at us.ibm.com  Thu Feb  5 08:31:49 2004
From: haveblue at us.ibm.com (Dave Hansen)
Date: 04 Feb 2004 13:31:49 -0800
Subject: lmb_phys_mem_size() vs. lmb_end_of_DRAM()
Message-ID: <1075930308.27981.863.camel@nighthawk>


It looks to me like lmb_phys_mem_size() will return the largest valid
physical address on the system while lmb_end_of_DRAM() is used just for
usable RAM.

lmb_phys_mem_size() determines how big the kernel's ZONE_DMA is and
lmb_end_of_DRAM() is used to figure out everything else like bootmem
sizes and the zone start/end_pfn values.

Can we just abolish the use of lmb_end_of_DRAM(), and bootmem reserve
the I/O areas?  We'll probably have some unused struct pages, but that
should be just about the only side-effect, and it keeps from having the
confusing distinction.  That's what we do on x86 at least.

--dave


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Thu Feb  5 10:07:08 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Thu, 05 Feb 2004 10:07:08 +1100
Subject: [2.4] [PATCH] hash_page rework, take 2
In-Reply-To: <402152C8.7080907@redhat.com>
References: <Pine.A41.4.44.0402032206230.85750-200000@forte.austin.ibm.com>
	 <402152C8.7080907@redhat.com>
Message-ID: <1075936028.4019.27.camel@gaston>


Hi Julie !

> OK, I managed to convince myself that using an eieio is ok here. I was
> concerned that other processors might not have seen any of the stores
> that preceded the eieio instruction, since eieio is normally only used
> when dealing with device memory. lwsync ensures other processors have
> seen any stores to system memory at the point the lock is released. But
> the only stores that matter here are the hpte (and it is sync'd) and the
> pte and it has the lock bit. So when another processor sees the pte
> contents without the lock bit set it will, by default, be seeing the
> updated value as well.

eieio enforce store ordering on cacheable accesses too, which is all
we should need at this point.

> So it is true that an interrupt handler can cause a page fault? Can you
> provide me with an example?

Not really a "page fault" in the linux sense, but rather a hash miss,
yes. Typically, a driver accessing ioremap'ed IO space or a module
running vmalloc'ed memory can trigger a hash miss.

With my 2.6 implementation, there shouldn't be a problem as only
hash_page will set PAGE_BUSY and this is done with interrupts off,
so it can't be re-entered on the same CPU.

> Let me see if I understand this. When someone wants to free a page
> pointed to by an entry in a 3rd level page table, they clear out the pte
> in the page table using pte_clear(). Then they call pte_free with the
> address of the page they are freeing up (not really a page table entry
> but the actual page address). This page address is added to the batch
> list. Later, the idle loop or process termination code calls
> do_check_pgt_cache which will free all the pages in the batch list.

Yes. What we need it to make sure no CPU was currently walking the
page tables when the pte_clear occured, that is that no CPU is
actually still using the PTEs in the page we are about to get rid
of, which basically means we must make sure that no CPU that was in
hash_page at the time of the pte_clear is still in that function.

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Thu Feb  5 11:37:22 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Wed, 4 Feb 2004 18:37:22 -0600
Subject: Root Drive Mirroring and LVM.
In-Reply-To: <16417.31540.78193.699721@notabene.cse.unsw.edu.au>; from neilb@cse.unsw.edu.au on Thu, Feb 05, 2004 at 10:07:32AM +1100
References: <20040204122317.K27780@forte.austin.ibm.com> <Pine.LNX.4.44.0402041726350.20232-100000@coffee.psychology.mcmaster.ca> <20040204164946.P27780@forte.austin.ibm.com> <16417.31540.78193.699721@notabene.cse.unsw.edu.au>
Message-ID: <20040204183722.R27780@forte.austin.ibm.com>


On Thu, Feb 05, 2004 at 10:07:32AM +1100, Neil Brown wrote:
> On Wednesday February 4, linas at austin.ibm.com wrote:
> > On Wed, Feb 04, 2004 at 05:32:01PM -0500, Mark Hahn wrote:
> > > > So what's wrong with autodetect, again?
> > >
> > > my guess is: in-kernel autodetect is the problem.
> > > out-of-kernel detection can be much smarter,
> > > and can be more easily tested/replaced.
> >
> > Hm, yes, that makes sense.
>
>
> Good :-)
>
> Just to flesh out my thoughts a bit more:
>
>  If the root filesystem is on an MD array, then I see the process of
>  assembling that md array as quite similar to the process of finding
>  the device for the root filesystem.
>
>  We don't expect the kernel, or anyone else, to scan all devices
>  looking for something that looks like a root filesystem, and loading
>  that.  Rather we tell the kernel or boot loader exactly where to find
>  the root filesystem.  And if the root filesystem moves, we get to
>  explicitly tell the boot loader where it is (root=/dev/hdc1 or
>  whatever).

Well, this is actually one of my more bitter complaints.  I'd much
much rather have some symbolic disk label, and have the kernel scan
*all* devices for that.  I can defend this for both low-end and high-end
machines.

On the low end i.e. PC's, PC servers: I've dealt (multiple times) with
disk and/or controller failure.  If you have more than 3 disks, it can
get very confusing as to which cable is which.  Add to that that some
BIOS'es allow you to enable/disable controllers, which cause devices
to be renumbered. E.g. I have a mobo with two 'plain vanilla' ide
connectors and two '100MHZ' 80-wire ide connectors.  Which one is
/dev/hda and which is not depends on the BIOS settings.  Worse:
if I plug in a 3rd party ide controller, then the numbering becomes
insane:

/dev/hda: mobo 33MHz controller
/dev/hdc: mobo 33MHz controller
/dev/hde: pci card controller
/dev/hdg: pci card controller
/dev/hdi: mobo 100MHz controller
/dev/hdk: mobo 100MHz controller

But if I disable the 33MHz mobo connector in bios:

/dev/hda: mobo 100MHz controller
/dev/hdc: mobo 100MHz controller
/dev/hde: pci card controller
/dev/hdg: pci card controller

Since some types of disk failures will lock up the kernel during boot,
you spend a lot of time plugging and unplugging disks and rebooting.
It gets confusing after a while.

To add to the confusion: If you RAID-1 mirror the root partition,
but then mount it as /dev/hdk (and not /dev/md0) during a rescue
operation (because rescue disks often don't have RAID on them),
and then edit /etc/fstab to try to match the new config...
you end up with two root partitions (each of the mirror pair) with
different /etc/fstab's, on different cables/controllers, ...
after fiddling with that, its a nightmare to figure out which
is which, what's mounted how and where.

I also had a similar experience on a machine with 30-odd scsi disks.
We installed a new kernel on /dev/sdp, rebooted ... and what confusion,
since /dev/sdp was now /dev/sdg.  Worse, since the bootloader was
unaware of this difference ... So then we made a change, and rebooted...
... again, much confusion till we worked it out.

As a cure-all, I started fiddling with ext2 disk labels.  However,
an ext2 disk label, when written on a RAID-1 device, will identify
*three* things: /dev/md0 (for example), and both mirror pairs:
/dev/hda1 and /dev/hde1 all have the same label.

rieserfs doesn't have ext2 labels....

I wanted to fool with other partition table schemes (non-DOS partition
tables) until I realized most PC rescue disks wouldn't be able to get
you out of that jam.

So then I thought about using LVM to label my disks (i.e. use LVM
for only one reason, and no other reason: to be able to assign logical
names, and not physical names).  But think about it... its scary,
LVM is not widespread, somewhat buggy, and standard debian/redhat/suse
rescue diskettes won't have it ... etc.


>  Assembling the root device should be handled the same way.  We tell
>  the boot loader/kernel where to expect it, but can over-ride that if
>  we need to:
>    md=0,/dev/hdc1,/dev/hde4
>
>  All other md arrays can, and so should, be assembled by code running
>  out of the root filesystem.

*If* you mounted the "right" root filesystem.  If you have multiple
copies around, and edit fstab on one but not the others, and then
recable and reboot ...

>  This could be some program that
>  assembles anything it finds after scanning all devices, or something
>  a bit more focused, but it should be controllable by the sysadmin.

At least once I managed to mount /usr as /var because of the confusion,
and upon reboot the init.d scripts spewed /var/lock crud into my poor
/usr filesystem ...

I want to know that /dev/mdwhatever is /var *before* I mount it,
not after.

>  It is true that in-kernel auto-detect can be controlled by fiddling
>  with partition types, but the problem is that it runs *before* the
>  root filesystem is mounted and so could conceivably confuse the
>  assembly of the root device (if e.g. you plugged in some other device
>  that also claimed to be part of /dev/md0, and it got scanned before
>  your real root device).

Well, yes, that too, I suppose.  But as you see, explicitly
specifying it at the boot prompt works only if you type in the
right thing ...

--linas


cc.ing ppc64 because although not an architecture issue, it is a
sysadmin issue on enterprise-class machines.

cc.ing hot-plug because this is a cold-plug issue.

Note you get similar crud for multiple ethernet cards ...

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From johnrose at austin.ibm.com  Thu Feb  5 11:47:39 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Wed, 04 Feb 2004 18:47:39 -0600
Subject: [PATCH] RPA PCI Hotplug - Remove Adapter Config at DLPAR add time
Message-ID: <1075942059.3026.96.camel@verve>


Patch below fixes:
https://bugzilla.linux.ibm.com/show_bug.cgi?id=6136

Details in bug if you're interested, comments welcome.

Thanks-
John

diff -Nru a/drivers/pci/hotplug/rpaphp_core.c b/drivers/pci/hotplug/rpaphp_core.c
--- a/drivers/pci/hotplug/rpaphp_core.c	Wed Feb  4 18:42:00 2004
+++ b/drivers/pci/hotplug/rpaphp_core.c	Wed Feb  4 18:42:00 2004
@@ -814,29 +814,16 @@
 					}

 					slot->dev = rpaphp_find_adapter_pdev(slot);
-
-					if (!slot->dev && slot_name) {
-						 /* adapter being added doesn't have pci_dev yet */
-						slot->dev = rpaphp_config_adapter(slot);
-						if (!slot->dev) {
-							err("%s: add new adapter device for slot[%s] failed\n",
-							__FUNCTION__, slot->name);
-							kfree(slot->hotplug_slot->info);
-							kfree(slot->hotplug_slot->name);
-							kfree(slot->hotplug_slot);
-							kfree(slot);
-							pci_dev_put(slot->bridge);
-							continue;
-
-						}
-					}
-
 					if(slot->dev) {
 						slot->state = CONFIGURED;
 						pci_dev_get(slot->dev);
 					}
-					else
+					else {
+						/* DLPAR add as opposed to
+						 * boot time */
 						slot->state = NOT_CONFIGURED;
+					}
+
 				}
 				dbg("%s registering slot:path[%s] index[%x], name[%s] pdomain[%x] type[%d]\n",
 					__FUNCTION__, dn->full_name, slot->index, slot->name,


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Thu Feb  5 12:25:11 2004
From: anton at samba.org (Anton Blanchard)
Date: Thu, 5 Feb 2004 12:25:11 +1100
Subject: lmb_phys_mem_size() vs. lmb_end_of_DRAM()
In-Reply-To: <1075930308.27981.863.camel@nighthawk>
References: <1075930308.27981.863.camel@nighthawk>
Message-ID: <20040205012511.GM19011@krispykreme>


> It looks to me like lmb_phys_mem_size() will return the largest valid
> physical address on the system while lmb_end_of_DRAM() is used just for
> usable RAM.
>
> lmb_phys_mem_size() determines how big the kernel's ZONE_DMA is and
> lmb_end_of_DRAM() is used to figure out everything else like bootmem
> sizes and the zone start/end_pfn values.
>
> Can we just abolish the use of lmb_end_of_DRAM(), and bootmem reserve
> the I/O areas?  We'll probably have some unused struct pages, but that
> should be just about the only side-effect, and it keeps from having the
> confusing distinction.  That's what we do on x86 at least.

On POWER4 and newer the IO hole sits above memory, so
__pa(lmb_end_of_DRAM()) == lmb_phys_mem_size(). The only place we care
about this distinction is POWER3.

It is similar to x86, 270 has 3G MEM - 1 G IO - 15G MEM. The nighthawk
is worse, it has 1G MEM - 3G IO - 63G MEM.

x86 doesnt have struct pages backing IO regions does it? On ppc64 we need
to create a mem_map array big enough to hit the top of RAM, so it seems to
me that we have to use lmb_end_of_DRAM().

So we currently do have struct pages for the IO region, we should probably
reserve them.

sparc64 passes in the size of the hole to free_area_init_node, I guess
we should be doing this too.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From bugzilla at watkins-home.com  Thu Feb  5 12:31:01 2004
From: bugzilla at watkins-home.com (Guy)
Date: Wed, 4 Feb 2004 20:31:01 -0500
Subject: Root Drive Mirroring and LVM.
In-Reply-To: <20040204183722.R27780@forte.austin.ibm.com>
Message-ID: <200402050129.i151T2i30440@dns1.watkins-home.com>


I upgraded my firmware on a 2940U2W.  That changed the order my SCSI buses
were scanned.  This changed the boot order of my disks.  I had to disable
the bios on the 2940U2W so it would not attempt to boot from the disks on
that bus.

My MB has 2 SCSI buses and I have 2 SCSI cards.

So, anything that could have prevented this would be good.

-----Original Message-----
From: linas at austin.ibm.com
Sent: Wednesday, February 04, 2004 7:37 PM
Subject: Re: Root Drive Mirroring and LVM.

On Thu, Feb 05, 2004 at 10:07:32AM +1100, Neil Brown wrote:
> On Wednesday February 4, linas at austin.ibm.com wrote:
> > On Wed, Feb 04, 2004 at 05:32:01PM -0500, Mark Hahn wrote:
> > > > So what's wrong with autodetect, again?
> > >
> > > my guess is: in-kernel autodetect is the problem. out-of-kernel
> > > detection can be much smarter, and can be more easily
> > > tested/replaced.
> >
> > Hm, yes, that makes sense.
>
> Good :-)
>
> Just to flesh out my thoughts a bit more:
>
>  If the root filesystem is on an MD array, then I see the process of
>  assembling that md array as quite similar to the process of finding
>  the device for the root filesystem.
>
>  We don't expect the kernel, or anyone else, to scan all devices
>  looking for something that looks like a root filesystem, and loading
>  that. Rather we tell the kernel or boot loader exactly where to
>  find the root filesystem. And if the root filesystem moves, we get
>  to explicitly tell the boot loader where it is (root=/dev/hdc1 or
>  whatever).

Well, this is actually one of my more bitter complaints.  I'd much
much rather have some symbolic disk label, and have the kernel scan
*all* devices for that.  I can defend this for both low-end and high-end
machines.

On the low end i.e. PC's, PC servers: I've dealt (multiple times) with
disk and/or controller failure.  If you have more than 3 disks, it can
get very confusing as to which cable is which.  Add to that that some
BIOS'es allow you to enable/disable controllers, which cause devices
to be renumbered. E.g. I have a mobo with two 'plain vanilla' ide
connectors and two '100MHZ' 80-wire ide connectors.  Which one is
/dev/hda and which is not depends on the BIOS settings.  Worse:
if I plug in a 3rd party ide controller, then the numbering becomes
insane:

/dev/hda: mobo 33MHz controller
/dev/hdc: mobo 33MHz controller
/dev/hde: pci card controller
/dev/hdg: pci card controller
/dev/hdi: mobo 100MHz controller
/dev/hdk: mobo 100MHz controller

But if I disable the 33MHz mobo connector in bios:

/dev/hda: mobo 100MHz controller
/dev/hdc: mobo 100MHz controller
/dev/hde: pci card controller
/dev/hdg: pci card controller

Since some types of disk failures will lock up the kernel during boot,
you spend a lot of time plugging and unplugging disks and rebooting.
It gets confusing after a while.

To add to the confusion: If you RAID-1 mirror the root partition,
but then mount it as /dev/hdk (and not /dev/md0) during a rescue
operation (because rescue disks often don't have RAID on them),
and then edit /etc/fstab to try to match the new config...
you end up with two root partitions (each of the mirror pair) with
different /etc/fstab's, on different cables/controllers, ...
after fiddling with that, its a nightmare to figure out which
is which, what's mounted how and where.

I also had a similar experience on a machine with 30-odd scsi disks.
We installed a new kernel on /dev/sdp, rebooted ... and what confusion,
since /dev/sdp was now /dev/sdg.  Worse, since the bootloader was
unaware of this difference ... So then we made a change, and rebooted...
... again, much confusion till we worked it out.

As a cure-all, I started fiddling with ext2 disk labels.  However,
an ext2 disk label, when written on a RAID-1 device, will identify
*three* things: /dev/md0 (for example), and both mirror pairs:
/dev/hda1 and /dev/hde1 all have the same label.

rieserfs doesn't have ext2 labels....

I wanted to fool with other partition table schemes (non-DOS partition
tables) until I realized most PC rescue disks wouldn't be able to get
you out of that jam.

So then I thought about using LVM to label my disks (i.e. use LVM
for only one reason, and no other reason: to be able to assign logical
names, and not physical names).  But think about it... its scary,
LVM is not widespread, somewhat buggy, and standard debian/redhat/suse
rescue diskettes won't have it ... etc.

>  Assembling the root device should be handled the same way. We tell
>  the boot loader/kernel where to expect it, but can over-ride that if
>  we need to:
>    md=0,/dev/hdc1,/dev/hde4
>
>  All other md arrays can, and so should, be assembled by code running
>  out of the root filesystem.

*If* you mounted the "right" root filesystem.  If you have multiple
copies around, and edit fstab on one but not the others, and then
recable and reboot ...

>  This could be some program that assembles anything it finds after
>  scanning all devices, or something a bit more focused, but it should
>  be controllable by the sysadmin.

At least once I managed to mount /usr as /var because of the confusion,
and upon reboot the init.d scripts spewed /var/lock crud into my poor
/usr filesystem ...

I want to know that /dev/mdwhatever is /var *before* I mount it,
not after.

>  It is true that in-kernel auto-detect can be controlled by fiddling
>  with partition types, but the problem is that it runs *before* the
>  root filesystem is mounted and so could conceivably confuse the
>  assembly of the root device (if e.g. you plugged in some other device
>  that also claimed to be part of /dev/md0, and it got scanned before
>  your real root device).

Well, yes, that too, I suppose.  But as you see, explicitly
specifying it at the boot prompt works only if you type in the
right thing ...

--linas

cc.ing ppc64 because although not an architecture issue, it is a
sysadmin issue on enterprise-class machines.

cc.ing hot-plug because this is a cold-plug issue.

Note you get similar crud for multiple ethernet cards ...

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Fri Feb  6 00:36:50 2004
From: anton at samba.org (Anton Blanchard)
Date: Fri, 6 Feb 2004 00:36:50 +1100
Subject: [PATCH] kill lmb_add_io
In-Reply-To: <20040204174036.GB4996@w-mikek2.beaverton.ibm.com>
References: <1075856864.1449.205.camel@nighthawk> <20040204174036.GB4996@w-mikek2.beaverton.ibm.com>
Message-ID: <20040205133650.GV19011@krispykreme>


Hi,

> On the subject of lmb's ... I've got a pSeries 615.  On this machine
> I only see one lmb of 4GB in size.  This is what is dug out of the
> prom by 'prom_initialize_lmb' before any coalescing.  I was somehow
> expecting more lmb's of a smaller size.  Of course, I don't know the
> hardware specifics of this machine or how many DIMMs of what size it
> contains.
>
> Dave, what does the lmb layout on your 'bigger' machine look like?

You only start caring about it when we talk partitioning and hot swap
memory. The reality is your 615s memory is most probably contiguous.

I seem to remember firmware were doing a 32bit backwards compatible
thing and splitting memory into at least two blocks, one of 4GB or less.
Not sure if they do that any more.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Fri Feb  6 00:37:18 2004
From: anton at samba.org (Anton Blanchard)
Date: Fri, 6 Feb 2004 00:37:18 +1100
Subject: [PATCH] kill lmb_add_io
In-Reply-To: <20040204094059.GF19011@krispykreme>
References: <1075856864.1449.205.camel@nighthawk> <20040204041820.GF22694@krispykreme> <20040204094059.GF19011@krispykreme>
Message-ID: <20040205133718.GW19011@krispykreme>


> Boots on a p630. Will get some help testing it on iSeries tomorrow
> (hi Stephen :) and will push it if there are no complaints.

Boots on iSeries (thanks to Stephen for testing) and its merged into
ameslab.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Fri Feb  6 03:04:31 2004
From: anton at samba.org (Anton Blanchard)
Date: Fri, 6 Feb 2004 03:04:31 +1100
Subject: EEH as module  [was Re: LPARCFG]
In-Reply-To: <20040204120252.J27780@forte.austin.ibm.com>
References: <20040201050227.GB22694@krispykreme> <20040204120252.J27780@forte.austin.ibm.com>
Message-ID: <20040205160431.GY19011@krispykreme>


> Since I'm planning to be whacking EEH in a big way, does anyone feel
> that it should be turned into a module?

Hmm I hadnt thought about that. Its fairly core and could get messy
trying to modularise it.

I am looking forwards to EEH being whacked around however :)

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Fri Feb  6 03:07:10 2004
From: anton at samba.org (Anton Blanchard)
Date: Fri, 6 Feb 2004 03:07:10 +1100
Subject: LPARCFG
In-Reply-To: <401D591E.9030503@austin.ibm.com>
References: <20040201050227.GB22694@krispykreme> <401D591E.9030503@austin.ibm.com>
Message-ID: <20040205160710.GZ19011@krispykreme>


> Building as a module was broken when the code was checked in to ameslab
> 2.6; I suggested turning it off.  I think lparcfg uses unexported
> symbols -- cur_cpu_spec, systemcfg, and plpar_hcall_4out.  Should any of
> those be exported?
>
> See this thread for the history:
> http://lists.linuxppc.org/linuxppc64-dev/200312/msg00023.html

Ahh yeah, I have such a short memory.

Actually I enabled it as a module and built all modules and didnt get
any warnings. Either we have everything exported now, Im not getting
undefined symbol warnings any more or else Im going blind.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From kravetz at us.ibm.com  Fri Feb  6 04:15:22 2004
From: kravetz at us.ibm.com (Mike Kravetz)
Date: Thu, 5 Feb 2004 09:15:22 -0800
Subject: [PATCH] kill lmb_add_io
In-Reply-To: <20040205133650.GV19011@krispykreme>
References: <1075856864.1449.205.camel@nighthawk> <20040204174036.GB4996@w-mikek2.beaverton.ibm.com> <20040205133650.GV19011@krispykreme>
Message-ID: <20040205171522.GB4983@w-mikek2.beaverton.ibm.com>


On Fri, Feb 06, 2004 at 12:36:50AM +1100, Anton Blanchard wrote:
> >
> > Dave, what does the lmb layout on your 'bigger' machine look like?
>
> You only start caring about it when we talk partitioning and hot swap
> memory.

Exactly, and this is an area I am interested in pursuing.  Hence, the
interest in Dave's machine.  I haven't gone through enough pSeries
architecture documentation yet to know what to expect.  One would think
that a smaller lmb size would be better for hot swap memory.  At least
in the 'remove' side of the equation.

>         The reality is your 615s memory is most probably contiguous.

Yup.  I think I'll try to put together a 'hack' to make the memory
layout on my 615 look like one of the bigger machines with dynamic
LPAR capabilities.  In other words, multiple (smaller) possibly
non-contiguous lmb's.

--
Mike

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Fri Feb  6 04:46:51 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Thu, 05 Feb 2004 11:46:51 -0600
Subject: [2.4] [PATCH] hash_page rework, take 2
In-Reply-To: <402152C8.7080907@redhat.com>
References: <Pine.A41.4.44.0402032206230.85750-200000@forte.austin.ibm.com> <402152C8.7080907@redhat.com>
Message-ID: <4022818B.90902@austin.ibm.com>


Julie DeWandel wrote:
> Hi Olof,
>
> Thank you for the explanations. In most cases, I agree but I still have
> one or two things I wanted to follow up on. I have added my comments to
> yours below.

See below.

> olof at austin.ibm.com wrote:

>> JSD: Better question for (1) is why are interrupts being disabled here?
>> JSD: Can this routine be called from interrupt context?
>>
>> Without disabling interrupts, there's a risk for deadlocks if the
>> processor gets interrupted and the interrupt handler causes a page fault
>> that needs to be resolved Since the lock is held for writing, the handler
>> will wait forever when locking for reading. This is actually similar to
>> the original deadlock that this whole patch is meant to remove, but the
>> window is really small (just a few instructions) now. Likewise an
>> interrupt on a different processor is not a problem since forward
>> progress
>> is still guaranteed on the processor holding it for writing so the reader
>> will eventually get the lock.
>>
> So it is true that an interrupt handler can cause a page fault? Can you
> provide me with an example?

Ben explained this pretty well yesterday. What we saw was mostly in
drivers loaded as modules when they happened to interrupt a processor
currently executing in vmalloc(). Stacks could look like:

c0000000c961f170  c00000000049bc30  __ex_table ?kernel? 0x3c30
c0000000c961f200  c00000000000b544  .do_hash_page_DSI ?kernel? 0x10
c0000000c961f4f0  c0000000c961f580  xfrm_policy_list_Rsmp_cdacf85e ??
0xc8d6dd28
c0000000c961f580  d000000000055f64  .tux_data_ready ?tux? 0x78
c0000000c961f620  c000000000294680  .tcp_data_queue ?kernel? 0x7d0
c0000000c961f6e0  c000000000295b10  .tcp_rcv_established ?kernel? 0x2b8
c0000000c961f7a0  c0000000002a1288  .tcp_v4_do_rcv ?kernel? 0x1b0
c0000000c961f830  c0000000002a1a48  .tcp_v4_rcv ?kernel? 0x7ac
c0000000c961f8d0  c00000000027b034  .ip_local_deliver_finish ?kernel? 0x14c
c0000000c961f960  c00000000027aba4  .ip_local_deliver ?kernel? 0x60
c0000000c961f9e0  c00000000027b4cc  .ip_rcv_finish ?kernel? 0x34c
c0000000c961fa80  c00000000027ae04  .ip_rcv ?kernel? 0x20c
c0000000c961fb20  c00000000025a26c  .netif_receive_skb ?kernel? 0x240
c0000000c961fbc0  c00000000017bb7c  .e1000_clean_rx_irq ?kernel? 0x394
c0000000c961fcd0  c00000000017b378  .e1000_clean ?kernel? 0x7c
c0000000c961fd90  c00000000025a7a0  .net_rx_action ?kernel? 0x15c
c0000000c961fe50  c000000000066c40  .do_softirq ?kernel? 0x1b4
c0000000c961ff00  c000000000011fec  .do_IRQ ?kernel? 0x164
c0000000c961ff90  c00000000000ae20  HardwareInterrupt_entry ?kernel? 0x38
c00000055c3f6a80  c000000000179324  .e1000_xmit_frame ?kernel? 0x1e8
c00000055c3f6d70  c00000000000aaa8  DataAccessSLB_common ?kernel? 0x108
c00000055c3f6e30  c000000000090d2c  .__vmalloc ?kernel? 0x1a4
c00000055c3f6f00  c0000000000d284c  .alloc_fd_array ?kernel? 0x4c
c00000055c3f6f80  c0000000000d297c  .expand_fd_array ?kernel? 0x9c
c00000055c3f7030  c00000000005a6f0  .copy_files ?kernel? 0x214
c00000055c3f70f0  c00000000005adbc  .copy_process ?kernel? 0x440
c00000055c3f71c0  c00000000005b838  .do_fork ?kernel? 0x40
c00000055c3f7280  c000000000014100  .sys_clone ?kernel? 0x9c
c00000055c3f7320  c000000000029944  .sys32_clone ?kernel? 0x28
c00000055c3f7390  c00000000000fe48  ret_from_syscall_1
exception: c00 (System Call) regs c00000055c3f7400
                   c000000000017020  .arch_kernel_thread ?kernel? 0x24
c00000055c3f76f0  c0000005599b9000  xfrm_policy_list_Rsmp_cdacf85e ??
0x591077a8
c00000055c3f7780  d000000000073610  .start_external_cgi ?tux? 0x2c
c00000055c3f7800  d00000000007368c  .query_extcgi ?tux? 0x18
c00000055c3f7880  d0000000000602e0  .http_process_message ?tux? 0x2e4
c00000055c3f7910  d0000000000571fc  .tux_schedule_atom ?tux? 0x40
c00000055c3f7990  d0000000000587d8  .process_requests ?tux? 0x14c
c00000055c3f7a30  d0000000000672d8  .event_loop ?tux? 0x1a0
c00000055c3f7ad0  d00000000006977c  .__sys_tux ?tux? 0x3c8
c00000055c3f7bc0  c00000000024de0c  .sys_tux ?kernel? 0x17c
c00000055c3f7c60  c00000000000fe48  ret_from_syscall_1

>> This is all an ad-hoc solution since there's no RCU in 2.4, so I needed
>> another light-weight syncronization method.
>>
> Let me see if I understand this. When someone wants to free a page
> pointed to by an entry in a 3rd level page table, they clear out the pte
> in the page table using pte_clear(). Then they call pte_free with the
> address of the page they are freeing up (not really a page table entry
> but the actual page address). This page address is added to the batch
> list. Later, the idle loop or process termination code calls
> do_check_pgt_cache which will free all the pages in the batch list.
>
> However, prior to freeing the pages, pte_free_batch() will call
> pte_free_sync(). pte_free_sync tries to ensure that the pte_hash_locks
> aren't held on any processor. If a processor is holding it, it could be
> the case that the processor is currently walking the page tables and
> might have loaded, for example, the address of a pmd that we have since
> cleared. Since it is still using the data in that page, we don't want to
> free it until they are done with it. If we wait until they drop the
> lock, they are done and we also know any new reference will see the
> cleared value.
>
> Is this correct?

Yes.

>> void
>> local_flush_tlb_mm(struct mm_struct *mm)
>> {
>> -    spin_lock(&mm->page_table_lock);
>> +    unsigned long flags;
>> +
>> +    spin_lock_irqsave(&mm->page_table_lock, flags);
>>
>>     if ( mm->map_count ) {
>>         struct vm_area_struct *mp;
>>
> JSD: I believe you said the _irqsave wasn't needed here so this hunk and
> the next JSD: one can be removed.

Grmbl. I'll make sure it's gone before I push a fix.

>> +/* Use the PTE functions for freeing PMD as well, since the same
>> + * problem with tree traversals apply. Since pmd pointers are always
>> + * virtual, no need for a page_address() translation.
>> + */
>> + +#define pte_free(pte_page)      __pte_free(pte_page)
>>
> JSD: Your original patch defined pte_free to be
> __pte_free(page_address(pte_page))
> JSD: Is the page_address() wrapper no longer necessary?

This is a difference between the RedHat and ames/mainline trees due to
quicklists. In mainline/ames, pte's are just put on free lists, so the
translation isn't needed there. You still need it in your tree.


-Olof

--
Olof Johansson                                        Office: 4F005/905
pSeries Linux Development                             IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From moilanen at austin.ibm.com  Fri Feb  6 06:01:26 2004
From: moilanen at austin.ibm.com (Jake Moilanen)
Date: Thu, 05 Feb 2004 13:01:26 -0600
Subject: [PATCH][2.6] rtas error-inject support
In-Reply-To: <16416.18469.3956.192501@cargo.ozlabs.ibm.com>
References: <1075480843.682.188.camel@magik>
	 <9F2C2DDE-5349-11D8-928E-000A95A0560C@us.ibm.com>
	 <1075488969.682.192.camel@magik>
	 <264AA26A-535D-11D8-928E-000A95A0560C@us.ibm.com>
	 <1075501743.681.214.camel@magik>
	 <16416.18469.3956.192501@cargo.ozlabs.ibm.com>
Message-ID: <1076007685.1023.1209.camel@magik>


> In general, since this is a patch for 2.6 (according to the subject
> line :), it would be better to do all this in userspace using the rtas
> syscall that was added recently.  But here are comments on the patch
> anyway:

I mentioned this to you on IRC, but the main reason I was trying to keep
the same interface was because this funcationality is for test teams.
The test teams already have tools and test suites that use this
interface in 2.4.  I would hate to break them when they will probably
need this interface in the very near future.

> > +config RTAS_ERRINJCT
> > +	bool "RTAS Errinject"
>
> How about bool "RTAS Error injection facility" ?
agreed

> > +static unsigned int open_token = 0;
>
> No need to initialize things to 0, C does that by default.

Done.  Normally I try following the style of the particular file.  If
this is bothering you, you probably want to clean up a couple other
variables in the file.

> > @@ -207,7 +221,8 @@
> >  void proc_rtas_init(void)
> >  {
> >  	struct proc_dir_entry *entry;
> > -
> > + 	int errinjct_token;
> > +
>
> I can't see any difference between the blank line that is deleted and
> the one that is inserted (not even any whitespace).  I wonder why diff
> put that in?  Or has your mailer deleted trailing whitespace?

not sure either.

>
> That's a user pointer that you're using strnlen on.  Ouch.  Use
> strnlen_user or copy_from_user.  In fact, do we need to check the
> string length at all?  Would it matter if there was a null in the
> buffer, and the value taken as a string was shorter than we thought?

strnlen taken out.  I don't remember what the thought was for having
this in there.  It was over a year ago when this was originally written.

>
> Another access to the user buffer without using *user functions.
> Ouch.

Fixed.

>
> This worries me.  We copy the workspace (the contents of which are
> undefined since we just kmalloc'd it) to rtas_data_buf, but we never
> copy rtas_data_buf back to the workspace after the rtas call.  So how
> can the contents of workspace ever be anything but undefined?

I'm not sure I'm following you here.  The workspace is passed in (from
ppc_rtas_errinjct_write).  The workspace information is just for RTAS to
know the specifics on the error that is being injected (or NULL
depending of if uses the workspace on that particular injection).

> Why can't we just store a pointer to the token name within the OF
> property value?  Why do we have to make a copy of it?

IIRC the original thought was to keep the string parsing to one place
and only do it once.

> In fact, why do we need to parse the list here at all?  We use the
> list for two things: matching the token name in the write function,
> and listing the tokens in the read function.  In both cases we could
> just run through the ibm,errinjct-tokens (what do they have against
> vwls?) property value almost as easily as ei_token_list.

Changed it to do this parsing.

> > +/* Error inject defines */
> > +#define ERRINJCT_TOKEN_LEN 24  /* Max length of an error inject token */
> > +#define MAX_ERRINJCT_TOKENS 15 /* Max # tokens. */
> > +#define WORKSPACE_SIZE 1024
>
> I worry about these too.  15 tokens sounds like future machines are
> going to going to exceed this limit quite easily.  Also, is the
> workspace size limit there something that applies to all RTAS
> functions, or just to the error injection functions?  If the latter,
> you should choose a name that indicates that.

This was taken directly out of the RPA.  If the RPA changes in the
future, it's just a matter of changing a #define.  A workspace is used
on other rtas calls, but none of them we have implemented
(ibm,get-indices, ibm,get-vpd, etc...).  Until that day, I changed this
to ERRINJCT_WORKSPACE_SIZE.

> Overall, this looks to me like something that could be done just as
> well or better in userspace.  Doing it in userspace would make it
> easy to avoid having an arbitrary limit on the number of tokens, for
> instance.  Userspace could just read
> /proc/device-tree/rtas/ibm,errinjct-tokens to get the list and match
> against that directly.

This could probably done just as well in user land.  The problem is that
some of these test teams need to start testing very soon.  Their
resources are thin, I doubt they'll have time to rewrite interfaces. Why
don't we use this interface for now and depricate it and tell the test
teams that the interface will change in the 2.7 time frame.

Jake
-------------- next part --------------
# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.1415  -> 1.1416
#	arch/ppc64/kernel/rtas.c	1.22    -> 1.23
#	arch/ppc64/defconfig	1.42    -> 1.43
#	arch/ppc64/kernel/rtas-proc.c	1.13    -> 1.14
#	  arch/ppc64/Kconfig	1.39    -> 1.40
#	include/asm-ppc64/rtas.h	1.17    -> 1.18
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 04/02/05	moilanen at zippy.ltc.austin.ibm.com	1.1416
# RTAS Error Inject support
# --------------------------------------------
#
diff -Nru a/arch/ppc64/Kconfig b/arch/ppc64/Kconfig
--- a/arch/ppc64/Kconfig	Thu Feb  5 12:47:44 2004
+++ b/arch/ppc64/Kconfig	Thu Feb  5 12:47:44 2004
@@ -165,6 +165,14 @@
 	Provide system capacity information via human readable
 	<key word>=<value> pairs through a /proc/ppc64/lparcfg interface.

+config RTAS_ERRINJCT
+	bool "RTAS Error inject facility"
+	depends on PPC_RTAS
+	help
+	Provide ability to inject errors into hardware for the purpose
+	of testing hardware error code path.  Do not use on production
+	machine.
+
 endmenu


diff -Nru a/arch/ppc64/defconfig b/arch/ppc64/defconfig
--- a/arch/ppc64/defconfig	Thu Feb  5 12:47:44 2004
+++ b/arch/ppc64/defconfig	Thu Feb  5 12:47:44 2004
@@ -65,6 +65,7 @@
 CONFIG_RTAS_FLASH=m
 CONFIG_SCANLOG=m
 CONFIG_LPARCFG=y
+# CONFIG_RTAS_ERRINJCT is not set

 #
 # General setup
diff -Nru a/arch/ppc64/kernel/rtas-proc.c b/arch/ppc64/kernel/rtas-proc.c
--- a/arch/ppc64/kernel/rtas-proc.c	Thu Feb  5 12:47:44 2004
+++ b/arch/ppc64/kernel/rtas-proc.c	Thu Feb  5 12:47:44 2004
@@ -126,6 +126,9 @@

 static unsigned long rtas_tone_frequency = 1000;
 static unsigned long rtas_tone_volume = 0;
+#ifdef CONFIG_RTAS_ERRINJCT
+static unsigned int open_token;
+#endif

 /* ****************STRUCTS******************************************* */
 struct individual_sensor {
@@ -165,6 +168,14 @@
 		size_t count, loff_t *ppos);
 static ssize_t ppc_rtas_rmo_buf_read(struct file *file, char *buf,
 				    size_t count, loff_t *ppos);
+#ifdef CONFIG_RTAS_ERRINJCT
+static int ppc_rtas_errinjct_open(struct inode *inode, struct file *file);
+static int ppc_rtas_errinjct_release(struct inode *inode, struct file *file);
+static ssize_t ppc_rtas_errinjct_write(struct file * file, const char * buf,
+	        size_t count, loff_t *ppos);
+static ssize_t ppc_rtas_errinjct_read(struct file *file, char *buf,
+		size_t count, loff_t *ppos);
+#endif

 struct file_operations ppc_rtas_poweron_operations = {
 	.read =		ppc_rtas_poweron_read,
@@ -189,6 +200,15 @@
 	.write =	ppc_rtas_tone_volume_write
 };

+#ifdef CONFIG_RTAS_ERRINJCT
+struct file_operations ppc_rtas_errinjct_operations = {
+    .open =		ppc_rtas_errinjct_open,
+    .read = 		ppc_rtas_errinjct_read,
+    .write = 		ppc_rtas_errinjct_write,
+    .release = 		ppc_rtas_errinjct_release
+};
+#endif
+
 static struct file_operations ppc_rtas_rmo_buf_ops = {
 	.read =		ppc_rtas_rmo_buf_read,
 };
@@ -244,6 +264,13 @@

 	entry = create_proc_entry("rmo_buffer", S_IRUSR, proc_ppc64.rtas);
 	if (entry) entry->proc_fops = &ppc_rtas_rmo_buf_ops;
+
+#ifdef CONFIG_RTAS_ERRINJCT
+ 	if (rtas_token("ibm,errinjct") != RTAS_UNKNOWN_SERVICE) {
+ 		entry = create_proc_entry("errinjct",S_IWUSR|S_IRUGO, proc_ppc64.rtas);
+ 		if (entry) entry->proc_fops = &ppc_rtas_errinjct_operations;
+	}
+#endif
 }

 /* ****************************************************************** */
@@ -932,6 +959,157 @@
 	*ppos += n;
 	return n;
 }
+
+#ifdef CONFIG_RTAS_ERRINJCT
+/* ****************************************************************** */
+/* ERRINJCT			                                      */
+/* ****************************************************************** */
+static int ppc_rtas_errinjct_open(struct inode *inode, struct file *file)
+{
+	int rc;
+
+	/* We will only allow one process to use error inject at a
+	   time.  Since errinjct is usually only used for testing,
+	   this shouldn't be an issue */
+	if (open_token) {
+		return -EAGAIN;
+	}
+	rc = rtas_errinjct_open();
+	if (rc < 0) {
+		return -EIO;
+	}
+	open_token = rc;
+
+	return 0;
+}
+
+static ssize_t ppc_rtas_errinjct_write(struct file * file, const char * buf,
+				       size_t count, loff_t *ppos)
+{
+
+	char * ei_token;
+	char * workspace = NULL;
+	size_t max_len;
+	int token_len;
+	int rc;
+
+	/* Verify the errinjct token length */
+	if (count < ERRINJCT_TOKEN_LEN) {
+		max_len = count;
+	} else {
+		max_len = ERRINJCT_TOKEN_LEN;
+	}
+
+	ei_token = (char *)kmalloc(max_len, GFP_KERNEL);
+	if (!ei_token) {
+		printk(KERN_WARNING "error: kmalloc failed\n");
+		return -ENOMEM;
+	}
+
+	token_len = strncpy_from_user(ei_token, buf, max_len);
+	if (token_len <= 0) {
+		kfree(ei_token);
+		return -EFAULT;
+	}
+	token_len++;
+
+	if (count > token_len + ERRINJCT_WORKSPACE_SIZE) {
+		count = token_len + ERRINJCT_WORKSPACE_SIZE;
+	}
+
+	buf += token_len;
+
+	/* check if there is a workspace */
+	if (count > token_len) {
+		/* Verify the workspace size */
+		if ((count - token_len) > ERRINJCT_WORKSPACE_SIZE) {
+			max_len = ERRINJCT_WORKSPACE_SIZE;
+		} else {
+			max_len = count - token_len;
+		}
+
+		workspace = (char *)kmalloc(max_len, GFP_KERNEL);
+		if (!workspace) {
+			printk(KERN_WARNING "error: failed kmalloc\n");
+			kfree(ei_token);
+			return -ENOMEM;
+		}
+		if (copy_from_user(workspace, buf, max_len)) {
+			kfree(ei_token);
+			kfree(workspace);
+			return -EFAULT;
+		}
+	}
+
+	rc = rtas_errinjct(open_token, ei_token, workspace, max_len);
+
+	if (count > token_len) {
+		kfree(workspace);
+	}
+	kfree(ei_token);
+
+	return rc < 0 ? rc : count;
+}
+
+static int ppc_rtas_errinjct_release(struct inode *inode, struct file *file)
+{
+	int rc;
+
+	rc = rtas_errinjct_close(open_token);
+	if (rc) {
+		return rc;
+	}
+	open_token = 0;
+	return 0;
+}
+
+static ssize_t ppc_rtas_errinjct_read(struct file *file, char *buf,
+				      size_t count, loff_t *ppos)
+{
+	char * buffer;
+	char * token_array;
+	char * token_array_end;
+	int array_len;
+	int n = 0;
+
+	buffer = (char *)kmalloc(MAX_ERRINJCT_TOKENS * (ERRINJCT_TOKEN_LEN+1),
+				 GFP_KERNEL);
+	if (!buffer) {
+		printk(KERN_ERR "error: kmalloc failed\n");
+		return -ENOMEM;
+	}
+
+	token_array = (char *) get_property(rtas.dev, "ibm,errinjct-tokens", &array_len);
+	token_array_end = token_array + array_len;
+
+	while (token_array < token_array_end) {
+		n += sprintf(buffer+n, token_array);
+		token_array += strlen(token_array) + 1;
+		n += sprintf(buffer+n, " %d\n", *(int *)token_array);
+		token_array += sizeof(int);
+	}
+
+	if (*ppos >= strlen(buffer)) {
+		kfree(buffer);
+		return 0;
+	}
+	if (n > strlen(buffer) - *ppos)
+		n = strlen(buffer) - *ppos;
+
+	if (n > count)
+		n = count;
+
+	if (copy_to_user(buf, buffer + *ppos, n)) {
+		kfree(buffer);
+		return -EFAULT;
+	}
+
+	*ppos += n;
+
+	kfree(buffer);
+	return n;
+}
+#endif /* CONFIG_RTAS_ERRINJCT */

 #define RMO_READ_BUF_MAX 30

diff -Nru a/arch/ppc64/kernel/rtas.c b/arch/ppc64/kernel/rtas.c
--- a/arch/ppc64/kernel/rtas.c	Thu Feb  5 12:47:44 2004
+++ b/arch/ppc64/kernel/rtas.c	Thu Feb  5 12:47:44 2004
@@ -225,6 +225,10 @@
 	int order = status - 9900;
 	unsigned long ms;

+	if (status < RTAS_EXTENDED_DELAY_MIN ||
+	    status > RTAS_EXTENDED_DELAY_MAX)
+		return 0;
+
 	if (order < 0)
 		order = 0;	/* RTC depends on this for -2 clock busy */
 	else if (order > 5)
@@ -463,6 +467,127 @@
 	return 0;
 }

+#ifdef CONFIG_RTAS_ERRINJCT
+int
+rtas_errinjct_open(void)
+{
+	u32 ret[2];
+	int open_token;
+	int rc;
+	unsigned int time;
+
+
+	while (1) {
+		/*
+		 * The rc and open_token values are backwards due to a
+		 * misprint in the RPA.
+		 */
+		open_token = rtas_call(rtas_token("ibm,open-errinjct"), 0, 2, (void *) &ret);
+		rc = ret[0];
+
+		if (rc == RTAS_BUSY) {
+			continue;
+		}
+
+		if ((time = rtas_extended_busy_delay_time(rc))) {
+			udelay(time * 1000);
+			continue;
+		}
+
+		if (rc < 0) {
+			printk(KERN_WARNING "error: ibm,open-errinjct failed (%d)\n", rc);
+			return rc;
+		}
+
+		return open_token;
+	}
+}
+
+int
+rtas_errinjct(unsigned int open_token, char * ei_token, char * workspace, size_t workspace_size)
+{
+	char * token_array;
+	char * token_array_end;
+	int array_len;
+	int rtas_ei_token = -1;
+	unsigned int time;
+	int rc = 0;
+
+	token_array = (char *) get_property(rtas.dev, "ibm,errinjct-tokens", &array_len);
+	token_array_end = token_array + array_len;
+
+	while (token_array <= token_array_end) {
+		if (strcmp(token_array, ei_token) == 0) {
+			rtas_ei_token = *(int *)(token_array + strlen(token_array) + 1);
+			break;
+		}
+		token_array += strlen(token_array) + 1;
+		token_array += sizeof(int);
+	}
+
+ 	if (rtas_ei_token == -1) {
+ 		return -EINVAL;
+ 	}
+
+ 	spin_lock(&rtas_data_buf_lock);
+
+	while (1) {
+		if (rc != RTAS_BUSY && workspace) {
+			memset(rtas_data_buf, 0, RTAS_DATA_BUF_SIZE);
+			memcpy(rtas_data_buf, workspace, workspace_size);
+		}
+
+		rc = rtas_call(rtas_token("ibm,errinjct"), 3, 1, NULL,
+ 			       rtas_ei_token, open_token, __pa(rtas_data_buf));
+
+ 		if (rc == RTAS_BUSY) {
+ 			continue;
+ 		}
+
+ 		if ((time = rtas_extended_busy_delay_time(rc))) {
+ 			spin_unlock(&rtas_data_buf_lock);
+ 			udelay(time * 1000);
+ 			spin_lock(&rtas_data_buf_lock);
+ 			continue;
+ 		}
+
+ 		if (rc != 0) {
+ 			printk(KERN_WARNING "error: ibm,errinjct failed (%d)\n", rc);
+ 		}
+
+ 		spin_unlock(&rtas_data_buf_lock);
+
+ 		return rc;
+ 	}
+}
+
+int
+rtas_errinjct_close(unsigned int open_token)
+{
+ 	int rc;
+ 	unsigned int time;
+
+ 	while (1) {
+ 		rc = rtas_call(rtas_token("ibm,close-errinjct"), 1, 1, NULL, open_token);
+
+ 		if (rc == RTAS_BUSY) {
+ 			continue;
+ 		}
+
+ 		if ((time = rtas_extended_busy_delay_time(rc))) {
+ 			udelay(time * 1000);
+ 			continue;
+ 		}
+
+ 		if (rc != 0) {
+ 			printk(KERN_WARNING "error: ibm,close-errinjct failed (%d)\n", rc);
+ 		}
+
+ 		return rc;
+ 	}
+}
+
+#endif

 EXPORT_SYMBOL(rtas_firmware_flash_list);
 EXPORT_SYMBOL(rtas_token);
diff -Nru a/include/asm-ppc64/rtas.h b/include/asm-ppc64/rtas.h
--- a/include/asm-ppc64/rtas.h	Thu Feb  5 12:47:44 2004
+++ b/include/asm-ppc64/rtas.h	Thu Feb  5 12:47:44 2004
@@ -22,6 +22,11 @@
 /* Buffer size for ppc_rtas system call. */
 #define RTAS_RMOBUF_MAX (64 * 1024)

+/* Error inject defines */
+#define ERRINJCT_TOKEN_LEN 24  /* Max length of an error inject token */
+#define MAX_ERRINJCT_TOKENS 15 /* Max # tokens. */
+#define ERRINJCT_WORKSPACE_SIZE 1024
+
 /* RTAS return codes */
 #define RTAS_BUSY		-2	/* RTAS Return Status - Busy */
 #define RTAS_EXTENDED_DELAY_MIN 9900
@@ -178,6 +183,9 @@
 extern int rtas_get_sensor(int sensor, int index, int *state);
 extern int rtas_get_power_level(int powerdomain, int *level);
 extern int rtas_set_indicator(int indicator, int index, int new_value);
+extern int rtas_errinjct_open(void);
+extern int rtas_errinjct(unsigned int, char *, char *, size_t);
+extern int rtas_errinjct_close(unsigned int);

 /* Given an RTAS status code of 9900..9905 compute the hinted delay */
 unsigned int rtas_extended_busy_delay_time(int status);

From linas at austin.ibm.com  Fri Feb  6 07:26:43 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Thu, 5 Feb 2004 14:26:43 -0600
Subject: Resending the patch: Re: 2.6: PATCH for multiple EEH bugs
In-Reply-To: <20040204142853.A28220@forte.austin.ibm.com>; from linas@austin.ibm.com on Wed, Feb 04, 2004 at 02:28:53PM -0600
References: <20040203183459.B27780@forte.austin.ibm.com> <Pine.A41.4.44.0402032246520.85316-100000@forte.austin.ibm.com> <20040204142853.A28220@forte.austin.ibm.com>
Message-ID: <20040205142643.T27780@forte.austin.ibm.com>


OK,

Fifth time's a charm ...

base64 encoding the patch helps prevent the mail gateways from mangling it,
but then its too big for the mailing list manager.  You can ftp the patch

http://www-124.ibm.com/linux/patches/?patch_id=1344

To repeat the original note:

Patch for multiple EEH-related bugs.  Please review this patch,
& if appropriate, please apply.  It should apply cleanly to
the current ameslab tree (Feb 03 2004  2.6.2-rc3).

This patch fixes multiple EEH-related bugs:

-- Fixes the eeh_check_failure() usage in an interrupt context.
   This routine is now safe to use in an interrupt. The fix was to
   build a cache of IO addresses and check that, instead of using
   the pci routines.
-- Merges in Olof Johansson's sizeof patch when checking for failure
-- Adds EEH tests to array/string reads
-- Fixes bugs with address resolution (some i/o addresses were handled
   incorrectly, resulting in EEH errors slipping by undetected.)
-- Adds EEH support to the PCI Hotplug system (so that devices that
   get added/removed get properly registered with the EEH subsystem.)
-- Fixes improper use of /proc filesystem.
-- Adds some misc statistics.

Please note that the EEH subsystem will be undergoing a major revision
in the not-to-distant future; this patch is a 'stopgap' to address the
immediate concerns/issues until that time.

--linas


On Wed, Feb 04, 2004 at 02:28:53PM -0600, linas at austin.ibm.com wrote:
> On Tue, Feb 03, 2004 at 10:52:11PM -0600, olof at austin.ibm.com wrote:
> > On Tue, 3 Feb 2004 linas at austin.ibm.com wrote:
> >
> > > Patch for multiple EEH-related bugs.  Please review this patch,
> > > & if appropriate, please apply.  It should apply cleanly to
> > > the current ameslab tree (Feb 03 2004  2.6.2-rc3).
> >
> > I have patch failures in eeh.c, pci.c and rpaphp_core.c. Are you sure you
> > made the diff against a current ameslab tree?
>
> Right tree, bad email attachment.
>
> I don't know how it happened, but what I sent out had some trailing
> whitespace whacked. The attached patch should not have this problem.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From moilanen at austin.ibm.com  Fri Feb  6 09:10:33 2004
From: moilanen at austin.ibm.com (Jake Moilanen)
Date: Thu, 05 Feb 2004 16:10:33 -0600
Subject: [PATCH][2.6] Nested Interrupt support
In-Reply-To: <16400.37866.867318.95501@cargo.ozlabs.ibm.com>
References: <1074094346.2389.42.camel@magik>
	 <20040119042022.GA20834@krispykreme> <1074781690.23288.571.camel@magik>
	 <16400.37866.867318.95501@cargo.ozlabs.ibm.com>
Message-ID: <1076019033.1023.1266.camel@magik>


> > Here's the patch using per cpu data for the irq stack.
>
> Can't we find somewhere on the kernel stack to stash this?  Could we
> use regs->softe maybe?

Do you have any better idea how you would want this?

I still think Anton's idea of storing this in the per-cpu data makes
more sense since it's the priority stack of each CPU.

Jake


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From johnrose at austin.ibm.com  Fri Feb  6 11:20:47 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Thu, 05 Feb 2004 18:20:47 -0600
Subject: PCI Probe Question
Message-ID: <1076026847.4798.43.camel@verve>


My question involves pci_read_bases(), which reads resource info from
the config space of an adapter.  I'm using an e100.

At boot, and at hotplug-enable time, pci_read_bases() reads 0xfc00 for
the base of the adapter's only "I/O" resource region.

After DLPAR removing and adding back the parent bus of the card,
pci_read_bases() reads a different value (0x1000) for the base of the
same region.  It reads the same size as at boot, and all the other
regions are identical to their boot-time bases and sizes.

How can the base address of the only I/O region change like this?  I
thought that was a static property of an adapter.  Despite this
difference, the card still works.

Thoughts?
John

BTW - the addr and buid values are identical to those at boot, so the
rtas read pci config call is passing identical inputs in both cases.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From willschm at us.ibm.com  Fri Feb  6 15:24:22 2004
From: willschm at us.ibm.com (Will Schmidt)
Date: Thu, 5 Feb 2004 22:24:22 -0600
Subject: LPARCFG
In-Reply-To: <20040205160710.GZ19011@krispykreme>
Message-ID: <OF9785332C.D4020F5D-ON86256E32.00178CE0-86256E32.0018349E@us.ibm.com>


> > Building as a module was broken when the code was checked in to ameslab
> > 2.6; I suggested turning it off.  I think lparcfg uses unexported
> > symbols -- cur_cpu_spec, systemcfg, and plpar_hcall_4out.  Should any
of
> > those be exported?
> >
> > See this thread for the history:
> > http://lists.linuxppc.org/linuxppc64-dev/200312/msg00023.html
>
> Ahh yeah, I have such a short memory.
>
> Actually I enabled it as a module and built all modules and didnt get
> any warnings. Either we have everything exported now, Im not getting
> undefined symbol warnings any more or else Im going blind.

I've got some more updates for this code, will try to get a patch onto this
list tomorrow.   (still need to forward port to current and check on
iSeries)..
I couldnt build it as a module without exporting a few symbols, but  my
tree is about a week old, so i'm probably missing those fixes.


-Will

willschm at us.ibm.com
Linux on PowerPC-64 Development
IBM Rochester


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Sat Feb  7 02:11:55 2004
From: anton at samba.org (Anton Blanchard)
Date: Sat, 7 Feb 2004 02:11:55 +1100
Subject: [PATCH] kill lmb_add_io
In-Reply-To: <20040205171522.GB4983@w-mikek2.beaverton.ibm.com>
References: <1075856864.1449.205.camel@nighthawk> <20040204174036.GB4996@w-mikek2.beaverton.ibm.com> <20040205133650.GV19011@krispykreme> <20040205171522.GB4983@w-mikek2.beaverton.ibm.com>
Message-ID: <20040206151155.GK19011@krispykreme>


> Exactly, and this is an area I am interested in pursuing.  Hence, the
> interest in Dave's machine.  I haven't gone through enough pSeries
> architecture documentation yet to know what to expect.  One would think
> that a smaller lmb size would be better for hot swap memory.  At least
> in the 'remove' side of the equation.

A good start would be to tar up a device tree of such a machine. You can
untar it and use lsprop to look around.

Actually it would be worth having a repository of device trees of
various machines and OF versions somewhere. We often need to know if
some random version of OF on some machine has a particular property.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From johnrose at austin.ibm.com  Sat Feb  7 03:28:49 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Fri, 06 Feb 2004 10:28:49 -0600
Subject: PCI Probe Question
In-Reply-To: <1076026847.4798.43.camel@verve>
References: <1076026847.4798.43.camel@verve>
Message-ID: <1076084929.6881.2.camel@verve>


> How can the base address of the only I/O region change like this?  I
> thought that was a static property of an adapter.  Despite this
> difference, the card still works.

So apparently the bases of these regions can be changed by firmware
and/or OS.  The sizes won't change, but the bases might.  So this isn't
a bug :)  Thanks to Mike Lyons for the pci knowledge.

John


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From jimix at watson.ibm.com  Sat Feb  7 06:12:44 2004
From: jimix at watson.ibm.com (Jimi Xenidis)
Date: Fri, 6 Feb 2004 14:12:44 -0500
Subject: missing RELOC()s in prom.c?
Message-ID: <16419.59180.461757.753005@kitch0.watson.ibm.com>


Fellow coder (cc'd) found the following issue
-JX


===== /scratch1/jimix/work/linux/linux-2.5/arch/ppc64/kernel/prom.c 1.46 vs edited =====
--- 1.46/arch/ppc64/kernel/prom.c	Tue Dec  9 11:45:05 2003
+++ edited//scratch1/jimix/work/linux/linux-2.5/arch/ppc64/kernel/prom.c	Fri Feb  6 13:35:47 2004
@@ -1143,9 +1143,9 @@
 				sizeof(option));
 			if (option[0] != 0) {
 				found = 1;
-				if (!strcmp(option, "off"))
+				if (!strcmp(option, RELOC("off")))
 					my_smt_enabled = SMT_OFF;
-				else if (!strcmp(option, "on"))
+				else if (!strcmp(option, RELOC("on")))
 					my_smt_enabled = SMT_ON;
 				else
 					my_smt_enabled = SMT_DYNAMIC;

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From haveblue at us.ibm.com  Sat Feb  7 06:37:45 2004
From: haveblue at us.ibm.com (Dave Hansen)
Date: 06 Feb 2004 11:37:45 -0800
Subject: [PATCH] get CROSS32 from environment
Message-ID: <1076096265.5716.340.camel@nighthawk>

I'm doing some x86->ppc64 cross compiles.  The top-level CROSS_COMPILE
comes out of the environment just fine, but the 32-bit compiler is just
set up to be empty.  This patch gets it out of the environment, if
present.

--dave
-------------- next part --------------
--- linux-2.6.1-clean/arch/ppc64/boot/Makefile	2004-01-08 22:59:56.000000000 -0800
+++ linux-2.6.1-memhotplug/arch/ppc64/boot/Makefile	2004-02-04 15:02:17.000000000 -0800
@@ -20,7 +20,7 @@
 #	CROSS32_COMPILE is setup as a prefix just like CROSS_COMPILE
 #	in the toplevel makefile.

-CROSS32_COMPILE =
+CROSS32_COMPILE ?=
 #CROSS32_COMPILE = /usr/local/ppc/bin/powerpc-linux-

 BOOTCC		:= $(CROSS32_COMPILE)gcc

From jschopp at austin.ibm.com  Sat Feb  7 09:53:32 2004
From: jschopp at austin.ibm.com (jschopp at austin.ibm.com)
Date: Fri, 6 Feb 2004 16:53:32 -0600 (CST)
Subject: missing RELOC()s in prom.c?
In-Reply-To: <16419.59180.461757.753005@kitch0.watson.ibm.com>
Message-ID: <Pine.A41.4.44.0402061651410.27162-100000@forte.austin.ibm.com>


Good catch.  I think this is dead on the money.  I'll push it to ameslab
unless somebody beats me to it.

-JOel

On Fri, 6 Feb 2004, Jimi Xenidis wrote:

>
> Fellow coder (cc'd) found the following issue
> -JX
>
>
> ===== /scratch1/jimix/work/linux/linux-2.5/arch/ppc64/kernel/prom.c 1.46 vs edited =====
> --- 1.46/arch/ppc64/kernel/prom.c	Tue Dec  9 11:45:05 2003
> +++ edited//scratch1/jimix/work/linux/linux-2.5/arch/ppc64/kernel/prom.c	Fri Feb  6 13:35:47 2004
> @@ -1143,9 +1143,9 @@
>  				sizeof(option));
>  			if (option[0] != 0) {
>  				found = 1;
> -				if (!strcmp(option, "off"))
> +				if (!strcmp(option, RELOC("off")))
>  					my_smt_enabled = SMT_OFF;
> -				else if (!strcmp(option, "on"))
> +				else if (!strcmp(option, RELOC("on")))
>  					my_smt_enabled = SMT_ON;
>  				else
>  					my_smt_enabled = SMT_DYNAMIC;
>
>
>


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Sat Feb  7 12:42:27 2004
From: paulus at samba.org (Paul Mackerras)
Date: Sat, 7 Feb 2004 12:42:27 +1100
Subject: missing RELOC()s in prom.c?
In-Reply-To: <16419.59180.461757.753005@kitch0.watson.ibm.com>
References: <16419.59180.461757.753005@kitch0.watson.ibm.com>
Message-ID: <16420.17027.764354.90415@cargo.ozlabs.ibm.com>


Jimi Xenidis writes:

> Fellow coder (cc'd) found the following issue
> -JX

Hmmm, good catch, but it's not obvious to me why smt_setup() has to be
done at prom_init time.  I don't see anything in there that has to be
done then - we are just looking at the command line and the device
tree and setting some fields in the naca.  That could be done more
easily (without the RELOCs) once we have set up the MMU, surely?
We use _naca->smt_state in prom_hold_cpus, it is true, but that only
affects the setting of cpu_available_map and cpu_present_at_boot, and
the values of those bitmaps doesn't affect any calls to OF.

Does anyone have a reason why smt_setup has to be called from
prom_init?

Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Sat Feb  7 12:57:28 2004
From: anton at samba.org (Anton Blanchard)
Date: Sat, 7 Feb 2004 12:57:28 +1100
Subject: missing RELOC()s in prom.c?
In-Reply-To: <16420.17027.764354.90415@cargo.ozlabs.ibm.com>
References: <16419.59180.461757.753005@kitch0.watson.ibm.com> <16420.17027.764354.90415@cargo.ozlabs.ibm.com>
Message-ID: <20040207015727.GO19011@krispykreme>


> Does anyone have a reason why smt_setup has to be called from
> prom_init?

Good point. We need the paca and prom.c police, someone who will come
knocking any time code gets added to either :) (for the benefit of
others, the reason for using percpu data rather than the paca is that
we can do node local allocation of per cpu memory one day)

Speaking of things in the paca, does anyone know why we a) have an
option to overclock the decr and b) have an option to overclock the decr
on cpu0 separately? If its an issue with iseries servicing events, does
the current HZ=1000 solve this?

Anton (looking to remove another field from the paca)

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From boutcher at us.ibm.com  Sat Feb  7 14:20:46 2004
From: boutcher at us.ibm.com (David Boutcher)
Date: Fri, 6 Feb 2004 21:20:46 -0600
Subject: missing RELOC()s in prom.c?
In-Reply-To: <20040207015727.GO19011@krispykreme>
Message-ID: <OF081CF8A9.A9C62975-ON86256E33.00118DB4-86256E33.00125CF4@us.ibm.com>


Anton Blanchard <anton at samba.org> wrote on 02/06/2004 07:57:28 PM:
> Speaking of things in the paca, does anyone know why we a) have an
> option to overclock the decr and b) have an option to overclock the
> decr on cpu0 separately? If its an issue with iseries servicing
> events, does the current HZ=1000 solve this?

Cause on legacy iSeries there are no real interrupts.  There are only lp
events put on a queue.  And we check the queue on each decrementer.  And
there's an option somewhere or other (search for spread_lp_events) for
whether we check the queue only with cpu0 or with all cpus..

By the way, at one point there was a cool option to stagger the
decrementers so that even with overclocking, multiple CPUs would check the
lp event queues at staggered intervals within the decrementer intervals.

Overclocking the decrementer doesn't do anything for jiffies, but it does
reduce the latency on I/O events and is only useful on legacy iSeries.

As to whether HZ helps, It's too late at night, and Hugh made me drink too
much beer to contemplate that.

Dave Boutcher
IBM Linux Technology Center

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Sat Feb  7 20:40:35 2004
From: anton at samba.org (Anton Blanchard)
Date: Sat, 7 Feb 2004 20:40:35 +1100
Subject: missing RELOC()s in prom.c?
In-Reply-To: <OF081CF8A9.A9C62975-ON86256E33.00118DB4-86256E33.00125CF4@us.ibm.com>
References: <20040207015727.GO19011@krispykreme> <OF081CF8A9.A9C62975-ON86256E33.00118DB4-86256E33.00125CF4@us.ibm.com>
Message-ID: <20040207094035.GR19011@krispykreme>


> Cause on legacy iSeries there are no real interrupts.  There are only lp
> events put on a queue.  And we check the queue on each decrementer.  And
> there's an option somewhere or other (search for spread_lp_events) for
> whether we check the queue only with cpu0 or with all cpus..
>
> By the way, at one point there was a cool option to stagger the
> decrementers so that even with overclocking, multiple CPUs would check the
> lp event queues at staggered intervals within the decrementer intervals.

In fact we are currently staggering the decrementers in 2.6. It was done
to avoid all cpus hitting some global locks in the timer code. The
global lock has been removed and x86 has moved back to triggering all
timer irqs at the same time.

Considering that staggered interrupts should help iseries, we might
continue to do it on ppc64. Also I notice there is a spread_events boot
option, should we make this the default?

> As to whether HZ helps, It's too late at night, and Hugh made me drink too
> much beer to contemplate that.

Watch out for Hugh :) My theory is if our usual overclocking of the decr
in 2.4 ends up the same frequency as base 2.6 (1000Hz) then we can remove
the option.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From jimix at watson.ibm.com  Sun Feb  8 02:52:55 2004
From: jimix at watson.ibm.com (Jimi Xenidis)
Date: Sat, 7 Feb 2004 10:52:55 -0500
Subject: missing RELOC()s in prom.c?
In-Reply-To: <16420.17027.764354.90415@cargo.ozlabs.ibm.com>
References: <16419.59180.461757.753005@kitch0.watson.ibm.com>
	<16420.17027.764354.90415@cargo.ozlabs.ibm.com>
Message-ID: <16421.2519.579220.101653@kitch0.watson.ibm.com>


>>>>> "PM" == Paul Mackerras <paulus at samba.org> writes:

 PM> Jimi Xenidis writes:
 >> Fellow coder (cc'd) found the following issue
 >> -JX

 PM> Hmmm, good catch, but it's not obvious to me why smt_setup() has to be
 PM> done at prom_init time.

we wondered this as well ;-)
-JX


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From mostrows at watson.ibm.com  Sun Feb  8 09:40:38 2004
From: mostrows at watson.ibm.com (Michal Ostrowski)
Date: Sat, 07 Feb 2004 17:40:38 -0500
Subject: G5 timebase_frequency
Message-ID: <1076193638.10104.3326.camel@brick.watson.ibm.com>


I've noticed that timebase_frequency on a G5 appears to be defined by OF
to be 33.3333MHz. OTOH, the specs for the 970 claim that timebase has a
frequency of 1/8 that of the CPU clock rate.  So, on a 2.0GHz processor
timebase would be 250MHz.

Which one of these, if any is correct?

My guess would be to say 250MHz, but my experiments show that a clock
based on timebase, assuming that rate, appears to be about 10% slow.

--
Michal Ostrowski <mostrows at watson.ibm.com>

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Sun Feb  8 17:45:19 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sun, 08 Feb 2004 17:45:19 +1100
Subject: G5 timebase_frequency
In-Reply-To: <1076193638.10104.3326.camel@brick.watson.ibm.com>
References: <1076193638.10104.3326.camel@brick.watson.ibm.com>
Message-ID: <1076222702.887.81.camel@gaston>


On Sun, 2004-02-08 at 09:40, Michal Ostrowski wrote:
> I've noticed that timebase_frequency on a G5 appears to be defined by OF
> to be 33.3333MHz. OTOH, the specs for the 970 claim that timebase has a
> frequency of 1/8 that of the CPU clock rate.  So, on a 2.0GHz processor
> timebase would be 250MHz.
>
> Which one of these, if any is correct?
>
> My guess would be to say 250MHz, but my experiments show that a clock
> based on timebase, assuming that rate, appears to be about 10% slow.

10% Only ? :)

It's really 33Mhz as OF says. AFAIK, the HID0[19] bit is set by the
firmware causing the timebase to be clocked on the rising edge of
the TBEN input, thus the timebase is externally clocked. This allows
a stable timebase when slewing the cpu & bus clocks.

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Sun Feb  8 18:46:13 2004
From: anton at samba.org (Anton Blanchard)
Date: Sun, 8 Feb 2004 18:46:13 +1100
Subject: [2.6] [PATCH] [LARGE] Rewrite/cleanup of PCI DMA mapping code
In-Reply-To: <401EDFD1.3010203@austin.ibm.com>
References: <401EDFD1.3010203@austin.ibm.com>
Message-ID: <20040208074613.GB19011@krispykreme>


Hi Olof,

> I've spent some time cleaning up the PCI DMA mapping code in 2.6. The
> patch is large (117KB, 4000 lines), so I won't post it here. It can be
> found at:

Very nice! Ive thrown this onto a few machines here and am stressing
it with random IO benchmarks. Its something we desperately needed done.

> I also replaced the old buddy-style allocator for TCE ranges with a
> simpler bitmap allocator. Time and benchmarking will tell if it's
> efficient enough, but it's fairly well abstracted and can easily be
> replaced

Agreed. I suspect (as with our SLB allocation code) we will only know
once the big IO benchmarks have beaten on it. We should get Jose, Rick
and Nancy onto it as soon as possible.

Some things to think about:

- Minimum guaranteed TCE table size (128MB for 32bit slot, 256MB for
  64bit slot)

- Likelihood of large (multiple page) non SG requests. The e1000 comes
  to mind here, it has an MTU of 16kB so could do a pci_map_single of
  that size.

- Peak TCE usage. After chasing emulex TCE starvation you guys would
  know the figures for this better than I.

- Whether virtual merging makes sense. Virtual merging will place more
  pressure on our TCE allocation code (because we will end up asking for
  much more high order TCE allocations). Its also more complex, Id
  prefer to avoid it unless we do see a performance advantage.

- We currently allocate a 2GB window in PCI space for TCEs. This is 4MB
  worth of TCE tables. Unfortunately we have to allocate an 8MB window
  on POWER4 boxes because firmware sets up some chip inits to cover the
  TCE region. If we allocate less and let normal memory get into this
  region, our performance grinds to a halt. (Its to do with the way
  TCE coherency is done on POWER4).

  Allocating a 2GB region unconditionally is also wrong, I have a
  nighthawk node that has a 3GB IO hole, and yes there is PCI memory
  allocated at 1GB and above (see below). We get away with it by luck with
  the current code but its going to hit when we switch to your new code.

  If we allocate 4MB on POWER3, 8MB on POWER4 and check the OF
  properties to get the correct range we should be good.

- False sharing between the CPU and host bridge. We store a few things
  to a TCE cacheline (eg for an SG list) then initiate IO. The IO device
  requests the first address, the host bridge realises it must do a TCE
  lookup. It then caches this cacheline.

  Meantime the cpu is setting up another request. It stores to the same
  cacheline which forces the cacheline in the host bridge to be flushed.
  It still hasnt completed the first sg list, so it has to refetch it.

  I think the answer here is to allocate an SG list within a cacheline
  then move onto the next cacheline for the next request. As suggested
  by davem we should convert the network drivers over to using SG lists.

- TCE table bypass, DAC.

Anton

  Bus  0, device  12, function  0:
    SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 3).
      IRQ 17.
      Master Capable.  Latency=74.  Min Gnt=17.Max Lat=64.
      I/O at 0x7ffc00 [0x7ffcff].
      Non-prefetchable 32 bit memory at 0x40101000 [0x401010ff].
      Non-prefetchable 32 bit memory at 0x40100000 [0x40100fff].

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Mon Feb  9 13:17:33 2004
From: olof at austin.ibm.com (olof at austin.ibm.com)
Date: Sun, 8 Feb 2004 20:17:33 -0600 (CST)
Subject: [2.6] [PATCH] [LARGE] Rewrite/cleanup of PCI DMA mapping code
In-Reply-To: <20040208074613.GB19011@krispykreme>
Message-ID: <Pine.A41.4.44.0402082008001.85324-100000@forte.austin.ibm.com>


Anton,

Thanks for the feedback. Comments to a few of the items are below.


-Olof

On Sun, 8 Feb 2004, Anton Blanchard wrote:

> > I also replaced the old buddy-style allocator for TCE ranges with a
> > simpler bitmap allocator. Time and benchmarking will tell if it's
> > efficient enough, but it's fairly well abstracted and can easily be
> > replaced
>
> Agreed. I suspect (as with our SLB allocation code) we will only know
> once the big IO benchmarks have beaten on it. We should get Jose, Rick
> and Nancy onto it as soon as possible.
>
> Some things to think about:
>
> - Minimum guaranteed TCE table size (128MB for 32bit slot, 256MB for
>   64bit slot)

Not sure I follow this one. Are you saying we need to guarantee a minimum?
On SMP, we just carve up the space reserved by the PHB in equal chunks per
slot, on LPAR we just use the ibm,dma-window property to size the space.

> - Likelihood of large (multiple page) non SG requests. The e1000 comes
>   to mind here, it has an MTU of 16kB so could do a pci_map_single of
>   that size.

Yes, the behaviour here would be interesting. It'll still only be 4-page
allocations so I don't expect any trouble. Allocations closer to 16 pages
would be more likely to fail due to fragmentation.

> - Peak TCE usage. After chasing emulex TCE starvation you guys would
>   know the figures for this better than I.

Good point. I'll follow up on this.

[...

> - We currently allocate a 2GB window in PCI space for TCEs. This is 4MB
>   worth of TCE tables. Unfortunately we have to allocate an 8MB window
>   on POWER4 boxes because firmware sets up some chip inits to cover the
>   TCE region. If we allocate less and let normal memory get into this
>   region, our performance grinds to a halt. (Its to do with the way
>   TCE coherency is done on POWER4).
>
>   Allocating a 2GB region unconditionally is also wrong, I have a
>   nighthawk node that has a 3GB IO hole, and yes there is PCI memory
>   allocated at 1GB and above (see below). We get away with it by luck with
>   the current code but its going to hit when we switch to your new code.
>
>   If we allocate 4MB on POWER3, 8MB on POWER4 and check the OF
>   properties to get the correct range we should be good.

Can we blindly base this on architecture, i.e. all POWER4-based systems
will be fine with 8MB and the other way around?

> - False sharing between the CPU and host bridge. We store a few things
>   to a TCE cacheline (eg for an SG list) then initiate IO. The IO device
>   requests the first address, the host bridge realises it must do a TCE
>   lookup. It then caches this cacheline.
>
>   Meantime the cpu is setting up another request. It stores to the same
>   cacheline which forces the cacheline in the host bridge to be flushed.
>   It still hasnt completed the first sg list, so it has to refetch it.
>
>   I think the answer here is to allocate an SG list within a cacheline
>   then move onto the next cacheline for the next request. As suggested
>   by davem we should convert the network drivers over to using SG lists.

There's room for improvement here, but did a simple first implementation
that will try to honor the PHB cachelines when allocating:

The "next" hint will always be bumped up to a new cacheline, and SG list
allocations will use their own hint instead of the next pointer. The
result should be that SG lists are packed in cache lines, while other
allocations should always move between lines.

The downside is when the table gets full, but the load should be evenly
spread between cache lines at least. I.e. while we'll have sharing, it
should get evenly distributed in round-robin fashion. The next-allocation
hint is explicitly NOT updated at free time to accomodate this.

> - TCE table bypass, DAC.

Yes, table bypass will be useful for machines with less than 2GB memory as
well (G5's in particular). Ben has a ppc_pci_md (or similar) with function
pointers to the pci_*map* functions, we might need something like that for
boot/runtime selection of allocation behaviour.


-Olof


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Tue Feb 10 01:47:28 2004
From: anton at samba.org (Anton Blanchard)
Date: Tue, 10 Feb 2004 01:47:28 +1100
Subject: [2.6] [PATCH] [LARGE] Rewrite/cleanup of PCI DMA mapping code
In-Reply-To: <Pine.A41.4.44.0402082008001.85324-100000@forte.austin.ibm.com>
References: <20040208074613.GB19011@krispykreme> <Pine.A41.4.44.0402082008001.85324-100000@forte.austin.ibm.com>
Message-ID: <20040209144728.GE19011@krispykreme>


> > - Minimum guaranteed TCE table size (128MB for 32bit slot, 256MB for
> >   64bit slot)
>
> Not sure I follow this one. Are you saying we need to guarantee a minimum?
> On SMP, we just carve up the space reserved by the PHB in equal chunks per
> slot, on LPAR we just use the ibm,dma-window property to size the space.

Just looking at worst case scenarios. eg how would an emulex survive
with only 256MB of TCE space.

> Yes, the behaviour here would be interesting. It'll still only be 4-page
> allocations so I don't expect any trouble. Allocations closer to 16 pages
> would be more likely to fail due to fragmentation.

Agreed. I havent seen a device that wanted to do really large allocations.
They might do large consistent allocations but thats not so bad, we can
take our time to allocate those and can in the end fail them (they are
usually during driver init).

The only other thing to keep in mind is we enable physical merging
on scatter gather lists, if we were really unlucky we could end up with
a very large element in the SG list.
>
> Can we blindly base this on architecture, i.e. all POWER4-based systems
> will be fine with 8MB and the other way around?

Assuming we are happy with 2GB on pre POWER4 machines we can go with
that. Then in tce table init we check the OF properties and put a limit
on the POWER4 tce table and perhaps the POWER3/RS64 one (eg on
nighthawk).

> There's room for improvement here, but did a simple first implementation
> that will try to honor the PHB cachelines when allocating:

Yep, Im just dumping all the ideas that have come up over the last year.
I like simple and would prefer not to complicate the allocator unless we
have to.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Tue Feb 10 09:31:22 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Mon, 9 Feb 2004 16:31:22 -0600
Subject: extreme RTAS printks
Message-ID: <B068E4AF-5B4F-11D8-AA0F-000A95A0560C@us.ibm.com>


I have here a boot log in which 280 of the 450 lines are "RTAS" hex
dumps. This is getting really ridiculous. Regardless of how important
these messages are (very likely they're reporting "hey! this partition
was killed unexpectedly!" which is worthless to me), I would much
rather see English from a daemon than hex in my boot log.

Could we *please* kill printk_log_rtas()? This data is available via
/proc files; there is no need to spam our boot logs with it.

--
Hollis Blanchard
IBM Linux Technology Center

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From johnrose at austin.ibm.com  Tue Feb 10 10:18:13 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Mon, 09 Feb 2004 17:18:13 -0600
Subject: [PATCH] RPA DLPAR/PCIHP cleanups
Message-ID: <1076368692.18739.199.camel@verve>


Hi Linda, World-

I'd like to push the changes below to Ameslab tomorrow morning, and
generate our final submission against the mainline tree from this.  The
patch:
- Fixes whitespace misuse
- Removes some debug prints (which you removed in your VIO code anyway)
- Fixes a hotplug bug
- Adds a semaphore to the DLPAR interface to protect against multiple
users

If there are no objections, will push tomorrow.

Thanks-
John

diff -Nru a/drivers/pci/hotplug/rpadlpar_core.c b/drivers/pci/hotplug/rpadlpar_core.c
--- a/drivers/pci/hotplug/rpadlpar_core.c	Mon Feb  9 17:05:49 2004
+++ b/drivers/pci/hotplug/rpadlpar_core.c	Mon Feb  9 17:05:49 2004
@@ -3,7 +3,7 @@
  *
  * John Rose <johnrose at austin.ibm.com>
  * Linda Xie <lxie at us.ibm.com>
- *
+ *
  * October 2003
  *
  * Copyright (C) 2003 IBM.
@@ -16,6 +16,7 @@
 #include <linux/init.h>
 #include <linux/pci.h>
 #include <asm/pci-bridge.h>
+#include <asm/semaphore.h>
 #include "../pci.h"
 #include "rpaphp.h"
 #include "rpadlpar.h"
@@ -23,6 +24,8 @@
 #define MODULE_VERSION "1.0"
 #define MODULE_NAME "rpadlpar_io"

+static DECLARE_MUTEX(rpadlpar_sem);
+
 static inline int is_hotplug_capable(struct device_node *dn)
 {
 	unsigned char *ptr = get_property(dn, "ibm,fw-pci-hot-plug-ctrl", NULL);
@@ -175,7 +178,7 @@
 	struct pci_bus *secondary_bus;

 	if (!bridge_dev) {
-		printk(KERN_ERR "%s: %s() unexpected null device\n",
+		printk(KERN_ERR "%s: %s() unexpected null device\n",
 				MODULE_NAME, __FUNCTION__);
 		return 1;
 	}
@@ -183,7 +186,7 @@
 	secondary_bus = bridge_dev->subordinate;

 	if (unmap_bus_range(secondary_bus)) {
-		printk(KERN_ERR "%s: failed to unmap bus range\n",
+		printk(KERN_ERR "%s: failed to unmap bus range\n",
 				__FUNCTION__);
 		return 1;
 	}
@@ -203,36 +206,47 @@
  * 0			Success
  * -ENODEV		Not a valid drc_name
  * -EINVAL		Slot already added
+ * -ERESTARTSYS		Signalled before obtaining lock
  * -EIO			Internal PCI Error
  */
 int dlpar_add_slot(char *drc_name)
 {
 	struct device_node *dn = find_php_slot_node(drc_name);
 	struct pci_dev *dev;
+	int rc = 0;
+
+	if (down_interruptible(&rpadlpar_sem))
+		return -ERESTARTSYS;

-	if (!dn)
-		return -ENODEV;
+	if (!dn) {
+		rc = -ENODEV;
+		goto exit;
+	}

 	/* Check for existing hotplug slot */
-	if (find_slot(drc_name))
-		return -EINVAL;
+	if (find_slot(drc_name)) {
+		rc = -EINVAL;
+		goto exit;
+	}

 	/* Add pci bus */
 	dev = dlpar_pci_add_bus(dn);
 	if (!dev) {
 		printk(KERN_ERR "%s: unable to add bus %s\n", __FUNCTION__,
 				drc_name);
-		return -EIO;
+		rc = -EIO;
+		goto exit;
 	}

 	/* Add hotplug slot for new bus */
 	if (rpaphp_add_slot(drc_name)) {
 		printk(KERN_ERR "%s: unable to add hotplug slot %s\n",
 				__FUNCTION__, drc_name);
-		return -EIO;
+		rc = -EIO;
 	}
-
-	return 0;
+exit:
+	up(&rpadlpar_sem);
+	return rc;
 }

 /**
@@ -245,6 +259,7 @@
  * 0			Success
  * -ENODEV		Not a valid drc_name
  * -EINVAL		Slot already removed
+ * -ERESTARTSYS		Signalled before obtaining lock
  * -EIO			Internal PCI Error
  */
 int dlpar_remove_slot(char *drc_name)
@@ -252,35 +267,46 @@
 	struct device_node *dn = find_php_slot_node(drc_name);
 	struct slot *slot;
 	struct pci_dev *bridge_dev;
+	int rc = 0;
+
+	if (down_interruptible(&rpadlpar_sem))
+		return -ERESTARTSYS;

-	if (!dn)
-		return -ENODEV;
+	if (!dn) {
+		rc = -ENODEV;
+		goto exit;
+	}

-	if (!(slot = find_slot(drc_name)))
-		return -EINVAL;
+	if (!(slot = find_slot(drc_name))) {
+		rc = -EINVAL;
+		goto exit;
+	}

 	bridge_dev = slot->bridge;
 	if (!bridge_dev) {
 		printk(KERN_ERR "%s: %s(): unexpected null bridge device\n",
 				MODULE_NAME, __FUNCTION__);
-		return -EIO;
+		rc = -EIO;
+		goto exit;
 	}

 	/* Remove hotplug slot */
 	if (rpaphp_remove_slot(slot)) {
 		printk(KERN_ERR "%s: %s(): unable to remove hotplug slot %s\n",
 				MODULE_NAME, __FUNCTION__, drc_name);
-		return -EIO;
+		rc = -EIO;
+		goto exit;
 	}

 	/* Remove pci bus */
 	if (dlpar_pci_remove_bus(bridge_dev)) {
 		printk(KERN_ERR "%s: %s() unable to remove pci bus %s\n",
 				MODULE_NAME, __FUNCTION__, drc_name);
-		return -EIO;
+		rc = -EIO;
 	}
-
-	return 0;
+exit:
+	up(&rpadlpar_sem);
+	return rc;
 }

 static inline int is_dlpar_capable(void)
diff -Nru a/drivers/pci/hotplug/rpadlpar_sysfs.c b/drivers/pci/hotplug/rpadlpar_sysfs.c
--- a/drivers/pci/hotplug/rpadlpar_sysfs.c	Mon Feb  9 17:05:49 2004
+++ b/drivers/pci/hotplug/rpadlpar_sysfs.c	Mon Feb  9 17:05:49 2004
@@ -116,7 +116,7 @@
 static void dlpar_io_release(struct kobject *kobj)
 {
 	/* noop */
-	return;
+	return;
 }

 struct kobj_type ktype_dlpar_io = {
diff -Nru a/drivers/pci/hotplug/rpaphp.h b/drivers/pci/hotplug/rpaphp.h
--- a/drivers/pci/hotplug/rpaphp.h	Mon Feb  9 17:05:49 2004
+++ b/drivers/pci/hotplug/rpaphp.h	Mon Feb  9 17:05:49 2004
@@ -47,8 +47,8 @@
 #define ERR_SENSE_USE -9002     /* No DR operation will succeed, slot is unusable  */

 /* Sensor values from rtas_get-sensor */
-#define EMPTY           0       /* No card in slot */
-#define PRESENT         1       /* Card in slot */
+#define EMPTY	0       /* No card in slot */
+#define PRESENT	1       /* Card in slot */

 #if !defined(CONFIG_HOTPLUG_PCI_MODULE)
 	#define MY_NAME "rpaphp"
@@ -81,11 +81,11 @@
  */
 struct slot {
 	u32	magic;
-        int     state;
-        u32     index;
-        u32     type;
-        u32     power_domain;
-        char    *name;
+	int     state;
+	u32     index;
+	u32     type;
+	u32     power_domain;
+	char    *name;
 	struct	device_node *dn;/* slot's device_node in OFDT		*/
 				/* dn has phb info			*/
 	struct	pci_dev	*bridge;/* slot's pci_dev in pci_devices	*/
diff -Nru a/drivers/pci/hotplug/rpaphp_core.c b/drivers/pci/hotplug/rpaphp_core.c
--- a/drivers/pci/hotplug/rpaphp_core.c	Mon Feb  9 17:05:49 2004
+++ b/drivers/pci/hotplug/rpaphp_core.c	Mon Feb  9 17:05:49 2004
@@ -38,8 +38,8 @@
 #include "pci_hotplug.h"


-static int debug;
-static struct semaphore rpaphp_sem;
+static int debug;
+static struct semaphore rpaphp_sem;
 static int rpaphp_debug;
 static LIST_HEAD (rpaphp_slot_head);
 static int num_slots = 0;
@@ -79,13 +79,13 @@
 {
 	int rc;

-        rc = rtas_get_sensor(DR_ENTITY_SENSE, index, state);
-
-        if (rc) {
+	rc = rtas_get_sensor(DR_ENTITY_SENSE, index, state);
+
+	if (rc) {
 		if (rc ==  NEED_POWER || rc == PWR_ONLY) {
-			dbg("%s: slot must be power up to get sensor-state\n",
+			dbg("%s: slot must be power up to get sensor-state\n",
 				__FUNCTION__);
-		} else if (rc == ERR_SENSE_USE)
+		} else if (rc == ERR_SENSE_USE)
 			info("%s: slot is unusable\n", __FUNCTION__);
 		   else err("%s failed to get sensor state\n", __FUNCTION__);
 	}
@@ -95,8 +95,8 @@
 static struct pci_dev *rpaphp_find_bridge_pdev(struct slot *slot)
 {
 	struct pci_dev		*retval_dev = NULL;
-
-	retval_dev = rpaphp_find_pci_dev(slot->dn);
+
+	retval_dev = rpaphp_find_pci_dev(slot->dn);

 	return retval_dev;
 }
@@ -105,7 +105,7 @@
 {
 	struct pci_dev * retval_dev = NULL;

-	retval_dev = rpaphp_find_pci_dev(slot->dn->child);
+	retval_dev = rpaphp_find_pci_dev(slot->dn->child);

 	return retval_dev;
 }
@@ -118,7 +118,7 @@
 		dbg("%s - slot == NULL\n", function);
 		return -1;
 	}
-
+
 	if (!slot->hotplug_slot) {
 		dbg("%s - slot->hotplug_slot == NULL!\n", function);
 		return -1;
@@ -137,7 +137,7 @@

 	slot = (struct slot *)hotplug_slot->private;
 	if (slot_paranoia_check(slot, function))
-                return NULL;
+		return NULL;
 	return slot;
 }

@@ -145,16 +145,12 @@
 {
 	int	rc;

-	dbg("Entry %s: status=%d\n", __FUNCTION__, status);
-
 	/* status: LED_OFF or LED_ON */
 	rc = rtas_set_indicator(DR_INDICATOR, slot->index, status);
 	if (rc)
-		err("slot(%s) set attention-status(%d) failed! rc=0x%x\n",
-			slot->name, status, rc);
-
-	dbg("Exit %s, rc=0x%x\n", __FUNCTION__, rc);
-
+		err("slot(%s) set attention-status(%d) failed! rc=0x%x\n",
+			slot->name, status, rc);
+
 	return rc;
 }

@@ -162,12 +158,12 @@
 {
 	int	rc;

-        rc = rtas_get_power_level(slot->power_domain, (int *)value);
-        if (rc)
-                err("failed to get power-level for slot(%s), rc=0x%x\n",
+	rc = rtas_get_power_level(slot->power_domain, (int *)value);
+	if (rc)
+		err("failed to get power-level for slot(%s), rc=0x%x\n",
 			slot->name, rc);
-
-        return rc;
+
+	return rc;
 }

 static int rpaphp_get_attention_status(struct slot *slot)
@@ -191,8 +187,6 @@
 	if (slot == NULL)
 		return -ENODEV;

-	dbg("%s - Entry: slot[%s] value[0x%x]\n",
-		__FUNCTION__, slot->name, value);
 	down(&rpaphp_sem);
 	switch (value) {
 		case 0:
@@ -213,8 +207,7 @@

 	}
 	up(&rpaphp_sem);
-
-	dbg("%s - Exit: rc[%d]\n",  __FUNCTION__, retval);
+
 	return retval;
 }

@@ -229,7 +222,7 @@
 {
 	int retval;
 	struct slot *slot = get_slot(hotplug_slot, __FUNCTION__);
-
+
 	if (slot == NULL)
 		return -ENODEV;

@@ -254,21 +247,16 @@
 		return -ENODEV;


-	dbg("%s - Entry: slot[%s]\n",
-		__FUNCTION__, slot->name);
-
 	down(&rpaphp_sem);
 	*value = rpaphp_get_attention_status(slot);
 	up(&rpaphp_sem);

-	dbg("%s - Exit: value[0x%x] rc[%d]\n",
-		__FUNCTION__, *value, retval);
 	return retval;
 }

 /*
  * get_adapter_status - get  the status of a slot
- *
+ *
  * 0-- slot is empty
  * 1-- adapter is configured
  * 2-- adapter is not configured
@@ -278,32 +266,30 @@
 {
 	int	state, rc;

-	dbg("Entry %s\n", __FUNCTION__);
+	*value 		  = NOT_VALID;

-	*value 		  = NOT_VALID;
+	rc = rpaphp_get_sensor_state(slot->index, &state);

-	rc = rpaphp_get_sensor_state(slot->index, &state);
-
-	if (rc)
-		goto exit;
+	if (rc)
+		goto exit;

 	if (state == PRESENT) {
 		dbg("slot is occupied\n");
-
+
 		if (!is_init) /* at run-time slot->state can be changed by */
 			  /* config/unconfig adapter	 		   */
 			*value = slot->state;
 		else {
-		if (!slot->dn->child)
-			dbg("%s: %s is not valid OFDT node\n",
+		if (!slot->dn->child)
+			dbg("%s: %s is not valid OFDT node\n",
 				__FUNCTION__, slot->dn->full_name);
-		else
-			if (rpaphp_find_pci_dev(slot->dn->child))
+		else
+			if (rpaphp_find_pci_dev(slot->dn->child))
 				*value = CONFIGURED;
 			else {
 				dbg("%s: can't find pdev of adapter in slot[%s]\n",
 					__FUNCTION__, slot->name);
-				*value = NOT_CONFIGURED;
+				*value = NOT_CONFIGURED;
 				}
 		}
 	}
@@ -312,10 +298,8 @@
 		dbg("slot is empty\n");
 			*value = state;
 		}
-
-exit:    dbg("Exit %s slot[%s] has adapter-status %d rtas call's rc=0x%x\n",
-		__FUNCTION__, slot->name, *value, rc);

+exit:
 	return rc;
 }

@@ -344,14 +328,11 @@

 	if (slot == NULL)
 		return -ENODEV;
-
-	dbg("%s - Entry: slot->name[%s] slot->type[%d]\n",
-		__FUNCTION__, slot->name, slot->type);

-	down(&rpaphp_sem);
+	down(&rpaphp_sem);

 	switch (slot->type) {
-		case 1:
+		case 1:
 		case 2:
 		case 3:
 		case 4:
@@ -378,10 +359,10 @@
 		default:
 			*value = PCI_SPEED_UNKNOWN;
 			break;
-
+
 	}

-	up(&rpaphp_sem);
+	up(&rpaphp_sem);

 	return 0;
 }
@@ -400,7 +381,7 @@
 	return 0;
 }

-/*
+/*
  * rpaphp_validate_slot - make sure the name of the slot matches
  * 				the location code , if the slots is not
  *				empty.
@@ -409,11 +390,8 @@
 {
 	struct device_node	*dn;
 	int			retval = 0;
-
-	dbg("Entry %s: (name: %s index: 0x%x\n",
-		__FUNCTION__, slot_name, slot_index);

-	for(dn = find_all_nodes(); dn; dn = dn->next) {
+	for(dn = find_all_nodes(); dn; dn = dn->next) {

 		int 		*index;
 		unsigned char	*loc_code;
@@ -423,30 +401,25 @@
 		if (index && *index == slot_index) {
 		char *slash, tmp_str[128];

-			loc_code = get_property(dn, "ibm,loc-code", NULL);
+			loc_code = get_property(dn, "ibm,loc-code", NULL);
 		if (!loc_code) {
 			retval = -1;
 			goto exit;
 		}

-		dbg("%s: name=%s loc-code=%s index=0x%x\n",
-			__FUNCTION__, slot_name, loc_code, slot_index);
-
 		strcpy(tmp_str, loc_code);
 		slash = strrchr(tmp_str, '/');
 		if (slash) {
 			*slash = '\0';
 		}
-		if (strcmp(slot_name, tmp_str))
+		if (strcmp(slot_name, tmp_str))
 			retval = -1;
-		goto exit;
+		goto exit;
 		}

 	}

 exit:
-	dbg("Exit %s with retval=%d\n", __FUNCTION__, retval);
-
 	return retval;
 }

@@ -454,8 +427,6 @@
 static void rpaphp_fixup_new_devices(struct pci_bus *bus)
 {
 	struct pci_dev *dev;
-
-	dbg("Enter rpaphp_fixup_new_devices()\n");

 	list_for_each_entry(dev, &bus->devices, bus_list) {
 	/*
@@ -467,30 +438,26 @@
 			pcibios_fixup_device_resources(dev, bus);
 			pci_read_irq_line(dev);
 			for (i = 0; i < PCI_NUM_RESOURCES; i++) {
-                        	struct resource *r = &dev->resource[i];
-                        	if (r->parent || !r->start || !r->flags)
-                                	continue;
-                        	rpaphp_claim_resource(dev, i);
-                	}
-
+				struct resource *r = &dev->resource[i];
+				if (r->parent || !r->start || !r->flags)
+					continue;
+				rpaphp_claim_resource(dev, i);
+			}
 		}
 	}
 }

-static struct pci_dev *rpaphp_config_adapter(struct slot *slot)
+static struct pci_dev *rpaphp_config_adapter(struct slot *slot)
 {
 	struct pci_bus 		*pci_bus;
 	struct device_node	*dn;
 	int 			num;
 	struct pci_dev		*dev = NULL;

-	dbg("Entry %s: slot[%s]\n",
-		__FUNCTION__, slot->name);
-
 	if (slot->bridge) {
-
+
 		pci_bus = slot->bridge->subordinate;
-
+
 		if (!pci_bus) {
 			err("%s: can't find bus structure\n", __FUNCTION__);
 			goto exit;
@@ -498,14 +465,12 @@

 		for (dn = slot->dn->child; dn; dn = dn->sibling) {
 			dbg("child dn's devfn=[%x]\n", dn->devfn);
-				num = pci_scan_slot(pci_bus,
+				num = pci_scan_slot(pci_bus,
 				PCI_DEVFN(PCI_SLOT(dn->devfn),  0));

 				dbg("pci_scan_slot return num=%d\n", num);

 			if (num) {
-				dbg("%s: calling rpaphp_fixup_new_devices()\n",
-					__FUNCTION__);
 				rpaphp_fixup_new_devices(pci_bus);
 				pci_bus_add_devices(pci_bus);
 			}
@@ -518,42 +483,37 @@
 		err("slot doesn't have pci_dev structure\n");
 		dev = NULL;
 		goto exit;
-	}
+	}

-exit:
+exit:
 	dbg("Exit %s: pci_dev %s\n", __FUNCTION__, dev? "found":"not found");

 	return dev;
 }

-static int rpaphp_unconfig_adapter(struct slot *slot)
+static int rpaphp_unconfig_adapter(struct slot *slot)
 {
 	int			retval = 0;

-	dbg("Entry %s: slot[%s]\n",
-		__FUNCTION__, slot->name);
 	if (!slot->dev) {
 		info("%s: no card in slot[%s]\n",
 			__FUNCTION__, slot->name);

 		retval = -EINVAL;
-		goto exit;
+		goto exit;
 	}

+	/* remove the device from the pci core */
+	pci_remove_bus_device(slot->dev);

-        /* remove the device from the pci core */
-        pci_remove_bus_device(slot->dev);
+	pci_dev_put(slot->dev);
+	slot->state = NOT_CONFIGURED;

-        pci_dev_put(slot->dev);
-        slot->state = NOT_CONFIGURED;
-
 	dbg("%s: adapter in slot[%s] unconfigured.\n", __FUNCTION__, slot->name);
-
-exit:
-	dbg("Exit %s, rc=0x%x\n", __FUNCTION__, retval);

+exit:
 	return retval;
-
+
 }

 /* free up the memory user be a slot */
@@ -561,39 +521,32 @@
 static void rpaphp_release_slot(struct hotplug_slot *hotplug_slot)
 {
 	struct slot *slot = get_slot(hotplug_slot, __FUNCTION__);
-
+
 	if (slot == NULL)
 		return;

-	dbg("%s - Entry: slot[%s]\n",
-		__FUNCTION__, slot->name);
 	kfree(slot->hotplug_slot->info);
 	kfree(slot->hotplug_slot->name);
 	kfree(slot->hotplug_slot);
 	pci_dev_put(slot->bridge);
 	pci_dev_put(slot->dev);
 	kfree(slot);
-	dbg("%s - Exit\n", __FUNCTION__);
 }

 int rpaphp_remove_slot(struct slot *slot)
 {
 	int retval = 0;

-	dbg("%s - Entry: slot[%s]\n",
-		__FUNCTION__, slot->name);
-
   	sysfs_remove_link(slot->hotplug_slot->kobj.parent,
-                          slot->bridge->slot_name);
-
+			slot->bridge->slot_name);
+
 	list_del(&slot->rpaphp_slot_list);
 	retval = pci_hp_deregister(slot->hotplug_slot);
 	if (retval)
 		err("Problem unregistering a slot %s\n", slot->name);
 	num_slots--;

-	dbg("%s - Exit: rc[%d]\n", __FUNCTION__, retval);
-	return retval;
+	return retval;
 }

 static int is_php_dn(struct device_node *dn, int **indexes,  int **names, int **types, int **power_domains)
@@ -604,7 +557,7 @@

 	/* &names[1] contains NULL terminated slot names */
 	*names = (int *)get_property(dn, "ibm,drc-names", NULL);
-	if (!*names)
+	if (!*names)
 		return(0);

 	/* &types[1] contains NULL terminated slot types */
@@ -615,7 +568,7 @@
 	/* power_domains[1...n] are the slot power domains */
 	*power_domains = (int *)get_property(dn,
 		"ibm,drc-power-domains", NULL);
-	if (!*power_domains)
+	if (!*power_domains)
 		return(0);

 	if (!get_property(dn, "ibm,fw-pci-hot-plug-ctrl", NULL))
@@ -629,17 +582,17 @@
 	struct slot *slot;

 	slot = kmalloc(sizeof(struct slot), GFP_KERNEL);
-	if (!slot)
+	if (!slot)
 		return (NULL);
 	memset(slot, 0, sizeof(struct slot));
-	slot->hotplug_slot = kmalloc(sizeof(struct hotplug_slot),
+	slot->hotplug_slot = kmalloc(sizeof(struct hotplug_slot),
 		GFP_KERNEL);
 	if (!slot->hotplug_slot) {
 		kfree(slot);
 		return (NULL);
-        }
+	}
 	memset(slot->hotplug_slot, 0, sizeof(struct hotplug_slot));
-	slot->hotplug_slot->info = kmalloc(sizeof(struct hotplug_slot_info),
+	slot->hotplug_slot->info = kmalloc(sizeof(struct hotplug_slot_info),
 		GFP_KERNEL);
 	if (!slot->hotplug_slot->info) {
 		kfree(slot->hotplug_slot);
@@ -659,17 +612,14 @@

 static int setup_hotplug_slot_info(struct slot *slot)
 {
-	dbg("%s Initilize the slot info structure ...\n",
-		__FUNCTION__);
-
-	rpaphp_get_power_status(slot,
-		&slot->hotplug_slot->info->power_status);
+	rpaphp_get_power_status(slot,
+		&slot->hotplug_slot->info->power_status);

 	rpaphp_get_adapter_status(slot, 1,
-		&slot->hotplug_slot->info->adapter_status);
+		&slot->hotplug_slot->info->adapter_status);

 	if (slot->hotplug_slot->info->adapter_status == NOT_VALID) {
-		dbg("%s: NOT_VALID: skip dn->full_name=%s\n",
+		dbg("%s: NOT_VALID: skip dn->full_name=%s\n",
 			__FUNCTION__, slot->dn->full_name);
 		    kfree(slot->hotplug_slot->info);
 		    kfree(slot->hotplug_slot->name);
@@ -682,7 +632,7 @@

 static int register_slot(struct slot *slot)
 {
-	int retval;
+	int retval;

 	retval = pci_hp_register(slot->hotplug_slot);
 	if (retval) {
@@ -692,7 +642,7 @@
 	}
 	/* create symlink between slot->name and it's bus_id */
 	dbg("%s: sysfs_create_link: %s --> %s\n", __FUNCTION__,
-		slot->bridge->slot_name, slot->name);
+		slot->bridge->slot_name, slot->name);
 	retval = sysfs_create_link(slot->hotplug_slot->kobj.parent,
 			&slot->hotplug_slot->kobj,
 			slot->bridge->slot_name);
@@ -702,12 +652,12 @@
 		return (retval);
 	}
 	/* add slot to our internal list */
-	dbg("%s adding slot[%s] to rpaphp_slot_list\n",
+	dbg("%s adding slot[%s] to rpaphp_slot_list\n",
 		__FUNCTION__, slot->name);

 	list_add(&slot->rpaphp_slot_list, &rpaphp_slot_head);

-	info("Slot [%s] (bus_id=%s) registered\n",
+	info("Slot [%s] (bus_id=%s) registered\n",
 		slot->name, slot->bridge->slot_name);
 	return (0);
 }
@@ -721,38 +671,33 @@
 	struct slot		*slot;
 	int 			retval = 0;
 	int 			i;
-        struct device_node 	*dn;
-        int 			*indexes, *names, *types, *power_domains;
-        char 			*name, *type;
-
-	dbg("Entry %s: %s\n", __FUNCTION__,
-			slot_name? slot_name: "init");
+	struct device_node 	*dn;
+	int 			*indexes, *names, *types, *power_domains;
+	char 			*name, *type;

 	for (dn = find_all_nodes(); dn; dn = dn->next) {

 		if (dn->name != 0 && strcmp(dn->name, "pci") == 0)	{
 			if (!is_php_dn(dn, &indexes, &names, &types, &power_domains))
 				continue;
-
+
 			dbg("%s : found device_node in OFDT full_name=%s, name=%s\n",
 				__FUNCTION__, dn->full_name, dn->name);

 			name = (char *)&names[1];
 			type = (char *)&types[1];
-
-			dbg("%s: indexes=%d\n", __FUNCTION__, indexes[0]);

-			for (i = 0; i < indexes[0];
-				i++,
+			for (i = 0; i < indexes[0];
+				i++,
 				name += (strlen(name) + 1),
 				type += (strlen(type) + 1)) {

 				dbg("%s: name[%s] index[%x]\n",
 					__FUNCTION__, name, indexes[i+1]);

-				if (slot_name && strcmp(slot_name, name))
+				if (slot_name && strcmp(slot_name, name))
 					continue;
-
+
 				if (rpaphp_validate_slot(name, indexes[i + 1])) {
 					dbg("%s: slot(%s, 0x%x) is invalid.\n",
 						__FUNCTION__, name, indexes[i+ 1]);
@@ -767,7 +712,7 @@
 				slot->name = slot->hotplug_slot->name;
 				slot->index = indexes[i + 1];
 				strcpy(slot->name, name);
-				slot->type = simple_strtoul(type, NULL, 10);
+				slot->type = simple_strtoul(type, NULL, 10);
 				if (slot->type < 1  || slot->type > 16)
 					slot->type = 0;

@@ -779,7 +724,7 @@
 				slot->dn = dn;

 				/*
-			 	* Initilize the slot info structure with some known
+			 	* Initilize the slot info structure with some known
 			 	* good values.
 			 	*/
 				if (setup_hotplug_slot_info(slot))
@@ -787,15 +732,15 @@

 				slot->bridge = rpaphp_find_bridge_pdev(slot);
 				if (!slot->bridge && slot_name) { /* slot being added doesn't have pci_dev yet*/
-					dbg("%s: no pci_dev for bridge dn %s\n",
+					dbg("%s: no pci_dev for bridge dn %s\n",
 							__FUNCTION__, slot_name);
-					    kfree(slot->hotplug_slot->info);
-					    kfree(slot->hotplug_slot->name);
-					    kfree(slot->hotplug_slot);
-					    kfree(slot);
+					kfree(slot->hotplug_slot->info);
+					kfree(slot->hotplug_slot->name);
+					kfree(slot->hotplug_slot);
+					kfree(slot);
 					continue;
 				}
-
+
 				/* find slot's pci_dev if it's not empty*/
 				if (slot->hotplug_slot->info->adapter_status == EMPTY) {
 					slot->state = EMPTY;  /* slot is empty */
@@ -812,49 +757,35 @@
 						continue;

 					}
-
-					slot->dev = rpaphp_find_adapter_pdev(slot);
-
-					if (!slot->dev && slot_name) {
-						 /* adapter being added doesn't have pci_dev yet */
-						slot->dev = rpaphp_config_adapter(slot);
-						if (!slot->dev) {
-							err("%s: add new adapter device for slot[%s] failed\n",
-							__FUNCTION__, slot->name);
-							kfree(slot->hotplug_slot->info);
-							kfree(slot->hotplug_slot->name);
-							kfree(slot->hotplug_slot);
-							kfree(slot);
-							pci_dev_put(slot->bridge);
-							continue;
-
-						}
-					}

+					slot->dev = rpaphp_find_adapter_pdev(slot);
 					if(slot->dev) {
 						slot->state = CONFIGURED;
 						pci_dev_get(slot->dev);
 					}
-					else
+					else {
+						/* DLPAR add as opposed to
+						 * boot time */
 						slot->state = NOT_CONFIGURED;
+					}
 				}
 				dbg("%s registering slot:path[%s] index[%x], name[%s] pdomain[%x] type[%d]\n",
-					__FUNCTION__, dn->full_name, slot->index, slot->name,
-					slot->power_domain, slot->type);
+					__FUNCTION__, dn->full_name, slot->index, slot->name,
+					slot->power_domain, slot->type);

 				if ((retval = register_slot(slot)))
 					goto exit;

 				num_slots++;
-
-				if (slot_name)
+
+				if (slot_name)
 					goto exit;

 			}/* for indexes */
 		}/* "pci" */
 	}/* find_all_nodes */
 exit:
-	dbg("%s - Exit: num_slots=%d rc[%d]\n",
+	dbg("%s - Exit: num_slots=%d rc[%d]\n",
 		__FUNCTION__, num_slots, retval);
 	return retval;
 }
@@ -867,12 +798,8 @@
 {
 	int 			retval = 0;

-	dbg("Entry %s\n", __FUNCTION__);
-
 	retval = rpaphp_add_slot(NULL);

-	dbg("Exit %s with retval=%d\n", __FUNCTION__, retval);
-
 	return retval;
 }

@@ -881,17 +808,12 @@
 {
 	int 			retval = 0;

-	dbg("Entry %s\n", __FUNCTION__);
-
 	init_MUTEX(&rpaphp_sem);
-
+
 	/* initialize internal data structure etc. */
 	retval = init_slots();
 	if (!num_slots)
 		retval = -ENODEV;
-
-	dbg("Exit %s with retval=%d, num_slots=%d\n",
-		__FUNCTION__, retval, num_slots);

 	return retval;
 }
@@ -904,12 +826,12 @@
 	/*
 	 * Unregister all of our slots with the pci_hotplug subsystem,
 	 * and free up all memory that we had allocated.
-	 * memory will be freed in release_slot callback.
+	 * memory will be freed in release_slot callback.
 	 */

 	list_for_each_safe (tmp, n, &rpaphp_slot_head) {
 		slot = list_entry(tmp, struct slot, rpaphp_slot_list);
-		sysfs_remove_link(slot->hotplug_slot->kobj.parent,
+		sysfs_remove_link(slot->hotplug_slot->kobj.parent,
 			slot->bridge->slot_name);
 		list_del(&slot->rpaphp_slot_list);
 		pci_hp_deregister(slot->hotplug_slot);
@@ -923,7 +845,6 @@
 {
 	int retval = 0;

-	dbg("Entry %s\n", __FUNCTION__);
 	info(DRIVER_DESC " version: " DRIVER_VERSION "\n");

 	rpaphp_debug = debug;
@@ -931,7 +852,6 @@
 	/* read all the PRA info from the system */
 	retval = init_rpa();

-	dbg("Exit %s with retval=%d\n", __FUNCTION__, retval);
 	return retval;
 }

@@ -951,21 +871,24 @@
 	if (slot == NULL)
 		return -ENODEV;

-	dbg("%s - Entry: slot[%s]\n",
-		__FUNCTION__, slot->name);
-
+	if (slot->state == CONFIGURED) {
+		dbg("%s: %s is already enabled\n",
+			__FUNCTION__, slot->name);
+		goto exit;
+	}
+
 	dbg("ENABLING SLOT %s\n", slot->name);

 	down(&rpaphp_sem);

-	retval = rpaphp_get_sensor_state(slot->index, &state);
-
-	if (retval)
-		goto exit;
+	retval = rpaphp_get_sensor_state(slot->index, &state);
+
+	if (retval)
+		goto exit;

 	dbg("%s: sensor state[%d]\n", __FUNCTION__, state);

-	/* if slot is not empty, enable the adapter */
+	/* if slot is not empty, enable the adapter */
 	if (state == PRESENT) {
 		dbg("%s : slot[%s] is occupid.\n", __FUNCTION__, slot->name);

@@ -984,7 +907,7 @@
 		}

 	}
-	else if (state == EMPTY) {
+	else if (state == EMPTY) {
 		dbg("%s : slot[%s] is empty\n", __FUNCTION__, slot->name);
 		slot->state = EMPTY;
 	}
@@ -993,30 +916,26 @@
 		slot->state = NOT_VALID;
 		retval = -EINVAL;
 	}
-
-exit:
+
+exit:
 	if (slot->state != NOT_VALID)
 		rpaphp_set_attention_status(slot, LED_ON);
 	else
 		rpaphp_set_attention_status(slot, LED_ID);

 	up(&rpaphp_sem);
-	dbg("%s - Exit: rc[%d]\n",  __FUNCTION__, retval);
-
-        return retval;
+
+	return retval;
 }

 static int disable_slot(struct hotplug_slot *hotplug_slot)
 {
 	int	retval;
 	struct slot *slot = get_slot(hotplug_slot, __FUNCTION__);
-

 	if (slot == NULL)
 		return -ENODEV;
-
-	dbg("%s - Entry: slot[%s]\n",
-		__FUNCTION__, slot->name);
+
 	dbg("DISABLING SLOT %s\n", slot->name);

 	down(&rpaphp_sem);
@@ -1024,13 +943,12 @@
 	rpaphp_set_attention_status(slot, LED_ID);

 	retval = rpaphp_unconfig_adapter(slot);
-
+
 	rpaphp_set_attention_status(slot, LED_OFF);

 	up(&rpaphp_sem);

-	dbg("%s - Exit: rc[%d]\n",  __FUNCTION__, retval);
-        return retval;
+	return retval;
 }

 module_init(rpaphp_init);
diff -Nru a/drivers/pci/hotplug/rpaphp_pci.c b/drivers/pci/hotplug/rpaphp_pci.c
--- a/drivers/pci/hotplug/rpaphp_pci.c	Mon Feb  9 17:05:49 2004
+++ b/drivers/pci/hotplug/rpaphp_pci.c	Mon Feb  9 17:05:49 2004
@@ -32,12 +32,12 @@
 	struct pci_dev		*retval_dev = NULL, *dev = NULL;

 	while ((dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) {
-		if(!dev->bus)
+		if(!dev->bus)
 			continue;
-
-		if (dev->devfn != dn->devfn)
+
+		if (dev->devfn != dn->devfn)
 			continue;
-
+
 		if (dn->phb->global_number == pci_domain_nr(dev->bus) &&
 		    dn->busno == dev->bus->number) {
 			retval_dev = dev;
@@ -46,9 +46,9 @@
 	}

 	return retval_dev;
-
+
 }
-
+
 int rpaphp_claim_resource(struct pci_dev *dev, int resource)
 {
 	struct resource *res = &dev->resource[resource];
@@ -63,9 +63,9 @@

 	if (err) {
 		err("PCI: %s region %d of %s %s [%lx:%lx]\n",
-		       root ? "Address space collision on" :
-			      "No parent found for",
-		       resource, dtype, pci_name(dev), res->start, res->end);
+			root ? "Address space collision on" :
+			"No parent found for",
+			resource, dtype, pci_name(dev), res->start, res->end);
 	}

 	return err;


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From boutcher at us.ibm.com  Tue Feb 10 14:01:55 2004
From: boutcher at us.ibm.com (David Boutcher)
Date: Mon, 9 Feb 2004 21:01:55 -0600
Subject: extreme RTAS printks
In-Reply-To: <B068E4AF-5B4F-11D8-AA0F-000A95A0560C@us.ibm.com>
Message-ID: <OF7F83BCA4.E931E8FB-ON86256E36.00108C31-86256E36.0010A7E3@us.ibm.com>


owner-linuxppc64-dev at lists.linuxppc.org wrote on 02/09/2004 04:31:22 PM:
> Could we *please* kill printk_log_rtas()? This data is available via
> /proc files; there is no need to spam our boot logs with it.

Agreed!!!!!  Or at least provide the decoder ring so we (and later, end
users) know what they say!

Dave Boutcher
IBM Linux Technology Center

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Tue Feb 10 14:59:04 2004
From: olof at austin.ibm.com (olof at austin.ibm.com)
Date: Mon, 9 Feb 2004 21:59:04 -0600 (CST)
Subject: extreme RTAS printks
In-Reply-To: <B068E4AF-5B4F-11D8-AA0F-000A95A0560C@us.ibm.com>
Message-ID: <Pine.A41.4.44.0402092157030.27856-100000@forte.austin.ibm.com>


On Mon, 9 Feb 2004, Hollis Blanchard wrote:

> Could we *please* kill printk_log_rtas()? This data is available via
> /proc files; there is no need to spam our boot logs with it.

I think killing it completely is a bad idea. There are times when a
machine has bad hardware, but good enough to boot halfway up. Getting the
error messages then could be very helpful. Likewise, at runtime the RTAS
messages are also useful since they'd show up on the console (and in the
dmesg output in kdb).

But, as you said, at boot time there's normally limited use for them. If
they're killed, I want to see a command line option to enable them if
needed. I also don't think that post-boot output should be removed at all.

-Olof

Olof Johansson                                        Office: 4E002/905
Linux on Power Development                            IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nathanl at austin.ibm.com  Tue Feb 10 15:00:43 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Mon, 09 Feb 2004 22:00:43 -0600
Subject: extreme RTAS printks
In-Reply-To: <B068E4AF-5B4F-11D8-AA0F-000A95A0560C@us.ibm.com>
References: <B068E4AF-5B4F-11D8-AA0F-000A95A0560C@us.ibm.com>
Message-ID: <4028576B.40205@austin.ibm.com>

Hollis Blanchard wrote:
> Could we *please* kill printk_log_rtas()? This data is available via
> /proc files; there is no need to spam our boot logs with it.

How about making it a config option?  Patch attached.  I'm not sure the
help text is 100% correct - do RAS tools depend on the messages being in
dmesg output or do they look at the /proc files?

Another idea is changing the log level of these messages from KERN_ERR
to something lower priority like KERN_INFO or KERN_DEBUG.  At least that
way the console doesn't get spammed in default configurations.

Nathan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rtas_verbose.patch
Type: text/x-patch
Size: 1067 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040209/4772b6c2/attachment.bin 

From boutcher at us.ibm.com  Wed Feb 11 04:18:00 2004
From: boutcher at us.ibm.com (David Boutcher)
Date: Tue, 10 Feb 2004 11:18:00 -0600
Subject: extreme RTAS printks
In-Reply-To: <Pine.A41.4.44.0402092157030.27856-100000@forte.austin.ibm.com>
Message-ID: <OF83312104.33B8BF45-ON86256E36.005ED6A2-86256E36.005F0866@us.ibm.com>


owner-linuxppc64-dev at lists.linuxppc.org wrote on 02/09/2004 09:59:04 PM:
> I think killing it completely is a bad idea. There are times when a
> machine has bad hardware, but good enough to boot halfway up. Getting the
> error messages then could be very helpful. Likewise, at runtime the RTAS
> messages are also useful since they'd show up on the console (and in the
> dmesg output in kdb).

OK, but is there any way to make them more meaningful than a hex dump?  No
user is going to take any action on a hex dump, and there is no way of
differentiating bad hardware from some mild informational log.  I know the
kernel doesn't decode these things, but can you at least get a serverity or
a classification or something?

Dave Boutcher
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Wed Feb 11 04:18:11 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Tue, 10 Feb 2004 11:18:11 -0600
Subject: extreme RTAS printks
In-Reply-To: <Pine.A41.4.44.0402092157030.27856-100000@forte.austin.ibm.com>
References: <Pine.A41.4.44.0402092157030.27856-100000@forte.austin.ibm.com>
Message-ID: <1A5F9B3C-5BED-11D8-BDA7-000A95A0560C@us.ibm.com>


On Feb 9, 2004, at 9:59 PM, olof at austin.ibm.com wrote:

> On Mon, 9 Feb 2004, Hollis Blanchard wrote:
>
>> Could we *please* kill printk_log_rtas()? This data is available via
>> /proc files; there is no need to spam our boot logs with it.
>
> I think killing it completely is a bad idea. There are times when a
> machine has bad hardware, but good enough to boot halfway up. Getting
> the
> error messages then could be very helpful.

Doesn't the service processor log such hardware errors for exactly this
reason?

> Likewise, at runtime the RTAS
> messages are also useful since they'd show up on the console (and in
> the
> dmesg output in kdb).

So you're saying you *do* want to see hex dumps? Printing 64 lines of
hex to a console you're actually trying to use I think is much worse
even than getting it at boot time.

--
Hollis Blanchard
IBM Linux Technology Center

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Wed Feb 11 04:18:57 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Tue, 10 Feb 2004 11:18:57 -0600
Subject: extreme RTAS printks
In-Reply-To: <OF83312104.33B8BF45-ON86256E36.005ED6A2-86256E36.005F0866@us.ibm.com>
References: <OF83312104.33B8BF45-ON86256E36.005ED6A2-86256E36.005F0866@us.ibm.com>
Message-ID: <40291281.1000905@austin.ibm.com>


David Boutcher wrote:

> OK, but is there any way to make them more meaningful than a hex dump?  No
> user is going to take any action on a hex dump, and there is no way of
> differentiating bad hardware from some mild informational log.  I know the
> kernel doesn't decode these things, but can you at least get a serverity or
> a classification or something?

Hmm, there's a userspace package that's used to parse all that info. I
don't know enough about the internal binary format, but I'd think that
there's severity levels to it. Maybe we can have a threshold for what
gets printed at boot and not?

Mike, got any insight to share? :-)


-Olof

--
Olof Johansson                                        Office: 4F005/905
pSeries Linux Development                             IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Wed Feb 11 04:25:44 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Tue, 10 Feb 2004 11:25:44 -0600
Subject: extreme RTAS printks
In-Reply-To: <1A5F9B3C-5BED-11D8-BDA7-000A95A0560C@us.ibm.com>
References: <Pine.A41.4.44.0402092157030.27856-100000@forte.austin.ibm.com> <1A5F9B3C-5BED-11D8-BDA7-000A95A0560C@us.ibm.com>
Message-ID: <40291418.7050003@austin.ibm.com>


Hollis Blanchard wrote:
>> I think killing it completely is a bad idea. There are times when a
>> machine has bad hardware, but good enough to boot halfway up. Getting the
>> error messages then could be very helpful.
>
> Doesn't the service processor log such hardware errors for exactly this
> reason?

I thought the SP only logged checkstops and other severe errors. Or does
it log ECC parity errors/corrections and other "minor" problems too?

>> Likewise, at runtime the RTAS
>> messages are also useful since they'd show up on the console (and in the
>> dmesg output in kdb).
>
>
> So you're saying you *do* want to see hex dumps? Printing 64 lines of
> hex to a console you're actually trying to use I think is much worse
> even than getting it at boot time.

If you're getting the RTAS messages on the console, the machine is
likely going to be unstable anyway. Would you prefer it to be quiet and
not warn at all?

Also, the data needs to be accessible from a debugger, be it via dmesg
or in other ways. Hex is much better than nothing at all.


-Olof

--
Olof Johansson                                        Office: 4F005/905
pSeries Linux Development                             IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From strosake at austin.ibm.com  Wed Feb 11 09:30:50 2004
From: strosake at austin.ibm.com (Mike Strosaker)
Date: Tue, 10 Feb 2004 16:30:50 -0600
Subject: extreme RTAS printks
In-Reply-To: <OFC47F0C50.45AEA1F2-ON87256E36.0076C8A7@us.ibm.com>
References: <OFC47F0C50.45AEA1F2-ON87256E36.0076C8A7@us.ibm.com>
Message-ID: <40295B9A.2040101@austin.ibm.com>


Olof Johansson wrote:
> Hmm, there's a userspace package that's used to parse all that info. I
> don't know enough about the internal binary format, but I'd think that
> there's severity levels to it. Maybe we can have a threshold for what
> gets printed at boot and not?

It would require a fairly significant amount of parsing inside the
kernel to determine if the message is worthy of being printed.  There's
no one severity field that makes thresholding easy.  Also, the parsing
code would need to be updated to reflect any new error log formats in
the future, which is why it's better done in userspace.

Nathan Lynch wrote:
> How about making it a config option? Patch attached. I'm not sure the
> help text is 100% correct - do RAS tools depend on the messages being
> in dmesg output or do they look at the /proc files?
>
> Another idea is changing the log level of these messages from KERN_ERR
> to something lower priority like KERN_INFO or KERN_DEBUG. At least
> that way the console doesn't get spammed in default configurations.

The userspace RAS tools look at the /proc file; there's no harm from
that perspective in either of the above solutions.  I'm all for doing
either, but the latter (KERN_INFO or KERN_DEBUG) sounds like a
particularly good compromise to me...  that way the messages are still
there in case of emergency.

Thanks,
Mike Strosaker

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From boutcher at us.ibm.com  Wed Feb 11 10:28:24 2004
From: boutcher at us.ibm.com (David Boutcher)
Date: Tue, 10 Feb 2004 17:28:24 -0600
Subject: extreme RTAS printks
In-Reply-To: <40295B9A.2040101@austin.ibm.com>
Message-ID: <OF24F6988F.B24021C6-ON86256E36.0080D77E-86256E36.0080F196@us.ibm.com>


owner-linuxppc64-dev at lists.linuxppc.org wrote on 02/10/2004 04:30:50 PM:
> The userspace RAS tools look at the /proc file; there's no harm from
> that perspective in either of the above solutions.  I'm all for doing
> either, but the latter (KERN_INFO or KERN_DEBUG) sounds like a
> particularly good compromise to me...  that way the messages are still
> there in case of emergency.

So we are going to document the format of the hex dump so that it is useful
to people?  If not, I'm back to wondering exactly WHO the large kernel
messages are useful for.

Dave Boutcher
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Wed Feb 11 10:30:59 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Tue, 10 Feb 2004 17:30:59 -0600
Subject: extreme RTAS printks
In-Reply-To: <OF24F6988F.B24021C6-ON86256E36.0080D77E-86256E36.0080F196@us.ibm.com>
References: <OF24F6988F.B24021C6-ON86256E36.0080D77E-86256E36.0080F196@us.ibm.com>
Message-ID: <402969B3.1080401@austin.ibm.com>


David Boutcher wrote:

> So we are going to document the format of the hex dump so that it is useful
> to people?  If not, I'm back to wondering exactly WHO the large kernel
> messages are useful for.


Doesn't look like it. I was of the impression that the RAS tools used
the /var/log/messages dump. If they don't, then there's no use in
printing it at all (_as long as_ there's a way to get to it from a
debugger, and/or as long as the last ones are dumped right before a panic).


-Olof


--
Olof Johansson                                        Office: 4F005/905
pSeries Linux Development                             IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From moilanen at austin.ibm.com  Thu Feb 12 01:22:35 2004
From: moilanen at austin.ibm.com (Jake Moilanen)
Date: Wed, 11 Feb 2004 08:22:35 -0600
Subject: extreme RTAS printks
In-Reply-To: <402969B3.1080401@austin.ibm.com>
References: 
	 <OF24F6988F.B24021C6-ON86256E36.0080D77E-86256E36.0080F196@us.ibm.com>
	 <402969B3.1080401@austin.ibm.com>
Message-ID: <1076509355.10309.40.camel@DYN279927END.austin.ibm.com>


On Tue, 2004-02-10 at 17:30, Olof Johansson wrote:
> David Boutcher wrote:
>
> > So we are going to document the format of the hex dump so that it is useful
> > to people?  If not, I'm back to wondering exactly WHO the large kernel
> > messages are useful for.
>
>
> Doesn't look like it. I was of the impression that the RAS tools used
> the /var/log/messages dump. If they don't, then there's no use in
> printing it at all (_as long as_ there's a way to get to it from a
> debugger, and/or as long as the last ones are dumped right before a panic).

Putting them both in /proc and in /var/log/messages was a interim
solution until all boxes have rtas_errd and diagela.  These errors need
to saved for a CE to diagnose the problem in the field, otherwise we
have no first failure data to analyze.

These messages should only be seen in two cases, either there is a error
that needs to be reported from a real failure, or NVRAM is not being
cleared of the error because rtas_errd is not installed on the machine
and it's showing up on every boot.

The error logs are going to 2k, so there could be a lot of messages
printed in the future.  I think it's a good compromise to move the log
level to KERN_INFO.  That way the data is still stored for a CE, and the
messages won't annoy Hollis. :)

Thanks,
Jake


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Thu Feb 12 04:10:39 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Wed, 11 Feb 2004 11:10:39 -0600
Subject: extreme RTAS printks
In-Reply-To: <40291418.7050003@austin.ibm.com>; from olof@austin.ibm.com on Tue, Feb 10, 2004 at 11:25:44AM -0600
References: <Pine.A41.4.44.0402092157030.27856-100000@forte.austin.ibm.com> <1A5F9B3C-5BED-11D8-BDA7-000A95A0560C@us.ibm.com> <40291418.7050003@austin.ibm.com>
Message-ID: <20040211111039.B58152@forte.austin.ibm.com>


On Tue, Feb 10, 2004 at 11:25:44AM -0600, Olof Johansson wrote:
>
> I thought the SP only logged checkstops and other severe errors. Or does
> it log ECC parity errors/corrections and other "minor" problems too?

Its reported minor/ridiculous warnings in the past ...

> If you're getting the RTAS messages on the console, the machine is
> likely going to be unstable anyway. Would you prefer it to be quiet and
> not warn at all?

The preference that hollis & the majority express is for an english-language
message.  When I asked about this ages ago, the answer I got back was:

"The logic to decode these messages is complex and is getting more complex
every day with new additions. This logic is too big to fit in the kernel
and should be a userland process/daemon (closed source, at that) instead."

I dunno, I'm not sure I buy the "too complex' story. Sure, there's a
lot of data in the hex dump, but a simple two-sentance english language
summary sure would be nice.  *especially* during boot of a failing machine.
And that amount of code would not be big or complex.

> Also, the data needs to be accessible from a debugger, be it via dmesg
> or in other ways. Hex is much better than nothing at all.

well, a quick extension to add rtas error decode to kdb sure would be cute!
I don't think its hard at all, not the basics.


--linas

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Thu Feb 12 04:34:23 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Wed, 11 Feb 2004 11:34:23 -0600
Subject: extreme RTAS printks
In-Reply-To: <OF24F6988F.B24021C6-ON86256E36.0080D77E-86256E36.0080F196@us.ibm.com>; from boutcher@us.ibm.com on Tue, Feb 10, 2004 at 05:28:24PM -0600
References: <40295B9A.2040101@austin.ibm.com> <OF24F6988F.B24021C6-ON86256E36.0080D77E-86256E36.0080F196@us.ibm.com>
Message-ID: <20040211113422.C58152@forte.austin.ibm.com>


On Tue, Feb 10, 2004 at 05:28:24PM -0600, David Boutcher wrote:
>
> owner-linuxppc64-dev at lists.linuxppc.org wrote on 02/10/2004 04:30:50 PM:
> > The userspace RAS tools look at the /proc file; there's no harm from
> > that perspective in either of the above solutions.  I'm all for doing
> > either, but the latter (KERN_INFO or KERN_DEBUG) sounds like a
> > particularly good compromise to me...  that way the messages are still
> > there in case of emergency.
>
> So we are going to document the format of the hex dump so that it is useful
> to people?  If not, I'm back to wondering exactly WHO the large kernel
> messages are useful for.

The format is documented in the RPA.  I heard a rumour last summer that
there was a move afoot to make the rpa open to the general public but
I don't know whether that panned out.

--linas


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From brking at us.ibm.com  Thu Feb 12 10:29:19 2004
From: brking at us.ibm.com (Brian King)
Date: Wed, 11 Feb 2004 17:29:19 -0600
Subject: Resending the patch: Re: 2.6: PATCH for multiple EEH bugs
References: <20040203183459.B27780@forte.austin.ibm.com> <Pine.A41.4.44.0402032246520.85316-100000@forte.austin.ibm.com> <20040204142853.A28220@forte.austin.ibm.com> <20040205142643.T27780@forte.austin.ibm.com>
Message-ID: <402ABACF.3090809@us.ibm.com>


I just want to express the importance of this patch. The 2.6 ipr driver
requires it, since it regularly hits the eeh_check_failure bug. Please
apply.


-Brian


linas at austin.ibm.com wrote:
> OK,
>
> Fifth time's a charm ...
>
> base64 encoding the patch helps prevent the mail gateways from mangling it,
> but then its too big for the mailing list manager.  You can ftp the patch
>
> http://www-124.ibm.com/linux/patches/?patch_id=1344
>
> To repeat the original note:
>
> Patch for multiple EEH-related bugs.  Please review this patch,
> & if appropriate, please apply.  It should apply cleanly to
> the current ameslab tree (Feb 03 2004  2.6.2-rc3).
>
> This patch fixes multiple EEH-related bugs:
>
> -- Fixes the eeh_check_failure() usage in an interrupt context.
>    This routine is now safe to use in an interrupt. The fix was to
>    build a cache of IO addresses and check that, instead of using
>    the pci routines.
> -- Merges in Olof Johansson's sizeof patch when checking for failure
> -- Adds EEH tests to array/string reads
> -- Fixes bugs with address resolution (some i/o addresses were handled
>    incorrectly, resulting in EEH errors slipping by undetected.)
> -- Adds EEH support to the PCI Hotplug system (so that devices that
>    get added/removed get properly registered with the EEH subsystem.)
> -- Fixes improper use of /proc filesystem.
> -- Adds some misc statistics.
>
> Please note that the EEH subsystem will be undergoing a major revision
> in the not-to-distant future; this patch is a 'stopgap' to address the
> immediate concerns/issues until that time.
>
> --linas
>
>
> On Wed, Feb 04, 2004 at 02:28:53PM -0600, linas at austin.ibm.com wrote:
>
>>On Tue, Feb 03, 2004 at 10:52:11PM -0600, olof at austin.ibm.com wrote:
>>
>>>On Tue, 3 Feb 2004 linas at austin.ibm.com wrote:
>>>
>>>
>>>>Patch for multiple EEH-related bugs.  Please review this patch,
>>>>& if appropriate, please apply.  It should apply cleanly to
>>>>the current ameslab tree (Feb 03 2004  2.6.2-rc3).
>>>
>>>I have patch failures in eeh.c, pci.c and rpaphp_core.c. Are you sure you
>>>made the diff against a current ameslab tree?
>>
>>Right tree, bad email attachment.
>>
>>I don't know how it happened, but what I sent out had some trailing
>>whitespace whacked. The attached patch should not have this problem.
>
>
>
>


--
Brian King
eServer Storage I/O
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olh at suse.de  Fri Feb 13 02:09:07 2004
From: olh at suse.de (Olaf Hering)
Date: Thu, 12 Feb 2004 16:09:07 +0100
Subject: autoconsole
In-Reply-To: <20040115011808.GD27924@krispykreme>
References: <20040113134042.A3771@w-mikek2.beaverton.ibm.com> <D56F210C-4617-11D8-92C2-000A95A0560C@us.ibm.com> <20040113235744.GB13397@krispykreme> <20040113185245.A1747@w-mikek2.beaverton.ibm.com> <20040114080749.GA374@suse.de> <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> <20040115011808.GD27924@krispykreme>
Message-ID: <20040212150907.GA13059@suse.de>


 On Thu, Jan 15, Anton Blanchard wrote:

>
> > I haven't had time to check it out yet, but Sparc pushed an
> > add_preferred_console() to 2.5 a couple weeks ago. See
> > arch/sparc/kernel/setup.c set_preferred_console() ; it's a bit cleaner
> > looking than what you've posted here. :)
>
> Agreed, how does this look? I could only compile test it, I dont have a
> machine to run on at the moment.

have you tested it? console=tty1 doesnt work, console output still goes
straight to ttyS0. cmd_line is probably not yet set, my cmdline contains
alot of stuff, but only the first word was printed with

        printk("%s(%u) cmd_line is %s\n",__FUNCTION__,__LINE__,cmd_line);

in set_preferred_console(). So strstr(cmd_line, "console=") does not
trigger. Does it just not work for me?

--
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olh at suse.de  Fri Feb 13 02:55:53 2004
From: olh at suse.de (Olaf Hering)
Date: Thu, 12 Feb 2004 16:55:53 +0100
Subject: bogus check in sg_add()
Message-ID: <20040212155553.GA11426@suse.de>


what is up with this check in ameslab? Still needed, what does it fix?


diff -purN linux-2.5/drivers/scsi/sg.c linuxppc64-2.5/drivers/scsi/sg.c
--- linux-2.5/drivers/scsi/sg.c 2004-02-06 08:21:23.000000000 +0000
+++ linuxppc64-2.5/drivers/scsi/sg.c    2004-02-10 08:37:26.000000000 +0000
@@ -1343,6 +1343,9 @@ sg_add(struct class_device *cl_dev)
        struct cdev * cdev = NULL;
        int k, error;

+       if (scsidp->type == 255)
+               return 0;
+
        disk = alloc_disk(1);
        if (!disk)
                return -ENOMEM;

--
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From brking at us.ibm.com  Fri Feb 13 04:19:06 2004
From: brking at us.ibm.com (Brian King)
Date: Thu, 12 Feb 2004 11:19:06 -0600
Subject: bogus check in sg_add()
References: <20040212155553.GA11426@suse.de>
Message-ID: <402BB58A.7090208@us.ibm.com>


Someone please remove this check. The ipr driver reports a device of
type 255, and wants an sg started for it.

-Brian

Olaf Hering wrote:
> what is up with this check in ameslab? Still needed, what does it fix?
>
>
> diff -purN linux-2.5/drivers/scsi/sg.c linuxppc64-2.5/drivers/scsi/sg.c
> --- linux-2.5/drivers/scsi/sg.c 2004-02-06 08:21:23.000000000 +0000
> +++ linuxppc64-2.5/drivers/scsi/sg.c    2004-02-10 08:37:26.000000000 +0000
> @@ -1343,6 +1343,9 @@ sg_add(struct class_device *cl_dev)
>         struct cdev * cdev = NULL;
>         int k, error;
>
> +       if (scsidp->type == 255)
> +               return 0;
> +
>         disk = alloc_disk(1);
>         if (!disk)
>                 return -ENOMEM;
>
> --
> USB is for mice, FireWire is for men!
>
> sUse lINUX ag, n?RNBERG
>
>
>


--
Brian King
eServer Storage I/O
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From moilanen at austin.ibm.com  Fri Feb 13 07:55:14 2004
From: moilanen at austin.ibm.com (Jake Moilanen)
Date: Thu, 12 Feb 2004 14:55:14 -0600
Subject: [PATCH][2.6] hardware breakpoint fix/support for xmon
Message-ID: <1076619314.10309.85.camel@DYN279927END.austin.ibm.com>

Here's a patch to fix hardware breakpoints and add support for LPAR
systems in xmon.

On an SMP system, the breakpoints appeared to have been working for
hitting the breakpoint.  When you exited xmon to continue, you would
instead hit the same breakpoint again.  There were a couple errors for
the check to see if xmon needed to single step over the instruction.

The other problem I was seeing was when breakpoint was cleared on one
CPU, unless the other CPUS were stopped as well, they would not clear
their dabr until they hit xmon.

Thanks,
Jake

-------------- next part --------------
A non-text attachment was scrubbed...
Name: linux-2.6-xmon-dabr-fix-1.patch
Type: text/x-patch
Size: 4794 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040212/85c8f2d6/attachment.bin 

From olof at austin.ibm.com  Fri Feb 13 08:00:58 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Thu, 12 Feb 2004 15:00:58 -0600
Subject: [2.4] [PATCH] hash_page rework, take 2
In-Reply-To: <40104461.5030804@austin.ibm.com>
References: <40104461.5030804@austin.ibm.com>
Message-ID: <402BE98A.9090304@austin.ibm.com>

After feedback from Julie and Ben, here's a revised mainline patch.
Consider this last call for comments before I push. :-)


Changes:

* Add a smb_mb() in pte_free_sync(), to make sure that any 0-writes to
PTE/PMDs are seen on other processors before the free.

* Make sure "newpp" is never more than 3 bits. This saves us from
crashing older iSeries hypervisor. Shouldn't be a problem since we don't
do aging in 2.4, but I prefer the more conservative approach.

* Add an isync after the _PAGE_BUSY lock, to avoid out-of-order execution.

* Don't invalidate/update/validate a HPTE in hpte_updatepp unless the pp
bits have changed. This avoids a pileup of faults caused by other
processors faulting during the period when the HPTE is invalid and/or
several processors faulting at the same time and resolving the same fault.

* Other logic fixes based on discussions on this list, mostly dealing
with timing windows during which _PAGE_BUSY was cleared inappropriately.


-Olof


--
Olof Johansson                                        Office: 4F005/905
pSeries Linux Development                             IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: hash_page-rework-feb12
Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040212/a4abb01e/attachment.txt 

From johnrose at us.ibm.com  Fri Feb 13 08:13:57 2004
From: johnrose at us.ibm.com (John H Rose)
Date: Thu, 12 Feb 2004 15:13:57 -0600
Subject: [PATCH][2.6] hardware breakpoint fix/support for xmon
Message-ID: <OF65A2D034.3A7A4DBE-ON85256E38.00746A29-86256E38.0074A4AF@us.ibm.com>


I might be confused, but don't these qualify as hardware watchpoints rather
than breakpoints?  Regardless, the patch looks good :)

Thanks-
John

-----------------------
John Rose
pSeries Linux Development
johnrose at austin.ibm.com
Office: 512-838-0298
Tieline: 678-0298

Jake Moilanen <moilanen at austin.ibm.com>@lists.linuxppc.org on 02/12/2004
02:55:14 PM

Sent by:    owner-linuxppc64-dev at lists.linuxppc.org


To:    linuxppc64-dev at lists.linuxppc.org
cc:
Subject:    [PATCH][2.6] hardware breakpoint fix/support for xmon


Here's a patch to fix hardware breakpoints and add support for LPAR
systems in xmon.

On an SMP system, the breakpoints appeared to have been working for
hitting the breakpoint.  When you exited xmon to continue, you would
instead hit the same breakpoint again.  There were a couple errors for
the check to see if xmon needed to single step over the instruction.

The other problem I was seeing was when breakpoint was cleared on one
CPU, unless the other CPUS were stopped as well, they would not clear
their dabr until they hit xmon.

Thanks,
Jake


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Fri Feb 13 08:20:02 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Fri, 13 Feb 2004 08:20:02 +1100
Subject: [2.4] [PATCH] hash_page rework, take 2
In-Reply-To: <402BE98A.9090304@austin.ibm.com>
References: <40104461.5030804@austin.ibm.com>
	 <402BE98A.9090304@austin.ibm.com>
Message-ID: <1076620802.12434.39.camel@gaston>


On Fri, 2004-02-13 at 08:00, Olof Johansson wrote:
> After feedback from Julie and Ben, here's a revised mainline patch.
> Consider this last call for comments before I push. :-)
>
>
> Changes:
>
> * Add a smb_mb() in pte_free_sync(), to make sure that any 0-writes to
> PTE/PMDs are seen on other processors before the free.

Actually, what matters is before the is_locked(), we act both
as a write barrier for the 0 and a read barrier for is_locked().

That said... I wonder if the implementation of is_locked() shouldn't
have a rmb() by default after all ...

> * Make sure "newpp" is never more than 3 bits. This saves us from
> crashing older iSeries hypervisor. Shouldn't be a problem since we don't
> do aging in 2.4, but I prefer the more conservative approach.

Not only crashing older HVs, but crashing the kernel with newer HVs ;)

> * Add an isync after the _PAGE_BUSY lock, to avoid out-of-order execution.
>
> * Don't invalidate/update/validate a HPTE in hpte_updatepp unless the pp
> bits have changed. This avoids a pileup of faults caused by other
> processors faulting during the period when the HPTE is invalid and/or
> several processors faulting at the same time and resolving the same fault.
>
> * Other logic fixes based on discussions on this list, mostly dealing
> with timing windows during which _PAGE_BUSY was cleared inappropriately.

Did you spot any case that could have affected the 2.6 version ?

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From moilanen at austin.ibm.com  Fri Feb 13 08:26:50 2004
From: moilanen at austin.ibm.com (Jake Moilanen)
Date: Thu, 12 Feb 2004 15:26:50 -0600
Subject: [PATCH][2.6] hardware breakpoint fix/support for xmon
In-Reply-To: <OF65A2D034.3A7A4DBE-ON85256E38.00746A29-86256E38.0074A4AF@us.ibm.com>
References: 
	 <OF65A2D034.3A7A4DBE-ON85256E38.00746A29-86256E38.0074A4AF@us.ibm.com>
Message-ID: <1076621210.10309.93.camel@DYN279927END.austin.ibm.com>


On Thu, 2004-02-12 at 15:13, John H Rose wrote:
> I might be confused, but don't these qualify as hardware watchpoints rather
> than breakpoints?  Regardless, the patch looks good :)

The correct name is "Data Access Breakpoints".  But depending on who or
what you are asking they'll go by different names.  I think it's more of
a AIXism to call them watchpoints.  I was following the xmon naming
convention of calling them hardware breakpoints.

Thanks,
Jake


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Fri Feb 13 08:34:16 2004
From: paulus at samba.org (Paul Mackerras)
Date: Fri, 13 Feb 2004 08:34:16 +1100
Subject: bogus check in sg_add()
In-Reply-To: <402BB58A.7090208@us.ibm.com>
References: <20040212155553.GA11426@suse.de>
	<402BB58A.7090208@us.ibm.com>
Message-ID: <16427.61784.266371.410005@cargo.ozlabs.ibm.com>


Brian King writes:
>
> Someone please remove this check. The ipr driver reports a device of
> type 255, and wants an sg started for it.

Done.  I always like reducing the diffs between ameslab and the
official trees. :)

For interest, the check was added by Todd Inglett on 14 Jan 2003 with
this changeset comment:

  Ignore host devices which are erroneously created by scsi_get_host_dev().
  Mike Anderson <andmike at us.ibm.com> is looking into it.

Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olh at suse.de  Fri Feb 13 08:40:33 2004
From: olh at suse.de (Olaf Hering)
Date: Thu, 12 Feb 2004 22:40:33 +0100
Subject: bogus check in sg_add()
In-Reply-To: <16427.61784.266371.410005@cargo.ozlabs.ibm.com>
References: <20040212155553.GA11426@suse.de> <402BB58A.7090208@us.ibm.com> <16427.61784.266371.410005@cargo.ozlabs.ibm.com>
Message-ID: <20040212214033.GB30422@suse.de>


 On Fri, Feb 13, Paul Mackerras wrote:

> Brian King writes:
> >
> > Someone please remove this check. The ipr driver reports a device of
> > type 255, and wants an sg started for it.
>
> Done.  I always like reducing the diffs between ameslab and the
> official trees. :)

There are also lots of mb() in e100 and e1000, stuff for jgarzik?


--
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Fri Feb 13 08:42:56 2004
From: paulus at samba.org (Paul Mackerras)
Date: Fri, 13 Feb 2004 08:42:56 +1100
Subject: [PATCH][2.6] hardware breakpoint fix/support for xmon
In-Reply-To: <1076619314.10309.85.camel@DYN279927END.austin.ibm.com>
References: <1076619314.10309.85.camel@DYN279927END.austin.ibm.com>
Message-ID: <16427.62304.699990.252289@cargo.ozlabs.ibm.com>


Jake Moilanen writes:

> Here's a patch to fix hardware breakpoints and add support for LPAR
> systems in xmon.
>
> On an SMP system, the breakpoints appeared to have been working for
> hitting the breakpoint.  When you exited xmon to continue, you would
> instead hit the same breakpoint again.  There were a couple errors for
> the check to see if xmon needed to single step over the instruction.
>
> The other problem I was seeing was when breakpoint was cleared on one
> CPU, unless the other CPUS were stopped as well, they would not clear
> their dabr until they hit xmon.

Hmmm... We're all hacking on xmon these days, it seems.  Anton was
making some changes to xmon just yesterday, and I am about to rework
the xmon entry/exit and breakpoint insertion/removal to make it work
properly on SMP systems.  Don't push for now, and I'll talk to Anton
today about merging your changes in with ours.

Thanks,
Paul.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From sleddog at us.ibm.com  Fri Feb 13 08:51:00 2004
From: sleddog at us.ibm.com (Dave Boutcher)
Date: Thu, 12 Feb 2004 15:51:00 -0600
Subject: support for dma-mapping
Message-ID: <qdj1xp06k6z.fsf@pom.rchland.ibm.com>


I wonder what you all (especially paulus and sfr) think of the following:

Currently ppc64 uses the asm-generic version...

===== dma-mapping.h 1.1 vs edited =====
--- 1.1/include/asm-ppc64/dma-mapping.h	Sat Dec 21 22:36:58 2002
+++ edited/dma-mapping.h	Thu Feb 12 15:38:59 2004
@@ -1 +1,157 @@
-#include <asm-generic/dma-mapping.h>
+/* Copyright (C) 2004 IBM
+ *
+ * Implements the generic device dma API
+ */
+
+#ifndef _ASM_DMA_MAPPING_H
+#define _ASM_DMA_MAPPING_H
+
+/* Include the busses we support */
+#include <linux/pci.h>
+#include <linux/vio.h>
+/* need struct page definitions */
+#include <linux/mm.h>
+
+static inline int
+dma_supported(struct device *dev, u64 mask)
+{
+	if (dev->bus == &pci_bus_type) return pci_dma_supported(to_pci_dev(dev), mask);
+	if (dev->bus == &vio_bus_type) return vio_dma_supported(to_vio_dev(dev), mask);
+	BUG();
+}
+
+static inline int
+dma_set_mask(struct device *dev, u64 dma_mask)
+{
+	if (dev->bus == &pci_bus_type) return pci_set_dma_mask(to_pci_dev(dev), dma_mask);
+	if (dev->bus == &vio_bus_type) return vio_set_dma_mask(to_vio_dev(dev), dma_mask);
+	BUG();
+}
+
+static inline void *
+dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle,
+		   int flag)
+{
+	if (dev->bus == &pci_bus_type) return pci_alloc_consistent(to_pci_dev(dev), size, dma_handle);
+	if (dev->bus == &vio_bus_type) return vio_alloc_consistent(to_vio_dev(dev), size, dma_handle);
+	BUG();
+}
+
+static inline void
+dma_free_coherent(struct device *dev, size_t size, void *cpu_addr,
+		    dma_addr_t dma_handle)
+{
+	if (dev->bus == &pci_bus_type) pci_free_consistent(to_pci_dev(dev), size, cpu_addr, dma_handle);
+	if (dev->bus == &vio_bus_type) vio_free_consistent(to_vio_dev(dev), size, cpu_addr, dma_handle);
+	BUG();
+}
+
+static inline dma_addr_t
+dma_map_single(struct device *dev, void *cpu_addr, size_t size,
+	       enum dma_data_direction direction)
+{
+	if (dev->bus == &pci_bus_type) return pci_map_single(to_pci_dev(dev), cpu_addr, size, (int)direction);
+	if (dev->bus == &vio_bus_type) return vio_map_single(to_vio_dev(dev), cpu_addr, size, (int)direction);
+	BUG();
+}
+
+static inline void
+dma_unmap_single(struct device *dev, dma_addr_t dma_addr, size_t size,
+		 enum dma_data_direction direction)
+{
+	if (dev->bus == &pci_bus_type) pci_unmap_single(to_pci_dev(dev), dma_addr, size, (int)direction);
+	if (dev->bus == &vio_bus_type) vio_unmap_single(to_vio_dev(dev), dma_addr, size, (int)direction);
+	BUG();
+}
+
+static inline dma_addr_t
+dma_map_page(struct device *dev, struct page *page,
+	     unsigned long offset, size_t size,
+	     enum dma_data_direction direction)
+{
+	if (dev->bus == &pci_bus_type) return pci_map_page(to_pci_dev(dev), page, offset, size, (int)direction);
+	if (dev->bus == &vio_bus_type) return vio_map_page(to_vio_dev(dev), page, offset, size, (int)direction);
+	BUG();
+}
+
+
+static inline void
+dma_unmap_page(struct device *dev, dma_addr_t dma_address, size_t size,
+	       enum dma_data_direction direction)
+{
+	if (dev->bus == &pci_bus_type) pci_unmap_page(to_pci_dev(dev), dma_address, size, (int)direction);
+	if (dev->bus == &vio_bus_type) vip_unmap_page(to_vio_dev(dev), dma_address, size, (int)direction);
+	BUG();
+}
+
+static inline int
+dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
+	   enum dma_data_direction direction)
+{
+	if (dev->bus == &pci_bus_type) return pci_map_sg(to_pci_dev(dev), sg, nents, (int)direction);
+	if (dev->bus == &vio_bus_type) return vio_map_sg(to_vio_dev(dev), sg, nents, (int)direction);
+	BUG();
+}
+
+static inline void
+dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nhwentries,
+	     enum dma_data_direction direction)
+{
+	if (dev->bus == &pci_bus_type) pci_unmap_sg(to_pci_dev(dev), sg, nhwentries, (int)direction);
+	if (dev->bus == &vio_bus_type) pci_unmap_sg(to_vio_dev(dev), sg, nhwentries, (int)direction);
+	BUG();
+}
+
+static inline void
+dma_sync_single(struct device *dev, dma_addr_t dma_handle, size_t size,
+		enum dma_data_direction direction)
+{
+	if (dev->bus == &pci_bus_type) pci_dma_sync_single(to_pci_dev(dev), dma_handle, size, (int)direction);
+	if (dev->bus == &vio_bus_type) vio_dma_sync_single(to_vio_dev(dev), dma_handle, size, (int)direction);
+	BUG();
+}
+
+static inline void
+dma_sync_sg(struct device *dev, struct scatterlist *sg, int nelems,
+	    enum dma_data_direction direction)
+{
+	if (dev->bus == &pci_bus_type) pci_dma_sync_sg(to_pci_dev(dev), sg, nelems, (int)direction);
+	if (dev->bus == &vio_bus_type) vio_dma_sync_sg(to_vio_dev(dev), sg, nelems, (int)direction);
+	BUG();
+}
+
+/* Now for the API extensions over the pci_ one */
+
+#define dma_alloc_noncoherent(d, s, h, f) dma_alloc_coherent(d, s, h, f)
+#define dma_free_noncoherent(d, s, v, h) dma_free_coherent(d, s, v, h)
+#define dma_is_consistent(d)	(1)
+
+static inline int
+dma_get_cache_alignment(void)
+{
+	/* no easy way to get cache size on all processors, so return
+	 * the maximum possible, to be safe */
+	return (1 << L1_CACHE_SHIFT_MAX);
+}
+
+static inline void
+dma_sync_single_range(struct device *dev, dma_addr_t dma_handle,
+		      unsigned long offset, size_t size,
+		      enum dma_data_direction direction)
+{
+	/* just sync everything, that's all the pci API can do */
+	dma_sync_single(dev, dma_handle, offset+size, direction);
+}
+
+static inline void
+dma_cache_sync(void *vaddr, size_t size,
+	       enum dma_data_direction direction)
+{
+	/* could define this in terms of the dma_cache ... operations,
+	 * but if you get this on a platform, you should convert the platform
+	 * to using the generic device DMA API */
+	BUG();
+}
+
+#endif
+

Dave B


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Fri Feb 13 09:18:25 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Thu, 12 Feb 2004 16:18:25 -0600
Subject: [2.4] [PATCH] hash_page rework, take 2
In-Reply-To: <1076620802.12434.39.camel@gaston>
References: <40104461.5030804@austin.ibm.com>	 <402BE98A.9090304@austin.ibm.com> <1076620802.12434.39.camel@gaston>
Message-ID: <402BFBB1.5000302@austin.ibm.com>


Benjamin Herrenschmidt wrote:

>>* Add a smb_mb() in pte_free_sync(), to make sure that any 0-writes to
>>PTE/PMDs are seen on other processors before the free.
>
>
> Actually, what matters is before the is_locked(), we act both
> as a write barrier for the 0 and a read barrier for is_locked().
>
> That said... I wonder if the implementation of is_locked() shouldn't
> have a rmb() by default after all ...

I can't even find any other users of is_read_locked in the ppc64 code. I
guess it should be fixed for future reference though. :-)

As for the memory barrier: Since smb_mb() (sync) is "larger" than
smb_rmb() (lwsync), we should be fine to keep it outside the loop:

* Another CPU has taken the read lock, seeing the old PTE value: mb()
will make us see the read lock.
* Another CPU will shortly take the read lock: Either the mb() will make
them see the new PTE value, or we will see their read lock.

Does the above make sense?

>>* Make sure "newpp" is never more than 3 bits. This saves us from
>>crashing older iSeries hypervisor. Shouldn't be a problem since we don't
>>do aging in 2.4, but I prefer the more conservative approach.
>
> Not only crashing older HVs, but crashing the kernel with newer HVs ;)

Oops. Either way, we shouldn't be exposed more now than before on 2.4
since none of the pp code was really changed, and no flags besides
_PAGE_BUSY were redefined.

>>* Other logic fixes based on discussions on this list, mostly dealing
>>with timing windows during which _PAGE_BUSY was cleared inappropriately.
>
>
> Did you spot any case that could have affected the 2.6 version ?

I didn't look much at it yet, but there's no isync after the loop at the
top of __hash_page (add one right before "Step 2"). I can supply patch,
but it's pretty obvious where it should go...


-Olof

--
Olof Johansson                                        Office: 4F005/905
pSeries Linux Development                             IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Fri Feb 13 09:25:12 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Fri, 13 Feb 2004 09:25:12 +1100
Subject: [2.4] [PATCH] hash_page rework, take 2
In-Reply-To: <402BFBB1.5000302@austin.ibm.com>
References: <40104461.5030804@austin.ibm.com>
	 <402BE98A.9090304@austin.ibm.com> <1076620802.12434.39.camel@gaston>
	 <402BFBB1.5000302@austin.ibm.com>
Message-ID: <1076624712.13813.48.camel@gaston>


> I can't even find any other users of is_read_locked in the ppc64 code. I
> guess it should be fixed for future reference though. :-)
>
> As for the memory barrier: Since smb_mb() (sync) is "larger" than
> smb_rmb() (lwsync), we should be fine to keep it outside the loop:

Sure, the code is fine, I was correcting your comments :)

> I didn't look much at it yet, but there's no isync after the loop at the
> top of __hash_page (add one right before "Step 2"). I can supply patch,
> but it's pretty obvious where it should go...

I did already, it's in linus tree.

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Fri Feb 13 09:30:23 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Thu, 12 Feb 2004 16:30:23 -0600
Subject: [2.4] [PATCH] hash_page rework, take 2
In-Reply-To: <1076624712.13813.48.camel@gaston>
References: <40104461.5030804@austin.ibm.com>	 <402BE98A.9090304@austin.ibm.com> <1076620802.12434.39.camel@gaston>	 <402BFBB1.5000302@austin.ibm.com> <1076624712.13813.48.camel@gaston>
Message-ID: <402BFE7F.4030907@austin.ibm.com>


Benjamin Herrenschmidt wrote:
>>I can't even find any other users of is_read_locked in the ppc64 code. I
>>guess it should be fixed for future reference though. :-)
>>
>>As for the memory barrier: Since smb_mb() (sync) is "larger" than
>>smb_rmb() (lwsync), we should be fine to keep it outside the loop:
>
>
> Sure, the code is fine, I was correcting your comments :)

Thanks. I'll fix them before any push.

>>I didn't look much at it yet, but there's no isync after the loop at the
>>top of __hash_page (add one right before "Step 2"). I can supply patch,
>>but it's pretty obvious where it should go...
>
> I did already, it's in linus tree.

Ok, my ames tree that I checked with might have been slightly out of date.


-Olof

--
Olof Johansson                                        Office: 4F005/905
pSeries Linux Development                             IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Fri Feb 13 10:00:45 2004
From: anton at samba.org (Anton Blanchard)
Date: Fri, 13 Feb 2004 10:00:45 +1100
Subject: [PATCH][2.6] hardware breakpoint fix/support for xmon
In-Reply-To: <16427.62304.699990.252289@cargo.ozlabs.ibm.com>
References: <1076619314.10309.85.camel@DYN279927END.austin.ibm.com> <16427.62304.699990.252289@cargo.ozlabs.ibm.com>
Message-ID: <20040212230045.GJ25922@krispykreme>


> Hmmm... We're all hacking on xmon these days, it seems.  Anton was
> making some changes to xmon just yesterday, and I am about to rework
> the xmon entry/exit and breakpoint insertion/removal to make it work
> properly on SMP systems.  Don't push for now, and I'll talk to Anton
> today about merging your changes in with ours.

For the benefit of others on the list, heres what Ive got:

- recover from bad SPR read/write (we get a program check)
- remove some old code (bat and segment register stuff)
- update the help text to match reality
- add a "press ? for help" when xmon first appears to make rusty happy
- protect against flushing bad parts of memory from Milton
- dont print iseries specific stuff on pseries in SPR dump (S)
- add code to dump the segment table or SLB
- remove a number of functions that wouldnt work on LPAR

Im trying to make sure xmon is solid so am interested in any way someone
can lock up xmon once this patch goes in.

Anton

===== arch/ppc64/kernel/traps.c 1.25 vs edited =====
--- 1.25/arch/ppc64/kernel/traps.c	Tue Jan 20 13:07:09 2004
+++ edited/arch/ppc64/kernel/traps.c	Thu Feb 12 15:54:09 2004
@@ -372,6 +372,13 @@
 {
 	siginfo_t info;

+#ifdef CONFIG_DEBUG_KERNEL
+	if (debugger_fault_handler) {
+		debugger_fault_handler(regs);
+		return;
+	}
+#endif
+
 	if (regs->msr & 0x100000) {
 		/* IEEE FP exception */

===== arch/ppc64/xmon/privinst.h 1.2 vs edited =====
--- 1.2/arch/ppc64/xmon/privinst.h	Mon Jun 10 12:37:26 2002
+++ edited/arch/ppc64/xmon/privinst.h	Thu Feb 12 16:18:53 2004
@@ -43,39 +43,12 @@
 GSETSPR(275, sprg3)
 GSETSPR(282, ear)
 GSETSPR(287, pvr)
-GSETSPR(528, bat0u)
-GSETSPR(529, bat0l)
-GSETSPR(530, bat1u)
-GSETSPR(531, bat1l)
-GSETSPR(532, bat2u)
-GSETSPR(533, bat2l)
-GSETSPR(534, bat3u)
-GSETSPR(535, bat3l)
 GSETSPR(1008, hid0)
 GSETSPR(1009, hid1)
 GSETSPR(1010, iabr)
 GSETSPR(1013, dabr)
 GSETSPR(1023, pir)

-static inline int get_sr(int n)
-{
-	int ret;
-
-#if 0
-	// DRENG does not assemble
-	asm (" mfsrin %0,%1" : "=r" (ret) : "r" (n << 28));
-#endif
-	return ret;
-}
-
-static inline void set_sr(int n, int val)
-{
-#if 0
-	// DRENG does not assemble
-	asm ("mtsrin %0,%1" : : "r" (val), "r" (n << 28));
-#endif
-}
-
 static inline void store_inst(void *p)
 {
 	asm volatile ("dcbst 0,%0; sync; icbi 0,%0; isync" : : "r" (p));
@@ -90,4 +63,3 @@
 {
 	asm volatile ("dcbi 0,%0; icbi 0,%0" : : "r" (p));
 }
-
===== arch/ppc64/xmon/xmon.c 1.33 vs edited =====
--- 1.33/arch/ppc64/xmon/xmon.c	Sat Feb  7 14:17:23 2004
+++ edited/arch/ppc64/xmon/xmon.c	Thu Feb 12 16:48:23 2004
@@ -115,10 +115,7 @@
 #endif /* CONFIG_SMP */
 static void csum(void);
 static void bootcmds(void);
-static void mem_translate(void);
-static void mem_check(void);
-static void mem_find_real(void);
-static void mem_find_vsid(void);
+void dump_segments(void);

 static void debug_trace(void);

@@ -149,7 +146,15 @@
   b	show breakpoints\n\
   bd	set data breakpoint\n\
   bi	set instruction breakpoint\n\
-  bc	clear breakpoint\n\
+  bc	clear breakpoint\n"
+#ifdef CONFIG_SMP
+  "\
+  c	print cpus stopped in xmon\n\
+  ci	send xmon interrupt to all other cpus\n\
+  c#	try to switch to cpu number h (in hex)\n"
+#endif
+  "\
+  C	checksum\n\
   d	dump bytes\n\
   di	dump instructions\n\
   df	dump float values\n\
@@ -162,7 +167,6 @@
   md	compare two blocks of memory\n\
   ml	locate a block of memory\n\
   mz	zero a block of memory\n\
-  mx	translation information for an effective address\n\
   mi	show information about memory allocation\n\
   p 	show the task list\n\
   r	print registers\n\
@@ -171,7 +175,14 @@
   t	print backtrace\n\
   T	Enable/Disable PPCDBG flags\n\
   x	exit monitor\n\
-";
+  u	dump segment table or SLB\n\
+  ?	help\n"
+#ifndef CONFIG_PPC_ISERIES
+  "\
+  zr	reboot\n\
+  zh	halt\n"
+#endif
+;

 static int xmon_trace[NR_CPUS];
 #define SSTEP	1		/* stepping because of 's' command */
@@ -308,6 +319,7 @@
 #endif /* CONFIG_SMP */
 	remove_bpts();
 	disable_surveillance();
+	printf("press ? for help ");
 	cmd = cmds(excp);
 	if (cmd == 's') {
 		xmon_trace[smp_processor_id()] = SSTEP;
@@ -332,17 +344,6 @@
 	set_msrd(msr);		/* restore interrupt enable */
 }

-void
-xmon_irq(int irq, void *d, struct pt_regs *regs)
-{
-	unsigned long flags;
-	local_save_flags(flags);
-	local_irq_disable();
-	printf("Keyboard interrupt\n");
-	xmon(regs);
-	local_irq_restore(flags);
-}
-
 int
 xmon_bpt(struct pt_regs *regs)
 {
@@ -524,18 +525,6 @@
 			case 'z':
 				memzcan();
 				break;
-			case 'x':
-				mem_translate();
-				break;
-			case 'c':
-				mem_check();
-				break;
-			case 'f':
-				mem_find_real();
-				break;
-			case 'e':
-				mem_find_vsid();
-				break;
 			case 'i':
 				show_mem();
 				break;
@@ -587,11 +576,16 @@
 			cpu_cmd();
 			break;
 #endif /* CONFIG_SMP */
+#ifndef CONFIG_PPC_ISERIES
 		case 'z':
 			bootcmds();
+#endif
 		case 'T':
 			debug_trace();
 			break;
+		case 'u':
+			dump_segments();
+			break;
 		default:
 			printf("Unrecognized command: ");
 		        do {
@@ -1056,14 +1050,23 @@
 		termch = 0;
 	nflush = 1;
 	scanhex(&nflush);
-	nflush = (nflush + 31) / 32;
-	if (cmd != 'i') {
-		for (; nflush > 0; --nflush, adrs += 0x20)
-			cflush((void *) adrs);
-	} else {
-		for (; nflush > 0; --nflush, adrs += 0x20)
-			cinval((void *) adrs);
+	nflush = (nflush + L1_CACHE_BYTES - 1) / L1_CACHE_BYTES;
+	if( setjmp(bus_error_jmp) == 0 ) {
+		debugger_fault_handler = handle_fault;
+		sync();
+
+		if (cmd != 'i') {
+			for (; nflush > 0; --nflush, adrs += L1_CACHE_BYTES)
+				cflush((void *) adrs);
+		} else {
+			for (; nflush > 0; --nflush, adrs += L1_CACHE_BYTES)
+				cinval((void *) adrs);
+		}
+		sync();
+		/* wait a little while to see if we get a machine check */
+		__delay(200);
 	}
+	debugger_fault_handler = 0;
 }

 unsigned long
@@ -1072,6 +1075,7 @@
 	unsigned int instrs[2];
 	unsigned long (*code)(void);
 	unsigned long opd[3];
+	unsigned long ret = -1UL;

 	instrs[0] = 0x7c6002a6 + ((n & 0x1F) << 16) + ((n & 0x3e0) << 6);
 	instrs[1] = 0x4e800020;
@@ -1082,7 +1086,22 @@
 	store_inst(instrs+1);
 	code = (unsigned long (*)(void)) opd;

-	return code();
+	if (setjmp(bus_error_jmp) == 0) {
+		debugger_fault_handler = handle_fault;
+		sync();
+
+		ret = code();
+
+		sync();
+		/* wait a little while to see if we get a machine check */
+		__delay(200);
+	} else {
+		printf("*** Error reading spr %x\n", n);
+	}
+
+	debugger_fault_handler = 0;
+
+	return ret;
 }

 void
@@ -1101,7 +1120,20 @@
 	store_inst(instrs+1);
 	code = (unsigned long (*)(unsigned long)) opd;

-	code(val);
+	if (setjmp(bus_error_jmp) == 0) {
+		debugger_fault_handler = handle_fault;
+		sync();
+
+		code(val);
+
+		sync();
+		/* wait a little while to see if we get a machine check */
+		__delay(200);
+	} else {
+		printf("*** Error writing spr %x\n", n);
+	}
+
+	debugger_fault_handler = 0;
 }

 static unsigned long regno;
@@ -1111,11 +1143,14 @@
 void
 super_regs()
 {
-	int i, cmd;
+	int cmd;
 	unsigned long val;
-	struct paca_struct*  ptrPaca = NULL;
-	struct ItLpPaca*  ptrLpPaca = NULL;
-	struct ItLpRegSave*  ptrLpRegSave = NULL;
+#ifdef CONFIG_PPC_ISERIES
+	int i;
+	struct paca_struct *ptrPaca = NULL;
+	struct ItLpPaca *ptrLpPaca = NULL;
+	struct ItLpRegSave *ptrLpRegSave = NULL;
+#endif

 	cmd = skipbl();
 	if (cmd == '\n') {
@@ -1129,10 +1164,7 @@
 		printf("sp   = %.16lx  sprg3= %.16lx\n", sp, get_sprg3());
 		printf("toc  = %.16lx  dar  = %.16lx\n", toc, get_dar());
 		printf("srr0 = %.16lx  srr1 = %.16lx\n", get_srr0(), get_srr1());
-		printf("asr  = %.16lx\n", mfasr());
-		for (i = 0; i < 8; ++i)
-			printf("sr%.2ld = %.16lx  sr%.2ld = %.16lx\n", i, get_sr(i), i+8, get_sr(i+8));
-
+#ifdef CONFIG_PPC_ISERIES
 		// Dump out relevant Paca data areas.
 		printf("Paca: \n");
 		ptrPaca = get_paca();
@@ -1148,7 +1180,8 @@
 		printf("    Saved Sprg0=%.16lx  Saved Sprg1=%.16lx \n", ptrLpRegSave->xSPRG0, ptrLpRegSave->xSPRG0);
 		printf("    Saved Sprg2=%.16lx  Saved Sprg3=%.16lx \n", ptrLpRegSave->xSPRG2, ptrLpRegSave->xSPRG3);
 		printf("    Saved Msr  =%.16lx  Saved Nia  =%.16lx \n", ptrLpRegSave->xMSR, ptrLpRegSave->xNIA);
-
+#endif
+
 		return;
 	}

@@ -1162,11 +1195,6 @@
 	case 'r':
 		printf("spr %lx = %lx\n", regno, read_spr(regno));
 		break;
-	case 's':
-		val = get_sr(regno);
-		scanhex(&val);
-		set_sr(regno, val);
-		break;
 	case 'm':
 		val = get_msr();
 		scanhex(&val);
@@ -1923,240 +1951,8 @@
 	}
 }

-void
-mem_translate()
-{
-	int c;
-	unsigned long ea, va, vsid, vpn, page, hpteg_slot_primary, hpteg_slot_secondary, primary_hash, i, *steg, esid, stabl;
-	HPTE *  hpte;
-	struct mm_struct * mm;
-	pte_t  *ptep = NULL;
-	void * pgdir;
-
-	c = inchar();
-	if ((isxdigit(c) && c != 'f' && c != 'd') || c == '\n')
-		termch = c;
-	scanhex((void *)&ea);
-
-	if ((ea >= KRANGE_START) && (ea <= (KRANGE_START + (1UL<<60)))) {
-		ptep = 0;
-		vsid = get_kernel_vsid(ea);
-		va = ( vsid << 28 ) | ( ea & 0x0fffffff );
-	} else {
-		// if in vmalloc range, use the vmalloc page directory
-		if ( ( ea >= VMALLOC_START ) && ( ea <= VMALLOC_END ) ) {
-			mm = &init_mm;
-			vsid = get_kernel_vsid( ea );
-		}
-		// if in ioremap range, use the ioremap page directory
-		else if ( ( ea >= IMALLOC_START ) && ( ea <= IMALLOC_END ) ) {
-			mm = &ioremap_mm;
-			vsid = get_kernel_vsid( ea );
-		}
-		// if in user range, use the current task's page directory
-		else if ( ( ea >= USER_START ) && ( ea <= USER_END ) ) {
-			mm = current->mm;
-			vsid = get_vsid(mm->context, ea );
-		}
-		pgdir = mm->pgd;
-		va = ( vsid << 28 ) | ( ea & 0x0fffffff );
-		ptep = find_linux_pte( pgdir, ea );
-	}
-
-	vpn = ((vsid << 28) | (((ea) & 0xFFFF000))) >> 12;
-	page = vpn & 0xffff;
-	esid = (ea >> 28)  & 0xFFFFFFFFF;
-
-  // Search the primary group for an available slot
-	primary_hash = ( vsid & 0x7fffffffff ) ^ page;
-	hpteg_slot_primary = ( primary_hash & htab_data.htab_hash_mask ) * HPTES_PER_GROUP;
-	hpteg_slot_secondary = ( ~primary_hash & htab_data.htab_hash_mask ) * HPTES_PER_GROUP;
-
-	printf("ea             : %.16lx\n", ea);
-	printf("esid           : %.16lx\n", esid);
-	printf("vsid           : %.16lx\n", vsid);
-
-	printf("\nSoftware Page Table\n-------------------\n");
-	printf("ptep           : %.16lx\n", ((unsigned long *)ptep));
-	if(ptep) {
-		printf("*ptep          : %.16lx\n", *((unsigned long *)ptep));
-	}
-
-	hpte  = htab_data.htab  + hpteg_slot_primary;
-	printf("\nHardware Page Table\n-------------------\n");
-	printf("htab base      : %.16lx\n", htab_data.htab);
-	printf("slot primary   : %.16lx\n", hpteg_slot_primary);
-	printf("slot secondary : %.16lx\n", hpteg_slot_secondary);
-	printf("\nPrimary Group\n");
-	for (i=0; i<8; ++i) {
-		if ( hpte->dw0.dw0.v != 0 ) {
-			printf("%d: (hpte)%.16lx %.16lx\n", i, hpte->dw0.dword0, hpte->dw1.dword1);
-			printf("          vsid: %.13lx   api: %.2lx  hash: %.1lx\n",
-			       (hpte->dw0.dw0.avpn)>>5,
-			       (hpte->dw0.dw0.avpn) & 0x1f,
-			       (hpte->dw0.dw0.h));
-			printf("          rpn: %.13lx \n", (hpte->dw1.dw1.rpn));
-			printf("           pp: %.1lx \n",
-			       ((hpte->dw1.dw1.pp0)<<2)|(hpte->dw1.dw1.pp));
-			printf("        wimgn: %.2lx  reference: %.1lx  change: %.1lx\n",
-			       ((hpte->dw1.dw1.w)<<4)|
-			       ((hpte->dw1.dw1.i)<<3)|
-			       ((hpte->dw1.dw1.m)<<2)|
-			       ((hpte->dw1.dw1.g)<<1)|
-			       ((hpte->dw1.dw1.n)<<0),
-			       hpte->dw1.dw1.r, hpte->dw1.dw1.c);
-		}
-		hpte++;
-	}
-
-	printf("\nSecondary Group\n");
-	// Search the secondary group
-	hpte  = htab_data.htab  + hpteg_slot_secondary;
-	for (i=0; i<8; ++i) {
-		if(hpte->dw0.dw0.v) {
-			printf("%d: (hpte)%.16lx %.16lx\n", i, hpte->dw0.dword0, hpte->dw1.dword1);
-			printf("          vsid: %.13lx   api: %.2lx  hash: %.1lx\n",
-			       (hpte->dw0.dw0.avpn)>>5,
-			       (hpte->dw0.dw0.avpn) & 0x1f,
-			       (hpte->dw0.dw0.h));
-			printf("          rpn: %.13lx \n", (hpte->dw1.dw1.rpn));
-			printf("           pp: %.1lx \n",
-			       ((hpte->dw1.dw1.pp0)<<2)|(hpte->dw1.dw1.pp));
-			printf("        wimgn: %.2lx  reference: %.1lx  change: %.1lx\n",
-			       ((hpte->dw1.dw1.w)<<4)|
-			       ((hpte->dw1.dw1.i)<<3)|
-			       ((hpte->dw1.dw1.m)<<2)|
-			       ((hpte->dw1.dw1.g)<<1)|
-			       ((hpte->dw1.dw1.n)<<0),
-			       hpte->dw1.dw1.r, hpte->dw1.dw1.c);
-		}
-		hpte++;
-	}
-
-	printf("\nHardware Segment Table\n-----------------------\n");
-	stabl = (unsigned long)(KERNELBASE+(_ASR&0xFFFFFFFFFFFFFFFE));
-	steg = (unsigned long *)((stabl) | ((esid & 0x1f) << 7));
-
-	printf("stab base      : %.16lx\n", stabl);
-	printf("slot           : %.16lx\n", steg);
-
-	for (i=0; i<8; ++i) {
-		printf("%d: (ste) %.16lx %.16lx\n", i,
-		       *((unsigned long *)(steg+i*2)),*((unsigned long *)(steg+i*2+1)) );
-	}
-}
-
-void mem_check()
-{
-	unsigned long htab_size_bytes;
-	unsigned long htab_end;
-	unsigned long last_rpn;
-	HPTE *hpte1, *hpte2;
-
-	htab_size_bytes = htab_data.htab_num_ptegs * 128; // 128B / PTEG
-	htab_end = (unsigned long)htab_data.htab + htab_size_bytes;
-	// last_rpn = (naca->physicalMemorySize-1) >> PAGE_SHIFT;
-	last_rpn = 0xfffff;
-
-	printf("\nHardware Page Table Check\n-------------------\n");
-	printf("htab base      : %.16lx\n", htab_data.htab);
-	printf("htab size      : %.16lx\n", htab_size_bytes);
-
-#if 1
-	for(hpte1 = htab_data.htab; hpte1 < (HPTE *)htab_end; hpte1++) {
-		if ( hpte1->dw0.dw0.v != 0 ) {
-			if ( hpte1->dw1.dw1.rpn <= last_rpn ) {
-				for(hpte2 = hpte1+1; hpte2 < (HPTE *)htab_end; hpte2++) {
-					if ( hpte2->dw0.dw0.v != 0 ) {
-						if(hpte1->dw1.dw1.rpn == hpte2->dw1.dw1.rpn) {
-							printf(" Duplicate rpn: %.13lx \n", (hpte1->dw1.dw1.rpn));
-							printf("   hpte1: %16.16lx  *hpte1: %16.16lx %16.16lx\n",
-							       hpte1, hpte1->dw0.dword0, hpte1->dw1.dword1);
-							printf("   hpte2: %16.16lx  *hpte2: %16.16lx %16.16lx\n",
-							       hpte2, hpte2->dw0.dword0, hpte2->dw1.dword1);
-						}
-					}
-				}
-			} else {
-				printf(" Bogus rpn: %.13lx \n", (hpte1->dw1.dw1.rpn));
-				printf("   hpte: %16.16lx  *hpte: %16.16lx %16.16lx\n",
-				       hpte1, hpte1->dw0.dword0, hpte1->dw1.dword1);
-			}
-		}
-	}
-#endif
-	printf("\nDone -------------------\n");
-}
-
-void mem_find_real()
+static void debug_trace(void)
 {
-	unsigned long htab_size_bytes;
-	unsigned long htab_end;
-	unsigned long last_rpn;
-	HPTE *hpte1;
-	unsigned long pa, rpn;
-	int c;
-
-	c = inchar();
-	if ((isxdigit(c) && c != 'f' && c != 'd') || c == '\n')
-		termch = c;
-	scanhex((void *)&pa);
-	rpn = pa >> 12;
-
-	htab_size_bytes = htab_data.htab_num_ptegs * 128; // 128B / PTEG
-	htab_end = (unsigned long)htab_data.htab + htab_size_bytes;
-	// last_rpn = (naca->physicalMemorySize-1) >> PAGE_SHIFT;
-	last_rpn = 0xfffff;
-
-	printf("\nMem Find RPN\n-------------------\n");
-	printf("htab base      : %.16lx\n", htab_data.htab);
-	printf("htab size      : %.16lx\n", htab_size_bytes);
-
-	for(hpte1 = htab_data.htab; hpte1 < (HPTE *)htab_end; hpte1++) {
-		if ( hpte1->dw0.dw0.v != 0 ) {
-			if ( hpte1->dw1.dw1.rpn == rpn ) {
-				printf(" Found rpn: %.13lx \n", (hpte1->dw1.dw1.rpn));
-				printf("      hpte: %16.16lx  *hpte1: %16.16lx %16.16lx\n",
-				       hpte1, hpte1->dw0.dword0, hpte1->dw1.dword1);
-			}
-		}
-	}
-	printf("\nDone -------------------\n");
-}
-
-void mem_find_vsid()
-{
-	unsigned long htab_size_bytes;
-	unsigned long htab_end;
-	HPTE *hpte1;
-	unsigned long vsid;
-	int c;
-
-	c = inchar();
-	if ((isxdigit(c) && c != 'f' && c != 'd') || c == '\n')
-		termch = c;
-	scanhex((void *)&vsid);
-
-	htab_size_bytes = htab_data.htab_num_ptegs * 128; // 128B / PTEG
-	htab_end = (unsigned long)htab_data.htab + htab_size_bytes;
-
-	printf("\nMem Find VSID\n-------------------\n");
-	printf("htab base      : %.16lx\n", htab_data.htab);
-	printf("htab size      : %.16lx\n", htab_size_bytes);
-
-	for(hpte1 = htab_data.htab; hpte1 < (HPTE *)htab_end; hpte1++) {
-		if ( hpte1->dw0.dw0.v != 0 ) {
-			if ( ((hpte1->dw0.dw0.avpn)>>5) == vsid ) {
-				printf(" Found vsid: %.16lx \n", ((hpte1->dw0.dw0.avpn) >> 5));
-				printf("       hpte: %16.16lx  *hpte1: %16.16lx %16.16lx\n",
-				       hpte1, hpte1->dw0.dword0, hpte1->dw1.dword1);
-			}
-		}
-	}
-	printf("\nDone -------------------\n");
-}
-
-static void debug_trace(void) {
         unsigned long val, cmd, on;

 	cmd = skipbl();
@@ -2202,4 +1998,48 @@
 		}
 		cmd = skipbl();
 	}
+}
+
+static void dump_slb(void)
+{
+	int i;
+	unsigned long tmp;
+
+	printf("SLB contents of cpu %d\n", smp_processor_id());
+
+	for (i = 0; i < naca->slb_size; i++) {
+		asm volatile("slbmfee  %0,%1" : "=r" (tmp) : "r" (i));
+		printf("%02d %016lx ", i, tmp);
+
+		asm volatile("slbmfev  %0,%1" : "=r" (tmp) : "r" (i));
+		printf("%016lx\n", tmp);
+	}
+}
+
+static void dump_stab(void)
+{
+	int i;
+	unsigned long *tmp = (unsigned long *)get_paca()->xStab_data.virt;
+
+	printf("Segment table contents of cpu %d\n", smp_processor_id());
+
+	for (i = 0; i < PAGE_SIZE/16; i++) {
+		unsigned long a, b;
+
+		a = *tmp++;
+		b = *tmp++;
+
+		if (a || b) {
+			printf("%03d %016lx ", i, a);
+			printf("%016lx\n", b);
+		}
+	}
+}
+
+void dump_segments(void)
+{
+	if (cur_cpu_spec->cpu_features & CPU_FTR_SLB)
+		dump_slb();
+	else
+		dump_stab();
 }

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Fri Feb 13 17:24:50 2004
From: anton at samba.org (Anton Blanchard)
Date: Fri, 13 Feb 2004 17:24:50 +1100
Subject: Resending the patch: Re: 2.6: PATCH for multiple EEH bugs
In-Reply-To: <402ABACF.3090809@us.ibm.com>
References: <20040203183459.B27780@forte.austin.ibm.com> <Pine.A41.4.44.0402032246520.85316-100000@forte.austin.ibm.com> <20040204142853.A28220@forte.austin.ibm.com> <20040205142643.T27780@forte.austin.ibm.com> <402ABACF.3090809@us.ibm.com>
Message-ID: <20040213062450.GL25922@krispykreme>


> I just want to express the importance of this patch. The 2.6 ipr driver
> requires it, since it regularly hits the eeh_check_failure bug. Please
> apply.

Yep, its in the queue. Linas, I'd like to remove the verbose messages at
boot, its very verbose and it will result in the Hollis police coming
after you.

Anton

PCI: skip building address cache for=0000:00:0c.0 Advanced Micro Devices \x{00C4}AMD\x{00DC} 79c970 \x{00C4}PCnet32 LANCE\x{00DC}
PCI: skip building address cache for=0000:00:0d.0 Alteon Networks Inc. AceNIC Gigabit Ethernet
PCI: skip building address cache for=0000:00:0e.0 Alteon Networks Inc. AceNIC Gigabit Ethernet (#2)
PCI: skip building address cache for=0000:00:0f.0 Matrox Graphics, Inc. MGA G200PCI: skip building address cache for=0000:00:10.0 Advanced Micro Devices \x{00C4}AMD\x{00DC} 79c970 \x{00C4}PCnet32 LANCE\x{00DC} (#2)
PCI: skip building address cache for=0000:00:11.0 LSI Logic / Symbios Logic 53c896
PCI: skip building address cache for=0000:00:11.1 LSI Logic / Symbios Logic 53c896 (#2)
PCI: skip building address cache for=0001:40:0b.0 Intel Corp. 82546EB Gigabit Ethernet Controller (Copper)
PCI: skip building address cache for=0001:40:0b.1 Intel Corp. 82546EB Gigabit Ethernet Controller (Copper) (#2)
PCI: skip building address cache for=0001:40:0c.0 Alteon Networks Inc. AceNIC Gigabit Ethernet (#3)
SCSI subsystem initialized
matroxfb: Matrox MGA-G200 (PCI) detected
matroxfb: BIOS on your Matrox device does

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From moilanen at austin.ibm.com  Sat Feb 14 01:00:05 2004
From: moilanen at austin.ibm.com (Jake Moilanen)
Date: Fri, 13 Feb 2004 08:00:05 -0600
Subject: [PATCH][2.6] hardware breakpoint fix/support for xmon
In-Reply-To: <20040212230045.GJ25922@krispykreme>
References: <1076619314.10309.85.camel@DYN279927END.austin.ibm.com>
	 <16427.62304.699990.252289@cargo.ozlabs.ibm.com>
	 <20040212230045.GJ25922@krispykreme>
Message-ID: <1076680805.10309.145.camel@DYN279927END.austin.ibm.com>


> Im trying to make sure xmon is solid so am interested in any way someone
> can lock up xmon once this patch goes in.

I have seen a number of instances during bringup where one of the CPUs
will hang in a rtas call (due to a FW DSI), and then xmon would come in
and disable_surveillance and hang on the rtas lock.  I would suggest
removing the disable_surveillance() call since there are no boxes that
I'm aware of that enable surveillance by default.  Maybe make it a
separate command to disable surveillance similar to KDB.

Thanks,
Jake


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Sat Feb 14 02:05:22 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Fri, 13 Feb 2004 09:05:22 -0600
Subject: hotplug devices require interrupts?
Message-ID: <0BC11108-5E36-11D8-B629-000A95A0560C@us.ibm.com>


A patch from Linda yesterday pointed out that of_finish_dynamic_node()
(prom.c) requires an "interrupts" property for hotplugged device nodes.
Why is that? Until very recently the vty nodes didn't have that... the
fact that they do now only coincidentally means we didn't shoot
ourselves in the foot (unless there are still other interrupt-less
devices).

Certainly lacking interrupts it is not an ENODEV error in the
non-dynamic finish_node/finish_node_interrupts case; why is it for
hotplug?

--
Hollis Blanchard
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From lxiep at us.ibm.com  Sat Feb 14 03:04:34 2004
From: lxiep at us.ibm.com (Linda Xie)
Date: Fri, 13 Feb 2004 10:04:34 -0600
Subject: hotplug devices require interrupts?
References: <0BC11108-5E36-11D8-B629-000A95A0560C@us.ibm.com>
Message-ID: <402CF592.9050209@us.ltcfwd.linux.ibm.com>


Hollis Blanchard wrote:
> A patch from Linda yesterday pointed out that of_finish_dynamic_node()
> (prom.c) requires an "interrupts" property for hotplugged device nodes.
> Why is that? Until very recently the vty nodes didn't have that... the
> fact that they do now only coincidentally means we didn't shoot
> ourselves in the foot (unless there are still other interrupt-less
> devices).
>
> Certainly lacking interrupts it is not an ENODEV error in the
> non-dynamic finish_node/finish_node_interrupts case; why is it for hotplug?
>

Nathan & Hollis,

Sorry for the confusion. "interrupts" property is for most of PCI I/O
devices. I will submit another patch that will fix the problem.

Thanks,

Linda


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From lxiep at us.ibm.com  Sat Feb 14 03:43:49 2004
From: lxiep at us.ibm.com (Linda Xie)
Date: Fri, 13 Feb 2004 10:43:49 -0600
Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards
Message-ID: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com>

The attached patch fixes OF device tree update code so that it supports
Hotplug and DLPAR PCI multifunction cards.

Comments are welcome.

Thanks,

Linda

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: OFDT_update.patch
Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040213/42da4b42/attachment.txt 

From hollisb at us.ibm.com  Sat Feb 14 04:15:49 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Fri, 13 Feb 2004 11:15:49 -0600
Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards
In-Reply-To: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com>
References: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com>
Message-ID: <44CE5954-5E48-11D8-B629-000A95A0560C@us.ibm.com>


On Feb 13, 2004, at 10:43 AM, Linda Xie wrote:

> The attached patch fixes OF device tree update code so that it supports
> Hotplug and DLPAR PCI multifunction cards.

I'm still unclear on the requirement to have an "interrupts" property.
In this patch, all non-PCI devices without interrupts will be treated
as errors. Why? If the new node doesn't have the property, just skip
the interrupt code (bringing you to what you've called "fixup_pci"). It
doesn't even have to be a goto: break the interrupt code into its own
function and then:

  	ints = (unsigned int *) get_property(node, "interrupts", &intlen);
  	if (ints)
		of_finish_dynamic_interrupts(...)

	node->phb = parent->phb;
	regs = (u32 *)get_property(node, "reg", 0);

> @@ -2929,8 +2930,12 @@
>
>  	ints = (unsigned int *) get_property(node, "interrupts", &intlen);
>  	if (!ints) {
> -		err = -ENODEV;
> -		goto out;
> +		char *name = get_property(node, "name", NULL);
> +
> +		/* hot-pluggable cards don't have "interrupts" */
> +		if (name && !strcmp(name, "pci") && !get_property(node, "built-in",
> NULL))
> +			goto fixup_pci;
> +		goto out;
>  	}
>
>  	intrcells = prom_n_intr_cells(node);

--
Hollis Blanchard
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From lxiep at us.ibm.com  Sat Feb 14 05:00:28 2004
From: lxiep at us.ibm.com (Linda Xie)
Date: Fri, 13 Feb 2004 12:00:28 -0600
Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards
References: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com> <44CE5954-5E48-11D8-B629-000A95A0560C@us.ibm.com>
Message-ID: <402D10BC.3000404@us.ltcfwd.linux.ibm.com>


Hollis Blanchard wrote:
> On Feb 13, 2004, at 10:43 AM, Linda Xie wrote:

> I'm still unclear on the requirement to have an "interrupts" property.
> In this patch, all non-PCI devices without interrupts will be treated as
> errors.

They are not treated as errors, the patch just skips some lines that are
not needed for the devices.

  If the new node doesn't have the property, just skip the
> interrupt code (bringing you to what you've called "fixup_pci"). It
> doesn't even have to be a goto: break the interrupt code into its own
> function and then:
>
>      ints = (unsigned int *) get_property(node, "interrupts", &intlen);
>      if (ints)
>         of_finish_dynamic_interrupts(...)
>
>     node->phb = parent->phb;
>     regs = (u32 *)get_property(node, "reg", 0);
>

Because I didn't want to change the "if(!ints)" logic that was there
before. BTW, I think that yours looks better.

Nathan, Any thoughts?


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From willschm at northrelay02.pok.ibm.com  Sat Feb 14 05:23:03 2004
From: willschm at northrelay02.pok.ibm.com (will schmidt)
Date: Fri, 13 Feb 2004 10:23:03 -0800
Subject: LPARCFG
In-Reply-To: <200402060425.i164PRJ60614@news.rchland.ibm.com>
References: <200402060425.i164PRJ60614@news.rchland.ibm.com>
Message-ID: <200402131620.i1DGKTOX103590@northrelay02.pok.ibm.com>


This patch includes:
- more of the function calls needed/requested for the licence manager folks.
- change config option for lparcfg back to tristate
- exported those pesky symbols needed via lparcfg as a module.   (with
kallsyms exporting all symbols, this issue might be masked depending
upon your kernel config)
- change H_SET_PURR to H_PURR

i realise i've also got some whitespace in the patch, i'll clean those
out before i push anything up to ames.


Will Schmidt wrote:
>>>Building as a module was broken when the code was checked in to ameslab
>>>2.6; I suggested turning it off.  I think lparcfg uses unexported
>>>symbols -- cur_cpu_spec, systemcfg, and plpar_hcall_4out.  Should any
>
> of
>
>>>those be exported?
>>>
>>>See this thread for the history:
>>>http://lists.linuxppc.org/linuxppc64-dev/200312/msg00023.html
>>
>>Ahh yeah, I have such a short memory.
>>
>>Actually I enabled it as a module and built all modules and didnt get
>>any warnings. Either we have everything exported now, Im not getting
>>undefined symbol warnings any more or else Im going blind.
>
>
> I've got some more updates for this code, will try to get a patch onto this
> list tomorrow.   (still need to forward port to current and check on
> iSeries)..
> I couldnt build it as a module without exporting a few symbols, but  my
> tree is about a week old, so i'm probably missing those fixes.
>
>
> -Will
>
> willschm at us.ibm.com
> Linux on PowerPC-64 Development
> IBM Rochester
>
>

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lparcfg.0212.diff
Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040213/0282971c/attachment.txt 

From olh at suse.de  Sat Feb 14 05:42:08 2004
From: olh at suse.de (Olaf Hering)
Date: Fri, 13 Feb 2004 19:42:08 +0100
Subject: [PATCH] PCI_DMA_NONE undeclared in asm/vio.h
Message-ID: <20040213184208.GA24365@suse.de>


asm/vio.h requires PCI_DMA_NONE, so include it either in the drivers or
directly in asm/vio.h


--- ./drivers/scsi/ibmvscsi/rpa_vscsi.c~        2004-02-13 18:56:02.000000000 +0100
+++ ./drivers/scsi/ibmvscsi/rpa_vscsi.c 2004-02-13 19:41:13.000000000 +0100
@@ -28,6 +28,7 @@
  */

 #include <linux/module.h>
+#include <linux/pci.h>
 #include <asm/vio.h>
 #include <asm/pci_dma.h>
 #include <asm/hvcall.h>
--- ./drivers/scsi/ibmvscsi/ibmvscsi.c~ 2004-02-13 18:56:02.000000000 +0100
+++ ./drivers/scsi/ibmvscsi/ibmvscsi.c  2004-02-13 19:38:14.000000000 +0100
@@ -63,6 +63,7 @@
  */

 #include <linux/module.h>
+#include <linux/pci.h>
 #include <asm/vio.h>
 #include "ibmvscsi.h"


--
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olh at suse.de  Sat Feb 14 05:48:52 2004
From: olh at suse.de (Olaf Hering)
Date: Fri, 13 Feb 2004 19:48:52 +0100
Subject: vio_unmap_sg and vio_bus_type not exported
Message-ID: <20040213184852.GA9470@suse.de>


I get two unresolved symbols after the recent changes,
current ameslab 2.6


kernel/drivers/usb/core/usbcore.ko needs unknown symbol vio_unmap_sg
kernel/drivers/usb/core/usbcore.ko needs unknown symbol vio_bus_type


--
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Sat Feb 14 06:16:41 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Fri, 13 Feb 2004 13:16:41 -0600
Subject: vio_unmap_sg and vio_bus_type not exported
In-Reply-To: <20040213184852.GA9470@suse.de>
References: <20040213184852.GA9470@suse.de>
Message-ID: <2791A786-5E59-11D8-B629-000A95A0560C@us.ibm.com>


On Feb 13, 2004, at 12:48 PM, Olaf Hering wrote:
>
> I get two unresolved symbols after the recent changes,
> current ameslab 2.6
>
> kernel/drivers/usb/core/usbcore.ko needs unknown symbol vio_unmap_sg
> kernel/drivers/usb/core/usbcore.ko needs unknown symbol vio_bus_type

This must be because of the new dma-mapping.h. As long as vio_bus_type
is referenced in that header, it will have to be exported. And the
hacks multiply...

--
Hollis Blanchard
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From boutcher at us.ibm.com  Sat Feb 14 07:07:31 2004
From: boutcher at us.ibm.com (David Boutcher)
Date: Fri, 13 Feb 2004 14:07:31 -0600
Subject: vio_unmap_sg and vio_bus_type not exported
In-Reply-To: <2791A786-5E59-11D8-B629-000A95A0560C@us.ibm.com>
Message-ID: <OFD9D87B05.8439D6DE-ON86256E39.006E5A7A-86256E39.006E8D67@us.ibm.com>


owner-linuxppc64-dev at lists.linuxppc.org wrote on 02/13/2004 01:16:41 PM:
> On Feb 13, 2004, at 12:48 PM, Olaf Hering wrote:
> >
> > I get two unresolved symbols after the recent changes, current
> > ameslab 2.6
> >
> > kernel/drivers/usb/core/usbcore.ko needs unknown symbol vio_unmap_sg
> > kernel/drivers/usb/core/usbcore.ko needs unknown symbol vio_bus_type
>
> This must be because of the new dma-mapping.h. As long as vio_bus_type
> is referenced in that header, it will have to be exported. And the
> hacks multiply...

There was a one hour window last night where I had a bug....does this still
happen with a current bk pull?

Dave Boutcher
IBM Linux Technology Center

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olh at suse.de  Sat Feb 14 07:46:12 2004
From: olh at suse.de (Olaf Hering)
Date: Fri, 13 Feb 2004 21:46:12 +0100
Subject: likely() disappeared
Message-ID: <20040213204612.GA28320@suse.de>


likely() from linux/compiler.h disappeared into a #ifdef __KERNEL__.
arch/ppc64/boot/prom.c calls do_div() in number().
what is the correct fix, other than not using the 'make all' target?


--
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olh at suse.de  Sat Feb 14 08:11:31 2004
From: olh at suse.de (Olaf Hering)
Date: Fri, 13 Feb 2004 22:11:31 +0100
Subject: vio_unmap_sg and vio_bus_type not exported
In-Reply-To: <OFD9D87B05.8439D6DE-ON86256E39.006E5A7A-86256E39.006E8D67@us.ibm.com>
References: <2791A786-5E59-11D8-B629-000A95A0560C@us.ibm.com> <OFD9D87B05.8439D6DE-ON86256E39.006E5A7A-86256E39.006E8D67@us.ibm.com>
Message-ID: <20040213211131.GA19947@suse.de>


 On Fri, Feb 13, David Boutcher wrote:

>
>
>
>
>
> owner-linuxppc64-dev at lists.linuxppc.org wrote on 02/13/2004 01:16:41 PM:
> > On Feb 13, 2004, at 12:48 PM, Olaf Hering wrote:
> > >
> > > I get two unresolved symbols after the recent changes,
> > > current ameslab 2.6
> > >
> > > kernel/drivers/usb/core/usbcore.ko needs unknown symbol vio_unmap_sg
> > > kernel/drivers/usb/core/usbcore.ko needs unknown symbol vio_bus_type
> >
> > This must be because of the new dma-mapping.h. As long as vio_bus_type
> > is referenced in that header, it will have to be exported. And the
> > hacks multiply...
>
> There was a one hour window last night where I had a bug....does this still
> happen with a current bk pull?

yes, current bk.

--
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olh at suse.de  Sat Feb 14 08:18:51 2004
From: olh at suse.de (Olaf Hering)
Date: Fri, 13 Feb 2004 22:18:51 +0100
Subject: g5 requires ADB_PMU
Message-ID: <20040213211851.GA20600@suse.de>


the kernel doesnt link without PMU.

--- ../linux-2.6.2/arch/ppc64/Kconfig   2004-02-13 17:56:05.000000000 +0000
+++ ./arch/ppc64/Kconfig  2004-02-13 21:14:45.000000000 +0000
@@ -87,6 +87,7 @@
 config PPC_PMAC
        depends on PPC_PSERIES
        bool "Apple PowerMac G5 support"
+       select ADB_PMU

 config PPC_PMAC64
        bool


--
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Sat Feb 14 10:04:33 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sat, 14 Feb 2004 10:04:33 +1100
Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards
In-Reply-To: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com>
References: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com>
Message-ID: <1076713472.8000.83.camel@gaston>


On Sat, 2004-02-14 at 03:43, Linda Xie wrote:
> The attached patch fixes OF device tree update code so that it supports
> Hotplug and DLPAR PCI multifunction cards.
>
> Comments are welcome.

Can't this code be moved out of prom.c ?

It's time to think about cleaning up that pile of mess that is prom.c,
ultimately, things that have to run in the context of the firmware
environement early during boot, should be completely split from things
that are used later on during normal kernel usage.

And low level device-tree accessors split from higher level things like
this hotplug stuff.

We won't do that right away, but starting to slowly move things away
as they are modified is a good way to make things simpler for anton and
I when it's time to do the big cleanup.

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nathanl at austin.ibm.com  Sat Feb 14 11:09:53 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Fri, 13 Feb 2004 18:09:53 -0600
Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards
In-Reply-To: <1076713472.8000.83.camel@gaston>
References: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com> <1076713472.8000.83.camel@gaston>
Message-ID: <402D6751.6080502@austin.ibm.com>


Benjamin Herrenschmidt wrote:
> Can't this code be moved out of prom.c ?

Yes, the of_find accessors etc. could be moved to their own file,
probably without too much pain.

> It's time to think about cleaning up that pile of mess that is prom.c,
> ultimately, things that have to run in the context of the firmware
> environement early during boot, should be completely split from things
> that are used later on during normal kernel usage.
>
> And low level device-tree accessors split from higher level things like
> this hotplug stuff.

The hotplug stuff (of_add_node, of_remove_node) necessarily uses the
same lock as the accessors, but we could take more care to avoid
building it on platforms which don't need it.  I believe those functions
should be compiled in only when building for pSeries with CONFIG_HOTPLUG=y.

Nathan

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olh at suse.de  Sat Feb 14 19:59:37 2004
From: olh at suse.de (Olaf Hering)
Date: Sat, 14 Feb 2004 09:59:37 +0100
Subject: likely() disappeared
In-Reply-To: <20040213204612.GA28320@suse.de>
References: <20040213204612.GA28320@suse.de>
Message-ID: <20040214085937.GA1426@suse.de>


 On Fri, Feb 13, Olaf Hering wrote:

>
> likely() from linux/compiler.h disappeared into a #ifdef __KERNEL__.
> arch/ppc64/boot/prom.c calls do_div() in number().
> what is the correct fix, other than not using the 'make all' target?

its already in ameslab:


diff -purN linux-2.5/arch/ppc64/boot/prom.c linuxppc64-2.5/arch/ppc64/boot/prom.c
--- linux-2.5/arch/ppc64/boot/prom.c	2003-07-09 17:20:23.000000000 +0000
+++ linuxppc64-2.5/arch/ppc64/boot/prom.c	2004-02-12 12:05:57.000000000 +0000
@@ -11,9 +11,6 @@
 #include <linux/string.h>
 #include <linux/ctype.h>

-#define BITS_PER_LONG 32
-#include <asm/div64.h>
-
 int (*prom)(void *);

 void *chosen_handle;
@@ -28,6 +25,9 @@ void chrpboot(int a1, int a2, void *prom

 void printk(char *fmt, ...);

+/* there is no convenient header to get this from...  -- paulus */
+extern unsigned long strlen(const char *);
+
 int
 write(void *handle, void *ptr, int nb)
 {
@@ -352,7 +352,7 @@ static int skip_atoi(const char **s)
 #define SPECIAL	32		/* 0x */
 #define LARGE	64		/* use 'ABCDEF' instead of 'abcdef' */

-static char * number(char * str, long long num, int base, int size, int precision, int type)
+static char * number(char * str, long num, int base, int size, int precision, int type)
 {
 	char c,sign,tmp[66];
 	const char *digits="0123456789abcdefghijklmnopqrstuvwxyz";
@@ -388,8 +388,10 @@ static char * number(char * str, long lo
 	i = 0;
 	if (num == 0)
 		tmp[i++]='0';
-	else while (num != 0)
-		tmp[i++] = digits[do_div(num,base)];
+	else while (num != 0) {
+		tmp[i++] = digits[num % base];
+		num /= base;
+	}
 	if (i > precision)
 		precision = i;
 	size -= precision;
@@ -424,7 +426,7 @@ int sprintf(char * buf, const char *fmt,
 int vsprintf(char *buf, const char *fmt, va_list args)
 {
 	int len;
-	unsigned long long num;
+	unsigned long num;
 	int i, base;
 	char * str;
 	const char *s;
@@ -575,9 +577,7 @@ int vsprintf(char *buf, const char *fmt,
 				--fmt;
 			continue;
 		}
-		if (qualifier == 'L')
-			num = va_arg(args, long long);
-		else if (qualifier == 'l') {
+		if (qualifier == 'l') {
 			num = va_arg(args, unsigned long);
 			if (flags & SIGN)
 				num = (signed long) num;

--
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Sat Feb 14 20:24:31 2004
From: paulus at samba.org (Paul Mackerras)
Date: Sat, 14 Feb 2004 20:24:31 +1100
Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards
In-Reply-To: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com>
References: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com>
Message-ID: <16429.59727.123989.862372@cargo.ozlabs.ibm.com>


Linda Xie writes:

> The attached patch fixes OF device tree update code so that it supports
> Hotplug and DLPAR PCI multifunction cards.
>
> Comments are welcome.

This looks to me like it would benefit from having parts of it such as
the interrupt property parsing code split out into subroutines.

Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Sun Feb 15 00:57:48 2004
From: anton at samba.org (Anton Blanchard)
Date: Sun, 15 Feb 2004 00:57:48 +1100
Subject: likely() disappeared
In-Reply-To: <20040214085937.GA1426@suse.de>
References: <20040213204612.GA28320@suse.de> <20040214085937.GA1426@suse.de>
Message-ID: <20040214135748.GF9910@krispykreme>


> its already in ameslab:

Thanks, I forwarded paulus' patch onto Linus and akpm.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Sun Feb 15 13:48:18 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sun, 15 Feb 2004 13:48:18 +1100
Subject: People with ATAPI problems: possible fix
Message-ID: <1076813298.853.2.camel@gaston>


Hi !

Recently, there have been lost of reports of ATAPI issues, typically
with CD/DVD burners, where DMA would be disabled for no obvious reasons,
with the driver spitting a message about a timeout waiting for dbdma
command stop.

The problem is that our driver doesn't deal with buffer underruns due
to the drive not sending as many data as requested.

This patch is an attempt at fixing this. Please let me know if it
helps.

Ben.

===== drivers/ide/ppc/pmac.c 1.50 vs edited =====
--- 1.50/drivers/ide/ppc/pmac.c	Sat Feb 14 19:29:16 2004
+++ edited/drivers/ide/ppc/pmac.c	Sun Feb 15 13:47:12 2004
@@ -55,7 +55,7 @@

 #define IDE_PMAC_DEBUG

-#define DMA_WAIT_TIMEOUT	100
+#define DMA_WAIT_TIMEOUT	50

 typedef struct pmac_ide_hwif {
 	unsigned long			regbase;
@@ -2032,8 +2032,11 @@
 	dstat = readl(&dma->status);
 	writel(((RUN|WAKE|DEAD) << 16), &dma->control);
 	pmac_ide_destroy_dmatable(drive);
-	/* verify good dma status */
-	return (dstat & (RUN|DEAD|ACTIVE)) != RUN;
+	/* verify good dma status. we don't check for ACTIVE beeing 0. We should...
+	 * in theory, but with ATAPI decices doing buffer underruns, that would
+	 * cause us to disable DMA, which isn't what we want
+	 */
+	return (dstat & (RUN|DEAD)) != RUN;
 }

 /*
@@ -2047,7 +2050,7 @@
 {
 	pmac_ide_hwif_t* pmif = (pmac_ide_hwif_t *)HWIF(drive)->hwif_data;
 	volatile struct dbdma_regs *dma;
-	unsigned long status;
+	unsigned long status, timeout;

 	if (pmif == NULL)
 		return 0;
@@ -2063,17 +2066,8 @@
 	 * - The dbdma fifo hasn't yet finished flushing to
 	 * to system memory when the disk interrupt occurs.
 	 *
-	 * The trick here is to increment drive->waiting_for_dma,
-	 * and return as if no interrupt occurred. If the counter
-	 * reach a certain timeout value, we then return 1. If
-	 * we really got the interrupt, it will happen right away
-	 * again.
-	 * Apple's solution here may be more elegant. They issue
-	 * a DMA channel interrupt (a separate irq line) via a DBDMA
-	 * NOP command just before the STOP, and wait for both the
-	 * disk and DBDMA interrupts to have completed.
 	 */
-
+
 	/* If ACTIVE is cleared, the STOP command have passed and
 	 * transfer is complete.
 	 */
@@ -2085,15 +2079,26 @@
 			called while not waiting\n", HWIF(drive)->index);

 	/* If dbdma didn't execute the STOP command yet, the
-	 * active bit is still set */
-	drive->waiting_for_dma++;
-	if (drive->waiting_for_dma >= DMA_WAIT_TIMEOUT) {
-		printk(KERN_WARNING "ide%d, timeout waiting \
-			for dbdma command stop\n", HWIF(drive)->index);
-		return 1;
-	}
-	udelay(5);
-	return 0;
+	 * active bit is still set. We consider that we aren't
+	 * sharing interrupts (which is hopefully the case with
+	 * those controllers) and so we just try to flush the
+	 * channel for pending data in the fifo
+	 */
+	udelay(1);
+	writel((FLUSH << 16) | FLUSH, &dma->control);
+	timeout = 0;
+	for (;;) {
+		udelay(1);
+		status = readl(&dma->status);
+		if ((status & FLUSH) == 0)
+			break;
+		if (++timeout > 100) {
+			printk(KERN_WARNING "ide%d, ide_dma_test_irq \
+			timeout flushing channel\n", HWIF(drive)->index);
+			break;
+		}
+	}
+	return 1;
 }

 static int __pmac


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Tue Feb 17 05:57:15 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Mon, 16 Feb 2004 12:57:15 -0600
Subject: open_pic.c without CONFIG_SMP
Message-ID: <4031128B.8050602@us.ibm.com>

I don't know if this patch is correct, but something is needed here.

--
Hollis Blanchard
IBM Linux Technology Center
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: openpic-UP.diff
Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040216/9b2124bc/attachment.txt 

From olh at suse.de  Tue Feb 17 07:49:32 2004
From: olh at suse.de (Olaf Hering)
Date: Mon, 16 Feb 2004 21:49:32 +0100
Subject: [PATCH] unbreaking pseries nvram driver
Message-ID: <20040216204932.GA3926@suse.de>


rtas_token returns an int, the rtas_call arguments are also an int. The
error code returned by rtas_token is also an int.
So, why the unsigned int?


--- /tmp/linuxppc64-2.5/arch/ppc64/kernel/pSeries_nvram.c	2004-02-12 03:47:53.000000000 +0000
+++ ./arch/ppc64/kernel/pSeries_nvram.c	2004-02-16 20:45:12.000000000 +0000
@@ -29,7 +29,7 @@
 #include <asm/machdep.h>

 static unsigned int nvram_size;
-static unsigned int nvram_fetch, nvram_store;
+static int nvram_fetch, nvram_store;
 static char nvram_buf[NVRW_CNT];	/* assume this is in the first 4GB */
 static spinlock_t nvram_lock = SPIN_LOCK_UNLOCKED;

@@ -41,7 +41,7 @@ static ssize_t pSeries_nvram_read(char *
 	unsigned long flags;
 	char *p = buf;

-	if (nvram_size == 0 || nvram_fetch)
+	if (nvram_size == 0 || nvram_fetch == RTAS_UNKNOWN_SERVICE)
 		return -ENODEV;

 	if (*index >= nvram_size)
@@ -83,7 +83,7 @@ static ssize_t pSeries_nvram_write(char
 	unsigned long flags;
 	const char *p = buf;

-	if (nvram_size == 0 || nvram_store)
+	if (nvram_size == 0 || nvram_store == RTAS_UNKNOWN_SERVICE)
 		return -ENODEV;

 	if (*index >= nvram_size)
--
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Tue Feb 17 15:35:27 2004
From: anton at samba.org (Anton Blanchard)
Date: Tue, 17 Feb 2004 15:35:27 +1100
Subject: KDB in ameslab
Message-ID: <20040217043527.GC25491@krispykreme>


Hi,

Linus had a nasty problem debugging what should have been a simple
problem but became a nightmare due to an incorrect debug hook. In
response to this I have cleaned up our debug hooks, it should be much
harder to screw up.

A side effect of this is that KDB is probably broken. I started looking
into fixing it however I noticed it looks out of date. Does someone have
the urge to update it?

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From segher at kernel.crashing.org  Tue Feb 17 18:19:06 2004
From: segher at kernel.crashing.org (Segher Boessenkool)
Date: Tue, 17 Feb 2004 08:19:06 +0100
Subject: open_pic.c without CONFIG_SMP
In-Reply-To: <4031128B.8050602@us.ibm.com>
References: <4031128B.8050602@us.ibm.com>
Message-ID: <924B9C8A-6119-11D8-BF03-000A95A4DC02@kernel.crashing.org>


> I don't know if this patch is correct, but something is needed here.

Why?  OpenPIC has IPIs whether or not the system is SMP, i.e., it's
a feature of the interrupt controller itself, not of the system.

Of course it may not be _necessary_ to initialize this stuff, but
it won't hurt, either.


Segher


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Tue Feb 17 18:19:25 2004
From: anton at samba.org (Anton Blanchard)
Date: Tue, 17 Feb 2004 18:19:25 +1100
Subject: autoconsole
In-Reply-To: <20040212150907.GA13059@suse.de>
References: <20040113134042.A3771@w-mikek2.beaverton.ibm.com> <D56F210C-4617-11D8-92C2-000A95A0560C@us.ibm.com> <20040113235744.GB13397@krispykreme> <20040113185245.A1747@w-mikek2.beaverton.ibm.com> <20040114080749.GA374@suse.de> <080D5891-46A6-11D8-9E58-000A95A0560C@us.ibm.com> <20040115011808.GD27924@krispykreme> <20040212150907.GA13059@suse.de>
Message-ID: <20040217071925.GE25491@krispykreme>


> have you tested it? console=tty1 doesnt work, console output still goes
> straight to ttyS0. cmd_line is probably not yet set, my cmdline contains
> alot of stuff, but only the first word was printed with
>
>         printk("%s(%u) cmd_line is %s\n",__FUNCTION__,__LINE__,cmd_line);
>
> in set_preferred_console(). So strstr(cmd_line, "console=") does not
> trigger. Does it just not work for me?

Yeah we are parsing cmd_line after its been tokenised. Ive converted it
to use saved_command_line (and found what looks like another case in eeh
init). Its in ameslab and I'll send it upstream after 2.6.3 is released.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Tue Feb 17 18:35:10 2004
From: anton at samba.org (Anton Blanchard)
Date: Tue, 17 Feb 2004 18:35:10 +1100
Subject: [PATCH][2.6] hardware breakpoint fix/support for xmon
In-Reply-To: <1076680805.10309.145.camel@DYN279927END.austin.ibm.com>
References: <1076619314.10309.85.camel@DYN279927END.austin.ibm.com> <16427.62304.699990.252289@cargo.ozlabs.ibm.com> <20040212230045.GJ25922@krispykreme> <1076680805.10309.145.camel@DYN279927END.austin.ibm.com>
Message-ID: <20040217073510.GF25491@krispykreme>


> I have seen a number of instances during bringup where one of the CPUs
> will hang in a rtas call (due to a FW DSI), and then xmon would come in
> and disable_surveillance and hang on the rtas lock.  I would suggest
> removing the disable_surveillance() call since there are no boxes that
> I'm aware of that enable surveillance by default.  Maybe make it a
> separate command to disable surveillance similar to KDB.

Interesting, I enable it on all our machines and would have thought we
should be doing it automatically. From a RAS perspective we want to
reboot the machine if it locks up.

Maybe we can have a spin_trylock rtas call and whinge if the lock was
already taken. Once we switch to stopping all cpus on xmon entry we
want to do this to avoid deadlocks anyway.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From moilanen at austin.ibm.com  Wed Feb 18 00:46:34 2004
From: moilanen at austin.ibm.com (Jake Moilanen)
Date: Tue, 17 Feb 2004 07:46:34 -0600
Subject: [PATCH][2.6] hardware breakpoint fix/support for xmon
In-Reply-To: <20040217073510.GF25491@krispykreme>
References: <1076619314.10309.85.camel@DYN279927END.austin.ibm.com>
	 <16427.62304.699990.252289@cargo.ozlabs.ibm.com>
	 <20040212230045.GJ25922@krispykreme>
	 <1076680805.10309.145.camel@DYN279927END.austin.ibm.com>
	 <20040217073510.GF25491@krispykreme>
Message-ID: <1077025594.4441.67.camel@DYN279927END.austin.ibm.com>


> Interesting, I enable it on all our machines and would have thought we
> should be doing it automatically. From a RAS perspective we want to
> reboot the machine if it locks up.

I believe it was enabled by default on 2.4, but on 2.6 I don't think it
enables by the OS unless the surveillance= boot option is specified.

IIRC FW has it off by default as well on any GA'd system.

Thanks,
Jake


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Wed Feb 18 08:01:37 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Tue, 17 Feb 2004 15:01:37 -0600
Subject: bk pushed to deleted files?
Message-ID: <798E3D07-618C-11D8-8A98-000A95A0560C@us.ibm.com>


Hi, I just pushed some changes to include/asm-ppc64/hvconsole.h,
arch/ppc64/kernel/hvconsole.c, and drivers/char/hvc_console.c.
Unfortunately this is what I see:
	drivers/char/hvc_console.c: 2 deltas
	BitKeeper/deleted/.del-hvconsole.c: 3 deltas
	BitKeeper/deleted/.del-hvconsole.h: 3 deltas
... despite the fact that neither of those two files have ever been
deleted, Of course my changes were not applied to the correct
(undeleted) files.

Help?

--
Hollis Blanchard
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Wed Feb 18 09:44:38 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Tue, 17 Feb 2004 16:44:38 -0600
Subject: bk pushed to deleted files?
In-Reply-To: <798E3D07-618C-11D8-8A98-000A95A0560C@us.ibm.com>
References: <798E3D07-618C-11D8-8A98-000A95A0560C@us.ibm.com>
Message-ID: <DDD6BDD8-619A-11D8-8A98-000A95A0560C@us.ibm.com>


On Feb 17, 2004, at 3:01 PM, Hollis Blanchard wrote:
>
> Hi, I just pushed some changes to include/asm-ppc64/hvconsole.h,
> arch/ppc64/kernel/hvconsole.c, and drivers/char/hvc_console.c.
> Unfortunately this is what I see:
> 	drivers/char/hvc_console.c: 2 deltas
> 	BitKeeper/deleted/.del-hvconsole.c: 3 deltas
> 	BitKeeper/deleted/.del-hvconsole.h: 3 deltas
> ... despite the fact that neither of those two files have ever been
> deleted, Of course my changes were not applied to the correct
> (undeleted) files.

The files were deleted at ameslab recently when there was a merge
conflict from upstream, and my tree was from before that deletion, so
bk correctly noted that my diffs were to a deleted version of the file.

I've just pushed the lost changes again, which unfortunately won't be
associated with the not-lost changes in a Changeset.

Who knew merging could be so hazardous...

--
Hollis Blanchard
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nathanl at austin.ibm.com  Wed Feb 18 10:37:11 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Tue, 17 Feb 2004 17:37:11 -0600
Subject: [PATCH] (2.6) put prom.c on a diet
Message-ID: <4032A5A7.4040700@austin.ibm.com>

Hi-

This patch separates the Open Firmware "client" code which runs early in
the boot process from code which is used to access and manipulate the
kernel's copy of the Open Firmware device tree. The former remains in
prom.c; the latter is placed in a new file, of_devtree.c.

I tried to be fairly aggressive about pulling everything out of prom.c
that could conceivably be used during normal system functioning after
the system has booted.

This has been compile-tested and booted on a Power4 lpar.  I believe I've
avoided breaking any builds, please point out anything I've missed.  Feel free
to suggest a different name for the new file.  Also, please help me get the
author credits right :)

Nathan

arch/ppc64/kernel/Makefile     |    2
arch/ppc64/kernel/of_devtree.c | 1025 ++++++++++++++++++++++++++++++++++++++++
arch/ppc64/kernel/prom.c       | 1030 -----------------------------------------
  include/asm-ppc64/prom.h       |   24
4 files changed, 1049 insertions(+), 1032 deletions(-)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: split_prom.c.patch.bz2
Type: application/x-bzip
Size: 9924 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040217/f2e74a29/attachment.bin 

From anton at samba.org  Wed Feb 18 11:03:54 2004
From: anton at samba.org (Anton Blanchard)
Date: Wed, 18 Feb 2004 11:03:54 +1100
Subject: [PATCH] (2.6) put prom.c on a diet
In-Reply-To: <4032A5A7.4040700@austin.ibm.com>
References: <4032A5A7.4040700@austin.ibm.com>
Message-ID: <20040218000353.GC22534@krispykreme>


Hi,

> This patch separates the Open Firmware "client" code which runs early in
> the boot process from code which is used to access and manipulate the
> kernel's copy of the Open Firmware device tree. The former remains in
> prom.c; the latter is placed in a new file, of_devtree.c.

Very nice! Assuming no one has pending stuff that touches prom.c I think
we should merge this ASAP.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Wed Feb 18 11:04:06 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Wed, 18 Feb 2004 11:04:06 +1100
Subject: [PATCH] (2.6) put prom.c on a diet
In-Reply-To: <4032A5A7.4040700@austin.ibm.com>
References: <4032A5A7.4040700@austin.ibm.com>
Message-ID: <1077062646.1080.66.camel@gaston>


On Wed, 2004-02-18 at 10:37, Nathan Lynch wrote:
> Hi-
>
> This patch separates the Open Firmware "client" code which runs early in
> the boot process from code which is used to access and manipulate the
> kernel's copy of the Open Firmware device tree. The former remains in
> prom.c; the latter is placed in a new file, of_devtree.c.
>
> I tried to be fairly aggressive about pulling everything out of prom.c
> that could conceivably be used during normal system functioning after
> the system has booted.
>
> This has been compile-tested and booted on a Power4 lpar.  I believe I've
> avoided breaking any builds, please point out anything I've missed.  Feel free
> to suggest a different name for the new file.  Also, please help me get the
> author credits right :)

While you are at it, can you unify the function definition "style" ?

That is replace

int
myfunc(...)

with

int myfunc(...)

Whatever are the respective merits of each version, the "Linux" way
is the second form ;)

The credits are probably scattered all over the place. I'd name paulus
of course, myself, probably dave engebretsen and peter bergner too...

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nathanl at austin.ibm.com  Wed Feb 18 11:59:32 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Tue, 17 Feb 2004 18:59:32 -0600
Subject: [PATCH] (2.6) put prom.c on a diet
In-Reply-To: <1077062646.1080.66.camel@gaston>
References: <4032A5A7.4040700@austin.ibm.com> <1077062646.1080.66.camel@gaston>
Message-ID: <4032B8F4.3090905@austin.ibm.com>

Benjamin Herrenschmidt wrote:
> While you are at it, can you unify the function definition "style" ?

Alright, I've done this.

> The credits are probably scattered all over the place. I'd name paulus
> of course, myself, probably dave engebretsen and peter bergner too...

Ok, I've tried to do this right.

I also updated the patch to latest available Ameslab code.

Nathan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: split_prom.c.patch.bz2
Type: application/x-bzip
Size: 10339 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040217/3dc0e321/attachment.bin 

From hollisb at us.ibm.com  Thu Feb 19 10:49:23 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Wed, 18 Feb 2004 17:49:23 -0600
Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards
In-Reply-To: <4033DFB7.1020603@ltcfwd.linux.ibm.com>
References: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com> <44CE5954-5E48-11D8-B629-000A95A0560C@us.ibm.com> <4033DFB7.1020603@ltcfwd.linux.ibm.com>
Message-ID: <13C60E86-626D-11D8-ACBB-000A95A0560C@us.ibm.com>


On Feb 18, 2004, at 3:57 PM, Linda Xie wrote:
>
> Ben,  Please let me know  your  plan for pushing Nathan's split_prom
> patch,
> so I can recut OFDT_update patch against of_devtree.c.

of_devtree.c is already present in ameslab linux-2.5.

--
Hollis Blanchard
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Thu Feb 19 13:27:40 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Thu, 19 Feb 2004 13:27:40 +1100
Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards
In-Reply-To: <4033DFB7.1020603@ltcfwd.linux.ibm.com>
References: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com>
	 <44CE5954-5E48-11D8-B629-000A95A0560C@us.ibm.com>
	 <4033DFB7.1020603@ltcfwd.linux.ibm.com>
Message-ID: <1077157659.20788.190.camel@gaston>


On Thu, 2004-02-19 at 08:57, Linda Xie wrote:
> Hi,
>
> Attached is the updated OFDT_update patch.
>
> Comments are welcome.
>
>
> Ben,
> Please let me know  your  plan for pushing Nathan's split_prom patch,
> so I can recut OFDT_update patch against of_devtree.c.

Well... somebody pushed it already so ...

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Fri Feb 20 03:34:12 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Thu, 19 Feb 2004 10:34:12 -0600
Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards
In-Reply-To: <4034DEA9.4070004@ltcfwd.linux.ibm.com>
References: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com> <44CE5954-5E48-11D8-B629-000A95A0560C@us.ibm.com> <4033DFB7.1020603@ltcfwd.linux.ibm.com> <13C60E86-626D-11D8-ACBB-000A95A0560C@us.ibm.com> <4034DEA9.4070004@ltcfwd.linux.ibm.com>
Message-ID: <730B4500-62F9-11D8-BBC4-000A95A0560C@us.ibm.com>


On Feb 19, 2004, at 10:04 AM, Linda Xie wrote:
> Hollis Blanchard wrote:
>>
>> of_devtree.c is already present in ameslab linux-2.5.
>
> the one is in ameslab you are talking about is of_device.c, Am I right?

Yup, my mistake.

--
Hollis Blanchard
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From stefan at nocrew.org  Fri Feb 20 08:12:51 2004
From: stefan at nocrew.org (Stefan Berndtsson)
Date: Thu, 19 Feb 2004 22:12:51 +0100
Subject: People with ATAPI problems: possible fix
In-Reply-To: <1076813298.853.2.camel@gaston> (Benjamin Herrenschmidt's
 message of "Sun, 15 Feb 2004 13:48:18 +1100")
References: <1076813298.853.2.camel@gaston>
Message-ID: <8765e2srho.fsf@hades.nocrew.org>


Benjamin Herrenschmidt <benh at kernel.crashing.org> writes:

> Hi !
>
> Recently, there have been lost of reports of ATAPI issues, typically
> with CD/DVD burners, where DMA would be disabled for no obvious reasons,
> with the driver spitting a message about a timeout waiting for dbdma
> command stop.
>
> The problem is that our driver doesn't deal with buffer underruns due
> to the drive not sending as many data as requested.
>
> This patch is an attempt at fixing this. Please let me know if it
> helps.

First impression is that it seems to work. It no longer claims to kill DMA
when running growisofs, and it is still active when writing is done.

Thanks.

/Stefan

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Fri Feb 20 08:49:57 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Fri, 20 Feb 2004 08:49:57 +1100
Subject: [PATCH] support for Hotplug/DLPAR PCI multifunction cards
In-Reply-To: <4034DEA9.4070004@ltcfwd.linux.ibm.com>
References: <402CFEC5.5090802@us.ltcfwd.linux.ibm.com>
	 <44CE5954-5E48-11D8-B629-000A95A0560C@us.ibm.com>
	 <4033DFB7.1020603@ltcfwd.linux.ibm.com>
	 <13C60E86-626D-11D8-ACBB-000A95A0560C@us.ibm.com>
	 <4034DEA9.4070004@ltcfwd.linux.ibm.com>
Message-ID: <1077227397.20789.554.camel@gaston>


> of_devtree.c is NOT in ameslab linux-2.5.  Nathan's
> split_prom.c.patch(posted on 2/17)
> generates of_devtree.c (new file) by pulling everything out of porm.c
> that could conceivably
> be used after the system has booted. I checked ameslab tree this
> morning,  his patch is not there yet.
>
> the one is in ameslab you are talking about is of_device.c, Am I right?

Oh right, somebody told me it was and I didn't double check.

Anton, do you want Nathan patch in now ?

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Fri Feb 20 09:05:50 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Fri, 20 Feb 2004 09:05:50 +1100
Subject: People with ATAPI problems: possible fix
In-Reply-To: <8765e2srho.fsf@hades.nocrew.org>
References: <1076813298.853.2.camel@gaston>
	 <8765e2srho.fsf@hades.nocrew.org>
Message-ID: <1077228350.20781.587.camel@gaston>


> First impression is that it seems to work. It no longer claims to kill DMA
> when running growisofs, and it is still active when writing is done.

And the actual datas on the written media are ok ?

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From stefan at nocrew.org  Fri Feb 20 10:21:02 2004
From: stefan at nocrew.org (Stefan Berndtsson)
Date: Fri, 20 Feb 2004 00:21:02 +0100
Subject: People with ATAPI problems: possible fix
In-Reply-To: <1077228350.20781.587.camel@gaston> (Benjamin Herrenschmidt's
 message of "Fri, 20 Feb 2004 09:05:50 +1100")
References: <1076813298.853.2.camel@gaston> <8765e2srho.fsf@hades.nocrew.org>
	<1077228350.20781.587.camel@gaston>
Message-ID: <87y8qyr6zl.fsf@hades.nocrew.org>


Benjamin Herrenschmidt <benh at kernel.crashing.org> writes:

>> First impression is that it seems to work. It no longer claims to kill DMA
>> when running growisofs, and it is still active when writing is done.
>
> And the actual datas on the written media are ok ?

At least the parts of the movie I looked at worked in my dvd player,
so I'd say it is.

/Stefan


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From rsa at us.ibm.com  Tue Feb 24 02:00:31 2004
From: rsa at us.ibm.com (Ryan Arnold)
Date: 23 Feb 2004 09:00:31 -0600
Subject: New driver (hvcs) review request
Message-ID: <1077548434.933.15.camel@SigurRos.rchland.ibm.com>


Greetings,

I have a new driver that I'd like some comments on.  The following is
the driver description from the source file hvcs.c.

 * This is the device driver for the IBM Hypervisor Virtual Console
 * Server, "hvcs".  The IBM hvcs provides a TTY interface to allow
 * Linux user space applications access to the system consoles of
 * partitioned RPA supported operating systems (Linux and AIX)
 * running on the same partitioned IBM POWER architecture eServer.
 * Physical hardware consoles per partition do not exist on these
 * platforms and system consoles are interacted with through
 * hypervisor interfaces utilized by this driver.

I've included with this email a link to the patch that I plan on
checking in to the Ameslab 2.6 linux trees as soon as the community
feels it is in decent enough shape to do so.  I would appreciate any and
all comments that could be given regarding this patch because I also
plan on submitting it to the LKML soon.

This patch would probably not apply cleanly to the current Ameslab
source because there are some other recent patches that it may need to
be resolved with.

http://www-124.ibm.com/linux/patches/?patch_id=1377

For those interested, I would appreciate some comments on the
questions/concerns I have outlined in the "TODO" section of driver
comments.

This is a basic working driver that has not had extensive testing done
against it.

Thanks,

Ryan S. Arnold
<rsa at us.ibm.com>
IBM Linux Technology Center
IBM Rochester, MN.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Tue Feb 24 06:46:39 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Mon, 23 Feb 2004 13:46:39 -0600
Subject: KDB in ameslab
In-Reply-To: <20040217043527.GC25491@krispykreme>; from anton@samba.org on Tue, Feb 17, 2004 at 03:35:27PM +1100
References: <20040217043527.GC25491@krispykreme>
Message-ID: <20040223134639.A74832@forte.austin.ibm.com>


On Tue, Feb 17, 2004 at 03:35:27PM +1100, Anton Blanchard wrote:
>
> Hi,
>
> Linus had a nasty problem debugging what should have been a simple
> problem but became a nightmare due to an incorrect debug hook. In
> response to this I have cleaned up our debug hooks, it should be much
> harder to screw up.
>
> A side effect of this is that KDB is probably broken. I started looking
> into fixing it however I noticed it looks out of date. Does someone have
> the urge to update it?

I don't have the urge to update it but I'm motivated (today) to fix it
up enough to work.  Do you want a patch on the mailing list, or just
a bk push?

--linas

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Tue Feb 24 08:21:01 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Mon, 23 Feb 2004 15:21:01 -0600
Subject: [PATCH] Re: KDB in ameslab
In-Reply-To: <20040223134639.A74832@forte.austin.ibm.com>; from linas@austin.ibm.com on Mon, Feb 23, 2004 at 01:46:39PM -0600
References: <20040217043527.GC25491@krispykreme> <20040223134639.A74832@forte.austin.ibm.com>
Message-ID: <20040223152101.B74832@forte.austin.ibm.com>

Hi,


On Mon, Feb 23, 2004 at 01:46:39PM -0600, linas at austin.ibm.com wrote:
>
> On Tue, Feb 17, 2004 at 03:35:27PM +1100, Anton Blanchard wrote:
> >
> > Hi,
> >
> > Linus had a nasty problem debugging what should have been a simple
> > problem but became a nightmare due to an incorrect debug hook. In
> > response to this I have cleaned up our debug hooks, it should be much
> > harder to screw up.
> >
> > A side effect of this is that KDB is probably broken. I started looking
> > into fixing it however I noticed it looks out of date. Does someone have
> > the urge to update it?
>
> I don't have the urge to update it but I'm motivated (today) to fix it
> up enough to work.  Do you want a patch on the mailing list, or just
> a bk push?


The atttachments below are the only changes I needed to
make to get kdb to compile & run with toadys (2.6.3)
ameslab bk tree.

The file include/linux/dis-asm.h seems to be missing from
the ameslab tree, I'm not sure why.  I append it below.
Its a copy of this file from the 2.4 trees (which seem
to be identical).

Please apply this patch & file to ameslab.

--linas

p.s. where is the "newest" KDB ?  I found
ftp://oss.sgi.com/projects/kdb/download/v4.3
but googling KDB is such a sad experience that I'm not
convinced that something newer isn't hiding somewhere.


-------------- next part --------------


The atttachments below are the only changes I needed to
make to get kdb to compile & run with toadys (2.6.3)
ameslab bk tree.

The file include/linux/dis-asm.h seems to be missing from
the ameslab tree, I'm not sure why.  I append it below.
Its a copy of this file from the 2.4 trees (which seem
to be identical).

Please apply this patch & file to ameslab.

--linas

p.s. where is the "newest" KDB ?  I found
ftp://oss.sgi.com/projects/kdb/download/v4.3
but googling KDB is such a sad experience that I'm not
convinced that something newer isn't hiding somewhere.


===== kdbasupport.c 1.5 vs edited =====
--- 1.5/arch/ppc64/kdb/kdbasupport.c	Thu Jan 22 00:11:58 2004
+++ edited/kdbasupport.c	Mon Feb 23 13:49:24 2004
@@ -732,7 +732,6 @@
 	asm volatile("sync; isync");
 }

-extern void (*debugger_fault_handler)(struct pt_regs *);
 extern void longjmp(u_int *, int);

 unsigned long
@@ -2028,12 +2027,12 @@
 	kdb_map_scc();		/* map sysrq key */
 #endif

-	debugger = kdb_debugger;
-	debugger_bpt = kdb_debugger_bpt;
-	debugger_sstep = kdb_debugger_sstep;
-	debugger_iabr_match = kdb_debugger_iabr_match;
-	debugger_dabr_match = kdb_debugger_dabr_match;
-	debugger_fault_handler = NULL; /* this guy is normally off. */
+	__debugger = kdb_debugger;
+	__debugger_bpt = kdb_debugger_bpt;
+	__debugger_sstep = kdb_debugger_sstep;
+	__debugger_iabr_match = kdb_debugger_iabr_match;
+	__debugger_dabr_match = kdb_debugger_dabr_match;
+	__debugger_fault_handler = NULL; /* this guy is normally off. */
 				    /* = kdb_debugger_fault_handler; */

 	kdba_enable_lbr();
-------------- next part --------------
/* Interface between the opcode library and its callers.
   Written by Cygnus Support, 1993.

   The opcode library (libopcodes.a) provides instruction decoders for
   a large variety of instruction sets, callable with an identical
   interface, for making instruction-processing programs more independent
   of the instruction set being processed.  */

/* Hacked by Scott Lurndal at SGI (02/1999) for linux kernel debugger */
/* Upgraded to cygnus CVS Keith Owens <kaos at sgi.com> 30 Oct 2000 */

#ifndef DIS_ASM_H
#define DIS_ASM_H

#ifdef __cplusplus
extern "C" {
#endif

	/*
	 * Misc definitions
	 */
#ifndef PARAMS
#define PARAMS(x)	x
#endif
#define PTR void *
#define FILE int
#if !defined(NULL)
#define NULL 0
#endif

#define abort()		dis_abort(__LINE__)

static inline void
dis_abort(int line)
{
	panic("Aborting disassembler @ line %d\n", line);
}

#include <linux/slab.h>
#include <asm/page.h>
#define xstrdup(string) ({ char *res = kdb_strdup(string, GFP_ATOMIC); if (!res) BUG(); res; })
#define xmalloc(size) ({ void *res = kmalloc(size, GFP_ATOMIC); if (!res) BUG(); res; })
#define free(address) kfree(address)

#include <bfd.h>

typedef int (*fprintf_ftype) PARAMS((PTR, const char*, ...));

enum dis_insn_type {
  dis_noninsn,			/* Not a valid instruction */
  dis_nonbranch,		/* Not a branch instruction */
  dis_branch,			/* Unconditional branch */
  dis_condbranch,		/* Conditional branch */
  dis_jsr,			/* Jump to subroutine */
  dis_condjsr,			/* Conditional jump to subroutine */
  dis_dref,			/* Data reference instruction */
  dis_dref2			/* Two data references in instruction */
};

/* This struct is passed into the instruction decoding routine,
   and is passed back out into each callback.  The various fields are used
   for conveying information from your main routine into your callbacks,
   for passing information into the instruction decoders (such as the
   addresses of the callback functions), or for passing information
   back from the instruction decoders to their callers.

   It must be initialized before it is first passed; this can be done
   by hand, or using one of the initialization macros below.  */

typedef struct disassemble_info {
  fprintf_ftype fprintf_func;
  fprintf_ftype fprintf_dummy;
  PTR stream;
  PTR application_data;

  /* Target description.  We could replace this with a pointer to the bfd,
     but that would require one.  There currently isn't any such requirement
     so to avoid introducing one we record these explicitly.  */
  /* The bfd_flavour.  This can be bfd_target_unknown_flavour.  */
  enum bfd_flavour flavour;
  /* The bfd_arch value.  */
  enum bfd_architecture arch;
  /* The bfd_mach value.  */
  unsigned long mach;
  /* Endianness (for bi-endian cpus).  Mono-endian cpus can ignore this.  */
  enum bfd_endian endian;

  /* An array of pointers to symbols either at the location being disassembled
     or at the start of the function being disassembled.  The array is sorted
     so that the first symbol is intended to be the one used.  The others are
     present for any misc. purposes.  This is not set reliably, but if it is
     not NULL, it is correct.  */
  asymbol **symbols;
  /* Number of symbols in array.  */
  int num_symbols;

  /* For use by the disassembler.
     The top 16 bits are reserved for public use (and are documented here).
     The bottom 16 bits are for the internal use of the disassembler.  */
  unsigned long flags;
#define INSN_HAS_RELOC	0x80000000
  PTR private_data;

  /* Function used to get bytes to disassemble.  MEMADDR is the
     address of the stuff to be disassembled, MYADDR is the address to
     put the bytes in, and LENGTH is the number of bytes to read.
     INFO is a pointer to this struct.
     Returns an errno value or 0 for success.  */
  int (*read_memory_func)
    PARAMS ((bfd_vma memaddr, bfd_byte *myaddr, unsigned int length,
	     struct disassemble_info *info));

  /* Function which should be called if we get an error that we can't
     recover from.  STATUS is the errno value from read_memory_func and
     MEMADDR is the address that we were trying to read.  INFO is a
     pointer to this struct.  */
  void (*memory_error_func)
    PARAMS ((int status, bfd_vma memaddr, struct disassemble_info *info));

  /* Function called to print ADDR.  */
  void (*print_address_func)
    PARAMS ((bfd_vma addr, struct disassemble_info *info));

  /* Function called to determine if there is a symbol at the given ADDR.
     If there is, the function returns 1, otherwise it returns 0.
     This is used by ports which support an overlay manager where
     the overlay number is held in the top part of an address.  In
     some circumstances we want to include the overlay number in the
     address, (normally because there is a symbol associated with
     that address), but sometimes we want to mask out the overlay bits.  */
  int (* symbol_at_address_func)
    PARAMS ((bfd_vma addr, struct disassemble_info * info));

  /* These are for buffer_read_memory.  */
  bfd_byte *buffer;
  bfd_vma buffer_vma;
  unsigned int buffer_length;

  /* This variable may be set by the instruction decoder.  It suggests
      the number of bytes objdump should display on a single line.  If
      the instruction decoder sets this, it should always set it to
      the same value in order to get reasonable looking output.  */
  int bytes_per_line;

  /* the next two variables control the way objdump displays the raw data */
  /* For example, if bytes_per_line is 8 and bytes_per_chunk is 4, the */
  /* output will look like this:
     00:   00000000 00000000
     with the chunks displayed according to "display_endian". */
  int bytes_per_chunk;
  enum bfd_endian display_endian;

  /* Number of octets per incremented target address
     Normally one, but some DSPs have byte sizes of 16 or 32 bits
   */
  unsigned int octets_per_byte;

  /* Results from instruction decoders.  Not all decoders yet support
     this information.  This info is set each time an instruction is
     decoded, and is only valid for the last such instruction.

     To determine whether this decoder supports this information, set
     insn_info_valid to 0, decode an instruction, then check it.  */

  char insn_info_valid;		/* Branch info has been set. */
  char branch_delay_insns;	/* How many sequential insn's will run before
				   a branch takes effect.  (0 = normal) */
  char data_size;		/* Size of data reference in insn, in bytes */
  enum dis_insn_type insn_type;	/* Type of instruction */
  bfd_vma target;		/* Target address of branch or dref, if known;
				   zero if unknown.  */
  bfd_vma target2;		/* Second target address for dref2 */

  /* Command line options specific to the target disassembler.  */
  char * disassembler_options;

} disassemble_info;


/* Standard disassemblers.  Disassemble one instruction at the given
   target address.  Return number of bytes processed.  */
typedef int (*disassembler_ftype)
     PARAMS((bfd_vma, disassemble_info *));

extern int print_insn_big_mips		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_little_mips	PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_i386_att		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_i386_intel	PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_ia64		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_i370		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_m68hc11		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_m68hc12		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_m68k		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_z8001		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_z8002		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_h8300		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_h8300h		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_h8300s		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_h8500		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_alpha		PARAMS ((bfd_vma, disassemble_info*));
extern disassembler_ftype arc_get_disassembler PARAMS ((int, int));
extern int print_insn_big_arm		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_little_arm	PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_sparc		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_big_a29k		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_little_a29k	PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_i860		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_i960		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_sh		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_shl		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_hppa		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_fr30		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_m32r		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_m88k		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_mcore		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_mn10200		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_mn10300		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_ns32k		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_big_powerpc	PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_little_powerpc	PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_rs6000		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_w65		PARAMS ((bfd_vma, disassemble_info*));
extern disassembler_ftype cris_get_disassembler PARAMS ((bfd *));
extern int print_insn_d10v		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_d30v		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_v850		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_tic30		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_vax		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_tic54x		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_tic80		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_pj		PARAMS ((bfd_vma, disassemble_info*));
extern int print_insn_avr		PARAMS ((bfd_vma, disassemble_info*));

extern void print_arm_disassembler_options PARAMS ((FILE *));
extern void parse_arm_disassembler_option  PARAMS ((char *));
extern int  get_arm_regname_num_options    PARAMS ((void));
extern int  set_arm_regname_option         PARAMS ((int));
extern int  get_arm_regnames               PARAMS ((int, const char **, const char **, const char ***));

/* Fetch the disassembler for a given BFD, if that support is available.  */
extern disassembler_ftype disassembler	PARAMS ((bfd *));

/* Document any target specific options available from the disassembler.  */
extern void disassembler_usage          PARAMS ((FILE *));


/* This block of definitions is for particular callers who read instructions
   into a buffer before calling the instruction decoder.  */

/* Here is a function which callers may wish to use for read_memory_func.
   It gets bytes from a buffer.  */
extern int buffer_read_memory
  PARAMS ((bfd_vma, bfd_byte *, unsigned int, struct disassemble_info *));

/* This function goes with buffer_read_memory.
   It prints a message using info->fprintf_func and info->stream.  */
extern void perror_memory PARAMS ((int, bfd_vma, struct disassemble_info *));


/* Just print the address in hex.  This is included for completeness even
   though both GDB and objdump provide their own (to print symbolic
   addresses).  */
extern void generic_print_address
  PARAMS ((bfd_vma, struct disassemble_info *));

/* Always true.  */
extern int generic_symbol_at_address
  PARAMS ((bfd_vma, struct disassemble_info *));

/* Macro to initialize a disassemble_info struct.  This should be called
   by all applications creating such a struct.  */
#define INIT_DISASSEMBLE_INFO(INFO, STREAM, FPRINTF_FUNC) \
  (INFO).flavour = bfd_target_unknown_flavour, \
  (INFO).arch = bfd_arch_unknown, \
  (INFO).mach = 0, \
  (INFO).endian = BFD_ENDIAN_UNKNOWN, \
  (INFO).octets_per_byte = 1, \
  INIT_DISASSEMBLE_INFO_NO_ARCH(INFO, STREAM, FPRINTF_FUNC)

/* Call this macro to initialize only the internal variables for the
   disassembler.  Architecture dependent things such as byte order, or machine
   variant are not touched by this macro.  This makes things much easier for
   GDB which must initialize these things separately.  */

#define INIT_DISASSEMBLE_INFO_NO_ARCH(INFO, STREAM, FPRINTF_FUNC) \
  (INFO).fprintf_func = (fprintf_ftype)(FPRINTF_FUNC), \
  (INFO).stream = (PTR)(STREAM), \
  (INFO).symbols = NULL, \
  (INFO).num_symbols = 0, \
  (INFO).buffer = NULL, \
  (INFO).buffer_vma = 0, \
  (INFO).buffer_length = 0, \
  (INFO).read_memory_func = buffer_read_memory, \
  (INFO).memory_error_func = perror_memory, \
  (INFO).print_address_func = generic_print_address, \
  (INFO).symbol_at_address_func = generic_symbol_at_address, \
  (INFO).flags = 0, \
  (INFO).bytes_per_line = 0, \
  (INFO).bytes_per_chunk = 0, \
  (INFO).display_endian = BFD_ENDIAN_UNKNOWN, \
  (INFO).insn_info_valid = 0

#ifdef __cplusplus
};
#endif

#endif /* ! defined (DIS_ASM_H) */

From olof at austin.ibm.com  Tue Feb 24 08:27:21 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Mon, 23 Feb 2004 15:27:21 -0600
Subject: [PATCH] Re: KDB in ameslab
In-Reply-To: <20040223152101.B74832@forte.austin.ibm.com>
References: <20040217043527.GC25491@krispykreme> <20040223134639.A74832@forte.austin.ibm.com> <20040223152101.B74832@forte.austin.ibm.com>
Message-ID: <403A7039.6060605@austin.ibm.com>


linas at austin.ibm.com wrote:

> p.s. where is the "newest" KDB ?  I found
> ftp://oss.sgi.com/projects/kdb/download/v4.3
> but googling KDB is such a sad experience that I'm not
> convinced that something newer isn't hiding somewhere.

4.3 seems to be current version. Keith Owens recently announced a port
of it to kernel 2.6.3.


-Olof

--
Olof Johansson                                        Office: 4F005/905
pSeries Linux Development                             IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Tue Feb 24 08:43:17 2004
From: anton at samba.org (Anton Blanchard)
Date: Tue, 24 Feb 2004 08:43:17 +1100
Subject: KDB in ameslab
In-Reply-To: <20040223134639.A74832@forte.austin.ibm.com>
References: <20040217043527.GC25491@krispykreme> <20040223134639.A74832@forte.austin.ibm.com>
Message-ID: <20040223214317.GU5801@krispykreme>


> I don't have the urge to update it but I'm motivated (today) to fix it
> up enough to work.  Do you want a patch on the mailing list, or just
> a bk push?

Im OK with a push, it probably just requires fixes to the code that
initialises the debugger hooks.

The xmon stuff now can be enabled/disabled at runtime (at this stage you
have to use sysrq to do it), it would be nice if kdb did too. That would
mean we could have both debuggers compiled in and we can enable either
at runtime.

With a bit of effort xmon could become a module too.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Tue Feb 24 08:56:11 2004
From: anton at samba.org (Anton Blanchard)
Date: Tue, 24 Feb 2004 08:56:11 +1100
Subject: CONFIG_PAGEALLOC_DEBUG
Message-ID: <20040223215611.GV5801@krispykreme>


Hi,

Heres a first stab at CONFIG_PAGEALLOC_DEBUG. Its a useful debug feature
where you unmap unused pages, catching use after free bugs etc.

It only works on pseries SMP at the moment, we really need to rework how
we do it. The current updateboltedpp hooks arent good enough because
they only write protect but still allow reading.

At the moment I just turn off the valid bit and leave the entry there,
but that wont work on LPAR. I think we will have to remove the bolted
entry completely and reinsert it.

You might have to tune how the slab cache interacts (for maximum
coverage you pretty much want all allocations even small ones to end up
on their own page, and you dont want any of the slab caches to be
operating)

Anton

---

 foobar-anton/arch/ppc64/Kconfig               |    8 ++++++++
 foobar-anton/arch/ppc64/kernel/iSeries_htab.c |    7 ++++---
 foobar-anton/arch/ppc64/kernel/idle.c         |   11 +++++++++++
 foobar-anton/arch/ppc64/kernel/pSeries_htab.c |   25 +++++++++++--------------
 foobar-anton/arch/ppc64/kernel/pSeries_lpar.c |   11 ++++-------
 foobar-anton/arch/ppc64/mm/hash_utils.c       |   11 +++++++++++
 foobar-anton/arch/ppc64/mm/init.c             |    2 +-
 foobar-anton/include/asm-ppc64/cacheflush.h   |    5 +++++
 foobar-anton/include/asm-ppc64/cputable.h     |    7 +++++--
 foobar-anton/include/asm-ppc64/machdep.h      |    4 ++--
 mm/slab.c                                     |    0
 11 files changed, 62 insertions(+), 29 deletions(-)

diff -puN arch/ppc64/Kconfig~ppc64-config_pagealloc_debug arch/ppc64/Kconfig
--- foobar/arch/ppc64/Kconfig~ppc64-config_pagealloc_debug	2004-02-21 13:58:15.922539797 +1100
+++ foobar-anton/arch/ppc64/Kconfig	2004-02-21 13:58:15.996534209 +1100
@@ -401,6 +401,14 @@ config DEBUG_SPINLOCK_SLEEP
 	  If you say Y here, various routines which may sleep will become very
 	  noisy if they are called with a spinlock held.

+config DEBUG_PAGEALLOC
+	bool "Page alloc debugging"
+	depends on DEBUG_KERNEL
+	help
+	  Unmap pages from the kernel linear mapping after free_pages().
+	  This results in a large slowdown, but helps to find certain types
+	  of memory corruptions.
+
 endmenu

 source "security/Kconfig"
diff -puN arch/ppc64/kernel/iSeries_htab.c~ppc64-config_pagealloc_debug arch/ppc64/kernel/iSeries_htab.c
--- foobar/arch/ppc64/kernel/iSeries_htab.c~ppc64-config_pagealloc_debug	2004-02-21 13:58:15.928539344 +1100
+++ foobar-anton/arch/ppc64/kernel/iSeries_htab.c	2004-02-21 13:58:15.997534133 +1100
@@ -167,7 +167,7 @@ static long iSeries_hpte_find(unsigned l
  *
  * No need to lock here because we should be the only user.
  */
-static void iSeries_hpte_updateboltedpp(unsigned long newpp, unsigned long ea)
+static void iSeries_hpte_updatevalid(unsigned long valid, unsigned long ea)
 {
 	unsigned long vsid,va,vpn;
 	long slot;
@@ -176,8 +176,9 @@ static void iSeries_hpte_updateboltedpp(
 	va = (vsid << 28) | (ea & 0x0fffffff);
 	vpn = va >> PAGE_SHIFT;
 	slot = iSeries_hpte_find(vpn);
-	if (slot == -1)
-		panic("updateboltedpp: Could not find page to bolt\n");
+	BUG_ON(slot == -1);
+
+	/* XXX FIXME */
 	HvCallHpt_setPp(slot, newpp);
 }

diff -puN arch/ppc64/kernel/idle.c~ppc64-config_pagealloc_debug arch/ppc64/kernel/idle.c
--- foobar/arch/ppc64/kernel/idle.c~ppc64-config_pagealloc_debug	2004-02-21 13:58:15.934538891 +1100
+++ foobar-anton/arch/ppc64/kernel/idle.c	2004-02-21 14:49:51.761292296 +1100
@@ -132,6 +132,17 @@ int default_idle(void)
 {
 	long oldval;

+#if 0
+	struct page *tmp = alloc_pages(GFP_KERNEL, 0);
+	unsigned char *foo =  __va(page_to_pfn(tmp) << PAGE_SHIFT);
+	foo[0] = '1';
+	free_pages(foo, 0);
+
+	printk("use after free: %p\n", foo);
+	printk("%c\n", foo[0]);
+	printk("passed\n");
+#endif
+
 	while (1) {
 		oldval = test_and_clear_thread_flag(TIF_NEED_RESCHED);

diff -puN arch/ppc64/kernel/pSeries_htab.c~ppc64-config_pagealloc_debug arch/ppc64/kernel/pSeries_htab.c
--- foobar/arch/ppc64/kernel/pSeries_htab.c~ppc64-config_pagealloc_debug	2004-02-21 13:58:15.939538513 +1100
+++ foobar-anton/arch/ppc64/kernel/pSeries_htab.c	2004-02-21 15:07:47.212994956 +1100
@@ -60,11 +60,11 @@ long pSeries_hpte_insert(unsigned long h
 	for (i = 0; i < HPTES_PER_GROUP; i++) {
 		dw0 = hptep->dw0.dw0;

-		if (!dw0.v) {
+		if (!dw0.v && !dw0.bolted) {
 			/* retry with lock held */
 			pSeries_lock_hpte(hptep);
 			dw0 = hptep->dw0.dw0;
-			if (!dw0.v)
+			if (!dw0.v && !dw0.bolted)
 				break;
 			pSeries_unlock_hpte(hptep);
 		}
@@ -177,7 +177,7 @@ static long pSeries_hpte_find(unsigned l
 			hptep = htab_data.htab + slot;
 			dw0 = hptep->dw0.dw0;

-			if ((dw0.avpn == (vpn >> 11)) && dw0.v &&
+			if ((dw0.avpn == (vpn >> 11)) && dw0.bolted &&
 			    (dw0.h == j)) {
 				/* HPTE matches */
 				if (j)
@@ -230,14 +230,12 @@ static long pSeries_hpte_updatepp(unsign
 }

 /*
- * Update the page protection bits. Intended to be used to create
- * guard pages for kernel data structures on pages which are bolted
- * in the HPT. Assumes pages being operated on will not be stolen.
- * Does not work on large pages.
+ * Change the valid bit on bolted pages. Used by debugging code such
+ * as CONFIG_PAGEALLOC_DEBUG to cause accesses on certain pages to fault.
  *
- * No need to lock here because we should be the only user.
+ * We assume the caller provides any locking.
  */
-static void pSeries_hpte_updateboltedpp(unsigned long newpp, unsigned long ea)
+static void pSeries_hpte_updatevalid(unsigned long valid, unsigned long ea)
 {
 	unsigned long vsid, va, vpn, flags;
 	long slot;
@@ -248,11 +246,10 @@ static void pSeries_hpte_updateboltedpp(
 	vpn = va >> PAGE_SHIFT;

 	slot = pSeries_hpte_find(vpn);
-	if (slot == -1)
-		panic("could not find page to bolt\n");
-	hptep = htab_data.htab + slot;
+	BUG_ON(slot == -1);

-	set_pp_bit(newpp, hptep);
+	hptep = htab_data.htab + slot;
+	hptep->dw0.dw0.v = valid;

 	/* Ensure it is out of the tlb too */
 	spin_lock_irqsave(&pSeries_tlbie_lock, flags);
@@ -376,7 +373,7 @@ void hpte_init_pSeries(void)

 	ppc_md.hpte_invalidate	= pSeries_hpte_invalidate;
 	ppc_md.hpte_updatepp	= pSeries_hpte_updatepp;
-	ppc_md.hpte_updateboltedpp = pSeries_hpte_updateboltedpp;
+	ppc_md.hpte_updatevalid = pSeries_hpte_updatevalid;
 	ppc_md.hpte_insert	= pSeries_hpte_insert;
 	ppc_md.hpte_remove     	= pSeries_hpte_remove;

diff -puN arch/ppc64/kernel/pSeries_lpar.c~ppc64-config_pagealloc_debug arch/ppc64/kernel/pSeries_lpar.c
--- foobar/arch/ppc64/kernel/pSeries_lpar.c~ppc64-config_pagealloc_debug	2004-02-21 13:58:15.945538060 +1100
+++ foobar-anton/arch/ppc64/kernel/pSeries_lpar.c	2004-02-21 13:58:16.003533680 +1100
@@ -487,8 +487,7 @@ static long pSeries_lpar_hpte_find(unsig
 	return -1;
 }

-static void pSeries_lpar_hpte_updateboltedpp(unsigned long newpp,
-					     unsigned long ea)
+static void pSeries_lpar_hpte_updatevalid(unsigned long valid, unsigned long ea)
 {
 	unsigned long lpar_rc;
 	unsigned long vsid, va, vpn, flags;
@@ -499,11 +498,9 @@ static void pSeries_lpar_hpte_updatebolt
 	vpn = va >> PAGE_SHIFT;

 	slot = pSeries_lpar_hpte_find(vpn);
-	if (slot == -1)
-		panic("updateboltedpp: Could not find page to bolt\n");
+	BUG_ON(slot == -1);

-	flags = newpp & 3;
-	lpar_rc = plpar_pte_protect(flags, slot, 0);
+	/* XXX FIXME */

 	if (lpar_rc != H_Success)
 		panic("Bad return code from pte bolted protect rc = %lx\n",
@@ -555,7 +552,7 @@ void pSeries_lpar_mm_init(void)
 {
 	ppc_md.hpte_invalidate	= pSeries_lpar_hpte_invalidate;
 	ppc_md.hpte_updatepp	= pSeries_lpar_hpte_updatepp;
-	ppc_md.hpte_updateboltedpp = pSeries_lpar_hpte_updateboltedpp;
+	ppc_md.hpte_updatevalid = pSeries_lpar_hpte_updatevalid;
 	ppc_md.hpte_insert	= pSeries_lpar_hpte_insert;
 	ppc_md.hpte_remove	= pSeries_lpar_hpte_remove;
 	ppc_md.flush_hash_range	= pSeries_lpar_flush_hash_range;
diff -puN arch/ppc64/mm/hash_utils.c~ppc64-config_pagealloc_debug arch/ppc64/mm/hash_utils.c
--- foobar/arch/ppc64/mm/hash_utils.c~ppc64-config_pagealloc_debug	2004-02-21 13:58:15.951537607 +1100
+++ foobar-anton/arch/ppc64/mm/hash_utils.c	2004-02-21 13:58:16.005533529 +1100
@@ -357,3 +357,14 @@ void __init htab_finish_init(void)
 	make_bl(htab_call_hpte_remove, ppc_md.hpte_remove);
 	make_bl(htab_call_hpte_updatepp, ppc_md.hpte_updatepp);
 }
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+void kernel_map_pages(struct page *page, int numpages, int enable)
+{
+	int i;
+
+	for (i = 0; i < numpages; i++)
+		ppc_md.hpte_updatevalid(enable,
+			(unsigned long)page_address(page) + PAGE_SIZE * i);
+}
+#endif
diff -puN arch/ppc64/mm/init.c~ppc64-config_pagealloc_debug arch/ppc64/mm/init.c
--- foobar/arch/ppc64/mm/init.c~ppc64-config_pagealloc_debug	2004-02-21 13:58:15.958537079 +1100
+++ foobar-anton/arch/ppc64/mm/init.c	2004-02-21 14:51:33.299823789 +1100
@@ -666,7 +666,7 @@ void __init mm_init_ppc64(void)
 	for (index = 0; index < NR_CPUS; index++) {
 		lpaca = &paca[index];
 		guard_page = ((unsigned long)lpaca) + 0x1000;
-		ppc_md.hpte_updateboltedpp(PP_RXRX, guard_page);
+		ppc_md.hpte_updatevalid(0, guard_page);
 	}

 	ppc64_boot_msg(0x100, "MM Init Done");
diff -puN include/asm-ppc64/cacheflush.h~ppc64-config_pagealloc_debug include/asm-ppc64/cacheflush.h
--- foobar/include/asm-ppc64/cacheflush.h~ppc64-config_pagealloc_debug	2004-02-21 13:58:15.963536701 +1100
+++ foobar-anton/include/asm-ppc64/cacheflush.h	2004-02-21 13:58:16.009533227 +1100
@@ -32,4 +32,9 @@ do { memcpy(dst, src, len); \

 extern void __flush_dcache_icache(void *page_va);

+#ifdef CONFIG_DEBUG_PAGEALLOC
+/* internal debugging function */
+void kernel_map_pages(struct page *page, int numpages, int enable);
+#endif
+
 #endif /* _PPC64_CACHEFLUSH_H */
diff -puN include/asm-ppc64/cputable.h~ppc64-config_pagealloc_debug include/asm-ppc64/cputable.h
--- foobar/include/asm-ppc64/cputable.h~ppc64-config_pagealloc_debug	2004-02-21 13:58:15.969536248 +1100
+++ foobar-anton/include/asm-ppc64/cputable.h	2004-02-21 13:58:16.010533152 +1100
@@ -139,8 +139,11 @@ extern firmware_feature_t firmware_featu
                                  CPU_FTR_TLBIEL | CPU_FTR_NOEXECUTE | \
                                  CPU_FTR_NODSISRALIGN)

-/* iSeries doesn't support large pages */
-#ifdef CONFIG_PPC_ISERIES
+/*
+ * iSeries doesn't support large pages and we cant use large pages when
+ * page alloc debug is enabled
+ */
+#if defined(CONFIG_PPC_ISERIES) || defined(CONFIG_DEBUG_PAGEALLOC)
 #define CPU_FTR_PPCAS_ARCH_V2	(CPU_FTR_PPCAS_ARCH_V2_BASE)
 #else
 #define CPU_FTR_PPCAS_ARCH_V2	(CPU_FTR_PPCAS_ARCH_V2_BASE | CPU_FTR_16M_PAGE)
diff -puN include/asm-ppc64/machdep.h~ppc64-config_pagealloc_debug include/asm-ppc64/machdep.h
--- foobar/include/asm-ppc64/machdep.h~ppc64-config_pagealloc_debug	2004-02-21 13:58:15.975535795 +1100
+++ foobar-anton/include/asm-ppc64/machdep.h	2004-02-21 13:58:16.011533076 +1100
@@ -40,8 +40,8 @@ struct machdep_calls {
 					 unsigned long va,
 					 int large,
 					 int local);
-	void            (*hpte_updateboltedpp)(unsigned long newpp,
-					       unsigned long ea);
+	void            (*hpte_updatevalid)(unsigned long valid,
+					    unsigned long ea);
 	long		(*hpte_insert)(unsigned long hpte_group,
 				       unsigned long va,
 				       unsigned long prpn,
diff -puN mm/slab.c~ppc64-config_pagealloc_debug mm/slab.c
diff -puN -L arch/ppc64/mm/ash_utils.c /dev/null /dev/null

_

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Tue Feb 24 11:41:02 2004
From: olof at austin.ibm.com (olof at austin.ibm.com)
Date: Mon, 23 Feb 2004 18:41:02 -0600 (CST)
Subject: TCE changes pushed...
Message-ID: <Pine.A41.4.44.0402231835230.79562-100000@forte.austin.ibm.com>


I just pushed a big changeset to ameslab, consisting of the TCE rewrite.
While I have tried to build it in all ways imaginable, there's still a
risk I missed a driver that needed changes to build. Also, for those of
you maintaining/developing the VIO drivers, the following changes should
be noted:

* <asm/iSeries/iSeries_dma.h> is no more. Use the global <asm/iommu.h>
instead.

* Likewise, <asm/pci_dma.h> has been renamed to <asm/iommu.h>. If your
driver can't find NO_TCE (or other defines), check your includes.

* There's been a bunch of renames. TceTable is now iommu_table, and the
tce_table structure members have been renamed accordingly.

Most things have a 1-1 mapping, so it's just a matter of figuring out the
renames. Let me know if anything is unclear (or if there's any other build
breaks or strange behaviour).


Thanks,

Olof

Olof Johansson                                        Office: 4E002/905
Linux on Power Development                            IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From marco.killijian at laas.fr  Tue Feb 24 21:04:11 2004
From: marco.killijian at laas.fr (Marc-Olivier Killijian)
Date: Tue, 24 Feb 2004 11:04:11 +0100
Subject: People with ATAPI problems: possible fix
In-Reply-To: <1076813298.853.2.camel@gaston>
References: <1076813298.853.2.camel@gaston>
Message-ID: <1077617051.1054.15.camel@tsfmok>


Hi,

I used to have the problem described by Branden Robinson in a previous
mail on DebianPPC: a Samsung CDRW/DVD that would not eject discs nor
play, nor anything.
I patched a 2.4.24-ben1 with your code and now the CD features are
working fine. I didn't test burning CDs or playing DVDs yet.

Cheers,
Marco

Le dim 15/02/2004 ? 03:48, Benjamin Herrenschmidt a ?crit :
>
> Recently, there have been lost of reports of ATAPI issues, typically
> with CD/DVD burners, where DMA would be disabled for no obvious reasons,
> with the driver spitting a message about a timeout waiting for dbdma
> command stop.
>
> The problem is that our driver doesn't deal with buffer underruns due
> to the drive not sending as many data as requested.
>
> This patch is an attempt at fixing this. Please let me know if it
> helps.
[deleted]

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Wed Feb 25 02:57:19 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Tue, 24 Feb 2004 09:57:19 -0600
Subject: [PATCH] rpaphp driver changes for vio and multifunction cards support-please review
In-Reply-To: <403AC039.3020903@ltcfwd.linux.ibm.com>
References: <403AC039.3020903@ltcfwd.linux.ibm.com>
Message-ID: <20697AC2-66E2-11D8-B1E2-000A95A0560C@us.ibm.com>


On Feb 23, 2004, at 9:08 PM, Linda Xie wrote:
> Any comments would be much appreciated.

In register_slot(), what is this code doing?
>         case VIO_DEV:
>             /* create symlink between slot->name and it's uni-address
> */
>             vio_uni_addr = strchr(slot->dn->full_name, '@');
>             if (!vio_uni_addr)
>                 return (1);
>             retval = sysfs_create_link(slot->hotplug_slot->kobj.parent,
>                 &slot->hotplug_slot->kobj,
>                 vio_uni_addr);

Can you show me tree output? I don't know what these sysfs directories
look like. Whatever this symlink is, do we *need* it? If not I'd say
take it out, as userland will start depending on it and then we'll be
stuck with it forever.

--
Hollis Blanchard
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Wed Feb 25 03:15:19 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Tue, 24 Feb 2004 10:15:19 -0600
Subject: [PATCH] rpaphp driver changes for vio and multifunction cards support-please review
In-Reply-To: <403AC039.3020903@ltcfwd.linux.ibm.com>
References: <403AC039.3020903@ltcfwd.linux.ibm.com>
Message-ID: <A402B9D6-66E4-11D8-8F7A-000A95A0560C@us.ibm.com>


rpaphp_get_vio_adapter_status() is extern in rpaphp.h but inline in
rpaphp_vio.c. It can't be inlined because it's called from
rpaphp_core.c.

Other functions, like setup_vio_hotplug_slot_info(), are marked inline
(in rpaphp.h), are not static (in rpaphp_vio.c), but aren't used by
anyone else. All such functions (in all these files) should be made
static (and removed from rpaphp.h, since nobody else needs to know
about them).

--
Hollis Blanchard
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olh at suse.de  Wed Feb 25 05:21:20 2004
From: olh at suse.de (Olaf Hering)
Date: Tue, 24 Feb 2004 19:21:20 +0100
Subject: 2.6.3: Debug: sleeping function called from invalid context at include/linux/rwsem.h:43 
Message-ID: <20040224182120.GA4026@suse.de>


This is with the ameslab tree as of Sunday evening, on a 6way p660.

papaya:~ # Debug: sleeping function called from invalid context at include/linux/rwsem.h:43
in_atomic():0, irqs_disabled():1
Calcpu 2: Vector: 300 (Data Access) at [c0000000864c7980]
    pc: c000000000091dd0 (.kmem_calche_alloc+0x54/0xc0)
    lr: c000000000091e2c (.kmem_cache_alloc+0xb0/0xc0)
    sp: c00 00000864c7c00
   msr: a000000000001032
   dar: 2777d6650
 dsisr: 40000000
  current = T0xc000000021306d80
  paca    = 0xc000000000444000
    pid   = 6280, comm = ld64.so.1
rpress ? for help 2:mon> ace:
[c000000000044718] .do_page_fault+0x10c/0x514
[c00000000000aa94] stab_bolted_user_return+0x118/0x11c
[c000000000091e2c] .kmem_cache_alloc+0xb0/0xc0
[c00000000020ce80] .idr_pre_get+0x58/0x144
[c000000000079644] .sys_timer_create+0x10c/0x624
[c00000000000f014] .ret_from_syscall_1+0x0/0xa4
NETDEV WATCHDOG: eth1: transmit timed out
eth1: transmit timed out, status 06f3, resetting.
sym0:4:0: ABORT operation started.

--
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Wed Feb 25 07:35:11 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Tue, 24 Feb 2004 14:35:11 -0600
Subject: [PATCH] xmon command for symbol lookups
Message-ID: <403BB57F.7010200@austin.ibm.com>

One of the most crippling drawbacks (for me) with xmon is that you need
a System.map to lookup all addresses in when you're digging through
disassembly.

The attached patch adds a new command, "n <address>" that will show the
corresponding symbol in the same way that the "t" command does.

It also fixes another thing that's irritated me before: xmon doesn't
take values in 0x<hex> form, only <hex>.


3:mon> t
c00000007e4978b0  c00000000004d894  .xmon+0x15c/0x358
c00000007e497a90  c00000000004d2e0  .sysrq_handle_xmon+0x5c/0x64
c00000007e497b20  c00000000024d7c4  .__handle_sysrq_nolock+0xe0/0x184
c00000007e497bd0  c00000000024d6c0  .handle_sysrq+0x70/0x94
c00000007e497c60  c000000000102dd0  .write_sysrq_trigger+0x88/0xac
c00000007e497cf0  c0000000000b4464  .vfs_write+0x10c/0x164
c00000007e497d90  c0000000000b45a0  .sys_write+0x50/0x94
c00000007e497e30  c0000000000119bc  ret_from_syscall_1
exception: c00 (System Call) regs c00000007e497ea0
                   000000000ff27b2c
<Stack drops into userspace 00000000ffffe780>
3:mon> di c00000000024d6c0
c00000000024d6c0  60000000      nop
c00000000024d6c4  38210090      addi    r1,r1,144
c00000000024d6c8  e8010010      ld      r0,16(r1)
c00000000024d6cc  ebe1fff8      ld      r31,-8(r1)
c00000000024d6d0  eb81ffe0      ld      r28,-32(r1)
c00000000024d6d4  eba1ffe8      ld      r29,-24(r1)
c00000000024d6d8  ebc1fff0      ld      r30,-16(r1)
c00000000024d6dc  7c0803a6      mtlr    r0
c00000000024d6e0  4bfffeac      b       0xc00000000024d58c
c00000000024d6e4  fbc1fff0      std     r30,-16(r1)
c00000000024d6e8  ebc2c5d8      ld      r30,-14888(r2)
c00000000024d6ec  7c0802a6      mflr    r0
c00000000024d6f0  fb41ffd0      std     r26,-48(r1)
c00000000024d6f4  fb61ffd8      std     r27,-40(r1)
c00000000024d6f8  7cba2b78      mr      r26,r5
c00000000024d6fc  7c9b2378      mr      r27,r4
3:mon> n 0xc00000000024d58c
c00000000024d58c: .__sysrq_unlock_table+0x0/0x30
3:mon>


Another enhancement would be to simply print the symbol as a comment
behind the address when doing disasm, but that would make the output
wider than 80 characters. Does anyone have problems with that or should
I add that as well?


Thanks,

Olof

--
Olof Johansson                                        Office: 4F005/905
pSeries Linux Development                             IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: xmon-n-cmd
Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040224/0d049fe1/attachment.txt 

From hollisb at us.ibm.com  Wed Feb 25 09:31:13 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Tue, 24 Feb 2004 16:31:13 -0600
Subject: [PATCH] rpadlpar changes for DLPAR VIO devices -- please review
In-Reply-To: <403BC5DA.4010705@ltcfwd.linux.ibm.com>
References: <403BC5DA.4010705@ltcfwd.linux.ibm.com>
Message-ID: <270FFA96-6719-11D8-8F7A-000A95A0560C@us.ibm.com>


On Feb 24, 2004, at 3:44 PM, Linda Xie wrote:

> +	if (strstr(drc_name, "-V"))
> +		dn = find_php_slot_vio_node(drc_name);
> +	else
> +		dn = find_php_slot_pci_node(drc_name);

I'm not sure this is safe as a canonical test. Maybe
sprintf("%s-V%i-D%i", ...)? Is this namespace defined and documented
somewhere? I just don't feel comfortable with a two-character test
determining it one way or the other.

--
Hollis Blanchard
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Wed Feb 25 09:46:54 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Wed, 25 Feb 2004 09:46:54 +1100
Subject: [PATCH] xmon command for symbol lookups
In-Reply-To: <403BB57F.7010200@austin.ibm.com>
References: <403BB57F.7010200@austin.ibm.com>
Message-ID: <1077662814.965.26.camel@gaston>


On Wed, 2004-02-25 at 07:35, Olof Johansson wrote:
> One of the most crippling drawbacks (for me) with xmon is that you need
> a System.map to lookup all addresses in when you're digging through
> disassembly.
>
> The attached patch adds a new command, "n <address>" that will show the
> corresponding symbol in the same way that the "t" command does.
>
> It also fixes another thing that's irritated me before: xmon doesn't
> take values in 0x<hex> form, only <hex>.

Hrm... well... I have this on ppc32 already, but used different
commmands: la and ls (lookup address and lookup symbol), also I
have added the ability to use a symbol (preceded by the $) every
time you can enter a number (like you can use % with a register
name). What about making xmon in sync ? :)

Another _real cool_ feature is a dmesg in xmon btw :)


> 3:mon> t
> c00000007e4978b0  c00000000004d894  .xmon+0x15c/0x358
> c00000007e497a90  c00000000004d2e0  .sysrq_handle_xmon+0x5c/0x64
> c00000007e497b20  c00000000024d7c4  .__handle_sysrq_nolock+0xe0/0x184
> c00000007e497bd0  c00000000024d6c0  .handle_sysrq+0x70/0x94
> c00000007e497c60  c000000000102dd0  .write_sysrq_trigger+0x88/0xac
> c00000007e497cf0  c0000000000b4464  .vfs_write+0x10c/0x164
> c00000007e497d90  c0000000000b45a0  .sys_write+0x50/0x94
> c00000007e497e30  c0000000000119bc  ret_from_syscall_1
> exception: c00 (System Call) regs c00000007e497ea0
>                    000000000ff27b2c
> <Stack drops into userspace 00000000ffffe780>
> 3:mon> di c00000000024d6c0
> c00000000024d6c0  60000000      nop
> c00000000024d6c4  38210090      addi    r1,r1,144
> c00000000024d6c8  e8010010      ld      r0,16(r1)
> c00000000024d6cc  ebe1fff8      ld      r31,-8(r1)
> c00000000024d6d0  eb81ffe0      ld      r28,-32(r1)
> c00000000024d6d4  eba1ffe8      ld      r29,-24(r1)
> c00000000024d6d8  ebc1fff0      ld      r30,-16(r1)
> c00000000024d6dc  7c0803a6      mtlr    r0
> c00000000024d6e0  4bfffeac      b       0xc00000000024d58c
> c00000000024d6e4  fbc1fff0      std     r30,-16(r1)
> c00000000024d6e8  ebc2c5d8      ld      r30,-14888(r2)
> c00000000024d6ec  7c0802a6      mflr    r0
> c00000000024d6f0  fb41ffd0      std     r26,-48(r1)
> c00000000024d6f4  fb61ffd8      std     r27,-40(r1)
> c00000000024d6f8  7cba2b78      mr      r26,r5
> c00000000024d6fc  7c9b2378      mr      r27,r4
> 3:mon> n 0xc00000000024d58c
> c00000000024d58c: .__sysrq_unlock_table+0x0/0x30
> 3:mon>
>
>
>
>
>
> Another enhancement would be to simply print the symbol as a comment
> behind the address when doing disasm, but that would make the output
> wider than 80 characters. Does anyone have problems with that or should
> I add that as well?
>
>
> Thanks,
>
> Olof
>
> --
> Olof Johansson                                        Office: 4F005/905
> pSeries Linux Development                             IBM Systems Group
> Email: olof at austin.ibm.com                          Phone: 512-838-9858
> All opinions are my own and not those of IBM
>
> ______________________________________________________________________
> ===== arch/ppc64/xmon/xmon.c 1.35 vs edited =====
> --- 1.35/arch/ppc64/xmon/xmon.c	Sun Feb 15 15:23:37 2004
> +++ edited/arch/ppc64/xmon/xmon.c	Tue Feb 24 14:25:34 2004
> @@ -83,6 +83,7 @@
>  static void dump(void);
>  static void prdump(unsigned long, long);
>  static int ppc_inst_dump(unsigned long, long);
> +static void lookupsymbol(void);
>  void print_address(unsigned long);
>  static int getsp(void);
>  static void backtrace(struct pt_regs *);
> @@ -167,6 +168,7 @@
>    ml	locate a block of memory\n\
>    mz	zero a block of memory\n\
>    mi	show information about memory allocation\n\
> +  n     lookup address -> symbol name\n\
>    p 	show the task list\n\
>    r	print registers\n\
>    s	single step\n\
> @@ -537,6 +539,9 @@
>  		case 'd':
>  			dump();
>  			break;
> +		case 'n':
> +			lookupsymbol();
> +			break;
>  		case 'r':
>  			if (excp != NULL)
>  				prregs(excp);	/* print regs */
> @@ -1644,6 +1649,23 @@
>  	printf("0x%lx", addr);
>  }
>
> +
> +void
> +lookupsymbol(void)
> +{
> +	int c;
> +
> +	c = inchar();
> +	if (c == '\n')
> +		termch = c;
> +	scanhex((void *)&adrs);
> +	if( termch != '\n')
> +		termch = 0;
> +	printf("%016lx: ", adrs);
> +	xmon_print_symbol("%s\n", adrs);
> +}
> +
> +
>  /*
>   * Memory operations - move, set, print differences
>   */
> @@ -1820,6 +1842,14 @@
>  		}
>  		printf("invalid register name '%%%s'\n", regname);
>  		return 0;
> +	}
> +
> +	/* skip leading "0x" if any */
> +
> +	if (c == '0') {
> +		c = inchar();
> +		if (c == 'x')
> +			c = inchar();
>  	}
>
>  	d = hexdigit(c);
--
Benjamin Herrenschmidt <benh at kernel.crashing.org>


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From rsa at us.ibm.com  Wed Feb 25 09:52:05 2004
From: rsa at us.ibm.com (Ryan Arnold)
Date: 24 Feb 2004 16:52:05 -0600
Subject: hvcs driver (with 80 character columns + revisions) revised :WAS
	Re: New driver (hvcs) review request
In-Reply-To: <1077577919.5940.90.camel@gaston>
References: <1077548434.933.15.camel@SigurRos.rchland.ibm.com> 
	<1077577919.5940.90.camel@gaston>
Message-ID: <1077663127.21201.5.camel@SigurRos.rchland.ibm.com>


On Mon, 2004-02-23 at 17:11, Benjamin Herrenschmidt wrote:
> Hi Ryan. Before somebody dives into the code per-se, could you
> reformat the driver properly ? Normally, linux code is supposed
> to fit in 80 columns. We can accept exceptions, but not a whole
> driver using more than 132 cols :)

For the 22" monitor challenged here is a revised version of this driver
which has columns of no greater than 80 characters.  Additionally, this
driver contains some first pass revisions that should make the driver
more readable thanks to Dave Hansen and Ben H.

http://www-124.ibm.com/linux/patches/misc/hvcs-20040224.diff

Thanks
Ryan S. Arnold <rsa at us.ibm.com>
IBM LTC, Rochester MN.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Wed Feb 25 09:55:03 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Tue, 24 Feb 2004 16:55:03 -0600
Subject: [PATCH] xmon command for symbol lookups
In-Reply-To: <1077662814.965.26.camel@gaston>
References: <403BB57F.7010200@austin.ibm.com> <1077662814.965.26.camel@gaston>
Message-ID: <403BD647.5060509@austin.ibm.com>


Benjamin Herrenschmidt wrote:

> Hrm... well... I have this on ppc32 already, but used different
> commmands: la and ls (lookup address and lookup symbol), also I
> have added the ability to use a symbol (preceded by the $) every
> time you can enter a number (like you can use % with a register
> name). What about making xmon in sync ? :)

Ah, I was reinventing the wheel! I'll bring that stuff over from ppc32
instead then.

 > Another _real cool_ feature is a dmesg in xmon btw :)

Yeah, I can look at that next.

-Olof

--
Olof Johansson                                        Office: 4F005/905
pSeries Linux Development                             IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Wed Feb 25 10:05:06 2004
From: anton at samba.org (Anton Blanchard)
Date: Wed, 25 Feb 2004 10:05:06 +1100
Subject: [PATCH] xmon command for symbol lookups
In-Reply-To: <403BD647.5060509@austin.ibm.com>
References: <403BB57F.7010200@austin.ibm.com> <1077662814.965.26.camel@gaston> <403BD647.5060509@austin.ibm.com>
Message-ID: <20040224230506.GE5801@krispykreme>


> Ah, I was reinventing the wheel! I'll bring that stuff over from ppc32
> instead then.

Not completely, ppc32 xmon doesnt use kallsyms. I do like the idea of
doing lookups in disassembly too, Im not sure if ppc32 does that.

> > Another _real cool_ feature is a dmesg in xmon btw :)
> Yeah, I can look at that next.

kdb adds all these nasty hooks into kernel/printk.c. I wonder if we cant
make the relevant symbols not static so xmon, kdb, kgdb etc can get at
them.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Wed Feb 25 10:06:30 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Wed, 25 Feb 2004 10:06:30 +1100
Subject: [PATCH] xmon command for symbol lookups
In-Reply-To: <20040224230506.GE5801@krispykreme>
References: <403BB57F.7010200@austin.ibm.com>
	 <1077662814.965.26.camel@gaston> <403BD647.5060509@austin.ibm.com>
	 <20040224230506.GE5801@krispykreme>
Message-ID: <1077663989.1105.28.camel@gaston>


On Wed, 2004-02-25 at 10:05, Anton Blanchard wrote:
>  > Ah, I was reinventing the wheel! I'll bring that stuff over from ppc32
> > instead then.
>
> Not completely, ppc32 xmon doesnt use kallsyms. I do like the idea of
> doing lookups in disassembly too, Im not sure if ppc32 does that.

No, it doesn't and that's a good feature.

> > > Another _real cool_ feature is a dmesg in xmon btw :)
> > Yeah, I can look at that next.
>
> kdb adds all these nasty hooks into kernel/printk.c. I wonder if we cant
> make the relevant symbols not static so xmon, kdb, kgdb etc can get at
> them.

Yah, no need for hook, not even for non-static crap btw, we can use
kallsyms to locate the symbols :)

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Wed Feb 25 10:20:52 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Tue, 24 Feb 2004 17:20:52 -0600
Subject: revised hvcs driver 
In-Reply-To: <1077663127.21201.5.camel@SigurRos.rchland.ibm.com>
References: <1077548434.933.15.camel@SigurRos.rchland.ibm.com>  <1077577919.5940.90.camel@gaston> <1077663127.21201.5.camel@SigurRos.rchland.ibm.com>
Message-ID: <16F915D3-6720-11D8-8F7A-000A95A0560C@us.ibm.com>


On Feb 24, 2004, at 4:52 PM, Ryan Arnold wrote:
>
> http://www-124.ibm.com/linux/patches/misc/hvcs-20040224.diff

One more comment: don't add things to hvconsole.h unless you want/need
other code to be able to use it (which is the case for the prototypes
currently in hvconsole.h). Your constants, struct, and prototypes
probably don't fall into that category.

--
Hollis Blanchard
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Wed Feb 25 10:40:17 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Wed, 25 Feb 2004 10:40:17 +1100
Subject: [PATCH] xmon command for symbol lookups
In-Reply-To: <1077663989.1105.28.camel@gaston>
References: <403BB57F.7010200@austin.ibm.com>
	 <1077662814.965.26.camel@gaston> <403BD647.5060509@austin.ibm.com>
	 <20040224230506.GE5801@krispykreme>  <1077663989.1105.28.camel@gaston>
Message-ID: <1077666016.1128.30.camel@gaston>


On Wed, 2004-02-25 at 10:06, Benjamin Herrenschmidt wrote:

> > Not completely, ppc32 xmon doesnt use kallsyms. I do like the idea of
> > doing lookups in disassembly too, Im not sure if ppc32 does that.
>
> No, it doesn't and that's a good feature.

I mean that is a good feature to add :)


Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Wed Feb 25 11:23:13 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Tue, 24 Feb 2004 18:23:13 -0600
Subject: [ppc64 patch] rename virtual IO sysfs directory
Message-ID: <CC9DB6E0-6728-11D8-8F7A-000A95A0560C@us.ibm.com>


This names the virtual IO devices sysfs directory /sys/devices/vio, to
match /sys/bus/vio. Linus, please apply.

===== arch/ppc64/kernel/vio.c 1.14 vs edited =====
--- 1.14/arch/ppc64/kernel/vio.c        Tue Feb 24 16:00:14 2004
+++ edited/arch/ppc64/kernel/vio.c      Tue Feb 24 18:30:39 2004
@@ -149,7 +149,7 @@
                 return 1;
         }
         memset(vio_bus_device, 0, sizeof(struct vio_dev));
-       strcpy(vio_bus_device->dev.bus_id, "vdevice");
+       strcpy(vio_bus_device->dev.bus_id, "vio");

         err = device_register(&vio_bus_device->dev);
         if (err) {


--
Hollis Blanchard
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Wed Feb 25 11:38:14 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Tue, 24 Feb 2004 18:38:14 -0600
Subject: [ppc64 patch] rename virtual IO sysfs directory
Message-ID: <E574FC6F-672A-11D8-8F7A-000A95A0560C@us.ibm.com>


[Bad address book first time.]

This names the virtual IO devices sysfs directory /sys/devices/vio, to
match /sys/bus/vio. Linus, please apply.

===== arch/ppc64/kernel/vio.c 1.14 vs edited =====
--- 1.14/arch/ppc64/kernel/vio.c        Tue Feb 24 16:00:14 2004
+++ edited/arch/ppc64/kernel/vio.c      Tue Feb 24 18:30:39 2004
@@ -149,7 +149,7 @@
                 return 1;
         }
         memset(vio_bus_device, 0, sizeof(struct vio_dev));
-       strcpy(vio_bus_device->dev.bus_id, "vdevice");
+       strcpy(vio_bus_device->dev.bus_id, "vio");

         err = device_register(&vio_bus_device->dev);
         if (err) {


--
Hollis Blanchard
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Wed Feb 25 20:21:23 2004
From: anton at samba.org (Anton Blanchard)
Date: Wed, 25 Feb 2004 20:21:23 +1100
Subject: [PATCH] ppc64 procfs cleanup
Message-ID: <20040225092123.GH5801@krispykreme>


Hi,

Olaf pointed me at an issue with our current procfs code and modules. I
decided to go through all of our code and clean things up.

Patch is at http://samba.org/~anton/fixup_procfs.patch

- Use initcalls everywhere. This allowed us to remove the iseries proc
  callback interface
- Kill proc_pmc.c. Most of it wasnt used (and we are planning to export the
  PMCs via sysfs). The few things left were iseries specific so they
  got moved into iSeries_proc.c
- Kill pmc.c. We dont use those statistics and the ones that are left
  can be gained via PMCs.
- Create /proc/iSeries and /proc/ppc64 very early. This means we no
  longer have to call proc_ppc64_init in all the drivers, we can
  assume its there.
- Fix some error return cases in rtas-proc.c and rtas-flash
- Dont even try some pseries specific drivers on mac.

Im planning to merge this tomorrow unless there are any objections.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nathanl at austin.ibm.com  Wed Feb 25 21:19:31 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Wed, 25 Feb 2004 04:19:31 -0600
Subject: [PATCH] ppc64 procfs cleanup
In-Reply-To: <20040225092123.GH5801@krispykreme>
References: <20040225092123.GH5801@krispykreme>
Message-ID: <403C76B3.9000108@austin.ibm.com>


> - Create /proc/iSeries and /proc/ppc64 very early. This means we no
>   longer have to call proc_ppc64_init in all the drivers, we can
>   assume its there.

+__initcall(proc_ppc64_init);

What guarantee is there that proc_ppc64_init will run before the init
routines for scanlog, rtas, etc?  Perhaps
subsys_initcall(proc_ppc64_init) would be better?

Nathan

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Thu Feb 26 00:49:08 2004
From: anton at samba.org (Anton Blanchard)
Date: Thu, 26 Feb 2004 00:49:08 +1100
Subject: [PATCH] ppc64 procfs cleanup
In-Reply-To: <403C76B3.9000108@austin.ibm.com>
References: <20040225092123.GH5801@krispykreme> <403C76B3.9000108@austin.ibm.com>
Message-ID: <20040225134908.GJ5801@krispykreme>


Hi Nathan

Are you trying to rival me for worst hours kept? :)

> +__initcall(proc_ppc64_init);
>
> What guarantee is there that proc_ppc64_init will run before the init
> routines for scanlog, rtas, etc?  Perhaps
> subsys_initcall(proc_ppc64_init) would be better?

Ill see your subsys_initcall and raise you a core_initcall.

The stuff in proc_ppc64_init shouldnt have any dependencies, I broke off
the important bits into proc_ppc64_create which is a core_initcall.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Thu Feb 26 01:55:14 2004
From: anton at samba.org (Anton Blanchard)
Date: Thu, 26 Feb 2004 01:55:14 +1100
Subject: thrashing 2.6 ameslab
Message-ID: <20040225145514.GL5801@krispykreme>


Hi,

I just got the following WARN_ON when pounding the machine with
bash-shared-mappings. Its part of the tlb flush rework. Paul originally
had something in there to recognise the mm had changed and to fix it
up silently.

I changed it to WARN_ON thinking it shouldnt occur normally (an mm
changing in the middle of a batch). However looking at this trace I guess
it can. Looks like we encountered memory pressure in copy_page_range
and ended up scanning pages doing page_referenced.

The problem is, copy_page_range is in the middle of a tlb batch...

I'll restore the silent flush in arch/ppc64/mm/tlb.c

Anton

Badness in hpte_update at arch/ppc64/mm/tlb.c:71
Call Trace:
[c0000000000a32cc] .page_referenced+0x10c/0x25c
[c000000000095a14] .refill_inactive_zone+0xa7c/0xb2c
[c000000000095b60] .shrink_zone+0x9c/0xd4
[c000000000095d04] .shrink_caches+0x16c/0x194
[c000000000095e28] .try_to_free_pages+0xfc/0x224
[c00000000008a29c] .__alloc_pages+0x254/0x438
[c00000000008a4b4] .__get_free_pages+0x34/0x80
[c00000000008f10c] .cache_grow+0x158/0x528
[c00000000008f7c4] .cache_alloc_refill+0x2e8/0x398
[c00000000008fc7c] .kmem_cache_alloc+0xac/0xc0
[c00000000009cec0] .__pmd_alloc+0x60/0x14c
[c000000000098b4c] .copy_page_range+0x690/0x768
[c000000000058948] .copy_mm+0x62c/0x730
[c000000000059920] .copy_process+0x71c/0xfd4
[c00000000005a21c] .do_fork+0x44/0x210
[c000000000016ba8] .sys_fork+0x28/0x40
[c0000000000117d4] .ret_from_syscall_1+0x0/0xa4

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From johnrose at austin.ibm.com  Thu Feb 26 04:21:52 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Wed, 25 Feb 2004 11:21:52 -0600
Subject: [PATCH] ppc64 procfs cleanup
In-Reply-To: <OFAD8CC52C.FCC96DDC-ON85256E45.0058AEE4-86256E45.0058B2B4@us.ibm.com>
References: 
	 <OFAD8CC52C.FCC96DDC-ON85256E45.0058AEE4-86256E45.0058B2B4@us.ibm.com>
Message-ID: <1077729711.10733.12.camel@verve.austin.ibm.com>


Hi Anton-

>From rtas_flash.c:
+	if (rtas_token("ibm,update-flash-64-and-reboot") ==
+		       RTAS_UNKNOWN_SERVICE) {
+		printk(KERN_ERR "rtas_flash: no firmware flash support\n");
+		return 1;

Can we not add this? :)  The current module init creates the three /proc
files regardless, and handles the case of "function not supported" with
a certain return code upon /proc file read.

This allows the userland tool to distinguish between the error cases of
"You don't have the module compiled in/loaded" and "You don't have the
firmware functionality."

-static inline struct proc_dir_entry * create_flash_pde(const char *filename,
-					struct file_operations *fops)
+static struct proc_dir_entry *create_flash_pde(const char *filename,
+					       struct file_operations *fops)

For my own education, could you explain when it's appropriate to
inline?  I was under the impression that functions that could be macros
were good candidates.

Thanks-
John


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Thu Feb 26 04:41:49 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Wed, 25 Feb 2004 11:41:49 -0600
Subject: [PATCH] ppc64 procfs cleanup
In-Reply-To: <1077729711.10733.12.camel@verve.austin.ibm.com>
References: <OFAD8CC52C.FCC96DDC-ON85256E45.0058AEE4-86256E45.0058B2B4@us.ibm.com> <1077729711.10733.12.camel@verve.austin.ibm.com>
Message-ID: <E3F004E2-67B9-11D8-B826-000A95A0560C@us.ibm.com>


On Feb 25, 2004, at 11:21 AM, John Rose wrote:
> For my own education, could you explain when it's appropriate to
> inline?  I was under the impression that functions that could be macros
> were good candidates.

It's been observed that too much is being inlined, to the point that
we're bigger and slower because we're eating icache. So recent wisdom
has been to let the compiler do it except in specific cases that can
demonstrate a performance improvement.

--
Hollis Blanchard
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Thu Feb 26 04:58:36 2004
From: olof at austin.ibm.com (olof at austin.ibm.com)
Date: Wed, 25 Feb 2004 11:58:36 -0600 (CST)
Subject: [PATCH] ppc64 procfs cleanup
In-Reply-To: <1077729711.10733.12.camel@verve.austin.ibm.com>
Message-ID: <Pine.A41.4.44.0402251155460.27692-100000@forte.austin.ibm.com>


On Wed, 25 Feb 2004, John Rose wrote:

> -static inline struct proc_dir_entry * create_flash_pde(const char *filename,
> -					struct file_operations *fops)
> +static struct proc_dir_entry *create_flash_pde(const char *filename,
> +					       struct file_operations *fops)
>
> For my own education, could you explain when it's appropriate to
> inline?  I was under the impression that functions that could be macros
> were good candidates.

Inlining should only be done where taking the additional function call
adds significant overhead and it's called often enough to impact system
performance.

In addition to the binary bloat, inlining makes debugging painful since
the code will be inserted in the caller and following the flow in
disassembly is hard.


-Olof

Olof Johansson                                        Office: 4E002/905
Linux on Power Development                            IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Thu Feb 26 04:59:42 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Wed, 25 Feb 2004 11:59:42 -0600
Subject: [PATCH] rpadlpar changes for DLPAR VIO devices -- please review
In-Reply-To: <403CDE23.4040007@ltcfwd.linux.ibm.com>
References: <403BC5DA.4010705@ltcfwd.linux.ibm.com> <270FFA96-6719-11D8-8F7A-000A95A0560C@us.ibm.com> <403CDE23.4040007@ltcfwd.linux.ibm.com>
Message-ID: <63620E9A-67BC-11D8-B826-000A95A0560C@us.ibm.com>


On Feb 25, 2004, at 11:40 AM, Linda Xie wrote:

> Hollis Blanchard wrote:
>
>> On Feb 24, 2004, at 3:44 PM, Linda Xie wrote:
>>
>>> +    if (strstr(drc_name, "-V"))
>>> +        dn = find_php_slot_vio_node(drc_name);
>>> +    else
>>> +        dn = find_php_slot_pci_node(drc_name);
>>
>> I'm not sure this is safe as a canonical test. Maybe
>> sprintf("%s-V%i-D%i", ...)? Is this namespace defined and documented
>> somewhere? I just don't feel comfortable with a two-character test
>> determining it one way or the other.
>
> I agreed [ event with sprintf("%s-V%i-D%i", ...)]. Because  the format
>  can be
> changed by FW at anytime,  It looks like we have to search OFDT twice
> (the worst case) for a given
> drc-name:
> Call  find_php_slot_vio_node(drc_name) first, if return value is NULL,
> then
> call find_php_slot_pci_node(drc_name).

I had trouble understanding that first part, but I agree that it makes
sense to search both for vdevice and pci devices for a given location
code, rather than try to parse the location code yourself.

--
Hollis Blanchard
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Thu Feb 26 09:32:41 2004
From: olof at austin.ibm.com (olof at austin.ibm.com)
Date: Wed, 25 Feb 2004 16:32:41 -0600 (CST)
Subject: Adding kallsyms_lookupname()
Message-ID: <Pine.A41.4.44.0402251618360.28842-200000@forte.austin.ibm.com>

Rusty,

Attached patch adds a kallsyms_lookupname() function for lookups of a
symbol name to an address.

I've attempted to be somewhat efficient and skip all "stems" where the
base part of the name doesn't match. I also added a loop through module
symbols to try finding the symbol there in case it's not found in the
kernel table. That part is not as efficient, but that's OK.

Furthermore, it's intentionally not exported as a symbol for module use,
since it can be used to circumvent other symbol export restrictions.


Please consider for upstream inclusion.


Thanks,

-Olof

Olof Johansson                                        Office: 4E002/905
Linux on Power Development                            IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM
-------------- next part --------------
===== include/linux/kallsyms.h 1.3 vs edited =====
--- 1.3/include/linux/kallsyms.h	Wed Dec 25 21:46:20 2002
+++ edited/include/linux/kallsyms.h	Tue Feb 24 21:59:20 2004
@@ -8,6 +8,9 @@
 #include <linux/config.h>
 
 #ifdef CONFIG_KALLSYMS
+/* Lookup the address of a symbol. Returns 0 if not found. */
+unsigned long kallsyms_lookupname(char *name);
+
 /* Lookup an address.  modname is set to NULL if it's in the kernel. */
 const char *kallsyms_lookup(unsigned long addr,
 			    unsigned long *symbolsize,
@@ -18,6 +21,11 @@
 extern void __print_symbol(const char *fmt, unsigned long address);
 
 #else /* !CONFIG_KALLSYMS */
+
+static inline const unsigned long kallsyms_lookupname(unsigned long addr)
+{
+	return 0;
+}
 
 static inline const char *kallsyms_lookup(unsigned long addr,
 					  unsigned long *symbolsize,
===== kernel/kallsyms.c 1.14 vs edited =====
--- 1.14/kernel/kallsyms.c	Sun Aug 31 18:14:13 2003
+++ edited/kernel/kallsyms.c	Wed Feb 25 16:29:42 2004
@@ -37,6 +37,58 @@
 	return 0;
 }
 
+/* Lookup the address of a symbol. Returns 0 if not found. */
+unsigned long kallsyms_lookupname(char *name)
+{
+	unsigned long i;
+	char namebuf[128];
+	char *knames = kallsyms_names;
+	unsigned int namelen = strlen(name);
+	unsigned long val;
+	char type;
+
+	/* This kernel should never had been booted. */
+	BUG_ON(!kallsyms_addresses);
+
+	namebuf[0] = 0;
+
+	for (i = 0; i < kallsyms_num_syms; i++) { 
+		unsigned prefix = *knames++;
+		unsigned len = strlen(knames);
+
+		/* Skip over as long as prefix at 0 doesn't match */
+		if (!prefix && len <= namelen &&
+		    strncmp(knames, name, len)) {
+			do {
+				knames += len + 1;
+				prefix = *knames++;
+				len = strlen(knames);
+				i++;
+			} while (prefix && i < kallsyms_num_syms);
+		}
+
+		strncpy(namebuf + prefix, knames, 127 - prefix);
+
+		if (prefix + len == namelen &&
+		     !strncmp(namebuf, name, namelen))
+			return kallsyms_addresses[i];
+		knames += len + 1;
+	}
+
+	/* If not found above, try looking up the name in modules.
+	 * This isn't all that efficient, but performance isn't critical
+	 * here.
+	 */
+	i = 0;
+	while (module_get_kallsym(i++, &val, &type, namebuf)) {
+		namebuf[127] = 0; /* Just in case */
+		if (!strcmp(namebuf, name))
+			return val;
+	}
+
+	return 0;
+}
+
 /* Lookup an address.  modname is set to NULL if it's in the kernel. */
 const char *kallsyms_lookup(unsigned long addr,
 			    unsigned long *symbolsize,

From olof at austin.ibm.com  Thu Feb 26 09:46:05 2004
From: olof at austin.ibm.com (olof at austin.ibm.com)
Date: Wed, 25 Feb 2004 16:46:05 -0600 (CST)
Subject: xmon support for symbol lookup
Message-ID: <Pine.A41.4.44.0402251634020.25240-200000@forte.austin.ibm.com>

Attached patch adds symbol lookup functions to xmon, similar to what Ben
added to ppc32 but not requiring a linked-in system.map.

It requires the kallsyms_lookupname patch (see previous post), so I won't
push this until I've heard back that it's accepted upstream.

Commands added are "la <symbol>" and "ls <address>". The syntax $<symbol>
can also be used to specify addresses, like on ppc32:

0:mon> di $.sysrq_handle_xmon
c000000000046b5c  7c0802a6      mflr    r0
c000000000046b60  fba1ffe8      std     r29,-24(r1)
c000000000046b64  7c9d2378      mr      r29,r4
c000000000046b68  f8010010      std     r0,16(r1)
c000000000046b6c  f821ff71      stdu    r1,-144(r1)
c000000000046b70  48004691      bl      0xc00000000004b200     # .xmon_init+0x0


(also notice the added comments in the disasm output with symbols)


-Olof

Olof Johansson                                        Office: 4E002/905
Linux on Power Development                            IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM

-------------- next part --------------
===== arch/ppc64/xmon/ppc-dis.c 1.1 vs edited =====
--- 1.1/arch/ppc64/xmon/ppc-dis.c	Thu Feb 14 06:14:36 2002
+++ edited/arch/ppc64/xmon/ppc-dis.c	Wed Feb 25 11:22:49 2004
@@ -18,6 +18,7 @@
 along with this file; see the file COPYING.  If not, write to the Free
 Software Foundation, 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.  */
 
+#include <linux/kallsyms.h>
 #include "nonstdio.h"
 #include "ansidecl.h"
 #include "ppc.h"
@@ -61,6 +62,7 @@
       int invalid;
       int need_comma;
       int need_paren;
+      unsigned long addr;
 
       table_op = PPC_OP (opcode->opcode);
       if (op < table_op)
@@ -93,6 +95,7 @@
       /* Now extract and print the operands.  */
       need_comma = 0;
       need_paren = 0;
+      addr = 0;
       for (opindex = opcode->operands; *opindex != 0; opindex++)
 		{
 		  long value;
@@ -134,9 +137,10 @@
 		    fprintf(out, "r%ld", value);
 		  else if ((operand->flags & PPC_OPERAND_FPR) != 0)
 		    fprintf(out, "f%ld", value);
-		  else if ((operand->flags & PPC_OPERAND_RELATIVE) != 0)
+		  else if ((operand->flags & PPC_OPERAND_RELATIVE) != 0) {
 		    print_address (memaddr + value);
-		  else if ((operand->flags & PPC_OPERAND_ABSOLUTE) != 0)
+		    addr = memaddr + value;
+		  } else if ((operand->flags & PPC_OPERAND_ABSOLUTE) != 0)
 		    print_address (value & 0xffffffff);
 		  else if ((operand->flags & PPC_OPERAND_CR) == 0
 			   || (dialect & PPC_OPCODE_PPC) == 0)
@@ -178,6 +182,21 @@
 	      need_paren = 1;
 	    }
 	}
+
+      if (addr) {
+          char namebuf[128];
+	  const char *name;
+	  char *modname;
+	  long size, offset;
+
+          name = kallsyms_lookup(addr, &size, &offset, &modname, namebuf);
+          if (name) {
+              if(modname)
+		      fprintf(out, "\t# [%s]%s+0x%lx", modname, name, offset);
+	      else
+	              fprintf(out, "\t# %s+0x%lx", name, offset);
+	  }
+      }
 
       /* We have found and printed an instruction; return.  */
       return 4;
===== arch/ppc64/xmon/xmon.c 1.34 vs edited =====
--- 1.34/arch/ppc64/xmon/xmon.c	Sat Feb 14 06:40:59 2004
+++ edited/arch/ppc64/xmon/xmon.c	Wed Feb 25 11:38:19 2004
@@ -115,6 +115,7 @@
 static void csum(void);
 static void bootcmds(void);
 void dump_segments(void);
+void symbol_lookup(void);
 
 static void debug_trace(void);
 
@@ -160,6 +161,8 @@
   dd	dump double values\n\
   e	print exception information\n\
   f	flush cache\n\
+  la	lookup address\n\
+  ls	lookup symbol\n\
   m	examine/change memory\n\
   mm	move a block of memory\n\
   ms	set a block of memory\n\
@@ -537,6 +540,9 @@
 		case 'd':
 			dump();
 			break;
+		case 'l':
+			symbol_lookup();
+			break;
 		case 'r':
 			if (excp != NULL)
 				prregs(excp);	/* print regs */
@@ -1142,6 +1148,7 @@
 extern char exc_prolog;
 extern char dec_exc;
 
+
 void
 super_regs()
 {
@@ -1644,6 +1651,7 @@
 	printf("0x%lx", addr);
 }
 
+
 /*
  * Memory operations - move, set, print differences
  */
@@ -1822,8 +1830,40 @@
 		return 0;
 	}
 
+	/* skip leading "0x" if any */
+
+	if (c == '0') {
+		c = inchar();
+		if (c == 'x')
+			c = inchar();
+	}
+
+	if (c == '0') {
+		c = inchar();
+		if (c == 'x')
+			c = inchar();
+	} else if (c == '$') {
+		static char symname[64];
+		int i;
+		for (i=0; i<63; i++) {
+                        c = inchar();
+			if (isspace(c)) {
+				termch = c;
+				break;
+			}
+			symname[i] = c;
+		}
+		symname[i++] = 0;
+		*vp = kallsyms_lookupname(symname);
+		if (!(*vp)) {
+			printf("unknown symbol '%s'\n", symname);
+			return 0;
+		}
+		return 1;
+	}
+	
 	d = hexdigit(c);
-	if( d == EOF ){
+	if (d == EOF) {
 		termch = c;
 		return 0;
 	}
@@ -1832,7 +1872,7 @@
 		v = (v << 4) + d;
 		c = inchar();
 		d = hexdigit(c);
-	} while( d != EOF );
+	} while (d != EOF);
 	termch = c;
 	*vp = v;
 	return 1;
@@ -1907,13 +1947,48 @@
 	lineptr = str;
 }
 
+
+void
+symbol_lookup(void)
+{
+        int type = inchar();
+        unsigned long addr;
+        static char tmp[64];
+
+        switch (type) {
+	case 'a':
+		if (scanhex(&addr)) {
+			printf("%lx: ", addr);
+			xmon_print_symbol("%s\n", addr);
+		}
+		termch = 0;
+		break;
+	case 's':
+		getstring(tmp, 64);
+		if (setjmp(bus_error_jmp) == 0) {
+			__debugger_fault_handler = handle_fault;
+			sync();
+			addr = kallsyms_lookupname(tmp);
+			if (addr) 
+				printf("%s: %lx\n", tmp, addr);
+			else
+				printf("Symbol '%s' not found.\n", tmp);
+			sync();
+		}
+		__debugger_fault_handler = 0;
+		termch = 0;
+		break;
+        }
+}
+
+
 /* xmon version of __print_symbol */
 void __xmon_print_symbol(const char *fmt, unsigned long address)
 {
 	char *modname;
 	const char *name;
 	unsigned long offset, size;
-	char namebuf[128];
+	static char namebuf[128];
 
 	if (setjmp(bus_error_jmp) == 0) {
 		__debugger_fault_handler = handle_fault;

From mjanders at us.ibm.com  Thu Feb 26 10:05:27 2004
From: mjanders at us.ibm.com (Michael Anderson)
Date: Wed, 25 Feb 2004 17:05:27 -0600
Subject: PCI Interrupts with CONFIG_PPC_ISERIES in linux 2.6.3
Message-ID: <OF3C8BB998.5AE1BF27-ON86256E45.007DE107-86256E45.007EE71F@us.ibm.com>


It appears that I am not getting interrupts on my iseries machine.  I
compiled as module the ibmsis, ipr, olympic and icom drivers as modules.
The ibmsis, ipr and olympic drivers hung when modprobe was attempted.  The
icom driver did modprobe successfully, no interrupts are needed for icom to
install, though when I attempt to transmit data, which does require
interrupts, no interrupts were received and transmit operations hung.  I
loaded this same driver on a simular pseries install and the icom driver
did receive interrupts so this appears to be an iseries issue.

Any known problems in this area?  I've seen this behavior before on the 2.4
kernel last fall.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Thu Feb 26 10:05:39 2004
From: anton at samba.org (Anton Blanchard)
Date: Thu, 26 Feb 2004 10:05:39 +1100
Subject: [PATCH] ppc64 procfs cleanup
In-Reply-To: <1077729711.10733.12.camel@verve.austin.ibm.com>
References: <OFAD8CC52C.FCC96DDC-ON85256E45.0058AEE4-86256E45.0058B2B4@us.ibm.com> <1077729711.10733.12.camel@verve.austin.ibm.com>
Message-ID: <20040225230539.GM5801@krispykreme>


> +	if (rtas_token("ibm,update-flash-64-and-reboot") ==
> +		       RTAS_UNKNOWN_SERVICE) {
> +		printk(KERN_ERR "rtas_flash: no firmware flash support\n");
> +		return 1;
>
> Can we not add this? :)  The current module init creates the three /proc
> files regardless, and handles the case of "function not supported" with
> a certain return code upon /proc file read.

Im thinking Linus and his G5 here :) We have a bunch of stuff thats pseries
specific that G5 should never know about. I just copied how scanlog
handles this (doesnt load if you dont have the required RTAS methods).

Im open to other suggestions, I guess we could do a platform & PSERIES
check.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Thu Feb 26 10:10:55 2004
From: anton at samba.org (Anton Blanchard)
Date: Thu, 26 Feb 2004 10:10:55 +1100
Subject: PCI Interrupts with CONFIG_PPC_ISERIES in linux 2.6.3
In-Reply-To: <OF3C8BB998.5AE1BF27-ON86256E45.007DE107-86256E45.007EE71F@us.ibm.com>
References: <OF3C8BB998.5AE1BF27-ON86256E45.007DE107-86256E45.007EE71F@us.ibm.com>
Message-ID: <20040225231055.GN5801@krispykreme>


Hi Mike,

> It appears that I am not getting interrupts on my iseries machine.  I
> compiled as module the ibmsis, ipr, olympic and icom drivers as modules.
> The ibmsis, ipr and olympic drivers hung when modprobe was attempted.  The
> icom driver did modprobe successfully, no interrupts are needed for icom to
> install, though when I attempt to transmit data, which does require
> interrupts, no interrupts were received and transmit operations hung.  I
> loaded this same driver on a simular pseries install and the icom driver
> did receive interrupts so this appears to be an iseries issue.
>
> Any known problems in this area?  I've seen this behavior before on the 2.4
> kernel last fall.

Yep I noticed it on our iseries box here, there is a problem with
ameslab at the moment.

Linus' tree (pauls large IRQ patch got merged yesterday) should work. We'll
be looking into the ameslab issue today.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Thu Feb 26 11:38:50 2004
From: paulus at samba.org (Paul Mackerras)
Date: Thu, 26 Feb 2004 11:38:50 +1100
Subject: xmon support for symbol lookup
In-Reply-To: <Pine.A41.4.44.0402251634020.25240-200000@forte.austin.ibm.com>
References: <Pine.A41.4.44.0402251634020.25240-200000@forte.austin.ibm.com>
Message-ID: <16445.16410.436976.639488@cargo.ozlabs.ibm.com>


Hi Olof,

> Attached patch adds symbol lookup functions to xmon, similar to what Ben
> added to ppc32 but not requiring a linked-in system.map.

Nice :)

> It requires the kallsyms_lookupname patch (see previous post), so I won't
> push this until I've heard back that it's accepted upstream.
>
> Commands added are "la <symbol>" and "ls <address>". The syntax $<symbol>
> can also be used to specify addresses, like on ppc32:

Hopefully you mean "la <address>" and "ls <symbol>", at least that is
what the code seems to implement.

> ===== arch/ppc64/xmon/ppc-dis.c 1.1 vs edited =====
> --- 1.1/arch/ppc64/xmon/ppc-dis.c	Thu Feb 14 06:14:36 2002
> +++ edited/arch/ppc64/xmon/ppc-dis.c	Wed Feb 25 11:22:49 2004

Someone needs to update ppc-dis.c and ppc-opc.c with a more recent
version from the BFD library.  The one we have seems to be missing
some power4 instructions, and also seems to have a 32-bit assumption
here:

> +		  } else if ((operand->flags & PPC_OPERAND_ABSOLUTE) != 0)
>  		    print_address (value & 0xffffffff);

I don't like this bit, though:

> +      if (addr) {
> +          char namebuf[128];
> +	  const char *name;
> +	  char *modname;
> +	  long size, offset;
> +
> +          name = kallsyms_lookup(addr, &size, &offset, &modname, namebuf);
> +          if (name) {
> +              if(modname)
> +		      fprintf(out, "\t# [%s]%s+0x%lx", modname, name, offset);
> +	      else
> +	              fprintf(out, "\t# %s+0x%lx", name, offset);
> +	  }
> +      }

This should be put in the print_address function.  (With correct
indentation. :)  If we minimize the changes we make to ppc_dis.c, it
makes it easier to update it from the BFD version as we go along.

And here it looks like you are sometimes using 8 spaces to indent, and
sometimes a tab:

> +void
> +symbol_lookup(void)
> +{
> +        int type = inchar();
> +        unsigned long addr;
> +        static char tmp[64];
> +
> +        switch (type) {
> +	case 'a':
> +		if (scanhex(&addr)) {

It would be good if you could use tabs everywhere.

Regards,
Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Thu Feb 26 12:59:00 2004
From: olof at austin.ibm.com (olof at austin.ibm.com)
Date: Wed, 25 Feb 2004 19:59:00 -0600 (CST)
Subject: xmon support for symbol lookup
In-Reply-To: <16445.16410.436976.639488@cargo.ozlabs.ibm.com>
Message-ID: <Pine.A41.4.44.0402251926410.43084-200000@forte.austin.ibm.com>

Paul,

Thanks for your feedback, see below.

On Thu, 26 Feb 2004, Paul Mackerras wrote:

> Hopefully you mean "la <address>" and "ls <symbol>", at least that is
> what the code seems to implement.

Ack, yes. Helptext has been updated to clarify too.

Whitespace weirdness was partially because I was tring to follow the weird
existing style of ppc-dis.c. The space/tab-issue was cut-n-paste garbage.
All that has been fixed.

I also consolidated some of the static char arrays to use one global
instead to save some BSS.


Actually, I can push everything but the "ls" functionality soonish
if it'll take a while to get the kallsyms change into mainline.


-Olof

Olof Johansson                                        Office: 4E002/905
Linux on Power Development                            IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM


>
> > ===== arch/ppc64/xmon/ppc-dis.c 1.1 vs edited =====
> > --- 1.1/arch/ppc64/xmon/ppc-dis.c	Thu Feb 14 06:14:36 2002
> > +++ edited/arch/ppc64/xmon/ppc-dis.c	Wed Feb 25 11:22:49 2004
>
> Someone needs to update ppc-dis.c and ppc-opc.c with a more recent
> version from the BFD library.  The one we have seems to be missing
> some power4 instructions, and also seems to have a 32-bit assumption
> here:
>
> > +		  } else if ((operand->flags & PPC_OPERAND_ABSOLUTE) != 0)
> >  		    print_address (value & 0xffffffff);
>
> I don't like this bit, though:
>
> > +      if (addr) {
> > +          char namebuf[128];
> > +	  const char *name;
> > +	  char *modname;
> > +	  long size, offset;
> > +
> > +          name = kallsyms_lookup(addr, &size, &offset, &modname, namebuf);
> > +          if (name) {
> > +              if(modname)
> > +		      fprintf(out, "\t# [%s]%s+0x%lx", modname, name, offset);
> > +	      else
> > +	              fprintf(out, "\t# %s+0x%lx", name, offset);
> > +	  }
> > +      }
>
> This should be put in the print_address function.  (With correct
> indentation. :)  If we minimize the changes we make to ppc_dis.c, it
> makes it easier to update it from the BFD version as we go along.
>
> And here it looks like you are sometimes using 8 spaces to indent, and
> sometimes a tab:
>
> > +void
> > +symbol_lookup(void)
> > +{
> > +        int type = inchar();
> > +        unsigned long addr;
> > +        static char tmp[64];
> > +
> > +        switch (type) {
> > +	case 'a':
> > +		if (scanhex(&addr)) {
>
> It would be good if you could use tabs everywhere.
>
> Regards,
> Paul.
>
-------------- next part --------------
===== arch/ppc64/xmon/xmon.c 1.34 vs edited =====
--- 1.34/arch/ppc64/xmon/xmon.c	Sat Feb 14 06:40:59 2004
+++ edited/arch/ppc64/xmon/xmon.c	Wed Feb 25 19:53:15 2004
@@ -50,6 +50,7 @@
 static unsigned long nidump = 16;
 static unsigned long ncsum = 4096;
 static int termch;
+static char tmpstr[128];
 
 static u_int bus_error_jmp[100];
 #define setjmp xmon_setjmp
@@ -115,6 +116,7 @@
 static void csum(void);
 static void bootcmds(void);
 void dump_segments(void);
+void symbol_lookup(void);
 
 static void debug_trace(void);
 
@@ -160,6 +162,8 @@
   dd	dump double values\n\
   e	print exception information\n\
   f	flush cache\n\
+  la	lookup symbol+offset of specified address\n\
+  ls	lookup address of specified symbol\n\
   m	examine/change memory\n\
   mm	move a block of memory\n\
   ms	set a block of memory\n\
@@ -537,6 +541,9 @@
 		case 'd':
 			dump();
 			break;
+		case 'l':
+			symbol_lookup();
+			break;
 		case 'r':
 			if (excp != NULL)
 				prregs(excp);	/* print regs */
@@ -1641,9 +1648,22 @@
 void
 print_address(unsigned long addr)
 {
-	printf("0x%lx", addr);
+	const char *name;
+	char *modname;
+	long size, offset;
+
+	name = kallsyms_lookup(addr, &size, &offset, &modname, tmpstr);
+
+	if (name) {
+		if (modname)
+			printf("0x%lx\t# [%s]%s+0x%lx", addr, modname, name, offset);
+		else
+			printf("0x%lx\t# %s+0x%lx", addr, name, offset);
+	} else
+		printf("0x%lx", addr);
 }
 
+
 /*
  * Memory operations - move, set, print differences
  */
@@ -1822,8 +1842,33 @@
 		return 0;
 	}
 
+	/* skip leading "0x" if any */
+
+	if (c == '0') {
+		c = inchar();
+		if (c == 'x')
+			c = inchar();
+	} else if (c == '$') {
+		int i;
+		for (i=0; i<63; i++) {
+			c = inchar();
+			if (isspace(c)) {
+				termch = c;
+				break;
+			}
+			tmpstr[i] = c;
+		}
+		tmpstr[i++] = 0;
+		*vp = kallsyms_lookupname(tmpstr);
+		if (!(*vp)) {
+			printf("unknown symbol '%s'\n", tmpstr);
+			return 0;
+		}
+		return 1;
+	}
+	
 	d = hexdigit(c);
-	if( d == EOF ){
+	if (d == EOF) {
 		termch = c;
 		return 0;
 	}
@@ -1832,7 +1877,7 @@
 		v = (v << 4) + d;
 		c = inchar();
 		d = hexdigit(c);
-	} while( d != EOF );
+	} while (d != EOF);
 	termch = c;
 	*vp = v;
 	return 1;
@@ -1907,19 +1952,53 @@
 	lineptr = str;
 }
 
+
+void
+symbol_lookup(void)
+{
+	int type = inchar();
+	unsigned long addr;
+	static char tmp[64];
+
+	switch (type) {
+	case 'a':
+		if (scanhex(&addr)) {
+			printf("%lx: ", addr);
+			xmon_print_symbol("%s\n", addr);
+		}
+		termch = 0;
+		break;
+	case 's':
+		getstring(tmp, 64);
+		if (setjmp(bus_error_jmp) == 0) {
+			__debugger_fault_handler = handle_fault;
+			sync();
+			addr = kallsyms_lookupname(tmp);
+			if (addr) 
+				printf("%s: %lx\n", tmp, addr);
+			else
+				printf("Symbol '%s' not found.\n", tmp);
+			sync();
+		}
+		__debugger_fault_handler = 0;
+		termch = 0;
+		break;
+	}
+}
+
+
 /* xmon version of __print_symbol */
 void __xmon_print_symbol(const char *fmt, unsigned long address)
 {
 	char *modname;
 	const char *name;
 	unsigned long offset, size;
-	char namebuf[128];
 
 	if (setjmp(bus_error_jmp) == 0) {
 		__debugger_fault_handler = handle_fault;
 		sync();
 		name = kallsyms_lookup(address, &size, &offset, &modname,
-				       namebuf);
+				       tmpstr);
 		sync();
 		/* wait a little while to see if we get a machine check */
 		__delay(200);

From rusty at au1.ibm.com  Thu Feb 26 17:51:08 2004
From: rusty at au1.ibm.com (Rusty Russell)
Date: Thu, 26 Feb 2004 17:51:08 +1100
Subject: Adding kallsyms_lookupname() 
In-Reply-To: Your message of "Wed, 25 Feb 2004 16:32:41 MDT."
             <Pine.A41.4.44.0402251618360.28842-200000@forte.austin.ibm.com> 
Message-ID: <20040226225409.91D7C17DD8@ozlabs.au.ibm.com>


In message <Pine.A41.4.44.0402251618360.28842-200000 at forte.austin.ibm.com> you write:
> Rusty,
>
> Attached patch adds a kallsyms_lookupname() function for lookups of a
> symbol name to an address.

OK, I simplified it a bit, and gave it some prototype love:

> +unsigned long kallsyms_lookupname(char *name);
....
> +static inline const unsigned long kallsyms_lookupname(unsigned long addr)

How's this:

Name: kallsyms_lookupname() Function For Debuggers
Author: Olof Johansson, Rusty Russell
Status: Experimental

Attached patch adds a kallsyms_lookupname() function for lookups of a
symbol name to an address.

It's intentionally not exported as a symbol for module use, since it
can be used to circumvent other symbol export restrictions.

diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .32130-linux-2.6.3-bk7/include/linux/kallsyms.h .32130-linux-2.6.3-bk7.updated/include/linux/kallsyms.h
--- .32130-linux-2.6.3-bk7/include/linux/kallsyms.h	2003-09-22 09:47:17.000000000 +1000
+++ .32130-linux-2.6.3-bk7.updated/include/linux/kallsyms.h	2004-02-26 13:57:39.000000000 +1100
@@ -8,6 +8,9 @@
 #include <linux/config.h>

 #ifdef CONFIG_KALLSYMS
+/* Lookup the address for a symbol. Returns 0 if not found. */
+unsigned long kallsyms_lookup_name(const char *name);
+
 /* Lookup an address.  modname is set to NULL if it's in the kernel. */
 const char *kallsyms_lookup(unsigned long addr,
 			    unsigned long *symbolsize,
@@ -19,6 +22,11 @@ extern void __print_symbol(const char *f

 #else /* !CONFIG_KALLSYMS */

+static inline unsigned long kallsyms_lookup_name(const char *name)
+{
+	return 0;
+}
+
 static inline const char *kallsyms_lookup(unsigned long addr,
 					  unsigned long *symbolsize,
 					  unsigned long *offset,
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .32130-linux-2.6.3-bk7/include/linux/module.h .32130-linux-2.6.3-bk7.updated/include/linux/module.h
--- .32130-linux-2.6.3-bk7/include/linux/module.h	2004-02-04 15:39:14.000000000 +1100
+++ .32130-linux-2.6.3-bk7.updated/include/linux/module.h	2004-02-26 15:20:29.000000000 +1100
@@ -282,6 +282,10 @@ struct module *module_get_kallsym(unsign
 				  unsigned long *value,
 				  char *type,
 				  char namebuf[128]);
+
+/* Look for this name: can be of form module:name. */
+unsigned long module_kallsyms_lookup_name(const char *name);
+
 int is_exported(const char *name, const struct module *mod);

 extern void __module_put_and_exit(struct module *mod, long code)
@@ -434,6 +438,11 @@ static inline struct module *module_get_
 	return NULL;
 }

+static inline unsigned long module_kallsyms_lookup_name(const char *name)
+{
+	return 0;
+}
+
 static inline int is_exported(const char *name, const struct module *mod)
 {
 	return 0;
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .32130-linux-2.6.3-bk7/kernel/kallsyms.c .32130-linux-2.6.3-bk7.updated/kernel/kallsyms.c
--- .32130-linux-2.6.3-bk7/kernel/kallsyms.c	2003-09-22 10:28:13.000000000 +1000
+++ .32130-linux-2.6.3-bk7.updated/kernel/kallsyms.c	2004-02-26 14:22:37.000000000 +1100
@@ -37,6 +37,25 @@ static inline int is_kernel_text(unsigne
 	return 0;
 }

+/* Lookup the address for this symbol. Returns 0 if not found. */
+unsigned long kallsyms_lookup_name(const char *name)
+{
+	char namebuf[128];
+	unsigned long i;
+	char *knames;
+
+	for (i = 0, knames = kallsyms_names; i < kallsyms_num_syms; i++) {
+		unsigned prefix = *knames++;
+
+		strlcpy(namebuf + prefix, knames, 127 - prefix);
+		if (strcmp(namebuf, name) == 0)
+			return kallsyms_addresses[i];
+
+		knames += strlen(knames) + 1;
+	}
+	return module_kallsyms_lookup_name(name);
+}
+
 /* Lookup an address.  modname is set to NULL if it's in the kernel. */
 const char *kallsyms_lookup(unsigned long addr,
 			    unsigned long *symbolsize,
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .32130-linux-2.6.3-bk7/kernel/module.c .32130-linux-2.6.3-bk7.updated/kernel/module.c
--- .32130-linux-2.6.3-bk7/kernel/module.c	2004-02-26 11:53:26.000000000 +1100
+++ .32130-linux-2.6.3-bk7.updated/kernel/module.c	2004-02-26 14:40:25.000000000 +1100
@@ -1892,6 +1892,37 @@ struct module *module_get_kallsym(unsign
 	up(&module_mutex);
 	return NULL;
 }
+
+static unsigned long mod_find_symname(struct module *mod, const char *name)
+{
+	unsigned int i;
+
+	for (i = 0; i < mod->num_symtab; i++)
+		if (strcmp(name, mod->strtab+mod->symtab[i].st_name) == 0)
+			return mod->symtab[i].st_value;
+	return 0;
+}
+
+/* Look for this name: can be of form module:name. */
+unsigned long module_kallsyms_lookup_name(const char *name)
+{
+	struct module *mod;
+	char *colon;
+	unsigned long ret = 0;
+
+	/* Don't lock: we're in enough trouble already. */
+	if ((colon = strchr(name, ':')) != NULL) {
+		*colon = '\0';
+		if ((mod = find_module(name)) != NULL)
+			ret = mod_find_symname(mod, colon+1);
+		*colon = ':';
+	} else {
+		list_for_each_entry(mod, &modules, list)
+			if ((ret = mod_find_symname(mod, name)) != 0)
+				break;
+	}
+	return ret;
+}
 #endif /* CONFIG_KALLSYMS */

 /* Called by the /proc file system to return a list of modules. */

--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Fri Feb 27 02:58:42 2004
From: anton at samba.org (Anton Blanchard)
Date: Fri, 27 Feb 2004 02:58:42 +1100
Subject: PCI Interrupts with CONFIG_PPC_ISERIES in linux 2.6.3
In-Reply-To: <20040225231055.GN5801@krispykreme>
References: <OF3C8BB998.5AE1BF27-ON86256E45.007DE107-86256E45.007EE71F@us.ibm.com> <20040225231055.GN5801@krispykreme>
Message-ID: <20040226155842.GR5801@krispykreme>


> Linus' tree (pauls large IRQ patch got merged yesterday) should work. We'll
> be looking into the ameslab issue today.

Paul has merged this into ameslab and iseries irqs should work again. I
tested this with a pcnet32 card in our iseries and it worked fine.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From akpm at osdl.org  Fri Feb 27 10:06:51 2004
From: akpm at osdl.org (Andrew Morton)
Date: Thu, 26 Feb 2004 15:06:51 -0800
Subject: Adding kallsyms_lookupname()
In-Reply-To: <20040226225409.91D7C17DD8@ozlabs.au.ibm.com>
References: <Pine.A41.4.44.0402251618360.28842-200000@forte.austin.ibm.com>
	<20040226225409.91D7C17DD8@ozlabs.au.ibm.com>
Message-ID: <20040226150651.66d45f91.akpm@osdl.org>


Rusty Russell <rusty at au1.ibm.com> wrote:
>
> Name: kallsyms_lookupname() Function For Debuggers

What uses this?

Whatever it is, I'd prefer to merge both caller and callee please.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Fri Feb 27 10:23:42 2004
From: olof at austin.ibm.com (olof at austin.ibm.com)
Date: Thu, 26 Feb 2004 17:23:42 -0600 (CST)
Subject: Adding kallsyms_lookupname()
In-Reply-To: <20040226150651.66d45f91.akpm@osdl.org>
Message-ID: <Pine.A41.4.44.0402261717120.74742-200000@forte.austin.ibm.com>

On Thu, 26 Feb 2004, Andrew Morton wrote:

> Rusty Russell <rusty at au1.ibm.com> wrote:
> >
> > Name: kallsyms_lookupname() Function For Debuggers
>
> What uses this?
>
> Whatever it is, I'd prefer to merge both caller and callee please.

Attached is the corresponding patch to xmon on ppc64. Ben said he might
backport these changes to ppc32 as well.


-Olof

Olof Johansson                                        Office: 4E002/905
Linux on Power Development                            IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM

-------------- next part --------------
===== arch/ppc64/xmon/xmon.c 1.34 vs edited =====
--- 1.34/arch/ppc64/xmon/xmon.c	Sat Feb 14 06:40:59 2004
+++ edited/arch/ppc64/xmon/xmon.c	Wed Feb 25 19:53:15 2004
@@ -50,6 +50,7 @@
 static unsigned long nidump = 16;
 static unsigned long ncsum = 4096;
 static int termch;
+static char tmpstr[128];
 
 static u_int bus_error_jmp[100];
 #define setjmp xmon_setjmp
@@ -115,6 +116,7 @@
 static void csum(void);
 static void bootcmds(void);
 void dump_segments(void);
+void symbol_lookup(void);
 
 static void debug_trace(void);
 
@@ -160,6 +162,8 @@
   dd	dump double values\n\
   e	print exception information\n\
   f	flush cache\n\
+  la	lookup symbol+offset of specified address\n\
+  ls	lookup address of specified symbol\n\
   m	examine/change memory\n\
   mm	move a block of memory\n\
   ms	set a block of memory\n\
@@ -537,6 +541,9 @@
 		case 'd':
 			dump();
 			break;
+		case 'l':
+			symbol_lookup();
+			break;
 		case 'r':
 			if (excp != NULL)
 				prregs(excp);	/* print regs */
@@ -1641,9 +1648,22 @@
 void
 print_address(unsigned long addr)
 {
-	printf("0x%lx", addr);
+	const char *name;
+	char *modname;
+	long size, offset;
+
+	name = kallsyms_lookup(addr, &size, &offset, &modname, tmpstr);
+
+	if (name) {
+		if (modname)
+			printf("0x%lx\t# %s:%s+0x%lx", addr, modname, name, offset);
+		else
+			printf("0x%lx\t# %s+0x%lx", addr, name, offset);
+	} else
+		printf("0x%lx", addr);
 }
 
+
 /*
  * Memory operations - move, set, print differences
  */
@@ -1822,8 +1842,33 @@
 		return 0;
 	}
 
+	/* skip leading "0x" if any */
+
+	if (c == '0') {
+		c = inchar();
+		if (c == 'x')
+			c = inchar();
+	} else if (c == '$') {
+		int i;
+		for (i=0; i<63; i++) {
+			c = inchar();
+			if (isspace(c)) {
+				termch = c;
+				break;
+			}
+			tmpstr[i] = c;
+		}
+		tmpstr[i++] = 0;
+		*vp = kallsyms_lookupname(tmpstr);
+		if (!(*vp)) {
+			printf("unknown symbol '%s'\n", tmpstr);
+			return 0;
+		}
+		return 1;
+	}
+	
 	d = hexdigit(c);
-	if( d == EOF ){
+	if (d == EOF) {
 		termch = c;
 		return 0;
 	}
@@ -1832,7 +1877,7 @@
 		v = (v << 4) + d;
 		c = inchar();
 		d = hexdigit(c);
-	} while( d != EOF );
+	} while (d != EOF);
 	termch = c;
 	*vp = v;
 	return 1;
@@ -1907,19 +1952,53 @@
 	lineptr = str;
 }
 
+
+void
+symbol_lookup(void)
+{
+	int type = inchar();
+	unsigned long addr;
+	static char tmp[64];
+
+	switch (type) {
+	case 'a':
+		if (scanhex(&addr)) {
+			printf("%lx: ", addr);
+			xmon_print_symbol("%s\n", addr);
+		}
+		termch = 0;
+		break;
+	case 's':
+		getstring(tmp, 64);
+		if (setjmp(bus_error_jmp) == 0) {
+			__debugger_fault_handler = handle_fault;
+			sync();
+			addr = kallsyms_lookupname(tmp);
+			if (addr) 
+				printf("%s: %lx\n", tmp, addr);
+			else
+				printf("Symbol '%s' not found.\n", tmp);
+			sync();
+		}
+		__debugger_fault_handler = 0;
+		termch = 0;
+		break;
+	}
+}
+
+
 /* xmon version of __print_symbol */
 void __xmon_print_symbol(const char *fmt, unsigned long address)
 {
 	char *modname;
 	const char *name;
 	unsigned long offset, size;
-	char namebuf[128];
 
 	if (setjmp(bus_error_jmp) == 0) {
 		__debugger_fault_handler = handle_fault;
 		sync();
 		name = kallsyms_lookup(address, &size, &offset, &modname,
-				       namebuf);
+				       tmpstr);
 		sync();
 		/* wait a little while to see if we get a machine check */
 		__delay(200);

From akpm at osdl.org  Fri Feb 27 10:37:36 2004
From: akpm at osdl.org (Andrew Morton)
Date: Thu, 26 Feb 2004 15:37:36 -0800
Subject: Adding kallsyms_lookupname()
In-Reply-To: <Pine.A41.4.44.0402261717120.74742-200000@forte.austin.ibm.com>
References: <20040226150651.66d45f91.akpm@osdl.org>
	<Pine.A41.4.44.0402261717120.74742-200000@forte.austin.ibm.com>
Message-ID: <20040226153736.74fecb3b.akpm@osdl.org>


olof at austin.ibm.com wrote:
>
> On Thu, 26 Feb 2004, Andrew Morton wrote:
>
> > Rusty Russell <rusty at au1.ibm.com> wrote:
> > >
> > > Name: kallsyms_lookupname() Function For Debuggers
> >
> > What uses this?
> >
> > Whatever it is, I'd prefer to merge both caller and callee please.
>
> Attached is the corresponding patch to xmon on ppc64.

OK, thanks.   Is this ready to be merged?

> +void symbol_lookup(void);

Should I make this static?


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From strosake at austin.ibm.com  Fri Feb 27 11:02:55 2004
From: strosake at austin.ibm.com (Mike Strosaker)
Date: Thu, 26 Feb 2004 18:02:55 -0600
Subject: [PATCH] (2.6) os-term call upon kernel panic
Message-ID: <403E892F.3080001@austin.ibm.com>


This patch will cause the os-term RTAS call to be invoked after a
kernel panic.  The call notifies the platform that the OS is terminating
normal operation, which causes the service processor (on systems so
equipped) to perform some pre-defined actions (like a call home).  The
os-term routine is given the lowest priority on panic_notifier_list so
that it will be the last routine invoked, since it may not return.

Comments welcome.

Thanks,
Mike

diff -Nru a/arch/ppc64/kernel/chrp_setup.c b/arch/ppc64/kernel/chrp_setup.c
--- a/arch/ppc64/kernel/chrp_setup.c	Thu Feb 26 16:54:18 2004
+++ b/arch/ppc64/kernel/chrp_setup.c	Thu Feb 26 16:54:18 2004
@@ -268,6 +268,7 @@
  	ppc_md.restart        = rtas_restart;
  	ppc_md.power_off      = rtas_power_off;
  	ppc_md.halt           = rtas_halt;
+	ppc_md.panic          = rtas_os_term;

  	ppc_md.get_boot_time  = pSeries_get_boot_time;
  	ppc_md.get_rtc_time   = pSeries_get_rtc_time;
diff -Nru a/arch/ppc64/kernel/iSeries_setup.c b/arch/ppc64/kernel/iSeries_setup.c
--- a/arch/ppc64/kernel/iSeries_setup.c	Thu Feb 26 16:54:18 2004
+++ b/arch/ppc64/kernel/iSeries_setup.c	Thu Feb 26 16:54:18 2004
@@ -323,6 +323,7 @@
  	ppc_md.restart = iSeries_restart;
  	ppc_md.power_off = iSeries_power_off;
  	ppc_md.halt = iSeries_halt;
+	ppc_md.panic = iSeries_panic;

  	ppc_md.get_boot_time = iSeries_get_boot_time;
  	ppc_md.set_rtc_time = iSeries_set_rtc_time;
@@ -790,6 +791,14 @@
  void iSeries_halt(void)
  {
  	mf_powerOff();
+}
+
+/*
+ * Document me.
+ */
+void iSeries_panic(void)
+{
+	mf_reboot();
  }

  /* JDH Hack */
diff -Nru a/arch/ppc64/kernel/rtas.c b/arch/ppc64/kernel/rtas.c
--- a/arch/ppc64/kernel/rtas.c	Thu Feb 26 16:54:18 2004
+++ b/arch/ppc64/kernel/rtas.c	Thu Feb 26 16:54:18 2004
@@ -420,6 +420,17 @@
          rtas_power_off();
  }

+void
+rtas_os_term(void)
+{
+	long status;
+	char *str = "OS panic";
+
+	status = rtas_call(rtas_token("ibm,os-term"), 1, 1, NULL, __pa(str));
+	if (status != 0)
+		printk(KERN_EMERG "ibm,os-term call failed %ld\n", status);
+}
+
  unsigned long rtas_rmo_buf = 0;

  asmlinkage int ppc_rtas(struct rtas_args __user *uargs)
diff -Nru a/arch/ppc64/kernel/setup.c b/arch/ppc64/kernel/setup.c
--- a/arch/ppc64/kernel/setup.c	Thu Feb 26 16:54:18 2004
+++ b/arch/ppc64/kernel/setup.c	Thu Feb 26 16:54:18 2004
@@ -25,6 +25,7 @@
  #include <linux/version.h>
  #include <linux/tty.h>
  #include <linux/root_dev.h>
+#include <linux/notifier.h>
  #include <asm/io.h>
  #include <asm/prom.h>
  #include <asm/processor.h>
@@ -93,6 +94,12 @@

  struct machdep_calls ppc_md;

+static int ppc64_panic_event(struct notifier_block *, unsigned long, void *);
+static struct notifier_block ppc64_panic_block = {
+	notifier_call: ppc64_panic_event,
+	priority: INT_MIN /* may not return; must be done last */
+};
+
  /*
   * Perhaps we can put the pmac screen_info[] here
   * on pmac as well so we don't need the ifdef's.
@@ -316,6 +323,13 @@

  EXPORT_SYMBOL(machine_halt);

+static int ppc64_panic_event(struct notifier_block *this,
+			     unsigned long event, void *ptr)
+{
+	ppc_md.panic();		/* May not return */
+	return NOTIFY_DONE;
+}
+
  unsigned long ppc_proc_freq;
  unsigned long ppc_tb_freq;

@@ -610,8 +624,9 @@
  	dcache_bsize = systemcfg->dCacheL1LineSize;
  	icache_bsize = systemcfg->iCacheL1LineSize;

-	/* reboot on panic */
-	panic_timeout = 180;
+	/* do not reboot on panic */
+	panic_timeout = 0;
+	notifier_chain_register(&panic_notifier_list, &ppc64_panic_block);

  	init_mm.start_code = PAGE_OFFSET;
  	init_mm.end_code = (unsigned long) _etext;
diff -Nru a/include/asm-ppc64/machdep.h b/include/asm-ppc64/machdep.h
--- a/include/asm-ppc64/machdep.h	Thu Feb 26 16:54:18 2004
+++ b/include/asm-ppc64/machdep.h	Thu Feb 26 16:54:18 2004
@@ -82,6 +82,7 @@
  	void		(*restart)(char *cmd);
  	void		(*power_off)(void);
  	void		(*halt)(void);
+	void		(*panic)(void);

  	int		(*set_rtc_time)(struct rtc_time *);
  	void		(*get_rtc_time)(struct rtc_time *);
diff -Nru a/include/asm-ppc64/rtas.h b/include/asm-ppc64/rtas.h
--- a/include/asm-ppc64/rtas.h	Thu Feb 26 16:54:18 2004
+++ b/include/asm-ppc64/rtas.h	Thu Feb 26 16:54:18 2004
@@ -175,6 +175,7 @@
  extern void rtas_restart(char *cmd);
  extern void rtas_power_off(void);
  extern void rtas_halt(void);
+extern void rtas_os_term(void);
  extern int rtas_get_sensor(int sensor, int index, int *state);
  extern int rtas_get_power_level(int powerdomain, int *level);
  extern int rtas_set_indicator(int indicator, int index, int new_value);

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Fri Feb 27 11:15:37 2004
From: olof at austin.ibm.com (olof at austin.ibm.com)
Date: Thu, 26 Feb 2004 18:15:37 -0600 (CST)
Subject: Adding kallsyms_lookupname()
In-Reply-To: <20040226153736.74fecb3b.akpm@osdl.org>
Message-ID: <Pine.A41.4.44.0402261809590.24730-200000@forte.austin.ibm.com>

On Thu, 26 Feb 2004, Andrew Morton wrote:

> OK, thanks.   Is this ready to be merged?

The attached new patch is good to go -- Rusty renamed the function and I
didn't notice at first. It's been built and tested together now.

> > +void symbol_lookup(void);
>
> Should I make this static?

Yep, done in the attached patch as well.


Thanks,

-Olof

Olof Johansson                                        Office: 4E002/905
Linux on Power Development                            IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM


-------------- next part --------------
===== arch/ppc64/xmon/xmon.c 1.34 vs edited =====
--- 1.34/arch/ppc64/xmon/xmon.c	Sat Feb 14 06:40:59 2004
+++ edited/arch/ppc64/xmon/xmon.c	Wed Feb 25 19:53:15 2004
@@ -50,6 +50,7 @@
 static unsigned long nidump = 16;
 static unsigned long ncsum = 4096;
 static int termch;
+static char tmpstr[128];
 
 static u_int bus_error_jmp[100];
 #define setjmp xmon_setjmp
@@ -115,6 +116,7 @@
 static void csum(void);
 static void bootcmds(void);
 void dump_segments(void);
+static void symbol_lookup(void);
 
 static void debug_trace(void);
 
@@ -160,6 +162,8 @@
   dd	dump double values\n\
   e	print exception information\n\
   f	flush cache\n\
+  la	lookup symbol+offset of specified address\n\
+  ls	lookup address of specified symbol\n\
   m	examine/change memory\n\
   mm	move a block of memory\n\
   ms	set a block of memory\n\
@@ -537,6 +541,9 @@
 		case 'd':
 			dump();
 			break;
+		case 'l':
+			symbol_lookup();
+			break;
 		case 'r':
 			if (excp != NULL)
 				prregs(excp);	/* print regs */
@@ -1641,9 +1648,22 @@
 void
 print_address(unsigned long addr)
 {
-	printf("0x%lx", addr);
+	const char *name;
+	char *modname;
+	long size, offset;
+
+	name = kallsyms_lookup(addr, &size, &offset, &modname, tmpstr);
+
+	if (name) {
+		if (modname)
+			printf("0x%lx\t# %s:%s+0x%lx", addr, modname, name, offset);
+		else
+			printf("0x%lx\t# %s+0x%lx", addr, name, offset);
+	} else
+		printf("0x%lx", addr);
 }
 
+
 /*
  * Memory operations - move, set, print differences
  */
@@ -1822,8 +1842,33 @@
 		return 0;
 	}
 
+	/* skip leading "0x" if any */
+
+	if (c == '0') {
+		c = inchar();
+		if (c == 'x')
+			c = inchar();
+	} else if (c == '$') {
+		int i;
+		for (i=0; i<63; i++) {
+			c = inchar();
+			if (isspace(c)) {
+				termch = c;
+				break;
+			}
+			tmpstr[i] = c;
+		}
+		tmpstr[i++] = 0;
+		*vp = kallsyms_lookup_name(tmpstr);
+		if (!(*vp)) {
+			printf("unknown symbol '%s'\n", tmpstr);
+			return 0;
+		}
+		return 1;
+	}
+	
 	d = hexdigit(c);
-	if( d == EOF ){
+	if (d == EOF) {
 		termch = c;
 		return 0;
 	}
@@ -1832,7 +1877,7 @@
 		v = (v << 4) + d;
 		c = inchar();
 		d = hexdigit(c);
-	} while( d != EOF );
+	} while (d != EOF);
 	termch = c;
 	*vp = v;
 	return 1;
@@ -1907,19 +1952,53 @@
 	lineptr = str;
 }
 
+
+static void
+symbol_lookup(void)
+{
+	int type = inchar();
+	unsigned long addr;
+	static char tmp[64];
+
+	switch (type) {
+	case 'a':
+		if (scanhex(&addr)) {
+			printf("%lx: ", addr);
+			xmon_print_symbol("%s\n", addr);
+		}
+		termch = 0;
+		break;
+	case 's':
+		getstring(tmp, 64);
+		if (setjmp(bus_error_jmp) == 0) {
+			__debugger_fault_handler = handle_fault;
+			sync();
+			addr = kallsyms_lookup_name(tmp);
+			if (addr) 
+				printf("%s: %lx\n", tmp, addr);
+			else
+				printf("Symbol '%s' not found.\n", tmp);
+			sync();
+		}
+		__debugger_fault_handler = 0;
+		termch = 0;
+		break;
+	}
+}
+
+
 /* xmon version of __print_symbol */
 void __xmon_print_symbol(const char *fmt, unsigned long address)
 {
 	char *modname;
 	const char *name;
 	unsigned long offset, size;
-	char namebuf[128];
 
 	if (setjmp(bus_error_jmp) == 0) {
 		__debugger_fault_handler = handle_fault;
 		sync();
 		name = kallsyms_lookup(address, &size, &offset, &modname,
-				       namebuf);
+				       tmpstr);
 		sync();
 		/* wait a little while to see if we get a machine check */
 		__delay(200);

From johnrose at austin.ibm.com  Fri Feb 27 11:22:36 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Thu, 26 Feb 2004 18:22:36 -0600
Subject: [PATCH] RTAS syscall NULL ptr deref (2.6)
Message-ID: <1077841356.14211.17.camel@verve.austin.ibm.com>


The patch below fixes a NULL ptr deref in the RTAS syscall on 2.6.  I
pushed it already, but send comments if you want :)

Thanks-
John

diff -Nru a/arch/ppc64/kernel/rtas.c b/arch/ppc64/kernel/rtas.c
--- a/arch/ppc64/kernel/rtas.c	Thu Feb 26 16:30:25 2004
+++ b/arch/ppc64/kernel/rtas.c	Thu Feb 26 16:30:25 2004
@@ -426,6 +426,7 @@
 {
 	struct rtas_args args;
 	unsigned long flags;
+	int nargs;

 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
@@ -433,14 +434,15 @@
 	if (copy_from_user(&args, uargs, 3 * sizeof(u32)) != 0)
 		return -EFAULT;

-	if (args.nargs > ARRAY_SIZE(args.args)
+	nargs = args.nargs;
+	if (nargs > ARRAY_SIZE(args.args)
 	    || args.nret > ARRAY_SIZE(args.args)
-	    || args.nargs + args.nret > ARRAY_SIZE(args.args))
+	    || nargs + args.nret > ARRAY_SIZE(args.args))
 		return -EINVAL;

 	/* Copy in args. */
 	if (copy_from_user(args.args, uargs->args,
-			   args.nargs * sizeof(rtas_arg_t)) != 0)
+			   nargs * sizeof(rtas_arg_t)) != 0)
 		return -EFAULT;

 	spin_lock_irqsave(&rtas.lock, flags);
@@ -449,14 +451,15 @@
 	enter_rtas((void *)__pa((unsigned long)&get_paca()->xRtas));
 	args = get_paca()->xRtas;

+	args.rets  = (rtas_arg_t *)&(args.args[nargs]);
 	if (args.rets[0] == -1)
 		log_rtas_error(&args);

 	spin_unlock_irqrestore(&rtas.lock, flags);

 	/* Copy out args. */
-	if (copy_to_user(uargs->args + args.nargs,
-			 args.args + args.nargs,
+	if (copy_to_user(uargs->args + nargs,
+			 args.args + nargs,
 			 args.nret * sizeof(rtas_arg_t)) != 0)
 		return -EFAULT;


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From strosake at austin.ibm.com  Fri Feb 27 11:24:52 2004
From: strosake at austin.ibm.com (Mike Strosaker)
Date: Thu, 26 Feb 2004 18:24:52 -0600
Subject: (resend)  [PATCH] (2.6) os-term call upon kernel panic
Message-ID: <403E8E54.2000009@austin.ibm.com>


Resend... tabs got converted to spaces on the last one.

Thanks,
Mike


diff -Nru a/arch/ppc64/kernel/chrp_setup.c b/arch/ppc64/kernel/chrp_setup.c
--- a/arch/ppc64/kernel/chrp_setup.c	Thu Feb 26 16:54:18 2004
+++ b/arch/ppc64/kernel/chrp_setup.c	Thu Feb 26 16:54:18 2004
@@ -268,6 +268,7 @@
  	ppc_md.restart        = rtas_restart;
  	ppc_md.power_off      = rtas_power_off;
  	ppc_md.halt           = rtas_halt;
+	ppc_md.panic          = rtas_os_term;

  	ppc_md.get_boot_time  = pSeries_get_boot_time;
  	ppc_md.get_rtc_time   = pSeries_get_rtc_time;
diff -Nru a/arch/ppc64/kernel/iSeries_setup.c b/arch/ppc64/kernel/iSeries_setup.c
--- a/arch/ppc64/kernel/iSeries_setup.c	Thu Feb 26 16:54:18 2004
+++ b/arch/ppc64/kernel/iSeries_setup.c	Thu Feb 26 16:54:18 2004
@@ -323,6 +323,7 @@
  	ppc_md.restart = iSeries_restart;
  	ppc_md.power_off = iSeries_power_off;
  	ppc_md.halt = iSeries_halt;
+	ppc_md.panic = iSeries_panic;

  	ppc_md.get_boot_time = iSeries_get_boot_time;
  	ppc_md.set_rtc_time = iSeries_set_rtc_time;
@@ -790,6 +791,14 @@
  void iSeries_halt(void)
  {
  	mf_powerOff();
+}
+
+/*
+ * Document me.
+ */
+void iSeries_panic(void)
+{
+	mf_reboot();
  }

  /* JDH Hack */
diff -Nru a/arch/ppc64/kernel/rtas.c b/arch/ppc64/kernel/rtas.c
--- a/arch/ppc64/kernel/rtas.c	Thu Feb 26 16:54:18 2004
+++ b/arch/ppc64/kernel/rtas.c	Thu Feb 26 16:54:18 2004
@@ -420,6 +420,17 @@
          rtas_power_off();
  }

+void
+rtas_os_term(void)
+{
+	long status;
+	char *str = "OS panic";
+
+	status = rtas_call(rtas_token("ibm,os-term"), 1, 1, NULL, __pa(str));
+	if (status != 0)
+		printk(KERN_EMERG "ibm,os-term call failed %ld\n", status);
+}
+
  unsigned long rtas_rmo_buf = 0;

  asmlinkage int ppc_rtas(struct rtas_args __user *uargs)
diff -Nru a/arch/ppc64/kernel/setup.c b/arch/ppc64/kernel/setup.c
--- a/arch/ppc64/kernel/setup.c	Thu Feb 26 16:54:18 2004
+++ b/arch/ppc64/kernel/setup.c	Thu Feb 26 16:54:18 2004
@@ -25,6 +25,7 @@
  #include <linux/version.h>
  #include <linux/tty.h>
  #include <linux/root_dev.h>
+#include <linux/notifier.h>
  #include <asm/io.h>
  #include <asm/prom.h>
  #include <asm/processor.h>
@@ -93,6 +94,12 @@

  struct machdep_calls ppc_md;

+static int ppc64_panic_event(struct notifier_block *, unsigned long, void *);
+static struct notifier_block ppc64_panic_block = {
+	notifier_call: ppc64_panic_event,
+	priority: INT_MIN /* may not return; must be done last */
+};
+
  /*
   * Perhaps we can put the pmac screen_info[] here
   * on pmac as well so we don't need the ifdef's.
@@ -316,6 +323,13 @@

  EXPORT_SYMBOL(machine_halt);

+static int ppc64_panic_event(struct notifier_block *this,
+			     unsigned long event, void *ptr)
+{
+	ppc_md.panic();		/* May not return */
+	return NOTIFY_DONE;
+}
+
  unsigned long ppc_proc_freq;
  unsigned long ppc_tb_freq;

@@ -610,8 +624,9 @@
  	dcache_bsize = systemcfg->dCacheL1LineSize;
  	icache_bsize = systemcfg->iCacheL1LineSize;

-	/* reboot on panic */
-	panic_timeout = 180;
+	/* do not reboot on panic */
+	panic_timeout = 0;
+	notifier_chain_register(&panic_notifier_list, &ppc64_panic_block);

  	init_mm.start_code = PAGE_OFFSET;
  	init_mm.end_code = (unsigned long) _etext;
diff -Nru a/include/asm-ppc64/machdep.h b/include/asm-ppc64/machdep.h
--- a/include/asm-ppc64/machdep.h	Thu Feb 26 16:54:18 2004
+++ b/include/asm-ppc64/machdep.h	Thu Feb 26 16:54:18 2004
@@ -82,6 +82,7 @@
  	void		(*restart)(char *cmd);
  	void		(*power_off)(void);
  	void		(*halt)(void);
+	void		(*panic)(void);

  	int		(*set_rtc_time)(struct rtc_time *);
  	void		(*get_rtc_time)(struct rtc_time *);
diff -Nru a/include/asm-ppc64/rtas.h b/include/asm-ppc64/rtas.h
--- a/include/asm-ppc64/rtas.h	Thu Feb 26 16:54:18 2004
+++ b/include/asm-ppc64/rtas.h	Thu Feb 26 16:54:18 2004
@@ -175,6 +175,7 @@
  extern void rtas_restart(char *cmd);
  extern void rtas_power_off(void);
  extern void rtas_halt(void);
+extern void rtas_os_term(void);
  extern int rtas_get_sensor(int sensor, int index, int *state);
  extern int rtas_get_power_level(int powerdomain, int *level);
  extern int rtas_set_indicator(int indicator, int index, int new_value);


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Fri Feb 27 12:15:23 2004
From: olof at austin.ibm.com (olof at austin.ibm.com)
Date: Thu, 26 Feb 2004 19:15:23 -0600 (CST)
Subject: (resend)  [PATCH] (2.6) os-term call upon kernel panic
In-Reply-To: <403E8E54.2000009@austin.ibm.com>
Message-ID: <Pine.A41.4.44.0402261903520.30274-200000@forte.austin.ibm.com>

A couple of comments:

* on iSeries, it'll result in an instant reboot, maybe not desirable
* no support for G5, will result in NULL branch. Panic will panic. :)
* if the RTAS os-term call for some reason fails (can it?), the system
  won't reboot since panic_timeout is 0.

Attached patch resolves those issues by not setting ppc_md.panic on pmac
and iSeries, and only registers the notifier in case it's set. The timeout
is kept at 180 even though it won't be used on a successful panic on
pSeries.

I've tried building it, but I didn't test it much.


-Olof


Olof Johansson                                        Office: 4E002/905
Linux on Power Development                            IBM Systems Group
Email: olof at austin.ibm.com                          Phone: 512-838-9858
All opinions are my own and not those of IBM

-------------- next part --------------
===== arch/ppc64/kernel/chrp_setup.c 1.56 vs edited =====
--- 1.56/arch/ppc64/kernel/chrp_setup.c	Thu Feb 26 04:56:03 2004
+++ edited/arch/ppc64/kernel/chrp_setup.c	Thu Feb 26 18:47:17 2004
@@ -266,6 +266,7 @@
 	ppc_md.restart        = rtas_restart;
 	ppc_md.power_off      = rtas_power_off;
 	ppc_md.halt           = rtas_halt;
+	ppc_md.panic          = rtas_os_term;
 
 	ppc_md.get_boot_time  = pSeries_get_boot_time;
 	ppc_md.get_rtc_time   = pSeries_get_rtc_time;
===== arch/ppc64/kernel/rtas.c 1.24 vs edited =====
--- 1.24/arch/ppc64/kernel/rtas.c	Tue Feb 24 22:28:40 2004
+++ edited/arch/ppc64/kernel/rtas.c	Thu Feb 26 18:48:28 2004
@@ -420,6 +420,18 @@
         rtas_power_off();
 }
 
+void
+rtas_os_term(void)
+{
+	long status;
+	char *str = "OS panic";
+
+	status = rtas_call(rtas_token("ibm,os-term"), 1, 1, NULL, __pa(str));
+	if (status != 0)
+		printk(KERN_EMERG "ibm,os-term call failed %ld\n", status);
+}
+
+
 unsigned long rtas_rmo_buf = 0;
 
 asmlinkage int ppc_rtas(struct rtas_args __user *uargs)
===== arch/ppc64/kernel/setup.c 1.61 vs edited =====
--- 1.61/arch/ppc64/kernel/setup.c	Tue Feb 24 21:57:17 2004
+++ edited/arch/ppc64/kernel/setup.c	Thu Feb 26 19:02:59 2004
@@ -25,6 +25,7 @@
 #include <linux/version.h>
 #include <linux/tty.h>
 #include <linux/root_dev.h>
+#include <linux/notifier.h>
 #include <asm/io.h>
 #include <asm/prom.h>
 #include <asm/processor.h>
@@ -93,6 +94,13 @@
 
 struct machdep_calls ppc_md;
 
+static int ppc64_panic_event(struct notifier_block *, unsigned long, void *);
+
+static struct notifier_block ppc64_panic_block = {
+	notifier_call: ppc64_panic_event,
+	priority: INT_MIN /* may not return; must be done last */
+};
+
 /*
  * Perhaps we can put the pmac screen_info[] here
  * on pmac as well so we don't need the ifdef's.
@@ -319,6 +327,14 @@
 unsigned long ppc_proc_freq;
 unsigned long ppc_tb_freq;
 
+static int ppc64_panic_event(struct notifier_block *this,
+                             unsigned long event, void *ptr)
+{
+	ppc_md.panic();         /* May not return */
+	return NOTIFY_DONE;
+}
+
+
 #ifdef CONFIG_SMP
 DEFINE_PER_CPU(unsigned int, pvr);
 #endif
@@ -610,8 +626,11 @@
 	dcache_bsize = systemcfg->dCacheL1LineSize; 
 	icache_bsize = systemcfg->iCacheL1LineSize; 
 
 	/* reboot on panic */
 	panic_timeout = 180;
+
+	if (ppc_md.panic)
+		notifier_chain_register(&panic_notifier_list, &ppc64_panic_block);
 
 	init_mm.start_code = PAGE_OFFSET;
 	init_mm.end_code = (unsigned long) _etext;
===== include/asm-ppc64/machdep.h 1.34 vs edited =====
--- 1.34/include/asm-ppc64/machdep.h	Thu Feb 26 04:56:03 2004
+++ edited/include/asm-ppc64/machdep.h	Thu Feb 26 18:47:18 2004
@@ -81,6 +81,7 @@
 	void		(*restart)(char *cmd);
 	void		(*power_off)(void);
 	void		(*halt)(void);
+	void		(*panic)(void);
 
 	int		(*set_rtc_time)(struct rtc_time *);
 	void		(*get_rtc_time)(struct rtc_time *);
===== include/asm-ppc64/rtas.h 1.17 vs edited =====
--- 1.17/include/asm-ppc64/rtas.h	Mon Jan 19 20:08:24 2004
+++ edited/include/asm-ppc64/rtas.h	Thu Feb 26 18:56:39 2004
@@ -175,6 +175,7 @@
 extern void rtas_restart(char *cmd);
 extern void rtas_power_off(void);
 extern void rtas_halt(void);
+extern void rtas_os_term(void);
 extern int rtas_get_sensor(int sensor, int index, int *state);
 extern int rtas_get_power_level(int powerdomain, int *level);
 extern int rtas_set_indicator(int indicator, int index, int new_value);

From benh at kernel.crashing.org  Fri Feb 27 13:35:54 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Fri, 27 Feb 2004 13:35:54 +1100
Subject: [PATCH] RTAS syscall NULL ptr deref (2.6)
In-Reply-To: <1077841356.14211.17.camel@verve.austin.ibm.com>
References: <1077841356.14211.17.camel@verve.austin.ibm.com>
Message-ID: <1077849354.22397.161.camel@gaston>


On Fri, 2004-02-27 at 11:22, John Rose wrote:
> The patch below fixes a NULL ptr deref in the RTAS syscall on 2.6.  I
> pushed it already, but send comments if you want :)

Can you quickly explain how the code could do a NULL ptr deref in
the first place ? (and how taht's fixed). The patch looks fine
but I don't see how it fixes a NULL ptr :) And Linus is rather
picky about patch descriptions not matching actual content...

Ben.

> Thanks-
> John
>
> diff -Nru a/arch/ppc64/kernel/rtas.c b/arch/ppc64/kernel/rtas.c
> --- a/arch/ppc64/kernel/rtas.c	Thu Feb 26 16:30:25 2004
> +++ b/arch/ppc64/kernel/rtas.c	Thu Feb 26 16:30:25 2004
> @@ -426,6 +426,7 @@
>  {
>  	struct rtas_args args;
>  	unsigned long flags;
> +	int nargs;
>
>  	if (!capable(CAP_SYS_ADMIN))
>  		return -EPERM;
> @@ -433,14 +434,15 @@
>  	if (copy_from_user(&args, uargs, 3 * sizeof(u32)) != 0)
>  		return -EFAULT;
>
> -	if (args.nargs > ARRAY_SIZE(args.args)
> +	nargs = args.nargs;
> +	if (nargs > ARRAY_SIZE(args.args)
>  	    || args.nret > ARRAY_SIZE(args.args)
> -	    || args.nargs + args.nret > ARRAY_SIZE(args.args))
> +	    || nargs + args.nret > ARRAY_SIZE(args.args))
>  		return -EINVAL;
>
>  	/* Copy in args. */
>  	if (copy_from_user(args.args, uargs->args,
> -			   args.nargs * sizeof(rtas_arg_t)) != 0)
> +			   nargs * sizeof(rtas_arg_t)) != 0)
>  		return -EFAULT;
>
>  	spin_lock_irqsave(&rtas.lock, flags);
> @@ -449,14 +451,15 @@
>  	enter_rtas((void *)__pa((unsigned long)&get_paca()->xRtas));
>  	args = get_paca()->xRtas;
>
> +	args.rets  = (rtas_arg_t *)&(args.args[nargs]);
>  	if (args.rets[0] == -1)
>  		log_rtas_error(&args);
>
>  	spin_unlock_irqrestore(&rtas.lock, flags);
>
>  	/* Copy out args. */
> -	if (copy_to_user(uargs->args + args.nargs,
> -			 args.args + args.nargs,
> +	if (copy_to_user(uargs->args + nargs,
> +			 args.args + nargs,
>  			 args.nret * sizeof(rtas_arg_t)) != 0)
>  		return -EFAULT;
>
>
--
Benjamin Herrenschmidt <benh at kernel.crashing.org>


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From strosake at austin.ibm.com  Fri Feb 27 16:42:08 2004
From: strosake at austin.ibm.com (Mike Strosaker)
Date: Thu, 26 Feb 2004 23:42:08 -0600
Subject: (resend)  [PATCH] (2.6) os-term call upon kernel panic
In-Reply-To: <Pine.A41.4.44.0402261903520.30274-200000@forte.austin.ibm.com>
References: <Pine.A41.4.44.0402261903520.30274-200000@forte.austin.ibm.com>
Message-ID: <403ED8B0.4080003@austin.ibm.com>


olof at austin.ibm.com wrote:
> A couple of comments:
>
> * on iSeries, it'll result in an instant reboot, maybe not desirable
> * no support for G5, will result in NULL branch. Panic will panic. :)
> * if the RTAS os-term call for some reason fails (can it?), the system
>   won't reboot since panic_timeout is 0.
>
> Attached patch resolves those issues by not setting ppc_md.panic on pmac
> and iSeries, and only registers the notifier in case it's set. The timeout
> is kept at 180 even though it won't be used on a successful panic on
> pSeries.
>
> I've tried building it, but I didn't test it much.

Thanks for the updates, Olof.  I tested your patch on a p650 LPAR, and
it works well.  It's probably better to keep panic_timeout at 180, as
your patch does, because it's definitely possible for os-term to return.

Thanks,
Mike

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Sat Feb 28 02:23:14 2004
From: anton at samba.org (Anton Blanchard)
Date: Sat, 28 Feb 2004 02:23:14 +1100
Subject: 2.6 viodasd and IDE emulation
Message-ID: <20040227152313.GN5801@krispykreme>


Hi,

IDE emulation on iseries virtual disks was removed in 2.6 recently (it had
no chance of being merged upstream) so if you are having trouble finding
the root filesystem on your iseries that is probably it.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From johnrose at austin.ibm.com  Sat Feb 28 03:16:09 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Fri, 27 Feb 2004 10:16:09 -0600
Subject: [PATCH] RTAS syscall NULL ptr deref (2.6)
In-Reply-To: <1077849354.22397.161.camel@gaston>
References: <1077841356.14211.17.camel@verve.austin.ibm.com>
	 <1077849354.22397.161.camel@gaston>
Message-ID: <1077898569.17961.11.camel@verve.austin.ibm.com>


Hi Ben-

> Can you quickly explain how the code could do a NULL ptr deref in
> the first place ? (and how taht's fixed).

Heh sure.  The rets member of the rtas_args structure is an int pointer
into the args member, which is an int array.  Initially, I didn't set
the "rets" ptr in this syscall, because I didn't need it in the
function, and it wouldn't be useful to userspace when copied out.

The following lines were more recently added to log hardware errors:
+ 	if (args.rets[0] == -1)
+  		log_rtas_error(&args);

Since rets was unassigned in this case, we're reading at a bad address.
The following line fixes the problem:
+     args.rets  = (rtas_arg_t *)&(args.args[nargs]);

Thanks-
John


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Sat Feb 28 05:12:42 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Fri, 27 Feb 2004 12:12:42 -0600
Subject: [ppc64 patch] virtual IO bus updates
Message-ID: <403F889A.9020001@us.ibm.com>

Hi Linus, please apply these two sysfs-related patches.

The first makes GregKH happy by removing the device name from the device.bus_id field (and replacing it with a "name" sysfs attribute).

The second renames the parent device from "vdevice" to "vio", making the /sys/bus and /sys/devices hierarchies consistent.

--
Hollis Blanchard
IBM Linux Technology Center
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: vio-sysfs-parent.diff
Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040227/8509194f/attachment.txt 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: vio-name-attr.diff
Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040227/8509194f/attachment-0001.txt 

From benh at kernel.crashing.org  Sat Feb 28 09:22:49 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sat, 28 Feb 2004 09:22:49 +1100
Subject: [PATCH] RTAS syscall NULL ptr deref (2.6)
In-Reply-To: <1077898569.17961.11.camel@verve.austin.ibm.com>
References: <1077841356.14211.17.camel@verve.austin.ibm.com>
	 <1077849354.22397.161.camel@gaston>
	 <1077898569.17961.11.camel@verve.austin.ibm.com>
Message-ID: <1077920569.22962.19.camel@gaston>


On Sat, 2004-02-28 at 03:16, John Rose wrote:
> Hi Ben-
>
> > Can you quickly explain how the code could do a NULL ptr deref in
> > the first place ? (and how taht's fixed).
>
> Heh sure.  The rets member of the rtas_args structure is an int pointer
> into the args member, which is an int array.  Initially, I didn't set
> the "rets" ptr in this syscall, because I didn't need it in the
> function, and it wouldn't be useful to userspace when copied out.

Ok, makes more sense now, thanks.

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/