From johnrose at austin.ibm.com  Fri Oct  1 01:45:15 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Thu, 30 Sep 2004 10:45:15 -0500
Subject: Why do we map PCI IO space so late ?
In-Reply-To: <1096532573.32754.13.camel@gaston>
References: <1096532573.32754.13.camel@gaston>
Message-ID: <1096559115.27021.33.camel@sinatra.austin.ibm.com>

Hi Ben-

Good questions :)  First let me clear something up, and forgive me if
I'm telling you stuff you already know.  The ioremap()'s that we do at
boot are _exclusively_ done for PHBs.  This creates mappings that span
the ranges for their children buses.  Why do we do this when drivers can
themselves use ioremap()?  Because some drivers still use inb()/outb(),
etc, without remapping their own space.  

The short answer to your questions is that I/O DLPAR required these PHB
ioremap()'s to be moved to a later chronological point during boot, so
that imalloc records would be kept.

Here's the long answer.  To dynamically remove a bus (EADS or PHB), we
need to iounmap() the range associated with it.  The iounmap() function
is prototyped in generic code to take one argument, the virtual address
in question.  In order to know the size of the region to unmap, we need
to keep some records of what was ioremap()'ed originally.  The imalloc
subsystem exists to keep these records.

The ppc64 ioremap() implementation has the limitation that if one calls
it before mem_init_done, no imalloc records are left behind.  If we
remap the PHBs early in boot, we have no way to unmap them (or their
children) at DLPAR remove time.  Does this make sense?  

As a side note, we didn't similarly defer the remap for ISA, b/c we
assumed that we'd never want to unmap this range.  I wrote the function
that remaps for ISA, and it's a hack, you're right :)  Suggestions are
welcome.  I would ask why your ISA node doesn't have a ranges property,
b/c I thought it was mandatory from some spec.  

You asked about ioremap_explicit(). This is used in two ways.  First
during boot, to remap the necessary regions for PHBs after
mem_init_done.  We've saved off the "physical" range info from the ofdt
early in boot, and now we explicitly remap starting at virtual addr
PHBS_IO_BASE.  Second, we use it to remap the range of a newly
DLPAR-added bus.  You can imagine that in the case of adding an EADS
slot, we need the mappings to exist at exact virtual addresses relative
to its parent PHB, etc.  Hence the creation of ioremap_explicit().

Suggestions on improvements are welcome.  Hope this helps, it's before
lunch and I'm being wordy. :)

Thanks-
John

On Thu, 2004-09-30 at 03:22, Benjamin Herrenschmidt wrote:
> Hi John !
> 
> I was going through some of the PCI setup code while working on
> some bringup stuff, and had an issue which was related to the way
> we do the ioremap'ing of the PCI IO space.
> 
> So the current scenario is:
> 
>  - early (setup_arch() time basically), we ioremap_explicit the ISA
> space and that only
> 
>  - later (pcibios_fixup time), we scan all busses and ioremap_explicit
> their various IO spaces.
> 
> I have two problems with that at the moment.
> 
> First is, I'm annoyed that during the actual PCI probing, the IO space
> is not mapped. That means that any quirk that needs IO accesses to the
> device will not work. I wonder also in which conditions we might end up
> instanciating a PCI driver as early as the PCI probing and thus crash.
> Also, this is all after console_initcalls(), so that leaves a gap of
> code that runs with PCI IO space not mapped. So far, it ended up beeing
> mostly ok because our console uses legacy serial drivers that use the
> ISA space which happen to be mapped early, but that sounds fragile &
> bogus to me. (For the short story, I found that while working on a board
> for which the "isa" node didn't have a "ranges" property, so we failed
> to early map it, thus the serial driver would crash doing IO cycles).
> Why can't we do the ioremap_explicit right after setting up the PHBs ?
> 
> The second thing that annoys me is that it seems we are also doing an
> ioremap_explicit for each p2p bridge IO space, aren't we ? I don't fully
> understand the logic here. Aren't those supposed to be fully enclosed by
> their parent PHB IO space, and thus mapped by those ?
> 
> Thanks for enlightening me,
> Ben.
> 
> 
> 
> 


From segher at kernel.crashing.org  Fri Oct  1 02:38:20 2004
From: segher at kernel.crashing.org (Segher Boessenkool)
Date: Thu, 30 Sep 2004 11:38:20 -0500
Subject: reading files in /proc/device-tree
In-Reply-To: <1096546849.3081.2.camel@gaston>
References: <20040929101700.GA2623@in.ibm.com>
	<EB304AB3-122B-11D9-8370-000A95A4DC02@kernel.crashing.org>
	<1096546849.3081.2.camel@gaston>
Message-ID: <23A68A84-12FF-11D9-8370-000A95A4DC02@kernel.crashing.org>

>> Also, the format of the entries is dependent on the
>> #address-cells and #sized-cells properties.
>
>  ... of the parent node :) read the OF spec for more details

Of the first (not necessarily immediate) parent that has
those properties, yes.  As memory is a child of the root
node, it will be its direct parent, yes.


Segher


From david at gibson.dropbear.id.au  Fri Oct  1 14:03:25 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Fri, 1 Oct 2004 14:03:25 +1000
Subject: mapping memory in 0xb space
In-Reply-To: <Pine.LNX.4.44.0409282214510.26110-100000@wotan.cs.wisc.edu>
References: <20040929014017.GC5470@zax>
	<Pine.LNX.4.44.0409282214510.26110-100000@wotan.cs.wisc.edu>
Message-ID: <20041001040325.GB12890@zax>

On Wed, Sep 29, 2004 at 12:14:08AM -0500, Igor Grobman wrote:
> On Wed, 29 Sep 2004, David Gibson wrote:
> 
> > On Tue, Sep 28, 2004 at 01:52:16PM -0500, Igor Grobman wrote:
> > > On Tue, 28 Sep 2004, David Gibson wrote:
> > > 
> > > >  Recent kernels don't even
> > > > have VSIDs allocated for the 0xb... region.
> > > 
> > > Looking at both 2.6.8 and 2.4.21, I don't see a difference in
> > > get_kernel_vsid() code.
> > 
> > Ok, *very* recent kernels.  The new VSID algorithm has gone into the
> > BK tree since 2.6.8.
> 
> >From the description I read, I might be better off using 0xfff.. addresses 
> with that algorithm.  Not a big deal.

Perhaps.  However, there are issues there as well: older kernels have
the same 41-bit address restriction (maybe somewhat extendable) in the
0xf region, just like 0xb.  The new VSID algo gives VSIDs for every
address above 0xc000000000000000 *except* the very last segment,
0xfffffffff0000000-0xffffffffffffffff.

> > > This leaves segments.  Both
> > > DataAccess_common and DataAccessSLB_common call
> > > do_stab_bolted/do_slb_bolted when confronted with an address in 0xb
> > > region.
> > 
> > Oh, so it does.  That, I think is a 2.4 thing, long gone in 2.6 (even
> > before the SLB rewrite, I'm pretty sure do_slb_bolted was only called
> > for 0xc addresses).
> 
> In my 2.4.21 source, do_slb_bolted does get called for 0xb addresses.
> And thanks for letting me know about power4 being SLB.  I was clueless on 
> the issue.

> > > Presumably, this will fault in the segments I am interested in.
> > 
> > Yes, actually, it should.  Ok, I guess the problem is deeper than I
> > thought.
> 
> Or is it?
> 
> > > Also, I narrowed it down to
> > > working (or appearing to work) as long as the highest 5 bits of the page
> > > index (those that end up as partial index in the HPTE) are zero.  This may
> > > just be a weird coincidence.
> > 
> > Could be.
> > 
> > > > Why on earth do you want to do this?
> > > 
> > > Good question ;-).  A long long time ago, I posted on this list and
> > > explained.  Since then, I found what appeared to be a solution, except
> > > that it appears power4 breaks it.  I am building a tool that allows
> > > dynamic splicing of code into a running kernel (see
> > > http://www.paradyn.org/html/kerninst.html).  In order for this to work, I
> > > need to be able to overwrite a single instruction with a jump to
> > > spliced-in code.  The target of the jump needs to be within the range (26
> > > bits).  Therefore, I have a choice of 0xbfff.. addresses with backward
> > > jumps from 0xc region, or the 0xff.. addresses for absolute jumps.  I
> > > chose 0xbff.., because I found already-working code, originally written
> > > for the performance counter interface.  Am I making more sense now?
> > 
> > Aha!  But this does actually explain the problem - there are only
> > VSIDs assigned for the first 2^41 bits of each region - so although
> > there are vsids for 0xb000000000000000-0xb00001ffffffffff, there
> > aren't any for 0xbff... addresses.  Likewise the Linux pagetables only
> > cover a 41-bit address range, but that won't matter if you're creating
> > HPTEs directly.
> 
> And this is why I avoided explaining fully in my first email :-).  I'd 
> like to solve one problem at a time.  What I said in my initial email
> is accurate.  Even within the valid VSID range, if the highest 5 bits of 
> the page index are not zero, I get a crash on access (e.g.  
> 0xb00001FFFFF00000, but works on 0xb00001FFF0000000).  

Hrm.  Ok.  I'm not sure why that would be.

> As for why I thought 0xbff would work,  I reasoned that
> since the highest bits are masked out in get_kernel_vsid(), and since 
> nobody else is using the 0xb region, it doesn't matter if I get a VSID 
> that is the same as some other VSID in 0xb region.  However, I did not 
> consider the bug in do_slb_bolted that you describe below.

Yes, with that bug the collision can be with a segment anywhere, not
just in the 0xb region.

> > You may have seen the comment in do_slb_bolted which claims to permit
> > a full 32-bits of ESID - it's wrong.  The code doesn't mask the ESID
> > down to 13 bits as get_kernel_vsid() does, but it probably should - an
> > overlarge ESID will cause collisions with VSIDs from entirely
> > different address places, which would be a Bad Thing.
> 
> This must be happening, although I would still like to know why it 
> misbehaves even within the valid VSID range.
> 
> > 
> > Actually, you should be able to allow ESIDs of up to 21 bits there (36
> > bit VSID - 15 bits of "context").  But you will need to make sure
> > get_kernel_vsid(), or whatever you're using to calculate the VAs for
> > the hash HPTEs is updated to match - at the moment I think it will
> > mask down to 13 bits.  I'm not sure if that will get you sufficiently
> > close to 0xc0... for your purposes.
> 
> No, it's not close enough--I really must have that very last segment.   
> It sounds like I was simply getting lucky on the power3 machine.
> Without the mask, I must have been getting random pages, and
> happily overwriting them.  
> 
> Any ideas on how I might  map that very last segment of 0xb, or for
> that matter the very last segment of 0xf ?  It need not be pretty,
> but it cannot involve modifying the kernel source, though it can rely on
> whatever dirty tricks a kernel module might get away with.  I don't
> want to modify the source, because I would like the tool to work on 
> unmodified kernels.

Um... right.  You know, I'm really not sure its possible without
changing the kernel source, short of binary patching the do_slb_bolted
code from a module.  Sorry.  The segment code's just really not set up
to handle this.

Though, come to that, you do only need one segment, so it might not be
that hard to binary patch in branch to some code of your own which
provides a VSID for that one segment.

> It's starting to sound like an impossible task (at least on non-recent 
> kernels).  I think I might go with a backup suboptimal solution, which 
> involves extra jumps, but at least it might work.

That may be a better idea.

-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson


From benh at kernel.crashing.org  Fri Oct  1 17:21:04 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Fri, 01 Oct 2004 17:21:04 +1000
Subject: Why do we map PCI IO space so late ?
In-Reply-To: <1096559115.27021.33.camel@sinatra.austin.ibm.com>
References: <1096532573.32754.13.camel@gaston>
	<1096559115.27021.33.camel@sinatra.austin.ibm.com>
Message-ID: <1096615264.11463.93.camel@gaston>

On Fri, 2004-10-01 at 01:45, John Rose wrote:
> Hi Ben-
> 
> Good questions :)  First let me clear something up, and forgive me if
> I'm telling you stuff you already know.  The ioremap()'s that we do at
> boot are _exclusively_ done for PHBs.  This creates mappings that span
> the ranges for their children buses.  Why do we do this when drivers can
> themselves use ioremap()?  Because some drivers still use inb()/outb(),
> etc, without remapping their own space.  

Yah, that at least is obvious :)

> The short answer to your questions is that I/O DLPAR required these PHB
> ioremap()'s to be moved to a later chronological point during boot, so
> that imalloc records would be kept.

Okay, that makes more sense to me now.

> Here's the long answer.  To dynamically remove a bus (EADS or PHB), we
> need to iounmap() the range associated with it.  The iounmap() function
> is prototyped in generic code to take one argument, the virtual address
> in question.  In order to know the size of the region to unmap, we need
> to keep some records of what was ioremap()'ed originally.  The imalloc
> subsystem exists to keep these records.

Right.

> The ppc64 ioremap() implementation has the limitation that if one calls
> it before mem_init_done, no imalloc records are left behind.  If we
> remap the PHBs early in boot, we have no way to unmap them (or their
> children) at DLPAR remove time.  Does this make sense?  

Yup.

> As a side note, we didn't similarly defer the remap for ISA, b/c we
> assumed that we'd never want to unmap this range.  I wrote the function
> that remaps for ISA, and it's a hack, you're right :)  Suggestions are
> welcome.  I would ask why your ISA node doesn't have a ranges property,
> b/c I thought it was mandatory from some spec.  

The OF tree of this board is still a work in progress. It has to be
mapped early anyway for other reasons, like the console serial driver
which will be initialized before we do the real mapping.

> You asked about ioremap_explicit(). This is used in two ways.  First
> during boot, to remap the necessary regions for PHBs after
> mem_init_done.  We've saved off the "physical" range info from the ofdt
> early in boot, and now we explicitly remap starting at virtual addr
> PHBS_IO_BASE.  Second, we use it to remap the range of a newly
> DLPAR-added bus.  You can imagine that in the case of adding an EADS
> slot, we need the mappings to exist at exact virtual addresses relative
> to its parent PHB, etc.  Hence the creation of ioremap_explicit().
> 
> Suggestions on improvements are welcome.  Hope this helps, it's before
> lunch and I'm being wordy. :)

Thanks, it's enough for now, I need to think of alternative (read: simpler)
ways to deal with that in the future, but for now, it's fine.

Ben.


From david at gibson.dropbear.id.au  Fri Oct  1 18:45:14 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Fri, 1 Oct 2004 18:45:14 +1000
Subject: [PPC64] Change bad choice of VSID_MULTIPLIER
Message-ID: <20041001084514.GB19046@zax>

Andrew/Linus, please apply:

We recently changed the VSID allocation on PPC64 to use a new scheme
based on a multiplicative hash.  It turns out our choice of multiplier
(the largest 28-bit prime) wasn't so great: with large contiguous
mappings, we can get very poor hash scattering.  In particular earlier
machines (without 16M pages) which had a reasonable about of RAM (>2G
or so) wouldn't boot, because the linear mapping overflowed some hash
buckets.

This patch changes the multiplier to something which seems to work
better (it is, rather arbitrarily, the median of the primes between
2^27 and 2^28).  Some more theory should almost certainly go into the
choice of this constant, to avoid more pathological cases.  But for
now, this choice fixes a serious bug, and seems to do at least as well
at scattering as the old choice on a handful of simple testcases.

Signed-off-by: David Gibson <david at gibson.dropbear.id.au>

Index: working-2.6/include/asm-ppc64/mmu_context.h
===================================================================
--- working-2.6.orig/include/asm-ppc64/mmu_context.h	2004-09-20 10:12:50.000000000 +1000
+++ working-2.6/include/asm-ppc64/mmu_context.h	2004-10-01 18:28:01.565963320 +1000
@@ -108,11 +108,10 @@
  *
  * This scramble is only well defined for proto-VSIDs below
  * 0xFFFFFFFFF, so both proto-VSID and actual VSID 0xFFFFFFFFF are
- * reserved.  VSID_MULTIPLIER is prime (the largest 28-bit prime, in
- * fact), so in particular it is co-prime to VSID_MODULUS, making this
- * a 1:1 scrambling function.  Because the modulus is 2^n-1 we can
- * compute it efficiently without a divide or extra multiply (see
- * below).
+ * reserved.  VSID_MULTIPLIER is prime, so in particular it is
+ * co-prime to VSID_MODULUS, making this a 1:1 scrambling function.
+ * Because the modulus is 2^n-1 we can compute it efficiently without
+ * a divide or extra multiply (see below).
  *
  * This scheme has several advantages over older methods:
  *
Index: working-2.6/include/asm-ppc64/mmu.h
===================================================================
--- working-2.6.orig/include/asm-ppc64/mmu.h	2004-09-20 10:12:50.000000000 +1000
+++ working-2.6/include/asm-ppc64/mmu.h	2004-10-01 18:28:01.566963168 +1000
@@ -202,7 +202,7 @@
 #define SLB_VSID_KERNEL		(SLB_VSID_KP|SLB_VSID_C)
 #define SLB_VSID_USER		(SLB_VSID_KP|SLB_VSID_KS)
 
-#define VSID_MULTIPLIER	ASM_CONST(268435399)	/* largest 28-bit prime */
+#define VSID_MULTIPLIER	ASM_CONST(200730139)	/* 28-bit prime */
 #define VSID_BITS	36
 #define VSID_MODULUS	((1UL<<VSID_BITS)-1)
 
Index: working-2.6/arch/ppc64/kernel/head.S
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/head.S	2004-09-24 10:14:09.000000000 +1000
+++ working-2.6/arch/ppc64/kernel/head.S	2004-10-01 18:34:48.870941232 +1000
@@ -551,14 +551,14 @@
 	.llong	0		/* Reserved */
 	.llong	0		/* Reserved */
 	.llong	(KERNELBASE>>SID_SHIFT)
-	.llong	0x40bffffd5	/* KERNELBASE VSID */
+	.llong	0x408f92c94	/* KERNELBASE VSID */
 	/* We have to list the bolted VMALLOC segment here, too, so that it
 	 * will be restored on shared processor switch */
 	.llong	(VMALLOCBASE>>SID_SHIFT)
-	.llong	0xb0cffffd1	/* VMALLOCBASE VSID */
+	.llong	0xf09b89af5	/* VMALLOCBASE VSID */
 	.llong	8192		/* # pages to map (32 MB) */
 	.llong	0		/* Offset from start of loadarea to start of map */
-	.llong	0x40bffffd50000	/* VPN of first page to map */
+	.llong	0x408f92c940000	/* VPN of first page to map */
 
 	. = 0x6100
 

-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson


From grave at ipno.in2p3.fr  Sat Oct  2 02:04:14 2004
From: grave at ipno.in2p3.fr (grave)
Date: Fri, 01 Oct 2004 16:04:14 +0000
Subject: XServe Node running a debian with only one processor
In-Reply-To: <1096548321l.32616l.0l@ipnnarval> (from grave@ipno.in2p3.fr on
	Thu Sep 30 14:45:21 2004)
References: <1096546729l.32147l.0l@ipnnarval> <1096548321l.32616l.0l@ipnnarval>
Message-ID: <1096646654l.2901l.2l@ipnnarval>

Got the xserve booting
(thanks to http://ozlabs.org/ppc64-patches/patch.pl?id=59)

But I can only run a single CPU kernel, does somebody know how to get  
the second CPU on ?

The kernel is a ppc64 one with smp compiled in but only able to boot  
with nosmp option
kernel from kernel.org + patch to setup.c and pmac_features.c)

Thanks in advance for any hint...

xavier


From igor at cs.wisc.edu  Sat Oct  2 04:05:12 2004
From: igor at cs.wisc.edu (Igor Grobman)
Date: Fri, 1 Oct 2004 13:05:12 -0500 (CDT)
Subject: mapping memory in 0xb space
In-Reply-To: <20041001040325.GB12890@zax>
References: <20040929014017.GC5470@zax>
	<Pine.LNX.4.44.0409282214510.26110-100000@wotan.cs.wisc.edu>
	<20041001040325.GB12890@zax>
Message-ID: <Pine.LNX.4.58.0410011113530.5565@wotan.cs.wisc.edu>

A question for the rest of you, who haven't been following this thread.
Is there publicly available documentation on the power4 extensions,
specifically the large page support, how it effects the HPT hashing, and
the SLB, including the new instructions for maintaining it in software?
I haven't been able to find anything yet.


On Fri, 1 Oct 2004, David Gibson wrote:

> On Wed, Sep 29, 2004 at 12:14:08AM -0500, Igor Grobman wrote:
> > On Wed, 29 Sep 2004, David Gibson wrote:
> >
> > > On Tue, Sep 28, 2004 at 01:52:16PM -0500, Igor Grobman wrote:
> > > > On Tue, 28 Sep 2004, David Gibson wrote:
> > > >
> > > > >  Recent kernels don't even
> > > > > have VSIDs allocated for the 0xb... region.
> > > >
> > > > Looking at both 2.6.8 and 2.4.21, I don't see a difference in
> > > > get_kernel_vsid() code.
> > >
> > > Ok, *very* recent kernels.  The new VSID algorithm has gone into the
> > > BK tree since 2.6.8.
> >
> > >From the description I read, I might be better off using 0xfff.. addresses
> > with that algorithm.  Not a big deal.
>
> Perhaps.  However, there are issues there as well: older kernels have
> the same 41-bit address restriction (maybe somewhat extendable) in the
> 0xf region, just like 0xb.  The new VSID algo gives VSIDs for every
> address above 0xc000000000000000 *except* the very last segment,
> 0xfffffffff0000000-0xffffffffffffffff.

Lucky me!  I'll take a look at what the VSID for the last segment
conflicts with, maybe it will be something unused.  Or I'll have to think
of something else clever.  Right now, I still want my 2.4.21
implementation to work.

> > > > Also, I narrowed it down to
> > > > working (or appearing to work) as long as the highest 5 bits of the page
> > > > index (those that end up as partial index in the HPTE) are zero.  This may
> > > > just be a weird coincidence.
> > >
> > > Could be.
> > >
> > > > > Why on earth do you want to do this?
> > > >
> > > > Good question ;-).  A long long time ago, I posted on this list and
> > > > explained.  Since then, I found what appeared to be a solution, except
> > > > that it appears power4 breaks it.  I am building a tool that allows
> > > > dynamic splicing of code into a running kernel (see
> > > > http://www.paradyn.org/html/kerninst.html).  In order for this to work, I
> > > > need to be able to overwrite a single instruction with a jump to
> > > > spliced-in code.  The target of the jump needs to be within the range (26
> > > > bits).  Therefore, I have a choice of 0xbfff.. addresses with backward
> > > > jumps from 0xc region, or the 0xff.. addresses for absolute jumps.  I
> > > > chose 0xbff.., because I found already-working code, originally written
> > > > for the performance counter interface.  Am I making more sense now?
> > >
> > > Aha!  But this does actually explain the problem - there are only
> > > VSIDs assigned for the first 2^41 bits of each region - so although
> > > there are vsids for 0xb000000000000000-0xb00001ffffffffff, there
> > > aren't any for 0xbff... addresses.  Likewise the Linux pagetables only
> > > cover a 41-bit address range, but that won't matter if you're creating
> > > HPTEs directly.
> >
> > And this is why I avoided explaining fully in my first email :-).  I'd
> > like to solve one problem at a time.  What I said in my initial email
> > is accurate.  Even within the valid VSID range, if the highest 5 bits of
> > the page index are not zero, I get a crash on access (e.g.
> > 0xb00001FFFFF00000, but works on 0xb00001FFF0000000).
>
> Hrm.  Ok.  I'm not sure why that would be.

Here is some more background.  Maybe it will help you think of what's
going wrong here.  I noticed that if I write to the remapped
0xb00001FFF0000000, the changes do not show up at the physical address I
mapped it to.  At this point, I noticed that get_free_page() returns a
4K page frame above 256MB, which means that in reality, it's an
address within a large page.  SLB entry created by do_slb_bolted likewise
has the large page bit set.  I changed my code to create an HPTE mapping
for the large page, and finally I get a sensible result: changes to the
remapped page show up on the physical page.  Note that even though I
create a mapping for the whole large page, I only write to the 4K chunk
that corresponds to the address returned by get_free_page() -- I do not
want to clobber random memory.

In summary, mapping the first large page of the 0xb00001FFF segment works,
but mapping any other within that segment causes a kernel crash.  There
must be something I don't understand about how large pages fit into the
HPT.  Could you point me to documentation on the large page extensions of
power4, and, while we are at it, documentation on the SLB?  So far, I
simply guessed on how it works, based on the code I see in the kernel.

For what it's worth, here is (roughly) the relevant code I am using:

frame = get_free_page(GFP_KERNEL);
pa = (unsigned long)__v2a(frame) & 0xFFFFFFFFFF000000;
//want physical address to point to the corresponding large page.

ea = 0xb00001FFFF000000;
vsid = get_kernel_vsid(ea);
va = ( vsid << 28 ) | ( ea & 0xfffffff );
vpn = va >> PAGE_SHIFT;
rpn = pa >> PAGE_SHIFT;
hpteflags = _PAGE_ACCESSED|_PAGE_COHERENT|PP_RWXX;
slot = ppc_md->hpte_insert(vpn, rpn, hpteflags, 1, 1);

smallpage_offset = ( (unsigned long) __v2a(frame) - pa)
return ea + smallpage_offset;
//only access the relevant 4K chunk within the large page


>
> > As for why I thought 0xbff would work,  I reasoned that
> > since the highest bits are masked out in get_kernel_vsid(), and since
> > nobody else is using the 0xb region, it doesn't matter if I get a VSID
> > that is the same as some other VSID in 0xb region.  However, I did not
> > consider the bug in do_slb_bolted that you describe below.
>
> Yes, with that bug the collision can be with a segment anywhere, not
> just in the 0xb region.

OK, I will deal with this, somehow.  Binary patch idea might just work.

> Though, come to that, you do only need one segment, so it might not be
> that hard to binary patch in branch to some code of your own which
> provides a VSID for that one segment.
>
> > It's starting to sound like an impossible task (at least on non-recent
> > kernels).  I think I might go with a backup suboptimal solution, which
> > involves extra jumps, but at least it might work.
>
> That may be a better idea.

I'd like to avoid this, but if I only have to incur this for the binary
patch to do_slb_bolted, I might be fine.

Thanks,
Igor


From jschopp at austin.ibm.com  Sat Oct  2 06:41:55 2004
From: jschopp at austin.ibm.com (Joel Schopp)
Date: Fri, 01 Oct 2004 15:41:55 -0500
Subject: [PATCH][0/2] ppc64 pre/post boot memory macros
Message-ID: <415DC113.1080007@austin.ibm.com>

I'm sending two patches for review and passing upstream.  The basic idea 
is that these patches put in place some macros such that memory 
management can be easily split into pre and post boot.  This is based on 
the work of Mike Kravetz and Dave Hansen.  It should be harmless, as the 
new macros are currently defined to the same thing the old macros were. 
  It is also isolated to ppc64 files, so the other arch guys don't need 
to worry.

Ultimatly my motivation is to move toward hotplug memory.  Acceptance of 
these patches will allow us to carry smaller patches out of mainline and 
ease our development greatly.

Comments/feedback/flames welcome.  Patches against 2.6.9-rc3 and have 
been boot tested on Power5 LPAR.


From jschopp at austin.ibm.com  Sat Oct  2 06:43:27 2004
From: jschopp at austin.ibm.com (Joel Schopp)
Date: Fri, 01 Oct 2004 15:43:27 -0500
Subject: [PATCH][1/2] ppc64 pre/post boot memory macros
In-Reply-To: <415DC113.1080007@austin.ibm.com>
References: <415DC113.1080007@austin.ibm.com>
Message-ID: <415DC16F.7030402@austin.ibm.com>


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ppc64-daveh.patch
Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041001/14bca17c/attachment.txt 

From jschopp at austin.ibm.com  Sat Oct  2 06:43:56 2004
From: jschopp at austin.ibm.com (Joel Schopp)
Date: Fri, 01 Oct 2004 15:43:56 -0500
Subject: [PATCH][2/2] ppc64 pre/post boot memory macros
In-Reply-To: <415DC113.1080007@austin.ibm.com>
References: <415DC113.1080007@austin.ibm.com>
Message-ID: <415DC18C.6040501@austin.ibm.com>


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ppc64-dave-hmore.patch
Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041001/52e74dd2/attachment.txt 

From schwab at suse.de  Sat Oct  2 07:40:04 2004
From: schwab at suse.de (Andreas Schwab)
Date: Fri, 01 Oct 2004 23:40:04 +0200
Subject: Machine check during PCI scan on PMac G5
Message-ID: <jesm8y9ky3.fsf@sykes.suse.de>

Has anyone been able to get 2.6.9-rc3 running on the new PMacs
(PowerMac7,3)?  I'm getting a machine check during PCI scan in
u3_ht_read_config while doing in_8 on 0xe00000008094800e.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab at suse.de
SuSE Linux AG, Maxfeldstra?e 5, 90409 N?rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


From benh at kernel.crashing.org  Sat Oct  2 21:17:01 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sat, 02 Oct 2004 21:17:01 +1000
Subject: Machine check during PCI scan on PMac G5
In-Reply-To: <jesm8y9ky3.fsf@sykes.suse.de>
References: <jesm8y9ky3.fsf@sykes.suse.de>
Message-ID: <1096715821.26913.35.camel@gaston>

On Sat, 2004-10-02 at 07:40, Andreas Schwab wrote:
> Has anyone been able to get 2.6.9-rc3 running on the new PMacs
> (PowerMac7,3)?  I'm getting a machine check during PCI scan in
> u3_ht_read_config while doing in_8 on 0xe00000008094800e.

Argh... again ! Looks like the box doesn't like us to probe the
PCI device that is there. Can you print out the precise devfn
bus number & offset where the machine check happens ?

I wonder if it's something that is turned off by the firmware
like one of the K2 internal USB1 controllers that are unused on
this machine.
K2 is notoriously allergic to us probing things that are turned off.

This patch should help by preventing the config space accesses to
occur on those devices that aren't in the device-tree, I'll push it
to Linus as a temporary fix if you confirm it works.

Ben.

===== arch/ppc64/kernel/pmac_pci.c 1.5 vs edited =====
--- 1.5/arch/ppc64/kernel/pmac_pci.c	2004-07-25 14:51:52 +10:00
+++ edited/arch/ppc64/kernel/pmac_pci.c	2004-08-04 10:26:07 +10:00
@@ -271,7 +271,7 @@
 				    int offset, int len, u32 *val)
 {
 	struct pci_controller *hose;
-	struct device_node *busdn;
+	struct device_node *busdn, *dn;
 	unsigned long addr;
 
 	if (bus->self)
@@ -282,6 +282,16 @@
 		return PCIBIOS_DEVICE_NOT_FOUND;
 	hose = busdn->phb;
 	if (hose == NULL)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
+	/* We only allow config cycles to devices that are in OF device-tree
+	 * as we are apparently having some weird things going on with some
+	 * revs of K2 on recent G5s
+	 */
+	for (dn = busdn->child; dn; dn = dn->sibling)
+		if (dn->devfn == devfn)
+			break;
+	if (dn == NULL)
 		return PCIBIOS_DEVICE_NOT_FOUND;
 
 	addr = u3_ht_cfg_access(hose, bus->number, devfn, offset);


From benh at kernel.crashing.org  Sat Oct  2 21:21:57 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sat, 02 Oct 2004 21:21:57 +1000
Subject: XServe Node running a debian with only one processor
In-Reply-To: <1096646654l.2901l.2l@ipnnarval>
References: <1096546729l.32147l.0l@ipnnarval>
	<1096548321l.32616l.0l@ipnnarval>  <1096646654l.2901l.2l@ipnnarval>
Message-ID: <1096716117.3634.40.camel@gaston>

On Sat, 2004-10-02 at 02:04, grave wrote:
> Got the xserve booting
> (thanks to http://ozlabs.org/ppc64-patches/patch.pl?id=59)
> 
> But I can only run a single CPU kernel, does somebody know how to get  
> the second CPU on ?
> 
> The kernel is a ppc64 one with smp compiled in but only able to boot  
> with nosmp option
> kernel from kernel.org + patch to setup.c and pmac_features.c)
> 
> Thanks in advance for any hint...

What exact version ? what patches ? What happens (last printed on
serial console) if you try to boot SMP ?

Ben.


From schwab at suse.de  Sun Oct  3 05:50:54 2004
From: schwab at suse.de (Andreas Schwab)
Date: Sat, 02 Oct 2004 21:50:54 +0200
Subject: Machine check during PCI scan on PMac G5
In-Reply-To: <1096715821.26913.35.camel@gaston> (Benjamin Herrenschmidt's
	message of "Sat, 02 Oct 2004 21:17:01 +1000")
References: <jesm8y9ky3.fsf@sykes.suse.de> <1096715821.26913.35.camel@gaston>
Message-ID: <jey8ioj3vl.fsf@sykes.suse.de>

Benjamin Herrenschmidt <benh at kernel.crashing.org> writes:

> Argh... again ! Looks like the box doesn't like us to probe the
> PCI device that is there. Can you print out the precise devfn
> bus number & offset where the machine check happens ?

The first occurence is devfn 48, bus number 0, offset 14.

> This patch should help by preventing the config space accesses to
> occur on those devices that aren't in the device-tree, I'll push it
> to Linus as a temporary fix if you confirm it works.

Thanks, I can confirm that it works.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab at suse.de
SuSE Linux AG, Maxfeldstra?e 5, 90409 N?rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


From benh at kernel.crashing.org  Sun Oct  3 10:38:38 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sun, 03 Oct 2004 10:38:38 +1000
Subject: [PATCH] Fix booting on some recent G5s
Message-ID: <1096763918.26914.63.camel@gaston>

Hi !

Some recent G5s have a problem with PCI/HT probing. They crash
(machine check) during the probe of some slot numbers, it seems
to be related to some functions beeing disabled by the firmware
inside the K2 ASIC.

This patch limits the config space accesses to devices that are
present in the OF device-tree. This fixes the problem and shouldn't
"add" any limitation. If you plug a "random" PCI card with no OF
driver, the firmware will still build a node for it with the
default set of properties created from the config space.

Ben.

Signed-off-by: Benjamin Herrenschmidt <benh at kernel.crashing.org>


--- 1.5/arch/ppc64/kernel/pmac_pci.c	2004-07-25 14:51:52 +10:00
+++ edited/arch/ppc64/kernel/pmac_pci.c	2004-08-04 10:26:07 +10:00
@@ -271,7 +271,7 @@
 				    int offset, int len, u32 *val)
 {
 	struct pci_controller *hose;
-	struct device_node *busdn;
+	struct device_node *busdn, *dn;
 	unsigned long addr;
 
 	if (bus->self)
@@ -282,6 +282,16 @@
 		return PCIBIOS_DEVICE_NOT_FOUND;
 	hose = busdn->phb;
 	if (hose == NULL)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
+	/* We only allow config cycles to devices that are in OF device-tree
+	 * as we are apparently having some weird things going on with some
+	 * revs of K2 on recent G5s
+	 */
+	for (dn = busdn->child; dn; dn = dn->sibling)
+		if (dn->devfn == devfn)
+			break;
+	if (dn == NULL)
 		return PCIBIOS_DEVICE_NOT_FOUND;
 
 	addr = u3_ht_cfg_access(hose, bus->number, devfn, offset);
--- 1.21/arch/ppc/platforms/pmac_pci.c	2004-07-29 14:58:35 +10:00
+++ edited/arch/ppc/platforms/pmac_pci.c	2004-08-17 14:18:09 +10:00
@@ -315,6 +315,10 @@
 	unsigned int addr;
 	int i;
 
+	struct device_node *np = pci_busdev_to_OF_node(bus, devfn);
+	if (np == NULL)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
 	/*
 	 * When a device in K2 is powered down, we die on config
 	 * cycle accesses. Fix that here.
@@ -362,6 +366,9 @@
 	unsigned int addr;
 	int i;
 
+	struct device_node *np = pci_busdev_to_OF_node(bus, devfn);
+	if (np == NULL)
+		return PCIBIOS_DEVICE_NOT_FOUND;
 	/*
 	 * When a device in K2 is powered down, we die on config
 	 * cycle accesses. Fix that here.


From benh at kernel.crashing.org  Sun Oct  3 10:51:46 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sun, 03 Oct 2004 10:51:46 +1000
Subject: Machine check during PCI scan on PMac G5
In-Reply-To: <4231083A-14D4-11D9-AE7A-000A95A4DC02@kernel.crashing.org>
References: <jesm8y9ky3.fsf@sykes.suse.de>
	<1096715821.26913.35.camel@gaston> <jey8ioj3vl.fsf@sykes.suse.de>
	<4231083A-14D4-11D9-AE7A-000A95A4DC02@kernel.crashing.org>
Message-ID: <1096764706.11996.77.camel@gaston>

On Sun, 2004-10-03 at 10:36, Segher Boessenkool wrote:
> >> Argh... again ! Looks like the box doesn't like us to probe the
> >> PCI device that is there. Can you print out the precise devfn
> >> bus number & offset where the machine check happens ?
> >
> > The first occurence is devfn 48, bus number 0, offset 14.
> 
> That's the "header type" field on the GEM shim.
> 
> I'd rather not have this fixed by the device-tree check, for
> various reasons; note that this issue probably is related to the
> "config space not readable while GEM is in sleep mode" problem
> on older Macs.  Is the GEM powered on during boot, on these boxes?

I'm suprised, I'm not sure it's actually GEM (Andreas, is the Sungem
properly functionning on this box after this fix ?). I think the
numbering of the Shims can change from firmware to firmware, it's
more probably one of the USBs. There is code in pmac_feature.c to
power up the GEM (but only if it has a device-node).

I think the proper solution is the filter from the device-tree on
Apple G5s, at least for now, though OF itself probably has a property
somewhere that tells it which slots to probe and not to probe, I need
to find it.

Ben.


From segher at kernel.crashing.org  Sun Oct  3 10:36:26 2004
From: segher at kernel.crashing.org (Segher Boessenkool)
Date: Sat, 2 Oct 2004 19:36:26 -0500
Subject: Machine check during PCI scan on PMac G5
In-Reply-To: <jey8ioj3vl.fsf@sykes.suse.de>
References: <jesm8y9ky3.fsf@sykes.suse.de> <1096715821.26913.35.camel@gaston>
	<jey8ioj3vl.fsf@sykes.suse.de>
Message-ID: <4231083A-14D4-11D9-AE7A-000A95A4DC02@kernel.crashing.org>

>> Argh... again ! Looks like the box doesn't like us to probe the
>> PCI device that is there. Can you print out the precise devfn
>> bus number & offset where the machine check happens ?
>
> The first occurence is devfn 48, bus number 0, offset 14.

That's the "header type" field on the GEM shim.

I'd rather not have this fixed by the device-tree check, for
various reasons; note that this issue probably is related to the
"config space not readable while GEM is in sleep mode" problem
on older Macs.  Is the GEM powered on during boot, on these boxes?


Segher


From benh at kernel.crashing.org  Sun Oct  3 13:44:44 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sun, 03 Oct 2004 13:44:44 +1000
Subject: Machine check during PCI scan on PMac G5
In-Reply-To: <1096764706.11996.77.camel@gaston>
References: <jesm8y9ky3.fsf@sykes.suse.de>
	<1096715821.26913.35.camel@gaston> <jey8ioj3vl.fsf@sykes.suse.de>
	<4231083A-14D4-11D9-AE7A-000A95A4DC02@kernel.crashing.org>
	<1096764706.11996.77.camel@gaston>
Message-ID: <1096775084.9539.4.camel@gaston>


> I'm suprised, I'm not sure it's actually GEM (Andreas, is the Sungem
> properly functionning on this box after this fix ?). I think the
> numbering of the Shims can change from firmware to firmware, it's
> more probably one of the USBs. There is code in pmac_feature.c to
> power up the GEM (but only if it has a device-node).

Ok, after digging in the OF code, it seems that on machines without
a PCI-X bridge, shim 6 is just not used and the stuff is really upset
when we probe it. K2 is a weird beast that needs care...

Ben.


From schwab at suse.de  Sun Oct  3 21:52:54 2004
From: schwab at suse.de (Andreas Schwab)
Date: Sun, 03 Oct 2004 13:52:54 +0200
Subject: Machine check during PCI scan on PMac G5
In-Reply-To: <1096764706.11996.77.camel@gaston> (Benjamin Herrenschmidt's
	message of "Sun, 03 Oct 2004 10:51:46 +1000")
References: <jesm8y9ky3.fsf@sykes.suse.de> <1096715821.26913.35.camel@gaston>
	<jey8ioj3vl.fsf@sykes.suse.de>
	<4231083A-14D4-11D9-AE7A-000A95A4DC02@kernel.crashing.org>
	<1096764706.11996.77.camel@gaston>
Message-ID: <je8yaohvc9.fsf@sykes.suse.de>

Benjamin Herrenschmidt <benh at kernel.crashing.org> writes:

> I'm suprised, I'm not sure it's actually GEM (Andreas, is the Sungem
> properly functionning on this box after this fix ?).

It appears to be.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab at suse.de
SuSE Linux AG, Maxfeldstra?e 5, 90409 N?rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


From schwab at suse.de  Sun Oct  3 21:59:47 2004
From: schwab at suse.de (Andreas Schwab)
Date: Sun, 03 Oct 2004 13:59:47 +0200
Subject: PM72 works also on PowerMac7,3
Message-ID: <je4qlchv0s.fsf@sykes.suse.de>

The therm_pm72 driver appears to work fine on the PowerMac7,3.

Andreas.

Signed-off-by: Andreas Schwab <schwab at suse.de>

--- linux-2.6/drivers/macintosh/therm_pm72.c.~1~	2004-08-19 11:31:30.000000000 +0200
+++ linux-2.6/drivers/macintosh/therm_pm72.c	2004-10-03 13:55:22.361631501 +0200
@@ -1301,7 +1301,8 @@ static int __init therm_pm72_init(void)
 {
 	struct device_node *np;
 
-	if (!machine_is_compatible("PowerMac7,2"))
+	if (!machine_is_compatible("PowerMac7,2") &&
+	    !machine_is_compatible("PowerMac7,3"))
 	    	return -ENODEV;
 
 	printk(KERN_INFO "PowerMac G5 Thermal control driver %s\n", VERSION);

-- 
Andreas Schwab, SuSE Labs, schwab at suse.de
SuSE Linux AG, Maxfeldstra?e 5, 90409 N?rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


From benh at kernel.crashing.org  Sun Oct  3 22:06:45 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sun, 03 Oct 2004 22:06:45 +1000
Subject: PM72 works also on PowerMac7,3
In-Reply-To: <je4qlchv0s.fsf@sykes.suse.de>
References: <je4qlchv0s.fsf@sykes.suse.de>
Message-ID: <1096805205.9514.17.camel@gaston>

On Sun, 2004-10-03 at 21:59, Andreas Schwab wrote:
> The therm_pm72 driver appears to work fine on the PowerMac7,3.

Before commiting this, I'd rather make sure the code & fan IDs is
actually the same in Darwin, also, just allowing the 7,3 may
enable the code on the new water cooling machines. Before doing so,
I'd rather make sure we get that right too.

I'm waiting for one of these to be delivered by Apple, they seem
to take ages, but hopefully, it should be there soon.

Ben.


From schwab at suse.de  Sun Oct  3 22:20:43 2004
From: schwab at suse.de (Andreas Schwab)
Date: Sun, 03 Oct 2004 14:20:43 +0200
Subject: Properly recognize PowerMac7,3
Message-ID: <jezn34gfhg.fsf@sykes.suse.de>

Make the PowerMac7,3 no longer unknown.

Andreas.

Signed-off-by: Andreas Schwab <schwab at suse.de>

--- linux-2.6/arch/ppc64/kernel/pmac_feature.c.~1~	2004-09-28 00:28:34.000000000 +0200
+++ linux-2.6/arch/ppc64/kernel/pmac_feature.c	2004-10-03 14:17:03.458461540 +0200
@@ -343,6 +343,10 @@ static struct pmac_mb_def pmac_mb_defs[]
 		PMAC_TYPE_POWERMAC_G5,		g5_features,
 		0,
 	},
+	{	"PowerMac7,3",			"PowerMac G5",
+		PMAC_TYPE_POWERMAC_G5,		g5_features,
+		0,
+	},
 	{       "RackMac3,1",                   "XServe G5",
 		PMAC_TYPE_POWERMAC_G5,          g5_features,
 		0,

-- 
Andreas Schwab, SuSE Labs, schwab at suse.de
SuSE Linux AG, Maxfeldstra?e 5, 90409 N?rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


From grave at ipno.in2p3.fr  Mon Oct  4 17:59:44 2004
From: grave at ipno.in2p3.fr (grave)
Date: Mon, 04 Oct 2004 07:59:44 +0000
Subject: =?iso-8859-1?q?Re=A0=3A?= XServe Node running a debian with only
 one processor
In-Reply-To: <415D85D1.4040701@austin.ibm.com> (from olof@austin.ibm.com on
	Fri Oct  1 18:29:05 2004)
References: <1096546729l.32147l.0l@ipnnarval>
	<1096548321l.32616l.0l@ipnnarval> <1096646654l.2901l.2l@ipnnarval>
	<415D85D1.4040701@austin.ibm.com>
Message-ID: <1096876784l.19627l.4l@ipnnarval>

Here are a few informations :

console output in the joined file
I use a cross compiler ppc32 -> ppc64
gcc-3.4.1
GNU ld version 2.15

kernel from ftp.kernel.org 2.6.6

patch (had to apply it reversed because of the initial diff I think) :

diff -ur linux-2.6.6-working/arch/ppc64/kernel/pmac_feature.c  
linux-2.6.6/arch/ppc64/kernel/pmac_feature.c
--- linux-2.6.6-working/arch/ppc64/kernel/pmac_feature.c	 
2004-05-13 17:00:12.000000000 -0600
+++ linux-2.6.6/arch/ppc64/kernel/pmac_feature.c	2004-05-09  
20:32:54.000000000 -0600
@@ -343,10 +343,6 @@
 		PMAC_TYPE_POWERMAC_G5,		g5_features,
 		0,
 	},
-	{	"RackMac3,1",			"XServe G5",
-		PMAC_TYPE_POWERMAC_G5,		g5_features,
-		0,
-	},
 };

 /*
diff -ur linux-2.6.6-working/arch/ppc64/kernel/setup.c linux-2.6.6/ 
arch/ppc64/kernel/setup.c
--- linux-2.6.6-working/arch/ppc64/kernel/setup.c	2004-05-13  
16:06:33.000000000 -0600
+++ linux-2.6.6/arch/ppc64/kernel/setup.c	2004-05-09  
20:32:29.000000000 -0600
@@ -547,7 +547,7 @@
 int __init ppc_init(void)
 {
 	/* clear the progress line */
-	if(ppc_md.progress) ppc_md.progress(" ", 0xffff);
+	ppc_md.progress(" ", 0xffff);

 	if (ppc_md.init != NULL) {
 		ppc_md.init();
-------------- next part --------------
Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes)
Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
Mount-cache hash table entries: 256 (order: 0, 4096 bytes)
POSIX conformance testing by UNIFIX
PowerMac SMP probe found 2 cpus
Processor 1 found.
Synchronizing timebase
Got ack
score 299, offset 1000
score 299, offset 500
score -299, offset 250
score 299, offset 375
score -299, offset 312
score -299, offset 343
score -299, offset 359
score -299, offset 367
score -283, offset 371
score -247, offset 373
score 133, offset 374
score -239, offset 373
Min 373 (score -237), Max 374 (score 129)
Final offset: 374 (127/300)
Brought up 2 CPUs

a few seconds and :

[c0000000000172f4] .kernel_thread+0x4c/0x68
 <0>Kernel panic: Attempted to k00025ef1c0[1] 'swapper' THREAD: c0000000025e8000
 CPU: 0
GPR00: C000000000077E90 C0000000025EBAE0 C00000000045EA58 FFFFFFFFFFFFFFFF
GPR04: 0000000000000DE7 FFFFFFFFFFFFFFFF C000000000397200 C000000000397218
GPR08: 0000000000000000 C000000000359180 0000000000000001 C0000000004AE730
GPR12: 0000000088004044 C000000000308000 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000001000
GPR20: C0000000004B16E0 000000000000007F 0000000000000008 000000000000001B
GPR24: C00000007DFCDD80 C0000000025EBBE0 0000000000000036 0000000000000080
GPR28: C0000000025EBBE0 C0000000004364F0 C0000000003A5628 0000000000000000
NIP [c000000000077e9c] .smp_call_function_all_cpus+0x7c/0x98
LR [c000000000077e90] .smp_call_function_all_cpus+0x70/0x98
Call Trace:
[c000000000079c28] .do_tune_cpucache+0xb4/0x3fc
[c00000000007a050] .enable_cpucache+0xe0/0x118
[c00000000007a7b0] .kmem_cache_create+0x728/0x79c
[c0000000002f5448] .sk_init+0x30/0xdc
[c0000000002f539c] .sock_init+0x3c/0xb8
[c00000000000c6ec] .init+0x238/0x43c
[c0000000000172f4] .kernel_thread+0x4c/0x68
 <0>Kernel panic: Attempted to kill init!
 smp_call_function on cpu 0: other cpus not responding (0)
Rebooting in 180 seconds..


From grave at ipno.in2p3.fr  Mon Oct  4 18:48:26 2004
From: grave at ipno.in2p3.fr (grave)
Date: Mon, 04 Oct 2004 08:48:26 +0000
Subject: discovered the patch pages and how it work on penguinppc64.org
 sorry for the previous mail...
Message-ID: <1096879706l.20867l.0l@ipnnarval>

http://ozlabs.org/ppc64-patches/patch.pl?id=62 make the all things  
going right !

One more time sorry...


From benh at kernel.crashing.org  Mon Oct  4 18:55:29 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Mon, 04 Oct 2004 18:55:29 +1000
Subject: discovered the patch pages and how it work on penguinppc64.org
	sorry for the previous mail...
In-Reply-To: <1096879706l.20867l.0l@ipnnarval>
References: <1096879706l.20867l.0l@ipnnarval>
Message-ID: <1096880129.9514.70.camel@gaston>

On Mon, 2004-10-04 at 18:48, grave wrote:
> http://ozlabs.org/ppc64-patches/patch.pl?id=62 make the all things  
> going right !
> 
> One more time sorry...

Hrm, that should be in Linus tree already...

Ben.


From grave at ipno.in2p3.fr  Mon Oct  4 22:31:21 2004
From: grave at ipno.in2p3.fr (grave)
Date: Mon, 04 Oct 2004 12:31:21 +0000
Subject: =?iso-8859-1?q?Re=A0=3A?= discovered the patch pages and how it
 work on penguinppc64.org 	sorry for the previous mail...
In-Reply-To: <1096880129.9514.70.camel@gaston> (from
	benh@kernel.crashing.org on Mon Oct  4 10:55:29 2004)
References: <1096879706l.20867l.0l@ipnnarval> <1096880129.9514.70.camel@gaston>
Message-ID: <1096893081l.23876l.0l@ipnnarval>

On 04.10.2004 10:55:29, Benjamin Herrenschmidt wrote:
> On Mon, 2004-10-04 at 18:48, grave wrote:
> > http://ozlabs.org/ppc64-patches/patch.pl?id=62 make the all things
> > going right !
> >
> > One more time sorry...
> 
> Hrm, that should be in Linus tree already...
Not in the 2.6.6 tree from www.kernel.org

It's present in 2.6.8.1 but this one crash at boot (see attached file).
This kernel also crash if I use the nosmp option

xavier
-------------- next part --------------
Min 8 (score -13), Max 9 (score 51)
Final offset: 8 (9/300)
Brought up 2 CPUs
NET: Registered protocol family 16
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2 POWERMAC
NIP: C0000000002DA15C XER: 0000000000000000 LR: C00000000000C600
REGS: c0000000027e7be0 TRAP: 0300   Not tainted  (2.6.8.1)
MSR: 9000000000009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11
DAR: 0000000000000000, DSISR: 0000000008000000
TASK: c0000000027e1200[1] 'swapper' THREAD: c0000000027e4000 CPU: 0
GPR00: C00000000000C600 C0000000027E7E60 C000000000437E78 C0000000002AEC28
GPR04: 000000000000FFFF 0000000000000000 C000000000493C48 C00000007DE5BD78
GPR08: 0000000000000002 0000000000000000 0000000000000002 0000000000000000
GPR12: 0000000028000042 C000000000304000 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000220000 0000000000230000 0000000001400000
GPR24: C000000000304000 C000000000435008 C0000000002F4348 C0000000002F8268
GPR28: 0000000000000000 C000000000436420 C000000000364260 C0000000002F7F30
NIP [c0000000002da15c] .ppc_init+0x30/0xa4
LR [c00000000000c600] .init+0x234/0x428
Call Trace:
[c0000000027e7e60] [c0000000027e7ef0] 0xc0000000027e7ef0 (unreliable)
[c0000000027e7ef0] [c00000000000c600] .init+0x234/0x428
[c0000000027e7f90] [c000000000017734] .kernel_thread+0x4c/0x68
 <0>Kernel panic: Attempted to kill init!
 <0>Rebooting in 180 seconds..

From benh at kernel.crashing.org  Mon Oct  4 23:32:23 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Mon, 04 Oct 2004 23:32:23 +1000
Subject: =?iso-8859-1?q?Re=3A_Re=C2=A0=3A_discovered_the_patch_pages_and_?=
	=?iso-8859-1?q?how_it_work_on=0D=0A=09penguinppc64=2Eorg_sorry_for_th?=
	=?iso-8859-1?q?e_previous_mail=2E=2E=2E?=
In-Reply-To: <1096893081l.23876l.0l@ipnnarval>
References: <1096879706l.20867l.0l@ipnnarval>
	<1096880129.9514.70.camel@gaston>  <1096893081l.23876l.0l@ipnnarval>
Message-ID: <1096896743.9516.84.camel@gaston>

On Mon, 2004-10-04 at 22:31, grave wrote:
> On 04.10.2004 10:55:29, Benjamin Herrenschmidt wrote:
> > On Mon, 2004-10-04 at 18:48, grave wrote:
> > > http://ozlabs.org/ppc64-patches/patch.pl?id=62 make the all things
> > > going right !
> > >
> > > One more time sorry...
> > 
> > Hrm, that should be in Linus tree already...
> Not in the 2.6.6 tree from www.kernel.org
> 
> It's present in 2.6.8.1 but this one crash at boot (see attached file).
> This kernel also crash if I use the nosmp option

Can you try 2.6.9-rc3 and let me know ? Or beter, the current bk
snapshot of 2.6.9

Ben.


From grave at ipno.in2p3.fr  Mon Oct  4 23:48:28 2004
From: grave at ipno.in2p3.fr (grave)
Date: Mon, 04 Oct 2004 13:48:28 +0000
Subject: =?iso-8859-1?q?Re=A0=3A_Re=A0=3A?= discovered the patch pages and
 how it  work on 	penguinppc64.org sorry for the previous mail...
In-Reply-To: <1096896743.9516.84.camel@gaston> (from
	benh@kernel.crashing.org on Mon Oct  4 15:32:23 2004)
References: <1096879706l.20867l.0l@ipnnarval>
	<1096880129.9514.70.camel@gaston> <1096893081l.23876l.0l@ipnnarval>
	<1096896743.9516.84.camel@gaston>
Message-ID: <1096897708l.24855l.1l@ipnnarval>


> Can you try 2.6.9-rc3 and let me know ? Or beter, the current bk
> snapshot of 2.6.9

I also tryed with the bk tree (2.6.9-rc1-ames) it also crashed...
I'll retry in order to send you a log of the crash...

Where can I get the 2.6.9-rc3 tree ? I didn't find where it is ?

I'm trying to have something "better" than 2.6.6 in order to have  
termal management.

xavier


From benh at kernel.crashing.org  Mon Oct  4 23:45:50 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Mon, 04 Oct 2004 23:45:50 +1000
Subject: =?iso-8859-1?q?Re=3A_Re=C2=A0=3A_Re=C2=A0=3A_discovered_the_patc?=
	=?iso-8859-1?q?h_pages_and_how_it_work_on=0D=0A=09penguinppc64=2Eorg_?=
	=?iso-8859-1?q?sorry_for_the_previous_mail=2E=2E=2E?=
In-Reply-To: <1096897708l.24855l.1l@ipnnarval>
References: <1096879706l.20867l.0l@ipnnarval>
	<1096880129.9514.70.camel@gaston> <1096893081l.23876l.0l@ipnnarval>
	<1096896743.9516.84.camel@gaston>  <1096897708l.24855l.1l@ipnnarval>
Message-ID: <1096897549.23141.93.camel@gaston>

On Mon, 2004-10-04 at 23:48, grave wrote:
> > Can you try 2.6.9-rc3 and let me know ? Or beter, the current bk
> > snapshot of 2.6.9
> 
> I also tryed with the bk tree (2.6.9-rc1-ames) it also crashed...
> I'll retry in order to send you a log of the crash...

ames ? just use mainstream

> Where can I get the 2.6.9-rc3 tree ? I didn't find where it is ?

kernel.org ?

> I'm trying to have something "better" than 2.6.6 in order to have  
> termal management.
> 
> xavier
-- 
Benjamin Herrenschmidt <benh at kernel.crashing.org>


From benh at kernel.crashing.org  Mon Oct  4 23:46:54 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Mon, 04 Oct 2004 23:46:54 +1000
Subject: =?iso-8859-1?q?Re=C2?= =?iso-8859-1?q?=C2=A0=3A?=
	=?iso-8859-1?q?Re=C2?= =?iso-8859-1?q?=C2=A0=3A_discovered?=
	the patch pages and how it
	work on penguinppc64.org sorry for the previous 	mail...
In-Reply-To: <1096897549.23141.93.camel@gaston>
References: <1096879706l.20867l.0l@ipnnarval>
	<1096880129.9514.70.camel@gaston> <1096893081l.23876l.0l@ipnnarval>
	<1096896743.9516.84.camel@gaston>  <1096897708l.24855l.1l@ipnnarval>
	<1096897549.23141.93.camel@gaston>
Message-ID: <1096897613.9539.95.camel@gaston>

On Mon, 2004-10-04 at 23:45, Benjamin Herrenschmidt wrote:

> > I'm trying to have something "better" than 2.6.6 in order to have  
> > termal management.


BTW. Thermal control isn't there yet for xserve's ... soon hopefully

Ben.


From moilanen at austin.ibm.com  Tue Oct  5 05:43:05 2004
From: moilanen at austin.ibm.com (moilanen at austin.ibm.com)
Date: Mon,  4 Oct 2004 14:43:05 -0500
Subject: [PATCH 1/1] rtas_flash_4gig
Message-ID: <200410041942.i94Jg4WA154540@westrelay04.boulder.ibm.com>

We should probably check to make sure that all of the flash
list headers are above 4gig.  Not just the first one.

We could see this situation happen if we are low on memory 
and get a paged alloc'd that's over the 4 gig boundary.

Jake

Signed-off-by: Jake Moilanen <moilanen at austin.ibm.com>


---


diff -puN arch/ppc64/kernel/rtas.c~rtas_flash_4gig arch/ppc64/kernel/rtas.c
--- linux-2.6-bk/arch/ppc64/kernel/rtas.c~rtas_flash_4gig	Mon Oct  4 10:46:46 2004
+++ linux-2.6-bk-moilanen/arch/ppc64/kernel/rtas.c	Mon Oct  4 14:22:31 2004
@@ -338,6 +338,12 @@ rtas_flash_firmware(void)
 			f->next = (struct flash_block_list *)virt_to_abs(f->next);
 		else
 			f->next = NULL;
+
+		if (f->next >= 4UL*1024*1024*1024) {
+			printk(KERN_ALERT "FLASH: aborted...flash list header addr above 4GB\n");
+			return;
+		}
+
 		/* make num_blocks into the version/length field */
 		f->num_blocks = (FLASH_BLOCK_LIST_VERSION << 56) | ((f->num_blocks+1)*16);
 	}

_


From schwab at suse.de  Tue Oct  5 06:51:55 2004
From: schwab at suse.de (Andreas Schwab)
Date: Mon, 04 Oct 2004 22:51:55 +0200
Subject: Sound on G5
Message-ID: <jebrfi1a1g.fsf@sykes.suse.de>

Is anyone already working on sound support for the PowerMac G5, by chance?
That's actually the only thing still missing.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab at suse.de
SuSE Linux AG, Maxfeldstra?e 5, 90409 N?rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


From benh at kernel.crashing.org  Tue Oct  5 11:25:03 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Tue, 05 Oct 2004 11:25:03 +1000
Subject: Sound on G5
In-Reply-To: <jebrfi1a1g.fsf@sykes.suse.de>
References: <jebrfi1a1g.fsf@sykes.suse.de>
Message-ID: <1096939502.24584.6.camel@gaston>

On Tue, 2004-10-05 at 06:51, Andreas Schwab wrote:
> Is anyone already working on sound support for the PowerMac G5, by chance?
> That's actually the only thing still missing.

Nobody really seriously ATM. One of the main issue is that the darwin
driver abuses apple "do-platform-*" shit. It's a mecanism they invented
to put sort-of "scripts" (in binary form) in the device-tree that can
contains elementary ops such as write GPIOs, I2C, etc...

This is extremely messy and difficult to parse. I have written the
basis for parsing them, but interpreting them is even more shitty as
the actual implementation of each ops sort-of depends on the target
object.

It's really a piece-of-shit imho.

So we could go that way and complete my "interpreter" or just hard code
all of the GPIOs we need in the driver hoping apple don't shuffle them
too much in upcoming models...

Ben.


From david at gibson.dropbear.id.au  Tue Oct  5 13:13:41 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Tue, 5 Oct 2004 13:13:41 +1000
Subject: [PPC64] Squash EEH warnings
Message-ID: <20041005031341.GA3695@zax>

Andrew, please apply:

A slightly non-ideal version of the recent patch which fixed EEH being
a no-op went in.  The srcsave variable in eeh_memcpy_to_io() is now
never referenced on non-pSeries machines, and so spews hundreds of
warnings.  The variable doesn't actually accomplish anything, so this
patch gets rid of it.

Signed-off-by: David Gibson <dwg at au1.ibm.com>

Index: working-2.6/include/asm-ppc64/eeh.h
===================================================================
--- working-2.6.orig/include/asm-ppc64/eeh.h	2004-10-05 10:08:10.000000000 +1000
+++ working-2.6/include/asm-ppc64/eeh.h	2004-10-05 13:09:24.730992368 +1000
@@ -196,7 +196,6 @@
 static inline void eeh_memcpy_fromio(void *dest, const volatile void __iomem *src, unsigned long n) {
 	void *vsrc = (void __force *) src;
 	void *destsave = dest;
-	const volatile void __iomem *srcsave = src;
 	unsigned long nsave = n;
 
 	while(n && (!EEH_CHECK_ALIGN(vsrc, 4) || !EEH_CHECK_ALIGN(dest, 4))) {
@@ -227,7 +226,7 @@
 	 */
 	if ((nsave >= 4) &&
 		(EEH_POSSIBLE_ERROR((*((u32 *) destsave+nsave-4)), u32))) {
-		eeh_check_failure(srcsave, (*((u32 *) destsave+nsave-4)));
+		eeh_check_failure(src, (*((u32 *) destsave+nsave-4)));
 	}
 }
 

-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson


From david at gibson.dropbear.id.au  Tue Oct  5 15:26:27 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Tue, 5 Oct 2004 15:26:27 +1000
Subject: [TRIVIAL, PPC64]  Remove redundant #ifdef CONFIG_ALTIVEC
Message-ID: <20041005052627.GD3695@zax>

Andrew, please apply:

arch/ppc64/kernel/process.c has an #ifdef CONFIG_ALTIVEC within an
#ifdef CONFIG_ALTIVEC.  This patch removes the inner one.

Signed-off-by: David Gibson <david at gibson.dropbear.id.au>

Index: working-2.6/arch/ppc64/kernel/process.c
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/process.c	2004-10-05 10:08:10.000000000 +1000
+++ working-2.6/arch/ppc64/kernel/process.c	2004-10-05 15:18:56.581996496 +1000
@@ -147,7 +147,6 @@
  */
 void flush_altivec_to_thread(struct task_struct *tsk)
 {
-#ifdef CONFIG_ALTIVEC
 	if (tsk->thread.regs) {
 		preempt_disable();
 		if (tsk->thread.regs->msr & MSR_VEC) {
@@ -158,7 +157,6 @@
 		}
 		preempt_enable();
 	}
-#endif
 }
 
 int dump_task_altivec(struct pt_regs *regs, elf_vrregset_t *vrregs)


-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson


From david at gibson.dropbear.id.au  Tue Oct  5 16:42:56 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Tue, 5 Oct 2004 16:42:56 +1000
Subject: [PPC64] xmon sparse cleanups
Message-ID: <20041005064255.GF3695@zax>

Andrew, please apply:

This patch removes many sparse warnings from the xmon code.  Mostly
K&R function declarations and 0-instead-of-NULLs.

I believe this removes all save one sparse error in xmon, excepting
those inherited from header files.

Signed-off-by: David Gibson <david at gibson.dropbear.id.au>

Index: working-2.6/arch/ppc64/xmon/xmon.c
===================================================================
--- working-2.6.orig/arch/ppc64/xmon/xmon.c	2004-09-24 10:14:09.000000000 +1000
+++ working-2.6/arch/ppc64/xmon/xmon.c	2004-10-05 16:31:01.822963256 +1000
@@ -645,7 +645,7 @@
 	for (i = 0; i < NBPTS; ++i, ++bp)
 		if (bp->enabled && pc == bp->address)
 			return bp;
-	return 0;
+	return NULL;
 }
 
 static struct bpt *in_breakpoint_table(unsigned long nip, unsigned long *offp)
@@ -1582,7 +1582,7 @@
 extern char dec_exc;
 
 void
-super_regs()
+super_regs(void)
 {
 	int cmd;
 	unsigned long val;
@@ -1816,7 +1816,7 @@
     "";
 
 void
-memex()
+memex(void)
 {
 	int cmd, inc, i, nslash;
 	unsigned long n;
@@ -1967,7 +1967,7 @@
 }
 
 int
-bsesc()
+bsesc(void)
 {
 	int c;
 
@@ -1985,7 +1985,7 @@
 			 || ('a' <= (c) && (c) <= 'f') \
 			 || ('A' <= (c) && (c) <= 'F'))
 void
-dump()
+dump(void)
 {
 	int c;
 
@@ -2150,7 +2150,7 @@
 static unsigned mask;
 
 void
-memlocate()
+memlocate(void)
 {
 	unsigned a, n;
 	unsigned char val[4];
@@ -2183,7 +2183,7 @@
 static unsigned long mlim = 0xffffffff;
 
 void
-memzcan()
+memzcan(void)
 {
 	unsigned char v;
 	unsigned a;
@@ -2212,7 +2212,7 @@
 
 /* Input scanning routines */
 int
-skipbl()
+skipbl(void)
 {
 	int c;
 
@@ -2237,8 +2237,7 @@
 };
 
 int
-scanhex(vp)
-unsigned long *vp;
+scanhex(unsigned long *vp)
 {
 	int c, d;
 	unsigned long v;
@@ -2322,7 +2321,7 @@
 }
 
 void
-scannl()
+scannl(void)
 {
 	int c;
 
@@ -2365,13 +2364,13 @@
 static char *lineptr;
 
 void
-flush_input()
+flush_input(void)
 {
 	lineptr = NULL;
 }
 
 int
-inchar()
+inchar(void)
 {
 	if (lineptr == NULL || *lineptr == 0) {
 		if (fgets(line, sizeof(line), stdin) == NULL) {
@@ -2384,8 +2383,7 @@
 }
 
 void
-take_input(str)
-char *str;
+take_input(char *str)
 {
 	lineptr = str;
 }
Index: working-2.6/arch/ppc64/xmon/ppc-opc.c
===================================================================
--- working-2.6.orig/arch/ppc64/xmon/ppc-opc.c	2004-08-09 09:51:38.000000000 +1000
+++ working-2.6/arch/ppc64/xmon/ppc-opc.c	2004-10-05 16:41:20.355047248 +1000
@@ -20,6 +20,7 @@
    Software Foundation, 59 Temple Place - Suite 330, Boston, MA
    02111-1307, USA.  */
 
+#include <linux/stddef.h>
 #include "nonstdio.h"
 #include "ppc.h"
 
@@ -110,12 +111,12 @@
   /* The zero index is used to indicate the end of the list of
      operands.  */
 #define UNUSED 0
-  { 0, 0, 0, 0, 0 },
+  { 0, 0, NULL, NULL, 0 },
 
   /* The BA field in an XL form instruction.  */
 #define BA UNUSED + 1
 #define BA_MASK (0x1f << 16)
-  { 5, 16, 0, 0, PPC_OPERAND_CR },
+  { 5, 16, NULL, NULL, PPC_OPERAND_CR },
 
   /* The BA field in an XL form instruction when it must be the same
      as the BT field in the same instruction.  */
@@ -125,7 +126,7 @@
   /* The BB field in an XL form instruction.  */
 #define BB BAT + 1
 #define BB_MASK (0x1f << 11)
-  { 5, 11, 0, 0, PPC_OPERAND_CR },
+  { 5, 11, NULL, NULL, PPC_OPERAND_CR },
 
   /* The BB field in an XL form instruction when it must be the same
      as the BA field in the same instruction.  */
@@ -168,21 +169,21 @@
 
   /* The BF field in an X or XL form instruction.  */
 #define BF BDPA + 1
-  { 3, 23, 0, 0, PPC_OPERAND_CR },
+  { 3, 23, NULL, NULL, PPC_OPERAND_CR },
 
   /* An optional BF field.  This is used for comparison instructions,
      in which an omitted BF field is taken as zero.  */
 #define OBF BF + 1
-  { 3, 23, 0, 0, PPC_OPERAND_CR | PPC_OPERAND_OPTIONAL },
+  { 3, 23, NULL, NULL, PPC_OPERAND_CR | PPC_OPERAND_OPTIONAL },
 
   /* The BFA field in an X or XL form instruction.  */
 #define BFA OBF + 1
-  { 3, 18, 0, 0, PPC_OPERAND_CR },
+  { 3, 18, NULL, NULL, PPC_OPERAND_CR },
 
   /* The BI field in a B form or XL form instruction.  */
 #define BI BFA + 1
 #define BI_MASK (0x1f << 16)
-  { 5, 16, 0, 0, PPC_OPERAND_CR },
+  { 5, 16, NULL, NULL, PPC_OPERAND_CR },
 
   /* The BO field in a B form instruction.  Certain values are
      illegal.  */
@@ -197,36 +198,36 @@
 
   /* The BT field in an X or XL form instruction.  */
 #define BT BOE + 1
-  { 5, 21, 0, 0, PPC_OPERAND_CR },
+  { 5, 21, NULL, NULL, PPC_OPERAND_CR },
 
   /* The condition register number portion of the BI field in a B form
      or XL form instruction.  This is used for the extended
      conditional branch mnemonics, which set the lower two bits of the
      BI field.  This field is optional.  */
 #define CR BT + 1
-  { 3, 18, 0, 0, PPC_OPERAND_CR | PPC_OPERAND_OPTIONAL },
+  { 3, 18, NULL, NULL, PPC_OPERAND_CR | PPC_OPERAND_OPTIONAL },
 
   /* The CRB field in an X form instruction.  */
 #define CRB CR + 1
-  { 5, 6, 0, 0, 0 },
+  { 5, 6, NULL, NULL, 0 },
 
   /* The CRFD field in an X form instruction.  */
 #define CRFD CRB + 1
-  { 3, 23, 0, 0, PPC_OPERAND_CR },
+  { 3, 23, NULL, NULL, PPC_OPERAND_CR },
 
   /* The CRFS field in an X form instruction.  */
 #define CRFS CRFD + 1
-  { 3, 0, 0, 0, PPC_OPERAND_CR },
+  { 3, 0, NULL, NULL, PPC_OPERAND_CR },
 
   /* The CT field in an X form instruction.  */
 #define CT CRFS + 1
-  { 5, 21, 0, 0, PPC_OPERAND_OPTIONAL },
+  { 5, 21, NULL, NULL, PPC_OPERAND_OPTIONAL },
 
   /* The D field in a D form instruction.  This is a displacement off
      a register, and implies that the next operand is a register in
      parentheses.  */
 #define D CT + 1
-  { 16, 0, 0, 0, PPC_OPERAND_PARENS | PPC_OPERAND_SIGNED },
+  { 16, 0, NULL, NULL, PPC_OPERAND_PARENS | PPC_OPERAND_SIGNED },
 
   /* The DE field in a DE form instruction.  This is like D, but is 12
      bits only.  */
@@ -252,40 +253,40 @@
 
   /* The E field in a wrteei instruction.  */
 #define E DS + 1
-  { 1, 15, 0, 0, 0 },
+  { 1, 15, NULL, NULL, 0 },
 
   /* The FL1 field in a POWER SC form instruction.  */
 #define FL1 E + 1
-  { 4, 12, 0, 0, 0 },
+  { 4, 12, NULL, NULL, 0 },
 
   /* The FL2 field in a POWER SC form instruction.  */
 #define FL2 FL1 + 1
-  { 3, 2, 0, 0, 0 },
+  { 3, 2, NULL, NULL, 0 },
 
   /* The FLM field in an XFL form instruction.  */
 #define FLM FL2 + 1
-  { 8, 17, 0, 0, 0 },
+  { 8, 17, NULL, NULL, 0 },
 
   /* The FRA field in an X or A form instruction.  */
 #define FRA FLM + 1
 #define FRA_MASK (0x1f << 16)
-  { 5, 16, 0, 0, PPC_OPERAND_FPR },
+  { 5, 16, NULL, NULL, PPC_OPERAND_FPR },
 
   /* The FRB field in an X or A form instruction.  */
 #define FRB FRA + 1
 #define FRB_MASK (0x1f << 11)
-  { 5, 11, 0, 0, PPC_OPERAND_FPR },
+  { 5, 11, NULL, NULL, PPC_OPERAND_FPR },
 
   /* The FRC field in an A form instruction.  */
 #define FRC FRB + 1
 #define FRC_MASK (0x1f << 6)
-  { 5, 6, 0, 0, PPC_OPERAND_FPR },
+  { 5, 6, NULL, NULL, PPC_OPERAND_FPR },
 
   /* The FRS field in an X form instruction or the FRT field in a D, X
      or A form instruction.  */
 #define FRS FRC + 1
 #define FRT FRS
-  { 5, 21, 0, 0, PPC_OPERAND_FPR },
+  { 5, 21, NULL, NULL, PPC_OPERAND_FPR },
 
   /* The FXM field in an XFX instruction.  */
 #define FXM FRS + 1
@@ -298,11 +299,11 @@
 
   /* The L field in a D or X form instruction.  */
 #define L FXM4 + 1
-  { 1, 21, 0, 0, PPC_OPERAND_OPTIONAL },
+  { 1, 21, NULL, NULL, PPC_OPERAND_OPTIONAL },
 
   /* The LEV field in a POWER SC form instruction.  */
 #define LEV L + 1
-  { 7, 5, 0, 0, 0 },
+  { 7, 5, NULL, NULL, 0 },
 
   /* The LI field in an I form instruction.  The lower two bits are
      forced to zero.  */
@@ -316,24 +317,24 @@
 
   /* The LS field in an X (sync) form instruction.  */
 #define LS LIA + 1
-  { 2, 21, 0, 0, PPC_OPERAND_OPTIONAL },
+  { 2, 21, NULL, NULL, PPC_OPERAND_OPTIONAL },
 
   /* The MB field in an M form instruction.  */
 #define MB LS + 1
 #define MB_MASK (0x1f << 6)
-  { 5, 6, 0, 0, 0 },
+  { 5, 6, NULL, NULL, 0 },
 
   /* The ME field in an M form instruction.  */
 #define ME MB + 1
 #define ME_MASK (0x1f << 1)
-  { 5, 1, 0, 0, 0 },
+  { 5, 1, NULL, NULL, 0 },
 
   /* The MB and ME fields in an M form instruction expressed a single
      operand which is a bitmask indicating which bits to select.  This
      is a two operand form using PPC_OPERAND_NEXT.  See the
      description in opcode/ppc.h for what this means.  */
 #define MBE ME + 1
-  { 5, 6, 0, 0, PPC_OPERAND_OPTIONAL | PPC_OPERAND_NEXT },
+  { 5, 6, NULL, NULL, PPC_OPERAND_OPTIONAL | PPC_OPERAND_NEXT },
   { 32, 0, insert_mbe, extract_mbe, 0 },
 
   /* The MB or ME field in an MD or MDS form instruction.  The high
@@ -345,7 +346,7 @@
 
   /* The MO field in an mbar instruction.  */
 #define MO MB6 + 1
-  { 5, 21, 0, 0, 0 },
+  { 5, 21, NULL, NULL, 0 },
 
   /* The NB field in an X form instruction.  The value 32 is stored as
      0.  */
@@ -361,34 +362,34 @@
   /* The RA field in an D, DS, DQ, X, XO, M, or MDS form instruction.  */
 #define RA NSI + 1
 #define RA_MASK (0x1f << 16)
-  { 5, 16, 0, 0, PPC_OPERAND_GPR },
+  { 5, 16, NULL, NULL, PPC_OPERAND_GPR },
 
   /* The RA field in the DQ form lq instruction, which has special
      value restrictions.  */
 #define RAQ RA + 1
-  { 5, 16, insert_raq, 0, PPC_OPERAND_GPR },
+  { 5, 16, insert_raq, NULL, PPC_OPERAND_GPR },
 
   /* The RA field in a D or X form instruction which is an updating
      load, which means that the RA field may not be zero and may not
      equal the RT field.  */
 #define RAL RAQ + 1
-  { 5, 16, insert_ral, 0, PPC_OPERAND_GPR },
+  { 5, 16, insert_ral, NULL, PPC_OPERAND_GPR },
 
   /* The RA field in an lmw instruction, which has special value
      restrictions.  */
 #define RAM RAL + 1
-  { 5, 16, insert_ram, 0, PPC_OPERAND_GPR },
+  { 5, 16, insert_ram, NULL, PPC_OPERAND_GPR },
 
   /* The RA field in a D or X form instruction which is an updating
      store or an updating floating point load, which means that the RA
      field may not be zero.  */
 #define RAS RAM + 1
-  { 5, 16, insert_ras, 0, PPC_OPERAND_GPR },
+  { 5, 16, insert_ras, NULL, PPC_OPERAND_GPR },
 
   /* The RB field in an X, XO, M, or MDS form instruction.  */
 #define RB RAS + 1
 #define RB_MASK (0x1f << 11)
-  { 5, 11, 0, 0, PPC_OPERAND_GPR },
+  { 5, 11, NULL, NULL, PPC_OPERAND_GPR },
 
   /* The RB field in an X form instruction when it must be the same as
      the RS field in the instruction.  This is used for extended
@@ -402,22 +403,22 @@
 #define RS RBS + 1
 #define RT RS
 #define RT_MASK (0x1f << 21)
-  { 5, 21, 0, 0, PPC_OPERAND_GPR },
+  { 5, 21, NULL, NULL, PPC_OPERAND_GPR },
 
   /* The RS field of the DS form stq instruction, which has special
      value restrictions.  */
 #define RSQ RS + 1
-  { 5, 21, insert_rsq, 0, PPC_OPERAND_GPR },
+  { 5, 21, insert_rsq, NULL, PPC_OPERAND_GPR },
 
   /* The RT field of the DQ form lq instruction, which has special
      value restrictions.  */
 #define RTQ RSQ + 1
-  { 5, 21, insert_rtq, 0, PPC_OPERAND_GPR },
+  { 5, 21, insert_rtq, NULL, PPC_OPERAND_GPR },
 
   /* The SH field in an X or M form instruction.  */
 #define SH RTQ + 1
 #define SH_MASK (0x1f << 11)
-  { 5, 11, 0, 0, 0 },
+  { 5, 11, NULL, NULL, 0 },
 
   /* The SH field in an MD form instruction.  This is split.  */
 #define SH6 SH + 1
@@ -426,12 +427,12 @@
 
   /* The SI field in a D form instruction.  */
 #define SI SH6 + 1
-  { 16, 0, 0, 0, PPC_OPERAND_SIGNED },
+  { 16, 0, NULL, NULL, PPC_OPERAND_SIGNED },
 
   /* The SI field in a D form instruction when we accept a wide range
      of positive values.  */
 #define SISIGNOPT SI + 1
-  { 16, 0, 0, 0, PPC_OPERAND_SIGNED | PPC_OPERAND_SIGNOPT },
+  { 16, 0, NULL, NULL, PPC_OPERAND_SIGNED | PPC_OPERAND_SIGNOPT },
 
   /* The SPR field in an XFX form instruction.  This is flipped--the
      lower 5 bits are stored in the upper 5 and vice- versa.  */
@@ -443,25 +444,25 @@
   /* The BAT index number in an XFX form m[ft]ibat[lu] instruction.  */
 #define SPRBAT SPR + 1
 #define SPRBAT_MASK (0x3 << 17)
-  { 2, 17, 0, 0, 0 },
+  { 2, 17, NULL, NULL, 0 },
 
   /* The SPRG register number in an XFX form m[ft]sprg instruction.  */
 #define SPRG SPRBAT + 1
 #define SPRG_MASK (0x3 << 16)
-  { 2, 16, 0, 0, 0 },
+  { 2, 16, NULL, NULL, 0 },
 
   /* The SR field in an X form instruction.  */
 #define SR SPRG + 1
-  { 4, 16, 0, 0, 0 },
+  { 4, 16, NULL, NULL, 0 },
 
   /* The STRM field in an X AltiVec form instruction.  */
 #define STRM SR + 1
 #define STRM_MASK (0x3 << 21)
-  { 2, 21, 0, 0, 0 },
+  { 2, 21, NULL, NULL, 0 },
 
   /* The SV field in a POWER SC form instruction.  */
 #define SV STRM + 1
-  { 14, 2, 0, 0, 0 },
+  { 14, 2, NULL, NULL, 0 },
 
   /* The TBR field in an XFX form instruction.  This is like the SPR
      field, but it is optional.  */
@@ -471,52 +472,52 @@
   /* The TO field in a D or X form instruction.  */
 #define TO TBR + 1
 #define TO_MASK (0x1f << 21)
-  { 5, 21, 0, 0, 0 },
+  { 5, 21, NULL, NULL, 0 },
 
   /* The U field in an X form instruction.  */
 #define U TO + 1
-  { 4, 12, 0, 0, 0 },
+  { 4, 12, NULL, NULL, 0 },
 
   /* The UI field in a D form instruction.  */
 #define UI U + 1
-  { 16, 0, 0, 0, 0 },
+  { 16, 0, NULL, NULL, 0 },
 
   /* The VA field in a VA, VX or VXR form instruction.  */
 #define VA UI + 1
 #define VA_MASK	(0x1f << 16)
-  { 5, 16, 0, 0, PPC_OPERAND_VR },
+  { 5, 16, NULL, NULL, PPC_OPERAND_VR },
 
   /* The VB field in a VA, VX or VXR form instruction.  */
 #define VB VA + 1
 #define VB_MASK (0x1f << 11)
-  { 5, 11, 0, 0, PPC_OPERAND_VR },
+  { 5, 11, NULL, NULL, PPC_OPERAND_VR },
 
   /* The VC field in a VA form instruction.  */
 #define VC VB + 1
 #define VC_MASK (0x1f << 6)
-  { 5, 6, 0, 0, PPC_OPERAND_VR },
+  { 5, 6, NULL, NULL, PPC_OPERAND_VR },
 
   /* The VD or VS field in a VA, VX, VXR or X form instruction.  */
 #define VD VC + 1
 #define VS VD
 #define VD_MASK (0x1f << 21)
-  { 5, 21, 0, 0, PPC_OPERAND_VR },
+  { 5, 21, NULL, NULL, PPC_OPERAND_VR },
 
   /* The SIMM field in a VX form instruction.  */
 #define SIMM VD + 1
-  { 5, 16, 0, 0, PPC_OPERAND_SIGNED},
+  { 5, 16, NULL, NULL, PPC_OPERAND_SIGNED},
 
   /* The UIMM field in a VX form instruction.  */
 #define UIMM SIMM + 1
-  { 5, 16, 0, 0, 0 },
+  { 5, 16, NULL, NULL, 0 },
 
   /* The SHB field in a VA form instruction.  */
 #define SHB UIMM + 1
-  { 4, 6, 0, 0, 0 },
+  { 4, 6, NULL, NULL, 0 },
 
   /* The other UIMM field in a EVX form instruction.  */
 #define EVUIMM SHB + 1
-  { 5, 11, 0, 0, 0 },
+  { 5, 11, NULL, NULL, 0 },
 
   /* The other UIMM field in a half word EVX form instruction.  */
 #define EVUIMM_2 EVUIMM + 1
@@ -533,11 +534,11 @@
   /* The WS field.  */
 #define WS EVUIMM_8 + 1
 #define WS_MASK (0x7 << 11)
-  { 3, 11, 0, 0, 0 },
+  { 3, 11, NULL, NULL, 0 },
 
   /* The L field in an mtmsrd instruction */
 #define MTMSRD_L WS + 1
-  { 1, 16, 0, 0, PPC_OPERAND_OPTIONAL },
+  { 1, 16, NULL, NULL, PPC_OPERAND_OPTIONAL },
 
 };
 
Index: working-2.6/arch/ppc64/xmon/start.c
===================================================================
--- working-2.6.orig/arch/ppc64/xmon/start.c	2004-08-09 09:51:38.000000000 +1000
+++ working-2.6/arch/ppc64/xmon/start.c	2004-10-05 16:33:50.355028808 +1000
@@ -173,7 +173,7 @@
 		c = xmon_getchar();
 		if (c == -1) {
 			if (p == str)
-				return 0;
+				return NULL;
 			break;
 		}
 		*p++ = c;


-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson


From grave at ipno.in2p3.fr  Tue Oct  5 18:41:04 2004
From: grave at ipno.in2p3.fr (grave)
Date: Tue, 05 Oct 2004 08:41:04 +0000
Subject: xserve and 2.6.9-rc3 and 2.6.9-rc3-bk4
Message-ID: <1096965664l.7230l.0l@ipnnarval>

Hi,

I've tryed both kernel and got crashes (see attached files).

Do I missed a patch ?

xavier
PS:2.6.6 + smp patch run fine 
-------------- next part --------------
PCI: Probing PCI hardware done
SCSI subsystem initialized
usbcore: registered new driver usbfs
usbcore: registered new driver hub
nvram_init: Could not find nvram partition for nvram buffered error logging.
rtasd: no RTAS on system
devfs: 2004-01-31 Richard Gooch (rgooch at atnf.csiro.au)
devfs: boot_options: 0x0
Oops: Machine check, sig: 0 [#1]
SMP NR_CPUS=2 POWERMAC
NIP: C00000000014A640 XER: 0000000000000000 LR: C00000000014A614
REGS: c000000001a17a50 TRAP: 0200   Not tainted  (2.6.9-rc3-bk4)
MSR: 9000000000101032 EE: 0 PR: 0 FP: 0 ME: 1 IR/DR: 11
TASK: c000000001a110c0[1] 'swapper' THREAD: c000000001a14000 CPU: 0
GPR00: FFFFFFFFFFFFFFFF C000000001A17CD0 C00000000043C390 00000000000000FF
GPR04: C00000000FEFB400 0000000000000010 C0000000002AAAA0 C000000000468298
GPR08: C000000000468268 E0000000828CD000 C00000000045AD5C 9000000000009032
GPR12: 0000000028000042 C000000000355780 0000000000000000 0000000000000000
GPR16: 0000000001400000 00000000016FB720 00000000016FB720 BFFFFFFFFEC00000
GPR20: 000000000023FD58 0000000000000000 0000000001A6A020 00000000016FB998
GPR24: 9000000000009032 0000000000000032 C00000000043F730 C000000000352D58
GPR28: C00000000043F728 0000000000000000 C0000000003D5718 C00000000043F730
NIP [c00000000014a640] .i8042_flush+0x6c/0x15c
LR [c00000000014a614] .i8042_flush+0x40/0x15c
Call Trace:
[c000000001a17cd0] [c000000000355780] 0xc000000000355780 (unreliable)
[c000000001a17d80] [c00000000014b240] .i8042_controller_init+0x1c/0x1e4
[c000000001a17e10] [c0000000002f4164] .i8042_init+0xe8/0x64c
[c000000001a17ef0] [c00000000000c688] .init+0x234/0x440
[c000000001a17f90] [c0000000000172b8] .kernel_thread+0x4c/0x6c
 <0>Kernel panic - not syncing: Attempted to kill init!
 <0>Rebooting in 180 seconds..
-------------- next part --------------
PCI: Probing PCI hardware done
SCSI subsystem initialized
usbcore: registered new driver usbfs
usbcore: registered new driver hub
nvram_init: Could not find nvram partition for nvram buffered error logging.
rtasd: no RTAS on system
devfs: 2004-01-31 Richard Gooch (rgooch at atnf.csiro.au)
devfs: boot_options: 0x0
Oops: Machine check, sig: 0 [#1]
SMP NR_CPUS=2 POWERMAC
NIP: C00000000014A244 XER: 0000000000000000 LR: C00000000014A218
REGS: c000000001a17a50 TRAP: 0200   Not tainted  (2.6.9-rc3)
MSR: 9000000000101032 EE: 0 PR: 0 FP: 0 ME: 1 IR/DR: 11
TASK: c000000001a110c0[1] 'swapper' THREAD: c000000001a14000 CPU: 0
GPR00: FFFFFFFFFFFFFFFF C000000001A17CD0 C0000000004383A8 00000000000000FF
GPR04: C00000000FEED3C0 0000000000000010 C0000000002A7A48 C000000000464298
GPR08: C000000000464268 E0000000828CD000 C000000000456D64 9000000000009032
GPR12: 0000000028000042 C000000000351780 0000000000000000 0000000000000000
GPR16: 0000000001400000 00000000016F8720 00000000016F8720 BFFFFFFFFEC00000
GPR20: 000000000023FD58 0000000000000000 0000000001A66020 00000000016F8998
GPR24: 9000000000009032 0000000000000032 C00000000043B730 C00000000034ED58
GPR28: C00000000043B728 0000000000000000 C0000000003D1728 C00000000043B730
NIP [c00000000014a244] .i8042_flush+0x6c/0x15c
LR [c00000000014a218] .i8042_flush+0x40/0x15c
Call Trace:
[c000000001a17cd0] [c000000000351780] 0xc000000000351780 (unreliable)
[c000000001a17d80] [c00000000014ae44] .i8042_controller_init+0x1c/0x1e4
[c000000001a17e10] [c0000000002f1164] .i8042_init+0xe8/0x64c
[c000000001a17ef0] [c00000000000c688] .init+0x234/0x440
[c000000001a17f90] [c0000000000172b8] .kernel_thread+0x4c/0x6c
 <0>Kernel panic - not syncing: Attempted to kill init!
 <0>Rebooting in 180 seconds..

From benh at kernel.crashing.org  Tue Oct  5 18:46:43 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Tue, 05 Oct 2004 18:46:43 +1000
Subject: xserve and 2.6.9-rc3 and 2.6.9-rc3-bk4
In-Reply-To: <1096965664l.7230l.0l@ipnnarval>
References: <1096965664l.7230l.0l@ipnnarval>
Message-ID: <1096966003.24535.48.camel@gaston>

On Tue, 2004-10-05 at 18:41, grave wrote:
> Hi,
> 
> I've tryed both kernel and got crashes (see attached files).
> 
> Do I missed a patch ?
> 
> xavier
> PS:2.6.6 + smp patch run fine 

That's your .config

You have enabled the legacy x86 keyboard support ! :)

Use a g5_defconfig

I'm working on a fix so that this driver stops crashing though.

Ben.


From grave at ipno.in2p3.fr  Tue Oct  5 19:25:20 2004
From: grave at ipno.in2p3.fr (grave)
Date: Tue, 05 Oct 2004 09:25:20 +0000
Subject: =?iso-8859-1?q?Re=A0=3A?= xserve and 2.6.9-rc3 and 2.6.9-rc3-bk4
In-Reply-To: <1096966003.24535.48.camel@gaston> (from
	benh@kernel.crashing.org on Tue Oct  5 10:46:43 2004)
References: <1096965664l.7230l.0l@ipnnarval> <1096966003.24535.48.camel@gaston>
Message-ID: <1096968320l.7230l.2l@ipnnarval>

Le 05.10.2004 10:46:43, Benjamin Herrenschmidt a ?crit?:
> On Tue, 2004-10-05 at 18:41, grave wrote:
> > Hi,
> >
> > I've tryed both kernel and got crashes (see attached files).
> >
> > Do I missed a patch ?
> >
> > xavier
> > PS:2.6.6 + smp patch run fine
> 
> That's your .config
> 
> You have enabled the legacy x86 keyboard support ! :)
> 
> Use a g5_defconfig

It works now... Thanks one more time !


From schwab at suse.de  Tue Oct  5 19:50:44 2004
From: schwab at suse.de (Andreas Schwab)
Date: Tue, 05 Oct 2004 11:50:44 +0200
Subject: Sound on G5
In-Reply-To: <1096939502.24584.6.camel@gaston> (Benjamin Herrenschmidt's
	message of "Tue, 05 Oct 2004 11:25:03 +1000")
References: <jebrfi1a1g.fsf@sykes.suse.de> <1096939502.24584.6.camel@gaston>
Message-ID: <je8yala3yj.fsf@sykes.suse.de>

Benjamin Herrenschmidt <benh at kernel.crashing.org> writes:

> So we could go that way and complete my "interpreter" or just hard code
> all of the GPIOs we need in the driver hoping apple don't shuffle them
> too much in upcoming models...

I would be happy to test anything that is available.

Thanks, Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab at suse.de
SuSE Linux AG, Maxfeldstra?e 5, 90409 N?rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


From paulus at samba.org  Tue Oct  5 20:38:52 2004
From: paulus at samba.org (Paul Mackerras)
Date: Tue, 5 Oct 2004 20:38:52 +1000
Subject: [PPC64] xmon sparse cleanups
In-Reply-To: <20041005064255.GF3695@zax>
References: <20041005064255.GF3695@zax>
Message-ID: <16738.31164.464250.638432@cargo.ozlabs.ibm.com>

David Gibson writes:

> Andrew, please apply:
> 
> This patch removes many sparse warnings from the xmon code.  Mostly
> K&R function declarations and 0-instead-of-NULLs.

The trouble with this patch is that it makes ppc-opc.c diverge from
the version in binutils, which is where it came from.  I'd rather keep
it as close as possible to that version.  I have no problem with the
changes to the other files.

Paul.


From igor at cs.wisc.edu  Wed Oct  6 03:46:53 2004
From: igor at cs.wisc.edu (Igor Grobman)
Date: Tue, 5 Oct 2004 12:46:53 -0500 (CDT)
Subject: mapping memory in 0xb space
In-Reply-To: <3337F539-14B0-11D9-AE7A-000A95A4DC02@kernel.crashing.org>
References: <20040929014017.GC5470@zax>
	<Pine.LNX.4.44.0409282214510.26110-100000@wotan.cs.wisc.edu>
	<20041001040325.GB12890@zax>
	<Pine.LNX.4.58.0410011113530.5565@wotan.cs.wisc.edu>
	<3337F539-14B0-11D9-AE7A-000A95A4DC02@kernel.crashing.org>
Message-ID: <Pine.LNX.4.58.0410051245580.9463@wotan.cs.wisc.edu>

On Sat, 2 Oct 2004, Segher Boessenkool wrote:

> > A question for the rest of you, who haven't been following this thread.
> > Is there publicly available documentation on the power4 extensions,
> > specifically the large page support, how it effects the HPT hashing,
> > and
> > the SLB, including the new instructions for maintaining it in software?
> > I haven't been able to find anything yet.
>
> http://www-106.ibm.com/developerworks/eserver/pdfs/archpub3.pdf
>
> has some info, don't know if that is enough for you -- nothing
> much POWER4 specific in there, but large pages are part of the
> architecture, so it does talk about the instructions to handle
> them etc.

Thanks, this is what I was looking for.

-Igor


From igor at cs.wisc.edu  Wed Oct  6 03:45:47 2004
From: igor at cs.wisc.edu (Igor Grobman)
Date: Tue, 5 Oct 2004 12:45:47 -0500 (CDT)
Subject: mapping memory in 0xb space
In-Reply-To: <20041001040325.GB12890@zax>
References: <20040929014017.GC5470@zax>
	<Pine.LNX.4.44.0409282214510.26110-100000@wotan.cs.wisc.edu>
	<20041001040325.GB12890@zax>
Message-ID: <Pine.LNX.4.58.0410051232410.9463@wotan.cs.wisc.edu>

One more followup on this issue, since I do have the base code working
now.  The problem was in the fact that do_slb_bolted code sets the large
page bit in the SLB entry, but my code (and particularly hpte_insert code)
did not insert a proper large page mapping.


On Fri, 1 Oct 2004, David Gibson wrote:
> On Wed, Sep 29, 2004 at 12:14:08AM -0500, Igor Grobman wrote:
> > On Wed, 29 Sep 2004, David Gibson wrote:
> >
> > > On Tue, Sep 28, 2004 at 01:52:16PM -0500, Igor Grobman wrote:
> > > > On Tue, 28 Sep 2004, David Gibson wrote:
> > As for why I thought 0xbff would work,  I reasoned that
> > since the highest bits are masked out in get_kernel_vsid(), and since
> > nobody else is using the 0xb region, it doesn't matter if I get a VSID
> > that is the same as some other VSID in 0xb region.  However, I did not
> > consider the bug in do_slb_bolted that you describe below.
>
> Yes, with that bug the collision can be with a segment anywhere, not
> just in the 0xb region.
>

I am not convinced anymore.  The lower 36 bits of the ordinal are still
the same in do_slb_bolted and get_kernel_vsid.  Multiplying the ordinal
by the 36-bit randomizer should produce the same lower 36 bits whether or
not the upper bits are different.  do_slb_bolted eventually clears the
upper 28 bits, before using the VSID.  I no longer think there can be
a conflict outside the 0xb region.  Is my reasoning correct?


> > > You may have seen the comment in do_slb_bolted which claims to permit
> > > a full 32-bits of ESID - it's wrong.  The code doesn't mask the ESID
> > > down to 13 bits as get_kernel_vsid() does, but it probably should - an
> > > overlarge ESID will cause collisions with VSIDs from entirely
> > > different address places, which would be a Bad Thing.
> >
> > This must be happening, although I would still like to know why it
> > misbehaves even within the valid VSID range.
> >
> > >
> > > Actually, you should be able to allow ESIDs of up to 21 bits there (36
> > > bit VSID - 15 bits of "context").  But you will need to make sure
> > > get_kernel_vsid(), or whatever you're using to calculate the VAs for
> > > the hash HPTEs is updated to match - at the moment I think it will
> > > mask down to 13 bits.  I'm not sure if that will get you sufficiently
> > > close to 0xc0... for your purposes.
> >


Thanks,
Igor


From caveman at boxacle.net  Wed Oct  6 04:24:25 2004
From: caveman at boxacle.net (CAVEMAN)
Date: Tue, 5 Oct 2004 13:24:25 -0500
Subject: Sound on G5
In-Reply-To: <1096939502.24584.6.camel@gaston>
References: <jebrfi1a1g.fsf@sykes.suse.de> <1096939502.24584.6.camel@gaston>
Message-ID: <200410051324.25817@laptop>

On Monday 04 October 2004 20:25, Benjamin Herrenschmidt wrote:
> On Tue, 2004-10-05 at 06:51, Andreas Schwab wrote:
> > Is anyone already working on sound support for the PowerMac G5, by
> > chance? That's actually the only thing still missing.
>
> Nobody really seriously ATM. One of the main issue is that the darwin
> driver abuses apple "do-platform-*" shit. It's a mecanism they invented
> to put sort-of "scripts" (in binary form) in the device-tree that can
> contains elementary ops such as write GPIOs, I2C, etc...
>
> This is extremely messy and difficult to parse. I have written the
> basis for parsing them, but interpreting them is even more shitty as
> the actual implementation of each ops sort-of depends on the target
> object.
>
> It's really a piece-of-shit imho.
>
> So we could go that way and complete my "interpreter" or just hard code
> all of the GPIOs we need in the driver hoping apple don't shuffle them
> too much in upcoming models...

I'd be willing to do some work and/or testing on this, where can I get the 
code?

Regards,
caveman


From rmk+lkml at arm.linux.org.uk  Wed Oct  6 17:26:59 2004
From: rmk+lkml at arm.linux.org.uk (Russell King)
Date: Wed, 6 Oct 2004 08:26:59 +0100
Subject: [RFC][PATCH] Way for platforms to alter built-in serial ports
In-Reply-To: <1096534248.32721.36.camel@gaston>;
	from benh@kernel.crashing.org on Thu, Sep 30, 2004 at 06:50:48PM
	+1000
References: <1096534248.32721.36.camel@gaston>
Message-ID: <20041006082658.A18379@flint.arm.linux.org.uk>

On Thu, Sep 30, 2004 at 06:50:48PM +1000, Benjamin Herrenschmidt wrote:
> +#ifndef ARCH_HAS_GET_LEGACY_SERIAL_PORTS
>  static struct old_serial_port old_serial_port[] = {
>  	SERIAL_PORT_DFNS /* defined in asm/serial.h */
>  };
> -
> +static inline struct old_serial_port *get_legacy_serial_ports(unsigned int *count)
> +{
> +	*count = ARRAY_SIZE(old_serial_port);
> +	return old_serial_port;
> +}
>  #define UART_NR	(ARRAY_SIZE(old_serial_port) + CONFIG_SERIAL_8250_NR_UARTS)
> +#endif /* ARCH_HAS_GET_LEGACY_SERIAL_PORTS */
> +

What happens if 8250.c is built as a module and
ARCH_HAS_GET_LEGACY_SERIAL_PORTS is defined?

> diff -urN linux-2.5/include/linux/serial.h linux-maple/include/linux/serial.h
> --- linux-2.5/include/linux/serial.h	2004-09-30 18:31:55.867785437 +1000
> +++ linux-maple/include/linux/serial.h	2004-09-30 15:36:57.981697919 +1000
> @@ -14,6 +14,21 @@
>  #include <asm/page.h>
>  
>  /*
> + * Definition of a legacy serial port
> + */
> +struct old_serial_port {
> +	unsigned int uart;
> +	unsigned int baud_base;
> +	unsigned int port;
> +	unsigned int irq;
> +	unsigned int flags;
> +	unsigned char hub6;
> +	unsigned char io_type;
> +	unsigned char *iomem_base;
> +	unsigned short iomem_reg_shift;
> +};
> +
> +/*
>   * Counters of the input lines (CTS, DSR, RI, CD) interrupts
>   */

serial.h is used by userspace programs.  We should not expose this
structure to those programs.  Instead, maybe creating an 8250.h
header, or even moving the existing 8250.h header ?

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 PCMCIA      - http://pcmcia.arm.linux.org.uk/
                 2.6 Serial core


From benh at kernel.crashing.org  Wed Oct  6 18:15:11 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Wed, 06 Oct 2004 18:15:11 +1000
Subject: [RFC][PATCH] Way for platforms to alter built-in serial ports
In-Reply-To: <20041006082658.A18379@flint.arm.linux.org.uk>
References: <1096534248.32721.36.camel@gaston>
	<20041006082658.A18379@flint.arm.linux.org.uk>
Message-ID: <1097050508.21132.15.camel@gaston>

On Wed, 2004-10-06 at 17:26, Russell King wrote:
> On Thu, Sep 30, 2004 at 06:50:48PM +1000, Benjamin Herrenschmidt wrote:
> > +#ifndef ARCH_HAS_GET_LEGACY_SERIAL_PORTS
> >  static struct old_serial_port old_serial_port[] = {
> >  	SERIAL_PORT_DFNS /* defined in asm/serial.h */
> >  };
> > -
> > +static inline struct old_serial_port *get_legacy_serial_ports(unsigned int *count)
> > +{
> > +	*count = ARRAY_SIZE(old_serial_port);
> > +	return old_serial_port;
> > +}
> >  #define UART_NR	(ARRAY_SIZE(old_serial_port) + CONFIG_SERIAL_8250_NR_UARTS)
> > +#endif /* ARCH_HAS_GET_LEGACY_SERIAL_PORTS */
> > +
> 
> What happens if 8250.c is built as a module and
> ARCH_HAS_GET_LEGACY_SERIAL_PORTS is defined?

It well call get_legacy_serial_ports() which is hopefully exported by
the arch code.

> serial.h is used by userspace programs.  We should not expose this
> structure to those programs.  Instead, maybe creating an 8250.h
> header, or even moving the existing 8250.h header ?

Hrm... ok. Or adding a #ifdef __KERNEL__ (sic !) :)

I'll send you a new patch later today as I had to do another fix, we
tend to "force" register_console() apparently even when we have nothing
to register because we set the "ops" to all ports even those who were
never configured and we test "ops" to decide wether to succeed or fail
in the console setup() callback.

Ben.


From benh at kernel.crashing.org  Wed Oct  6 19:07:44 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Wed, 06 Oct 2004 19:07:44 +1000
Subject: [RFC][PATCH] Way for platforms to alter built-in serial ports
In-Reply-To: <20041006082658.A18379@flint.arm.linux.org.uk>
References: <1096534248.32721.36.camel@gaston>
	<20041006082658.A18379@flint.arm.linux.org.uk>
Message-ID: <1097053663.21132.56.camel@gaston>

On Wed, 2004-10-06 at 17:26, Russell King wrote:

> serial.h is used by userspace programs.  We should not expose this
> structure to those programs.  Instead, maybe creating an 8250.h
> header, or even moving the existing 8250.h header ?

Here's a new version of that patch that moves 8250.h to
include/linux, moves the definition of old_serial_ports there,
and also corrects the problem I told you about with serial console.

Let me know if I can send it to Andrew...

Ben.

diff -urN linux-2.5/drivers/serial/8250.c linux-maple/drivers/serial/8250.c
--- linux-2.5/drivers/serial/8250.c	2004-09-30 18:31:42.000000000 +1000
+++ linux-maple/drivers/serial/8250.c	2004-10-06 19:05:13.042342513 +1000
@@ -41,7 +41,7 @@
 #endif
 
 #include <linux/serial_core.h>
-#include "8250.h"
+#include <linux/8250.h>
 
 /*
  * Configuration:
@@ -112,11 +112,18 @@
 #define SERIAL_PORT_DFNS
 #endif
 
+#ifndef ARCH_HAS_GET_LEGACY_SERIAL_PORTS
 static struct old_serial_port old_serial_port[] = {
 	SERIAL_PORT_DFNS /* defined in asm/serial.h */
 };
-
+static inline struct old_serial_port *get_legacy_serial_ports(unsigned int *count)
+{
+	*count = ARRAY_SIZE(old_serial_port);
+	return old_serial_port;
+}
 #define UART_NR	(ARRAY_SIZE(old_serial_port) + CONFIG_SERIAL_8250_NR_UARTS)
+#endif /* ARCH_HAS_DYNAMIC_LEGACY_SERIAL_PORTS */
+
 
 #ifdef CONFIG_SERIAL_8250_RSA
 
@@ -1839,22 +1846,28 @@
 {
 	struct uart_8250_port *up;
 	static int first = 1;
+	struct old_serial_port *old_ports;
+	int count;
 	int i;
 
 	if (!first)
 		return;
 	first = 0;
 
-	for (i = 0, up = serial8250_ports; i < ARRAY_SIZE(old_serial_port);
+	old_ports = get_legacy_serial_ports(&count);
+	if (old_ports == NULL)
+		return;
+
+	for (i = 0, up = serial8250_ports; i < count;
 	     i++, up++) {
-		up->port.iobase   = old_serial_port[i].port;
-		up->port.irq      = irq_canonicalize(old_serial_port[i].irq);
-		up->port.uartclk  = old_serial_port[i].baud_base * 16;
-		up->port.flags    = old_serial_port[i].flags;
-		up->port.hub6     = old_serial_port[i].hub6;
-		up->port.membase  = old_serial_port[i].iomem_base;
-		up->port.iotype   = old_serial_port[i].io_type;
-		up->port.regshift = old_serial_port[i].iomem_reg_shift;
+		up->port.iobase   = old_ports[i].port;
+		up->port.irq      = irq_canonicalize(old_ports[i].irq);
+		up->port.uartclk  = old_ports[i].baud_base * 16;
+		up->port.flags    = old_ports[i].flags;
+		up->port.hub6     = old_ports[i].hub6;
+		up->port.membase  = old_ports[i].iomem_base;
+		up->port.iotype   = old_ports[i].io_type;
+		up->port.regshift = old_ports[i].iomem_reg_shift;
 		up->port.ops      = &serial8250_pops;
 		if (share_irqs)
 			up->port.flags |= UPF_SHARE_IRQ;
@@ -1870,6 +1883,9 @@
 	for (i = 0; i < UART_NR; i++) {
 		struct uart_8250_port *up = &serial8250_ports[i];
 
+		if (!up->port.iobase)
+			continue;
+
 		up->port.line = i;
 		up->port.ops = &serial8250_pops;
 		init_timer(&up->timer);
diff -urN linux-2.5/drivers/serial/8250.h linux-maple/drivers/serial/8250.h
--- linux-2.5/drivers/serial/8250.h	2004-09-30 18:31:42.000000000 +1000
+++ /dev/null	2004-10-05 22:10:47.391719208 +1000
@@ -1,71 +0,0 @@
-/*
- *  linux/drivers/char/8250.h
- *
- *  Driver for 8250/16550-type serial ports
- *
- *  Based on drivers/char/serial.c, by Linus Torvalds, Theodore Ts'o.
- *
- *  Copyright (C) 2001 Russell King.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- *  $Id: 8250.h,v 1.8 2002/07/21 21:32:30 rmk Exp $
- */
-
-#include <linux/config.h>
-
-void serial8250_get_irq_map(unsigned int *map);
-void serial8250_suspend_port(int line);
-void serial8250_resume_port(int line);
-
-struct old_serial_port {
-	unsigned int uart;
-	unsigned int baud_base;
-	unsigned int port;
-	unsigned int irq;
-	unsigned int flags;
-	unsigned char hub6;
-	unsigned char io_type;
-	unsigned char *iomem_base;
-	unsigned short iomem_reg_shift;
-};
-
-/*
- * This replaces serial_uart_config in include/linux/serial.h
- */
-struct serial8250_config {
-	const char	*name;
-	unsigned int	fifo_size;
-	unsigned int	tx_loadsz;
-	unsigned int	flags;
-};
-
-#define UART_CAP_FIFO	(1 << 8)	/* UART has FIFO */
-#define UART_CAP_EFR	(1 << 9)	/* UART has EFR */
-#define UART_CAP_SLEEP	(1 << 10)	/* UART has IER sleep */
-
-#undef SERIAL_DEBUG_PCI
-
-#if defined(__i386__) && (defined(CONFIG_M386) || defined(CONFIG_M486))
-#define SERIAL_INLINE
-#endif
-  
-#ifdef SERIAL_INLINE
-#define _INLINE_ inline
-#else
-#define _INLINE_
-#endif
-
-#define PROBE_RSA	(1 << 0)
-#define PROBE_ANY	(~0)
-
-#define HIGH_BITS_OFFSET ((sizeof(long)-sizeof(int))*8)
-
-#ifdef CONFIG_SERIAL_8250_SHARE_IRQ
-#define SERIAL8250_SHARE_IRQS 1
-#else
-#define SERIAL8250_SHARE_IRQS 0
-#endif
diff -urN linux-2.5/drivers/serial/8250_pci.c linux-maple/drivers/serial/8250_pci.c
--- linux-2.5/drivers/serial/8250_pci.c	2004-09-30 18:31:42.000000000 +1000
+++ linux-maple/drivers/serial/8250_pci.c	2004-10-06 19:05:41.301674308 +1000
@@ -25,13 +25,12 @@
 #include <linux/serial.h>
 #include <linux/serial_core.h>
 #include <linux/8250_pci.h>
+#include <linux/8250.h>
 
 #include <asm/bitops.h>
 #include <asm/byteorder.h>
 #include <asm/io.h>
 
-#include "8250.h"
-
 /*
  * Definitions for PCI support.
  */
diff -urN linux-2.5/drivers/serial/8250_pnp.c linux-maple/drivers/serial/8250_pnp.c
--- linux-2.5/drivers/serial/8250_pnp.c	2004-09-30 18:31:42.000000000 +1000
+++ linux-maple/drivers/serial/8250_pnp.c	2004-10-06 19:05:55.788749883 +1000
@@ -25,13 +25,12 @@
 #include <linux/serial.h>
 #include <linux/serialP.h>
 #include <linux/serial_core.h>
+#include <linux/8250.h>
 
 #include <asm/bitops.h>
 #include <asm/byteorder.h>
 #include <asm/serial.h>
 
-#include "8250.h"
-
 #define UNKNOWN_DEV 0x3000
 
 
diff -urN linux-2.5/drivers/serial/au1x00_uart.c linux-maple/drivers/serial/au1x00_uart.c
--- linux-2.5/drivers/serial/au1x00_uart.c	2004-09-30 18:31:42.000000000 +1000
+++ linux-maple/drivers/serial/au1x00_uart.c	2004-10-06 19:07:39.461032916 +1000
@@ -40,7 +40,7 @@
 #endif
 
 #include <linux/serial_core.h>
-#include "8250.h"
+#include <linux/8250.h>
 
 /*
  * Debugging.
diff -urN linux-2.5/drivers/serial/serial_cs.c linux-maple/drivers/serial/serial_cs.c
--- linux-2.5/drivers/serial/serial_cs.c	2004-09-30 18:31:42.000000000 +1000
+++ linux-maple/drivers/serial/serial_cs.c	2004-10-06 19:07:35.059700476 +1000
@@ -44,6 +44,7 @@
 #include <linux/serial.h>
 #include <linux/serial_core.h>
 #include <linux/major.h>
+#include <linux/8250.h>
 #include <asm/io.h>
 #include <asm/system.h>
 
@@ -55,8 +56,6 @@
 #include <pcmcia/ds.h>
 #include <pcmcia/cisreg.h>
 
-#include "8250.h"
-
 #ifdef PCMCIA_DEBUG
 static int pc_debug = PCMCIA_DEBUG;
 MODULE_PARM(pc_debug, "i");
diff -urN linux-2.5/include/linux/8250.h linux-maple/include/linux/8250.h
--- /dev/null	2004-10-05 22:10:47.391719208 +1000
+++ linux-maple/include/linux/8250.h	2004-10-06 19:06:45.680713598 +1000
@@ -0,0 +1,74 @@
+/*
+ *  linux/drivers/char/8250.h
+ *
+ *  Driver for 8250/16550-type serial ports
+ *
+ *  Based on drivers/char/serial.c, by Linus Torvalds, Theodore Ts'o.
+ *
+ *  Copyright (C) 2001 Russell King.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ *  $Id: 8250.h,v 1.8 2002/07/21 21:32:30 rmk Exp $
+ */
+
+#include <linux/config.h>
+
+void serial8250_get_irq_map(unsigned int *map);
+void serial8250_suspend_port(int line);
+void serial8250_resume_port(int line);
+
+/*
+ * This replaces serial_uart_config in include/linux/serial.h
+ */
+struct serial8250_config {
+	const char	*name;
+	unsigned int	fifo_size;
+	unsigned int	tx_loadsz;
+	unsigned int	flags;
+};
+
+#define UART_CAP_FIFO	(1 << 8)	/* UART has FIFO */
+#define UART_CAP_EFR	(1 << 9)	/* UART has EFR */
+#define UART_CAP_SLEEP	(1 << 10)	/* UART has IER sleep */
+
+/*
+ * Definition of a legacy serial port
+ */
+struct old_serial_port {
+	unsigned int uart;
+	unsigned int baud_base;
+	unsigned int port;
+	unsigned int irq;
+	unsigned int flags;
+	unsigned char hub6;
+	unsigned char io_type;
+	unsigned char *iomem_base;
+	unsigned short iomem_reg_shift;
+};
+
+#undef SERIAL_DEBUG_PCI
+
+#if defined(__i386__) && (defined(CONFIG_M386) || defined(CONFIG_M486))
+#define SERIAL_INLINE
+#endif
+  
+#ifdef SERIAL_INLINE
+#define _INLINE_ inline
+#else
+#define _INLINE_
+#endif
+
+#define PROBE_RSA	(1 << 0)
+#define PROBE_ANY	(~0)
+
+#define HIGH_BITS_OFFSET ((sizeof(long)-sizeof(int))*8)
+
+#ifdef CONFIG_SERIAL_8250_SHARE_IRQ
+#define SERIAL8250_SHARE_IRQS 1
+#else
+#define SERIAL8250_SHARE_IRQS 0
+#endif


From clmason at gmail.com  Thu Oct  7 00:58:56 2004
From: clmason at gmail.com (Chris L. Mason)
Date: Wed, 6 Oct 2004 11:58:56 -0300
Subject: iMac G5 available for testing
Message-ID: <610e346604100607581144298e@mail.gmail.com>

Hi all,

I have a new iMac G5/1.8 GHz/17-inch system that I would like to make
available for testing/debugging.  If you have anything you would like
me to try booting, checking in open firmware, etc., let me know.

Thanks,


Chris


From benh at kernel.crashing.org  Thu Oct  7 08:36:09 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Thu, 07 Oct 2004 08:36:09 +1000
Subject: iMac G5 available for testing
In-Reply-To: <610e346604100607581144298e@mail.gmail.com>
References: <610e346604100607581144298e@mail.gmail.com>
Message-ID: <1097102169.8448.14.camel@gaston>

On Thu, 2004-10-07 at 00:58, Chris L. Mason wrote:
> Hi all,
> 
> I have a new iMac G5/1.8 GHz/17-inch system that I would like to make
> available for testing/debugging.  If you have anything you would like
> me to try booting, checking in open firmware, etc., let me know.

We have ordered one here. It will require some reverse engineering work
since it's a new rev of the chipset and the good old PMU chip was finally,
years later, replaced by a new "SMU" that is totally undocumented of course...

Ben.


From clmason at gmail.com  Thu Oct  7 09:12:04 2004
From: clmason at gmail.com (Chris L. Mason)
Date: Wed, 6 Oct 2004 20:12:04 -0300
Subject: iMac G5 available for testing
In-Reply-To: <1097102169.8448.14.camel@gaston>
References: <610e346604100607581144298e@mail.gmail.com>
	<1097102169.8448.14.camel@gaston>
Message-ID: <610e34660410061612379af1c8@mail.gmail.com>

On Thu, 07 Oct 2004 08:36:09 +1000, Benjamin Herrenschmidt
<benh at kernel.crashing.org> wrote:
> 
> 
> On Thu, 2004-10-07 at 00:58, Chris L. Mason wrote:
> > Hi all,
> >
> > I have a new iMac G5/1.8 GHz/17-inch system that I would like to make
> > available for testing/debugging.  If you have anything you would like
> > me to try booting, checking in open firmware, etc., let me know.
> 
> We have ordered one here. It will require some reverse engineering work
> since it's a new rev of the chipset and the good old PMU chip was finally,
> years later, replaced by a new "SMU" that is totally undocumented of course...
> 

Ah, wonderful.  :)  The good news is that with tgall's latest debug
kernel, I do at least get to boot as far the ata drive detection
before it freezes, although it gets kernel error too right after the
tux logo.  Here's an image of my boot attempt:

http://homepage.mac.com/clmason/imacboot.jpg

(Sorry for the bad quality of the image)

Segher also told me how to use the romgrabber.  I have a copy up at:

http://homepage.mac.com/clmason/OF-5.2.2f1-2004-08-18


Chris


From david at gibson.dropbear.id.au  Thu Oct  7 11:01:54 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Thu, 7 Oct 2004 11:01:54 +1000
Subject: mapping memory in 0xb space
In-Reply-To: <Pine.LNX.4.58.0410051232410.9463@wotan.cs.wisc.edu>
References: <20040929014017.GC5470@zax>
	<Pine.LNX.4.44.0409282214510.26110-100000@wotan.cs.wisc.edu>
	<20041001040325.GB12890@zax>
	<Pine.LNX.4.58.0410051232410.9463@wotan.cs.wisc.edu>
Message-ID: <20041007010154.GC25012@zax>

On Tue, Oct 05, 2004 at 12:45:47PM -0500, Igor Grobman wrote:
> One more followup on this issue, since I do have the base code working
> now.  The problem was in the fact that do_slb_bolted code sets the large
> page bit in the SLB entry, but my code (and particularly hpte_insert code)
> did not insert a proper large page mapping.
> 
> 
> On Fri, 1 Oct 2004, David Gibson wrote:
> > On Wed, Sep 29, 2004 at 12:14:08AM -0500, Igor Grobman wrote:
> > > On Wed, 29 Sep 2004, David Gibson wrote:
> > >
> > > > On Tue, Sep 28, 2004 at 01:52:16PM -0500, Igor Grobman wrote:
> > > > > On Tue, 28 Sep 2004, David Gibson wrote:
> > > As for why I thought 0xbff would work,  I reasoned that
> > > since the highest bits are masked out in get_kernel_vsid(), and since
> > > nobody else is using the 0xb region, it doesn't matter if I get a VSID
> > > that is the same as some other VSID in 0xb region.  However, I did not
> > > consider the bug in do_slb_bolted that you describe below.
> >
> > Yes, with that bug the collision can be with a segment anywhere, not
> > just in the 0xb region.
> >
> 
> I am not convinced anymore.  The lower 36 bits of the ordinal are still
> the same in do_slb_bolted and get_kernel_vsid.  Multiplying the ordinal
> by the 36-bit randomizer should produce the same lower 36 bits whether or
> not the upper bits are different.  do_slb_bolted eventually clears the
> upper 28 bits, before using the VSID.  I no longer think there can be
> a conflict outside the 0xb region.  Is my reasoning correct?

Ah, yes, I think it is.  Sorry, I guess I wasn't thinking very clearly
when I decided the collisions could be anywhere.

-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson


From benh at kernel.crashing.org  Thu Oct  7 18:30:09 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Thu, 07 Oct 2004 18:30:09 +1000
Subject: Gothic horrors in pci_dn.c
Message-ID: <1097137808.4894.73.camel@gaston>

Hi !

To all those who had to deal with the guts of the PCI layer on ppc64,
I'd like your comments about these and what do you think I may break.

Currently, the code in pci_dn.c does 2 things articulated around a
single function. That function is traverse_pci_devices() and is supposed
to traverse the PCI tree exposed by Open Firmware and call back a
function passed as an argument for each node in the tree.

The 2 things it does are

 - Setting up the "devfn" and "busno" fields of the device nodes in the
tree in an initial traversal pass at boot
 - "Finding" a device node for a given pci_dev at any time

However, the current code does a number of assumptions and is bogus in
many cases. Among the issues are:

 - The tree traversal goes all the way down the tree only skipping
things that don't have a "class code". That means potentially walking on
subtrees of a PCI device that aren't PCI themselves (USB ? FireWire ?)
and we have no guarantee that those busses have no "class-code"
property, though we are sure to misinterpret anything we find here.

 -  We try to manipulate host bridge nodes as if they were PCI devices,
which leads us to various funny and totally bogus special cases. First,
in update_dn_pci_info(), where we have an "intersting" (at least)
heuristic to find out if a node if a host bridge or not, with an
horrible special case for avoiding setting the devfn 0 on U3 on blades,
and then we "use" those devfn and busno of the host bridge property in
is_devfn_node() later on when trying to match which is why we have to do
the above bogus workaround.

 - Our firmware (and Apple's too in some cases) is broken in the sense
that it doesn't show the host bridge in the tree as a PCI device. Host
bridges that are themselves visible as devices on their own PCI bus
should have an additional node in the PCI domain named "host" that
represent them.

The solution to this however is very simple, but I need to make sure I
won't break anything else by doing so. It's based on a few facts:

 - The "node" of the host bridge is _NOT_ a PCI node, and thus should
not be traversed by traverse_pci_devices(). This is very easy to do
without any assumption due to the way this function works, just remove 2
lines near the beginning before the for loop.

 - The result for the update_dn_pci_info() pass is that we can rip off
the workaround completely. busno and devfn in the host bridge node are
undefined and that how they should be as they won't be traversed. There
is no "driver" for the host bridge that should make use of them.

 - Same thing with is_devfn_node().

 - We initialize "sysdata" of all pci_dev to point the the host bridge
by default. So if the host bridge happens to have an associated pci_dev,
and no "specific" node (as explained above), then we'll point to the
root node of that pci tree which is exactly what we want, cool !

 - Now the only remaining problem is the test 

	if (dn->devfn == dev->devfn && dn->busno == (dev->bus->number&0xff))

Which will result in incorrect result if the host bridge has undefined
(and typically 0) values in devfn and busno fields and the device we are
looking for happens to really be 0:00.0. This is fixed by forcing those
fields on all PHB nodes to -1. (No special U3 case, all of them).

Here's a patch (untested, it's getting late here) implementing those, I
need to know if it will work at all. Comments welcome :)

Note to Anton & Milton: Pretty much nothing relies anymore on the device
nodes for PCI devices to exist. The only mandatory ones are PHBs, but you
can easily statically lay them out in a static device-tree blob for BM.
By default, all pci_dev point to the PHB. I have a couple of fixes coming
in for u3_iommu to properly setup iommu_table for PHB nodes (it forgot to
do it) and I confirm it works with no OF nodes for the devices themselves.
Config space accesses never need the OF node neither except when you have
RTAS, but then you don't care since you have real nodes for everything.
I added a simple helper to my tree (will be pushed after 2.6.9) that gives
you the pci_controller* from the pci_dev* without doing a full device-tree
walk, and I use that for pmac & maple. You should do the same for PM.

Ben.

===== arch/ppc64/kernel/pci_dn.c 1.18 vs edited =====
--- 1.18/arch/ppc64/kernel/pci_dn.c	2004-10-05 17:24:47 +10:00
+++ edited/arch/ppc64/kernel/pci_dn.c	2004-10-07 18:35:41 +10:00
@@ -46,28 +46,13 @@
 {
 	struct pci_controller *phb = data;
 	u32 *regs;
-	char *device_type = get_property(dn, "device_type", NULL);
-	char *model;
 
 	dn->phb = phb;
-	if (device_type && (strcmp(device_type, "pci") == 0) &&
-			(get_property(dn, "class-code", NULL) == 0)) {
-		/* special case for PHB's.  Sigh. */
-		regs = (u32 *)get_property(dn, "bus-range", NULL);
-		dn->busno = regs[0];
-
-		model = (char *)get_property(dn, "model", NULL);
-		if (model && strstr(model, "U3"))
-			dn->devfn = -1;
-		else
-			dn->devfn = 0;	/* assumption */
-	} else {
-		regs = (u32 *)get_property(dn, "reg", NULL);
-		if (regs) {
-			/* First register entry is addr (00BBSS00)  */
-			dn->busno = (regs[0] >> 16) & 0xff;
-			dn->devfn = (regs[0] >> 8) & 0xff;
-		}
+	regs = (u32 *)get_property(dn, "reg", NULL);
+	if (regs) {
+		/* First register entry is addr (00BBSS00)  */
+		dn->busno = (regs[0] >> 16) & 0xff;
+		dn->devfn = (regs[0] >> 8) & 0xff;
 	}
 	return NULL;
 }
@@ -96,20 +81,25 @@
 	struct device_node *dn, *nextdn;
 	void *ret;
 
-	if (pre && ((ret = pre(start, data)) != NULL))
-		return ret;
+	/* We started with a phb, iterate all childs */
 	for (dn = start->child; dn; dn = nextdn) {
+		u32 *classp, class;
+
 		nextdn = NULL;
-		if (get_property(dn, "class-code", NULL)) {
-			if (pre && ((ret = pre(dn, data)) != NULL))
-				return ret;
-			if (dn->child)
-				/* Depth first...do children */
-				nextdn = dn->child;
-			else if (dn->sibling)
-				/* ok, try next sibling instead. */
-				nextdn = dn->sibling;
-		}
+		classp = (u32 *)get_property(dn, "class-code", NULL);
+		class = classp ? *classp : 0;
+
+		if (pre && ((ret = pre(dn, data)) != NULL))
+			return ret;
+
+		/* If we are a PCI bridge, go down */
+		if (dn->child && (class >> 8) == PCI_CLASS_BRIDGE_PCI &&
+		    (class >> 8) == PCI_CLASS_BRIDGE_CARDBUS)
+			/* Depth first...do children */
+			nextdn = dn->child;
+		else if (dn->sibling)
+			/* ok, try next sibling instead. */
+			nextdn = dn->sibling;
 		if (!nextdn) {
 			/* Walk up to next valid sibling. */
 			do {
@@ -123,21 +113,6 @@
 	return NULL;
 }
 
-/*
- * Same as traverse_pci_devices except this does it for all phbs.
- */
-static void *traverse_all_pci_devices(traverse_func pre)
-{
-	struct pci_controller *phb, *tmp;
-	void *ret;
-
-	list_for_each_entry_safe(phb, tmp, &hose_list, list_node)
-		if ((ret = traverse_pci_devices(phb->arch_data, pre, phb))
-				!= NULL)
-			return ret;
-	return NULL;
-}
-
 
 /*
  * Traversal func that looks for a <busno,devfcn> value.
@@ -147,6 +122,7 @@
 {
 	int busno = ((unsigned long)data >> 8) & 0xff;
 	int devfn = ((unsigned long)data) & 0xff;
+
 	return ((devfn == dn->devfn) && (busno == dn->busno)) ? dn : NULL;
 }
 
@@ -173,10 +149,8 @@
 
 	phb_dn = phb->arch_data;
 	dn = traverse_pci_devices(phb_dn, is_devfn_node, (void *)searchval);
-	if (dn) {
+	if (dn)
 		dev->sysdata = dn;
-		/* ToDo: call some device init hook here */
-	}
 	return dn;
 }
 EXPORT_SYMBOL(fetch_dev_dn);
@@ -188,8 +162,16 @@
  */
 void __init pci_devs_phb_init(void)
 {
+	struct pci_controller *phb, *tmp;
+
 	/* This must be done first so the device nodes have valid pci info! */
-	traverse_all_pci_devices(update_dn_pci_info);
+	list_for_each_entry_safe(phb, tmp, &hose_list, list_node) {
+		struct device_node * dn = (struct device_node *) phb->arch_data;
+		/* PHB nodes themselves must not match */
+		dn->devfn = dn->busno = -1;
+		dn->phb = phb;
+		traverse_pci_devices(phb->arch_data, update_dn_pci_info, phb);
+	}
 }
 
 
From hollisb at us.ibm.com  Thu Oct  7 20:40:27 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Thu, 7 Oct 2004 10:40:27 +0000
Subject: [patch] HVSI udbg
Message-ID: <200410071040.27907.hollisb@us.ibm.com>

This fixes a long-standing omission in HVSI support: dropping to xmon would 
basically hang your system, as there was no udbg code to read/write chars 
from xmon. It's based on the existing "LP" routines.

Could we get this pushed upstream soon?

-- 
Hollis Blanchard
IBM Linux Technology Center

===== arch/ppc64/kernel/pSeries_lpar.c 1.41 vs edited =====
--- 1.41/arch/ppc64/kernel/pSeries_lpar.c Tue Sep 21 23:40:30 2004
+++ edited/arch/ppc64/kernel/pSeries_lpar.c Thu Oct  7 10:52:23 2004
@@ -59,6 +59,74 @@
 
 int vtermno; /* virtual terminal# for udbg  */
 
+#define __ALIGNED__ __attribute__((__aligned__(sizeof(long))))
+static void udbg_hvsi_putc(unsigned char c)
+{
+ /* packet's seqno isn't used anyways */
+ uint8_t packet[] __ALIGNED__ = { 0xff, 5, 0, 0, c };
+ int rc;
+
+ if (c == '\n')
+  udbg_hvsi_putc('\r');
+
+ do {
+  rc = plpar_put_term_char(vtermno, sizeof(packet), packet);
+ } while (rc == H_Busy);
+}
+
+static long hvsi_udbg_buf_len;
+static uint8_t hvsi_udbg_buf[256];
+
+static int udbg_hvsi_getc_poll(void)
+{
+ unsigned char ch;
+ int rc, i;
+
+ if (hvsi_udbg_buf_len == 0) {
+  rc = plpar_get_term_char(vtermno, &hvsi_udbg_buf_len, hvsi_udbg_buf);
+  if (rc != H_Success || hvsi_udbg_buf[0] != 0xff) {
+   /* bad read or non-data packet */
+   hvsi_udbg_buf_len = 0;
+  } else {
+   /* remove the packet header */
+   for (i = 4; i < hvsi_udbg_buf_len; i++)
+    hvsi_udbg_buf[i-4] = hvsi_udbg_buf[i];
+   hvsi_udbg_buf_len -= 4;
+  }
+ }
+
+ if (hvsi_udbg_buf_len <= 0 || hvsi_udbg_buf_len > 256) {
+  /* no data ready */
+  hvsi_udbg_buf_len = 0;
+  return -1;
+ }
+
+ ch = hvsi_udbg_buf[0];
+ /* shift remaining data down */
+ for (i = 1; i < hvsi_udbg_buf_len; i++) {
+  hvsi_udbg_buf[i-1] = hvsi_udbg_buf[i];
+ }
+ hvsi_udbg_buf_len--;
+
+ return ch;
+}
+
+static unsigned char udbg_hvsi_getc(void)
+{
+ int ch;
+ for (;;) {
+  ch = udbg_hvsi_getc_poll();
+  if (ch == -1) {
+   /* This shouldn't be needed...but... */
+   volatile unsigned long delay;
+   for (delay=0; delay < 2000000; delay++)
+    ;
+  } else {
+   return ch;
+  }
+ }
+}
+
 static void udbg_putcLP(unsigned char c)
 {
  char buf[16];
@@ -167,11 +235,15 @@
     ppc_md.udbg_getc_poll = udbg_getc_pollLP;
     found = 1;
    }
-  } else {
-   /* XXX implement udbg_putcLP_vtty for hvterm-protocol1 case */
-   printk(KERN_WARNING "%s doesn't speak hvterm1; "
-     "can't print udbg messages\n",
-          stdout_node->full_name);
+  } else if (device_is_compatible(stdout_node, "hvterm-protocol")) {
+   termno = (u32 *)get_property(stdout_node, "reg", NULL);
+   if (termno) {
+    vtermno = termno[0];
+    ppc_md.udbg_putc = udbg_hvsi_putc;
+    ppc_md.udbg_getc = udbg_hvsi_getc;
+    ppc_md.udbg_getc_poll = udbg_hvsi_getc_poll;
+    found = 1;
+   }
   }
  } else if (strncmp(name, "serial", 6)) {
   /* XXX fix ISA serial console */


From johnrose at austin.ibm.com  Fri Oct  8 03:54:21 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Thu, 07 Oct 2004 12:54:21 -0500
Subject: [PATCH] create iommu_free_table()
Message-ID: <1097171661.7087.1.camel@sinatra.austin.ibm.com>

The patch below creates iommu_free_table().  Iommu tables are not currently
freed in PPC64.  This could cause a memory leak for DLPAR of an EADS slot.  The
function verifies that there are no outstanding TCE entries for the range of
the table before freeing it.  I added a call to iommu_free_table() to the code
that dynamically removes a device node.  This should be fairly symmetrical with
the table allocation, which happens during dynamic addition of a device node.

Comments welcome.

Thanks-
John

Signed-off-by: John Rose <johnrose at austin.ibm.com>

diff -Nru a/arch/ppc64/kernel/pSeries_iommu.c b/arch/ppc64/kernel/pSeries_iommu.c
--- a/arch/ppc64/kernel/pSeries_iommu.c	Thu Oct  7 11:08:19 2004
+++ b/arch/ppc64/kernel/pSeries_iommu.c	Thu Oct  7 11:08:19 2004
@@ -412,6 +412,38 @@
 	dn->iommu_table = iommu_init_table(tbl);
 }
 
+void iommu_free_table(struct device_node *dn)
+{
+	struct iommu_table *tbl = dn->iommu_table;
+        unsigned long bitmap_sz, i;
+        unsigned int order;
+
+        if (!tbl || !tbl->it_map) {
+		printk(KERN_ERR "%s: expected TCE map for %s\n", __FUNCTION__,
+				dn->full_name);
+		return;
+	}
+
+	/* verify that table contains no entries */
+	/* it_mapsize is in entries, and we're examining 64 at a time */
+	for (i = 0; i < (tbl->it_mapsize/64); i++) {
+		if (tbl->it_map[i] != 0) {
+			printk(KERN_WARNING "%s: Unexpected TCEs for %s\n",
+				__FUNCTION__, dn->full_name);
+			break;
+		}
+	}
+
+	/* calculate bitmap size in bytes */
+	bitmap_sz = (tbl->it_mapsize + 7) / 8;
+
+	/* free bitmap */
+	order = get_order(bitmap_sz);
+	free_pages((unsigned long) tbl->it_map, order);
+
+	/* free table */
+        kfree(tbl);
+}
 
 void iommu_setup_pSeries(void)
 {
diff -Nru a/arch/ppc64/kernel/prom.c b/arch/ppc64/kernel/prom.c
--- a/arch/ppc64/kernel/prom.c	Thu Oct  7 11:08:19 2004
+++ b/arch/ppc64/kernel/prom.c	Thu Oct  7 11:08:19 2004
@@ -1818,6 +1818,9 @@
 		return -EBUSY;
 	}
 
+	if (np->iommu_table)
+		iommu_free_table(np);
+
 	write_lock(&devtree_lock);
 	OF_MARK_STALE(np);
 	remove_node_proc_entries(np);
diff -Nru a/include/asm-ppc64/iommu.h b/include/asm-ppc64/iommu.h
--- a/include/asm-ppc64/iommu.h	Thu Oct  7 11:08:19 2004
+++ b/include/asm-ppc64/iommu.h	Thu Oct  7 11:08:19 2004
@@ -113,6 +113,9 @@
 /* Creates table for an individual device node */
 extern void iommu_devnode_init(struct device_node *dn);
 
+/* Frees table for an individual device node */
+extern void iommu_free_table(struct device_node *dn);
+
 #endif /* CONFIG_PPC_MULTIPLATFORM */
 
 #ifdef CONFIG_PPC_ISERIES


From linas at austin.ibm.com  Fri Oct  8 04:13:35 2004
From: linas at austin.ibm.com (Linas Vepstas)
Date: Thu, 7 Oct 2004 13:13:35 -0500
Subject: [linas: [PATCH] PPC64: crash during firmware flash update]
Message-ID: <20041007181335.GA21633@austin.ibm.com>

Sent to the wrong mailing list :)

----- Forwarded message from linas -----

To: paulus at samba.org, anton at samba.org
Cc: linuxppc64-dev at lists.linuxppc.org
Subject:  [PATCH] PPC64: crash during firmware flash update


Race conditions during system shutdown after a firmware
flash can sometimes lead to an invalid pointer deref (deref
to freed memory).  This patch fixes this.  In addition, it makes
sure that the proc entries created by the firmware flash module
are removed when the module is unloaded. 


Signed-off-by: Linas Vepstas <linas at linas.org>


--- a/arch/ppc64/kernel/rtas_flash.c.orig	2004-09-20 11:59:18.000000000 -0500
+++ b/arch/ppc64/kernel/rtas_flash.c	2004-10-06 11:19:45.000000000 -0500
@@ -562,6 +562,7 @@ static int validate_flash_release(struct

 		validate_flash(args_buf);
 	}
 
+	/* The matching atomic_inc was in rtas_excl_open() */
 	atomic_dec(&dp->count);
 
 	return 0;
@@ -572,7 +573,8 @@ static void remove_flash_pde(struct proc
 	if (dp) {
 		if (dp->data != NULL)
 			kfree(dp->data);
-		remove_proc_entry(dp->name, NULL);
+		dp->owner = NULL;
+		remove_proc_entry(dp->name, dp->parent);
 	}
 }
 

----- End forwarded message -----


From benh at kernel.crashing.org  Fri Oct  8 12:27:07 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Fri, 08 Oct 2004 12:27:07 +1000
Subject: Gothic horrors in pci_dn.c
In-Reply-To: <1097137808.4894.73.camel@gaston>
References: <1097137808.4894.73.camel@gaston>
Message-ID: <1097202427.846.102.camel@gaston>

On Thu, 2004-10-07 at 18:30, Benjamin Herrenschmidt wrote:

> +		/* If we are a PCI bridge, go down */
> +		if (dn->child && (class >> 8) == PCI_CLASS_BRIDGE_PCI &&
> +		    (class >> 8) == PCI_CLASS_BRIDGE_CARDBUS)
> +			/* Depth first...do children */
> +			nextdn = dn->child;

Of course, that should have been

+		/* If we are a PCI bridge, go down */
+		if (dn->child && ((class >> 8) == PCI_CLASS_BRIDGE_PCI ||
+				  (class >> 8) == PCI_CLASS_BRIDGE_CARDBUS))
+			/* Depth first...do children */
+			nextdn = dn->child;

Ben.


From paulus at samba.org  Fri Oct  8 10:44:32 2004
From: paulus at samba.org (Paul Mackerras)
Date: Fri, 8 Oct 2004 10:44:32 +1000
Subject: [patch] HVSI udbg
In-Reply-To: <200410071040.27907.hollisb@us.ibm.com>
References: <200410071040.27907.hollisb@us.ibm.com>
Message-ID: <16741.58096.932315.526999@cargo.ozlabs.ibm.com>

Hollis,

> --- 1.41/arch/ppc64/kernel/pSeries_lpar.c Tue Sep 21 23:40:30 2004
> +++ edited/arch/ppc64/kernel/pSeries_lpar.c Thu Oct  7 10:52:23 2004
> @@ -59,6 +59,74 @@
>  
>  int vtermno; /* virtual terminal# for udbg  */
>  
> +#define __ALIGNED__ __attribute__((__aligned__(sizeof(long))))
> +static void udbg_hvsi_putc(unsigned char c)
> +{
> + /* packet's seqno isn't used anyways */
> + uint8_t packet[] __ALIGNED__ = { 0xff, 5, 0, 0, c };
> + int rc;

All the tabs in the patch seem to have got changed to spaces.  Is it
your mailer or is the list software doing something bad?

Paul.


From arnd at arndb.de  Fri Oct  8 16:22:57 2004
From: arnd at arndb.de (Arnd Bergmann)
Date: Fri, 8 Oct 2004 08:22:57 +0200
Subject: [patch] HVSI udbg
In-Reply-To: <16741.58096.932315.526999@cargo.ozlabs.ibm.com>
References: <200410071040.27907.hollisb@us.ibm.com>
	<16741.58096.932315.526999@cargo.ozlabs.ibm.com>
Message-ID: <200410080823.03298.arnd@arndb.de>

On Freedag 08 Oktober 2004 02:44, Paul Mackerras wrote:

> All the tabs in the patch seem to have got changed to spaces.  Is it
> your mailer or is the list software doing something bad?

It's the latest kmail (or Qt) update from Debian Sarge that broke this.
I have the same problem here. Attachments appear to be still working.

http://bugs.kde.org/show_bug.cgi?id=90688

 Arnd <><
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: signature
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041008/33488d20/attachment.pgp 

From hollisb at us.ibm.com  Fri Oct  8 22:42:13 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Fri, 8 Oct 2004 12:42:13 +0000
Subject: [patch 2] HVSI udbg
Message-ID: <200410081242.13486.hollisb@us.ibm.com>

This patch (resent as attachment due to mailer troubles) adds support for the 
udbg early console interfaces when using an HVSI console.

-- 
Hollis Blanchard
IBM Linux Technology Center
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hvsi-udbg.diff
Type: text/x-diff
Size: 2552 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041008/663aa099/attachment.diff 

From david at gibson.dropbear.id.au  Mon Oct 11 12:11:46 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Mon, 11 Oct 2004 12:11:46 +1000
Subject: [PPC64] xmon sparse cleanups
In-Reply-To: <16738.31164.464250.638432@cargo.ozlabs.ibm.com>
References: <20041005064255.GF3695@zax>
	<16738.31164.464250.638432@cargo.ozlabs.ibm.com>
Message-ID: <20041011021146.GA1556@zax>

On Tue, Oct 05, 2004 at 08:38:52PM +1000, Paul Mackerras wrote:
> David Gibson writes:
> 
> > Andrew, please apply:
> > 
> > This patch removes many sparse warnings from the xmon code.  Mostly
> > K&R function declarations and 0-instead-of-NULLs.
> 
> The trouble with this patch is that it makes ppc-opc.c diverge from
> the version in binutils, which is where it came from.  I'd rather keep
> it as close as possible to that version.  I have no problem with the
> changes to the other files.

A corresponding patch has now gone into binutils CVS.  As it happens
there has already been a certain amount of divergence between the
versions, presumably because the kernel copy hasn't been updated from
binutils in quite a while.

-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson


From schwab at suse.de  Tue Oct 12 06:11:42 2004
From: schwab at suse.de (Andreas Schwab)
Date: Mon, 11 Oct 2004 22:11:42 +0200
Subject: 2.6.9-rc4: oops during ide probing
Message-ID: <m3llednhfl.fsf@igel.m5r.de>

I'm getting an oops during ide probing on the PMac G5 with 2.6.9-rc4:

ide-pmac: cannot find MacIO node for Kauai ATA interface
ide0: Found Apple OHare ATA controller, bus ID 0, irq 0
Oops: Kernel access of bad area, sig: 11 [#1]
NIP [...] .ide_mm_inb+0x0/0x14
LR [...] .ide_wait_not_busy+0x98/0xf0

(Sorry, I couldn't capture the whole oops.)

I've tried also with the patch from
<http://ozlabs.org/ppc64-patches/patch.pl?id=339>, but that didn't help.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab at suse.de
SuSE Linux AG, Maxfeldstra?e 5, 90409 N?rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


From pbadari at us.ibm.com  Tue Oct 12 08:07:32 2004
From: pbadari at us.ibm.com (Badari Pulavarty)
Date: 11 Oct 2004 15:07:32 -0700
Subject: 2.6.9-rc4-mm1 doesn't boot on my Power3 box
Message-ID: <1097532452.12861.398.camel@dyn318077bld.beaverton.ibm.com>

Hi,

My Power3 box doesn't boot with 2.6.9-rc4-mm1. I get
following OOPs. (2.6.9-rc3-mm3 also same issue).

Any fixes ?

Thanks,
Badari

kernel BUG in __flush_tlb_pending at arch/ppc64/mm/tlb.c:125!
Oops: Exception in kernel mode, sig: 5 [#1]
SMP NR_CPUS=128 NUMA PSERIES
NIP: C00000000003E344 XER: 0000000020000000 LR: C000000000014DA0
REGS: c000000001963550 TRAP: 0700   Not tainted  (2.6.9-rc4-mm1)
MSR: a000000000023032 EE: 0 PR: 0 FP: 1 ME: 1 IR/DR: 11
TASK: c00000003f7577e0[1396] 'hotplug' THREAD: c000000001960000 CPU: 0
GPR00: 0000000004000000 C0000000019637D0 C0000000005D29F0
C0000000006B70A0
GPR04: C00000003FB597E0 000000028904198B C0000000005D1008
C0000000004583B0
GPR08: 0000000000260F00 C000000001960000 C0000000005D1008
0000000000000002
GPR12: 0000000022222482 C0000000004B9900 C00000003F757A80
00000030CAC526D0
GPR16: C0000000005D1008 000000000065E4C0 0000000000000000
C00000000F052500
GPR20: C0000000006BAD88 C000000001963990 C00000003FB597E0
C0000000006B9B38
GPR24: C000000001945200 C00000003F7577E0 0000000018221613
C00000003FB597E0
GPR28: 0000000000001260 C00000003F7577E0 0000000000000000
C0000000006B70A0
NIP [c00000000003e344] .__flush_tlb_pending+0x38/0x150
LR [c000000000014da0] .__switch_to+0xb4/0xd8
Call Trace:
[c0000000019637d0] [00000000f7fad210] 0xf7fad210 (unreliable)
--- Exception: 901 at .copy_page_range+0x218/0x61c
    LR = .copy_page_range+0x160/0x61c
[c000000001963890] [c000000000014da0] .__switch_to+0xb4/0xd8
(unreliable)
[c000000001963920] [c00000000039a5dc] .schedule+0x38c/0xc3c
[c000000001963a40] [c00000000039b028] .cond_resched+0x4c/0x80
[c000000001963ac0] [c000000000096eb0] .copy_page_range+0x29c/0x61c
[c000000001963bd0] [c00000000004fecc] .copy_process+0x8c0/0x148c
[c000000001963ce0] [c000000000050b38] .do_fork+0xa0/0x25c
[c000000001963dc0] [c000000000014680] .sys_clone+0x5c/0x74
[c000000001963e30] [c000000000010208] .ppc_clone+0x8/0xc


From dwmw2 at infradead.org  Tue Oct 12 23:51:49 2004
From: dwmw2 at infradead.org (David Woodhouse)
Date: Tue, 12 Oct 2004 14:51:49 +0100
Subject: cond_syscall() and new ABI.
Message-ID: <1097589108.318.425.camel@hades.cambridge.redhat.com>

This (in linux/asm-ppc64/unistd.h) doesn't work with the new ABI:

/*
 * "Conditional" syscalls
 *
 * What we want is __attribute__((weak,alias("sys_ni_syscall"))),
 * but it doesn't work on all toolchains, so we just do it by hand
 */
#define cond_syscall(x) asm(".weak\t." #x "\n\t.set\t." #x ",.sys_ni_syscall");

Two options -- either we ditch older toolchains (before 2002-03-01
probably), by switching to what we say in the comment, or we introduce
an ifdef to choose whether to include the '.' in the symbol names...

Both attached. Someone who cares can choose one :)

-- 
dwmw2
-------------- next part --------------
===== include/asm-ppc64/unistd.h 1.34 vs edited =====
--- 1.34/include/asm-ppc64/unistd.h	Tue Sep 14 01:23:12 2004
+++ edited/include/asm-ppc64/unistd.h	Tue Oct 12 14:49:48 2004
@@ -468,7 +468,11 @@
  * What we want is __attribute__((weak,alias("sys_ni_syscall"))),
  * but it doesn't work on all toolchains, so we just do it by hand
  */
+#if __GNUC__ > 3 || (__GNUC__ == 3 && __GNUC_MINOR__ > 3)
+#define cond_syscall(x) asm(".weak\t" #x "\n\t.set\t" #x ",sys_ni_syscall");
+#else
 #define cond_syscall(x) asm(".weak\t." #x "\n\t.set\t." #x ",.sys_ni_syscall");
+#endif
 
 #endif		/* __KERNEL__ */
 
-------------- next part --------------
===== include/asm-ppc64/unistd.h 1.34 vs edited =====
--- 1.34/include/asm-ppc64/unistd.h	Tue Sep 14 01:23:12 2004
+++ edited/include/asm-ppc64/unistd.h	Tue Oct 12 14:48:08 2004
@@ -468,7 +468,7 @@
  * What we want is __attribute__((weak,alias("sys_ni_syscall"))),
  * but it doesn't work on all toolchains, so we just do it by hand
  */
-#define cond_syscall(x) asm(".weak\t." #x "\n\t.set\t." #x ",.sys_ni_syscall");
+#define cond_syscall(x) void x(void) __attribute__((weak,alias("sys_ni_syscall")));
 
 #endif		/* __KERNEL__ */
 

From hch at lst.de  Wed Oct 13 00:26:27 2004
From: hch at lst.de (Christoph Hellwig)
Date: Tue, 12 Oct 2004 16:26:27 +0200
Subject: cond_syscall() and new ABI.
In-Reply-To: <1097589108.318.425.camel@hades.cambridge.redhat.com>
References: <1097589108.318.425.camel@hades.cambridge.redhat.com>
Message-ID: <20041012142627.GA19091@lst.de>

On Tue, Oct 12, 2004 at 02:51:49PM +0100, David Woodhouse wrote:
> This (in linux/asm-ppc64/unistd.h) doesn't work with the new ABI:
> 
> /*
>  * "Conditional" syscalls
>  *
>  * What we want is __attribute__((weak,alias("sys_ni_syscall"))),
>  * but it doesn't work on all toolchains, so we just do it by hand
>  */
> #define cond_syscall(x) asm(".weak\t." #x "\n\t.set\t." #x ",.sys_ni_syscall");
> 
> Two options -- either we ditch older toolchains (before 2002-03-01
> probably), by switching to what we say in the comment, or we introduce
> an ifdef to choose whether to include the '.' in the symbol names...
> 
> Both attached. Someone who cares can choose one :)
> 
> -- 
> dwmw2

> ===== include/asm-ppc64/unistd.h 1.34 vs edited =====
> --- 1.34/include/asm-ppc64/unistd.h	Tue Sep 14 01:23:12 2004
> +++ edited/include/asm-ppc64/unistd.h	Tue Oct 12 14:49:48 2004
> @@ -468,7 +468,11 @@
>   * What we want is __attribute__((weak,alias("sys_ni_syscall"))),
>   * but it doesn't work on all toolchains, so we just do it by hand
>   */
> +#if __GNUC__ > 3 || (__GNUC__ == 3 && __GNUC_MINOR__ > 3)
> +#define cond_syscall(x) asm(".weak\t" #x "\n\t.set\t" #x ",sys_ni_syscall");
> +#else
>  #define cond_syscall(x) asm(".weak\t." #x "\n\t.set\t." #x ",.sys_ni_syscall");

this is broken.  Gcc 3.4 doesn't even have support for the non-dotted
ABI, nevermind uses it by default.

> ===== include/asm-ppc64/unistd.h 1.34 vs edited =====
> --- 1.34/include/asm-ppc64/unistd.h	Tue Sep 14 01:23:12 2004
> +++ edited/include/asm-ppc64/unistd.h	Tue Oct 12 14:48:08 2004
> @@ -468,7 +468,7 @@
>   * What we want is __attribute__((weak,alias("sys_ni_syscall"))),
>   * but it doesn't work on all toolchains, so we just do it by hand
>   */
> -#define cond_syscall(x) asm(".weak\t." #x "\n\t.set\t." #x ",.sys_ni_syscall");
> +#define cond_syscall(x) void x(void) __attribute__((weak,alias("sys_ni_syscall")));


this one otoh makes lots of sense - it's what most architectures use.


From moilanen at austin.ibm.com  Wed Oct 13 00:56:19 2004
From: moilanen at austin.ibm.com (Jake Moilanen)
Date: Tue, 12 Oct 2004 09:56:19 -0500
Subject: [PATCH 1/2][RFC] PPC64 no-exec support for user space
In-Reply-To: <20041012095248.2b6418c4@localhost>
References: <20041012095248.2b6418c4@localhost>
Message-ID: <20041012095619.63a38530@localhost>

Here is no-exec support for user space.  This patch also includes base
no-exec support. 

Once again it requires Ben's signal trampoline in vdso piece.  

Thanks,
Jake

Signed-off-by: Jake Moilanen <moilanen at austin.ibm.com>

---


diff -puN arch/ppc64/kernel/head.S~nx-user-ppc64 arch/ppc64/kernel/head.S
--- linux-2.6-bk/arch/ppc64/kernel/head.S~nx-user-ppc64	Thu Oct  7 15:23:52 2004
+++ linux-2.6-bk-moilanen/arch/ppc64/kernel/head.S	Thu Oct  7 15:23:52 2004
@@ -35,6 +35,7 @@
 #include <asm/offsets.h>
 #include <asm/bug.h>
 #include <asm/cputable.h>
+#include <asm/pgtable.h>	
 #include <asm/setup.h>
 
 #ifdef CONFIG_PPC_ISERIES
@@ -879,6 +880,7 @@ InstructionAccess_common:
 	ld	r3,_NIP(r1)
 	andis.	r4,r12,0x5820
 	li	r5,0x400
+	ori	r4,r4,_PAGE_EXEC	
 	b	.do_hash_page		/* Try to handle as hpte fault */
 
 	.align	7
@@ -964,11 +966,10 @@ END_FTR_SECTION_IFCLR(CPU_FTR_SLB)
 	 * accessing a userspace segment (even from the kernel). We assume
 	 * kernel addresses always have the high bit set.
 	 */
-	rlwinm	r4,r4,32-23,29,29	/* DSISR_STORE -> _PAGE_RW */
+	rlwinm	r4,r4,32-25+9,31-9,31-9	/* DSISR_STORE -> _PAGE_RW */
 	rotldi	r0,r3,15		/* Move high bit into MSR_PR posn */
 	orc	r0,r12,r0		/* MSR_PR | ~high_bit */
 	rlwimi	r4,r0,32-13,30,30	/* becomes _PAGE_USER access bit */
-	ori	r4,r4,1			/* add _PAGE_PRESENT */
 
 	/*
 	 * On iSeries, we soft-disable interrupts here, then
diff -puN arch/ppc64/mm/fault.c~nx-user-ppc64 arch/ppc64/mm/fault.c
--- linux-2.6-bk/arch/ppc64/mm/fault.c~nx-user-ppc64	Thu Oct  7 15:23:52 2004
+++ linux-2.6-bk-moilanen/arch/ppc64/mm/fault.c	Thu Oct  7 15:23:52 2004
@@ -92,6 +92,7 @@ int do_page_fault(struct pt_regs *regs, 
 	unsigned long code = SEGV_MAPERR;
 	unsigned long is_write = error_code & 0x02000000;
 	unsigned long trap = TRAP(regs);
+ 	unsigned long is_exec = trap == 0x400;	
 
 	BUG_ON((trap == 0x380) || (trap == 0x480));
 
@@ -191,16 +192,19 @@ int do_page_fault(struct pt_regs *regs, 
 good_area:
 	code = SEGV_ACCERR;
 
+	if (is_exec) {
+		/* protection fault */
+		if (error_code & 0x08000000)
+			goto bad_area;
+		if (!(vma->vm_flags & VM_EXEC)) 
+			goto bad_area;
 	/* a write */
-	if (is_write) {
+	} else if (is_write) {
 		if (!(vma->vm_flags & VM_WRITE))
 			goto bad_area;
 	/* a read */
 	} else {
-		/* protection fault */
-		if (error_code & 0x08000000)
-			goto bad_area;
-		if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
+		if (!(vma->vm_flags & VM_READ))
 			goto bad_area;
 	}
 
diff -puN arch/ppc64/mm/hash_low.S~nx-user-ppc64 arch/ppc64/mm/hash_low.S
--- linux-2.6-bk/arch/ppc64/mm/hash_low.S~nx-user-ppc64	Thu Oct  7 15:23:52 2004
+++ linux-2.6-bk-moilanen/arch/ppc64/mm/hash_low.S	Thu Oct  7 15:23:52 2004
@@ -89,7 +89,7 @@ _GLOBAL(__hash_page)
 	/* Prepare new PTE value (turn access RW into DIRTY, then
 	 * add BUSY,HASHPTE and ACCESSED)
 	 */
-	rlwinm	r30,r4,5,24,24	/* _PAGE_RW -> _PAGE_DIRTY */
+	rlwinm	r30,r4,32-9+7,31-7,31-7	/* _PAGE_RW -> _PAGE_DIRTY */
 	or	r30,r30,r31
 	ori	r30,r30,_PAGE_BUSY | _PAGE_ACCESSED | _PAGE_HASHPTE
 	/* Write the linux PTE atomically (setting busy) */
@@ -112,11 +112,11 @@ _GLOBAL(__hash_page)
 	rldicl	r5,r5,0,25		/* vsid & 0x0000007fffffffff */
 	rldicl	r0,r3,64-12,48		/* (ea >> 12) & 0xffff */
 	xor	r28,r5,r0
-	
-	/* Convert linux PTE bits into HW equivalents
-	 */
-	andi.	r3,r30,0x1fa		/* Get basic set of flags */
-	rlwinm	r0,r30,32-2+1,30,30	/* _PAGE_RW -> _PAGE_USER (r0) */
+
+	/* Convert linux PTE bits into HW equivalents */
+	andi.	r3,r30,0x1fe		/* Get basic set of flags */
+	xori	r3,r3,HW_NO_EXEC	/* _PAGE_EXEC -> NOEXEC */
+	rlwinm	r0,r30,32-9+1,30,30	/* _PAGE_RW -> _PAGE_USER (r0) */
 	rlwinm	r4,r30,32-7+1,30,30	/* _PAGE_DIRTY -> _PAGE_USER (r4) */
 	and	r0,r0,r4		/* _PAGE_RW & _PAGE_DIRTY -> r0 bit 30 */
 	andc	r0,r30,r0		/* r0 = pte & ~r0 */
diff -puN fs/binfmt_elf.c~nx-user-ppc64 fs/binfmt_elf.c
--- linux-2.6-bk/fs/binfmt_elf.c~nx-user-ppc64	Thu Oct  7 15:23:52 2004
+++ linux-2.6-bk-moilanen/fs/binfmt_elf.c	Thu Oct  7 15:23:52 2004
@@ -89,8 +89,11 @@ static int set_brk(unsigned long start, 
 	end = ELF_PAGEALIGN(end);
 	if (end > start) {
 		unsigned long addr = do_brk(start, end - start);
+
 		if (BAD_ADDR(addr))
 			return addr;
+
+ 		sys_mprotect(start, end-start, PROT_READ|PROT_WRITE|PROT_EXEC);
 	}
 	current->mm->start_brk = current->mm->brk = end;
 	return 0;
diff -puN include/asm-ppc64/elf.h~nx-user-ppc64 include/asm-ppc64/elf.h
--- linux-2.6-bk/include/asm-ppc64/elf.h~nx-user-ppc64	Thu Oct  7 15:23:52 2004
+++ linux-2.6-bk-moilanen/include/asm-ppc64/elf.h	Thu Oct  7 15:23:52 2004
@@ -226,6 +226,13 @@ do {								\
 	else if (current->personality != PER_LINUX32)		\
 		set_personality(PER_LINUX);			\
 } while (0)
+
+/*
+ * An executable for which elf_read_implies_exec() returns TRUE will
+ * have the READ_IMPLIES_EXEC personality flag set automatically.
+ */
+#define elf_read_implies_exec_binary(ex, have_pt_gnu_stack)	(!(have_pt_gnu_stack))
+
 #endif
 
 /*
diff -puN include/asm-ppc64/page.h~nx-user-ppc64 include/asm-ppc64/page.h
--- linux-2.6-bk/include/asm-ppc64/page.h~nx-user-ppc64	Thu Oct  7 15:23:52 2004
+++ linux-2.6-bk-moilanen/include/asm-ppc64/page.h	Thu Oct  7 15:23:52 2004
@@ -233,8 +233,25 @@ extern int page_is_ram(unsigned long pfn
 
 #define virt_addr_valid(kaddr)	pfn_valid(__pa(kaddr) >> PAGE_SHIFT)
 
-#define VM_DATA_DEFAULT_FLAGS	(VM_READ | VM_WRITE | VM_EXEC | \
+#define VM_DATA_DEFAULT_FLAGS32	(VM_READ | VM_WRITE | \
 				 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
+
+#define VM_STACK_DEFAULT_FLAGS32 (VM_READ | VM_WRITE | VM_EXEC | \
+				 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
+
+#define VM_DATA_DEFAULT_FLAGS64	(VM_READ | VM_WRITE | \
+				 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
+
+#define VM_STACK_DEFAULT_FLAGS64 (VM_READ | VM_WRITE | VM_EXEC | \
+				 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
+
+#define VM_DATA_DEFAULT_FLAGS \
+	(test_thread_flag(TIF_32BIT) ? \
+	 VM_DATA_DEFAULT_FLAGS32 : VM_DATA_DEFAULT_FLAGS64)
+
+#define VM_STACK_DEFAULT_FLAGS \
+	(test_thread_flag(TIF_32BIT) ? \
+	 VM_STACK_DEFAULT_FLAGS32 : VM_STACK_DEFAULT_FLAGS64)
 
 #endif /* __KERNEL__ */
 #endif /* _PPC64_PAGE_H */
diff -puN include/asm-ppc64/pgtable.h~nx-user-ppc64 include/asm-ppc64/pgtable.h
--- linux-2.6-bk/include/asm-ppc64/pgtable.h~nx-user-ppc64	Thu Oct  7 15:23:52 2004
+++ linux-2.6-bk-moilanen/include/asm-ppc64/pgtable.h	Thu Oct  7 15:23:52 2004
@@ -86,24 +86,25 @@
 #define _PAGE_PRESENT	0x0001 /* software: pte contains a translation */
 #define _PAGE_USER	0x0002 /* matches one of the PP bits */
 #define _PAGE_FILE	0x0002 /* (!present only) software: pte holds file offset */
-#define _PAGE_RW	0x0004 /* software: user write access allowed */
+#define _PAGE_EXEC	0x0004 /* No execute on POWER4 and newer (we invert) */
 #define _PAGE_GUARDED	0x0008
 #define _PAGE_COHERENT	0x0010 /* M: enforce memory coherence (SMP systems) */
 #define _PAGE_NO_CACHE	0x0020 /* I: cache inhibit */
 #define _PAGE_WRITETHRU	0x0040 /* W: cache write-through */
 #define _PAGE_DIRTY	0x0080 /* C: page changed */
 #define _PAGE_ACCESSED	0x0100 /* R: page referenced */
-#define _PAGE_EXEC	0x0200 /* software: i-cache coherence required */
+#define _PAGE_RW	0x0200 /* software: user write access allowed */
 #define _PAGE_HASHPTE	0x0400 /* software: pte has an associated HPTE */
 #define _PAGE_BUSY	0x0800 /* software: PTE & hash are busy */ 
 #define _PAGE_SECONDARY 0x8000 /* software: HPTE is in secondary group */
 #define _PAGE_GROUP_IX  0x7000 /* software: HPTE index within group */
 /* Bits 0x7000 identify the index within an HPT Group */
 #define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | _PAGE_SECONDARY | _PAGE_GROUP_IX)
+
 /* PAGE_MASK gives the right answer below, but only by accident */
 /* It should be preserving the high 48 bits and then specifically */
 /* preserving _PAGE_SECONDARY | _PAGE_GROUP_IX */
-#define _PAGE_CHG_MASK	(PAGE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_HPTEFLAGS)
+#define _PAGE_CHG_MASK (_PAGE_GUARDED | _PAGE_COHERENT | _PAGE_NO_CACHE | _PAGE_WRITETHRU | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_HPTEFLAGS | PAGE_MASK)
 
 #define _PAGE_BASE	(_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_COHERENT)
 
@@ -119,31 +120,32 @@
 #define PAGE_READONLY	__pgprot(_PAGE_BASE | _PAGE_USER)
 #define PAGE_READONLY_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_EXEC)
 #define PAGE_KERNEL	__pgprot(_PAGE_BASE | _PAGE_WRENABLE)
-#define PAGE_KERNEL_CI	__pgprot(_PAGE_PRESENT | _PAGE_ACCESSED | \
-			       _PAGE_WRENABLE | _PAGE_NO_CACHE | _PAGE_GUARDED)
 
 /*
- * The PowerPC can only do execute protection on a segment (256MB) basis,
- * not on a page basis.  So we consider execute permission the same as read.
+ * POWER4 and newer have per page execute protection, older chips can only
+ * do this on a segment (256MB) basis.
+ *
  * Also, write permissions imply read permissions.
  * This is the closest we can get..
+ *
+ * Note due to the way vm flags are laid out, the bits are XWR
  */
 #define __P000	PAGE_NONE
-#define __P001	PAGE_READONLY_X
+#define __P001	PAGE_READONLY
 #define __P010	PAGE_COPY
-#define __P011	PAGE_COPY_X
-#define __P100	PAGE_READONLY
+#define __P011	PAGE_COPY
+#define __P100	PAGE_READONLY_X
 #define __P101	PAGE_READONLY_X
-#define __P110	PAGE_COPY
+#define __P110	PAGE_COPY_X
 #define __P111	PAGE_COPY_X
 
 #define __S000	PAGE_NONE
-#define __S001	PAGE_READONLY_X
+#define __S001	PAGE_READONLY
 #define __S010	PAGE_SHARED
-#define __S011	PAGE_SHARED_X
-#define __S100	PAGE_READONLY
+#define __S011	PAGE_SHARED
+#define __S100	PAGE_READONLY_X
 #define __S101	PAGE_READONLY_X
-#define __S110	PAGE_SHARED
+#define __S110	PAGE_SHARED_X
 #define __S111	PAGE_SHARED_X
 
 #ifndef __ASSEMBLY__
@@ -200,7 +202,8 @@ int hash_huge_page(struct mm_struct *mm,
 })
 
 #define pte_modify(_pte, newprot) \
-  (__pte((pte_val(_pte) & _PAGE_CHG_MASK) | pgprot_val(newprot)))
+	(__pte((pte_val(_pte) & _PAGE_CHG_MASK) | \
+	       (pgprot_val(newprot) & ~_PAGE_CHG_MASK)))
 
 #define pte_none(pte)		((pte_val(pte) & ~_PAGE_HPTEFLAGS) == 0)
 #define pte_present(pte)	(pte_val(pte) & _PAGE_PRESENT)
@@ -270,9 +273,6 @@ static inline int pte_dirty(pte_t pte) {
 static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED;}
 static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE;}
 
-static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; }
-static inline void pte_cache(pte_t pte)   { pte_val(pte) &= ~_PAGE_NO_CACHE; }
-
 static inline pte_t pte_rdprotect(pte_t pte) {
 	pte_val(pte) &= ~_PAGE_USER; return pte; }
 static inline pte_t pte_exprotect(pte_t pte) {
@@ -420,7 +420,7 @@ static inline void set_pte(pte_t *ptep, 
 static inline void __ptep_set_access_flags(pte_t *ptep, pte_t entry, int dirty)
 {
 	unsigned long bits = pte_val(entry) &
-		(_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW);
+		(_PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_RW | _PAGE_EXEC);
 	unsigned long old, tmp;
 
 	__asm__ __volatile__(
diff -puN arch/ppc64/mm/hugetlbpage.c~nx-user-ppc64 arch/ppc64/mm/hugetlbpage.c
--- linux-2.6-bk/arch/ppc64/mm/hugetlbpage.c~nx-user-ppc64	Thu Oct  7 15:23:52 2004
+++ linux-2.6-bk-moilanen/arch/ppc64/mm/hugetlbpage.c	Thu Oct  7 15:23:52 2004
@@ -29,8 +29,8 @@
 
 /* HugePTE layout:
  *
- * 31 30 ... 15 14 13 12 10 9  8  7   6    5    4    3    2    1    0
- * PFN>>12..... -  -  -  -  -  -  HASH_IX....   2ND  HASH RW   -    HG=1
+ * 31 30 ... 15 14 13 12 10 9  8  7   6    5    4    3    2    	1    0
+ * PFN>>12..... -  -  -  -  -  -  HASH_IX....   2ND  HASH !EXEC	RW   HG=1
  */
 
 #define HUGEPTE_SHIFT	15
@@ -41,7 +41,8 @@
 #define _HUGEPAGE_GROUP_IX	0x000000e0
 #define _HUGEPAGE_HPTEFLAGS	(_HUGEPAGE_HASHPTE | _HUGEPAGE_SECONDARY | \
 				 _HUGEPAGE_GROUP_IX)
-#define _HUGEPAGE_RW		0x00000004
+#define _HUGEPAGE_RW		0x00000002
+#define _HUGEPAGE_EXEC		0x00000004	/* this is inverted */
 
 typedef struct {unsigned int val;} hugepte_t;
 #define hugepte_val(hugepte)	((hugepte).val)
@@ -722,6 +723,7 @@ int hash_huge_page(struct mm_struct *mm,
 	hugepte_t *ptep;
 	unsigned long va, vpn;
 	int is_write;
+	int is_exec;
 	hugepte_t old_pte, new_pte;
 	unsigned long hpteflags, prpn, flags;
 	long slot;
@@ -752,6 +754,10 @@ int hash_huge_page(struct mm_struct *mm,
 	if (unlikely(is_write && !(hugepte_val(*ptep) & _HUGEPAGE_RW)))
 		return 1;
 
+	is_exec = access & _PAGE_EXEC;
+	if (unlikely(is_exec && !(hugepte_val(*ptep) & _HUGEPAGE_EXEC)))
+		return 1;
+
 	/*
 	 * At this point, we have a pte (old_pte) which can be used to build
 	 * or update an HPTE. There are 2 cases:
@@ -769,7 +775,10 @@ int hash_huge_page(struct mm_struct *mm,
 	old_pte = *ptep;
 	new_pte = old_pte;
 
-	hpteflags = 0x2 | (! (hugepte_val(new_pte) & _HUGEPAGE_RW));
+	/* _HUGEPAGE_EXEC -> HW_NO_EXEC since it's inverted */
+	hpteflags = (hugepte_val(new_pte) & _HUGEPAGE_RW) |
+		(hugepte_val(new_pte) ^ HW_NO_EXEC) |
+		(!(hugepte_val(new_pte) & _HUGEPAGE_RW));
 
 	/* Check if pte already has an hpte (case 2) */
 	if (unlikely(hugepte_val(old_pte) & _HUGEPAGE_HASHPTE)) {
diff -L arch/ppc64/kernel/pSeries_htab.c -puN /dev/null /dev/null
diff -puN arch/ppc64/kernel/pSeries_lpar.c~nx-user-ppc64 arch/ppc64/kernel/pSeries_lpar.c
--- linux-2.6-bk/arch/ppc64/kernel/pSeries_lpar.c~nx-user-ppc64	Thu Oct  7 15:23:52 2004
+++ linux-2.6-bk-moilanen/arch/ppc64/kernel/pSeries_lpar.c	Thu Oct  7 15:23:52 2004
@@ -384,7 +384,7 @@ static void pSeries_lpar_hpte_updatebolt
 	slot = pSeries_lpar_hpte_find(vpn);
 	BUG_ON(slot == -1);
 
-	flags = newpp & 3;
+	flags = newpp & 7;
 	lpar_rc = plpar_pte_protect(flags, slot, 0);
 
 	BUG_ON(lpar_rc != H_Success);
diff -puN arch/ppc64/kernel/iSeries_htab.c~nx-user-ppc64 arch/ppc64/kernel/iSeries_htab.c
--- linux-2.6-bk/arch/ppc64/kernel/iSeries_htab.c~nx-user-ppc64	Thu Oct  7 15:23:52 2004
+++ linux-2.6-bk-moilanen/arch/ppc64/kernel/iSeries_htab.c	Thu Oct  7 15:23:52 2004
@@ -144,6 +144,10 @@ static long iSeries_hpte_updatepp(unsign
 
 	HvCallHpt_get(&hpte, slot);
 	if ((hpte.dw0.dw0.avpn == avpn) && (hpte.dw0.dw0.v)) {
+		/*
+		 * Hypervisor expects bit's as NPPP, which is
+		 * different from how they are mapped in our PP.
+		 */
 		HvCallHpt_setPp(slot, (newpp & 0x3) | ((newpp & 0x4) << 1));
 		iSeries_hunlock(slot);
 		return 0;

_


From moilanen at austin.ibm.com  Wed Oct 13 00:52:48 2004
From: moilanen at austin.ibm.com (Jake Moilanen)
Date: Tue, 12 Oct 2004 09:52:48 -0500
Subject: [PATCH 0/2][RFC] PPC64 no-exec support
Message-ID: <20041012095248.2b6418c4@localhost>

These patches add no exec support to PPC64.  It should prohibit
executing code out of the stack, or most any non-text segment.  

For distros that compile w/ pt_gnu_stacks, they depend on Ben's signal
trampoline changes, or else it will hang on the first signal due to the
return code being put on the signal context stack to return to the
kernel on the completion of the signal handler.

The patches include a base fixup from Anton of the wrong bit being used
for no-exec and for read/write on the hardware PTEs.

The patch is broken into two parts:

1/2: PPC64 no-exec support for user space:  This will prohibit user
space apps from executing in segments not marked as executable.  The
base support is in here as well.

2/2: PPC64 no-exec support for kernel space:  This prohibits the kernel
from executing non-text code.

Thanks,
Jake


From moilanen at austin.ibm.com  Wed Oct 13 00:58:52 2004
From: moilanen at austin.ibm.com (Jake Moilanen)
Date: Tue, 12 Oct 2004 09:58:52 -0500
Subject: [PATCH 2/2][RFC] PPC64 no-exec support for kernel space
In-Reply-To: <20041012095248.2b6418c4@localhost>
References: <20041012095248.2b6418c4@localhost>
Message-ID: <20041012095852.29e583a3@localhost>

Here is the kernel piece of no-exec.  It marks all non-text pages as
no-execute.

It depends on the no-exec for user-space patch.

Thanks,
Jake

Signed-off-by: Jake Moilanen <moilanen at austin.ibm.com>

---


diff -puN arch/ppc64/kernel/module.c~nx-kernel-ppc64 arch/ppc64/kernel/module.c
--- linux-2.6-bk/arch/ppc64/kernel/module.c~nx-kernel-ppc64	Thu Oct  7 15:23:55 2004
+++ linux-2.6-bk-moilanen/arch/ppc64/kernel/module.c	Thu Oct  7 15:23:55 2004
@@ -102,7 +102,8 @@ void *module_alloc(unsigned long size)
 {
 	if (size == 0)
 		return NULL;
-	return vmalloc(size);
+
+	return vmalloc_exec(size);
 }
 
 /* Free memory returned from module_alloc */
diff -puN arch/ppc64/mm/fault.c~nx-kernel-ppc64 arch/ppc64/mm/fault.c
--- linux-2.6-bk/arch/ppc64/mm/fault.c~nx-kernel-ppc64	Thu Oct  7 15:23:55 2004
+++ linux-2.6-bk-moilanen/arch/ppc64/mm/fault.c	Thu Oct  7 15:23:55 2004
@@ -75,6 +75,21 @@ static int store_updates_sp(struct pt_re
 	return 0;
 }
 
+pte_t *lookup_address(unsigned long address) 
+{ 
+	pgd_t *pgd = pgd_offset_k(address); 
+	pmd_t *pmd;
+
+	if (pgd_none(*pgd))
+		return NULL;
+
+	pmd = pmd_offset(pgd, address); 	       
+	if (pmd_none(*pmd))
+		return NULL;
+
+        return pte_offset_kernel(pmd, address);
+} 
+
 /*
  * The error_code parameter is
  *  - DSISR for a non-SLB data access fault,
@@ -93,6 +108,7 @@ int do_page_fault(struct pt_regs *regs, 
 	unsigned long is_write = error_code & 0x02000000;
 	unsigned long trap = TRAP(regs);
  	unsigned long is_exec = trap == 0x400;	
+	pte_t *ptep;
 
 	BUG_ON((trap == 0x380) || (trap == 0x480));
 
@@ -245,6 +261,15 @@ bad_area_nosemaphore:
 		info.si_addr = (void __user *) address;
 		force_sig_info(SIGSEGV, &info, current);
 		return 0;
+	} 
+
+	ptep = lookup_address(address);
+
+	if (ptep && pte_present(*ptep) && !pte_exec(*ptep)) {
+		if (printk_ratelimit())
+			printk(KERN_CRIT "kernel tried to execute NX-protected page - exploit attempt? (uid: %d)\n", current->uid);
+		show_stack(current, (unsigned long *)__get_SP());
+		do_exit(SIGKILL);
 	}
 
 	return SIGSEGV;
diff -puN arch/ppc64/mm/hash_utils.c~nx-kernel-ppc64 arch/ppc64/mm/hash_utils.c
--- linux-2.6-bk/arch/ppc64/mm/hash_utils.c~nx-kernel-ppc64	Thu Oct  7 15:23:55 2004
+++ linux-2.6-bk-moilanen/arch/ppc64/mm/hash_utils.c	Thu Oct  7 15:23:55 2004
@@ -52,6 +52,7 @@
 #include <asm/cacheflush.h>
 #include <asm/cputable.h>
 #include <asm/abs_addr.h>
+#include <asm/sections.h>
 
 #ifdef DEBUG
 #define DBG(fmt...) udbg_printf(fmt)
@@ -89,12 +90,23 @@ static inline void loop_forever(void)
 		;
 }
 
+int is_kernel_text(unsigned long addr)
+{
+	if (addr >= (unsigned long)_stext && addr < (unsigned long)__init_end)
+		return 1;
+
+	return 0;
+}
+
+
+
 #ifdef CONFIG_PPC_MULTIPLATFORM
 static inline void create_pte_mapping(unsigned long start, unsigned long end,
 				      unsigned long mode, int large)
 {
 	unsigned long addr;
 	unsigned int step;
+	unsigned long tmp_mode;
 
 	if (large)
 		step = 16*MB;
@@ -112,6 +124,13 @@ static inline void create_pte_mapping(un
 		else
 			vpn = va >> PAGE_SHIFT;
 
+
+		tmp_mode = mode;
+		
+		/* Make non-kernel text non-executable */
+		if (!is_kernel_text(addr))
+			tmp_mode = mode | HW_NO_EXEC;
+
 		hash = hpt_hash(vpn, large);
 
 		hpteg = ((hash & htab_data.htab_hash_mask)*HPTES_PER_GROUP);
@@ -120,12 +139,12 @@ static inline void create_pte_mapping(un
 		if (systemcfg->platform & PLATFORM_LPAR)
 			ret = pSeries_lpar_hpte_insert(hpteg, va,
 				virt_to_abs(addr) >> PAGE_SHIFT,
-				0, mode, 1, large);
+				0, tmp_mode, 1, large);
 		else
 #endif /* CONFIG_PPC_PSERIES */
 			ret = native_hpte_insert(hpteg, va,
 				virt_to_abs(addr) >> PAGE_SHIFT,
-				0, mode, 1, large);
+				0, tmp_mode, 1, large);
 
 		if (ret == -1) {
 			ppc64_terminate_msg(0x20, "create_pte_mapping");
@@ -239,8 +258,6 @@ unsigned int hash_page_do_lazy_icache(un
 {
 	struct page *page;
 
-#define PPC64_HWNOEXEC (1 << 2)
-
 	if (!pfn_valid(pte_pfn(pte)))
 		return pp;
 
@@ -251,8 +268,8 @@ unsigned int hash_page_do_lazy_icache(un
 		if (trap == 0x400) {
 			__flush_dcache_icache(page_address(page));
 			set_bit(PG_arch_1, &page->flags);
-		} else
-			pp |= PPC64_HWNOEXEC;
+		} else 
+			pp |= HW_NO_EXEC;
 	}
 	return pp;
 }
diff -puN include/asm-ppc64/mmu.h~nx-kernel-ppc64 include/asm-ppc64/mmu.h
diff -puN include/asm-ppc64/pgtable.h~nx-kernel-ppc64 include/asm-ppc64/pgtable.h
--- linux-2.6-bk/include/asm-ppc64/pgtable.h~nx-kernel-ppc64	Thu Oct  7 15:23:55 2004
+++ linux-2.6-bk-moilanen/include/asm-ppc64/pgtable.h	Thu Oct  7 15:23:55 2004
@@ -101,6 +101,12 @@
 /* Bits 0x7000 identify the index within an HPT Group */
 #define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | _PAGE_SECONDARY | _PAGE_GROUP_IX)
 
+#define HW_NO_EXEC	_PAGE_EXEC /* This is used when the bit is
+				    * inverted, even though it's the
+				    * same value, hopefully it will be
+				    * clearer in the code what is
+				    * going on. */
+
 /* PAGE_MASK gives the right answer below, but only by accident */
 /* It should be preserving the high 48 bits and then specifically */
 /* preserving _PAGE_SECONDARY | _PAGE_GROUP_IX */
@@ -120,6 +126,7 @@
 #define PAGE_READONLY	__pgprot(_PAGE_BASE | _PAGE_USER)
 #define PAGE_READONLY_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_EXEC)
 #define PAGE_KERNEL	__pgprot(_PAGE_BASE | _PAGE_WRENABLE)
+#define PAGE_KERNEL_EXEC __pgprot(_PAGE_BASE | _PAGE_WRENABLE | _PAGE_EXEC)
 
 /*
  * POWER4 and newer have per page execute protection, older chips can only
@@ -266,6 +273,7 @@ int hash_huge_page(struct mm_struct *mm,
  * The following only work if pte_present() is true.
  * Undefined behaviour if not..
  */
+static inline int pte_user(pte_t pte)  { return pte_val(pte) & _PAGE_USER;}
 static inline int pte_read(pte_t pte)  { return pte_val(pte) & _PAGE_USER;}
 static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_RW;}
 static inline int pte_exec(pte_t pte)  { return pte_val(pte) & _PAGE_EXEC;}
diff -puN arch/ppc64/kernel/iSeries_setup.c~nx-kernel-ppc64 arch/ppc64/kernel/iSeries_setup.c
--- linux-2.6-bk/arch/ppc64/kernel/iSeries_setup.c~nx-kernel-ppc64	Thu Oct  7 15:23:55 2004
+++ linux-2.6-bk-moilanen/arch/ppc64/kernel/iSeries_setup.c	Thu Oct  7 15:23:55 2004
@@ -622,6 +622,7 @@ static void __init iSeries_bolt_kernel(u
 {
 	unsigned long pa;
 	unsigned long mode_rw = _PAGE_ACCESSED | _PAGE_COHERENT | PP_RWXX;
+	unsigned long tmp_mode;
 	HPTE hpte;
 
 	for (pa = saddr; pa < eaddr ;pa += PAGE_SIZE) {
@@ -630,6 +631,12 @@ static void __init iSeries_bolt_kernel(u
 		unsigned long va = (vsid << 28) | (pa & 0xfffffff);
 		unsigned long vpn = va >> PAGE_SHIFT;
 		unsigned long slot = HvCallHpt_findValid(&hpte, vpn);
+
+		tmp_mode = mode_rw;
+
+		/* Make non-kernel text non-executable */
+		if (!is_kernel_text(ea))
+			tmp_mode = mode_rw | HW_NO_EXEC;
 
 		if (hpte.dw0.dw0.v) {
 			/* HPTE exists, so just bolt it */

_


From dwmw2 at infradead.org  Wed Oct 13 05:08:02 2004
From: dwmw2 at infradead.org (David Woodhouse)
Date: Tue, 12 Oct 2004 20:08:02 +0100
Subject: cond_syscall() and new ABI.
In-Reply-To: <200410122043.52351.arnd@arndb.de>
References: <1097589108.318.425.camel@hades.cambridge.redhat.com>
	<20041012142627.GA19091@lst.de>  <200410122043.52351.arnd@arndb.de>
Message-ID: <1097608083.5178.5.camel@localhost.localdomain>

On Tue, 2004-10-12 at 20:43 +0200, Arnd Bergmann wrote:
> A better solution IMHO would be to include the right headers from sys.c
> and have
> 
> #define cond_syscall(x) typeof(x) (x) __attribute__((weak,alias("sys_ni_syscall")));

That's true in theory, yes -- not that I can see any way that having the
'correct' prototype will actually make a difference in practice.

> Also, someone should try to find out which toolchains don't support this
> and if anybody is still using those. One issue seems to be the one from
> http://seclists.org/lists/linux-kernel/2004/Jan/2474.html, but I'm not
> sure if that is the problem that the comment refers to.

That happens with both the current inline asm method, and with the
'alias' method which translates to basically the same asm output from
gcc, but without the ifdefs.

-- 
dwmw2


From arnd at arndb.de  Wed Oct 13 04:43:52 2004
From: arnd at arndb.de (Arnd Bergmann)
Date: Tue, 12 Oct 2004 20:43:52 +0200
Subject: cond_syscall() and new ABI.
In-Reply-To: <20041012142627.GA19091@lst.de>
References: <1097589108.318.425.camel@hades.cambridge.redhat.com>
	<20041012142627.GA19091@lst.de>
Message-ID: <200410122043.52351.arnd@arndb.de>

On Dinsdag 12 Oktober 2004 16:26, Christoph Hellwig wrote:
> > ===== include/asm-ppc64/unistd.h 1.34 vs edited =====
> > --- 1.34/include/asm-ppc64/unistd.h???Tue Sep 14 01:23:12 2004
> > +++ edited/include/asm-ppc64/unistd.h?Tue Oct 12 14:48:08 2004
> > @@ -468,7 +468,7 @@
> > ? * What we want is __attribute__((weak,alias("sys_ni_syscall"))),
> > ? * but it doesn't work on all toolchains, so we just do it by hand
> > ? */
> > -#define cond_syscall(x) asm(".weak\t." #x "\n\t.set\t." #x ",.sys_ni_syscall");
> > +#define cond_syscall(x) void x(void) __attribute__((weak,alias("sys_ni_syscall")));
> 
> 
> this one otoh makes lots of sense - it's what most architectures use.

It's also something that looks suboptimal to me. The syscalls should
already have a proper protoype in <linux/syscalls.h>, which typically is
not "void sys_foo(void)".
A better solution IMHO would be to include the right headers from sys.c
and have

#define cond_syscall(x) typeof(x) (x) __attribute__((weak,alias("sys_ni_syscall")));

Also, someone should try to find out which toolchains don't support this
and if anybody is still using those. One issue seems to be the one from
http://seclists.org/lists/linux-kernel/2004/Jan/2474.html, but I'm not
sure if that is the problem that the comment refers to.

	Arnd <><


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: signature
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041012/782ffcb7/attachment.pgp 

From anton at samba.org  Wed Oct 13 05:39:02 2004
From: anton at samba.org (Anton Blanchard)
Date: Wed, 13 Oct 2004 05:39:02 +1000
Subject: cond_syscall() and new ABI.
In-Reply-To: <1097589108.318.425.camel@hades.cambridge.redhat.com>
References: <1097589108.318.425.camel@hades.cambridge.redhat.com>
Message-ID: <20041012193902.GB3315@krispykreme.ozlabs.ibm.com>


> This (in linux/asm-ppc64/unistd.h) doesn't work with the new ABI:
> 
> /*
>  * "Conditional" syscalls
>  *
>  * What we want is __attribute__((weak,alias("sys_ni_syscall"))),
>  * but it doesn't work on all toolchains, so we just do it by hand
>  */
> #define cond_syscall(x) asm(".weak\t." #x "\n\t.set\t." #x ",.sys_ni_syscall");
> 
> Two options -- either we ditch older toolchains (before 2002-03-01
> probably), by switching to what we say in the comment, or we introduce
> an ifdef to choose whether to include the '.' in the symbol names...

http://ozlabs.org/ppc64-patches/

Has 5 remove -mminimal-toc patches which should fix this mess. The
syscall table is currently abusing the ABI, it would be nice to fix it.

If there are no complaints Id like to push this patchset once 2.6.10
opens.

Anton


From grave at ipno.in2p3.fr  Wed Oct 13 17:19:56 2004
From: grave at ipno.in2p3.fr (grave)
Date: Wed, 13 Oct 2004 07:19:56 +0000
Subject: libmotovec
Message-ID: <1097651996l.1092l.0l@ipnnarval>

Hi,

Just to know : does the current powerpc kernel benefit from something  
like libmotovec ?

Since the VMX is also on the ppc970 familly perhaps we will see more  
ibm processors in the future with such velocity engine so...

xavier


From benh at kernel.crashing.org  Wed Oct 13 17:44:23 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Wed, 13 Oct 2004 17:44:23 +1000
Subject: 2.6.9-rc4: oops during ide probing
In-Reply-To: <m3llednhfl.fsf@igel.m5r.de>
References: <m3llednhfl.fsf@igel.m5r.de>
Message-ID: <1097653462.5553.43.camel@gaston>

On Tue, 2004-10-12 at 06:11, Andreas Schwab wrote:
> I'm getting an oops during ide probing on the PMac G5 with 2.6.9-rc4:

Can you send me a dump of the whole device-tree ?

Ben.


From arnd at arndb.de  Wed Oct 13 19:19:19 2004
From: arnd at arndb.de (Arnd Bergmann)
Date: Wed, 13 Oct 2004 11:19:19 +0200
Subject: cond_syscall() and new ABI.
In-Reply-To: <1097608083.5178.5.camel@localhost.localdomain>
References: <1097589108.318.425.camel@hades.cambridge.redhat.com>
	<200410122043.52351.arnd@arndb.de>
	<1097608083.5178.5.camel@localhost.localdomain>
Message-ID: <200410131119.23409.arnd@arndb.de>

On Dinsdag 12 Oktober 2004 21:08, David Woodhouse wrote:
> On Tue, 2004-10-12 at 20:43 +0200, Arnd Bergmann wrote:
> > A better solution IMHO would be to include the right headers from sys.c
> > and have
> > 
> > #define cond_syscall(x) typeof(x) (x) __attribute__((weak,alias("sys_ni_syscall")));
> 
> That's true in theory, yes -- not that I can see any way that having the
> 'correct' prototype will actually make a difference in practice.

Right, my point was mostly about having an implementation that is less
surprising to the reader, not about correctness. It might actually
become a bug as soon as someone tries to build the kernel with
a compiler that does inter-module analysis, but that's not likely
to happen soon.

	Arnd <><
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: signature
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041013/d033fef2/attachment.pgp 

From schwab at suse.de  Wed Oct 13 19:48:52 2004
From: schwab at suse.de (Andreas Schwab)
Date: Wed, 13 Oct 2004 11:48:52 +0200
Subject: 2.6.9-rc4: oops during ide probing
In-Reply-To: <1097653462.5553.43.camel@gaston> (Benjamin Herrenschmidt's
	message of "Wed, 13 Oct 2004 17:44:23 +1000")
References: <m3llednhfl.fsf@igel.m5r.de> <1097653462.5553.43.camel@gaston>
Message-ID: <jesm8jx81n.fsf@sykes.suse.de>

Benjamin Herrenschmidt <benh at kernel.crashing.org> writes:

> On Tue, 2004-10-12 at 06:11, Andreas Schwab wrote:
>> I'm getting an oops during ide probing on the PMac G5 with 2.6.9-rc4:
>
> Can you send me a dump of the whole device-tree ?

By "dump" do you mean ls -R or something more fancy?

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab at suse.de
SuSE Linux AG, Maxfeldstra?e 5, 90409 N?rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


From benh at kernel.crashing.org  Thu Oct 14 00:28:53 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Thu, 14 Oct 2004 00:28:53 +1000
Subject: 2.6.9-rc4: oops during ide probing
In-Reply-To: <jesm8jx81n.fsf@sykes.suse.de>
References: <m3llednhfl.fsf@igel.m5r.de> <1097653462.5553.43.camel@gaston>
	<jesm8jx81n.fsf@sykes.suse.de>
Message-ID: <1097677732.10215.1.camel@gaston>

On Wed, 2004-10-13 at 19:48, Andreas Schwab wrote:
> Benjamin Herrenschmidt <benh at kernel.crashing.org> writes:
> 
> > On Tue, 2004-10-12 at 06:11, Andreas Schwab wrote:
> >> I'm getting an oops during ide probing on the PMac G5 with 2.6.9-rc4:
> >
> > Can you send me a dump of the whole device-tree ?
> 
> By "dump" do you mean ls -R or something more fancy?

tarball of /proc/device-tree


From linas at austin.ibm.com  Thu Oct 14 05:23:56 2004
From: linas at austin.ibm.com (Linas Vepstas)
Date: Wed, 13 Oct 2004 14:23:56 -0500
Subject: Hardware Watchdog Device in pSeries?
In-Reply-To: <416D6D89.6030300@unix.sh>
References: <OF4372D4F6.E01A4E8C-ON87256F2B.00580229-86256F2B.005A4D56@us.ibm.com>
	<OF3BEF48D0.3C639A60-ON48256F2C.00062530-48256F2C.0006E5F2@cn.ibm.com>
	<20041013165050.GC12237@austin.ibm.com> <416D6D89.6030300@unix.sh>
Message-ID: <20041013192356.GE12237@austin.ibm.com>


Hi, 

I'm copying over to the linuxppc64-dev at ozlabs.org
mailing list, which is the right place to discuss this.

On Wed, Oct 13, 2004 at 12:01:45PM -0600, Alan Robertson was heard to remark:
> Linas Vepstas wrote:
> >Hi,
> >
> >On Wed, Oct 13, 2004 at 09:12:23AM +0800, Zhen Huang was heard to remark:
> >
> >>Hi,
> >>
> >>The watchdog I mentioned means such a device:
> >>Once we open it we must write to it regularly. 
> >>Otherwise the whole system will be reset.
> >>
> >>Many OS have software implement of this.
> >>But the software watchdog will depend on the health of the OS.
> >>
> >>I want to know whether there have any hardware implement in pServer.
> >
> >
> >Yes, there is a hardware watchdog; its implemented on all pSeries
> >machines that have service processors (thus, it goes back to at
> >least power3).  However, it is not a unix 'device' that a user-land 
> >process can 'open'; it is only accessible through RTAS calls.  The 
> >kernel daemon rtasd provides the regular heartbeat.
> >
> >The kernel enables the watchdog function with the 'enable_surveillance()' 
> >subroutine call (see arch/ppc64/kernel/rtasd.c).
> >Once its enabled, the heartbeat is the 'event-scan' RTAS call,
> >which the kernel must call regularly from each CPU.  (I guess this
> >helps detect hung CPU's on SMP systems).  If the event-scan call 
> >isn't made within the 'surveillance timeout', the SP will reboot 
> >the OS (or call in a service request, etc.)
> >
> >I don't know if there is any interest in moving this heartbeat 
> >watchdog out from kernel space into user space; right now, 
> >rtasd is a kernel daemon, and it more or less just works.
> >
> >iIf it ever is converted to userland, its not likely it will 
> >every be a traditional unix device; instead, functions like 
> >this are moving to the sysfs file system.
> 
> This would be a logical equivalent to the well-known and long-standing 
> 'softdog' device driver which already has a well-known API, which is also 
> implemented on other hardware devices and architectures.
> 
> So, my suggestion would be that if it were moved to a userspace driver, 
> that the softdog API be retained.

I might have volunteered to hack this up real quick, were it not for
Mike Strosaker's correction, that the surveillance featues were taken
out of Power5.   

Anyone on this list know why?

--linas


From strosake at austin.ibm.com  Thu Oct 14 05:57:35 2004
From: strosake at austin.ibm.com (Mike Strosaker)
Date: Wed, 13 Oct 2004 14:57:35 -0500
Subject: Hardware Watchdog Device in pSeries?
In-Reply-To: <20041013192356.GE12237@austin.ibm.com>
References: <OF4372D4F6.E01A4E8C-ON87256F2B.00580229-86256F2B.005A4D56@us.ibm.com>	<OF3BEF48D0.3C639A60-ON48256F2C.00062530-48256F2C.0006E5F2@cn.ibm.com>	<20041013165050.GC12237@austin.ibm.com>
	<416D6D89.6030300@unix.sh> <20041013192356.GE12237@austin.ibm.com>
Message-ID: <416D88AF.1010706@austin.ibm.com>

Linas Vepstas wrote:
> I might have volunteered to hack this up real quick, were it not for
> Mike Strosaker's correction, that the surveillance featues were taken
> out of Power5.   
> 
> Anyone on this list know why?
> 

I sent the reason I got from the hardware RAS folks to this list a while back.
Luckily, it's still in my sent mail folder:

"Because of the virtualization layer and partitioning, the surveillance
requirement was moved to PHYP<->SP.  Apparently, this was a hotly
contested issue among the platform design folks (especially considering that
partitioned power4 systems still have OS<->SP surveillance).  I think the logic
is: If an OS goes down, its not likely a server problem, hence no requirement
to monitor from the server side.

At least the platform gets notified of panics via os-term.  I gather
that some user space tools are expected to monitor for deadlocks/hangs
(maybe clustering tools). "

Thanks,
Mike


From alanr at unix.sh  Thu Oct 14 07:30:02 2004
From: alanr at unix.sh (Alan Robertson)
Date: Wed, 13 Oct 2004 15:30:02 -0600
Subject: Hardware Watchdog Device in pSeries?
In-Reply-To: <416D88AF.1010706@austin.ibm.com>
References: <OF4372D4F6.E01A4E8C-ON87256F2B.00580229-86256F2B.005A4D56@us.ibm.com>	<OF3BEF48D0.3C639A60-ON48256F2C.00062530-48256F2C.0006E5F2@cn.ibm.com>	<20041013165050.GC12237@austin.ibm.com>
	<416D6D89.6030300@unix.sh> <20041013192356.GE12237@austin.ibm.com>
	<416D88AF.1010706@austin.ibm.com>
Message-ID: <416D9E5A.9080102@unix.sh>

Mike Strosaker wrote:
> Linas Vepstas wrote:
> 
>> I might have volunteered to hack this up real quick, were it not for
>> Mike Strosaker's correction, that the surveillance featues were taken
>> out of Power5.  
>> Anyone on this list know why?
>>
> 
> I sent the reason I got from the hardware RAS folks to this list a while 
> back.
> Luckily, it's still in my sent mail folder:
> 
> "Because of the virtualization layer and partitioning, the surveillance
> requirement was moved to PHYP<->SP.  Apparently, this was a hotly
> contested issue among the platform design folks (especially considering 
> that
> partitioned power4 systems still have OS<->SP surveillance).  I think 
> the logic
> is: If an OS goes down, its not likely a server problem, hence no 
> requirement
> to monitor from the server side.
> 
> At least the platform gets notified of panics via os-term.  I gather
> that some user space tools are expected to monitor for deadlocks/hangs
> (maybe clustering tools). "

This is about half-right.

There is one particular circumstance which can ONLY be monitored from a 
hardware-level monitor.

OS hangs.

If the OS hangs, then, nothing but a hardware timer can bring the machine 
out of it's hung state.  Hangs do NOT panic (by definition), and can't be 
reliably detected any other way.

In highly available systems (like telecom systems), hardware level monitors 
are required.  Leaving it out sends the message that "availability isn't 
important".

The normal way that a highly available systems is to have layers (or a 
hierarchy) of watchers.

	At the bottom is the hardware monitor.

	Above that is an application monitor

	above that is resource monitors

	etc.

But, there are certain kinds of faults which cannot be caught without this 
bottom layer monitor.


-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me claim 
from you at all times your undisguised opinions." - William Wilberforce


From linas at austin.ibm.com  Thu Oct 14 08:12:54 2004
From: linas at austin.ibm.com (Linas Vepstas)
Date: Wed, 13 Oct 2004 17:12:54 -0500
Subject: Hardware Watchdog Device in pSeries?
In-Reply-To: <416D9E5A.9080102@unix.sh>
References: <OF4372D4F6.E01A4E8C-ON87256F2B.00580229-86256F2B.005A4D56@us.ibm.com>
	<OF3BEF48D0.3C639A60-ON48256F2C.00062530-48256F2C.0006E5F2@cn.ibm.com>
	<20041013165050.GC12237@austin.ibm.com> <416D6D89.6030300@unix.sh>
	<20041013192356.GE12237@austin.ibm.com>
	<416D88AF.1010706@austin.ibm.com> <416D9E5A.9080102@unix.sh>
Message-ID: <20041013221254.GF12237@austin.ibm.com>

Hi,

On Wed, Oct 13, 2004 at 03:30:02PM -0600, Alan Robertson was heard to remark:
> Mike Strosaker wrote:
> >Linas Vepstas wrote:
> >
> >>I might have volunteered to hack this up real quick, were it not for
> >>Mike Strosaker's correction, that the surveillance featues were taken
> >>out of Power5.  
> >>Anyone on this list know why?
> >>
> >
> >I sent the reason I got from the hardware RAS folks to this list a while 
> >back.
> >Luckily, it's still in my sent mail folder:
> >
> >"Because of the virtualization layer and partitioning, the surveillance
> >requirement was moved to PHYP<->SP.  Apparently, this was a hotly
> >contested issue among the platform design folks (especially considering 
> >that
> >partitioned power4 systems still have OS<->SP surveillance).  I think 
> >the logic
> >is: If an OS goes down, its not likely a server problem, hence no 
> >requirement
> >to monitor from the server side.
> >
> >At least the platform gets notified of panics via os-term.  I gather
> >that some user space tools are expected to monitor for deadlocks/hangs
> >(maybe clustering tools). "
> 
> This is about half-right.
> 
> There is one particular circumstance which can ONLY be monitored from a 
> hardware-level monitor.
> 
> OS hangs.

Heh. I think I can clarify, after talking to the firmware folks.

The core thinking behind the the "platform architecture" was to make
sure that the underlying hardware, i.e. the "platform" wasn't hung.
They were not concerned about the OS itself; they assumed that OS'es
have thier own independent mechanisms for detecting hung-ness.

>From the platform point of view, they are concerned that they'll
have a machine with a dozen different partitons on it (a dozen 
different OS'es), and a hardware hang will take down all twelve.
So they've got the hypervisor and service processor montioring
each other, keeping things humming.  If just one partition goes
down due to a kernel hang/crash, well, that's too bad, but its
not the end of the world from the platform point of view.


I think Alan's point of view is from the other side of the table:
why should someone buy 12 pci-card watchdogs, one for each partition,
chewing up 12 pci slots, when the pSeries is already capable of doing
watchdog functions?   To add insult to injury, the sysadmin now needs
to duct-tape each of the watchdog cards to some sort of kill-switch,
to reboot a dead partition.  The kill-switch needs to then ssh to 
the fsp or the hmc to start the reboot.  So it gets pretty byzantine
for something that could have been 'simple' and built-in.  Never mind
that the reliability goes down:  the kill switch could fail, the 
pci watchdog card could fail (or get EEH'ed out), causing a reboot 
when no reboot was necessary, etc. 

--linas


From jschopp at austin.ibm.com  Thu Oct 14 08:32:10 2004
From: jschopp at austin.ibm.com (Joel Schopp)
Date: Wed, 13 Oct 2004 17:32:10 -0500
Subject: Hardware Watchdog Device in pSeries?
In-Reply-To: <20041013221254.GF12237@austin.ibm.com>
References: <OF4372D4F6.E01A4E8C-ON87256F2B.00580229-86256F2B.005A4D56@us.ibm.com>	<OF3BEF48D0.3C639A60-ON48256F2C.00062530-48256F2C.0006E5F2@cn.ibm.com>	<20041013165050.GC12237@austin.ibm.com>
	<416D6D89.6030300@unix.sh>	<20041013192356.GE12237@austin.ibm.com>	<416D88AF.1010706@austin.ibm.com>
	<416D9E5A.9080102@unix.sh> <20041013221254.GF12237@austin.ibm.com>
Message-ID: <416DACEA.1070900@austin.ibm.com>


> I think Alan's point of view is from the other side of the table:
> why should someone buy 12 pci-card watchdogs, one for each partition,
> chewing up 12 pci slots, when the pSeries is already capable of doing
> watchdog functions?   To add insult to injury, the sysadmin now needs
> to duct-tape each of the watchdog cards to some sort of kill-switch,
> to reboot a dead partition.  The kill-switch needs to then ssh to 
> the fsp or the hmc to start the reboot.  So it gets pretty byzantine
> for something that could have been 'simple' and built-in.  Never mind
> that the reliability goes down:  the kill switch could fail, the 
> pci watchdog card could fail (or get EEH'ed out), causing a reboot 
> when no reboot was necessary, etc. 

I will miss the old school hardware watchdog.  If I'd had a vote I would 
have voted to keep it.  But since it is not a democracy I can only add a 
couple points to this argument.

First, if people really care about reliability that much they will be 
running with hot spares in a HA environment.  In that case there are 
already external monitors that activate the spare on any sign of problems.

Second, this can all be done from the HMC.  The HMC is perfectly capable 
of determining the partition is hung (LED error codes, heartbeat 
timeouts).  It is also perfectly capable of rebooting a partition.  I am 
not aware that there is a way to put the two together right now, so that 
the HMC automatically reboots the partition if it hangs, but it would 
certainly be an easy feature to add the HMC.


From linas at austin.ibm.com  Thu Oct 14 08:53:16 2004
From: linas at austin.ibm.com (Linas Vepstas)
Date: Wed, 13 Oct 2004 17:53:16 -0500
Subject: Hardware Watchdog Device in pSeries?
In-Reply-To: <416DACEA.1070900@austin.ibm.com>
References: <OF4372D4F6.E01A4E8C-ON87256F2B.00580229-86256F2B.005A4D56@us.ibm.com>
	<OF3BEF48D0.3C639A60-ON48256F2C.00062530-48256F2C.0006E5F2@cn.ibm.com>
	<20041013165050.GC12237@austin.ibm.com> <416D6D89.6030300@unix.sh>
	<20041013192356.GE12237@austin.ibm.com>
	<416D88AF.1010706@austin.ibm.com> <416D9E5A.9080102@unix.sh>
	<20041013221254.GF12237@austin.ibm.com>
	<416DACEA.1070900@austin.ibm.com>
Message-ID: <20041013225316.GH12237@austin.ibm.com>

On Wed, Oct 13, 2004 at 05:32:10PM -0500, Joel Schopp was heard to remark:
> 
> >I think Alan's point of view is from the other side of the table:
> >why should someone buy 12 pci-card watchdogs, one for each partition,
> >chewing up 12 pci slots, when the pSeries is already capable of doing
> >watchdog functions?   To add insult to injury, the sysadmin now needs
> >to duct-tape each of the watchdog cards to some sort of kill-switch,
> >to reboot a dead partition.  The kill-switch needs to then ssh to 
> >the fsp or the hmc to start the reboot.  So it gets pretty byzantine
> >for something that could have been 'simple' and built-in.  Never mind
> >that the reliability goes down:  the kill switch could fail, the 
> >pci watchdog card could fail (or get EEH'ed out), causing a reboot 
> >when no reboot was necessary, etc. 
> 
> I will miss the old school hardware watchdog.  If I'd had a vote I would 
> have voted to keep it.  But since it is not a democracy I can only add a 
> couple points to this argument.
> 
> First, if people really care about reliability that much they will be 
> running with hot spares in a HA environment.  In that case there are 
> already external monitors that activate the spare on any sign of problems.

Yes, well, Alan is the guy who designs and builds these systems :)
He's trying to figure out how to hook them up to the pSeries.  
You can't just cut the power, like you can for PC's :)

http://www.linux-ha.org

> Second, this can all be done from the HMC.  The HMC is perfectly capable 
> of determining the partition is hung (LED error codes, heartbeat 
> timeouts).  It is also perfectly capable of rebooting a partition.  I am 
> not aware that there is a way to put the two together right now, so that 
> the HMC automatically reboots the partition if it hangs, but it would 
> certainly be an easy feature to add the HMC.

The HMC is a natural place for this.  One of Alan's complaints
is that (non-pSeries) HMC's tend to be semi-proprietary and mostly 
unarchitected, with a wide variation from one model to another.
The dependance on Java for core functions also makes them untrustworthy.

--linas


From alanr at unix.sh  Thu Oct 14 14:41:26 2004
From: alanr at unix.sh (Alan Robertson)
Date: Wed, 13 Oct 2004 22:41:26 -0600
Subject: Hardware Watchdog Device in pSeries?
In-Reply-To: <20041013221254.GF12237@austin.ibm.com>
References: <OF4372D4F6.E01A4E8C-ON87256F2B.00580229-86256F2B.005A4D56@us.ibm.com>
	<OF3BEF48D0.3C639A60-ON48256F2C.00062530-48256F2C.0006E5F2@cn.ibm.com>
	<20041013165050.GC12237@austin.ibm.com> <416D6D89.6030300@unix.sh>
	<20041013192356.GE12237@austin.ibm.com>
	<416D88AF.1010706@austin.ibm.com> <416D9E5A.9080102@unix.sh>
	<20041013221254.GF12237@austin.ibm.com>
Message-ID: <416E0376.1010500@unix.sh>

Linas Vepstas wrote:
> Hi,
> 
> On Wed, Oct 13, 2004 at 03:30:02PM -0600, Alan Robertson was heard to remark:
> 
>>Mike Strosaker wrote:
>>
>>>Linas Vepstas wrote:
>>>
>>>
>>>>I might have volunteered to hack this up real quick, were it not for
>>>>Mike Strosaker's correction, that the surveillance featues were taken
>>>>out of Power5.  
>>>>Anyone on this list know why?
>>>>
>>>
>>>I sent the reason I got from the hardware RAS folks to this list a while 
>>>back.
>>>Luckily, it's still in my sent mail folder:
>>>
>>>"Because of the virtualization layer and partitioning, the surveillance
>>>requirement was moved to PHYP<->SP.  Apparently, this was a hotly
>>>contested issue among the platform design folks (especially considering 
>>>that
>>>partitioned power4 systems still have OS<->SP surveillance).  I think 
>>>the logic
>>>is: If an OS goes down, its not likely a server problem, hence no 
>>>requirement
>>>to monitor from the server side.
>>>
>>>At least the platform gets notified of panics via os-term.  I gather
>>>that some user space tools are expected to monitor for deadlocks/hangs
>>>(maybe clustering tools). "
>>
>>This is about half-right.
>>
>>There is one particular circumstance which can ONLY be monitored from a 
>>hardware-level monitor.
>>
>>OS hangs.
> 
> 
> Heh. I think I can clarify, after talking to the firmware folks.
> 
> The core thinking behind the the "platform architecture" was to make
> sure that the underlying hardware, i.e. the "platform" wasn't hung.
> They were not concerned about the OS itself; they assumed that OS'es
> have thier own independent mechanisms for detecting hung-ness.
> 
>>From the platform point of view, they are concerned that they'll
> have a machine with a dozen different partitons on it (a dozen 
> different OS'es), and a hardware hang will take down all twelve.
> So they've got the hypervisor and service processor montioring
> each other, keeping things humming.  If just one partition goes
> down due to a kernel hang/crash, well, that's too bad, but its
> not the end of the world from the platform point of view.

And this is a great set of goals as far as they go.  But, not sufficient 
when looking at the platform as something which actually delivers services, 
not just runs the hypervisor.

[[I guess I forgot to say that in addition to being the architect for IBM's 
OSS Linux strategy and product, I worked for 21 years for Bell Labs on 
highly reliable telecommunications systems before this.  So, I have some 
reasonable knowledge of how these kinds of things work in well-tested, 
well-proven systems.  Typically, telephone systems are considered extremely 
reliable - because they follow a well-proven discipline of design.  The 
international telephone system is in effect the worlds largest 
ultra-reliable computer.  And, it has been since back when telephone 
switches were made with discrete transistors - largely because of good HA 
system design]]


> I think Alan's point of view is from the other side of the table:
> why should someone buy 12 pci-card watchdogs, one for each partition,
> chewing up 12 pci slots, when the pSeries is already capable of doing
> watchdog functions?   To add insult to injury, the sysadmin now needs
> to duct-tape each of the watchdog cards to some sort of kill-switch,
> to reboot a dead partition.  The kill-switch needs to then ssh to 
> the fsp or the hmc to start the reboot.  So it gets pretty byzantine
> for something that could have been 'simple' and built-in.  Never mind
> that the reliability goes down:  the kill switch could fail, the 
> pci watchdog card could fail (or get EEH'ed out), causing a reboot 
> when no reboot was necessary, etc. 


Linas is right about the cost and complexity of the monitoring cards and 
the whole system.  In addition, if we're trying to see pSeries as a premium 
highly-reliable system better than the competition, it just doesn't send 
the right message if you tell a customer that this is what they have to do. 
  It looks really Rube Goldberg-ish (to say the least).

In addition, from a technical perspective, there is a basic principle in HA 
systems which is being ignored here...


	A sick system cannot reliably monitor itself.


If you're relying on a system which you believe to be sick to monitor 
itself, it will be unable to do this reliably under all circumstances - 
it's sick, and therefore not reliable -- by definition.  Crazy people may 
not think they're insane ;-).  The hardware watchdog timer is a 3rd party 
monitoring system, and therefore is likely to be reliable when the thing it 
is watching is sick - because its sanity is uncorrelated to the failure of 
the thing it is watching.

For example, if by a programming error in the kernel, you halt or loop with 
interrupts disabled -- you're screwed with no way out.  In mainframes I 
think this is called a disabled wait state.  Of course, there are more 
complex ways to do this, but hopefully one example makes the point.

This is the point of the hierarchy of monitoring I described before.  This 
is very much standard operating procedure for reliable systems in the 
telecom industry (and many others).  In fact, such a watchdog timer is a 
requirement for Carrier Grade Linux (CGL).

Here is the standard way which highly available systems are architected to 
work -- and it's consistent with 35-year industry practice in telephony 
systems, the formal CGL requirements, and the architecture of the Linux-HA 
system.

The hardware watchdog timer times out when it doesn't get
	a heartbeat in the allotted time.  (duhhh!)

	Just before loading the BIOS, the watchdog timer should be set
		for some "reasonable" amount of time (like a few
		seconds) for the BIOS to load and begin executing.

	The BIOS should set the timer for a reasonable
		time for the bootstrap program to load.  It must tickle
		it periodically while waiting for input from humans.*

	The bootstrap loader should work much the
		same way.  Before it jumps to the OS, it should set
		the timer for a reasonable amount of time for
		the OS to take over the tickling.*

	When it first comes up, the OS takes over and tickles the
		watchdog timer.

	When the HA monitoring subsystem comes up, it takes over
		and tickles the watchdog timer.

	As HA-aware processes start up, they tickle individual watchdog
		timers maintained by the HA monitoring subsystem (apphbd).  		If they 
die, or hang, they are restarted by the Recovery
		Manager.  As a special case, apphbd will restart the
		recovery manager as described below.

	The recovery manager registers with the HA monitoring subsystem
		and receives notification of insane or dead
		processes.  If they're insane it kills them.
		When they die, it restarts them.
		If the recovery manager dies (or goes insane), then
		apphbd will (kill and) restart the recovery manager.**

	When the system panics, then the watchdog timer needs to be
		tickled while waiting for human input, and while making
		progress taking a dump. [but only when actually
		making progress].

	When the OS jumps back into the BIOS for any reason
		then the timer is reset to some value suitable
		for the BIOS to take over and start tickling it.
		(~ same as the original value).

Now if the BIOS or OS or bootstrap loader, or dump process craps out and 
hangs, or the hard disk can't boot, or a peripherial hangs the bus, then 
this watchdog timer will trigger, and the system will be reset - and you'll 
get a chance to try it again.  [[If you fail too often in too short a 
period of time, then "phone home" or cry "uncle" or sit and cry if you 
like.  Or, you can just keep persisting...]]

Later on when HA monitoring system is running, if it (or the scheduler or 
other piece of the OS) craps out and the HA monitoring system doesn't (or 
isn't able to) tickle this watchdog timer - for whatever reason - then 
everything will reboot just like it should.

Notice how many different kinds of errors this one single timer can detect 
and recover from - and how many of them cannot easily be recovered from at 
all without it.  Note how handy it is in designing the system to know that 
your underlying hardware has this capability built-in.  It eliminates a lot 
of complexity from several pieces of software, and does a better job too!

Without this timer, you can't easily design a truly reliable system.  (and 
maybe not at all).

<pedantic-mode>
The lowest level monitor should be the simplest and most reliable.  It 
monitors the OS.  The driver for this in the kernel should also be solid 
and no-frills.  The base-level HA monitoring system (which monitors 
processes for their health) should also be as simple as possible. 
Complexity is the enemy of reliability.  If any of these components fail, 
then the system will be rebooted unnecessarily.  This is a BadThing(TM).

Now, to use this "right", the thing that any subsystem tickling the timer 
at the next higher level should do is periodically schedule something to 
evaluate its internal sanity (data structure consistency or queue lengths 
or whatever), and tickle the watchdog timer only when it passes whatever 
its internal sanity measure is.

Then, if you go into an infinite loop, or doubt your own sanity long 
enough, someone else will eventually do something about it - you'll be 
killed and restarted (if a process) -- or rebooted (if you're the HA 
process monitor, or the BIOS, or bootstrap loader or OS).

Of course, this doesn't *replace* external monitoring (see the note above 
about declaring oneself sick), but it is a good orthogonal measure, and 
simpler to implement for subsystems with limited external interfaces - like 
the bootstrap loader.
</pedantic-mode>

* = Note that these layers may have to deal with bootstrap loaders and/or 
OSes which won't tickle the watchdog timer - so they have to shut it off 
(or set it really long) when booting a layer under them which isn't 
watchdog-aware.

** = The reason why the recovery manager is not part of the apphbd process 
in our design is because the apphbd process should be as simple as it can 
be - because it's death or insanity would trigger a system restart. 
Putting it in a separate process lessens the liklihood of an unnecessary 
system restart.  This is not a necessity, but I believe it to be a good 
design choice - after all it was my design choice ;-)

It is certainly true that we don't have to implement all these things 
today, or at all, but with the hardware watchdog timer, they're possible. 
And, without it, they're not.

Even without implementing all these extra HA features, it still monitors 
the OS more reliably than it can monitor itself.  So, I think this is a 
very worthwhile feature for the platform to have.

Hope this helps!


-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me claim 
from you at all times your undisguised opinions." - William Wilberforce


From arnd at arndb.de  Thu Oct 14 21:35:13 2004
From: arnd at arndb.de (Arnd Bergmann)
Date: Thu, 14 Oct 2004 13:35:13 +0200
Subject: Hardware Watchdog Device in pSeries?
In-Reply-To: <20041013192356.GE12237@austin.ibm.com>
References: <OF4372D4F6.E01A4E8C-ON87256F2B.00580229-86256F2B.005A4D56@us.ibm.com>
	<416D6D89.6030300@unix.sh> <20041013192356.GE12237@austin.ibm.com>
Message-ID: <200410141335.17333.arnd@arndb.de>

On Middeweken 13 Oktober 2004 21:23, Linas Vepstas wrote:
> On Wed, Oct 13, 2004 at 12:01:45PM -0600, Alan Robertson was heard to remark:

> > This would be a logical equivalent to the well-known and long-standing 
> > 'softdog' device driver which already has a well-known API, which is also 
> > implemented on other hardware devices and architectures.
> > 
> > So, my suggestion would be that if it were moved to a userspace driver, 
> > that the softdog API be retained.
> 
> I might have volunteered to hack this up real quick, were it not for
> Mike Strosaker's correction, that the surveillance featues were taken
> out of Power5. ? 

FWIW, s390 linux has just added support for a hypervisor watchdog [1]
that looks like a hardware watchdog to linux, but is implemented with
hypercalls ("diag 0x288"). Since Power5 is typically running in hypervisor
more, the watchdog interface could be provided completely by the
firmware.

	Arnd <><

[1] http://ftp2.de.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/\
    2.6.9-rc4/2.6.9-rc4-mm1/broken-out/s390-9-12-z-vm-watchdog-timer.patch
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: signature
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041014/29c36997/attachment.pgp 

From alanr at unix.sh  Fri Oct 15 01:56:14 2004
From: alanr at unix.sh (Alan Robertson)
Date: Thu, 14 Oct 2004 09:56:14 -0600
Subject: Hardware Watchdog Device in pSeries?
In-Reply-To: <200410141335.17333.arnd@arndb.de>
References: <OF4372D4F6.E01A4E8C-ON87256F2B.00580229-86256F2B.005A4D56@us.ibm.com>
	<416D6D89.6030300@unix.sh> <20041013192356.GE12237@austin.ibm.com>
	<200410141335.17333.arnd@arndb.de>
Message-ID: <416EA19E.40300@unix.sh>

Arnd Bergmann wrote:
> On Middeweken 13 Oktober 2004 21:23, Linas Vepstas wrote:
> 
>>On Wed, Oct 13, 2004 at 12:01:45PM -0600, Alan Robertson was heard to remark:
> 
> 
>>>This would be a logical equivalent to the well-known and long-standing 
>>>'softdog' device driver which already has a well-known API, which is also 
>>>implemented on other hardware devices and architectures.
>>>
>>>So, my suggestion would be that if it were moved to a userspace driver, 
>>>that the softdog API be retained.
>>
>>I might have volunteered to hack this up real quick, were it not for
>>Mike Strosaker's correction, that the surveillance featues were taken
>>out of Power5.   
> 
> 
> FWIW, s390 linux has just added support for a hypervisor watchdog [1]
> that looks like a hardware watchdog to linux, but is implemented with
> hypercalls ("diag 0x288"). Since Power5 is typically running in hypervisor
> more, the watchdog interface could be provided completely by the
> firmware.

The method of implementation isn't that important.  Are these hypervisor 
calls (or equivalent) provided by power5?

Is there any disadvantage to running under the hypervisor?

-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me claim 
from you at all times your undisguised opinions." - William Wilberforce


From linas at austin.ibm.com  Fri Oct 15 02:21:41 2004
From: linas at austin.ibm.com (Linas Vepstas)
Date: Thu, 14 Oct 2004 11:21:41 -0500
Subject: Hardware Watchdog Device in pSeries?
In-Reply-To: <416E0376.1010500@unix.sh>
References: <OF4372D4F6.E01A4E8C-ON87256F2B.00580229-86256F2B.005A4D56@us.ibm.com>
	<OF3BEF48D0.3C639A60-ON48256F2C.00062530-48256F2C.0006E5F2@cn.ibm.com>
	<20041013165050.GC12237@austin.ibm.com> <416D6D89.6030300@unix.sh>
	<20041013192356.GE12237@austin.ibm.com>
	<416D88AF.1010706@austin.ibm.com> <416D9E5A.9080102@unix.sh>
	<20041013221254.GF12237@austin.ibm.com> <416E0376.1010500@unix.sh>
Message-ID: <20041014162141.GA958@austin.ibm.com>

Hi Alan,

Long emails confuse me ...

On Wed, Oct 13, 2004 at 10:41:26PM -0600, Alan Robertson was heard to remark:
> Linas Vepstas wrote:
> >why should someone buy 12 pci-card watchdogs, one for each partition,
> >chewing up 12 pci slots, when the pSeries is already capable of doing
>
>  It looks really Rube Goldberg-ish (to say the least).

[...]
> 
> The hardware watchdog timer is a 3rd party 
> monitoring system, and therefore is likely to be reliable when the thing it 
> is watching is sick - 


Not sure where you're going with this; are you saying that 
3rd-party watchdog PCI cards, one for each partition, is a 
good idea, or a bad idea?  

Would you rather have the OS monitoring done with 
(a) watchdog PCI cards,
(b) with 'surveillance' done by firmware/hypervisor, 
(c) or with some other method?


> 	The bootstrap loader should work much the

I guess I didn't get this exposition either.  Although its nice to 
know that boot was successful,  I see boot as a whole lot less 
important than monitoring the system once its gone 'online'.  The boot
sequence can be monitored much more loosely, with a whole-lot less
complexity.  The hypervisor knows when the OS boot sequence starts.
If the OS hasn't completely booted after, say, 10 minutes, then it
can call a human to look at the problem.  I don't see why one needs
to heartbeat once a second during boot; that's hard to do and seems
un-neccessary.  By contrast, I'd expect to turn on the once-per-second
heartbeat just before the system goes 'online' or 'critical'.


--linas


From alanr at unix.sh  Fri Oct 15 03:34:48 2004
From: alanr at unix.sh (Alan Robertson)
Date: Thu, 14 Oct 2004 11:34:48 -0600
Subject: Hardware Watchdog Device in pSeries?
In-Reply-To: <20041014162141.GA958@austin.ibm.com>
References: <OF4372D4F6.E01A4E8C-ON87256F2B.00580229-86256F2B.005A4D56@us.ibm.com>
	<OF3BEF48D0.3C639A60-ON48256F2C.00062530-48256F2C.0006E5F2@cn.ibm.com>
	<20041013165050.GC12237@austin.ibm.com> <416D6D89.6030300@unix.sh>
	<20041013192356.GE12237@austin.ibm.com>
	<416D88AF.1010706@austin.ibm.com> <416D9E5A.9080102@unix.sh>
	<20041013221254.GF12237@austin.ibm.com> <416E0376.1010500@unix.sh>
	<20041014162141.GA958@austin.ibm.com>
Message-ID: <416EB8B8.8040601@unix.sh>

Linas Vepstas wrote:
> Hi Alan,
> 
> Long emails confuse me ...
> 
> On Wed, Oct 13, 2004 at 10:41:26PM -0600, Alan Robertson was heard to remark:
> 
>>Linas Vepstas wrote:
>>
>>>why should someone buy 12 pci-card watchdogs, one for each partition,
>>>chewing up 12 pci slots, when the pSeries is already capable of doing
>>
>> It looks really Rube Goldberg-ish (to say the least).
> 
> 
> [...]
> 
>>The hardware watchdog timer is a 3rd party 
>>monitoring system, and therefore is likely to be reliable when the thing it 
>>is watching is sick - 
> 
> 
> 
> Not sure where you're going with this; are you saying that 
> 3rd-party watchdog PCI cards, one for each partition, is a 
> good idea, or a bad idea?  
> 
> Would you rather have the OS monitoring done with 
> (a) watchdog PCI cards,
> (b) with 'surveillance' done by firmware/hypervisor, 
> (c) or with some other method?

I would prefer (b).  Because the software and address spaces of the 
firmware/hypervisor are separate, it is effectively a third party reset 
mechanism.  The test I would use is:  Does failure of the thing being 
monitored cause or correlate to failure in the thing doing the monitoring - 
and the answer is "no" -- therefore it's a third-party reset.


I don't have a (c) method in mind that would work in this environment.

Evaluating (a) and (b):

Method (a):
	+ is third party
	- is complex and hard to configure all around
		(think about configuring those cards with passwords
			and ssh, and ip addresses and partition names
			and so on - also think about how many things
			could break and keep this from working).
	- difficult to support
	- doesn't scale well in any obvious way
	- is relatively expensive for the customer (adds several hundred
		dollars for each partition - maybe as much as $1K)
	- difficult to bring into existence (compared to (b))
	- is ugly, kludgy, and Rube Goldberg-ish.

Method (b):
	+ is third party
	+ is relatively simple when compared to (a) (i.e., more reliable)
	+ requires little/no special configuration to make it work
	+ Shows off the advantages of pSeries architecture
	+ adds no cost to the customer's solution
	+ is comparatively easy to bring into existence (compared to a)
	+ is a natural and clean solution.

>>	The bootstrap loader should work much the
> 
> 
> I guess I didn't get this exposition either. 

---- OK -- as I said this is an improvement over the above - but
	not absolutely critical -- But I'll try explaining
	it again and see if giving a shorter answer helps -------

 > Although its nice to
> know that boot was successful,  I see boot as a whole lot less 
> important than monitoring the system once its gone 'online'.  The boot
> sequence can be monitored much more loosely, with a whole-lot less
> complexity.  The hypervisor knows when the OS boot sequence starts.
> If the OS hasn't completely booted after, say, 10 minutes, then it
> can call a human to look at the problem.  I don't see why one needs
> to heartbeat once a second during boot; that's hard to do and seems
> un-neccessary. 

I didn't say anything about once a second.  It could be once every 30 
seconds - or even 5 minutes.  That gives you lots of time, and you then 
only have to heartbeat in a couple of select places, and while in input 
loops waiting for human input.  These aren't so much periodic heartbeats as 
they are progress reports.  If you stop making progress, you get reset.

 > By contrast, I'd expect to turn on the once-per-second
> heartbeat just before the system goes 'online' or 'critical'.

This change decreases MTTR.  MTTR has an effect on system availability - 
even in a redundant HA cluster - since MTTR determines the probability of 
"simultaneous" failures from which the HA system cannot recover.

Calling a human is slow and often expensive (particularly on an emergency 
basis).   It takes minutes to hours and may result in an extra service 
charge from someone (depending on who gets the call, what time it is, and 
what arrangements are made, etc.).

A system which doesn't boot isn't providing service.  If service isn't 
being provided, it doesn't matter why it's not being provided (OS, dump, 
bootstrap, BIOS, etc.)...  The OS is not the only possible cause of 
failure.  The OS is by far more likely than these others, but all software 
has bugs.  And, hardware has transient failures as well as permanent ones.

A system with these capabilities will continue to try and provide service 
in the presence of (transient) errors until it succeeds, or exceeds some 
retry threshold, meaning a human needs to intervene and fix whatever's wrong.

This is essentially autonomic computing for the boot process.

In short:
	With this architecture, the system will come up and provide
		service, or it is broken so badly that retrying won't
		help and a human really is needed.

	Otherwise, no recovery will be performed for errors which keep
		the system from coming up (after a crash or otherwise)
		and some outages may be unnecessarily prolonged.

If your availability is poor, this will make zero difference.  If your 
availability is very good, this helps a little.  And, when your 
availability is very good, it's hard to find things that help even a little...

Of course, being able to say "autonomic computing wired into the lowest 
levels of the system" probably has marketing value beyond the small amount 
of improved availability it provides ;-)

[[If this system is running the air traffic control system while I'm in the 
air, I vote for adding this feature ;-)]].


-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me claim 
from you at all times your undisguised opinions." - William Wilberforce


From alanr at unix.sh  Fri Oct 15 07:19:17 2004
From: alanr at unix.sh (Alan Robertson)
Date: Thu, 14 Oct 2004 15:19:17 -0600
Subject: My use of the term "3rd party"
In-Reply-To: <416E0376.1010500@unix.sh>
References: <OF4372D4F6.E01A4E8C-ON87256F2B.00580229-86256F2B.005A4D56@us.ibm.com>
	<OF3BEF48D0.3C639A60-ON48256F2C.00062530-48256F2C.0006E5F2@cn.ibm.com>
	<20041013165050.GC12237@austin.ibm.com> <416D6D89.6030300@unix.sh>
	<20041013192356.GE12237@austin.ibm.com>
	<416D88AF.1010706@austin.ibm.com> <416D9E5A.9080102@unix.sh>
	<20041013221254.GF12237@austin.ibm.com> <416E0376.1010500@unix.sh>
Message-ID: <416EED55.7050200@unix.sh>

I just realized that this term has a different meaning to many people than 
it does to me in this context.

I meant that it was an independent of the thing it was monitoring.

That is, that its probability of failure is an independent random variable 
with respect to the thing it is measuring.  In other words, the failure of 
the watchdog timer is uncorrelated to failures of the operating system or 
other user of the watchdog timer.

I did *not* mean that you had to buy it from a 3rd party hardware manufacturer.

My apologies for what was probably a poor choice of terminology.


-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me claim 
from you at all times your undisguised opinions." - William Wilberforce


From benh at kernel.crashing.org  Fri Oct 15 19:16:32 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Fri, 15 Oct 2004 19:16:32 +1000
Subject: Fan control for PowerMac7_3
Message-ID: <1097831790.1131.111.camel@gaston>

Hi !

This is an experimental (read: totally untested) patch to the G5 fan
control code. All I know is that it builds :)

It should add proper support for all desktop G5s including liquid cooling.
I suggest you run it with debug enabled (#undef DEBUG -> #define DEBUG in
the beginning of the .c file) and send me the output though :)

It does _NOT_ add support for the Xserve yet !

People who have already working cooling don't _need_ to test, they are
welcome to do it though in case I broke something, but only send me the
output if you feel something is wrong ...

Should apply on top of current bk.

Ben.

diff -urN linux-2.5/drivers/macintosh/therm_pm72.c linux-pogo/drivers/macintosh/therm_pm72.c
--- linux-2.5/drivers/macintosh/therm_pm72.c	2004-09-24 14:34:05.000000000 +1000
+++ linux-pogo/drivers/macintosh/therm_pm72.c	2004-10-15 19:09:05.000000000 +1000
@@ -46,6 +46,8 @@
  *          overtemp conditions so userland can take some policy
  *          decisions, like slewing down CPUs
  *	  - Deal with fan and i2c failures in a better way
+ *	  - Maybe do a generic PID based on params used for
+ *	    U3 and Drives ?
  *
  * History:
  *
@@ -73,6 +75,13 @@
  *        values in the configuration register
  *	- Switch back to use of target fan speed for PID, thus lowering
  *        pressure on i2c
+ *
+ *  Oct. 15, 2004 : 1.1b1 (beta)
+ *	- Add device-tree lookup for fan IDs, should detect liquid cooling
+ *        pumps when present
+ *	- Enable driver for PowerMac7,3 machines
+ *	- Split the U3/Backside cooling on U3 & U3H versions as Darwin does
+ *	- Add new CPU cooling algorithm for machines with liquid cooling
  */
 
 #include <linux/config.h>
@@ -101,7 +110,7 @@
 
 #include "therm_pm72.h"
 
-#define VERSION "0.9"
+#define VERSION "1.1b1"
 
 #undef DEBUG
 
@@ -121,16 +130,100 @@
 static struct i2c_adapter *		u3_1;
 static struct i2c_client *		fcu;
 static struct cpu_pid_state		cpu_state[2];
+static struct basckside_pid_params	backside_params;
 static struct backside_pid_state	backside_state;
 static struct drives_pid_state		drives_state;
 static int				state;
 static int				cpu_count;
+static int				cpu_pid_type;
 static pid_t				ctrl_task;
 static struct completion		ctrl_complete;
 static int				critical_state;
 static DECLARE_MUTEX(driver_lock);
 
 /*
+ * We have 2 types of CPU PID control. One is "split" old style control
+ * for intake & exhaust fans, the other is "combined" control for both
+ * CPUs that also deals with the pumps when present. To be "compatible"
+ * with OS X at this point, we only use "COMBINED" on the machines that
+ * are identified as having the pumps (though that identification is at
+ * least dodgy). Ultimately, we could probably switch completely to this
+ * algorithm provided we hack it to deal with the UP case
+ */
+#define CPU_PID_TYPE_SPLIT	0
+#define CPU_PID_TYPE_COMBINED	1
+
+/*
+ * This table describes all fans in the FCU. The "id" and "type" values
+ * are defaults valid for all earlier machines. Newer machines will
+ * eventually override the table content based on the device-tree
+ */
+struct fcu_fan_table
+{
+	char*	loc;	/* location code */
+	int	type;	/* 0 = rpm, 1 = pwm, 2 = pump */
+	int	id;	/* id or -1 */
+};
+
+#define FCU_FAN_RPM		0
+#define FCU_FAN_PWM		1
+
+#define FCU_FAN_ABSENT_ID	-1
+
+#define FCU_FAN_COUNT		ARRAY_SIZE(fcu_fans)
+
+struct fcu_fan_table	fcu_fans[] = {
+	[BACKSIDE_FAN_PWM_INDEX] = {
+		.loc	= "BACKSIDE",
+		.type	= FCU_FAN_PWM,
+		.id	= BACKSIDE_FAN_PWM_DEFAULT_ID,
+	},
+	[DRIVES_FAN_RPM_INDEX] = {
+		.loc	= "DRIVE BAY",
+		.type	= FCU_FAN_RPM,
+		.id	= DRIVES_FAN_RPM_DEFAULT_ID,
+	},
+	[SLOTS_FAN_PWM_INDEX] = {
+		.loc	= "SLOT",
+		.type	= FCU_FAN_PWM,
+		.id	= SLOTS_FAN_PWM_DEFAULT_ID,
+	},
+	[CPUA_INTAKE_FAN_RPM_INDEX] = {
+		.loc	= "CPU A INTAKE",
+		.type	= FCU_FAN_RPM,
+		.id	= CPUA_INTAKE_FAN_RPM_DEFAULT_ID,
+	},
+	[CPUA_EXHAUST_FAN_RPM_INDEX] = {
+		.loc	= "CPU A EXHAUST",
+		.type	= FCU_FAN_RPM,
+		.id	= CPUA_EXHAUST_FAN_RPM_DEFAULT_ID,
+	},
+	[CPUB_INTAKE_FAN_RPM_INDEX] = {
+		.loc	= "CPU B INTAKE",
+		.type	= FCU_FAN_RPM,
+		.id	= CPUB_INTAKE_FAN_RPM_DEFAULT_ID,
+	},
+	[CPUB_EXHAUST_FAN_RPM_INDEX] = {
+		.loc	= "CPU B EXHAUST",
+		.type	= FCU_FAN_RPM,
+		.id	= CPUB_EXHAUST_FAN_RPM_DEFAULT_ID,
+	},
+	/* pumps aren't present by default, have to be looked up in the
+	 * device-tree
+	 */
+	[CPUA_PUMP_RPM_INDEX] = {
+		.loc	= "CPU A PUMP",
+		.type	= FCU_FAN_RPM,		
+		.id	= FCU_FAN_ABSENT_ID,
+	},
+	[CPUB_PUMP_RPM_INDEX] = {
+		.loc	= "CPU B PUMP",
+		.type	= FCU_FAN_RPM,
+		.id	= FCU_FAN_ABSENT_ID,
+	},
+};
+
+/*
  * i2c_driver structure to attach to the host i2c controller
  */
 
@@ -331,10 +424,16 @@
 	return 0;
 }
 
-static int set_rpm_fan(int fan, int rpm)
+static int set_rpm_fan(int fan_index, int rpm)
 {
 	unsigned char buf[2];
-	int rc;
+	int rc, id;
+
+	if (fcu_fans[fan_index].type != FCU_FAN_RPM)
+		return -EINVAL;
+	id = fcu_fans[fan_index].id; 
+	if (id == FCU_FAN_ABSENT_ID)
+		return -EINVAL;
 
 	if (rpm < 300)
 		rpm = 300;
@@ -342,43 +441,55 @@
 		rpm = 8191;
 	buf[0] = rpm >> 5;
 	buf[1] = rpm << 3;
-	rc = fan_write_reg(0x10 + (fan * 2), buf, 2);
+	rc = fan_write_reg(0x10 + (id * 2), buf, 2);
 	if (rc < 0)
 		return -EIO;
 	return 0;
 }
 
-static int get_rpm_fan(int fan, int programmed)
+static int get_rpm_fan(int fan_index, int programmed)
 {
 	unsigned char failure;
 	unsigned char active;
 	unsigned char buf[2];
-	int rc, reg_base;
+	int rc, id, reg_base;
+
+	if (fcu_fans[fan_index].type != FCU_FAN_RPM)
+		return -EINVAL;
+	id = fcu_fans[fan_index].id; 
+	if (id == FCU_FAN_ABSENT_ID)
+		return -EINVAL;
 
 	rc = fan_read_reg(0xb, &failure, 1);
 	if (rc != 1)
 		return -EIO;
-	if ((failure & (1 << fan)) != 0)
+	if ((failure & (1 << id)) != 0)
 		return -EFAULT;
 	rc = fan_read_reg(0xd, &active, 1);
 	if (rc != 1)
 		return -EIO;
-	if ((active & (1 << fan)) == 0)
+	if ((active & (1 << id)) == 0)
 		return -ENXIO;
 
 	/* Programmed value or real current speed */
 	reg_base = programmed ? 0x10 : 0x11;
-	rc = fan_read_reg(reg_base + (fan * 2), buf, 2);
+	rc = fan_read_reg(reg_base + (id * 2), buf, 2);
 	if (rc != 2)
 		return -EIO;
 
 	return (buf[0] << 5) | buf[1] >> 3;
 }
 
-static int set_pwm_fan(int fan, int pwm)
+static int set_pwm_fan(int fan_index, int pwm)
 {
 	unsigned char buf[2];
-	int rc;
+	int rc, id;
+
+	if (fcu_fans[fan_index].type != FCU_FAN_PWM)
+		return -EINVAL;
+	id = fcu_fans[fan_index].id; 
+	if (id == FCU_FAN_ABSENT_ID)
+		return -EINVAL;
 
 	if (pwm < 10)
 		pwm = 10;
@@ -386,32 +497,38 @@
 		pwm = 100;
 	pwm = (pwm * 2559) / 1000;
 	buf[0] = pwm;
-	rc = fan_write_reg(0x30 + (fan * 2), buf, 1);
+	rc = fan_write_reg(0x30 + (id * 2), buf, 1);
 	if (rc < 0)
 		return rc;
 	return 0;
 }
 
-static int get_pwm_fan(int fan)
+static int get_pwm_fan(int fan_index)
 {
 	unsigned char failure;
 	unsigned char active;
 	unsigned char buf[2];
-	int rc;
+	int rc, id;
+
+	if (fcu_fans[fan_index].type != FCU_FAN_PWM)
+		return -EINVAL;
+	id = fcu_fans[fan_index].id; 
+	if (id == FCU_FAN_ABSENT_ID)
+		return -EINVAL;
 
 	rc = fan_read_reg(0x2b, &failure, 1);
 	if (rc != 1)
 		return -EIO;
-	if ((failure & (1 << fan)) != 0)
+	if ((failure & (1 << id)) != 0)
 		return -EFAULT;
 	rc = fan_read_reg(0x2d, &active, 1);
 	if (rc != 1)
 		return -EIO;
-	if ((active & (1 << fan)) == 0)
+	if ((active & (1 << id)) == 0)
 		return -ENXIO;
 
 	/* Programmed value or real current speed */
-	rc = fan_read_reg(0x30 + (fan * 2), buf, 1);
+	rc = fan_read_reg(0x30 + (id * 2), buf, 1);
 	if (rc != 1)
 		return -EIO;
 
@@ -513,80 +630,84 @@
 /*
  * CPUs fans control loop
  */
-static void do_monitor_cpu(struct cpu_pid_state *state)
+
+static int do_read_one_cpu_values(struct cpu_pid_state *state, s32 *temp, s32 *power)
 {
-	s32 temp, voltage, current_a, power, power_target;
-	s32 integral, derivative, proportional, adj_in_target, sval;
-	s64 integ_p, deriv_p, prop_p, sum; 
-	int i, intake, rc;
+	s32 ltemp, volts, amps;
+	int rc = 0;
 
-	DBG("cpu %d:\n", state->index);
+	/* Default (in case of error) */
+	*temp = state->cur_temp;
+	*power = state->cur_power;
 
 	/* Read current fan status */
 	if (state->index == 0)
-		rc = get_rpm_fan(CPUA_EXHAUST_FAN_RPM_ID, !RPM_PID_USE_ACTUAL_SPEED);
+		rc = get_rpm_fan(CPUA_EXHAUST_FAN_RPM_INDEX, !RPM_PID_USE_ACTUAL_SPEED);
 	else
-		rc = get_rpm_fan(CPUB_EXHAUST_FAN_RPM_ID, !RPM_PID_USE_ACTUAL_SPEED);
+		rc = get_rpm_fan(CPUB_EXHAUST_FAN_RPM_INDEX, !RPM_PID_USE_ACTUAL_SPEED);
 	if (rc < 0) {
-		printk(KERN_WARNING "Error %d reading CPU %d exhaust fan !\n",
-		       rc, state->index);
-		/* XXX What do we do now ? */
-	} else
+		/* XXX What do we do now ? Nothing for now, keep old value, but
+		 * return error upstream
+		 */
+		DBG("  cpu %d, fan reading error !\n", state->index);
+	} else {
 		state->rpm = rc;
-	DBG("  current rpm: %d\n", state->rpm);
+		DBG("  cpu %d, exhaust RPM: %d\n", state->rpm);
+	}
 
 	/* Get some sensor readings and scale it */
-	temp = read_smon_adc(state, 1);
-	if (temp == -1) {
+	ltemp = read_smon_adc(state, 1);
+	if (ltemp == -1) {
+		/* XXX What do we do now ? */
 		state->overtemp++;
-		return;
+		if (rc == 0)
+			rc = -EIO;
+		DBG("  cpu %d, temp reading error !\n", state->index);
+	} else {
+		/* Fixup temperature according to diode calibration
+		 */
+		DBG("  cpu %d, temp raw: %04x, m_diode: %04x, b_diode: %04x\n",
+		    state->index,
+		    ltemp, state->mpu.mdiode, state->mpu.bdiode);
+		*temp = ((s32)ltemp * (s32)state->mpu.mdiode + ((s32)state->mpu.bdiode << 12)) >> 2;
+		state->last_temp = *temp;
+		DBG("  temp: %d.%03d\n", FIX32TOPRINT((*temp)));
 	}
-	voltage = read_smon_adc(state, 3);
-	current_a = read_smon_adc(state, 4);
 
-	/* Fixup temperature according to diode calibration
+	/*
+	 * Read voltage & current and calculate power
 	 */
-	DBG("  temp raw: %04x, m_diode: %04x, b_diode: %04x\n",
-	    temp, state->mpu.mdiode, state->mpu.bdiode);
-	temp = ((s32)temp * (s32)state->mpu.mdiode + ((s32)state->mpu.bdiode << 12)) >> 2;
-	state->last_temp = temp;
-	DBG("  temp: %d.%03d\n", FIX32TOPRINT(temp));
+	volts = read_smon_adc(state, 3);
+	amps = read_smon_adc(state, 4);
 
-	/* Check tmax, increment overtemp if we are there. At tmax+8, we go
-	 * full blown immediately and try to trigger a shutdown
-	 */
-	if (temp >= ((state->mpu.tmax + 8) << 16)) {
-		printk(KERN_WARNING "Warning ! CPU %d temperature way above maximum"
-		       " (%d) !\n",
-		       state->index, temp >> 16);
-		state->overtemp = CPU_MAX_OVERTEMP;
-	} else if (temp > (state->mpu.tmax << 16))
-		state->overtemp++;
-	else
-		state->overtemp = 0;
-	if (state->overtemp >= CPU_MAX_OVERTEMP)
-		critical_state = 1;
-	if (state->overtemp > 0) {
-		state->rpm = state->mpu.rmaxn_exhaust_fan;
-		state->intake_rpm = intake = state->mpu.rmaxn_intake_fan;
-		goto do_set_fans;
-	}
-	
-	/* Scale other sensor values according to fixed scales
+	/* Scale voltage and current raw sensor values according to fixed scales
 	 * obtained in Darwin and calculate power from I and V
 	 */
-	state->voltage = voltage *= ADC_CPU_VOLTAGE_SCALE;
-	state->current_a = current_a *= ADC_CPU_CURRENT_SCALE;
-	power = (((u64)current_a) * ((u64)voltage)) >> 16;
+	volts *= ADC_CPU_VOLTAGE_SCALE;
+	amps *= ADC_CPU_CURRENT_SCALE;
+	*power = (((u64)volts) * ((u64)amps)) >> 16;
+	state->voltage = volts;
+	state->current_a = amps;
+	state->last_power = *power;
+
+	DBG("  cpu %d, current: %d.%03d, voltage: %d.%03d, power: %d.%03d W\n",
+	    state->index, FIX32TOPRINT(current_a), FIX32TOPRINT(voltage),
+	    FIX32TOPRINT(*power));
+
+	return 0;
+}
+
+static void do_cpu_pid(struct cpu_pid_state *state, s32 temp, s32 power)
+{
+	s32 power_target, integral, derivative, proportional, adj_in_target, sval;
+	s64 integ_p, deriv_p, prop_p, sum; 
+	int i;
 
 	/* Calculate power target value (could be done once for all)
 	 * and convert to a 16.16 fp number
 	 */
 	power_target = ((u32)(state->mpu.pmaxh - state->mpu.padjmax)) << 16;
-
-	DBG("  current: %d.%03d, voltage: %d.%03d\n",
-	    FIX32TOPRINT(current_a), FIX32TOPRINT(voltage));
-	DBG("  power: %d.%03d W, target: %d.%03d, error: %d.%03d\n", FIX32TOPRINT(power),
+	DBG("  power target: %d.%03d, error: %d.%03d\n",
 	    FIX32TOPRINT(power_target), FIX32TOPRINT(power_target - power));
 
 	/* Store temperature and power in history array */
@@ -659,6 +780,127 @@
 		state->rpm = state->mpu.rminn_exhaust_fan;
 	if (state->rpm > state->mpu.rmaxn_exhaust_fan)
 		state->rpm = state->mpu.rmaxn_exhaust_fan;
+}
+
+static void do_monitor_cpu_combined(void)
+{
+	struct cpu_pid_state *state0 = &cpu_state[0];
+	struct cpu_pid_state *state1 = &cpu_state[1];
+	s32 temp0, power0, temp1, power1;
+	s32 temp_combi, power_combi;
+	int rc, intake, pump;
+
+	rc = do_read_one_cpu_values(state0, &temp0, &power0);
+	if (rc < 0) {
+		/* XXX What do we do now ? */
+	}
+	state1->overtemp = 0;
+	rc = do_read_one_cpu_values(state1, &temp1, &power1);
+	if (rc < 0) {
+		/* XXX What do we do now ? */
+	}
+	if (state1->overtemp)
+		state0->overtemp++;
+
+	temp_combi = max(temp0, temp1);
+	power_combi = max(power0, power1);
+
+	/* Check tmax, increment overtemp if we are there. At tmax+8, we go
+	 * full blown immediately and try to trigger a shutdown
+	 */
+	if (temp_combi >= ((state0->mpu.tmax + 8) << 16)) {
+		printk(KERN_WARNING "Warning ! Temperature way above maximum (%d) !\n",
+		       temp_combi >> 16);
+		state0->overtemp = CPU_MAX_OVERTEMP;
+	} else if (temp_combi > (state0->mpu.tmax << 16))
+		state0->overtemp++;
+	else
+		state0->overtemp = 0;
+	if (state0->overtemp >= CPU_MAX_OVERTEMP)
+		critical_state = 1;
+	if (state0->overtemp > 0) {
+		state0->rpm = state0->mpu.rmaxn_exhaust_fan;
+		state0->intake_rpm = intake = state0->mpu.rmaxn_intake_fan;
+		pump = CPU_PUMP_OUTPUT_MAX;
+		goto do_set_fans;
+	}
+
+	/* Do the PID */
+	do_cpu_pid(state0, temp_combi, power_combi);
+
+	/* Calculate intake fan speed */
+	intake = (state0->rpm * CPU_INTAKE_SCALE) >> 16;
+	if (intake < state0->mpu.rminn_intake_fan)
+		intake = state0->mpu.rminn_intake_fan;
+	if (intake > state0->mpu.rmaxn_intake_fan)
+		intake = state0->mpu.rmaxn_intake_fan;
+	state0->intake_rpm = intake;
+
+	/* Calculate pump speed */
+	pump = (state0->rpm * CPU_PUMP_OUTPUT_MAX) /
+		state0->mpu.rmaxn_exhaust_fan;
+	if (pump > CPU_PUMP_OUTPUT_MAX)
+		pump = CPU_PUMP_OUTPUT_MAX;
+	if (pump < CPU_PUMP_OUTPUT_MIN)
+		pump = CPU_PUMP_OUTPUT_MIN;
+	
+ do_set_fans:
+	/* We copy values from state 0 to state 1 for /sysfs */
+	state1->rpm = state0->rpm;
+	state1->intake_rpm = state0->intake_rpm;
+
+	DBG("** CPU %d RPM: %d Ex, %d, Pump: %d, In, overtemp: %d\n",
+	    state->index, (int)state->rpm, intake, pump, state->overtemp);
+
+	/* We should check for errors, shouldn't we ? But then, what
+	 * do we do once the error occurs ? For FCU notified fan
+	 * failures (-EFAULT) we probably want to notify userland
+	 * some way...
+	 */
+	set_rpm_fan(CPUA_INTAKE_FAN_RPM_INDEX, intake);
+	set_rpm_fan(CPUA_EXHAUST_FAN_RPM_INDEX, state0->rpm);
+	set_rpm_fan(CPUB_INTAKE_FAN_RPM_INDEX, intake);
+	set_rpm_fan(CPUB_EXHAUST_FAN_RPM_INDEX, state0->rpm);
+
+	if (fcu_fans[CPUA_PUMP_RPM_INDEX].id != FCU_FAN_ABSENT_ID)
+		set_rpm_fan(CPUA_PUMP_RPM_INDEX, pump);
+	if (fcu_fans[CPUB_PUMP_RPM_INDEX].id != FCU_FAN_ABSENT_ID)
+		set_rpm_fan(CPUB_PUMP_RPM_INDEX, pump);
+}
+
+static void do_monitor_cpu_split(struct cpu_pid_state *state)
+{
+	s32 temp, power;
+	int rc, intake;
+
+	/* Read current fan status */
+	rc = do_read_one_cpu_values(state, &temp, &power);
+	if (rc < 0) {
+		/* XXX What do we do now ? */
+	}
+
+	/* Check tmax, increment overtemp if we are there. At tmax+8, we go
+	 * full blown immediately and try to trigger a shutdown
+	 */
+	if (temp >= ((state->mpu.tmax + 8) << 16)) {
+		printk(KERN_WARNING "Warning ! CPU %d temperature way above maximum"
+		       " (%d) !\n",
+		       state->index, temp >> 16);
+		state->overtemp = CPU_MAX_OVERTEMP;
+	} else if (temp > (state->mpu.tmax << 16))
+		state->overtemp++;
+	else
+		state->overtemp = 0;
+	if (state->overtemp >= CPU_MAX_OVERTEMP)
+		critical_state = 1;
+	if (state->overtemp > 0) {
+		state->rpm = state->mpu.rmaxn_exhaust_fan;
+		state->intake_rpm = intake = state->mpu.rmaxn_intake_fan;
+		goto do_set_fans;
+	}
+
+	/* Do the PID */
+	do_cpu_pid(state, temp, power);
 
 	intake = (state->rpm * CPU_INTAKE_SCALE) >> 16;
 	if (intake < state->mpu.rminn_intake_fan)
@@ -677,11 +919,11 @@
 	 * some way...
 	 */
 	if (state->index == 0) {
-		set_rpm_fan(CPUA_INTAKE_FAN_RPM_ID, intake);
-		set_rpm_fan(CPUA_EXHAUST_FAN_RPM_ID, state->rpm);
+		set_rpm_fan(CPUA_INTAKE_FAN_RPM_INDEX, intake);
+		set_rpm_fan(CPUA_EXHAUST_FAN_RPM_INDEX, state->rpm);
 	} else {
-		set_rpm_fan(CPUB_INTAKE_FAN_RPM_ID, intake);
-		set_rpm_fan(CPUB_EXHAUST_FAN_RPM_ID, state->rpm);
+		set_rpm_fan(CPUB_INTAKE_FAN_RPM_INDEX, intake);
+		set_rpm_fan(CPUB_EXHAUST_FAN_RPM_INDEX, state->rpm);
 	}
 }
 
@@ -696,6 +938,7 @@
 	state->overtemp = 0;
 	state->adc_config = 0x00;
 
+
 	if (index == 0)
 		state->monitor = attach_i2c_chip(SUPPLY_MONITOR_ID, "CPU0_monitor");
 	else if (index == 1)
@@ -778,7 +1021,7 @@
 	DBG("backside:\n");
 
 	/* Check fan status */
-	rc = get_pwm_fan(BACKSIDE_FAN_PWM_ID);
+	rc = get_pwm_fan(BACKSIDE_FAN_PWM_INDEX);
 	if (rc < 0) {
 		printk(KERN_WARNING "Error %d reading backside fan !\n", rc);
 		/* XXX What do we do now ? */
@@ -790,12 +1033,12 @@
 	temp = i2c_smbus_read_byte_data(state->monitor, MAX6690_EXT_TEMP) << 16;
 	state->last_temp = temp;
 	DBG("  temp: %d.%03d, target: %d.%03d\n", FIX32TOPRINT(temp),
-	    FIX32TOPRINT(BACKSIDE_PID_INPUT_TARGET));
+	    FIX32TOPRINT(backside_params.input_target));
 
 	/* Store temperature and error in history array */
 	state->cur_sample = (state->cur_sample + 1) % BACKSIDE_PID_HISTORY_SIZE;
 	state->sample_history[state->cur_sample] = temp;
-	state->error_history[state->cur_sample] = temp - BACKSIDE_PID_INPUT_TARGET;
+	state->error_history[state->cur_sample] = temp - backside_params.input_target;
 	
 	/* If first loop, fill the history table */
 	if (state->first) {
@@ -804,7 +1047,7 @@
 				BACKSIDE_PID_HISTORY_SIZE;
 			state->sample_history[state->cur_sample] = temp;
 			state->error_history[state->cur_sample] =
-				temp - BACKSIDE_PID_INPUT_TARGET;
+				temp - backside_params.input_target;
 		}
 		state->first = 0;
 	}
@@ -816,7 +1059,7 @@
 		integral += state->error_history[i];
 	integral *= BACKSIDE_PID_INTERVAL;
 	DBG("  integral: %08x\n", integral);
-	integ_p = ((s64)BACKSIDE_PID_G_r) * (s64)integral;
+	integ_p = ((s64)backside_params.G_r) * (s64)integral;
 	DBG("   integ_p: %d\n", (int)(integ_p >> 36));
 	sum += integ_p;
 
@@ -825,12 +1068,12 @@
 		state->error_history[(state->cur_sample + BACKSIDE_PID_HISTORY_SIZE - 1)
 				    % BACKSIDE_PID_HISTORY_SIZE];
 	derivative /= BACKSIDE_PID_INTERVAL;
-	deriv_p = ((s64)BACKSIDE_PID_G_d) * (s64)derivative;
+	deriv_p = ((s64)backside_params.G_d) * (s64)derivative;
 	DBG("   deriv_p: %d\n", (int)(deriv_p >> 36));
 	sum += deriv_p;
 
 	/* Calculate the proportional term */
-	prop_p = ((s64)BACKSIDE_PID_G_p) * (s64)(state->error_history[state->cur_sample]);
+	prop_p = ((s64)backside_params.G_p) * (s64)(state->error_history[state->cur_sample]);
 	DBG("   prop_p: %d\n", (int)(prop_p >> 36));
 	sum += prop_p;
 
@@ -839,13 +1082,13 @@
 
 	DBG("   sum: %d\n", (int)sum);
 	state->pwm += (s32)sum;
-	if (state->pwm < BACKSIDE_PID_OUTPUT_MIN)
-		state->pwm = BACKSIDE_PID_OUTPUT_MIN;
-	if (state->pwm > BACKSIDE_PID_OUTPUT_MAX)
-		state->pwm = BACKSIDE_PID_OUTPUT_MAX;
+	if (state->pwm < backside_params.output_min)
+		state->pwm = backside_params.output_min;
+	if (state->pwm > backside_params.output_max)
+		state->pwm = backside_params.output_max;
 
 	DBG("** BACKSIDE PWM: %d\n", (int)state->pwm);
-	set_pwm_fan(BACKSIDE_FAN_PWM_ID, state->pwm);
+	set_pwm_fan(BACKSIDE_FAN_PWM_INDEX, state->pwm);
 }
 
 /*
@@ -853,6 +1096,35 @@
  */
 static int init_backside_state(struct backside_pid_state *state)
 {
+	struct device_node *u3;
+	int u3h = 1; /* conservative by default */
+
+	/*
+	 * There are different PID params for machines with U3 and machines
+	 * with U3H, pick the right ones now
+	 */
+	u3 = of_find_node_by_path("/u3");
+	if (u3 != NULL) {
+		u32 *vers = (u32 *)get_property(u3, "device-rev", NULL);
+		if (vers)
+			if (((*vers) & 0x3f) < 0x34)
+				u3h = 0;
+		of_node_put(u3);
+	}
+
+	backside_params.G_p = BACKSIDE_PID_G_p;
+	backside_params.G_r = BACKSIDE_PID_G_r;
+	backside_params.output_max = BACKSIDE_PID_OUTPUT_MAX;
+	if (u3h) {
+		backside_params.G_d = BACKSIDE_PID_U3H_G_d;
+		backside_params.input_target = BACKSIDE_PID_U3H_INPUT_TARGET;
+		backside_params.output_min = BACKSIDE_PID_U3H_OUTPUT_MIN;
+	} else {
+		backside_params.G_d = BACKSIDE_PID_U3_G_d;
+		backside_params.input_target = BACKSIDE_PID_U3_INPUT_TARGET;
+		backside_params.output_min = BACKSIDE_PID_U3_OUTPUT_MIN;
+	}
+
 	state->ticks = 1;
 	state->first = 1;
 	state->pwm = 50;
@@ -898,7 +1170,7 @@
 	DBG("drives:\n");
 
 	/* Check fan status */
-	rc = get_rpm_fan(DRIVES_FAN_RPM_ID, !RPM_PID_USE_ACTUAL_SPEED);
+	rc = get_rpm_fan(DRIVES_FAN_RPM_INDEX, !RPM_PID_USE_ACTUAL_SPEED);
 	if (rc < 0) {
 		printk(KERN_WARNING "Error %d reading drives fan !\n", rc);
 		/* XXX What do we do now ? */
@@ -965,7 +1237,7 @@
 		state->rpm = DRIVES_PID_OUTPUT_MAX;
 
 	DBG("** DRIVES RPM: %d\n", (int)state->rpm);
-	set_rpm_fan(DRIVES_FAN_RPM_ID, state->rpm);
+	set_rpm_fan(DRIVES_FAN_RPM_INDEX, state->rpm);
 }
 
 /*
@@ -1032,7 +1304,7 @@
 	}
 
 	/* Set the PCI fan once for now */
-	set_pwm_fan(SLOTS_FAN_PWM_ID, SLOTS_FAN_DEFAULT_PWM);
+	set_pwm_fan(SLOTS_FAN_PWM_INDEX, SLOTS_FAN_DEFAULT_PWM);
 
 	/* Initialize ADCs */
 	initialize_adc(&cpu_state[0]);
@@ -1047,9 +1319,13 @@
 		start = jiffies;
 
 		down(&driver_lock);
-		do_monitor_cpu(&cpu_state[0]);
-		if (cpu_state[1].monitor != NULL)
-			do_monitor_cpu(&cpu_state[1]);
+		if (cpu_pid_type == CPU_PID_TYPE_COMBINED)
+			do_monitor_cpu_combined();
+		else {
+			do_monitor_cpu_split(&cpu_state[0]);
+			if (cpu_state[1].monitor != NULL)
+				do_monitor_cpu_split(&cpu_state[1]);
+		}
 		do_monitor_backside(&backside_state);
 		do_monitor_drives(&drives_state);
 		up(&driver_lock);
@@ -1113,6 +1389,19 @@
 
 	DBG("counted %d CPUs in the device-tree\n", cpu_count);
 
+	/* Decide the type of PID algorithm to use based on the presence of
+	 * the pumps, though that may not be the best way, that is good enough
+	 * for now
+	 */
+	if (machine_is_compatible("PowerMac7,3")
+	    && (cpu_count > 1)
+	    && fcu_fans[CPUA_PUMP_RPM_INDEX].id != FCU_FAN_ABSENT_ID
+	    && fcu_fans[CPUB_PUMP_RPM_INDEX].id != FCU_FAN_ABSENT_ID) {
+		printk(KERN_INFO "Liquid cooling pumps detected, using new algorithm !\n");
+		cpu_pid_type = CPU_PID_TYPE_COMBINED;
+	} else
+		cpu_pid_type = CPU_PID_TYPE_SPLIT;
+
 	/* Create control loops for everything. If any fail, everything
 	 * fails
 	 */
@@ -1257,12 +1546,91 @@
 	return 0;
 }
 
+static void fcu_lookup_fans(struct device_node *fcu_node)
+{
+	struct device_node *np = NULL;
+	int i;
+
+	/* The table is filled by default with values that are suitable
+	 * for the old machines without device-tree informations. We scan
+	 * the device-tree and override those values with whatever is
+	 * there
+	 */
+
+	DBG("Looking up FCU controls in device-tree...\n");
+
+	while ((np = of_get_next_child(fcu_node, np)) != NULL) {
+		int type = -1;
+		char *loc;
+		u32 *reg;
+
+		DBG(" control: %s, type: %s\n", np->name, np->type);
+
+		/* Detect control type */
+		if (!strcmp(np->type, "fan-rpm-control") ||
+		    !strcmp(np->type, "fan-rpm"))
+			type = FCU_FAN_RPM;
+		if (!strcmp(np->type, "fan-pwm-control") ||
+		    !strcmp(np->type, "fan-pwm"))
+			type = FCU_FAN_PWM;
+		/* Only care about fans for now */
+		if (type == -1)
+			continue;
+
+		/* Lookup for a matching location */
+		loc = (char *)get_property(np, "location", NULL);
+		reg = (u32 *)get_property(np, "reg", NULL);
+		if (loc == NULL || reg == NULL)
+			continue;
+		DBG(" matching location: %s, reg: 0x%08x\n", loc, *reg);
+
+		for (i = 0; i < FCU_FAN_COUNT; i++) {
+			int fan_id;
+
+			if (strcmp(loc, fcu_fans[i].loc))
+				continue;
+			DBG(" location match, index: %d\n", i);
+			fcu_fans[i].id = FCU_FAN_ABSENT_ID;
+			if (type != fcu_fans[i].type) {
+				printk(KERN_WARNING "therm_pm72: Fan type mismatch "
+				       "in device-tree for %s\n", np->full_name);
+				break;
+			}
+			if (type == FCU_FAN_RPM)
+				fan_id = ((*reg) / 2) - 0x10;
+			else
+				fan_id = ((*reg) / 2) - 0x30;
+			if (fan_id > 7) {
+				printk(KERN_WARNING "therm_pm72: Can't parse "
+				       "fan ID in device-tree for %s\n", np->full_name);
+				break;
+			}
+			DBG(" fan id -> %d, type -> %d\n", fan_id, type);
+			fcu_fans[i].id = fan_id;
+		}
+	}
+
+	/* Now dump the array */
+	printk(KERN_INFO "Detected fan controls:\n");
+	for (i = 0; i < FCU_FAN_COUNT; i++) {
+		if (fcu_fans[i].id == FCU_FAN_ABSENT_ID)
+			continue;
+		printk(KERN_INFO "  %d: %s fan, id %d, location: %s\n", i,
+		       fcu_fans[i].type == FCU_FAN_RPM ? "RPM" : "PWM",
+		       fcu_fans[i].id, fcu_fans[i].loc);
+	}
+}
+
 static int fcu_of_probe(struct of_device* dev, const struct of_match *match)
 {
 	int rc;
 
 	state = state_detached;
 
+	/* Lookup the fans in the device tree */
+	fcu_lookup_fans(dev->node);
+
+	/* Add the driver */
 	rc = i2c_add_driver(&therm_pm72_driver);
 	if (rc < 0)
 		return rc;
@@ -1301,7 +1669,8 @@
 {
 	struct device_node *np;
 
-	if (!machine_is_compatible("PowerMac7,2"))
+	if (!machine_is_compatible("PowerMac7,2") &&
+	    !machine_is_compatible("PowerMac7,3"))
 	    	return -ENODEV;
 
 	printk(KERN_INFO "PowerMac G5 Thermal control driver %s\n", VERSION);
diff -urN linux-2.5/drivers/macintosh/therm_pm72.h linux-pogo/drivers/macintosh/therm_pm72.h
--- linux-2.5/drivers/macintosh/therm_pm72.h	2004-09-24 14:34:05.000000000 +1000
+++ linux-pogo/drivers/macintosh/therm_pm72.h	2004-10-15 18:58:22.000000000 +1000
@@ -119,18 +119,33 @@
 #define ADC_CPU_CURRENT_SCALE	0x1f40	/* _AD4 */
 
 /*
- * PID factors for the U3/Backside fan control loop
+ * PID factors for the U3/Backside fan control loop. We have 2 sets
+ * of values here, one set for U3 and one set for U3H
  */
-#define BACKSIDE_FAN_PWM_ID		1
-#define BACKSIDE_PID_G_d		0x02800000
+#define BACKSIDE_FAN_PWM_DEFAULT_ID	1
+#define BACKSIDE_FAN_PWM_INDEX		0
+#define BACKSIDE_PID_U3_G_d		0x02800000
+#define BACKSIDE_PID_U3H_G_d		0x01400000
 #define BACKSIDE_PID_G_p		0x00500000
 #define BACKSIDE_PID_G_r		0x00000000
-#define BACKSIDE_PID_INPUT_TARGET	0x00410000
+#define BACKSIDE_PID_U3_INPUT_TARGET	0x00410000
+#define BACKSIDE_PID_U3H_INPUT_TARGET	0x004b0000
 #define BACKSIDE_PID_INTERVAL		5
 #define BACKSIDE_PID_OUTPUT_MAX		100
-#define BACKSIDE_PID_OUTPUT_MIN		20
+#define BACKSIDE_PID_U3_OUTPUT_MIN	20
+#define BACKSIDE_PID_U3H_OUTPUT_MIN	30
 #define BACKSIDE_PID_HISTORY_SIZE	2
 
+struct basckside_pid_params
+{
+	u32			G_d;
+	u32			G_p;
+	u32			G_r;
+	u32			input_target;
+	u32			output_min;
+	u32			output_max;
+};
+
 struct backside_pid_state
 {
 	int			ticks;
@@ -146,7 +161,8 @@
 /*
  * PID factors for the Drive Bay fan control loop
  */
-#define DRIVES_FAN_RPM_ID      		2
+#define DRIVES_FAN_RPM_DEFAULT_ID	2
+#define DRIVES_FAN_RPM_INDEX		1
 #define DRIVES_PID_G_d			0x01e00000
 #define DRIVES_PID_G_p			0x00500000
 #define DRIVES_PID_G_r			0x00000000
@@ -168,7 +184,8 @@
 	int			first;
 };
 
-#define SLOTS_FAN_PWM_ID       		2
+#define SLOTS_FAN_PWM_DEFAULT_ID	2
+#define SLOTS_FAN_PWM_INDEX		2
 #define	SLOTS_FAN_DEFAULT_PWM		50 /* Do better here ! */
 
 /*
@@ -191,10 +208,15 @@
  * CPU B FAKE POWER	49	(I_V_inputs: 18, 19)
  */
 
-#define CPUA_INTAKE_FAN_RPM_ID		3
-#define CPUA_EXHAUST_FAN_RPM_ID		4
-#define CPUB_INTAKE_FAN_RPM_ID		5
-#define CPUB_EXHAUST_FAN_RPM_ID		6
+#define CPUA_INTAKE_FAN_RPM_DEFAULT_ID	3
+#define CPUA_EXHAUST_FAN_RPM_DEFAULT_ID	4
+#define CPUB_INTAKE_FAN_RPM_DEFAULT_ID	5
+#define CPUB_EXHAUST_FAN_RPM_DEFAULT_ID	6
+
+#define CPUA_INTAKE_FAN_RPM_INDEX	3
+#define CPUA_EXHAUST_FAN_RPM_INDEX	4
+#define CPUB_INTAKE_FAN_RPM_INDEX	5
+#define CPUB_EXHAUST_FAN_RPM_INDEX	6
 
 #define CPU_INTAKE_SCALE		0x0000f852
 #define CPU_TEMP_HISTORY_SIZE		2
@@ -202,6 +224,11 @@
 #define CPU_PID_INTERVAL		1
 #define CPU_MAX_OVERTEMP		30
 
+#define CPUA_PUMP_RPM_INDEX		7
+#define CPUB_PUMP_RPM_INDEX		8
+#define CPU_PUMP_OUTPUT_MAX		3700
+#define CPU_PUMP_OUTPUT_MIN		1000
+
 struct cpu_pid_state
 {
 	int			index;
@@ -219,6 +246,7 @@
 	s32			voltage;
 	s32			current_a;
 	s32			last_temp;
+	s32			last_power;
 	int			first;
 	u8			adc_config;
 };


From benh at kernel.crashing.org  Fri Oct 15 19:19:42 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Fri, 15 Oct 2004 19:19:42 +1000
Subject: Wrong patch! (Re: Fan control for PowerMac7_3)
In-Reply-To: <1097831790.1131.111.camel@gaston>
References: <1097831790.1131.111.camel@gaston>
Message-ID: <1097831981.1131.113.camel@gaston>

On Fri, 2004-10-15 at 19:16, Benjamin Herrenschmidt wrote:
> Hi !
> 
> This is an experimental (read: totally untested) patch to the G5 fan
> control code. All I know is that it builds :)

And I sent a wrong version ... sorry, the good one in a few minutes.

Ben.


From benh at kernel.crashing.org  Fri Oct 15 19:20:50 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Fri, 15 Oct 2004 19:20:50 +1000
Subject: Fan control for PowerMac7_3
In-Reply-To: <1097831981.1131.113.camel@gaston>
References: <1097831790.1131.111.camel@gaston>
	<1097831981.1131.113.camel@gaston>
Message-ID: <1097832049.1149.115.camel@gaston>

On Fri, 2004-10-15 at 19:19, Benjamin Herrenschmidt wrote:
> On Fri, 2004-10-15 at 19:16, Benjamin Herrenschmidt wrote:
> > Hi !
> > 
> > This is an experimental (read: totally untested) patch to the G5 fan
> > control code. All I know is that it builds :)
> 
> And I sent a wrong version ... sorry, the good one in a few minutes.

Here it is:

diff -urN linux-2.5/drivers/macintosh/therm_pm72.c linux-pogo/drivers/macintosh/therm_pm72.c
--- linux-2.5/drivers/macintosh/therm_pm72.c	2004-09-24 14:34:05.000000000 +1000
+++ linux-pogo/drivers/macintosh/therm_pm72.c	2004-10-15 19:20:06.000000000 +1000
@@ -46,6 +46,8 @@
  *          overtemp conditions so userland can take some policy
  *          decisions, like slewing down CPUs
  *	  - Deal with fan and i2c failures in a better way
+ *	  - Maybe do a generic PID based on params used for
+ *	    U3 and Drives ?
  *
  * History:
  *
@@ -73,6 +75,13 @@
  *        values in the configuration register
  *	- Switch back to use of target fan speed for PID, thus lowering
  *        pressure on i2c
+ *
+ *  Oct. 15, 2004 : 1.1b1 (beta)
+ *	- Add device-tree lookup for fan IDs, should detect liquid cooling
+ *        pumps when present
+ *	- Enable driver for PowerMac7,3 machines
+ *	- Split the U3/Backside cooling on U3 & U3H versions as Darwin does
+ *	- Add new CPU cooling algorithm for machines with liquid cooling
  */
 
 #include <linux/config.h>
@@ -101,7 +110,7 @@
 
 #include "therm_pm72.h"
 
-#define VERSION "0.9"
+#define VERSION "1.1b1"
 
 #undef DEBUG
 
@@ -121,16 +130,100 @@
 static struct i2c_adapter *		u3_1;
 static struct i2c_client *		fcu;
 static struct cpu_pid_state		cpu_state[2];
+static struct basckside_pid_params	backside_params;
 static struct backside_pid_state	backside_state;
 static struct drives_pid_state		drives_state;
 static int				state;
 static int				cpu_count;
+static int				cpu_pid_type;
 static pid_t				ctrl_task;
 static struct completion		ctrl_complete;
 static int				critical_state;
 static DECLARE_MUTEX(driver_lock);
 
 /*
+ * We have 2 types of CPU PID control. One is "split" old style control
+ * for intake & exhaust fans, the other is "combined" control for both
+ * CPUs that also deals with the pumps when present. To be "compatible"
+ * with OS X at this point, we only use "COMBINED" on the machines that
+ * are identified as having the pumps (though that identification is at
+ * least dodgy). Ultimately, we could probably switch completely to this
+ * algorithm provided we hack it to deal with the UP case
+ */
+#define CPU_PID_TYPE_SPLIT	0
+#define CPU_PID_TYPE_COMBINED	1
+
+/*
+ * This table describes all fans in the FCU. The "id" and "type" values
+ * are defaults valid for all earlier machines. Newer machines will
+ * eventually override the table content based on the device-tree
+ */
+struct fcu_fan_table
+{
+	char*	loc;	/* location code */
+	int	type;	/* 0 = rpm, 1 = pwm, 2 = pump */
+	int	id;	/* id or -1 */
+};
+
+#define FCU_FAN_RPM		0
+#define FCU_FAN_PWM		1
+
+#define FCU_FAN_ABSENT_ID	-1
+
+#define FCU_FAN_COUNT		ARRAY_SIZE(fcu_fans)
+
+struct fcu_fan_table	fcu_fans[] = {
+	[BACKSIDE_FAN_PWM_INDEX] = {
+		.loc	= "BACKSIDE",
+		.type	= FCU_FAN_PWM,
+		.id	= BACKSIDE_FAN_PWM_DEFAULT_ID,
+	},
+	[DRIVES_FAN_RPM_INDEX] = {
+		.loc	= "DRIVE BAY",
+		.type	= FCU_FAN_RPM,
+		.id	= DRIVES_FAN_RPM_DEFAULT_ID,
+	},
+	[SLOTS_FAN_PWM_INDEX] = {
+		.loc	= "SLOT",
+		.type	= FCU_FAN_PWM,
+		.id	= SLOTS_FAN_PWM_DEFAULT_ID,
+	},
+	[CPUA_INTAKE_FAN_RPM_INDEX] = {
+		.loc	= "CPU A INTAKE",
+		.type	= FCU_FAN_RPM,
+		.id	= CPUA_INTAKE_FAN_RPM_DEFAULT_ID,
+	},
+	[CPUA_EXHAUST_FAN_RPM_INDEX] = {
+		.loc	= "CPU A EXHAUST",
+		.type	= FCU_FAN_RPM,
+		.id	= CPUA_EXHAUST_FAN_RPM_DEFAULT_ID,
+	},
+	[CPUB_INTAKE_FAN_RPM_INDEX] = {
+		.loc	= "CPU B INTAKE",
+		.type	= FCU_FAN_RPM,
+		.id	= CPUB_INTAKE_FAN_RPM_DEFAULT_ID,
+	},
+	[CPUB_EXHAUST_FAN_RPM_INDEX] = {
+		.loc	= "CPU B EXHAUST",
+		.type	= FCU_FAN_RPM,
+		.id	= CPUB_EXHAUST_FAN_RPM_DEFAULT_ID,
+	},
+	/* pumps aren't present by default, have to be looked up in the
+	 * device-tree
+	 */
+	[CPUA_PUMP_RPM_INDEX] = {
+		.loc	= "CPU A PUMP",
+		.type	= FCU_FAN_RPM,		
+		.id	= FCU_FAN_ABSENT_ID,
+	},
+	[CPUB_PUMP_RPM_INDEX] = {
+		.loc	= "CPU B PUMP",
+		.type	= FCU_FAN_RPM,
+		.id	= FCU_FAN_ABSENT_ID,
+	},
+};
+
+/*
  * i2c_driver structure to attach to the host i2c controller
  */
 
@@ -331,10 +424,16 @@
 	return 0;
 }
 
-static int set_rpm_fan(int fan, int rpm)
+static int set_rpm_fan(int fan_index, int rpm)
 {
 	unsigned char buf[2];
-	int rc;
+	int rc, id;
+
+	if (fcu_fans[fan_index].type != FCU_FAN_RPM)
+		return -EINVAL;
+	id = fcu_fans[fan_index].id; 
+	if (id == FCU_FAN_ABSENT_ID)
+		return -EINVAL;
 
 	if (rpm < 300)
 		rpm = 300;
@@ -342,43 +441,55 @@
 		rpm = 8191;
 	buf[0] = rpm >> 5;
 	buf[1] = rpm << 3;
-	rc = fan_write_reg(0x10 + (fan * 2), buf, 2);
+	rc = fan_write_reg(0x10 + (id * 2), buf, 2);
 	if (rc < 0)
 		return -EIO;
 	return 0;
 }
 
-static int get_rpm_fan(int fan, int programmed)
+static int get_rpm_fan(int fan_index, int programmed)
 {
 	unsigned char failure;
 	unsigned char active;
 	unsigned char buf[2];
-	int rc, reg_base;
+	int rc, id, reg_base;
+
+	if (fcu_fans[fan_index].type != FCU_FAN_RPM)
+		return -EINVAL;
+	id = fcu_fans[fan_index].id; 
+	if (id == FCU_FAN_ABSENT_ID)
+		return -EINVAL;
 
 	rc = fan_read_reg(0xb, &failure, 1);
 	if (rc != 1)
 		return -EIO;
-	if ((failure & (1 << fan)) != 0)
+	if ((failure & (1 << id)) != 0)
 		return -EFAULT;
 	rc = fan_read_reg(0xd, &active, 1);
 	if (rc != 1)
 		return -EIO;
-	if ((active & (1 << fan)) == 0)
+	if ((active & (1 << id)) == 0)
 		return -ENXIO;
 
 	/* Programmed value or real current speed */
 	reg_base = programmed ? 0x10 : 0x11;
-	rc = fan_read_reg(reg_base + (fan * 2), buf, 2);
+	rc = fan_read_reg(reg_base + (id * 2), buf, 2);
 	if (rc != 2)
 		return -EIO;
 
 	return (buf[0] << 5) | buf[1] >> 3;
 }
 
-static int set_pwm_fan(int fan, int pwm)
+static int set_pwm_fan(int fan_index, int pwm)
 {
 	unsigned char buf[2];
-	int rc;
+	int rc, id;
+
+	if (fcu_fans[fan_index].type != FCU_FAN_PWM)
+		return -EINVAL;
+	id = fcu_fans[fan_index].id; 
+	if (id == FCU_FAN_ABSENT_ID)
+		return -EINVAL;
 
 	if (pwm < 10)
 		pwm = 10;
@@ -386,32 +497,38 @@
 		pwm = 100;
 	pwm = (pwm * 2559) / 1000;
 	buf[0] = pwm;
-	rc = fan_write_reg(0x30 + (fan * 2), buf, 1);
+	rc = fan_write_reg(0x30 + (id * 2), buf, 1);
 	if (rc < 0)
 		return rc;
 	return 0;
 }
 
-static int get_pwm_fan(int fan)
+static int get_pwm_fan(int fan_index)
 {
 	unsigned char failure;
 	unsigned char active;
 	unsigned char buf[2];
-	int rc;
+	int rc, id;
+
+	if (fcu_fans[fan_index].type != FCU_FAN_PWM)
+		return -EINVAL;
+	id = fcu_fans[fan_index].id; 
+	if (id == FCU_FAN_ABSENT_ID)
+		return -EINVAL;
 
 	rc = fan_read_reg(0x2b, &failure, 1);
 	if (rc != 1)
 		return -EIO;
-	if ((failure & (1 << fan)) != 0)
+	if ((failure & (1 << id)) != 0)
 		return -EFAULT;
 	rc = fan_read_reg(0x2d, &active, 1);
 	if (rc != 1)
 		return -EIO;
-	if ((active & (1 << fan)) == 0)
+	if ((active & (1 << id)) == 0)
 		return -ENXIO;
 
 	/* Programmed value or real current speed */
-	rc = fan_read_reg(0x30 + (fan * 2), buf, 1);
+	rc = fan_read_reg(0x30 + (id * 2), buf, 1);
 	if (rc != 1)
 		return -EIO;
 
@@ -513,80 +630,84 @@
 /*
  * CPUs fans control loop
  */
-static void do_monitor_cpu(struct cpu_pid_state *state)
+
+static int do_read_one_cpu_values(struct cpu_pid_state *state, s32 *temp, s32 *power)
 {
-	s32 temp, voltage, current_a, power, power_target;
-	s32 integral, derivative, proportional, adj_in_target, sval;
-	s64 integ_p, deriv_p, prop_p, sum; 
-	int i, intake, rc;
+	s32 ltemp, volts, amps;
+	int rc = 0;
 
-	DBG("cpu %d:\n", state->index);
+	/* Default (in case of error) */
+	*temp = state->cur_temp;
+	*power = state->cur_power;
 
 	/* Read current fan status */
 	if (state->index == 0)
-		rc = get_rpm_fan(CPUA_EXHAUST_FAN_RPM_ID, !RPM_PID_USE_ACTUAL_SPEED);
+		rc = get_rpm_fan(CPUA_EXHAUST_FAN_RPM_INDEX, !RPM_PID_USE_ACTUAL_SPEED);
 	else
-		rc = get_rpm_fan(CPUB_EXHAUST_FAN_RPM_ID, !RPM_PID_USE_ACTUAL_SPEED);
+		rc = get_rpm_fan(CPUB_EXHAUST_FAN_RPM_INDEX, !RPM_PID_USE_ACTUAL_SPEED);
 	if (rc < 0) {
-		printk(KERN_WARNING "Error %d reading CPU %d exhaust fan !\n",
-		       rc, state->index);
-		/* XXX What do we do now ? */
-	} else
+		/* XXX What do we do now ? Nothing for now, keep old value, but
+		 * return error upstream
+		 */
+		DBG("  cpu %d, fan reading error !\n", state->index);
+	} else {
 		state->rpm = rc;
-	DBG("  current rpm: %d\n", state->rpm);
+		DBG("  cpu %d, exhaust RPM: %d\n", state->rpm);
+	}
 
 	/* Get some sensor readings and scale it */
-	temp = read_smon_adc(state, 1);
-	if (temp == -1) {
+	ltemp = read_smon_adc(state, 1);
+	if (ltemp == -1) {
+		/* XXX What do we do now ? */
 		state->overtemp++;
-		return;
+		if (rc == 0)
+			rc = -EIO;
+		DBG("  cpu %d, temp reading error !\n", state->index);
+	} else {
+		/* Fixup temperature according to diode calibration
+		 */
+		DBG("  cpu %d, temp raw: %04x, m_diode: %04x, b_diode: %04x\n",
+		    state->index,
+		    ltemp, state->mpu.mdiode, state->mpu.bdiode);
+		*temp = ((s32)ltemp * (s32)state->mpu.mdiode + ((s32)state->mpu.bdiode << 12)) >> 2;
+		state->last_temp = *temp;
+		DBG("  temp: %d.%03d\n", FIX32TOPRINT((*temp)));
 	}
-	voltage = read_smon_adc(state, 3);
-	current_a = read_smon_adc(state, 4);
 
-	/* Fixup temperature according to diode calibration
+	/*
+	 * Read voltage & current and calculate power
 	 */
-	DBG("  temp raw: %04x, m_diode: %04x, b_diode: %04x\n",
-	    temp, state->mpu.mdiode, state->mpu.bdiode);
-	temp = ((s32)temp * (s32)state->mpu.mdiode + ((s32)state->mpu.bdiode << 12)) >> 2;
-	state->last_temp = temp;
-	DBG("  temp: %d.%03d\n", FIX32TOPRINT(temp));
+	volts = read_smon_adc(state, 3);
+	amps = read_smon_adc(state, 4);
 
-	/* Check tmax, increment overtemp if we are there. At tmax+8, we go
-	 * full blown immediately and try to trigger a shutdown
-	 */
-	if (temp >= ((state->mpu.tmax + 8) << 16)) {
-		printk(KERN_WARNING "Warning ! CPU %d temperature way above maximum"
-		       " (%d) !\n",
-		       state->index, temp >> 16);
-		state->overtemp = CPU_MAX_OVERTEMP;
-	} else if (temp > (state->mpu.tmax << 16))
-		state->overtemp++;
-	else
-		state->overtemp = 0;
-	if (state->overtemp >= CPU_MAX_OVERTEMP)
-		critical_state = 1;
-	if (state->overtemp > 0) {
-		state->rpm = state->mpu.rmaxn_exhaust_fan;
-		state->intake_rpm = intake = state->mpu.rmaxn_intake_fan;
-		goto do_set_fans;
-	}
-	
-	/* Scale other sensor values according to fixed scales
+	/* Scale voltage and current raw sensor values according to fixed scales
 	 * obtained in Darwin and calculate power from I and V
 	 */
-	state->voltage = voltage *= ADC_CPU_VOLTAGE_SCALE;
-	state->current_a = current_a *= ADC_CPU_CURRENT_SCALE;
-	power = (((u64)current_a) * ((u64)voltage)) >> 16;
+	volts *= ADC_CPU_VOLTAGE_SCALE;
+	amps *= ADC_CPU_CURRENT_SCALE;
+	*power = (((u64)volts) * ((u64)amps)) >> 16;
+	state->voltage = volts;
+	state->current_a = amps;
+	state->last_power = *power;
+
+	DBG("  cpu %d, current: %d.%03d, voltage: %d.%03d, power: %d.%03d W\n",
+	    state->index, FIX32TOPRINT(current_a), FIX32TOPRINT(voltage),
+	    FIX32TOPRINT(*power));
+
+	return 0;
+}
+
+static void do_cpu_pid(struct cpu_pid_state *state, s32 temp, s32 power)
+{
+	s32 power_target, integral, derivative, proportional, adj_in_target, sval;
+	s64 integ_p, deriv_p, prop_p, sum; 
+	int i;
 
 	/* Calculate power target value (could be done once for all)
 	 * and convert to a 16.16 fp number
 	 */
 	power_target = ((u32)(state->mpu.pmaxh - state->mpu.padjmax)) << 16;
-
-	DBG("  current: %d.%03d, voltage: %d.%03d\n",
-	    FIX32TOPRINT(current_a), FIX32TOPRINT(voltage));
-	DBG("  power: %d.%03d W, target: %d.%03d, error: %d.%03d\n", FIX32TOPRINT(power),
+	DBG("  power target: %d.%03d, error: %d.%03d\n",
 	    FIX32TOPRINT(power_target), FIX32TOPRINT(power_target - power));
 
 	/* Store temperature and power in history array */
@@ -659,6 +780,127 @@
 		state->rpm = state->mpu.rminn_exhaust_fan;
 	if (state->rpm > state->mpu.rmaxn_exhaust_fan)
 		state->rpm = state->mpu.rmaxn_exhaust_fan;
+}
+
+static void do_monitor_cpu_combined(void)
+{
+	struct cpu_pid_state *state0 = &cpu_state[0];
+	struct cpu_pid_state *state1 = &cpu_state[1];
+	s32 temp0, power0, temp1, power1;
+	s32 temp_combi, power_combi;
+	int rc, intake, pump;
+
+	rc = do_read_one_cpu_values(state0, &temp0, &power0);
+	if (rc < 0) {
+		/* XXX What do we do now ? */
+	}
+	state1->overtemp = 0;
+	rc = do_read_one_cpu_values(state1, &temp1, &power1);
+	if (rc < 0) {
+		/* XXX What do we do now ? */
+	}
+	if (state1->overtemp)
+		state0->overtemp++;
+
+	temp_combi = max(temp0, temp1);
+	power_combi = max(power0, power1);
+
+	/* Check tmax, increment overtemp if we are there. At tmax+8, we go
+	 * full blown immediately and try to trigger a shutdown
+	 */
+	if (temp_combi >= ((state0->mpu.tmax + 8) << 16)) {
+		printk(KERN_WARNING "Warning ! Temperature way above maximum (%d) !\n",
+		       temp_combi >> 16);
+		state0->overtemp = CPU_MAX_OVERTEMP;
+	} else if (temp_combi > (state0->mpu.tmax << 16))
+		state0->overtemp++;
+	else
+		state0->overtemp = 0;
+	if (state0->overtemp >= CPU_MAX_OVERTEMP)
+		critical_state = 1;
+	if (state0->overtemp > 0) {
+		state0->rpm = state0->mpu.rmaxn_exhaust_fan;
+		state0->intake_rpm = intake = state0->mpu.rmaxn_intake_fan;
+		pump = CPU_PUMP_OUTPUT_MAX;
+		goto do_set_fans;
+	}
+
+	/* Do the PID */
+	do_cpu_pid(state0, temp_combi, power_combi);
+
+	/* Calculate intake fan speed */
+	intake = (state0->rpm * CPU_INTAKE_SCALE) >> 16;
+	if (intake < state0->mpu.rminn_intake_fan)
+		intake = state0->mpu.rminn_intake_fan;
+	if (intake > state0->mpu.rmaxn_intake_fan)
+		intake = state0->mpu.rmaxn_intake_fan;
+	state0->intake_rpm = intake;
+
+	/* Calculate pump speed */
+	pump = (state0->rpm * CPU_PUMP_OUTPUT_MAX) /
+		state0->mpu.rmaxn_exhaust_fan;
+	if (pump > CPU_PUMP_OUTPUT_MAX)
+		pump = CPU_PUMP_OUTPUT_MAX;
+	if (pump < CPU_PUMP_OUTPUT_MIN)
+		pump = CPU_PUMP_OUTPUT_MIN;
+	
+ do_set_fans:
+	/* We copy values from state 0 to state 1 for /sysfs */
+	state1->rpm = state0->rpm;
+	state1->intake_rpm = state0->intake_rpm;
+
+	DBG("** CPU %d RPM: %d Ex, %d, Pump: %d, In, overtemp: %d\n",
+	    state->index, (int)state->rpm, intake, pump, state->overtemp);
+
+	/* We should check for errors, shouldn't we ? But then, what
+	 * do we do once the error occurs ? For FCU notified fan
+	 * failures (-EFAULT) we probably want to notify userland
+	 * some way...
+	 */
+	set_rpm_fan(CPUA_INTAKE_FAN_RPM_INDEX, intake);
+	set_rpm_fan(CPUA_EXHAUST_FAN_RPM_INDEX, state0->rpm);
+	set_rpm_fan(CPUB_INTAKE_FAN_RPM_INDEX, intake);
+	set_rpm_fan(CPUB_EXHAUST_FAN_RPM_INDEX, state0->rpm);
+
+	if (fcu_fans[CPUA_PUMP_RPM_INDEX].id != FCU_FAN_ABSENT_ID)
+		set_rpm_fan(CPUA_PUMP_RPM_INDEX, pump);
+	if (fcu_fans[CPUB_PUMP_RPM_INDEX].id != FCU_FAN_ABSENT_ID)
+		set_rpm_fan(CPUB_PUMP_RPM_INDEX, pump);
+}
+
+static void do_monitor_cpu_split(struct cpu_pid_state *state)
+{
+	s32 temp, power;
+	int rc, intake;
+
+	/* Read current fan status */
+	rc = do_read_one_cpu_values(state, &temp, &power);
+	if (rc < 0) {
+		/* XXX What do we do now ? */
+	}
+
+	/* Check tmax, increment overtemp if we are there. At tmax+8, we go
+	 * full blown immediately and try to trigger a shutdown
+	 */
+	if (temp >= ((state->mpu.tmax + 8) << 16)) {
+		printk(KERN_WARNING "Warning ! CPU %d temperature way above maximum"
+		       " (%d) !\n",
+		       state->index, temp >> 16);
+		state->overtemp = CPU_MAX_OVERTEMP;
+	} else if (temp > (state->mpu.tmax << 16))
+		state->overtemp++;
+	else
+		state->overtemp = 0;
+	if (state->overtemp >= CPU_MAX_OVERTEMP)
+		critical_state = 1;
+	if (state->overtemp > 0) {
+		state->rpm = state->mpu.rmaxn_exhaust_fan;
+		state->intake_rpm = intake = state->mpu.rmaxn_intake_fan;
+		goto do_set_fans;
+	}
+
+	/* Do the PID */
+	do_cpu_pid(state, temp, power);
 
 	intake = (state->rpm * CPU_INTAKE_SCALE) >> 16;
 	if (intake < state->mpu.rminn_intake_fan)
@@ -677,11 +919,11 @@
 	 * some way...
 	 */
 	if (state->index == 0) {
-		set_rpm_fan(CPUA_INTAKE_FAN_RPM_ID, intake);
-		set_rpm_fan(CPUA_EXHAUST_FAN_RPM_ID, state->rpm);
+		set_rpm_fan(CPUA_INTAKE_FAN_RPM_INDEX, intake);
+		set_rpm_fan(CPUA_EXHAUST_FAN_RPM_INDEX, state->rpm);
 	} else {
-		set_rpm_fan(CPUB_INTAKE_FAN_RPM_ID, intake);
-		set_rpm_fan(CPUB_EXHAUST_FAN_RPM_ID, state->rpm);
+		set_rpm_fan(CPUB_INTAKE_FAN_RPM_INDEX, intake);
+		set_rpm_fan(CPUB_EXHAUST_FAN_RPM_INDEX, state->rpm);
 	}
 }
 
@@ -696,6 +938,7 @@
 	state->overtemp = 0;
 	state->adc_config = 0x00;
 
+
 	if (index == 0)
 		state->monitor = attach_i2c_chip(SUPPLY_MONITOR_ID, "CPU0_monitor");
 	else if (index == 1)
@@ -778,7 +1021,7 @@
 	DBG("backside:\n");
 
 	/* Check fan status */
-	rc = get_pwm_fan(BACKSIDE_FAN_PWM_ID);
+	rc = get_pwm_fan(BACKSIDE_FAN_PWM_INDEX);
 	if (rc < 0) {
 		printk(KERN_WARNING "Error %d reading backside fan !\n", rc);
 		/* XXX What do we do now ? */
@@ -790,12 +1033,12 @@
 	temp = i2c_smbus_read_byte_data(state->monitor, MAX6690_EXT_TEMP) << 16;
 	state->last_temp = temp;
 	DBG("  temp: %d.%03d, target: %d.%03d\n", FIX32TOPRINT(temp),
-	    FIX32TOPRINT(BACKSIDE_PID_INPUT_TARGET));
+	    FIX32TOPRINT(backside_params.input_target));
 
 	/* Store temperature and error in history array */
 	state->cur_sample = (state->cur_sample + 1) % BACKSIDE_PID_HISTORY_SIZE;
 	state->sample_history[state->cur_sample] = temp;
-	state->error_history[state->cur_sample] = temp - BACKSIDE_PID_INPUT_TARGET;
+	state->error_history[state->cur_sample] = temp - backside_params.input_target;
 	
 	/* If first loop, fill the history table */
 	if (state->first) {
@@ -804,7 +1047,7 @@
 				BACKSIDE_PID_HISTORY_SIZE;
 			state->sample_history[state->cur_sample] = temp;
 			state->error_history[state->cur_sample] =
-				temp - BACKSIDE_PID_INPUT_TARGET;
+				temp - backside_params.input_target;
 		}
 		state->first = 0;
 	}
@@ -816,7 +1059,7 @@
 		integral += state->error_history[i];
 	integral *= BACKSIDE_PID_INTERVAL;
 	DBG("  integral: %08x\n", integral);
-	integ_p = ((s64)BACKSIDE_PID_G_r) * (s64)integral;
+	integ_p = ((s64)backside_params.G_r) * (s64)integral;
 	DBG("   integ_p: %d\n", (int)(integ_p >> 36));
 	sum += integ_p;
 
@@ -825,12 +1068,12 @@
 		state->error_history[(state->cur_sample + BACKSIDE_PID_HISTORY_SIZE - 1)
 				    % BACKSIDE_PID_HISTORY_SIZE];
 	derivative /= BACKSIDE_PID_INTERVAL;
-	deriv_p = ((s64)BACKSIDE_PID_G_d) * (s64)derivative;
+	deriv_p = ((s64)backside_params.G_d) * (s64)derivative;
 	DBG("   deriv_p: %d\n", (int)(deriv_p >> 36));
 	sum += deriv_p;
 
 	/* Calculate the proportional term */
-	prop_p = ((s64)BACKSIDE_PID_G_p) * (s64)(state->error_history[state->cur_sample]);
+	prop_p = ((s64)backside_params.G_p) * (s64)(state->error_history[state->cur_sample]);
 	DBG("   prop_p: %d\n", (int)(prop_p >> 36));
 	sum += prop_p;
 
@@ -839,13 +1082,13 @@
 
 	DBG("   sum: %d\n", (int)sum);
 	state->pwm += (s32)sum;
-	if (state->pwm < BACKSIDE_PID_OUTPUT_MIN)
-		state->pwm = BACKSIDE_PID_OUTPUT_MIN;
-	if (state->pwm > BACKSIDE_PID_OUTPUT_MAX)
-		state->pwm = BACKSIDE_PID_OUTPUT_MAX;
+	if (state->pwm < backside_params.output_min)
+		state->pwm = backside_params.output_min;
+	if (state->pwm > backside_params.output_max)
+		state->pwm = backside_params.output_max;
 
 	DBG("** BACKSIDE PWM: %d\n", (int)state->pwm);
-	set_pwm_fan(BACKSIDE_FAN_PWM_ID, state->pwm);
+	set_pwm_fan(BACKSIDE_FAN_PWM_INDEX, state->pwm);
 }
 
 /*
@@ -853,6 +1096,35 @@
  */
 static int init_backside_state(struct backside_pid_state *state)
 {
+	struct device_node *u3;
+	int u3h = 1; /* conservative by default */
+
+	/*
+	 * There are different PID params for machines with U3 and machines
+	 * with U3H, pick the right ones now
+	 */
+	u3 = of_find_node_by_path("/u3");
+	if (u3 != NULL) {
+		u32 *vers = (u32 *)get_property(u3, "device-rev", NULL);
+		if (vers)
+			if (((*vers) & 0x3f) < 0x34)
+				u3h = 0;
+		of_node_put(u3);
+	}
+
+	backside_params.G_p = BACKSIDE_PID_G_p;
+	backside_params.G_r = BACKSIDE_PID_G_r;
+	backside_params.output_max = BACKSIDE_PID_OUTPUT_MAX;
+	if (u3h) {
+		backside_params.G_d = BACKSIDE_PID_U3H_G_d;
+		backside_params.input_target = BACKSIDE_PID_U3H_INPUT_TARGET;
+		backside_params.output_min = BACKSIDE_PID_U3H_OUTPUT_MIN;
+	} else {
+		backside_params.G_d = BACKSIDE_PID_U3_G_d;
+		backside_params.input_target = BACKSIDE_PID_U3_INPUT_TARGET;
+		backside_params.output_min = BACKSIDE_PID_U3_OUTPUT_MIN;
+	}
+
 	state->ticks = 1;
 	state->first = 1;
 	state->pwm = 50;
@@ -898,7 +1170,7 @@
 	DBG("drives:\n");
 
 	/* Check fan status */
-	rc = get_rpm_fan(DRIVES_FAN_RPM_ID, !RPM_PID_USE_ACTUAL_SPEED);
+	rc = get_rpm_fan(DRIVES_FAN_RPM_INDEX, !RPM_PID_USE_ACTUAL_SPEED);
 	if (rc < 0) {
 		printk(KERN_WARNING "Error %d reading drives fan !\n", rc);
 		/* XXX What do we do now ? */
@@ -965,7 +1237,7 @@
 		state->rpm = DRIVES_PID_OUTPUT_MAX;
 
 	DBG("** DRIVES RPM: %d\n", (int)state->rpm);
-	set_rpm_fan(DRIVES_FAN_RPM_ID, state->rpm);
+	set_rpm_fan(DRIVES_FAN_RPM_INDEX, state->rpm);
 }
 
 /*
@@ -1032,7 +1304,7 @@
 	}
 
 	/* Set the PCI fan once for now */
-	set_pwm_fan(SLOTS_FAN_PWM_ID, SLOTS_FAN_DEFAULT_PWM);
+	set_pwm_fan(SLOTS_FAN_PWM_INDEX, SLOTS_FAN_DEFAULT_PWM);
 
 	/* Initialize ADCs */
 	initialize_adc(&cpu_state[0]);
@@ -1047,9 +1319,13 @@
 		start = jiffies;
 
 		down(&driver_lock);
-		do_monitor_cpu(&cpu_state[0]);
-		if (cpu_state[1].monitor != NULL)
-			do_monitor_cpu(&cpu_state[1]);
+		if (cpu_pid_type == CPU_PID_TYPE_COMBINED)
+			do_monitor_cpu_combined();
+		else {
+			do_monitor_cpu_split(&cpu_state[0]);
+			if (cpu_state[1].monitor != NULL)
+				do_monitor_cpu_split(&cpu_state[1]);
+		}
 		do_monitor_backside(&backside_state);
 		do_monitor_drives(&drives_state);
 		up(&driver_lock);
@@ -1113,6 +1389,19 @@
 
 	DBG("counted %d CPUs in the device-tree\n", cpu_count);
 
+	/* Decide the type of PID algorithm to use based on the presence of
+	 * the pumps, though that may not be the best way, that is good enough
+	 * for now
+	 */
+	if (machine_is_compatible("PowerMac7,3")
+	    && (cpu_count > 1)
+	    && fcu_fans[CPUA_PUMP_RPM_INDEX].id != FCU_FAN_ABSENT_ID
+	    && fcu_fans[CPUB_PUMP_RPM_INDEX].id != FCU_FAN_ABSENT_ID) {
+		printk(KERN_INFO "Liquid cooling pumps detected, using new algorithm !\n");
+		cpu_pid_type = CPU_PID_TYPE_COMBINED;
+	} else
+		cpu_pid_type = CPU_PID_TYPE_SPLIT;
+
 	/* Create control loops for everything. If any fail, everything
 	 * fails
 	 */
@@ -1257,12 +1546,91 @@
 	return 0;
 }
 
+static void fcu_lookup_fans(struct device_node *fcu_node)
+{
+	struct device_node *np = NULL;
+	int i;
+
+	/* The table is filled by default with values that are suitable
+	 * for the old machines without device-tree informations. We scan
+	 * the device-tree and override those values with whatever is
+	 * there
+	 */
+
+	DBG("Looking up FCU controls in device-tree...\n");
+
+	while ((np = of_get_next_child(fcu_node, np)) != NULL) {
+		int type = -1;
+		char *loc;
+		u32 *reg;
+
+		DBG(" control: %s, type: %s\n", np->name, np->type);
+
+		/* Detect control type */
+		if (!strcmp(np->type, "fan-rpm-control") ||
+		    !strcmp(np->type, "fan-rpm"))
+			type = FCU_FAN_RPM;
+		if (!strcmp(np->type, "fan-pwm-control") ||
+		    !strcmp(np->type, "fan-pwm"))
+			type = FCU_FAN_PWM;
+		/* Only care about fans for now */
+		if (type == -1)
+			continue;
+
+		/* Lookup for a matching location */
+		loc = (char *)get_property(np, "location", NULL);
+		reg = (u32 *)get_property(np, "reg", NULL);
+		if (loc == NULL || reg == NULL)
+			continue;
+		DBG(" matching location: %s, reg: 0x%08x\n", loc, *reg);
+
+		for (i = 0; i < FCU_FAN_COUNT; i++) {
+			int fan_id;
+
+			if (strcmp(loc, fcu_fans[i].loc))
+				continue;
+			DBG(" location match, index: %d\n", i);
+			fcu_fans[i].id = FCU_FAN_ABSENT_ID;
+			if (type != fcu_fans[i].type) {
+				printk(KERN_WARNING "therm_pm72: Fan type mismatch "
+				       "in device-tree for %s\n", np->full_name);
+				break;
+			}
+			if (type == FCU_FAN_RPM)
+				fan_id = ((*reg) - 0x10) / 2;
+			else
+				fan_id = ((*reg) - 0x30) / 2;
+			if (fan_id > 7) {
+				printk(KERN_WARNING "therm_pm72: Can't parse "
+				       "fan ID in device-tree for %s\n", np->full_name);
+				break;
+			}
+			DBG(" fan id -> %d, type -> %d\n", fan_id, type);
+			fcu_fans[i].id = fan_id;
+		}
+	}
+
+	/* Now dump the array */
+	printk(KERN_INFO "Detected fan controls:\n");
+	for (i = 0; i < FCU_FAN_COUNT; i++) {
+		if (fcu_fans[i].id == FCU_FAN_ABSENT_ID)
+			continue;
+		printk(KERN_INFO "  %d: %s fan, id %d, location: %s\n", i,
+		       fcu_fans[i].type == FCU_FAN_RPM ? "RPM" : "PWM",
+		       fcu_fans[i].id, fcu_fans[i].loc);
+	}
+}
+
 static int fcu_of_probe(struct of_device* dev, const struct of_match *match)
 {
 	int rc;
 
 	state = state_detached;
 
+	/* Lookup the fans in the device tree */
+	fcu_lookup_fans(dev->node);
+
+	/* Add the driver */
 	rc = i2c_add_driver(&therm_pm72_driver);
 	if (rc < 0)
 		return rc;
@@ -1301,7 +1669,8 @@
 {
 	struct device_node *np;
 
-	if (!machine_is_compatible("PowerMac7,2"))
+	if (!machine_is_compatible("PowerMac7,2") &&
+	    !machine_is_compatible("PowerMac7,3"))
 	    	return -ENODEV;
 
 	printk(KERN_INFO "PowerMac G5 Thermal control driver %s\n", VERSION);
diff -urN linux-2.5/drivers/macintosh/therm_pm72.h linux-pogo/drivers/macintosh/therm_pm72.h
--- linux-2.5/drivers/macintosh/therm_pm72.h	2004-09-24 14:34:05.000000000 +1000
+++ linux-pogo/drivers/macintosh/therm_pm72.h	2004-10-15 18:58:22.000000000 +1000
@@ -119,18 +119,33 @@
 #define ADC_CPU_CURRENT_SCALE	0x1f40	/* _AD4 */
 
 /*
- * PID factors for the U3/Backside fan control loop
+ * PID factors for the U3/Backside fan control loop. We have 2 sets
+ * of values here, one set for U3 and one set for U3H
  */
-#define BACKSIDE_FAN_PWM_ID		1
-#define BACKSIDE_PID_G_d		0x02800000
+#define BACKSIDE_FAN_PWM_DEFAULT_ID	1
+#define BACKSIDE_FAN_PWM_INDEX		0
+#define BACKSIDE_PID_U3_G_d		0x02800000
+#define BACKSIDE_PID_U3H_G_d		0x01400000
 #define BACKSIDE_PID_G_p		0x00500000
 #define BACKSIDE_PID_G_r		0x00000000
-#define BACKSIDE_PID_INPUT_TARGET	0x00410000
+#define BACKSIDE_PID_U3_INPUT_TARGET	0x00410000
+#define BACKSIDE_PID_U3H_INPUT_TARGET	0x004b0000
 #define BACKSIDE_PID_INTERVAL		5
 #define BACKSIDE_PID_OUTPUT_MAX		100
-#define BACKSIDE_PID_OUTPUT_MIN		20
+#define BACKSIDE_PID_U3_OUTPUT_MIN	20
+#define BACKSIDE_PID_U3H_OUTPUT_MIN	30
 #define BACKSIDE_PID_HISTORY_SIZE	2
 
+struct basckside_pid_params
+{
+	u32			G_d;
+	u32			G_p;
+	u32			G_r;
+	u32			input_target;
+	u32			output_min;
+	u32			output_max;
+};
+
 struct backside_pid_state
 {
 	int			ticks;
@@ -146,7 +161,8 @@
 /*
  * PID factors for the Drive Bay fan control loop
  */
-#define DRIVES_FAN_RPM_ID      		2
+#define DRIVES_FAN_RPM_DEFAULT_ID	2
+#define DRIVES_FAN_RPM_INDEX		1
 #define DRIVES_PID_G_d			0x01e00000
 #define DRIVES_PID_G_p			0x00500000
 #define DRIVES_PID_G_r			0x00000000
@@ -168,7 +184,8 @@
 	int			first;
 };
 
-#define SLOTS_FAN_PWM_ID       		2
+#define SLOTS_FAN_PWM_DEFAULT_ID	2
+#define SLOTS_FAN_PWM_INDEX		2
 #define	SLOTS_FAN_DEFAULT_PWM		50 /* Do better here ! */
 
 /*
@@ -191,10 +208,15 @@
  * CPU B FAKE POWER	49	(I_V_inputs: 18, 19)
  */
 
-#define CPUA_INTAKE_FAN_RPM_ID		3
-#define CPUA_EXHAUST_FAN_RPM_ID		4
-#define CPUB_INTAKE_FAN_RPM_ID		5
-#define CPUB_EXHAUST_FAN_RPM_ID		6
+#define CPUA_INTAKE_FAN_RPM_DEFAULT_ID	3
+#define CPUA_EXHAUST_FAN_RPM_DEFAULT_ID	4
+#define CPUB_INTAKE_FAN_RPM_DEFAULT_ID	5
+#define CPUB_EXHAUST_FAN_RPM_DEFAULT_ID	6
+
+#define CPUA_INTAKE_FAN_RPM_INDEX	3
+#define CPUA_EXHAUST_FAN_RPM_INDEX	4
+#define CPUB_INTAKE_FAN_RPM_INDEX	5
+#define CPUB_EXHAUST_FAN_RPM_INDEX	6
 
 #define CPU_INTAKE_SCALE		0x0000f852
 #define CPU_TEMP_HISTORY_SIZE		2
@@ -202,6 +224,11 @@
 #define CPU_PID_INTERVAL		1
 #define CPU_MAX_OVERTEMP		30
 
+#define CPUA_PUMP_RPM_INDEX		7
+#define CPUB_PUMP_RPM_INDEX		8
+#define CPU_PUMP_OUTPUT_MAX		3700
+#define CPU_PUMP_OUTPUT_MIN		1000
+
 struct cpu_pid_state
 {
 	int			index;
@@ -219,6 +246,7 @@
 	s32			voltage;
 	s32			current_a;
 	s32			last_temp;
+	s32			last_power;
 	int			first;
 	u8			adc_config;
 };


From jimix at watson.ibm.com  Sat Oct 16 01:53:17 2004
From: jimix at watson.ibm.com (Jimi Xenidis)
Date: Fri, 15 Oct 2004 11:53:17 -0400
Subject: [vHype-discussion] u64 in linux
In-Reply-To: <1097849471.25095.97.camel@brick.watson.ibm.com>
References: <1097849471.25095.97.camel@brick.watson.ibm.com>
Message-ID: <16751.62061.393716.650492@kitch0.watson.ibm.com>

>>>>> "MO" == Michal Ostrowski <mostrows at watson.ibm.com> writes:

 MO> In trying to integrate ppc64 changes into the vhype linux tree, I'm
 MO> coming across a problem with usage of "u64".

 MO> On x86, u64 is "unsigned long long".  On ppc64 it is "unsigned long".

*sigh* I thought the hell over size_t unsigned int vs. unsigned long
would have tought everyone.

BTW: a thread starts here:
   http://www.ussg.iu.edu/hypermail/linux/kernel/0402.3/1428.html

After a whole lot of clicking it looks like a dropped patch.

I guess its the cast, it seems thats the linux way at the moment.

-JX


From jimix at watson.ibm.com  Sat Oct 16 02:46:56 2004
From: jimix at watson.ibm.com (Jimi Xenidis)
Date: Fri, 15 Oct 2004 12:46:56 -0400
Subject: u64 in linux
In-Reply-To: <16751.62061.393716.650492@kitch0.watson.ibm.com>
References: <1097849471.25095.97.camel@brick.watson.ibm.com>
	<16751.62061.393716.650492@kitch0.watson.ibm.com>
Message-ID: <16751.65280.234326.437361@kitch0.watson.ibm.com>

>>>>> "JX" == Jimi Xenidis <jimix at watson.ibm.com> writes:

Forgive the CC to my internal list.
The real question is, what was the result of this thread?
 JX>    http://www.ussg.iu.edu/hypermail/linux/kernel/0402.3/1428.html

And is casting the acceptable thing to do?
-JX


From hpa at zytor.com  Sat Oct 16 03:33:45 2004
From: hpa at zytor.com (H. Peter Anvin)
Date: Fri, 15 Oct 2004 10:33:45 -0700
Subject: Fan control for PowerMac7_3
In-Reply-To: <1097832049.1149.115.camel@gaston>
References: <1097831790.1131.111.camel@gaston>	
	<1097831981.1131.113.camel@gaston>
	<1097832049.1149.115.camel@gaston>
Message-ID: <417009F9.6080007@zytor.com>

Hi there,

I tried to apply this patch to top-of-tree (bkcvs), but it looks like 
the current TOT doesn't compile on ppc64 for unrelated reasons:

.config attached.

arch/ppc64/kernel/built-in.o(.text+0x79f8): In function `.sys_call_table32':
: undefined reference to `.sys_acct'
arch/ppc64/kernel/built-in.o(.text+0x7c78): In function `.sys_call_table32':
: undefined reference to `.sys_quotactl'
arch/ppc64/kernel/built-in.o(.text+0x8078): In function `.sys_call_table32':
: undefined reference to `.compat_mbind'
arch/ppc64/kernel/built-in.o(.text+0x8080): In function `.sys_call_table32':
: undefined reference to `.compat_get_mempolicy'
arch/ppc64/kernel/built-in.o(.text+0x8088): In function `.sys_call_table32':
: undefined reference to `.compat_set_mempolicy'
arch/ppc64/kernel/built-in.o(.text+0x8090): In function `.sys_call_table32':
: undefined reference to `.compat_sys_mq_open'
arch/ppc64/kernel/built-in.o(.text+0x8098): In function `.sys_call_table32':
: undefined reference to `.sys_mq_unlink'
arch/ppc64/kernel/built-in.o(.text+0x80a0): In function `.sys_call_table32':
: undefined reference to `.compat_sys_mq_timedsend'
arch/ppc64/kernel/built-in.o(.text+0x80a8): In function `.sys_call_table32':
: undefined reference to `.compat_sys_mq_timedreceive'
arch/ppc64/kernel/built-in.o(.text+0x80b0): In function `.sys_call_table32':
: undefined reference to `.compat_sys_mq_notify'
arch/ppc64/kernel/built-in.o(.text+0x80b8): In function `.sys_call_table32':
: undefined reference to `.compat_sys_mq_getsetattr'
arch/ppc64/kernel/built-in.o(.text+0x8260): In function `.sys_call_table':
: undefined reference to `.sys_acct'
arch/ppc64/kernel/built-in.o(.text+0x84e0): In function `.sys_call_table':
: undefined reference to `.sys_quotactl'
arch/ppc64/kernel/built-in.o(.text+0x88e0): In function `.sys_call_table':
: undefined reference to `.sys_mbind'
arch/ppc64/kernel/built-in.o(.text+0x88e8): In function `.sys_call_table':
: undefined reference to `.sys_get_mempolicy'
arch/ppc64/kernel/built-in.o(.text+0x88f0): In function `.sys_call_table':
: undefined reference to `.sys_set_mempolicy'
arch/ppc64/kernel/built-in.o(.text+0x88f8): In function `.sys_call_table':
: undefined reference to `.sys_mq_open'
arch/ppc64/kernel/built-in.o(.text+0x8900): In function `.sys_call_table':
: undefined reference to `.sys_mq_unlink'
arch/ppc64/kernel/built-in.o(.text+0x8908): In function `.sys_call_table':
: undefined reference to `.sys_mq_timedsend'
arch/ppc64/kernel/built-in.o(.text+0x8910): In function `.sys_call_table':
: undefined reference to `.sys_mq_timedreceive'
arch/ppc64/kernel/built-in.o(.text+0x8918): In function `.sys_call_table':
: undefined reference to `.sys_mq_notify'
arch/ppc64/kernel/built-in.o(.text+0x8920): In function `.sys_call_table':
: undefined reference to `.sys_mq_getsetattr'
make: *** [.tmp_vmlinux1] Error 1

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: .config
Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041015/6b5d420b/attachment.txt 

From arnd at arndb.de  Sat Oct 16 04:58:58 2004
From: arnd at arndb.de (Arnd Bergmann)
Date: Fri, 15 Oct 2004 20:58:58 +0200
Subject: [vHype-discussion] u64 in linux
In-Reply-To: <16751.62061.393716.650492@kitch0.watson.ibm.com>
References: <1097849471.25095.97.camel@brick.watson.ibm.com>
	<16751.62061.393716.650492@kitch0.watson.ibm.com>
Message-ID: <200410152059.03647.arnd@arndb.de>

On Freedag 15 Oktober 2004 17:53, Jimi Xenidis wrote:
> BTW: a thread starts here:
> ? ?http://www.ussg.iu.edu/hypermail/linux/kernel/0402.3/1428.html
> 
> After a whole lot of clicking it looks like a dropped patch.
> 
> I guess its the cast, it seems thats the linux way at the moment.

Yes, I think there have been some patches to drivers going in that
direction.
An alternative if the warning is in your own code is to use
'unsigned long long' or a user defined 'uval64' directly in
the declaration instead of 'u64'.

C99 also mandates that the macro PRIu64 contains the correct
format string for uint64_t (which afaik is always the same as u64).
It's currently not defined in linux, but could perhaps be added.

	Arnd <><
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: signature
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041015/2b1c6e18/attachment.pgp 

From hpa at zytor.com  Sat Oct 16 05:21:07 2004
From: hpa at zytor.com (H. Peter Anvin)
Date: Fri, 15 Oct 2004 12:21:07 -0700
Subject: [vHype-discussion] u64 in linux
In-Reply-To: <200410152059.03647.arnd@arndb.de>
References: <1097849471.25095.97.camel@brick.watson.ibm.com>	<16751.62061.393716.650492@kitch0.watson.ibm.com>
	<200410152059.03647.arnd@arndb.de>
Message-ID: <41702323.9010903@zytor.com>

Arnd Bergmann wrote:
> On Freedag 15 Oktober 2004 17:53, Jimi Xenidis wrote:
> 
>>BTW: a thread starts here:
>>   http://www.ussg.iu.edu/hypermail/linux/kernel/0402.3/1428.html
>>
>>After a whole lot of clicking it looks like a dropped patch.
>>
>>I guess its the cast, it seems thats the linux way at the moment.
> 
> 
> Yes, I think there have been some patches to drivers going in that
> direction.
> An alternative if the warning is in your own code is to use
> 'unsigned long long' or a user defined 'uval64' directly in
> the declaration instead of 'u64'.
> 
> C99 also mandates that the macro PRIu64 contains the correct
> format string for uint64_t (which afaik is always the same as u64).
> It's currently not defined in linux, but could perhaps be added.
> 

Also, in C99, you can print any integer type by casting it to [u]intmax_t and 
use %j.

	-hpa


From hpa at zytor.com  Sat Oct 16 05:27:58 2004
From: hpa at zytor.com (H. Peter Anvin)
Date: Fri, 15 Oct 2004 12:27:58 -0700
Subject: [vHype-discussion] u64 in linux
In-Reply-To: <41702323.9010903@zytor.com>
References: <1097849471.25095.97.camel@brick.watson.ibm.com>	<16751.62061.393716.650492@kitch0.watson.ibm.com>	<200410152059.03647.arnd@arndb.de>
	<41702323.9010903@zytor.com>
Message-ID: <417024BE.3060008@zytor.com>

H. Peter Anvin wrote:
> 
> Also, in C99, you can print any integer type by casting it to 
> [u]intmax_t and use %j.
> 

By the way, my very firm opinion on this is that we should match and use 
<inttypes.h> as much as possible.  Quite frankly <inttypes.h> actually 
resolves a lot of issues that previous attempts at creating these datatypes -- 
including the one in Linux -- have ignored.  This is a good thing.

Yes, there is ugliness, and I actually would have liked to see the C99 
committee to have adopted the M$ extension %Inn (e.g. %I64d for a 64-bit 
signed decimal integer); to make matters worse GNU used %I for a different 
purpose to it's not even possible to make it a compatible extension.

	-hpa


From olh at suse.de  Sat Oct 16 05:34:00 2004
From: olh at suse.de (Olaf Hering)
Date: Fri, 15 Oct 2004 21:34:00 +0200
Subject: Fan control for PowerMac7_3
In-Reply-To: <417009F9.6080007@zytor.com>
References: <1097831790.1131.111.camel@gaston>
	<1097831981.1131.113.camel@gaston>
	<1097832049.1149.115.camel@gaston> <417009F9.6080007@zytor.com>
Message-ID: <20041015193400.GA14307@suse.de>

 On Fri, Oct 15, H. Peter Anvin wrote:

> Hi there,
> 
> I tried to apply this patch to top-of-tree (bkcvs), but it looks like 
> the current TOT doesn't compile on ppc64 for unrelated reasons:
> 
> .config attached.

> # Linux kernel version: 2.6.9-rc4

rc4-bk3 builds ok for me with that config.

-- 
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG


From dwmw2 at infradead.org  Sat Oct 16 06:52:42 2004
From: dwmw2 at infradead.org (David Woodhouse)
Date: Fri, 15 Oct 2004 21:52:42 +0100
Subject: Reserve initrd pages.
Message-ID: <1097873562.13633.732.camel@hades.cambridge.redhat.com>

We don't mark initrd pages as reserved. If we manage to allocate enough
other stuff before using the initrd, we end up eating into the initrd
and we don't boot.

Signed-Off-By: David Woodhouse <dwmw2 at infradead.org>

===== arch/ppc64/kernel/setup.c 1.83 vs edited =====
--- 1.83/arch/ppc64/kernel/setup.c	2004-10-04 20:17:37 +01:00
+++ edited/arch/ppc64/kernel/setup.c	2004-10-15 21:02:33 +01:00
@@ -30,6 +30,7 @@
 #include <linux/notifier.h>
 #include <linux/cpu.h>
 #include <linux/unistd.h>
+#include <linux/bootmem.h>
 #include <asm/io.h>
 #include <asm/prom.h>
 #include <asm/processor.h>
@@ -990,6 +991,9 @@
 
 	/* set up the bootmem stuff with available memory */
 	do_init_bootmem();
+
+	if (initrd_start)
+		reserve_bootmem(__pa(initrd_start), initrd_end-initrd_start);
 
 	/* Select the correct idle loop for the platform. */
 	idle_setup();


-- 
dwmw2


From schwab at suse.de  Sat Oct 16 07:00:48 2004
From: schwab at suse.de (Andreas Schwab)
Date: Fri, 15 Oct 2004 23:00:48 +0200
Subject: 2.6.9-rc4: oops during ide probing
In-Reply-To: <m3llednhfl.fsf@igel.m5r.de> (Andreas Schwab's message of "Mon,
	11 Oct 2004 22:11:42 +0200")
References: <m3llednhfl.fsf@igel.m5r.de>
Message-ID: <jeu0sv8znj.fsf@sykes.suse.de>

> I'm getting an oops during ide probing on the PMac G5 with 2.6.9-rc4:
>
> ide-pmac: cannot find MacIO node for Kauai ATA interface
> ide0: Found Apple OHare ATA controller, bus ID 0, irq 0
> Oops: Kernel access of bad area, sig: 11 [#1]
> NIP [...] .ide_mm_inb+0x0/0x14
> LR [...] .ide_wait_not_busy+0x98/0xf0

That turned out to be an apparent compiler bug.  The kernel is working
fine for me now.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab at suse.de
SuSE Linux AG, Maxfeldstra?e 5, 90409 N?rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


From schwab at suse.de  Sat Oct 16 08:16:28 2004
From: schwab at suse.de (Andreas Schwab)
Date: Sat, 16 Oct 2004 00:16:28 +0200
Subject: Fan control for PowerMac7_3
In-Reply-To: <1097832049.1149.115.camel@gaston> (Benjamin Herrenschmidt's
	message of "Fri, 15 Oct 2004 19:20:50 +1000")
References: <1097831790.1131.111.camel@gaston>
	<1097831981.1131.113.camel@gaston> <1097832049.1149.115.camel@gaston>
Message-ID: <jewtxrlj9f.fsf@sykes.suse.de>

Benjamin Herrenschmidt <benh at kernel.crashing.org> writes:

> On Fri, 2004-10-15 at 19:19, Benjamin Herrenschmidt wrote:
>> On Fri, 2004-10-15 at 19:16, Benjamin Herrenschmidt wrote:
>> > Hi !
>> > 
>> > This is an experimental (read: totally untested) patch to the G5 fan
>> > control code. All I know is that it builds :)
>> 
>> And I sent a wrong version ... sorry, the good one in a few minutes.
>
> Here it is:

Here's a patch to make it compile with DEBUG enabled:

--- linux-2.6.9-rc4/drivers/macintosh/therm_pm72.c.~1~	2004-10-16 00:02:36.705511068 +0200
+++ linux-2.6.9-rc4/drivers/macintosh/therm_pm72.c	2004-10-16 00:07:04.815455733 +0200
@@ -652,7 +652,7 @@ static int do_read_one_cpu_values(struct
 		DBG("  cpu %d, fan reading error !\n", state->index);
 	} else {
 		state->rpm = rc;
-		DBG("  cpu %d, exhaust RPM: %d\n", state->rpm);
+		DBG("  cpu %d, exhaust RPM: %d\n", state->index, state->rpm);
 	}
 
 	/* Get some sensor readings and scale it */
@@ -691,8 +691,8 @@ static int do_read_one_cpu_values(struct
 	state->last_power = *power;
 
 	DBG("  cpu %d, current: %d.%03d, voltage: %d.%03d, power: %d.%03d W\n",
-	    state->index, FIX32TOPRINT(current_a), FIX32TOPRINT(voltage),
-	    FIX32TOPRINT(*power));
+	    state->index, FIX32TOPRINT(state->current_a),
+	    FIX32TOPRINT(state->voltage), FIX32TOPRINT(*power));
 
 	return 0;
 }
@@ -850,7 +850,7 @@ static void do_monitor_cpu_combined(void
 	state1->intake_rpm = state0->intake_rpm;
 
 	DBG("** CPU %d RPM: %d Ex, %d, Pump: %d, In, overtemp: %d\n",
-	    state->index, (int)state->rpm, intake, pump, state->overtemp);
+	    state1->index, (int)state1->rpm, intake, pump, state1->overtemp);
 
 	/* We should check for errors, shouldn't we ? But then, what
 	 * do we do once the error occurs ? For FCU notified fan

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab at suse.de
SuSE Linux AG, Maxfeldstra?e 5, 90409 N?rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


From raosanth at us.ibm.com  Sat Oct 16 07:00:25 2004
From: raosanth at us.ibm.com (Santhosh Rao)
Date: Fri, 15 Oct 2004 16:00:25 -0500
Subject: 2.6.9-rc4 kernel -- "cannot find space for TCE table"
Message-ID: <OFF2A7A85B.119958DD-ON87256F2E.0073427E-86256F2E.0071BEBC@us.ibm.com>

Ok, it appears we aren't dropping into the open firmware debugger 
randomly, the kernel seems to give up early in the boot process
Below is the output of an attempted boot of 2.6.9-rc4.

Jose, ever seen anything like this?

The machine is a p615 power-4  2-CPU box with 2GB of RAM.

--
Sonny

Output:


Elapsed time since release of system processors: 1 mins 23 secs

Config file read, 4096 bytes

Welcome to yaboot version 1.3.11.SuSE
Enter "help" to get some basic usage information
boot: autobench 
Please wait, loading kernel...
   Elf64 kernel loaded...
OF stdout device is: /pci at 400000000110/isa at 3/serial at i3f8
command line: root=/dev/sda3 elevator=noop  elevator=noop 
memory layout at init:
  alloc_bottom : 000000000403c000
  alloc_top    : 0000000040000000
  alloc_top_hi : 0000000080000000
  rmo_top      : 0000000080000000
  ram_top      : 0000000080000000
Looking for displays
ERROR, cannot find space for TCE table.
EXIT called ok
0 > 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041015/46c8a0fc/attachment.htm 

From dwmw2 at infradead.org  Sat Oct 16 09:15:50 2004
From: dwmw2 at infradead.org (David Woodhouse)
Date: Sat, 16 Oct 2004 00:15:50 +0100
Subject: Reserve initrd pages.
In-Reply-To: <1097873562.13633.732.camel@hades.cambridge.redhat.com>
References: <1097873562.13633.732.camel@hades.cambridge.redhat.com>
Message-ID: <1097882150.13633.754.camel@hades.cambridge.redhat.com>

On Fri, 2004-10-15 at 21:52 +0100, David Woodhouse wrote:
> +		reserve_bootmem(__pa(initrd_start), initrd_end-initrd_start);

That doesn't work if CONFIG_NUMA is set. This one does...

--- linux-2.6.8/arch/ppc64/kernel/setup.c~	2004-10-15 20:59:01.000000000 +0100
+++ linux-2.6.8/arch/ppc64/kernel/setup.c	2004-10-15 23:59:18.082932384 +0100
@@ -533,6 +533,8 @@
 	if (initrd_start)
 		printk("Found initrd at 0x%lx:0x%lx\n", initrd_start, initrd_end);
 
+	lmb_reserve(__pa(initrd_start), initrd_end-initrd_start);
+
 	DBG(" <- check_for_initrd()\n");
 #endif /* CONFIG_BLK_DEV_INITRD */
 }


-- 
dwmw2


From dwmw2 at infradead.org  Sat Oct 16 09:35:23 2004
From: dwmw2 at infradead.org (David Woodhouse)
Date: Sat, 16 Oct 2004 00:35:23 +0100
Subject: Fan control for PowerMac7_3
In-Reply-To: <417009F9.6080007@zytor.com>
References: <1097831790.1131.111.camel@gaston>
	<1097831981.1131.113.camel@gaston> <1097832049.1149.115.camel@gaston>
	<417009F9.6080007@zytor.com>
Message-ID: <1097883323.13633.757.camel@hades.cambridge.redhat.com>

On Fri, 2004-10-15 at 10:33 -0700, H. Peter Anvin wrote:
> Hi there,
> 
> I tried to apply this patch to top-of-tree (bkcvs), but it looks like 
> the current TOT doesn't compile on ppc64 for unrelated reasons:

Building with -mcall-aixdesc will work around that.

-- 
dwmw2


From benh at kernel.crashing.org  Sat Oct 16 10:39:21 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sat, 16 Oct 2004 10:39:21 +1000
Subject: Fan control for PowerMac7_3
In-Reply-To: <417009F9.6080007@zytor.com>
References: <1097831790.1131.111.camel@gaston>
	<1097831981.1131.113.camel@gaston> <1097832049.1149.115.camel@gaston>
	<417009F9.6080007@zytor.com>
Message-ID: <1097887160.6527.15.camel@gaston>

On Sat, 2004-10-16 at 03:33, H. Peter Anvin wrote:
> Hi there,
> 
> I tried to apply this patch to top-of-tree (bkcvs), but it looks like 
> the current TOT doesn't compile on ppc64 for unrelated reasons:

Weird... could it be cond_syscall not working ?

Ben.


From benh at kernel.crashing.org  Sat Oct 16 10:42:02 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sat, 16 Oct 2004 10:42:02 +1000
Subject: Fan control for PowerMac7_3
In-Reply-To: <1097883323.13633.757.camel@hades.cambridge.redhat.com>
References: <1097831790.1131.111.camel@gaston>
	<1097831981.1131.113.camel@gaston> <1097832049.1149.115.camel@gaston>
	<417009F9.6080007@zytor.com>
	<1097883323.13633.757.camel@hades.cambridge.redhat.com>
Message-ID: <1097887322.6487.21.camel@gaston>

On Sat, 2004-10-16 at 09:35, David Woodhouse wrote:
> On Fri, 2004-10-15 at 10:33 -0700, H. Peter Anvin wrote:
> > Hi there,
> > 
> > I tried to apply this patch to top-of-tree (bkcvs), but it looks like 
> > the current TOT doesn't compile on ppc64 for unrelated reasons:
> 
> Building with -mcall-aixdesc will work around that.

What is the exact problem ?

Ben.


From benh at kernel.crashing.org  Sat Oct 16 10:45:10 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sat, 16 Oct 2004 10:45:10 +1000
Subject: 2.6.9-rc4 kernel -- "cannot find space for TCE table"
In-Reply-To: <OFF2A7A85B.119958DD-ON87256F2E.0073427E-86256F2E.0071BEBC@us.ibm.com>
References: <OFF2A7A85B.119958DD-ON87256F2E.0073427E-86256F2E.0071BEBC@us.ibm.com>
Message-ID: <1097887510.6487.23.camel@gaston>

On Sat, 2004-10-16 at 07:00, Santhosh Rao wrote:
> Ok, it appears we aren't dropping into the open firmware debugger
> randomly, the kernel seems to give up early in the boot process
> Below is the output of an attempted boot of 2.6.9-rc4.
> 
> Jose, ever seen anything like this?
> 
> The machine is a p615 power-4  2-CPU box with 2GB of RAM.

Can you enable PROM_DEBUG in arch/ppc64/kernel/prom_init.c and send me the
output log ?

Ben.


From benh at kernel.crashing.org  Sat Oct 16 10:46:18 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sat, 16 Oct 2004 10:46:18 +1000
Subject: Reserve initrd pages.
In-Reply-To: <1097873562.13633.732.camel@hades.cambridge.redhat.com>
References: <1097873562.13633.732.camel@hades.cambridge.redhat.com>
Message-ID: <1097887578.6546.25.camel@gaston>

On Sat, 2004-10-16 at 06:52, David Woodhouse wrote:
> We don't mark initrd pages as reserved. If we manage to allocate enough
> other stuff before using the initrd, we end up eating into the initrd
> and we don't boot.

Hrm... that should be done in

> Signed-Off-By: David Woodhouse <dwmw2 at infradead.org>
> 
> ===== arch/ppc64/kernel/setup.c 1.83 vs edited =====
> --- 1.83/arch/ppc64/kernel/setup.c	2004-10-04 20:17:37 +01:00
> +++ edited/arch/ppc64/kernel/setup.c	2004-10-15 21:02:33 +01:00
> @@ -30,6 +30,7 @@
>  #include <linux/notifier.h>
>  #include <linux/cpu.h>
>  #include <linux/unistd.h>
> +#include <linux/bootmem.h>
>  #include <asm/io.h>
>  #include <asm/prom.h>
>  #include <asm/processor.h>
> @@ -990,6 +991,9 @@
>  
>  	/* set up the bootmem stuff with available memory */
>  	do_init_bootmem();
> +
> +	if (initrd_start)
> +		reserve_bootmem(__pa(initrd_start), initrd_end-initrd_start);
>  
>  	/* Select the correct idle loop for the platform. */
>  	idle_setup();
-- 
Benjamin Herrenschmidt <benh at kernel.crashing.org>


From benh at kernel.crashing.org  Sat Oct 16 10:47:41 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sat, 16 Oct 2004 10:47:41 +1000
Subject: Reserve initrd pages.
In-Reply-To: <1097873562.13633.732.camel@hades.cambridge.redhat.com>
References: <1097873562.13633.732.camel@hades.cambridge.redhat.com>
Message-ID: <1097887661.6487.28.camel@gaston>

On Sat, 2004-10-16 at 06:52, David Woodhouse wrote:
> We don't mark initrd pages as reserved. If we manage to allocate enough
> other stuff before using the initrd, we end up eating into the initrd
> and we don't boot.

That should be done in mm/init.c, do_init_bootmem() itself:

	/* reserve the sections we're already using */
	for (i=0; i < lmb.reserved.cnt; i++) {
		unsigned long physbase = lmb.reserved.region[i].physbase;
		unsigned long size = lmb.reserved.region[i].size;

		reserve_bootmem(physbase, size);
	}

The initrd is part of the "reserved map" passed in by prom_init and thus
is put in the list of reserved lmb regions.

Ben.


From benh at kernel.crashing.org  Sat Oct 16 10:48:11 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sat, 16 Oct 2004 10:48:11 +1000
Subject: Reserve initrd pages.
In-Reply-To: <1097882150.13633.754.camel@hades.cambridge.redhat.com>
References: <1097873562.13633.732.camel@hades.cambridge.redhat.com>
	<1097882150.13633.754.camel@hades.cambridge.redhat.com>
Message-ID: <1097887691.6527.30.camel@gaston>

On Sat, 2004-10-16 at 09:15, David Woodhouse wrote:
> On Fri, 2004-10-15 at 21:52 +0100, David Woodhouse wrote:
> > +		reserve_bootmem(__pa(initrd_start), initrd_end-initrd_start);
> 
> That doesn't work if CONFIG_NUMA is set. This one does...

Again, it should be already in the LMB reserve map, if not, then there
is a bug, but that isn't the right fix.

Ben.


From dwmw2 at infradead.org  Sat Oct 16 10:47:46 2004
From: dwmw2 at infradead.org (David Woodhouse)
Date: Sat, 16 Oct 2004 01:47:46 +0100
Subject: Fan control for PowerMac7_3
In-Reply-To: <1097887322.6487.21.camel@gaston>
References: <1097831790.1131.111.camel@gaston>
	<1097831981.1131.113.camel@gaston> <1097832049.1149.115.camel@gaston>
	<417009F9.6080007@zytor.com>
	<1097883323.13633.757.camel@hades.cambridge.redhat.com>
	<1097887322.6487.21.camel@gaston>
Message-ID: <1097887666.5788.2059.camel@baythorne.infradead.org>

On Sat, 2004-10-16 at 10:42 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2004-10-16 at 09:35, David Woodhouse wrote:
> > On Fri, 2004-10-15 at 10:33 -0700, H. Peter Anvin wrote:
> > > Hi there,
> > > 
> > > I tried to apply this patch to top-of-tree (bkcvs), but it looks like 
> > > the current TOT doesn't compile on ppc64 for unrelated reasons:
> > 
> > Building with -mcall-aixdesc will work around that.
> 
> What is the exact problem ?

cond_syscall not working due to new ABI.

-- 
dwmw2


From benh at kernel.crashing.org  Sat Oct 16 12:23:53 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sat, 16 Oct 2004 12:23:53 +1000
Subject: Fan control for PowerMac7_3 (#3)
In-Reply-To: <1097832049.1149.115.camel@gaston>
References: <1097831790.1131.111.camel@gaston>
	<1097831981.1131.113.camel@gaston> <1097832049.1149.115.camel@gaston>
Message-ID: <1097893432.6546.37.camel@gaston>

Ok, here's a new patch that fixes a few issues, it's been
tested on a non-liquid cooled system and appear to work ok.

diff -urN linux-2.5/drivers/macintosh/therm_pm72.c linux-pogo/drivers/macintosh/therm_pm72.c
--- linux-2.5/drivers/macintosh/therm_pm72.c	2004-09-24 14:34:05.000000000 +1000
+++ linux-pogo/drivers/macintosh/therm_pm72.c	2004-10-16 12:21:42.000000000 +1000
@@ -46,6 +46,8 @@
  *          overtemp conditions so userland can take some policy
  *          decisions, like slewing down CPUs
  *	  - Deal with fan and i2c failures in a better way
+ *	  - Maybe do a generic PID based on params used for
+ *	    U3 and Drives ?
  *
  * History:
  *
@@ -73,6 +75,14 @@
  *        values in the configuration register
  *	- Switch back to use of target fan speed for PID, thus lowering
  *        pressure on i2c
+ *
+ *  Oct. 16, 2004 : 1.1b2 (beta)
+ *	- Add device-tree lookup for fan IDs, should detect liquid cooling
+ *        pumps when present
+ *	- Enable driver for PowerMac7,3 machines
+ *	- Split the U3/Backside cooling on U3 & U3H versions as Darwin does
+ *	- Add new CPU cooling algorithm for machines with liquid cooling
+ *	- Workaround for some PowerMac7,3 with empty "fan" node in the devtree
  */
 
 #include <linux/config.h>
@@ -101,7 +111,7 @@
 
 #include "therm_pm72.h"
 
-#define VERSION "0.9"
+#define VERSION "1.1b2"
 
 #undef DEBUG
 
@@ -121,16 +131,100 @@
 static struct i2c_adapter *		u3_1;
 static struct i2c_client *		fcu;
 static struct cpu_pid_state		cpu_state[2];
+static struct basckside_pid_params	backside_params;
 static struct backside_pid_state	backside_state;
 static struct drives_pid_state		drives_state;
 static int				state;
 static int				cpu_count;
+static int				cpu_pid_type;
 static pid_t				ctrl_task;
 static struct completion		ctrl_complete;
 static int				critical_state;
 static DECLARE_MUTEX(driver_lock);
 
 /*
+ * We have 2 types of CPU PID control. One is "split" old style control
+ * for intake & exhaust fans, the other is "combined" control for both
+ * CPUs that also deals with the pumps when present. To be "compatible"
+ * with OS X at this point, we only use "COMBINED" on the machines that
+ * are identified as having the pumps (though that identification is at
+ * least dodgy). Ultimately, we could probably switch completely to this
+ * algorithm provided we hack it to deal with the UP case
+ */
+#define CPU_PID_TYPE_SPLIT	0
+#define CPU_PID_TYPE_COMBINED	1
+
+/*
+ * This table describes all fans in the FCU. The "id" and "type" values
+ * are defaults valid for all earlier machines. Newer machines will
+ * eventually override the table content based on the device-tree
+ */
+struct fcu_fan_table
+{
+	char*	loc;	/* location code */
+	int	type;	/* 0 = rpm, 1 = pwm, 2 = pump */
+	int	id;	/* id or -1 */
+};
+
+#define FCU_FAN_RPM		0
+#define FCU_FAN_PWM		1
+
+#define FCU_FAN_ABSENT_ID	-1
+
+#define FCU_FAN_COUNT		ARRAY_SIZE(fcu_fans)
+
+struct fcu_fan_table	fcu_fans[] = {
+	[BACKSIDE_FAN_PWM_INDEX] = {
+		.loc	= "BACKSIDE",
+		.type	= FCU_FAN_PWM,
+		.id	= BACKSIDE_FAN_PWM_DEFAULT_ID,
+	},
+	[DRIVES_FAN_RPM_INDEX] = {
+		.loc	= "DRIVE BAY",
+		.type	= FCU_FAN_RPM,
+		.id	= DRIVES_FAN_RPM_DEFAULT_ID,
+	},
+	[SLOTS_FAN_PWM_INDEX] = {
+		.loc	= "SLOT",
+		.type	= FCU_FAN_PWM,
+		.id	= SLOTS_FAN_PWM_DEFAULT_ID,
+	},
+	[CPUA_INTAKE_FAN_RPM_INDEX] = {
+		.loc	= "CPU A INTAKE",
+		.type	= FCU_FAN_RPM,
+		.id	= CPUA_INTAKE_FAN_RPM_DEFAULT_ID,
+	},
+	[CPUA_EXHAUST_FAN_RPM_INDEX] = {
+		.loc	= "CPU A EXHAUST",
+		.type	= FCU_FAN_RPM,
+		.id	= CPUA_EXHAUST_FAN_RPM_DEFAULT_ID,
+	},
+	[CPUB_INTAKE_FAN_RPM_INDEX] = {
+		.loc	= "CPU B INTAKE",
+		.type	= FCU_FAN_RPM,
+		.id	= CPUB_INTAKE_FAN_RPM_DEFAULT_ID,
+	},
+	[CPUB_EXHAUST_FAN_RPM_INDEX] = {
+		.loc	= "CPU B EXHAUST",
+		.type	= FCU_FAN_RPM,
+		.id	= CPUB_EXHAUST_FAN_RPM_DEFAULT_ID,
+	},
+	/* pumps aren't present by default, have to be looked up in the
+	 * device-tree
+	 */
+	[CPUA_PUMP_RPM_INDEX] = {
+		.loc	= "CPU A PUMP",
+		.type	= FCU_FAN_RPM,		
+		.id	= FCU_FAN_ABSENT_ID,
+	},
+	[CPUB_PUMP_RPM_INDEX] = {
+		.loc	= "CPU B PUMP",
+		.type	= FCU_FAN_RPM,
+		.id	= FCU_FAN_ABSENT_ID,
+	},
+};
+
+/*
  * i2c_driver structure to attach to the host i2c controller
  */
 
@@ -331,10 +425,16 @@
 	return 0;
 }
 
-static int set_rpm_fan(int fan, int rpm)
+static int set_rpm_fan(int fan_index, int rpm)
 {
 	unsigned char buf[2];
-	int rc;
+	int rc, id;
+
+	if (fcu_fans[fan_index].type != FCU_FAN_RPM)
+		return -EINVAL;
+	id = fcu_fans[fan_index].id; 
+	if (id == FCU_FAN_ABSENT_ID)
+		return -EINVAL;
 
 	if (rpm < 300)
 		rpm = 300;
@@ -342,43 +442,55 @@
 		rpm = 8191;
 	buf[0] = rpm >> 5;
 	buf[1] = rpm << 3;
-	rc = fan_write_reg(0x10 + (fan * 2), buf, 2);
+	rc = fan_write_reg(0x10 + (id * 2), buf, 2);
 	if (rc < 0)
 		return -EIO;
 	return 0;
 }
 
-static int get_rpm_fan(int fan, int programmed)
+static int get_rpm_fan(int fan_index, int programmed)
 {
 	unsigned char failure;
 	unsigned char active;
 	unsigned char buf[2];
-	int rc, reg_base;
+	int rc, id, reg_base;
+
+	if (fcu_fans[fan_index].type != FCU_FAN_RPM)
+		return -EINVAL;
+	id = fcu_fans[fan_index].id; 
+	if (id == FCU_FAN_ABSENT_ID)
+		return -EINVAL;
 
 	rc = fan_read_reg(0xb, &failure, 1);
 	if (rc != 1)
 		return -EIO;
-	if ((failure & (1 << fan)) != 0)
+	if ((failure & (1 << id)) != 0)
 		return -EFAULT;
 	rc = fan_read_reg(0xd, &active, 1);
 	if (rc != 1)
 		return -EIO;
-	if ((active & (1 << fan)) == 0)
+	if ((active & (1 << id)) == 0)
 		return -ENXIO;
 
 	/* Programmed value or real current speed */
 	reg_base = programmed ? 0x10 : 0x11;
-	rc = fan_read_reg(reg_base + (fan * 2), buf, 2);
+	rc = fan_read_reg(reg_base + (id * 2), buf, 2);
 	if (rc != 2)
 		return -EIO;
 
 	return (buf[0] << 5) | buf[1] >> 3;
 }
 
-static int set_pwm_fan(int fan, int pwm)
+static int set_pwm_fan(int fan_index, int pwm)
 {
 	unsigned char buf[2];
-	int rc;
+	int rc, id;
+
+	if (fcu_fans[fan_index].type != FCU_FAN_PWM)
+		return -EINVAL;
+	id = fcu_fans[fan_index].id; 
+	if (id == FCU_FAN_ABSENT_ID)
+		return -EINVAL;
 
 	if (pwm < 10)
 		pwm = 10;
@@ -386,32 +498,38 @@
 		pwm = 100;
 	pwm = (pwm * 2559) / 1000;
 	buf[0] = pwm;
-	rc = fan_write_reg(0x30 + (fan * 2), buf, 1);
+	rc = fan_write_reg(0x30 + (id * 2), buf, 1);
 	if (rc < 0)
 		return rc;
 	return 0;
 }
 
-static int get_pwm_fan(int fan)
+static int get_pwm_fan(int fan_index)
 {
 	unsigned char failure;
 	unsigned char active;
 	unsigned char buf[2];
-	int rc;
+	int rc, id;
+
+	if (fcu_fans[fan_index].type != FCU_FAN_PWM)
+		return -EINVAL;
+	id = fcu_fans[fan_index].id; 
+	if (id == FCU_FAN_ABSENT_ID)
+		return -EINVAL;
 
 	rc = fan_read_reg(0x2b, &failure, 1);
 	if (rc != 1)
 		return -EIO;
-	if ((failure & (1 << fan)) != 0)
+	if ((failure & (1 << id)) != 0)
 		return -EFAULT;
 	rc = fan_read_reg(0x2d, &active, 1);
 	if (rc != 1)
 		return -EIO;
-	if ((active & (1 << fan)) == 0)
+	if ((active & (1 << id)) == 0)
 		return -ENXIO;
 
 	/* Programmed value or real current speed */
-	rc = fan_read_reg(0x30 + (fan * 2), buf, 1);
+	rc = fan_read_reg(0x30 + (id * 2), buf, 1);
 	if (rc != 1)
 		return -EIO;
 
@@ -513,80 +631,84 @@
 /*
  * CPUs fans control loop
  */
-static void do_monitor_cpu(struct cpu_pid_state *state)
+
+static int do_read_one_cpu_values(struct cpu_pid_state *state, s32 *temp, s32 *power)
 {
-	s32 temp, voltage, current_a, power, power_target;
-	s32 integral, derivative, proportional, adj_in_target, sval;
-	s64 integ_p, deriv_p, prop_p, sum; 
-	int i, intake, rc;
+	s32 ltemp, volts, amps;
+	int rc = 0;
 
-	DBG("cpu %d:\n", state->index);
+	/* Default (in case of error) */
+	*temp = state->cur_temp;
+	*power = state->cur_power;
 
 	/* Read current fan status */
 	if (state->index == 0)
-		rc = get_rpm_fan(CPUA_EXHAUST_FAN_RPM_ID, !RPM_PID_USE_ACTUAL_SPEED);
+		rc = get_rpm_fan(CPUA_EXHAUST_FAN_RPM_INDEX, !RPM_PID_USE_ACTUAL_SPEED);
 	else
-		rc = get_rpm_fan(CPUB_EXHAUST_FAN_RPM_ID, !RPM_PID_USE_ACTUAL_SPEED);
+		rc = get_rpm_fan(CPUB_EXHAUST_FAN_RPM_INDEX, !RPM_PID_USE_ACTUAL_SPEED);
 	if (rc < 0) {
-		printk(KERN_WARNING "Error %d reading CPU %d exhaust fan !\n",
-		       rc, state->index);
-		/* XXX What do we do now ? */
-	} else
+		/* XXX What do we do now ? Nothing for now, keep old value, but
+		 * return error upstream
+		 */
+		DBG("  cpu %d, fan reading error !\n", state->index);
+	} else {
 		state->rpm = rc;
-	DBG("  current rpm: %d\n", state->rpm);
+		DBG("  cpu %d, exhaust RPM: %d\n", state->index, state->rpm);
+	}
 
 	/* Get some sensor readings and scale it */
-	temp = read_smon_adc(state, 1);
-	if (temp == -1) {
+	ltemp = read_smon_adc(state, 1);
+	if (ltemp == -1) {
+		/* XXX What do we do now ? */
 		state->overtemp++;
-		return;
+		if (rc == 0)
+			rc = -EIO;
+		DBG("  cpu %d, temp reading error !\n", state->index);
+	} else {
+		/* Fixup temperature according to diode calibration
+		 */
+		DBG("  cpu %d, temp raw: %04x, m_diode: %04x, b_diode: %04x\n",
+		    state->index,
+		    ltemp, state->mpu.mdiode, state->mpu.bdiode);
+		*temp = ((s32)ltemp * (s32)state->mpu.mdiode + ((s32)state->mpu.bdiode << 12)) >> 2;
+		state->last_temp = *temp;
+		DBG("  temp: %d.%03d\n", FIX32TOPRINT((*temp)));
 	}
-	voltage = read_smon_adc(state, 3);
-	current_a = read_smon_adc(state, 4);
 
-	/* Fixup temperature according to diode calibration
+	/*
+	 * Read voltage & current and calculate power
 	 */
-	DBG("  temp raw: %04x, m_diode: %04x, b_diode: %04x\n",
-	    temp, state->mpu.mdiode, state->mpu.bdiode);
-	temp = ((s32)temp * (s32)state->mpu.mdiode + ((s32)state->mpu.bdiode << 12)) >> 2;
-	state->last_temp = temp;
-	DBG("  temp: %d.%03d\n", FIX32TOPRINT(temp));
+	volts = read_smon_adc(state, 3);
+	amps = read_smon_adc(state, 4);
 
-	/* Check tmax, increment overtemp if we are there. At tmax+8, we go
-	 * full blown immediately and try to trigger a shutdown
-	 */
-	if (temp >= ((state->mpu.tmax + 8) << 16)) {
-		printk(KERN_WARNING "Warning ! CPU %d temperature way above maximum"
-		       " (%d) !\n",
-		       state->index, temp >> 16);
-		state->overtemp = CPU_MAX_OVERTEMP;
-	} else if (temp > (state->mpu.tmax << 16))
-		state->overtemp++;
-	else
-		state->overtemp = 0;
-	if (state->overtemp >= CPU_MAX_OVERTEMP)
-		critical_state = 1;
-	if (state->overtemp > 0) {
-		state->rpm = state->mpu.rmaxn_exhaust_fan;
-		state->intake_rpm = intake = state->mpu.rmaxn_intake_fan;
-		goto do_set_fans;
-	}
-	
-	/* Scale other sensor values according to fixed scales
+	/* Scale voltage and current raw sensor values according to fixed scales
 	 * obtained in Darwin and calculate power from I and V
 	 */
-	state->voltage = voltage *= ADC_CPU_VOLTAGE_SCALE;
-	state->current_a = current_a *= ADC_CPU_CURRENT_SCALE;
-	power = (((u64)current_a) * ((u64)voltage)) >> 16;
+	volts *= ADC_CPU_VOLTAGE_SCALE;
+	amps *= ADC_CPU_CURRENT_SCALE;
+	*power = (((u64)volts) * ((u64)amps)) >> 16;
+	state->voltage = volts;
+	state->current_a = amps;
+	state->last_power = *power;
+
+	DBG("  cpu %d, current: %d.%03d, voltage: %d.%03d, power: %d.%03d W\n",
+	    state->index, FIX32TOPRINT(state->current_a),
+	    FIX32TOPRINT(state->voltage), FIX32TOPRINT(*power));
+
+	return 0;
+}
+
+static void do_cpu_pid(struct cpu_pid_state *state, s32 temp, s32 power)
+{
+	s32 power_target, integral, derivative, proportional, adj_in_target, sval;
+	s64 integ_p, deriv_p, prop_p, sum; 
+	int i;
 
 	/* Calculate power target value (could be done once for all)
 	 * and convert to a 16.16 fp number
 	 */
 	power_target = ((u32)(state->mpu.pmaxh - state->mpu.padjmax)) << 16;
-
-	DBG("  current: %d.%03d, voltage: %d.%03d\n",
-	    FIX32TOPRINT(current_a), FIX32TOPRINT(voltage));
-	DBG("  power: %d.%03d W, target: %d.%03d, error: %d.%03d\n", FIX32TOPRINT(power),
+	DBG("  power target: %d.%03d, error: %d.%03d\n",
 	    FIX32TOPRINT(power_target), FIX32TOPRINT(power_target - power));
 
 	/* Store temperature and power in history array */
@@ -626,7 +748,7 @@
 	 * input target is mpu.ttarget, input max is mpu.tmax
 	 */
 	integ_p = ((s64)state->mpu.pid_gr) * (s64)integral;
-	DBG("   integ_p: %d\n", (int)(deriv_p >> 36));
+	DBG("   integ_p: %d\n", (int)(integ_p >> 36));
 	sval = (state->mpu.tmax << 16) - ((integ_p >> 20) & 0xffffffff);
 	adj_in_target = (state->mpu.ttarget << 16);
 	if (adj_in_target > sval)
@@ -659,6 +781,127 @@
 		state->rpm = state->mpu.rminn_exhaust_fan;
 	if (state->rpm > state->mpu.rmaxn_exhaust_fan)
 		state->rpm = state->mpu.rmaxn_exhaust_fan;
+}
+
+static void do_monitor_cpu_combined(void)
+{
+	struct cpu_pid_state *state0 = &cpu_state[0];
+	struct cpu_pid_state *state1 = &cpu_state[1];
+	s32 temp0, power0, temp1, power1;
+	s32 temp_combi, power_combi;
+	int rc, intake, pump;
+
+	rc = do_read_one_cpu_values(state0, &temp0, &power0);
+	if (rc < 0) {
+		/* XXX What do we do now ? */
+	}
+	state1->overtemp = 0;
+	rc = do_read_one_cpu_values(state1, &temp1, &power1);
+	if (rc < 0) {
+		/* XXX What do we do now ? */
+	}
+	if (state1->overtemp)
+		state0->overtemp++;
+
+	temp_combi = max(temp0, temp1);
+	power_combi = max(power0, power1);
+
+	/* Check tmax, increment overtemp if we are there. At tmax+8, we go
+	 * full blown immediately and try to trigger a shutdown
+	 */
+	if (temp_combi >= ((state0->mpu.tmax + 8) << 16)) {
+		printk(KERN_WARNING "Warning ! Temperature way above maximum (%d) !\n",
+		       temp_combi >> 16);
+		state0->overtemp = CPU_MAX_OVERTEMP;
+	} else if (temp_combi > (state0->mpu.tmax << 16))
+		state0->overtemp++;
+	else
+		state0->overtemp = 0;
+	if (state0->overtemp >= CPU_MAX_OVERTEMP)
+		critical_state = 1;
+	if (state0->overtemp > 0) {
+		state0->rpm = state0->mpu.rmaxn_exhaust_fan;
+		state0->intake_rpm = intake = state0->mpu.rmaxn_intake_fan;
+		pump = CPU_PUMP_OUTPUT_MAX;
+		goto do_set_fans;
+	}
+
+	/* Do the PID */
+	do_cpu_pid(state0, temp_combi, power_combi);
+
+	/* Calculate intake fan speed */
+	intake = (state0->rpm * CPU_INTAKE_SCALE) >> 16;
+	if (intake < state0->mpu.rminn_intake_fan)
+		intake = state0->mpu.rminn_intake_fan;
+	if (intake > state0->mpu.rmaxn_intake_fan)
+		intake = state0->mpu.rmaxn_intake_fan;
+	state0->intake_rpm = intake;
+
+	/* Calculate pump speed */
+	pump = (state0->rpm * CPU_PUMP_OUTPUT_MAX) /
+		state0->mpu.rmaxn_exhaust_fan;
+	if (pump > CPU_PUMP_OUTPUT_MAX)
+		pump = CPU_PUMP_OUTPUT_MAX;
+	if (pump < CPU_PUMP_OUTPUT_MIN)
+		pump = CPU_PUMP_OUTPUT_MIN;
+	
+ do_set_fans:
+	/* We copy values from state 0 to state 1 for /sysfs */
+	state1->rpm = state0->rpm;
+	state1->intake_rpm = state0->intake_rpm;
+
+	DBG("** CPU %d RPM: %d Ex, %d, Pump: %d, In, overtemp: %d\n",
+	    state1->index, (int)state1->rpm, intake, pump, state1->overtemp);
+
+	/* We should check for errors, shouldn't we ? But then, what
+	 * do we do once the error occurs ? For FCU notified fan
+	 * failures (-EFAULT) we probably want to notify userland
+	 * some way...
+	 */
+	set_rpm_fan(CPUA_INTAKE_FAN_RPM_INDEX, intake);
+	set_rpm_fan(CPUA_EXHAUST_FAN_RPM_INDEX, state0->rpm);
+	set_rpm_fan(CPUB_INTAKE_FAN_RPM_INDEX, intake);
+	set_rpm_fan(CPUB_EXHAUST_FAN_RPM_INDEX, state0->rpm);
+
+	if (fcu_fans[CPUA_PUMP_RPM_INDEX].id != FCU_FAN_ABSENT_ID)
+		set_rpm_fan(CPUA_PUMP_RPM_INDEX, pump);
+	if (fcu_fans[CPUB_PUMP_RPM_INDEX].id != FCU_FAN_ABSENT_ID)
+		set_rpm_fan(CPUB_PUMP_RPM_INDEX, pump);
+}
+
+static void do_monitor_cpu_split(struct cpu_pid_state *state)
+{
+	s32 temp, power;
+	int rc, intake;
+
+	/* Read current fan status */
+	rc = do_read_one_cpu_values(state, &temp, &power);
+	if (rc < 0) {
+		/* XXX What do we do now ? */
+	}
+
+	/* Check tmax, increment overtemp if we are there. At tmax+8, we go
+	 * full blown immediately and try to trigger a shutdown
+	 */
+	if (temp >= ((state->mpu.tmax + 8) << 16)) {
+		printk(KERN_WARNING "Warning ! CPU %d temperature way above maximum"
+		       " (%d) !\n",
+		       state->index, temp >> 16);
+		state->overtemp = CPU_MAX_OVERTEMP;
+	} else if (temp > (state->mpu.tmax << 16))
+		state->overtemp++;
+	else
+		state->overtemp = 0;
+	if (state->overtemp >= CPU_MAX_OVERTEMP)
+		critical_state = 1;
+	if (state->overtemp > 0) {
+		state->rpm = state->mpu.rmaxn_exhaust_fan;
+		state->intake_rpm = intake = state->mpu.rmaxn_intake_fan;
+		goto do_set_fans;
+	}
+
+	/* Do the PID */
+	do_cpu_pid(state, temp, power);
 
 	intake = (state->rpm * CPU_INTAKE_SCALE) >> 16;
 	if (intake < state->mpu.rminn_intake_fan)
@@ -677,11 +920,11 @@
 	 * some way...
 	 */
 	if (state->index == 0) {
-		set_rpm_fan(CPUA_INTAKE_FAN_RPM_ID, intake);
-		set_rpm_fan(CPUA_EXHAUST_FAN_RPM_ID, state->rpm);
+		set_rpm_fan(CPUA_INTAKE_FAN_RPM_INDEX, intake);
+		set_rpm_fan(CPUA_EXHAUST_FAN_RPM_INDEX, state->rpm);
 	} else {
-		set_rpm_fan(CPUB_INTAKE_FAN_RPM_ID, intake);
-		set_rpm_fan(CPUB_EXHAUST_FAN_RPM_ID, state->rpm);
+		set_rpm_fan(CPUB_INTAKE_FAN_RPM_INDEX, intake);
+		set_rpm_fan(CPUB_EXHAUST_FAN_RPM_INDEX, state->rpm);
 	}
 }
 
@@ -696,6 +939,7 @@
 	state->overtemp = 0;
 	state->adc_config = 0x00;
 
+
 	if (index == 0)
 		state->monitor = attach_i2c_chip(SUPPLY_MONITOR_ID, "CPU0_monitor");
 	else if (index == 1)
@@ -778,7 +1022,7 @@
 	DBG("backside:\n");
 
 	/* Check fan status */
-	rc = get_pwm_fan(BACKSIDE_FAN_PWM_ID);
+	rc = get_pwm_fan(BACKSIDE_FAN_PWM_INDEX);
 	if (rc < 0) {
 		printk(KERN_WARNING "Error %d reading backside fan !\n", rc);
 		/* XXX What do we do now ? */
@@ -790,12 +1034,12 @@
 	temp = i2c_smbus_read_byte_data(state->monitor, MAX6690_EXT_TEMP) << 16;
 	state->last_temp = temp;
 	DBG("  temp: %d.%03d, target: %d.%03d\n", FIX32TOPRINT(temp),
-	    FIX32TOPRINT(BACKSIDE_PID_INPUT_TARGET));
+	    FIX32TOPRINT(backside_params.input_target));
 
 	/* Store temperature and error in history array */
 	state->cur_sample = (state->cur_sample + 1) % BACKSIDE_PID_HISTORY_SIZE;
 	state->sample_history[state->cur_sample] = temp;
-	state->error_history[state->cur_sample] = temp - BACKSIDE_PID_INPUT_TARGET;
+	state->error_history[state->cur_sample] = temp - backside_params.input_target;
 	
 	/* If first loop, fill the history table */
 	if (state->first) {
@@ -804,7 +1048,7 @@
 				BACKSIDE_PID_HISTORY_SIZE;
 			state->sample_history[state->cur_sample] = temp;
 			state->error_history[state->cur_sample] =
-				temp - BACKSIDE_PID_INPUT_TARGET;
+				temp - backside_params.input_target;
 		}
 		state->first = 0;
 	}
@@ -816,7 +1060,7 @@
 		integral += state->error_history[i];
 	integral *= BACKSIDE_PID_INTERVAL;
 	DBG("  integral: %08x\n", integral);
-	integ_p = ((s64)BACKSIDE_PID_G_r) * (s64)integral;
+	integ_p = ((s64)backside_params.G_r) * (s64)integral;
 	DBG("   integ_p: %d\n", (int)(integ_p >> 36));
 	sum += integ_p;
 
@@ -825,12 +1069,12 @@
 		state->error_history[(state->cur_sample + BACKSIDE_PID_HISTORY_SIZE - 1)
 				    % BACKSIDE_PID_HISTORY_SIZE];
 	derivative /= BACKSIDE_PID_INTERVAL;
-	deriv_p = ((s64)BACKSIDE_PID_G_d) * (s64)derivative;
+	deriv_p = ((s64)backside_params.G_d) * (s64)derivative;
 	DBG("   deriv_p: %d\n", (int)(deriv_p >> 36));
 	sum += deriv_p;
 
 	/* Calculate the proportional term */
-	prop_p = ((s64)BACKSIDE_PID_G_p) * (s64)(state->error_history[state->cur_sample]);
+	prop_p = ((s64)backside_params.G_p) * (s64)(state->error_history[state->cur_sample]);
 	DBG("   prop_p: %d\n", (int)(prop_p >> 36));
 	sum += prop_p;
 
@@ -839,13 +1083,13 @@
 
 	DBG("   sum: %d\n", (int)sum);
 	state->pwm += (s32)sum;
-	if (state->pwm < BACKSIDE_PID_OUTPUT_MIN)
-		state->pwm = BACKSIDE_PID_OUTPUT_MIN;
-	if (state->pwm > BACKSIDE_PID_OUTPUT_MAX)
-		state->pwm = BACKSIDE_PID_OUTPUT_MAX;
+	if (state->pwm < backside_params.output_min)
+		state->pwm = backside_params.output_min;
+	if (state->pwm > backside_params.output_max)
+		state->pwm = backside_params.output_max;
 
 	DBG("** BACKSIDE PWM: %d\n", (int)state->pwm);
-	set_pwm_fan(BACKSIDE_FAN_PWM_ID, state->pwm);
+	set_pwm_fan(BACKSIDE_FAN_PWM_INDEX, state->pwm);
 }
 
 /*
@@ -853,6 +1097,35 @@
  */
 static int init_backside_state(struct backside_pid_state *state)
 {
+	struct device_node *u3;
+	int u3h = 1; /* conservative by default */
+
+	/*
+	 * There are different PID params for machines with U3 and machines
+	 * with U3H, pick the right ones now
+	 */
+	u3 = of_find_node_by_path("/u3 at 0,f8000000");
+	if (u3 != NULL) {
+		u32 *vers = (u32 *)get_property(u3, "device-rev", NULL);
+		if (vers)
+			if (((*vers) & 0x3f) < 0x34)
+				u3h = 0;
+		of_node_put(u3);
+	}
+
+	backside_params.G_p = BACKSIDE_PID_G_p;
+	backside_params.G_r = BACKSIDE_PID_G_r;
+	backside_params.output_max = BACKSIDE_PID_OUTPUT_MAX;
+	if (u3h) {
+		backside_params.G_d = BACKSIDE_PID_U3H_G_d;
+		backside_params.input_target = BACKSIDE_PID_U3H_INPUT_TARGET;
+		backside_params.output_min = BACKSIDE_PID_U3H_OUTPUT_MIN;
+	} else {
+		backside_params.G_d = BACKSIDE_PID_U3_G_d;
+		backside_params.input_target = BACKSIDE_PID_U3_INPUT_TARGET;
+		backside_params.output_min = BACKSIDE_PID_U3_OUTPUT_MIN;
+	}
+
 	state->ticks = 1;
 	state->first = 1;
 	state->pwm = 50;
@@ -898,7 +1171,7 @@
 	DBG("drives:\n");
 
 	/* Check fan status */
-	rc = get_rpm_fan(DRIVES_FAN_RPM_ID, !RPM_PID_USE_ACTUAL_SPEED);
+	rc = get_rpm_fan(DRIVES_FAN_RPM_INDEX, !RPM_PID_USE_ACTUAL_SPEED);
 	if (rc < 0) {
 		printk(KERN_WARNING "Error %d reading drives fan !\n", rc);
 		/* XXX What do we do now ? */
@@ -965,7 +1238,7 @@
 		state->rpm = DRIVES_PID_OUTPUT_MAX;
 
 	DBG("** DRIVES RPM: %d\n", (int)state->rpm);
-	set_rpm_fan(DRIVES_FAN_RPM_ID, state->rpm);
+	set_rpm_fan(DRIVES_FAN_RPM_INDEX, state->rpm);
 }
 
 /*
@@ -1032,7 +1305,7 @@
 	}
 
 	/* Set the PCI fan once for now */
-	set_pwm_fan(SLOTS_FAN_PWM_ID, SLOTS_FAN_DEFAULT_PWM);
+	set_pwm_fan(SLOTS_FAN_PWM_INDEX, SLOTS_FAN_DEFAULT_PWM);
 
 	/* Initialize ADCs */
 	initialize_adc(&cpu_state[0]);
@@ -1047,9 +1320,13 @@
 		start = jiffies;
 
 		down(&driver_lock);
-		do_monitor_cpu(&cpu_state[0]);
-		if (cpu_state[1].monitor != NULL)
-			do_monitor_cpu(&cpu_state[1]);
+		if (cpu_pid_type == CPU_PID_TYPE_COMBINED)
+			do_monitor_cpu_combined();
+		else {
+			do_monitor_cpu_split(&cpu_state[0]);
+			if (cpu_state[1].monitor != NULL)
+				do_monitor_cpu_split(&cpu_state[1]);
+		}
 		do_monitor_backside(&backside_state);
 		do_monitor_drives(&drives_state);
 		up(&driver_lock);
@@ -1113,6 +1390,19 @@
 
 	DBG("counted %d CPUs in the device-tree\n", cpu_count);
 
+	/* Decide the type of PID algorithm to use based on the presence of
+	 * the pumps, though that may not be the best way, that is good enough
+	 * for now
+	 */
+	if (machine_is_compatible("PowerMac7,3")
+	    && (cpu_count > 1)
+	    && fcu_fans[CPUA_PUMP_RPM_INDEX].id != FCU_FAN_ABSENT_ID
+	    && fcu_fans[CPUB_PUMP_RPM_INDEX].id != FCU_FAN_ABSENT_ID) {
+		printk(KERN_INFO "Liquid cooling pumps detected, using new algorithm !\n");
+		cpu_pid_type = CPU_PID_TYPE_COMBINED;
+	} else
+		cpu_pid_type = CPU_PID_TYPE_SPLIT;
+
 	/* Create control loops for everything. If any fail, everything
 	 * fails
 	 */
@@ -1257,12 +1547,91 @@
 	return 0;
 }
 
+static void fcu_lookup_fans(struct device_node *fcu_node)
+{
+	struct device_node *np = NULL;
+	int i;
+
+	/* The table is filled by default with values that are suitable
+	 * for the old machines without device-tree informations. We scan
+	 * the device-tree and override those values with whatever is
+	 * there
+	 */
+
+	DBG("Looking up FCU controls in device-tree...\n");
+
+	while ((np = of_get_next_child(fcu_node, np)) != NULL) {
+		int type = -1;
+		char *loc;
+		u32 *reg;
+
+		DBG(" control: %s, type: %s\n", np->name, np->type);
+
+		/* Detect control type */
+		if (!strcmp(np->type, "fan-rpm-control") ||
+		    !strcmp(np->type, "fan-rpm"))
+			type = FCU_FAN_RPM;
+		if (!strcmp(np->type, "fan-pwm-control") ||
+		    !strcmp(np->type, "fan-pwm"))
+			type = FCU_FAN_PWM;
+		/* Only care about fans for now */
+		if (type == -1)
+			continue;
+
+		/* Lookup for a matching location */
+		loc = (char *)get_property(np, "location", NULL);
+		reg = (u32 *)get_property(np, "reg", NULL);
+		if (loc == NULL || reg == NULL)
+			continue;
+		DBG(" matching location: %s, reg: 0x%08x\n", loc, *reg);
+
+		for (i = 0; i < FCU_FAN_COUNT; i++) {
+			int fan_id;
+
+			if (strcmp(loc, fcu_fans[i].loc))
+				continue;
+			DBG(" location match, index: %d\n", i);
+			fcu_fans[i].id = FCU_FAN_ABSENT_ID;
+			if (type != fcu_fans[i].type) {
+				printk(KERN_WARNING "therm_pm72: Fan type mismatch "
+				       "in device-tree for %s\n", np->full_name);
+				break;
+			}
+			if (type == FCU_FAN_RPM)
+				fan_id = ((*reg) - 0x10) / 2;
+			else
+				fan_id = ((*reg) - 0x30) / 2;
+			if (fan_id > 7) {
+				printk(KERN_WARNING "therm_pm72: Can't parse "
+				       "fan ID in device-tree for %s\n", np->full_name);
+				break;
+			}
+			DBG(" fan id -> %d, type -> %d\n", fan_id, type);
+			fcu_fans[i].id = fan_id;
+		}
+	}
+
+	/* Now dump the array */
+	printk(KERN_INFO "Detected fan controls:\n");
+	for (i = 0; i < FCU_FAN_COUNT; i++) {
+		if (fcu_fans[i].id == FCU_FAN_ABSENT_ID)
+			continue;
+		printk(KERN_INFO "  %d: %s fan, id %d, location: %s\n", i,
+		       fcu_fans[i].type == FCU_FAN_RPM ? "RPM" : "PWM",
+		       fcu_fans[i].id, fcu_fans[i].loc);
+	}
+}
+
 static int fcu_of_probe(struct of_device* dev, const struct of_match *match)
 {
 	int rc;
 
 	state = state_detached;
 
+	/* Lookup the fans in the device tree */
+	fcu_lookup_fans(dev->node);
+
+	/* Add the driver */
 	rc = i2c_add_driver(&therm_pm72_driver);
 	if (rc < 0)
 		return rc;
@@ -1301,15 +1670,20 @@
 {
 	struct device_node *np;
 
-	if (!machine_is_compatible("PowerMac7,2"))
+	if (!machine_is_compatible("PowerMac7,2") &&
+	    !machine_is_compatible("PowerMac7,3"))
 	    	return -ENODEV;
 
 	printk(KERN_INFO "PowerMac G5 Thermal control driver %s\n", VERSION);
 
 	np = of_find_node_by_type(NULL, "fcu");
 	if (np == NULL) {
-		printk(KERN_ERR "Can't find FCU in device-tree !\n");
-		return -ENODEV;
+		/* Some machines have strangely broken device-tree */
+		np = of_find_node_by_path("/u3 at 0,f8000000/i2c at f8001000/fan at 15e");
+		if (np == NULL) {
+			    printk(KERN_ERR "Can't find FCU in device-tree !\n");
+			    return -ENODEV;
+		}
 	}
 	of_dev = of_platform_device_create(np, "temperature");
 	if (of_dev == NULL) {
diff -urN linux-2.5/drivers/macintosh/therm_pm72.h linux-pogo/drivers/macintosh/therm_pm72.h
--- linux-2.5/drivers/macintosh/therm_pm72.h	2004-09-24 14:34:05.000000000 +1000
+++ linux-pogo/drivers/macintosh/therm_pm72.h	2004-10-15 18:58:22.000000000 +1000
@@ -119,18 +119,33 @@
 #define ADC_CPU_CURRENT_SCALE	0x1f40	/* _AD4 */
 
 /*
- * PID factors for the U3/Backside fan control loop
+ * PID factors for the U3/Backside fan control loop. We have 2 sets
+ * of values here, one set for U3 and one set for U3H
  */
-#define BACKSIDE_FAN_PWM_ID		1
-#define BACKSIDE_PID_G_d		0x02800000
+#define BACKSIDE_FAN_PWM_DEFAULT_ID	1
+#define BACKSIDE_FAN_PWM_INDEX		0
+#define BACKSIDE_PID_U3_G_d		0x02800000
+#define BACKSIDE_PID_U3H_G_d		0x01400000
 #define BACKSIDE_PID_G_p		0x00500000
 #define BACKSIDE_PID_G_r		0x00000000
-#define BACKSIDE_PID_INPUT_TARGET	0x00410000
+#define BACKSIDE_PID_U3_INPUT_TARGET	0x00410000
+#define BACKSIDE_PID_U3H_INPUT_TARGET	0x004b0000
 #define BACKSIDE_PID_INTERVAL		5
 #define BACKSIDE_PID_OUTPUT_MAX		100
-#define BACKSIDE_PID_OUTPUT_MIN		20
+#define BACKSIDE_PID_U3_OUTPUT_MIN	20
+#define BACKSIDE_PID_U3H_OUTPUT_MIN	30
 #define BACKSIDE_PID_HISTORY_SIZE	2
 
+struct basckside_pid_params
+{
+	u32			G_d;
+	u32			G_p;
+	u32			G_r;
+	u32			input_target;
+	u32			output_min;
+	u32			output_max;
+};
+
 struct backside_pid_state
 {
 	int			ticks;
@@ -146,7 +161,8 @@
 /*
  * PID factors for the Drive Bay fan control loop
  */
-#define DRIVES_FAN_RPM_ID      		2
+#define DRIVES_FAN_RPM_DEFAULT_ID	2
+#define DRIVES_FAN_RPM_INDEX		1
 #define DRIVES_PID_G_d			0x01e00000
 #define DRIVES_PID_G_p			0x00500000
 #define DRIVES_PID_G_r			0x00000000
@@ -168,7 +184,8 @@
 	int			first;
 };
 
-#define SLOTS_FAN_PWM_ID       		2
+#define SLOTS_FAN_PWM_DEFAULT_ID	2
+#define SLOTS_FAN_PWM_INDEX		2
 #define	SLOTS_FAN_DEFAULT_PWM		50 /* Do better here ! */
 
 /*
@@ -191,10 +208,15 @@
  * CPU B FAKE POWER	49	(I_V_inputs: 18, 19)
  */
 
-#define CPUA_INTAKE_FAN_RPM_ID		3
-#define CPUA_EXHAUST_FAN_RPM_ID		4
-#define CPUB_INTAKE_FAN_RPM_ID		5
-#define CPUB_EXHAUST_FAN_RPM_ID		6
+#define CPUA_INTAKE_FAN_RPM_DEFAULT_ID	3
+#define CPUA_EXHAUST_FAN_RPM_DEFAULT_ID	4
+#define CPUB_INTAKE_FAN_RPM_DEFAULT_ID	5
+#define CPUB_EXHAUST_FAN_RPM_DEFAULT_ID	6
+
+#define CPUA_INTAKE_FAN_RPM_INDEX	3
+#define CPUA_EXHAUST_FAN_RPM_INDEX	4
+#define CPUB_INTAKE_FAN_RPM_INDEX	5
+#define CPUB_EXHAUST_FAN_RPM_INDEX	6
 
 #define CPU_INTAKE_SCALE		0x0000f852
 #define CPU_TEMP_HISTORY_SIZE		2
@@ -202,6 +224,11 @@
 #define CPU_PID_INTERVAL		1
 #define CPU_MAX_OVERTEMP		30
 
+#define CPUA_PUMP_RPM_INDEX		7
+#define CPUB_PUMP_RPM_INDEX		8
+#define CPU_PUMP_OUTPUT_MAX		3700
+#define CPU_PUMP_OUTPUT_MIN		1000
+
 struct cpu_pid_state
 {
 	int			index;
@@ -219,6 +246,7 @@
 	s32			voltage;
 	s32			current_a;
 	s32			last_temp;
+	s32			last_power;
 	int			first;
 	u8			adc_config;
 };


From hpa at zytor.com  Sat Oct 16 14:24:05 2004
From: hpa at zytor.com (H. Peter Anvin)
Date: Fri, 15 Oct 2004 21:24:05 -0700
Subject: Fan control for PowerMac7_3 (#3)
In-Reply-To: <1097893432.6546.37.camel@gaston>
References: <1097831790.1131.111.camel@gaston>	
	<1097831981.1131.113.camel@gaston>
	<1097832049.1149.115.camel@gaston>
	<1097893432.6546.37.camel@gaston>
Message-ID: <4170A265.6030402@zytor.com>

Benjamin Herrenschmidt wrote:
> Ok, here's a new patch that fixes a few issues, it's been
> tested on a non-liquid cooled system and appear to work ok.

I'm testing it out right now.  It is definitely suffering from some 
degree of oscillation when idling, and it seems to be one particular 
(set of) fan(s) that is having that problem.

Note that one cause of oscillation at low speed is that there is a 
minimum speed below which the fans will simply stop.  This may be what 
is happening here.

Some time later I'll try to figure out which numbers to collect and try 
to generate a graph over time.

It's definitely passing the stress test, though; make -j4 on the whole 
kernel (with the .config posted earlier) took 4:27.19.  The other stress 
test -- which used to kill the old thermal driver dead in a matter of 
seconds -- is to start a bunch of "cat /dev/zero > /dev/null" is happily 
running, and nice and quiet.

The oscillation is obnoxious, though.

	-hpa


From nathanl at austin.ibm.com  Sat Oct 16 14:37:04 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Fri, 15 Oct 2004 23:37:04 -0500
Subject: cpu hotplug broken in 2.6.9-rc4
Message-ID: <1097901423.3226.42.camel@biclops>

(Urgh, sent this to the wrong list initially, sorry.  Second try...)

Hi-

Seems that cpu hotplug got broken when benh's monster cleanup patch went
into bk (in 2.6.9-rc2-bk10).  System boots fine, but if I take down a
cpu and then try to bring it back up, I get:

# echo 1 > /sys/devices/system/cpu/cpu1/online 
Bad kernel stack pointer 7d23080 at 6373c0
cpu 0x1: Vector: c000000002ff4d80  at [c0000000077c7d40]
    pc: 00000000006373c0
    lr: 00000000006373c0
    sp: 7d23080
   msr: 1002
  current = 0xc00000000779a8a0
  paca    = 0xc000000000493d00
    pid   = 0, comm = swapper
enter ? for help
1:mon> t
SP (7d23080) is in userspace
1:mon> r
R00 = 0000000000000000   R16 = 0000000000000000
R01 = 0000000007d23080   R17 = 0000000000000000
R02 = 0000000007ad4b68   R18 = 0000000000000000
R03 = 0000000000000001   R19 = 0000000000000000
R04 = 00000000006373c0   R20 = c000000000493880
R05 = 0000000000000001   R21 = 00016bb01a585f1a
R06 = 0000000000000020   R22 = c000000000493d00
R07 = fffffffd00000000   R23 = c000000002565008
R08 = 00000000000d6000   R24 = 0000000000000001
R09 = c00000000737ef80   R25 = 0000000000000000
R10 = c00000000737ed40   R26 = 0000000000000008
R11 = c00000000737eb40   R27 = 0000000000000010
R12 = 0000000000000001   R28 = c0000000077a4000
R13 = c000000000493d00   R29 = 0000000007889698
R14 = 0000000000000000   R30 = c0000000077a4010
R15 = 0000000007ab0420   R31 = 0000000007d23080
pc  = 00000000006373c0
lr  = 00000000006373c0
msr = 0000000000001002   cr  = 22000024
ctr = 800000000010dd60   xer = 0000000000000001   trap = c000000002ff4d80

For what it's worth, the least significant half of pc (00000000006373c0)
matches the address of pseries_secondary_smp_init in the System.map:

c0000000006373c0 D pseries_secondary_smp_init

If I revert the monster patch from the 2.6.9-rc2-bk10 snapshot things
work fine.  I haven't been able to figure out yet how the stack pointer
gets a bad value.

Nathan


From benh at kernel.crashing.org  Sat Oct 16 14:50:07 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sat, 16 Oct 2004 14:50:07 +1000
Subject: Fan control for PowerMac7_3 (#3)
In-Reply-To: <4170A265.6030402@zytor.com>
References: <1097831790.1131.111.camel@gaston>
	<1097831981.1131.113.camel@gaston> <1097832049.1149.115.camel@gaston>
	<1097893432.6546.37.camel@gaston>  <4170A265.6030402@zytor.com>
Message-ID: <1097902206.8965.2.camel@gaston>

On Sat, 2004-10-16 at 14:24, H. Peter Anvin wrote:
> Benjamin Herrenschmidt wrote:
> > Ok, here's a new patch that fixes a few issues, it's been
> > tested on a non-liquid cooled system and appear to work ok.
> 
> I'm testing it out right now.  It is definitely suffering from some 
> degree of oscillation when idling, and it seems to be one particular 
> (set of) fan(s) that is having that problem.

Which ones ? The CPU fans ?

> Note that one cause of oscillation at low speed is that there is a 
> minimum speed below which the fans will simply stop.  This may be what 
> is happening here.

Do the fan actually stop ? Yes we "floor" the fan speeds and indeed,
Apple algorithm is known to slowly oscillate, on my box it's between
300 and 1000 RPM for the CPU fans over a period of a minute or 2.

Such an oscillation is expected. Something worse would mean we get
something wrong. Did you compare against OS X ?

> Some time later I'll try to figure out which numbers to collect and try 
> to generate a graph over time.
> 
> It's definitely passing the stress test, though; make -j4 on the whole 
> kernel (with the .config posted earlier) took 4:27.19.  The other stress 
> test -- which used to kill the old thermal driver dead in a matter of 
> seconds -- is to start a bunch of "cat /dev/zero > /dev/null" is happily 
> running, and nice and quiet.
> 
> The oscillation is obnoxious, though.

Hehe...

Ben.


From hpa at zytor.com  Sat Oct 16 14:58:07 2004
From: hpa at zytor.com (H. Peter Anvin)
Date: Fri, 15 Oct 2004 21:58:07 -0700
Subject: Fan control for PowerMac7_3 (#3)
In-Reply-To: <1097902206.8965.2.camel@gaston>
References: <1097831790.1131.111.camel@gaston>	
	<1097831981.1131.113.camel@gaston>
	<1097832049.1149.115.camel@gaston>	
	<1097893432.6546.37.camel@gaston> <4170A265.6030402@zytor.com>
	<1097902206.8965.2.camel@gaston>
Message-ID: <4170AA5F.6060107@zytor.com>

Benjamin Herrenschmidt wrote:
> On Sat, 2004-10-16 at 14:24, H. Peter Anvin wrote:
> 
>>Benjamin Herrenschmidt wrote:
>>
>>>Ok, here's a new patch that fixes a few issues, it's been
>>>tested on a non-liquid cooled system and appear to work ok.
>>
>>I'm testing it out right now.  It is definitely suffering from some 
>>degree of oscillation when idling, and it seems to be one particular 
>>(set of) fan(s) that is having that problem.
> 
> Which ones ? The CPU fans ?

I don't know how to tell; it's a significant sound.  Let me see if I can 
figure it out.

>>Note that one cause of oscillation at low speed is that there is a 
>>minimum speed below which the fans will simply stop.  This may be what 
>>is happening here.
> 
> Do the fan actually stop ? Yes we "floor" the fan speeds and indeed,
> Apple algorithm is known to slowly oscillate, on my box it's between
> 300 and 1000 RPM for the CPU fans over a period of a minute or 2.
> 
> Such an oscillation is expected. Something worse would mean we get
> something wrong. Did you compare against OS X ?

OS X doesn't sound like this.  The oscillation period for what it's 
worth is 10 seconds.

	-hpa


From benh at kernel.crashing.org  Sat Oct 16 14:55:45 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sat, 16 Oct 2004 14:55:45 +1000
Subject: [PATCH] ppc64: Fix a typo in the code that reserves memory at boot
Message-ID: <1097902544.8963.5.camel@gaston>

Hi !

The code that marks memory regions as "reserved" early during boot
has a typo (doing incorrect rounding of the top address) which can
cause some areas to not be properly reserved. That may explain some
cases of initrd corruption reported recently.

Signed-off-by: Benjamin Herrenschmidt <benh at kernel.crashing.org>

===== arch/ppc64/kernel/prom_init.c 1.2 vs edited =====
--- 1.2/arch/ppc64/kernel/prom_init.c	2004-09-27 19:12:49 +10:00
+++ edited/arch/ppc64/kernel/prom_init.c	2004-10-16 14:53:28 +10:00
@@ -595,7 +595,7 @@
 	 * dumb and just copy this entire array to the boot params
 	 */
 	base = _ALIGN_DOWN(base, PAGE_SIZE);
-	top = _ALIGN_DOWN(top, PAGE_SIZE);
+	top = _ALIGN_UP(top, PAGE_SIZE);
 	size = top - base;
 
 	if (cnt >= (MEM_RESERVE_MAP_SIZE - 1))


From benh at kernel.crashing.org  Sat Oct 16 14:58:14 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sat, 16 Oct 2004 14:58:14 +1000
Subject: Fan control for PowerMac7_3 (#3)
In-Reply-To: <4170AA5F.6060107@zytor.com>
References: <1097831790.1131.111.camel@gaston>
	<1097831981.1131.113.camel@gaston> <1097832049.1149.115.camel@gaston>
	<1097893432.6546.37.camel@gaston>  <4170A265.6030402@zytor.com>
	<1097902206.8965.2.camel@gaston>  <4170AA5F.6060107@zytor.com>
Message-ID: <1097902694.8965.8.camel@gaston>

On Sat, 2004-10-16 at 14:58, H. Peter Anvin wrote:

> I don't know how to tell; it's a significant sound.  Let me see if I can 
> figure it out.

If the dual 2.5Ghz is like the old dual 2Ghz, you can run prefectly well
with the case open, as long as you keep the plexiglass in place, which
drives the air flow, and you'll be able to see the CPU and slots fans.

You can also read the speed values from /sys/devices/temperature

> OS X doesn't sound like this.  The oscillation period for what it's 
> worth is 10 seconds.

Ok, there must be something wrong then...

Ben.


From hpa at zytor.com  Sat Oct 16 15:03:26 2004
From: hpa at zytor.com (H. Peter Anvin)
Date: Fri, 15 Oct 2004 22:03:26 -0700
Subject: Fan control for PowerMac7_3 (#3)
In-Reply-To: <1097902694.8965.8.camel@gaston>
References: <1097831790.1131.111.camel@gaston>	
	<1097831981.1131.113.camel@gaston>
	<1097832049.1149.115.camel@gaston>	
	<1097893432.6546.37.camel@gaston> <4170A265.6030402@zytor.com>	
	<1097902206.8965.2.camel@gaston> <4170AA5F.6060107@zytor.com>
	<1097902694.8965.8.camel@gaston>
Message-ID: <4170AB9E.5010006@zytor.com>

Benjamin Herrenschmidt wrote:
> On Sat, 2004-10-16 at 14:58, H. Peter Anvin wrote:
> 
> 
>>I don't know how to tell; it's a significant sound.  Let me see if I can 
>>figure it out.
> 
> 
> If the dual 2.5Ghz is like the old dual 2Ghz, you can run prefectly well
> with the case open, as long as you keep the plexiglass in place, which
> drives the air flow, and you'll be able to see the CPU and slots fans.
> 
> You can also read the speed values from /sys/devices/temperature
> 

That's what I'm about to do.  Hang on.

	-hpa


From benh at kernel.crashing.org  Sat Oct 16 15:02:53 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sat, 16 Oct 2004 15:02:53 +1000
Subject: [PATCH] ppc64: Fix a typo in the code that reserves memory at
	boot
In-Reply-To: <1097902544.8963.5.camel@gaston>
References: <1097902544.8963.5.camel@gaston>
Message-ID: <1097902973.9026.10.camel@gaston>

On Sat, 2004-10-16 at 14:55, Benjamin Herrenschmidt wrote:
> Hi !
> 
> The code that marks memory regions as "reserved" early during boot
> has a typo (doing incorrect rounding of the top address) which can
> cause some areas to not be properly reserved. That may explain some
> cases of initrd corruption reported recently.
> 
> Signed-off-by: Benjamin Herrenschmidt <benh at kernel.crashing.org>

Ok, ignore it and take Anton's one instead.

Ben.


From hpa at zytor.com  Sat Oct 16 15:32:07 2004
From: hpa at zytor.com (H. Peter Anvin)
Date: Fri, 15 Oct 2004 22:32:07 -0700
Subject: Fan control for PowerMac7_3 (#3)
In-Reply-To: <1097902694.8965.8.camel@gaston>
References: <1097831790.1131.111.camel@gaston>	
	<1097831981.1131.113.camel@gaston>
	<1097832049.1149.115.camel@gaston>	
	<1097893432.6546.37.camel@gaston> <4170A265.6030402@zytor.com>	
	<1097902206.8965.2.camel@gaston> <4170AA5F.6060107@zytor.com>
	<1097902694.8965.8.camel@gaston>
Message-ID: <4170B257.1010602@zytor.com>

Benjamin Herrenschmidt wrote:
> On Sat, 2004-10-16 at 14:58, H. Peter Anvin wrote:
> 
> 
>>I don't know how to tell; it's a significant sound.  Let me see if I can 
>>figure it out.
> 
> 
> If the dual 2.5Ghz is like the old dual 2Ghz, you can run prefectly well
> with the case open, as long as you keep the plexiglass in place, which
> drives the air flow, and you'll be able to see the CPU and slots fans.
> 
> You can also read the speed values from /sys/devices/temperature
> 
> 
>>OS X doesn't sound like this.  The oscillation period for what it's 
>>worth is 10 seconds.
> 
> 
> Ok, there must be something wrong then...
> 

It's the backside fan that oscillates; backside_fan_pwm varies between 
30 and 100 in what is pretty much a squarewave.  See attached graph (and 
note how the other fans vary with workload.)

I probably need to write a "power virus" program for the G5 to really 
test out the high end (a power virus is a program which keeps the chip 
running as hard as it can; generally keep all pipelines stuffed.)

	-hpa
-------------- next part --------------
A non-text attachment was scrubbed...
Name: temps.pdf
Type: application/pdf
Size: 11823 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041015/4f39c9d7/attachment.pdf 

From benh at kernel.crashing.org  Sat Oct 16 15:33:04 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sat, 16 Oct 2004 15:33:04 +1000
Subject: Fan control for PowerMac7_3 (#3)
In-Reply-To: <4170B257.1010602@zytor.com>
References: <1097831790.1131.111.camel@gaston>
	<1097831981.1131.113.camel@gaston> <1097832049.1149.115.camel@gaston>
	<1097893432.6546.37.camel@gaston>  <4170A265.6030402@zytor.com>
	<1097902206.8965.2.camel@gaston>  <4170AA5F.6060107@zytor.com>
	<1097902694.8965.8.camel@gaston>  <4170B257.1010602@zytor.com>
Message-ID: <1097904783.8961.23.camel@gaston>

On Sat, 2004-10-16 at 15:32, H. Peter Anvin wrote:

> It's the backside fan that oscillates; backside_fan_pwm varies between 
> 30 and 100 in what is pretty much a squarewave.  See attached graph (and 
> note how the other fans vary with workload.)
> 
> I probably need to write a "power virus" program for the G5 to really 
> test out the high end (a power virus is a program which keeps the chip 
> running as hard as it can; generally keep all pipelines stuffed.)

think about also banging FPU and Altivec units then :)

Since it's low oscillation point is 30, I suppose it properly detects
U3H (can you verify that in the code, adding a printk for example in
init_backside_state()).

I'll double check the values used for the PID in darwin

Ben


From benh at kernel.crashing.org  Sat Oct 16 15:43:19 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sat, 16 Oct 2004 15:43:19 +1000
Subject: Fan control for PowerMac7_3 (#3)
In-Reply-To: <4170B257.1010602@zytor.com>
References: <1097831790.1131.111.camel@gaston>
	<1097831981.1131.113.camel@gaston> <1097832049.1149.115.camel@gaston>
	<1097893432.6546.37.camel@gaston>  <4170A265.6030402@zytor.com>
	<1097902206.8965.2.camel@gaston>  <4170AA5F.6060107@zytor.com>
	<1097902694.8965.8.camel@gaston>  <4170B257.1010602@zytor.com>
Message-ID: <1097905399.8963.26.camel@gaston>

On Sat, 2004-10-16 at 15:32, H. Peter Anvin wrote:

> It's the backside fan that oscillates; backside_fan_pwm varies between 
> 30 and 100 in what is pretty much a squarewave.  See attached graph (and 
> note how the other fans vary with workload.)
> 
> I probably need to write a "power virus" program for the G5 to really 
> test out the high end (a power virus is a program which keeps the chip 
> running as hard as it can; generally keep all pipelines stuffed.)

Strange... The values used seem to be identical to OS X (a 75? target
which is high actually, and a different G_d value than old U3). I would
need to see the debug output and compare with the OS X driver built with
debug output as well (don't ask me to fully understand the math of the
PID algorithm)

Ben.

 
From hpa at zytor.com  Sat Oct 16 16:01:08 2004
From: hpa at zytor.com (H. Peter Anvin)
Date: Fri, 15 Oct 2004 23:01:08 -0700
Subject: Fan control for PowerMac7_3 (#3)
In-Reply-To: <1097904783.8961.23.camel@gaston>
References: <1097831790.1131.111.camel@gaston>	
	<1097831981.1131.113.camel@gaston>
	<1097832049.1149.115.camel@gaston>	
	<1097893432.6546.37.camel@gaston> <4170A265.6030402@zytor.com>	
	<1097902206.8965.2.camel@gaston> <4170AA5F.6060107@zytor.com>	
	<1097902694.8965.8.camel@gaston> <4170B257.1010602@zytor.com>
	<1097904783.8961.23.camel@gaston>
Message-ID: <4170B924.3040104@zytor.com>

Benjamin Herrenschmidt wrote:
> On Sat, 2004-10-16 at 15:32, H. Peter Anvin wrote:
> 
> 
>>It's the backside fan that oscillates; backside_fan_pwm varies between 
>>30 and 100 in what is pretty much a squarewave.  See attached graph (and 
>>note how the other fans vary with workload.)
>>
>>I probably need to write a "power virus" program for the G5 to really 
>>test out the high end (a power virus is a program which keeps the chip 
>>running as hard as it can; generally keep all pipelines stuffed.)
> 
> think about also banging FPU and Altivec units then :)
> 

Those would be included in "all pipelines."  I need to learn more about 
the specifics of the G5 -- and general PowerPC stuff -- before I can 
write such a program, though.

> Since it's low oscillation point is 30, I suppose it properly detects
> U3H (can you verify that in the code, adding a printk for example in
> init_backside_state()).
> 
> I'll double check the values used for the PID in darwin

I'll do that and compile with debugging enabled, and send you a log from 
hell.

	-hpa


From nathanl at austin.ibm.com  Sat Oct 16 16:14:17 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Sat, 16 Oct 2004 01:14:17 -0500
Subject: [PATCH] ppc64:  fix smp_startup_cpu for cpu hotplug
In-Reply-To: <1097901423.3226.42.camel@biclops>
References: <1097901423.3226.42.camel@biclops>
Message-ID: <1097907257.3226.47.camel@biclops>

This change is needed in order to allow cpus to be onlined after
boot.  This used to work but the declaration of
pseries_secondary_smp_init in this file was changed in Ben's big
cleanup patch a while back, so the cpu would start at a bad address.

Signed-off-by: Nathan Lynch <nathanl at austin.ibm.com>


 smp.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletion(-)

Index: 2.6.9-rc4/arch/ppc64/kernel/smp.c
===================================================================
--- 2.6.9-rc4.orig/arch/ppc64/kernel/smp.c	2004-10-16 00:38:57.404529136 -0500
+++ 2.6.9-rc4/arch/ppc64/kernel/smp.c	2004-10-16 00:56:13.266054248 -0500
@@ -390,7 +390,8 @@
 static inline int __devinit smp_startup_cpu(unsigned int lcpu)
 {
 	int status;
-	unsigned long start_here = __pa(pseries_secondary_smp_init);
+	unsigned long start_here = __pa((u32)*((unsigned long *)
+					       pseries_secondary_smp_init));
 	unsigned int pcpu;
 
 	/* At boot time the cpus are already spinning in hold


From schwab at suse.de  Sun Oct 17 06:05:22 2004
From: schwab at suse.de (Andreas Schwab)
Date: Sat, 16 Oct 2004 22:05:22 +0200
Subject: Fan control for PowerMac7_3 (#3)
In-Reply-To: <1097893432.6546.37.camel@gaston> (Benjamin Herrenschmidt's
	message of "Sat, 16 Oct 2004 12:23:53 +1000")
References: <1097831790.1131.111.camel@gaston>
	<1097831981.1131.113.camel@gaston> <1097832049.1149.115.camel@gaston>
	<1097893432.6546.37.camel@gaston>
Message-ID: <jept3iv37h.fsf@sykes.suse.de>

Benjamin Herrenschmidt <benh at kernel.crashing.org> writes:

> Ok, here's a new patch that fixes a few issues, it's been
> tested on a non-liquid cooled system and appear to work ok.

That doesn't work very well for me.  The fans are constantly spinning at a
rather high rate independent of how loaded the system is.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab at suse.de
SuSE Linux AG, Maxfeldstra?e 5, 90409 N?rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


From paulus at samba.org  Sun Oct 17 10:53:21 2004
From: paulus at samba.org (Paul Mackerras)
Date: Sun, 17 Oct 2004 10:53:21 +1000
Subject: Fan control for PowerMac7_3 (#3)
In-Reply-To: <4170B257.1010602@zytor.com>
References: <1097831790.1131.111.camel@gaston>
	<1097831981.1131.113.camel@gaston>
	<1097832049.1149.115.camel@gaston>
	<1097893432.6546.37.camel@gaston> <4170A265.6030402@zytor.com>
	<1097902206.8965.2.camel@gaston> <4170AA5F.6060107@zytor.com>
	<1097902694.8965.8.camel@gaston> <4170B257.1010602@zytor.com>
Message-ID: <16753.49793.159513.618588@cargo.ozlabs.ibm.com>

H. Peter Anvin writes:

> It's the backside fan that oscillates; backside_fan_pwm varies between 
> 30 and 100 in what is pretty much a squarewave.  See attached graph (and 
> note how the other fans vary with workload.)

The sharp rises look like the code thinks it gets into an
over-temperature situation and turns the fans on full blast.  It could
be worth putting some printks in the overtemp code.

> I probably need to write a "power virus" program for the G5 to really 
> test out the high end (a power virus is a program which keeps the chip 
> running as hard as it can; generally keep all pipelines stuffed.)

Hmmm, I should see if I can dig such a thing out of somewhere in IBM.

Paul.


From hpa at zytor.com  Sun Oct 17 10:58:22 2004
From: hpa at zytor.com (H. Peter Anvin)
Date: Sat, 16 Oct 2004 17:58:22 -0700
Subject: Fan control for PowerMac7_3 (#3)
In-Reply-To: <16753.49793.159513.618588@cargo.ozlabs.ibm.com>
References: <1097831790.1131.111.camel@gaston>	<1097831981.1131.113.camel@gaston>	<1097832049.1149.115.camel@gaston>	<1097893432.6546.37.camel@gaston>	<4170A265.6030402@zytor.com>	<1097902206.8965.2.camel@gaston>	<4170AA5F.6060107@zytor.com>	<1097902694.8965.8.camel@gaston>	<4170B257.1010602@zytor.com>
	<16753.49793.159513.618588@cargo.ozlabs.ibm.com>
Message-ID: <4171C3AE.3010302@zytor.com>

Paul Mackerras wrote:
> H. Peter Anvin writes:
> 
> 
>>It's the backside fan that oscillates; backside_fan_pwm varies between 
>>30 and 100 in what is pretty much a squarewave.  See attached graph (and 
>>note how the other fans vary with workload.)
> 
> The sharp rises look like the code thinks it gets into an
> over-temperature situation and turns the fans on full blast.  It could
> be worth putting some printks in the overtemp code.
> 

Changing the unsigned variables to signed per Ben's suggestion seems to 
have solved the problem.

> 
>>I probably need to write a "power virus" program for the G5 to really 
>>test out the high end (a power virus is a program which keeps the chip 
>>running as hard as it can; generally keep all pipelines stuffed.)
> 
> Hmmm, I should see if I can dig such a thing out of somewhere in IBM.
> 

That would be good; otherwise they're not too hard to write given a 
microarchitectural description.

	-hpa


From benh at kernel.crashing.org  Sun Oct 17 10:58:04 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sun, 17 Oct 2004 10:58:04 +1000
Subject: Fan control for PowerMac7_3 (#3)
In-Reply-To: <jept3iv37h.fsf@sykes.suse.de>
References: <1097831790.1131.111.camel@gaston>
	<1097831981.1131.113.camel@gaston> <1097832049.1149.115.camel@gaston>
	<1097893432.6546.37.camel@gaston>  <jept3iv37h.fsf@sykes.suse.de>
Message-ID: <1097974684.8965.59.camel@gaston>

On Sun, 2004-10-17 at 06:05, Andreas Schwab wrote:
> Benjamin Herrenschmidt <benh at kernel.crashing.org> writes:
> 
> > Ok, here's a new patch that fixes a few issues, it's been
> > tested on a non-liquid cooled system and appear to work ok.
> 
> That doesn't work very well for me.  The fans are constantly spinning at a
> rather high rate independent of how loaded the system is.

Is it all fans or just the backside fan getting crazy ? This later bug
is fixed by version #4 I'll post in a minute...

Ben.


From benh at kernel.crashing.org  Sun Oct 17 11:01:03 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sun, 17 Oct 2004 11:01:03 +1000
Subject: Fan control for PowerMac7_3 (#4)
In-Reply-To: <1097893432.6546.37.camel@gaston>
References: <1097831790.1131.111.camel@gaston>
	<1097831981.1131.113.camel@gaston> <1097832049.1149.115.camel@gaston>
	<1097893432.6546.37.camel@gaston>
Message-ID: <1097974861.8965.62.camel@gaston>

This version fixes a bug with the backside fan doing crazy things,
it appears to work properly on the dual 2.5Ghz now. Unless I get a
negative report, I intend to submit it to Linus in a couple of days.

diff -urN linux-2.5/drivers/macintosh/therm_pm72.c linux-pogo/drivers/macintosh/therm_pm72.c
--- linux-2.5/drivers/macintosh/therm_pm72.c	2004-09-24 14:34:05.000000000 +1000
+++ linux-pogo/drivers/macintosh/therm_pm72.c	2004-10-16 18:49:57.000000000 +1000
@@ -46,6 +46,9 @@
  *          overtemp conditions so userland can take some policy
  *          decisions, like slewing down CPUs
  *	  - Deal with fan and i2c failures in a better way
+ *	  - Maybe do a generic PID based on params used for
+ *	    U3 and Drives ?
+ *        - Add RackMac3,1 support (XServe g5)
  *
  * History:
  *
@@ -73,6 +76,15 @@
  *        values in the configuration register
  *	- Switch back to use of target fan speed for PID, thus lowering
  *        pressure on i2c
+ *
+ *  Oct. 16, 2004 : 1.1b3 (beta)
+ *	- Add device-tree lookup for fan IDs, should detect liquid cooling
+ *        pumps when present
+ *	- Enable driver for PowerMac7,3 machines
+ *	- Split the U3/Backside cooling on U3 & U3H versions as Darwin does
+ *	- Add new CPU cooling algorithm for machines with liquid cooling
+ *	- Workaround for some PowerMac7,3 with empty "fan" node in the devtree
+ *	- Fix a signed/unsigned compare issue in some PID loops
  */
 
 #include <linux/config.h>
@@ -101,7 +113,7 @@
 
 #include "therm_pm72.h"
 
-#define VERSION "0.9"
+#define VERSION "1.1b3"
 
 #undef DEBUG
 
@@ -121,16 +133,100 @@
 static struct i2c_adapter *		u3_1;
 static struct i2c_client *		fcu;
 static struct cpu_pid_state		cpu_state[2];
+static struct basckside_pid_params	backside_params;
 static struct backside_pid_state	backside_state;
 static struct drives_pid_state		drives_state;
 static int				state;
 static int				cpu_count;
+static int				cpu_pid_type;
 static pid_t				ctrl_task;
 static struct completion		ctrl_complete;
 static int				critical_state;
 static DECLARE_MUTEX(driver_lock);
 
 /*
+ * We have 2 types of CPU PID control. One is "split" old style control
+ * for intake & exhaust fans, the other is "combined" control for both
+ * CPUs that also deals with the pumps when present. To be "compatible"
+ * with OS X at this point, we only use "COMBINED" on the machines that
+ * are identified as having the pumps (though that identification is at
+ * least dodgy). Ultimately, we could probably switch completely to this
+ * algorithm provided we hack it to deal with the UP case
+ */
+#define CPU_PID_TYPE_SPLIT	0
+#define CPU_PID_TYPE_COMBINED	1
+
+/*
+ * This table describes all fans in the FCU. The "id" and "type" values
+ * are defaults valid for all earlier machines. Newer machines will
+ * eventually override the table content based on the device-tree
+ */
+struct fcu_fan_table
+{
+	char*	loc;	/* location code */
+	int	type;	/* 0 = rpm, 1 = pwm, 2 = pump */
+	int	id;	/* id or -1 */
+};
+
+#define FCU_FAN_RPM		0
+#define FCU_FAN_PWM		1
+
+#define FCU_FAN_ABSENT_ID	-1
+
+#define FCU_FAN_COUNT		ARRAY_SIZE(fcu_fans)
+
+struct fcu_fan_table	fcu_fans[] = {
+	[BACKSIDE_FAN_PWM_INDEX] = {
+		.loc	= "BACKSIDE",
+		.type	= FCU_FAN_PWM,
+		.id	= BACKSIDE_FAN_PWM_DEFAULT_ID,
+	},
+	[DRIVES_FAN_RPM_INDEX] = {
+		.loc	= "DRIVE BAY",
+		.type	= FCU_FAN_RPM,
+		.id	= DRIVES_FAN_RPM_DEFAULT_ID,
+	},
+	[SLOTS_FAN_PWM_INDEX] = {
+		.loc	= "SLOT",
+		.type	= FCU_FAN_PWM,
+		.id	= SLOTS_FAN_PWM_DEFAULT_ID,
+	},
+	[CPUA_INTAKE_FAN_RPM_INDEX] = {
+		.loc	= "CPU A INTAKE",
+		.type	= FCU_FAN_RPM,
+		.id	= CPUA_INTAKE_FAN_RPM_DEFAULT_ID,
+	},
+	[CPUA_EXHAUST_FAN_RPM_INDEX] = {
+		.loc	= "CPU A EXHAUST",
+		.type	= FCU_FAN_RPM,
+		.id	= CPUA_EXHAUST_FAN_RPM_DEFAULT_ID,
+	},
+	[CPUB_INTAKE_FAN_RPM_INDEX] = {
+		.loc	= "CPU B INTAKE",
+		.type	= FCU_FAN_RPM,
+		.id	= CPUB_INTAKE_FAN_RPM_DEFAULT_ID,
+	},
+	[CPUB_EXHAUST_FAN_RPM_INDEX] = {
+		.loc	= "CPU B EXHAUST",
+		.type	= FCU_FAN_RPM,
+		.id	= CPUB_EXHAUST_FAN_RPM_DEFAULT_ID,
+	},
+	/* pumps aren't present by default, have to be looked up in the
+	 * device-tree
+	 */
+	[CPUA_PUMP_RPM_INDEX] = {
+		.loc	= "CPU A PUMP",
+		.type	= FCU_FAN_RPM,		
+		.id	= FCU_FAN_ABSENT_ID,
+	},
+	[CPUB_PUMP_RPM_INDEX] = {
+		.loc	= "CPU B PUMP",
+		.type	= FCU_FAN_RPM,
+		.id	= FCU_FAN_ABSENT_ID,
+	},
+};
+
+/*
  * i2c_driver structure to attach to the host i2c controller
  */
 
@@ -331,10 +427,16 @@
 	return 0;
 }
 
-static int set_rpm_fan(int fan, int rpm)
+static int set_rpm_fan(int fan_index, int rpm)
 {
 	unsigned char buf[2];
-	int rc;
+	int rc, id;
+
+	if (fcu_fans[fan_index].type != FCU_FAN_RPM)
+		return -EINVAL;
+	id = fcu_fans[fan_index].id; 
+	if (id == FCU_FAN_ABSENT_ID)
+		return -EINVAL;
 
 	if (rpm < 300)
 		rpm = 300;
@@ -342,43 +444,55 @@
 		rpm = 8191;
 	buf[0] = rpm >> 5;
 	buf[1] = rpm << 3;
-	rc = fan_write_reg(0x10 + (fan * 2), buf, 2);
+	rc = fan_write_reg(0x10 + (id * 2), buf, 2);
 	if (rc < 0)
 		return -EIO;
 	return 0;
 }
 
-static int get_rpm_fan(int fan, int programmed)
+static int get_rpm_fan(int fan_index, int programmed)
 {
 	unsigned char failure;
 	unsigned char active;
 	unsigned char buf[2];
-	int rc, reg_base;
+	int rc, id, reg_base;
+
+	if (fcu_fans[fan_index].type != FCU_FAN_RPM)
+		return -EINVAL;
+	id = fcu_fans[fan_index].id; 
+	if (id == FCU_FAN_ABSENT_ID)
+		return -EINVAL;
 
 	rc = fan_read_reg(0xb, &failure, 1);
 	if (rc != 1)
 		return -EIO;
-	if ((failure & (1 << fan)) != 0)
+	if ((failure & (1 << id)) != 0)
 		return -EFAULT;
 	rc = fan_read_reg(0xd, &active, 1);
 	if (rc != 1)
 		return -EIO;
-	if ((active & (1 << fan)) == 0)
+	if ((active & (1 << id)) == 0)
 		return -ENXIO;
 
 	/* Programmed value or real current speed */
 	reg_base = programmed ? 0x10 : 0x11;
-	rc = fan_read_reg(reg_base + (fan * 2), buf, 2);
+	rc = fan_read_reg(reg_base + (id * 2), buf, 2);
 	if (rc != 2)
 		return -EIO;
 
 	return (buf[0] << 5) | buf[1] >> 3;
 }
 
-static int set_pwm_fan(int fan, int pwm)
+static int set_pwm_fan(int fan_index, int pwm)
 {
 	unsigned char buf[2];
-	int rc;
+	int rc, id;
+
+	if (fcu_fans[fan_index].type != FCU_FAN_PWM)
+		return -EINVAL;
+	id = fcu_fans[fan_index].id; 
+	if (id == FCU_FAN_ABSENT_ID)
+		return -EINVAL;
 
 	if (pwm < 10)
 		pwm = 10;
@@ -386,32 +500,38 @@
 		pwm = 100;
 	pwm = (pwm * 2559) / 1000;
 	buf[0] = pwm;
-	rc = fan_write_reg(0x30 + (fan * 2), buf, 1);
+	rc = fan_write_reg(0x30 + (id * 2), buf, 1);
 	if (rc < 0)
 		return rc;
 	return 0;
 }
 
-static int get_pwm_fan(int fan)
+static int get_pwm_fan(int fan_index)
 {
 	unsigned char failure;
 	unsigned char active;
 	unsigned char buf[2];
-	int rc;
+	int rc, id;
+
+	if (fcu_fans[fan_index].type != FCU_FAN_PWM)
+		return -EINVAL;
+	id = fcu_fans[fan_index].id; 
+	if (id == FCU_FAN_ABSENT_ID)
+		return -EINVAL;
 
 	rc = fan_read_reg(0x2b, &failure, 1);
 	if (rc != 1)
 		return -EIO;
-	if ((failure & (1 << fan)) != 0)
+	if ((failure & (1 << id)) != 0)
 		return -EFAULT;
 	rc = fan_read_reg(0x2d, &active, 1);
 	if (rc != 1)
 		return -EIO;
-	if ((active & (1 << fan)) == 0)
+	if ((active & (1 << id)) == 0)
 		return -ENXIO;
 
 	/* Programmed value or real current speed */
-	rc = fan_read_reg(0x30 + (fan * 2), buf, 1);
+	rc = fan_read_reg(0x30 + (id * 2), buf, 1);
 	if (rc != 1)
 		return -EIO;
 
@@ -513,80 +633,84 @@
 /*
  * CPUs fans control loop
  */
-static void do_monitor_cpu(struct cpu_pid_state *state)
+
+static int do_read_one_cpu_values(struct cpu_pid_state *state, s32 *temp, s32 *power)
 {
-	s32 temp, voltage, current_a, power, power_target;
-	s32 integral, derivative, proportional, adj_in_target, sval;
-	s64 integ_p, deriv_p, prop_p, sum; 
-	int i, intake, rc;
+	s32 ltemp, volts, amps;
+	int rc = 0;
 
-	DBG("cpu %d:\n", state->index);
+	/* Default (in case of error) */
+	*temp = state->cur_temp;
+	*power = state->cur_power;
 
 	/* Read current fan status */
 	if (state->index == 0)
-		rc = get_rpm_fan(CPUA_EXHAUST_FAN_RPM_ID, !RPM_PID_USE_ACTUAL_SPEED);
+		rc = get_rpm_fan(CPUA_EXHAUST_FAN_RPM_INDEX, !RPM_PID_USE_ACTUAL_SPEED);
 	else
-		rc = get_rpm_fan(CPUB_EXHAUST_FAN_RPM_ID, !RPM_PID_USE_ACTUAL_SPEED);
+		rc = get_rpm_fan(CPUB_EXHAUST_FAN_RPM_INDEX, !RPM_PID_USE_ACTUAL_SPEED);
 	if (rc < 0) {
-		printk(KERN_WARNING "Error %d reading CPU %d exhaust fan !\n",
-		       rc, state->index);
-		/* XXX What do we do now ? */
-	} else
+		/* XXX What do we do now ? Nothing for now, keep old value, but
+		 * return error upstream
+		 */
+		DBG("  cpu %d, fan reading error !\n", state->index);
+	} else {
 		state->rpm = rc;
-	DBG("  current rpm: %d\n", state->rpm);
+		DBG("  cpu %d, exhaust RPM: %d\n", state->index, state->rpm);
+	}
 
 	/* Get some sensor readings and scale it */
-	temp = read_smon_adc(state, 1);
-	if (temp == -1) {
+	ltemp = read_smon_adc(state, 1);
+	if (ltemp == -1) {
+		/* XXX What do we do now ? */
 		state->overtemp++;
-		return;
+		if (rc == 0)
+			rc = -EIO;
+		DBG("  cpu %d, temp reading error !\n", state->index);
+	} else {
+		/* Fixup temperature according to diode calibration
+		 */
+		DBG("  cpu %d, temp raw: %04x, m_diode: %04x, b_diode: %04x\n",
+		    state->index,
+		    ltemp, state->mpu.mdiode, state->mpu.bdiode);
+		*temp = ((s32)ltemp * (s32)state->mpu.mdiode + ((s32)state->mpu.bdiode << 12)) >> 2;
+		state->last_temp = *temp;
+		DBG("  temp: %d.%03d\n", FIX32TOPRINT((*temp)));
 	}
-	voltage = read_smon_adc(state, 3);
-	current_a = read_smon_adc(state, 4);
 
-	/* Fixup temperature according to diode calibration
+	/*
+	 * Read voltage & current and calculate power
 	 */
-	DBG("  temp raw: %04x, m_diode: %04x, b_diode: %04x\n",
-	    temp, state->mpu.mdiode, state->mpu.bdiode);
-	temp = ((s32)temp * (s32)state->mpu.mdiode + ((s32)state->mpu.bdiode << 12)) >> 2;
-	state->last_temp = temp;
-	DBG("  temp: %d.%03d\n", FIX32TOPRINT(temp));
+	volts = read_smon_adc(state, 3);
+	amps = read_smon_adc(state, 4);
 
-	/* Check tmax, increment overtemp if we are there. At tmax+8, we go
-	 * full blown immediately and try to trigger a shutdown
-	 */
-	if (temp >= ((state->mpu.tmax + 8) << 16)) {
-		printk(KERN_WARNING "Warning ! CPU %d temperature way above maximum"
-		       " (%d) !\n",
-		       state->index, temp >> 16);
-		state->overtemp = CPU_MAX_OVERTEMP;
-	} else if (temp > (state->mpu.tmax << 16))
-		state->overtemp++;
-	else
-		state->overtemp = 0;
-	if (state->overtemp >= CPU_MAX_OVERTEMP)
-		critical_state = 1;
-	if (state->overtemp > 0) {
-		state->rpm = state->mpu.rmaxn_exhaust_fan;
-		state->intake_rpm = intake = state->mpu.rmaxn_intake_fan;
-		goto do_set_fans;
-	}
-	
-	/* Scale other sensor values according to fixed scales
+	/* Scale voltage and current raw sensor values according to fixed scales
 	 * obtained in Darwin and calculate power from I and V
 	 */
-	state->voltage = voltage *= ADC_CPU_VOLTAGE_SCALE;
-	state->current_a = current_a *= ADC_CPU_CURRENT_SCALE;
-	power = (((u64)current_a) * ((u64)voltage)) >> 16;
+	volts *= ADC_CPU_VOLTAGE_SCALE;
+	amps *= ADC_CPU_CURRENT_SCALE;
+	*power = (((u64)volts) * ((u64)amps)) >> 16;
+	state->voltage = volts;
+	state->current_a = amps;
+	state->last_power = *power;
+
+	DBG("  cpu %d, current: %d.%03d, voltage: %d.%03d, power: %d.%03d W\n",
+	    state->index, FIX32TOPRINT(state->current_a),
+	    FIX32TOPRINT(state->voltage), FIX32TOPRINT(*power));
+
+	return 0;
+}
+
+static void do_cpu_pid(struct cpu_pid_state *state, s32 temp, s32 power)
+{
+	s32 power_target, integral, derivative, proportional, adj_in_target, sval;
+	s64 integ_p, deriv_p, prop_p, sum; 
+	int i;
 
 	/* Calculate power target value (could be done once for all)
 	 * and convert to a 16.16 fp number
 	 */
 	power_target = ((u32)(state->mpu.pmaxh - state->mpu.padjmax)) << 16;
-
-	DBG("  current: %d.%03d, voltage: %d.%03d\n",
-	    FIX32TOPRINT(current_a), FIX32TOPRINT(voltage));
-	DBG("  power: %d.%03d W, target: %d.%03d, error: %d.%03d\n", FIX32TOPRINT(power),
+	DBG("  power target: %d.%03d, error: %d.%03d\n",
 	    FIX32TOPRINT(power_target), FIX32TOPRINT(power_target - power));
 
 	/* Store temperature and power in history array */
@@ -626,7 +750,7 @@
 	 * input target is mpu.ttarget, input max is mpu.tmax
 	 */
 	integ_p = ((s64)state->mpu.pid_gr) * (s64)integral;
-	DBG("   integ_p: %d\n", (int)(deriv_p >> 36));
+	DBG("   integ_p: %d\n", (int)(integ_p >> 36));
 	sval = (state->mpu.tmax << 16) - ((integ_p >> 20) & 0xffffffff);
 	adj_in_target = (state->mpu.ttarget << 16);
 	if (adj_in_target > sval)
@@ -655,15 +779,136 @@
 	DBG("   sum: %d\n", (int)sum);
 	state->rpm += (s32)sum;
 
-	if (state->rpm < state->mpu.rminn_exhaust_fan)
+	if (state->rpm < (int)state->mpu.rminn_exhaust_fan)
 		state->rpm = state->mpu.rminn_exhaust_fan;
-	if (state->rpm > state->mpu.rmaxn_exhaust_fan)
+	if (state->rpm > (int)state->mpu.rmaxn_exhaust_fan)
 		state->rpm = state->mpu.rmaxn_exhaust_fan;
+}
+
+static void do_monitor_cpu_combined(void)
+{
+	struct cpu_pid_state *state0 = &cpu_state[0];
+	struct cpu_pid_state *state1 = &cpu_state[1];
+	s32 temp0, power0, temp1, power1;
+	s32 temp_combi, power_combi;
+	int rc, intake, pump;
+
+	rc = do_read_one_cpu_values(state0, &temp0, &power0);
+	if (rc < 0) {
+		/* XXX What do we do now ? */
+	}
+	state1->overtemp = 0;
+	rc = do_read_one_cpu_values(state1, &temp1, &power1);
+	if (rc < 0) {
+		/* XXX What do we do now ? */
+	}
+	if (state1->overtemp)
+		state0->overtemp++;
+
+	temp_combi = max(temp0, temp1);
+	power_combi = max(power0, power1);
+
+	/* Check tmax, increment overtemp if we are there. At tmax+8, we go
+	 * full blown immediately and try to trigger a shutdown
+	 */
+	if (temp_combi >= ((state0->mpu.tmax + 8) << 16)) {
+		printk(KERN_WARNING "Warning ! Temperature way above maximum (%d) !\n",
+		       temp_combi >> 16);
+		state0->overtemp = CPU_MAX_OVERTEMP;
+	} else if (temp_combi > (state0->mpu.tmax << 16))
+		state0->overtemp++;
+	else
+		state0->overtemp = 0;
+	if (state0->overtemp >= CPU_MAX_OVERTEMP)
+		critical_state = 1;
+	if (state0->overtemp > 0) {
+		state0->rpm = state0->mpu.rmaxn_exhaust_fan;
+		state0->intake_rpm = intake = state0->mpu.rmaxn_intake_fan;
+		pump = CPU_PUMP_OUTPUT_MAX;
+		goto do_set_fans;
+	}
+
+	/* Do the PID */
+	do_cpu_pid(state0, temp_combi, power_combi);
+
+	/* Calculate intake fan speed */
+	intake = (state0->rpm * CPU_INTAKE_SCALE) >> 16;
+	if (intake < (int)state0->mpu.rminn_intake_fan)
+		intake = state0->mpu.rminn_intake_fan;
+	if (intake > (int)state0->mpu.rmaxn_intake_fan)
+		intake = state0->mpu.rmaxn_intake_fan;
+	state0->intake_rpm = intake;
+
+	/* Calculate pump speed */
+	pump = (state0->rpm * CPU_PUMP_OUTPUT_MAX) /
+		state0->mpu.rmaxn_exhaust_fan;
+	if (pump > CPU_PUMP_OUTPUT_MAX)
+		pump = CPU_PUMP_OUTPUT_MAX;
+	if (pump < CPU_PUMP_OUTPUT_MIN)
+		pump = CPU_PUMP_OUTPUT_MIN;
+	
+ do_set_fans:
+	/* We copy values from state 0 to state 1 for /sysfs */
+	state1->rpm = state0->rpm;
+	state1->intake_rpm = state0->intake_rpm;
+
+	DBG("** CPU %d RPM: %d Ex, %d, Pump: %d, In, overtemp: %d\n",
+	    state1->index, (int)state1->rpm, intake, pump, state1->overtemp);
+
+	/* We should check for errors, shouldn't we ? But then, what
+	 * do we do once the error occurs ? For FCU notified fan
+	 * failures (-EFAULT) we probably want to notify userland
+	 * some way...
+	 */
+	set_rpm_fan(CPUA_INTAKE_FAN_RPM_INDEX, intake);
+	set_rpm_fan(CPUA_EXHAUST_FAN_RPM_INDEX, state0->rpm);
+	set_rpm_fan(CPUB_INTAKE_FAN_RPM_INDEX, intake);
+	set_rpm_fan(CPUB_EXHAUST_FAN_RPM_INDEX, state0->rpm);
+
+	if (fcu_fans[CPUA_PUMP_RPM_INDEX].id != FCU_FAN_ABSENT_ID)
+		set_rpm_fan(CPUA_PUMP_RPM_INDEX, pump);
+	if (fcu_fans[CPUB_PUMP_RPM_INDEX].id != FCU_FAN_ABSENT_ID)
+		set_rpm_fan(CPUB_PUMP_RPM_INDEX, pump);
+}
+
+static void do_monitor_cpu_split(struct cpu_pid_state *state)
+{
+	s32 temp, power;
+	int rc, intake;
+
+	/* Read current fan status */
+	rc = do_read_one_cpu_values(state, &temp, &power);
+	if (rc < 0) {
+		/* XXX What do we do now ? */
+	}
+
+	/* Check tmax, increment overtemp if we are there. At tmax+8, we go
+	 * full blown immediately and try to trigger a shutdown
+	 */
+	if (temp >= ((state->mpu.tmax + 8) << 16)) {
+		printk(KERN_WARNING "Warning ! CPU %d temperature way above maximum"
+		       " (%d) !\n",
+		       state->index, temp >> 16);
+		state->overtemp = CPU_MAX_OVERTEMP;
+	} else if (temp > (state->mpu.tmax << 16))
+		state->overtemp++;
+	else
+		state->overtemp = 0;
+	if (state->overtemp >= CPU_MAX_OVERTEMP)
+		critical_state = 1;
+	if (state->overtemp > 0) {
+		state->rpm = state->mpu.rmaxn_exhaust_fan;
+		state->intake_rpm = intake = state->mpu.rmaxn_intake_fan;
+		goto do_set_fans;
+	}
+
+	/* Do the PID */
+	do_cpu_pid(state, temp, power);
 
 	intake = (state->rpm * CPU_INTAKE_SCALE) >> 16;
-	if (intake < state->mpu.rminn_intake_fan)
+	if (intake < (int)state->mpu.rminn_intake_fan)
 		intake = state->mpu.rminn_intake_fan;
-	if (intake > state->mpu.rmaxn_intake_fan)
+	if (intake > (int)state->mpu.rmaxn_intake_fan)
 		intake = state->mpu.rmaxn_intake_fan;
 	state->intake_rpm = intake;
 
@@ -677,11 +922,11 @@
 	 * some way...
 	 */
 	if (state->index == 0) {
-		set_rpm_fan(CPUA_INTAKE_FAN_RPM_ID, intake);
-		set_rpm_fan(CPUA_EXHAUST_FAN_RPM_ID, state->rpm);
+		set_rpm_fan(CPUA_INTAKE_FAN_RPM_INDEX, intake);
+		set_rpm_fan(CPUA_EXHAUST_FAN_RPM_INDEX, state->rpm);
 	} else {
-		set_rpm_fan(CPUB_INTAKE_FAN_RPM_ID, intake);
-		set_rpm_fan(CPUB_EXHAUST_FAN_RPM_ID, state->rpm);
+		set_rpm_fan(CPUB_INTAKE_FAN_RPM_INDEX, intake);
+		set_rpm_fan(CPUB_EXHAUST_FAN_RPM_INDEX, state->rpm);
 	}
 }
 
@@ -696,6 +941,7 @@
 	state->overtemp = 0;
 	state->adc_config = 0x00;
 
+
 	if (index == 0)
 		state->monitor = attach_i2c_chip(SUPPLY_MONITOR_ID, "CPU0_monitor");
 	else if (index == 1)
@@ -778,7 +1024,7 @@
 	DBG("backside:\n");
 
 	/* Check fan status */
-	rc = get_pwm_fan(BACKSIDE_FAN_PWM_ID);
+	rc = get_pwm_fan(BACKSIDE_FAN_PWM_INDEX);
 	if (rc < 0) {
 		printk(KERN_WARNING "Error %d reading backside fan !\n", rc);
 		/* XXX What do we do now ? */
@@ -790,12 +1036,12 @@
 	temp = i2c_smbus_read_byte_data(state->monitor, MAX6690_EXT_TEMP) << 16;
 	state->last_temp = temp;
 	DBG("  temp: %d.%03d, target: %d.%03d\n", FIX32TOPRINT(temp),
-	    FIX32TOPRINT(BACKSIDE_PID_INPUT_TARGET));
+	    FIX32TOPRINT(backside_params.input_target));
 
 	/* Store temperature and error in history array */
 	state->cur_sample = (state->cur_sample + 1) % BACKSIDE_PID_HISTORY_SIZE;
 	state->sample_history[state->cur_sample] = temp;
-	state->error_history[state->cur_sample] = temp - BACKSIDE_PID_INPUT_TARGET;
+	state->error_history[state->cur_sample] = temp - backside_params.input_target;
 	
 	/* If first loop, fill the history table */
 	if (state->first) {
@@ -804,7 +1050,7 @@
 				BACKSIDE_PID_HISTORY_SIZE;
 			state->sample_history[state->cur_sample] = temp;
 			state->error_history[state->cur_sample] =
-				temp - BACKSIDE_PID_INPUT_TARGET;
+				temp - backside_params.input_target;
 		}
 		state->first = 0;
 	}
@@ -816,7 +1062,7 @@
 		integral += state->error_history[i];
 	integral *= BACKSIDE_PID_INTERVAL;
 	DBG("  integral: %08x\n", integral);
-	integ_p = ((s64)BACKSIDE_PID_G_r) * (s64)integral;
+	integ_p = ((s64)backside_params.G_r) * (s64)integral;
 	DBG("   integ_p: %d\n", (int)(integ_p >> 36));
 	sum += integ_p;
 
@@ -825,12 +1071,12 @@
 		state->error_history[(state->cur_sample + BACKSIDE_PID_HISTORY_SIZE - 1)
 				    % BACKSIDE_PID_HISTORY_SIZE];
 	derivative /= BACKSIDE_PID_INTERVAL;
-	deriv_p = ((s64)BACKSIDE_PID_G_d) * (s64)derivative;
+	deriv_p = ((s64)backside_params.G_d) * (s64)derivative;
 	DBG("   deriv_p: %d\n", (int)(deriv_p >> 36));
 	sum += deriv_p;
 
 	/* Calculate the proportional term */
-	prop_p = ((s64)BACKSIDE_PID_G_p) * (s64)(state->error_history[state->cur_sample]);
+	prop_p = ((s64)backside_params.G_p) * (s64)(state->error_history[state->cur_sample]);
 	DBG("   prop_p: %d\n", (int)(prop_p >> 36));
 	sum += prop_p;
 
@@ -839,13 +1085,13 @@
 
 	DBG("   sum: %d\n", (int)sum);
 	state->pwm += (s32)sum;
-	if (state->pwm < BACKSIDE_PID_OUTPUT_MIN)
-		state->pwm = BACKSIDE_PID_OUTPUT_MIN;
-	if (state->pwm > BACKSIDE_PID_OUTPUT_MAX)
-		state->pwm = BACKSIDE_PID_OUTPUT_MAX;
+	if (state->pwm < backside_params.output_min)
+		state->pwm = backside_params.output_min;
+	if (state->pwm > backside_params.output_max)
+		state->pwm = backside_params.output_max;
 
 	DBG("** BACKSIDE PWM: %d\n", (int)state->pwm);
-	set_pwm_fan(BACKSIDE_FAN_PWM_ID, state->pwm);
+	set_pwm_fan(BACKSIDE_FAN_PWM_INDEX, state->pwm);
 }
 
 /*
@@ -853,6 +1099,35 @@
  */
 static int init_backside_state(struct backside_pid_state *state)
 {
+	struct device_node *u3;
+	int u3h = 1; /* conservative by default */
+
+	/*
+	 * There are different PID params for machines with U3 and machines
+	 * with U3H, pick the right ones now
+	 */
+	u3 = of_find_node_by_path("/u3 at 0,f8000000");
+	if (u3 != NULL) {
+		u32 *vers = (u32 *)get_property(u3, "device-rev", NULL);
+		if (vers)
+			if (((*vers) & 0x3f) < 0x34)
+				u3h = 0;
+		of_node_put(u3);
+	}
+
+	backside_params.G_p = BACKSIDE_PID_G_p;
+	backside_params.G_r = BACKSIDE_PID_G_r;
+	backside_params.output_max = BACKSIDE_PID_OUTPUT_MAX;
+	if (u3h) {
+		backside_params.G_d = BACKSIDE_PID_U3H_G_d;
+		backside_params.input_target = BACKSIDE_PID_U3H_INPUT_TARGET;
+		backside_params.output_min = BACKSIDE_PID_U3H_OUTPUT_MIN;
+	} else {
+		backside_params.G_d = BACKSIDE_PID_U3_G_d;
+		backside_params.input_target = BACKSIDE_PID_U3_INPUT_TARGET;
+		backside_params.output_min = BACKSIDE_PID_U3_OUTPUT_MIN;
+	}
+
 	state->ticks = 1;
 	state->first = 1;
 	state->pwm = 50;
@@ -898,7 +1173,7 @@
 	DBG("drives:\n");
 
 	/* Check fan status */
-	rc = get_rpm_fan(DRIVES_FAN_RPM_ID, !RPM_PID_USE_ACTUAL_SPEED);
+	rc = get_rpm_fan(DRIVES_FAN_RPM_INDEX, !RPM_PID_USE_ACTUAL_SPEED);
 	if (rc < 0) {
 		printk(KERN_WARNING "Error %d reading drives fan !\n", rc);
 		/* XXX What do we do now ? */
@@ -965,7 +1240,7 @@
 		state->rpm = DRIVES_PID_OUTPUT_MAX;
 
 	DBG("** DRIVES RPM: %d\n", (int)state->rpm);
-	set_rpm_fan(DRIVES_FAN_RPM_ID, state->rpm);
+	set_rpm_fan(DRIVES_FAN_RPM_INDEX, state->rpm);
 }
 
 /*
@@ -1032,7 +1307,7 @@
 	}
 
 	/* Set the PCI fan once for now */
-	set_pwm_fan(SLOTS_FAN_PWM_ID, SLOTS_FAN_DEFAULT_PWM);
+	set_pwm_fan(SLOTS_FAN_PWM_INDEX, SLOTS_FAN_DEFAULT_PWM);
 
 	/* Initialize ADCs */
 	initialize_adc(&cpu_state[0]);
@@ -1047,9 +1322,13 @@
 		start = jiffies;
 
 		down(&driver_lock);
-		do_monitor_cpu(&cpu_state[0]);
-		if (cpu_state[1].monitor != NULL)
-			do_monitor_cpu(&cpu_state[1]);
+		if (cpu_pid_type == CPU_PID_TYPE_COMBINED)
+			do_monitor_cpu_combined();
+		else {
+			do_monitor_cpu_split(&cpu_state[0]);
+			if (cpu_state[1].monitor != NULL)
+				do_monitor_cpu_split(&cpu_state[1]);
+		}
 		do_monitor_backside(&backside_state);
 		do_monitor_drives(&drives_state);
 		up(&driver_lock);
@@ -1113,6 +1392,19 @@
 
 	DBG("counted %d CPUs in the device-tree\n", cpu_count);
 
+	/* Decide the type of PID algorithm to use based on the presence of
+	 * the pumps, though that may not be the best way, that is good enough
+	 * for now
+	 */
+	if (machine_is_compatible("PowerMac7,3")
+	    && (cpu_count > 1)
+	    && fcu_fans[CPUA_PUMP_RPM_INDEX].id != FCU_FAN_ABSENT_ID
+	    && fcu_fans[CPUB_PUMP_RPM_INDEX].id != FCU_FAN_ABSENT_ID) {
+		printk(KERN_INFO "Liquid cooling pumps detected, using new algorithm !\n");
+		cpu_pid_type = CPU_PID_TYPE_COMBINED;
+	} else
+		cpu_pid_type = CPU_PID_TYPE_SPLIT;
+
 	/* Create control loops for everything. If any fail, everything
 	 * fails
 	 */
@@ -1257,12 +1549,91 @@
 	return 0;
 }
 
+static void fcu_lookup_fans(struct device_node *fcu_node)
+{
+	struct device_node *np = NULL;
+	int i;
+
+	/* The table is filled by default with values that are suitable
+	 * for the old machines without device-tree informations. We scan
+	 * the device-tree and override those values with whatever is
+	 * there
+	 */
+
+	DBG("Looking up FCU controls in device-tree...\n");
+
+	while ((np = of_get_next_child(fcu_node, np)) != NULL) {
+		int type = -1;
+		char *loc;
+		u32 *reg;
+
+		DBG(" control: %s, type: %s\n", np->name, np->type);
+
+		/* Detect control type */
+		if (!strcmp(np->type, "fan-rpm-control") ||
+		    !strcmp(np->type, "fan-rpm"))
+			type = FCU_FAN_RPM;
+		if (!strcmp(np->type, "fan-pwm-control") ||
+		    !strcmp(np->type, "fan-pwm"))
+			type = FCU_FAN_PWM;
+		/* Only care about fans for now */
+		if (type == -1)
+			continue;
+
+		/* Lookup for a matching location */
+		loc = (char *)get_property(np, "location", NULL);
+		reg = (u32 *)get_property(np, "reg", NULL);
+		if (loc == NULL || reg == NULL)
+			continue;
+		DBG(" matching location: %s, reg: 0x%08x\n", loc, *reg);
+
+		for (i = 0; i < FCU_FAN_COUNT; i++) {
+			int fan_id;
+
+			if (strcmp(loc, fcu_fans[i].loc))
+				continue;
+			DBG(" location match, index: %d\n", i);
+			fcu_fans[i].id = FCU_FAN_ABSENT_ID;
+			if (type != fcu_fans[i].type) {
+				printk(KERN_WARNING "therm_pm72: Fan type mismatch "
+				       "in device-tree for %s\n", np->full_name);
+				break;
+			}
+			if (type == FCU_FAN_RPM)
+				fan_id = ((*reg) - 0x10) / 2;
+			else
+				fan_id = ((*reg) - 0x30) / 2;
+			if (fan_id > 7) {
+				printk(KERN_WARNING "therm_pm72: Can't parse "
+				       "fan ID in device-tree for %s\n", np->full_name);
+				break;
+			}
+			DBG(" fan id -> %d, type -> %d\n", fan_id, type);
+			fcu_fans[i].id = fan_id;
+		}
+	}
+
+	/* Now dump the array */
+	printk(KERN_INFO "Detected fan controls:\n");
+	for (i = 0; i < FCU_FAN_COUNT; i++) {
+		if (fcu_fans[i].id == FCU_FAN_ABSENT_ID)
+			continue;
+		printk(KERN_INFO "  %d: %s fan, id %d, location: %s\n", i,
+		       fcu_fans[i].type == FCU_FAN_RPM ? "RPM" : "PWM",
+		       fcu_fans[i].id, fcu_fans[i].loc);
+	}
+}
+
 static int fcu_of_probe(struct of_device* dev, const struct of_match *match)
 {
 	int rc;
 
 	state = state_detached;
 
+	/* Lookup the fans in the device tree */
+	fcu_lookup_fans(dev->node);
+
+	/* Add the driver */
 	rc = i2c_add_driver(&therm_pm72_driver);
 	if (rc < 0)
 		return rc;
@@ -1301,15 +1672,20 @@
 {
 	struct device_node *np;
 
-	if (!machine_is_compatible("PowerMac7,2"))
+	if (!machine_is_compatible("PowerMac7,2") &&
+	    !machine_is_compatible("PowerMac7,3"))
 	    	return -ENODEV;
 
 	printk(KERN_INFO "PowerMac G5 Thermal control driver %s\n", VERSION);
 
 	np = of_find_node_by_type(NULL, "fcu");
 	if (np == NULL) {
-		printk(KERN_ERR "Can't find FCU in device-tree !\n");
-		return -ENODEV;
+		/* Some machines have strangely broken device-tree */
+		np = of_find_node_by_path("/u3 at 0,f8000000/i2c at f8001000/fan at 15e");
+		if (np == NULL) {
+			    printk(KERN_ERR "Can't find FCU in device-tree !\n");
+			    return -ENODEV;
+		}
 	}
 	of_dev = of_platform_device_create(np, "temperature");
 	if (of_dev == NULL) {
diff -urN linux-2.5/drivers/macintosh/therm_pm72.h linux-pogo/drivers/macintosh/therm_pm72.h
--- linux-2.5/drivers/macintosh/therm_pm72.h	2004-09-24 14:34:05.000000000 +1000
+++ linux-pogo/drivers/macintosh/therm_pm72.h	2004-10-16 18:29:29.000000000 +1000
@@ -119,18 +119,33 @@
 #define ADC_CPU_CURRENT_SCALE	0x1f40	/* _AD4 */
 
 /*
- * PID factors for the U3/Backside fan control loop
+ * PID factors for the U3/Backside fan control loop. We have 2 sets
+ * of values here, one set for U3 and one set for U3H
  */
-#define BACKSIDE_FAN_PWM_ID		1
-#define BACKSIDE_PID_G_d		0x02800000
+#define BACKSIDE_FAN_PWM_DEFAULT_ID	1
+#define BACKSIDE_FAN_PWM_INDEX		0
+#define BACKSIDE_PID_U3_G_d		0x02800000
+#define BACKSIDE_PID_U3H_G_d		0x01400000
 #define BACKSIDE_PID_G_p		0x00500000
 #define BACKSIDE_PID_G_r		0x00000000
-#define BACKSIDE_PID_INPUT_TARGET	0x00410000
+#define BACKSIDE_PID_U3_INPUT_TARGET	0x00410000
+#define BACKSIDE_PID_U3H_INPUT_TARGET	0x004b0000
 #define BACKSIDE_PID_INTERVAL		5
 #define BACKSIDE_PID_OUTPUT_MAX		100
-#define BACKSIDE_PID_OUTPUT_MIN		20
+#define BACKSIDE_PID_U3_OUTPUT_MIN	20
+#define BACKSIDE_PID_U3H_OUTPUT_MIN	30
 #define BACKSIDE_PID_HISTORY_SIZE	2
 
+struct basckside_pid_params
+{
+	s32			G_d;
+	s32			G_p;
+	s32			G_r;
+	s32			input_target;
+	s32			output_min;
+	s32			output_max;
+};
+
 struct backside_pid_state
 {
 	int			ticks;
@@ -146,7 +161,8 @@
 /*
  * PID factors for the Drive Bay fan control loop
  */
-#define DRIVES_FAN_RPM_ID      		2
+#define DRIVES_FAN_RPM_DEFAULT_ID	2
+#define DRIVES_FAN_RPM_INDEX		1
 #define DRIVES_PID_G_d			0x01e00000
 #define DRIVES_PID_G_p			0x00500000
 #define DRIVES_PID_G_r			0x00000000
@@ -168,7 +184,8 @@
 	int			first;
 };
 
-#define SLOTS_FAN_PWM_ID       		2
+#define SLOTS_FAN_PWM_DEFAULT_ID	2
+#define SLOTS_FAN_PWM_INDEX		2
 #define	SLOTS_FAN_DEFAULT_PWM		50 /* Do better here ! */
 
 /*
@@ -191,10 +208,15 @@
  * CPU B FAKE POWER	49	(I_V_inputs: 18, 19)
  */
 
-#define CPUA_INTAKE_FAN_RPM_ID		3
-#define CPUA_EXHAUST_FAN_RPM_ID		4
-#define CPUB_INTAKE_FAN_RPM_ID		5
-#define CPUB_EXHAUST_FAN_RPM_ID		6
+#define CPUA_INTAKE_FAN_RPM_DEFAULT_ID	3
+#define CPUA_EXHAUST_FAN_RPM_DEFAULT_ID	4
+#define CPUB_INTAKE_FAN_RPM_DEFAULT_ID	5
+#define CPUB_EXHAUST_FAN_RPM_DEFAULT_ID	6
+
+#define CPUA_INTAKE_FAN_RPM_INDEX	3
+#define CPUA_EXHAUST_FAN_RPM_INDEX	4
+#define CPUB_INTAKE_FAN_RPM_INDEX	5
+#define CPUB_EXHAUST_FAN_RPM_INDEX	6
 
 #define CPU_INTAKE_SCALE		0x0000f852
 #define CPU_TEMP_HISTORY_SIZE		2
@@ -202,6 +224,11 @@
 #define CPU_PID_INTERVAL		1
 #define CPU_MAX_OVERTEMP		30
 
+#define CPUA_PUMP_RPM_INDEX		7
+#define CPUB_PUMP_RPM_INDEX		8
+#define CPU_PUMP_OUTPUT_MAX		3700
+#define CPU_PUMP_OUTPUT_MIN		1000
+
 struct cpu_pid_state
 {
 	int			index;
@@ -219,6 +246,7 @@
 	s32			voltage;
 	s32			current_a;
 	s32			last_temp;
+	s32			last_power;
 	int			first;
 	u8			adc_config;
 };


From benh at kernel.crashing.org  Sun Oct 17 11:12:44 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sun, 17 Oct 2004 11:12:44 +1000
Subject: Fan control for PowerMac7_3 (#3)
In-Reply-To: <16753.49793.159513.618588@cargo.ozlabs.ibm.com>
References: <1097831790.1131.111.camel@gaston>
	<1097831981.1131.113.camel@gaston> <1097832049.1149.115.camel@gaston>
	<1097893432.6546.37.camel@gaston> <4170A265.6030402@zytor.com>
	<1097902206.8965.2.camel@gaston> <4170AA5F.6060107@zytor.com>
	<1097902694.8965.8.camel@gaston> <4170B257.1010602@zytor.com>
	<16753.49793.159513.618588@cargo.ozlabs.ibm.com>
Message-ID: <1097975564.14005.66.camel@gaston>

On Sun, 2004-10-17 at 10:53, Paul Mackerras wrote:
> H. Peter Anvin writes:
> 
> > It's the backside fan that oscillates; backside_fan_pwm varies between 
> > 30 and 100 in what is pretty much a squarewave.  See attached graph (and 
> > note how the other fans vary with workload.)
> 
> The sharp rises look like the code thinks it gets into an
> over-temperature situation and turns the fans on full blast.  It could
> be worth putting some printks in the overtemp code.

It was in practice a problem when i turned the min/max values into
variables, I set them unsigned. That caused that code to crap out:

        state->pwm += (s32)sum;
        if (state->pwm < backside_params.output_min)
                state->pwm = backside_params.output_min;
        if (state->pwm > backside_params.output_max)
                state->pwm = backside_params.output_max;

When "sum" was negative enough to cause state->pwm to drop below 0

Turning backside_params.* to signed fixed this issue (and possibly
others as the other factors are also used as signed fixed values into
the previous calculations).

Ben.
 

From schwab at suse.de  Mon Oct 18 00:50:47 2004
From: schwab at suse.de (Andreas Schwab)
Date: Sun, 17 Oct 2004 16:50:47 +0200
Subject: Fan control for PowerMac7_3 (#3)
In-Reply-To: <1097974684.8965.59.camel@gaston> (Benjamin Herrenschmidt's
	message of "Sun, 17 Oct 2004 10:58:04 +1000")
References: <1097831790.1131.111.camel@gaston>
	<1097831981.1131.113.camel@gaston> <1097832049.1149.115.camel@gaston>
	<1097893432.6546.37.camel@gaston> <jept3iv37h.fsf@sykes.suse.de>
	<1097974684.8965.59.camel@gaston>
Message-ID: <jeis99pfeg.fsf@sykes.suse.de>

Benjamin Herrenschmidt <benh at kernel.crashing.org> writes:

> Is it all fans or just the backside fan getting crazy ? This later bug
> is fixed by version #4 I'll post in a minute...

I think it's all fans, but I'll test your patch just in case.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab at suse.de
SuSE Linux AG, Maxfeldstra?e 5, 90409 N?rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


From hpa at zytor.com  Sat Oct 16 15:57:09 2004
From: hpa at zytor.com (H. Peter Anvin)
Date: Fri, 15 Oct 2004 22:57:09 -0700
Subject: Fan control for PowerMac7_3 (#3)
In-Reply-To: <1097905399.8963.26.camel@gaston>
References: <1097831790.1131.111.camel@gaston>	
	<1097831981.1131.113.camel@gaston>
	<1097832049.1149.115.camel@gaston>	
	<1097893432.6546.37.camel@gaston> <4170A265.6030402@zytor.com>	
	<1097902206.8965.2.camel@gaston> <4170AA5F.6060107@zytor.com>	
	<1097902694.8965.8.camel@gaston> <4170B257.1010602@zytor.com>
	<1097905399.8963.26.camel@gaston>
Message-ID: <4170B835.8050205@zytor.com>

Benjamin Herrenschmidt wrote:
> On Sat, 2004-10-16 at 15:32, H. Peter Anvin wrote:
> 
> 
>>It's the backside fan that oscillates; backside_fan_pwm varies between 
>>30 and 100 in what is pretty much a squarewave.  See attached graph (and 
>>note how the other fans vary with workload.)
>>
>>I probably need to write a "power virus" program for the G5 to really 
>>test out the high end (a power virus is a program which keeps the chip 
>>running as hard as it can; generally keep all pipelines stuffed.)
> 
> 
> Strange... The values used seem to be identical to OS X (a 75? target
> which is high actually, and a different G_d value than old U3). I would
> need to see the debug output and compare with the OS X driver built with
> debug output as well (don't ask me to fully understand the math of the
> PID algorithm)
> 

If you want the file I used to produce the graph, it has all the entries 
in /sys/devices/temperature snapshotted at 100 ms intervals (attached) 
in the following order:

[time] backside_fan_pwm backside_temperature cpu0_current
cpu0_exhaust_fan_rpm cpu0_intake_fan_rpm cpu0_temperature cpu0_voltage
cpu1_current cpu1_exhaust_fan_rpm cpu1_intake_fan_rpm cpu1_temperature
cpu1_voltage drives_fan_rpm drives_temperature

I've also attached /var/log/dmesg in case that's useful.

	-hpa
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dmesg.bz2
Type: application/x-bzip2
Size: 5502 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041015/7bbba4f9/attachment.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: temps.dat.bz2
Type: application/x-bzip2
Size: 25736 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041015/7bbba4f9/attachment-0001.bin 

From schwab at suse.de  Mon Oct 18 01:37:22 2004
From: schwab at suse.de (Andreas Schwab)
Date: Sun, 17 Oct 2004 17:37:22 +0200
Subject: Fan control for PowerMac7_3 (#4)
In-Reply-To: <1097974861.8965.62.camel@gaston> (Benjamin Herrenschmidt's
	message of "Sun, 17 Oct 2004 11:01:03 +1000")
References: <1097831790.1131.111.camel@gaston>
	<1097831981.1131.113.camel@gaston> <1097832049.1149.115.camel@gaston>
	<1097893432.6546.37.camel@gaston> <1097974861.8965.62.camel@gaston>
Message-ID: <jeaculmk3x.fsf@sykes.suse.de>

Benjamin Herrenschmidt <benh at kernel.crashing.org> writes:

> This version fixes a bug with the backside fan doing crazy things,
> it appears to work properly on the dual 2.5Ghz now. Unless I get a
> negative report, I intend to submit it to Linus in a couple of days.

This works fine for me, too.  Thanks!

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab at suse.de
SuSE Linux AG, Maxfeldstra?e 5, 90409 N?rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


From olh at suse.de  Mon Oct 18 04:55:57 2004
From: olh at suse.de (Olaf Hering)
Date: Sun, 17 Oct 2004 20:55:57 +0200
Subject: [PATCH] allow kernel compile with native ppc64 compiler
Message-ID: <20041017185557.GA9619@suse.de>


The zImage is a 32bit binary, but a native powerpc64-linux gcc will
produce 64bit objects in arch/ppc64/boot.
This patch fixes it.

Signed-off-by: Olaf Hering <olh at suse.de>

diff -purN linux-2.6.9-final/arch/ppc64/boot/Makefile linux-2.6.9-final.native/arch/ppc64/boot/Makefile
--- linux-2.6.9-final/arch/ppc64/boot/Makefile	2004-10-16 03:03:50.000000000 +0000
+++ linux-2.6.9-final.native/arch/ppc64/boot/Makefile	2004-10-17 18:44:33.229249956 +0000
@@ -23,14 +23,14 @@
 CROSS32_COMPILE ?=
 #CROSS32_COMPILE = /usr/local/ppc/bin/powerpc-linux-
 
-BOOTCC		:= $(CROSS32_COMPILE)gcc
+BOOTCC		:= $(CROSS32_COMPILE)gcc -m32
 HOSTCC		:= gcc
 BOOTCFLAGS	:= $(HOSTCFLAGS) $(LINUXINCLUDE) -fno-builtin 
-BOOTAS		:= $(CROSS32_COMPILE)as
+BOOTAS		:= $(CROSS32_COMPILE)as -a32
 BOOTAFLAGS	:= -D__ASSEMBLY__ $(BOOTCFLAGS) -traditional
-BOOTLD		:= $(CROSS32_COMPILE)ld
+BOOTLD		:= $(CROSS32_COMPILE)ld -m elf32ppc
 BOOTLFLAGS	:= -Ttext 0x00400000 -e _start -T $(srctree)/$(src)/zImage.lds
-BOOTOBJCOPY	:= $(CROSS32_COMPILE)objcopy
+BOOTOBJCOPY	:= $(CROSS32_COMPILE)objcopy --target elf32-powerpc
 OBJCOPYFLAGS    := contents,alloc,load,readonly,data
 
 src-boot := crt0.S string.S prom.c main.c zlib.c imagesize.c div64.S
diff -purN linux-2.6.9-final/arch/ppc64/boot/zImage.lds linux-2.6.9-final.native/arch/ppc64/boot/zImage.lds
--- linux-2.6.9-final/arch/ppc64/boot/zImage.lds	2004-10-16 03:01:55.000000000 +0000
+++ linux-2.6.9-final.native/arch/ppc64/boot/zImage.lds	2004-10-17 18:48:14.824288338 +0000
@@ -1,4 +1,4 @@
-OUTPUT_ARCH(powerpc)
+OUTPUT_ARCH(powerpc:common)
 SEARCH_DIR(/lib); SEARCH_DIR(/usr/lib); SEARCH_DIR(/usr/local/lib); SEARCH_DIR(/usr/local/powerpc-any-elf/lib);
 /* Do we need any of these for elf?
    __DYNAMIC = 0;    */
-- 
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG


From paulus at samba.org  Mon Oct 18 07:46:26 2004
From: paulus at samba.org (Paul Mackerras)
Date: Mon, 18 Oct 2004 07:46:26 +1000
Subject: [PATCH] allow kernel compile with native ppc64 compiler
In-Reply-To: <20041017185557.GA9619@suse.de>
References: <20041017185557.GA9619@suse.de>
Message-ID: <16754.59442.992185.715900@cargo.ozlabs.ibm.com>

Olaf Hering writes:

> The zImage is a 32bit binary, but a native powerpc64-linux gcc will
> produce 64bit objects in arch/ppc64/boot.
> This patch fixes it.

... and breaks the compile on older toolchains that don't understand
-m32.  We need to make the -m32 conditional on HAS_BIARCH as defined
in arch/ppc64/Makefile.

Paul.


From olh at suse.de  Mon Oct 18 14:56:03 2004
From: olh at suse.de (Olaf Hering)
Date: Mon, 18 Oct 2004 06:56:03 +0200
Subject: [PATCH] allow kernel compile with native ppc64 compiler
In-Reply-To: <16754.59442.992185.715900@cargo.ozlabs.ibm.com>
References: <20041017185557.GA9619@suse.de>
	<16754.59442.992185.715900@cargo.ozlabs.ibm.com>
Message-ID: <20041018045603.GA8500@suse.de>

 On Mon, Oct 18, Paul Mackerras wrote:

> Olaf Hering writes:
> 
> > The zImage is a 32bit binary, but a native powerpc64-linux gcc will
> > produce 64bit objects in arch/ppc64/boot.
> > This patch fixes it.
> 
> ... and breaks the compile on older toolchains that don't understand
> -m32.  We need to make the -m32 conditional on HAS_BIARCH as defined
> in arch/ppc64/Makefile.

how old?

-- 
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG


From paulus at samba.org  Mon Oct 18 15:55:52 2004
From: paulus at samba.org (Paul Mackerras)
Date: Mon, 18 Oct 2004 15:55:52 +1000
Subject: [PATCH] allow kernel compile with native ppc64 compiler
In-Reply-To: <20041018045603.GA8500@suse.de>
References: <20041017185557.GA9619@suse.de>
	<16754.59442.992185.715900@cargo.ozlabs.ibm.com>
	<20041018045603.GA8500@suse.de>
Message-ID: <16755.23272.754150.209624@cargo.ozlabs.ibm.com>

Olaf Hering writes:

> > ... and breaks the compile on older toolchains that don't understand
> > -m32.  We need to make the -m32 conditional on HAS_BIARCH as defined
> > in arch/ppc64/Makefile.
> 
> how old?

The gcc that comes with debian sid doesn't understand -m32.  That's a
32-bit gcc, which means that I set CROSS_COMPILE when doing a ppc64
kernel compile.  With your patch I have to set CROSS32_COMPILE as
well, which seems silly when I'm compiling on a ppc32 box already.

Ben H suggested making the default BOOTCC be $(CC) -m32, which makes
sense to me.

Paul.


From olh at suse.de  Mon Oct 18 17:54:33 2004
From: olh at suse.de (Olaf Hering)
Date: Mon, 18 Oct 2004 09:54:33 +0200
Subject: [PATCH] allow kernel compile with native ppc64 compiler
In-Reply-To: <16755.23272.754150.209624@cargo.ozlabs.ibm.com>
References: <20041017185557.GA9619@suse.de>
	<16754.59442.992185.715900@cargo.ozlabs.ibm.com>
	<20041018045603.GA8500@suse.de>
	<16755.23272.754150.209624@cargo.ozlabs.ibm.com>
Message-ID: <20041018075433.GA24927@suse.de>

 On Mon, Oct 18, Paul Mackerras wrote:

> Olaf Hering writes:
> 
> > > ... and breaks the compile on older toolchains that don't understand
> > > -m32.  We need to make the -m32 conditional on HAS_BIARCH as defined
> > > in arch/ppc64/Makefile.
> > 
> > how old?
> 
> The gcc that comes with debian sid doesn't understand -m32.  That's a
> 32-bit gcc, which means that I set CROSS_COMPILE when doing a ppc64
> kernel compile.  With your patch I have to set CROSS32_COMPILE as
> well, which seems silly when I'm compiling on a ppc32 box already.

Makes sense, I confused a native powerpc64-linux gcc from last century
with a native/cross powerpc-linux gcc from last century.

> Ben H suggested making the default BOOTCC be $(CC) -m32, which makes
> sense to me.

That may break cross compile. I will provide a new patch.

-- 
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG


From ananth at in.ibm.com  Mon Oct 18 19:52:29 2004
From: ananth at in.ibm.com (Ananth N Mavinakayanahalli)
Date: Mon, 18 Oct 2004 15:22:29 +0530
Subject: [PATCH] Kprobes for ppc64
Message-ID: <20041018095229.GA7394@in.ibm.com>

Hi,

Here is kprobes for ppc64. The patch applies on 2.6.9-rc4/2.6.9-final
and provides the kprobes + jprobes functionality.

My earlier post did not reach the mailing lists, hence this resend.

Kprobes (Kernel dynamic probes) is a lightweight mechanism for kernel
modules to insert probes into a running kernel, without the need to
modify the underlying source. The probe handlers can then be coded
to log relevent data at the probe point. More information on kprobes
can be found at:

     http://www-124.ibm.com/developerworks/oss/linux/projects/kprobes/

Jprobes (or jumper probes) is a small infrastructure to access function 
arguments. It can be used by defining a small stub with the same 
template as the routine in kernel, within which the required parameters 
can be logged. 

The following pseudocode illustrates the usage of a jprobe, where the 
skbuff at tcp_v4_rcv() needs to be decoded:

............
struct jprobe jp;

jtcp_v4_rcv(struct skbuff *skb)
{
	/* decode and log skb related details as required */
	
	jprobe_return();
	return 0;
}
	
init_module
{
	jp.kp.addr = (kprobe_opcode_t *)<addr of tcp_v4_rcv>;
	jp.entry = JPROBE_ENTRY(jtcp_v4_rcv);
	register_jprobe(&jp);
	return 0;
}

cleanup_module
{
	unregister_jprobe(&jp);
}
............


NOTE: 
1. The current implementation uses xmon's emulate_step() and hence 
   requires xmon to be compiled in. 
2. arch_prepare_kprobe() now returns an int. I have made the necessary
   changes to i386 and sparc64 kprobes files, but is untested.


Thanks,
Ananth


diff -Naurp temp/linux-2.6.9-rc4/arch/i386/kernel/kprobes.c linux-2.6.9-rc4/arch/i386/kernel/kprobes.c
--- temp/linux-2.6.9-rc4/arch/i386/kernel/kprobes.c	2004-10-11 08:27:50.000000000 +0530
+++ linux-2.6.9-rc4/arch/i386/kernel/kprobes.c	2004-10-11 15:30:41.000000000 +0530
@@ -58,9 +58,10 @@ static inline int is_IF_modifier(kprobe_
 	return 0;
 }
 
-void arch_prepare_kprobe(struct kprobe *p)
+int arch_prepare_kprobe(struct kprobe *p)
 {
 	memcpy(p->insn, p->addr, MAX_INSN_SIZE * sizeof(kprobe_opcode_t));
+	return 0;
 }
 
 static inline void disarm_kprobe(struct kprobe *p, struct pt_regs *regs)
diff -Naurp temp/linux-2.6.9-rc4/arch/ppc64/Kconfig.debug linux-2.6.9-rc4/arch/ppc64/Kconfig.debug
--- temp/linux-2.6.9-rc4/arch/ppc64/Kconfig.debug	2004-10-11 08:28:49.000000000 +0530
+++ linux-2.6.9-rc4/arch/ppc64/Kconfig.debug	2004-10-11 15:30:41.000000000 +0530
@@ -6,6 +6,16 @@ config DEBUG_STACKOVERFLOW
 	bool "Check for stack overflows"
 	depends on DEBUG_KERNEL
 
+config KPROBES
+        bool "Kprobes"
+	depends on DEBUG_KERNEL
+	help
+	  Kprobes allows you to trap at almost any kernel address and
+	  execute a callback function.  register_kprobe() establishes
+	  a probepoint and specifies the callback.  Kprobes is useful
+	  for kernel debugging, non-intrusive instrumentation and testing.
+	  If in doubt, say "N".
+
 config DEBUG_STACK_USAGE
 	bool "Stack utilization instrumentation"
 	depends on DEBUG_KERNEL
diff -Naurp temp/linux-2.6.9-rc4/arch/ppc64/kernel/kprobes.c linux-2.6.9-rc4/arch/ppc64/kernel/kprobes.c
--- temp/linux-2.6.9-rc4/arch/ppc64/kernel/kprobes.c	1970-01-01 05:30:00.000000000 +0530
+++ linux-2.6.9-rc4/arch/ppc64/kernel/kprobes.c	2004-10-11 15:30:41.000000000 +0530
@@ -0,0 +1,260 @@
+/*
+ *  Kernel Probes (KProbes)
+ *  arch/ppc64/kernel/kprobes.c
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2002, 2004
+ *
+ * 2002-Oct	Created by Vamsi Krishna S <vamsi_krishna at in.ibm.com> Kernel
+ *		Probes initial implementation ( includes contributions from
+ *		Rusty Russell).
+ * 2004-July	Suparna Bhattacharya <suparna at in.ibm.com> added jumper probes
+ *		interface to access function arguments.
+ * 2004-Oct	Ananth N Mavinakayanahalli <ananth at in.ibm.com> kprobes port
+ *		for PPC64
+ */
+
+#include <linux/config.h>
+#include <linux/kprobes.h>
+#include <linux/ptrace.h>
+#include <linux/spinlock.h>
+#include <linux/preempt.h>
+#include <asm/kdebug.h>
+
+/* kprobe_status settings */
+#define KPROBE_HIT_ACTIVE	0x00000001
+#define KPROBE_HIT_SS		0x00000002
+
+static struct kprobe *current_kprobe;
+static unsigned long kprobe_status, kprobe_saved_msr;
+static struct pt_regs jprobe_saved_regs;
+
+/* we re-use xmon's emulate_step here */
+extern int emulate_step(struct pt_regs *regs, unsigned int instr);
+
+int arch_prepare_kprobe(struct kprobe *p)
+{
+	memcpy(p->insn, p->addr, MAX_INSN_SIZE * sizeof(kprobe_opcode_t));
+	if (IS_MTMSRD(p->insn[0]) || IS_RFID(p->insn[0]))
+		/* cannot put bp on RFID/MTMSRD */ 
+		return 1;
+	return 0;
+}
+
+static inline void disarm_kprobe(struct kprobe *p, struct pt_regs *regs)
+{
+	*p->addr = p->opcode;
+	regs->nip = (unsigned long)p->addr;
+}
+
+static inline void prepare_singlestep(struct kprobe *p, struct pt_regs *regs)
+{
+	regs->msr |= MSR_SE;
+	regs->nip = (unsigned long)&p->insn;
+}
+
+/*
+ * Interrupts are disabled on entry as trap3 is an interrupt gate and they
+ * remain disabled thorough out this function.
+ */
+static inline int kprobe_handler(struct pt_regs *regs)
+{
+	struct kprobe *p;
+	int ret = 0;
+	unsigned int *addr = (unsigned int *)regs->nip;
+
+	/* We're in an interrupt, but this is clear and BUG()-safe. */
+	preempt_disable();
+
+	/* Check we're not actually recursing */
+	if (kprobe_running()) {
+		/* We *are* holding lock here, so this is safe.
+		   Disarm the probe we just hit, and ignore it. */
+		p = get_kprobe(addr);
+		if (p) {
+			disarm_kprobe(p, regs);
+			ret = 1;
+		} else {
+			p = current_kprobe;
+			if (p->break_handler && p->break_handler(p, regs)) {
+				goto ss_probe;
+			}
+		}
+		/* If it's not ours, can't be delete race, (we hold lock). */
+		goto no_kprobe;
+	}
+
+	lock_kprobes();
+	p = get_kprobe(addr);
+	if (!p) {
+		unlock_kprobes();
+		if (*addr != BREAKPOINT_INSTRUCTION) {
+			/*
+			 * The breakpoint instruction was removed right
+			 * after we hit it.  Another cpu has removed
+			 * either a probepoint or a debugger breakpoint
+			 * at this address.  In either case, no further
+			 * handling of this interrupt is appropriate.
+			 */
+			ret = 1;
+		}
+		/* Not one of ours: let kernel handle it */
+		goto no_kprobe;
+	}
+
+	kprobe_status = KPROBE_HIT_ACTIVE;
+	current_kprobe = p;
+	kprobe_saved_msr = regs->msr;
+	if (p->pre_handler(p, regs)) {
+		/* handler has already set things up, so skip ss setup */
+		return 1;
+	}
+
+ss_probe:
+	prepare_singlestep(p, regs);
+	kprobe_status = KPROBE_HIT_SS;
+	return 1;
+
+no_kprobe:
+	preempt_enable_no_resched();
+	return ret;
+}
+
+/*
+ * Called after single-stepping.  p->addr is the address of the
+ * instruction whose first byte has been replaced by the "breakpoint"
+ * instruction.  To avoid the SMP problems that can occur when we
+ * temporarily put back the original opcode to single-step, we
+ * single-stepped a copy of the instruction.  The address of this
+ * copy is p->insn.
+ */
+static void resume_execution(struct kprobe *p, struct pt_regs *regs)
+{
+	int ret;
+
+	regs->nip = (unsigned long)p->addr;
+	ret = emulate_step(regs, p->insn[0]);
+	if (ret == 0) 
+		regs->nip = (unsigned long)p->addr + 4;
+
+	regs->msr &= ~MSR_SE;
+}
+
+static inline int post_kprobe_handler(struct pt_regs *regs)
+{
+	if (!kprobe_running())
+		return 0;
+
+	if (current_kprobe->post_handler)
+		current_kprobe->post_handler(current_kprobe, regs, 0);
+
+	resume_execution(current_kprobe, regs);
+	regs->msr |= kprobe_saved_msr;
+
+	unlock_kprobes();
+	preempt_enable_no_resched();
+
+	/*
+	 * if somebody else is singlestepping across a probe point, msr
+	 * will have SE set, in which case, continue the remaining processing
+	 * of do_debug, as if this is not a probe hit.
+	 */
+	if (regs->msr & MSR_SE)
+		return 0;
+
+	return 1;
+}
+
+/* Interrupts disabled, kprobe_lock held. */
+static inline int kprobe_fault_handler(struct pt_regs *regs, int trapnr)
+{
+	if (current_kprobe->fault_handler
+	    && current_kprobe->fault_handler(current_kprobe, regs, trapnr))
+		return 1;
+
+	if (kprobe_status & KPROBE_HIT_SS) {
+		resume_execution(current_kprobe, regs);
+		regs->msr |= kprobe_saved_msr;
+
+		unlock_kprobes();
+		preempt_enable_no_resched();
+	}
+	return 0;
+}
+
+/*
+ * Wrapper routine to for handling exceptions.
+ */
+int kprobe_exceptions_notify(struct notifier_block *self, unsigned long val,
+			     void *data)
+{
+	struct die_args *args = (struct die_args *)data;
+	switch (val) {
+	case DIE_IABR_MATCH:
+	case DIE_DABR_MATCH:
+	case DIE_BPT:
+		if (kprobe_handler(args->regs))
+			return NOTIFY_STOP;
+		break;
+	case DIE_SSTEP:
+		if (post_kprobe_handler(args->regs))
+			return NOTIFY_STOP;
+		break;
+	case DIE_GPF:
+	case DIE_PAGE_FAULT:
+		if (kprobe_running() &&
+		    kprobe_fault_handler(args->regs, args->trapnr))
+			return NOTIFY_STOP;
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_DONE;
+}
+
+int setjmp_pre_handler(struct kprobe *p, struct pt_regs *regs)
+{
+	struct jprobe *jp = container_of(p, struct jprobe, kp);
+	
+	memcpy(&jprobe_saved_regs, regs, sizeof(struct pt_regs));
+
+	/* setup return addr to the jprobe handler routine */
+	regs->nip = (unsigned long)(((func_descr_t *)jp->entry)->entry);
+	regs->gpr[2] = (unsigned long)(((func_descr_t *)jp->entry)->toc);
+
+	return 1;
+}
+
+void jprobe_return(void)
+{
+	preempt_enable_no_resched();
+	asm volatile("trap" ::: "memory");
+}
+
+void jprobe_return_end(void)
+{
+};
+
+int longjmp_break_handler(struct kprobe *p, struct pt_regs *regs)
+{
+	/* 
+	 * FIXME - we should ideally be validating that we got here 'cos
+	 * of the "trap" in jprobe_return() above, before restoring the
+	 * saved regs...
+	 */
+	memcpy(regs, &jprobe_saved_regs, sizeof(struct pt_regs));
+	return 1;
+} 
diff -Naurp temp/linux-2.6.9-rc4/arch/ppc64/kernel/Makefile linux-2.6.9-rc4/arch/ppc64/kernel/Makefile
--- temp/linux-2.6.9-rc4/arch/ppc64/kernel/Makefile	2004-10-11 08:28:50.000000000 +0530
+++ linux-2.6.9-rc4/arch/ppc64/kernel/Makefile	2004-10-11 15:30:41.000000000 +0530
@@ -56,5 +56,6 @@ obj-$(CONFIG_PPC_PMAC)		+= pmac_smp.o sm
 endif
 
 obj-$(CONFIG_ALTIVEC)		+= vecemu.o vector.o
+obj-$(CONFIG_KPROBES)		+= kprobes.o
 
 CFLAGS_ioctl32.o += -Ifs/
diff -Naurp temp/linux-2.6.9-rc4/arch/ppc64/kernel/traps.c linux-2.6.9-rc4/arch/ppc64/kernel/traps.c
--- temp/linux-2.6.9-rc4/arch/ppc64/kernel/traps.c	2004-10-11 08:27:59.000000000 +0530
+++ linux-2.6.9-rc4/arch/ppc64/kernel/traps.c	2004-10-11 15:30:41.000000000 +0530
@@ -29,6 +29,7 @@
 #include <linux/interrupt.h>
 #include <linux/init.h>
 #include <linux/module.h>
+#include <asm/kdebug.h>
 
 #include <asm/pgtable.h>
 #include <asm/uaccess.h>
@@ -61,6 +62,20 @@ EXPORT_SYMBOL(__debugger_dabr_match);
 EXPORT_SYMBOL(__debugger_fault_handler);
 #endif
 
+struct notifier_block *ppc64_die_chain;
+static spinlock_t die_notifier_lock = SPIN_LOCK_UNLOCKED;
+
+int register_die_notifier(struct notifier_block *nb)
+{
+	int err = 0;
+	unsigned long flags;
+
+	spin_lock_irqsave(&die_notifier_lock, flags);
+	err = notifier_chain_register(&ppc64_die_chain, nb);
+	spin_unlock_irqrestore(&die_notifier_lock, flags);
+	return err;
+}
+
 /*
  * Trap & Exception support
  */
@@ -287,6 +302,9 @@ UnknownException(struct pt_regs *regs)
 void
 InstructionBreakpointException(struct pt_regs *regs)
 {
+	if (notify_die(DIE_BPT, "iabr_match", regs, 5,
+					5, SIGTRAP) == NOTIFY_STOP)
+		return;
 	if (debugger_iabr_match(regs))
 		return;
 	_exception(SIGTRAP, regs, TRAP_BRKPT, regs->nip);
@@ -297,6 +315,9 @@ SingleStepException(struct pt_regs *regs
 {
 	regs->msr &= ~MSR_SE;  /* Turn off 'trace' bit */
 
+	if (notify_die(DIE_SSTEP, "single_step", regs, 5,
+					5, SIGTRAP) == NOTIFY_STOP)
+		return;
 	if (debugger_sstep(regs))
 		return;
 
@@ -470,6 +491,9 @@ ProgramCheckException(struct pt_regs *re
 	} else if (regs->msr & 0x20000) {
 		/* trap exception */
 
+		if (notify_die(DIE_BPT, "breakpoint", regs, 5,
+					5, SIGTRAP) == NOTIFY_STOP)
+			return;
 		if (debugger_bpt(regs))
 			return;
 
diff -Naurp temp/linux-2.6.9-rc4/arch/ppc64/mm/fault.c linux-2.6.9-rc4/arch/ppc64/mm/fault.c
--- temp/linux-2.6.9-rc4/arch/ppc64/mm/fault.c	2004-10-11 08:28:24.000000000 +0530
+++ linux-2.6.9-rc4/arch/ppc64/mm/fault.c	2004-10-11 15:30:41.000000000 +0530
@@ -36,6 +36,7 @@
 #include <asm/mmu_context.h>
 #include <asm/system.h>
 #include <asm/uaccess.h>
+#include <asm/kdebug.h>
 
 /*
  * Check whether the instruction at regs->nip is a store using
@@ -96,6 +97,9 @@ int do_page_fault(struct pt_regs *regs, 
 	BUG_ON((trap == 0x380) || (trap == 0x480));
 
 	if (trap == 0x300) {
+		if (notify_die(DIE_PAGE_FAULT, "page_fault", regs, error_code,
+					11, SIGSEGV) == NOTIFY_STOP)
+			return 0;
 		if (debugger_fault_handler(regs))
 			return 0;
 	}
@@ -105,6 +109,9 @@ int do_page_fault(struct pt_regs *regs, 
 		return SIGSEGV;
 
 	if (error_code & 0x00400000) {
+		if (notify_die(DIE_BPT, "dabr_match", regs, error_code,
+					11, SIGSEGV) == NOTIFY_STOP)
+			return 0;
 		if (debugger_dabr_match(regs))
 			return 0;
 	}
diff -Naurp temp/linux-2.6.9-rc4/arch/ppc64/xmon/xmon.c linux-2.6.9-rc4/arch/ppc64/xmon/xmon.c
--- temp/linux-2.6.9-rc4/arch/ppc64/xmon/xmon.c	2004-10-11 08:28:48.000000000 +0530
+++ linux-2.6.9-rc4/arch/ppc64/xmon/xmon.c	2004-10-11 15:30:41.000000000 +0530
@@ -132,7 +132,7 @@ static void csum(void);
 static void bootcmds(void);
 void dump_segments(void);
 static void symbol_lookup(void);
-static int emulate_step(struct pt_regs *regs, unsigned int instr);
+int emulate_step(struct pt_regs *regs, unsigned int instr);
 static void xmon_print_symbol(unsigned long address, const char *mid,
 			      const char *after);
 static const char *getvecname(unsigned long vec);
@@ -781,7 +781,7 @@ static int branch_taken(unsigned int ins
  * or -1 if the instruction is one that should not be stepped,
  * such as an rfid, or a mtmsrd that would clear MSR_RI.
  */
-static int emulate_step(struct pt_regs *regs, unsigned int instr)
+int emulate_step(struct pt_regs *regs, unsigned int instr)
 {
 	unsigned int opcode, rd;
 	unsigned long int imm;
diff -Naurp temp/linux-2.6.9-rc4/arch/sparc64/kernel/kprobes.c linux-2.6.9-rc4/arch/sparc64/kernel/kprobes.c
--- temp/linux-2.6.9-rc4/arch/sparc64/kernel/kprobes.c	2004-10-11 08:28:49.000000000 +0530
+++ linux-2.6.9-rc4/arch/sparc64/kernel/kprobes.c	2004-10-11 15:30:41.000000000 +0530
@@ -38,10 +38,11 @@
  * - Mark that we are no longer actively in a kprobe.
  */
 
-void arch_prepare_kprobe(struct kprobe *p)
+int arch_prepare_kprobe(struct kprobe *p)
 {
 	p->insn[0] = *p->addr;
 	p->insn[1] = BREAKPOINT_INSTRUCTION_2;
+	return 0;
 }
 
 /* kprobe_status settings */
diff -Naurp temp/linux-2.6.9-rc4/include/asm-i386/kprobes.h linux-2.6.9-rc4/include/asm-i386/kprobes.h
--- temp/linux-2.6.9-rc4/include/asm-i386/kprobes.h	2004-10-11 08:28:07.000000000 +0530
+++ linux-2.6.9-rc4/include/asm-i386/kprobes.h	2004-10-11 19:28:07.000000000 +0530
@@ -38,6 +38,8 @@ typedef u8 kprobe_opcode_t;
 	? (MAX_STACK_SIZE) \
 	: (((unsigned long)current_thread_info()) + THREAD_SIZE - (ADDR)))
 
+#define JPROBE_ENTRY(pentry)	(kprobe_opcode_t *)pentry
+
 /* trap3/1 are intr gates for kprobes.  So, restore the status of IF,
  * if necessary, before executing the original int3/1 (trap) handler.
  */
diff -Naurp temp/linux-2.6.9-rc4/include/asm-ppc64/kdebug.h linux-2.6.9-rc4/include/asm-ppc64/kdebug.h
--- temp/linux-2.6.9-rc4/include/asm-ppc64/kdebug.h	1970-01-01 05:30:00.000000000 +0530
+++ linux-2.6.9-rc4/include/asm-ppc64/kdebug.h	2004-10-11 15:30:41.000000000 +0530
@@ -0,0 +1,43 @@
+#ifndef _PPC64_KDEBUG_H
+#define _PPC64_KDEBUG_H 1
+
+/* nearly identical to x86_64/i386 code */
+
+#include <linux/notifier.h>
+
+struct pt_regs;
+
+struct die_args {
+	struct pt_regs *regs;
+	const char *str;
+	long err;
+	int trapnr;
+	int signr;
+};
+
+/* 
+   Note - you should never unregister because that can race with NMIs.
+   If you really want to do it first unregister - then synchronize_kernel - 
+   then free.
+ */
+int register_die_notifier(struct notifier_block *nb);
+extern struct notifier_block *ppc64_die_chain;
+
+/* Grossly misnamed. */
+enum die_val {
+	DIE_OOPS = 1,
+	DIE_IABR_MATCH,
+	DIE_DABR_MATCH,
+	DIE_BPT,
+	DIE_SSTEP,
+	DIE_GPF,
+	DIE_PAGE_FAULT,
+};
+
+static inline int notify_die(enum die_val val,char *str,struct pt_regs *regs,long err,int trap, int sig)
+{
+	struct die_args args = { .regs=regs, .str=str, .err=err, .trapnr=trap,.signr=sig };
+	return notifier_call_chain(&ppc64_die_chain, val, &args);
+}
+
+#endif
diff -Naurp temp/linux-2.6.9-rc4/include/asm-ppc64/kprobes.h linux-2.6.9-rc4/include/asm-ppc64/kprobes.h
--- temp/linux-2.6.9-rc4/include/asm-ppc64/kprobes.h	1970-01-01 05:30:00.000000000 +0530
+++ linux-2.6.9-rc4/include/asm-ppc64/kprobes.h	2004-10-12 22:57:04.000000000 +0530
@@ -0,0 +1,53 @@
+#ifndef _ASM_KPROBES_H
+#define _ASM_KPROBES_H
+/*
+ *  Kernel Probes (KProbes)
+ *  include/asm-ppc64/kprobes.h
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2002, 2004
+ *
+ * 2002-Oct	Created by Vamsi Krishna S <vamsi_krishna at in.ibm.com> Kernel
+ *		Probes initial implementation ( includes suggestions from
+ *		Rusty Russell).
+ * 2004-Oct	Modified for PPC64 by Ananth N Mavinakayanahalli
+ *		<ananth at in.ibm.com>
+ */
+#include <linux/types.h>
+#include <linux/ptrace.h>
+
+struct pt_regs;
+
+typedef unsigned int kprobe_opcode_t;
+#define BREAKPOINT_INSTRUCTION	0x7fe00008	/* trap */
+#define MAX_INSN_SIZE 1
+
+#define IS_MTMSRD(instr)	(((instr) & 0xfc0007fe) == 0x7c000164)
+#define IS_RFID(instr)		(((instr) & 0xfc0007fe) == 0x4c000024)
+
+#define JPROBE_ENTRY(pentry)	(kprobe_opcode_t *)((func_descr_t *)pentry)
+
+#ifdef CONFIG_KPROBES
+extern int kprobe_exceptions_notify(struct notifier_block *self,
+				    unsigned long val, void *data);
+#else				/* !CONFIG_KPROBES */
+static inline int kprobe_exceptions_notify(struct notifier_block *self,
+					   unsigned long val, void *data)
+{
+	return 0;
+}
+#endif
+#endif				/* _ASM_KPROBES_H */
diff -Naurp temp/linux-2.6.9-rc4/include/linux/kprobes.h linux-2.6.9-rc4/include/linux/kprobes.h
--- temp/linux-2.6.9-rc4/include/linux/kprobes.h	2004-10-11 08:27:16.000000000 +0530
+++ linux-2.6.9-rc4/include/linux/kprobes.h	2004-10-11 15:30:41.000000000 +0530
@@ -94,7 +94,7 @@ static inline int kprobe_running(void)
 	return kprobe_cpu == smp_processor_id();
 }
 
-extern void arch_prepare_kprobe(struct kprobe *p);
+extern int arch_prepare_kprobe(struct kprobe *p);
 extern void show_registers(struct pt_regs *regs);
 
 /* Get the kprobe at this addr (if any).  Must have called lock_kprobes */
diff -Naurp temp/linux-2.6.9-rc4/kernel/kprobes.c linux-2.6.9-rc4/kernel/kprobes.c
--- temp/linux-2.6.9-rc4/kernel/kprobes.c	2004-10-11 08:29:12.000000000 +0530
+++ linux-2.6.9-rc4/kernel/kprobes.c	2004-10-11 15:30:41.000000000 +0530
@@ -27,6 +27,8 @@
  *		interface to access function arguments.
  * 2004-Sep	Prasanna S Panchamukhi <prasanna at in.ibm.com> Changed Kprobes
  *		exceptions notifier to be first on the priority list.
+ * 2004-Oct	Ananth N Mavinakayanahalli <ananth at in.ibm.com> 
+ *		arch_prepare_kprobe now returns an int.
  */
 #include <linux/kprobes.h>
 #include <linux/spinlock.h>
@@ -87,12 +89,17 @@ int register_kprobe(struct kprobe *p)
 	hlist_add_head(&p->hlist,
 		       &kprobe_table[hash_ptr(p->addr, KPROBE_HASH_BITS)]);
 
-	arch_prepare_kprobe(p);
+	ret = arch_prepare_kprobe(p);
+	if (ret) {
+		unregister_kprobe(p);
+		ret = -EINVAL;
+		goto out;
+	}
 	p->opcode = *p->addr;
 	*p->addr = BREAKPOINT_INSTRUCTION;
 	flush_icache_range((unsigned long) p->addr,
 			   (unsigned long) p->addr + sizeof(kprobe_opcode_t));
-      out:
+out:
 	spin_unlock_irqrestore(&kprobe_lock, flags);
 	return ret;
 }


From segher at kernel.crashing.org  Mon Oct 18 19:55:26 2004
From: segher at kernel.crashing.org (Segher Boessenkool)
Date: Mon, 18 Oct 2004 11:55:26 +0200
Subject: [vHype-discussion] u64 in linux
In-Reply-To: <200410152059.03647.arnd@arndb.de>
References: <1097849471.25095.97.camel@brick.watson.ibm.com>
	<16751.62061.393716.650492@kitch0.watson.ibm.com>
	<200410152059.03647.arnd@arndb.de>
Message-ID: <D61A9855-20EB-11D9-862F-000A95A4DC02@kernel.crashing.org>

> C99 also mandates that the macro PRIu64 contains the correct
> format string for uint64_t (which afaik is always the same as u64).
> It's currently not defined in linux, but could perhaps be added.

Works fine for me:

	#include <inttypes.h>

	char x[] = PRIx64;
	char u[] = PRIu64;

resulting in

         .globl u
         .section        ".data"
         .align 3
         .type   u, @object
         .size   u, 3
u:
         .string "lu"
         .globl x
         .align 3
         .type   x, @object
         .size   x, 3
x:
         .string "lx"

(this is on a PPC64 system, GCC 3.4.1).


Segher


From benh at kernel.crashing.org  Mon Oct 18 20:58:27 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Mon, 18 Oct 2004 20:58:27 +1000
Subject: [PATCH] allow kernel compile with native ppc64 compiler
In-Reply-To: <20041018075433.GA24927@suse.de>
References: <20041017185557.GA9619@suse.de>
	<16754.59442.992185.715900@cargo.ozlabs.ibm.com>
	<20041018045603.GA8500@suse.de>
	<16755.23272.754150.209624@cargo.ozlabs.ibm.com>
	<20041018075433.GA24927@suse.de>
Message-ID: <1098097106.30570.6.camel@gaston>

On Mon, 2004-10-18 at 17:54, Olaf Hering wrote:

> 
> > Ben H suggested making the default BOOTCC be $(CC) -m32, which makes
> > sense to me.

How so ? The idea is to add -m32 to whatever compiler you are using for
the rest of the kernel (assuming bi-arch) which is a lot more sane than
using whatever _local_ compiler you are using _and_ assuming bi-arch...

Of course, that would only be the "defaul", with the ability of explicitly
passing CROSS32_COMPILE to make...

Ben.


From nathanl at austin.ibm.com  Tue Oct 19 00:46:44 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Mon, 18 Oct 2004 09:46:44 -0500
Subject: [RFC] maxcpus boot option leads to dropped interrupts
Message-ID: <1098110804.3165.63.camel@biclops>

Hi-

Our test group has discovered that booting a 2.6 kernel on a SMP pSeries
LPAR with maxcpus=1 will either hang or take a very long time to boot,
with lots of dropped interrupt messages or scsi timeouts, e.g.

Probing IDE interface ide2...
hde: IBM DROM00205, ATAPI CD/DVD-ROM drive
Using cfq io scheduler
ide2 at 0xfe400-0xfe407,0xfdc02 on irq 166
Probing IDE interface ide3...
Probing IDE interface ide3...
hde: ATAPI 24X DVD-ROM drive, 256kB Cache
Uniform CD-ROM driver Revision: 3.20
ide-cd: cmd 0x25 timed out
hde: lost interrupt
hde: lost interrupt

The problem goes away if CONFIG_IRQ_ALL_CPUS is not set.

I am about 85% sure that this is due to the OF "start-cpu" method
placing the primary threads of secondary cpus in the global interrupt
queue (see the comment in arch/ppc64/kernel/smp.c::start_secondary). 
With the maxcpus parameter, we never "boot" those cpus; they simply sit
in their spin loops waiting to be kicked.  However, from the platform's
point of view they are fair game to service device interrupts.

The RTAS "start-cpu" method apparently does not behave the same way -- I
can boot a single CPU (with SMT) Power5 LPAR with maxcpus=1 and
interrupts are not lost, even though the secondary thread on the boot
cpu has been started by RTAS.  So this problem is limited to systems
which have more than one cpu device node.

I've worked around the problem by modifying the xics code to use the
default interrupt server (the boot cpu) if cpu_online_map !=
cpu_present_map.  However that's a nasty hack which will keep interrupts
from being distributed in the smt-enabled=off case.

I'm not sure whether this happens on non-xics machines.

I'm looking for ideas on how to handle this.  Some options that occur to
me are:

o  Not booting secondary cpus from the OF client code (but the PPC-OF
binding document says we can't do this).  I believe I've tried this
before, and RTAS was unable to start the secondary cpus later.  So this
is probably not the way to go.

o  In smp_cpus_done(), "shoot down" any cpus which have not been kicked
out of their spin loops.  I've got a very rough version of this
working.  However, this method assumes that the RTAS "stop-cpu"
interface is available, which is a given on LPAR, but I'm not sure it's
a safe bet on other systems.

o  Directing interrupts to the boot cpu instead of using the GIQ when
the maxcpus option is detected.  This might be the easiest alternative;
however this could have a performance impact.

Any other ideas?  Keep in mind that I would like to get the code to a
state which will allow us to hotplug-online cpus which were not started
at boot.


Nathan


From olof at austin.ibm.com  Tue Oct 19 05:40:27 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Mon, 18 Oct 2004 14:40:27 -0500
Subject: [PATCH] [PPC64] Fix CPU numa init code thinkos
Message-ID: <20041018194027.GA11753@4>

There seems to have been a couple of thinkos in the NUMA init code,
in particular in find_cpu_node():

* Property size returned is in bytes, not words
* Off-by-one error in loop iteration


Signed-off-by: Nathan Lynch <nathanl at austin.ibm.com>
Signed-off-by: Olof Johansson <olof at austin.ibm.com>


---

 linux-2.5-olof/arch/ppc64/mm/numa.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletion(-)

diff -puN arch/ppc64/mm/numa.c~find-cpu-node arch/ppc64/mm/numa.c
--- linux-2.5/arch/ppc64/mm/numa.c~find-cpu-node	2004-10-18 14:21:55.603312384 -0500
+++ linux-2.5-olof/arch/ppc64/mm/numa.c	2004-10-18 14:22:19.271552232 -0500
@@ -75,9 +75,11 @@ static struct device_node * __init find_
 		interrupt_server = (unsigned int *)get_property(cpu_node,
 					"ibm,ppc-interrupt-server#s", &len);
 
+		len = len / sizeof(u32);
+
 		if (interrupt_server && (len > 0)) {
 			while (len--) {
-				if (interrupt_server[len-1] == hw_cpuid)
+				if (interrupt_server[len] == hw_cpuid)
 					return cpu_node;
 			}
 		} else {

_


From benh at kernel.crashing.org  Tue Oct 19 09:28:42 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Tue, 19 Oct 2004 09:28:42 +1000
Subject: ppc64 breakage.
In-Reply-To: <20041018222658.GA31577@redhat.com>
References: <20041018222658.GA31577@redhat.com>
Message-ID: <1098142122.18687.35.camel@gaston>

On Tue, 2004-10-19 at 08:26, Dave Jones wrote:
> hey guys,
> 
> During a build for an iseries kernel, it blew up with ..
> 
> arch/ppc64/kernel/built-in.o(.text+0x1cd5c): In function `ioport_map':
> arch/ppc64/kernel/iomap.c:84: undefined reference to `._IO_IS_VALID'
> make: *** [.tmp_vmlinux1] Error 1
> 
> Ideas ?
> 
> The '.' looks odd. Toolchain bug ?

No, it's my fault and I hate iSeries ! You can't do anything in this
arch without breaking it :(

Why do these systematically pop up just when linus released the new
kernel ? Grrrrrr

Anton/Paul, do iSeries has any kind of PIO on PCI at all ? Should I do

#define _IO_IS_VALID(port)	(0)

or

#define _IO_IS_VALID(port)	(1)

For iSeries ?

Ben.


From benh at kernel.crashing.org  Tue Oct 19 09:33:56 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Tue, 19 Oct 2004 09:33:56 +1000
Subject: ppc64 build failure.
In-Reply-To: <20041018230529.GB31577@redhat.com>
References: <20041018230529.GB31577@redhat.com>
Message-ID: <1098142435.18679.38.camel@gaston>

On Tue, 2004-10-19 at 09:05, Dave Jones wrote:
> Ignore previous mail, this should fix it.
>
> .../...

What about this one instead ? io_page_mask is set to 0 by default,
so iSeries would automatically get _IO_IS_VALID(*) == 0 if it doesn't
initialize it...

===== include/asm-ppc64/eeh.h 1.20 vs edited =====
--- 1.20/include/asm-ppc64/eeh.h	2004-10-06 16:05:23 +10:00
+++ edited/include/asm-ppc64/eeh.h	2004-10-19 09:31:54 +10:00
@@ -256,10 +256,6 @@
 
 #undef EEH_CHECK_ALIGN
 
-#define MAX_ISA_PORT 0x10000
-extern unsigned long io_page_mask;
-#define _IO_IS_VALID(port) ((port) >= MAX_ISA_PORT || (1 << (port>>PAGE_SHIFT)) & io_page_mask)
-
 static inline u8 eeh_inb(unsigned long port) {
 	u8 val;
 	if (!_IO_IS_VALID(port))
===== include/asm-ppc64/io.h 1.22 vs edited =====
--- 1.22/include/asm-ppc64/io.h	2004-09-21 19:14:10 +10:00
+++ edited/include/asm-ppc64/io.h	2004-10-19 09:32:20 +10:00
@@ -33,6 +33,12 @@
 
 extern unsigned long isa_io_base;
 extern unsigned long pci_io_base;
+extern unsigned long io_page_mask;
+
+#define MAX_ISA_PORT 0x10000
+
+#define _IO_IS_VALID(port) ((port) >= MAX_ISA_PORT || (1 << (port>>PAGE_SHIFT)) \
+			    & io_page_mask)
 
 #ifdef CONFIG_PPC_ISERIES
 /* __raw_* accessors aren't supported on iSeries */


From benh at kernel.crashing.org  Tue Oct 19 11:03:29 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Tue, 19 Oct 2004 11:03:29 +1000
Subject: ppc64 build failure.
In-Reply-To: <1098142435.18679.38.camel@gaston>
References: <20041018230529.GB31577@redhat.com>
	<1098142435.18679.38.camel@gaston>
Message-ID: <1098147809.11402.0.camel@gaston>

On Tue, 2004-10-19 at 09:33, Benjamin Herrenschmidt wrote:
> On Tue, 2004-10-19 at 09:05, Dave Jones wrote:
> > Ignore previous mail, this should fix it.
> >
> > .../...
> 
> What about this one instead ? io_page_mask is set to 0 by default,
> so iSeries would automatically get _IO_IS_VALID(*) == 0 if it doesn't
> initialize it...

OK, since nobody seem to really know what IO cycles are on iSeries,
let's allow them rather than mask them, thus falling back to the former
behaviour...

===== include/asm-ppc64/eeh.h 1.20 vs edited =====
--- 1.20/include/asm-ppc64/eeh.h	2004-10-06 16:05:23 +10:00
+++ edited/include/asm-ppc64/eeh.h	2004-10-19 09:31:54 +10:00
@@ -256,10 +256,6 @@
 
 #undef EEH_CHECK_ALIGN
 
-#define MAX_ISA_PORT 0x10000
-extern unsigned long io_page_mask;
-#define _IO_IS_VALID(port) ((port) >= MAX_ISA_PORT || (1 << (port>>PAGE_SHIFT)) & io_page_mask)
-
 static inline u8 eeh_inb(unsigned long port) {
 	u8 val;
 	if (!_IO_IS_VALID(port))
===== include/asm-ppc64/io.h 1.22 vs edited =====
--- 1.22/include/asm-ppc64/io.h	2004-09-21 19:14:10 +10:00
+++ edited/include/asm-ppc64/io.h	2004-10-19 09:32:20 +10:00
@@ -33,6 +33,12 @@
 
 extern unsigned long isa_io_base;
 extern unsigned long pci_io_base;
+extern unsigned long io_page_mask;
+
+#define MAX_ISA_PORT 0x10000
+
+#define _IO_IS_VALID(port) ((port) >= MAX_ISA_PORT || (1 << (port>>PAGE_SHIFT)) \
+			    & io_page_mask)
 
 #ifdef CONFIG_PPC_ISERIES
 /* __raw_* accessors aren't supported on iSeries */
===== arch/ppc64/kernel/iSeries_pci.c 1.24 vs edited =====
--- 1.24/arch/ppc64/kernel/iSeries_pci.c	2004-09-11 15:50:12 +10:00
+++ edited/arch/ppc64/kernel/iSeries_pci.c	2004-10-19 11:02:20 +10:00
@@ -55,6 +55,7 @@
 extern unsigned long iSeries_Base_Io_Memory;    
 
 extern struct iommu_table *tceTables[256];
+extern unsigned long io_page_mask;
 
 extern void iSeries_MmIoTest(void);
 
@@ -196,6 +197,7 @@
 	PPCDBG(PPCDBG_BUSWALK, "iSeries_pcibios_init Entry.\n"); 
 	iSeries_IoMmTable_Initialize();
 	find_and_init_phbs();
+	io_page_mask = -1;
 	/* pci_assign_all_busses = 0;		SFRXXX*/
 	PPCDBG(PPCDBG_BUSWALK, "iSeries_pcibios_init Exit.\n"); 
 }


From benh at kernel.crashing.org  Tue Oct 19 18:28:20 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Tue, 19 Oct 2004 18:28:20 +1000
Subject: [PATCH] generic irq subsystem: ppc64 port
In-Reply-To: <200410190714.i9J7Elnx027734@hera.kernel.org>
References: <200410190714.i9J7Elnx027734@hera.kernel.org>
Message-ID: <1098174500.11449.65.camel@gaston>

Hi !

That patch will unfortunately break a load of ppc64 boxes.

If you look closely at the ppc64 code, you'll notice we don't
use the irq_desc array directly but go through a get_irq_desc()
accessor. This is because our interrupt numbers can be very
large and scattered, and thus we have a remapping tree.

I still like the idea of the patch, so it would be useful if
you added the possibility for us to just change that behaviour,
that is replace all occursences of irq_descs + i with get_irq_desc()
and provide a generic one that just does that, with a #ifndef so
that the architecture can provide it's own. 

If you agree with the principle, though, I suppose I can do it
and send a proposed patch tomorrow.

Ben.


From hch at infradead.org  Tue Oct 19 18:41:32 2004
From: hch at infradead.org (Christoph Hellwig)
Date: Tue, 19 Oct 2004 09:41:32 +0100
Subject: [PATCH] generic irq subsystem: ppc64 port
In-Reply-To: <1098174500.11449.65.camel@gaston>
References: <200410190714.i9J7Elnx027734@hera.kernel.org>
	<1098174500.11449.65.camel@gaston>
Message-ID: <20041019084131.GA7100@infradead.org>

On Tue, Oct 19, 2004 at 06:28:20PM +1000, Benjamin Herrenschmidt wrote:
> Hi !
> 
> That patch will unfortunately break a load of ppc64 boxes.
> 
> If you look closely at the ppc64 code, you'll notice we don't
> use the irq_desc array directly but go through a get_irq_desc()
> accessor. This is because our interrupt numbers can be very
> large and scattered, and thus we have a remapping tree.
> 
> I still like the idea of the patch, so it would be useful if
> you added the possibility for us to just change that behaviour,
> that is replace all occursences of irq_descs + i with get_irq_desc()
> and provide a generic one that just does that, with a #ifndef so
> that the architecture can provide it's own. 
> 
> If you agree with the principle, though, I suppose I can do it
> and send a proposed patch tomorrow.

The PPC64 changes were actually my fault.  I think get_irq_desc() is okay.


From mingo at elte.hu  Tue Oct 19 19:15:57 2004
From: mingo at elte.hu (Ingo Molnar)
Date: Tue, 19 Oct 2004 11:15:57 +0200
Subject: [PATCH] generic irq subsystem: ppc64 port
In-Reply-To: <1098174500.11449.65.camel@gaston>
References: <200410190714.i9J7Elnx027734@hera.kernel.org>
	<1098174500.11449.65.camel@gaston>
Message-ID: <20041019091557.GA17473@elte.hu>


* Benjamin Herrenschmidt <benh at kernel.crashing.org> wrote:

> I still like the idea of the patch, so it would be useful if you added
> the possibility for us to just change that behaviour, that is replace
> all occursences of irq_descs + i with get_irq_desc() and provide a
> generic one that just does that, with a #ifndef so that the
> architecture can provide it's own. 

sure, we could do that. But since there are other architectures with
large irq-vector spaces too, you might want to try to move it into the
generic IRQ code and just provide a way to switch between 1:1 mapped and
sparse-mapped variants.

(of course this still means all of the direct indexing in kernel/irq/*.c
would have to change.)

	Ingo


From sfr at canb.auug.org.au  Wed Oct 20 03:05:30 2004
From: sfr at canb.auug.org.au (Stephen Rothwell)
Date: Wed, 20 Oct 2004 03:05:30 +1000
Subject: test - please ignore
Message-ID: <20041020030530.582725f7.sfr@canb.auug.org.au>

Just a test after updating the archives.

-- 
Cheers,
Stephen Rothwell                    sfr at canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041020/4e41f74f/attachment.pgp 

From sfr at canb.auug.org.au  Wed Oct 20 03:16:59 2004
From: sfr at canb.auug.org.au (Stephen Rothwell)
Date: Wed, 20 Oct 2004 03:16:59 +1000
Subject: old mailing list archives
Message-ID: <20041020031659.220bdfeb.sfr@canb.auug.org.au>

Hi all,

Thanks to Wolfgang Denk I have recovered (some of) the old list archives.
Please see http://ozlabs.org/pipermail/linuxppc64-dev/

I don't know how complete the archive is ...

-- 
Cheers,
Stephen Rothwell                    sfr at canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 00000000.mimetmp
Type: application/pgp-signature
Size: 190 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041020/b832635f/attachment.pgp 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041020/b832635f/attachment-0001.pgp 

From jschopp at austin.ibm.com  Wed Oct 20 02:52:20 2004
From: jschopp at austin.ibm.com (Joel Schopp)
Date: Tue, 19 Oct 2004 11:52:20 -0500
Subject: status of ppc64 patches
Message-ID: <41754644.1010003@austin.ibm.com>

2.6.9 is now in rc4.  Linus claims that the final 2.6.9 is very close. 
Thus, I expect the floodgates into mainline to open soon.  I would hope 
that my patches would be sent on by the architecture maintainers at that 
time.

I am concerned that we may be falling behind on reviewing patches in 
general and if we don't catch up several very deserving patches may miss 
this next window of opportunity.  The backlog of "New" patches is over a 
month long now.  http://ozlabs.org/ppc64-patches/

Either this page is out of date or we have a very serious bottleneck 
problem.  I'm hoping it is the former, but guessing it is the latter.

I think we should consider bringing another architecture maintainer on 
board to help spread out the load of reviewing and approving 
architecture patches.  Somebody like Olof.  Barring that I would like to 
volunteer some of my own cycles to review some of the current backlog, 
prioritize them, make sure they still compile/boot, and rebase them.


From olof at austin.ibm.com  Wed Oct 20 06:42:14 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Tue, 19 Oct 2004 15:42:14 -0500
Subject: status of ppc64 patches
In-Reply-To: <41754644.1010003@austin.ibm.com>
References: <41754644.1010003@austin.ibm.com>
Message-ID: <41757C26.2030909@austin.ibm.com>

Joel Schopp wrote:
> 2.6.9 is now in rc4.  Linus claims that the final 2.6.9 is very close. 

2.6.9 was released yesterday. :)


-Olof


From sonny at burdell.org  Wed Oct 20 09:00:54 2004
From: sonny at burdell.org (Sonny Rao)
Date: Tue, 19 Oct 2004 19:00:54 -0400
Subject: 2.6.9-rc4 kernel -- "cannot find space for TCE table"
In-Reply-To: <1097887510.6487.23.camel@gaston>
References: <OFF2A7A85B.119958DD-ON87256F2E.0073427E-86256F2E.0071BEBC@us.ibm.com>
	<1097887510.6487.23.camel@gaston>
Message-ID: <20041019230054.GA3807@kevlar.burdell.org>

On Sat, Oct 16, 2004 at 10:45:10AM +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2004-10-16 at 07:00, Santhosh Rao wrote:
> > Ok, it appears we aren't dropping into the open firmware debugger
> > randomly, the kernel seems to give up early in the boot process
> > Below is the output of an attempted boot of 2.6.9-rc4.
> > 
> > Jose, ever seen anything like this?
> > 
> > The machine is a p615 power-4  2-CPU box with 2GB of RAM.
> 
> Can you enable PROM_DEBUG in arch/ppc64/kernel/prom_init.c and send me the
> output log ?
> 
> Ben.
> 

Ben, I'm still seeing this issue with 2.6.9 final, do you need
anything else?  I'm sure you're very busy, but please let me know if I
can help.

Sonny Rao


From benh at kernel.crashing.org  Wed Oct 20 09:38:52 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Wed, 20 Oct 2004 09:38:52 +1000
Subject: 2.6.9-rc4 kernel -- "cannot find space for TCE table"
In-Reply-To: <20041019230054.GA3807@kevlar.burdell.org>
References: <OFF2A7A85B.119958DD-ON87256F2E.0073427E-86256F2E.0071BEBC@us.ibm.com>
	<1097887510.6487.23.camel@gaston>
	<20041019230054.GA3807@kevlar.burdell.org>
Message-ID: <1098229131.5792.9.camel@gaston>

On Wed, 2004-10-20 at 09:00, Sonny Rao wrote:

> Ben, I'm still seeing this issue with 2.6.9 final, do you need
> anything else?  I'm sure you're very busy, but please let me know if I
> can help.

Well, I can't reproduce here, but it seem basically that one of the
calls to alloc_down() is failing, you may want to trace a bit. I'll
try to find by myself too & let you know.

Ben.


From nathanl at austin.ibm.com  Wed Oct 20 10:22:29 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Tue, 19 Oct 2004 19:22:29 -0500
Subject: status of ppc64 patches
In-Reply-To: <41754644.1010003@austin.ibm.com>
References: <41754644.1010003@austin.ibm.com>
Message-ID: <1098231748.7493.114.camel@pants.austin.ibm.com>

On Tue, 2004-10-19 at 11:52, Joel Schopp wrote: 
> I am concerned that we may be falling behind on reviewing patches in 
> general and if we don't catch up several very deserving patches may miss 
> this next window of opportunity.  The backlog of "New" patches is over a 
> month long now.  http://ozlabs.org/ppc64-patches/

> Either this page is out of date or we have a very serious bottleneck 
> problem.  I'm hoping it is the former, but guessing it is the latter.

It looks to me like the backlog is a bit smaller than a first glance at
the page would suggest.  It is somewhat out of date in that several of
the patches that are marked "new" have already been picked up by Linus
or akpm.  I think quite a few of the items in the list do not correspond
to patches that are intended for submission upstream (e.g. there are
several revisions of "Fan control for PowerMac7_3").

> I think we should consider bringing another architecture maintainer on 
> board to help spread out the load of reviewing and approving 
> architecture patches.  Somebody like Olof.  Barring that I would like to 

The fact that a web page is slightly out of date and some minor
non-bugfix patches were not forwarded upstream during the late 2.6.9-rc
series fails to convince me that such a change is needed.

If you feel a patch has been overlooked, it's usually just a matter of
gently nudging one of the maintainers via email or IRC; it Works For Me
(tm) ;)


Nathan


From olof at austin.ibm.com  Wed Oct 20 11:03:01 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Tue, 19 Oct 2004 20:03:01 -0500
Subject: status of ppc64 patches
In-Reply-To: <1098231748.7493.114.camel@pants.austin.ibm.com>
References: <41754644.1010003@austin.ibm.com>
	<1098231748.7493.114.camel@pants.austin.ibm.com>
Message-ID: <20041020010301.GA29579@4>

On Tue, Oct 19, 2004 at 07:22:29PM -0500, Nathan Lynch wrote:
> > I think we should consider bringing another architecture maintainer on 
> > board to help spread out the load of reviewing and approving 
> > architecture patches.  Somebody like Olof.  Barring that I would like to 
> 
> The fact that a web page is slightly out of date and some minor
> non-bugfix patches were not forwarded upstream during the late 2.6.9-rc
> series fails to convince me that such a change is needed.

Agreed. The page is there for the maintainers to track their work, not
for us to track them.  :-)  I hope that each person tracks their own
work and follows up as needed.

And even if, in the future, current maintainers need help looking at
patches, there's no need to promote someone (myself or others) to a
"full" maintainer just to pitch in and help out. Anyone has the
opportunity to look at a patch and ask questions about it or say that
they agree or disagree with it. This happens every day on LKML and other
lists, there's no reason we should work differently on our architecture
list.

Also: Regarding re-basing patches: It has to be the duty of the developer
of the patch to re-base it to current trees if it will no longer apply
cleanly. I wouldn't expect Anton or Paul to forward-port my patches,
just as little as I would expect Andrew Morton or Linus to do so.


-Olof


From benh at kernel.crashing.org  Wed Oct 20 11:24:59 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Wed, 20 Oct 2004 11:24:59 +1000
Subject: [PATCH] generic irq subsystem: ppc64 port
In-Reply-To: <20041019091557.GA17473@elte.hu>
References: <200410190714.i9J7Elnx027734@hera.kernel.org>
	<1098174500.11449.65.camel@gaston>  <20041019091557.GA17473@elte.hu>
Message-ID: <1098235499.22943.16.camel@gaston>

On Tue, 2004-10-19 at 19:15, Ingo Molnar wrote:
> * Benjamin Herrenschmidt <benh at kernel.crashing.org> wrote:
> 
> > I still like the idea of the patch, so it would be useful if you added
> > the possibility for us to just change that behaviour, that is replace
> > all occursences of irq_descs + i with get_irq_desc() and provide a
> > generic one that just does that, with a #ifndef so that the
> > architecture can provide it's own. 
> 
> sure, we could do that. But since there are other architectures with
> large irq-vector spaces too, you might want to try to move it into the
> generic IRQ code and just provide a way to switch between 1:1 mapped and
> sparse-mapped variants.

False alert ! In fact, Paulus rewrote that stuff a while ago and I
totally forgot about it. We no longer do that, our get_irq_desc()
is nowadays just doing (&irq_desc[(irq)]). We map the large
physical interrupt numbers to "virtual" numbers that are the only
thing the generic code sees, so it's fine. 

Ben.


From sfr at canb.auug.org.au  Wed Oct 20 15:47:30 2004
From: sfr at canb.auug.org.au (Stephen Rothwell)
Date: Wed, 20 Oct 2004 15:47:30 +1000
Subject: [PATCH] PPC64 iSeries compile broken in 2.6.9-bk3
Message-ID: <20041020154730.39ea3509.sfr@canb.auug.org.au>

Hi Andrew,

One of the iSeries specific files used HZ without including linux/param.h
and previously got away with it.

Signed-off-by: Stephen Rothwell <sfr at canb.auug.org.au>

Please apply and send to Linus.

-- 
Cheers,
Stephen Rothwell                    sfr at canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

diff -ruN 2.6.9-bk3/arch/ppc64/kernel/iSeries_proc.c 2.6.9-bk3-sfr.1/arch/ppc64/kernel/iSeries_proc.c
--- 2.6.9-bk3/arch/ppc64/kernel/iSeries_proc.c	2004-08-19 17:01:59.000000000 +1000
+++ 2.6.9-bk3-sfr.1/arch/ppc64/kernel/iSeries_proc.c	2004-10-20 15:21:23.000000000 +1000
@@ -20,6 +20,7 @@
 #include <linux/init.h>
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
+#include <linux/param.h>		/* for HZ */
 #include <asm/paca.h>
 #include <asm/processor.h>
 #include <asm/time.h>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041020/28f59265/attachment.pgp 

From greg.quinn at anu.edu.au  Wed Oct 20 16:09:55 2004
From: greg.quinn at anu.edu.au (Greg Quinn)
Date: Wed, 20 Oct 2004 16:09:55 +1000
Subject: 64 bit compilation and linking - help
Message-ID: <41760133.6020801@anu.edu.au>

Sorry to intrude on your mailing list. Here at bios.org (Canbia) we've 
just acquired two new p615 machines courtesy of a generous IBM donation, 
and we want to put them to work ASAP.

We've installed a Suse 9 Enterprise Server distribution. I'm trying to 
compile a C application in 64-bit mode, but can't get the compilation to 
succeed. For example ...

cc -o m m.c

prodices a 32 bit executable, ie pointers are 4 bytes. But ...

cc -o m -m64 m.c

dies with a bunch of messages like

> /usr/lib/gcc-lib/powerpc-suse-linux/3.3.3/../../../../powerpc-suse-linux/bin/ld: 
> skipping incompatible 
> /usr/lib/gcc-lib/powerpc-suse-linux/3.3.3/../../../libc.so when 
> searching for -lc 

We seem to have the 64 bit libraries installed (in /lib64 and 
/usr/lib64), I just need a clue on how to compile and link with them. 
It's probably something very simple, so I'd appreciate 10 seconds of 
somebody's time.

-- 
Greg Quinn

CAMBIA
http://www.cambiaip.org
(02) 62464523


From olh at suse.de  Wed Oct 20 16:30:46 2004
From: olh at suse.de (Olaf Hering)
Date: Wed, 20 Oct 2004 08:30:46 +0200
Subject: 64 bit compilation and linking - help
In-Reply-To: <41760133.6020801@anu.edu.au>
References: <41760133.6020801@anu.edu.au>
Message-ID: <20041020063046.GA28504@suse.de>

 On Wed, Oct 20, Greg Quinn wrote:

> We seem to have the 64 bit libraries installed (in /lib64 and 
> /usr/lib64), I just need a clue on how to compile and link with them. 
> It's probably something very simple, so I'd appreciate 10 seconds of 
> somebody's time.

you have not enough installed, look at 'rpm -qa | grep 64bit'.
To install more rpms, use yast and search for package names wich contain
'64bit'.
I think you just need the glibc-devel-64bit for a simple hello_world.c.

-- 
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG


From paulus at samba.org  Wed Oct 20 22:10:36 2004
From: paulus at samba.org (Paul Mackerras)
Date: Wed, 20 Oct 2004 22:10:36 +1000
Subject: status of ppc64 patches
In-Reply-To: <41754644.1010003@austin.ibm.com>
References: <41754644.1010003@austin.ibm.com>
Message-ID: <16758.21948.795730.268143@cargo.ozlabs.ibm.com>

Joel Schopp writes:

> 2.6.9 is now in rc4.  Linus claims that the final 2.6.9 is very close. 
> Thus, I expect the floodgates into mainline to open soon.  I would hope 
> that my patches would be sent on by the architecture maintainers at that 
> time.

And 2.6.9 is now out, and the floodgates are open, and patches are
flowing again.

As far as your patches are concerned, I am aware of two patches that
change things so that we have __boot variants of __pa etc.  However,
your explanation didn't really get me excited about the change.  You
said something about "moving towards hotplug memory" but you didn't
explain why these changes would help with that, or how I should choose
which function to use when I'm making changes in future (that should
actually go in a file somewhere under the Documentation directory), or
why those changes need to go in now.

> I think we should consider bringing another architecture maintainer on 
> board to help spread out the load of reviewing and approving 
> architecture patches.  Somebody like Olof.  Barring that I would like to 
> volunteer some of my own cycles to review some of the current backlog, 
> prioritize them, make sure they still compile/boot, and rebase them.

Help with reviewing, compile/boot testing and rebasing patches is
always welcome. :)  Rebasing is really the responsibility of the
original submitter though, since they generally know what has been
changed and why better than anyone.

Paul.


From tiwari.amit at gmail.com  Wed Oct 20 22:08:28 2004
From: tiwari.amit at gmail.com (Amit K Tiwari)
Date: Wed, 20 Oct 2004 17:38:28 +0530
Subject: Max RAM Supported
Message-ID: <dd97ea000410200508740fce45@mail.gmail.com>

Hi,

I have just installed YDL 4.0. The OS does not show all 6GB DRAM I
have in my Power Mac G5. It shows only 1.97GB (I ran top to see how
much physical memory I have). Looking at the net,
http://archive.linuxsymposium.org/ols2003/Proceedings/All-Reprints/Reprint-Bligh-OLS2003.pdf
says that kernel 2.5 should support approx 32GB of memory.

Do I need to re-build the kernel to enable the support for all of
available memory? If yes, with what options? 'High Memory Support' is
already enabled in the kernel config.

Amit


From paulus at samba.org  Wed Oct 20 22:21:03 2004
From: paulus at samba.org (Paul Mackerras)
Date: Wed, 20 Oct 2004 22:21:03 +1000
Subject: Max RAM Supported
In-Reply-To: <dd97ea000410200508740fce45@mail.gmail.com>
References: <dd97ea000410200508740fce45@mail.gmail.com>
Message-ID: <16758.22575.18560.155884@cargo.ozlabs.ibm.com>

Amit K Tiwari writes:

> I have just installed YDL 4.0. The OS does not show all 6GB DRAM I
> have in my Power Mac G5. It shows only 1.97GB (I ran top to see how
> much physical memory I have). Looking at the net,

Is that a 32-bit kernel or a 64-bit kernel?  (If uname -m prints ppc,
it's a 32-bit kernel; if it prints ppc64, it's a 64-bit kernel.)

The 32-bit kernel only supports 2GB of RAM, because it can only use
physical addresses below 4GB, and the space from 2GB - 4GB in the
physical address space is used for I/O and ROM.  The 64-bit kernel can
address all of the physical address space.

> Do I need to re-build the kernel to enable the support for all of
> available memory? If yes, with what options? 'High Memory Support' is
> already enabled in the kernel config.

You need to build a 64-bit kernel (i.e. ARCH=ppc64) rather than a
32-bit kernel (ARCH=ppc).

Paul.


From dhowells at redhat.com  Thu Oct 21 00:44:15 2004
From: dhowells at redhat.com (David Howells)
Date: Wed, 20 Oct 2004 15:44:15 +0100
Subject: [PATCH] Add key management syscalls to non-i386 archs
Message-ID: <3506.1098283455@redhat.com>

Hi Linus, Andrew,

The attached patch adds syscalls for almost all archs (everything barring
m68knommu which is in a real mess, and i386 which already has it).

It also adds 32->64 compatibility where appropriate.

David

Signed-Off-By: David Howells <dhowells at redhat.com>
---

warthog>diffstat keys-269bk4.diff 
 arch/alpha/kernel/systbls.S        |    3 +++
 arch/arm/kernel/calls.S            |    3 +++
 arch/cris/arch-v10/kernel/entry.S  |    3 +++
 arch/h8300/kernel/syscalls.S       |    3 +++
 arch/ia64/ia32/ia32_entry.S        |    4 ++++
 arch/ia64/ia32/sys_ia32.c          |   20 ++++++++++++++++++++
 arch/ia64/kernel/entry.S           |    6 +++---
 arch/ia64/kernel/fsys.S            |    6 +++---
 arch/m32r/kernel/entry.S           |    3 +++
 arch/m68k/kernel/entry.S           |    3 +++
 arch/mips/kernel/scall32-o32.S     |    3 +++
 arch/mips/kernel/scall64-64.S      |    3 +++
 arch/mips/kernel/scall64-n32.S     |    3 +++
 arch/mips/kernel/scall64-o32.S     |    3 +++
 arch/parisc/kernel/syscall_table.S |    4 +++-
 arch/ppc/kernel/misc.S             |    3 +++
 arch/ppc64/kernel/misc.S           |    6 ++++++
 arch/ppc64/kernel/sys_ppc32.c      |   33 +++++++++++++++++++++++++++++++++
 arch/s390/kernel/compat_wrapper.S  |   26 ++++++++++++++++++++++++++
 arch/s390/kernel/syscalls.S        |    3 +++
 arch/sh/kernel/entry.S             |    4 ++++
 arch/sh64/kernel/syscalls.S        |    4 +++-
 arch/sparc/kernel/systbls.S        |    2 +-
 arch/sparc64/kernel/sys32.S        |    3 +++
 arch/sparc64/kernel/systbls.S      |    4 ++--
 arch/um/kernel/sys_call_table.c    |    3 +++
 arch/v850/kernel/entry.S           |    3 +++
 arch/x86_64/ia32/ia32entry.S       |    4 ++++
 include/asm-alpha/unistd.h         |    5 ++++-
 include/asm-arm/unistd.h           |    3 +++
 include/asm-arm26/unistd.h         |    3 +++
 include/asm-cris/unistd.h          |    5 ++++-
 include/asm-h8300/unistd.h         |    5 ++++-
 include/asm-ia64/unistd.h          |    3 +++
 include/asm-m32r/unistd.h          |    5 ++++-
 include/asm-m68k/unistd.h          |    5 ++++-
 include/asm-mips/unistd.h          |   17 +++++++++++++----
 include/asm-parisc/unistd.h        |    5 ++++-
 include/asm-ppc/unistd.h           |    5 ++++-
 include/asm-ppc64/unistd.h         |    5 ++++-
 include/asm-s390/unistd.h          |    5 ++++-
 include/asm-sh/unistd.h            |    5 ++++-
 include/asm-sh64/unistd.h          |    5 ++++-
 include/asm-sparc/unistd.h         |    3 +++
 include/asm-sparc64/unistd.h       |    3 +++
 include/asm-v850/unistd.h          |    3 +++
 include/asm-x86_64/unistd.h        |    8 +++++++-
 47 files changed, 239 insertions(+), 27 deletions(-)

diff -uNrp linux-2.6.9-bk4/arch/alpha/kernel/systbls.S linux-2.6.9-bk4-keys/arch/alpha/kernel/systbls.S
--- linux-2.6.9-bk4/arch/alpha/kernel/systbls.S	2004-10-19 10:41:41.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/alpha/kernel/systbls.S	2004-10-20 14:47:43.275151615 +0100
@@ -458,6 +458,9 @@ sys_call_table:
 	.quad sys_mq_notify
 	.quad sys_mq_getsetattr
 	.quad sys_waitid
+	.quad sys_add_key
+	.quad sys_request_key
+	.quad sys_keyctl
 
 	.size sys_call_table, . - sys_call_table
 	.type sys_call_table, @object
diff -uNrp linux-2.6.9-bk4/arch/arm/kernel/calls.S linux-2.6.9-bk4-keys/arch/arm/kernel/calls.S
--- linux-2.6.9-bk4/arch/arm/kernel/calls.S	2004-10-19 10:41:42.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/arm/kernel/calls.S	2004-10-20 14:57:39.641915157 +0100
@@ -295,6 +295,9 @@ __syscall_start:
 		.long	sys_mq_notify
 		.long	sys_mq_getsetattr
 /* 280 */	.long	sys_waitid
+		.long	sys_add_key
+		.long	sys_request_key
+		.long	sys_keyctl
 __syscall_end:
 
 		.rept	NR_syscalls - (__syscall_end - __syscall_start) / 4
diff -uNrp linux-2.6.9-bk4/arch/cris/arch-v10/kernel/entry.S linux-2.6.9-bk4-keys/arch/cris/arch-v10/kernel/entry.S
--- linux-2.6.9-bk4/arch/cris/arch-v10/kernel/entry.S	2004-06-18 13:43:42.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/cris/arch-v10/kernel/entry.S	2004-10-20 14:44:52.215209105 +0100
@@ -1079,6 +1079,9 @@ sys_call_table:	
 	.long sys_mq_timedreceive	/* 280 */
 	.long sys_mq_notify
 	.long sys_mq_getsetattr
+	.long sys_add_key
+	.long sys_request_key	/* 285 */
+	.long sys_keyctl
 		
         /*
          * NOTE!! This doesn't have to be exact - we just have
diff -uNrp linux-2.6.9-bk4/arch/h8300/kernel/syscalls.S linux-2.6.9-bk4-keys/arch/h8300/kernel/syscalls.S
--- linux-2.6.9-bk4/arch/h8300/kernel/syscalls.S	2004-06-18 13:43:42.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/h8300/kernel/syscalls.S	2004-10-20 15:00:36.035535939 +0100
@@ -289,6 +289,9 @@ SYMBOL_NAME_LABEL(sys_call_table)	
 	.long SYMBOL_NAME(sys_utimes)
  	.long SYMBOL_NAME(sys_fadvise64_64)
 	.long SYMBOL_NAME(sys_ni_syscall)	/* sys_vserver */
+	.long SYMBOL_NAME(sys_add_key)
+	.long SYMBOL_NAME(sys_request_key)	/* 275 */
+	.long SYMBOL_NAME(sys_keyctl)
 
 	.rept NR_syscalls-(.-SYMBOL_NAME(sys_call_table))/4
 		.long SYMBOL_NAME(sys_ni_syscall)
diff -uNrp linux-2.6.9-bk4/arch/ia64/ia32/ia32_entry.S linux-2.6.9-bk4-keys/arch/ia64/ia32/ia32_entry.S
--- linux-2.6.9-bk4/arch/ia64/ia32/ia32_entry.S	2004-10-19 10:41:43.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/ia64/ia32/ia32_entry.S	2004-10-20 15:25:01.365546264 +0100
@@ -495,6 +495,10 @@ ia32_syscall_table:
   	data8 compat_sys_mq_getsetattr
 	data8 sys_ni_syscall		/* reserved for kexec */
 	data8 sys32_waitid
+	data8 sys_ni_syscall		/* reserved for setaltroot */
+	data8 sys32_add_key
+	data8 sys32_request_key
+	data8 sys_keyctl
 
 	// guard against failures to increase IA32_NR_syscalls
 	.org ia32_syscall_table + 8*IA32_NR_syscalls
diff -uNrp linux-2.6.9-bk4/arch/ia64/ia32/sys_ia32.c linux-2.6.9-bk4-keys/arch/ia64/ia32/sys_ia32.c
--- linux-2.6.9-bk4/arch/ia64/ia32/sys_ia32.c	2004-10-19 10:41:43.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/ia64/ia32/sys_ia32.c	2004-10-20 15:28:48.663376741 +0100
@@ -2687,6 +2687,26 @@ asmlinkage long sys32_waitid(int which, 
 	return copy_siginfo_to_user32(uinfo, &info);
 }
 
+
+asmlinkage long sys32_add_key(const char __user *_type,
+			      const char __user *_description,
+			      const void __user *_payload,
+			      __u32 plen,
+			      __u32 ringid)
+{
+	sys_add_key(_type, _description, _payload, (size_t) plen,
+		    (key_serial_t) ringid);
+}
+
+asmlinkage long sys32_request_key(const char __user *_type,
+				  const char __user *_description,
+				  const char __user *_callout_info,
+				  __u32 destringid)
+{
+	sys_request_key(_type, _description, _callout_info,
+			(key_serial_t) destringid);
+}
+
 #ifdef	NOTYET  /* UNTESTED FOR IA64 FROM HERE DOWN */
 
 asmlinkage long sys32_setreuid(compat_uid_t ruid, compat_uid_t euid)
diff -uNrp linux-2.6.9-bk4/arch/ia64/kernel/entry.S linux-2.6.9-bk4-keys/arch/ia64/kernel/entry.S
--- linux-2.6.9-bk4/arch/ia64/kernel/entry.S	2004-10-20 14:02:54.138626787 +0100
+++ linux-2.6.9-bk4-keys/arch/ia64/kernel/entry.S	2004-10-20 14:45:48.309267588 +0100
@@ -1528,9 +1528,9 @@ sys_call_table:
 	data8 sys_ni_syscall			// reserved for kexec_load
 	data8 sys_ni_syscall
 	data8 sys_setaltroot			// 1270
-	data8 sys_ni_syscall
-	data8 sys_ni_syscall
-	data8 sys_ni_syscall
+	data8 sys_add_key
+	data8 sys_request_key
+	data8 sys_keyctl
 	data8 sys_ni_syscall
 	data8 sys_ni_syscall			// 1275
 	data8 sys_ni_syscall
diff -uNrp linux-2.6.9-bk4/arch/ia64/kernel/fsys.S linux-2.6.9-bk4-keys/arch/ia64/kernel/fsys.S
--- linux-2.6.9-bk4/arch/ia64/kernel/fsys.S	2004-10-19 10:41:43.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/ia64/kernel/fsys.S	2004-10-20 14:46:27.814789684 +0100
@@ -868,9 +868,9 @@ fsyscall_table:
 	data8 0				// kexec_load
 	data8 0
 	data8 0							// 1270
-	data8 0
-	data8 0
-	data8 0
+	data8 0				// add_key
+	data8 0				// request_key
+	data8 0				// keyctl
 	data8 0
 	data8 0							// 1275
 	data8 0
diff -uNrp linux-2.6.9-bk4/arch/m32r/kernel/entry.S linux-2.6.9-bk4-keys/arch/m32r/kernel/entry.S
--- linux-2.6.9-bk4/arch/m32r/kernel/entry.S	2004-10-19 10:41:44.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/m32r/kernel/entry.S	2004-10-20 15:09:17.798751465 +0100
@@ -994,6 +994,9 @@ ENTRY(sys_call_table)
         .long sys_mq_getsetattr
         .long sys_ni_syscall            /* reserved for kexec */
 	.long sys_waitid
+	.long sys_add_key		/* 285 */
+	.long sys_request_key
+	.long sys_keyctl
 
 syscall_table_size=(.-sys_call_table)
 
diff -uNrp linux-2.6.9-bk4/arch/m68k/kernel/entry.S linux-2.6.9-bk4-keys/arch/m68k/kernel/entry.S
--- linux-2.6.9-bk4/arch/m68k/kernel/entry.S	2004-06-18 13:43:44.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/m68k/kernel/entry.S	2004-10-20 14:45:20.678701183 +0100
@@ -663,3 +663,6 @@ sys_call_table:
 	.long sys_lremovexattr
 	.long sys_fremovexattr
 	.long sys_futex		/* 235 */
+	.long sys_add_key
+	.long sys_request_key
+	.long sys_keyctl
diff -uNrp linux-2.6.9-bk4/arch/mips/kernel/scall32-o32.S linux-2.6.9-bk4-keys/arch/mips/kernel/scall32-o32.S
--- linux-2.6.9-bk4/arch/mips/kernel/scall32-o32.S	2004-09-16 12:05:47.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/mips/kernel/scall32-o32.S	2004-10-20 14:30:46.698878816 +0100
@@ -628,6 +628,9 @@ out:	jr	ra
 	sys	sys_mq_notify		2	/* 4275 */
 	sys	sys_mq_getsetattr	3
 	sys	sys_ni_syscall		0	/* sys_vserver */
+	sys	sys_add_key		5
+	sys	sys_request_key		4
+	sys	sys_keyctl		5
 
 	.endm
 
diff -uNrp linux-2.6.9-bk4/arch/mips/kernel/scall64-64.S linux-2.6.9-bk4-keys/arch/mips/kernel/scall64-64.S
--- linux-2.6.9-bk4/arch/mips/kernel/scall64-64.S	2004-09-16 12:05:47.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/mips/kernel/scall64-64.S	2004-10-20 14:32:42.206470034 +0100
@@ -448,3 +448,6 @@ sys_call_table:
 	PTR	sys_mq_notify
 	PTR	sys_mq_getsetattr		/* 5235 */
 	PTR	sys_ni_syscall			/* sys_vserver */
+	PTR	sys_add_key
+	PTR	sys_request_key
+	PTR	sys_keyctl
diff -uNrp linux-2.6.9-bk4/arch/mips/kernel/scall64-n32.S linux-2.6.9-bk4-keys/arch/mips/kernel/scall64-n32.S
--- linux-2.6.9-bk4/arch/mips/kernel/scall64-n32.S	2004-09-16 12:05:47.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/mips/kernel/scall64-n32.S	2004-10-20 15:12:10.687967430 +0100
@@ -358,3 +358,6 @@ EXPORT(sysn32_call_table)
 	PTR	compat_sys_mq_notify
 	PTR	compat_sys_mq_getsetattr	/* 6239 */
 	PTR	sys_ni_syscall			/* sys_vserver */
+	PTR	sys_add_key
+	PTR	sys_request_key
+	PTR	sys_keyctl
diff -uNrp linux-2.6.9-bk4/arch/mips/kernel/scall64-o32.S linux-2.6.9-bk4-keys/arch/mips/kernel/scall64-o32.S
--- linux-2.6.9-bk4/arch/mips/kernel/scall64-o32.S	2004-09-16 12:05:47.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/mips/kernel/scall64-o32.S	2004-10-20 15:11:26.761722025 +0100
@@ -536,6 +536,9 @@ out:	jr	ra
 	sys	compat_sys_mq_notify	2	/* 4275 */
 	sys	compat_sys_mq_getsetattr 3
 	sys	sys_ni_syscall		0	/* sys_vserver */
+	sys	sys_add_key		5
+	sys	sys_request_key		4
+	sys	sys_keyctl		5
 
 	.endm
 
diff -uNrp linux-2.6.9-bk4/arch/parisc/kernel/syscall_table.S linux-2.6.9-bk4-keys/arch/parisc/kernel/syscall_table.S
--- linux-2.6.9-bk4/arch/parisc/kernel/syscall_table.S	2004-06-18 13:43:47.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/parisc/kernel/syscall_table.S	2004-10-20 14:58:51.533643420 +0100
@@ -341,5 +341,7 @@
   ENTRY_SAME(mq_timedreceive)
   ENTRY_SAME(mq_notify)
   ENTRY_SAME(mq_getsetattr)
-  /* Nothing yet */       /* 235 */
+	ENTRY_SAME(add_key)	/* 235 */
+	ENTRY_SAME(request_key)
+	ENTRY_SAME(keyctl)
 
diff -uNrp linux-2.6.9-bk4/arch/ppc/kernel/misc.S linux-2.6.9-bk4-keys/arch/ppc/kernel/misc.S
--- linux-2.6.9-bk4/arch/ppc/kernel/misc.S	2004-10-19 10:41:46.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/ppc/kernel/misc.S	2004-10-20 14:43:37.665815385 +0100
@@ -1447,3 +1447,6 @@ _GLOBAL(sys_call_table)
 	.long sys_mq_notify
 	.long sys_mq_getsetattr
 	.long sys_ni_syscall		/* 268 reserved for sys_kexec_load */
+	.long sys_add_key
+	.long sys_request_key		/* 270 */
+	.long sys_keyctl
diff -uNrp linux-2.6.9-bk4/arch/ppc64/kernel/misc.S linux-2.6.9-bk4-keys/arch/ppc64/kernel/misc.S
--- linux-2.6.9-bk4/arch/ppc64/kernel/misc.S	2004-10-20 14:02:55.974474037 +0100
+++ linux-2.6.9-bk4-keys/arch/ppc64/kernel/misc.S	2004-10-20 14:57:18.470763092 +0100
@@ -963,6 +963,9 @@ _GLOBAL(sys_call_table32)
 	.llong .compat_sys_mq_notify
 	.llong .compat_sys_mq_getsetattr
 	.llong .sys_ni_syscall		/* 268 reserved for sys_kexec_load */
+	.llong .sys32_add_key
+	.llong .sys32_request_key
+	.llong .sys32_keyctl
 
 	.balign 8
 _GLOBAL(sys_call_table)
@@ -1235,3 +1238,6 @@ _GLOBAL(sys_call_table)
 	.llong .sys_mq_notify
 	.llong .sys_mq_getsetattr
 	.llong .sys_ni_syscall		/* 268 reserved for sys_kexec_load */
+	.llong .sys_add_key
+	.llong .sys_request_key		/* 270 */
+	.llong .sys_keyctl
diff -uNrp linux-2.6.9-bk4/arch/ppc64/kernel/sys_ppc32.c linux-2.6.9-bk4-keys/arch/ppc64/kernel/sys_ppc32.c
--- linux-2.6.9-bk4/arch/ppc64/kernel/sys_ppc32.c	2004-10-20 14:02:56.046468047 +0100
+++ linux-2.6.9-bk4-keys/arch/ppc64/kernel/sys_ppc32.c	2004-10-20 15:29:22.936487493 +0100
@@ -1328,3 +1328,36 @@ long ppc32_timer_create(clockid_t clock,
 
 	return err;
 }
+
+asmlinkage long sys32_add_key(const char __user *_type,
+			      const char __user *_description,
+			      const void __user *_payload,
+			      u32 plen,
+			      u32 ringid)
+{
+	sys_add_key(_type, _description, _payload, (size_t) plen,
+		    (key_serial_t) ringid);
+}
+
+asmlinkage long sys32_request_key(const char __user *_type,
+				  const char __user *_description,
+				  const char __user *_callout_info,
+				  u32 destringid)
+{
+	sys_request_key(_type, _description, _callout_info,
+			(key_serial_t) destringid);
+}
+
+/* Note: it is necessary to treat option as an unsigned int, 
+ * with the corresponding cast to a signed int to insure that the 
+ * proper conversion (sign extension) between the register representation of a signed int (msr in 32-bit mode)
+ * and the register representation of a signed int (msr in 64-bit mode) is performed.
+ */
+asmlinkage long sys32_keyctl(u32 option, u32 arg2, u32 arg3, u32 arg4, u32 arg5)
+{
+	return sys_keyctl((int)option,
+			 (unsigned long) arg2,
+			 (unsigned long) arg3,
+			 (unsigned long) arg4,
+			 (unsigned long) arg5);
+}
diff -uNrp linux-2.6.9-bk4/arch/s390/kernel/compat_wrapper.S linux-2.6.9-bk4-keys/arch/s390/kernel/compat_wrapper.S
--- linux-2.6.9-bk4/arch/s390/kernel/compat_wrapper.S	2004-06-18 13:43:49.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/s390/kernel/compat_wrapper.S	2004-10-20 15:08:00.071403677 +0100
@@ -1406,3 +1406,29 @@ compat_sys_mq_getsetattr_wrapper:
 	llgtr	%r3,%r3			# struct compat_mq_attr *
 	llgtr	%r4,%r4			# struct compat_mq_attr *
 	jg	compat_sys_mq_getsetattr
+
+	.globl  sys32_add_key_wrapper
+sys32_add_key_wrapper:
+	lgfr	%r2,%r2			# const char *
+	llgfr	%r3,%r3			# const char *
+	llgfr	%r4,%r4			# const void *
+	llgfr	%r5,%r5			# size_t
+	llgfr	%r6,%r6			# key_serial_t
+	jg	sys_add_key		# branch to system call
+
+	.globl  sys32_request_key_wrapper
+sys32_request_key_wrapper:
+	lgfr	%r2,%r2			# const char *
+	llgfr	%r3,%r3			# const char *
+	llgfr	%r4,%r4			# const char *
+	llgfr	%r5,%r5			# key_serial_t
+	jg	sys_request_key		# branch to system call
+
+	.globl  sys32_keyctl_wrapper
+sys32_keyctl_wrapper:
+	lgfr	%r2,%r2			# int
+	llgfr	%r3,%r3			# unsigned long
+	llgfr	%r4,%r4			# unsigned long
+	llgfr	%r5,%r5			# unsigned long
+	llgfr	%r6,%r6			# unsigned long
+	jg	sys_keyctl		# branch to system call
diff -uNrp linux-2.6.9-bk4/arch/s390/kernel/syscalls.S linux-2.6.9-bk4-keys/arch/s390/kernel/syscalls.S
--- linux-2.6.9-bk4/arch/s390/kernel/syscalls.S	2004-06-18 13:43:49.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/s390/kernel/syscalls.S	2004-10-20 15:05:49.863555437 +0100
@@ -285,3 +285,6 @@ SYSCALL(sys_mq_timedsend,sys_mq_timedsen
 SYSCALL(sys_mq_timedreceive,sys_mq_timedreceive,compat_sys_mq_timedreceive_wrapper)
 SYSCALL(sys_mq_notify,sys_mq_notify,compat_sys_mq_notify_wrapper)
 SYSCALL(sys_mq_getsetattr,sys_mq_getsetattr,compat_sys_mq_getsetattr_wrapper)
+SYSCALL(sys_add_key,sys_add_key,sys32_add_key_wrapper)
+SYSCALL(sys_request_key,sys_request_key,sys32_request_key_wrapper)
+SYSCALL(sys_keyctl,sys_keyctl,sys32_keyctl_wrapper)
diff -uNrp linux-2.6.9-bk4/arch/sh/kernel/entry.S linux-2.6.9-bk4-keys/arch/sh/kernel/entry.S
--- linux-2.6.9-bk4/arch/sh/kernel/entry.S	2004-10-20 14:02:56.666416464 +0100
+++ linux-2.6.9-bk4-keys/arch/sh/kernel/entry.S	2004-10-20 14:26:32.677689027 +0100
@@ -1140,5 +1140,9 @@ ENTRY(sys_call_table)
 	.long sys_mq_timedreceive       /* 280 */
 	.long sys_mq_notify
 	.long sys_mq_getsetattr
+	.long sys_add_key
+	.long sys_request_key
+	.long sys_keyctl		/* 285 */
+	
 
 /* End of entry.S */
diff -uNrp linux-2.6.9-bk4/arch/sh64/kernel/syscalls.S linux-2.6.9-bk4-keys/arch/sh64/kernel/syscalls.S
--- linux-2.6.9-bk4/arch/sh64/kernel/syscalls.S	2004-09-16 12:05:50.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/sh64/kernel/syscalls.S	2004-10-20 15:08:45.682499668 +0100
@@ -337,4 +337,6 @@ sys_call_table:
 	.long sys_mq_timedreceive
 	.long sys_mq_notify
 	.long sys_mq_getsetattr		/* 310 */
-
+	.long sys_add_key
+	.long sys_request_key
+	.long sys_keyctl
diff -uNrp linux-2.6.9-bk4/arch/sparc/kernel/systbls.S linux-2.6.9-bk4-keys/arch/sparc/kernel/systbls.S
--- linux-2.6.9-bk4/arch/sparc/kernel/systbls.S	2004-10-19 10:41:48.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/sparc/kernel/systbls.S	2004-10-20 14:25:23.775664787 +0100
@@ -75,7 +75,7 @@ sys_call_table:
 /*265*/	.long sys_timer_delete, sys_timer_create, sys_nis_syscall, sys_io_setup, sys_io_destroy
 /*270*/	.long sys_io_submit, sys_io_cancel, sys_io_getevents, sys_mq_open, sys_mq_unlink
 /*275*/	.long sys_mq_timedsend, sys_mq_timedreceive, sys_mq_notify, sys_mq_getsetattr, sys_waitid
-/*280*/	.long sys_ni_syscall, sys_ni_syscall, sys_ni_syscall
+/*280*/	.long sys_add_key, sys_request_key, sys_keyctl
 
 #ifdef CONFIG_SUNOS_EMUL
 	/* Now the SunOS syscall table. */
diff -uNrp linux-2.6.9-bk4/arch/sparc64/kernel/sys32.S linux-2.6.9-bk4-keys/arch/sparc64/kernel/sys32.S
--- linux-2.6.9-bk4/arch/sparc64/kernel/sys32.S	2004-10-19 10:41:48.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/sparc64/kernel/sys32.S	2004-10-20 15:22:48.095792589 +0100
@@ -135,6 +135,9 @@ SIGN2(sys32_shutdown, sys_shutdown, %o0,
 SIGN3(sys32_socketpair, sys_socketpair, %o0, %o1, %o2)
 SIGN1(sys32_getpeername, sys_getpeername, %o0)
 SIGN1(sys32_getsockname, sys_getsockname, %o0)
+SIGN2(sys32_add_key, sys_add_key, %o3, %o4)
+SIGN1(sys32_request_key, sys_request_key, %o3)
+SIGN1(sys32_keyctl, sys_keyctl, %o0)
 
 	.globl		sys32_mmap2
 sys32_mmap2:
diff -uNrp linux-2.6.9-bk4/arch/sparc64/kernel/systbls.S linux-2.6.9-bk4-keys/arch/sparc64/kernel/systbls.S
--- linux-2.6.9-bk4/arch/sparc64/kernel/systbls.S	2004-10-19 10:41:48.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/sparc64/kernel/systbls.S	2004-10-20 14:42:28.934934888 +0100
@@ -76,7 +76,7 @@ sys_call_table32:
 	.word sys_timer_delete, sys32_timer_create, sys_ni_syscall, compat_sys_io_setup, sys_io_destroy
 /*270*/	.word sys32_io_submit, sys_io_cancel, compat_sys_io_getevents, sys32_mq_open, sys_mq_unlink
 	.word sys_mq_timedsend, sys_mq_timedreceive, compat_sys_mq_notify, compat_sys_mq_getsetattr, compat_sys_waitid
-/*280*/	.word sys_ni_syscall, sys_ni_syscall, sys_ni_syscall
+/*280*/	.word sys32_add_key, sys32_request_key, sys32_keyctl
 
 #endif /* CONFIG_COMPAT */
 
@@ -142,7 +142,7 @@ sys_call_table:
 	.word sys_timer_delete, sys_timer_create, sys_ni_syscall, sys_io_setup, sys_io_destroy
 /*270*/	.word sys_io_submit, sys_io_cancel, sys_io_getevents, sys_mq_open, sys_mq_unlink
 	.word sys_mq_timedsend, sys_mq_timedreceive, sys_mq_notify, sys_mq_getsetattr, sys_waitid
-/*280*/	.word sys_ni_syscall, sys_ni_syscall, sys_ni_syscall
+/*280*/	.word sys_add_key, sys_request_key, sys_keyctl
 
 #if defined(CONFIG_SUNOS_EMUL) || defined(CONFIG_SOLARIS_EMUL) || \
     defined(CONFIG_SOLARIS_EMUL_MODULE)
diff -uNrp linux-2.6.9-bk4/arch/um/kernel/sys_call_table.c linux-2.6.9-bk4-keys/arch/um/kernel/sys_call_table.c
--- linux-2.6.9-bk4/arch/um/kernel/sys_call_table.c	2004-10-19 10:41:49.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/um/kernel/sys_call_table.c	2004-10-20 14:44:10.557889241 +0100
@@ -306,6 +306,9 @@ syscall_handler_t *sys_call_table[] = {
 	[ __NR_utimes ] (syscall_handler_t *) sys_utimes,
 	[ __NR_fadvise64_64 ] (syscall_handler_t *) sys_fadvise64_64,
 	[ __NR_vserver ] (syscall_handler_t *) sys_ni_syscall,
+	[ __NR_add_key ] (syscall_handler_t *) sys_add_key,
+	[ __NR_request_key ] (syscall_handler_t *) sys_request_key,
+	[ __NR_keyctl ] (syscall_handler_t *) sys_keyctl,
 
 	ARCH_SYSCALLS
 	[ LAST_SYSCALL + 1 ... NR_syscalls ] = 
diff -uNrp linux-2.6.9-bk4/arch/v850/kernel/entry.S linux-2.6.9-bk4-keys/arch/v850/kernel/entry.S
--- linux-2.6.9-bk4/arch/v850/kernel/entry.S	2004-06-18 13:41:13.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/v850/kernel/entry.S	2004-10-20 15:02:06.154739578 +0100
@@ -1117,5 +1117,8 @@ C_DATA(sys_call_table):
 	.long CSYM(sys_pivot_root)	// 200
 	.long CSYM(sys_gettid)
 	.long CSYM(sys_tkill)
+	.long CSYM(sys_add_key)
+	.long CSYM(sys_request_key)
+	.long CSYM(sys_keyctl)		// 205
 sys_call_table_end:
 C_END(sys_call_table)
diff -uNrp linux-2.6.9-bk4/arch/x86_64/ia32/ia32entry.S linux-2.6.9-bk4-keys/arch/x86_64/ia32/ia32entry.S
--- linux-2.6.9-bk4/arch/x86_64/ia32/ia32entry.S	2004-10-19 10:41:49.000000000 +0100
+++ linux-2.6.9-bk4-keys/arch/x86_64/ia32/ia32entry.S	2004-10-20 15:04:46.183013167 +0100
@@ -587,6 +587,10 @@ ia32_sys_call_table:
 	.quad compat_sys_mq_getsetattr
 	.quad quiet_ni_syscall		/* reserved for kexec */
 	.quad sys32_waitid
+	.quad quiet_ni_syscall		/* 285 reserved for setaltroot */
+	.quad sys_add_key
+	.quad sys_request_key
+	.quad sys_keyctl
 	/* don't forget to change IA32_NR_syscalls */
 ia32_syscall_end:		
 	.rept IA32_NR_syscalls-(ia32_syscall_end-ia32_sys_call_table)/8
diff -uNrp linux-2.6.9-bk4/include/asm-alpha/unistd.h linux-2.6.9-bk4-keys/include/asm-alpha/unistd.h
--- linux-2.6.9-bk4/include/asm-alpha/unistd.h	2004-10-19 10:42:11.000000000 +0100
+++ linux-2.6.9-bk4-keys/include/asm-alpha/unistd.h	2004-10-20 14:18:36.681064345 +0100
@@ -374,8 +374,11 @@
 #define __NR_mq_notify			436
 #define __NR_mq_getsetattr		437
 #define __NR_waitid			438
+#define __NR_add_key			439
+#define __NR_request_key		440
+#define __NR_keyctl			441
 
-#define NR_SYSCALLS			439
+#define NR_SYSCALLS			442
 
 #if defined(__GNUC__)
 
diff -uNrp linux-2.6.9-bk4/include/asm-arm/unistd.h linux-2.6.9-bk4-keys/include/asm-arm/unistd.h
--- linux-2.6.9-bk4/include/asm-arm/unistd.h	2004-10-19 10:42:12.000000000 +0100
+++ linux-2.6.9-bk4-keys/include/asm-arm/unistd.h	2004-10-20 14:17:35.183426405 +0100
@@ -306,6 +306,9 @@
 #define __NR_mq_notify			(__NR_SYSCALL_BASE+278)
 #define __NR_mq_getsetattr		(__NR_SYSCALL_BASE+279)
 #define __NR_waitid			(__NR_SYSCALL_BASE+280)
+#define __NR_add_key			(__NR_SYSCALL_BASE+281)
+#define __NR_request_key		(__NR_SYSCALL_BASE+282)
+#define __NR_keyctl			(__NR_SYSCALL_BASE+283)
 
 /*
  * The following SWIs are ARM private.
diff -uNrp linux-2.6.9-bk4/include/asm-arm26/unistd.h linux-2.6.9-bk4-keys/include/asm-arm26/unistd.h
--- linux-2.6.9-bk4/include/asm-arm26/unistd.h	2004-06-18 13:44:05.000000000 +0100
+++ linux-2.6.9-bk4-keys/include/asm-arm26/unistd.h	2004-10-20 14:16:45.004804472 +0100
@@ -260,6 +260,9 @@
 #define __NR_lremovexattr		(__NR_SYSCALL_BASE+236)
 #define __NR_fremovexattr		(__NR_SYSCALL_BASE+237)
 #define __NR_tkill			(__NR_SYSCALL_BASE+238)
+#define __NR_add_key			(__NR_SYSCALL_BASE+239)
+#define __NR_request_key		(__NR_SYSCALL_BASE+240)
+#define __NR_keyctl			(__NR_SYSCALL_BASE+241)
 
 /*
  * The following SWIs are ARM private.
diff -uNrp linux-2.6.9-bk4/include/asm-cris/unistd.h linux-2.6.9-bk4-keys/include/asm-cris/unistd.h
--- linux-2.6.9-bk4/include/asm-cris/unistd.h	2004-06-18 13:44:05.000000000 +0100
+++ linux-2.6.9-bk4-keys/include/asm-cris/unistd.h	2004-10-20 14:16:21.025897563 +0100
@@ -288,8 +288,11 @@
 #define __NR_mq_timedreceive	(__NR_mq_open+3)
 #define __NR_mq_notify		(__NR_mq_open+4)
 #define __NR_mq_getsetattr	(__NR_mq_open+5)
+#define __NR_add_key		283
+#define __NR_request_key	284
+#define __NR_keyctl		285
  
-#define NR_syscalls 283
+#define NR_syscalls 286
 
 
 #ifdef __KERNEL__
diff -uNrp linux-2.6.9-bk4/include/asm-h8300/unistd.h linux-2.6.9-bk4-keys/include/asm-h8300/unistd.h
--- linux-2.6.9-bk4/include/asm-h8300/unistd.h	2004-06-18 13:44:05.000000000 +0100
+++ linux-2.6.9-bk4-keys/include/asm-h8300/unistd.h	2004-10-20 15:01:16.446016959 +0100
@@ -269,8 +269,11 @@
 #define __NR_clock_gettime	(__NR_timer_create+6)
 #define __NR_clock_getres	(__NR_timer_create+7)
 #define __NR_clock_nanosleep	(__NR_timer_create+8)
+#define __NR_add_key		274
+#define __NR_request_key	275
+#define __NR_keyctl		276
 
-#define NR_syscalls 268
+#define NR_syscalls 277
 
 
 /* user-visible error numbers are in the range -1 - -122: see
diff -uNrp linux-2.6.9-bk4/include/asm-ia64/unistd.h linux-2.6.9-bk4-keys/include/asm-ia64/unistd.h
--- linux-2.6.9-bk4/include/asm-ia64/unistd.h	2004-10-20 14:03:14.832904952 +0100
+++ linux-2.6.9-bk4-keys/include/asm-ia64/unistd.h	2004-10-20 14:14:59.746996878 +0100
@@ -260,6 +260,9 @@
 #define __NR_kexec_load			1268
 #define __NR_vserver			1269
 #define __NR_setaltroot			1270
+#define __NR_add_key			1271
+#define __NR_request_key		1272
+#define __NR_keyctl			1273
 
 #ifdef __KERNEL__
 
diff -uNrp linux-2.6.9-bk4/include/asm-m32r/unistd.h linux-2.6.9-bk4-keys/include/asm-m32r/unistd.h
--- linux-2.6.9-bk4/include/asm-m32r/unistd.h	2004-10-19 10:42:13.000000000 +0100
+++ linux-2.6.9-bk4-keys/include/asm-m32r/unistd.h	2004-10-20 14:14:34.284222397 +0100
@@ -294,8 +294,11 @@
 #define __NR_mq_getsetattr      (__NR_mq_open+5)
 #define __NR_sys_kexec_load    283
 #define __NR_waitid            284
+#define __NR_add_key		285
+#define __NR_request_key	286
+#define __NR_keyctl		287
 
-#define NR_syscalls 285
+#define NR_syscalls 288
 
 /* user-visible error numbers are in the range -1 - -124: see
  * <asm-m32r/errno.h>
diff -uNrp linux-2.6.9-bk4/include/asm-m68k/unistd.h linux-2.6.9-bk4-keys/include/asm-m68k/unistd.h
--- linux-2.6.9-bk4/include/asm-m68k/unistd.h	2004-06-18 13:44:05.000000000 +0100
+++ linux-2.6.9-bk4-keys/include/asm-m68k/unistd.h	2004-10-20 14:14:06.358663984 +0100
@@ -238,8 +238,11 @@
 #define __NR_lremovexattr	233
 #define __NR_fremovexattr	234
 #define __NR_futex		235
+#define __NR_add_key		236
+#define __NR_request_key	237
+#define __NR_keyctl		238
 
-#define NR_syscalls		236
+#define NR_syscalls		239
 
 /* user-visible error numbers are in the range -1 - -124: see
    <asm-m68k/errno.h> */
diff -uNrp linux-2.6.9-bk4/include/asm-mips/unistd.h linux-2.6.9-bk4-keys/include/asm-mips/unistd.h
--- linux-2.6.9-bk4/include/asm-mips/unistd.h	2004-09-16 12:06:18.000000000 +0100
+++ linux-2.6.9-bk4-keys/include/asm-mips/unistd.h	2004-10-20 14:12:31.321979696 +0100
@@ -298,16 +298,19 @@
 #define __NR_mq_notify			(__NR_Linux + 275)
 #define __NR_mq_getsetattr		(__NR_Linux + 276)
 #define __NR_vserver			(__NR_Linux + 277)
+#define __NR_add_key			(__NR_Linux + 278)
+#define __NR_request_key		(__NR_Linux + 279)
+#define __NR_keyctl			(__NR_Linux + 280)
 
 /*
  * Offset of the last Linux o32 flavoured syscall
  */
-#define __NR_Linux_syscalls		277
+#define __NR_Linux_syscalls		280
 
 #endif /* _MIPS_SIM == _MIPS_SIM_ABI32 */
 
 #define __NR_O32_Linux			4000
-#define __NR_O32_Linux_syscalls		277
+#define __NR_O32_Linux_syscalls		280
 
 #if _MIPS_SIM == _MIPS_SIM_ABI64
 
@@ -552,11 +555,14 @@
 #define __NR_mq_notify			(__NR_Linux + 234)
 #define __NR_mq_getsetattr		(__NR_Linux + 235)
 #define __NR_vserver			(__NR_Linux + 236)
+#define __NR_add_key			(__NR_Linux + 237)
+#define __NR_request_key		(__NR_Linux + 238)
+#define __NR_keyctl			(__NR_Linux + 239)
 
 /*
  * Offset of the last Linux flavoured syscall
  */
-#define __NR_Linux_syscalls		236
+#define __NR_Linux_syscalls		239
 
 #endif /* _MIPS_SIM == _MIPS_SIM_ABI64 */
 
@@ -810,11 +816,14 @@
 #define __NR_mq_notify			(__NR_Linux + 238)
 #define __NR_mq_getsetattr		(__NR_Linux + 239)
 #define __NR_vserver			(__NR_Linux + 240)
+#define __NR_add_key			(__NR_Linux + 241)
+#define __NR_request_key		(__NR_Linux + 242)
+#define __NR_keyctl			(__NR_Linux + 243)
 
 /*
  * Offset of the last N32 flavoured syscall
  */
-#define __NR_Linux_syscalls		240
+#define __NR_Linux_syscalls		243
 
 #endif /* _MIPS_SIM == _MIPS_SIM_NABI32 */
 
diff -uNrp linux-2.6.9-bk4/include/asm-parisc/unistd.h linux-2.6.9-bk4-keys/include/asm-parisc/unistd.h
--- linux-2.6.9-bk4/include/asm-parisc/unistd.h	2004-09-16 12:06:18.000000000 +0100
+++ linux-2.6.9-bk4-keys/include/asm-parisc/unistd.h	2004-10-20 14:11:00.896901332 +0100
@@ -727,8 +727,11 @@
 #define __NR_mq_timedreceive    (__NR_Linux + 232)
 #define __NR_mq_notify          (__NR_Linux + 233)
 #define __NR_mq_getsetattr      (__NR_Linux + 234)
+#define __NR_add_key		(__NR_Linux + 235)
+#define __NR_request_key	(__NR_Linux + 236)
+#define __NR_keyctl		(__NR_Linux + 237)
 
-#define __NR_Linux_syscalls     235
+#define __NR_Linux_syscalls     238
 
 #define HPUX_GATEWAY_ADDR       0xC0000004
 #define LINUX_GATEWAY_ADDR      0x100
diff -uNrp linux-2.6.9-bk4/include/asm-ppc/unistd.h linux-2.6.9-bk4-keys/include/asm-ppc/unistd.h
--- linux-2.6.9-bk4/include/asm-ppc/unistd.h	2004-06-18 13:44:05.000000000 +0100
+++ linux-2.6.9-bk4-keys/include/asm-ppc/unistd.h	2004-10-20 14:10:32.629379614 +0100
@@ -273,8 +273,11 @@
 #define __NR_mq_notify		266
 #define __NR_mq_getsetattr	267
 #define __NR_kexec_load		268
+#define __NR_add_key		269
+#define __NR_request_key	270
+#define __NR_keyctl		271
 
-#define __NR_syscalls		269
+#define __NR_syscalls		272
 
 #define __NR(n)	#n
 
diff -uNrp linux-2.6.9-bk4/include/asm-ppc64/unistd.h linux-2.6.9-bk4-keys/include/asm-ppc64/unistd.h
--- linux-2.6.9-bk4/include/asm-ppc64/unistd.h	2004-10-19 10:42:14.000000000 +0100
+++ linux-2.6.9-bk4-keys/include/asm-ppc64/unistd.h	2004-10-20 14:10:19.868498694 +0100
@@ -279,8 +279,11 @@
 #define __NR_mq_notify		266
 #define __NR_mq_getsetattr	267
 #define __NR_kexec_load		268
+#define __NR_add_key		269
+#define __NR_request_key	270
+#define __NR_keyctl		271
 
-#define __NR_syscalls		269
+#define __NR_syscalls		272
 #ifdef __KERNEL__
 #define NR_syscalls	__NR_syscalls
 #endif
diff -uNrp linux-2.6.9-bk4/include/asm-s390/unistd.h linux-2.6.9-bk4-keys/include/asm-s390/unistd.h
--- linux-2.6.9-bk4/include/asm-s390/unistd.h	2004-06-18 13:44:05.000000000 +0100
+++ linux-2.6.9-bk4-keys/include/asm-s390/unistd.h	2004-10-20 14:09:39.572899460 +0100
@@ -269,8 +269,11 @@
 #define __NR_mq_timedreceive	274
 #define __NR_mq_notify		275
 #define __NR_mq_getsetattr	276
+#define __NR_add_key		277
+#define __NR_request_key	278
+#define __NR_keyctl		279
 
-#define NR_syscalls 277
+#define NR_syscalls 280
 
 /* 
  * There are some system calls that are not present on 64 bit, some
diff -uNrp linux-2.6.9-bk4/include/asm-sh/unistd.h linux-2.6.9-bk4-keys/include/asm-sh/unistd.h
--- linux-2.6.9-bk4/include/asm-sh/unistd.h	2004-10-20 14:03:16.058802954 +0100
+++ linux-2.6.9-bk4-keys/include/asm-sh/unistd.h	2004-10-20 14:09:16.465821351 +0100
@@ -290,8 +290,11 @@
 #define __NR_mq_timedreceive    (__NR_mq_open+3)
 #define __NR_mq_notify          (__NR_mq_open+4)
 #define __NR_mq_getsetattr      (__NR_mq_open+5)
+#define __NR_add_key		283
+#define __NR_request_key	284
+#define __NR_keyctl		285
 
-#define NR_syscalls 283
+#define NR_syscalls 286
 
 /* user-visible error numbers are in the range -1 - -124: see <asm-sh/errno.h> */
 
diff -uNrp linux-2.6.9-bk4/include/asm-sh64/unistd.h linux-2.6.9-bk4-keys/include/asm-sh64/unistd.h
--- linux-2.6.9-bk4/include/asm-sh64/unistd.h	2004-09-16 12:06:19.000000000 +0100
+++ linux-2.6.9-bk4-keys/include/asm-sh64/unistd.h	2004-10-20 14:08:45.352409218 +0100
@@ -333,8 +333,11 @@
 #define __NR_mq_timedreceive    (__NR_mq_open+3)
 #define __NR_mq_notify          (__NR_mq_open+4)
 #define __NR_mq_getsetattr      (__NR_mq_open+5)
+#define __NR_add_key		311
+#define __NR_request_key	312
+#define __NR_keyctl		313
 
-#define NR_syscalls 311
+#define NR_syscalls 314
 
 /* user-visible error numbers are in the range -1 - -125: see <asm-sh64/errno.h> */
 
diff -uNrp linux-2.6.9-bk4/include/asm-sparc/unistd.h linux-2.6.9-bk4-keys/include/asm-sparc/unistd.h
--- linux-2.6.9-bk4/include/asm-sparc/unistd.h	2004-10-19 10:42:14.000000000 +0100
+++ linux-2.6.9-bk4-keys/include/asm-sparc/unistd.h	2004-10-20 14:08:05.303740383 +0100
@@ -296,6 +296,9 @@
 #define __NR_mq_notify		277
 #define __NR_mq_getsetattr	278
 #define __NR_waitid		279
+#define __NR_add_key		280
+#define __NR_request_key	281
+#define __NR_keyctl		282
 
 /* WARNING: You MAY NOT add syscall numbers larger than 282, since
  *          all of the syscall tables in the Sparc kernel are
diff -uNrp linux-2.6.9-bk4/include/asm-sparc64/unistd.h linux-2.6.9-bk4-keys/include/asm-sparc64/unistd.h
--- linux-2.6.9-bk4/include/asm-sparc64/unistd.h	2004-10-19 10:42:15.000000000 +0100
+++ linux-2.6.9-bk4-keys/include/asm-sparc64/unistd.h	2004-10-20 14:07:45.586380476 +0100
@@ -298,6 +298,9 @@
 #define __NR_mq_notify		277
 #define __NR_mq_getsetattr	278
 #define __NR_waitid		279
+#define __NR_add_key		280
+#define __NR_request_key	281
+#define __NR_keyctl		282
 
 /* WARNING: You MAY NOT add syscall numbers larger than 282, since
  *          all of the syscall tables in the Sparc kernel are
diff -uNrp linux-2.6.9-bk4/include/asm-v850/unistd.h linux-2.6.9-bk4-keys/include/asm-v850/unistd.h
--- linux-2.6.9-bk4/include/asm-v850/unistd.h	2004-09-16 12:06:20.000000000 +0100
+++ linux-2.6.9-bk4-keys/include/asm-v850/unistd.h	2004-10-20 14:06:45.477380562 +0100
@@ -205,6 +205,9 @@
 #define __NR_pivot_root		200
 #define __NR_gettid		201
 #define __NR_tkill		202
+#define __NR_add_key		203
+#define __NR_request_key	204
+#define __NR_keyctl		205
 
 
 /* Syscall protocol:
diff -uNrp linux-2.6.9-bk4/include/asm-x86_64/unistd.h linux-2.6.9-bk4-keys/include/asm-x86_64/unistd.h
--- linux-2.6.9-bk4/include/asm-x86_64/unistd.h	2004-10-19 10:42:16.000000000 +0100
+++ linux-2.6.9-bk4-keys/include/asm-x86_64/unistd.h	2004-10-20 14:06:01.645026869 +0100
@@ -556,8 +556,14 @@ __SYSCALL(__NR_mq_getsetattr, sys_mq_get
 __SYSCALL(__NR_kexec_load, sys_ni_syscall)
 #define __NR_waitid		247
 __SYSCALL(__NR_waitid, sys_waitid)
+#define __NR_add_key		248
+__SYSCALL(__NR_add_key, sys_add_key)
+#define __NR_request_key	249
+__SYSCALL(__NR_request_key, sys_request_key)
+#define __NR_keyctl		250
+__SYSCALL(__NR_keyctl, sys_keyctl)
 
-#define __NR_syscall_max __NR_waitid
+#define __NR_syscall_max __NR_keyctl
 #ifndef __NO_STUBS
 
 /* user-visible error numbers are in the range -1 - -4095 */


From hch at infradead.org  Thu Oct 21 01:29:57 2004
From: hch at infradead.org (Christoph Hellwig)
Date: Wed, 20 Oct 2004 16:29:57 +0100
Subject: [PATCH] Add key management syscalls to non-i386 archs
In-Reply-To: <3506.1098283455@redhat.com>
References: <3506.1098283455@redhat.com>
Message-ID: <20041020152957.GA21774@infradead.org>

> Hi Linus, Andrew,
> 
> The attached patch adds syscalls for almost all archs (everything barring
> m68knommu which is in a real mess, and i386 which already has it).
> 
> It also adds 32->64 compatibility where appropriate.

Umm, that patch added the damn multiplexer that had been vetoed multiple
times.  Why did this happen?


From matthew at wil.cx  Thu Oct 21 01:49:22 2004
From: matthew at wil.cx (Matthew Wilcox)
Date: Wed, 20 Oct 2004 16:49:22 +0100
Subject: [parisc-linux] [PATCH] Add key management syscalls to non-i386
	archs
In-Reply-To: <3506.1098283455@redhat.com>
References: <3506.1098283455@redhat.com>
Message-ID: <20041020154922.GV16153@parcelfarce.linux.theplanet.co.uk>

On Wed, Oct 20, 2004 at 03:44:15PM +0100, David Howells wrote:
> The attached patch adds syscalls for almost all archs (everything barring
> m68knommu which is in a real mess, and i386 which already has it).
> 
> It also adds 32->64 compatibility where appropriate.

> --- linux-2.6.9-bk4/arch/parisc/kernel/syscall_table.S	2004-06-18 13:43:47.000000000 +0100
> +++ linux-2.6.9-bk4-keys/arch/parisc/kernel/syscall_table.S	2004-10-20 14:58:51.533643420 +0100
> @@ -341,5 +341,7 @@
>    ENTRY_SAME(mq_timedreceive)
>    ENTRY_SAME(mq_notify)
>    ENTRY_SAME(mq_getsetattr)
> -  /* Nothing yet */       /* 235 */
> +	ENTRY_SAME(add_key)	/* 235 */
> +	ENTRY_SAME(request_key)
> +	ENTRY_SAME(keyctl)

Um, no.  Should be ENTRY_COMP() if there's compat syscalls.  And those
particular syscall numbers have already been assigned (blame Linus for
dropping the PA-RISC patch on the floor instead of including it in 2.6.9).

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain


From dhowells at redhat.com  Thu Oct 21 02:16:17 2004
From: dhowells at redhat.com (David Howells)
Date: Wed, 20 Oct 2004 17:16:17 +0100
Subject: [parisc-linux] [PATCH] Add key management syscalls to non-i386
	archs 
In-Reply-To: <20041020154922.GV16153@parcelfarce.linux.theplanet.co.uk> 
References: <20041020154922.GV16153@parcelfarce.linux.theplanet.co.uk>
	<3506.1098283455@redhat.com> 
Message-ID: <7779.1098288977@redhat.com>


> Um, no.  Should be ENTRY_COMP() if there's compat syscalls.

Not all archs (of which PA-Risc is an example) seem to require the same fixups
on the same syscalls. In some instances, the upper half of the register is
implicitly zero on 32-bit syscall entry to a 64-bit kernel. In such cases,
none of my syscalls require fixing up, assuming the pointers are automatically
correct.

> And those particular syscall numbers have already been assigned (blame Linus
> for dropping the PA-RISC patch on the floor instead of including it in
> 2.6.9).

There's not a lot I can do about that, except wave a patch under Linus's nose
and see who complains. Can you allocate three syscall numbers for me for
parisc?

David


From johnrose at austin.ibm.com  Thu Oct 21 02:35:32 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Wed, 20 Oct 2004 11:35:32 -0500
Subject: [PATCH] __ioremap_explicit() criterion change
Message-ID: <1098290132.15425.7.camel@sinatra.austin.ibm.com>

The function __ioremap_explicit() misses a possible (obscure) case when
reserving the imalloc area for the new region.	This can result in the
unexpected DLPAR-add failure for an I/O slot.  The failure will be
characterized by a kernel message resembling "could not obtain imalloc area for
ea 0x..." Here's an explanation:

At boot time, imalloc regions are created for the ranges of all PHBs.  Upon 
removal of a child slot for one of these PHBs, the imalloc region is split
so that the region for the child slot can be removed.

A GFW testcase revealed the following scenario.  A PHB is remapped at boot for
virtual address range A through C.  At boot, the partition owns a slot that
spans from A to B.  This slot is DLPAR-removed, leaving an imalloc region from
B to C.  At this point, the user DLPAR adds an EADS slot that was not present
at boot, but is a child of the PHB.  The new slot happens to have a range that
directly matches the leftover PHB range, from B to C.  The existing code does
not expect this, so the operation fails.  

Signed-off-by: John Rose <johnrose at austin.ibm.com>

diff -Nru a/arch/ppc64/mm/init.c b/arch/ppc64/mm/init.c
--- a/arch/ppc64/mm/init.c	Wed Oct 20 11:17:47 2004
+++ b/arch/ppc64/mm/init.c	Wed Oct 20 11:17:47 2004
@@ -263,7 +263,8 @@
 		 */
 		;
 	} else {
-		area = im_get_area(ea, size, IM_REGION_UNUSED|IM_REGION_SUBSET);
+		area = im_get_area(ea, size,
+			IM_REGION_UNUSED|IM_REGION_SUBSET|IM_REGION_EXISTS);
 		if (area == NULL) {
 			printk(KERN_ERR "could not obtain imalloc area for ea 0x%lx\n", ea);
 			return 1;


From cchaney at us.ibm.com  Thu Oct 21 03:04:20 2004
From: cchaney at us.ibm.com (Craig Chaney)
Date: Wed, 20 Oct 2004 13:04:20 -0400
Subject: 2.6.9-rc4 kernel -- "cannot find space for TCE table"
In-Reply-To: <1098229131.5792.9.camel@gaston>
References: <OFF2A7A85B.119958DD-ON87256F2E.0073427E-86256F2E.0071BEBC@us.ibm.com>
	<1097887510.6487.23.camel@gaston>
	<20041019230054.GA3807@kevlar.burdell.org>
	<1098229131.5792.9.camel@gaston>
Message-ID: <20041020170420.GA8345@sage.raleigh.ibm.com>

On Wed, Oct 20, 2004 at 09:38:52AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2004-10-20 at 09:00, Sonny Rao wrote:
> 
> > Ben, I'm still seeing this issue with 2.6.9 final, do you need
> > anything else?  I'm sure you're very busy, but please let me know if I
> > can help.
> 
> Well, I can't reproduce here, but it seem basically that one of the
> calls to alloc_down() is failing, you may want to trace a bit. I'll
> try to find by myself too & let you know.
> 
> Ben.

I can reproduce this on a p615 as well.  I did a little bit of superficial
tracking.

The call to alloc_down fails because (RELOC(alloc_top) == RELOC(rmo_top)) is
false.  On LPAR platforms, alloc_top is set to rmo_top in prom_init_mem.
However, for the p615, prom_find_machine_type() returns PLATFORM_PSERIES,
which causes the logic in prom_init_mem to set alloc_top to 0x40000000.

I can work around this by modifying prom_init_mem to set alloc_top to rmo_top
if of_platform is either PLATFORM_PSERIES_LPAR or PLATFORM_PSERIES.  This
allows me to boot a 2.6.9-rc4 kernel on a p615.

Hope this helps.

-Craig


From arnd at arndb.de  Thu Oct 21 03:08:17 2004
From: arnd at arndb.de (Arnd Bergmann)
Date: Wed, 20 Oct 2004 19:08:17 +0200
Subject: [PATCH] Add key management syscalls to non-i386 archs
In-Reply-To: <3506.1098283455@redhat.com>
References: <3506.1098283455@redhat.com>
Message-ID: <200410201908.18273.arnd@arndb.de>

On Middeweken 20 Oktober 2004 16:44, David Howells wrote:

> diff -uNrp linux-2.6.9-bk4/arch/s390/kernel/compat_wrapper.S linux-2.6.9-bk4-keys/arch/s390/kernel/compat_wrapper.S
> --- linux-2.6.9-bk4/arch/s390/kernel/compat_wrapper.S	2004-06-18 13:43:49.000000000 +0100
> +++ linux-2.6.9-bk4-keys/arch/s390/kernel/compat_wrapper.S	2004-10-20 15:08:00.071403677 +0100
> @@ -1406,3 +1406,29 @@ compat_sys_mq_getsetattr_wrapper:
>  	llgtr	%r3,%r3			# struct compat_mq_attr *
>  	llgtr	%r4,%r4			# struct compat_mq_attr *
>  	jg	compat_sys_mq_getsetattr
> +
> +	.globl  sys32_add_key_wrapper
> +sys32_add_key_wrapper:
> +	lgfr	%r2,%r2			# const char *
> +	llgfr	%r3,%r3			# const char *
> +	llgfr	%r4,%r4			# const void *
> +	llgfr	%r5,%r5			# size_t
> +	llgfr	%r6,%r6			# key_serial_t
> +	jg	sys_add_key		# branch to system call
> +
> +	.globl  sys32_request_key_wrapper
> +sys32_request_key_wrapper:
> +	lgfr	%r2,%r2			# const char *
> +	llgfr	%r3,%r3			# const char *
> +	llgfr	%r4,%r4			# const char *
> +	llgfr	%r5,%r5			# key_serial_t
> +	jg	sys_request_key		# branch to system call
> +
> +	.globl  sys32_keyctl_wrapper
> +sys32_keyctl_wrapper:
> +	lgfr	%r2,%r2			# int
> +	llgfr	%r3,%r3			# unsigned long
> +	llgfr	%r4,%r4			# unsigned long
> +	llgfr	%r5,%r5			# unsigned long
> +	llgfr	%r6,%r6			# unsigned long
> +	jg	sys_keyctl		# branch to system call

The comments don't match with the code. Please use the correct
lgfr/llgfr/llgtr opcodes for signed/unsigned/pointer extension.
Note that for keyctl_wrapper, the actual conversion is not static
but depends on the value of %r2. You probably want to code that
conversion in C.

	Arnd <><
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: signature
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041020/2a01c183/attachment.pgp 

From akpm at osdl.org  Thu Oct 21 03:50:27 2004
From: akpm at osdl.org (Andrew Morton)
Date: Wed, 20 Oct 2004 10:50:27 -0700
Subject: [PATCH] Add key management syscalls to non-i386 archs
In-Reply-To: <20041020152957.GA21774@infradead.org>
References: <3506.1098283455@redhat.com> <20041020152957.GA21774@infradead.org>
Message-ID: <20041020105027.54bf9e89.akpm@osdl.org>

Christoph Hellwig <hch at infradead.org> wrote:
>
> > Hi Linus, Andrew,
>  > 
>  > The attached patch adds syscalls for almost all archs (everything barring
>  > m68knommu which is in a real mess, and i386 which already has it).
>  > 
>  > It also adds 32->64 compatibility where appropriate.
> 
>  Umm, that patch added the damn multiplexer that had been vetoed multiple
>  times.  Why did this happen?

Fifteen new syscalls was judged excessive and the keyfs interface was
judged slow and bloaty.


From hch at infradead.org  Thu Oct 21 04:18:50 2004
From: hch at infradead.org (Christoph Hellwig)
Date: Wed, 20 Oct 2004 19:18:50 +0100
Subject: [PATCH] Add key management syscalls to non-i386 archs
In-Reply-To: <20041020105027.54bf9e89.akpm@osdl.org>
References: <3506.1098283455@redhat.com> <20041020152957.GA21774@infradead.org>
	<20041020105027.54bf9e89.akpm@osdl.org>
Message-ID: <20041020181850.GA23979@infradead.org>

On Wed, Oct 20, 2004 at 10:50:27AM -0700, Andrew Morton wrote:
> Christoph Hellwig <hch at infradead.org> wrote:
> >
> > > Hi Linus, Andrew,
> >  > 
> >  > The attached patch adds syscalls for almost all archs (everything barring
> >  > m68knommu which is in a real mess, and i386 which already has it).
> >  > 
> >  > It also adds 32->64 compatibility where appropriate.
> > 
> >  Umm, that patch added the damn multiplexer that had been vetoed multiple
> >  times.  Why did this happen?
> 
> Fifteen new syscalls was judged excessive and the keyfs interface was
> judged slow and bloaty.

Maybe 15 syscalls just means the API is goddamn awfull and we certainly
shouldn't merge it as-is.


From linas at austin.ibm.com  Thu Oct 21 04:45:01 2004
From: linas at austin.ibm.com (Linas Vepstas)
Date: Wed, 20 Oct 2004 13:45:01 -0500
Subject: status of ppc64 patches
In-Reply-To: <20041020010301.GA29579@4>
References: <41754644.1010003@austin.ibm.com>
	<1098231748.7493.114.camel@pants.austin.ibm.com>
	<20041020010301.GA29579@4>
Message-ID: <20041020184501.GF10026@austin.ibm.com>

On Tue, Oct 19, 2004 at 08:03:01PM -0500, Olof Johansson was heard to remark:
> 
> Also: Regarding re-basing patches: It has to be the duty of the developer
> of the patch to re-base it to current trees if it will no longer apply
> cleanly. 

I think this misses the point. I've re-based some of my patches more
than half-a-dozen times, and this has gotten so tedious that I've
just sort of stopped bothering sending in patches.  Excessive delays
in moving patches upstream just kills the development process.
Patches need to be handled in a timely manner, while they are still
'fresh', so that they don't need to be rebased.

Put it another way: it is, at this time, impossible for me to rebase,
because I know that my patches will conflict with others in the 
un-applied patch queue.  So all I can do is wait for the patch queue to
shrink, wait till the others get into the Torvalds tree, then bk pull, 
then hurry, hurry, rebase, test, submit, and hope I get in before 
someone else does and wrecks it again.  The turn-around time for 
"getting lucky" like this is over a month, and if one doesn't get 
lucky the first month, one has to wait a whole 'nother month for 
one's next shot.

--linas


From jschopp at austin.ibm.com  Thu Oct 21 05:28:35 2004
From: jschopp at austin.ibm.com (Joel Schopp)
Date: Wed, 20 Oct 2004 14:28:35 -0500
Subject: status of ppc64 patches
In-Reply-To: <16758.21948.795730.268143@cargo.ozlabs.ibm.com>
References: <41754644.1010003@austin.ibm.com>
	<16758.21948.795730.268143@cargo.ozlabs.ibm.com>
Message-ID: <4176BC63.8000700@austin.ibm.com>

> As far as your patches are concerned, I am aware of two patches that
> change things so that we have __boot variants of __pa etc.  However,
> your explanation didn't really get me excited about the change.  You
> said something about "moving towards hotplug memory" but you didn't
> explain why these changes would help with that, or how I should choose
> which function to use when I'm making changes in future (that should
> actually go in a file somewhere under the Documentation directory), or
> why those changes need to go in now.

The direct answer is that this is a big part of the size of the 
CONFIG_NONLINEAR patch, without the controversial part that actually 
does CONFIG_NONLINEAR.  CONFIG_NONLINEAR allows us to have big holes in 
physical memory and to grow physical memory after boot.  These changes 
will be necessary for whatever ends up filling the role CONFIG_NONLINEAR 
currently does in our hotplug memory tree.  So even if you hate 
CONFIG_NONLINEAR these patches will be necessary for memory hotplug 
because we will have to differentiate early boot memory from normal memory.

We have a tree that does memory add, and is part of the way to doing 
remove.  http://sprucegoose.sr71.net/patches It has 76 patches 
currently.  It is a real job to continue to forward port it.  We are 
trying to get it all upstream.  But of course it would be insane to 
merge 76 very complex patches at once, especially when a few of them are 
still buggy.

These changes need to go in now because they don't hurt anything and 
they help us a great deal on a project most everybody agrees is a good 
idea (memory hotplug).  If we didn't have a continuous development model 
they could be ignored until 2.7, but to get large features into a kernel 
that is always stable it is necessary to merge things a bit at a time. 
Even if those bits are only worthwhile in the context of the yet 
unmerged bits.

And I apologize for not making this all clear in my initial message.


From paulus at samba.org  Thu Oct 21 07:30:56 2004
From: paulus at samba.org (Paul Mackerras)
Date: Thu, 21 Oct 2004 07:30:56 +1000
Subject: [PATCH 1/1] rtas_flash_4gig
In-Reply-To: <200410041942.i94Jg4WA154540@westrelay04.boulder.ibm.com>
References: <200410041942.i94Jg4WA154540@westrelay04.boulder.ibm.com>
Message-ID: <16758.55568.809557.670513@cargo.ozlabs.ibm.com>

Jake,

> We should probably check to make sure that all of the flash
> list headers are above 4gig.  Not just the first one.

Why is the limit 4GB rather than the RMO size?

Paul.


From moilanen at austin.ibm.com  Thu Oct 21 08:08:17 2004
From: moilanen at austin.ibm.com (Jake Moilanen)
Date: Wed, 20 Oct 2004 17:08:17 -0500
Subject: [PATCH 1/1] rtas_flash_4gig
In-Reply-To: <16758.55568.809557.670513@cargo.ozlabs.ibm.com>
References: <200410041942.i94Jg4WA154540@westrelay04.boulder.ibm.com>
	<16758.55568.809557.670513@cargo.ozlabs.ibm.com>
Message-ID: <20041020170817.0ee49b64@localhost>


> > We should probably check to make sure that all of the flash
> > list headers are above 4gig.  Not just the first one.
> 
> Why is the limit 4GB rather than the RMO size?

According to the RPA (item E7-41 to be exact), the block-list can be
anywhere under 4 gigs.  RTAS will make hypervisor calls to access this
memory.  

I would infer the reason why they want to allow the block-list outside
the RMO is otherwise it may have been difficult for the OS to get an
entire flash image under the RMO boundary.

Thanks,
Jake


From davem at davemloft.net  Thu Oct 21 08:01:49 2004
From: davem at davemloft.net (David S. Miller)
Date: Wed, 20 Oct 2004 15:01:49 -0700
Subject: [PATCH] Add key management syscalls to non-i386 archs
In-Reply-To: <3506.1098283455@redhat.com>
References: <3506.1098283455@redhat.com>
Message-ID: <20041020150149.7be06d6d.davem@davemloft.net>


David, I applaud your effort to take care of this.
However, this patch will conflict with what I've
sent into Linus already for Sparc.  I also had to
add the sys_altroot syscall entry as well.

I've mentioned several times that perhaps the best
way to deal with this problem is to purposefully
break the build of platforms when new system calls
are added.

Simply adding a:

#error new syscall entries for X and Y needed

to include/asm-*/unistd.h would handle this just
fine I think.

That way it won't be missed, and if the platform
maintainer wants to just ignore the new syscall
they can choose to do that as well.


From olof at austin.ibm.com  Thu Oct 21 08:26:41 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Wed, 20 Oct 2004 17:26:41 -0500
Subject: [PATCH] create iommu_free_table()
In-Reply-To: <1097171661.7087.1.camel@sinatra.austin.ibm.com>
References: <1097171661.7087.1.camel@sinatra.austin.ibm.com>
Message-ID: <4176E621.3040607@austin.ibm.com>

John Rose wrote:
> The patch below creates iommu_free_table().  Iommu tables are not currently
> freed in PPC64.  This could cause a memory leak for DLPAR of an EADS slot.  The
> function verifies that there are no outstanding TCE entries for the range of
> the table before freeing it.  I added a call to iommu_free_table() to the code
> that dynamically removes a device node.  This should be fairly symmetrical with
> the table allocation, which happens during dynamic addition of a device node.
> 
> Comments welcome.

Looks good, just a couple of minor nitpicks below.


-Olof


> Signed-off-by: John Rose <johnrose at austin.ibm.com>
> 
> diff -Nru a/arch/ppc64/kernel/pSeries_iommu.c b/arch/ppc64/kernel/pSeries_iommu.c
> --- a/arch/ppc64/kernel/pSeries_iommu.c	Thu Oct  7 11:08:19 2004
> +++ b/arch/ppc64/kernel/pSeries_iommu.c	Thu Oct  7 11:08:19 2004
> @@ -412,6 +412,38 @@
>  	dn->iommu_table = iommu_init_table(tbl);
>  }
>  
> +void iommu_free_table(struct device_node *dn)
> +{
> +	struct iommu_table *tbl = dn->iommu_table;
> +        unsigned long bitmap_sz, i;
> +        unsigned int order;
> +
> +        if (!tbl || !tbl->it_map) {

whitespace above looks wrong (or below?)

> +		printk(KERN_ERR "%s: expected TCE map for %s\n", __FUNCTION__,
> +				dn->full_name);
> +		return;
> +	}
> +
> +	/* verify that table contains no entries */
> +	/* it_mapsize is in entries, and we're examining 64 at a time */
> +	for (i = 0; i < (tbl->it_mapsize/64); i++) {
> +		if (tbl->it_map[i] != 0) {
> +			printk(KERN_WARNING "%s: Unexpected TCEs for %s\n",
> +				__FUNCTION__, dn->full_name);
> +			break;
> +		}

Could this get spammy? It could be nice to see a WARN_ON(1) too, so the 
call stack is dumped. If that's added, a printk_ratelimit() would 
definately be warranted around both the printk and the WARN_ON().

> +	}
> +
> +	/* calculate bitmap size in bytes */
> +	bitmap_sz = (tbl->it_mapsize + 7) / 8;
> +
> +	/* free bitmap */
> +	order = get_order(bitmap_sz);
> +	free_pages((unsigned long) tbl->it_map, order);
> +
> +	/* free table */
> +        kfree(tbl);

whitespace

> +}
>  
>  void iommu_setup_pSeries(void)
>  {
> diff -Nru a/arch/ppc64/kernel/prom.c b/arch/ppc64/kernel/prom.c
> --- a/arch/ppc64/kernel/prom.c	Thu Oct  7 11:08:19 2004
> +++ b/arch/ppc64/kernel/prom.c	Thu Oct  7 11:08:19 2004
> @@ -1818,6 +1818,9 @@
>  		return -EBUSY;
>  	}
>  
> +	if (np->iommu_table)
> +		iommu_free_table(np);
> +
>  	write_lock(&devtree_lock);
>  	OF_MARK_STALE(np);
>  	remove_node_proc_entries(np);
> diff -Nru a/include/asm-ppc64/iommu.h b/include/asm-ppc64/iommu.h
> --- a/include/asm-ppc64/iommu.h	Thu Oct  7 11:08:19 2004
> +++ b/include/asm-ppc64/iommu.h	Thu Oct  7 11:08:19 2004
> @@ -113,6 +113,9 @@
>  /* Creates table for an individual device node */
>  extern void iommu_devnode_init(struct device_node *dn);
>  
> +/* Frees table for an individual device node */
> +extern void iommu_free_table(struct device_node *dn);
> +
>  #endif /* CONFIG_PPC_MULTIPLATFORM */
>  
>  #ifdef CONFIG_PPC_ISERIES
> 
> 
> _______________________________________________
> Linuxppc64-dev mailing list
> Linuxppc64-dev at ozlabs.org
> https://ozlabs.org/cgi-bin/mailman/listinfo/linuxppc64-dev
> 


From davem at davemloft.net  Thu Oct 21 09:04:50 2004
From: davem at davemloft.net (David S. Miller)
Date: Wed, 20 Oct 2004 16:04:50 -0700
Subject: [discuss] Re: [PATCH] Add key management syscalls to non-i386
 archs
In-Reply-To: <20041020225625.GD995@wotan.suse.de>
References: <3506.1098283455@redhat.com>
	<20041020150149.7be06d6d.davem@davemloft.net>
	<20041020225625.GD995@wotan.suse.de>
Message-ID: <20041020160450.0914270b.davem@davemloft.net>

On Thu, 21 Oct 2004 00:56:25 +0200
Andi Kleen <ak at suse.de> wrote:

> I don't think that's a good idea.  Normally new system calls 
> are relatively obscure and the system works fine without them,
> so urgent action is not needed.
> 
> And I think we can trust architecture maintainers to regularly
> sync the system calls with i386.

I disagree quite strongly.  One major frustration for users of
non-x86 platforms is that functionality is often missing for some
time that we can make trivial to keep in sync.

I religiously watch what goes into Linus's tree for this purpose,
but that is kind of a rediculious burdon to expect every platform
maintainer to do.  It's not just system calls, we have signal handling
bug fixes, trap handling infrastructure, and now the nice generic
IRQ handling subsystem as other examples.

Simply put, if you're not watching the tree in painstaking detail
every day, you miss all of these enhancements.

The knowledge should come from the person putting the changes into
the tree, therefore it gets done once and this makes it so that
the other platform maintainers will find out about it automatically
next time they update their tree.


From ak at suse.de  Thu Oct 21 09:25:09 2004
From: ak at suse.de (Andi Kleen)
Date: Thu, 21 Oct 2004 01:25:09 +0200
Subject: [discuss] Re: [PATCH] Add key management syscalls to non-i386
	archs
In-Reply-To: <20041020160450.0914270b.davem@davemloft.net>
References: <3506.1098283455@redhat.com>
	<20041020150149.7be06d6d.davem@davemloft.net>
	<20041020225625.GD995@wotan.suse.de>
	<20041020160450.0914270b.davem@davemloft.net>
Message-ID: <20041020232509.GF995@wotan.suse.de>

On Wed, Oct 20, 2004 at 04:04:50PM -0700, David S. Miller wrote:
> On Thu, 21 Oct 2004 00:56:25 +0200
> Andi Kleen <ak at suse.de> wrote:
> 
> > I don't think that's a good idea.  Normally new system calls 
> > are relatively obscure and the system works fine without them,
> > so urgent action is not needed.
> > 
> > And I think we can trust architecture maintainers to regularly
> > sync the system calls with i386.
> 
> I disagree quite strongly.  One major frustration for users of
> non-x86 platforms is that functionality is often missing for some
> time that we can make trivial to keep in sync.

I'm not sure really if the users of some embedded platform
are all sheering for key management system calls...

I guess they will prefer just something that compiles.

> 
> I religiously watch what goes into Linus's tree for this purpose,
> but that is kind of a rediculious burdon to expect every platform
> maintainer to do.  It's not just system calls, we have signal handling
> bug fixes, trap handling infrastructure, and now the nice generic
> IRQ handling subsystem as other examples.

Most of that is optional. When the arch maintainer choses not to
use it you have just unnecessarily  broken the build.

IMHO breaking the build unnecessarily is extremly bad because
it will prevent all testing. And would you really want to hold
up the whole linux testing machinery just for some obscure 
system call? IMHO not a good tradeoff.

> 
> Simply put, if you're not watching the tree in painstaking detail
> every day, you miss all of these enhancements.

I would assume the other maintainers go at least from time to 
time through the i386 diffs and check if they miss anything
(that is what I do). For system calls they do definitely, although
it may take some time.

> 
> The knowledge should come from the person putting the changes into
> the tree, therefore it gets done once and this makes it so that
> the other platform maintainers will find out about it automatically
> next time they update their tree.

And causing merging headaches and all kind of other problems.

-Andi


From ak at suse.de  Thu Oct 21 08:56:25 2004
From: ak at suse.de (Andi Kleen)
Date: Thu, 21 Oct 2004 00:56:25 +0200
Subject: [discuss] Re: [PATCH] Add key management syscalls to non-i386
	archs
In-Reply-To: <20041020150149.7be06d6d.davem@davemloft.net>
References: <3506.1098283455@redhat.com>
	<20041020150149.7be06d6d.davem@davemloft.net>
Message-ID: <20041020225625.GD995@wotan.suse.de>

On Wed, Oct 20, 2004 at 03:01:49PM -0700, David S. Miller wrote:
> 
> David, I applaud your effort to take care of this.
> However, this patch will conflict with what I've
> sent into Linus already for Sparc.  I also had to
> add the sys_altroot syscall entry as well.
> 
> I've mentioned several times that perhaps the best
> way to deal with this problem is to purposefully
> break the build of platforms when new system calls
> are added.
> 
> Simply adding a:
> 
> #error new syscall entries for X and Y needed
> 
> to include/asm-*/unistd.h would handle this just
> fine I think.

I don't think that's a good idea.  Normally new system calls 
are relatively obscure and the system works fine without them,
so urgent action is not needed.

And I think we can trust architecture maintainers to regularly
sync the system calls with i386.

-Andi


From davem at davemloft.net  Thu Oct 21 09:41:44 2004
From: davem at davemloft.net (David S. Miller)
Date: Wed, 20 Oct 2004 16:41:44 -0700
Subject: [discuss] Re: [PATCH] Add key management syscalls to non-i386
 archs
In-Reply-To: <20041020232509.GF995@wotan.suse.de>
References: <3506.1098283455@redhat.com>
	<20041020150149.7be06d6d.davem@davemloft.net>
	<20041020225625.GD995@wotan.suse.de>
	<20041020160450.0914270b.davem@davemloft.net>
	<20041020232509.GF995@wotan.suse.de>
Message-ID: <20041020164144.3457eafe.davem@davemloft.net>

On Thu, 21 Oct 2004 01:25:09 +0200
Andi Kleen <ak at suse.de> wrote:

> IMHO breaking the build unnecessarily is extremly bad because
> it will prevent all testing. And would you really want to hold
> up the whole linux testing machinery just for some obscure 
> system call? IMHO not a good tradeoff.

Then change the unistd.h cookie from "#error" to a "#warning".  It
accomplishes both of our goals.


From ak at suse.de  Thu Oct 21 10:10:42 2004
From: ak at suse.de (Andi Kleen)
Date: Thu, 21 Oct 2004 02:10:42 +0200
Subject: [discuss] Re: [PATCH] Add key management syscalls to non-i386
	archs
In-Reply-To: <20041020164144.3457eafe.davem@davemloft.net>
References: <3506.1098283455@redhat.com>
	<20041020150149.7be06d6d.davem@davemloft.net>
	<20041020225625.GD995@wotan.suse.de>
	<20041020160450.0914270b.davem@davemloft.net>
	<20041020232509.GF995@wotan.suse.de>
	<20041020164144.3457eafe.davem@davemloft.net>
Message-ID: <20041021001041.GI995@wotan.suse.de>

On Wed, Oct 20, 2004 at 04:41:44PM -0700, David S. Miller wrote:
> On Thu, 21 Oct 2004 01:25:09 +0200
> Andi Kleen <ak at suse.de> wrote:
> 
> > IMHO breaking the build unnecessarily is extremly bad because
> > it will prevent all testing. And would you really want to hold
> > up the whole linux testing machinery just for some obscure 
> > system call? IMHO not a good tradeoff.
> 
> Then change the unistd.h cookie from "#error" to a "#warning".  It
> accomplishes both of our goals.

#warnings would be fine for me.

-Andi


From benh at kernel.crashing.org  Thu Oct 21 11:30:59 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Thu, 21 Oct 2004 11:30:59 +1000
Subject: 2.6.9-rc4 kernel -- "cannot find space for TCE table"
In-Reply-To: <20041020170420.GA8345@sage.raleigh.ibm.com>
References: <OFF2A7A85B.119958DD-ON87256F2E.0073427E-86256F2E.0071BEBC@us.ibm.com>
	<1097887510.6487.23.camel@gaston>
	<20041019230054.GA3807@kevlar.burdell.org>
	<1098229131.5792.9.camel@gaston>
	<20041020170420.GA8345@sage.raleigh.ibm.com>
Message-ID: <1098322258.4183.15.camel@gaston>

On Thu, 2004-10-21 at 03:04, Craig Chaney wrote:

> which causes the logic in prom_init_mem to set alloc_top to 0x40000000.
> 
> I can work around this by modifying prom_init_mem to set alloc_top to rmo_top
> if of_platform is either PLATFORM_PSERIES_LPAR or PLATFORM_PSERIES.  This
> allows me to boot a 2.6.9-rc4 kernel on a p615.

Yes, alloc_top and rmo_top should be both "clamped". Can you try that
patch and let me know ?

Index: linux-work/arch/ppc64/kernel/prom_init.c
===================================================================
--- linux-work.orig/arch/ppc64/kernel/prom_init.c	2004-10-20 18:38:08.911500096 +1000
+++ linux-work/arch/ppc64/kernel/prom_init.c	2004-10-21 11:30:23.570248584 +1000
@@ -675,7 +675,7 @@
 	if ( RELOC(of_platform) == PLATFORM_PSERIES_LPAR )
 		RELOC(alloc_top) = RELOC(rmo_top);
 	else
-		RELOC(alloc_top) = min(0x40000000ul, RELOC(ram_top));
+		RELOC(alloc_top) = RELOC(rmo_top) = min(0x40000000ul, RELOC(ram_top));
 	RELOC(alloc_bottom) = PAGE_ALIGN(RELOC(klimit) - offset + 0x4000);
 	RELOC(alloc_top_high) = RELOC(ram_top);
 

From david at gibson.dropbear.id.au  Thu Oct 21 11:32:07 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Thu, 21 Oct 2004 11:32:07 +1000
Subject: [PPC64] Don't build virtual IO drivers for PowerMac
Message-ID: <20041021013207.GH17760@zax>

Andrew, please apply:

Only compile vio.c on iSeries and pSeries, since other PPC64 platforms
(PowerMac) don't use virtual IO.  The resulting #ifdefs in dma.c are
kind of ugly, but at least contained, and I can't see a nicer way of
doing it for the time being.

Signed-off-by: David Gibson <david at gibson.dropbear.id.au>

Index: working-2.6/arch/ppc64/kernel/Makefile
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/Makefile	2004-09-28 10:22:13.000000000 +1000
+++ working-2.6/arch/ppc64/kernel/Makefile	2004-10-05 15:47:16.541962864 +1000
@@ -11,7 +11,7 @@
 			udbg.o binfmt_elf32.o sys_ppc32.o ioctl32.o \
 			ptrace32.o signal32.o rtc.o init_task.o \
 			lmb.o cputable.o cpu_setup_power4.o idle_power4.o \
-			iommu.o sysfs.o vio.o
+			iommu.o sysfs.o
 
 obj-$(CONFIG_PPC_OF) +=	of_device.o
 
@@ -45,6 +45,7 @@
 obj-$(CONFIG_HVC_CONSOLE)	+= hvconsole.o
 obj-$(CONFIG_BOOTX_TEXT)	+= btext.o
 obj-$(CONFIG_HVCS)		+= hvcserver.o
+obj-$(CONFIG_IBMVIO)		+= vio.o
 
 obj-$(CONFIG_PPC_PMAC)		+= pmac_setup.o pmac_feature.o pmac_pci.o \
 				   pmac_time.o pmac_nvram.o pmac_low_i2c.o \
Index: working-2.6/arch/ppc64/Kconfig
===================================================================
--- working-2.6.orig/arch/ppc64/Kconfig	2004-09-28 10:22:13.000000000 +1000
+++ working-2.6/arch/ppc64/Kconfig	2004-10-05 15:47:16.541962864 +1000
@@ -110,6 +110,11 @@
 	  processors, that is, which share physical processors between
 	  two or more partitions.
 
+config IBMVIO
+	depends on PPC_PSERIES || PPC_ISERIES
+	bool
+	default y
+
 config U3_DART
 	bool 
 	depends on PPC_MULTIPLATFORM
Index: working-2.6/arch/ppc64/kernel/dma.c
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/dma.c	2004-08-09 09:51:38.000000000 +1000
+++ working-2.6/arch/ppc64/kernel/dma.c	2004-10-05 16:02:01.372034952 +1000
@@ -17,8 +17,10 @@
 {
 	if (dev->bus == &pci_bus_type)
 		return pci_dma_supported(to_pci_dev(dev), mask);
+#ifdef CONFIG_IBMVIO
 	if (dev->bus == &vio_bus_type)
 		return vio_dma_supported(to_vio_dev(dev), mask);
+#endif /* CONFIG_IBMVIO */
 	BUG();
 	return 0;
 }
@@ -28,8 +30,10 @@
 {
 	if (dev->bus == &pci_bus_type)
 		return pci_set_dma_mask(to_pci_dev(dev), dma_mask);
+#ifdef CONFIG_IBMVIO
 	if (dev->bus == &vio_bus_type)
 		return vio_set_dma_mask(to_vio_dev(dev), dma_mask);
+#endif /* CONFIG_IBMVIO */
 	BUG();
 	return 0;
 }
@@ -40,8 +44,10 @@
 {
 	if (dev->bus == &pci_bus_type)
 		return pci_alloc_consistent(to_pci_dev(dev), size, dma_handle);
+#ifdef CONFIG_IBMVIO
 	if (dev->bus == &vio_bus_type)
 		return vio_alloc_consistent(to_vio_dev(dev), size, dma_handle);
+#endif /* CONFIG_IBMVIO */
 	BUG();
 	return NULL;
 }
@@ -52,8 +58,10 @@
 {
 	if (dev->bus == &pci_bus_type)
 		pci_free_consistent(to_pci_dev(dev), size, cpu_addr, dma_handle);
+#ifdef CONFIG_IBMVIO
 	else if (dev->bus == &vio_bus_type)
 		vio_free_consistent(to_vio_dev(dev), size, cpu_addr, dma_handle);
+#endif /* CONFIG_IBMVIO */
 	else
 		BUG();
 }
@@ -64,8 +72,10 @@
 {
 	if (dev->bus == &pci_bus_type)
 		return pci_map_single(to_pci_dev(dev), cpu_addr, size, (int)direction);
+#ifdef CONFIG_IBMVIO
 	if (dev->bus == &vio_bus_type)
 		return vio_map_single(to_vio_dev(dev), cpu_addr, size, direction);
+#endif /* CONFIG_IBMVIO */
 	BUG();
 	return (dma_addr_t)0;
 }
@@ -76,8 +86,10 @@
 {
 	if (dev->bus == &pci_bus_type)
 		pci_unmap_single(to_pci_dev(dev), dma_addr, size, (int)direction);
+#ifdef CONFIG_IBMVIO
 	else if (dev->bus == &vio_bus_type)
 		vio_unmap_single(to_vio_dev(dev), dma_addr, size, direction);
+#endif /* CONFIG_IBMVIO */
 	else
 		BUG();
 }
@@ -89,8 +101,10 @@
 {
 	if (dev->bus == &pci_bus_type)
 		return pci_map_page(to_pci_dev(dev), page, offset, size, (int)direction);
+#ifdef CONFIG_IBMVIO
 	if (dev->bus == &vio_bus_type)
 		return vio_map_page(to_vio_dev(dev), page, offset, size, direction);
+#endif /* CONFIG_IBMVIO */
 	BUG();
 	return (dma_addr_t)0;
 }
@@ -101,8 +115,10 @@
 {
 	if (dev->bus == &pci_bus_type)
 		pci_unmap_page(to_pci_dev(dev), dma_address, size, (int)direction);
+#ifdef CONFIG_IBMVIO
 	else if (dev->bus == &vio_bus_type)
 		vio_unmap_page(to_vio_dev(dev), dma_address, size, direction);
+#endif /* CONFIG_IBMVIO */
 	else
 		BUG();
 }
@@ -113,8 +129,10 @@
 {
 	if (dev->bus == &pci_bus_type)
 		return pci_map_sg(to_pci_dev(dev), sg, nents, (int)direction);
+#ifdef CONFIG_IBMVIO
 	if (dev->bus == &vio_bus_type)
 		return vio_map_sg(to_vio_dev(dev), sg, nents, direction);
+#endif /* CONFIG_IBMVIO */
 	BUG();
 	return 0;
 }
@@ -125,8 +143,10 @@
 {
 	if (dev->bus == &pci_bus_type)
 		pci_unmap_sg(to_pci_dev(dev), sg, nhwentries, (int)direction);
+#ifdef CONFIG_IBMVIO
 	else if (dev->bus == &vio_bus_type)
 		vio_unmap_sg(to_vio_dev(dev), sg, nhwentries, direction);
+#endif /* CONFIG_IBMVIO */
 	else
 		BUG();
 }


-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson


From david at gibson.dropbear.id.au  Thu Oct 21 11:35:49 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Thu, 21 Oct 2004 11:35:49 +1000
Subject: [PPC64] Trivial sparse cleanups
Message-ID: <20041021013549.GI17760@zax>

Andrew, please apply:

This patch squashes a handful of assorted sparse warnings in the ppc64
code.  Should be pretty much trivial and self explanatory.

Signed-off-by: David Gibson <david at gibson.dropbear.id.au>

Index: working-2.6/arch/ppc64/kernel/nvram.c
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/nvram.c	2004-09-24 10:14:09.000000000 +1000
+++ working-2.6/arch/ppc64/kernel/nvram.c	2004-10-21 11:34:39.057902952 +1000
@@ -77,7 +77,7 @@
 }
 
 
-static ssize_t dev_nvram_read(struct file *file, char *buf,
+static ssize_t dev_nvram_read(struct file *file, char __user *buf,
 			  size_t count, loff_t *ppos)
 {
 	ssize_t len;
@@ -117,7 +117,7 @@
 
 }
 
-static ssize_t dev_nvram_write(struct file *file, const char *buf,
+static ssize_t dev_nvram_write(struct file *file, const char __user *buf,
 			   size_t count, loff_t *ppos)
 {
 	ssize_t len;
Index: working-2.6/arch/ppc64/kernel/setup.c
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/setup.c	2004-10-05 10:08:10.000000000 +1000
+++ working-2.6/arch/ppc64/kernel/setup.c	2004-10-21 11:34:39.059902648 +1000
@@ -1111,7 +1111,7 @@
 {
 	/* ensure xmon is enabled */
 	xmon_init();
-	debugger(0);
+	debugger(NULL);
 
 	return 0;
 }
Index: working-2.6/arch/ppc64/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/ppc64/mm/hugetlbpage.c	2004-10-20 10:52:39.000000000 +1000
+++ working-2.6/arch/ppc64/mm/hugetlbpage.c	2004-10-21 11:34:39.060902496 +1000
@@ -249,7 +249,7 @@
 {
 	if (within_hugepage_high_range(addr, len))
 		return 0;
-	else if ((addr < 0x100000000) && ((addr+len) < 0x100000000)) {
+	else if ((addr < 0x100000000UL) && ((addr+len) < 0x100000000UL)) {
 		int err;
 		/* Yes, we need both tests, in case addr+len overflows
 		 * 64-bit arithmetic */
Index: working-2.6/arch/ppc64/mm/hash_utils.c
===================================================================
--- working-2.6.orig/arch/ppc64/mm/hash_utils.c	2004-09-28 10:22:13.000000000 +1000
+++ working-2.6/arch/ppc64/mm/hash_utils.c	2004-10-21 11:34:39.060902496 +1000
@@ -401,7 +401,7 @@
 		info.si_signo = SIGBUS;
 		info.si_errno = 0;
 		info.si_code = BUS_ADRERR;
-		info.si_addr = (void *)address;
+		info.si_addr = (void __user *)address;
 		force_sig_info(SIGBUS, &info, current);
 		return;
 	}


-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson


From benh at kernel.crashing.org  Thu Oct 21 11:51:10 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Thu, 21 Oct 2004 11:51:10 +1000
Subject: status of ppc64 patches
In-Reply-To: <20041020184501.GF10026@austin.ibm.com>
References: <41754644.1010003@austin.ibm.com>
	<1098231748.7493.114.camel@pants.austin.ibm.com>
	<20041020010301.GA29579@4> <20041020184501.GF10026@austin.ibm.com>
Message-ID: <1098323469.20954.27.camel@gaston>

On Thu, 2004-10-21 at 04:45, Linas Vepstas wrote:

> Put it another way: it is, at this time, impossible for me to rebase,
> because I know that my patches will conflict with others in the 
> un-applied patch queue.  So all I can do is wait for the patch queue to
> shrink, wait till the others get into the Torvalds tree, then bk pull, 
> then hurry, hurry, rebase, test, submit, and hope I get in before 
> someone else does and wrecks it again.  The turn-around time for 
> "getting lucky" like this is over a month, and if one doesn't get 
> lucky the first month, one has to wait a whole 'nother month for 
> one's next shot.

For some reason, it seems other people have a lot more luck than you
do ... 

Ben.


From benh at kernel.crashing.org  Thu Oct 21 11:55:32 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Thu, 21 Oct 2004 11:55:32 +1000
Subject: [discuss] Re: [PATCH] Add key management syscalls to non-i386
	archs
In-Reply-To: <20041020160450.0914270b.davem@davemloft.net>
References: <3506.1098283455@redhat.com>
	<20041020150149.7be06d6d.davem@davemloft.net>
	<20041020225625.GD995@wotan.suse.de>
	<20041020160450.0914270b.davem@davemloft.net>
Message-ID: <1098323732.20955.31.camel@gaston>

On Thu, 2004-10-21 at 09:04, David S. Miller wrote:
> On Thu, 21 Oct 2004 00:56:25 +0200
> Andi Kleen <ak at suse.de> wrote:
> 
> > I don't think that's a good idea.  Normally new system calls 
> > are relatively obscure and the system works fine without them,
> > so urgent action is not needed.
> > 
> > And I think we can trust architecture maintainers to regularly
> > sync the system calls with i386.
> 
> I disagree quite strongly.  One major frustration for users of
> non-x86 platforms is that functionality is often missing for some
> time that we can make trivial to keep in sync.

I agree with David here. It's also easy for arch/platform maintainers to
"miss" a new syscall too ... for various reasons, we can't all read
_everything_ that gets posted to lkml and we all do occasionally miss
some csets going upstream, which means we can very well totally "forget"
about addint the new syscall to the arch ... until somebody complains,
which can be 1 or 2 releases later !

> I religiously watch what goes into Linus's tree for this purpose,
> but that is kind of a rediculious burdon to expect every platform
> maintainer to do.  It's not just system calls, we have signal handling
> bug fixes, trap handling infrastructure, and now the nice generic
> IRQ handling subsystem as other examples.

Right.

> Simply put, if you're not watching the tree in painstaking detail
> every day, you miss all of these enhancements.
>
> The knowledge should come from the person putting the changes into
> the tree, therefore it gets done once and this makes it so that
> the other platform maintainers will find out about it automatically
> next time they update their tree.

Agreed,
Ben.


From david at gibson.dropbear.id.au  Thu Oct 21 13:36:17 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Thu, 21 Oct 2004 13:36:17 +1000
Subject: [PPC64] xmon sparse cleanups
Message-ID: <20041021033617.GK17760@zax>

Andrew, please apply:

This patch removes many sparse warnings from the xmon code.  Mostly
K&R function declarations and 0-instead-of-NULLs.  There are still a
whole bunch of warnings in xmon/ppc-opc.c, which is a copy of a file
from binutils.

Signed-off-by: David Gibson <david at gibson.dropbear.id.au>

Index: working-2.6/arch/ppc64/xmon/xmon.c
===================================================================
--- working-2.6.orig/arch/ppc64/xmon/xmon.c	2004-09-24 10:14:09.000000000 +1000
+++ working-2.6/arch/ppc64/xmon/xmon.c	2004-10-05 16:31:01.822963256 +1000
@@ -645,7 +645,7 @@
 	for (i = 0; i < NBPTS; ++i, ++bp)
 		if (bp->enabled && pc == bp->address)
 			return bp;
-	return 0;
+	return NULL;
 }
 
 static struct bpt *in_breakpoint_table(unsigned long nip, unsigned long *offp)
@@ -1582,7 +1582,7 @@
 extern char dec_exc;
 
 void
-super_regs()
+super_regs(void)
 {
 	int cmd;
 	unsigned long val;
@@ -1816,7 +1816,7 @@
     "";
 
 void
-memex()
+memex(void)
 {
 	int cmd, inc, i, nslash;
 	unsigned long n;
@@ -1967,7 +1967,7 @@
 }
 
 int
-bsesc()
+bsesc(void)
 {
 	int c;
 
@@ -1985,7 +1985,7 @@
 			 || ('a' <= (c) && (c) <= 'f') \
 			 || ('A' <= (c) && (c) <= 'F'))
 void
-dump()
+dump(void)
 {
 	int c;
 
@@ -2150,7 +2150,7 @@
 static unsigned mask;
 
 void
-memlocate()
+memlocate(void)
 {
 	unsigned a, n;
 	unsigned char val[4];
@@ -2183,7 +2183,7 @@
 static unsigned long mlim = 0xffffffff;
 
 void
-memzcan()
+memzcan(void)
 {
 	unsigned char v;
 	unsigned a;
@@ -2212,7 +2212,7 @@
 
 /* Input scanning routines */
 int
-skipbl()
+skipbl(void)
 {
 	int c;
 
@@ -2237,8 +2237,7 @@
 };
 
 int
-scanhex(vp)
-unsigned long *vp;
+scanhex(unsigned long *vp)
 {
 	int c, d;
 	unsigned long v;
@@ -2322,7 +2321,7 @@
 }
 
 void
-scannl()
+scannl(void)
 {
 	int c;
 
@@ -2365,13 +2364,13 @@
 static char *lineptr;
 
 void
-flush_input()
+flush_input(void)
 {
 	lineptr = NULL;
 }
 
 int
-inchar()
+inchar(void)
 {
 	if (lineptr == NULL || *lineptr == 0) {
 		if (fgets(line, sizeof(line), stdin) == NULL) {
@@ -2384,8 +2383,7 @@
 }
 
 void
-take_input(str)
-char *str;
+take_input(char *str)
 {
 	lineptr = str;
 }
Index: working-2.6/arch/ppc64/xmon/start.c
===================================================================
--- working-2.6.orig/arch/ppc64/xmon/start.c	2004-08-09 09:51:38.000000000 +1000
+++ working-2.6/arch/ppc64/xmon/start.c	2004-10-05 16:33:50.355028808 +1000
@@ -173,7 +173,7 @@
 		c = xmon_getchar();
 		if (c == -1) {
 			if (p == str)
-				return 0;
+				return NULL;
 			break;
 		}
 		*p++ = c;


-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson


From wjfast at yahoo.com  Thu Oct 21 16:33:30 2004
From: wjfast at yahoo.com (Wjeeha Tahir)
Date: Wed, 20 Oct 2004 23:33:30 -0700 (PDT)
Subject: Booting Linux from HardDisk on iSeries
Message-ID: <20041021063330.34212.qmail@web14926.mail.yahoo.com>

Hi,
 
This is my first email on this group, and I am really hopeful to find solution to my problem here. I was installing linux on iSeries in my office but was getting problems. I have installed RedHat Linux 9 on an iSeries machine in LPAR. The version of kernel as given by uname -a command is 2.4.21-4.EL

However after installation is complete I want to boot from disk rather than the cd drive. I think there is some need to copy some boot image onto the disk. I looked at theTechnical FAQ for Linux on iSeries: http://www-1.ibm.com/servers/eserver/iseries/linux/tech_faq.html#kernel and performed the following steps.

I executed the command fdisk-l and the output was as follows:

Disk /dev/iseries/vda: 4194 MB, 4194892800 bytes

255 heads, 63 sectors/track, 510 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System

/dev/iseries/vda1 * 1 2 16033+ 41 PPC PReP Boot

/dev/iseries/vda2 3 384 3068415 83 Linux

/dev/iseries/vda3 385 510 1012095 82 Linux swap

Hence my Prep Partition is /dev/iseries/vda1

Next the implementaion document tells me to execute the command dd if=/boot/vmlinux/good of=/dev/iseries/vda1 bs=4k

However the problem is that there is no file by the name of vmlinux.good in the boot directory. I'll show you the listing of boot directory.

[root at TestLinux /]# cd /boot

[root at TestLinux boot]# ls

cmdline-2.4.21-4.EL kernel.h System.map-2.4.21-4.EL

config-2.4.21-4.EL message vmlinitrd-2.4.21-4.EL

grub message.ja vmlinux-2.4.21-4.EL

initrd-2.4.21-4.EL.img System.map

Now I am at a loss at to what should be the input file for the dd command. I tried the command:

dd if=/boot/vmlinitrd-2.4.21-4.EL of=/dev/iseries/vda1 bs=4k

but when I booted from "IPL Source" = *NWSSTG ,"Stream file" = *NONE, "IPL parameters" = 'root=/dev/iseries/vda1," , the Linux doesnt boot and I get the following error:

Partition check:

iseries/vda: iseries/vda1 iseries/vda2 iseries/vda3

iSeries virtual I/O: viod: Disk 00 size 4000M, sectors 63, heads 255, cylinders 510, sectsize 512

iSeries virtual I/O: viod: Disk 00 partition 01 start sector 63, # sector 32067

iSeries virtual I/O: viod: Disk 00 partition 02 start sector 32130, # sector 6136830

iSeries virtual I/O: viod: Disk 00 partition 03 start sector 6168960, # sector 2024190

Loading jbd.o module

Journalled Block Device driver loaded

Loading ext3.o module

Mounting /proc filesystem

Creating block devices

Creating root device

Mounting root filesystem

VFS: Can't find ext3 filesystem on dev viod(112,1).

mount: error 22 mounting ext3

pivotroot: pivot_root(/sysroot,/sysroot/initrd) failed: 2

umount /initrd/proc failed: 2

Freeing unused kernel memory: 156k init

Kernel panic: No init found. Try passing init= option to kernel.

Rebooting in 180 seconds..

Can anyone tell me the exact command specifying what to copy from where and to where. I'll be very thankful if you could help me in this. 

Kind Regards,

Wjeeha Tahir


---------------------------------
Do you Yahoo!?
vote.yahoo.com - Register online to vote today!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041020/b38e4fb9/attachment.htm 

From sfr at canb.auug.org.au  Thu Oct 21 18:05:46 2004
From: sfr at canb.auug.org.au (Stephen Rothwell)
Date: Thu, 21 Oct 2004 18:05:46 +1000
Subject: Booting Linux from HardDisk on iSeries
In-Reply-To: <20041021063330.34212.qmail@web14926.mail.yahoo.com>
References: <20041021063330.34212.qmail@web14926.mail.yahoo.com>
Message-ID: <20041021180546.780f3090.sfr@canb.auug.org.au>

On Wed, 20 Oct 2004 23:33:30 -0700 (PDT) Wjeeha Tahir <wjfast at yahoo.com> wrote:
>
> but when I booted from "IPL Source" = *NWSSTG ,"Stream file" = *NONE,
> "IPL parameters" = 'root=/dev/iseries/vda1," , the Linux doesnt boot and
                                        ^^^^
This should be vda2 ...

Linux did boot, it just could not find its root file system ...
-- 
Cheers,
Stephen Rothwell                    sfr at canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041021/d0a03011/attachment.pgp 

From wjfast at yahoo.com  Thu Oct 21 18:30:41 2004
From: wjfast at yahoo.com (Wjeeha Tahir)
Date: Thu, 21 Oct 2004 01:30:41 -0700 (PDT)
Subject: Booting Linux from HardDisk on iSeries
In-Reply-To: <20041021180546.780f3090.sfr@canb.auug.org.au>
Message-ID: <20041021083041.82774.qmail@web14921.mail.yahoo.com>

I changed to vda2 but now Linux isnt booting at all. When the console connects to iSreies then the screen is blank. The errors that were being given initially are not appearing now. I am totally stuck.
 
Please help in this regard.

Stephen Rothwell <sfr at canb.auug.org.au> wrote:
On Wed, 20 Oct 2004 23:33:30 -0700 (PDT) Wjeeha Tahir wrote:
>
> but when I booted from "IPL Source" = *NWSSTG ,"Stream file" = *NONE,
> "IPL parameters" = 'root=/dev/iseries/vda1," , the Linux doesnt boot and
^^^^
This should be vda2 ...

Linux did boot, it just could not find its root file system ...
-- 
Cheers,
Stephen Rothwell sfr at canb.auug.org.au
http://www.canb.auug.org.au/~sfr/


> ATTACHMENT part 2 application/pgp-signature 

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041021/e28b492e/attachment.htm 

From jbglaw at lug-owl.de  Thu Oct 21 18:47:29 2004
From: jbglaw at lug-owl.de (Jan-Benedict Glaw)
Date: Thu, 21 Oct 2004 10:47:29 +0200
Subject: [discuss] Re: [PATCH] Add key management syscalls to non-i386
	archs
In-Reply-To: <20041020160450.0914270b.davem@davemloft.net>
References: <3506.1098283455@redhat.com>
	<20041020150149.7be06d6d.davem@davemloft.net>
	<20041020225625.GD995@wotan.suse.de>
	<20041020160450.0914270b.davem@davemloft.net>
Message-ID: <20041021084728.GA5033@lug-owl.de>

On Wed, 2004-10-20 16:04:50 -0700, David S. Miller <davem at davemloft.net>
wrote in message <20041020160450.0914270b.davem at davemloft.net>:
> On Thu, 21 Oct 2004 00:56:25 +0200
> Andi Kleen <ak at suse.de> wrote:

*VAX hacker's hat on*

> I disagree quite strongly.  One major frustration for users of
> non-x86 platforms is that functionality is often missing for some
> time that we can make trivial to keep in sync.

Full ACK.

> Simply put, if you're not watching the tree in painstaking detail
> every day, you miss all of these enhancements.

Right; and these missing enhancements will cause extra-pain when they're
used some time later from core code. That is, you missed the feature
while it was discusses/accepted and need to put it in place later on. So
you've got to do extra searching etc.

> The knowledge should come from the person putting the changes into
> the tree, therefore it gets done once and this makes it so that
> the other platform maintainers will find out about it automatically
> next time they update their tree.

Here's my proposal:

$ mkdir ./Documentation/new_enhancements_to_implement
$ cat ./Documentation/new_enhancements_to_implement/new_key_syscalls << EOF
> Dear Architecture Maintailers,
> 
> please add these four new cryptographic key functions to your syscall
> table. It's quite easy; just extend the ./include/arch-xxx/unistd.h
> for four new defines and then add them to your ./arch/xxx/kernel/entry.S
> file. For reference, here's my i386 patch doing this:
> 
> diff -Nurp
> --- path-old/to/file/one
> +++ path-new/to/file/one
>  text
> -del
> +add
>  more text
> 
> 
> Thanks, your keychain hacker:-)
> EOF
$

This way, all arch maintainers just *see* what needs to be done and
get a small introduction on how to do that. I'd *really* like to see
that! That would particularly help those that cannot do full-time
hacking on their port (like us VAX hackers:-)

MfG, JBG

-- 
Jan-Benedict Glaw       jbglaw at lug-owl.de    . +49-172-7608481             _ O _
"Eine Freie Meinung in  einem Freien Kopf    | Gegen Zensur | Gegen Krieg  _ _ O
 fuer einen Freien Staat voll Freier B?rger" | im Internet! |   im Irak!   O O O
ret = do_actions((curr | FREE_SPEECH) & ~(NEW_COPYRIGHT_LAW | DRM | TCPA));
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041021/29da377d/attachment.pgp 

From wjfast at yahoo.com  Thu Oct 21 19:33:32 2004
From: wjfast at yahoo.com (Wjeeha Tahir)
Date: Thu, 21 Oct 2004 02:33:32 -0700 (PDT)
Subject: Fwd: Re: Booting Linux from HardDisk on iSeries
Message-ID: <20041021093332.48090.qmail@web14927.mail.yahoo.com>

Just a correction.. after i changed to vda2 the following errors appear on the console:
 
mf.c: Preparing to bounce...
LINUXRH : Console connected.
pty: 2048 Unix98 ptys configured
NET4: Frame Diverter 0.46
RAMDISK driver initialized: 256 RAM disks of 8192K size 1024 blocksize
md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
Initializing Cryptographic API
NET4: Linux TCP/IP 1.0 for NET4.0
IP: routing cache hash table of 2048 buckets, 32Kbytes
TCP: Hash tables configured (established 16384 bind 16384)
Linux IP multicast router 0.06 plus PIM-SM
Initializing IPsec netlink socket
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
RAMDISK: Compressed image found at block 0
Freeing initrd memory: 788k freed
VFS: Mounted root (ext2 filesystem) readonly.
Freeing unused kernel memory: 156k init
Kernel panic: No init found.  Try passing init= option to kernel.
Rebooting in 180 seconds..

What to do now??
 
Thanks and Kind Regards,
Wjeeha Tahir

Wjeeha Tahir <wjfast at yahoo.com> wrote:
Date: Thu, 21 Oct 2004 01:30:41 -0700 (PDT)
From: Wjeeha Tahir 
Subject: Re: Booting Linux from HardDisk on iSeries
To: Stephen Rothwell 
CC: linuxppc64-dev at ozlabs.org

I changed to vda2 but now Linux isnt booting at all. When the console connects to iSreies then the screen is blank. The errors that were being given initially are not appearing now. I am totally stuck.
 
Please help in this regard.

Stephen Rothwell <sfr at canb.auug.org.au> wrote:
On Wed, 20 Oct 2004 23:33:30 -0700 (PDT) Wjeeha Tahir wrote:
>
> but when I booted from "IPL Source" = *NWSSTG ,"Stream file" = *NONE,
> "IPL parameters" = 'root=/dev/iseries/vda1," , the Linux doesnt boot and
^^^^
This should be vda2 ...

Linux did boot, it just could not find its root file system ...
-- 
Cheers,
Stephen Rothwell sfr at canb.auug.org.au
http://www.canb.auug.org.au/~sfr/


> ATTACHMENT part 2 application/pgp-signature 

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

		
---------------------------------
Do you Yahoo!?
vote.yahoo.com - Register online to vote today!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041021/7eb1a1b1/attachment.htm 

From cchaney at us.ibm.com  Thu Oct 21 23:20:54 2004
From: cchaney at us.ibm.com (Craig Chaney)
Date: Thu, 21 Oct 2004 09:20:54 -0400
Subject: 2.6.9-rc4 kernel -- "cannot find space for TCE table"
In-Reply-To: <1098322258.4183.15.camel@gaston>
References: <OFF2A7A85B.119958DD-ON87256F2E.0073427E-86256F2E.0071BEBC@us.ibm.com>
	<1097887510.6487.23.camel@gaston>
	<20041019230054.GA3807@kevlar.burdell.org>
	<1098229131.5792.9.camel@gaston>
	<20041020170420.GA8345@sage.raleigh.ibm.com>
	<1098322258.4183.15.camel@gaston>
Message-ID: <20041021132054.GA15732@sage.raleigh.ibm.com>

On Thu, Oct 21, 2004 at 11:30:59AM +1000, Benjamin Herrenschmidt wrote:
> Yes, alloc_top and rmo_top should be both "clamped". Can you try that
> patch and let me know ?

Yup, it worked.  Your patch allows 2.6.9-rc4 to boot on a p615.

Thanks,
Craig


From sonny at burdell.org  Fri Oct 22 01:41:04 2004
From: sonny at burdell.org (Sonny Rao)
Date: Thu, 21 Oct 2004 11:41:04 -0400
Subject: 2.6.9-rc4 kernel -- "cannot find space for TCE table"
In-Reply-To: <20041021132054.GA15732@sage.raleigh.ibm.com>
References: <OFF2A7A85B.119958DD-ON87256F2E.0073427E-86256F2E.0071BEBC@us.ibm.com>
	<1097887510.6487.23.camel@gaston>
	<20041019230054.GA3807@kevlar.burdell.org>
	<1098229131.5792.9.camel@gaston>
	<20041020170420.GA8345@sage.raleigh.ibm.com>
	<1098322258.4183.15.camel@gaston>
	<20041021132054.GA15732@sage.raleigh.ibm.com>
Message-ID: <20041021154104.GA15267@kevlar.burdell.org>

On Thu, Oct 21, 2004 at 09:20:54AM -0400, Craig Chaney wrote:
> On Thu, Oct 21, 2004 at 11:30:59AM +1000, Benjamin Herrenschmidt wrote:
> > Yes, alloc_top and rmo_top should be both "clamped". Can you try that
> > patch and let me know ?
> 
> Yup, it worked.  Your patch allows 2.6.9-rc4 to boot on a p615.

Also worked on 2.6.9 final, thanks guys.

Sonny


From mjr at us.ibm.com  Fri Oct 22 01:17:44 2004
From: mjr at us.ibm.com (Mike Ranweiler)
Date: Thu, 21 Oct 2004 10:17:44 -0500
Subject: Fwd: Re: Booting Linux from HardDisk on iSeries
In-Reply-To: <20041021093332.48090.qmail@web14927.mail.yahoo.com>
References: <20041021093332.48090.qmail@web14927.mail.yahoo.com>
Message-ID: <200410211017.45047.mjr@us.ibm.com>

On Thursday 21 October 2004 04:33, Wjeeha Tahir wrote:
> Just a correction.. after i changed to vda2 the following errors appear on
> the console:

I thought RHEL3 usually used something like 'ro root=LABEL=/' for a cmdline.  
The easiest way to do this is to boot from the B side and then put whatever's 
in /proc/cmdline from that boot into your IPL Parameters.  You can also see 
that from strsst, 5, 1, 11, F10.

Mike


From benh at kernel.crashing.org  Fri Oct 22 11:02:44 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Fri, 22 Oct 2004 11:02:44 +1000
Subject: [PATCH] ppc64: Fix boot on some non-LPAR pSeries
Message-ID: <1098406963.6008.13.camel@gaston>

Hi !

This patch fixes a problem when allocating the TCE tables (iommu) during
early boot on some non-LPAR machines with a lot of memory.

Signed-off-by: Benjamin Herrenschmidt <benh at kernel.crashing.org>

--- linux-work.orig/arch/ppc64/kernel/prom_init.c	2004-10-20 18:38:08.911500096 +1000
+++ linux-work/arch/ppc64/kernel/prom_init.c	2004-10-21 11:30:23.570248584 +1000
@@ -675,7 +675,7 @@
 	if ( RELOC(of_platform) == PLATFORM_PSERIES_LPAR )
 		RELOC(alloc_top) = RELOC(rmo_top);
 	else
-		RELOC(alloc_top) = min(0x40000000ul, RELOC(ram_top));
+		RELOC(alloc_top) = RELOC(rmo_top) = min(0x40000000ul, RELOC(ram_top));
 	RELOC(alloc_bottom) = PAGE_ALIGN(RELOC(klimit) - offset + 0x4000);
 	RELOC(alloc_top_high) = RELOC(ram_top);
 

From paulus at samba.org  Fri Oct 22 11:59:17 2004
From: paulus at samba.org (Paul Mackerras)
Date: Fri, 22 Oct 2004 11:59:17 +1000
Subject: [PATCH] add syslog printing to xmon debugger.
In-Reply-To: <20040916230647.GN9645@austin.ibm.com>
References: <20040916230647.GN9645@austin.ibm.com>
Message-ID: <16760.26997.131687.456670@cargo.ozlabs.ibm.com>

Linas,

> Andrew,
> 
> Please apply at least the kernel/printk.c part of the patch,
> if you are feeling at all charitable.

Did you ever get any reaction to that?

Paul.


From paulus at samba.org  Fri Oct 22 13:49:48 2004
From: paulus at samba.org (Paul Mackerras)
Date: Fri, 22 Oct 2004 13:49:48 +1000
Subject: [PATCH 1/1] ppc64: Block config accesses during BIST
In-Reply-To: <200409012158.i81LwRGY176052@northrelay04.pok.ibm.com>
References: <200409012158.i81LwRGY176052@northrelay04.pok.ibm.com>
Message-ID: <16760.33628.666087.631340@cargo.ozlabs.ibm.com>

Brian King writes:

> Some PCI adapters on pSeries and iSeries hardware (ipr scsi adapters)
> have an exposure today in that they issue BIST to the adapter to reset
> the card. If, during the time it takes to complete BIST, userspace attempts
> to access PCI config space, the host bus bridge will master abort the access
> since the ipr adapter does not respond on the PCI bus for a brief period of
> time when running BIST. This master abort results in the host PCI bridge
> isolating that PCI device from the rest of the system, making the device
> unusable until Linux is rebooted. This patch is an attempt to close that
> exposure by introducing some blocking code in the arch specific PCI code.
> The intent is to have the ipr device driver invoke these routines to
> prevent userspace PCI accesses from occurring during this window.
> 
> It has been tested by running BIST on an ipr adapter while running a
> script which looped reading the config space of that adapter through sysfs.
> Without the patch, an EEH error occurrs. With the patch there is no EEH
> error. Tested on Power 5 and iSeries Power 4.

The general idea seems fine to me.  There are a couple of things I
don't like about the patch though:

(1) I don't see why we need separate implementations of
    pci_block_config_io, pci_unblock_config_io and pci_start_bist for
    iSeries and for the rest.  (Maybe that just points up that we
    still have gratuitous differences between the iSeries and
    non-iSeries PCI code.)

(2) I don't think we need to add a spinlock to the device node
    structure.  A single global spinlock should suffice, particularly
    since we get serialized on the RTAS call anyway, and therefore
    there is no incentive to try to provide parallelism at the higher
    levels.

Comments?

Paul.


From nathanl at austin.ibm.com  Fri Oct 22 19:19:56 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Fri, 22 Oct 2004 04:19:56 -0500
Subject: [PATCH] ppc64: cpu hotplug notifier for numa
Message-ID: <1098436795.17305.22.camel@biclops>

The NUMA properties of all "possible" cpus are not necessarily
available at boot time on pSeries LPAR.  Only the properties for present
cpus are known.

This patch modifies the ppc64 numa code to map a cpu to its node right
before it is brought up -- this means that secondary cpus are now
mapped to their nodes during smp_init() (regardless of whether
CONFIG_HOTPLUG_CPU=y).  Cpus are removed from their nodes after they
have gone offline.

Also some minor cleanups:
- Stash the "minimum common depth" in a global at boot time, so we
  don't have to rediscover it every time something changes.

- Remove unnecessary variable from of_get_associativity() which is
  accessed while possibly uninitialized.

- Remove the cpu portion from dump_numa_topology() since it will show
  only the boot cpu now.  We could display this information from
  smp_cpus_done() if necessary.

Tested on a 4-way 2-node Power5 system.

Signed-off-by: Nathan Lynch <nathanl at austin.ibm.com>


---

 numa.c |  192 +++++++++++++++++++++++++--------------------
 1 files changed, 108 insertions(+), 84 deletions(-)


diff -puN arch/ppc64/mm/numa.c~ppc64-numa-cpu-hotplug-notifier arch/ppc64/mm/numa.c
--- 2.6.9-bk6/arch/ppc64/mm/numa.c~ppc64-numa-cpu-hotplug-notifier	2004-10-22 01:37:04.000000000 -0500
+++ 2.6.9-bk6-nathanl/arch/ppc64/mm/numa.c	2004-10-22 01:37:04.000000000 -0500
@@ -15,6 +15,8 @@
 #include <linux/mmzone.h>
 #include <linux/module.h>
 #include <linux/nodemask.h>
+#include <linux/cpu.h>
+#include <linux/notifier.h>
 #include <asm/lmb.h>
 #include <asm/machdep.h>
 #include <asm/abs_addr.h>
@@ -39,6 +41,7 @@ int nr_cpus_in_node[MAX_NUMNODES] = { [0
 struct pglist_data *node_data[MAX_NUMNODES];
 bootmem_data_t __initdata plat_node_bdata[MAX_NUMNODES];
 static unsigned long node0_io_hole_size;
+static int min_common_depth;
 
 /*
  * We need somewhere to store start/span for each node until we have
@@ -64,7 +67,24 @@ static inline void map_cpu_to_node(int c
 	}
 }
 
-static struct device_node * __init find_cpu_node(unsigned int cpu)
+#ifdef CONFIG_HOTPLUG_CPU
+static void unmap_cpu_from_node(unsigned long cpu)
+{
+	int node = numa_cpu_lookup_table[cpu];
+
+	dbg("removing cpu %lu from node %d\n", cpu, node);
+
+	if (cpu_isset(cpu, numa_cpumask_lookup_table[node])) {
+		cpu_clear(cpu, numa_cpumask_lookup_table[node]);
+		nr_cpus_in_node[node]--;
+	} else {
+		printk(KERN_ERR "WARNING: cpu %lu not found in node %d\n",
+		       cpu, node);
+	}
+}
+#endif /* CONFIG_HOTPLUG_CPU */
+
+static struct device_node * __devinit find_cpu_node(unsigned int cpu)
 {
 	unsigned int hw_cpuid = get_hard_smp_processor_id(cpu);
 	struct device_node *cpu_node = NULL;
@@ -96,26 +116,21 @@ static struct device_node * __init find_
 
 /* must hold reference to node during call */
 static int *of_get_associativity(struct device_node *dev)
- {
-	unsigned int *result;
-	int len;
-
-	result = (unsigned int *)get_property(dev, "ibm,associativity", &len);
-
-	if (len <= 0)
-		return NULL;
-
-	return result;
+{
+	return (unsigned int *)get_property(dev, "ibm,associativity", NULL);
 }
 
-static int of_node_numa_domain(struct device_node *device, int depth)
+static int of_node_numa_domain(struct device_node *device)
 {
 	int numa_domain;
 	unsigned int *tmp;
 
+	if (min_common_depth == -1)
+		return 0;
+
 	tmp = of_get_associativity(device);
-	if (tmp && (tmp[0] >= depth)) {
-		numa_domain = tmp[depth];
+	if (tmp && (tmp[0] >= min_common_depth)) {
+		numa_domain = tmp[min_common_depth];
 	} else {
 		dbg("WARNING: no NUMA information for %s\n",
 		    device->full_name);
@@ -138,7 +153,7 @@ static int of_node_numa_domain(struct de
  *
  * - Dave Hansen <haveblue at us.ibm.com>
  */
-static int find_min_common_depth(void)
+static int __init find_min_common_depth(void)
 {
 	int depth;
 	unsigned int *ref_points;
@@ -185,11 +200,72 @@ static unsigned long read_cell_ul(struct
 	return result;
 }
 
+/*
+ * Figure out to which domain a cpu belongs and stick it there.
+ * Return the id of the domain used.
+ */
+static int numa_setup_cpu(unsigned long lcpu)
+{
+	int numa_domain = 0;
+	struct device_node *cpu = find_cpu_node(lcpu);
+
+	if (!cpu) {
+		WARN_ON(1);
+		goto out;
+	}
+
+	numa_domain = of_node_numa_domain(cpu);
+
+	if (numa_domain >= MAX_NUMNODES) {
+		/*
+		 * POWER4 LPAR uses 0xffff as invalid node,
+		 * dont warn in this case.
+		 */
+		if (numa_domain != 0xffff)
+			printk(KERN_ERR "WARNING: cpu %ld "
+			       "maps to invalid NUMA node %d\n",
+			       lcpu, numa_domain);
+		numa_domain = 0;
+	}
+out:
+	node_set_online(numa_domain);
+
+	map_cpu_to_node(lcpu, numa_domain);
+
+	of_node_put(cpu);
+
+	return numa_domain;
+}
+
+static int cpu_numa_callback(struct notifier_block *nfb,
+			     unsigned long action,
+			     void *hcpu)
+{
+	unsigned long lcpu = (unsigned long)hcpu;
+	int ret = NOTIFY_DONE;
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+		if (min_common_depth == -1 || !numa_enabled)
+			map_cpu_to_node(lcpu, 0);
+		else
+			numa_setup_cpu(lcpu);
+		ret = NOTIFY_OK;
+		break;
+#ifdef CONFIG_HOTPLUG_CPU
+	case CPU_DEAD:
+	case CPU_UP_CANCELED:
+		unmap_cpu_from_node(lcpu);
+		break;
+		ret = NOTIFY_OK;
+#endif
+	}
+	return ret;
+}
+
 static int __init parse_numa_properties(void)
 {
-	struct device_node *cpu = NULL;
 	struct device_node *memory = NULL;
-	int depth;
 	int max_domain = 0;
 	long entries = lmb_end_of_DRAM() >> MEMORY_INCREMENT_SHIFT;
 	unsigned long i;
@@ -206,44 +282,13 @@ static int __init parse_numa_properties(
 	for (i = 0; i < entries ; i++)
 		numa_memory_lookup_table[i] = ARRAY_INITIALISER;
 
-	depth = find_min_common_depth();
-
-	dbg("NUMA associativity depth for CPU/Memory: %d\n", depth);
-	if (depth < 0)
-		return depth;
-
-	for_each_cpu(i) {
-		int numa_domain;
-
-		cpu = find_cpu_node(i);
-
-		if (cpu) {
-			numa_domain = of_node_numa_domain(cpu, depth);
-			of_node_put(cpu);
-
-			if (numa_domain >= MAX_NUMNODES) {
-				/*
-			 	 * POWER4 LPAR uses 0xffff as invalid node,
-				 * dont warn in this case.
-			 	 */
-				if (numa_domain != 0xffff)
-					printk(KERN_ERR "WARNING: cpu %ld "
-					       "maps to invalid NUMA node %d\n",
-					       i, numa_domain);
-				numa_domain = 0;
-			}
-		} else {
-			dbg("WARNING: no NUMA information for cpu %ld\n", i);
-			numa_domain = 0;
-		}
-
-		node_set_online(numa_domain);
+	min_common_depth = find_min_common_depth();
 
-		if (max_domain < numa_domain)
-			max_domain = numa_domain;
+	dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
+	if (min_common_depth < 0)
+		return min_common_depth;
 
-		map_cpu_to_node(i, numa_domain);
-	}
+	max_domain = numa_setup_cpu(boot_cpuid);
 
 	memory = NULL;
 	while ((memory = of_find_node_by_type(memory, "memory")) != NULL) {
@@ -267,7 +312,7 @@ new_range:
 		start = _ALIGN_DOWN(start, MEMORY_INCREMENT);
 		size = _ALIGN_UP(size, MEMORY_INCREMENT);
 
-		numa_domain = of_node_numa_domain(memory, depth);
+		numa_domain = of_node_numa_domain(memory);
 
 		if (numa_domain >= MAX_NUMNODES) {
 			if (numa_domain != 0xffff)
@@ -341,8 +386,7 @@ static void __init setup_nonnuma(void)
 			numa_memory_lookup_table[i] = ARRAY_INITIALISER;
 	}
 
-	for (i = 0; i < NR_CPUS; i++)
-		map_cpu_to_node(i, 0);
+	map_cpu_to_node(boot_cpuid, 0);
 
 	node_set_online(0);
 
@@ -358,35 +402,10 @@ static void __init setup_nonnuma(void)
 static void __init dump_numa_topology(void)
 {
 	unsigned int node;
-	unsigned int cpu, count;
+	unsigned int count;
 
-	for (node = 0; node < MAX_NUMNODES; node++) {
-		if (!node_online(node))
-			continue;
-
-		printk(KERN_INFO "Node %d CPUs:", node);
-
-		count = 0;
-		/*
-		 * If we used a CPU iterator here we would miss printing
-		 * the holes in the cpumap.
-		 */
-		for (cpu = 0; cpu < NR_CPUS; cpu++) {
-			if (cpu_isset(cpu, numa_cpumask_lookup_table[node])) {
-				if (count == 0)
-					printk(" %u", cpu);
-				++count;
-			} else {
-				if (count > 1)
-					printk("-%u", cpu - 1);
-				count = 0;
-			}
-		}
-
-		if (count > 1)
-			printk("-%u", NR_CPUS - 1);
-		printk("\n");
-	}
+	if (min_common_depth == -1 || !numa_enabled)
+		return;
 
 	for (node = 0; node < MAX_NUMNODES; node++) {
 		unsigned long i;
@@ -414,6 +433,7 @@ static void __init dump_numa_topology(vo
 			printk("-0x%lx", i);
 		printk("\n");
 	}
+	return;
 }
 
 /*
@@ -469,6 +489,10 @@ void __init do_init_bootmem(void)
 		setup_nonnuma();
 	else
 		dump_numa_topology();
+	/*
+	 * This must run before the sched domains notifier.
+	 */
+	hotcpu_notifier(cpu_numa_callback, 1);
 
 	for (nid = 0; nid < numnodes; nid++) {
 		unsigned long start_paddr, end_paddr;

_


From johnrose at austin.ibm.com  Sat Oct 23 04:15:39 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Fri, 22 Oct 2004 13:15:39 -0500
Subject: [PATCH] create iommu_free_table()
In-Reply-To: <4176E621.3040607@austin.ibm.com>
References: <1097171661.7087.1.camel@sinatra.austin.ibm.com>
	<4176E621.3040607@austin.ibm.com>
Message-ID: <1098468939.31847.14.camel@sinatra.austin.ibm.com>

Thanks for the comments and help... responses below.

On Wed, 2004-10-20 at 17:26, Olof Johansson wrote:
<snip>

> > +	for (i = 0; i < (tbl->it_mapsize/64); i++) {
> > +		if (tbl->it_map[i] != 0) {
> > +			printk(KERN_WARNING "%s: Unexpected TCEs for %s\n",
> > +				__FUNCTION__, dn->full_name);
> > +			break;
> > +		}
> 
> Could this get spammy? It could be nice to see a WARN_ON(1) too, so the 
> call stack is dumped. If that's added, a printk_ratelimit() would 
> definately be warranted around both the printk and the WARN_ON().

I'd have to disagree here.  Since the stack trace will always involve
the removal of a device node prompted by a write to /proc, it doesn't
reveal any useful info.  The printk above includes the OF path of the
device, so any offending driver can be tracked down.  

Here's a patch without the whitespace problems you pointed out.

Thanks-
John

Signed-off-by: John Rose <johnrose at austin.ibm.com>

diff -Nru a/arch/ppc64/kernel/pSeries_iommu.c b/arch/ppc64/kernel/pSeries_iommu.c
--- a/arch/ppc64/kernel/pSeries_iommu.c	Fri Oct 22 13:03:21 2004
+++ b/arch/ppc64/kernel/pSeries_iommu.c	Fri Oct 22 13:03:21 2004
@@ -412,6 +412,38 @@
 	dn->iommu_table = iommu_init_table(tbl);
 }
 
+void iommu_free_table(struct device_node *dn)
+{
+	struct iommu_table *tbl = dn->iommu_table;
+	unsigned long bitmap_sz, i;
+	unsigned int order;
+
+	if (!tbl || !tbl->it_map) {
+		printk(KERN_ERR "%s: expected TCE map for %s\n", __FUNCTION__,
+				dn->full_name);
+		return;
+	}
+
+	/* verify that table contains no entries */
+	/* it_mapsize is in entries, and we're examining 64 at a time */
+	for (i = 0; i < (tbl->it_mapsize/64); i++) {
+		if (tbl->it_map[i] != 0) {
+			printk(KERN_WARNING "%s: Unexpected TCEs for %s\n",
+				__FUNCTION__, dn->full_name);
+			break;
+		}
+	}
+
+	/* calculate bitmap size in bytes */
+	bitmap_sz = (tbl->it_mapsize + 7) / 8;
+
+	/* free bitmap */
+	order = get_order(bitmap_sz);
+	free_pages((unsigned long) tbl->it_map, order);
+
+	/* free table */
+	kfree(tbl);
+}
 
 void iommu_setup_pSeries(void)
 {
diff -Nru a/arch/ppc64/kernel/prom.c b/arch/ppc64/kernel/prom.c
--- a/arch/ppc64/kernel/prom.c	Fri Oct 22 13:03:21 2004
+++ b/arch/ppc64/kernel/prom.c	Fri Oct 22 13:03:21 2004
@@ -1818,6 +1818,9 @@
 		return -EBUSY;
 	}
 
+	if (np->iommu_table)
+		iommu_free_table(np);
+
 	write_lock(&devtree_lock);
 	OF_MARK_STALE(np);
 	remove_node_proc_entries(np);
diff -Nru a/include/asm-ppc64/iommu.h b/include/asm-ppc64/iommu.h
--- a/include/asm-ppc64/iommu.h	Fri Oct 22 13:03:21 2004
+++ b/include/asm-ppc64/iommu.h	Fri Oct 22 13:03:21 2004
@@ -113,6 +113,9 @@
 /* Creates table for an individual device node */
 extern void iommu_devnode_init(struct device_node *dn);
 
+/* Frees table for an individual device node */
+extern void iommu_free_table(struct device_node *dn);
+
 #endif /* CONFIG_PPC_MULTIPLATFORM */
 
 #ifdef CONFIG_PPC_ISERIES


From brking at us.ibm.com  Sat Oct 23 06:27:59 2004
From: brking at us.ibm.com (brking at us.ibm.com)
Date: Fri, 22 Oct 2004 15:27:59 -0500
Subject: [PATCH 2/2] ipr_block_config_io_during_bist
Message-ID: <200410222028.i9MKRxvC024092@d03av02.boulder.ibm.com>


Change ipr to use new ppc64 pci APIs to block PCI config space
accesses when running BIST to prevent PCI master aborts.

Signed-off-by: Brian King <brking at us.ibm.com>
---

 linux-2.6.9-bk7-bjking1/drivers/scsi/ipr.c |    5 ++++-
 linux-2.6.9-bk7-bjking1/drivers/scsi/ipr.h |    7 +++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff -puN drivers/scsi/ipr.c~ipr_block_config_io_during_bist drivers/scsi/ipr.c
--- linux-2.6.9-bk7/drivers/scsi/ipr.c~ipr_block_config_io_during_bist	2004-10-22 15:25:07.000000000 -0500
+++ linux-2.6.9-bk7-bjking1/drivers/scsi/ipr.c	2004-10-22 15:25:07.000000000 -0500
@@ -4935,6 +4935,7 @@ static int ipr_reset_restore_cfg_space(s
 	int rc;
 
 	ENTER;
+	pci_unblock_config_io(ioa_cfg->pdev);
 	rc = pci_restore_state(ioa_cfg->pdev);
 
 	if (rc != PCIBIOS_SUCCESSFUL) {
@@ -4989,9 +4990,11 @@ static int ipr_reset_start_bist(struct i
 	int rc;
 
 	ENTER;
-	rc = pci_write_config_byte(ioa_cfg->pdev, PCI_BIST, PCI_BIST_START);
+	pci_block_config_io(ioa_cfg->pdev);
+	rc = pci_start_bist(ioa_cfg->pdev);
 
 	if (rc != PCIBIOS_SUCCESSFUL) {
+		pci_unblock_config_io(ioa_cfg->pdev);
 		ipr_cmd->ioasa.ioasc = cpu_to_be32(IPR_IOASC_PCI_ACCESS_ERROR);
 		rc = IPR_RC_JOB_CONTINUE;
 	} else {
diff -puN drivers/scsi/ipr.h~ipr_block_config_io_during_bist drivers/scsi/ipr.h
--- linux-2.6.9-bk7/drivers/scsi/ipr.h~ipr_block_config_io_during_bist	2004-10-22 15:25:07.000000000 -0500
+++ linux-2.6.9-bk7-bjking1/drivers/scsi/ipr.h	2004-10-22 15:25:07.000000000 -0500
@@ -1112,6 +1112,13 @@ __FUNCTION__, __LINE__, ioa_cfg
 #define ipr_remove_dump_file(kobj, attr) do { } while(0)
 #endif
 
+#if !defined(CONFIG_PPC_PSERIES) && !defined(CONFIG_PPC_ISERIES)
+#define pci_block_config_io(dev) do { } while(0)
+#define pci_unblock_config_io(dev) do { } while(0)
+#define pci_start_bist(dev) \
+        pci_write_config_byte(dev, PCI_BIST, PCI_BIST_START)
+#endif
+
 /*
  * Error logging macros
  */
_


From brking at us.ibm.com  Sat Oct 23 06:27:51 2004
From: brking at us.ibm.com (brking at us.ibm.com)
Date: Fri, 22 Oct 2004 15:27:51 -0500
Subject: [PATCH 1/2] ppc64: Block config accesses during BIST (revised)
Message-ID: <200410222027.i9MKRroN023754@d03av02.boulder.ibm.com>


Some PCI adapters on pSeries and iSeries hardware (ipr scsi adapters)
have an exposure today in that they issue BIST to the adapter to reset
the card. If, during the time it takes to complete BIST, userspace attempts
to access PCI config space, the host bus bridge will master abort the access
since the ipr adapter does not respond on the PCI bus for a brief period of
time when running BIST. This master abort results in the host PCI bridge
isolating that PCI device from the rest of the system, making the device
unusable until Linux is rebooted. This patch is an attempt to close that
exposure by introducing some blocking code in the arch specific PCI code.
The intent is to have the ipr device driver invoke these routines to
prevent userspace PCI accesses from occurring during this window.

It has been tested by running BIST on an ipr adapter while running a
script which looped reading the config space of that adapter through sysfs.
Without the patch, an EEH error occurrs. With the patch there is no EEH
error. Tested on Power 5 and iSeries Power 4.

Signed-off-by: Brian King <brking at us.ibm.com>
---

 linux-2.6.9-bk7-bjking1/arch/ppc64/kernel/iSeries_pci.c         |  128 +++++++++-
 linux-2.6.9-bk7-bjking1/arch/ppc64/kernel/pSeries_pci.c         |  103 +++++++-
 linux-2.6.9-bk7-bjking1/include/asm-ppc64/iSeries/iSeries_pci.h |    1 
 linux-2.6.9-bk7-bjking1/include/asm-ppc64/pci.h                 |    6 
 linux-2.6.9-bk7-bjking1/include/asm-ppc64/prom.h                |    4 
 5 files changed, 226 insertions(+), 16 deletions(-)

diff -puN include/asm-ppc64/prom.h~ppc64_block_cfg_io_during_bist include/asm-ppc64/prom.h
--- linux-2.6.9-bk7/include/asm-ppc64/prom.h~ppc64_block_cfg_io_during_bist	2004-10-22 10:13:40.000000000 -0500
+++ linux-2.6.9-bk7-bjking1/include/asm-ppc64/prom.h	2004-10-22 10:13:40.000000000 -0500
@@ -210,11 +210,15 @@ extern struct device_node *of_chosen;
 /* flag descriptions */
 #define OF_STALE   0 /* node is slated for deletion */
 #define OF_DYNAMIC 1 /* node and properties were allocated via kmalloc */
+#define OF_NO_CFGIO 2 /* config space accesses should fail */
 
 #define OF_IS_STALE(x) test_bit(OF_STALE, &x->_flags)
 #define OF_MARK_STALE(x) set_bit(OF_STALE, &x->_flags)
 #define OF_IS_DYNAMIC(x) test_bit(OF_DYNAMIC, &x->_flags)
 #define OF_MARK_DYNAMIC(x) set_bit(OF_DYNAMIC, &x->_flags)
+#define OF_IS_CFGIO_BLOCKED(x) test_bit(OF_NO_CFGIO, &x->_flags)
+#define OF_UNBLOCK_CFGIO(x) clear_bit(OF_NO_CFGIO, &x->_flags)
+#define OF_BLOCK_CFGIO(x) set_bit(OF_NO_CFGIO, &x->_flags)
 
 /*
  * Until 32-bit ppc can add proc_dir_entries to its device_node
diff -puN arch/ppc64/kernel/pSeries_pci.c~ppc64_block_cfg_io_during_bist arch/ppc64/kernel/pSeries_pci.c
--- linux-2.6.9-bk7/arch/ppc64/kernel/pSeries_pci.c~ppc64_block_cfg_io_during_bist	2004-10-22 10:13:40.000000000 -0500
+++ linux-2.6.9-bk7-bjking1/arch/ppc64/kernel/pSeries_pci.c	2004-10-22 10:13:40.000000000 -0500
@@ -30,6 +30,7 @@
 #include <linux/string.h>
 #include <linux/init.h>
 #include <linux/bootmem.h>
+#include <linux/spinlock.h>
 
 #include <asm/io.h>
 #include <asm/pgtable.h>
@@ -52,17 +53,16 @@ static int ibm_read_pci_config;
 static int ibm_write_pci_config;
 
 static int s7a_workaround;
+static spinlock_t config_lock = SPIN_LOCK_UNLOCKED;
 
 extern unsigned long pci_probe_only;
 
-static int rtas_read_config(struct device_node *dn, int where, int size, u32 *val)
+static int __rtas_read_config(struct device_node *dn, int where, int size, u32 *val)
 {
 	int returnval = -1;
 	unsigned long buid, addr;
 	int ret;
 
-	if (!dn)
-		return PCIBIOS_DEVICE_NOT_FOUND;
 	if (where & (size - 1))
 		return PCIBIOS_BAD_REGISTER_NUMBER;
 
@@ -86,6 +86,23 @@ static int rtas_read_config(struct devic
 	return PCIBIOS_SUCCESSFUL;
 }
 
+static int rtas_read_config(struct device_node *dn, int where, int size, u32 *val)
+{
+	unsigned long flags;
+	int ret = 0;
+
+	if (!dn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
+	spin_lock_irqsave(&config_lock, flags);
+	if (OF_IS_CFGIO_BLOCKED(dn))
+		*val = -1;
+	else
+		ret = __rtas_read_config(dn, where, size, val);
+	spin_unlock_irqrestore(&config_lock, flags);
+	return ret;
+}
+
 static int rtas_pci_read_config(struct pci_bus *bus,
 				unsigned int devfn,
 				int where, int size, u32 *val)
@@ -104,13 +121,11 @@ static int rtas_pci_read_config(struct p
 	return PCIBIOS_DEVICE_NOT_FOUND;
 }
 
-static int rtas_write_config(struct device_node *dn, int where, int size, u32 val)
+static int __rtas_write_config(struct device_node *dn, int where, int size, u32 val)
 {
 	unsigned long buid, addr;
 	int ret;
 
-	if (!dn)
-		return PCIBIOS_DEVICE_NOT_FOUND;
 	if (where & (size - 1))
 		return PCIBIOS_BAD_REGISTER_NUMBER;
 
@@ -128,6 +143,21 @@ static int rtas_write_config(struct devi
 	return PCIBIOS_SUCCESSFUL;
 }
 
+static int rtas_write_config(struct device_node *dn, int where, int size, u32 val)
+{
+	unsigned long flags;
+	int ret = 0;
+
+	if (!dn)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
+	spin_lock_irqsave(&config_lock, flags);
+	if (!OF_IS_CFGIO_BLOCKED(dn))
+		ret = __rtas_write_config(dn, where, size, val);
+	spin_unlock_irqrestore(&config_lock, flags);
+	return ret;
+}
+
 static int rtas_pci_write_config(struct pci_bus *bus,
 				 unsigned int devfn,
 				 int where, int size, u32 val)
@@ -151,6 +181,67 @@ struct pci_ops rtas_pci_ops = {
 	rtas_pci_write_config
 };
 
+/**
+ * pci_block_config_io - Block PCI config reads/writes
+ * @pdev:	pci device struct
+ *
+ * This function blocks any PCI config accesses from occurring.
+ * Device drivers may call this prior to running BIST if the
+ * adapter cannot handle PCI config reads or writes when
+ * running BIST. When blocked, any writes will be ignored and
+ * treated as successful and any reads will return all 1's data.
+ *
+ * Return value:
+ * 	nothing
+ **/
+void pci_block_config_io(struct pci_dev *pdev)
+{
+	struct device_node *dn = pci_device_to_OF_node(pdev);
+	unsigned long flags;
+
+	spin_lock_irqsave(&config_lock, flags);
+	OF_BLOCK_CFGIO(dn);
+	spin_unlock_irqrestore(&config_lock, flags);
+}
+EXPORT_SYMBOL(pci_block_config_io);
+
+/**
+ * pci_unblock_config_io - Unblock PCI config reads/writes
+ * @pdev:	pci device struct
+ *
+ * This function allows PCI config accesses to resume.
+ *
+ * Return value:
+ * 	nothing
+ **/
+void pci_unblock_config_io(struct pci_dev *pdev)
+{
+	struct device_node *dn = pci_device_to_OF_node(pdev);
+	unsigned long flags;
+
+	spin_lock_irqsave(&config_lock, flags);
+	OF_UNBLOCK_CFGIO(dn);
+	spin_unlock_irqrestore(&config_lock, flags);
+}
+EXPORT_SYMBOL(pci_unblock_config_io);
+
+/**
+ * pci_start_bist - Start BIST on a PCI device
+ * @pdev:	pci device struct
+ *
+ * This function allows a device driver to start BIST
+ * when PCI config accesses are disabled.
+ *
+ * Return value:
+ * 	nothing
+ **/
+int pci_start_bist(struct pci_dev *pdev)
+{
+	struct device_node *dn = pci_device_to_OF_node(pdev);
+	return __rtas_write_config(dn, PCI_BIST, 1, PCI_BIST_START);
+}
+EXPORT_SYMBOL(pci_start_bist);
+
 static void python_countermeasures(unsigned long addr)
 {
 	void *chip_regs;
diff -puN include/asm-ppc64/pci.h~ppc64_block_cfg_io_during_bist include/asm-ppc64/pci.h
--- linux-2.6.9-bk7/include/asm-ppc64/pci.h~ppc64_block_cfg_io_during_bist	2004-10-22 10:13:40.000000000 -0500
+++ linux-2.6.9-bk7-bjking1/include/asm-ppc64/pci.h	2004-10-22 10:13:40.000000000 -0500
@@ -235,6 +235,12 @@ extern int pci_read_irq_line(struct pci_
 
 extern void pcibios_add_platform_entries(struct pci_dev *dev);
 
+extern void pci_block_config_io(struct pci_dev *dev);
+
+extern void pci_unblock_config_io(struct pci_dev *dev);
+
+extern int pci_start_bist(struct pci_dev *dev);
+
 #endif	/* __KERNEL__ */
 
 #endif /* __PPC64_PCI_H */
diff -puN include/asm-ppc64/iSeries/iSeries_pci.h~ppc64_block_cfg_io_during_bist include/asm-ppc64/iSeries/iSeries_pci.h
--- linux-2.6.9-bk7/include/asm-ppc64/iSeries/iSeries_pci.h~ppc64_block_cfg_io_during_bist	2004-10-22 10:13:40.000000000 -0500
+++ linux-2.6.9-bk7-bjking1/include/asm-ppc64/iSeries/iSeries_pci.h	2004-10-22 10:13:40.000000000 -0500
@@ -91,6 +91,7 @@ struct iSeries_Device_Node {
 	int             ReturnCode;	/* Return Code Holder          */
 	int             IoRetry;        /* Current Retry Count         */
 	int             Flags;          /* Possible flags(disable/bist)*/
+#define ISERIES_CFGIO_BLOCKED	1
 	u16             Vendor;         /* Vendor ID                   */
 	u8              LogicalSlot;    /* Hv Slot Index for Tces      */
 	struct iommu_table* iommu_table;/* Device TCE Table            */ 
diff -puN arch/ppc64/kernel/iSeries_pci.c~ppc64_block_cfg_io_during_bist arch/ppc64/kernel/iSeries_pci.c
--- linux-2.6.9-bk7/arch/ppc64/kernel/iSeries_pci.c~ppc64_block_cfg_io_during_bist	2004-10-22 10:13:40.000000000 -0500
+++ linux-2.6.9-bk7-bjking1/arch/ppc64/kernel/iSeries_pci.c	2004-10-22 10:13:40.000000000 -0500
@@ -29,6 +29,7 @@
 #include <linux/module.h>
 #include <linux/ide.h>
 #include <linux/pci.h>
+#include <linux/spinlock.h>
 
 #include <asm/io.h>
 #include <asm/irq.h>
@@ -86,6 +87,7 @@ static int Pci_Retry_Max = 3;	/* Only re
 static int Pci_Error_Flag = 1;	/* Set Retry Error on. */
 
 static struct pci_ops iSeries_pci_ops;
+static spinlock_t config_lock = SPIN_LOCK_UNLOCKED;
 
 /*
  * Log Error infor in Flight Recorder to system Console.
@@ -510,16 +512,12 @@ static u64 hv_cfg_write_func[4] = {
 /*
  * Read PCI config space
  */
-static int iSeries_pci_read_config(struct pci_bus *bus, unsigned int devfn,
+static int __iSeries_pci_read_config(struct iSeries_Device_Node *node,
 		int offset, int size, u32 *val)
 {
-	struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn);
 	u64 fn;
 	struct HvCallPci_LoadReturn ret;
 
-	if (node == NULL)
-		return PCIBIOS_DEVICE_NOT_FOUND;
-
 	fn = hv_cfg_read_func[(size - 1) & 3];
 	HvCall3Ret16(fn, &ret, node->DsaAddr.DsaAddr, offset, 0);
 
@@ -532,20 +530,36 @@ static int iSeries_pci_read_config(struc
 	return 0;
 }
 
+static int iSeries_pci_read_config(struct pci_bus *bus, unsigned int devfn,
+		int offset, int size, u32 *val)
+{
+	struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn);
+	int ret = PCIBIOS_DEVICE_NOT_FOUND;
+	unsigned long flags;
+
+	if (node) {
+		ret = 0;
+		spin_lock_irqsave(&config_lock, flags);
+		if (node->Flags & ISERIES_CFGIO_BLOCKED)
+			*val = -1;
+		else
+			ret = __iSeries_pci_read_config(node, offset, size, val);
+		spin_unlock_irqrestore(&config_lock, flags);
+	}
+
+	return ret;
+}
+
 /*
  * Write PCI config space
  */
 
-static int iSeries_pci_write_config(struct pci_bus *bus, unsigned int devfn,
+static int __iSeries_pci_write_config(struct iSeries_Device_Node *node,
 		int offset, int size, u32 val)
 {
-	struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn);
 	u64 fn;
 	u64 ret;
 
-	if (node == NULL)
-		return PCIBIOS_DEVICE_NOT_FOUND;
-
 	fn = hv_cfg_write_func[(size - 1) & 3];
 	ret = HvCall4(fn, node->DsaAddr.DsaAddr, offset, val, 0);
 
@@ -555,6 +569,23 @@ static int iSeries_pci_write_config(stru
 	return 0;
 }
 
+static int iSeries_pci_write_config(struct pci_bus *bus, unsigned int devfn,
+		int offset, int size, u32 val)
+{
+	struct iSeries_Device_Node *node = find_Device_Node(bus->number, devfn);
+	int ret = PCIBIOS_DEVICE_NOT_FOUND;
+	unsigned long flags;
+
+	if (node) {
+		spin_lock_irqsave(&config_lock, flags);
+		if (!(node->Flags & ISERIES_CFGIO_BLOCKED))
+			ret = __iSeries_pci_write_config(node, offset, size, val);
+		spin_unlock_irqrestore(&config_lock, flags);
+	}
+
+	return ret;
+}
+
 static struct pci_ops iSeries_pci_ops = {
 	.read = iSeries_pci_read_config,
 	.write = iSeries_pci_write_config
@@ -817,3 +848,80 @@ void iSeries_Write_Long(u32 data, volati
 	} while (CheckReturnCode("WWL", DevNode, rc) != 0);
 }
 EXPORT_SYMBOL(iSeries_Write_Long);
+
+/**
+ * pci_block_config_io - Block PCI config reads/writes
+ * @pdev:	pci device struct
+ *
+ * This function blocks any PCI config accesses from occurring.
+ * Device drivers may call this prior to running BIST if the
+ * adapter cannot handle PCI config reads or writes when
+ * running BIST. When blocked, any writes will be ignored and
+ * treated as successful and any reads will return all 1's data.
+ *
+ * Return value:
+ * 	nothing
+ **/
+void pci_block_config_io(struct pci_dev *pdev)
+{
+	struct iSeries_Device_Node *node;
+	unsigned long flags;
+
+	node = find_Device_Node(pdev->bus->number, pdev->devfn);
+
+	if (node == NULL)
+		return;
+
+	spin_lock_irqsave(&config_lock, flags);
+	node->Flags |= ISERIES_CFGIO_BLOCKED;
+	spin_unlock_irqrestore(&config_lock, flags);
+}
+EXPORT_SYMBOL(pci_block_config_io);
+
+/**
+ * pci_unblock_config_io - Unblock PCI config reads/writes
+ * @pdev:	pci device struct
+ *
+ * This function allows PCI config accesses to resume.
+ *
+ * Return value:
+ * 	nothing
+ **/
+void pci_unblock_config_io(struct pci_dev *pdev)
+{
+	struct iSeries_Device_Node *node;
+	unsigned long flags;
+
+	node = find_Device_Node(pdev->bus->number, pdev->devfn);
+
+	if (node == NULL)
+		return;
+
+	spin_lock_irqsave(&config_lock, flags);
+	node->Flags &= ~ISERIES_CFGIO_BLOCKED;
+	spin_unlock_irqrestore(&config_lock, flags);
+}
+EXPORT_SYMBOL(pci_unblock_config_io);
+
+/**
+ * pci_start_bist - Start BIST on a PCI device
+ * @pdev:	pci device struct
+ *
+ * This function allows a device driver to start BIST
+ * when PCI config accesses are disabled.
+ *
+ * Return value:
+ * 	nothing
+ **/
+int pci_start_bist(struct pci_dev *pdev)
+{
+	struct iSeries_Device_Node *node;
+
+	node = find_Device_Node(pdev->bus->number, pdev->devfn);
+
+	if (node == NULL)
+		return PCIBIOS_DEVICE_NOT_FOUND;
+
+	return __iSeries_pci_write_config(node, PCI_BIST, 1, PCI_BIST_START);
+}
+EXPORT_SYMBOL(pci_start_bist);
_


From paulus at samba.org  Sat Oct 23 18:19:00 2004
From: paulus at samba.org (Paul Mackerras)
Date: Sat, 23 Oct 2004 18:19:00 +1000
Subject: [PATCH] Kprobes for ppc64
In-Reply-To: <20041018095229.GA7394@in.ibm.com>
References: <20041018095229.GA7394@in.ibm.com>
Message-ID: <16762.5108.282382.603502@cargo.ozlabs.ibm.com>

Ananth N Mavinakayanahalli writes:

> Here is kprobes for ppc64. The patch applies on 2.6.9-rc4/2.6.9-final
> and provides the kprobes + jprobes functionality.

> 1. The current implementation uses xmon's emulate_step() and hence 
>    requires xmon to be compiled in. 

We can move emulate_step out to arch/ppc64/lib/step.c (and take out
the printfs).

> 2. arch_prepare_kprobe() now returns an int. I have made the necessary
>    changes to i386 and sparc64 kprobes files, but is untested.

Are you going to send this upstream?

> + * Interrupts are disabled on entry as trap3 is an interrupt gate and they
> + * remain disabled thorough out this function.
> + */
> +static inline int kprobe_handler(struct pt_regs *regs)

Comments about "trap3" and "interrupt gate" don't help me understand
this function on ppc64. :)  At present interrupts are enabled in a
program check exception handler but disabled in a single-step handler.
When does this function get called?

> @@ -96,6 +97,9 @@ int do_page_fault(struct pt_regs *regs, 
>  	BUG_ON((trap == 0x380) || (trap == 0x480));
>  
>  	if (trap == 0x300) {
> +		if (notify_die(DIE_PAGE_FAULT, "page_fault", regs, error_code,
> +					11, SIGSEGV) == NOTIFY_STOP)
> +			return 0;

Hmmm, this seems a bit heavyweight for adding to the page fault path.
Have you done any benchmarks with vs. without kprobes?

On the whole the patch looks OK.  I haven't checked the kprobe_handler
code to see if I think it's all SMP- and preempt-safe, but I assume
you have done it similarly on x86 and checked it there.

Paul.


From paulus at samba.org  Sat Oct 23 18:20:37 2004
From: paulus at samba.org (Paul Mackerras)
Date: Sat, 23 Oct 2004 18:20:37 +1000
Subject: [PPC64] xmon sparse cleanups
In-Reply-To: <20041021033617.GK17760@zax>
References: <20041021033617.GK17760@zax>
Message-ID: <16762.5205.563634.564951@cargo.ozlabs.ibm.com>

David Gibson writes:

> This patch removes many sparse warnings from the xmon code.  Mostly
> K&R function declarations and 0-instead-of-NULLs.  There are still a
> whole bunch of warnings in xmon/ppc-opc.c, which is a copy of a file
> from binutils.
> 
> Signed-off-by: David Gibson <david at gibson.dropbear.id.au>

Acked-by: Paul Mackerras <paulus at samba.org>


From paulus at samba.org  Sat Oct 23 18:22:29 2004
From: paulus at samba.org (Paul Mackerras)
Date: Sat, 23 Oct 2004 18:22:29 +1000
Subject: [PPC64] Trivial sparse cleanups
In-Reply-To: <20041021013549.GI17760@zax>
References: <20041021013549.GI17760@zax>
Message-ID: <16762.5317.444887.668294@cargo.ozlabs.ibm.com>

David Gibson writes:

> This patch squashes a handful of assorted sparse warnings in the ppc64
> code.  Should be pretty much trivial and self explanatory.
> 
> Signed-off-by: David Gibson <david at gibson.dropbear.id.au>

Acked-by: Paul Mackerras <paulus at samba.org>


From paulus at samba.org  Sat Oct 23 18:40:00 2004
From: paulus at samba.org (Paul Mackerras)
Date: Sat, 23 Oct 2004 18:40:00 +1000
Subject: [PATCH] ppc64: cpu hotplug notifier for numa
In-Reply-To: <1098436795.17305.22.camel@biclops>
References: <1098436795.17305.22.camel@biclops>
Message-ID: <16762.6368.569207.26902@cargo.ozlabs.ibm.com>

Nathan Lynch writes:

> This patch modifies the ppc64 numa code to map a cpu to its node right
> before it is brought up -- this means that secondary cpus are now
> mapped to their nodes during smp_init() (regardless of whether
> CONFIG_HOTPLUG_CPU=y).  Cpus are removed from their nodes after they
> have gone offline.

I get this when compiling with CONFIG_NUMA=n:

arch/ppc64/mm/numa.c:243: warning: `cpu_numa_callback' defined but not used

Only a small point, but it would be nicer if that was fixed.

Paul.


From nathanl at austin.ibm.com  Sun Oct 24 05:02:08 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Sat, 23 Oct 2004 14:02:08 -0500
Subject: [PATCH] ppc64: cpu hotplug notifier for numa
In-Reply-To: <16762.6368.569207.26902@cargo.ozlabs.ibm.com>
References: <1098436795.17305.22.camel@biclops>
	<16762.6368.569207.26902@cargo.ozlabs.ibm.com>
Message-ID: <1098558128.23102.2.camel@biclops>

On Sat, 2004-10-23 at 03:40, Paul Mackerras wrote:
> Nathan Lynch writes:
> 
> > This patch modifies the ppc64 numa code to map a cpu to its node right
> > before it is brought up -- this means that secondary cpus are now
> > mapped to their nodes during smp_init() (regardless of whether
> > CONFIG_HOTPLUG_CPU=y).  Cpus are removed from their nodes after they
> > have gone offline.
> 
> I get this when compiling with CONFIG_NUMA=n:
> 
> arch/ppc64/mm/numa.c:243: warning: `cpu_numa_callback' defined but not used
> 
> Only a small point, but it would be nicer if that was fixed.

I assume you meant CONFIG_HOTPLUG_CPU=n.  That warning actually
indicates a bug; I'm registering the notifier in the wrong way.

Will send a corrected patch.

Nathan


From geert at linux-m68k.org  Thu Oct 21 18:03:25 2004
From: geert at linux-m68k.org (Geert Uytterhoeven)
Date: Thu, 21 Oct 2004 10:03:25 +0200 (MEST)
Subject: [discuss] Re: [PATCH] Add key management syscalls to non-i386
 archs
In-Reply-To: <20041020164144.3457eafe.davem@davemloft.net>
References: <3506.1098283455@redhat.com>
	<20041020150149.7be06d6d.davem@davemloft.net>
	<20041020225625.GD995@wotan.suse.de>
	<20041020160450.0914270b.davem@davemloft.net>
	<20041020232509.GF995@wotan.suse.de>
	<20041020164144.3457eafe.davem@davemloft.net>
Message-ID: <Pine.GSO.4.61.0410211002020.614@waterleaf.sonytel.be>

On Wed, 20 Oct 2004, David S. Miller wrote:
> On Thu, 21 Oct 2004 01:25:09 +0200
> Andi Kleen <ak at suse.de> wrote:
> 
> > IMHO breaking the build unnecessarily is extremly bad because
> > it will prevent all testing. And would you really want to hold
> > up the whole linux testing machinery just for some obscure 
> > system call? IMHO not a good tradeoff.
> 
> Then change the unistd.h cookie from "#error" to a "#warning".  It
> accomplishes both of our goals.

Please do so! And not only for syscalls, but also for other things.

That way we can procmail all mails sent to lkml or bk-commits-head that
add #warnings to arch/<arch>/ or include/asm-<arch>/.

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert at linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds


From nathanl at austin.ibm.com  Sun Oct 24 07:36:43 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Sat, 23 Oct 2004 16:36:43 -0500
Subject: [PATCH] ppc64: cpu hotplug notifier for numa (take 2)
In-Reply-To: <1098558128.23102.2.camel@biclops>
References: <1098436795.17305.22.camel@biclops>
	<16762.6368.569207.26902@cargo.ozlabs.ibm.com>
	<1098558128.23102.2.camel@biclops>
Message-ID: <1098567403.23102.28.camel@biclops>


The NUMA properties of all "possible" cpus are not necessarily
available at boot time on ppc64 LPAR.  Only the properties for present
cpus are known.

This patch modifies the ppc64 numa code to map a cpu to its node right
before it is brought up -- this means that secondary cpus are now
mapped to their nodes during smp_init().  Cpus are removed from their
nodes after they have gone offline.

Also some minor cleanups:
- Stash the "minimum common depth" in a global at boot time, so we
  don't have to rediscover it every time something changes.

- Remove unnecessary variable from of_get_associativity() which is
  accessed while possibly uninitialized.

- Remove the cpu portion from dump_numa_topology() since it will show
  only the boot cpu now.  We could display this information from
  smp_cpus_done() if necessary.

Tested on a 4-way 2-node Power5 system.

Signed-off-by: Nathan Lynch <nathanl at austin.ibm.com>

---


---


diff -puN arch/ppc64/mm/numa.c~ppc64-numa-cpu-hotplug-notifier arch/ppc64/mm/numa.c
--- 2.6.10-rc1/arch/ppc64/mm/numa.c~ppc64-numa-cpu-hotplug-notifier	2004-10-23 15:10:39.000000000 -0500
+++ 2.6.10-rc1-nathanl/arch/ppc64/mm/numa.c	2004-10-23 16:28:58.000000000 -0500
@@ -15,6 +15,8 @@
 #include <linux/mmzone.h>
 #include <linux/module.h>
 #include <linux/nodemask.h>
+#include <linux/cpu.h>
+#include <linux/notifier.h>
 #include <asm/lmb.h>
 #include <asm/machdep.h>
 #include <asm/abs_addr.h>
@@ -39,6 +41,7 @@ int nr_cpus_in_node[MAX_NUMNODES] = { [0
 struct pglist_data *node_data[MAX_NUMNODES];
 bootmem_data_t __initdata plat_node_bdata[MAX_NUMNODES];
 static unsigned long node0_io_hole_size;
+static int min_common_depth;
 
 /*
  * We need somewhere to store start/span for each node until we have
@@ -64,7 +67,24 @@ static inline void map_cpu_to_node(int c
 	}
 }
 
-static struct device_node * __init find_cpu_node(unsigned int cpu)
+#ifdef CONFIG_HOTPLUG_CPU
+static void unmap_cpu_from_node(unsigned long cpu)
+{
+	int node = numa_cpu_lookup_table[cpu];
+
+	dbg("removing cpu %lu from node %d\n", cpu, node);
+
+	if (cpu_isset(cpu, numa_cpumask_lookup_table[node])) {
+		cpu_clear(cpu, numa_cpumask_lookup_table[node]);
+		nr_cpus_in_node[node]--;
+	} else {
+		printk(KERN_ERR "WARNING: cpu %lu not found in node %d\n",
+		       cpu, node);
+	}
+}
+#endif /* CONFIG_HOTPLUG_CPU */
+
+static struct device_node * __devinit find_cpu_node(unsigned int cpu)
 {
 	unsigned int hw_cpuid = get_hard_smp_processor_id(cpu);
 	struct device_node *cpu_node = NULL;
@@ -96,26 +116,21 @@ static struct device_node * __init find_
 
 /* must hold reference to node during call */
 static int *of_get_associativity(struct device_node *dev)
- {
-	unsigned int *result;
-	int len;
-
-	result = (unsigned int *)get_property(dev, "ibm,associativity", &len);
-
-	if (len <= 0)
-		return NULL;
-
-	return result;
+{
+	return (unsigned int *)get_property(dev, "ibm,associativity", NULL);
 }
 
-static int of_node_numa_domain(struct device_node *device, int depth)
+static int of_node_numa_domain(struct device_node *device)
 {
 	int numa_domain;
 	unsigned int *tmp;
 
+	if (min_common_depth == -1)
+		return 0;
+
 	tmp = of_get_associativity(device);
-	if (tmp && (tmp[0] >= depth)) {
-		numa_domain = tmp[depth];
+	if (tmp && (tmp[0] >= min_common_depth)) {
+		numa_domain = tmp[min_common_depth];
 	} else {
 		dbg("WARNING: no NUMA information for %s\n",
 		    device->full_name);
@@ -138,7 +153,7 @@ static int of_node_numa_domain(struct de
  *
  * - Dave Hansen <haveblue at us.ibm.com>
  */
-static int find_min_common_depth(void)
+static int __init find_min_common_depth(void)
 {
 	int depth;
 	unsigned int *ref_points;
@@ -185,11 +200,72 @@ static unsigned long read_cell_ul(struct
 	return result;
 }
 
+/*
+ * Figure out to which domain a cpu belongs and stick it there.
+ * Return the id of the domain used.
+ */
+static int numa_setup_cpu(unsigned long lcpu)
+{
+	int numa_domain = 0;
+	struct device_node *cpu = find_cpu_node(lcpu);
+
+	if (!cpu) {
+		WARN_ON(1);
+		goto out;
+	}
+
+	numa_domain = of_node_numa_domain(cpu);
+
+	if (numa_domain >= MAX_NUMNODES) {
+		/*
+		 * POWER4 LPAR uses 0xffff as invalid node,
+		 * dont warn in this case.
+		 */
+		if (numa_domain != 0xffff)
+			printk(KERN_ERR "WARNING: cpu %ld "
+			       "maps to invalid NUMA node %d\n",
+			       lcpu, numa_domain);
+		numa_domain = 0;
+	}
+out:
+	node_set_online(numa_domain);
+
+	map_cpu_to_node(lcpu, numa_domain);
+
+	of_node_put(cpu);
+
+	return numa_domain;
+}
+
+static int cpu_numa_callback(struct notifier_block *nfb,
+			     unsigned long action,
+			     void *hcpu)
+{
+	unsigned long lcpu = (unsigned long)hcpu;
+	int ret = NOTIFY_DONE;
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+		if (min_common_depth == -1 || !numa_enabled)
+			map_cpu_to_node(lcpu, 0);
+		else
+			numa_setup_cpu(lcpu);
+		ret = NOTIFY_OK;
+		break;
+#ifdef CONFIG_HOTPLUG_CPU
+	case CPU_DEAD:
+	case CPU_UP_CANCELED:
+		unmap_cpu_from_node(lcpu);
+		break;
+		ret = NOTIFY_OK;
+#endif
+	}
+	return ret;
+}
+
 static int __init parse_numa_properties(void)
 {
-	struct device_node *cpu = NULL;
 	struct device_node *memory = NULL;
-	int depth;
 	int max_domain = 0;
 	long entries = lmb_end_of_DRAM() >> MEMORY_INCREMENT_SHIFT;
 	unsigned long i;
@@ -206,44 +282,13 @@ static int __init parse_numa_properties(
 	for (i = 0; i < entries ; i++)
 		numa_memory_lookup_table[i] = ARRAY_INITIALISER;
 
-	depth = find_min_common_depth();
-
-	dbg("NUMA associativity depth for CPU/Memory: %d\n", depth);
-	if (depth < 0)
-		return depth;
-
-	for_each_cpu(i) {
-		int numa_domain;
-
-		cpu = find_cpu_node(i);
-
-		if (cpu) {
-			numa_domain = of_node_numa_domain(cpu, depth);
-			of_node_put(cpu);
-
-			if (numa_domain >= MAX_NUMNODES) {
-				/*
-			 	 * POWER4 LPAR uses 0xffff as invalid node,
-				 * dont warn in this case.
-			 	 */
-				if (numa_domain != 0xffff)
-					printk(KERN_ERR "WARNING: cpu %ld "
-					       "maps to invalid NUMA node %d\n",
-					       i, numa_domain);
-				numa_domain = 0;
-			}
-		} else {
-			dbg("WARNING: no NUMA information for cpu %ld\n", i);
-			numa_domain = 0;
-		}
-
-		node_set_online(numa_domain);
+	min_common_depth = find_min_common_depth();
 
-		if (max_domain < numa_domain)
-			max_domain = numa_domain;
+	dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
+	if (min_common_depth < 0)
+		return min_common_depth;
 
-		map_cpu_to_node(i, numa_domain);
-	}
+	max_domain = numa_setup_cpu(boot_cpuid);
 
 	memory = NULL;
 	while ((memory = of_find_node_by_type(memory, "memory")) != NULL) {
@@ -267,7 +312,7 @@ new_range:
 		start = _ALIGN_DOWN(start, MEMORY_INCREMENT);
 		size = _ALIGN_UP(size, MEMORY_INCREMENT);
 
-		numa_domain = of_node_numa_domain(memory, depth);
+		numa_domain = of_node_numa_domain(memory);
 
 		if (numa_domain >= MAX_NUMNODES) {
 			if (numa_domain != 0xffff)
@@ -341,8 +386,7 @@ static void __init setup_nonnuma(void)
 			numa_memory_lookup_table[i] = ARRAY_INITIALISER;
 	}
 
-	for (i = 0; i < NR_CPUS; i++)
-		map_cpu_to_node(i, 0);
+	map_cpu_to_node(boot_cpuid, 0);
 
 	node_set_online(0);
 
@@ -358,35 +402,10 @@ static void __init setup_nonnuma(void)
 static void __init dump_numa_topology(void)
 {
 	unsigned int node;
-	unsigned int cpu, count;
+	unsigned int count;
 
-	for (node = 0; node < MAX_NUMNODES; node++) {
-		if (!node_online(node))
-			continue;
-
-		printk(KERN_INFO "Node %d CPUs:", node);
-
-		count = 0;
-		/*
-		 * If we used a CPU iterator here we would miss printing
-		 * the holes in the cpumap.
-		 */
-		for (cpu = 0; cpu < NR_CPUS; cpu++) {
-			if (cpu_isset(cpu, numa_cpumask_lookup_table[node])) {
-				if (count == 0)
-					printk(" %u", cpu);
-				++count;
-			} else {
-				if (count > 1)
-					printk("-%u", cpu - 1);
-				count = 0;
-			}
-		}
-
-		if (count > 1)
-			printk("-%u", NR_CPUS - 1);
-		printk("\n");
-	}
+	if (min_common_depth == -1 || !numa_enabled)
+		return;
 
 	for (node = 0; node < MAX_NUMNODES; node++) {
 		unsigned long i;
@@ -414,6 +433,7 @@ static void __init dump_numa_topology(vo
 			printk("-0x%lx", i);
 		printk("\n");
 	}
+	return;
 }
 
 /*
@@ -460,6 +480,10 @@ static unsigned long careful_allocation(
 void __init do_init_bootmem(void)
 {
 	int nid;
+	static struct notifier_block ppc64_numa_nb = {
+		.notifier_call = cpu_numa_callback,
+		.priority = 1 /* Must run before sched domains notifier. */
+	};
 
 	min_low_pfn = 0;
 	max_low_pfn = lmb_end_of_DRAM() >> PAGE_SHIFT;
@@ -470,6 +494,8 @@ void __init do_init_bootmem(void)
 	else
 		dump_numa_topology();
 
+	register_cpu_notifier(&ppc64_numa_nb);
+
 	for (nid = 0; nid < numnodes; nid++) {
 		unsigned long start_paddr, end_paddr;
 		int i;

_


From dwm at austin.ibm.com  Sun Oct 24 10:55:36 2004
From: dwm at austin.ibm.com (Doug Maxey)
Date: Sat, 23 Oct 2004 19:55:36 -0500
Subject: [PATCH 1/1] build modular usb isd200 with modular ide
Message-ID: <200410240055.i9O0taCf006206@falcon10.austin.ibm.com>


Name: inline ide_fix_driveid()

Rationale:
	This is a fix for bugme.osdl 3819.
	With any of the 2.6.9 release flavors (vanilla, mm1, ac3), one
	cannot build the usb isd200 module due to the dependency on
	ide_fix_driveid() being exported from ide-iops.

Description:
	When building IDE modular, the current ide_fix_driveid() is
	exported from ide-iops.c.  This patch makes the function an inline.

Status: compile tested on ppc64.  Other issues prevent run test.


Signed-off-by: Doug Maxey <dwm at austin.ibm.com>

ChangeLog:

++doug

 drivers/ide/ide-iops.c |   98 -------------------------------------------------
 include/linux/ide.h    |   48 +++++++++++++++++++++++-
 2 files changed, 47 insertions(+), 99 deletions(-)

diff -Nwupa lk-2.6.9-mm1/drivers/ide/ide-iops.c lk-2.6.9-mm1.edit/drivers/ide/ide-iops.c
--- lk-2.6.9-mm1/drivers/ide/ide-iops.c	2004-10-22 15:10:30.465342832 -0500
+++ lk-2.6.9-mm1.edit/drivers/ide/ide-iops.c	2004-10-23 00:22:16.901355000 -0500
@@ -352,104 +352,6 @@ EXPORT_SYMBOL(atapi_output_bytes);
 /*
  * Beginning of Taskfile OPCODE Library and feature sets.
  */
-void ide_fix_driveid (struct hd_driveid *id)
-{
-#ifndef __LITTLE_ENDIAN
-# ifdef __BIG_ENDIAN
-	int i;
-	u16 *stringcast;
-
-	id->config         = __le16_to_cpu(id->config);
-	id->cyls           = __le16_to_cpu(id->cyls);
-	id->reserved2      = __le16_to_cpu(id->reserved2);
-	id->heads          = __le16_to_cpu(id->heads);
-	id->track_bytes    = __le16_to_cpu(id->track_bytes);
-	id->sector_bytes   = __le16_to_cpu(id->sector_bytes);
-	id->sectors        = __le16_to_cpu(id->sectors);
-	id->vendor0        = __le16_to_cpu(id->vendor0);
-	id->vendor1        = __le16_to_cpu(id->vendor1);
-	id->vendor2        = __le16_to_cpu(id->vendor2);
-	stringcast = (u16 *)&id->serial_no[0];
-	for (i = 0; i < (20/2); i++)
-		stringcast[i] = __le16_to_cpu(stringcast[i]);
-	id->buf_type       = __le16_to_cpu(id->buf_type);
-	id->buf_size       = __le16_to_cpu(id->buf_size);
-	id->ecc_bytes      = __le16_to_cpu(id->ecc_bytes);
-	stringcast = (u16 *)&id->fw_rev[0];
-	for (i = 0; i < (8/2); i++)
-		stringcast[i] = __le16_to_cpu(stringcast[i]);
-	stringcast = (u16 *)&id->model[0];
-	for (i = 0; i < (40/2); i++)
-		stringcast[i] = __le16_to_cpu(stringcast[i]);
-	id->dword_io       = __le16_to_cpu(id->dword_io);
-	id->reserved50     = __le16_to_cpu(id->reserved50);
-	id->field_valid    = __le16_to_cpu(id->field_valid);
-	id->cur_cyls       = __le16_to_cpu(id->cur_cyls);
-	id->cur_heads      = __le16_to_cpu(id->cur_heads);
-	id->cur_sectors    = __le16_to_cpu(id->cur_sectors);
-	id->cur_capacity0  = __le16_to_cpu(id->cur_capacity0);
-	id->cur_capacity1  = __le16_to_cpu(id->cur_capacity1);
-	id->lba_capacity   = __le32_to_cpu(id->lba_capacity);
-	id->dma_1word      = __le16_to_cpu(id->dma_1word);
-	id->dma_mword      = __le16_to_cpu(id->dma_mword);
-	id->eide_pio_modes = __le16_to_cpu(id->eide_pio_modes);
-	id->eide_dma_min   = __le16_to_cpu(id->eide_dma_min);
-	id->eide_dma_time  = __le16_to_cpu(id->eide_dma_time);
-	id->eide_pio       = __le16_to_cpu(id->eide_pio);
-	id->eide_pio_iordy = __le16_to_cpu(id->eide_pio_iordy);
-	for (i = 0; i < 2; ++i)
-		id->words69_70[i] = __le16_to_cpu(id->words69_70[i]);
-	for (i = 0; i < 4; ++i)
-		id->words71_74[i] = __le16_to_cpu(id->words71_74[i]);
-	id->queue_depth    = __le16_to_cpu(id->queue_depth);
-	for (i = 0; i < 4; ++i)
-		id->words76_79[i] = __le16_to_cpu(id->words76_79[i]);
-	id->major_rev_num  = __le16_to_cpu(id->major_rev_num);
-	id->minor_rev_num  = __le16_to_cpu(id->minor_rev_num);
-	id->command_set_1  = __le16_to_cpu(id->command_set_1);
-	id->command_set_2  = __le16_to_cpu(id->command_set_2);
-	id->cfsse          = __le16_to_cpu(id->cfsse);
-	id->cfs_enable_1   = __le16_to_cpu(id->cfs_enable_1);
-	id->cfs_enable_2   = __le16_to_cpu(id->cfs_enable_2);
-	id->csf_default    = __le16_to_cpu(id->csf_default);
-	id->dma_ultra      = __le16_to_cpu(id->dma_ultra);
-	id->trseuc         = __le16_to_cpu(id->trseuc);
-	id->trsEuc         = __le16_to_cpu(id->trsEuc);
-	id->CurAPMvalues   = __le16_to_cpu(id->CurAPMvalues);
-	id->mprc           = __le16_to_cpu(id->mprc);
-	id->hw_config      = __le16_to_cpu(id->hw_config);
-	id->acoustic       = __le16_to_cpu(id->acoustic);
-	id->msrqs          = __le16_to_cpu(id->msrqs);
-	id->sxfert         = __le16_to_cpu(id->sxfert);
-	id->sal            = __le16_to_cpu(id->sal);
-	id->spg            = __le32_to_cpu(id->spg);
-	id->lba_capacity_2 = __le64_to_cpu(id->lba_capacity_2);
-	for (i = 0; i < 22; i++)
-		id->words104_125[i]   = __le16_to_cpu(id->words104_125[i]);
-	id->last_lun       = __le16_to_cpu(id->last_lun);
-	id->word127        = __le16_to_cpu(id->word127);
-	id->dlf            = __le16_to_cpu(id->dlf);
-	id->csfo           = __le16_to_cpu(id->csfo);
-	for (i = 0; i < 26; i++)
-		id->words130_155[i] = __le16_to_cpu(id->words130_155[i]);
-	id->word156        = __le16_to_cpu(id->word156);
-	for (i = 0; i < 3; i++)
-		id->words157_159[i] = __le16_to_cpu(id->words157_159[i]);
-	id->cfa_power      = __le16_to_cpu(id->cfa_power);
-	for (i = 0; i < 14; i++)
-		id->words161_175[i] = __le16_to_cpu(id->words161_175[i]);
-	for (i = 0; i < 31; i++)
-		id->words176_205[i] = __le16_to_cpu(id->words176_205[i]);
-	for (i = 0; i < 48; i++)
-		id->words206_254[i] = __le16_to_cpu(id->words206_254[i]);
-	id->integrity_word  = __le16_to_cpu(id->integrity_word);
-# else
-#  error "Please fix <asm/byteorder.h>"
-# endif
-#endif
-}
-
-EXPORT_SYMBOL(ide_fix_driveid);
 
 void ide_fixstring (u8 *s, const int bytecount, const int byteswap)
 {
diff -Nwupa lk-2.6.9-mm1/include/linux/ide.h lk-2.6.9-mm1.edit/include/linux/ide.h
--- lk-2.6.9-mm1/include/linux/ide.h	2004-10-22 15:10:36.748318728 -0500
+++ lk-2.6.9-mm1.edit/include/linux/ide.h	2004-10-23 15:28:52.635380680 -0500
@@ -1204,7 +1204,53 @@ extern ide_startstop_t ide_abort(ide_dri
  */
 extern void ide_cmd(ide_drive_t *, u8, u8, ide_handler_t *);
 
-extern void ide_fix_driveid(struct hd_driveid *);
+/*
+ * ide_fix_driveid - fix IDENTIFY DEVICE data for big endian machines.
+ * @id - pointer to data from drive.
+ *
+ * Could be a one liner except for the 3 x 32 bit and 2 x 64 bit
+ * fields.  Offsets are from d1532v1r4.
+ */
+static inline void ide_fix_driveid (struct hd_driveid *id)
+{
+#ifndef __LITTLE_ENDIAN
+# ifdef __BIG_ENDIAN
+	u16 *sp = (u16*)id;
+
+	for (; sp < ((u16*)id) + 61; sp++) *sp = __le16_to_cpu(*sp);
+
+	/* lba_capacity */
+	*((u32*)sp)  = __le32_to_cpu(*((u32*)sp));
+	sp += 2;
+
+	for (; sp < ((u16*)id) + 98; sp++)
+		*sp = __le16_to_cpu(*sp);
+
+	/* Streaming Perfomance Granularity. words 98-99 */
+	*((u32*)sp)  = __le32_to_cpu(*((u32*)sp));
+	sp += 2;			/* word 100 */
+
+	/* lba_capacity2. words 100-103 */
+	*((u64*)sp) = __le64_to_cpu(*((u64*)sp));
+	sp += 4;			/* word 104 */
+
+	for (; sp < ((u16*)id) + 117; sp++)
+		*sp = __le16_to_cpu(*sp);
+
+
+	/* Words per Logical Sector. words 117-118 */
+	*((u32*)sp)  = __le32_to_cpu(*((u32*)sp));
+	sp += 2;			/* word 119 */
+
+	for (; sp < ((u16*)id) + 256; sp++)
+		*sp = __le16_to_cpu(*sp);
+
+# else
+#  error "Please fix <asm/byteorder.h>"
+# endif
+#endif
+}
+
 /*
  * ide_fixstring() cleans up and (optionally) byte-swaps a text string,
  * removing leading/trailing blanks and compressing internal blanks.


From hch at lst.de  Sun Oct 24 20:03:19 2004
From: hch at lst.de (Christoph Hellwig)
Date: Sun, 24 Oct 2004 12:03:19 +0200
Subject: [PATCH 1/1] build modular usb isd200 with modular ide
In-Reply-To: <200410240055.i9O0taCf006206@falcon10.austin.ibm.com>
References: <200410240055.i9O0taCf006206@falcon10.austin.ibm.com>
Message-ID: <20041024100319.GA17183@lst.de>

On Sat, Oct 23, 2004 at 07:55:36PM -0500, Doug Maxey wrote:
> 
> Name: inline ide_fix_driveid()
> 
> Rationale:
> 	This is a fix for bugme.osdl 3819.

bugme.osdl.org doesn't know of a bug #3819.

> 	With any of the 2.6.9 release flavors (vanilla, mm1, ac3), one
> 	cannot build the usb isd200 module due to the dependency on
> 	ide_fix_driveid() being exported from ide-iops.
> 
> Description:
> 	When building IDE modular, the current ide_fix_driveid() is
> 	exported from ide-iops.c.  This patch makes the function an inline.

Still doesn't make any sense. ide_fix_driveid is properly exported from
ide-iops.c, so you use it from other modules.  The only case that
doesn't work is modular ide and builtin usb-storage, and the BLK_DEV_IDE 
depency should fix that one.

If you think that depency is ugly (I do) just copy the routine to
isd200.c, it's a) too large to inline but b) just a trivial byteswap
that should need much changes over time.


From bzolnier at gmail.com  Sun Oct 24 22:45:53 2004
From: bzolnier at gmail.com (Bartlomiej Zolnierkiewicz)
Date: Sun, 24 Oct 2004 14:45:53 +0200
Subject: [PATCH 1/1] build modular usb isd200 with modular ide
In-Reply-To: <20041024100319.GA17183@lst.de>
References: <200410240055.i9O0taCf006206@falcon10.austin.ibm.com>
	<20041024100319.GA17183@lst.de>
Message-ID: <58cb370e041024054575c09679@mail.gmail.com>

On Sun, 24 Oct 2004 12:03:19 +0200, Christoph Hellwig <hch at lst.de> wrote:
> On Sat, Oct 23, 2004 at 07:55:36PM -0500, Doug Maxey wrote:
> >
> > Name: inline ide_fix_driveid()
> >
> > Rationale:
> >       This is a fix for bugme.osdl 3819.
> 
> bugme.osdl.org doesn't know of a bug #3819.
> 
> >       With any of the 2.6.9 release flavors (vanilla, mm1, ac3), one
> >       cannot build the usb isd200 module due to the dependency on
> >       ide_fix_driveid() being exported from ide-iops.
> >
> > Description:
> >       When building IDE modular, the current ide_fix_driveid() is
> >       exported from ide-iops.c.  This patch makes the function an inline.
> 
> Still doesn't make any sense. ide_fix_driveid is properly exported from
> ide-iops.c, so you use it from other modules.  The only case that
> doesn't work is modular ide and builtin usb-storage, and the BLK_DEV_IDE
> depency should fix that one.

The new ide_fix_driveid function seems buggy,
ie. it byte-swaps id->max_multsect with id->vendor3.

> If you think that depency is ugly (I do) just copy the routine to
> isd200.c, it's a) too large to inline but b) just a trivial byteswap
> that should need much changes over time.

The dependency is a bug, <linux/ide.h> is for IDE driver only.

Doug, if you kill debugging code in isd200.c then only:

id->command_set_1
id->model
id->fw_rev
id->capability
id->lba_capacity
id->heads
id->cyls
id->sectors
id->command_set_2

need to be byte-swapped.


From dwm at austin.ibm.com  Mon Oct 25 09:11:12 2004
From: dwm at austin.ibm.com (Doug Maxey)
Date: Sun, 24 Oct 2004 18:11:12 -0500
Subject: [PATCH 1/1] build modular usb isd200 with modular ide 
In-Reply-To: <20041024100319.GA17183@lst.de> 
Message-ID: <200410242311.i9ONBCKp019869@falcon10.austin.ibm.com>


On Sun, 24 Oct 2004 12:03:19 +0200, Christoph Hellwig wrote:
>On Sat, Oct 23, 2004 at 07:55:36PM -0500, Doug Maxey wrote:
>> 
>> Name: inline ide_fix_driveid()
>> 
>> Rationale:
>> 	This is a fix for bugme.osdl 3819.
>
>bugme.osdl.org doesn't know of a bug #3819.

Uh Oh.  Should be 3618.  Have no idea where 3819 came from.

>
>> 	With any of the 2.6.9 release flavors (vanilla, mm1, ac3), one
>> 	cannot build the usb isd200 module due to the dependency on
>> 	ide_fix_driveid() being exported from ide-iops.
>> 
>> Description:
>> 	When building IDE modular, the current ide_fix_driveid() is
>> 	exported from ide-iops.c.  This patch makes the function an inline.
>
>Still doesn't make any sense. ide_fix_driveid is properly exported from
>ide-iops.c, so you use it from other modules.  The only case that
>doesn't work is modular ide and builtin usb-storage, and the BLK_DEV_IDE 
>depency should fix that one.
>
>If you think that depency is ugly (I do) just copy the routine to

What happened to common code that may have more uses than originally 
intended?  Do it right once in one place, and make it available.

>isd200.c, it's a) too large to inline but b) just a trivial byteswap
>that should need much changes over time.

Except for those few 32 and 64 bit quantities that need word (16 bit) swaps,
I agree completely.  

The points I was trying to make were that  
1) This is called in only a few places.
2) it is never on a fast path.
2) The sequence of the named elements was a little bit much.  Meaning of 
   the words change, and quite a few of the fields no longer have the 
   original meaning or definition.
3) Having a singular (even if somewhat large) inline handles all current (and 
   future) uses.

++doug


From benh at kernel.crashing.org  Mon Oct 25 10:47:09 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Mon, 25 Oct 2004 10:47:09 +1000
Subject: [PATCH] ppc64: cleanups of ppc64 pci.c
Message-ID: <1098665227.16132.11.camel@gaston>

Hi !

This patch applies on top of previously posted "ppc64: Move PCI IO mapping
from pSeries_pci.c to pci.c".

It does cosmetic cleanups & add some debug macros to pci.c without actually
changing any functionality. Further patches against ppc64 pci.c that I'll
post will be against a file already patched with this one.

Signed-off-by: Benjamin Herrenschmidt <benh at kernel.crashing.org>


Index: linux-work/arch/ppc64/kernel/pci.c
===================================================================
--- linux-work.orig/arch/ppc64/kernel/pci.c	2004-10-25 10:30:34.841855848 +1000
+++ linux-work/arch/ppc64/kernel/pci.c	2004-10-25 10:36:50.724712968 +1000
@@ -11,6 +11,8 @@
  *      2 of the License, or (at your option) any later version.
  */
 
+#undef DEBUG
+
 #include <linux/config.h>
 #include <linux/kernel.h>
 #include <linux/pci.h>
@@ -39,6 +41,12 @@
 
 #include "pci.h"
 
+#ifdef DEBUG
+#define DBG(fmt...) udbg_printf(fmt)
+#else
+#define DBG(fmt...)
+#endif
+
 unsigned long pci_probe_only = 1;
 unsigned long pci_assign_all_buses = 0;
 
@@ -106,11 +114,11 @@
 			dev->resource[i].flags &= ~IORESOURCE_IO;
 	}
 }
-DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_WINBOND, PCI_DEVICE_ID_WINBOND_82C105, fixup_windbond_82c105);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_WINBOND, PCI_DEVICE_ID_WINBOND_82C105,
+			 fixup_windbond_82c105);
 
-void 
-pcibios_resource_to_bus(struct pci_dev *dev, struct pci_bus_region *region,
-			struct resource *res)
+void  pcibios_resource_to_bus(struct pci_dev *dev, struct pci_bus_region *region,
+			      struct resource *res)
 {
 	unsigned long offset = 0;
 	struct pci_controller *hose = PCI_GET_PHB_PTR(dev);
@@ -215,8 +223,7 @@
 /*
  * Allocate pci_controller(phb) initialized common variables.
  */
-struct pci_controller * __init
-pci_alloc_pci_controller(enum phb_types controller_type)
+struct pci_controller * __init pci_alloc_pci_controller(enum phb_types controller_type)
 {
 	struct pci_controller *hose;
 
@@ -246,8 +253,7 @@
 /*
  * Dymnamically allocate pci_controller(phb), initialize common variables.
  */
-struct pci_controller *
-pci_alloc_phb_dynamic(enum phb_types controller_type)
+struct pci_controller * pci_alloc_phb_dynamic(enum phb_types controller_type)
 {
 	struct pci_controller *hose;
 
@@ -430,9 +436,9 @@
  *
  * Returns negative error code on failure, zero on success.
  */
-static __inline__ int
-__pci_mmap_make_offset(struct pci_dev *dev, struct vm_area_struct *vma,
-		       enum pci_mmap_state mmap_state)
+static __inline__ int __pci_mmap_make_offset(struct pci_dev *dev,
+					     struct vm_area_struct *vma,
+					     enum pci_mmap_state mmap_state)
 {
 	struct pci_controller *hose = PCI_GET_PHB_PTR(dev);
 	unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
@@ -487,9 +493,9 @@
  * Set vm_flags of VMA, as appropriate for this architecture, for a pci device
  * mapping.
  */
-static __inline__ void
-__pci_mmap_set_flags(struct pci_dev *dev, struct vm_area_struct *vma,
-		     enum pci_mmap_state mmap_state)
+static __inline__ void __pci_mmap_set_flags(struct pci_dev *dev,
+					    struct vm_area_struct *vma,
+					    enum pci_mmap_state mmap_state)
 {
 	vma->vm_flags |= VM_SHM | VM_LOCKED | VM_IO;
 }
@@ -498,9 +504,10 @@
  * Set vm_page_prot of VMA, as appropriate for this architecture, for a pci
  * device mapping.
  */
-static __inline__ void
-__pci_mmap_set_pgprot(struct pci_dev *dev, struct vm_area_struct *vma,
-		      enum pci_mmap_state mmap_state, int write_combine)
+static __inline__ void __pci_mmap_set_pgprot(struct pci_dev *dev,
+					     struct vm_area_struct *vma,
+					     enum pci_mmap_state mmap_state,
+					     int write_combine)
 {
 	long prot = pgprot_val(vma->vm_page_prot);
 
@@ -613,7 +620,7 @@
 }
 
 void __devinit pci_process_bridge_OF_ranges(struct pci_controller *hose,
-					struct device_node *dev)
+					    struct device_node *dev)
 {
 	unsigned int *ranges;
 	unsigned long size;
@@ -654,6 +661,8 @@
 			res = &hose->io_resource;
 			res->flags = IORESOURCE_IO;
 			res->start = pci_addr;
+			DBG("phb%d: IO 0x%lx -> 0x%lx\n", hose->global_number,
+				    res->start, res->start + size - 1);
 			break;
 		case 2:		/* memory space */
 			memno = 0;
@@ -666,6 +675,8 @@
 				res = &hose->mem_resources[memno];
 				res->flags = IORESOURCE_MEM;
 				res->start = cpu_phys_addr;
+				DBG("phb%d: MEM 0x%lx -> 0x%lx\n", hose->global_number,
+					    res->start, res->start + size - 1);
 			}
 			break;
 		}
@@ -873,7 +884,8 @@
 
 	for (i = 0; i < PCI_NUM_RESOURCES; i++) {
 		if (dev->resource[i].flags & IORESOURCE_IO) {
-			unsigned long offset = (unsigned long)hose->io_base_virt - pci_io_base;
+			unsigned long offset = (unsigned long)hose->io_base_virt
+				- pci_io_base;
                         unsigned long start, end, mask;
 
                         start = dev->resource[i].start += offset;


From benh at kernel.crashing.org  Mon Oct 25 11:26:30 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Mon, 25 Oct 2004 11:26:30 +1000
Subject: [PATCH] ppc64: Rework PCI <-> OF node matching
Message-ID: <1098667590.26695.1.camel@gaston>

This patch reworks the code that deals with matching PCI devices
with Open Firmware device nodes. This code made several incorrect
assumptions and can be simplified significantly. The main functional
difference now is that PHBs are no longer special cased, but that
shouldn't cause any specific problem.
It also fixes a problem where u3_iommu.c wouldn't work for PCI
devices that lacked a matching OF device node.

Signed-off-by: Benjamin Herrenschmidt <benh at kernel.crashing.org>

Index: linux-work/arch/ppc64/kernel/u3_iommu.c
===================================================================
--- linux-work.orig/arch/ppc64/kernel/u3_iommu.c	2004-10-17 12:07:07.000000000 +1000
+++ linux-work/arch/ppc64/kernel/u3_iommu.c	2004-10-25 11:12:22.000000000 +1000
@@ -267,6 +267,7 @@
 
 void iommu_setup_u3(void)
 {
+	struct pci_controller *phb, *tmp;
 	struct pci_dev *dev = NULL;
 	struct device_node *dn;
 
@@ -299,6 +300,11 @@
 		if (dn)
 			dn->iommu_table = &iommu_table_u3;
 	}
+	/* We also make sure we set all PHBs ... */
+	list_for_each_entry_safe(phb, tmp, &hose_list, list_node) {
+		dn = (struct device_node *)phb->arch_data;
+		dn->iommu_table = &iommu_table_u3;
+	}
 }
 
 void __init alloc_u3_dart_table(void)
Index: linux-work/arch/ppc64/kernel/pci_dn.c
===================================================================
--- linux-work.orig/arch/ppc64/kernel/pci_dn.c	2004-10-20 13:01:00.000000000 +1000
+++ linux-work/arch/ppc64/kernel/pci_dn.c	2004-10-25 11:15:30.000000000 +1000
@@ -46,29 +46,13 @@
 {
 	struct pci_controller *phb = data;
 	u32 *regs;
-	char *device_type = get_property(dn, "device_type", NULL);
-	char *model;
 
 	dn->phb = phb;
-	if (device_type && (strcmp(device_type, "pci") == 0) &&
-			(get_property(dn, "class-code", NULL) == 0)) {
-		/* special case for PHB's.  Sigh. */
-		regs = (u32 *)get_property(dn, "bus-range", NULL);
-		dn->busno = regs[0];
-
-		model = (char *)get_property(dn, "model", NULL);
-
-		if (strstr(model, "U3"))
-			dn->devfn = -1;
-		else
-			dn->devfn = 0;	/* assumption */
-	} else {
-		regs = (u32 *)get_property(dn, "reg", NULL);
-		if (regs) {
-			/* First register entry is addr (00BBSS00)  */
-			dn->busno = (regs[0] >> 16) & 0xff;
-			dn->devfn = (regs[0] >> 8) & 0xff;
-		}
+	regs = (u32 *)get_property(dn, "reg", NULL);
+	if (regs) {
+		/* First register entry is addr (00BBSS00)  */
+		dn->busno = (regs[0] >> 16) & 0xff;
+		dn->devfn = (regs[0] >> 8) & 0xff;
 	}
 	return NULL;
 }
@@ -97,20 +81,25 @@
 	struct device_node *dn, *nextdn;
 	void *ret;
 
-	if (pre && ((ret = pre(start, data)) != NULL))
-		return ret;
+	/* We started with a phb, iterate all childs */
 	for (dn = start->child; dn; dn = nextdn) {
+		u32 *classp, class;
+
 		nextdn = NULL;
-		if (get_property(dn, "class-code", NULL)) {
-			if (pre && ((ret = pre(dn, data)) != NULL))
-				return ret;
-			if (dn->child)
-				/* Depth first...do children */
-				nextdn = dn->child;
-			else if (dn->sibling)
-				/* ok, try next sibling instead. */
-				nextdn = dn->sibling;
-		}
+		classp = (u32 *)get_property(dn, "class-code", NULL);
+		class = classp ? *classp : 0;
+
+		if (pre && ((ret = pre(dn, data)) != NULL))
+			return ret;
+
+		/* If we are a PCI bridge, go down */
+		if (dn->child && ((class >> 8) == PCI_CLASS_BRIDGE_PCI ||
+				  (class >> 8) == PCI_CLASS_BRIDGE_CARDBUS))
+			/* Depth first...do children */
+			nextdn = dn->child;
+		else if (dn->sibling)
+			/* ok, try next sibling instead. */
+			nextdn = dn->sibling;
 		if (!nextdn) {
 			/* Walk up to next valid sibling. */
 			do {
@@ -124,26 +113,16 @@
 	return NULL;
 }
 
-/*
- * Same as traverse_pci_devices except this does it for all phbs.
- */
-static void *traverse_all_pci_devices(traverse_func pre)
+void __devinit pci_devs_phb_init_dynamic(struct pci_controller *phb)
 {
-	struct pci_controller *phb, *tmp;
-	void *ret;
+	struct device_node * dn = (struct device_node *) phb->arch_data;
 
-	list_for_each_entry_safe(phb, tmp, &hose_list, list_node)
-		if ((ret = traverse_pci_devices(phb->arch_data, pre, phb))
-				!= NULL)
-			return ret;
-	return NULL;
-}
+	/* PHB nodes themselves must not match */
+	dn->devfn = dn->busno = -1;
+	dn->phb = phb;
 
-void __devinit pci_devs_phb_init_dynamic(struct pci_controller *phb)
-{
 	/* Update dn->phb ptrs for new phb and children devices */
-	traverse_pci_devices((struct device_node *)phb->arch_data,
-			update_dn_pci_info, phb);
+	traverse_pci_devices(dn, update_dn_pci_info, phb);
 }
 
 /*
@@ -154,6 +133,7 @@
 {
 	int busno = ((unsigned long)data >> 8) & 0xff;
 	int devfn = ((unsigned long)data) & 0xff;
+
 	return ((devfn == dn->devfn) && (busno == dn->busno)) ? dn : NULL;
 }
 
@@ -180,10 +160,8 @@
 
 	phb_dn = phb->arch_data;
 	dn = traverse_pci_devices(phb_dn, is_devfn_node, (void *)searchval);
-	if (dn) {
+	if (dn)
 		dev->sysdata = dn;
-		/* ToDo: call some device init hook here */
-	}
 	return dn;
 }
 EXPORT_SYMBOL(fetch_dev_dn);
@@ -195,8 +173,11 @@
  */
 void __init pci_devs_phb_init(void)
 {
+	struct pci_controller *phb, *tmp;
+
 	/* This must be done first so the device nodes have valid pci info! */
-	traverse_all_pci_devices(update_dn_pci_info);
+	list_for_each_entry_safe(phb, tmp, &hose_list, list_node)
+		pci_devs_phb_init_dynamic(phb);
 }
 
 
-- 
Benjamin Herrenschmidt <benh at kernel.crashing.org>


From benh at kernel.crashing.org  Mon Oct 25 11:50:30 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Mon, 25 Oct 2004 11:50:30 +1000
Subject: [PATCH] ppc64: cleanup/split SMP code
Message-ID: <1098669030.30012.8.camel@gaston>

Hi !

This patch depends at least on two previously posted ones (and not yet merged).

[PATCH] ppc64: Fix pSeries secondary CPU setup
[PATCH] ppc64: Rewrite the openpic driver

Splits arch/ppc64/kernel/smp.c into 3 different files, smp.c, pSeries_smp.c and
iSeries_smp.c, thus removing most of the #define mess in those files and making
it easier to add a new platform.

Signed-off-by: Benjamin Herrenschmidt <benh at kernel.crashing.org>

Index: linux-work/arch/ppc64/kernel/pSeries_smp.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-work/arch/ppc64/kernel/pSeries_smp.c	2004-10-25 11:29:38.804091696 +1000
@@ -0,0 +1,393 @@
+/*
+ * SMP support for pSeries machines.
+ *
+ * Dave Engebretsen, Peter Bergner, and
+ * Mike Corrigan {engebret|bergner|mikec}@us.ibm.com
+ *
+ * Plus various changes from other IBM teams...
+ *
+ *      This program is free software; you can redistribute it and/or
+ *      modify it under the terms of the GNU General Public License
+ *      as published by the Free Software Foundation; either version
+ *      2 of the License, or (at your option) any later version.
+ */
+
+#undef DEBUG
+
+#include <linux/config.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/smp.h>
+#include <linux/smp_lock.h>
+#include <linux/interrupt.h>
+#include <linux/kernel_stat.h>
+#include <linux/delay.h>
+#include <linux/init.h>
+#include <linux/spinlock.h>
+#include <linux/cache.h>
+#include <linux/err.h>
+#include <linux/sysdev.h>
+#include <linux/cpu.h>
+
+#include <asm/ptrace.h>
+#include <asm/atomic.h>
+#include <asm/irq.h>
+#include <asm/page.h>
+#include <asm/pgtable.h>
+#include <asm/io.h>
+#include <asm/prom.h>
+#include <asm/smp.h>
+#include <asm/naca.h>
+#include <asm/paca.h>
+#include <asm/time.h>
+#include <asm/ppcdebug.h>
+#include <asm/machdep.h>
+#include <asm/xics.h>
+#include <asm/cputable.h>
+#include <asm/system.h>
+#include <asm/rtas.h>
+#include <asm/plpar_wrappers.h>
+
+#include "mpic.h"
+
+#ifdef DEBUG
+#define DBG(fmt...) udbg_printf(fmt)
+#else
+#define DBG(fmt...)
+#endif
+
+extern void pseries_secondary_smp_init(unsigned long); 
+
+static void vpa_init(int cpu)
+{
+	unsigned long flags, pcpu = get_hard_smp_processor_id(cpu);
+
+	/* Register the Virtual Processor Area (VPA) */
+	flags = 1UL << (63 - 18);
+	register_vpa(flags, pcpu, __pa((unsigned long)&(paca[cpu].lppaca)));
+}
+
+
+/* Get state of physical CPU.
+ * Return codes:
+ *	0	- The processor is in the RTAS stopped state
+ *	1	- stop-self is in progress
+ *	2	- The processor is not in the RTAS stopped state
+ *	-1	- Hardware Error
+ *	-2	- Hardware Busy, Try again later.
+ */
+static int query_cpu_stopped(unsigned int pcpu)
+{
+	int cpu_status;
+	int status, qcss_tok;
+
+	qcss_tok = rtas_token("query-cpu-stopped-state");
+	if (qcss_tok == RTAS_UNKNOWN_SERVICE)
+		return -1;
+	status = rtas_call(qcss_tok, 1, 2, &cpu_status, pcpu);
+	if (status != 0) {
+		printk(KERN_ERR
+		       "RTAS query-cpu-stopped-state failed: %i\n", status);
+		return status;
+	}
+
+	return cpu_status;
+}
+
+
+#ifdef CONFIG_HOTPLUG_CPU
+
+int __cpu_disable(void)
+{
+	/* FIXME: go put this in a header somewhere */
+	extern void xics_migrate_irqs_away(void);
+
+	systemcfg->processorCount--;
+
+	/*fix boot_cpuid here*/
+	if (smp_processor_id() == boot_cpuid)
+		boot_cpuid = any_online_cpu(cpu_online_map);
+
+	/* FIXME: abstract this to not be platform specific later on */
+	xics_migrate_irqs_away();
+	return 0;
+}
+
+void __cpu_die(unsigned int cpu)
+{
+	int tries;
+	int cpu_status;
+	unsigned int pcpu = get_hard_smp_processor_id(cpu);
+
+	for (tries = 0; tries < 25; tries++) {
+		cpu_status = query_cpu_stopped(pcpu);
+		if (cpu_status == 0 || cpu_status == -1)
+			break;
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		schedule_timeout(HZ/5);
+	}
+	if (cpu_status != 0) {
+		printk("Querying DEAD? cpu %i (%i) shows %i\n",
+		       cpu, pcpu, cpu_status);
+	}
+
+	/* Isolation and deallocation are definatly done by
+	 * drslot_chrp_cpu.  If they were not they would be
+	 * done here.  Change isolate state to Isolate and
+	 * change allocation-state to Unusable.
+	 */
+	paca[cpu].cpu_start = 0;
+}
+
+/* Search all cpu device nodes for an offline logical cpu.  If a
+ * device node has a "ibm,my-drc-index" property (meaning this is an
+ * LPAR), paranoid-check whether we own the cpu.  For each "thread"
+ * of a cpu, if it is offline and has the same hw index as before,
+ * grab that in preference.
+ */
+static unsigned int find_physical_cpu_to_start(unsigned int old_hwindex)
+{
+	struct device_node *np = NULL;
+	unsigned int best = -1U;
+
+	while ((np = of_find_node_by_type(np, "cpu"))) {
+		int nr_threads, len;
+		u32 *index = (u32 *)get_property(np, "ibm,my-drc-index", NULL);
+		u32 *tid = (u32 *)
+			get_property(np, "ibm,ppc-interrupt-server#s", &len);
+
+		if (!tid)
+			tid = (u32 *)get_property(np, "reg", &len);
+
+		if (!tid)
+			continue;
+
+		/* If there is a drc-index, make sure that we own
+		 * the cpu.
+		 */
+		if (index) {
+			int state;
+			int rc = rtas_get_sensor(9003, *index, &state);
+			if (rc != 0 || state != 1)
+				continue;
+		}
+
+		nr_threads = len / sizeof(u32);
+
+		while (nr_threads--) {
+			if (0 == query_cpu_stopped(tid[nr_threads])) {
+				best = tid[nr_threads];
+				if (best == old_hwindex)
+					goto out;
+			}
+		}
+	}
+out:
+	of_node_put(np);
+	return best;
+}
+
+/**
+ * smp_startup_cpu() - start the given cpu
+ *
+ * At boot time, there is nothing to do.  At run-time, call RTAS with
+ * the appropriate start location, if the cpu is in the RTAS stopped
+ * state.
+ *
+ * Returns:
+ *	0	- failure
+ *	1	- success
+ */
+static inline int __devinit smp_startup_cpu(unsigned int lcpu)
+{
+	int status;
+	unsigned long start_here = __pa((u32)*((unsigned long *)
+					       pseries_secondary_smp_init));
+	unsigned int pcpu;
+
+	/* At boot time the cpus are already spinning in hold
+	 * loops, so nothing to do. */
+ 	if (system_state < SYSTEM_RUNNING)
+		return 1;
+
+	pcpu = find_physical_cpu_to_start(get_hard_smp_processor_id(lcpu));
+	if (pcpu == -1U) {
+		printk(KERN_INFO "No more cpus available, failing\n");
+		return 0;
+	}
+
+	/* Fixup atomic count: it exited inside IRQ handler. */
+	paca[lcpu].__current->thread_info->preempt_count	= 0;
+
+	/* At boot this is done in prom.c. */
+	paca[lcpu].hw_cpu_id = pcpu;
+
+	status = rtas_call(rtas_token("start-cpu"), 3, 1, NULL,
+			   pcpu, start_here, lcpu);
+	if (status != 0) {
+		printk(KERN_ERR "start-cpu failed: %i\n", status);
+		return 0;
+	}
+	return 1;
+}
+#else /* ... CONFIG_HOTPLUG_CPU */
+static inline int __devinit smp_startup_cpu(unsigned int lcpu)
+{
+	return 1;
+}
+#endif /* CONFIG_HOTPLUG_CPU */
+
+static inline void smp_xics_do_message(int cpu, int msg)
+{
+	set_bit(msg, &xics_ipi_message[cpu].value);
+	mb();
+	xics_cause_IPI(cpu);
+}
+
+static void smp_xics_message_pass(int target, int msg)
+{
+	unsigned int i;
+
+	if (target < NR_CPUS) {
+		smp_xics_do_message(target, msg);
+	} else {
+		for_each_online_cpu(i) {
+			if (target == MSG_ALL_BUT_SELF
+			    && i == smp_processor_id())
+				continue;
+			smp_xics_do_message(i, msg);
+		}
+	}
+}
+
+extern void xics_request_IPIs(void);
+
+static int __init smp_xics_probe(void)
+{
+	xics_request_IPIs();
+
+	return cpus_weight(cpu_possible_map);
+}
+
+static void __devinit smp_xics_setup_cpu(int cpu)
+{
+	if (cpu != boot_cpuid)
+		xics_setup_cpu();
+}
+
+static spinlock_t timebase_lock = SPIN_LOCK_UNLOCKED;
+static unsigned long timebase = 0;
+
+static void __devinit pSeries_give_timebase(void)
+{
+	spin_lock(&timebase_lock);
+	rtas_call(rtas_token("freeze-time-base"), 0, 1, NULL);
+	timebase = get_tb();
+	spin_unlock(&timebase_lock);
+
+	while (timebase)
+		barrier();
+	rtas_call(rtas_token("thaw-time-base"), 0, 1, NULL);
+}
+
+static void __devinit pSeries_take_timebase(void)
+{
+	while (!timebase)
+		barrier();
+	spin_lock(&timebase_lock);
+	set_tb(timebase >> 32, timebase & 0xffffffff);
+	timebase = 0;
+	spin_unlock(&timebase_lock);
+}
+
+static void __devinit pSeries_late_setup_cpu(int cpu)
+{
+	extern unsigned int default_distrib_server;
+
+	if (cur_cpu_spec->firmware_features & FW_FEATURE_SPLPAR) {
+		vpa_init(cpu); 
+	}
+
+#ifdef CONFIG_IRQ_ALL_CPUS
+	/* Put the calling processor into the GIQ.  This is really only
+	 * necessary from a secondary thread as the OF start-cpu interface
+	 * performs this function for us on primary threads.
+	 */
+	/* TODO: 9005 is #defined in rtas-proc.c -- move to a header */
+	rtas_set_indicator(9005, default_distrib_server, 1);
+#endif
+}
+
+
+void __devinit smp_pSeries_kick_cpu(int nr)
+{
+	BUG_ON(nr < 0 || nr >= NR_CPUS);
+
+	if (!smp_startup_cpu(nr))
+		return;
+
+	/*
+	 * The processor is currently spinning, waiting for the
+	 * cpu_start field to become non-zero After we set cpu_start,
+	 * the processor will continue on to secondary_start
+	 */
+	paca[nr].cpu_start = 1;
+}
+
+static struct smp_ops_t pSeries_mpic_smp_ops = {
+	.message_pass	= smp_mpic_message_pass,
+	.probe		= smp_mpic_probe,
+	.kick_cpu	= smp_pSeries_kick_cpu,
+	.setup_cpu	= smp_mpic_setup_cpu,
+	.late_setup_cpu	= pSeries_late_setup_cpu,
+};
+
+static struct smp_ops_t pSeries_xics_smp_ops = {
+	.message_pass	= smp_xics_message_pass,
+	.probe		= smp_xics_probe,
+	.kick_cpu	= smp_pSeries_kick_cpu,
+	.setup_cpu	= smp_xics_setup_cpu,
+	.late_setup_cpu	= pSeries_late_setup_cpu,
+};
+
+/* This is called very early */
+void __init smp_init_pSeries(void)
+{
+	int ret, i;
+
+	DBG(" -> smp_init_pSeries()\n");
+
+	if (naca->interrupt_controller == IC_OPEN_PIC)
+		smp_ops = &pSeries_mpic_smp_ops;
+	else
+		smp_ops = &pSeries_xics_smp_ops;
+
+	/* Start secondary threads on SMT systems; primary threads
+	 * are already in the running state.
+	 */
+	for_each_present_cpu(i) {
+		if (query_cpu_stopped(get_hard_smp_processor_id(i)) == 0) {
+			printk("%16.16x : starting thread\n", i);
+			DBG("%16.16x : starting thread\n", i);
+			rtas_call(rtas_token("start-cpu"), 3, 1, &ret,
+				  get_hard_smp_processor_id(i),
+				  __pa((u32)*((unsigned long *)
+					      pseries_secondary_smp_init)),
+				  i);
+		}
+	}
+
+	if (cur_cpu_spec->firmware_features & FW_FEATURE_SPLPAR)
+		vpa_init(boot_cpuid);
+
+	/* Non-lpar has additional take/give timebase */
+	if (systemcfg->platform == PLATFORM_PSERIES) {
+		smp_ops->give_timebase = pSeries_give_timebase;
+		smp_ops->take_timebase = pSeries_take_timebase;
+	}
+
+
+	DBG(" <- smp_init_pSeries()\n");
+}
+
Index: linux-work/arch/ppc64/kernel/smp.c
===================================================================
--- linux-work.orig/arch/ppc64/kernel/smp.c	2004-10-25 10:24:50.000000000 +1000
+++ linux-work/arch/ppc64/kernel/smp.c	2004-10-25 11:29:38.849084856 +1000
@@ -43,19 +43,14 @@
 #include <asm/smp.h>
 #include <asm/naca.h>
 #include <asm/paca.h>
-#include <asm/iSeries/LparData.h>
-#include <asm/iSeries/HvCall.h>
-#include <asm/iSeries/HvCallCfg.h>
 #include <asm/time.h>
 #include <asm/ppcdebug.h>
 #include <asm/machdep.h>
-#include <asm/xics.h>
 #include <asm/cputable.h>
 #include <asm/system.h>
+#include <asm/abs_addr.h>
 
 #include "mpic.h"
-#include <asm/rtas.h>
-#include <asm/plpar_wrappers.h>
 
 #ifdef DEBUG
 #define DBG(fmt...) udbg_printf(fmt)
@@ -89,110 +84,6 @@
 /* Low level assembly function used to backup CPU 0 state */
 extern void __save_cpu_setup(void);
 
-extern void pseries_secondary_smp_init(unsigned long); 
-
-#ifdef CONFIG_PPC_ISERIES
-static unsigned long iSeries_smp_message[NR_CPUS];
-
-void iSeries_smp_message_recv( struct pt_regs * regs )
-{
-	int cpu = smp_processor_id();
-	int msg;
-
-	if ( num_online_cpus() < 2 )
-		return;
-
-	for ( msg = 0; msg < 4; ++msg )
-		if ( test_and_clear_bit( msg, &iSeries_smp_message[cpu] ) )
-			smp_message_recv( msg, regs );
-}
-
-static inline void smp_iSeries_do_message(int cpu, int msg)
-{
-	set_bit(msg, &iSeries_smp_message[cpu]);
-	HvCall_sendIPI(&(paca[cpu]));
-}
-
-static void smp_iSeries_message_pass(int target, int msg)
-{
-	int i;
-
-	if (target < NR_CPUS)
-		smp_iSeries_do_message(target, msg);
-	else {
-		for_each_online_cpu(i) {
-			if (target == MSG_ALL_BUT_SELF
-			    && i == smp_processor_id())
-				continue;
-			smp_iSeries_do_message(i, msg);
-		}
-	}
-}
-
-static int smp_iSeries_numProcs(void)
-{
-	unsigned np, i;
-
-	np = 0;
-        for (i=0; i < NR_CPUS; ++i) {
-                if (paca[i].lppaca.xDynProcStatus < 2) {
-			cpu_set(i, cpu_possible_map);
-			cpu_set(i, cpu_present_map);
-                        ++np;
-                }
-        }
-	return np;
-}
-
-static int smp_iSeries_probe(void)
-{
-	unsigned i;
-	unsigned np = 0;
-
-	for (i=0; i < NR_CPUS; ++i) {
-		if (paca[i].lppaca.xDynProcStatus < 2) {
-			/*paca[i].active = 1;*/
-			++np;
-		}
-	}
-
-	return np;
-}
-
-static void smp_iSeries_kick_cpu(int nr)
-{
-	BUG_ON(nr < 0 || nr >= NR_CPUS);
-
-	/* Verify that our partition has a processor nr */
-	if (paca[nr].lppaca.xDynProcStatus >= 2)
-		return;
-
-	/* The processor is currently spinning, waiting
-	 * for the cpu_start field to become non-zero
-	 * After we set cpu_start, the processor will
-	 * continue on to secondary_start in iSeries_head.S
-	 */
-	paca[nr].cpu_start = 1;
-}
-
-static void __devinit smp_iSeries_setup_cpu(int nr)
-{
-}
-
-static struct smp_ops_t iSeries_smp_ops = {
-	.message_pass = smp_iSeries_message_pass,
-	.probe        = smp_iSeries_probe,
-	.kick_cpu     = smp_iSeries_kick_cpu,
-	.setup_cpu    = smp_iSeries_setup_cpu,
-};
-
-/* This is called very early. */
-void __init smp_init_iSeries(void)
-{
-	smp_ops = &iSeries_smp_ops;
-	systemcfg->processorCount	= smp_iSeries_numProcs();
-}
-#endif
 
 #ifdef CONFIG_PPC_MULTIPLATFORM
 void smp_mpic_message_pass(int target, int msg)
@@ -238,213 +129,20 @@
 	mpic_setup_this_cpu();
 }
 
-#endif /* CONFIG_PPC_MULTIPLATFORM */
-
-#ifdef CONFIG_PPC_PSERIES
-
-/* Get state of physical CPU.
- * Return codes:
- *	0	- The processor is in the RTAS stopped state
- *	1	- stop-self is in progress
- *	2	- The processor is not in the RTAS stopped state
- *	-1	- Hardware Error
- *	-2	- Hardware Busy, Try again later.
- */
-int query_cpu_stopped(unsigned int pcpu)
-{
-	int cpu_status;
-	int status, qcss_tok;
-
-	DBG(" -> query_cpu_stopped(%d)\n", pcpu);
-	qcss_tok = rtas_token("query-cpu-stopped-state");
-	if (qcss_tok == RTAS_UNKNOWN_SERVICE)
-		return -1;
-	status = rtas_call(qcss_tok, 1, 2, &cpu_status, pcpu);
-	if (status != 0) {
-		printk(KERN_ERR
-		       "RTAS query-cpu-stopped-state failed: %i\n", status);
-		return status;
-	}
-
-	DBG(" <- query_cpu_stopped(), status: %d\n", cpu_status);
-
-	return cpu_status;
-}
-
-#ifdef CONFIG_HOTPLUG_CPU
-
-int __cpu_disable(void)
-{
-	/* FIXME: go put this in a header somewhere */
-	extern void xics_migrate_irqs_away(void);
-
-	systemcfg->processorCount--;
-
-	/*fix boot_cpuid here*/
-	if (smp_processor_id() == boot_cpuid)
-		boot_cpuid = any_online_cpu(cpu_online_map);
-
-	/* FIXME: abstract this to not be platform specific later on */
-	xics_migrate_irqs_away();
-	return 0;
-}
-
-void __cpu_die(unsigned int cpu)
-{
-	int tries;
-	int cpu_status;
-	unsigned int pcpu = get_hard_smp_processor_id(cpu);
-
-	for (tries = 0; tries < 25; tries++) {
-		cpu_status = query_cpu_stopped(pcpu);
-		if (cpu_status == 0 || cpu_status == -1)
-			break;
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		schedule_timeout(HZ/5);
-	}
-	if (cpu_status != 0) {
-		printk("Querying DEAD? cpu %i (%i) shows %i\n",
-		       cpu, pcpu, cpu_status);
-	}
-
-	/* Isolation and deallocation are definatly done by
-	 * drslot_chrp_cpu.  If they were not they would be
-	 * done here.  Change isolate state to Isolate and
-	 * change allocation-state to Unusable.
-	 */
-	paca[cpu].cpu_start = 0;
-
-	/* So we can recognize if it fails to come up next time. */
-	cpu_callin_map[cpu] = 0;
-}
-
-/* Kill this cpu */
-void cpu_die(void)
-{
-	local_irq_disable();
-	/* Some hardware requires clearing the CPPR, while other hardware does not
-	 * it is safe either way
-	 */
-	pSeriesLP_cppr_info(0, 0);
-	rtas_stop_self();
-	/* Should never get here... */
-	BUG();
-	for(;;);
-}
-
-/* Search all cpu device nodes for an offline logical cpu.  If a
- * device node has a "ibm,my-drc-index" property (meaning this is an
- * LPAR), paranoid-check whether we own the cpu.  For each "thread"
- * of a cpu, if it is offline and has the same hw index as before,
- * grab that in preference.
- */
-static unsigned int find_physical_cpu_to_start(unsigned int old_hwindex)
-{
-	struct device_node *np = NULL;
-	unsigned int best = -1U;
-
-	while ((np = of_find_node_by_type(np, "cpu"))) {
-		int nr_threads, len;
-		u32 *index = (u32 *)get_property(np, "ibm,my-drc-index", NULL);
-		u32 *tid = (u32 *)
-			get_property(np, "ibm,ppc-interrupt-server#s", &len);
-
-		if (!tid)
-			tid = (u32 *)get_property(np, "reg", &len);
-
-		if (!tid)
-			continue;
-
-		/* If there is a drc-index, make sure that we own
-		 * the cpu.
-		 */
-		if (index) {
-			int state;
-			int rc = rtas_get_sensor(9003, *index, &state);
-			if (rc != 0 || state != 1)
-				continue;
-		}
-
-		nr_threads = len / sizeof(u32);
-
-		while (nr_threads--) {
-			if (0 == query_cpu_stopped(tid[nr_threads])) {
-				best = tid[nr_threads];
-				if (best == old_hwindex)
-					goto out;
-			}
-		}
-	}
-out:
-	of_node_put(np);
-	return best;
-}
-
-/**
- * smp_startup_cpu() - start the given cpu
- *
- * At boot time, there is nothing to do.  At run-time, call RTAS with
- * the appropriate start location, if the cpu is in the RTAS stopped
- * state.
- *
- * Returns:
- *	0	- failure
- *	1	- success
- */
-static inline int __devinit smp_startup_cpu(unsigned int lcpu)
-{
-	int status;
-	unsigned long start_here = __pa((u32)*((unsigned long *)
-					       pseries_secondary_smp_init));
-	unsigned int pcpu;
-
-	/* At boot time the cpus are already spinning in hold
-	 * loops, so nothing to do. */
- 	if (system_state < SYSTEM_RUNNING)
-		return 1;
-
-	pcpu = find_physical_cpu_to_start(get_hard_smp_processor_id(lcpu));
-	if (pcpu == -1U) {
-		printk(KERN_INFO "No more cpus available, failing\n");
-		return 0;
-	}
-
-	/* Fixup atomic count: it exited inside IRQ handler. */
-	paca[lcpu].__current->thread_info->preempt_count	= 0;
-
-	/* At boot this is done in prom.c. */
-	paca[lcpu].hw_cpu_id = pcpu;
-
-	status = rtas_call(rtas_token("start-cpu"), 3, 1, NULL,
-			   pcpu, start_here, lcpu);
-	if (status != 0) {
-		printk(KERN_ERR "start-cpu failed: %i\n", status);
-		return 0;
-	}
-	return 1;
-}
-#else /* ... CONFIG_HOTPLUG_CPU */
-static inline int __devinit smp_startup_cpu(unsigned int lcpu)
-{
-	return 1;
-}
-#endif /* CONFIG_HOTPLUG_CPU */
-
-static void smp_pSeries_kick_cpu(int nr)
+void __devinit smp_generic_kick_cpu(int nr)
 {
 	BUG_ON(nr < 0 || nr >= NR_CPUS);
 
-	if (!smp_startup_cpu(nr))
-		return;
-
 	/*
 	 * The processor is currently spinning, waiting for the
 	 * cpu_start field to become non-zero After we set cpu_start,
 	 * the processor will continue on to secondary_start
 	 */
 	paca[nr].cpu_start = 1;
+	mb();
 }
-#endif /* CONFIG_PPC_PSERIES */
+
+#endif /* CONFIG_PPC_MULTIPLATFORM */
 
 static void __init smp_space_timers(unsigned int max_cpus)
 {
@@ -461,136 +159,6 @@
 	}
 }
 
-#ifdef CONFIG_PPC_PSERIES
-static void vpa_init(int cpu)
-{
-	unsigned long flags, pcpu = get_hard_smp_processor_id(cpu);
-
-	/* Register the Virtual Processor Area (VPA) */
-	flags = 1UL << (63 - 18);
-	register_vpa(flags, pcpu, __pa((unsigned long)&(paca[cpu].lppaca)));
-}
-
-static inline void smp_xics_do_message(int cpu, int msg)
-{
-	set_bit(msg, &xics_ipi_message[cpu].value);
-	mb();
-	xics_cause_IPI(cpu);
-}
-
-static void smp_xics_message_pass(int target, int msg)
-{
-	unsigned int i;
-
-	if (target < NR_CPUS) {
-		smp_xics_do_message(target, msg);
-	} else {
-		for_each_online_cpu(i) {
-			if (target == MSG_ALL_BUT_SELF
-			    && i == smp_processor_id())
-				continue;
-			smp_xics_do_message(i, msg);
-		}
-	}
-}
-
-extern void xics_request_IPIs(void);
-
-static int __init smp_xics_probe(void)
-{
-#ifdef CONFIG_SMP
-	xics_request_IPIs();
-#endif
-
-	return cpus_weight(cpu_possible_map);
-}
-
-static void __devinit smp_xics_setup_cpu(int cpu)
-{
-	if (cpu != boot_cpuid)
-		xics_setup_cpu();
-}
-
-static spinlock_t timebase_lock = SPIN_LOCK_UNLOCKED;
-static unsigned long timebase = 0;
-
-static void __devinit pSeries_give_timebase(void)
-{
-	spin_lock(&timebase_lock);
-	rtas_call(rtas_token("freeze-time-base"), 0, 1, NULL);
-	timebase = get_tb();
-	spin_unlock(&timebase_lock);
-
-	while (timebase)
-		barrier();
-	rtas_call(rtas_token("thaw-time-base"), 0, 1, NULL);
-}
-
-static void __devinit pSeries_take_timebase(void)
-{
-	while (!timebase)
-		barrier();
-	spin_lock(&timebase_lock);
-	set_tb(timebase >> 32, timebase & 0xffffffff);
-	timebase = 0;
-	spin_unlock(&timebase_lock);
-}
-
-static struct smp_ops_t pSeries_mpic_smp_ops = {
-	.message_pass	= smp_mpic_message_pass,
-	.probe		= smp_mpic_probe,
-	.kick_cpu	= smp_pSeries_kick_cpu,
-	.setup_cpu	= smp_mpic_setup_cpu,
-};
-
-static struct smp_ops_t pSeries_xics_smp_ops = {
-	.message_pass	= smp_xics_message_pass,
-	.probe		= smp_xics_probe,
-	.kick_cpu	= smp_pSeries_kick_cpu,
-	.setup_cpu	= smp_xics_setup_cpu,
-};
-
-/* This is called very early */
-void __init smp_init_pSeries(void)
-{
-	int ret, i;
-
-	DBG(" -> smp_init_pSeries()\n");
-
-	if (naca->interrupt_controller == IC_OPEN_PIC)
-		smp_ops = &pSeries_mpic_smp_ops;
-	else
-		smp_ops = &pSeries_xics_smp_ops;
-
-	/* Start secondary threads on SMT systems; primary threads
-	 * are already in the running state.
-	 */
-	for_each_present_cpu(i) {
-		if (query_cpu_stopped(get_hard_smp_processor_id(i)) == 0) {
-			printk("%16.16x : starting thread\n", i);
-			DBG("%16.16x : starting thread\n", i);
-			rtas_call(rtas_token("start-cpu"), 3, 1, &ret,
-				  get_hard_smp_processor_id(i),
-				  __pa((u32)*((unsigned long *)
-					      pseries_secondary_smp_init)),
-				  i);
-		}
-	}
-
-	if (cur_cpu_spec->firmware_features & FW_FEATURE_SPLPAR)
-		vpa_init(boot_cpuid);
-
-	/* Non-lpar has additional take/give timebase */
-	if (systemcfg->platform == PLATFORM_PSERIES) {
-		smp_ops->give_timebase = pSeries_give_timebase;
-		smp_ops->take_timebase = pSeries_take_timebase;
-	}
-
-
-	DBG(" <- smp_init_pSeries()\n");
-}
-#endif /* CONFIG_PPC_PSERIES */
-
 void smp_local_timer_interrupt(struct pt_regs * regs)
 {
 	update_process_times(user_mode(regs));
@@ -813,6 +381,8 @@
 {
 	unsigned int cpu;
 
+	DBG("smp_prepare_cpus\n");
+
 	/* 
 	 * setup_cpu may need to be called on the boot cpu. We havent
 	 * spun any cpus up but lets be paranoid.
@@ -877,6 +447,11 @@
 		paca[cpu].stab_real = virt_to_abs(tmp);
 	}
 
+	/* Make sure callin-map entry is 0 (can be leftover a CPU
+	 * hotplug
+	 */
+	cpu_callin_map[cpu] = 0;
+
 	/* The information for processor bringup must
 	 * be written out to main store before we release
 	 * the processor.
@@ -884,6 +459,7 @@
 	mb();
 
 	/* wake up cpus */
+	DBG("smp: kicking cpu %d\n", cpu);
 	smp_ops->kick_cpu(cpu);
 
 	/*
@@ -923,7 +499,7 @@
 	return 0;
 }
 
-extern unsigned int default_distrib_server;
+
 /* Activate a secondary processor. */
 int __devinit start_secondary(void *unused)
 {
@@ -940,20 +516,8 @@
 	if (smp_ops->take_timebase)
 		smp_ops->take_timebase();
 
-#ifdef CONFIG_PPC_PSERIES
-	if (cur_cpu_spec->firmware_features & FW_FEATURE_SPLPAR) {
-		vpa_init(cpu); 
-	}
-
-#ifdef CONFIG_IRQ_ALL_CPUS
-	/* Put the calling processor into the GIQ.  This is really only
-	 * necessary from a secondary thread as the OF start-cpu interface
-	 * performs this function for us on primary threads.
-	 */
-	/* TODO: 9005 is #defined in rtas-proc.c -- move to a header */
-	rtas_set_indicator(9005, default_distrib_server, 1);
-#endif
-#endif
+	if (smp_ops->late_setup_cpu)
+		smp_ops->late_setup_cpu(cpu);
 
 	spin_lock(&call_lock);
 	cpu_set(cpu, cpu_online_map);
Index: linux-work/include/asm-ppc64/machdep.h
===================================================================
--- linux-work.orig/include/asm-ppc64/machdep.h	2004-10-25 10:24:50.000000000 +1000
+++ linux-work/include/asm-ppc64/machdep.h	2004-10-25 11:29:38.890078624 +1000
@@ -28,6 +28,7 @@
 	int   (*probe)(void);
 	void  (*kick_cpu)(int nr);
 	void  (*setup_cpu)(int nr);
+	void  (*late_setup_cpu)(int nr);
 	void  (*take_timebase)(void);
 	void  (*give_timebase)(void);
 };
@@ -86,6 +87,7 @@
 	void		(*power_off)(void);
 	void		(*halt)(void);
 	void		(*panic)(char *str);
+	void		(*cpu_die)(void);
 
 	int		(*set_rtc_time)(struct rtc_time *);
 	void		(*get_rtc_time)(struct rtc_time *);
Index: linux-work/include/asm-ppc64/smp.h
===================================================================
--- linux-work.orig/include/asm-ppc64/smp.h	2004-10-25 10:24:50.000000000 +1000
+++ linux-work/include/asm-ppc64/smp.h	2004-10-25 11:29:38.894078016 +1000
@@ -28,6 +28,8 @@
 
 extern int boot_cpuid;
 
+extern void cpu_die(void) __attribute__((noreturn));
+
 #ifdef CONFIG_SMP
 
 extern void smp_send_debugger_break(int cpu);
@@ -57,9 +59,7 @@
 
 extern int __cpu_disable(void);
 extern void __cpu_die(unsigned int cpu);
-extern void cpu_die(void) __attribute__((noreturn));
-extern int query_cpu_stopped(unsigned int pcpu);
-#endif /* !(CONFIG_SMP) */
+#endif /* CONFIG_SMP */
 
 #define get_hard_smp_processor_id(CPU) (paca[(CPU)].hw_cpu_id)
 #define set_hard_smp_processor_id(CPU, VAL) \
@@ -70,6 +70,12 @@
 extern int smp_mpic_probe(void);
 extern void smp_mpic_setup_cpu(int cpu);
 extern void smp_mpic_message_pass(int target, int msg);
+extern void smp_generic_kick_cpu(int nr);
+
+extern void smp_generic_give_timebase(void);
+extern void smp_generic_take_timebase(void);
+
+extern struct smp_ops_t *smp_ops;
 
 #endif /* __ASSEMBLY__ */
 
Index: linux-work/arch/ppc64/kernel/iSeries_smp.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-work/arch/ppc64/kernel/iSeries_smp.c	2004-10-25 11:29:38.896077712 +1000
@@ -0,0 +1,151 @@
+/*
+ * SMP support for iSeries machines.
+ *
+ * Dave Engebretsen, Peter Bergner, and
+ * Mike Corrigan {engebret|bergner|mikec}@us.ibm.com
+ *
+ * Plus various changes from other IBM teams...
+ *
+ *      This program is free software; you can redistribute it and/or
+ *      modify it under the terms of the GNU General Public License
+ *      as published by the Free Software Foundation; either version
+ *      2 of the License, or (at your option) any later version.
+ */
+
+#undef DEBUG
+
+#include <linux/config.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/smp.h>
+#include <linux/smp_lock.h>
+#include <linux/interrupt.h>
+#include <linux/kernel_stat.h>
+#include <linux/delay.h>
+#include <linux/init.h>
+#include <linux/spinlock.h>
+#include <linux/cache.h>
+#include <linux/err.h>
+#include <linux/sysdev.h>
+#include <linux/cpu.h>
+
+#include <asm/ptrace.h>
+#include <asm/atomic.h>
+#include <asm/irq.h>
+#include <asm/page.h>
+#include <asm/pgtable.h>
+#include <asm/io.h>
+#include <asm/smp.h>
+#include <asm/naca.h>
+#include <asm/paca.h>
+#include <asm/iSeries/LparData.h>
+#include <asm/iSeries/HvCall.h>
+#include <asm/iSeries/HvCallCfg.h>
+#include <asm/time.h>
+#include <asm/ppcdebug.h>
+#include <asm/machdep.h>
+#include <asm/cputable.h>
+#include <asm/system.h>
+
+static unsigned long iSeries_smp_message[NR_CPUS];
+
+void iSeries_smp_message_recv( struct pt_regs * regs )
+{
+	int cpu = smp_processor_id();
+	int msg;
+
+	if ( num_online_cpus() < 2 )
+		return;
+
+	for ( msg = 0; msg < 4; ++msg )
+		if ( test_and_clear_bit( msg, &iSeries_smp_message[cpu] ) )
+			smp_message_recv( msg, regs );
+}
+
+static inline void smp_iSeries_do_message(int cpu, int msg)
+{
+	set_bit(msg, &iSeries_smp_message[cpu]);
+	HvCall_sendIPI(&(paca[cpu]));
+}
+
+static void smp_iSeries_message_pass(int target, int msg)
+{
+	int i;
+
+	if (target < NR_CPUS)
+		smp_iSeries_do_message(target, msg);
+	else {
+		for_each_online_cpu(i) {
+			if (target == MSG_ALL_BUT_SELF
+			    && i == smp_processor_id())
+				continue;
+			smp_iSeries_do_message(i, msg);
+		}
+	}
+}
+
+static int smp_iSeries_numProcs(void)
+{
+	unsigned np, i;
+
+	np = 0;
+        for (i=0; i < NR_CPUS; ++i) {
+                if (paca[i].lppaca.xDynProcStatus < 2) {
+			cpu_set(i, cpu_possible_map);
+			cpu_set(i, cpu_present_map);
+                        ++np;
+                }
+        }
+	return np;
+}
+
+static int smp_iSeries_probe(void)
+{
+	unsigned i;
+	unsigned np = 0;
+
+	for (i=0; i < NR_CPUS; ++i) {
+		if (paca[i].lppaca.xDynProcStatus < 2) {
+			/*paca[i].active = 1;*/
+			++np;
+		}
+	}
+
+	return np;
+}
+
+static void smp_iSeries_kick_cpu(int nr)
+{
+	BUG_ON(nr < 0 || nr >= NR_CPUS);
+
+	/* Verify that our partition has a processor nr */
+	if (paca[nr].lppaca.xDynProcStatus >= 2)
+		return;
+
+	/* The processor is currently spinning, waiting
+	 * for the cpu_start field to become non-zero
+	 * After we set cpu_start, the processor will
+	 * continue on to secondary_start in iSeries_head.S
+	 */
+	paca[nr].cpu_start = 1;
+}
+
+static void __devinit smp_iSeries_setup_cpu(int nr)
+{
+}
+
+static struct smp_ops_t iSeries_smp_ops = {
+	.message_pass = smp_iSeries_message_pass,
+	.probe        = smp_iSeries_probe,
+	.kick_cpu     = smp_iSeries_kick_cpu,
+	.setup_cpu    = smp_iSeries_setup_cpu,
+};
+
+/* This is called very early. */
+void __init smp_init_iSeries(void)
+{
+	smp_ops = &iSeries_smp_ops;
+	systemcfg->processorCount	= smp_iSeries_numProcs();
+}
+
Index: linux-work/arch/ppc64/kernel/pmac_smp.c
===================================================================
--- linux-work.orig/arch/ppc64/kernel/pmac_smp.c	2004-10-25 10:24:50.000000000 +1000
+++ linux-work/arch/ppc64/kernel/pmac_smp.c	2004-10-25 11:29:38.909075736 +1000
@@ -21,6 +21,9 @@
  *  as published by the Free Software Foundation; either version
  *  2 of the License, or (at your option) any later version.
  */
+
+#undef DEBUG
+
 #include <linux/config.h>
 #include <linux/kernel.h>
 #include <linux/sched.h>
@@ -51,6 +54,11 @@
 
 #include "mpic.h"
 
+#ifdef DEBUG
+#define DBG(fmt...) udbg_printf(fmt)
+#else
+#define DBG(fmt...)
+#endif
 
 extern void pmac_secondary_start_1(void);
 extern void pmac_secondary_start_2(void);
@@ -102,15 +110,16 @@
 	 *   b .pmac_secondary_start - KERNELBASE
 	 */
 	switch(nr) {
-		case 1:
-			new_vector = (unsigned long)pmac_secondary_start_1;
-			break;
-		case 2:
-			new_vector = (unsigned long)pmac_secondary_start_2;
-			break;
-		case 3:
-			new_vector = (unsigned long)pmac_secondary_start_3;
-			break;
+	case 1:
+		new_vector = (unsigned long)pmac_secondary_start_1;
+		break;
+	case 2:
+		new_vector = (unsigned long)pmac_secondary_start_2;
+		break;			
+	case 3:
+	default:
+		new_vector = (unsigned long)pmac_secondary_start_3;
+		break;
 	}
 	*vector = 0x48000002 + (new_vector - KERNELBASE);
 
@@ -149,13 +158,10 @@
 		 */
 		if (num_online_cpus() < 2)		
 			g5_phy_disable_cpu1();
-		if (ppc_md.progress) ppc_md.progress("core99_setup_cpu 0 done", 0x349);
+		if (ppc_md.progress) ppc_md.progress("smp_core99_setup_cpu 0 done", 0x349);
 	}
 }
 
-extern void smp_generic_give_timebase(void);
-extern void smp_generic_take_timebase(void);
-
 struct smp_ops_t core99_smp_ops __pmacdata = {
 	.message_pass	= smp_mpic_message_pass,
 	.probe		= smp_core99_probe,
Index: linux-work/arch/ppc64/kernel/pSeries_setup.c
===================================================================
--- linux-work.orig/arch/ppc64/kernel/pSeries_setup.c	2004-10-25 10:24:50.000000000 +1000
+++ linux-work/arch/ppc64/kernel/pSeries_setup.c	2004-10-25 11:29:38.911075432 +1000
@@ -321,6 +321,20 @@
 	}
 }
 
+static void pSeries_cpu_die(void)
+{
+	local_irq_disable();
+	/* Some hardware requires clearing the CPPR, while other hardware does not
+	 * it is safe either way
+	 */
+	pSeriesLP_cppr_info(0, 0);
+	rtas_stop_self();
+	/* Should never get here... */
+	BUG();
+	for(;;);
+}
+
+
 /*
  * Early initialization.  Relocation is on but do not reference unbolted pages
  */
@@ -588,6 +602,7 @@
 	.power_off		= rtas_power_off,
 	.halt			= rtas_halt,
 	.panic			= rtas_os_term,
+	.cpu_die		= pSeries_cpu_die,
 	.get_boot_time		= pSeries_get_boot_time,
 	.get_rtc_time		= pSeries_get_rtc_time,
 	.set_rtc_time		= pSeries_set_rtc_time,
Index: linux-work/arch/ppc64/kernel/setup.c
===================================================================
--- linux-work.orig/arch/ppc64/kernel/setup.c	2004-10-25 10:24:50.000000000 +1000
+++ linux-work/arch/ppc64/kernel/setup.c	2004-10-25 11:29:38.923073608 +1000
@@ -1308,3 +1308,10 @@
 early_param("xmon", early_xmon);
 #endif
 
+void cpu_die(void)
+{
+	if (ppc_md.cpu_die)
+		ppc_md.cpu_die();
+	local_irq_disable();
+	for (;;);
+}
Index: linux-work/arch/ppc64/kernel/Makefile
===================================================================
--- linux-work.orig/arch/ppc64/kernel/Makefile	2004-10-25 10:24:50.000000000 +1000
+++ linux-work/arch/ppc64/kernel/Makefile	2004-10-25 11:29:38.932072240 +1000
@@ -53,6 +53,8 @@
 
 ifdef CONFIG_SMP
 obj-$(CONFIG_PPC_PMAC)		+= pmac_smp.o smp-tbsync.o
+obj-$(CONFIG_PPC_ISERIES)	+= iSeries_smp.o
+obj-$(CONFIG_PPC_PSERIES)	+= pSeries_smp.o
 endif
 
 obj-$(CONFIG_ALTIVEC)		+= vecemu.o vector.o


From sfr at canb.auug.org.au  Mon Oct 25 17:35:24 2004
From: sfr at canb.auug.org.au (Stephen Rothwell)
Date: Mon, 25 Oct 2004 17:35:24 +1000
Subject: [PATCH] iSeries console: cleanup after tty_write user copies removal
Message-ID: <20041025173524.43932e3e.sfr@canb.auug.org.au>

Hi Andrew,

This patch just removes more of the infrastructure in the PPC64 iSeries
console driver that is no longer needed since we no longer need to do
copies from user mode in the tty drivers.

Signed-off-by: Stephen Rothwell <sfr at canb.auug.org.au>
-- 
Cheers,
Stephen Rothwell                    sfr at canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

diff -ruN 2.6.10-rc1-bk2/drivers/char/viocons.c 2.6.10-rc1-bk2-viocons.1/drivers/char/viocons.c
--- 2.6.10-rc1-bk2/drivers/char/viocons.c	2004-10-25 15:37:13.000000000 +1000
+++ 2.6.10-rc1-bk2-viocons.1/drivers/char/viocons.c	2004-10-25 17:03:17.000000000 +1000
@@ -83,15 +83,6 @@
 	u8 data[VIOCHAR_MAX_DATA];
 };
 
-/*
- * This is a place where we handle the distribution of memory
- * for copy_from_user() calls.  The buffer_available array is to
- * help us determine which buffer to use.
- */
-#define VIOCHAR_NUM_CFU_BUFFERS	7
-static struct viocharlpevent viocons_cfu_buffer[VIOCHAR_NUM_CFU_BUFFERS];
-static atomic_t viocons_cfu_buffer_available[VIOCHAR_NUM_CFU_BUFFERS];
-
 #define VIOCHAR_WINDOW		10
 #define VIOCHAR_HIGHWATERMARK	3
 
@@ -207,50 +198,6 @@
 }
 
 /*
- * This function should ONLY be called once from viocons_init2
- */
-static void viocons_init_cfu_buffer(void)
-{
-	int i;
-
-	for (i = 1; i < VIOCHAR_NUM_CFU_BUFFERS; i++)
-		atomic_set(&viocons_cfu_buffer_available[i], 1);
-}
-
-static struct viocharlpevent *viocons_get_cfu_buffer(void)
-{
-	int i;
-
-	/*
-	 * Grab the first available buffer.  It doesn't matter if we
-	 * are interrupted during this array traversal as long as we
-	 * get an available space.
-	 */
-	for (i = 0; i < VIOCHAR_NUM_CFU_BUFFERS; i++)
-		if (atomic_dec_if_positive(&viocons_cfu_buffer_available[i])
-				== 0 )
-			return &viocons_cfu_buffer[i];
-	hvlog("\n\rviocons: viocons_get_cfu_buffer : no free buffers found");
-	return NULL;
-}
-
-static void viocons_free_cfu_buffer(struct viocharlpevent *buffer)
-{
-	int i;
-
-	i = buffer - &viocons_cfu_buffer[0];
-	if (i >= (sizeof(viocons_cfu_buffer) / sizeof(viocons_cfu_buffer[0]))) {
-		hvlog("\n\rviocons: viocons_free_cfu_buffer : buffer pointer not found in list.");
-		return;
-	}
-	if (atomic_read(&viocons_cfu_buffer_available[i]) != 0) {
-		hvlog("\n\rviocons: WARNING : returning unallocated cfu buffer.");
-		return;
-	}
-	atomic_set(&viocons_cfu_buffer_available[i], 1);
-}
-
-/*
  * Add data to our pending-send buffers.  
  *
  * NOTE: Don't use printk in here because it gets nastily recursive.
@@ -438,15 +385,14 @@
  * NOTE: Don't use printk in here because it gets nastily recursive.  hvlog
  * can be used to log to the hypervisor buffer
  */
-static int internal_write(struct port_info *pi, const char *buf,
-			  size_t len, struct viocharlpevent *viochar)
+static int internal_write(struct port_info *pi, const char *buf, size_t len)
 {
 	HvLpEvent_Rc hvrc;
 	size_t bleft;
 	size_t curlen;
 	const char *curbuf;
 	unsigned long flags;
-	int copy_needed = (viochar == NULL);
+	struct viocharlpevent *viochar;
 
 	/*
 	 * Write to the hvlog of inbound data are now done prior to
@@ -462,25 +408,13 @@
 
 	spin_lock_irqsave(&consolelock, flags);
 
-	/*
-	 * If the internal_write() was passed a pointer to a
-	 * viocharlpevent then we don't need to allocate a new one
-	 * (this is the case where we are internal_writing user space
-	 * data).  If we aren't writing user space data then we need
-	 * to get an event from viopath.
-	 */
-	if (copy_needed) {
-		/* This one is fetched from the viopath data structure */
-		viochar = (struct viocharlpevent *)
-			vio_get_event_buffer(viomajorsubtype_chario);
-		/* Make sure we got a buffer */
-		if (viochar == NULL) {
-			spin_unlock_irqrestore(&consolelock, flags);
-			hvlog("\n\rviocons: Can't get viochar buffer in internal_write().");
-			return -EAGAIN;
-		}
-		initDataEvent(viochar, pi->lp);
+	viochar = vio_get_event_buffer(viomajorsubtype_chario);
+	if (viochar == NULL) {
+		spin_unlock_irqrestore(&consolelock, flags);
+		hvlog("\n\rviocons: Can't get vio buffer in internal_write().");
+		return -EAGAIN;
 	}
+	initDataEvent(viochar, pi->lp);
 
 	curbuf = buf;
 	bleft = len;
@@ -493,25 +427,16 @@
 			curlen = bleft;
 
 		viochar->event.xCorrelationToken = pi->seq++;
-
-		if (copy_needed) {
-			memcpy(viochar->data, curbuf, curlen);
-			viochar->len = curlen;
-		}
-
+		memcpy(viochar->data, curbuf, curlen);
+		viochar->len = curlen;
 		viochar->event.xSizeMinus1 =
 		    offsetof(struct viocharlpevent, data) + curlen;
 
 		hvrc = HvCallEvent_signalLpEvent(&viochar->event);
 		if (hvrc) {
-			spin_unlock_irqrestore(&consolelock, flags);
-			if (copy_needed)
-				vio_free_event_buffer(viomajorsubtype_chario, viochar);
-
 			hvlog("viocons: error sending event! %d\n", (int)hvrc);
-			return len - bleft;
+			goto out;
 		}
-
 		curbuf += curlen;
 		bleft -= curlen;
 	}
@@ -519,14 +444,9 @@
 	/* If we didn't send it all, buffer as much of it as we can. */
 	if (bleft > 0)
 		bleft -= buffer_add(pi, curbuf, bleft);
-	/*
-	 * Since we grabbed it from the viopath data structure, return
-	 * it to the data structure.
-	 */
-	if (copy_needed)
-		vio_free_event_buffer(viomajorsubtype_chario, viochar);
+out:
+	vio_free_event_buffer(viomajorsubtype_chario, viochar);
 	spin_unlock_irqrestore(&consolelock, flags);
-
 	return len - bleft;
 }
 
@@ -603,18 +523,8 @@
 
 	hvlogOutput(s, count);
 
-	if (!viopath_isactive(pi->lp)) {
-		/*
-		 * This is a VERY noisy trace message in the case where the
-		 * path manager is not active or in the case where this
-		 * function is called prior to viocons initialization.  It is
-		 * being commented out for the sake of a clear trace buffer.
-		 */
-#if 0
-		 hvlog("\n\rviocons_write: path not active to lp %d", pi->lp);
-#endif
+	if (!viopath_isactive(pi->lp))
 		return;
-	}
 
 	/* 
 	 * Any newline character found will cause a
@@ -627,17 +537,16 @@
 			 * Newline found. Print everything up to and 
 			 * including the newline
 			 */
-			internal_write(pi, &s[begin], index - begin + 1,
-					NULL);
+			internal_write(pi, &s[begin], index - begin + 1);
 			begin = index + 1;
 			/* Emit a carriage return as well */
-			internal_write(pi, &cr, 1, NULL);
+			internal_write(pi, &cr, 1);
 		}
 	}
 
 	/* If any characters left to write, write them now */
 	if ((index - begin) > 0)
-		internal_write(pi, &s[begin], index - begin, NULL);
+		internal_write(pi, &s[begin], index - begin);
 }
 
 /*
@@ -721,11 +630,9 @@
 /*
  * TTY Write method
  */
-static int viotty_write(struct tty_struct *tty,
-			const unsigned char *buf, int count)
+static int viotty_write(struct tty_struct *tty, const unsigned char *buf,
+		int count)
 {
-	int ret;
-	int total = 0;
 	struct port_info *pi;
 
 	pi = get_port_data(tty);
@@ -746,16 +653,10 @@
 	 * viotty_write call and, since the viopath isn't active to this
 	 * partition, return count.
 	 */
-	if (!viopath_isactive(pi->lp)) {
-		/* Noisy trace.  Commented unless needed. */
-#if 0
-		 hvlog("\n\rviotty_write: viopath NOT active for lp %d.",pi->lp);
-#endif
+	if (!viopath_isactive(pi->lp))
 		return count;
-	}
 
-	total = internal_write(pi, buf, count, NULL);
-	return total;
+	return internal_write(pi, buf, count);
 }
 
 /*
@@ -774,7 +675,7 @@
 		hvlogOutput(&ch, 1);
 
 	if (viopath_isactive(pi->lp))
-		internal_write(pi, &ch, 1, NULL);
+		internal_write(pi, &ch, 1);
 }
 
 /*
@@ -1270,8 +1171,6 @@
 		viotty_driver = NULL;
 	}
 
-	viocons_init_cfu_buffer();
-
 	unregister_console(&viocons_early);
 	register_console(&viocons);
 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041025/1ce92fe4/attachment.pgp 

From dwm at austin.ibm.com  Tue Oct 26 08:55:50 2004
From: dwm at austin.ibm.com (Doug Maxey)
Date: Mon, 25 Oct 2004 17:55:50 -0500
Subject: [PATCH 1/1] build modular usb isd200 with modular ide 
In-Reply-To: <58cb370e041024054575c09679@mail.gmail.com> 
Message-ID: <200410252255.i9PMto6B024865@falcon10.austin.ibm.com>


On Sun, 24 Oct 2004 14:45:53 +0200, Bartlomiej Zolnierkiewicz wrote:
...
>
>The new ide_fix_driveid function seems buggy,
>ie. it byte-swaps id->max_multsect with id->vendor3.

Ok, lets look at those vars.  Both are defined in hdreg.h as bytes.
No fields in the data from the device are bytes, but are 16 bit.  On big
endian, the relative positions for an LE u16 are swapped.  If the swap is
not done on those, then one replaces the other when read.  Probably not
what was intended.  It appears that another bug is being fixed here.

Do you not agree that all reads when doing IDENTIFY xxx DEVICE are
retrieved as u16?  If not, then the current ide_fix_driveid() code is
wrong also.

Backup data, taken from the raw bits on the wire via datatransit:

$ hexdump -C eio/ata/041025-2.6.9-rc3-wcd-4-dwm.data.bin
00000000  40 00 ff 3f 37 c8 10 00  00 00 00 00 3f 00 00 00  |@..?7.......?...|
00000010  00 00 00 00 20 20 20 20  20 20 20 20 20 20 36 20  |....          6 |
00000020  54 34 30 33 32 30 41 32  00 00 00 00 30 00 42 50  |T40320A2....0.BP|
00000030  30 31 45 33 20 20 4f 54  48 53 42 49 20 41 4b 4d  |01E3  OTHSBI AKM|
00000040  30 34 36 32 41 47 42 58  20 20 20 20 20 20 20 20  |0462AGBX        |
00000050  20 20 20 20 20 20 20 20  20 20 20 20 20 20 10 80  |              ..|
00000060  00 00 00 2f 00 40 00 02  00 00 07 00 ff 3f 10 00  |.../. at .......?..|
00000070  3f 00 10 fc fb 00 10 01  00 53 a8 04 07 00 07 00  |?........S......|
00000080  03 00 78 00 78 00 78 00  78 00 00 00 00 00 00 00  |..x.x.x.x.......|
00000090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000a0  7e 00 00 00 6b 7c 08 59  03 40 49 7c 08 18 03 40  |~...k|.Y. at I|...@|
000000b0  3f 20 0f 00 00 00 80 00  fe ff 4b 60 00 00 00 00  |? ........K`....|
000000c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000100  01 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000110  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 a5 89  |................|
00000200

Note that bytes 5E-5F are '10 80'.  Per d1532, table 15, in the version I am 
looking at: 
47 M F 15-8 80h
     F 7-0 00h = Reserved  
     F     01h-FFh = Maximum number of sectors that shall be transferred per 
                     interrupt on READ/WRITE MULTIPLE commands

To match max_multisect and vendor3, the  bytes must be swapped.
	unsigned char	max_multsect;	/* 0=not_implemented */
	unsigned char	vendor3;	/* vendor unique */

Ouch! Oh man.  Depending on LE byte ordering in a u16, but only for certain 
vars.  Should this be ifdef'd in hdregs.h?  And, and, oh jeez...

What is the solution here?  Preserve the definitely non-arch neutral format
in hdregs.h?  All the char values are troubling.

Or copy and rename the entire ide_fix_driveid() into isd200?  This would be
Christoph's choice.

...
>The dependency is a bug, <linux/ide.h> is for IDE driver only.

The isd200 _is_ a bridge to ATA/ATAPI devices.  Does this mean it cannot use 
common code, just because it is not in drivers/ide?

>
>Doug, if you kill debugging code in isd200.c then only:
>
>id->command_set_1
>id->model
>id->fw_rev
>id->capability
>id->lba_capacity
>id->heads
>id->cyls
>id->sectors
>id->command_set_2
>
>need to be byte-swapped.
>

I don't plan on killing any debug code.


From paulus at samba.org  Tue Oct 26 09:12:28 2004
From: paulus at samba.org (Paul Mackerras)
Date: Tue, 26 Oct 2004 09:12:28 +1000
Subject: [PATCH 1/1] build modular usb isd200 with modular ide 
In-Reply-To: <200410252255.i9PMto6B024865@falcon10.austin.ibm.com>
References: <58cb370e041024054575c09679@mail.gmail.com>
	<200410252255.i9PMto6B024865@falcon10.austin.ibm.com>
Message-ID: <16765.34908.93713.977225@cargo.ozlabs.ibm.com>

Doug Maxey writes:

> Ok, lets look at those vars.  Both are defined in hdreg.h as bytes.
> No fields in the data from the device are bytes, but are 16 bit.  On big
> endian, the relative positions for an LE u16 are swapped.  If the swap is
> not done on those, then one replaces the other when read.  Probably not
> what was intended.  It appears that another bug is being fixed here.

No.  The only sane way to do things is to transfer data from the
device to memory as a byte stream, in other words, preserving the
ordering of the individual bytes.  That is what we do on PPC and PPC64
platforms.  That ordering is preserved (and must be preserved)
irrespective of whether the transfer is actually done in 8, 16 or 32
bit chunks.

That means that 16-bit quantities might need to be byte-swapped to be
interpreted in host byte order, but single-byte fields should always
be in their correct sequence.

Paul.


From bzolnier at gmail.com  Tue Oct 26 09:20:08 2004
From: bzolnier at gmail.com (Bartlomiej Zolnierkiewicz)
Date: Tue, 26 Oct 2004 01:20:08 +0200
Subject: [PATCH 1/1] build modular usb isd200 with modular ide
In-Reply-To: <200410252255.i9PMto6B024865@falcon10.austin.ibm.com>
References: <58cb370e041024054575c09679@mail.gmail.com>
	<200410252255.i9PMto6B024865@falcon10.austin.ibm.com>
Message-ID: <58cb370e0410251620279fb0ee@mail.gmail.com>

On Mon, 25 Oct 2004 17:55:50 -0500, Doug Maxey <dwm at austin.ibm.com> wrote:

> >The dependency is a bug, <linux/ide.h> is for IDE driver only.
> 
> The isd200 _is_ a bridge to ATA/ATAPI devices.  Does this mean it cannot use
> common code, just because it is not in drivers/ide?

no but the common ATA/ATAPI code resides in hdreg.h or/and ata.h,
ide.h is for IDE driver _only_

> >Doug, if you kill debugging code in isd200.c then only:
> >
> >id->command_set_1
> >id->model
> >id->fw_rev
> >id->capability
> >id->lba_capacity
> >id->heads
> >id->cyls
> >id->sectors
> >id->command_set_2
> >
> >need to be byte-swapped.
> >
> 
> I don't plan on killing any debug code.

I do :)


From dwm at austin.ibm.com  Tue Oct 26 09:55:47 2004
From: dwm at austin.ibm.com (Doug Maxey)
Date: Mon, 25 Oct 2004 18:55:47 -0500
Subject: [PATCH 1/1] build modular usb isd200 with modular ide 
In-Reply-To: <16765.34908.93713.977225@cargo.ozlabs.ibm.com> 
Message-ID: <200410252355.i9PNtlSp025091@falcon10.austin.ibm.com>


On Tue, 26 Oct 2004 09:12:28 +1000, Paul Mackerras wrote:
>Doug Maxey writes:
>
>> Ok, lets look at those vars.  Both are defined in hdreg.h as bytes.
>> No fields in the data from the device are bytes, but are 16 bit.  On big
>> endian, the relative positions for an LE u16 are swapped.  If the swap is
>> not done on those, then one replaces the other when read.  Probably not
>> what was intended.  It appears that another bug is being fixed here.
>
>No.  The only sane way to do things is to transfer data from the
>device to memory as a byte stream, in other words, preserving the
>ordering of the individual bytes.  That is what we do on PPC and PPC64
>platforms.  That ordering is preserved (and must be preserved)
>irrespective of whether the transfer is actually done in 8, 16 or 32
>bit chunks.

Oh yes, I am aware.  Just happen to be working on PPC64.  Have been
writing drivers for this base for several years.  It's the olde LE
device vs BE host.  The transfers are done as a 16 bit quantity, PIO.
And yes, I understand, "we have always done it this way".  Works well
when you only have to deal with single arch.

Possibly I am not making point very well, that one is preserving the
correct byte order and let the structures reflect to native location.
Strings get swapped, 16, 32, and 64 bit fields likewise.  I just missed the 
LE order that is is being preserved for *some* few fields only.


>
>That means that 16-bit quantities might need to be byte-swapped to be
>interpreted in host byte order, but single-byte fields should always
>be in their correct sequence.

There is not a single reference to byte field in the ATA spec for
IDENTIFY DEVICE.  It just happens that some of the fields are 8 bits long. Or 
32 or 64.


>
>Paul.
>

++doug


From paulus at samba.org  Tue Oct 26 12:05:29 2004
From: paulus at samba.org (Paul Mackerras)
Date: Tue, 26 Oct 2004 12:05:29 +1000
Subject: [PATCH 1/1] build modular usb isd200 with modular ide 
In-Reply-To: <200410252355.i9PNtlSp025091@falcon10.austin.ibm.com>
References: <16765.34908.93713.977225@cargo.ozlabs.ibm.com>
	<200410252355.i9PNtlSp025091@falcon10.austin.ibm.com>
Message-ID: <16765.45289.684525.732044@cargo.ozlabs.ibm.com>

Doug Maxey writes:

> Oh yes, I am aware.  Just happen to be working on PPC64.  Have been
> writing drivers for this base for several years.  It's the olde LE

For Linux or some other OS?

> device vs BE host.  The transfers are done as a 16 bit quantity, PIO.
> And yes, I understand, "we have always done it this way".  Works well
> when you only have to deal with single arch.

No, we haven't always done it this way on PPC. :)  Various different
ways have been tried over the years and this is the only way that
doesn't suck.

> Possibly I am not making point very well, that one is preserving the
> correct byte order and let the structures reflect to native location.

I can't parse that sentence unambiguously...

> Strings get swapped, 16, 32, and 64 bit fields likewise.  I just missed the 
> LE order that is is being preserved for *some* few fields only.

Strings shouldn't get swapped, or at least, strings should only need
to be swapped on a BE platform if they also need to be swapped on an
LE platform.

> There is not a single reference to byte field in the ATA spec for
> IDENTIFY DEVICE.  It just happens that some of the fields are 8 bits long. Or 
> 32 or 64.

And your point is...  ?

Paul.


From benh at kernel.crashing.org  Tue Oct 26 17:16:46 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Tue, 26 Oct 2004 17:16:46 +1000
Subject: problems with iommu_free_table()
In-Reply-To: <1097171661.7087.1.camel@sinatra.austin.ibm.com>
References: <1097171661.7087.1.camel@sinatra.austin.ibm.com>
Message-ID: <1098775007.6916.10.camel@gaston>

On Thu, 2004-10-07 at 12:54 -0500, John Rose wrote:
> The patch below creates iommu_free_table().  Iommu tables are not currently
> freed in PPC64.  This could cause a memory leak for DLPAR of an EADS slot.  The
> function verifies that there are no outstanding TCE entries for the range of
> the table before freeing it.  I added a call to iommu_free_table() to the code
> that dynamically removes a device node.  This should be fairly symmetrical with
> the table allocation, which happens during dynamic addition of a device node.

Ouch, I should have commented earlier... now it went in and has
problems:

 - It breaks build without CONFIG_PPC_PSERIES (try a pmac-only build).
There is, more generally, a tendency at calling things in
pSeries_iommu.c with the prefix "iommu_" without any mention of
"pSeries" in the name. Hey guys ! pSeries isn't alone anymore ! So
please call those pSeries-specific things pSeries_* or tce_* or
whatever, but don't add back confusion where I had such a hard time
splitting things.

 - It seems that any call to of_remove_node() will call
iommu_free_table() on np->iommu_table. That sounds bad. The iommu_table
pointer is copied at init time from the parent to all child nodes. So if
we add a phb, and then remove a device from that bus, we end up
disposing of the phb's iommu table ...

I'll send a patch fixing G5 build by renaming iommu_free_table to
tce_free_table() and putting the call in #ifdef CONFIG_PPC_PSERIES for
now, but if you start hooking too much between prom.c and the higher
level, you should start thinking about doing things differently. That is
have of_remove_node() stay what it should have been from the beginning:
a low level function removing the node and just that, and have the
_caller_ to the grunt work of knowing what else need to be
removed/freed/etc...

Ben.
 

From benh at kernel.crashing.org  Tue Oct 26 17:21:26 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Tue, 26 Oct 2004 17:21:26 +1000
Subject: problems with iommu_free_table()
In-Reply-To: <1098775007.6916.10.camel@gaston>
References: <1097171661.7087.1.camel@sinatra.austin.ibm.com>
	<1098775007.6916.10.camel@gaston>
Message-ID: <1098775287.6898.14.camel@gaston>

On Tue, 2004-10-26 at 17:16 +1000, Benjamin Herrenschmidt wrote:
> I'll send a patch fixing G5 build by renaming iommu_free_table to
> tce_free_table() and putting the call in #ifdef CONFIG_PPC_PSERIES for
> now, but if you start hooking too much between prom.c and the higher
> level, you should start thinking about doing things differently. That is
> have of_remove_node() stay what it should have been from the beginning:
> a low level function removing the node and just that, and have the
> _caller_ to the grunt work of knowing what else need to be
> removed/freed/etc...

Ok, I'm keeping the name for now, just doing ifdef's plus adding a fat
comment to iommu.h

If you want iommu's in general to have the ability to add/remove tables,
then those calls (iommu_devnode_init, iommu_free_table, ...) should end
up beeing ppc_md. hooks so the actual implementation of the iommu knows
how to deal with them.

If that is to remain a pSeries-only API (I don't mind at this point),
then rename those to tce_* something or pSeries_iommu_* or whatever
making it clear they are pSeries only, and be careful of not breaking
build with non-pSeries.

Ben.


From ananth at in.ibm.com  Tue Oct 26 18:47:38 2004
From: ananth at in.ibm.com (Ananth N Mavinakayanahalli)
Date: Tue, 26 Oct 2004 14:17:38 +0530
Subject: [PATCH] Kprobes for ppc64
In-Reply-To: <16762.5108.282382.603502@cargo.ozlabs.ibm.com>
References: <20041018095229.GA7394@in.ibm.com>
	<16762.5108.282382.603502@cargo.ozlabs.ibm.com>
Message-ID: <20041026084738.GA7425@in.ibm.com>

On Sat, Oct 23, 2004 at 06:19:00PM +1000, Paul Mackerras wrote:
> Ananth N Mavinakayanahalli writes:
> 
> > 2. arch_prepare_kprobe() now returns an int. I have made the necessary
> >    changes to i386 and sparc64 kprobes files, but is untested.
> 
> Are you going to send this upstream?

Prasanna has a set of changes which he will be pushing to akpm shortly.
This will be part of the set.

> > + * Interrupts are disabled on entry as trap3 is an interrupt gate and they
> > + * remain disabled thorough out this function.
> > + */
> > +static inline int kprobe_handler(struct pt_regs *regs)
> 
> Comments about "trap3" and "interrupt gate" don't help me understand
> this function on ppc64. :)  At present interrupts are enabled in a
> program check exception handler but disabled in a single-step handler.
> When does this function get called?

Ah, I missed the comment .. my bad :(
kprobe_handler() gets invoked from ProgramCheckException(). 
 
> > @@ -96,6 +97,9 @@ int do_page_fault(struct pt_regs *regs, 
> >  	BUG_ON((trap == 0x380) || (trap == 0x480));
> >  
> >  	if (trap == 0x300) {
> > +		if (notify_die(DIE_PAGE_FAULT, "page_fault", regs, error_code,
> > +					11, SIGSEGV) == NOTIFY_STOP)
> > +			return 0;
> 
> Hmmm, this seems a bit heavyweight for adding to the page fault path.
> Have you done any benchmarks with vs. without kprobes?

Hmm no, not yet.
 
> On the whole the patch looks OK.  I haven't checked the kprobe_handler
> code to see if I think it's all SMP- and preempt-safe, but I assume
> you have done it similarly on x86 and checked it there.

Yes - the port is based off the initial x86 code. It is SMP and
preempt safe.

Thanks for your comments! I will rework the patch a bit and post it
soon.

Thanks,
Ananth


From olof at austin.ibm.com  Tue Oct 26 23:45:46 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Tue, 26 Oct 2004 08:45:46 -0500
Subject: problems with iommu_free_table()
In-Reply-To: <1098775007.6916.10.camel@gaston>
References: <1097171661.7087.1.camel@sinatra.austin.ibm.com>
	<1098775007.6916.10.camel@gaston>
Message-ID: <417E550A.1040400@austin.ibm.com>

Benjamin Herrenschmidt wrote:

> Ouch, I should have commented earlier... now it went in and has
> problems:
> 
>  - It breaks build without CONFIG_PPC_PSERIES (try a pmac-only build).
> There is, more generally, a tendency at calling things in
> pSeries_iommu.c with the prefix "iommu_" without any mention of
> "pSeries" in the name. Hey guys ! pSeries isn't alone anymore ! So
> please call those pSeries-specific things pSeries_* or tce_* or
> whatever, but don't add back confusion where I had such a hard time
> splitting things.

Actually, you're wrong. :) It's not pSeries-specific, see below.

>  - It seems that any call to of_remove_node() will call
> iommu_free_table() on np->iommu_table. That sounds bad. The iommu_table
> pointer is copied at init time from the parent to all child nodes. So if
> we add a phb, and then remove a device from that bus, we end up
> disposing of the phb's iommu table ...

Yep, you're right. There's two ways to fix this: Add reference counting 
to the iommu tables and do automatic deallocation, or only delete the 
tables for PHB deallocation. The second option would be preferred, since 
it should be the right way to solve the layering violation.

> I'll send a patch fixing G5 build by renaming iommu_free_table to
> tce_free_table() and putting the call in #ifdef CONFIG_PPC_PSERIES for
> now,

This is the wrong solution. iommu_free_table is a companion to 
iommu_init_table, and it _is_ generic code, it just ended up in the 
wrong file (I didn't catch that myself, sorry about that).


-Olof


From johnrose at austin.ibm.com  Wed Oct 27 02:41:35 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Tue, 26 Oct 2004 11:41:35 -0500
Subject: [PATCH] ppc64: Fix g5-only build
In-Reply-To: <1098775712.6897.17.camel@gaston>
References: <1098775712.6897.17.camel@gaston>
Message-ID: <1098808895.32293.23.camel@sinatra.austin.ibm.com>

Forgive me for the cross-post, but I'm trying to answer two list
messages on the same topic.  I think it's more productive to just fix
the bug than to commit a giant comment pointing out a small bug, so I've
attached an alternate fix (build tested for g5 :).

> - It breaks build without CONFIG_PPC_PSERIES (try a pmac-only build).
> There is, more generally, a tendency at calling things in
> pSeries_iommu.c with the prefix "iommu_" without any mention of
> "pSeries" in the name. Hey guys ! pSeries isn't alone anymore ! So
> please call those pSeries-specific things pSeries_* or tce_* or
> whatever, but don't add back confusion where I had such a hard time
> splitting things.

Apologies for the build break.  I mistakenly placed the function in a pSeries 
file.  In our view, this is a generic function, complementary to
iommu_init_table(), so I've moved it to iommu.c.

> - It seems that any call to of_remove_node() will call
> iommu_free_table() on np->iommu_table. That sounds bad. The iommu_table
> pointer is copied at init time from the parent to all child nodes. So if
> we add a phb, and then remove a device from that bus, we end up
> disposing of the phb's iommu table ...

Good catch, although table allocation doesn't always happen at the PHB
level.  On POWER5, it happens at the EADS level.  My fix checks for the
ibm,dma-window property before calling the free function.  This is the
criterion for which the table is alloc'ed in the first place.

Thanks-
John

Signed-off-by: John Rose <johnrose at austin.ibm.com>

diff -Nru a/arch/ppc64/kernel/iommu.c b/arch/ppc64/kernel/iommu.c
--- a/arch/ppc64/kernel/iommu.c	Tue Oct 26 11:36:42 2004
+++ b/arch/ppc64/kernel/iommu.c	Tue Oct 26 11:36:42 2004
@@ -425,6 +425,39 @@
 	return tbl;
 }
 
+void iommu_free_table(struct device_node *dn)
+{
+	struct iommu_table *tbl = dn->iommu_table;
+	unsigned long bitmap_sz, i;
+	unsigned int order;
+
+	if (!tbl || !tbl->it_map) {
+		printk(KERN_ERR "%s: expected TCE map for %s\n", __FUNCTION__,
+				dn->full_name);
+		return;
+	}
+
+	/* verify that table contains no entries */
+	/* it_mapsize is in entries, and we're examining 64 at a time */
+	for (i = 0; i < (tbl->it_mapsize/64); i++) {
+		if (tbl->it_map[i] != 0) {
+			printk(KERN_WARNING "%s: Unexpected TCEs for %s\n",
+				__FUNCTION__, dn->full_name);
+			break;
+		}
+	}
+
+	/* calculate bitmap size in bytes */
+	bitmap_sz = (tbl->it_mapsize + 7) / 8;
+
+	/* free bitmap */
+	order = get_order(bitmap_sz);
+	free_pages((unsigned long) tbl->it_map, order);
+
+	/* free table */
+	kfree(tbl);
+}
+
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address of the buffer
  * passed here is the kernel (virtual) address of the buffer.  The buffer
diff -Nru a/arch/ppc64/kernel/pSeries_iommu.c b/arch/ppc64/kernel/pSeries_iommu.c
--- a/arch/ppc64/kernel/pSeries_iommu.c	Tue Oct 26 11:36:42 2004
+++ b/arch/ppc64/kernel/pSeries_iommu.c	Tue Oct 26 11:36:42 2004
@@ -412,39 +412,6 @@
 	dn->iommu_table = iommu_init_table(tbl);
 }
 
-void iommu_free_table(struct device_node *dn)
-{
-	struct iommu_table *tbl = dn->iommu_table;
-	unsigned long bitmap_sz, i;
-	unsigned int order;
-
-	if (!tbl || !tbl->it_map) {
-		printk(KERN_ERR "%s: expected TCE map for %s\n", __FUNCTION__,
-				dn->full_name);
-		return;
-	}
-
-	/* verify that table contains no entries */
-	/* it_mapsize is in entries, and we're examining 64 at a time */
-	for (i = 0; i < (tbl->it_mapsize/64); i++) {
-		if (tbl->it_map[i] != 0) {
-			printk(KERN_WARNING "%s: Unexpected TCEs for %s\n",
-				__FUNCTION__, dn->full_name);
-			break;
-		}
-	}
-
-	/* calculate bitmap size in bytes */
-	bitmap_sz = (tbl->it_mapsize + 7) / 8;
-
-	/* free bitmap */
-	order = get_order(bitmap_sz);
-	free_pages((unsigned long) tbl->it_map, order);
-
-	/* free table */
-	kfree(tbl);
-}
-
 void iommu_setup_pSeries(void)
 {
 	struct pci_dev *dev = NULL;
diff -Nru a/arch/ppc64/kernel/prom.c b/arch/ppc64/kernel/prom.c
--- a/arch/ppc64/kernel/prom.c	Tue Oct 26 11:36:42 2004
+++ b/arch/ppc64/kernel/prom.c	Tue Oct 26 11:36:42 2004
@@ -1818,8 +1818,9 @@
 		return -EBUSY;
 	}
 
-	if (np->iommu_table)
+	if ((np->iommu_table) && get_property(np, "ibm,dma-window", NULL)) {
 		iommu_free_table(np);
+	}
 
 	write_lock(&devtree_lock);
 	OF_MARK_STALE(np);


From johnrose at austin.ibm.com  Wed Oct 27 04:03:01 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Tue, 26 Oct 2004 13:03:01 -0500
Subject: [PATCH] iommu fixes, round 2
In-Reply-To: <1098808895.32293.23.camel@sinatra.austin.ibm.com>
References: <1098775712.6897.17.camel@gaston>
	<1098808895.32293.23.camel@sinatra.austin.ibm.com>
Message-ID: <1098813781.32293.40.camel@sinatra.austin.ibm.com>

Ben's patch went in before my note went out, so please disregard my
previous post.

All this might have been more easily addressed in one note on one list.
As opposed to posting three msgs on two lists and committing a patch
that creates code comments on proposed reorgs.  Might I humbly request
that our patches/reorg ideas sit on the ppc64 list for a bit before
pushing to Linus?  

Here's a patch that fixes the original build break, and removes the
ifdef's and comments that were added by Ben's patch.

We feel that iommu_free_table() is generic so we've moved it to iommu.c.
This fixes the build break (sorry g5 :).  It's complementary to
iommu_init_table(), which is generic.  

Secondly, the attempt to free the table in of_remove_node() is "as
symmetric as possible" with of_finish_node_dynamic(), where the table is
allocated.  If that's what the comment means by layering violation, I
humbly disagree. 

Thirdly, iommu_devnode_init() also has an iSeries implementation, so
it's not pSeries-specific.  No need to rename it, as suggested in the
comment.

Thanks-
John

Signed-off-by: John Rose <johnrose at austin.ibm.com>

diff -Nru a/arch/ppc64/kernel/iommu.c b/arch/ppc64/kernel/iommu.c
--- a/arch/ppc64/kernel/iommu.c	Tue Oct 26 12:51:42 2004
+++ b/arch/ppc64/kernel/iommu.c	Tue Oct 26 12:51:42 2004
@@ -425,6 +425,39 @@
 	return tbl;
 }
 
+void iommu_free_table(struct device_node *dn)
+{
+	struct iommu_table *tbl = dn->iommu_table;
+	unsigned long bitmap_sz, i;
+	unsigned int order;
+
+	if (!tbl || !tbl->it_map) {
+		printk(KERN_ERR "%s: expected TCE map for %s\n", __FUNCTION__,
+				dn->full_name);
+		return;
+	}
+
+	/* verify that table contains no entries */
+	/* it_mapsize is in entries, and we're examining 64 at a time */
+	for (i = 0; i < (tbl->it_mapsize/64); i++) {
+		if (tbl->it_map[i] != 0) {
+			printk(KERN_WARNING "%s: Unexpected TCEs for %s\n",
+				__FUNCTION__, dn->full_name);
+			break;
+		}
+	}
+
+	/* calculate bitmap size in bytes */
+	bitmap_sz = (tbl->it_mapsize + 7) / 8;
+
+	/* free bitmap */
+	order = get_order(bitmap_sz);
+	free_pages((unsigned long) tbl->it_map, order);
+
+	/* free table */
+	kfree(tbl);
+}
+
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address of the buffer
  * passed here is the kernel (virtual) address of the buffer.  The buffer
diff -Nru a/arch/ppc64/kernel/pSeries_iommu.c b/arch/ppc64/kernel/pSeries_iommu.c
--- a/arch/ppc64/kernel/pSeries_iommu.c	Tue Oct 26 12:51:42 2004
+++ b/arch/ppc64/kernel/pSeries_iommu.c	Tue Oct 26 12:51:42 2004
@@ -412,39 +412,6 @@
 	dn->iommu_table = iommu_init_table(tbl);
 }
 
-void iommu_free_table(struct device_node *dn)
-{
-	struct iommu_table *tbl = dn->iommu_table;
-	unsigned long bitmap_sz, i;
-	unsigned int order;
-
-	if (!tbl || !tbl->it_map) {
-		printk(KERN_ERR "%s: expected TCE map for %s\n", __FUNCTION__,
-				dn->full_name);
-		return;
-	}
-
-	/* verify that table contains no entries */
-	/* it_mapsize is in entries, and we're examining 64 at a time */
-	for (i = 0; i < (tbl->it_mapsize/64); i++) {
-		if (tbl->it_map[i] != 0) {
-			printk(KERN_WARNING "%s: Unexpected TCEs for %s\n",
-				__FUNCTION__, dn->full_name);
-			break;
-		}
-	}
-
-	/* calculate bitmap size in bytes */
-	bitmap_sz = (tbl->it_mapsize + 7) / 8;
-
-	/* free bitmap */
-	order = get_order(bitmap_sz);
-	free_pages((unsigned long) tbl->it_map, order);
-
-	/* free table */
-	kfree(tbl);
-}
-
 void iommu_setup_pSeries(void)
 {
 	struct pci_dev *dev = NULL;
diff -Nru a/arch/ppc64/kernel/prom.c b/arch/ppc64/kernel/prom.c
--- a/arch/ppc64/kernel/prom.c	Tue Oct 26 12:51:42 2004
+++ b/arch/ppc64/kernel/prom.c	Tue Oct 26 12:51:42 2004
@@ -1818,13 +1818,8 @@
 		return -EBUSY;
 	}
 
-	/* XXX This is a layering violation, should be moved to the caller
-	 * --BenH.
-	 */
-#ifdef CONFIG_PPC_PSERIES
-	if (np->iommu_table)
+	if ((np->iommu_table) && get_property(np, "ibm,dma-window", NULL))
 		iommu_free_table(np);
-#endif /* CONFIG_PPC_PSERIES */
 
 	write_lock(&devtree_lock);
 	OF_MARK_STALE(np);
diff -Nru a/include/asm-ppc64/iommu.h b/include/asm-ppc64/iommu.h
--- a/include/asm-ppc64/iommu.h	Tue Oct 26 12:51:42 2004
+++ b/include/asm-ppc64/iommu.h	Tue Oct 26 12:51:42 2004
@@ -111,17 +111,9 @@
 extern void iommu_setup_u3(void);
 
 /* Creates table for an individual device node */
-/* XXX: This isn't generic, please name it accordingly or add
- * some ppc_md. hooks for iommu implementations to do what they
- * need to do. --BenH.
- */
 extern void iommu_devnode_init(struct device_node *dn);
 
 /* Frees table for an individual device node */
-/* XXX: This isn't generic, please name it accordingly or add
- * some ppc_md. hooks for iommu implementations to do what they
- * need to do. --BenH.
- */
 extern void iommu_free_table(struct device_node *dn);
 
 #endif /* CONFIG_PPC_MULTIPLATFORM */


From dwm at austin.ibm.com  Wed Oct 27 04:05:12 2004
From: dwm at austin.ibm.com (Doug Maxey)
Date: Tue, 26 Oct 2004 13:05:12 -0500
Subject: [PATCH 1/1] build modular usb isd200 with modular ide 
In-Reply-To: <16765.45289.684525.732044@cargo.ozlabs.ibm.com> 
Message-ID: <200410261805.i9QI5CKM029850@falcon10.austin.ibm.com>

On Tue, 26 Oct 2004 12:05:29 +1000, Paul Mackerras wrote:
>For Linux or some other OS?

Linux for little over a year, 2.6 for about 3 months, some _other_ OS for
several years.

>
>> device vs BE host.  The transfers are done as a 16 bit quantity, PIO.
>> And yes, I understand, "we have always done it this way".  Works well
>> when you only have to deal with single arch.
>
>No, we haven't always done it this way on PPC. :)  Various different
>ways have been tried over the years and this is the only way that
>doesn't suck.
>
>> Possibly I am not making point very well, that one is preserving the
>> correct byte order and let the structures reflect to native location.
>
>I can't parse that sentence unambiguously...

s/to native location/the normalized (for the host) layout/

>
>> Strings get swapped, 16, 32, and 64 bit fields likewise.  I just missed the 
>> LE order that is is being preserved for *some* few fields only.
>
>Strings shouldn't get swapped, or at least, strings should only need
>to be swapped on a BE platform if they also need to be swapped on an
>LE platform.
>
>> There is not a single reference to byte field in the ATA spec for
>> IDENTIFY DEVICE.  It just happens that some of the fields are 8 bits long. Or 
>> 32 or 64.
>
>And your point is...  ?

To me, and I do seem to be in the minority, it seems that normalizing the 
entire bytestream is the right thing (tm).  But I can see the point that 
leaving certain parts non-normalized is cheaper.

It was my mistake missing the use of the char fields.  GIITD.  In any
event, with 2.6.10-rc1 the problem seems to be solved in spite of my
meddling. :-)

++doug


From benh at kernel.crashing.org  Wed Oct 27 09:32:35 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Wed, 27 Oct 2004 09:32:35 +1000
Subject: problems with iommu_free_table()
In-Reply-To: <417E550A.1040400@austin.ibm.com>
References: <1097171661.7087.1.camel@sinatra.austin.ibm.com>
	<1098775007.6916.10.camel@gaston>  <417E550A.1040400@austin.ibm.com>
Message-ID: <1098833556.6916.44.camel@gaston>

On Tue, 2004-10-26 at 08:45 -0500, Olof Johansson wrote:
> Benjamin Herrenschmidt wrote:

> Actually, you're wrong. :) It's not pSeries-specific, see below.

Well, it's implemented in pSeries_iommu.c ...

> Yep, you're right. There's two ways to fix this: Add reference counting 
> to the iommu tables and do automatic deallocation, or only delete the 
> tables for PHB deallocation. The second option would be preferred, since 
> it should be the right way to solve the layering violation.

Agreed.

> > I'll send a patch fixing G5 build by renaming iommu_free_table to
> > tce_free_table() and putting the call in #ifdef CONFIG_PPC_PSERIES for
> > now,
> 
> This is the wrong solution. iommu_free_table is a companion to 
> iommu_init_table, and it _is_ generic code, it just ended up in the 
> wrong file (I didn't catch that myself, sorry about that).

It's the right fix for now until you or John do something better :)
Besides, I don't fully agree with iommu_free_table() beeing the
'pending' of iommu_init_table() since it does kfree etc... it makes
assumptions on how the caller allocated the tables... not _that_ bad but
don't even try calling that on the U3 ones :)


From pbadari at us.ibm.com  Wed Oct 27 09:33:18 2004
From: pbadari at us.ibm.com (Badari Pulavarty)
Date: 26 Oct 2004 16:33:18 -0700
Subject: 2.6.9 iommu_alloc failures on PPC64
Message-ID: <1098833598.20643.116.camel@dyn318077bld.beaverton.ibm.com>

Hi,

When I run IO tests with 2.6.9 kernel on PPC64, I get hundreds of
following messages and eventually get OOPS from qlogic driver.
Is this a known problems ?

BTW, this happens only with JFS not ext3. 

Thanks,
Badari

iommu_alloc failed, tbl c0000000e3fe1f00 vaddr c0000000d1c80000 npages
10
 iommu_alloc failed, tbl c0000000e3fe1f00 vaddr c0000000b56c0000 npages
10
 iommu_alloc failed, tbl c0000000e3fe1f00 vaddr c0000000de3b8000 npages
8
 iommu_alloc failed, tbl c0000000e3fe1f00 vaddr c0000001b16e8000 npages
6
 iommu_alloc failed, tbl c0000000e3fe1f00 vaddr c0000000ddd88000 npages
8
 iommu_alloc failed, tbl c0000000e3fe1f00 vaddr c0000001aead0000 npages
e
 iommu_alloc failed, tbl c0000000e3fe1f00 vaddr c0000001bd2a0000 npages
6
 iommu_alloc failed, tbl c0000000e3fe1f00 vaddr c0000001b89e0000 npages
e
 iommu_alloc failed, tbl c0000000e3fe1f00 vaddr c0000001b8350000 npages
6
 iommu_alloc failed, tbl c0000000e3fe1f00 vaddr c0000000cc440000 npages
10


From benh at kernel.crashing.org  Wed Oct 27 09:46:57 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Wed, 27 Oct 2004 09:46:57 +1000
Subject: [PATCH] iommu fixes, round 2
In-Reply-To: <1098813781.32293.40.camel@sinatra.austin.ibm.com>
References: <1098775712.6897.17.camel@gaston>
	<1098808895.32293.23.camel@sinatra.austin.ibm.com>
	<1098813781.32293.40.camel@sinatra.austin.ibm.com>
Message-ID: <1098834417.6916.62.camel@gaston>


> We feel that iommu_free_table() is generic so we've moved it to iommu.c.
> This fixes the build break (sorry g5 :).  It's complementary to
> iommu_init_table(), which is generic.  
> 
> Secondly, the attempt to free the table in of_remove_node() is "as
> symmetric as possible" with of_finish_node_dynamic(), where the table is
> allocated.  If that's what the comment means by layering violation, I
> humbly disagree. 

Nope. All the "finish" node routines are high level routines that parse
the device-tree to fill various additional things in the device nodes.
There are some remote plans of getting rid of them in the long run...

of_remove_node() is a low level routine that is responsible for removing
the node from the tree, and dealing with the /proc things, and that
should be all.

If you want to keep the iommu removal in prom.c, then you should create
an of_finish_dynamic_node() or something like that, that does that kind
of high level stuff before calling of_remove_node().

> Thirdly, iommu_devnode_init() also has an iSeries implementation, so
> it's not pSeries-specific.  No need to rename it, as suggested in the
> comment.

Then let's move it, but the fact that it does a kfree() and that sort
of things means it actually makes assumptions on how the iommu table was
allocated in the first place, which is not under control of the generic
code at this point.


From olof at austin.ibm.com  Wed Oct 27 09:56:41 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Tue, 26 Oct 2004 18:56:41 -0500
Subject: problems with iommu_free_table()
In-Reply-To: <1098833556.6916.44.camel@gaston>
References: <1097171661.7087.1.camel@sinatra.austin.ibm.com>	
	<1098775007.6916.10.camel@gaston> <417E550A.1040400@austin.ibm.com>
	<1098833556.6916.44.camel@gaston>
Message-ID: <417EE439.3080206@austin.ibm.com>

Benjamin Herrenschmidt wrote:

> It's the right fix for now until you or John do something better :)
> Besides, I don't fully agree with iommu_free_table() beeing the
> 'pending' of iommu_init_table() since it does kfree etc... it makes
> assumptions on how the caller allocated the tables... not _that_ bad but
> don't even try calling that on the U3 ones :)

Right, we discussed it today. The whole iommu table init code flow is a 
bit awkward today with alloc/setup/init. It's nonintuitive. :(


-Olof


From benh at kernel.crashing.org  Wed Oct 27 09:49:43 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Wed, 27 Oct 2004 09:49:43 +1000
Subject: [PATCH] ppc64: Fix g5-only build
In-Reply-To: <1098808895.32293.23.camel@sinatra.austin.ibm.com>
References: <1098775712.6897.17.camel@gaston>
	<1098808895.32293.23.camel@sinatra.austin.ibm.com>
Message-ID: <1098834583.6898.64.camel@gaston>

On Tue, 2004-10-26 at 11:41 -0500, John Rose wrote:
> Forgive me for the cross-post, but I'm trying to answer two list
> messages on the same topic.  I think it's more productive to just fix
> the bug than to commit a giant comment pointing out a small bug, so I've
> attached an alternate fix (build tested for g5 :).

I replied the other list, let's stop this thread here.

Ben.


From benh at kernel.crashing.org  Wed Oct 27 09:56:55 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Wed, 27 Oct 2004 09:56:55 +1000
Subject: List message size limit
Message-ID: <1098835015.6917.69.camel@gaston>

Hi !

The limit of messages sizes on this list is about 40k. This is too small
for a lot of patches. I'd like it to be pumped to 128k or even more,
though that needs to be discussed first in case a majority of
subscribers disagree..

Ben.


From benh at kernel.crashing.org  Wed Oct 27 10:05:31 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Wed, 27 Oct 2004 10:05:31 +1000
Subject: 2.6.9 iommu_alloc failures on PPC64
In-Reply-To: <1098833598.20643.116.camel@dyn318077bld.beaverton.ibm.com>
References: <1098833598.20643.116.camel@dyn318077bld.beaverton.ibm.com>
Message-ID: <1098835531.6917.76.camel@gaston>

On Tue, 2004-10-26 at 16:33 -0700, Badari Pulavarty wrote:
> Hi,
> 
> When I run IO tests with 2.6.9 kernel on PPC64, I get hundreds of
> following messages and eventually get OOPS from qlogic driver.
> Is this a known problems ?
> 
> BTW, this happens only with JFS not ext3. 

I suppose JFS is flooding the driver with so many large requests that
the small table on your machine gets full (what machine is this
precisely ?).

The qlogic driver should be fixed to handle iommu failures more
gracefully. It should be possible in most cases to just wait for pending
IOs to complete & try again. I don't know if it's possible to ask the
upper layer to breakup the request.

Ben.


From jk at ozlabs.org  Wed Oct 27 10:10:44 2004
From: jk at ozlabs.org (Jeremy Kerr)
Date: Wed, 27 Oct 2004 10:10:44 +1000
Subject: List message size limit
In-Reply-To: <1098835015.6917.69.camel@gaston>
References: <1098835015.6917.69.camel@gaston>
Message-ID: <200410271010.45081.jk@ozlabs.org>

Hi all,

> The limit of messages sizes on this list is about 40k. This is too small
> for a lot of patches. I'd like it to be pumped to 128k or even more,
> though that needs to be discussed first in case a majority of
> subscribers disagree..

Just a side-note here: patches that are provided by a URL (ie, those too large 
to be attached) will not be picked up by the patch tracking system at 
present. However, I could extend it to check URLs that appear in a message, 
possibly with some special syntax to let the parser know that it should 
follow the link (to reduce unnecessary downloads).

Any suggestions?


Jeremy


From olof at austin.ibm.com  Wed Oct 27 11:02:13 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Tue, 26 Oct 2004 20:02:13 -0500
Subject: List message size limit
In-Reply-To: <200410271010.45081.jk@ozlabs.org>
References: <1098835015.6917.69.camel@gaston>
	<200410271010.45081.jk@ozlabs.org>
Message-ID: <20041027010213.GA23655@4>

On Wed, Oct 27, 2004 at 10:10:44AM +1000, Jeremy Kerr wrote:

> Just a side-note here: patches that are provided by a URL (ie, those too large 
> to be attached) will not be picked up by the patch tracking system at 
> present. However, I could extend it to check URLs that appear in a message, 
> possibly with some special syntax to let the parser know that it should 
> follow the link (to reduce unnecessary downloads).
> 
> Any suggestions?

I say let's just up the size high enough for it to not be a concern
(1MB?). The amount of spam coming across is very low (none as far as
I've been able to tell), and if turns out to be a problem it can be
lowered again.


-Olof


From pbadari at us.ibm.com  Wed Oct 27 10:10:50 2004
From: pbadari at us.ibm.com (Badari Pulavarty)
Date: 26 Oct 2004 17:10:50 -0700
Subject: 2.6.9 iommu_alloc failures on PPC64
In-Reply-To: <1098835531.6917.76.camel@gaston>
References: <1098833598.20643.116.camel@dyn318077bld.beaverton.ibm.com>
	<1098835531.6917.76.camel@gaston>
Message-ID: <1098835849.20643.122.camel@dyn318077bld.beaverton.ibm.com>

On Tue, 2004-10-26 at 17:05, Benjamin Herrenschmidt wrote:
> On Tue, 2004-10-26 at 16:33 -0700, Badari Pulavarty wrote:
> > Hi,
> > 
> > When I run IO tests with 2.6.9 kernel on PPC64, I get hundreds of
> > following messages and eventually get OOPS from qlogic driver.
> > Is this a known problems ?
> > 
> > BTW, this happens only with JFS not ext3. 
> 
> I suppose JFS is flooding the driver with so many large requests that
> the small table on your machine gets full (what machine is this
> precisely ?).

Its my latest P-570. 


Thanks,
Badari


From benh at kernel.crashing.org  Wed Oct 27 12:05:53 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Wed, 27 Oct 2004 12:05:53 +1000
Subject: List message size limit
In-Reply-To: <20041027010213.GA23655@4>
References: <1098835015.6917.69.camel@gaston>
	<200410271010.45081.jk@ozlabs.org>  <20041027010213.GA23655@4>
Message-ID: <1098842753.610.1.camel@gaston>

On Tue, 2004-10-26 at 20:02 -0500, Olof Johansson wrote:
> On Wed, Oct 27, 2004 at 10:10:44AM +1000, Jeremy Kerr wrote:
> 
> > Just a side-note here: patches that are provided by a URL (ie, those too large 
> > to be attached) will not be picked up by the patch tracking system at 
> > present. However, I could extend it to check URLs that appear in a message, 
> > possibly with some special syntax to let the parser know that it should 
> > follow the link (to reduce unnecessary downloads).
> > 
> > Any suggestions?
> 
> I say let's just up the size high enough for it to not be a concern
> (1MB?). The amount of spam coming across is very low (none as far as
> I've been able to tell), and if turns out to be a problem it can be
> lowered again.

1Mb is probably too big for archives, and even my monster patch was only
about 350K :) I think 256K would be a good limit. (Let's start the who
gets the best random number game now :)

Ben.


From dhowells at redhat.com  Wed Oct 27 19:47:08 2004
From: dhowells at redhat.com (David Howells)
Date: Wed, 27 Oct 2004 10:47:08 +0100
Subject: List message size limit 
In-Reply-To: <20041027010213.GA23655@4> 
References: <20041027010213.GA23655@4> <1098835015.6917.69.camel@gaston>
	<200410271010.45081.jk@ozlabs.org> 
Message-ID: <26685.1098870428@redhat.com>

> 
> I say let's just up the size high enough for it to not be a concern
> (1MB?). The amount of spam coming across is very low (none as far as
> I've been able to tell), and if turns out to be a problem it can be
> lowered again.

Make the limit larger only for list subscribees.

David


From pbadari at us.ibm.com  Thu Oct 28 01:31:34 2004
From: pbadari at us.ibm.com (Badari Pulavarty)
Date: 27 Oct 2004 08:31:34 -0700
Subject: 2.6.9 iommu_alloc failures on PPC64
In-Reply-To: <1098835531.6917.76.camel@gaston>
References: <1098833598.20643.116.camel@dyn318077bld.beaverton.ibm.com>
	<1098835531.6917.76.camel@gaston>
Message-ID: <1098891094.20643.134.camel@dyn318077bld.beaverton.ibm.com>

Ben,

SLES9 seems to work fine, which has qlogic driver version 8.00.00b14
2.6.9 with qlogic driver version 8.00.00b15-k is having problems.

FYI.

Thanks,
Badari

On Tue, 2004-10-26 at 17:05, Benjamin Herrenschmidt wrote:
> On Tue, 2004-10-26 at 16:33 -0700, Badari Pulavarty wrote:
> > Hi,
> > 
> > When I run IO tests with 2.6.9 kernel on PPC64, I get hundreds of
> > following messages and eventually get OOPS from qlogic driver.
> > Is this a known problems ?
> > 
> > BTW, this happens only with JFS not ext3. 
> 
> I suppose JFS is flooding the driver with so many large requests that
> the small table on your machine gets full (what machine is this
> precisely ?).
> 
> The qlogic driver should be fixed to handle iommu failures more
> gracefully. It should be possible in most cases to just wait for pending
> IOs to complete & try again. I don't know if it's possible to ask the
> upper layer to breakup the request.
> 
> Ben.
> 
> 
> 


From hollisb at us.ibm.com  Wed Oct 27 23:38:08 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Wed, 27 Oct 2004 13:38:08 +0000
Subject: [resend patch] HVSI early boot console
Message-ID: <1098884287.3486.5.camel@localhost>

Hi Linus, I've retested this with the current BK tree as you requested.

This patch adds support for the udbg early console interfaces
when using an HVSI console. Please apply.

Signed-off-by: Hollis Blanchard <hollisb at us.ibm.com>

-- 
Hollis Blanchard
IBM Linux Technology Center

--- arch/ppc64/kernel/pSeries_lpar.c.orig	Tue Sep 21 23:40:30 2004
+++ arch/ppc64/kernel/pSeries_lpar.c	Thu Oct  7 10:52:23 2004
@@ -59,6 +59,74 @@
 
 int vtermno;	/* virtual terminal# for udbg  */
 
+#define __ALIGNED__ __attribute__((__aligned__(sizeof(long))))
+static void udbg_hvsi_putc(unsigned char c)
+{
+	/* packet's seqno isn't used anyways */
+	uint8_t packet[] __ALIGNED__ = { 0xff, 5, 0, 0, c };
+	int rc;
+
+	if (c == '\n')
+		udbg_hvsi_putc('\r');
+
+	do {
+		rc = plpar_put_term_char(vtermno, sizeof(packet), packet);
+	} while (rc == H_Busy);
+}
+
+static long hvsi_udbg_buf_len;
+static uint8_t hvsi_udbg_buf[256];
+
+static int udbg_hvsi_getc_poll(void)
+{
+	unsigned char ch;
+	int rc, i;
+
+	if (hvsi_udbg_buf_len == 0) {
+		rc = plpar_get_term_char(vtermno, &hvsi_udbg_buf_len, hvsi_udbg_buf);
+		if (rc != H_Success || hvsi_udbg_buf[0] != 0xff) {
+			/* bad read or non-data packet */
+			hvsi_udbg_buf_len = 0;
+		} else {
+			/* remove the packet header */
+			for (i = 4; i < hvsi_udbg_buf_len; i++)
+				hvsi_udbg_buf[i-4] = hvsi_udbg_buf[i];
+			hvsi_udbg_buf_len -= 4;
+		}
+	}
+
+	if (hvsi_udbg_buf_len <= 0 || hvsi_udbg_buf_len > 256) {
+		/* no data ready */
+		hvsi_udbg_buf_len = 0;
+		return -1;
+	}
+
+	ch = hvsi_udbg_buf[0];
+	/* shift remaining data down */
+	for (i = 1; i < hvsi_udbg_buf_len; i++) {
+		hvsi_udbg_buf[i-1] = hvsi_udbg_buf[i];
+	}
+	hvsi_udbg_buf_len--;
+
+	return ch;
+}
+
+static unsigned char udbg_hvsi_getc(void)
+{
+	int ch;
+	for (;;) {
+		ch = udbg_hvsi_getc_poll();
+		if (ch == -1) {
+			/* This shouldn't be needed...but... */
+			volatile unsigned long delay;
+			for (delay=0; delay < 2000000; delay++)
+				;
+		} else {
+			return ch;
+		}
+	}
+}
+
 static void udbg_putcLP(unsigned char c)
 {
 	char buf[16];
@@ -167,11 +235,15 @@
 				ppc_md.udbg_getc_poll = udbg_getc_pollLP;
 				found = 1;
 			}
-		} else {
-			/* XXX implement udbg_putcLP_vtty for hvterm-protocol1 case */
-			printk(KERN_WARNING "%s doesn't speak hvterm1; "
-					"can't print udbg messages\n",
-			       stdout_node->full_name);
+		} else if (device_is_compatible(stdout_node, "hvterm-protocol")) {
+			termno = (u32 *)get_property(stdout_node, "reg", NULL);
+			if (termno) {
+				vtermno = termno[0];
+				ppc_md.udbg_putc = udbg_hvsi_putc;
+				ppc_md.udbg_getc = udbg_hvsi_getc;
+				ppc_md.udbg_getc_poll = udbg_hvsi_getc_poll;
+				found = 1;
+			}
 		}
 	} else if (strncmp(name, "serial", 6)) {
 		/* XXX fix ISA serial console */


From hollisb at us.ibm.com  Wed Oct 27 23:40:04 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Wed, 27 Oct 2004 13:40:04 +0000
Subject: [resend patch] HVSI reset support
Message-ID: <1098884404.3484.11.camel@localhost>

Hi Linus, I've retested this with current BK as you requested.

This patch adds support for when the service processor (the
other end of the console) resets due to a critical error; we can resume
the connection when it comes back. Please apply.

Signed-off-by: Hollis Blanchard <hollisb at us.ibm.com>

-- 
Hollis Blanchard
IBM Linux Technology Center

--- drivers/char/hvsi.c.orig	Mon Sep 13 19:23:15 2004
+++ drivers/char/hvsi.c	Wed Oct 20 17:10:34 2004
@@ -29,11 +29,6 @@
  * the OS cannot change the speed of the port through this protocol.
  */
 
-/* TODO:
- * test FSP reset
- * add udbg support for xmon/kdb
- */
-
 #undef DEBUG
 
 #include <linux/console.h>
@@ -54,6 +49,7 @@
 #include <asm/prom.h>
 #include <asm/uaccess.h>
 #include <asm/vio.h>
+#include <asm/param.h>
 
 #define HVSI_MAJOR	229
 #define HVSI_MINOR	128
@@ -74,6 +70,7 @@
 
 struct hvsi_struct {
 	struct work_struct writer;
+	struct work_struct handshaker;
 	wait_queue_head_t emptyq; /* woken when outbuf is emptied */
 	wait_queue_head_t stateq; /* woken when HVSI state changes */
 	spinlock_t lock;
@@ -109,6 +106,7 @@
 	HVSI_WAIT_FOR_VER_QUERY,
 	HVSI_OPEN,
 	HVSI_WAIT_FOR_MCTRL_RESPONSE,
+	HVSI_FSP_DIED,
 };
 #define HVSI_CONSOLE 0x1
 
@@ -172,6 +170,13 @@
 	} u;
 } __attribute__((packed));
 
+
+
+static inline int is_console(struct hvsi_struct *hp)
+{
+	return hp->flags & HVSI_CONSOLE;
+}
+
 static inline int is_open(struct hvsi_struct *hp)
 {
 	/* if we're waiting for an mctrl then we're already open */
@@ -188,6 +193,7 @@
 		"HVSI_WAIT_FOR_VER_QUERY",
 		"HVSI_OPEN",
 		"HVSI_WAIT_FOR_MCTRL_RESPONSE",
+		"HVSI_FSP_DIED",
 	};
 	const char *name = state_names[hp->state];
 
@@ -296,14 +302,9 @@
 	return 0;
 }
 
-/*
- * we can't call tty_hangup() directly here because we need to call that
- * outside of our lock
- */
-static struct tty_struct *hvsi_recv_control(struct hvsi_struct *hp,
-		uint8_t *packet)
+static void hvsi_recv_control(struct hvsi_struct *hp, uint8_t *packet,
+	struct tty_struct **to_hangup, struct hvsi_struct **to_handshake)
 {
-	struct tty_struct *to_hangup = NULL;
 	struct hvsi_control *header = (struct hvsi_control *)packet;
 
 	switch (header->verb) {
@@ -313,15 +314,14 @@
 				pr_debug("hvsi%i: CD dropped\n", hp->index);
 				hp->mctrl &= TIOCM_CD;
 				if (!(hp->tty->flags & CLOCAL))
-					to_hangup = hp->tty;
+					*to_hangup = hp->tty;
 			}
 			break;
 		case VSV_CLOSE_PROTOCOL:
-			printk(KERN_DEBUG
-				"hvsi%i: service processor closed connection!\n", hp->index);
-			__set_state(hp, HVSI_CLOSED);
-			to_hangup = hp->tty;
-			hp->tty = NULL;
+			pr_debug("hvsi%i: service processor came back\n", hp->index);
+			if (hp->state != HVSI_CLOSED) {
+				*to_handshake = hp;
+			}
 			break;
 		default:
 			printk(KERN_WARNING "hvsi%i: unknown HVSI control packet: ",
@@ -329,8 +329,6 @@
 			dump_packet(packet);
 			break;
 	}
-
-	return to_hangup;
 }
 
 static void hvsi_recv_response(struct hvsi_struct *hp, uint8_t *packet)
@@ -388,8 +386,8 @@
 
 	switch (hp->state) {
 		case HVSI_WAIT_FOR_VER_QUERY:
-			__set_state(hp, HVSI_OPEN);
 			hvsi_version_respond(hp, query->seqno);
+			__set_state(hp, HVSI_OPEN);
 			break;
 		default:
 			printk(KERN_ERR "hvsi%i: unexpected query: ", hp->index);
@@ -467,17 +465,20 @@
  * incoming data).
  */
 static int hvsi_load_chunk(struct hvsi_struct *hp, struct tty_struct **flip,
-		struct tty_struct **hangup)
+		struct tty_struct **hangup, struct hvsi_struct **handshake)
 {
 	uint8_t *packet = hp->inbuf;
 	int chunklen;
 
 	*flip = NULL;
 	*hangup = NULL;
+	*handshake = NULL;
 
 	chunklen = hvsi_read(hp, hp->inbuf_end, HVSI_MAX_READ);
-	if (chunklen == 0)
+	if (chunklen == 0) {
+		pr_debug("%s: 0-length read\n", __FUNCTION__);
 		return 0;
+	}
 
 	pr_debug("%s: got %i bytes\n", __FUNCTION__, chunklen);
 	dbg_dump_hex(hp->inbuf_end, chunklen);
@@ -509,7 +510,7 @@
 				*flip = hvsi_recv_data(hp, packet);
 				break;
 			case VS_CONTROL_PACKET_HEADER:
-				*hangup = hvsi_recv_control(hp, packet);
+				hvsi_recv_control(hp, packet, hangup, handshake);
 				break;
 			case VS_QUERY_RESPONSE_PACKET_HEADER:
 				hvsi_recv_response(hp, packet);
@@ -526,8 +527,8 @@
 
 		packet += len_packet(packet);
 
-		if (*hangup) {
-			pr_debug("%s: hangup\n", __FUNCTION__);
+		if (*hangup || *handshake) {
+			pr_debug("%s: hangup or handshake\n", __FUNCTION__);
 			/*
 			 * we need to send the hangup now before receiving any more data.
 			 * If we get "data, hangup, data", we can't deliver the second
@@ -560,16 +561,15 @@
 	struct hvsi_struct *hp = (struct hvsi_struct *)arg;
 	struct tty_struct *flip;
 	struct tty_struct *hangup;
+	struct hvsi_struct *handshake;
 	unsigned long flags;
-	irqreturn_t handled = IRQ_NONE;
 	int again = 1;
 
 	pr_debug("%s\n", __FUNCTION__);
 
 	while (again) {
 		spin_lock_irqsave(&hp->lock, flags);
-		again = hvsi_load_chunk(hp, &flip, &hangup);
-		handled = IRQ_HANDLED;
+		again = hvsi_load_chunk(hp, &flip, &hangup, &handshake);
 		spin_unlock_irqrestore(&hp->lock, flags);
 
 		/*
@@ -587,6 +587,11 @@
 		if (hangup) {
 			tty_hangup(hangup);
 		}
+
+		if (handshake) {
+			pr_debug("hvsi%i: attempting re-handshake\n", handshake->index);
+			schedule_work(&handshake->handshaker);
+		}
 	}
 
 	spin_lock_irqsave(&hp->lock, flags);
@@ -603,7 +608,7 @@
 		tty_flip_buffer_push(flip);
 	}
 
-	return handled;
+	return IRQ_HANDLED;
 }
 
 /* for boot console, before the irq handler is running */
@@ -757,6 +762,23 @@
 	return 0;
 }
 
+static void hvsi_handshaker(void *arg)
+{
+	struct hvsi_struct *hp = (struct hvsi_struct *)arg;
+
+	if (hvsi_handshake(hp) >= 0)
+		return;
+
+	printk(KERN_ERR "hvsi%i: re-handshaking failed\n", hp->index);
+	if (is_console(hp)) {
+		/*
+		 * ttys will re-attempt the handshake via hvsi_open, but
+		 * the console will not.
+		 */
+		printk(KERN_ERR "hvsi%i: lost console!\n", hp->index);
+	}
+}
+
 static int hvsi_put_chars(struct hvsi_struct *hp, const char *buf, int count)
 {
 	struct hvsi_data packet __ALIGNED__;
@@ -808,6 +830,10 @@
 	tty->driver_data = hp;
 	tty->low_latency = 1; /* avoid throttle/tty_flip_buffer_push race */
 
+	mb();
+	if (hp->state == HVSI_FSP_DIED)
+		return -EIO;
+
 	spin_lock_irqsave(&hp->lock, flags);
 	hp->tty = tty;
 	hp->count++;
@@ -815,7 +841,7 @@
 	h_vio_signal(hp->vtermno, VIO_IRQ_ENABLE);
 	spin_unlock_irqrestore(&hp->lock, flags);
 
-	if (hp->flags & HVSI_CONSOLE)
+	if (is_console(hp))
 		return 0; /* this has already been handshaked as the console */
 
 	ret = hvsi_handshake(hp);
@@ -889,7 +915,7 @@
 		hp->inbuf_end = hp->inbuf; /* discard remaining partial packets */
 
 		/* only close down connection if it is not the console */
-		if (!(hp->flags & HVSI_CONSOLE)) {
+		if (!is_console(hp)) {
 			h_vio_signal(hp->vtermno, VIO_IRQ_DISABLE); /* no more irqs */
 			__set_state(hp, HVSI_CLOSED);
 			/*
@@ -943,12 +969,13 @@
 		return;
 
 	n = hvsi_put_chars(hp, hp->outbuf, hp->n_outbuf);
-	if (n != 0) {
-		/*
-		 * either all data was sent or there was an error, and we throw away
-		 * data on error.
-		 */
+	if (n > 0) {
+		/* success */
+		pr_debug("%s: wrote %i chars\n", __FUNCTION__, n);
 		hp->n_outbuf = 0;
+	} else if (n == -EIO) {
+		__set_state(hp, HVSI_FSP_DIED);
+		printk(KERN_ERR "hvsi%i: service processor died\n", hp->index);
 	}
 }
 
@@ -966,6 +993,19 @@
 
 	spin_lock_irqsave(&hp->lock, flags);
 
+	pr_debug("%s: %i chars in buffer\n", __FUNCTION__, hp->n_outbuf);
+
+	if (!is_open(hp)) {
+		/*
+		 * We could have a non-open connection if the service processor died
+		 * while we were busily scheduling ourselves. In that case, it could
+		 * be minutes before the service processor comes back, so only try
+		 * again once a second.
+		 */
+		schedule_delayed_work(&hp->writer, HZ);
+		goto out;
+	}
+
 	hvsi_push(hp);
 	if (hp->n_outbuf > 0)
 		schedule_delayed_work(&hp->writer, 10);
@@ -982,6 +1022,7 @@
 		wake_up_interruptible(&hp->tty->write_wait);
 	}
 
+out:
 	spin_unlock_irqrestore(&hp->lock, flags);
 }
 
@@ -1022,6 +1063,8 @@
 
 	spin_lock_irqsave(&hp->lock, flags);
 
+	pr_debug("%s: %i chars in buffer\n", __FUNCTION__, hp->n_outbuf);
+
 	if (!is_open(hp)) {
 		/* we're either closing or not yet open; don't accept data */
 		pr_debug("%s: not open\n", __FUNCTION__);
@@ -1294,6 +1337,7 @@
 
 		hp = &hvsi_ports[hvsi_count];
 		INIT_WORK(&hp->writer, hvsi_write_worker, hp);
+		INIT_WORK(&hp->handshaker, hvsi_handshaker, hp);
 		init_waitqueue_head(&hp->emptyq);
 		init_waitqueue_head(&hp->stateq);
 		hp->lock = SPIN_LOCK_UNLOCKED;


From dhowells at redhat.com  Thu Oct 28 05:08:41 2004
From: dhowells at redhat.com (David Howells)
Date: Wed, 27 Oct 2004 20:08:41 +0100
Subject: [PATCH] Make key management syscalls work on PPC/PPC64
Message-ID: <24857.1098904121@redhat.com>


The attached patch permits my key management stuff to be used on PPC, PPC64
and PPC on PPC64. Syscall numbers were allocated by Paul Mackerras.

I've updated my keyctl utility to work on PPC/PPC64 too:

	http://people.redhat.com/~dhowells/keys/keyctl.c

Signed-Off-By: David Howells <dhowells at redhat.com>
---

warthog>diffstat keys-269bk5.diff 
 arch/ppc/kernel/misc.S        |    3 +
 arch/ppc64/Kconfig            |    5 ++
 arch/ppc64/kernel/misc.S      |    6 +++
 arch/ppc64/kernel/sys_ppc32.c |   18 +++++++++
 include/asm-ppc/unistd.h      |    5 ++
 include/asm-ppc64/unistd.h    |    5 ++
 include/linux/compat.h        |    2 +
 security/keys/Makefile        |    1 
 security/keys/compat.c        |   78 ++++++++++++++++++++++++++++++++++++++++++
 security/keys/internal.h      |   20 ++++++++++
 security/keys/keyctl.c        |   54 +++++++++++++----------------
 11 files changed, 166 insertions(+), 31 deletions(-)

diff -uNrp linux-2.6.9-bk5/arch/ppc/kernel/misc.S linux-2.6.9-bk5-keys/arch/ppc/kernel/misc.S
--- linux-2.6.9-bk5/arch/ppc/kernel/misc.S	2004-10-19 10:41:46.000000000 +0100
+++ linux-2.6.9-bk5-keys/arch/ppc/kernel/misc.S	2004-10-22 10:27:40.000000000 +0100
@@ -1447,3 +1447,6 @@ _GLOBAL(sys_call_table)
 	.long sys_mq_notify
 	.long sys_mq_getsetattr
 	.long sys_ni_syscall		/* 268 reserved for sys_kexec_load */
+	.long sys_add_key
+	.long sys_request_key		/* 270 */
+	.long sys_keyctl
diff -uNrp linux-2.6.9-bk5/arch/ppc64/Kconfig linux-2.6.9-bk5-keys/arch/ppc64/Kconfig
--- linux-2.6.9-bk5/arch/ppc64/Kconfig	2004-10-21 11:21:45.000000000 +0100
+++ linux-2.6.9-bk5-keys/arch/ppc64/Kconfig	2004-10-22 14:01:30.000000000 +0100
@@ -356,6 +356,11 @@ source "arch/ppc64/Kconfig.debug"
 
 source "security/Kconfig"
 
+config KEYS_COMPAT
+	bool
+	depends on COMPAT && KEYS
+	default y
+
 source "crypto/Kconfig"
 
 source "lib/Kconfig"
diff -uNrp linux-2.6.9-bk5/arch/ppc64/kernel/misc.S linux-2.6.9-bk5-keys/arch/ppc64/kernel/misc.S
--- linux-2.6.9-bk5/arch/ppc64/kernel/misc.S	2004-10-21 11:21:45.000000000 +0100
+++ linux-2.6.9-bk5-keys/arch/ppc64/kernel/misc.S	2004-10-22 11:08:44.000000000 +0100
@@ -963,6 +963,9 @@ _GLOBAL(sys_call_table32)
 	.llong .compat_sys_mq_notify
 	.llong .compat_sys_mq_getsetattr
 	.llong .sys_ni_syscall		/* 268 reserved for sys_kexec_load */
+	.llong .sys32_add_key
+	.llong .sys32_request_key
+	.llong .compat_keyctl
 
 	.balign 8
 _GLOBAL(sys_call_table)
@@ -1235,3 +1238,6 @@ _GLOBAL(sys_call_table)
 	.llong .sys_mq_notify
 	.llong .sys_mq_getsetattr
 	.llong .sys_ni_syscall		/* 268 reserved for sys_kexec_load */
+	.llong .sys_add_key
+	.llong .sys_request_key		/* 270 */
+	.llong .sys_keyctl
diff -uNrp linux-2.6.9-bk5/arch/ppc64/kernel/sys_ppc32.c linux-2.6.9-bk5-keys/arch/ppc64/kernel/sys_ppc32.c
--- linux-2.6.9-bk5/arch/ppc64/kernel/sys_ppc32.c	2004-10-21 11:21:45.000000000 +0100
+++ linux-2.6.9-bk5-keys/arch/ppc64/kernel/sys_ppc32.c	2004-10-22 13:55:56.000000000 +0100
@@ -1328,3 +1328,21 @@ long ppc32_timer_create(clockid_t clock,
 
 	return err;
 }
+
+asmlinkage long sys32_add_key(const char __user *_type,
+			      const char __user *_description,
+			      const void __user *_payload,
+			      u32 plen,
+			      u32 ringid)
+{
+	return sys_add_key(_type, _description, _payload, plen, ringid);
+}
+
+asmlinkage long sys32_request_key(const char __user *_type,
+				  const char __user *_description,
+				  const char __user *_callout_info,
+				  u32 destringid)
+{
+	return sys_request_key(_type, _description, _callout_info, destringid);
+}
+
diff -uNrp linux-2.6.9-bk5/include/asm-ppc/unistd.h linux-2.6.9-bk5-keys/include/asm-ppc/unistd.h
--- linux-2.6.9-bk5/include/asm-ppc/unistd.h	2004-06-18 13:44:05.000000000 +0100
+++ linux-2.6.9-bk5-keys/include/asm-ppc/unistd.h	2004-10-22 10:27:40.000000000 +0100
@@ -273,8 +273,11 @@
 #define __NR_mq_notify		266
 #define __NR_mq_getsetattr	267
 #define __NR_kexec_load		268
+#define __NR_add_key		269
+#define __NR_request_key	270
+#define __NR_keyctl		271
 
-#define __NR_syscalls		269
+#define __NR_syscalls		272
 
 #define __NR(n)	#n
 
diff -uNrp linux-2.6.9-bk5/include/asm-ppc64/unistd.h linux-2.6.9-bk5-keys/include/asm-ppc64/unistd.h
--- linux-2.6.9-bk5/include/asm-ppc64/unistd.h	2004-10-19 10:42:14.000000000 +0100
+++ linux-2.6.9-bk5-keys/include/asm-ppc64/unistd.h	2004-10-22 10:27:40.000000000 +0100
@@ -279,8 +279,11 @@
 #define __NR_mq_notify		266
 #define __NR_mq_getsetattr	267
 #define __NR_kexec_load		268
+#define __NR_add_key		269
+#define __NR_request_key	270
+#define __NR_keyctl		271
 
-#define __NR_syscalls		269
+#define __NR_syscalls		272
 #ifdef __KERNEL__
 #define NR_syscalls	__NR_syscalls
 #endif
diff -uNrp linux-2.6.9-bk5/include/linux/compat.h linux-2.6.9-bk5-keys/include/linux/compat.h
--- linux-2.6.9-bk5/include/linux/compat.h	2004-10-19 10:42:16.000000000 +0100
+++ linux-2.6.9-bk5-keys/include/linux/compat.h	2004-10-22 11:02:14.000000000 +0100
@@ -119,6 +119,8 @@ long compat_sys_shmat(int first, int sec
 long compat_sys_shmctl(int first, int second, void __user *uptr);
 long compat_sys_semtimedop(int semid, struct sembuf __user *tsems,
 		unsigned nsems, const struct compat_timespec __user *timeout);
+asmlinkage long compat_keyctl(u32 option,
+			      u32 arg2, u32 arg3, u32 arg4, u32 arg5);
 
 asmlinkage ssize_t compat_sys_readv(unsigned long fd,
 		const struct compat_iovec __user *vec, unsigned long vlen);
diff -uNrp linux-2.6.9-bk5/security/keys/compat.c linux-2.6.9-bk5-keys/security/keys/compat.c
--- linux-2.6.9-bk5/security/keys/compat.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.9-bk5-keys/security/keys/compat.c	2004-10-22 14:02:07.000000000 +0100
@@ -0,0 +1,78 @@
+/* compat.c: 32-bit compatibility syscall for 64-bit systems
+ *
+ * Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells at redhat.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+#include <linux/keyctl.h>
+#include <linux/compat.h>
+#include "internal.h"
+
+/*****************************************************************************/
+/*
+ * the key control system call, 32-bit compatibility version for 64-bit archs
+ * - this should only be called if the 64-bit arch uses weird pointers in
+ *   32-bit mode or doesn't guarantee that the top 32-bits of the argument
+ *   registers on taking a 32-bit syscall are zero
+ * - if you can, you should call sys_keyctl directly
+ */
+asmlinkage long compat_keyctl(u32 option,
+			      u32 arg2, u32 arg3, u32 arg4, u32 arg5)
+{
+	switch (option) {
+	case KEYCTL_GET_KEYRING_ID:
+		return keyctl_get_keyring_ID(arg2, arg3);
+
+	case KEYCTL_JOIN_SESSION_KEYRING:
+		return keyctl_join_session_keyring(compat_ptr(arg3));
+
+	case KEYCTL_UPDATE:
+		return keyctl_update_key(arg2, compat_ptr(arg3), arg4);
+
+	case KEYCTL_REVOKE:
+		return keyctl_revoke_key(arg2);
+
+	case KEYCTL_DESCRIBE:
+		return keyctl_describe_key(arg2, compat_ptr(arg3), arg4);
+
+	case KEYCTL_CLEAR:
+		return keyctl_keyring_clear(arg2);
+
+	case KEYCTL_LINK:
+		return keyctl_keyring_link(arg2, arg3);
+
+	case KEYCTL_UNLINK:
+		return keyctl_keyring_unlink(arg2, arg3);
+
+	case KEYCTL_SEARCH:
+		return keyctl_keyring_search(arg2, compat_ptr(arg3),
+					     compat_ptr(arg4), arg5);
+
+	case KEYCTL_READ:
+		return keyctl_read_key(arg2, compat_ptr(arg3), arg4);
+
+	case KEYCTL_CHOWN:
+		return keyctl_chown_key(arg2, arg3, arg4);
+
+	case KEYCTL_SETPERM:
+		return keyctl_setperm_key(arg2, arg3);
+
+	case KEYCTL_INSTANTIATE:
+		return keyctl_instantiate_key(arg2, compat_ptr(arg3), arg4,
+					      arg5);
+
+	case KEYCTL_NEGATE:
+		return keyctl_negate_key(arg2, arg3, arg4);
+
+	default:
+		return -EOPNOTSUPP;
+	}
+
+} /* end compat_keyctl() */
diff -uNrp linux-2.6.9-bk5/security/keys/internal.h linux-2.6.9-bk5-keys/security/keys/internal.h
--- linux-2.6.9-bk5/security/keys/internal.h	2004-10-21 11:22:11.000000000 +0100
+++ linux-2.6.9-bk5-keys/security/keys/internal.h	2004-10-21 11:39:25.000000000 +0100
@@ -81,6 +81,26 @@ extern struct key *find_keyring_by_name(
 
 extern int install_thread_keyring(struct task_struct *tsk);
 
+/*
+ * keyctl functions
+ */
+extern long keyctl_get_keyring_ID(key_serial_t, int);
+extern long keyctl_join_session_keyring(const char __user *);
+extern long keyctl_update_key(key_serial_t, const void __user *, size_t);
+extern long keyctl_revoke_key(key_serial_t);
+extern long keyctl_keyring_clear(key_serial_t);
+extern long keyctl_keyring_link(key_serial_t, key_serial_t);
+extern long keyctl_keyring_unlink(key_serial_t, key_serial_t);
+extern long keyctl_describe_key(key_serial_t, char __user *, size_t);
+extern long keyctl_keyring_search(key_serial_t, const char __user *,
+				  const char __user *, key_serial_t);
+extern long keyctl_read_key(key_serial_t, char __user *, size_t);
+extern long keyctl_chown_key(key_serial_t, uid_t, gid_t);
+extern long keyctl_setperm_key(key_serial_t, key_perm_t);
+extern long keyctl_instantiate_key(key_serial_t, const void __user *,
+				   size_t, key_serial_t);
+extern long keyctl_negate_key(key_serial_t, unsigned, key_serial_t);
+
 
 /*
  * debugging key validation
diff -uNrp linux-2.6.9-bk5/security/keys/keyctl.c linux-2.6.9-bk5-keys/security/keys/keyctl.c
--- linux-2.6.9-bk5/security/keys/keyctl.c	2004-10-21 11:22:11.000000000 +0100
+++ linux-2.6.9-bk5-keys/security/keys/keyctl.c	2004-10-21 11:54:48.000000000 +0100
@@ -13,6 +13,7 @@
 #include <linux/init.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
+#include <linux/syscalls.h>
 #include <linux/keyctl.h>
 #include <linux/fs.h>
 #include <linux/err.h>
@@ -231,7 +232,7 @@ asmlinkage long sys_request_key(const ch
  * - the keyring must have search permission to be found
  * - implements keyctl(KEYCTL_GET_KEYRING_ID)
  */
-static long keyctl_get_keyring_ID(key_serial_t id, int create)
+long keyctl_get_keyring_ID(key_serial_t id, int create)
 {
 	struct key *key;
 	long ret;
@@ -254,7 +255,7 @@ static long keyctl_get_keyring_ID(key_se
  * join the session keyring
  * - implements keyctl(KEYCTL_JOIN_SESSION_KEYRING)
  */
-static long keyctl_join_session_keyring(const char __user *_name)
+long keyctl_join_session_keyring(const char __user *_name)
 {
 	char *name;
 	long nlen, ret;
@@ -297,9 +298,9 @@ static long keyctl_join_session_keyring(
  * - the key must be writable
  * - implements keyctl(KEYCTL_UPDATE)
  */
-static long keyctl_update_key(key_serial_t id,
-			      const void __user *_payload,
-			      size_t plen)
+long keyctl_update_key(key_serial_t id,
+		       const void __user *_payload,
+		       size_t plen)
 {
 	struct key *key;
 	void *payload;
@@ -346,7 +347,7 @@ static long keyctl_update_key(key_serial
  * - the key must be writable
  * - implements keyctl(KEYCTL_REVOKE)
  */
-static long keyctl_revoke_key(key_serial_t id)
+long keyctl_revoke_key(key_serial_t id)
 {
 	struct key *key;
 	long ret;
@@ -372,7 +373,7 @@ static long keyctl_revoke_key(key_serial
  * - the keyring must be writable
  * - implements keyctl(KEYCTL_CLEAR)
  */
-static long keyctl_keyring_clear(key_serial_t ringid)
+long keyctl_keyring_clear(key_serial_t ringid)
 {
 	struct key *keyring;
 	long ret;
@@ -398,7 +399,7 @@ static long keyctl_keyring_clear(key_ser
  * - the key must be linkable
  * - implements keyctl(KEYCTL_LINK)
  */
-static long keyctl_keyring_link(key_serial_t id, key_serial_t ringid)
+long keyctl_keyring_link(key_serial_t id, key_serial_t ringid)
 {
 	struct key *keyring, *key;
 	long ret;
@@ -432,7 +433,7 @@ static long keyctl_keyring_link(key_seri
  * - we don't need any permissions on the key
  * - implements keyctl(KEYCTL_UNLINK)
  */
-static long keyctl_keyring_unlink(key_serial_t id, key_serial_t ringid)
+long keyctl_keyring_unlink(key_serial_t id, key_serial_t ringid)
 {
 	struct key *keyring, *key;
 	long ret;
@@ -470,9 +471,9 @@ static long keyctl_keyring_unlink(key_se
  *	type;uid;gid;perm;description<NUL>
  * - implements keyctl(KEYCTL_DESCRIBE)
  */
-static long keyctl_describe_key(key_serial_t keyid,
-				char __user *buffer,
-				size_t buflen)
+long keyctl_describe_key(key_serial_t keyid,
+			 char __user *buffer,
+			 size_t buflen)
 {
 	struct key *key;
 	char *tmpbuf;
@@ -532,10 +533,10 @@ static long keyctl_describe_key(key_seri
  *   there's one specified
  * - implements keyctl(KEYCTL_SEARCH)
  */
-static long keyctl_keyring_search(key_serial_t ringid,
-				  const char __user *_type,
-				  const char __user *_description,
-				  key_serial_t destringid)
+long keyctl_keyring_search(key_serial_t ringid,
+			   const char __user *_type,
+			   const char __user *_description,
+			   key_serial_t destringid)
 {
 	struct key_type *ktype;
 	struct key *keyring, *key, *dest;
@@ -649,9 +650,7 @@ static int keyctl_read_key_same(const st
  *   irrespective of how much we may have copied
  * - implements keyctl(KEYCTL_READ)
  */
-static long keyctl_read_key(key_serial_t keyid,
-			    char __user *buffer,
-			    size_t buflen)
+long keyctl_read_key(key_serial_t keyid, char __user *buffer, size_t buflen)
 {
 	struct key *key, *skey;
 	long ret;
@@ -711,7 +710,7 @@ static long keyctl_read_key(key_serial_t
  * - if the uid or gid is -1, then that parameter is not changed
  * - implements keyctl(KEYCTL_CHOWN)
  */
-static long keyctl_chown_key(key_serial_t id, uid_t uid, gid_t gid)
+long keyctl_chown_key(key_serial_t id, uid_t uid, gid_t gid)
 {
 	struct key *key;
 	long ret;
@@ -770,7 +769,7 @@ static long keyctl_chown_key(key_serial_
  * - the keyring owned by the changer
  * - implements keyctl(KEYCTL_SETPERM)
  */
-static long keyctl_setperm_key(key_serial_t id, key_perm_t perm)
+long keyctl_setperm_key(key_serial_t id, key_perm_t perm)
 {
 	struct key *key;
 	long ret;
@@ -814,10 +813,10 @@ static long keyctl_setperm_key(key_seria
  * instantiate the key with the specified payload, and, if one is given, link
  * the key into the keyring
  */
-static long keyctl_instantiate_key(key_serial_t id,
-				   const void __user *_payload,
-				   size_t plen,
-				   key_serial_t ringid)
+long keyctl_instantiate_key(key_serial_t id,
+			    const void __user *_payload,
+			    size_t plen,
+			    key_serial_t ringid)
 {
 	struct key *key, *keyring;
 	void *payload;
@@ -877,9 +876,7 @@ static long keyctl_instantiate_key(key_s
  * negatively instantiate the key with the given timeout (in seconds), and, if
  * one is given, link the key into the keyring
  */
-static long keyctl_negate_key(key_serial_t id,
-			      unsigned timeout,
-			      key_serial_t ringid)
+long keyctl_negate_key(key_serial_t id, unsigned timeout, key_serial_t ringid)
 {
 	struct key *key, *keyring;
 	long ret;
@@ -916,7 +913,6 @@ static long keyctl_negate_key(key_serial
 /*****************************************************************************/
 /*
  * the key control system call
- * - currently invoked through prctl()
  */
 asmlinkage long sys_keyctl(int option, unsigned long arg2, unsigned long arg3,
 			   unsigned long arg4, unsigned long arg5)
diff -uNrp linux-2.6.9-bk5/security/keys/Makefile linux-2.6.9-bk5-keys/security/keys/Makefile
--- linux-2.6.9-bk5/security/keys/Makefile	2004-10-21 11:22:11.000000000 +0100
+++ linux-2.6.9-bk5-keys/security/keys/Makefile	2004-10-22 10:49:39.000000000 +0100
@@ -10,4 +10,5 @@ obj-y := \
 	user_defined.o \
 	request_key.o
 
+obj-$(CONFIG_KEYS_COMPAT) += compat.o
 obj-$(CONFIG_PROC_FS) += proc.o


From paulus at samba.org  Thu Oct 28 09:08:17 2004
From: paulus at samba.org (Paul Mackerras)
Date: Thu, 28 Oct 2004 09:08:17 +1000
Subject: [PATCH] iommu fixes, round 2
In-Reply-To: <1098813781.32293.40.camel@sinatra.austin.ibm.com>
References: <1098775712.6897.17.camel@gaston>
	<1098808895.32293.23.camel@sinatra.austin.ibm.com>
	<1098813781.32293.40.camel@sinatra.austin.ibm.com>
Message-ID: <16768.10849.741580.850491@cargo.ozlabs.ibm.com>

John Rose writes:

> Thirdly, iommu_devnode_init() also has an iSeries implementation, so
> it's not pSeries-specific.  No need to rename it, as suggested in the
> comment.

I would rather we didn't have two functions with the same name, as we
do for iommu_devnode_init (with iSeries and pSeries implementations),
because that is one more obstacle to eventually making a single kernel
binary that can run on iSeries and on other machines.  That goal is
still some distance off but we shouldn't make it harder to reach if
possible.  That's the motivation for having a function pointer in
ppc_md for it.

Paul.


From dwm at austin.ibm.com  Thu Oct 28 09:16:29 2004
From: dwm at austin.ibm.com (Doug Maxey)
Date: Wed, 27 Oct 2004 18:16:29 -0500
Subject: 2.6.10.-rc1 on ppc64 - returning from prom_init hang
Message-ID: <200410272316.i9RNGTpj005995@falcon10.austin.ibm.com>

Anyone have any thoughts on this?  I have xmon=on, but it stops before it gets 
there...  for this iteration, added console=hvsi1, did not change anything.

This libata-dev-2.6 on power5 system.

Config file read, 1024 bytes
Welcome
Welcome to yaboot version 1.3.12
Enter "help" to get some basic usage information
boot:
  2.6.10-rc1-ata-1         * linux
boot: 2.6.10-rc1-ata-1
Please wait, loading kernel...
   Elf64 kernel loaded...
Loading ramdisk...
ramdisk loaded at 02300000, size: 1306 Kbytes
OF stdout device is: /vdevice/vty at 30000001
Hypertas detected, assuming LPAR !
command line: root=/dev/VolGroup00/LogVol00 ro rhgb quiet console=hvsi1 xmon=on
memory layout at init:
  alloc_bottom : 0000000002447000
  alloc_top    : 0000000008000000
  alloc_top_hi : 0000000075000000
  rmo_top      : 0000000008000000
  ram_top      : 0000000075000000
Looking for displays
found display   : /pci at 800000020000002/pci at 2,2/pci at 1/display at 0, opening ... done
instantiating rtas at 0x00000000077d9000... done
0000000000000000 : boot cpu     0000000000000000
0000000000000002 : starting cpu hw idx 0000000000000002... done
copying OF device tree ...
Building dt strings...
Building dt structure...
Device tree strings 0x0000000002748000 -> 0x00000000027492f8
Device tree struct  0x000000000274a000 -> 0x000000000275a000
Calling quiesce ...
returning from prom_init
 
++doug


From johnrose at austin.ibm.com  Thu Oct 28 09:25:47 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Wed, 27 Oct 2004 18:25:47 -0500
Subject: [PATCH] iommu fixes, round 2
In-Reply-To: <16768.10849.741580.850491@cargo.ozlabs.ibm.com>
References: <1098775712.6897.17.camel@gaston>
	<1098808895.32293.23.camel@sinatra.austin.ibm.com>
	<1098813781.32293.40.camel@sinatra.austin.ibm.com>
	<16768.10849.741580.850491@cargo.ozlabs.ibm.com>
Message-ID: <1098919547.18158.4.camel@sinatra.austin.ibm.com>

On Wed, 2004-10-27 at 18:08, Paul Mackerras wrote:
> John Rose writes:
> 
> > Thirdly, iommu_devnode_init() also has an iSeries implementation, so
> > it's not pSeries-specific.  No need to rename it, as suggested in the
> > comment.
> 
> I would rather we didn't have two functions with the same name, as we
> do for iommu_devnode_init (with iSeries and pSeries implementations),
> because that is one more obstacle to eventually making a single kernel
> binary that can run on iSeries and on other machines.  That goal is
> still some distance off but we shouldn't make it harder to reach if
> possible.  That's the motivation for having a function pointer in
> ppc_md for it.

Good point.  To contradict my earlier statement, let's rename them :) 
None of the other functions in [i,p]Series_iommu.c share names.  The two
implementations are called from i and p-specific locations anyway, so
renaming won't be a problem.

Will post a patch tmw.

Thanks-
John


From paulus at samba.org  Thu Oct 28 13:59:30 2004
From: paulus at samba.org (Paul Mackerras)
Date: Thu, 28 Oct 2004 13:59:30 +1000
Subject: [PATCH 1/1] rtas_flash_4gig
In-Reply-To: <20041020170817.0ee49b64@localhost>
References: <200410041942.i94Jg4WA154540@westrelay04.boulder.ibm.com>
	<16758.55568.809557.670513@cargo.ozlabs.ibm.com>
	<20041020170817.0ee49b64@localhost>
Message-ID: <16768.28322.583827.9327@cargo.ozlabs.ibm.com>

Jake Moilanen writes:

> According to the RPA (item E7-41 to be exact), the block-list can be
> anywhere under 4 gigs.  RTAS will make hypervisor calls to access this
> memory.  

OK, but I don't see that we make any attempt at all to try to make
sure the memory for the block list pages is below 4G.  I also don't
see where we check the ibm,flash-block-version property (to see if we
can in fact use a linked list of headers) or where we check that the
pages we are using don't overlap OF's memory (i.e. real-size bytes
starting at real-base).

Since this is happening at reboot time, I suggest we copy the block
list into rtas_rmo_buf.  That is big enough to accommodate up to 8k
entries, which will do for up to 32MB of flash image, which should be
enough for now, shouldn't it?  If not we can just make rtas_rmo_buf a
bit bigger.

As for not overlapping OF, we just need a little allocator function
that keeps on allocating pages until it gets one that doesn't overlap
with OF, and then frees all the extra pages it had to allocate.  Those
pages could be linked together so we don't have to maintain a big
array of page pointers.  The common case will be that we get a page we
can use (i.e. which doesn't overlap OF) on the first try.

Regards,
Paul.


From paulus at samba.org  Thu Oct 28 14:44:59 2004
From: paulus at samba.org (Paul Mackerras)
Date: Thu, 28 Oct 2004 14:44:59 +1000
Subject: module.viomap support for ppc64
In-Reply-To: <20040813094040.GA1769@suse.de>
References: <20040812173751.GA30564@suse.de>
	<1092339278.19137.8.camel@localhost>
	<1092354195.25196.11.camel@bach> <20040813094040.GA1769@suse.de>
Message-ID: <16768.31051.268932.927382@cargo.ozlabs.ibm.com>

Olaf Hering writes:

> A hack for 2.6.8-rc4 is below. Can I read the alias file via 
> while read a b c ; do : done < modules.alias ?
> Is b supposed to contain not spaces? What special delimiter chars are
> allowed? The 'name' and 'compat' property can contain almost any char.
> I used '^' for the time being.

Olaf, do you still want these changes made?  I rebased your patch on
current BK (see below).

Dave, any comments on this patch?

Paul.

diff -urN linux-2.5/arch/ppc64/kernel/vio.c test/arch/ppc64/kernel/vio.c
--- linux-2.5/arch/ppc64/kernel/vio.c	2004-09-24 15:23:06.000000000 +1000
+++ test/arch/ppc64/kernel/vio.c	2004-10-28 14:22:59.791014944 +1000
@@ -143,7 +143,7 @@
 {
 	DBGENTER();
 
-	while (ids->type) {
+	while (ids->type[0]) {
 		if ((strncmp(dev->type, ids->type, strlen(ids->type)) == 0) &&
 			device_is_compatible(dev->dev.platform_data, ids->compat))
 			return ids;
diff -urN linux-2.5/drivers/block/viodasd.c test/drivers/block/viodasd.c
--- linux-2.5/drivers/block/viodasd.c	2004-06-30 15:40:03.000000000 +1000
+++ test/drivers/block/viodasd.c	2004-10-28 14:21:06.962994664 +1000
@@ -778,7 +778,7 @@
  */
 static struct vio_device_id viodasd_device_table[] __devinitdata = {
 	{ "viodasd", "" },
-	{ 0, }
+	{ "", "" }
 };
 
 MODULE_DEVICE_TABLE(vio, viodasd_device_table);
diff -urN linux-2.5/drivers/cdrom/viocd.c test/drivers/cdrom/viocd.c
--- linux-2.5/drivers/cdrom/viocd.c	2004-08-24 07:22:47.000000000 +1000
+++ test/drivers/cdrom/viocd.c	2004-10-28 14:21:27.959005552 +1000
@@ -693,7 +693,7 @@
  */
 static struct vio_device_id viocd_device_table[] __devinitdata = {
 	{ "viocd", "" },
-	{ 0, }
+	{ "", "" }
 };
 
 MODULE_DEVICE_TABLE(vio, viocd_device_table);
diff -urN linux-2.5/drivers/char/hvc_console.c test/drivers/char/hvc_console.c
--- linux-2.5/drivers/char/hvc_console.c	2004-10-22 07:00:21.000000000 +1000
+++ test/drivers/char/hvc_console.c	2004-10-28 14:21:36.504029776 +1000
@@ -581,7 +581,7 @@
 
 static struct vio_device_id hvc_driver_table[] __devinitdata= {
 	{"serial", "hvterm1"},
-	{ NULL, }
+	{ "", "" }
 };
 MODULE_DEVICE_TABLE(vio, hvc_driver_table);
 
diff -urN linux-2.5/drivers/char/hvcs.c test/drivers/char/hvcs.c
--- linux-2.5/drivers/char/hvcs.c	2004-10-22 07:00:21.000000000 +1000
+++ test/drivers/char/hvcs.c	2004-10-28 14:17:51.265058720 +1000
@@ -527,7 +527,7 @@
 
 static struct vio_device_id hvcs_driver_table[] __devinitdata= {
 	{"serial-server", "hvterm2"},
-	{ NULL, }
+	{ "", "" }
 };
 MODULE_DEVICE_TABLE(vio, hvcs_driver_table);
 
diff -urN linux-2.5/drivers/char/viotape.c test/drivers/char/viotape.c
--- linux-2.5/drivers/char/viotape.c	2004-06-30 15:40:03.000000000 +1000
+++ test/drivers/char/viotape.c	2004-10-28 14:22:59.446934232 +1000
@@ -991,7 +991,7 @@
  */
 static struct vio_device_id viotape_device_table[] __devinitdata = {
 	{ "viotape", "" },
-	{ 0, }
+	{ "", "" }
 };
 
 MODULE_DEVICE_TABLE(vio, viotape_device_table);
diff -urN linux-2.5/drivers/net/ibmveth.c test/drivers/net/ibmveth.c
--- linux-2.5/drivers/net/ibmveth.c	2004-09-16 21:51:58.000000000 +1000
+++ test/drivers/net/ibmveth.c	2004-10-28 14:16:32.795007496 +1000
@@ -1125,7 +1125,7 @@
 
 static struct vio_device_id ibmveth_device_table[] __devinitdata= {
 	{ "network", "IBM,l-lan"},
-	{ 0,}
+	{ "",""}
 };
 
 MODULE_DEVICE_TABLE(vio, ibmveth_device_table);
diff -urN linux-2.5/drivers/net/iseries_veth.c test/drivers/net/iseries_veth.c
--- linux-2.5/drivers/net/iseries_veth.c	2004-10-20 21:20:19.000000000 +1000
+++ test/drivers/net/iseries_veth.c	2004-10-28 14:22:59.046995032 +1000
@@ -1353,7 +1353,7 @@
  */
 static struct vio_device_id veth_device_table[] __devinitdata = {
 	{ "vlan", "" },
-	{ NULL, NULL }
+	{ "", "" }
 };
 MODULE_DEVICE_TABLE(vio, veth_device_table);
 
diff -urN linux-2.5/drivers/scsi/ibmvscsi/ibmvscsi.c test/drivers/scsi/ibmvscsi/ibmvscsi.c
--- linux-2.5/drivers/scsi/ibmvscsi/ibmvscsi.c	2004-07-29 07:33:14.000000000 +1000
+++ test/drivers/scsi/ibmvscsi/ibmvscsi.c	2004-10-28 14:22:45.765019224 +1000
@@ -1368,7 +1368,7 @@
  */
 static struct vio_device_id ibmvscsi_device_table[] __devinitdata = {
 	{"vscsi", "IBM,v-scsi"},
-	{0,}
+	{ "", "" }
 };
 
 MODULE_DEVICE_TABLE(vio, ibmvscsi_device_table);
diff -urN linux-2.5/include/asm-ppc64/vio.h test/include/asm-ppc64/vio.h
--- linux-2.5/include/asm-ppc64/vio.h	2004-06-30 15:40:04.000000000 +1000
+++ test/include/asm-ppc64/vio.h	2004-10-28 14:16:32.797007192 +1000
@@ -86,9 +86,10 @@
 
 extern struct bus_type vio_bus_type;
 
+#define VIO_DEVTABLE_PROPERTY_LENGTH 32
 struct vio_device_id {
-	char *type;
-	char *compat;
+	char type[VIO_DEVTABLE_PROPERTY_LENGTH];
+	char compat[VIO_DEVTABLE_PROPERTY_LENGTH];
 };
 
 struct vio_driver {
diff -urN linux-2.5/include/linux/mod_devicetable.h test/include/linux/mod_devicetable.h
--- linux-2.5/include/linux/mod_devicetable.h	2004-02-09 18:25:16.000000000 +1100
+++ test/include/linux/mod_devicetable.h	2004-10-28 14:16:32.798007040 +1000
@@ -164,5 +164,10 @@
 	} devs[PNP_MAX_DEVICES];
 };
 
+#define VIO_DEVTABLE_PROPERTY_LENGTH 32
+struct VIO_device_id {
+	char name[VIO_DEVTABLE_PROPERTY_LENGTH];
+	char compat[VIO_DEVTABLE_PROPERTY_LENGTH];
+};
 
 #endif /* LINUX_MOD_DEVICETABLE_H */


From david at gibson.dropbear.id.au  Thu Oct 28 16:01:51 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Thu, 28 Oct 2004 16:01:51 +1000
Subject: [PPC64] Rework ppc64 hugepage code
Message-ID: <20041028060151.GA1680@zax>

Andrew, please apply:

Rework the ppc64 hugepage code.  Instead of using specially marked pmd
entries in the normal pagetables to represent hugepages, use normal
pte_t entries, in a special set of pagetables used for hugepages only.

Using pte_t instead of a special hugepte_t makes the code more similar
to that for other architecturess, allowing more possibilities for
consolidating the hugepage code.

Using independent pagetables for the hugepages is also a prerequisite
for moving the hugepages into their own region well outside the normal
user address space.  The restrictions imposed by the powerpc mmu's
segment design mean we probably want to do that in the fairly near
future.

Signed-off-by: David Gibson <david at gibson.dropbear.id.au>

Index: working-2.6/include/asm-ppc64/pgtable.h
===================================================================
--- working-2.6.orig/include/asm-ppc64/pgtable.h	2004-10-21 11:55:01.000000000 +1000
+++ working-2.6/include/asm-ppc64/pgtable.h	2004-10-27 12:06:02.635023544 +1000
@@ -98,6 +98,7 @@
 #define _PAGE_BUSY	0x0800 /* software: PTE & hash are busy */ 
 #define _PAGE_SECONDARY 0x8000 /* software: HPTE is in secondary group */
 #define _PAGE_GROUP_IX  0x7000 /* software: HPTE index within group */
+#define _PAGE_HUGE	0x10000 /* 16MB page */
 /* Bits 0x7000 identify the index within an HPT Group */
 #define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | _PAGE_SECONDARY | _PAGE_GROUP_IX)
 /* PAGE_MASK gives the right answer below, but only by accident */
@@ -157,19 +158,19 @@
 #endif /* __ASSEMBLY__ */
 
 /* shift to put page number into pte */
-#define PTE_SHIFT (16)
+#define PTE_SHIFT (17)
 
 /* We allow 2^41 bytes of real memory, so we need 29 bits in the PMD
  * to give the PTE page number.  The bottom two bits are for flags. */
 #define PMD_TO_PTEPAGE_SHIFT (2)
 
 #ifdef CONFIG_HUGETLB_PAGE
-#define _PMD_HUGEPAGE	0x00000001U
-#define HUGEPTE_BATCH_SIZE (1<<(HPAGE_SHIFT-PMD_SHIFT))
 
 #ifndef __ASSEMBLY__
 int hash_huge_page(struct mm_struct *mm, unsigned long access,
 		   unsigned long ea, unsigned long vsid, int local);
+
+void hugetlb_mm_free_pgd(struct mm_struct *mm);
 #endif /* __ASSEMBLY__ */
 
 #define HAVE_ARCH_UNMAPPED_AREA
@@ -177,7 +178,7 @@
 #else
 
 #define hash_huge_page(mm,a,ea,vsid,local)	-1
-#define _PMD_HUGEPAGE	0
+#define hugetlb_mm_free_pgd(mm)			do {} while (0)
 
 #endif
 
@@ -213,10 +214,8 @@
 #define pmd_set(pmdp, ptep) 	\
 	(pmd_val(*(pmdp)) = (__ba_to_bpn(ptep) << PMD_TO_PTEPAGE_SHIFT))
 #define pmd_none(pmd)		(!pmd_val(pmd))
-#define	pmd_hugepage(pmd)	(!!(pmd_val(pmd) & _PMD_HUGEPAGE))
-#define	pmd_bad(pmd)		(((pmd_val(pmd)) == 0) || pmd_hugepage(pmd))
-#define	pmd_present(pmd)	((!pmd_hugepage(pmd)) \
-				 && (pmd_val(pmd) & ~_PMD_HUGEPAGE) != 0)
+#define	pmd_bad(pmd)		(pmd_val(pmd) == 0)
+#define	pmd_present(pmd)	(pmd_val(pmd) != 0)
 #define	pmd_clear(pmdp)		(pmd_val(*(pmdp)) = 0)
 #define pmd_page_kernel(pmd)	\
 	(__bpn_to_ba(pmd_val(pmd) >> PMD_TO_PTEPAGE_SHIFT))
@@ -269,6 +268,7 @@
 static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY;}
 static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED;}
 static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE;}
+static inline int pte_huge(pte_t pte) { return pte_val(pte) & _PAGE_HUGE;}
 
 static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; }
 static inline void pte_cache(pte_t pte)   { pte_val(pte) &= ~_PAGE_NO_CACHE; }
@@ -294,6 +294,8 @@
 	pte_val(pte) |= _PAGE_DIRTY; return pte; }
 static inline pte_t pte_mkyoung(pte_t pte) {
 	pte_val(pte) |= _PAGE_ACCESSED; return pte; }
+static inline pte_t pte_mkhuge(pte_t pte) {
+	pte_val(pte) |= _PAGE_HUGE; return pte; }
 
 /* Atomic PTE updates */
 static inline unsigned long pte_update(pte_t *p, unsigned long clr)
@@ -464,6 +466,10 @@
 
 extern void paging_init(void);
 
+struct mmu_gather;
+void hugetlb_free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *prev,
+			   unsigned long start, unsigned long end);
+
 /*
  * This gets called at the end of handling a page fault, when
  * the kernel has put a new PTE into the page table for the process.
Index: working-2.6/arch/ppc64/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/ppc64/mm/hugetlbpage.c	2004-10-27 10:43:46.000000000 +1000
+++ working-2.6/arch/ppc64/mm/hugetlbpage.c	2004-10-27 12:06:02.637023240 +1000
@@ -27,116 +27,143 @@
 
 #include <linux/sysctl.h>
 
-/* HugePTE layout:
- *
- * 31 30 ... 15 14 13 12 10 9  8  7   6    5    4    3    2    1    0
- * PFN>>12..... -  -  -  -  -  -  HASH_IX....   2ND  HASH RW   -    HG=1
- */
+#define	HUGEPGDIR_SHIFT		(HPAGE_SHIFT + PAGE_SHIFT - 3)
+#define HUGEPGDIR_SIZE		(1UL << HUGEPGDIR_SHIFT)
+#define HUGEPGDIR_MASK		(~(HUGEPGDIR_SIZE-1))
+
+#define HUGEPTE_INDEX_SIZE	9
+#define HUGEPGD_INDEX_SIZE	10
+
+#define PTRS_PER_HUGEPTE	(1 << HUGEPTE_INDEX_SIZE)
+#define PTRS_PER_HUGEPGD	(1 << HUGEPGD_INDEX_SIZE)
 
-#define HUGEPTE_SHIFT	15
-#define _HUGEPAGE_PFN		0xffff8000
-#define _HUGEPAGE_BAD		0x00007f00
-#define _HUGEPAGE_HASHPTE	0x00000008
-#define _HUGEPAGE_SECONDARY	0x00000010
-#define _HUGEPAGE_GROUP_IX	0x000000e0
-#define _HUGEPAGE_HPTEFLAGS	(_HUGEPAGE_HASHPTE | _HUGEPAGE_SECONDARY | \
-				 _HUGEPAGE_GROUP_IX)
-#define _HUGEPAGE_RW		0x00000004
-
-typedef struct {unsigned int val;} hugepte_t;
-#define hugepte_val(hugepte)	((hugepte).val)
-#define __hugepte(x)		((hugepte_t) { (x) } )
-#define hugepte_pfn(x)		\
-	((unsigned long)(hugepte_val(x)>>HUGEPTE_SHIFT) << HUGETLB_PAGE_ORDER)
-#define mk_hugepte(page,wr)	__hugepte( \
-	((page_to_pfn(page)>>HUGETLB_PAGE_ORDER) << HUGEPTE_SHIFT ) \
-	| (!!(wr) * _HUGEPAGE_RW) | _PMD_HUGEPAGE )
-
-#define hugepte_bad(x)	( !(hugepte_val(x) & _PMD_HUGEPAGE) || \
-			  (hugepte_val(x) & _HUGEPAGE_BAD) )
-#define hugepte_page(x)	pfn_to_page(hugepte_pfn(x))
-#define hugepte_none(x)	(!(hugepte_val(x) & _HUGEPAGE_PFN))
-
-
-static void flush_hash_hugepage(mm_context_t context, unsigned long ea,
-				hugepte_t pte, int local);
-
-static inline unsigned int hugepte_update(hugepte_t *p, unsigned int clr,
-					  unsigned int set)
-{
-	unsigned int old, tmp;
-
-	__asm__ __volatile__(
-	"1:	lwarx	%0,0,%3		# pte_update\n\
-	andc	%1,%0,%4 \n\
-	or	%1,%1,%5 \n\
-	stwcx.	%1,0,%3 \n\
-	bne-	1b"
-	: "=&r" (old), "=&r" (tmp), "=m" (*p)
-	: "r" (p), "r" (clr), "r" (set), "m" (*p)
-	: "cc" );
-	return old;
+static inline int hugepgd_index(unsigned long addr)
+{
+	return (addr & ~REGION_MASK) >> HUGEPGDIR_SHIFT;
 }
 
-static inline void set_hugepte(hugepte_t *ptep, hugepte_t pte)
+static pgd_t *hugepgd_offset(struct mm_struct *mm, unsigned long addr)
 {
-	hugepte_update(ptep, ~_HUGEPAGE_HPTEFLAGS,
-		       hugepte_val(pte) & ~_HUGEPAGE_HPTEFLAGS);
+	int index;
+
+	if (! mm->context.huge_pgdir)
+		return NULL;
+
+
+	index = hugepgd_index(addr);
+	BUG_ON(index >= PTRS_PER_HUGEPGD);
+	return mm->context.huge_pgdir + index;
 }
 
-static hugepte_t *hugepte_alloc(struct mm_struct *mm, unsigned long addr)
+static inline pte_t *hugepte_offset(pgd_t *dir, unsigned long addr)
 {
-	pgd_t *pgd;
-	pmd_t *pmd = NULL;
+	int index;
 
-	BUG_ON(!in_hugepage_area(mm->context, addr));
+	if (pgd_none(*dir))
+		return NULL;
 
-	pgd = pgd_offset(mm, addr);
-	pmd = pmd_alloc(mm, pgd, addr);
+	index = (addr >> HPAGE_SHIFT) % PTRS_PER_HUGEPTE;
+	return (pte_t *)pgd_page(*dir) + index;
+}
 
-	/* We shouldn't find a (normal) PTE page pointer here */
-	BUG_ON(!pmd_none(*pmd) && !pmd_hugepage(*pmd));
-	
-	return (hugepte_t *)pmd;
+static pgd_t *hugepgd_alloc(struct mm_struct *mm, unsigned long addr)
+{
+	BUG_ON(! in_hugepage_area(mm->context, addr));
+
+	if (! mm->context.huge_pgdir) {
+		pgd_t *new;
+		spin_unlock(&mm->page_table_lock);
+		/* Don't use pgd_alloc(), because we want __GFP_REPEAT */
+		new = kmem_cache_alloc(zero_cache, GFP_KERNEL | __GFP_REPEAT);
+		BUG_ON(memcmp(new, empty_zero_page, PAGE_SIZE));
+		spin_lock(&mm->page_table_lock);
+
+		/*
+		 * Because we dropped the lock, we should re-check the
+		 * entry, as somebody else could have populated it..
+		 */
+		if (mm->context.huge_pgdir)
+			pgd_free(new);
+		else
+			mm->context.huge_pgdir = new;
+	}
+	return hugepgd_offset(mm, addr);
 }
 
-static hugepte_t *hugepte_offset(struct mm_struct *mm, unsigned long addr)
+static pte_t *hugepte_alloc(struct mm_struct *mm, pgd_t *dir,
+			    unsigned long addr)
 {
-	pgd_t *pgd;
-	pmd_t *pmd = NULL;
+	if (! pgd_present(*dir)) {
+		pte_t *new;
 
-	BUG_ON(!in_hugepage_area(mm->context, addr));
+		spin_unlock(&mm->page_table_lock);
+		new = kmem_cache_alloc(zero_cache, GFP_KERNEL | __GFP_REPEAT);
+		BUG_ON(memcmp(new, empty_zero_page, PAGE_SIZE));
+		spin_lock(&mm->page_table_lock);
+		/*
+		 * Because we dropped the lock, we should re-check the
+		 * entry, as somebody else could have populated it..
+		 */
+		if (pgd_present(*dir)) {
+			if (new)
+				kmem_cache_free(zero_cache, new);
+		} else {
+			struct page *ptepage;
 
-	pgd = pgd_offset(mm, addr);
-	if (pgd_none(*pgd))
-		return NULL;
+			if (! new)
+				return NULL;
+			ptepage = virt_to_page(new);
+			ptepage->mapping = (void *) mm;
+			ptepage->index = addr & HUGEPGDIR_MASK;
+			pgd_populate(mm, dir, new);
+		}
+	}
 
-	pmd = pmd_offset(pgd, addr);
+	return hugepte_offset(dir, addr);
+}
+
+static pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+{
+	pgd_t *pgd;
 
-	/* We shouldn't find a (normal) PTE page pointer here */
-	BUG_ON(!pmd_none(*pmd) && !pmd_hugepage(*pmd));
+	BUG_ON(! in_hugepage_area(mm->context, addr));
 
-	return (hugepte_t *)pmd;
+	pgd = hugepgd_offset(mm, addr);
+	if (! pgd)
+		return NULL;
+
+	return hugepte_offset(pgd, addr);
 }
 
-static void setup_huge_pte(struct mm_struct *mm, struct page *page,
-			   hugepte_t *ptep, int write_access)
+static pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
 {
-	hugepte_t entry;
-	int i;
+	pgd_t *pgd;
 
-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
-	entry = mk_hugepte(page, write_access);
-	for (i = 0; i < HUGEPTE_BATCH_SIZE; i++)
-		set_hugepte(ptep+i, entry);
+	BUG_ON(! in_hugepage_area(mm->context, addr));
+
+	pgd = hugepgd_alloc(mm, addr);
+	if (! pgd)
+		return NULL;
+
+	return hugepte_alloc(mm, pgd, addr);
 }
 
-static void teardown_huge_pte(hugepte_t *ptep)
+static void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma,
+			 struct page *page, pte_t *ptep, int write_access)
 {
-	int i;
+	pte_t entry;
 
-	for (i = 0; i < HUGEPTE_BATCH_SIZE; i++)
-		pmd_clear((pmd_t *)(ptep+i));
+	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+	if (write_access) {
+		entry =
+		    pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
+	} else {
+		entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot));
+	}
+	entry = pte_mkyoung(entry);
+	entry = pte_mkhuge(entry);
+
+	set_pte(ptep, entry);
 }
 
 /*
@@ -268,34 +295,31 @@
 int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma)
 {
-	hugepte_t *src_pte, *dst_pte, entry;
+	pte_t *src_pte, *dst_pte, entry;
 	struct page *ptepage;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
+	int err = -ENOMEM;
 
 	while (addr < end) {
-		BUG_ON(! in_hugepage_area(src->context, addr));
-		BUG_ON(! in_hugepage_area(dst->context, addr));
-
-		dst_pte = hugepte_alloc(dst, addr);
+		dst_pte = huge_pte_alloc(dst, addr);
 		if (!dst_pte)
-			return -ENOMEM;
+			goto out;
 
-		src_pte = hugepte_offset(src, addr);
+		src_pte = huge_pte_offset(src, addr);
 		entry = *src_pte;
 		
-		if ((addr % HPAGE_SIZE) == 0) {
-			/* This is the first hugepte in a batch */
-			ptepage = hugepte_page(entry);
-			get_page(ptepage);
-			dst->rss += (HPAGE_SIZE / PAGE_SIZE);
-		}
-		set_hugepte(dst_pte, entry);
-
+		ptepage = pte_page(entry);
+		get_page(ptepage);
+		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		set_pte(dst_pte, entry);
 
-		addr += PMD_SIZE;
+		addr += HPAGE_SIZE;
 	}
-	return 0;
+
+	err = 0;
+ out:
+	return err;
 }
 
 int
@@ -310,18 +334,16 @@
 
 	vpfn = vaddr/PAGE_SIZE;
 	while (vaddr < vma->vm_end && remainder) {
-		BUG_ON(!in_hugepage_area(mm->context, vaddr));
-
 		if (pages) {
-			hugepte_t *pte;
+			pte_t *pte;
 			struct page *page;
 
-			pte = hugepte_offset(mm, vaddr);
+			pte = huge_pte_offset(mm, vaddr);
 
 			/* hugetlb should be locked, and hence, prefaulted */
-			WARN_ON(!pte || hugepte_none(*pte));
+			WARN_ON(!pte || pte_none(*pte));
 
-			page = &hugepte_page(*pte)[vpfn % (HPAGE_SIZE/PAGE_SIZE)];
+			page = &pte_page(*pte)[vpfn % (HPAGE_SIZE/PAGE_SIZE)];
 
 			WARN_ON(!PageCompound(page));
 
@@ -347,26 +369,31 @@
 struct page *
 follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
 {
-	return ERR_PTR(-EINVAL);
+	pte_t *ptep;
+	struct page *page;
+
+	if (! in_hugepage_area(mm->context, address))
+		return ERR_PTR(-EINVAL);
+
+	ptep = huge_pte_offset(mm, address);
+	page = pte_page(*ptep);
+	if (page)
+		page += (address % HPAGE_SIZE) / PAGE_SIZE;
+
+	return page;
 }
 
 int pmd_huge(pmd_t pmd)
 {
-	return pmd_hugepage(pmd);
+	return 0;
 }
 
 struct page *
 follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 		pmd_t *pmd, int write)
 {
-	struct page *page;
-
-	BUG_ON(! pmd_hugepage(*pmd));
-
-	page = hugepte_page(*(hugepte_t *)pmd);
-	if (page)
-		page += ((address & ~HPAGE_MASK) >> PAGE_SHIFT);
-	return page;
+	BUG();
+	return NULL;
 }
 
 void unmap_hugepage_range(struct vm_area_struct *vma,
@@ -374,44 +401,38 @@
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long addr;
-	hugepte_t *ptep;
+	pte_t *ptep;
 	struct page *page;
-	int cpu;
-	int local = 0;
-	cpumask_t tmp;
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON((start % HPAGE_SIZE) != 0);
 	BUG_ON((end % HPAGE_SIZE) != 0);
 
-	/* XXX are there races with checking cpu_vm_mask? - Anton */
-	cpu = get_cpu();
-	tmp = cpumask_of_cpu(cpu);
-	if (cpus_equal(vma->vm_mm->cpu_vm_mask, tmp))
-		local = 1;
-
 	for (addr = start; addr < end; addr += HPAGE_SIZE) {
-		hugepte_t pte;
-
-		BUG_ON(!in_hugepage_area(mm->context, addr));
+		pte_t pte;
 
-		ptep = hugepte_offset(mm, addr);
-		if (!ptep || hugepte_none(*ptep))
+		ptep = huge_pte_offset(mm, addr);
+		if (!ptep || pte_none(*ptep))
 			continue;
 
 		pte = *ptep;
-		page = hugepte_page(pte);
-		teardown_huge_pte(ptep);
-		
-		if (hugepte_val(pte) & _HUGEPAGE_HASHPTE)
-			flush_hash_hugepage(mm->context, addr,
-					    pte, local);
+		page = pte_page(pte);
+		pte_clear(ptep);
 
 		put_page(page);
 	}
-	put_cpu();
-
 	mm->rss -= (end - start) >> PAGE_SHIFT;
+	flush_tlb_pending();
+}
+
+void hugetlb_free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *prev,
+			   unsigned long start, unsigned long end)
+{
+	/* Because the huge pgtables are only 2 level, they can take
+	 * at most around 4M, much less than one hugepage which the
+	 * process is presumably entitled to use.  So we don't bother
+	 * freeing up the pagetables on unmap, and wait until
+	 * destroy_context() to clean up the lot. */
 }
 
 int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
@@ -427,16 +448,14 @@
 	spin_lock(&mm->page_table_lock);
 	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
 		unsigned long idx;
-		hugepte_t *pte = hugepte_alloc(mm, addr);
+		pte_t *pte = huge_pte_alloc(mm, addr);
 		struct page *page;
 
-		BUG_ON(!in_hugepage_area(mm->context, addr));
-
 		if (!pte) {
 			ret = -ENOMEM;
 			goto out;
 		}
-		if (!hugepte_none(*pte))
+		if (! pte_none(*pte))
 			continue;
 
 		idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
@@ -463,7 +482,7 @@
 				goto out;
 			}
 		}
-		setup_huge_pte(mm, page, pte, vma->vm_flags & VM_WRITE);
+		set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
 	}
 out:
 	spin_unlock(&mm->page_table_lock);
@@ -717,20 +736,55 @@
 	}
 }
 
+void hugetlb_mm_free_pgd(struct mm_struct *mm)
+{
+	int i;
+	pgd_t *pgdir;
+
+	spin_lock(&mm->page_table_lock);
+
+	pgdir = mm->context.huge_pgdir;
+	if (! pgdir)
+		return;
+
+	mm->context.huge_pgdir = NULL;
+
+	/* cleanup any hugepte pages leftover */
+	for (i = 0; i < PTRS_PER_HUGEPGD; i++) {
+		pgd_t *pgd = pgdir + i;
+
+		if (! pgd_none(*pgd)) {
+			pte_t *pte = (pte_t *)pgd_page(*pgd);
+			struct page *ptepage = virt_to_page(pte);
+
+			ptepage->mapping = NULL;
+
+			BUG_ON(memcmp(pte, empty_zero_page, PAGE_SIZE));
+			kmem_cache_free(zero_cache, pte);
+		}
+		pgd_clear(pgd);
+	}
+
+	BUG_ON(memcmp(pgdir, empty_zero_page, PAGE_SIZE));
+	kmem_cache_free(zero_cache, pgdir);
+
+	spin_unlock(&mm->page_table_lock);
+}
+
 int hash_huge_page(struct mm_struct *mm, unsigned long access,
 		   unsigned long ea, unsigned long vsid, int local)
 {
-	hugepte_t *ptep;
+	pte_t *ptep;
 	unsigned long va, vpn;
 	int is_write;
-	hugepte_t old_pte, new_pte;
-	unsigned long hpteflags, prpn, flags;
+	pte_t old_pte, new_pte;
+	unsigned long hpteflags, prpn;
 	long slot;
+	int err = 1;
+
+	spin_lock(&mm->page_table_lock);
 
-	/* We have to find the first hugepte in the batch, since
-	 * that's the one that will store the HPTE flags */
-	ea &= HPAGE_MASK;
-	ptep = hugepte_offset(mm, ea);
+	ptep = huge_pte_offset(mm, ea);
 
 	/* Search the Linux page table for a match with va */
 	va = (vsid << 28) | (ea & 0x0fffffff);
@@ -740,19 +794,18 @@
 	 * If no pte found or not present, send the problem up to
 	 * do_page_fault
 	 */
-	if (unlikely(!ptep || hugepte_none(*ptep)))
-		return 1;
+	if (unlikely(!ptep || pte_none(*ptep)))
+		goto out;
 
-	BUG_ON(hugepte_bad(*ptep));
+/* 	BUG_ON(pte_bad(*ptep)); */
 
 	/* 
 	 * Check the user's access rights to the page.  If access should be
 	 * prevented then send the problem up to do_page_fault.
 	 */
 	is_write = access & _PAGE_RW;
-	if (unlikely(is_write && !(hugepte_val(*ptep) & _HUGEPAGE_RW)))
-		return 1;
-
+	if (unlikely(is_write && !(pte_val(*ptep) & _PAGE_RW)))
+		goto out;
 	/*
 	 * At this point, we have a pte (old_pte) which can be used to build
 	 * or update an HPTE. There are 2 cases:
@@ -765,41 +818,40 @@
 	 *	page is currently not DIRTY. 
 	 */
 
-	spin_lock_irqsave(&mm->page_table_lock, flags);
 
 	old_pte = *ptep;
 	new_pte = old_pte;
 
-	hpteflags = 0x2 | (! (hugepte_val(new_pte) & _HUGEPAGE_RW));
+	hpteflags = 0x2 | (! (pte_val(new_pte) & _PAGE_RW));
 
 	/* Check if pte already has an hpte (case 2) */
-	if (unlikely(hugepte_val(old_pte) & _HUGEPAGE_HASHPTE)) {
+	if (unlikely(pte_val(old_pte) & _PAGE_HASHPTE)) {
 		/* There MIGHT be an HPTE for this pte */
 		unsigned long hash, slot;
 
 		hash = hpt_hash(vpn, 1);
-		if (hugepte_val(old_pte) & _HUGEPAGE_SECONDARY)
+		if (pte_val(old_pte) & _PAGE_SECONDARY)
 			hash = ~hash;
 		slot = (hash & htab_data.htab_hash_mask) * HPTES_PER_GROUP;
-		slot += (hugepte_val(old_pte) & _HUGEPAGE_GROUP_IX) >> 5;
+		slot += (pte_val(old_pte) & _PAGE_GROUP_IX) >> 12;
 
 		if (ppc_md.hpte_updatepp(slot, hpteflags, va, 1, local) == -1)
-			hugepte_val(old_pte) &= ~_HUGEPAGE_HPTEFLAGS;
+			pte_val(old_pte) &= ~_PAGE_HPTEFLAGS;
 	}
 
-	if (likely(!(hugepte_val(old_pte) & _HUGEPAGE_HASHPTE))) {
+	if (likely(!(pte_val(old_pte) & _PAGE_HASHPTE))) {
 		unsigned long hash = hpt_hash(vpn, 1);
 		unsigned long hpte_group;
 
-		prpn = hugepte_pfn(old_pte);
+		prpn = pte_pfn(old_pte);
 
 repeat:
 		hpte_group = ((hash & htab_data.htab_hash_mask) *
 			      HPTES_PER_GROUP) & ~0x7UL;
 
 		/* Update the linux pte with the HPTE slot */
-		hugepte_val(new_pte) &= ~_HUGEPAGE_HPTEFLAGS;
-		hugepte_val(new_pte) |= _HUGEPAGE_HASHPTE;
+		pte_val(new_pte) &= ~_PAGE_HPTEFLAGS;
+		pte_val(new_pte) |= _PAGE_HASHPTE;
 
 		/* Add in WIMG bits */
 		/* XXX We should store these in the pte */
@@ -810,7 +862,7 @@
 
 		/* Primary is full, try the secondary */
 		if (unlikely(slot == -1)) {
-			hugepte_val(new_pte) |= _HUGEPAGE_SECONDARY;
+			pte_val(new_pte) |= _PAGE_SECONDARY;
 			hpte_group = ((~hash & htab_data.htab_hash_mask) *
 				      HPTES_PER_GROUP) & ~0x7UL; 
 			slot = ppc_md.hpte_insert(hpte_group, va, prpn,
@@ -827,39 +879,20 @@
 		if (unlikely(slot == -2))
 			panic("hash_huge_page: pte_insert failed\n");
 
-		hugepte_val(new_pte) |= (slot<<5) & _HUGEPAGE_GROUP_IX;
+		pte_val(new_pte) |= (slot<<12) & _PAGE_GROUP_IX;
 
 		/* 
 		 * No need to use ldarx/stdcx here because all who
 		 * might be updating the pte will hold the
-		 * page_table_lock or the hash_table_lock
-		 * (we hold both)
+		 * page_table_lock 
 		 */
 		*ptep = new_pte;
 	}
 
-	spin_unlock_irqrestore(&mm->page_table_lock, flags);
-
-	return 0;
-}
-
-static void flush_hash_hugepage(mm_context_t context, unsigned long ea,
-				hugepte_t pte, int local)
-{
-	unsigned long vsid, vpn, va, hash, slot;
-
-	BUG_ON(hugepte_bad(pte));
-	BUG_ON(!in_hugepage_area(context, ea));
-
-	vsid = get_vsid(context.id, ea);
+	err = 0;
 
-	va = (vsid << 28) | (ea & 0x0fffffff);
-	vpn = va >> HPAGE_SHIFT;
-	hash = hpt_hash(vpn, 1);
-	if (hugepte_val(pte) & _HUGEPAGE_SECONDARY)
-		hash = ~hash;
-	slot = (hash & htab_data.htab_hash_mask) * HPTES_PER_GROUP;
-	slot += (hugepte_val(pte) & _HUGEPAGE_GROUP_IX) >> 5;
+ out:
+	spin_unlock(&mm->page_table_lock);
 
-	ppc_md.hpte_invalidate(slot, va, 1, local);
+	return err;
 }
Index: working-2.6/include/asm-ppc64/mmu.h
===================================================================
--- working-2.6.orig/include/asm-ppc64/mmu.h	2004-10-05 10:08:10.000000000 +1000
+++ working-2.6/include/asm-ppc64/mmu.h	2004-10-27 12:06:02.638023088 +1000
@@ -24,6 +24,7 @@
 typedef struct {
 	mm_context_id_t id;
 #ifdef CONFIG_HUGETLB_PAGE
+	pgd_t *huge_pgdir;
 	u16 htlb_segs; /* bitmask */
 #endif
 } mm_context_t;
Index: working-2.6/include/asm-ppc64/page.h
===================================================================
--- working-2.6.orig/include/asm-ppc64/page.h	2004-09-20 10:12:50.000000000 +1000
+++ working-2.6/include/asm-ppc64/page.h	2004-10-27 12:06:02.638023088 +1000
@@ -64,7 +64,6 @@
 #define is_hugepage_only_range(addr, len) \
 	(touches_hugepage_high_range((addr), (len)) || \
 	  touches_hugepage_low_range((addr), (len)))
-#define hugetlb_free_pgtables free_pgtables
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
 
 #define in_hugepage_area(context, addr) \
Index: working-2.6/arch/ppc64/mm/init.c
===================================================================
--- working-2.6.orig/arch/ppc64/mm/init.c	2004-10-27 10:43:46.000000000 +1000
+++ working-2.6/arch/ppc64/mm/init.c	2004-10-27 12:06:02.639022936 +1000
@@ -478,6 +478,12 @@
 	int index;
 	int err;
 
+#ifdef CONFIG_HUGETLB_PAGE
+	/* We leave htlb_segs as it was, but for a fork, we need to
+	 * clear the huge_pgdir. */
+	mm->context.huge_pgdir = NULL;
+#endif
+
 again:
 	if (!idr_pre_get(&mmu_context_idr, GFP_KERNEL))
 		return -ENOMEM;
@@ -508,6 +514,8 @@
 	spin_unlock(&mmu_context_lock);
 
 	mm->context.id = NO_CONTEXT;
+
+	hugetlb_mm_free_pgd(mm);
 }
 
 static int __init mmu_context_init(void)
Index: working-2.6/arch/ppc64/mm/hash_utils.c
===================================================================
--- working-2.6.orig/arch/ppc64/mm/hash_utils.c	2004-10-27 10:43:46.000000000 +1000
+++ working-2.6/arch/ppc64/mm/hash_utils.c	2004-10-27 12:06:02.640022784 +1000
@@ -341,9 +341,7 @@
 		     int local)
 {
 	unsigned long vsid, vpn, va, hash, secondary, slot;
-
-	/* XXX fix for large ptes */
-	unsigned long large = 0;
+	unsigned long huge = pte_huge(pte);
 
 	if ((ea >= USER_START) && (ea <= USER_END))
 		vsid = get_vsid(context, ea);
@@ -351,18 +349,18 @@
 		vsid = get_kernel_vsid(ea);
 
 	va = (vsid << 28) | (ea & 0x0fffffff);
-	if (large)
+	if (huge)
 		vpn = va >> HPAGE_SHIFT;
 	else
 		vpn = va >> PAGE_SHIFT;
-	hash = hpt_hash(vpn, large);
+	hash = hpt_hash(vpn, huge);
 	secondary = (pte_val(pte) & _PAGE_SECONDARY) >> 15;
 	if (secondary)
 		hash = ~hash;
 	slot = (hash & htab_data.htab_hash_mask) * HPTES_PER_GROUP;
 	slot += (pte_val(pte) & _PAGE_GROUP_IX) >> 12;
 
-	ppc_md.hpte_invalidate(slot, va, large, local);
+	ppc_md.hpte_invalidate(slot, va, huge, local);
 }
 
 void flush_hash_range(unsigned long context, unsigned long number, int local)


-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson


From akpm at osdl.org  Thu Oct 28 16:41:09 2004
From: akpm at osdl.org (Andrew Morton)
Date: Wed, 27 Oct 2004 23:41:09 -0700
Subject: [PATCH] Make key management syscalls work on PPC/PPC64
In-Reply-To: <24857.1098904121@redhat.com>
References: <24857.1098904121@redhat.com>
Message-ID: <20041027234109.19b39e93.akpm@osdl.org>

David Howells <dhowells at redhat.com> wrote:
>
> The attached patch permits my key management stuff to be used on PPC, PPC64
>  and PPC on PPC64.

Please remember to test your patches with CONFIG_KEYS=n

--- 25-power4/kernel/sys.c~ppc-ppc64-make-key-management-syscalls-work-fix	2004-10-27 23:26:16.330512080 -0700
+++ 25-power4-akpm/kernel/sys.c	2004-10-27 23:27:04.516186744 -0700
@@ -286,6 +286,7 @@ cond_syscall(compat_set_mempolicy)
 cond_syscall(sys_add_key)
 cond_syscall(sys_request_key)
 cond_syscall(sys_keyctl)
+cond_syscall(compat_keyctl)
 cond_syscall(compat_sys_socketcall)
 
 /* arch-specific weak syscall entries */
_


From kaos at sgi.com  Thu Oct 28 16:30:15 2004
From: kaos at sgi.com (Keith Owens)
Date: Thu, 28 Oct 2004 16:30:15 +1000
Subject: [PATCH] add syslog printing to xmon debugger. 
In-Reply-To: Your message of "Fri, 22 Oct 2004 11:59:17 +1000."
	<16760.26997.131687.456670@cargo.ozlabs.ibm.com> 
Message-ID: <5227.1098945015@kao2.melbourne.sgi.com>

On Fri, 22 Oct 2004 11:59:17 +1000, 
Paul Mackerras <paulus at samba.org> wrote:
>Linas,
>
>> Andrew,
>> 
>> Please apply at least the kernel/printk.c part of the patch,
>> if you are feeling at all charitable.
>
>Did you ever get any reaction to that?

I see that the printk.c patch was lifted straight from kdb - without
any mention of kdb.  It even has the same bug as kdb, which was
corrected in kdb-v4.4-2.6.9-common-2.  The current kdb patch to
printk.c is :-

Index: linux/kernel/printk.c
===================================================================
--- linux.orig/kernel/printk.c	Tue Oct 19 07:55:35 2004
+++ linux/kernel/printk.c	Thu Oct 21 18:06:28 2004
@@ -373,6 +373,20 @@ out:
 	return error;
 }
 
+#ifdef	CONFIG_KDB
+/* kdb dmesg command needs access to the syslog buffer.  do_syslog() uses locks
+ * so it cannot be used during debugging.  Just tell kdb where the start and
+ * end of the physical and logical logs are.  This is equivalent to do_syslog(3).
+ */
+void kdb_syslog_data(char *syslog_data[4])
+{
+	syslog_data[0] = log_buf;
+	syslog_data[1] = log_buf + log_buf_len;
+	syslog_data[2] = log_buf + log_end - (logged_chars < log_buf_len ? logged_chars : log_buf_len);
+	syslog_data[3] = log_buf + log_end;
+}
+#endif	/* CONFIG_KDB */
+
 asmlinkage long sys_syslog(int type, char __user * buf, int len)
 {
 	return do_syslog(type, buf, len);


From sfr at canb.auug.org.au  Thu Oct 28 18:23:58 2004
From: sfr at canb.auug.org.au (Stephen Rothwell)
Date: Thu, 28 Oct 2004 18:23:58 +1000
Subject: [PATCH] ppc64 iSeries: fix for generic irq changes
Message-ID: <20041028182358.6b69eeac.sfr@canb.auug.org.au>

Hi Andrew,

The generic irq patches broke pci irqs on ppc64 iSeries.

Signed-off-by: Stephen Rothwell <sfr at canb.auug.org.au>

Please merge and send to Linus.
-- 
Cheers,
Stephen Rothwell                    sfr at canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

diff -ruN 2.6.10-rc1-bk6/arch/ppc64/kernel/iSeries_irq.c 2.6.10-rc1-bk6-irq.1/arch/ppc64/kernel/iSeries_irq.c
--- 2.6.10-rc1-bk6/arch/ppc64/kernel/iSeries_irq.c	2004-05-10 15:31:04.000000000 +1000
+++ 2.6.10-rc1-bk6-irq.1/arch/ppc64/kernel/iSeries_irq.c	2004-10-28 18:06:30.000000000 +1000
@@ -110,6 +110,7 @@
 	/* Unmask bridge interrupts in the FISR */
 	mask = 0x01010000 << function;
 	HvCallPci_unmaskFisr(bus, subBus, deviceId, mask);
+	iSeries_enable_IRQ(irq);
 	return 0;
 }
 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041028/e1d53c5a/attachment.pgp 

From sfr at canb.auug.org.au  Fri Oct 29 02:42:51 2004
From: sfr at canb.auug.org.au (Stephen Rothwell)
Date: Fri, 29 Oct 2004 02:42:51 +1000
Subject: [PATCH] ppc64 iSeries pci cleanups
Message-ID: <20041029024251.4cf06de2.sfr@canb.auug.org.au>

Hi Andrew,

This patch removes two files (iSeries_IoMmTable.[ch]) by merging them into
iSeries_pci.c.  This allowed quite a few more things to become declared
static.  It then does some fairly mechanical cleanups in iSeries_pci.c
(replacing studly caps, removing the last of the PCIFR() macros and
removing a couple of empty or unused routines).  There are no semantic
changes.

Signed-off-by: Stephen Rothwell <sfr at canb.auug.org.au>

Please apply and send to Linus.
-- 
Cheers,
Stephen Rothwell                    sfr at canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

diff -ruN 2.6.10-rc1-bk6-irq.1/arch/ppc64/kernel/Makefile 2.6.10-rc1-bk6-cleanup.1/arch/ppc64/kernel/Makefile
--- 2.6.10-rc1-bk6-irq.1/arch/ppc64/kernel/Makefile	2004-10-28 14:18:05.000000000 +1000
+++ 2.6.10-rc1-bk6-cleanup.1/arch/ppc64/kernel/Makefile	2004-10-28 14:50:05.000000000 +1000
@@ -15,8 +15,7 @@
 
 obj-$(CONFIG_PPC_OF) +=	of_device.o
 
-pci-obj-$(CONFIG_PPC_ISERIES)	+= iSeries_pci.o iSeries_pci_reset.o \
-				     iSeries_IoMmTable.o
+pci-obj-$(CONFIG_PPC_ISERIES)	+= iSeries_pci.o iSeries_pci_reset.o
 pci-obj-$(CONFIG_PPC_MULTIPLATFORM)	+= pci_dn.o pci_dma_direct.o
 
 obj-$(CONFIG_PCI)	+= pci.o pci_iommu.o iomap.o $(pci-obj-y)
diff -ruN 2.6.10-rc1-bk6-irq.1/arch/ppc64/kernel/iSeries_IoMmTable.c 2.6.10-rc1-bk6-cleanup.1/arch/ppc64/kernel/iSeries_IoMmTable.c
--- 2.6.10-rc1-bk6-irq.1/arch/ppc64/kernel/iSeries_IoMmTable.c	2004-02-04 17:24:34.000000000 +1100
+++ 2.6.10-rc1-bk6-cleanup.1/arch/ppc64/kernel/iSeries_IoMmTable.c	1970-01-01 10:00:00.000000000 +1000
@@ -1,169 +0,0 @@
-#define PCIFR(...)
-/************************************************************************/
-/* This module supports the iSeries I/O Address translation mapping     */
-/* Copyright (C) 20yy  <Allan H Trautman> <IBM Corp>                    */
-/*                                                                      */
-/* This program is free software; you can redistribute it and/or modify */
-/* it under the terms of the GNU General Public License as published by */
-/* the Free Software Foundation; either version 2 of the License, or    */
-/* (at your option) any later version.                                  */
-/*                                                                      */
-/* This program is distributed in the hope that it will be useful,      */ 
-/* but WITHOUT ANY WARRANTY; without even the implied warranty of       */
-/* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the        */
-/* GNU General Public License for more details.                         */
-/*                                                                      */
-/* You should have received a copy of the GNU General Public License    */ 
-/* along with this program; if not, write to the:                       */
-/* Free Software Foundation, Inc.,                                      */ 
-/* 59 Temple Place, Suite 330,                                          */ 
-/* Boston, MA  02111-1307  USA                                          */
-/************************************************************************/
-/* Change Activity:                                                     */
-/*   Created, December 14, 2000                                         */
-/*   Added Bar table for IoMm performance.                              */
-/*   Ported to ppc64                                                    */
-/*   Added dynamic table allocation                                     */
-/* End Change Activity                                                  */
-/************************************************************************/
-#include <asm/types.h>
-#include <asm/resource.h>
-#include <linux/pci.h>
-#include <linux/spinlock.h>
-#include <asm/ppcdebug.h>
-#include <asm/iSeries/HvCallPci.h>
-#include <asm/iSeries/iSeries_pci.h>
-
-#include "iSeries_IoMmTable.h"
-#include "pci.h"
-
-/*
- * Table defines
- * Each Entry size is 4 MB * 1024 Entries = 4GB I/O address space.
- */
-#define Max_Entries 1024
-unsigned long iSeries_IoMmTable_Entry_Size = 0x0000000000400000; 
-unsigned long iSeries_Base_Io_Memory       = 0xE000000000000000;
-unsigned long iSeries_Max_Io_Memory        = 0xE000000000000000;
-static   long iSeries_CurrentIndex         = 0;
-
-/*
- * Lookup Tables.
- */
-struct iSeries_Device_Node **iSeries_IoMmTable;
-u8 *iSeries_IoBarTable;
-
-/*
- * Static and Global variables
- */
-static char *iSeriesPciIoText = "iSeries PCI I/O";
-static spinlock_t iSeriesIoMmTableLock = SPIN_LOCK_UNLOCKED;
-
-/*
- * iSeries_IoMmTable_Initialize
- *
- * Allocates and initalizes the Address Translation Table and Bar
- * Tables to get them ready for use.  Must be called before any
- * I/O space is handed out to the device BARs.
- * A follow up method,iSeries_IoMmTable_Status can be called to
- * adjust the table after the device BARs have been assiged to
- * resize the table.
- */
-void iSeries_IoMmTable_Initialize(void)
-{
-	spin_lock(&iSeriesIoMmTableLock);
-	iSeries_IoMmTable  = kmalloc(sizeof(void *) * Max_Entries, GFP_KERNEL);
-	iSeries_IoBarTable = kmalloc(sizeof(u8) * Max_Entries, GFP_KERNEL);
-	spin_unlock(&iSeriesIoMmTableLock);
-	PCIFR("IoMmTable Initialized 0x%p", iSeries_IoMmTable);
-	if ((iSeries_IoMmTable == NULL) || (iSeries_IoBarTable == NULL))
-		panic("PCI: I/O tables allocation failed.\n");
-}
-
-/*
- * iSeries_IoMmTable_AllocateEntry
- *
- * Adds pci_dev entry in address translation table
- *
- * - Allocates the number of entries required in table base on BAR
- *   size.
- * - Allocates starting at iSeries_Base_Io_Memory and increases.
- * - The size is round up to be a multiple of entry size.
- * - CurrentIndex is incremented to keep track of the last entry.
- * - Builds the resource entry for allocated BARs.
- */
-static void iSeries_IoMmTable_AllocateEntry(struct pci_dev *PciDev,
-		int BarNumber)
-{
-	struct resource *BarResource = &PciDev->resource[BarNumber];
-	long BarSize = pci_resource_len(PciDev, BarNumber);
-
-	/*
-	 * No space to allocate, quick exit, skip Allocation.
-	 */
-	if (BarSize == 0)
-		return;
-	/*
-	 * Set Resource values.
-	 */
-	spin_lock(&iSeriesIoMmTableLock);
-	BarResource->name = iSeriesPciIoText;
-	BarResource->start =
-		iSeries_IoMmTable_Entry_Size * iSeries_CurrentIndex;
-	BarResource->start += iSeries_Base_Io_Memory;
-	BarResource->end = BarResource->start+BarSize-1;
-	/*
-	 * Allocate the number of table entries needed for BAR.
-	 */
-	while (BarSize > 0 ) {
-		*(iSeries_IoMmTable + iSeries_CurrentIndex) =
-			(struct iSeries_Device_Node *)PciDev->sysdata;
-		*(iSeries_IoBarTable + iSeries_CurrentIndex) = BarNumber;
-		BarSize -= iSeries_IoMmTable_Entry_Size;
-		++iSeries_CurrentIndex;
-	}
-	iSeries_Max_Io_Memory = iSeries_Base_Io_Memory +
-		(iSeries_IoMmTable_Entry_Size * iSeries_CurrentIndex);
-	spin_unlock(&iSeriesIoMmTableLock);
-}
-
-/*
- * iSeries_allocateDeviceBars
- *
- * - Allocates ALL pci_dev BAR's and updates the resources with the
- *   BAR value.  BARS with zero length will have the resources
- *   The HvCallPci_getBarParms is used to get the size of the BAR
- *   space.  It calls iSeries_IoMmTable_AllocateEntry to allocate
- *   each entry.
- * - Loops through The Bar resources(0 - 5) including the ROM
- *   is resource(6).
- */
-void iSeries_allocateDeviceBars(struct pci_dev *PciDev)
-{
-	struct resource *BarResource;
-	int BarNumber;
-
-	for (BarNumber = 0; BarNumber <= PCI_ROM_RESOURCE; ++BarNumber) {
-		BarResource = &PciDev->resource[BarNumber];
-		iSeries_IoMmTable_AllocateEntry(PciDev, BarNumber);
-    	}
-}
-
-/*
- * Translates the IoAddress to the device that is mapped to IoSpace.
- * This code is inlined, see the iSeries_pci.c file for the replacement.
- */
-struct iSeries_Device_Node *iSeries_xlateIoMmAddress(void *IoAddress)
-{
-	return NULL;	   
-}
-
-/*
- * Status hook for IoMmTable
- */
-void iSeries_IoMmTable_Status(void)
-{
-	PCIFR("IoMmTable......: 0x%p", iSeries_IoMmTable);
-	PCIFR("IoMmTable Range: 0x%p to 0x%p", iSeries_Base_Io_Memory,
-			iSeries_Max_Io_Memory);
-}
diff -ruN 2.6.10-rc1-bk6-irq.1/arch/ppc64/kernel/iSeries_IoMmTable.h 2.6.10-rc1-bk6-cleanup.1/arch/ppc64/kernel/iSeries_IoMmTable.h
--- 2.6.10-rc1-bk6-irq.1/arch/ppc64/kernel/iSeries_IoMmTable.h	2004-02-04 17:24:34.000000000 +1100
+++ 2.6.10-rc1-bk6-cleanup.1/arch/ppc64/kernel/iSeries_IoMmTable.h	1970-01-01 10:00:00.000000000 +1000
@@ -1,85 +0,0 @@
-#ifndef _ISERIES_IOMMTABLE_H
-#define _ISERIES_IOMMTABLE_H
-/************************************************************************/
-/* File iSeries_IoMmTable.h created by Allan Trautman on Dec 12 2001.   */
-/************************************************************************/
-/* Interfaces for the write/read Io address translation table.          */
-/* Copyright (C) 20yy  Allan H Trautman, IBM Corporation                */
-/*                                                                      */
-/* This program is free software; you can redistribute it and/or modify */
-/* it under the terms of the GNU General Public License as published by */
-/* the Free Software Foundation; either version 2 of the License, or    */
-/* (at your option) any later version.                                  */
-/*                                                                      */
-/* This program is distributed in the hope that it will be useful,      */ 
-/* but WITHOUT ANY WARRANTY; without even the implied warranty of       */
-/* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the        */
-/* GNU General Public License for more details.                         */
-/*                                                                      */
-/* You should have received a copy of the GNU General Public License    */ 
-/* along with this program; if not, write to the:                       */
-/* Free Software Foundation, Inc.,                                      */ 
-/* 59 Temple Place, Suite 330,                                          */ 
-/* Boston, MA  02111-1307  USA                                          */
-/************************************************************************/
-/* Change Activity:                                                     */
-/*   Created December 12, 2000                                          */
-/*   Ported to ppc64, August 30, 2001                                   */
-/* End Change Activity                                                  */
-/************************************************************************/
-
-struct pci_dev;
-struct iSeries_Device_Node;
-
-extern struct iSeries_Device_Node **iSeries_IoMmTable;
-extern u8 *iSeries_IoBarTable;
-extern unsigned long iSeries_Base_Io_Memory;
-extern unsigned long iSeries_Max_Io_Memory;
-extern unsigned long iSeries_Base_Io_Memory;
-extern unsigned long iSeries_IoMmTable_Entry_Size;
-/*
- * iSeries_IoMmTable_Initialize
- *
- * - Initalizes the Address Translation Table and get it ready for use.
- *   Must be called before any client calls any of the other methods.
- *
- * Parameters: None.
- *
- * Return: None.
- */
-extern void iSeries_IoMmTable_Initialize(void);
-extern void iSeries_IoMmTable_Status(void);
-
-/*
- * iSeries_allocateDeviceBars
- *
- * - Allocates ALL pci_dev BAR's and updates the resources with the BAR
- *   value.  BARS with zero length will not have the resources.  The
- *   HvCallPci_getBarParms is used to get the size of the BAR space.
- *   It calls iSeries_IoMmTable_AllocateEntry to allocate each entry.
- *
- * Parameters:
- * pci_dev = Pointer to pci_dev structure that will be mapped to pseudo
- *           I/O Address.
- *
- * Return:
- *   The pci_dev I/O resources updated with pseudo I/O Addresses.
- */
-extern void iSeries_allocateDeviceBars(struct pci_dev *);
-
-/*
- * iSeries_xlateIoMmAddress
- *
- * - Translates an I/O Memory address to Device Node that has been the
- *   allocated the psuedo I/O Address.
- *
- * Parameters:
- * IoAddress = I/O Memory Address.
- *
- * Return:
- *   An iSeries_Device_Node to the device mapped to the I/O address. The
- *   BarNumber and BarOffset are valid if the Device Node is returned.
- */
-extern struct iSeries_Device_Node *iSeries_xlateIoMmAddress(void *IoAddress);
-
-#endif /* _ISERIES_IOMMTABLE_H */
diff -ruN 2.6.10-rc1-bk6-irq.1/arch/ppc64/kernel/iSeries_pci.c 2.6.10-rc1-bk6-cleanup.1/arch/ppc64/kernel/iSeries_pci.c
--- 2.6.10-rc1-bk6-irq.1/arch/ppc64/kernel/iSeries_pci.c	2004-10-25 15:37:12.000000000 +1000
+++ 2.6.10-rc1-bk6-cleanup.1/arch/ppc64/kernel/iSeries_pci.c	2004-10-27 18:43:41.000000000 +1000
@@ -1,4 +1,3 @@
-#define PCIFR(...)
 /*
  * iSeries_pci.c
  *
@@ -47,27 +46,19 @@
 #include <asm/iSeries/iSeries_pci.h>
 #include <asm/iSeries/mf.h>
 
-#include "iSeries_IoMmTable.h"
 #include "pci.h"
 
 extern int panic_timeout;
 
-extern unsigned long iSeries_Base_Io_Memory;    
-
-extern struct iommu_table *tceTables[256];
 extern unsigned long io_page_mask;
 
-extern void iSeries_MmIoTest(void);
-
 /*
  * Forward declares of prototypes. 
  */
 static struct iSeries_Device_Node *find_Device_Node(int bus, int devfn);
-static void iSeries_Scan_PHBs_Slots(struct pci_controller *Phb);
-static void iSeries_Scan_EADs_Bridge(HvBusNumber Bus, HvSubBusNumber SubBus,
-		int IdSel);
-static int iSeries_Scan_Bridge_Slot(HvBusNumber Bus,
-		struct HvCallPci_BridgeInfo *Info);
+static void scan_PHB_slots(struct pci_controller *Phb);
+static void scan_EADS_bridge(HvBusNumber Bus, HvSubBusNumber SubBus, int IdSel);
+static int scan_bridge_slot(HvBusNumber Bus, struct HvCallPci_BridgeInfo *Info);
 
 LIST_HEAD(iSeries_Global_Device_List);
 
@@ -88,7 +79,116 @@
 static struct pci_ops iSeries_pci_ops;
 
 /*
- * Log Error infor in Flight Recorder to system Console.
+ * Table defines
+ * Each Entry size is 4 MB * 1024 Entries = 4GB I/O address space.
+ */
+#define IOMM_TABLE_MAX_ENTRIES	1024
+#define IOMM_TABLE_ENTRY_SIZE	0x0000000000400000UL
+#define BASE_IO_MEMORY		0xE000000000000000UL
+
+static unsigned long max_io_memory = 0xE000000000000000UL;
+static long current_iomm_table_entry;
+
+/*
+ * Lookup Tables.
+ */
+static struct iSeries_Device_Node **iomm_table;
+static u8 *iobar_table;
+
+/*
+ * Static and Global variables
+ */
+static char *pci_io_text = "iSeries PCI I/O";
+static spinlock_t iomm_table_lock = SPIN_LOCK_UNLOCKED;
+
+/*
+ * iomm_table_initialize
+ *
+ * Allocates and initalizes the Address Translation Table and Bar
+ * Tables to get them ready for use.  Must be called before any
+ * I/O space is handed out to the device BARs.
+ */
+static void iomm_table_initialize(void)
+{
+	spin_lock(&iomm_table_lock);
+	iomm_table = kmalloc(sizeof(*iomm_table) * IOMM_TABLE_MAX_ENTRIES,
+			GFP_KERNEL);
+	iobar_table = kmalloc(sizeof(*iobar_table) * IOMM_TABLE_MAX_ENTRIES,
+			GFP_KERNEL);
+	spin_unlock(&iomm_table_lock);
+	if ((iomm_table == NULL) || (iobar_table == NULL))
+		panic("PCI: I/O tables allocation failed.\n");
+}
+
+/*
+ * iomm_table_allocate_entry
+ *
+ * Adds pci_dev entry in address translation table
+ *
+ * - Allocates the number of entries required in table base on BAR
+ *   size.
+ * - Allocates starting at BASE_IO_MEMORY and increases.
+ * - The size is round up to be a multiple of entry size.
+ * - CurrentIndex is incremented to keep track of the last entry.
+ * - Builds the resource entry for allocated BARs.
+ */
+static void iomm_table_allocate_entry(struct pci_dev *dev, int bar_num)
+{
+	struct resource *bar_res = &dev->resource[bar_num];
+	long bar_size = pci_resource_len(dev, bar_num);
+
+	/*
+	 * No space to allocate, quick exit, skip Allocation.
+	 */
+	if (bar_size == 0)
+		return;
+	/*
+	 * Set Resource values.
+	 */
+	spin_lock(&iomm_table_lock);
+	bar_res->name = pci_io_text;
+	bar_res->start =
+		IOMM_TABLE_ENTRY_SIZE * current_iomm_table_entry;
+	bar_res->start += BASE_IO_MEMORY;
+	bar_res->end = bar_res->start + bar_size - 1;
+	/*
+	 * Allocate the number of table entries needed for BAR.
+	 */
+	while (bar_size > 0 ) {
+		iomm_table[current_iomm_table_entry] = dev->sysdata;
+		iobar_table[current_iomm_table_entry] = bar_num;
+		bar_size -= IOMM_TABLE_ENTRY_SIZE;
+		++current_iomm_table_entry;
+	}
+	max_io_memory = BASE_IO_MEMORY +
+		(IOMM_TABLE_ENTRY_SIZE * current_iomm_table_entry);
+	spin_unlock(&iomm_table_lock);
+}
+
+/*
+ * allocate_device_bars
+ *
+ * - Allocates ALL pci_dev BAR's and updates the resources with the
+ *   BAR value.  BARS with zero length will have the resources
+ *   The HvCallPci_getBarParms is used to get the size of the BAR
+ *   space.  It calls iomm_table_allocate_entry to allocate
+ *   each entry.
+ * - Loops through The Bar resources(0 - 5) including the ROM
+ *   is resource(6).
+ */
+static void allocate_device_bars(struct pci_dev *dev)
+{
+	struct resource *bar_res;
+	int bar_num;
+
+	for (bar_num = 0; bar_num <= PCI_ROM_RESOURCE; ++bar_num) {
+		bar_res = &dev->resource[bar_num];
+		iomm_table_allocate_entry(dev, bar_num);
+    	}
+}
+
+/*
+ * Log error information to system console.
  * Filter out the device not there errors.
  * PCI: EADs Connect Failed 0x18.58.10 Rc: 0x00xx
  * PCI: Read Vendor Failed 0x18.58.10 Rc: 0x00xx
@@ -99,7 +199,6 @@
 {
 	if (HvRc == 0x0302)
 		return;
-
 	printk(KERN_ERR "PCI: %s Failed: 0x%02X.%02X.%02X Rc: 0x%04X",
 	       Error_Text, Bus, SubBus, AgentId, HvRc);
 }
@@ -133,8 +232,6 @@
 	node->DevFn = PCI_DEVFN(ISERIES_ENCODE_DEVICE(AgentId), Function);
 	node->IoRetry = 0;
 	iSeries_Get_Location_Code(node);
-	PCIFR("Device 0x%02X.%2X, Node:0x%p ", ISERIES_BUS(node),
-			ISERIES_DEVFUN(node), node);
 	return node;
 }
 
@@ -160,10 +257,8 @@
 		if (ret == 0) {
 			printk("bus %d appears to exist\n", bus);
 			phb = pci_alloc_pci_controller(phb_type_hypervisor);
-			if (phb == NULL) {
-				PCIFR("Allocate pci_controller failed.");
+			if (phb == NULL)
 				return -1;
-			}
 			phb->pci_mem_offset = phb->local_number = bus;
 			phb->first_busno = bus;
 			phb->last_busno = bus;
@@ -171,10 +266,9 @@
 
 			PPCDBG(PPCDBG_BUSWALK, "PCI:Create iSeries pci_controller(%p), Bus: %04X\n",
 					phb, bus);
-			PCIFR("Create iSeries PHB controller: %04X", bus);
 
 			/* Find and connect the devices. */
-			iSeries_Scan_PHBs_Slots(phb);
+			scan_PHB_slots(phb);
 		}
 		/*
 		 * Check for Unexpected Return code, a clue that something
@@ -195,7 +289,7 @@
 void iSeries_pcibios_init(void)
 {
 	PPCDBG(PPCDBG_BUSWALK, "iSeries_pcibios_init Entry.\n"); 
-	iSeries_IoMmTable_Initialize();
+	iomm_table_initialize();
 	find_and_init_phbs();
 	io_page_mask = -1;
 	/* pci_assign_all_busses = 0;		SFRXXX*/
@@ -231,7 +325,7 @@
 			PPCDBG(PPCDBG_BUSWALK,
 					"pdev 0x%p <==> DevNode 0x%p\n",
 					pdev, node);
-			iSeries_allocateDeviceBars(pdev);
+			allocate_device_bars(pdev);
 			iSeries_Device_Information(pdev, Buffer,
 					sizeof(Buffer));
 			printk("%d. %s\n", DeviceCount, Buffer);
@@ -241,7 +335,6 @@
 					(unsigned long)pdev);
 		pdev->irq = node->Irq;
 	}
-	iSeries_IoMmTable_Status();
 	iSeries_activate_IRQs();
 	mf_displaySrc(0xC9000200);
 }
@@ -260,7 +353,7 @@
 /*
  * Loop through each node function to find usable EADs bridges.  
  */
-static void iSeries_Scan_PHBs_Slots(struct pci_controller *Phb)
+static void scan_PHB_slots(struct pci_controller *Phb)
 {
 	struct HvCallPci_DeviceInfo *DevInfo;
 	HvBusNumber bus = Phb->local_number;	/* System Bus */	
@@ -283,7 +376,7 @@
 				sizeof(struct HvCallPci_DeviceInfo));
 		if (HvRc == 0) {
 			if (DevInfo->deviceType == HvCallPci_NodeDevice)
-				iSeries_Scan_EADs_Bridge(bus, SubBus, IdSel);
+				scan_EADS_bridge(bus, SubBus, IdSel);
 			else
 				printk("PCI: Invalid System Configuration(0x%02X)"
 				       " for bus 0x%02x id 0x%02x.\n",
@@ -295,7 +388,7 @@
 	kfree(DevInfo);
 }
 
-static void iSeries_Scan_EADs_Bridge(HvBusNumber bus, HvSubBusNumber SubBus,
+static void scan_EADS_bridge(HvBusNumber bus, HvSubBusNumber SubBus,
 		int IdSel)
 {
 	struct HvCallPci_BridgeInfo *BridgeInfo;
@@ -340,7 +433,7 @@
 				if (BridgeInfo->busUnitInfo.deviceType ==
 						HvCallPci_BridgeDevice)  {
 					/* Scan_Bridge_Slot...: 0x18.00.12 */
-					iSeries_Scan_Bridge_Slot(bus, BridgeInfo);
+					scan_bridge_slot(bus, BridgeInfo);
 				} else
 					printk("PCI: Invalid Bridge Configuration(0x%02X)",
 						BridgeInfo->busUnitInfo.deviceType);
@@ -355,7 +448,7 @@
 /*
  * This assumes that the node slot is always on the primary bus!
  */
-static int iSeries_Scan_Bridge_Slot(HvBusNumber Bus,
+static int scan_bridge_slot(HvBusNumber Bus,
 		struct HvCallPci_BridgeInfo *BridgeInfo)
 {
 	struct iSeries_Device_Node *node;
@@ -593,12 +686,8 @@
 		return -1;	/* Retry Try */
 	}
 	/* If retry was in progress, log success and rest retry count */
-	if (DevNode->IoRetry > 0) {
-		PCIFR("%s: Device 0x%04X:%02X Retry Successful(%2d).",
-				TextHdr, DevNode->DsaAddr.Dsa.busNumber, DevNode->DevFn,
-				DevNode->IoRetry);
+	if (DevNode->IoRetry > 0)
 		DevNode->IoRetry = 0;
-	}
 	return 0; 
 }
 
@@ -607,8 +696,9 @@
  * Note: Make sure the passed variable end up on the stack to avoid
  * the exposure of being device global.
  */
-static inline struct iSeries_Device_Node *xlateIoMmAddress(const volatile void __iomem *IoAddress,
-		 u64 *dsaptr, u64 *BarOffsetPtr)
+static inline struct iSeries_Device_Node *xlate_iomm_address(
+		const volatile void __iomem *IoAddress,
+		u64 *dsaptr, u64 *BarOffsetPtr)
 {
 	unsigned long OrigIoAddr;
 	unsigned long BaseIoAddr;
@@ -616,17 +706,16 @@
 	struct iSeries_Device_Node *DevNode;
 
 	OrigIoAddr = (unsigned long __force)IoAddress;
-	if ((OrigIoAddr < iSeries_Base_Io_Memory) ||
-			(OrigIoAddr >= iSeries_Max_Io_Memory))
+	if ((OrigIoAddr < BASE_IO_MEMORY) || (OrigIoAddr >= max_io_memory))
 		return NULL;
-	BaseIoAddr = OrigIoAddr - iSeries_Base_Io_Memory;
-	TableIndex = BaseIoAddr / iSeries_IoMmTable_Entry_Size;
-	DevNode = iSeries_IoMmTable[TableIndex];
+	BaseIoAddr = OrigIoAddr - BASE_IO_MEMORY;
+	TableIndex = BaseIoAddr / IOMM_TABLE_ENTRY_SIZE;
+	DevNode = iomm_table[TableIndex];
 
 	if (DevNode != NULL) {
-		int barnum = iSeries_IoBarTable[TableIndex];
+		int barnum = iobar_table[TableIndex];
 		*dsaptr = DevNode->DsaAddr.DsaAddr | (barnum << 24);
-		*BarOffsetPtr = BaseIoAddr % iSeries_IoMmTable_Entry_Size;
+		*BarOffsetPtr = BaseIoAddr % IOMM_TABLE_ENTRY_SIZE;
 	} else
 		panic("PCI: Invalid PCI IoAddress detected!\n");
 	return DevNode;
@@ -647,7 +736,7 @@
 	u64 dsa;
 	struct HvCallPci_LoadReturn ret;
 	struct iSeries_Device_Node *DevNode =
-		xlateIoMmAddress(IoAddress, &dsa, &BarOffset);
+		xlate_iomm_address(IoAddress, &dsa, &BarOffset);
 
 	if (DevNode == NULL) {
 		static unsigned long last_jiffies;
@@ -676,7 +765,7 @@
 	u64 dsa;
 	struct HvCallPci_LoadReturn ret;
 	struct iSeries_Device_Node *DevNode =
-		xlateIoMmAddress(IoAddress, &dsa, &BarOffset);
+		xlate_iomm_address(IoAddress, &dsa, &BarOffset);
 
 	if (DevNode == NULL) {
 		static unsigned long last_jiffies;
@@ -706,7 +795,7 @@
 	u64 dsa;
 	struct HvCallPci_LoadReturn ret;
 	struct iSeries_Device_Node *DevNode =
-		xlateIoMmAddress(IoAddress, &dsa, &BarOffset);
+		xlate_iomm_address(IoAddress, &dsa, &BarOffset);
 
 	if (DevNode == NULL) {
 		static unsigned long last_jiffies;
@@ -743,7 +832,7 @@
 	u64 dsa;
 	u64 rc;
 	struct iSeries_Device_Node *DevNode =
-		xlateIoMmAddress(IoAddress, &dsa, &BarOffset);
+		xlate_iomm_address(IoAddress, &dsa, &BarOffset);
 
 	if (DevNode == NULL) {
 		static unsigned long last_jiffies;
@@ -770,7 +859,7 @@
 	u64 dsa;
 	u64 rc;
 	struct iSeries_Device_Node *DevNode =
-		xlateIoMmAddress(IoAddress, &dsa, &BarOffset);
+		xlate_iomm_address(IoAddress, &dsa, &BarOffset);
 
 	if (DevNode == NULL) {
 		static unsigned long last_jiffies;
@@ -797,7 +886,7 @@
 	u64 dsa;
 	u64 rc;
 	struct iSeries_Device_Node *DevNode =
-		xlateIoMmAddress(IoAddress, &dsa, &BarOffset);
+		xlate_iomm_address(IoAddress, &dsa, &BarOffset);
 
 	if (DevNode == NULL) {
 		static unsigned long last_jiffies;
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041029/152cc3a8/attachment.pgp 

From jschopp at austin.ibm.com  Fri Oct 29 02:51:39 2004
From: jschopp at austin.ibm.com (Joel Schopp)
Date: Thu, 28 Oct 2004 11:51:39 -0500
Subject: [PPC64] Rework ppc64 hugepage code
In-Reply-To: <20041028060151.GA1680@zax>
References: <20041028060151.GA1680@zax>
Message-ID: <4181239B.5020307@austin.ibm.com>

> Andrew, please apply:
> 
> Rework the ppc64 hugepage code.  Instead of using specially marked pmd
> entries in the normal pagetables to represent hugepages, use normal
> pte_t entries, in a special set of pagetables used for hugepages only.
> 
> Using pte_t instead of a special hugepte_t makes the code more similar
> to that for other architecturess, allowing more possibilities for
> consolidating the hugepage code.
> 
> Using independent pagetables for the hugepages is also a prerequisite
> for moving the hugepages into their own region well outside the normal
> user address space.  The restrictions imposed by the powerpc mmu's
> segment design mean we probably want to do that in the fairly near
> future.
> 

Besides making the code more like other architectures and being a 
prerequisite for moving hugepages into their own region this patch has 
another use.  It is on the list of prerequisites for memory hotplug 
remove on ppc64, because it unifies the method for flushing hardware 
page table entries of both large and normal sized pages.

When David originally wrote this patch I tested it on some Power4 & 
Power5 hardware and it worked flawlessly for me.

> Signed-off-by: David Gibson <david at gibson.dropbear.id.au>

Acked-by: Joel Schopp <jschopp at austin.ibm.com>


From olof at austin.ibm.com  Fri Oct 29 04:44:47 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Thu, 28 Oct 2004 13:44:47 -0500
Subject: [PATCH] [PPC64] Setup cpu_sibling_map on iSeries
Message-ID: <20041028184447.GA30644@4>

Hi,

Nathan Lynch pointed this out: The CPU sibling map is never initialized
on iSeries. This makes the scheduler very unhappy if CONFIG_SCHED_SMT
is enabled, causing an oops in find_busiest_group during boot.

Below patch adds the expected init. Please apply.


Signed-off-by: Olof Johansson <olof at austin.ibm.com>

---

 linux-2.5-olof/arch/ppc64/kernel/iSeries_smp.c |    1 +
 1 files changed, 1 insertion(+)

diff -puN arch/ppc64/kernel/iSeries_smp.c~iseries-sibling-map arch/ppc64/kernel/iSeries_smp.c
--- linux-2.5/arch/ppc64/kernel/iSeries_smp.c~iseries-sibling-map	2004-10-28 13:24:03.063642740 -0500
+++ linux-2.5-olof/arch/ppc64/kernel/iSeries_smp.c	2004-10-28 13:28:06.592330464 -0500
@@ -94,6 +94,7 @@ static int smp_iSeries_numProcs(void)
                 if (paca[i].lppaca.xDynProcStatus < 2) {
 			cpu_set(i, cpu_possible_map);
 			cpu_set(i, cpu_present_map);
+			cpu_set(i, cpu_sibling_map[i]);
                         ++np;
                 }
         }

_


From johnrose at austin.ibm.com  Fri Oct 29 05:27:45 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Thu, 28 Oct 2004 14:27:45 -0500
Subject: [PATCH 1/1] rtas_flash_4gig
In-Reply-To: <16768.28322.583827.9327@cargo.ozlabs.ibm.com>
References: <200410041942.i94Jg4WA154540@westrelay04.boulder.ibm.com>
	<16758.55568.809557.670513@cargo.ozlabs.ibm.com>
	<20041020170817.0ee49b64@localhost>
	<16768.28322.583827.9327@cargo.ozlabs.ibm.com>
Message-ID: <1098991665.692.17.camel@sinatra.austin.ibm.com>

On Wed, 2004-10-27 at 22:59, Paul Mackerras wrote:
<snip>
> Since this is happening at reboot time, I suggest we copy the block
> list into rtas_rmo_buf.  

It's less complex to use rtas_rmo_buf exclusively for userspace.  I'm
against introducing kernel use of rtas_rmo_buf, even at reboot time.  If
we did, it would be proper to add a kernel lock to synchronize access to
it, but then userspace apps have no way to take that lock when
mmap()'ing /dev/mem.

This seems like overkill for a situation we've never actually
encountered, imho.

John


From johnrose at austin.ibm.com  Fri Oct 29 07:28:36 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Thu, 28 Oct 2004 16:28:36 -0500
Subject: [PATCH] iommu fixes, round 3
In-Reply-To: <16768.10849.741580.850491@cargo.ozlabs.ibm.com>
References: <1098775712.6897.17.camel@gaston>
	<1098808895.32293.23.camel@sinatra.austin.ibm.com>
	<1098813781.32293.40.camel@sinatra.austin.ibm.com>
	<16768.10849.741580.850491@cargo.ozlabs.ibm.com>
Message-ID: <1098998916.692.20.camel@sinatra.austin.ibm.com>

This patch changes the following iommu-related things:

- Renames the [i,p]series versions of iommu_devnode_init(), to keep things 
  logically separate where possible.
  
- Moves iommu_free_table() to generic iommu.c

- Creates of_cleanup_node(), which will directly precede the dynamic removal of
  any device node

Comments welcome.

Thanks-
John

Signed-off-by: John Rose <johnrose at austin.ibm.com>

diff -puN arch/ppc64/kernel/iSeries_iommu.c~iommu_free_table_fix4 arch/ppc64/kernel/iSeries_iommu.c
--- 2_6_ketchup/arch/ppc64/kernel/iSeries_iommu.c~iommu_free_table_fix4	2004-10-28 16:16:13.000000000 -0500
+++ 2_6_ketchup-johnrose/arch/ppc64/kernel/iSeries_iommu.c	2004-10-28 16:16:13.000000000 -0500
@@ -171,7 +171,7 @@ static void iommu_table_getparms(struct 
 }
 
 
-void iommu_devnode_init(struct iSeries_Device_Node *dn) {
+void iommu_devnode_init_iSeries(struct iSeries_Device_Node *dn) {
 	struct iommu_table *tbl;
 
 	tbl = (struct iommu_table *)kmalloc(sizeof(struct iommu_table), GFP_KERNEL);
diff -puN arch/ppc64/kernel/iSeries_pci.c~iommu_free_table_fix4 arch/ppc64/kernel/iSeries_pci.c
--- 2_6_ketchup/arch/ppc64/kernel/iSeries_pci.c~iommu_free_table_fix4	2004-10-28 16:16:13.000000000 -0500
+++ 2_6_ketchup-johnrose/arch/ppc64/kernel/iSeries_pci.c	2004-10-28 16:16:13.000000000 -0500
@@ -235,7 +235,7 @@ void __init iSeries_pci_final_fixup(void
 			iSeries_Device_Information(pdev, Buffer,
 					sizeof(Buffer));
 			printk("%d. %s\n", DeviceCount, Buffer);
-			iommu_devnode_init(node);
+			iommu_devnode_init_iSeries(node);
 		} else
 			printk("PCI: Device Tree not found for 0x%016lX\n",
 					(unsigned long)pdev);
diff -puN arch/ppc64/kernel/iommu.c~iommu_free_table_fix4 arch/ppc64/kernel/iommu.c
--- 2_6_ketchup/arch/ppc64/kernel/iommu.c~iommu_free_table_fix4	2004-10-28 16:16:13.000000000 -0500
+++ 2_6_ketchup-johnrose/arch/ppc64/kernel/iommu.c	2004-10-28 16:16:13.000000000 -0500
@@ -425,6 +425,39 @@ struct iommu_table *iommu_init_table(str
 	return tbl;
 }
 
+void iommu_free_table(struct device_node *dn)
+{
+	struct iommu_table *tbl = dn->iommu_table;
+	unsigned long bitmap_sz, i;
+	unsigned int order;
+
+	if (!tbl || !tbl->it_map) {
+		printk(KERN_ERR "%s: expected TCE map for %s\n", __FUNCTION__,
+				dn->full_name);
+		return;
+	}
+
+	/* verify that table contains no entries */
+	/* it_mapsize is in entries, and we're examining 64 at a time */
+	for (i = 0; i < (tbl->it_mapsize/64); i++) {
+		if (tbl->it_map[i] != 0) {
+			printk(KERN_WARNING "%s: Unexpected TCEs for %s\n",
+				__FUNCTION__, dn->full_name);
+			break;
+		}
+	}
+
+	/* calculate bitmap size in bytes */
+	bitmap_sz = (tbl->it_mapsize + 7) / 8;
+
+	/* free bitmap */
+	order = get_order(bitmap_sz);
+	free_pages((unsigned long) tbl->it_map, order);
+
+	/* free table */
+	kfree(tbl);
+}
+
 /* Creates TCEs for a user provided buffer.  The user buffer must be
  * contiguous real kernel storage (not vmalloc).  The address of the buffer
  * passed here is the kernel (virtual) address of the buffer.  The buffer
diff -puN arch/ppc64/kernel/pSeries_iommu.c~iommu_free_table_fix4 arch/ppc64/kernel/pSeries_iommu.c
--- 2_6_ketchup/arch/ppc64/kernel/pSeries_iommu.c~iommu_free_table_fix4	2004-10-28 16:16:13.000000000 -0500
+++ 2_6_ketchup-johnrose/arch/ppc64/kernel/pSeries_iommu.c	2004-10-28 16:16:13.000000000 -0500
@@ -276,7 +276,7 @@ static void iommu_buses_init(void)
 		first_phb = 0;
 
 		for (dn = first_dn; dn != NULL; dn = dn->sibling)
-			iommu_devnode_init(dn);
+			iommu_devnode_init_pSeries(dn);
 	}
 }
 
@@ -298,7 +298,7 @@ static void iommu_buses_init_lpar(struct
 			 * Do it now because iommu_table_setparms_lpar needs it.
 			 */
 			busdn->bussubno = bus->number;
-			iommu_devnode_init(busdn);
+			iommu_devnode_init_pSeries(busdn);
 		}
 
 		/* look for a window on a bridge even if the PHB had one */
@@ -397,7 +397,7 @@ static void iommu_table_setparms_lpar(st
 }
 
 
-void iommu_devnode_init(struct device_node *dn)
+void iommu_devnode_init_pSeries(struct device_node *dn)
 {
 	struct iommu_table *tbl;
 
@@ -412,39 +412,6 @@ void iommu_devnode_init(struct device_no
 	dn->iommu_table = iommu_init_table(tbl);
 }
 
-void iommu_free_table(struct device_node *dn)
-{
-	struct iommu_table *tbl = dn->iommu_table;
-	unsigned long bitmap_sz, i;
-	unsigned int order;
-
-	if (!tbl || !tbl->it_map) {
-		printk(KERN_ERR "%s: expected TCE map for %s\n", __FUNCTION__,
-				dn->full_name);
-		return;
-	}
-
-	/* verify that table contains no entries */
-	/* it_mapsize is in entries, and we're examining 64 at a time */
-	for (i = 0; i < (tbl->it_mapsize/64); i++) {
-		if (tbl->it_map[i] != 0) {
-			printk(KERN_WARNING "%s: Unexpected TCEs for %s\n",
-				__FUNCTION__, dn->full_name);
-			break;
-		}
-	}
-
-	/* calculate bitmap size in bytes */
-	bitmap_sz = (tbl->it_mapsize + 7) / 8;
-
-	/* free bitmap */
-	order = get_order(bitmap_sz);
-	free_pages((unsigned long) tbl->it_map, order);
-
-	/* free table */
-	kfree(tbl);
-}
-
 void iommu_setup_pSeries(void)
 {
 	struct pci_dev *dev = NULL;
@@ -469,7 +436,6 @@ void iommu_setup_pSeries(void)
 	}
 }
 
-
 /* These are called very early. */
 void tce_init_pSeries(void)
 {
diff -puN arch/ppc64/kernel/prom.c~iommu_free_table_fix4 arch/ppc64/kernel/prom.c
--- 2_6_ketchup/arch/ppc64/kernel/prom.c~iommu_free_table_fix4	2004-10-28 16:16:13.000000000 -0500
+++ 2_6_ketchup-johnrose/arch/ppc64/kernel/prom.c	2004-10-28 16:17:06.000000000 -0500
@@ -1740,7 +1740,7 @@ static int of_finish_dynamic_node(struct
 	if (strcmp(node->name, "pci") == 0 &&
 	    get_property(node, "ibm,dma-window", NULL)) {
 		node->bussubno = node->busno;
-		iommu_devnode_init(node);
+		iommu_devnode_init_pSeries(node);
 	} else
 		node->iommu_table = parent->iommu_table;
 #endif /* CONFIG_PPC_PSERIES */
@@ -1802,6 +1802,15 @@ int of_add_node(const char *path, struct
 }
 
 /*
+ * Prepare an OF node for removal from system
+ */
+static void of_cleanup_node(struct device_node *np)
+{
+	if (np->iommu_table && get_property(np, "ibm,dma-window", NULL))
+		iommu_free_table(np);
+}
+
+/*
  * Remove an OF device node from the system.
  * Caller should have already "gotten" np.
  */
@@ -1818,13 +1827,7 @@ int of_remove_node(struct device_node *n
 		return -EBUSY;
 	}
 
-	/* XXX This is a layering violation, should be moved to the caller
-	 * --BenH.
-	 */
-#ifdef CONFIG_PPC_PSERIES
-	if (np->iommu_table)
-		iommu_free_table(np);
-#endif /* CONFIG_PPC_PSERIES */
+	of_cleanup_node(np);
 
 	write_lock(&devtree_lock);
 	OF_MARK_STALE(np);
diff -puN include/asm-ppc64/iommu.h~iommu_free_table_fix4 include/asm-ppc64/iommu.h
--- 2_6_ketchup/include/asm-ppc64/iommu.h~iommu_free_table_fix4	2004-10-28 16:16:13.000000000 -0500
+++ 2_6_ketchup-johnrose/include/asm-ppc64/iommu.h	2004-10-28 16:16:13.000000000 -0500
@@ -110,22 +110,18 @@ struct scatterlist;
 extern void iommu_setup_pSeries(void);
 extern void iommu_setup_u3(void);
 
-/* Creates table for an individual device node */
-/* XXX: This isn't generic, please name it accordingly or add
- * some ppc_md. hooks for iommu implementations to do what they
- * need to do. --BenH.
- */
-extern void iommu_devnode_init(struct device_node *dn);
-
 /* Frees table for an individual device node */
-/* XXX: This isn't generic, please name it accordingly or add
- * some ppc_md. hooks for iommu implementations to do what they
- * need to do. --BenH.
- */
 extern void iommu_free_table(struct device_node *dn);
 
 #endif /* CONFIG_PPC_MULTIPLATFORM */
 
+#ifdef CONFIG_PPC_PSERIES
+
+/* Creates table for an individual device node */
+extern void iommu_devnode_init_pSeries(struct device_node *dn);
+
+#endif /* CONFIG_PPC_PSERIES */
+
 #ifdef CONFIG_PPC_ISERIES
 
 /* Walks all buses and creates iommu tables */
@@ -136,7 +132,7 @@ extern void __init iommu_vio_init(void);
 
 struct iSeries_Device_Node;
 /* Creates table for an individual device node */
-extern void iommu_devnode_init(struct iSeries_Device_Node *dn);
+extern void iommu_devnode_init_iSeries(struct iSeries_Device_Node *dn);
 
 #endif /* CONFIG_PPC_ISERIES */
 

_


From benh at kernel.crashing.org  Fri Oct 29 09:28:41 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Fri, 29 Oct 2004 09:28:41 +1000
Subject: 2.6.10.-rc1 on ppc64 - returning from prom_init hang
In-Reply-To: <200410272316.i9RNGTpj005995@falcon10.austin.ibm.com>
References: <200410272316.i9RNGTpj005995@falcon10.austin.ibm.com>
Message-ID: <1099006121.29690.81.camel@gaston>

On Wed, 2004-10-27 at 18:16 -0500, Doug Maxey wrote:
> Anyone have any thoughts on this?  I have xmon=on, but it stops before it gets 
> there...  for this iteration, added console=hvsi1, did not change anything.
> 
> This libata-dev-2.6 on power5 system.

If your system supports old-style "HVC console" HV calls to output
things, then you can enable the early debug stuff for that in setup.c.
If not, then you'll have to write an HVSI style early debug stuff...
 
> Config file read, 1024 bytes
> Welcome
> Welcome to yaboot version 1.3.12
> Enter "help" to get some basic usage information
> boot:
>   2.6.10-rc1-ata-1         * linux
> boot: 2.6.10-rc1-ata-1
> Please wait, loading kernel...
>    Elf64 kernel loaded...
> Loading ramdisk...
> ramdisk loaded at 02300000, size: 1306 Kbytes
> OF stdout device is: /vdevice/vty at 30000001
> Hypertas detected, assuming LPAR !
> command line: root=/dev/VolGroup00/LogVol00 ro rhgb quiet console=hvsi1 xmon=on
> memory layout at init:
>   alloc_bottom : 0000000002447000
>   alloc_top    : 0000000008000000
>   alloc_top_hi : 0000000075000000
>   rmo_top      : 0000000008000000
>   ram_top      : 0000000075000000
> Looking for displays
> found display   : /pci at 800000020000002/pci at 2,2/pci at 1/display at 0, opening ... done
> instantiating rtas at 0x00000000077d9000... done
> 0000000000000000 : boot cpu     0000000000000000
> 0000000000000002 : starting cpu hw idx 0000000000000002... done
> copying OF device tree ...
> Building dt strings...
> Building dt structure...
> Device tree strings 0x0000000002748000 -> 0x00000000027492f8
> Device tree struct  0x000000000274a000 -> 0x000000000275a000
> Calling quiesce ...
> returning from prom_init
>  
> ++doug
> 
> 
> _______________________________________________
> Linuxppc64-dev mailing list
> Linuxppc64-dev at ozlabs.org
> https://ozlabs.org/cgi-bin/mailman/listinfo/linuxppc64-dev
-- 
Benjamin Herrenschmidt <benh at kernel.crashing.org>


From wli at holomorphy.com  Fri Oct 29 13:48:17 2004
From: wli at holomorphy.com (William Lee Irwin III)
Date: Thu, 28 Oct 2004 20:48:17 -0700
Subject: [RFC] Consolidate lots of hugepage code
In-Reply-To: <20041029033708.GF12247@zax>
References: <20041029033708.GF12247@zax>
Message-ID: <20041029034817.GY12934@holomorphy.com>

On Fri, Oct 29, 2004 at 01:37:08PM +1000, David Gibson wrote:
> wA lot of the code in arch/*/mm/hugetlbpage.c is quite similar.  This
> patch attempts to consolidate a lot of the code across the arch's,
> putting the combined version in mm/hugetlb.c.  There are a couple of
> uglyish hacks in order to cover all the hugepage archs, but the result
> is a very large reduction in the total amount of code.  It also means
> things like hugepage lazy allocation could be implemented in one
> place, instead of six.
> As yet this is entirely untested, except on ppc64.  Comments?
> Objections?  Testing acks?
> Notes:
> 	- this patch changes the meaning of set_huge_pte() to be more
> 	  analagous to set_pte()
> 	- does SH4 need special huge_ptep_get_and_clear()??

Further consolidation is premature given that outstanding hugetlb bugs
have the implication that architectures' needs are not being served by
the current arch/core split. I have at least two relatively major hugetlb
bugs outstanding, the lack of a flush_dcache_page() analogue first, and
another (soon to be a reported to affected distros) less well-understood.
Unless they're directly toward the end of restoring hugetlb to a sound
state, they're counterproductive to merge before patches doing so.

-- wli


From david at gibson.dropbear.id.au  Fri Oct 29 13:37:08 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Fri, 29 Oct 2004 13:37:08 +1000
Subject: [RFC] Consolidate lots of hugepage code
Message-ID: <20041029033708.GF12247@zax>

wA lot of the code in arch/*/mm/hugetlbpage.c is quite similar.  This
patch attempts to consolidate a lot of the code across the arch's,
putting the combined version in mm/hugetlb.c.  There are a couple of
uglyish hacks in order to cover all the hugepage archs, but the result
is a very large reduction in the total amount of code.  It also means
things like hugepage lazy allocation could be implemented in one
place, instead of six.

As yet this is entirely untested, except on ppc64.  Comments?
Objections?  Testing acks?

Notes:
	- this patch changes the meaning of set_huge_pte() to be more
	  analagous to set_pte()
	- does SH4 need special huge_ptep_get_and_clear()??

Index: working-2.6/mm/hugetlb.c
===================================================================
--- working-2.6.orig/mm/hugetlb.c	2004-09-07 10:38:00.000000000 +1000
+++ working-2.6/mm/hugetlb.c	2004-10-29 11:38:27.132145776 +1000
@@ -7,9 +7,13 @@
 #include <linux/init.h>
 #include <linux/module.h>
 #include <linux/mm.h>
-#include <linux/hugetlb.h>
 #include <linux/sysctl.h>
 #include <linux/highmem.h>
+#include <linux/pagemap.h>
+#include <asm/page.h>
+#include <asm/pgtable.h>
+
+#include <linux/hugetlb.h>
 
 const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
 static unsigned long nr_huge_pages, free_huge_pages;
@@ -248,6 +252,75 @@
 	.nopage = hugetlb_nopage,
 };
 
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr);
+pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr);
+
+pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page)
+{
+	pte_t entry;
+
+	if (vma->vm_flags & VM_WRITE) {
+		entry =
+		    pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
+	} else {
+		entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot));
+	}
+	entry = pte_mkyoung(entry);
+	entry = pte_mkhuge(entry);
+
+	return entry;
+}
+
+int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
+			    struct vm_area_struct *vma)
+{
+	pte_t *src_pte, *dst_pte, entry;
+	struct page *ptepage;
+	unsigned long addr = vma->vm_start;
+	unsigned long end = vma->vm_end;
+
+	while (addr < end) {
+		dst_pte = huge_pte_alloc(dst, addr);
+		if (!dst_pte)
+			goto nomem;
+		src_pte = huge_pte_offset(src, addr);
+		BUG_ON(!src_pte || pte_none(*src_pte)); /* prefaulted */
+		entry = *src_pte;
+		ptepage = pte_page(entry);
+		get_page(ptepage);
+		set_huge_pte(dst_pte, entry);
+		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
+		addr += HPAGE_SIZE;
+	}
+	return 0;
+
+nomem:
+	return -ENOMEM;
+}
+
+void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
+			  unsigned long end)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address;
+	pte_t pte;
+	struct page *page;
+
+	WARN_ON(!is_vm_hugetlb_page(vma));
+	BUG_ON(start & ~HPAGE_MASK);
+	BUG_ON(end & ~HPAGE_MASK);
+
+	for (address = start; address < end; address += HPAGE_SIZE) {
+		pte = huge_ptep_get_and_clear(huge_pte_offset(mm, address));
+		if (pte_none(pte))
+			continue;
+		page = pte_page(pte);
+		put_page(page);
+	}
+	mm->rss -= (end - start) >> PAGE_SHIFT;
+	flush_tlb_range(vma, start, end);
+}
+
 void zap_hugepage_range(struct vm_area_struct *vma,
 			unsigned long start, unsigned long length)
 {
@@ -257,3 +330,106 @@
 	unmap_hugepage_range(vma, start, start + length);
 	spin_unlock(&mm->page_table_lock);
 }
+
+int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
+{
+	struct mm_struct *mm = current->mm;
+	unsigned long addr;
+	int ret = 0;
+
+	WARN_ON(!is_vm_hugetlb_page(vma));
+	BUG_ON(vma->vm_start & ~HPAGE_MASK);
+	BUG_ON(vma->vm_end & ~HPAGE_MASK);
+
+	spin_lock(&mm->page_table_lock);
+	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
+		unsigned long idx;
+		pte_t *pte = huge_pte_alloc(mm, addr);
+		struct page *page;
+
+		if (!pte) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		if (! pte_none(*pte))
+			hugetlb_clean_stale_pgtable(pte);
+
+		idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
+			+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+		page = find_get_page(mapping, idx);
+		if (!page) {
+			/* charge the fs quota first */
+			if (hugetlb_get_quota(mapping)) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			page = alloc_huge_page();
+			if (!page) {
+				hugetlb_put_quota(mapping);
+				ret = -ENOMEM;
+				goto out;
+			}
+			ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
+			if (! ret) {
+				unlock_page(page);
+			} else {
+				hugetlb_put_quota(mapping);
+				free_huge_page(page);
+				goto out;
+			}
+		}
+		mm->rss += (HPAGE_SIZE / PAGE_SIZE);
+		set_huge_pte(pte, make_huge_pte(vma, page));
+	}
+out:
+	spin_unlock(&mm->page_table_lock);
+	return ret;
+}
+
+int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			struct page **pages, struct vm_area_struct **vmas,
+			unsigned long *position, int *length, int i)
+{
+	unsigned long vpfn, vaddr = *position;
+	int remainder = *length;
+
+	BUG_ON(!is_vm_hugetlb_page(vma));
+
+	vpfn = vaddr/PAGE_SIZE;
+	while (vaddr < vma->vm_end && remainder) {
+
+		if (pages) {
+			pte_t *pte;
+			struct page *page;
+
+			/* Some archs (sparc64, sh*) have multiple
+			 * pte_ts to each hugepage.  We have to make
+			 * sure we get the first, for the page
+			 * indexing below to work. */
+			pte = huge_pte_offset(mm, vaddr & HPAGE_MASK);
+
+			/* hugetlb should be locked, and hence, prefaulted */
+			WARN_ON(!pte || pte_none(*pte));
+
+			page = &pte_page(*pte)[vpfn % (HPAGE_SIZE/PAGE_SIZE)];
+
+			WARN_ON(!PageCompound(page));
+
+			get_page(page);
+			pages[i] = page;
+		}
+
+		if (vmas)
+			vmas[i] = vma;
+
+		vaddr += PAGE_SIZE;
+		++vpfn;
+		--remainder;
+		++i;
+	}
+
+	*length = remainder;
+	*position = vaddr;
+
+	return i;
+}
Index: working-2.6/arch/ppc64/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/ppc64/mm/hugetlbpage.c	2004-10-29 11:37:48.139082848 +1000
+++ working-2.6/arch/ppc64/mm/hugetlbpage.c	2004-10-29 11:38:27.133145624 +1000
@@ -122,7 +122,7 @@
 	return hugepte_offset(dir, addr);
 }
 
-static pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
 
@@ -135,7 +135,7 @@
 	return hugepte_offset(pgd, addr);
 }
 
-static pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
 
@@ -148,24 +148,6 @@
 	return hugepte_alloc(mm, pgd, addr);
 }
 
-static void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma,
-			 struct page *page, pte_t *ptep, int write_access)
-{
-	pte_t entry;
-
-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
-	if (write_access) {
-		entry =
-		    pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
-	} else {
-		entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot));
-	}
-	entry = pte_mkyoung(entry);
-	entry = pte_mkhuge(entry);
-
-	set_pte(ptep, entry);
-}
-
 /*
  * This function checks for proper alignment of input addr and len parameters.
  */
@@ -292,80 +274,6 @@
 	return -EINVAL;
 }
 
-int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
-			struct vm_area_struct *vma)
-{
-	pte_t *src_pte, *dst_pte, entry;
-	struct page *ptepage;
-	unsigned long addr = vma->vm_start;
-	unsigned long end = vma->vm_end;
-	int err = -ENOMEM;
-
-	while (addr < end) {
-		dst_pte = huge_pte_alloc(dst, addr);
-		if (!dst_pte)
-			goto out;
-
-		src_pte = huge_pte_offset(src, addr);
-		entry = *src_pte;
-		
-		ptepage = pte_page(entry);
-		get_page(ptepage);
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
-		set_pte(dst_pte, entry);
-
-		addr += HPAGE_SIZE;
-	}
-
-	err = 0;
- out:
-	return err;
-}
-
-int
-follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		    struct page **pages, struct vm_area_struct **vmas,
-		    unsigned long *position, int *length, int i)
-{
-	unsigned long vpfn, vaddr = *position;
-	int remainder = *length;
-
-	WARN_ON(!is_vm_hugetlb_page(vma));
-
-	vpfn = vaddr/PAGE_SIZE;
-	while (vaddr < vma->vm_end && remainder) {
-		if (pages) {
-			pte_t *pte;
-			struct page *page;
-
-			pte = huge_pte_offset(mm, vaddr);
-
-			/* hugetlb should be locked, and hence, prefaulted */
-			WARN_ON(!pte || pte_none(*pte));
-
-			page = &pte_page(*pte)[vpfn % (HPAGE_SIZE/PAGE_SIZE)];
-
-			WARN_ON(!PageCompound(page));
-
-			get_page(page);
-			pages[i] = page;
-		}
-
-		if (vmas)
-			vmas[i] = vma;
-
-		vaddr += PAGE_SIZE;
-		++vpfn;
-		--remainder;
-		++i;
-	}
-
-	*length = remainder;
-	*position = vaddr;
-
-	return i;
-}
-
 struct page *
 follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
 {
@@ -396,35 +304,6 @@
 	return NULL;
 }
 
-void unmap_hugepage_range(struct vm_area_struct *vma,
-			  unsigned long start, unsigned long end)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long addr;
-	pte_t *ptep;
-	struct page *page;
-
-	WARN_ON(!is_vm_hugetlb_page(vma));
-	BUG_ON((start % HPAGE_SIZE) != 0);
-	BUG_ON((end % HPAGE_SIZE) != 0);
-
-	for (addr = start; addr < end; addr += HPAGE_SIZE) {
-		pte_t pte;
-
-		ptep = huge_pte_offset(mm, addr);
-		if (!ptep || pte_none(*ptep))
-			continue;
-
-		pte = *ptep;
-		page = pte_page(pte);
-		pte_clear(ptep);
-
-		put_page(page);
-	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
-	flush_tlb_pending();
-}
-
 void hugetlb_free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *prev,
 			   unsigned long start, unsigned long end)
 {
@@ -435,60 +314,6 @@
 	 * destroy_context() to clean up the lot. */
 }
 
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
-	struct mm_struct *mm = current->mm;
-	unsigned long addr;
-	int ret = 0;
-
-	WARN_ON(!is_vm_hugetlb_page(vma));
-	BUG_ON((vma->vm_start % HPAGE_SIZE) != 0);
-	BUG_ON((vma->vm_end % HPAGE_SIZE) != 0);
-
-	spin_lock(&mm->page_table_lock);
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
-		unsigned long idx;
-		pte_t *pte = huge_pte_alloc(mm, addr);
-		struct page *page;
-
-		if (!pte) {
-			ret = -ENOMEM;
-			goto out;
-		}
-		if (! pte_none(*pte))
-			continue;
-
-		idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
-			+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
-		page = find_get_page(mapping, idx);
-		if (!page) {
-			/* charge the fs quota first */
-			if (hugetlb_get_quota(mapping)) {
-				ret = -ENOMEM;
-				goto out;
-			}
-			page = alloc_huge_page();
-			if (!page) {
-				hugetlb_put_quota(mapping);
-				ret = -ENOMEM;
-				goto out;
-			}
-			ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
-			if (! ret) {
-				unlock_page(page);
-			} else {
-				hugetlb_put_quota(mapping);
-				free_huge_page(page);
-				goto out;
-			}
-		}
-		set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
-	}
-out:
-	spin_unlock(&mm->page_table_lock);
-	return ret;
-}
-
 /* Because we have an exclusive hugepage region which lies within the
  * normal user address space, we have to take special measures to make
  * non-huge mmap()s evade the hugepage reserved regions. */
Index: working-2.6/arch/ia64/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/ia64/mm/hugetlbpage.c	2004-08-09 09:51:26.000000000 +1000
+++ working-2.6/arch/ia64/mm/hugetlbpage.c	2004-10-29 11:38:27.134145472 +1000
@@ -24,7 +24,7 @@
 
 unsigned int hpage_shift=HPAGE_SHIFT_DEFAULT;
 
-static pte_t *
+pte_t *
 huge_pte_alloc (struct mm_struct *mm, unsigned long addr)
 {
 	unsigned long taddr = htlbpage_to_page(addr);
@@ -39,7 +39,7 @@
 	return pte;
 }
 
-static pte_t *
+pte_t *
 huge_pte_offset (struct mm_struct *mm, unsigned long addr)
 {
 	unsigned long taddr = htlbpage_to_page(addr);
@@ -57,25 +57,6 @@
 	return pte;
 }
 
-#define mk_pte_huge(entry) { pte_val(entry) |= _PAGE_P; }
-
-static void
-set_huge_pte (struct mm_struct *mm, struct vm_area_struct *vma,
-	      struct page *page, pte_t * page_table, int write_access)
-{
-	pte_t entry;
-
-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
-	if (write_access) {
-		entry =
-		    pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
-	} else
-		entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot));
-	entry = pte_mkyoung(entry);
-	mk_pte_huge(entry);
-	set_pte(page_table, entry);
-	return;
-}
 /*
  * This function checks for proper alignment of input addr and len parameters.
  */
@@ -91,68 +72,6 @@
 	return 0;
 }
 
-int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
-			struct vm_area_struct *vma)
-{
-	pte_t *src_pte, *dst_pte, entry;
-	struct page *ptepage;
-	unsigned long addr = vma->vm_start;
-	unsigned long end = vma->vm_end;
-
-	while (addr < end) {
-		dst_pte = huge_pte_alloc(dst, addr);
-		if (!dst_pte)
-			goto nomem;
-		src_pte = huge_pte_offset(src, addr);
-		entry = *src_pte;
-		ptepage = pte_page(entry);
-		get_page(ptepage);
-		set_pte(dst_pte, entry);
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
-		addr += HPAGE_SIZE;
-	}
-	return 0;
-nomem:
-	return -ENOMEM;
-}
-
-int
-follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		    struct page **pages, struct vm_area_struct **vmas,
-		    unsigned long *st, int *length, int i)
-{
-	pte_t *ptep, pte;
-	unsigned long start = *st;
-	unsigned long pstart;
-	int len = *length;
-	struct page *page;
-
-	do {
-		pstart = start & HPAGE_MASK;
-		ptep = huge_pte_offset(mm, start);
-		pte = *ptep;
-
-back1:
-		page = pte_page(pte);
-		if (pages) {
-			page += ((start & ~HPAGE_MASK) >> PAGE_SHIFT);
-			get_page(page);
-			pages[i] = page;
-		}
-		if (vmas)
-			vmas[i] = vma;
-		i++;
-		len--;
-		start += PAGE_SIZE;
-		if (((start & HPAGE_MASK) == pstart) && len &&
-				(start < vma->vm_end))
-			goto back1;
-	} while (len && start < vma->vm_end);
-	*length = len;
-	*st = start;
-	return i;
-}
-
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long addr, int write)
 {
 	struct page *page;
@@ -231,81 +150,6 @@
 	}
 }
 
-void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, unsigned long end)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long address;
-	pte_t *pte;
-	struct page *page;
-
-	BUG_ON(start & (HPAGE_SIZE - 1));
-	BUG_ON(end & (HPAGE_SIZE - 1));
-
-	for (address = start; address < end; address += HPAGE_SIZE) {
-		pte = huge_pte_offset(mm, address);
-		if (pte_none(*pte))
-			continue;
-		page = pte_page(*pte);
-		put_page(page);
-		pte_clear(pte);
-	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
-	flush_tlb_range(vma, start, end);
-}
-
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
-	struct mm_struct *mm = current->mm;
-	unsigned long addr;
-	int ret = 0;
-
-	BUG_ON(vma->vm_start & ~HPAGE_MASK);
-	BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
-	spin_lock(&mm->page_table_lock);
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
-		unsigned long idx;
-		pte_t *pte = huge_pte_alloc(mm, addr);
-		struct page *page;
-
-		if (!pte) {
-			ret = -ENOMEM;
-			goto out;
-		}
-		if (!pte_none(*pte))
-			continue;
-
-		idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
-			+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
-		page = find_get_page(mapping, idx);
-		if (!page) {
-			/* charge the fs quota first */
-			if (hugetlb_get_quota(mapping)) {
-				ret = -ENOMEM;
-				goto out;
-			}
-			page = alloc_huge_page();
-			if (!page) {
-				hugetlb_put_quota(mapping);
-				ret = -ENOMEM;
-				goto out;
-			}
-			ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
-			if (! ret) {
-				unlock_page(page);
-			} else {
-				hugetlb_put_quota(mapping);
-				page_cache_release(page);
-				goto out;
-			}
-		}
-		set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
-	}
-out:
-	spin_unlock(&mm->page_table_lock);
-	return ret;
-}
-
 unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 		unsigned long pgoff, unsigned long flags)
 {
Index: working-2.6/arch/i386/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/i386/mm/hugetlbpage.c	2004-10-27 10:43:46.000000000 +1000
+++ working-2.6/arch/i386/mm/hugetlbpage.c	2004-10-29 11:44:43.541035816 +1000
@@ -18,7 +18,7 @@
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 
-static pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
 	pmd_t *pmd = NULL;
@@ -28,7 +28,7 @@
 	return (pte_t *) pmd;
 }
 
-static pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
 	pmd_t *pmd = NULL;
@@ -38,21 +38,6 @@
 	return (pte_t *) pmd;
 }
 
-static void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma, struct page *page, pte_t * page_table, int write_access)
-{
-	pte_t entry;
-
-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
-	if (write_access) {
-		entry =
-		    pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
-	} else
-		entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot));
-	entry = pte_mkyoung(entry);
-	mk_pte_huge(entry);
-	set_pte(page_table, entry);
-}
-
 /*
  * This function checks for proper alignment of input addr and len parameters.
  */
@@ -65,77 +50,6 @@
 	return 0;
 }
 
-int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
-			struct vm_area_struct *vma)
-{
-	pte_t *src_pte, *dst_pte, entry;
-	struct page *ptepage;
-	unsigned long addr = vma->vm_start;
-	unsigned long end = vma->vm_end;
-
-	while (addr < end) {
-		dst_pte = huge_pte_alloc(dst, addr);
-		if (!dst_pte)
-			goto nomem;
-		src_pte = huge_pte_offset(src, addr);
-		entry = *src_pte;
-		ptepage = pte_page(entry);
-		get_page(ptepage);
-		set_pte(dst_pte, entry);
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
-		addr += HPAGE_SIZE;
-	}
-	return 0;
-
-nomem:
-	return -ENOMEM;
-}
-
-int
-follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		    struct page **pages, struct vm_area_struct **vmas,
-		    unsigned long *position, int *length, int i)
-{
-	unsigned long vpfn, vaddr = *position;
-	int remainder = *length;
-
-	WARN_ON(!is_vm_hugetlb_page(vma));
-
-	vpfn = vaddr/PAGE_SIZE;
-	while (vaddr < vma->vm_end && remainder) {
-
-		if (pages) {
-			pte_t *pte;
-			struct page *page;
-
-			pte = huge_pte_offset(mm, vaddr);
-
-			/* hugetlb should be locked, and hence, prefaulted */
-			WARN_ON(!pte || pte_none(*pte));
-
-			page = &pte_page(*pte)[vpfn % (HPAGE_SIZE/PAGE_SIZE)];
-
-			WARN_ON(!PageCompound(page));
-
-			get_page(page);
-			pages[i] = page;
-		}
-
-		if (vmas)
-			vmas[i] = vma;
-
-		vaddr += PAGE_SIZE;
-		++vpfn;
-		--remainder;
-		++i;
-	}
-
-	*length = remainder;
-	*position = vaddr;
-
-	return i;
-}
-
 #if 0	/* This is just for testing */
 struct page *
 follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
@@ -200,87 +114,15 @@
 }
 #endif
 
-void unmap_hugepage_range(struct vm_area_struct *vma,
-		unsigned long start, unsigned long end)
+void hugetlb_clean_stale_pgtable(pte_t *pte)
 {
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long address;
-	pte_t pte;
+	pmd_t *pmd = (pmd_t *) pte;
 	struct page *page;
 
-	BUG_ON(start & (HPAGE_SIZE - 1));
-	BUG_ON(end & (HPAGE_SIZE - 1));
-
-	for (address = start; address < end; address += HPAGE_SIZE) {
-		pte = ptep_get_and_clear(huge_pte_offset(mm, address));
-		if (pte_none(pte))
-			continue;
-		page = pte_page(pte);
-		put_page(page);
-	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
-	flush_tlb_range(vma, start, end);
-}
-
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
-	struct mm_struct *mm = current->mm;
-	unsigned long addr;
-	int ret = 0;
-
-	BUG_ON(vma->vm_start & ~HPAGE_MASK);
-	BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
-	spin_lock(&mm->page_table_lock);
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
-		unsigned long idx;
-		pte_t *pte = huge_pte_alloc(mm, addr);
-		struct page *page;
-
-		if (!pte) {
-			ret = -ENOMEM;
-			goto out;
-		}
-
-		if (!pte_none(*pte)) {
-			pmd_t *pmd = (pmd_t *) pte;
-
-			page = pmd_page(*pmd);
-			pmd_clear(pmd);
-			mm->nr_ptes--;
-			dec_page_state(nr_page_table_pages);
-			page_cache_release(page);
-		}
-
-		idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
-			+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
-		page = find_get_page(mapping, idx);
-		if (!page) {
-			/* charge the fs quota first */
-			if (hugetlb_get_quota(mapping)) {
-				ret = -ENOMEM;
-				goto out;
-			}
-			page = alloc_huge_page();
-			if (!page) {
-				hugetlb_put_quota(mapping);
-				ret = -ENOMEM;
-				goto out;
-			}
-			ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
-			if (! ret) {
-				unlock_page(page);
-			} else {
-				hugetlb_put_quota(mapping);
-				free_huge_page(page);
-				goto out;
-			}
-		}
-		set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
-	}
-out:
-	spin_unlock(&mm->page_table_lock);
-	return ret;
+	page = pmd_page(*pmd);
+	pmd_clear(pmd);
+	dec_page_state(nr_page_table_pages);
+	page_cache_release(page);
 }
 
 /* x86_64 also uses this file */
Index: working-2.6/arch/sh64/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/sh64/mm/hugetlbpage.c	2004-08-09 09:51:41.000000000 +1000
+++ working-2.6/arch/sh64/mm/hugetlbpage.c	2004-10-29 11:38:27.137145016 +1000
@@ -24,7 +24,7 @@
 #include <asm/tlbflush.h>
 #include <asm/cacheflush.h>
 
-static pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
 	pmd_t *pmd;
@@ -39,7 +39,7 @@
 	return pte;
 }
 
-static pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
 	pmd_t *pmd;
@@ -54,23 +54,9 @@
 	return pte;
 }
 
-#define mk_pte_huge(entry) do { pte_val(entry) |= _PAGE_SZHUGE; } while (0)
-
-static void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma,
-			 struct page *page, pte_t * page_table, int write_access)
+void set_huge_pte(pte_t *page_table, pte_t entry)
 {
 	unsigned long i;
-	pte_t entry;
-
-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
-
-	if (write_access)
-		entry = pte_mkwrite(pte_mkdirty(mk_pte(page,
-						       vma->vm_page_prot)));
-	else
-		entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot));
-	entry = pte_mkyoung(entry);
-	mk_pte_huge(entry);
 
 	for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
 		set_pte(page_table, entry);
@@ -80,6 +66,20 @@
 	}
 }
 
+pte_t huge_ptep_get_and_clear(pte_t *ptep)
+{
+	pte_t entry;
+
+	entry = *ptep;
+
+	for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
+		pte_clear(pte);
+		pte++;
+	}
+
+	return entry;
+}
+
 /*
  * This function checks for proper alignment of input addr and len parameters.
  */
@@ -92,79 +92,6 @@
 	return 0;
 }
 
-int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
-			    struct vm_area_struct *vma)
-{
-	pte_t *src_pte, *dst_pte, entry;
-	struct page *ptepage;
-	unsigned long addr = vma->vm_start;
-	unsigned long end = vma->vm_end;
-	int i;
-
-	while (addr < end) {
-		dst_pte = huge_pte_alloc(dst, addr);
-		if (!dst_pte)
-			goto nomem;
-		src_pte = huge_pte_offset(src, addr);
-		BUG_ON(!src_pte || pte_none(*src_pte));
-		entry = *src_pte;
-		ptepage = pte_page(entry);
-		get_page(ptepage);
-		for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
-			set_pte(dst_pte, entry);
-			pte_val(entry) += PAGE_SIZE;
-			dst_pte++;
-		}
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
-		addr += HPAGE_SIZE;
-	}
-	return 0;
-
-nomem:
-	return -ENOMEM;
-}
-
-int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			struct page **pages, struct vm_area_struct **vmas,
-			unsigned long *position, int *length, int i)
-{
-	unsigned long vaddr = *position;
-	int remainder = *length;
-
-	WARN_ON(!is_vm_hugetlb_page(vma));
-
-	while (vaddr < vma->vm_end && remainder) {
-		if (pages) {
-			pte_t *pte;
-			struct page *page;
-
-			pte = huge_pte_offset(mm, vaddr);
-
-			/* hugetlb should be locked, and hence, prefaulted */
-			BUG_ON(!pte || pte_none(*pte));
-
-			page = pte_page(*pte);
-
-			WARN_ON(!PageCompound(page));
-
-			get_page(page);
-			pages[i] = page;
-		}
-
-		if (vmas)
-			vmas[i] = vma;
-
-		vaddr += PAGE_SIZE;
-		--remainder;
-		++i;
-	}
-
-	*length = remainder;
-	*position = vaddr;
-
-	return i;
-}
-
 struct page *follow_huge_addr(struct mm_struct *mm,
 			      unsigned long address, int write)
 {
@@ -181,84 +108,3 @@
 {
 	return NULL;
 }
-
-void unmap_hugepage_range(struct vm_area_struct *vma,
-			  unsigned long start, unsigned long end)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long address;
-	pte_t *pte;
-	struct page *page;
-	int i;
-
-	BUG_ON(start & (HPAGE_SIZE - 1));
-	BUG_ON(end & (HPAGE_SIZE - 1));
-
-	for (address = start; address < end; address += HPAGE_SIZE) {
-		pte = huge_pte_offset(mm, address);
-		BUG_ON(!pte);
-		if (pte_none(*pte))
-			continue;
-		page = pte_page(*pte);
-		put_page(page);
-		for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
-			pte_clear(pte);
-			pte++;
-		}
-	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
-	flush_tlb_range(vma, start, end);
-}
-
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
-	struct mm_struct *mm = current->mm;
-	unsigned long addr;
-	int ret = 0;
-
-	BUG_ON(vma->vm_start & ~HPAGE_MASK);
-	BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
-	spin_lock(&mm->page_table_lock);
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
-		unsigned long idx;
-		pte_t *pte = huge_pte_alloc(mm, addr);
-		struct page *page;
-
-		if (!pte) {
-			ret = -ENOMEM;
-			goto out;
-		}
-		if (!pte_none(*pte))
-			continue;
-
-		idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
-			+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
-		page = find_get_page(mapping, idx);
-		if (!page) {
-			/* charge the fs quota first */
-			if (hugetlb_get_quota(mapping)) {
-				ret = -ENOMEM;
-				goto out;
-			}
-			page = alloc_huge_page();
-			if (!page) {
-				hugetlb_put_quota(mapping);
-				ret = -ENOMEM;
-				goto out;
-			}
-			ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
-			if (! ret) {
-				unlock_page(page);
-			} else {
-				hugetlb_put_quota(mapping);
-				free_huge_page(page);
-				goto out;
-			}
-		}
-		set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
-	}
-out:
-	spin_unlock(&mm->page_table_lock);
-	return ret;
-}
Index: working-2.6/arch/sh/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/sh/mm/hugetlbpage.c	2004-08-09 09:51:40.000000000 +1000
+++ working-2.6/arch/sh/mm/hugetlbpage.c	2004-10-29 11:38:27.138144864 +1000
@@ -24,7 +24,7 @@
 #include <asm/tlbflush.h>
 #include <asm/cacheflush.h>
 
-static pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
 	pmd_t *pmd;
@@ -39,7 +39,7 @@
 	return pte;
 }
 
-static pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
 	pmd_t *pmd;
@@ -54,23 +54,9 @@
 	return pte;
 }
 
-#define mk_pte_huge(entry) do { pte_val(entry) |= _PAGE_SZHUGE; } while (0)
-
-static void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma,
-			 struct page *page, pte_t * page_table, int write_access)
+void set_huge_pte(pte_t *page_table, pte_t entry)
 {
 	unsigned long i;
-	pte_t entry;
-
-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
-
-	if (write_access)
-		entry = pte_mkwrite(pte_mkdirty(mk_pte(page,
-						       vma->vm_page_prot)));
-	else
-		entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot));
-	entry = pte_mkyoung(entry);
-	mk_pte_huge(entry);
 
 	for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
 		set_pte(page_table, entry);
@@ -80,6 +66,20 @@
 	}
 }
 
+pte_t huge_ptep_get_and_clear(pte_t *ptep)
+{
+	pte_t entry;
+
+	entry = *ptep;
+
+	for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
+		pte_clear(pte);
+		pte++;
+	}
+
+	return entry;
+}
+
 /*
  * This function checks for proper alignment of input addr and len parameters.
  */
@@ -92,79 +92,6 @@
 	return 0;
 }
 
-int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
-			    struct vm_area_struct *vma)
-{
-	pte_t *src_pte, *dst_pte, entry;
-	struct page *ptepage;
-	unsigned long addr = vma->vm_start;
-	unsigned long end = vma->vm_end;
-	int i;
-
-	while (addr < end) {
-		dst_pte = huge_pte_alloc(dst, addr);
-		if (!dst_pte)
-			goto nomem;
-		src_pte = huge_pte_offset(src, addr);
-		BUG_ON(!src_pte || pte_none(*src_pte));
-		entry = *src_pte;
-		ptepage = pte_page(entry);
-		get_page(ptepage);
-		for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
-			set_pte(dst_pte, entry);
-			pte_val(entry) += PAGE_SIZE;
-			dst_pte++;
-		}
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
-		addr += HPAGE_SIZE;
-	}
-	return 0;
-
-nomem:
-	return -ENOMEM;
-}
-
-int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			struct page **pages, struct vm_area_struct **vmas,
-			unsigned long *position, int *length, int i)
-{
-	unsigned long vaddr = *position;
-	int remainder = *length;
-
-	WARN_ON(!is_vm_hugetlb_page(vma));
-
-	while (vaddr < vma->vm_end && remainder) {
-		if (pages) {
-			pte_t *pte;
-			struct page *page;
-
-			pte = huge_pte_offset(mm, vaddr);
-
-			/* hugetlb should be locked, and hence, prefaulted */
-			BUG_ON(!pte || pte_none(*pte));
-
-			page = pte_page(*pte);
-
-			WARN_ON(!PageCompound(page));
-
-			get_page(page);
-			pages[i] = page;
-		}
-
-		if (vmas)
-			vmas[i] = vma;
-
-		vaddr += PAGE_SIZE;
-		--remainder;
-		++i;
-	}
-
-	*length = remainder;
-	*position = vaddr;
-
-	return i;
-}
-
 struct page *follow_huge_addr(struct mm_struct *mm,
 			      unsigned long address, int write)
 {
@@ -181,84 +108,3 @@
 {
 	return NULL;
 }
-
-void unmap_hugepage_range(struct vm_area_struct *vma,
-			  unsigned long start, unsigned long end)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long address;
-	pte_t *pte;
-	struct page *page;
-	int i;
-
-	BUG_ON(start & (HPAGE_SIZE - 1));
-	BUG_ON(end & (HPAGE_SIZE - 1));
-
-	for (address = start; address < end; address += HPAGE_SIZE) {
-		pte = huge_pte_offset(mm, address);
-		BUG_ON(!pte);
-		if (pte_none(*pte))
-			continue;
-		page = pte_page(*pte);
-		put_page(page);
-		for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
-			pte_clear(pte);
-			pte++;
-		}
-	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
-	flush_tlb_range(vma, start, end);
-}
-
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
-	struct mm_struct *mm = current->mm;
-	unsigned long addr;
-	int ret = 0;
-
-	BUG_ON(vma->vm_start & ~HPAGE_MASK);
-	BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
-	spin_lock(&mm->page_table_lock);
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
-		unsigned long idx;
-		pte_t *pte = huge_pte_alloc(mm, addr);
-		struct page *page;
-
-		if (!pte) {
-			ret = -ENOMEM;
-			goto out;
-		}
-		if (!pte_none(*pte))
-			continue;
-
-		idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
-			+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
-		page = find_get_page(mapping, idx);
-		if (!page) {
-			/* charge the fs quota first */
-			if (hugetlb_get_quota(mapping)) {
-				ret = -ENOMEM;
-				goto out;
-			}
-			page = alloc_huge_page();
-			if (!page) {
-				hugetlb_put_quota(mapping);
-				ret = -ENOMEM;
-				goto out;
-			}
-			ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
-			if (! ret) {
-				unlock_page(page);
-			} else {
-				hugetlb_put_quota(mapping);
-				free_huge_page(page);
-				goto out;
-			}
-		}
-		set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
-	}
-out:
-	spin_unlock(&mm->page_table_lock);
-	return ret;
-}
Index: working-2.6/arch/sparc64/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/sparc64/mm/hugetlbpage.c	2004-08-09 09:51:42.000000000 +1000
+++ working-2.6/arch/sparc64/mm/hugetlbpage.c	2004-10-29 11:38:27.138144864 +1000
@@ -21,7 +21,7 @@
 #include <asm/tlbflush.h>
 #include <asm/cacheflush.h>
 
-static pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
 	pmd_t *pmd;
@@ -36,7 +36,7 @@
 	return pte;
 }
 
-static pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
 {
 	pgd_t *pgd;
 	pmd_t *pmd;
@@ -51,23 +51,9 @@
 	return pte;
 }
 
-#define mk_pte_huge(entry) do { pte_val(entry) |= _PAGE_SZHUGE; } while (0)
-
-static void set_huge_pte(struct mm_struct *mm, struct vm_area_struct *vma,
-			 struct page *page, pte_t * page_table, int write_access)
+void set_huge_pte(pte_t *page_table, pte_t entry)
 {
 	unsigned long i;
-	pte_t entry;
-
-	mm->rss += (HPAGE_SIZE / PAGE_SIZE);
-
-	if (write_access)
-		entry = pte_mkwrite(pte_mkdirty(mk_pte(page,
-						       vma->vm_page_prot)));
-	else
-		entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot));
-	entry = pte_mkyoung(entry);
-	mk_pte_huge(entry);
 
 	for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
 		set_pte(page_table, entry);
@@ -77,6 +63,20 @@
 	}
 }
 
+pte_t huge_ptep_get_and_clear(pte_t *ptep)
+{
+	pte_t entry;
+
+	entry = *ptep;
+
+	for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
+		pte_clear(pte);
+		pte++;
+	}
+
+	return entry;
+}
+
 /*
  * This function checks for proper alignment of input addr and len parameters.
  */
@@ -89,79 +89,6 @@
 	return 0;
 }
 
-int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
-			    struct vm_area_struct *vma)
-{
-	pte_t *src_pte, *dst_pte, entry;
-	struct page *ptepage;
-	unsigned long addr = vma->vm_start;
-	unsigned long end = vma->vm_end;
-	int i;
-
-	while (addr < end) {
-		dst_pte = huge_pte_alloc(dst, addr);
-		if (!dst_pte)
-			goto nomem;
-		src_pte = huge_pte_offset(src, addr);
-		BUG_ON(!src_pte || pte_none(*src_pte));
-		entry = *src_pte;
-		ptepage = pte_page(entry);
-		get_page(ptepage);
-		for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
-			set_pte(dst_pte, entry);
-			pte_val(entry) += PAGE_SIZE;
-			dst_pte++;
-		}
-		dst->rss += (HPAGE_SIZE / PAGE_SIZE);
-		addr += HPAGE_SIZE;
-	}
-	return 0;
-
-nomem:
-	return -ENOMEM;
-}
-
-int follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			struct page **pages, struct vm_area_struct **vmas,
-			unsigned long *position, int *length, int i)
-{
-	unsigned long vaddr = *position;
-	int remainder = *length;
-
-	WARN_ON(!is_vm_hugetlb_page(vma));
-
-	while (vaddr < vma->vm_end && remainder) {
-		if (pages) {
-			pte_t *pte;
-			struct page *page;
-
-			pte = huge_pte_offset(mm, vaddr);
-
-			/* hugetlb should be locked, and hence, prefaulted */
-			BUG_ON(!pte || pte_none(*pte));
-
-			page = pte_page(*pte);
-
-			WARN_ON(!PageCompound(page));
-
-			get_page(page);
-			pages[i] = page;
-		}
-
-		if (vmas)
-			vmas[i] = vma;
-
-		vaddr += PAGE_SIZE;
-		--remainder;
-		++i;
-	}
-
-	*length = remainder;
-	*position = vaddr;
-
-	return i;
-}
-
 struct page *follow_huge_addr(struct mm_struct *mm,
 			      unsigned long address, int write)
 {
@@ -178,84 +105,3 @@
 {
 	return NULL;
 }
-
-void unmap_hugepage_range(struct vm_area_struct *vma,
-			  unsigned long start, unsigned long end)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long address;
-	pte_t *pte;
-	struct page *page;
-	int i;
-
-	BUG_ON(start & (HPAGE_SIZE - 1));
-	BUG_ON(end & (HPAGE_SIZE - 1));
-
-	for (address = start; address < end; address += HPAGE_SIZE) {
-		pte = huge_pte_offset(mm, address);
-		BUG_ON(!pte);
-		if (pte_none(*pte))
-			continue;
-		page = pte_page(*pte);
-		put_page(page);
-		for (i = 0; i < (1 << HUGETLB_PAGE_ORDER); i++) {
-			pte_clear(pte);
-			pte++;
-		}
-	}
-	mm->rss -= (end - start) >> PAGE_SHIFT;
-	flush_tlb_range(vma, start, end);
-}
-
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
-{
-	struct mm_struct *mm = current->mm;
-	unsigned long addr;
-	int ret = 0;
-
-	BUG_ON(vma->vm_start & ~HPAGE_MASK);
-	BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
-	spin_lock(&mm->page_table_lock);
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
-		unsigned long idx;
-		pte_t *pte = huge_pte_alloc(mm, addr);
-		struct page *page;
-
-		if (!pte) {
-			ret = -ENOMEM;
-			goto out;
-		}
-		if (!pte_none(*pte))
-			continue;
-
-		idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
-			+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
-		page = find_get_page(mapping, idx);
-		if (!page) {
-			/* charge the fs quota first */
-			if (hugetlb_get_quota(mapping)) {
-				ret = -ENOMEM;
-				goto out;
-			}
-			page = alloc_huge_page();
-			if (!page) {
-				hugetlb_put_quota(mapping);
-				ret = -ENOMEM;
-				goto out;
-			}
-			ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
-			if (! ret) {
-				unlock_page(page);
-			} else {
-				hugetlb_put_quota(mapping);
-				free_huge_page(page);
-				goto out;
-			}
-		}
-		set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
-	}
-out:
-	spin_unlock(&mm->page_table_lock);
-	return ret;
-}
Index: working-2.6/include/asm-sh/pgtable.h
===================================================================
--- working-2.6.orig/include/asm-sh/pgtable.h	2004-10-29 10:15:21.000000000 +1000
+++ working-2.6/include/asm-sh/pgtable.h	2004-10-29 11:38:27.139144712 +1000
@@ -194,6 +194,7 @@
 static inline pte_t pte_mkdirty(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_DIRTY)); return pte; }
 static inline pte_t pte_mkyoung(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_ACCESSED)); return pte; }
 static inline pte_t pte_mkwrite(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_RW)); return pte; }
+static inline pte_t pte_mkhuge(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_SZHUGE)); return pte; }
 
 /*
  * Macro and implementation to make a page protection as uncachable.
Index: working-2.6/include/asm-ia64/pgtable.h
===================================================================
--- working-2.6.orig/include/asm-ia64/pgtable.h	2004-10-29 10:15:20.000000000 +1000
+++ working-2.6/include/asm-ia64/pgtable.h	2004-10-29 11:38:27.140144560 +1000
@@ -281,6 +281,7 @@
 #define pte_mkyoung(pte)	(__pte(pte_val(pte) | _PAGE_A))
 #define pte_mkclean(pte)	(__pte(pte_val(pte) & ~_PAGE_D))
 #define pte_mkdirty(pte)	(__pte(pte_val(pte) | _PAGE_D))
+#define pte_mkhuge(entry)	(__pte(pte_val(pte) | _PAGE_P))
 
 /*
  * Macro to a page protection value as "uncacheable".  Note that "protection" is really a
Index: working-2.6/include/asm-i386/pgtable.h
===================================================================
--- working-2.6.orig/include/asm-i386/pgtable.h	2004-10-21 11:55:01.000000000 +1000
+++ working-2.6/include/asm-i386/pgtable.h	2004-10-29 11:38:27.141144408 +1000
@@ -236,6 +236,7 @@
 static inline pte_t pte_mkdirty(pte_t pte)	{ (pte).pte_low |= _PAGE_DIRTY; return pte; }
 static inline pte_t pte_mkyoung(pte_t pte)	{ (pte).pte_low |= _PAGE_ACCESSED; return pte; }
 static inline pte_t pte_mkwrite(pte_t pte)	{ (pte).pte_low |= _PAGE_RW; return pte; }
+static inline pte_t pte_mkhuge(pte_t pte)	{ (pte).pte_low |= _PAGE_PRESENT | _PAGE_PSE; return pte; }
 
 #ifdef CONFIG_X86_PAE
 # include <asm/pgtable-3level.h>
@@ -273,7 +274,6 @@
  */
 
 #define mk_pte(page, pgprot)	pfn_pte(page_to_pfn(page), (pgprot))
-#define mk_pte_huge(entry) ((entry).pte_low |= _PAGE_PRESENT | _PAGE_PSE)
 
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
Index: working-2.6/include/asm-sparc64/page.h
===================================================================
--- working-2.6.orig/include/asm-sparc64/page.h	2004-08-09 09:52:58.000000000 +1000
+++ working-2.6/include/asm-sparc64/page.h	2004-10-29 11:38:27.141144408 +1000
@@ -93,6 +93,7 @@
 #define HPAGE_SIZE		(_AC(1,UL) << HPAGE_SHIFT)
 #define HPAGE_MASK		(~(HPAGE_SIZE - 1UL))
 #define HUGETLB_PAGE_ORDER	(HPAGE_SHIFT - PAGE_SHIFT)
+#define ARCH_HAS_SETCLEAR_HUGE_PTE
 #endif
 
 #define TASK_UNMAPPED_BASE	(test_thread_flag(TIF_32BIT) ? \
Index: working-2.6/include/asm-sparc64/pgtable.h
===================================================================
--- working-2.6.orig/include/asm-sparc64/pgtable.h	2004-08-11 10:28:33.000000000 +1000
+++ working-2.6/include/asm-sparc64/pgtable.h	2004-10-29 11:38:27.142144256 +1000
@@ -302,6 +302,7 @@
 #define pte_mkyoung(pte)	(__pte(pte_val(pte) | _PAGE_ACCESSED | _PAGE_R))
 #define pte_mkwrite(pte)	(__pte(pte_val(pte) | _PAGE_WRITE))
 #define pte_mkdirty(pte)	(__pte(pte_val(pte) | _PAGE_MODIFIED | _PAGE_W))
+#define pte_mkhuge(pte)		(__pte(pte_val(pte) | _PAGE_SZHUGE))
 
 /* to find an entry in a page-table-directory. */
 #define pgd_index(address)	(((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD))
Index: working-2.6/include/asm-sh/page.h
===================================================================
--- working-2.6.orig/include/asm-sh/page.h	2004-10-19 17:17:04.000000000 +1000
+++ working-2.6/include/asm-sh/page.h	2004-10-29 11:38:27.142144256 +1000
@@ -31,6 +31,7 @@
 #define HPAGE_SIZE		(1UL << HPAGE_SHIFT)
 #define HPAGE_MASK		(~(HPAGE_SIZE-1))
 #define HUGETLB_PAGE_ORDER	(HPAGE_SHIFT-PAGE_SHIFT)
+#define ARCH_HAS_SETCLEAR_HUGE_PTE
 #endif
 
 #ifdef __KERNEL__
Index: working-2.6/include/asm-sh64/pgtable.h
===================================================================
--- working-2.6.orig/include/asm-sh64/pgtable.h	2004-10-21 11:55:01.000000000 +1000
+++ working-2.6/include/asm-sh64/pgtable.h	2004-10-29 11:38:27.143144104 +1000
@@ -429,6 +429,8 @@
 extern inline pte_t pte_mkexec(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_EXECUTE)); return pte; }
 extern inline pte_t pte_mkdirty(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_DIRTY)); return pte; }
 extern inline pte_t pte_mkyoung(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_ACCESSED)); return pte; }
+extern inline pte_t pte_mkhuge(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) | _PAGE_SZHUGE)); return pte; }
+
 
 /*
  * Conversion functions: convert a page and protection to a page entry.
Index: working-2.6/include/asm-sh64/page.h
===================================================================
--- working-2.6.orig/include/asm-sh64/page.h	2004-08-09 09:52:55.000000000 +1000
+++ working-2.6/include/asm-sh64/page.h	2004-10-29 11:38:27.144143952 +1000
@@ -41,6 +41,7 @@
 #define HPAGE_SIZE		(1UL << HPAGE_SHIFT)
 #define HPAGE_MASK		(~(HPAGE_SIZE-1))
 #define HUGETLB_PAGE_ORDER	(HPAGE_SHIFT-PAGE_SHIFT)
+#define ARCH_HAS_SETCLEAR_HUGE_PTE
 #endif
 
 #ifdef __KERNEL__
Index: working-2.6/include/linux/hugetlb.h
===================================================================
--- working-2.6.orig/include/linux/hugetlb.h	2004-08-09 09:53:01.000000000 +1000
+++ working-2.6/include/linux/hugetlb.h	2004-10-29 11:38:27.144143952 +1000
@@ -47,6 +47,20 @@
 int prepare_hugepage_range(unsigned long addr, unsigned long len);
 #endif
 
+#ifndef ARCH_HAS_SETCLEAR_HUGE_PTE
+#define set_huge_pte(ptep, pte)	set_pte(ptep, pte)
+#define huge_ptep_get_and_clear(ptep) ptep_get_and_clear(ptep)
+#else
+void set_huge_pte(pte_t *ptep, pte_t pte);
+pte_t huge_ptep_get_and_clear(pte_t *ptep);
+#endif
+
+#ifndef ARCH_HAS_HUGETLB_CLEAN_STALE_PGTABLE
+#define hugetlb_clean_stale_pgtable(pte)	BUG()
+#else
+void hugetlb_clean_stale_pgtable(pte_t *pte);
+#endif
+
 #else /* !CONFIG_HUGETLB_PAGE */
 
 static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
Index: working-2.6/include/asm-i386/page.h
===================================================================
--- working-2.6.orig/include/asm-i386/page.h	2004-10-27 10:43:47.000000000 +1000
+++ working-2.6/include/asm-i386/page.h	2004-10-29 11:39:01.817064456 +1000
@@ -64,6 +64,7 @@
 #define HPAGE_MASK	(~(HPAGE_SIZE - 1))
 #define HUGETLB_PAGE_ORDER	(HPAGE_SHIFT - PAGE_SHIFT)
 #define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
+#define ARCH_HAS_HUGETLB_CLEAN_STALE_PGTABLE
 #endif
 
 
-- 
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson


From Darren.Sheppard at ncode.com  Fri Oct 29 18:33:54 2004
From: Darren.Sheppard at ncode.com (Darren Sheppard)
Date: Fri, 29 Oct 2004 09:33:54 +0100
Subject: Shared Libraries and Exceptions on PSeries
Message-ID: <D412B46FD18B964694F7D63DFD2B442177BD78@neptune.ncode.com>

I am new to this site so apologies if I have inadvertently broken any
rules.

 
We are having trouble catching Exceptions within a shared library built
on IBM PSeries running SUSE Linux 8.0 using 32bit gcc compiler.

 
We have created a very simple test application which demonstrates this.
We are pretty experienced with porting code to unix platforms but have
never come across this before. The code sample works on all of our Unix
and Linux platforms and Windows.

 
There is no possibility of upgrading to SUSE 9 as the project we are
working on if for a large multinational company who wont upgrade for
another 2 years.

 
Here is the code

 
MAIN.CPP

#include <stdio.h>

 
int main(int argc, char *argv[])

{

printf ("In main\n");

void shared_func();

  try 

{

  throw 1;

}

catch(int)

{

 printf ("Catch in main ok\n");

}

 
try {

   shared_func();

}

catch(...) {

   printf ("Caught shared exception in main - ERROR\n");

}

return 0;

}

 
SHARED.CPP

#include <stdio.h>

 
void shared_func()

{

   try 

   {

      printf ("Throwing in shared\n");

      throw 1;

   }

   catch(...)

   {

      printf ("Caught in shared\n");

   }

}

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041029/2c358065/attachment.htm 

From dwm at austin.ibm.com  Sat Oct 30 05:55:41 2004
From: dwm at austin.ibm.com (Doug Maxey)
Date: Fri, 29 Oct 2004 14:55:41 -0500
Subject: 2.6.10-rc1-mm2 
In-Reply-To: <20041029014930.21ed5b9a.akpm@osdl.org> 
Message-ID: <200410291955.i9TJtfaj014056@falcon10.austin.ibm.com>


Andrew, 

having some troubles on ppc64.  It looks like the changes in
the scripts/Makefile.{clean,build} are expecting include/asm to
exist in the source tree.  I don't see any related file except the
include/asm-$ARCH/Kbuild


Below is output from a hacked up attempt to add $(srctree) check to
fix scripts/Makefile.build.  It invokes an added $(warning) at the top
of the file:


=============================
cmd=={make -j4 O=/build/dwm/build/lk-2.6.10-rc1-mm2.edit/ppc64 zImage}
  Using /build/dwm/linux/lk-2.6.10-rc1-mm2.edit as source for kernel
  CHK     include/linux/version.h
  GEN    /build/dwm/build/lk-2.6.10-rc1-mm2.edit/ppc64/Makefile
/build/dwm/linux/lk-2.6.10-rc1-mm2.edit/scripts/Makefile.build:13: kbuild: obj=scripts/basic/Kbuild srctree=/build/dwm/linux/lk-2.6.10-rc1-mm2.edit/scripts/basic/Kbuild, make=scripts/basic/Makefile!
  GEN    /build/dwm/build/lk-2.6.10-rc1-mm2.edit/ppc64/Makefile
/build/dwm/linux/lk-2.6.10-rc1-mm2.edit/scripts/Makefile.build:13: kbuild: obj=scripts/kconfig/Kbuild srctree=/build/dwm/linux/lk-2.6.10-rc1-mm2.edit/scripts/kconfig/Kbuild, make=scripts/kconfig/Makefile!
scripts/kconfig/conf -s arch/ppc64/Kconfig
 #
 # using defaults found in .config
 #
  SPLIT   include/linux/autoconf.h -> include/config/*
/build/dwm/linux/lk-2.6.10-rc1-mm2.edit/scripts/Makefile.build:13: kbuild: obj=scripts/basic/Kbuild srctree=/build/dwm/linux/lk-2.6.10-rc1-mm2.edit/scripts/basic/Kbuild, make=scripts/basic/Makefile!
/build/dwm/linux/lk-2.6.10-rc1-mm2.edit/scripts/Makefile.build:13: kbuild: obj=/build/dwm/linux/lk-2.6.10-rc1-mm2.edit/include/asm/Kbuild srctree=/build/dwm/linux/lk-2.6.10-rc1-mm2.edit//build/dwm/linux/lk-2.6.10-rc1-mm2.edit/include/asm/Kbuild, make=/build/dwm/linux/lk-2.6.10-rc1-mm2.edit/include/asm/Makefile!
/build/dwm/linux/lk-2.6.10-rc1-mm2.edit/scripts/Makefile.build:14: /build/dwm/linux/lk-2.6.10-rc1-mm2.edit/include/asm/Makefile: No such file or directory
/build/dwm/linux/lk-2.6.10-rc1-mm2.edit/scripts/Makefile.build:13: kbuild: obj=scripts/Kbuild srctree=/build/dwm/linux/lk-2.6.10-rc1-mm2.edit/scripts/Kbuild, make=scripts/Makefile!
make[2]: *** No rule to make target `/build/dwm/linux/lk-2.6.10-rc1-mm2.edit/include/asm/Makefile'.  Stop.
make[1]: *** [prepare0] Error 2
make[1]: *** Waiting for unfinished jobs....
/build/dwm/linux/lk-2.6.10-rc1-mm2.edit/scripts/Makefile.build:13: kbuild: obj=scripts/genksyms/Kbuild srctree=/build/dwm/linux/lk-2.6.10-rc1-mm2.edit/scripts/genksyms/Kbuild, make=scripts/genksyms/Makefile!
/build/dwm/linux/lk-2.6.10-rc1-mm2.edit/scripts/Makefile.build:13: kbuild: obj=scripts/mod/Kbuild srctree=/build/dwm/linux/lk-2.6.10-rc1-mm2.edit/scripts/mod/Kbuild, make=scripts/mod/Makefile!
make: *** [zImage] Error 2

=============================

diff from vanilla scripts/Makefile.{build,clean}

=============================
diff -Nwupa libata-dev-2.6/scripts/Makefile.build lk-2.6.10-rc1-mm2.edit/scripts/Makefile.build
--- libata-dev-2.6/scripts/Makefile.build       2004-10-27 15:38:46.972904640 -0500
+++ lk-2.6.10-rc1-mm2.edit/scripts/Makefile.build       2004-10-29 12:50:35.766986000 -0500
@@ -10,7 +10,7 @@ __build:
 # Read .config if it exist, otherwise ignore
 -include .config
 
-include $(obj)/Makefile
+include $(if $(wildcard $(obj)/Kbuild), $(obj)/Kbuild, $(obj)/Makefile)
 
 include scripts/Makefile.lib
 
diff -Nwupa libata-dev-2.6/scripts/Makefile.clean lk-2.6.10-rc1-mm2.edit/scripts/Makefile.clean
--- libata-dev-2.6/scripts/Makefile.clean       2004-10-27 15:38:46.972904640 -0500
+++ lk-2.6.10-rc1-mm2.edit/scripts/Makefile.clean       2004-10-29 12:50:35.766986000 -0500
@@ -7,7 +7,7 @@ src := $(obj)
 .PHONY: __clean
 __clean:
 
-include $(obj)/Makefile
+include $(if $(wildcard $(obj)/Kbuild), $(obj)/Kbuild, $(obj)/Makefile)
 
 # Figure out what we need to build from the various variables


From sam at ravnborg.org  Sat Oct 30 08:13:07 2004
From: sam at ravnborg.org (Sam Ravnborg)
Date: Sat, 30 Oct 2004 00:13:07 +0200
Subject: 2.6.10-rc1-mm2
In-Reply-To: <200410291955.i9TJtfaj014056@falcon10.austin.ibm.com>
References: <20041029014930.21ed5b9a.akpm@osdl.org>
	<200410291955.i9TJtfaj014056@falcon10.austin.ibm.com>
Message-ID: <20041029221307.GB11016@mars.ravnborg.org>

On Fri, Oct 29, 2004 at 02:55:41PM -0500, Doug Maxey wrote:
> 
> Andrew, 
> 
> having some troubles on ppc64.  It looks like the changes in
> the scripts/Makefile.{clean,build} are expecting include/asm to
> exist in the source tree.  I don't see any related file except the
> include/asm-$ARCH/Kbuild

Fix attached.

	Sam

===== Makefile 1.546 vs edited =====
--- 1.546/Makefile	2004-10-27 23:00:25 +02:00
+++ edited/Makefile	2004-10-29 23:05:42 +02:00
@@ -761,7 +761,7 @@
 prepare1: prepare2 outputmakefile
 
 prepare0: prepare1 include/linux/version.h include/asm include/config/MARKER
-	$(Q)$(MAKE) $(build)=$(srctree)/include/asm
+	$(Q)$(MAKE) $(build)=include/asm-$(ARCH)
 ifneq ($(KBUILD_MODULES),)
 	$(Q)rm -rf $(MODVERDIR)
 	$(Q)mkdir -p $(MODVERDIR)
===== include/asm-i386/Kbuild 1.1 vs edited =====
--- 1.1/include/asm-i386/Kbuild	2004-10-27 23:06:50 +02:00
+++ edited/include/asm-i386/Kbuild	2004-10-29 01:44:08 +02:00
@@ -11,7 +11,7 @@
 always  := offsets.h
 targets := offsets.s
 
-CFLAGS_offsets.o := -I arch/i386/kernel
+CFLAGS_offsets.o := -Iarch/i386/kernel
 
 $(obj)/offsets.h: $(obj)/offsets.s FORCE
 	$(call filechk,gen-asm-offsets, < $<)
===== scripts/Makefile.build 1.51 vs edited =====
--- 1.51/scripts/Makefile.build	2004-10-27 22:49:53 +02:00
+++ edited/scripts/Makefile.build	2004-10-29 23:04:40 +02:00
@@ -10,7 +10,7 @@
 # Read .config if it exist, otherwise ignore
 -include .config
 
-include $(if $(wildcard $(obj)/Kbuild), $(obj)/Kbuild, $(obj)/Makefile)
+include $(if $(wildcard $(srctree)/$(obj)/Kbuild), $(obj)/Kbuild, $(obj)/Makefile)
 
 include scripts/Makefile.lib
 
===== scripts/Makefile.clean 1.17 vs edited =====
--- 1.17/scripts/Makefile.clean	2004-10-27 22:49:53 +02:00
+++ edited/scripts/Makefile.clean	2004-10-29 23:22:26 +02:00
@@ -7,7 +7,7 @@
 .PHONY: __clean
 __clean:
 
-include $(if $(wildcard $(obj)/Kbuild), $(obj)/Kbuild, $(obj)/Makefile)
+include $(if $(wildcard $(srctree)/$(obj)/Kbuild), $(obj)/Kbuild, $(obj)/Makefile)
 
 # Figure out what we need to build from the various variables
 # ==========================================================================


From dwm at austin.ibm.com  Sat Oct 30 07:24:11 2004
From: dwm at austin.ibm.com (Doug Maxey)
Date: Fri, 29 Oct 2004 16:24:11 -0500
Subject: 2.6.10-rc1-mm2 
In-Reply-To: <20041029221307.GB11016@mars.ravnborg.org> 
Message-ID: <200410292124.i9TLOBIe014728@falcon10.austin.ibm.com>


On Sat, 30 Oct 2004 00:13:07 +0200, Sam Ravnborg wrote:
>On Fri, Oct 29, 2004 at 02:55:41PM -0500, Doug Maxey wrote:
>> 
>> Andrew, 
>> 
>> having some troubles on ppc64.  It looks like the changes in
>> the scripts/Makefile.{clean,build} are expecting include/asm to
>> exist in the source tree.  I don't see any related file except the
>> include/asm-$ARCH/Kbuild
>
>Fix attached.

Worked, thanks!


From dwm at austin.ibm.com  Sat Oct 30 08:09:03 2004
From: dwm at austin.ibm.com (Doug Maxey)
Date: Fri, 29 Oct 2004 17:09:03 -0500
Subject: [PATCH 1/1] ppc64 install outside of source tree
Message-ID: <200410292209.i9TM937o014943@falcon10.austin.ibm.com>


Sam, 
please apply.  Having been using this for a while.

Name: arch/ppc64/boot install outside of source tree

Rationale:
	When building outside source tree, install.sh is looked for in the 
        obj side.

Status:  tested on ppc64 builds

Signed-off-by: Doug Maxey <dwm at austin.ibm.com>

ChangeLog:
* have ppc64 ability to run install.sh from build outside srctree.

++doug
IBM Linux Technology Center

===== arch/ppc64/boot/Makefile 1.25 vs edited =====
--- 1.25/arch/ppc64/boot/Makefile	2004-10-03 12:23:50 -05:00
+++ edited/arch/ppc64/boot/Makefile	2004-10-11 14:15:58 -05:00
@@ -118,6 +118,6 @@
 		>> $(obj)/imagesize.c
 
 install: $(CONFIGURE) $(obj)/$(BOOTIMAGE)
-	sh -x $(src)/install.sh "$(KERNELRELEASE)" "$(obj)/$(BOOTIMAGE)" "$(INSTALL_PATH)"
+	sh -x $(srctree)/$(src)/install.sh "$(KERNELRELEASE)" "$(obj)/$(BOOTIMAGE)" "$(INSTALL_PATH)"
 
 clean-files := $(addprefix $(objtree)/, $(obj-boot) vmlinux.strip)


From sam at ravnborg.org  Sun Oct 31 09:12:59 2004
From: sam at ravnborg.org (Sam Ravnborg)
Date: Sun, 31 Oct 2004 00:12:59 +0200
Subject: [PATCH 1/1] ppc64 install outside of source tree
In-Reply-To: <200410292209.i9TM937o014943@falcon10.austin.ibm.com>
References: <200410292209.i9TM937o014943@falcon10.austin.ibm.com>
Message-ID: <20041030221259.GA9592@mars.ravnborg.org>

On Fri, Oct 29, 2004 at 05:09:03PM -0500, Doug Maxey wrote:
> 
> Sam, 
> please apply.  Having been using this for a while.

Applied.

	Sam


From hpa at zytor.com  Sun Oct 31 10:37:40 2004
From: hpa at zytor.com (H. Peter Anvin)
Date: Sat, 30 Oct 2004 16:37:40 -0700
Subject: PATCH: Altivec support for RAID-6
Message-ID: <418425C4.1020900@zytor.com>

This patch allows the RAID-6 code to use Altivec on ppc/ppc64 
processors.  Note that it uses gcc <altivec.h> support, so it might 
require a fairly recent gcc -- but I haven't been able to get a clear 
answer on *how* new.

It also changes -mcpu=power4 to -mcpu=970 when CONFIG_ALTIVEC is 
enabled, since -mcpu=power4 doesn't allow -maltivec to be specified with 
it :(

The results are *impressive*, however; on a PowerMac G5 I get 6.1 GB/s 
(on one CPU!); this is close to the 7.8 GB/s for RAID-5, and almost 2x 
what my 3 GHz Pentium4 gets.

	-hpa
-------------- next part --------------
A non-text attachment was scrubbed...
Name: raid6altivec.diff
Type: text/x-patch
Size: 7285 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041030/1f500436/attachment.bin 

From hpa at zytor.com  Sun Oct 31 10:37:59 2004
From: hpa at zytor.com (H. Peter Anvin)
Date: Sat, 30 Oct 2004 16:37:59 -0700
Subject: PATCH: Altivec support for RAID-6
Message-ID: <418425D7.1050602@zytor.com>

This patch allows the RAID-6 code to use Altivec on ppc/ppc64 
processors.  Note that it uses gcc <altivec.h> support, so it might 
require a fairly recent gcc -- but I haven't been able to get a clear 
answer on *how* new.

It also changes -mcpu=power4 to -mcpu=970 when CONFIG_ALTIVEC is 
enabled, since -mcpu=power4 doesn't allow -maltivec to be specified with 
it :(

The results are *impressive*, however; on a PowerMac G5 I get 6.1 GB/s 
(on one CPU!); this is close to the 7.8 GB/s for RAID-5, and almost 2x 
what my 3 GHz Pentium4 gets.

	-hpa


Signed-Off-By: H. Peter Anvin <hpa at zytor.com>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: raid6altivec.diff
Type: text/x-patch
Size: 7285 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20041030/65ffa544/attachment.bin 

From olh at suse.de  Thu Oct 28 17:25:59 2004
From: olh at suse.de (Olaf Hering)
Date: Thu, 28 Oct 2004 09:25:59 +0200
Subject: module.viomap support for ppc64
In-Reply-To: <16768.31051.268932.927382@cargo.ozlabs.ibm.com>
References: <20040812173751.GA30564@suse.de>
	<1092339278.19137.8.camel@localhost>
	<1092354195.25196.11.camel@bach> <20040813094040.GA1769@suse.de>
	<16768.31051.268932.927382@cargo.ozlabs.ibm.com>
Message-ID: <20041028072559.GA4977@suse.de>

 On Thu, Oct 28, Paul Mackerras wrote:

> Olaf Hering writes:
> 
> > A hack for 2.6.8-rc4 is below. Can I read the alias file via 
> > while read a b c ; do : done < modules.alias ?
> > Is b supposed to contain not spaces? What special delimiter chars are
> > allowed? The 'name' and 'compat' property can contain almost any char.
> > I used '^' for the time being.
> 
> Olaf, do you still want these changes made?  I rebased your patch on
> current BK (see below).

Yes, but how to implemented in detail was the question.

-- 
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG