From lxiep at us.ibm.com  Thu Jul  1 05:03:32 2004
From: lxiep at us.ibm.com (Linda Xie)
Date: Wed, 30 Jun 2004 14:03:32 -0500
Subject: [PATCH] rpaphp broken in ameslab
In-Reply-To: <16610.22790.655073.704981@cargo.ozlabs.ibm.com>
References: <40E1D5BE.2010504@austin.ibm.com>	<40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com>
Message-ID: <40E30E84.8090206@us.ibm.com>


Paul,

Paul Mackerras wrote:

>Linda,
>
>
>
>By the way, I notice that upstream rpaphp_core.c now has the call to
>eeh_register_disable_func(), although the actual function isn't
>present in arch/ppc64/kernel/eeh.c.  I thought we were going to do
>that via userspace.  Did Greg change his mind?
>
>
 eeh_register_disable_func needs to be sent to upstream by
ppc64(maintainers), if it's not in.


>
>I think that we have it the wrong way around though.  I think that the
>rpaphp code should do something analogous to a request_irq() to ask to
>be informed about slot isolation events, rather than having the EEH
>code call the rpaphp code to disable a slot.  In fact I think the
>separation is bogus; the EEH code and the rpaphp code are both part of
>the driver for the RPA PCI subsystem.
>
>
Linas, Can you answer all EEH releated questions?

Thanks,

Linda

>Paul.
>
>
>


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Thu Jul  1 05:14:33 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Wed, 30 Jun 2004 14:14:33 -0500
Subject: [PATCH] rpaphp broken in ameslab
In-Reply-To: <40E30E84.8090206@us.ibm.com>; from lxiep@us.ibm.com on Wed, Jun 30, 2004 at 02:03:32PM -0500
References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com>
Message-ID: <20040630141433.V21634@forte.austin.ibm.com>


On Wed, Jun 30, 2004 at 02:03:32PM -0500, Linda Xie wrote:
> Paul Mackerras wrote:
> >
> >By the way, I notice that upstream rpaphp_core.c now has the call to
> >eeh_register_disable_func(), although the actual function isn't
> >present in arch/ppc64/kernel/eeh.c.

Paul,

You and Anton are responsible for keeping the arch/ppc64 directories
in sync between sles9, ameslab, and akpm.  You are, after all, the
one true official, designated maintainer ... if the code hasn't been
migrated to akpm ... uuh ... what am I missing?

>From where I sit, the sles9 code is really the latest, greatest, most
tested and debugged arch/ppc64 code that there is.  This is the tree
that the developers get thier code/patches into.  Ameslab is distinctly
downlevel (for example, sles9-2.6.5-7.79 has an eeh.c that has a number
of patches, some up to a month old, that haven't yet appeared in ameslab.).
If the akpm tree doesn't have eeh_register_disable_func() then its
badly out of date, since that function got added many moons ago.

I'm somewhat concerned; the sles9 tree is now closed/closing, so
its a really great time to bring the other trees up-to-date, so
that we can continue to have a place to drop patches.

> >I think that we have it the wrong way around though.  I think that the

Yes, eeh_register_disable_func() is an ugly hack; its because rpaphp
can be built as a module, and eeh cannot, yett eeh needs to call into
rpaphp.

> >rpaphp code should do something analogous to a request_irq() to ask to
> >be informed about slot isolation events, rather than having the EEH
> >code call the rpaphp code to disable a slot.

Other than the module/not-a-module issue, why? I don't see any
particular advantage to an async, event-driven thingy as compared to
a plain-old subroutine call.  Subroutine calls are easy to make,
yuo don't need to invent infrastructure. KISS.

> > In fact I think the
> >separation is bogus; the EEH code and the rpaphp code are both part of
> >the driver for the RPA PCI subsystem.

prolly.  But note that the generic hotplug API's need to be extended
to give device drivers a mechanism to ask RPA PHP / EEH if a disconnect
event occured.  Last I talked to Greg, he wasn't willing to accept
something like that yet, so its a bit up in the air.  Personally,
I'd be happy to wait till more eeh prototype gets written before
propounding on archtiectural changes, (and happier still to find
the time to actually do it).

--linas


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From greg at kroah.com  Thu Jul  1 05:46:34 2004
From: greg at kroah.com (Greg KH)
Date: Wed, 30 Jun 2004 12:46:34 -0700
Subject: [PATCH] rpaphp broken in ameslab
In-Reply-To: <20040630141433.V21634@forte.austin.ibm.com>
References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com>
Message-ID: <20040630194634.GA22223@kroah.com>


On Wed, Jun 30, 2004 at 02:14:33PM -0500, linas at austin.ibm.com wrote:
> On Wed, Jun 30, 2004 at 02:03:32PM -0500, Linda Xie wrote:
> > Paul Mackerras wrote:
> > >
> > >By the way, I notice that upstream rpaphp_core.c now has the call to
> > >eeh_register_disable_func(), although the actual function isn't
> > >present in arch/ppc64/kernel/eeh.c.
>
> Paul,
>
> You and Anton are responsible for keeping the arch/ppc64 directories
> in sync between sles9, ameslab, and akpm.  You are, after all, the
> one true official, designated maintainer ... if the code hasn't been
> migrated to akpm ... uuh ... what am I missing?
>
> >From where I sit, the sles9 code is really the latest, greatest, most
> tested and debugged arch/ppc64 code that there is.  This is the tree
> that the developers get thier code/patches into.

And that is the big problem.

Those patches/fixes should go to mainline, not directly to suse.  How
are they going to get back into mainline?  Are you all willing to now
send those same patches to the next distro that has a release (like Red
Hat?)  What about other distros that have full ppc64 releases (debian,
gentoo, etc.)

Change your development process to properly go to mainline and there
would not be any issues like this.

> I'm somewhat concerned; the sles9 tree is now closed/closing, so
> its a really great time to bring the other trees up-to-date, so
> that we can continue to have a place to drop patches.

You should all be more concerned that all those suse patches are not
seen by anyone else.

> > >I think that we have it the wrong way around though.  I think that the
>
> Yes, eeh_register_disable_func() is an ugly hack; its because rpaphp
> can be built as a module, and eeh cannot, yett eeh needs to call into
> rpaphp.

Which is an ugly hack in and of itself.  I only oked it for now.

Actually, since everyone agrees that this isn't the way to go, I'll go
remove it :)

> > > In fact I think the
> > >separation is bogus; the EEH code and the rpaphp code are both part of
> > >the driver for the RPA PCI subsystem.
>
> prolly.  But note that the generic hotplug API's need to be extended
> to give device drivers a mechanism to ask RPA PHP / EEH if a disconnect
> event occured.  Last I talked to Greg, he wasn't willing to accept
> something like that yet, so its a bit up in the air.

I wasn't willing to accept that, as that was the wrong way to do this.
It should be done from userspace with hotplug events like we mentioned.

And none of this "what happens about the root device" crud either, I've
seen your code in the kernel that checks for this today.  bah.

thanks,

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Thu Jul  1 06:56:28 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Wed, 30 Jun 2004 15:56:28 -0500
Subject: [PATCH] rpaphp broken in ameslab
In-Reply-To: <20040630194634.GA22223@kroah.com>; from greg@kroah.com on Wed, Jun 30, 2004 at 12:46:34PM -0700
References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <20040630194634.GA22223@kroah.com>
Message-ID: <20040630155627.X21634@forte.austin.ibm.com>


On Wed, Jun 30, 2004 at 12:46:34PM -0700, Greg KH wrote:
> On Wed, Jun 30, 2004 at 02:14:33PM -0500, linas at austin.ibm.com wrote:
> > On Wed, Jun 30, 2004 at 02:03:32PM -0500, Linda Xie wrote:
> > > Paul Mackerras wrote:
> > > >
> > > >By the way, I notice that upstream rpaphp_core.c now has the call to
> > > >eeh_register_disable_func(), although the actual function isn't
> > > >present in arch/ppc64/kernel/eeh.c.
> >
> > Paul,
> >
> > You and Anton are responsible for keeping the arch/ppc64 directories
> > in sync between sles9, ameslab, and akpm.  You are, after all, the
> > one true official, designated maintainer ... if the code hasn't been
> > migrated to akpm ... uuh ... what am I missing?
> >
> > >From where I sit, the sles9 code is really the latest, greatest, most
> > tested and debugged arch/ppc64 code that there is.  This is the tree
> > that the developers get thier code/patches into.
>
> And that is the big problem.
>
> Those patches/fixes should go to mainline, not directly to suse.  How
> are they going to get back into mainline?

My understanding is that Paul Mackerras and Anton Blanchard are the
designated maintainers of the arch/ppc64 tree.  They are responsible
for sending the patches upstream, getting them into mainline.

> Which is an ugly hack in and of itself.  I only oked it for now.
>
> Actually, since everyone agrees that this isn't the way to go, I'll go
> remove it :)

Ack, I wish you wouldn't, it will break things.

What do you suggest as the 'right way' to accomplish this?

> > > > In fact I think the
> > > >separation is bogus; the EEH code and the rpaphp code are both part of
> > > >the driver for the RPA PCI subsystem.
> >
> > prolly.  But note that the generic hotplug API's need to be extended
> > to give device drivers a mechanism to ask RPA PHP / EEH if a disconnect
> > event occured.  Last I talked to Greg, he wasn't willing to accept
> > something like that yet, so its a bit up in the air.
>
> I wasn't willing to accept that, as that was the wrong way to do this.

Yes, well, a few months ago, Torvalds made me promise that this would
be the way that it would be done eventually.  You were cc'ed on that
chain of notes.  Why didn't you take that up with him?

> It should be done from userspace with hotplug events like we mentioned.

I think you're confusing this with a different issue.  The issue here
is "the device driver thinks that the PCI bus is whacked. The device
driver wants to be able to make a call to find out if the PCI bus is
whacked."  I don't think you want to bounce that up to userspace and
back down again.

> And none of this "what happens about the root device" crud either, I've
> seen your code in the kernel that checks for this today.  bah.

? Well, if your implying I wrote that code, I didn't.  My goal is to
get it to do 'the right thing'.  Once I stop getting dirstract by
other issues.

The 'root device' issue is a real issue.  You can't execute /sbin/hotplug
if the root fs is not reachable.  If the scsi device driver suspects
that the reason its unable to get a response from the scsi controller
is that the PCI bus is down, then the scsi device driver must be given
a mechanism for rebooting the PCI bus without having to go to user-space
to do it.

--linas


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Thu Jul  1 07:01:01 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Wed, 30 Jun 2004 16:01:01 -0500
Subject: [PATCH] rpaphp broken in ameslab
In-Reply-To: <20040630155627.X21634@forte.austin.ibm.com>
References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com>
	 <16610.22790.655073.704981@cargo.ozlabs.ibm.com>
	 <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com>
	 <20040630194634.GA22223@kroah.com>
	 <20040630155627.X21634@forte.austin.ibm.com>
Message-ID: <1088629261.18831.36.camel@localhost>


On Wed, 2004-06-30 at 15:56, linas at austin.ibm.com wrote:
> On Wed, Jun 30, 2004 at 12:46:34PM -0700, Greg KH wrote:
> > Those patches/fixes should go to mainline, not directly to suse.  How
> > are they going to get back into mainline?
>
> My understanding is that Paul Mackerras and Anton Blanchard are the
> designated maintainers of the arch/ppc64 tree.  They are responsible
> for sending the patches upstream, getting them into mainline.

That's not my understanding at all. Why should Paul and Anton have to
understand, agree with, and advocate your code on LKML when you can do
it yourself? They have their own code they have to worry about...

--
Hollis Blanchard
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From greg at kroah.com  Thu Jul  1 07:14:14 2004
From: greg at kroah.com (Greg KH)
Date: Wed, 30 Jun 2004 14:14:14 -0700
Subject: [PATCH] rpaphp broken in ameslab
In-Reply-To: <20040630155627.X21634@forte.austin.ibm.com>
References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <20040630194634.GA22223@kroah.com> <20040630155627.X21634@forte.austin.ibm.com>
Message-ID: <20040630211414.GB24048@kroah.com>


On Wed, Jun 30, 2004 at 03:56:28PM -0500, linas at austin.ibm.com wrote:
> On Wed, Jun 30, 2004 at 12:46:34PM -0700, Greg KH wrote:
> > On Wed, Jun 30, 2004 at 02:14:33PM -0500, linas at austin.ibm.com wrote:
> > > On Wed, Jun 30, 2004 at 02:03:32PM -0500, Linda Xie wrote:
> > > > Paul Mackerras wrote:
> > > > >
> > > > >By the way, I notice that upstream rpaphp_core.c now has the call to
> > > > >eeh_register_disable_func(), although the actual function isn't
> > > > >present in arch/ppc64/kernel/eeh.c.
> > >
> > > Paul,
> > >
> > > You and Anton are responsible for keeping the arch/ppc64 directories
> > > in sync between sles9, ameslab, and akpm.  You are, after all, the
> > > one true official, designated maintainer ... if the code hasn't been
> > > migrated to akpm ... uuh ... what am I missing?
> > >
> > > >From where I sit, the sles9 code is really the latest, greatest, most
> > > tested and debugged arch/ppc64 code that there is.  This is the tree
> > > that the developers get thier code/patches into.
> >
> > And that is the big problem.
> >
> > Those patches/fixes should go to mainline, not directly to suse.  How
> > are they going to get back into mainline?
>
> My understanding is that Paul Mackerras and Anton Blanchard are the
> designated maintainers of the arch/ppc64 tree.  They are responsible
> for sending the patches upstream, getting them into mainline.

No, you are responsible for sending those patches to them, in public, to
get them, and the community, to accept them and then pass them on.  It's
not up to them to do all of the work in picking pieces out of different
trees and forwarding them upward.

> > > > > In fact I think the
> > > > >separation is bogus; the EEH code and the rpaphp code are both part of
> > > > >the driver for the RPA PCI subsystem.
> > >
> > > prolly.  But note that the generic hotplug API's need to be extended
> > > to give device drivers a mechanism to ask RPA PHP / EEH if a disconnect
> > > event occured.  Last I talked to Greg, he wasn't willing to accept
> > > something like that yet, so its a bit up in the air.
> >
> > I wasn't willing to accept that, as that was the wrong way to do this.
>
> Yes, well, a few months ago, Torvalds made me promise that this would
> be the way that it would be done eventually.  You were cc'ed on that
> chain of notes.  Why didn't you take that up with him?

Yes, I remember that.  When is "eventually" going to happen?  next week?
2.7?  2.9?

thanks,

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From haveblue at us.ibm.com  Thu Jul  1 07:36:51 2004
From: haveblue at us.ibm.com (Dave Hansen)
Date: Wed, 30 Jun 2004 14:36:51 -0700
Subject: [PATCH] fix off-by-one in mem_init()
Message-ID: <1088631410.5265.676.camel@nighthawk>

lmb_end_of_DRAM() returns the address of the end of RAM, not the
starting address of the last page.  We've been accessing mem_map[] out
of bounds for quite a while.  But, it's just a read, so it's probably
never caused a real problem.

But, during my port of CONFIG_NONLINEAR to ppc64, I have a check to make
sure that all __va() calls are given with valid physical addresses.
This code tripped that check.

-- Dave
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ppc64-lt-end_of_DRAM.patch
Type: text/x-patch
Size: 398 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040630/ea8a9e30/attachment.bin 

From linas at austin.ibm.com  Thu Jul  1 08:45:59 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Wed, 30 Jun 2004 17:45:59 -0500
Subject: [PATCH] rpaphp broken in ameslab
In-Reply-To: <1088629261.18831.36.camel@localhost>; from hollisb@us.ibm.com on Wed, Jun 30, 2004 at 04:01:01PM -0500
References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <20040630194634.GA22223@kroah.com> <20040630155627.X21634@forte.austin.ibm.com> <1088629261.18831.36.camel@localhost>
Message-ID: <20040630174558.Y21634@forte.austin.ibm.com>


On Wed, Jun 30, 2004 at 04:01:01PM -0500, Hollis Blanchard wrote:
> On Wed, 2004-06-30 at 15:56, linas at austin.ibm.com wrote:
> > On Wed, Jun 30, 2004 at 12:46:34PM -0700, Greg KH wrote:
> > > Those patches/fixes should go to mainline, not directly to suse.  How
> > > are they going to get back into mainline?
> >
> > My understanding is that Paul Mackerras and Anton Blanchard are the
> > designated maintainers of the arch/ppc64 tree.  They are responsible
> > for sending the patches upstream, getting them into mainline.
>
> That's not my understanding at all. Why should Paul and Anton have to
> understand, agree with, and advocate your code on LKML when you can do
> it yourself? They have their own code they have to worry about...

I was refering to patches to the arch/ppc64 tree, and not patches that
go into the generic kernel.  The 1 or 2 patches I've had to generic
kernel code had have gone to LKML.  All the other patches I've done
go into bugzilla, and through some magical process :) eventually ship
as code.

As to more work for Paul and Anton, that's between them and thier
managers...  These various roles & responsibilities were announced
in meetings and emails ... I was just doing the management-directed
thing;  its news to me to learn I should now bypass Paul and Anton ...
or should I take this as a promotion? :)

--linas

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Thu Jul  1 09:03:06 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Wed, 30 Jun 2004 18:03:06 -0500
Subject: [PATCH] rpaphp broken in ameslab
In-Reply-To: <20040630211414.GB24048@kroah.com>; from greg@kroah.com on Wed, Jun 30, 2004 at 02:14:14PM -0700
References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <20040630194634.GA22223@kroah.com> <20040630155627.X21634@forte.austin.ibm.com> <20040630211414.GB24048@kroah.com>
Message-ID: <20040630180306.Z21634@forte.austin.ibm.com>


On Wed, Jun 30, 2004 at 02:14:14PM -0700, Greg KH wrote:
> On Wed, Jun 30, 2004 at 03:56:28PM -0500, linas at austin.ibm.com wrote:
> >
> > My understanding is that Paul Mackerras and Anton Blanchard are the
> > designated maintainers of the arch/ppc64 tree.  They are responsible
> > for sending the patches upstream, getting them into mainline.
>
> No, you are responsible for sending those patches to them, in public, to
> get them, and the community, to accept them and then pass them on.  It's
> not up to them to do all of the work in picking pieces out of different
> trees and forwarding them upward.

Look, take this up with the management.  I'm stating the facts.
There's a half-dozen or dozen or so developers here generating
patches for the ppc64 tree, and *everyone* (with Hollis as a
clear exception?) has been operating the same way.  The patches
go into the sles9 tree, the ppc64 maintainers have the responsibility
of accepting or refusing the patches and of syncing the trees.

> Yes, I remember that.  When is "eventually" going to happen?  next week?

Personally speaking? no, not next week. I've got a backlog of bugs
to clear out first, and when that queue shrinks to zero, then I'll
get a chance to do new development.  I was hoping to start
six months ago; but its been emergency bug-fix season, which seems
to be winding down.

--linas

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From greg at kroah.com  Thu Jul  1 09:18:34 2004
From: greg at kroah.com (Greg KH)
Date: Wed, 30 Jun 2004 16:18:34 -0700
Subject: [PATCH] rpaphp broken in ameslab
In-Reply-To: <20040630180306.Z21634@forte.austin.ibm.com>
References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <20040630194634.GA22223@kroah.com> <20040630155627.X21634@forte.austin.ibm.com> <20040630211414.GB24048@kroah.com> <20040630180306.Z21634@forte.austin.ibm.com>
Message-ID: <20040630231834.GE28815@kroah.com>


On Wed, Jun 30, 2004 at 06:03:06PM -0500, linas at austin.ibm.com wrote:
> On Wed, Jun 30, 2004 at 02:14:14PM -0700, Greg KH wrote:
> > On Wed, Jun 30, 2004 at 03:56:28PM -0500, linas at austin.ibm.com wrote:
> > >
> > > My understanding is that Paul Mackerras and Anton Blanchard are the
> > > designated maintainers of the arch/ppc64 tree.  They are responsible
> > > for sending the patches upstream, getting them into mainline.
> >
> > No, you are responsible for sending those patches to them, in public, to
> > get them, and the community, to accept them and then pass them on.  It's
> > not up to them to do all of the work in picking pieces out of different
> > trees and forwarding them upward.
>
> Look, take this up with the management.

Who's management?  I'm not talking company specific issues here.  I'm
talking about the proper open source kernel development model.  As you
are playing in the sandbox, you need to follow the same rules as all
other developers, otherwise the other members of the sandbox get mad and
start ignoring your nifty sand castles you want them to accept.

{ok, so the analogy broke down at the end there, but I hope my point got
across...}

> I'm stating the facts.
> There's a half-dozen or dozen or so developers here generating
> patches for the ppc64 tree, and *everyone* (with Hollis as a
> clear exception?) has been operating the same way.

That's because Hollis properly understands the Linux kernel development
process :)

> The patches go into the sles9 tree, the ppc64 maintainers have the
> responsibility of accepting or refusing the patches and of syncing the
> trees.

Huh?  If the ppc64 maintainers never know about them, how can that
happen?  If the patches are never posted to a _public_ mailing list, how
can you be sure that the patch is even the correct thing to do?

If you add a usb specific patch to the suse tree, how is the usb
maintainer going to find out about it?  ppc64 should be no different
than any other kernel subsystem.

> > Yes, I remember that.  When is "eventually" going to happen?  next week?
>
> Personally speaking? no, not next week. I've got a backlog of bugs
> to clear out first, and when that queue shrinks to zero, then I'll
> get a chance to do new development.  I was hoping to start
> six months ago; but its been emergency bug-fix season, which seems
> to be winding down.

Ah, but wait for it to start up again with the next distro testing cycle
that doesn't have all of those suse bugs in it...

But that's not relevant here.  I want to know when this hack will be
removed, and done properly, in a arch-independent manner.  If it's not
on the shortlist of things to do, I can't live with it, and will have to
remove it, because that means it will never get fixed :(

thanks,

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Thu Jul  1 09:41:09 2004
From: paulus at samba.org (Paul Mackerras)
Date: Thu, 1 Jul 2004 09:41:09 +1000
Subject: [PATCH] rpaphp broken in ameslab
In-Reply-To: <20040630141433.V21634@forte.austin.ibm.com>
References: <40E1D5BE.2010504@austin.ibm.com>
	<40E1F6C7.80907@us.ibm.com>
	<16610.22790.655073.704981@cargo.ozlabs.ibm.com>
	<40E30E84.8090206@us.ibm.com>
	<20040630141433.V21634@forte.austin.ibm.com>
Message-ID: <16611.20373.87299.105401@cargo.ozlabs.ibm.com>


Linas,

> You and Anton are responsible for keeping the arch/ppc64 directories
> in sync between sles9, ameslab, and akpm.  You are, after all, the
> one true official, designated maintainer ... if the code hasn't been

Yes, ultimately.

However, that doesn't mean that I am the subject matter expert for all
parts of the arch/ppc64 code.  Just as Linus isn't the expert for all
parts of the entire kernel tree.  In the areas where I am not the
expert I need and expect the expert(s) for those areas to be sending
me patches with explanations that I can forward upstream.

It's not good enough for the expert to just put the latest and
greatest code into ameslab.  If I'm not the expert, I don't know,
without going digging into the revision history, what the rationale
for the changes is.  I also don't know whether what is in there is
just what we want or if it is something that we just want a few
developers to try.

Thus, I need the experts in areas such as RAS and EEH to be sending me
patches suitable for forwarding to Andrew Morton, complete with
rationale and explanation.  (Which you have been doing - thanks.)

As for the SLES9 tree, we (Anton, Jeremy Kerr and I) have been
spending some effort on identifying which changes need to go
upstream, and in fact most of them are upstream already.  However, in
areas such as EEH and RAS it can take us considerable effort to work
out why changes are being made and whether they should go upstream,
when the expert in the area could just take one look and say very
quickly yes/no and why.

> >From where I sit, the sles9 code is really the latest, greatest, most
> tested and debugged arch/ppc64 code that there is.  This is the tree
> that the developers get thier code/patches into.  Ameslab is distinctly
> downlevel (for example, sles9-2.6.5-7.79 has an eeh.c that has a number
> of patches, some up to a month old, that haven't yet appeared in ameslab.).

Well... why not?  Why haven't the people who have been debugging EEH
problems in sles9 been at least cc'ing the patches to me?

> If the akpm tree doesn't have eeh_register_disable_func() then its
> badly out of date, since that function got added many moons ago.

I never sent that upstream because (a) the EEH developer(s) never sent
me a proposed patch to go upstream and (b) as I understood it, Greg KH
had vetoed that approach.

> Yes, eeh_register_disable_func() is an ugly hack; its because rpaphp
> can be built as a module, and eeh cannot, yett eeh needs to call into
> rpaphp.

I think we should have a notifier list for EEH events and a way for
code to request to be added to that list.  The rpaphp code can then
be notified about EEH events.  What it does then is up to it.  Having
the EEH code decide that a slot removal is the right thing seems
bogus.  That should be up to the rpaphp code and/or userspace.

Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Thu Jul  1 09:47:44 2004
From: paulus at samba.org (Paul Mackerras)
Date: Thu, 1 Jul 2004 09:47:44 +1000
Subject: [PATCH] rpaphp broken in ameslab
In-Reply-To: <20040630174558.Y21634@forte.austin.ibm.com>
References: <40E1D5BE.2010504@austin.ibm.com>
	<40E1F6C7.80907@us.ibm.com>
	<16610.22790.655073.704981@cargo.ozlabs.ibm.com>
	<40E30E84.8090206@us.ibm.com>
	<20040630141433.V21634@forte.austin.ibm.com>
	<20040630194634.GA22223@kroah.com>
	<20040630155627.X21634@forte.austin.ibm.com>
	<1088629261.18831.36.camel@localhost>
	<20040630174558.Y21634@forte.austin.ibm.com>
Message-ID: <16611.20768.91355.411754@cargo.ozlabs.ibm.com>


Linas,

> I was refering to patches to the arch/ppc64 tree, and not patches that
> go into the generic kernel.  The 1 or 2 patches I've had to generic
> kernel code had have gone to LKML.  All the other patches I've done
> go into bugzilla, and through some magical process :) eventually ship
> as code.

To the extent that you are the EEH expert (which you seem to be, let
me know if you don't want to be), I rely on you to send me patches
with explanations.  Otherwise I have to be the EEH expert myself.

> As to more work for Paul and Anton, that's between them and thier
> managers...  These various roles & responsibilities were announced
> in meetings and emails ... I was just doing the management-directed
> thing;  its news to me to learn I should now bypass Paul and Anton ...
> or should I take this as a promotion? :)

No, you shouldn't bypass me/Anton, but you shouldn't expect us to do
all the work either. :)

Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Thu Jul  1 09:50:26 2004
From: paulus at samba.org (Paul Mackerras)
Date: Thu, 1 Jul 2004 09:50:26 +1000
Subject: [PATCH] rpaphp broken in ameslab
In-Reply-To: <20040630180306.Z21634@forte.austin.ibm.com>
References: <40E1D5BE.2010504@austin.ibm.com>
	<40E1F6C7.80907@us.ibm.com>
	<16610.22790.655073.704981@cargo.ozlabs.ibm.com>
	<40E30E84.8090206@us.ibm.com>
	<20040630141433.V21634@forte.austin.ibm.com>
	<20040630194634.GA22223@kroah.com>
	<20040630155627.X21634@forte.austin.ibm.com>
	<20040630211414.GB24048@kroah.com>
	<20040630180306.Z21634@forte.austin.ibm.com>
Message-ID: <16611.20930.229141.507440@cargo.ozlabs.ibm.com>


linas at austin.ibm.com writes:

> Look, take this up with the management.  I'm stating the facts.
> There's a half-dozen or dozen or so developers here generating
> patches for the ppc64 tree, and *everyone* (with Hollis as a
> clear exception?) has been operating the same way.  The patches
> go into the sles9 tree, the ppc64 maintainers have the responsibility
> of accepting or refusing the patches and of syncing the trees.

OK, so I haven't been speaking up loudly and clearly and often
enough.  Hollis has got it right.  We'll have to do better in the next
bug frenzy.

Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Thu Jul  1 10:02:25 2004
From: paulus at samba.org (Paul Mackerras)
Date: Thu, 1 Jul 2004 10:02:25 +1000
Subject: [PATCH] rpaphp broken in ameslab
In-Reply-To: <20040630231834.GE28815@kroah.com>
References: <40E1D5BE.2010504@austin.ibm.com>
	<40E1F6C7.80907@us.ibm.com>
	<16610.22790.655073.704981@cargo.ozlabs.ibm.com>
	<40E30E84.8090206@us.ibm.com>
	<20040630141433.V21634@forte.austin.ibm.com>
	<20040630194634.GA22223@kroah.com>
	<20040630155627.X21634@forte.austin.ibm.com>
	<20040630211414.GB24048@kroah.com>
	<20040630180306.Z21634@forte.austin.ibm.com>
	<20040630231834.GE28815@kroah.com>
Message-ID: <16611.21649.436501.278036@cargo.ozlabs.ibm.com>


Greg KH writes:

> But that's not relevant here.  I want to know when this hack will be
> removed, and done properly, in a arch-independent manner.  If it's not
> on the shortlist of things to do, I can't live with it, and will have to
> remove it, because that means it will never get fixed :(

What I propose is that the EEH code export a notifier list for code
that wants to know about EEH events, and do a notifier_call_chain when
events such as slot isolation occur.  The rpaphp code can then
register itself.  When a slot isolation event occurs, the rpaphp code
can then do whatever is appropriate.  This seems to me to be as
fundamental as a device driver being able to do request_irq().  The
rpaphp code is part of the device driver for the pci subsystem and it
needs to be notified when its hardware changes state.

Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From greg at kroah.com  Thu Jul  1 10:08:51 2004
From: greg at kroah.com (Greg KH)
Date: Wed, 30 Jun 2004 17:08:51 -0700
Subject: [PATCH] rpaphp broken in ameslab
In-Reply-To: <16611.21649.436501.278036@cargo.ozlabs.ibm.com>
References: <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <20040630194634.GA22223@kroah.com> <20040630155627.X21634@forte.austin.ibm.com> <20040630211414.GB24048@kroah.com> <20040630180306.Z21634@forte.austin.ibm.com> <20040630231834.GE28815@kroah.com> <16611.21649.436501.278036@cargo.ozlabs.ibm.com>
Message-ID: <20040701000851.GC30029@kroah.com>


On Thu, Jul 01, 2004 at 10:02:25AM +1000, Paul Mackerras wrote:
> Greg KH writes:
>
> > But that's not relevant here.  I want to know when this hack will be
> > removed, and done properly, in a arch-independent manner.  If it's not
> > on the shortlist of things to do, I can't live with it, and will have to
> > remove it, because that means it will never get fixed :(
>
> What I propose is that the EEH code export a notifier list for code
> that wants to know about EEH events, and do a notifier_call_chain when
> events such as slot isolation occur.  The rpaphp code can then
> register itself.  When a slot isolation event occurs, the rpaphp code
> can then do whatever is appropriate.  This seems to me to be as
> fundamental as a device driver being able to do request_irq().  The
> rpaphp code is part of the device driver for the pci subsystem and it
> needs to be notified when its hardware changes state.

That's acceptable to me.

thanks,

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Thu Jul  1 11:20:32 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Wed, 30 Jun 2004 20:20:32 -0500
Subject: [PATCH] rpaphp broken in ameslab
In-Reply-To: <16611.20373.87299.105401@cargo.ozlabs.ibm.com>; from paulus@samba.org on Thu, Jul 01, 2004 at 09:41:09AM +1000
References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <16611.20373.87299.105401@cargo.ozlabs.ibm.com>
Message-ID: <20040630202032.A21634@forte.austin.ibm.com>


On Thu, Jul 01, 2004 at 09:41:09AM +1000, Paul Mackerras wrote:
>
> patches suitable for forwarding to Andrew Morton, complete with
> rationale and explanation.  (Which you have been doing - thanks.)

OK, well, except that, until yesterday, I haven't been :) there's
a few months of accumulated patches lying around.  They're mostly
trite.

> As for the SLES9 tree, we (Anton, Jeremy Kerr and I) have been
> spending some effort on identifying which changes need to go
> upstream, and in fact most of them are upstream already.  However, in
> areas such as EEH and RAS it can take us considerable effort to work
> out why changes are being made and whether they should go upstream,
> when the expert in the area could just take one look and say very
> quickly yes/no and why.

OK.  The short answer for eeh.c is 'all of it'.

> Well... why not?  Why haven't the people who have been debugging EEH
> problems in sles9 been at least cc'ing the patches to me?

I duuno... I thought this was standard operating proceedure ...
one can get cc'ed on all bugzilla ppc64 activity automatically.
Whenever a bug gets marked 'fixed-awaiting-test' one can grab
the patch and go with it.  That's what SUSE does.

I do have one big concern with tracking patches by email vs.
tracking patches in bugzilla, and that is the problem of closure.
I sent four patches yesterday, heard responses for two of them.
What about the other two? Were they applied? Rejected? Still
in the queue? Accidenetally ignored & forgotten?    How will
I know?  In bugzilla, theres a very clear idea of what the
status is, and there's a 'paper trail' to go with it.
Its got built-in nagging ... I can say things like 'please
apply patch 7538, its been ready for months', which works out
a lot better than 'hey remember that email I send you months
ago, did you ever get around to it?'

Problem with un-adorned email is you never know if a thread
came to a conclusion or not.  But bugzilla has a built-in
status tracker.. you know when the thread is dead.

I like to think of bugzilla as 'plain-email with extra status
bits' -- its auto-threaded, everyone can review a thread at
any time, every post to every thread is there, even for
late-comers who might not have been 'subscribed' at the begining.
And you can perform complex queries accoss threads and topics,
which is something mutt doesn't do, and I don't think pine,
elm, evolution or any of the other emailers do either.

What would really be slick is if there was a button on
bugzilla, that, when clicked, did a 'bk push'.

> I think we should have a notifier list for EEH events and a way for
> code to request to be added to that list.  The rpaphp code can then
> be notified about EEH events.

OK, we have that code already; presently, its not compiled with
rpaphp_pci.c, its compiled with eeh.c.  I can cut and paste it
into drivers/pci/hotplug/rpaphp_pci.c and/or create a new file
drivers/pci/hotplug/rpaphp_eeh.c  That eleminates the need for
the callback.

--linas

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Thu Jul  1 12:17:11 2004
From: paulus at samba.org (Paul Mackerras)
Date: Thu, 1 Jul 2004 12:17:11 +1000
Subject: [PATCH] rpaphp broken in ameslab
In-Reply-To: <20040630202032.A21634@forte.austin.ibm.com>
References: <40E1D5BE.2010504@austin.ibm.com>
	<40E1F6C7.80907@us.ibm.com>
	<16610.22790.655073.704981@cargo.ozlabs.ibm.com>
	<40E30E84.8090206@us.ibm.com>
	<20040630141433.V21634@forte.austin.ibm.com>
	<16611.20373.87299.105401@cargo.ozlabs.ibm.com>
	<20040630202032.A21634@forte.austin.ibm.com>
Message-ID: <16611.29735.778269.316356@cargo.ozlabs.ibm.com>


linas at austin.ibm.com writes:

> OK.  The short answer for eeh.c is 'all of it'.

OK, that doesn't give me the explanation that would need to go with
the patch when I send it to Andrew.

I want to see a notifier list exported by eeh.c as I proposed in a
previous email before that goes upstream.

> I do have one big concern with tracking patches by email vs.
> tracking patches in bugzilla, and that is the problem of closure.
> I sent four patches yesterday, heard responses for two of them.
> What about the other two? Were they applied? Rejected? Still
> in the queue? Accidenetally ignored & forgotten?    How will
> I know?  In bugzilla, theres a very clear idea of what the

I thought I replied to all 4.  Let me check...

I have replied to the emails with the subjects:

[PATCH] [2.6] PPC64: log firmware errors during boot.
[PATCH] PPC64: lockfix for rtas error log
[PATCH] PPC64: (resend) Janitor signature of rtas_call() routine
[PATCH] PPC64: Janitor log_rtas_error() call arguments

(Yes, the last one I didn't directly reply to, but I rediffed the
patch and sent it to Andrew with a cc to you.)

The trouble with bugzilla is that it doesn't give me the final patch
with explanation, ready to go upstream.

> status is, and there's a 'paper trail' to go with it.
> Its got built-in nagging ... I can say things like 'please
> apply patch 7538, its been ready for months', which works out
> a lot better than 'hey remember that email I send you months
> ago, did you ever get around to it?'

I would much rather you sent me the patch with explanation and said
"please send this upstream".  If I send it upstream I will cc you.  I
will either do that or send you a reply telling you what problems I
see with the patch.  If I don't feel free to re-send at 2-day
intervals. :)

> Problem with un-adorned email is you never know if a thread
> came to a conclusion or not.  But bugzilla has a built-in
> status tracker.. you know when the thread is dead.

You do know when the thread has come to a conclusion, it's when the
patch is in Linus' tree.

Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Thu Jul  1 19:37:13 2004
From: anton at samba.org (Anton Blanchard)
Date: Thu, 1 Jul 2004 19:37:13 +1000
Subject: RFC: Dynamic segment table allocation
In-Reply-To: <20040630065105.GA19054@zax>
References: <20040630065105.GA19054@zax>
Message-ID: <20040701093713.GJ7523@krispykreme>


Hi,

> Can anyone see a problem with the patch below?  If not, I plan to push
> it to Andrew in the next few days.  Boots on a 4-way Power3 (RS/6000
> 270) and an RS64 iSeries lpar.
>
> PPC64 machines before Power4 need a segment table page allocated for
> each CPU.  Currently these are allocated statically in a big array in
> head.S for all CPUs.  However, except for the boot CPU, there are no
> particular constraints on the segment table's location, and the
> secondary CPUs don't come up until quite late when get_free_pages() is
> operational.
>
> This patch dynamically allocates segment table pages as the secondary
> CPUs come up.  This reduces the kernel image size by 192k...

I like the idea of removing this bloat. However I think all segment tables
have to be in the first segment, otherwise the bolted segment miss handler
could take a miss when storing to the segment table.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Fri Jul  2 04:14:30 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Thu, 1 Jul 2004 13:14:30 -0500
Subject: [PATCH] rpaphp broken in ameslab
In-Reply-To: <16611.29735.778269.316356@cargo.ozlabs.ibm.com>; from paulus@samba.org on Thu, Jul 01, 2004 at 12:17:11PM +1000
References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <16611.20373.87299.105401@cargo.ozlabs.ibm.com> <20040630202032.A21634@forte.austin.ibm.com> <16611.29735.778269.316356@cargo.ozlabs.ibm.com>
Message-ID: <20040701131430.C21634@forte.austin.ibm.com>


On Thu, Jul 01, 2004 at 12:17:11PM +1000, Paul Mackerras wrote:
> linas at austin.ibm.com writes:
>
> > OK.  The short answer for eeh.c is 'all of it'.
>
> OK, that doesn't give me the explanation that would need to go with
> the patch when I send it to Andrew.

The detailed explanations are in bugzilla.  I'll crawl through the
archives and pull these; but this makes me unhappy :(  I personally
would be happier with a process that requires less make-work; these
patches should have been picked up at the time they were authored,
not post-factum.

> I want to see a notifier list exported by eeh.c as I proposed in a
> previous email before that goes upstream.

Its currently implemented as a work queue. Is that acceptable?
To keep gregkh happy, I'll move the work-queue to
drivers/pci/hotplug/rpaphp_eeh.c, will this work?

> I thought I replied to all 4.  Let me check...

OK, right.  I thought a few had fallen through the cracks.
Again, I question the efficiency of this process, it took
longer than it should to figure out what's up.

> The trouble with bugzilla is that it doesn't give me the final patch
> with explanation, ready to go upstream.

That's not true. Of course it does.

> > status is, and there's a 'paper trail' to go with it.
> > Its got built-in nagging ... I can say things like 'please
> > apply patch 7538, its been ready for months', which works out
> > a lot better than 'hey remember that email I send you months
> > ago, did you ever get around to it?'
>
> I would much rather you sent me the patch with explanation and said
> "please send this upstream".

Yes well, I and others were operating under the assumption that you
were actually monitoring bugzilla, and pulling the patches & explanations
from there, merging and sending them upstream.  Clearly, this did not
occur.  The process broke down.  How will we do better next time?

> > Problem with un-adorned email is you never know if a thread
> > came to a conclusion or not.  But bugzilla has a built-in
> > status tracker.. you know when the thread is dead.
>
> You do know when the thread has come to a conclusion, it's when the
> patch is in Linus' tree.

Its an inefficient, error-prone process.  I can generate more patches
a day than I am able to track by email.  Ergo, patch-tracking is a
bottleneck.

--linas


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Fri Jul  2 06:04:34 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Thu, 1 Jul 2004 15:04:34 -0500
Subject: [PATCH] PPC64: lockfix for rtas error log
In-Reply-To: <20040701113116.72a25ed0.moilanen@austin.ibm.com>; from moilanen@austin.ibm.com on Thu, Jul 01, 2004 at 11:31:16AM -0500
References: <20040629175007.P21634@forte.austin.ibm.com> <16610.41869.78636.349800@cargo.ozlabs.ibm.com> <20040630125027.T21634@forte.austin.ibm.com> <16611.18350.530425.178652@cargo.ozlabs.ibm.com> <20040701113116.72a25ed0.moilanen@austin.ibm.com>
Message-ID: <20040701150434.G21634@forte.austin.ibm.com>


> > As for the RTAS messages being printk'd, the only possible

A lot of those messages are 'garbage' messages, I'm trying
to clear some (maybe most?) of these up.  For example,
a lot of them are the result of EEH not being turned on
for PHB's and EAD's (i.e. not  being turned on for empty
unoccupied pci slots).  Once I get that patch in (which
pre-req's all these other patches), I think most of the
messages will be gone, which I know will make lots of
people happy.

--linas


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Fri Jul  2 07:15:12 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Thu, 01 Jul 2004 16:15:12 -0500
Subject: [PATCH] 2.6 PPC64: lockfix for rtas error log (third-times-a-charm?)]
In-Reply-To: <20040701153117.H21634@forte.austin.ibm.com>
References: <20040701141931.E21634@forte.austin.ibm.com> <1088711355.21679.152.camel@nighthawk> <20040701153117.H21634@forte.austin.ibm.com>
Message-ID: <40E47EE0.306@austin.ibm.com>


linas at austin.ibm.com wrote:

> +	/* Log the error in the unlikely case that there was one. */
> +	if (unlikely(logit)) {
> +		buff_copy = kmalloc (RTAS_ERROR_LOG_MAX, GFP_ATOMIC);
> +		if (buff_copy) {
> +			memcpy (buff_copy, rtas_err_buf, RTAS_ERROR_LOG_MAX);
> +		}
> +	}

This isn't performance critical code, do you really need to hard code
unlikely here? The compiler and processor do pretty good prediction on
it's own where needed.

(also, as Greg said, you have extra whitespace in some places, including
after kmallocs and kfrees)

-Olof

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Fri Jul  2 07:20:50 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Thu, 1 Jul 2004 16:20:50 -0500
Subject: [PATCH] 2.6 PPC64: lockfix for rtas error log (third-times-a-charm?)]
In-Reply-To: <40E47EE0.306@austin.ibm.com>; from olof@austin.ibm.com on Thu, Jul 01, 2004 at 04:15:12PM -0500
References: <20040701141931.E21634@forte.austin.ibm.com> <1088711355.21679.152.camel@nighthawk> <20040701153117.H21634@forte.austin.ibm.com> <40E47EE0.306@austin.ibm.com>
Message-ID: <20040701162049.K21634@forte.austin.ibm.com>


On Thu, Jul 01, 2004 at 04:15:12PM -0500, Olof Johansson wrote:
> linas at austin.ibm.com wrote:
>
> > +	/* Log the error in the unlikely case that there was one. */
> > +	if (unlikely(logit)) {
> > +		buff_copy = kmalloc (RTAS_ERROR_LOG_MAX, GFP_ATOMIC);
> > +		if (buff_copy) {
> > +			memcpy (buff_copy, rtas_err_buf, RTAS_ERROR_LOG_MAX);
> > +		}
> > +	}
>
> This isn't performance critical code, do you really need to hard code
> unlikely here?

I figured its more of a hint to the human reader than it is to the compiler.

> (also, as Greg said, you have extra whitespace in some places, including
> after kmallocs and kfrees)

Is this really important enough to gen a new patch? If that's what it
takes to get the patch accepted, I'll do it ...

--linas

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From greg at kroah.com  Fri Jul  2 09:29:12 2004
From: greg at kroah.com (Greg KH)
Date: Thu, 1 Jul 2004 16:29:12 -0700
Subject: [PATCH] 2.6 PPC64: lockfix for rtas error log (third-times-a-charm?)]
In-Reply-To: <20040701162049.K21634@forte.austin.ibm.com>
References: <20040701141931.E21634@forte.austin.ibm.com> <1088711355.21679.152.camel@nighthawk> <20040701153117.H21634@forte.austin.ibm.com> <40E47EE0.306@austin.ibm.com> <20040701162049.K21634@forte.austin.ibm.com>
Message-ID: <20040701232912.GA27199@kroah.com>


On Thu, Jul 01, 2004 at 04:20:50PM -0500, linas at austin.ibm.com wrote:
>
> On Thu, Jul 01, 2004 at 04:15:12PM -0500, Olof Johansson wrote:
> > linas at austin.ibm.com wrote:
> >
> > > +	/* Log the error in the unlikely case that there was one. */
> > > +	if (unlikely(logit)) {
> > > +		buff_copy = kmalloc (RTAS_ERROR_LOG_MAX, GFP_ATOMIC);
> > > +		if (buff_copy) {
> > > +			memcpy (buff_copy, rtas_err_buf, RTAS_ERROR_LOG_MAX);
> > > +		}
> > > +	}
> >
> > This isn't performance critical code, do you really need to hard code
> > unlikely here?
>
> I figured its more of a hint to the human reader than it is to the compiler.
>
> > (also, as Greg said, you have extra whitespace in some places, including
> > after kmallocs and kfrees)
>
> Is this really important enough to gen a new patch? If that's what it
> takes to get the patch accepted, I'll do it ...

Please do.

thanks,

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From david at gibson.dropbear.id.au  Fri Jul  2 12:02:16 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Fri, 2 Jul 2004 12:02:16 +1000
Subject: RFC: Dynamic segment table allocation
In-Reply-To: <20040701093713.GJ7523@krispykreme>
References: <20040630065105.GA19054@zax> <20040701093713.GJ7523@krispykreme>
Message-ID: <20040702020216.GD5937@zax>


On Thu, Jul 01, 2004 at 07:37:13PM +1000, Anton Blanchard wrote:
>
> Hi,
>
> > Can anyone see a problem with the patch below?  If not, I plan to push
> > it to Andrew in the next few days.  Boots on a 4-way Power3 (RS/6000
> > 270) and an RS64 iSeries lpar.
> >
> > PPC64 machines before Power4 need a segment table page allocated for
> > each CPU.  Currently these are allocated statically in a big array in
> > head.S for all CPUs.  However, except for the boot CPU, there are no
> > particular constraints on the segment table's location, and the
> > secondary CPUs don't come up until quite late when get_free_pages() is
> > operational.
> >
> > This patch dynamically allocates segment table pages as the secondary
> > CPUs come up.  This reduces the kernel image size by 192k...
>
> I like the idea of removing this bloat. However I think all segment tables
> have to be in the first segment, otherwise the bolted segment miss handler
> could take a miss when storing to the segment table.

Ah... good point.

I can see two ways around that:  allocate the tables from bootmem
sufficiently low, or bolt into each table an STE for itself.  Paulus
says he prefers the former option, so I'll look at doing that.

--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Fri Jul  2 12:45:51 2004
From: anton at samba.org (Anton Blanchard)
Date: Fri, 2 Jul 2004 12:45:51 +1000
Subject: RFC: Dynamic segment table allocation
In-Reply-To: <20040702020216.GD5937@zax>
References: <20040630065105.GA19054@zax> <20040701093713.GJ7523@krispykreme> <20040702020216.GD5937@zax>
Message-ID: <20040702024551.GK7523@krispykreme>


> Ah... good point.
>
> I can see two ways around that:  allocate the tables from bootmem
> sufficiently low, or bolt into each table an STE for itself.  Paulus
> says he prefers the former option, so I'll look at doing that.

irqstack_early_init is doing a similar thing for irq stacks, it would be
worth consolidating the dynamic segment table allocation code with this.

Also it would be worth working with Nathan Lynch to calculate
smp_possible_cpus earlier at boot so that irqstack_early_init doesnt
allocate NR_CPUS worth of stacks all the time.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From david at gibson.dropbear.id.au  Fri Jul  2 13:58:49 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Fri, 2 Jul 2004 13:58:49 +1000
Subject: RFC: Dynamic segment table allocation
In-Reply-To: <20040702024551.GK7523@krispykreme>
References: <20040630065105.GA19054@zax> <20040701093713.GJ7523@krispykreme> <20040702020216.GD5937@zax> <20040702024551.GK7523@krispykreme>
Message-ID: <20040702035849.GF5937@zax>


On Fri, Jul 02, 2004 at 12:45:51PM +1000, Anton Blanchard wrote:
>
> > Ah... good point.
> >
> > I can see two ways around that:  allocate the tables from bootmem
> > sufficiently low, or bolt into each table an STE for itself.  Paulus
> > says he prefers the former option, so I'll look at doing that.
>
> irqstack_early_init is doing a similar thing for irq stacks, it would be
> worth consolidating the dynamic segment table allocation code with this.

Yes, noticed thats.

> Also it would be worth working with Nathan Lynch to calculate
> smp_possible_cpus earlier at boot so that irqstack_early_init doesnt
> allocate NR_CPUS worth of stacks all the time.

Erm.. as far as I can tell smp_possible_cpus is already early enough
for the call to irqstack_early_init().  The patch below uses it for
the stabs allocation immediately after the call to
irqstack_early_init()...

PPC64 machines before Power4 need a segment table page allocated for
each CPU.  Currently these are allocated statically in a big array in
head.S for all CPUs.  The segment tables need to be in the first
segment (so do_stab_bolted doesn't take a recursive fault on the stab
itself), but other than that there are no constraints which require
the stabs for the secondary CPUs to be statically allocated.

This patch allocates segment tables dynamically during boot, using
lmb_alloc() to ensure they are within the first 256M segment.  This
reduces the kernel image size by 192k...

Signed-off-by: David Gibson <david at gibson.dropbear.id.au>

Index: working-2.6/arch/ppc64/kernel/head.S
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/head.S
+++ working-2.6/arch/ppc64/kernel/head.S
@@ -2201,11 +2201,6 @@
 ioremap_dir:
 	.space	4096

-/* 1 page segment table per cpu (max 48, cpu0 allocated at STAB0_PHYS_ADDR) */
-	.globl	stab_array
-stab_array:
-	.space	4096 * 48
-
 /*
  * This space gets a copy of optional info passed to us by the bootstrap
  * Used to pass parameters into the kernel like root=/dev/sda1, etc.
Index: working-2.6/arch/ppc64/kernel/smp.c
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/smp.c
+++ working-2.6/arch/ppc64/kernel/smp.c
@@ -876,19 +876,6 @@
 	paca[cpu].prof_multiplier = 1;
 	paca[cpu].default_decr = tb_ticks_per_jiffy / decr_overclock;

-	if (!(cur_cpu_spec->cpu_features & CPU_FTR_SLB)) {
-		void *tmp;
-
-		/* maximum of 48 CPUs on machines with a segment table */
-		if (cpu >= 48)
-			BUG();
-
-		tmp = &stab_array[PAGE_SIZE * cpu];
-		memset(tmp, 0, PAGE_SIZE);
-		paca[cpu].stab_addr = (unsigned long)tmp;
-		paca[cpu].stab_real = virt_to_abs(tmp);
-	}
-
 	/* The information for processor bringup must
 	 * be written out to main store before we release
 	 * the processor.
Index: working-2.6/arch/ppc64/kernel/setup.c
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/setup.c
+++ working-2.6/arch/ppc64/kernel/setup.c
@@ -76,6 +76,7 @@
 extern void pseries_secondary_smp_init(unsigned long);
 extern int  idle_setup(void);
 extern void vpa_init(int cpu);
+void stabs_alloc(void);

 unsigned long decr_overclock = 1;
 unsigned long decr_overclock_proc0 = 1;
@@ -639,6 +640,7 @@
 	*cmdline_p = cmd_line;

 	irqstack_early_init();
+	stabs_alloc();

 	/* set up the bootmem stuff with available memory */
 	do_init_bootmem();
Index: working-2.6/arch/ppc64/kernel/stab.c
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/stab.c
+++ working-2.6/arch/ppc64/kernel/stab.c
@@ -19,6 +19,8 @@
 #include <asm/paca.h>
 #include <asm/naca.h>
 #include <asm/cputable.h>
+#include <asm/lmb.h>
+#include <asm/abs_addr.h>

 static int make_ste(unsigned long stab, unsigned long esid, unsigned long vsid);
 static void make_slbe(unsigned long esid, unsigned long vsid, int large,
@@ -41,6 +43,34 @@
 #endif
 }

+/* Allocate segment tables for secondary CPUs.  These must all go in
+ * the first (bolted) segment, so that do_stab_bolted won't get a
+ * segment miss on the segment table itself. */
+void stabs_alloc(void)
+{
+	int cpu;
+
+	if (cur_cpu_spec->cpu_features & CPU_FTR_SLB)
+		return;
+
+	for_each_cpu(cpu) {
+		unsigned long newstab;
+
+		if (cpu == 0)
+			continue; /* stab for CPU 0 is statically allocated */
+
+		newstab = lmb_alloc_base(PAGE_SIZE, PAGE_SIZE, 1<<SID_SHIFT);
+		if (! newstab)
+			panic("Unable to allocate segment table for CPU %d.\n", cpu);
+
+		newstab += KERNELBASE;
+
+		paca[cpu].stab_addr = newstab;
+		paca[cpu].stab_real = virt_to_abs(newstab);
+		printk(KERN_DEBUG "Segment table for CPU %d at 0x%lx virtual, 0x%lx absolute\n", cpu, paca[cpu].stab_addr, paca[cpu].stab_real);
+	}
+}
+
 /*
  * Build an entry for the base kernel segment and put it into
  * the segment table or SLB.  All other segment table or SLB


--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Fri Jul  2 15:57:41 2004
From: paulus at samba.org (Paul Mackerras)
Date: Fri, 2 Jul 2004 15:57:41 +1000
Subject: [PATCH] rpaphp broken in ameslab
In-Reply-To: <20040701131430.C21634@forte.austin.ibm.com>
References: <40E1D5BE.2010504@austin.ibm.com>
	<40E1F6C7.80907@us.ibm.com>
	<16610.22790.655073.704981@cargo.ozlabs.ibm.com>
	<40E30E84.8090206@us.ibm.com>
	<20040630141433.V21634@forte.austin.ibm.com>
	<16611.20373.87299.105401@cargo.ozlabs.ibm.com>
	<20040630202032.A21634@forte.austin.ibm.com>
	<16611.29735.778269.316356@cargo.ozlabs.ibm.com>
	<20040701131430.C21634@forte.austin.ibm.com>
Message-ID: <16612.63829.67601.300009@cargo.ozlabs.ibm.com>


linas at austin.ibm.com writes:

> The detailed explanations are in bugzilla.  I'll crawl through the
> archives and pull these; but this makes me unhappy :(  I personally
> would be happier with a process that requires less make-work; these
> patches should have been picked up at the time they were authored,
> not post-factum.

I agree, but it is the person who knows the code who should decide (a)
whether the patch in bugzilla is what should go upstream and (b)
whether the patch has actually fixed the bug, and take the action to
push the patch upstream.  That is logically the job of whoever knows
the code best, and they should do it as soon as the patch is known to
have actually fixed the bug.

> > I want to see a notifier list exported by eeh.c as I proposed in a
> > previous email before that goes upstream.
>
> Its currently implemented as a work queue. Is that acceptable?
> To keep gregkh happy, I'll move the work-queue to
> drivers/pci/hotplug/rpaphp_eeh.c, will this work?

It's not the work queue that is the problem, it is that the EEH code
is taking a decision about what hotplug should do.  I am saying that
the EEH code should offer to provide notifications to any interested
code about slot isolation events.  The slot isolation event is a fact,
the request to do an unplug operation is policy.  Let's leave the
policy up to the rpaphp driver and/or userspace.

Specifically, I think we should leave the workqueue in eeh.c, replace
eeh_disable_slot with a notifier list, and move the check for ethernet
devices to rpaphp.c.  The eeh_register_disable_func call then changes
to notifier_chain_register(&eeh_slot_isolation_list, &my_notify_block)
or somesuch.

> Yes well, I and others were operating under the assumption that you
> were actually monitoring bugzilla, and pulling the patches & explanations
> from there, merging and sending them upstream.  Clearly, this did not
> occur.  The process broke down.  How will we do better next time?

I guess we need to have it clearly stated who is the designated person
for each area of the code.  Within arch/ppc64 and include/asm-ppc64, I
would like to have people who take responsibility for legacy iSeries,
RAS, DLPAR, EEH, PCI, NUMA, oprofile/perfctr, and Powermac support.
Stephen Rothwell has been pretty much looking after the iSeries stuff,
and Ben H is clearly the powermac maintainer.  So far, Anton seems to
be the expert on NUMA and oprofile/perfctr.  We have a whole team
doing DLPAR work and I look to people like John Rose to say what
should go upstream in that area.  It's been an informal process so far
which has worked in some areas but not in others.

> Its an inefficient, error-prone process.  I can generate more patches
> a day than I am able to track by email.  Ergo, patch-tracking is a
> bottleneck.

The discipline of sending patches with explanations is that it forces
you to write a description of the changes that will give someone who
doesn't know the history a reasonable idea of the motivation behind
them.  That turns out to be an important part of keeping the kernel
maintainable - the fact that you can look at a year-old change and get
a reasonable idea of what the patch author was trying to achieve is
invaluable.  That is far more important than having people churn out
patches at the maximum possible rate.  One of the big reasons why the
ameslab BK tree doesn't work as a source of patches to send upstream
is that many people haven't been putting in much in the way of
changeset descriptions.

Paul.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From david at gibson.dropbear.id.au  Fri Jul  2 16:14:21 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Fri, 2 Jul 2004 16:14:21 +1000
Subject: RFC: Dynamic segment table allocation
In-Reply-To: <20040702035849.GF5937@zax>
References: <20040630065105.GA19054@zax> <20040701093713.GJ7523@krispykreme> <20040702020216.GD5937@zax> <20040702024551.GK7523@krispykreme> <20040702035849.GF5937@zax>
Message-ID: <20040702061421.GI5937@zax>


On Fri, Jul 02, 2004 at 01:58:49PM +1000, David Gibson wrote:
>
> On Fri, Jul 02, 2004 at 12:45:51PM +1000, Anton Blanchard wrote:
> >
> > > Ah... good point.
> > >
> > > I can see two ways around that:  allocate the tables from bootmem
> > > sufficiently low, or bolt into each table an STE for itself.  Paulus
> > > says he prefers the former option, so I'll look at doing that.
> >
> > irqstack_early_init is doing a similar thing for irq stacks, it would be
> > worth consolidating the dynamic segment table allocation code with this.
>
> Yes, noticed thats.
>
> > Also it would be worth working with Nathan Lynch to calculate
> > smp_possible_cpus earlier at boot so that irqstack_early_init doesnt
> > allocate NR_CPUS worth of stacks all the time.
>
> Erm.. as far as I can tell smp_possible_cpus is already early enough
> for the call to irqstack_early_init().  The patch below uses it for
> the stabs allocation immediately after the call to
> irqstack_early_init()...

Hrm.  For some reason that patch doesn't work on iSeries.  Still
debugging.

--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Fri Jul  2 21:22:24 2004
From: paulus at samba.org (Paul Mackerras)
Date: Fri, 2 Jul 2004 21:22:24 +1000
Subject: [PATCH] fix ras irq handlers
In-Reply-To: <1083860324.3702.8.camel@mudbug.austin.ibm.com>
References: <1083860324.3702.8.camel@mudbug.austin.ibm.com>
Message-ID: <16613.17776.748934.366072@cargo.ozlabs.ibm.com>


Nathan,

I just dug out a patch that you sent a while ago that made some
changes to ras.c, mainly to cater for the possibility of epow-events
and internal-errors having multiple elements.  I have reworked it -
let me know what you think of the patch below.

One of the changes I have made is to check the #interrupt-cells
applying to the interrupts property.  Without this I think we will
incorrectly try to get interrupt 0 or 1, since when #interrupt-cells
is 2, the second cell is the edge/level indicator.

Paul.

diff -urN test25/arch/ppc64/kernel/prom.c ppc64-2.5-pseries/arch/ppc64/kernel/prom.c
--- test25/arch/ppc64/kernel/prom.c	2004-06-30 22:00:47.000000000 +1000
+++ ppc64-2.5-pseries/arch/ppc64/kernel/prom.c	2004-07-02 17:06:51.000000000 +1000
@@ -1881,8 +1881,7 @@
  * Find out the size of each entry of the interrupts property
  * for a node.
  */
-static int __devinit
-prom_n_intr_cells(struct device_node *np)
+int __devinit prom_n_intr_cells(struct device_node *np)
 {
 	struct device_node *p;
 	unsigned int *icp;
@@ -1896,7 +1895,7 @@
 		    || get_property(p, "interrupt-map", NULL) != NULL) {
 			printk("oops, node %s doesn't have #interrupt-cells\n",
 			       p->full_name);
-		return 1;
+			return 1;
 		}
 	}
 #ifdef DEBUG_IRQ
diff -urN ppc64-linux-2.5/arch/ppc64/kernel/ras.c ppc64-2.5-pseries/arch/ppc64/kernel/ras.c
--- ppc64-linux-2.5/arch/ppc64/kernel/ras.c	2004-04-13 14:04:32.000000000 +1000
+++ ppc64-2.5-pseries/arch/ppc64/kernel/ras.c	2004-07-02 21:07:28.932004888 +1000
@@ -52,6 +52,16 @@
 #include <asm/rtas.h>
 #include <asm/ppcdebug.h>

+static unsigned char log_buf[RTAS_ERROR_LOG_MAX];
+static spinlock_t log_lock = SPIN_LOCK_UNLOCKED;
+
+static int ras_get_sensor_state_token;
+static int ras_check_exception_token;
+
+#define EPOW_SENSOR_TOKEN	9
+#define EPOW_SENSOR_INDEX	0
+#define RAS_VECTOR_OFFSET	0x500
+
 static irqreturn_t ras_epow_interrupt(int irq, void *dev_id,
 					struct pt_regs * regs);
 static irqreturn_t ras_error_interrupt(int irq, void *dev_id,
@@ -59,6 +69,35 @@

 /* #define DEBUG */

+static void request_ras_irqs(struct device_node *np, char *propname,
+			irqreturn_t (*handler)(int, void *, struct pt_regs *),
+			const char *name)
+{
+	unsigned int *ireg, len, i;
+	int virq, n_intr;
+
+	ireg = (unsigned int *)get_property(np, propname, &len);
+	if (ireg == NULL)
+		return;
+	n_intr = prom_n_intr_cells(np);
+	len /= n_intr * sizeof(*ireg);
+
+	for (i = 0; i < len; i++) {
+		virq = virt_irq_create_mapping(*ireg);
+		if (virq == NO_IRQ) {
+			printk(KERN_ERR "Unable to allocate interrupt "
+			       "number for %s\n", np->full_name);
+			return;
+		}
+		if (request_irq(irq_offset_up(virq), handler, 0, name, NULL)) {
+			printk(KERN_ERR "Unable to request interrupt %d for "
+			       "%s\n", irq_offset_up(virq), np->full_name);
+			return;
+		}
+		ireg += n_intr;
+	}
+}
+
 /*
  * Initialize handlers for the set of interrupts caused by hardware errors
  * and power system events.
@@ -66,52 +105,33 @@
 static int __init init_ras_IRQ(void)
 {
 	struct device_node *np;
-	unsigned int *ireg, len, i;
-	int virq;

-	if ((np = of_find_node_by_path("/event-sources/internal-errors")) &&
-	    (ireg = (unsigned int *)get_property(np, "open-pic-interrupt",
-						 &len))) {
-		for (i=0; i<(len / sizeof(*ireg)); i++) {
-			virq = virt_irq_create_mapping(*(ireg));
-			if (virq == NO_IRQ) {
-				printk(KERN_ERR "Unable to allocate interrupt "
-				       "number for %s\n", np->full_name);
-				break;
-			}
-			request_irq(irq_offset_up(virq),
-				    ras_error_interrupt, 0,
-				    "RAS_ERROR", NULL);
-			ireg++;
-		}
+	ras_get_sensor_state_token = rtas_token("get-sensor-state");
+	ras_check_exception_token = rtas_token("check-exception");
+
+	/* Internal Errors */
+	np = of_find_node_by_path("/event-sources/internal-errors");
+	if (np != NULL) {
+		request_ras_irqs(np, "open-pic-interrupt", ras_error_interrupt,
+				 "RAS_ERROR");
+		request_ras_irqs(np, "interrupts", ras_error_interrupt,
+				 "RAS_ERROR");
+		of_node_put(np);
 	}
-	of_node_put(np);

-	if ((np = of_find_node_by_path("/event-sources/epow-events")) &&
-	    (ireg = (unsigned int *)get_property(np, "open-pic-interrupt",
-						 &len))) {
-		for (i=0; i<(len / sizeof(*ireg)); i++) {
-			virq = virt_irq_create_mapping(*(ireg));
-			if (virq == NO_IRQ) {
-				printk(KERN_ERR "Unable to allocate interrupt "
-				       " number for %s\n", np->full_name);
-				break;
-			}
-			request_irq(irq_offset_up(virq),
-				    ras_epow_interrupt, 0,
-				    "RAS_EPOW", NULL);
-			ireg++;
-		}
+	np = of_find_node_by_path("/event-sources/epow-events");
+	if (np != NULL) {
+		request_ras_irqs(np, "open-pic-interrupt", ras_epow_interrupt,
+				"RAS_EPOW");
+		request_ras_irqs(np, "interrupts", ras_epow_interrupt,
+				"RAS_EPOW");
+		of_node_put(np);
 	}
-	of_node_put(np);

 	return 1;
 }
 __initcall(init_ras_IRQ);

-static struct rtas_error_log log_buf;
-static spinlock_t log_lock = SPIN_LOCK_UNLOCKED;
-
 /*
  * Handle power subsystem events (EPOW).
  *
@@ -122,30 +142,35 @@
 static irqreturn_t
 ras_epow_interrupt(int irq, void *dev_id, struct pt_regs * regs)
 {
-	struct rtas_error_log log_entry;
-	unsigned int size = sizeof(log_entry);
-	long status = 0xdeadbeef;
+	int status = 0xdeadbeef;
+	int state = 0;
+	int virq = irq_offset_down(irq);
+	int critical;

 	spin_lock(&log_lock);

-	status = rtas_call(rtas_token("check-exception"), 6, 1, NULL,
-			   0x500, irq,
-			   RTAS_EPOW_WARNING | RTAS_POWERMGM_EVENTS,
-			   1,  /* Time Critical */
-			   __pa(&log_buf), size);
+	status = rtas_call(ras_get_sensor_state_token, 2, 2, &state,
+			   EPOW_SENSOR_TOKEN, EPOW_SENSOR_INDEX);

-	log_entry = log_buf;
-
-	spin_unlock(&log_lock);
+	if (state > 3)
+		critical = 1;  /* Time Critical */
+	else
+		critical = 0;

-	udbg_printf("EPOW <0x%lx 0x%lx>\n",
-		    *((unsigned long *)&log_entry), status);
-	printk(KERN_WARNING
-		"EPOW <0x%lx 0x%lx>\n",*((unsigned long *)&log_entry), status);
+	status = rtas_call(ras_check_exception_token, 6, 1, NULL,
+			   RAS_VECTOR_OFFSET, virt_irq_to_real(virq),
+			   RTAS_EPOW_WARNING | RTAS_POWERMGM_EVENTS,
+			   critical, __pa(&log_buf), RTAS_ERROR_LOG_MAX);
+
+	udbg_printf("EPOW <0x%lx 0x%x 0x%x>\n",
+		    *((unsigned long *)&log_buf), status, state);
+	printk(KERN_WARNING "EPOW <0x%lx 0x%x 0x%x>\n",
+	       *((unsigned long *)&log_buf), status, state);

 	/* format and print the extended information */
-	log_error((char *)&log_entry, ERR_TYPE_RTAS_LOG, 0);
-
+	log_error(log_buf, ERR_TYPE_RTAS_LOG, 0);
+
+	spin_unlock(&log_lock);
 	return IRQ_HANDLED;
 }

@@ -160,37 +185,34 @@
 static irqreturn_t
 ras_error_interrupt(int irq, void *dev_id, struct pt_regs * regs)
 {
-	struct rtas_error_log log_entry;
-	unsigned int size = sizeof(log_entry);
-	long status = 0xdeadbeef;
+	struct rtas_error_log *rtas_elog;
+	int status = 0xdeadbeef;
 	int fatal;

 	spin_lock(&log_lock);

-	status = rtas_call(rtas_token("check-exception"), 6, 1, NULL,
-			   0x500, irq,
-			   RTAS_INTERNAL_ERROR,
-			   1, /* Time Critical */
-			   __pa(&log_buf), size);
-
-	log_entry = log_buf;
+	status = rtas_call(ras_check_exception_token, 6, 1, NULL,
+			   RAS_VECTOR_OFFSET,
+			   virt_irq_to_real(irq_offset_down(irq)),
+			   RTAS_INTERNAL_ERROR,
+			   1 /* Time Critical */,
+			   __pa(&log_buf), RTAS_ERROR_LOG_MAX);

-	spin_unlock(&log_lock);
+	rtas_elog = (struct rtas_error_log *)log_buf;

-	if ((status == 0) && (log_entry.severity >= SEVERITY_ERROR_SYNC))
+	if ((status == 0) && (rtas_elog->severity >= SEVERITY_ERROR_SYNC))
 		fatal = 1;
 	else
 		fatal = 0;

 	/* format and print the extended information */
-	log_error((char *)&log_entry, ERR_TYPE_RTAS_LOG, fatal);
+	log_error(log_buf, ERR_TYPE_RTAS_LOG, fatal);

 	if (fatal) {
-		udbg_printf("HW Error <0x%lx 0x%lx>\n",
-			    *((unsigned long *)&log_entry), status);
-		printk(KERN_EMERG
-		       "Error: Fatal hardware error <0x%lx 0x%lx>\n",
-		       *((unsigned long *)&log_entry), status);
+		udbg_printf("HW Error <0x%lx 0x%x>\n",
+			    *((unsigned long *)&log_buf), status);
+		printk(KERN_EMERG "Error: Fatal hardware error <0x%lx 0x%x>\n",
+		       *((unsigned long *)&log_buf), status);

 #ifndef DEBUG
 		/* Don't actually power off when debugging so we can test
@@ -200,11 +222,13 @@
 		ppc_md.power_off();
 #endif
 	} else {
-		udbg_printf("Recoverable HW Error <0x%lx 0x%lx>\n",
-			    *((unsigned long *)&log_entry), status);
+		udbg_printf("Recoverable HW Error <0x%lx 0x%x>\n",
+			    *((unsigned long *)&log_buf), status);
 		printk(KERN_WARNING
-		       "Warning: Recoverable hardware error <0x%lx 0x%lx>\n",
-		       *((unsigned long *)&log_entry), status);
+		       "Warning: Recoverable hardware error <0x%lx 0x%x>\n",
+		       *((unsigned long *)&log_buf), status);
 	}
+
+	spin_unlock(&log_lock);
 	return IRQ_HANDLED;
 }
diff -urN test25/include/asm-ppc64/prom.h ppc64-2.5-pseries/include/asm-ppc64/prom.h
--- test25/include/asm-ppc64/prom.h	2004-06-24 21:46:45.000000000 +1000
+++ ppc64-2.5-pseries/include/asm-ppc64/prom.h	2004-07-02 17:06:22.000000000 +1000
@@ -269,6 +269,7 @@
 extern void print_properties(struct device_node *node);
 extern int prom_n_addr_cells(struct device_node* np);
 extern int prom_n_size_cells(struct device_node* np);
+extern int prom_n_intr_cells(struct device_node* np);
 extern void prom_get_irq_senses(unsigned char *senses, int off, int max);
 extern void prom_add_property(struct device_node* np, struct property* prop);


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nfont at austin.ibm.com  Sat Jul  3 01:47:07 2004
From: nfont at austin.ibm.com (Nathan Fontenot)
Date: Fri, 02 Jul 2004 10:47:07 -0500
Subject: [PATCH] fix ras irq handlers
In-Reply-To: <16613.17776.748934.366072@cargo.ozlabs.ibm.com>
References: <1083860324.3702.8.camel@mudbug.austin.ibm.com> <16613.17776.748934.366072@cargo.ozlabs.ibm.com>
Message-ID: <40E5837B.4000008@austin.ibm.com>

Everything looks good to me.

I did add three changes you may (or may not) want to include.  These
are pretty minor things I noticed in looking over the code with your
patch applied.  I included an updated patch with these updates.

- Added a comment (/* EPOW Events */) before the irq initialization of
   epow-events to atch te internal errors comment.

- Changed ras_epow_interrupt to use tha same irq manipulation,
   (virt_irq_to_real(irq_offset_down(irq)), you used in
   ras_error_interrupt for consistancy.

- Moved the taking of the log_lock in rtas_epow_interrupt to after the
   first rtas_call.  We don't need it until the second rtas call and I
   don't see any need to hold it longer than need be.

> One of the changes I have made is to check the #interrupt-cells
> applying to the interrupts property.  Without this I think we will
> incorrectly try to get interrupt 0 or 1, since when #interrupt-cells
> is 2, the second cell is the edge/level indicator.

Anton had opened a LTC bug (9508) for this issue.  I posted a patch
there awhile back but I like your implementation better.  I'll update
the bug with your patch if thats ok.

--
Nathan Fontenot
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ras_irq_update.patch
Type: text/x-patch
Size: 8967 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040702/9e1a15c7/attachment.bin 

From linas at austin.ibm.com  Sat Jul  3 02:05:41 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Fri, 2 Jul 2004 11:05:41 -0500
Subject: [PATCH] 2.6 PPC64: lockfix for rtas error log (third-times-a-charm?)]
In-Reply-To: <20040701232912.GA27199@kroah.com>; from greg@kroah.com on Thu, Jul 01, 2004 at 04:29:12PM -0700
References: <20040701141931.E21634@forte.austin.ibm.com> <1088711355.21679.152.camel@nighthawk> <20040701153117.H21634@forte.austin.ibm.com> <40E47EE0.306@austin.ibm.com> <20040701162049.K21634@forte.austin.ibm.com> <20040701232912.GA27199@kroah.com>
Message-ID: <20040702110541.N21634@forte.austin.ibm.com>


Here's the latest set patch w/ whitespace fixes.

On Thu, Jul 01, 2004 at 04:29:12PM -0700, Greg KH wrote:
> On Thu, Jul 01, 2004 at 04:20:50PM -0500, linas at austin.ibm.com wrote:
> >
> > Is this really important enough to gen a new patch? If that's what it
> > takes to get the patch accepted, I'll do it ...
>
> Please do.

------------------------------------------

Upon closer analysis of the code, I see that log_rtas_error()
was incorrectly named, and was being used incorrectly.  The
solution is to get rid of it entirely; see patch below. So:

-- In one case kmalloc must be GFP_ATOMIC because rtas_call()
can happen in any context, incl. irqs.
-- In the other case, I turned it into GFP_KENREL, at the cost
of doing a needless malloc/free in the vast majority of
cases where there is no error.  Small price, as I beleive
that this routine is very rarely called.

Patch below,
Signed-off-by: Linas Vepstas <linas at linas.org>


-------------- next part --------------
--- arch/ppc64/kernel/rtas.c.orig-pre-lockfix	2004-06-29 17:02:12.000000000 -0500
+++ arch/ppc64/kernel/rtas.c	2004-07-02 10:52:50.000000000 -0500
@@ -98,8 +98,14 @@ rtas_token(const char *service)
 }


+/** Return a copy of the detailed error text associated with the
+ *  most recent failed call to rtas.  Because the error text
+ *  might go stale if there are any other intervening rtas calls,
+ *  this routine must be called atomically with whatever produced
+ *  the error (i.e. with rtas.lock still held from the previous call).
+ */
 static int
-__log_rtas_error(void)
+__fetch_rtas_last_error(void)
 {
 	struct rtas_args err_args, save_args;

@@ -126,19 +132,6 @@ __log_rtas_error(void)
 	return err_args.rets[0];
 }

-void
-log_rtas_error(void)
-{
-	unsigned long s;
-	int rc;
-
-	spin_lock_irqsave(&rtas.lock, s);
-	rc = __log_rtas_error();
-	spin_unlock_irqrestore(&rtas.lock, s);
-	if (rc == 0)
-		log_error(rtas_err_buf, ERR_TYPE_RTAS_LOG, 0);
-}
-
 int
 rtas_call(int token, int nargs, int nret,
 	  unsigned long *outputs, ...)
@@ -147,6 +140,7 @@ rtas_call(int token, int nargs, int nret
 	int i, logit = 0;
 	unsigned long s;
 	struct rtas_args *rtas_args;
+	char * buff_copy = NULL;
 	int ret;

 	PPCDBG(PPCDBG_RTAS, "Entering rtas_call\n");
@@ -181,7 +175,7 @@ rtas_call(int token, int nargs, int nret
 	PPCDBG(PPCDBG_RTAS, "\treturned from rtas ...\n");

 	if (rtas_args->rets[0] == -1)
-		logit = (__log_rtas_error() == 0);
+		logit = (__fetch_rtas_last_error() == 0);

 	ifppcdebug(PPCDBG_RTAS) {
 		for(i=0; i < nret ;i++)
@@ -193,12 +187,21 @@ rtas_call(int token, int nargs, int nret
 			outputs[i] = rtas_args->rets[i+1];
 	ret = (int)((nret > 0) ? rtas_args->rets[0] : 0);

+	/* Log the error in the unlikely case that there was one. */
+	if (unlikely(logit)) {
+		buff_copy = kmalloc(RTAS_ERROR_LOG_MAX, GFP_ATOMIC);
+		if (buff_copy) {
+			memcpy(buff_copy, rtas_err_buf, RTAS_ERROR_LOG_MAX);
+		}
+	}
+
 	/* Gotta do something different here, use global lock for now... */
 	spin_unlock_irqrestore(&rtas.lock, s);

-	if (logit)
-		log_error(rtas_err_buf, ERR_TYPE_RTAS_LOG, 0);
-
+	if (buff_copy) {
+		log_error(buff_copy, ERR_TYPE_RTAS_LOG, 0);
+		kfree(buff_copy);
+	}
 	return ret;
 }

@@ -218,7 +221,7 @@ rtas_extended_busy_delay_time(int status

 	/* Use microseconds for reasonable accuracy */
 	for (ms=1; order > 0; order--)
-		ms *= 10;
+		ms *= 10;

 	return ms;
 }
@@ -409,9 +412,9 @@ rtas_restart(char *cmd)
 	if (rtas_firmware_flash_list.next)
 		rtas_flash_firmware();

-        printk("RTAS system-reboot returned %d\n",
+	printk("RTAS system-reboot returned %d\n",
 	       rtas_call(rtas_token("system-reboot"), 0, 1, NULL));
-        for (;;);
+	for (;;);
 }

 void
@@ -419,10 +422,10 @@ rtas_power_off(void)
 {
 	if (rtas_firmware_flash_list.next)
 		rtas_flash_bypass_warning();
-        /* allow power on only with power button press */
-        printk("RTAS power-off returned %d\n",
-               rtas_call(rtas_token("power-off"), 2, 1, NULL,0xffffffff,0xffffffff));
-        for (;;);
+	/* allow power on only with power button press */
+	printk("RTAS power-off returned %d\n",
+	       rtas_call(rtas_token("power-off"), 2, 1, NULL,0xffffffff,0xffffffff));
+	for (;;);
 }

 void
@@ -430,7 +433,7 @@ rtas_halt(void)
 {
 	if (rtas_firmware_flash_list.next)
 		rtas_flash_bypass_warning();
-        rtas_power_off();
+	rtas_power_off();
 }

 /* Must be in the RMO region, so we place it here */
@@ -460,7 +463,9 @@ asmlinkage int ppc_rtas(struct rtas_args
 {
 	struct rtas_args args;
 	unsigned long flags;
+	char * buff_copy;
 	int nargs;
+	int err_rc;

 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
@@ -479,18 +484,32 @@ asmlinkage int ppc_rtas(struct rtas_args
 			   nargs * sizeof(rtas_arg_t)) != 0)
 		return -EFAULT;

+	buff_copy = kmalloc(RTAS_ERROR_LOG_MAX, GFP_KERNEL);
+
 	spin_lock_irqsave(&rtas.lock, flags);

 	get_paca()->xRtas = args;
 	enter_rtas(__pa(&get_paca()->xRtas));
 	args = get_paca()->xRtas;

-	spin_unlock_irqrestore(&rtas.lock, flags);
-
 	args.rets  = (rtas_arg_t *)&(args.args[nargs]);
-	if (args.rets[0] == -1)
-		log_rtas_error();

+	if (args.rets[0] == -1) {
+		err_rc = __fetch_rtas_last_error();
+		if ((err_rc == 0) && buff_copy) {
+			memcpy(buff_copy, rtas_err_buf, RTAS_ERROR_LOG_MAX);
+		}
+	}
+
+	spin_unlock_irqrestore(&rtas.lock, flags);
+
+	if (buff_copy) {
+		if ((args.rets[0] == -1) && (err_rc == 0)) {
+			log_error(buff_copy, ERR_TYPE_RTAS_LOG, 0);
+		}
+		kfree(buff_copy);
+	}
+
 	/* Copy out args. */
 	if (copy_to_user(uargs->args + nargs,
 			 args.args + nargs,

From linas at austin.ibm.com  Sat Jul  3 02:33:33 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Fri, 2 Jul 2004 11:33:33 -0500
Subject: (resend) [PATCH] 2.6 RPAPHP structure size/performance
Message-ID: <20040702113333.R21634@forte.austin.ibm.com>


Hi Greg,

Please review and apply the following patch if you find it agreeable.

This patch does not make any functional changes, but does improve
both performance and memory usage by rearranging structure elements.

The need for these changes became appearent during a code review of
the disassembly involving this structure. The memory footprint of this
structure is made smaller by grouping the byte fields next to each other.
The access of the list_head can be simplified by making it the first element
of the structure, thus avoiding a needless add-immediate without negatively
impacting any of the other accesses.

Signed-off-by: Linas Vepstas <linas at linas.org>

--linas

--- drivers/pci/hotplug/rpaphp.h.orig	2004-06-18 16:10:47.000000000 -0500
+++ drivers/pci/hotplug/rpaphp.h	2004-06-23 13:28:20.000000000 -0500
@@ -85,6 +85,7 @@ struct rpaphp_pci_func {
  * struct slot - slot information for each *physical* slot
  */
 struct slot {
+	struct list_head rpaphp_slot_list;
 	int state;
 	u32 index;
 	u32 type;
@@ -92,6 +93,7 @@ struct slot {
 	char *name;
 	char *location;
 	u8 removable;
+	u8 dev_type;		/* VIO or PCI */
 	struct device_node *dn;	/* slot's device_node in OFDT */
 				/* dn has phb info */
 	struct pci_dev *bridge;	/* slot's pci_dev in pci_devices */
@@ -99,9 +101,7 @@ struct slot {
 		struct list_head pci_funcs; /* pci_devs in PCI slot */
 		struct vio_dev *vio_dev; /* vio_dev in VIO slot */
 	} dev;
-	u8 dev_type;		/* VIO or PCI */
 	struct hotplug_slot *hotplug_slot;
-	struct list_head rpaphp_slot_list;
 };

 extern struct hotplug_slot_ops rpaphp_hotplug_slot_ops;


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Sat Jul  3 03:29:49 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Fri, 2 Jul 2004 12:29:49 -0500
Subject: EEH/Hotplug (was Re: [PATCH] rpaphp broken in ameslab)
In-Reply-To: <16612.63829.67601.300009@cargo.ozlabs.ibm.com>; from paulus@samba.org on Fri, Jul 02, 2004 at 03:57:41PM +1000
References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <16611.20373.87299.105401@cargo.ozlabs.ibm.com> <20040630202032.A21634@forte.austin.ibm.com> <16611.29735.778269.316356@cargo.ozlabs.ibm.com> <20040701131430.C21634@forte.austin.ibm.com> <16612.63829.67601.300009@cargo.ozlabs.ibm.com>
Message-ID: <20040702122949.S21634@forte.austin.ibm.com>


On Fri, Jul 02, 2004 at 03:57:41PM +1000, Paul Mackerras wrote:
> > > I want to see a notifier list exported by eeh.c as I proposed in a
> > > previous email before that goes upstream.
> >
> > Its currently implemented as a work queue. Is that acceptable?
> > To keep gregkh happy, I'll move the work-queue to
> > drivers/pci/hotplug/rpaphp_eeh.c, will this work?
>
> It's not the work queue that is the problem, it is that the EEH code
> is taking a decision about what hotplug should do.  I am saying that
> the EEH code should offer to provide notifications to any interested
> code about slot isolation events.  The slot isolation event is a fact,
> the request to do an unplug operation is policy.  Let's leave the
> policy up to the rpaphp driver and/or userspace.

I'm not yet convinced that hotplug should be the focal point for
device driver policy decisions, but I'll go ahead and implement the
notifier chain for now, and see what happens.

Note that the scsi generic layer implements a bunch of policy
almost the same kind of thing, except that its for the scsi bus,
and not for the pci bus.   Not all scsi device drivers use the
scsi-generic layer, but those that do get a reset sequence something
like the following:

-- if device not responding, reset device
-- if above failed, retry a few times.
-- if still failed, reset scsi bus
-- if still failed, retry a few times ...
-- if above failed, reset scsi controller

For pci bus disconnection events that affected scsi devices, I was
going to tap into that 'policy' code.  I'm not sure I want to comment
more until I try the prototype.

I'm not sure if anyone is thinking about i/o fabrics yet, or how
that policy gets done ... for example, one disk is attached to
two scsi controllers, and there was an eeh event on one of the
controllers; where is the failover policy implemented?  Currently,
I think all the device drivers that do this are all proprietary ...

--linas

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From greg at kroah.com  Sat Jul  3 04:12:15 2004
From: greg at kroah.com (Greg KH)
Date: Fri, 2 Jul 2004 11:12:15 -0700
Subject: EEH/Hotplug (was Re: [PATCH] rpaphp broken in ameslab)
In-Reply-To: <20040702122949.S21634@forte.austin.ibm.com>
References: <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <16611.20373.87299.105401@cargo.ozlabs.ibm.com> <20040630202032.A21634@forte.austin.ibm.com> <16611.29735.778269.316356@cargo.ozlabs.ibm.com> <20040701131430.C21634@forte.austin.ibm.com> <16612.63829.67601.300009@cargo.ozlabs.ibm.com> <20040702122949.S21634@forte.austin.ibm.com>
Message-ID: <20040702181214.GA28182@kroah.com>


On Fri, Jul 02, 2004 at 12:29:49PM -0500, linas at austin.ibm.com wrote:
> On Fri, Jul 02, 2004 at 03:57:41PM +1000, Paul Mackerras wrote:
> > > > I want to see a notifier list exported by eeh.c as I proposed in a
> > > > previous email before that goes upstream.
> > >
> > > Its currently implemented as a work queue. Is that acceptable?
> > > To keep gregkh happy, I'll move the work-queue to
> > > drivers/pci/hotplug/rpaphp_eeh.c, will this work?
> >
> > It's not the work queue that is the problem, it is that the EEH code
> > is taking a decision about what hotplug should do.  I am saying that
> > the EEH code should offer to provide notifications to any interested
> > code about slot isolation events.  The slot isolation event is a fact,
> > the request to do an unplug operation is policy.  Let's leave the
> > policy up to the rpaphp driver and/or userspace.
>
> I'm not yet convinced that hotplug should be the focal point for
> device driver policy decisions,

Sorry, but you're a bit late to the table for trying to change this
overall kernel design decision :)

> but I'll go ahead and implement the notifier chain for now, and see
> what happens.

Thank you.

> Note that the scsi generic layer implements a bunch of policy
> almost the same kind of thing, except that its for the scsi bus,
> and not for the pci bus.   Not all scsi device drivers use the
> scsi-generic layer, but those that do get a reset sequence something
> like the following:
>
> -- if device not responding, reset device
> -- if above failed, retry a few times.
> -- if still failed, reset scsi bus
> -- if still failed, retry a few times ...
> -- if above failed, reset scsi controller
>
> For pci bus disconnection events that affected scsi devices, I was
> going to tap into that 'policy' code.  I'm not sure I want to comment
> more until I try the prototype.

scsi errors and pci errors are quite different things.  For one, I'm
pretty sure the scsi stuff is specified by the spec.  And it's way more
common than pci errors would be.

It's also done in a generic manner, not a arch specific way, which is a
good thing.

> I'm not sure if anyone is thinking about i/o fabrics yet, or how
> that policy gets done ... for example, one disk is attached to
> two scsi controllers, and there was an eeh event on one of the
> controllers; where is the failover policy implemented?  Currently,
> I think all the device drivers that do this are all proprietary ...

The multipath people are working on this, using dm and userspace stuff.
The kernel drivers that try to do this within the kernel have been
rejected for one reason or another (not the least being that no company
seems to want to release them...)

thanks,

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From greg at kroah.com  Sat Jul  3 06:56:51 2004
From: greg at kroah.com (Greg KH)
Date: Fri, 2 Jul 2004 13:56:51 -0700
Subject: [PATCH] rpaphp broken in ameslab
In-Reply-To: <40E1F6C7.80907@us.ibm.com>
References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com>
Message-ID: <20040702205651.GD29580@kroah.com>


On Tue, Jun 29, 2004 at 06:09:59PM -0500, Linda Xie wrote:
> Hi Joel,
>
> Any changes to rpaphp(or rpaphp related) now should be posted on PCI
> hotplug mailing list (CC to kernel mailing list and Greg KH).  After
> Greg applies the
> changes to his tree, the changes will be merged into amslab. BTW, I
> posted a separate patch that exports pci_scan_chid_bus for rpaphp a
> while ago.
>
> Greg,  Can you apply the attached patch to your tree?
>
> Signed-off-by: Linda Xie lxie at us.ibm.com

Applied, thanks.

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From sharada at in.ibm.com  Mon Jul  5 16:53:00 2004
From: sharada at in.ibm.com (R Sharada)
Date: Mon, 5 Jul 2004 12:23:00 +0530
Subject: Fw: debugging on ppc with xmon
Message-ID: <20040705065300.GB1458@in.ibm.com>


Hello,
	Perhaps I sent this mail out earlier to the wrong id for the list. Did
not see it on the list. Hence resending - hopefully this time I am sending to
the right id.

Thanks and Regards,
Sharada
----- Forwarded message from R Sharada <sharada at in.ibm.com> -----

Date: Mon, 5 Jul 2004 11:28:21 +0530
From: R Sharada <sharada at in.ibm.com>
To: owner-linuxppc64-dev at lists.linuxppc.org
Subject: debugging on ppc with xmon
Reply-To: sharada at in.ibm.com

Hello,
	I have been learning and understanding ppc lately, and was working on
trying to do some re-ordering of the prom code.
	As a first step, I was trying to move the setting of the global cpuid
masks to a later point in time, in setup.c, within setup_system.
	So, I moved the code performing the cpuset in the prom_hold_cpus()
function to a function, cpuid_setup(), and called it within setup_system, after
finish_device_tree() (within CONFIG_PSERIES).
	When I rebooted this kernel, it oopsed in xmon right after prom_init,
with a data SLB access (which in this case is most likely to be a wrong memory
reference or null pointer reference).
	I did go through olof's mail on debugging with xmon on ppc. However, I
am still not clear as to what access is causing the invalid memory reference.
A zr at the xmon prompt does not reboot but faults back again into xmon.
	How can I obtain a objdump output with source line corelation to the
assembly code?
	Any pointers or directions as to how to debug this would be helpful,
as I am starting off with ppc debugging with this as the first.

The panic oops is pasted out here:

Calling quiesce ...
returning from prom_init
cpu 0x0: Vector: 380 (Data SLB Access) at [c000000000563b70]
    pc: c000000000039d98: .cpuid_setup+0x1d0/0x3b0
    lr: c000000000039d7c: .cpuid_setup+0x1b4/0x3b0
    sp: c000000000563df0
   msr: 9000000000001032
   dar: 4202000
  current = 0xc0000000005f5480
  paca    = 0xc000000000564000
    pid   = 0, comm = swapper
enter ? for help
0:mon> ?

0:mon> di c000000000039d98
c000000000039d98  901c0000      stw     r0,0(r28)
c000000000039d9c  4800097d      bl      c00000000003a718        # .get_property0c000000000039da0  60000000      nop
c000000000039da4  38a00000      li      r5,0
c000000000039da8  e882a318      ld      r4,-23784(r2)
c000000000039dac  7c7c1b78      mr      r28,r3
c000000000039db0  7fc3f378      mr      r3,r30
c000000000039db4  48000965      bl      c00000000003a718        # .get_property0c000000000039db8  60000000      nop
c000000000039dbc  38a00001      li      r5,1
c000000000039dc0  80030000      lwz     r0,0(r3)
c000000000039dc4  2f800000      cmpwi   cr7,r0,0
c000000000039dc8  419c0018      blt     cr7,c000000000039de0    # .cpuid_setup+0c000000000039dcc  7c0007b4      extsw   r0,r0
c000000000039dd0  7805f022      rldicl  r5,r0,62,32
c000000000039dd4  2b850002      cmplwi  cr7,r5,2

0:mon> zr
cpu 0x0: Vector: 380 (Data SLB Access) at [c0000000005630f0]
    pc: c00000000004be54: .cmds+0x1914/0x1f34
    lr: c00000000004a864: .cmds+0x324/0x1f34
    sp: c000000000563370
   msr: 9000000000001032
   dar: 0
  current = 0xc0000000005f5480
  paca    = 0xc000000000564000
    pid   = 0, comm = swapper
cpu 0x0: Exception 380 (Data SLB Access) in xmon, returning to main loop

Thanks and Regards,
Sharada


----- End forwarded message -----

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From segher at kernel.crashing.org  Tue Jul  6 17:04:05 2004
From: segher at kernel.crashing.org (Segher Boessenkool)
Date: Tue, 6 Jul 2004 09:04:05 +0200
Subject: debugging on ppc with xmon
In-Reply-To: <20040705065300.GB1458@in.ibm.com>
References: <20040705065300.GB1458@in.ibm.com>
Message-ID: <AAF7F354-CF1A-11D8-8F1A-000A95A4DC02@kernel.crashing.org>


> 	When I rebooted this kernel, it oopsed in xmon right after prom_init,
> with a data SLB access (which in this case is most likely to be a
> wrong memory
> reference or null pointer reference).

Null pointer references do not fail here :-(  -- so it is some
other "wrong" access.

> 	How can I obtain a objdump output with source line corelation to the
> assembly code?

objdump -drS vmlinux

It will only show the C source files though, and/or only when
you have kernel debug info enabled.  Also, it is pretty useless
at any decent compiler optimization level IMHO.

> 	Any pointers or directions as to how to debug this would be helpful,
> as I am starting off with ppc debugging with this as the first.

I normally just insert a boatload of printk()'s, see what interval
it fails in, and refine.  Repeat until done.

Or set a lot of breakpoints in your debugger -- basically the same
thing,
but less convenient.


Segher


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Wed Jul  7 07:12:54 2004
From: linas at austin.ibm.com (linas at austin.ibm.com)
Date: Tue, 6 Jul 2004 16:12:54 -0500
Subject: How to block pci config-reads during device self-test?
Message-ID: <20040706161254.C21634@forte.austin.ibm.com>


Hi all,

Am having trouble with PCI config-space reads ... I have a device
(actualy Brian King has it) that can perform a built-in-self test
(BIST).  However, if anything does a PCI config-read during BIST,
then the device does something crazy that makes the PCI controller
chip take it offline.

I'm not sure what's doing the config-spcae reads ... seems to be some
user-space tool or daemon.  I'm wondering if there is any practical
way to block such reads to a given device until its self-test
sequence is completed.  I could try to modify the architecture-specific
pci files to do this (arch/ppc64/kernel/pSeries_pci.c) but this seems
a tad ugly ... is there another way?  or do we have to just learn to
live with this ahrdware?

--linas

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Wed Jul  7 07:15:49 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Tue, 06 Jul 2004 16:15:49 -0500
Subject: How to block pci config-reads during device self-test?
In-Reply-To: <20040706161254.C21634@forte.austin.ibm.com>
References: <20040706161254.C21634@forte.austin.ibm.com>
Message-ID: <1089148549.1898.84.camel@gaston>


> I'm not sure what's doing the config-spcae reads ... seems to be some
> user-space tool or daemon.  I'm wondering if there is any practical
> way to block such reads to a given device until its self-test
> sequence is completed.  I could try to modify the architecture-specific
> pci files to do this (arch/ppc64/kernel/pSeries_pci.c) but this seems
> a tad ugly ... is there another way?  or do we have to just learn to
> live with this ahrdware?

I see no sane solution... I have a similar (not exactly identical though)
problem on the g5 where some devices will lockup on config access when
they are power managed. What I did there was to hack a list of devices
to "ignore" on config accesses...

If your driver knows when the device is going away for the BIST and how
long, maybe you can play tricks like adding a property to the device
node indicating if the device should be skipped on config access (that
is basically return all fffffff's), have the driver set that for
the duration of the BIST, and the config ops check that.

Definitely not the cleanest way, but to me, it seems like broken HW...

Ben.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From greg at kroah.com  Wed Jul  7 09:48:43 2004
From: greg at kroah.com (Greg KH)
Date: Tue, 6 Jul 2004 16:48:43 -0700
Subject: How to block pci config-reads during device self-test?
In-Reply-To: <20040706161254.C21634@forte.austin.ibm.com>
References: <20040706161254.C21634@forte.austin.ibm.com>
Message-ID: <20040706234843.GA11327@kroah.com>


On Tue, Jul 06, 2004 at 04:12:54PM -0500, linas at austin.ibm.com wrote:
>
> I could try to modify the architecture-specific pci files to do this
> (arch/ppc64/kernel/pSeries_pci.c) but this seems a tad ugly ... is
> there another way?  or do we have to just learn to live with this
> ahrdware?

As I've told Brian in the past, you have to learn to just live with this
broken hardware.

Good luck,

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From david at gibson.dropbear.id.au  Wed Jul  7 16:06:26 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Wed, 7 Jul 2004 16:06:26 +1000
Subject: [0/4] RFC: SLB rewrite
Message-ID: <20040707060626.GA987@zax>


Here, at long last, the much awaited, SLB rewrite!

I have this in four patches at the moment:

1/4: slbrewrite2     - The guts of the rewrite, miss path in asm
2/4: noprolog        - Unify do_slb_bolted with full SLB miss path
3/4: customentryexit - Optimise SLB exception entry/exit path
4/4: newvsid         - Replace VSID algorithm with a better one

Overall these seem to improve the (user address) SLB miss time by
around 25% (200ns to 150ns on a G5).  Test program and resulting
graphs are at http://www.ozlabs.org/people/dgibson/slbtest/ the most
interesting graph is probably:
	http://www.ozlabs.org/people/dgibson/slbtest/stagger.png

These patches have been tested (though not extensively) on an Apple G5
(PowerPC 970), a p630 LPAR (Power4+), an RS/6000 270 (POWER3) and an
RS64 iSeries partition (the latter two obviously aren't expected to
show any improvement, the point is just to check I haven't broken
things dramatically for STAB machines).  Anton has also tested some
earlier versions of the rewrite on a Power5 (bare metal Linux).

If there are no serious problems, I'm thinking of pushing at least the
first three of these to Linus/akpm in the next week or so.

--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From david at gibson.dropbear.id.au  Wed Jul  7 16:06:35 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Wed, 7 Jul 2004 16:06:35 +1000
Subject: [1/4] RFC: SLB rewrite (core rewrite)
Message-ID: <20040707060635.GB987@zax>


Rewrite/cleanup of the SLB management code.  This removes nearly all
the SLB related code from arch/ppc64/kernel/stab.c and puts a
rewritten version in arch/ppc64/mm, where it better belongs.  The main
SLB miss path is in assembler and the other routines have been cleaned
up and streamlined.

Notable changes:
	- Ugly bitfields no longer used for generating SLB entries.
	- slb_allocate() (the main SLB miss routine) is now in
assembler, and all the data it uses is stored in the PACA.
	- The mm context is now copied into the PACA at context switch
time, to avoid looking up the thread struct on SLB miss.
	- An SLB miss will now never (directly) result in a call to
do_page_fault.  If we get a miss on a totally bogus address the
handler will now put in an SLB referencing VSID 0.  This will never
have any pages, so we'll get the (fatal) page fault shortly
afterwards.  This simplifies the SLB entry and exit paths.
	- The round-robin pointer in the PACA now references the
last-used instead of next-to-use SLB slot, which simplifies the asm
for updating it slightly.


Index: working-2.6/include/asm-ppc64/mmu.h
===================================================================
--- working-2.6.orig/include/asm-ppc64/mmu.h
+++ working-2.6/include/asm-ppc64/mmu.h
@@ -37,12 +37,6 @@
 		mm_context_t ctx = { .id = REGION_ID(ea), KERNEL_LOW_HPAGES}; \
 		ctx; })

-/*
- * Hardware Segment Lookaside Buffer Entry
- * This structure has been padded out to two 64b doublewords (actual SLBE's are
- * 94 bits).  This padding facilites use by the segment management
- * instructions.
- */
 typedef struct {
 	unsigned long esid: 36; /* Effective segment ID */
 	unsigned long resv0:20; /* Reserved */
@@ -71,35 +65,6 @@
 	} dw1;
 } STE;

-typedef struct {
-	unsigned long esid: 36; /* Effective segment ID */
-	unsigned long v:     1; /* Entry valid (v=1) or invalid */
-	unsigned long null1:15; /* padding to a 64b boundary */
-	unsigned long index:12; /* Index to select SLB entry. Used by slbmte */
-} slb_dword0;
-
-typedef struct {
-	unsigned long vsid: 52; /* Virtual segment ID */
-	unsigned long ks:    1; /* Supervisor (privileged) state storage key */
-	unsigned long kp:    1; /* Problem state storage key */
-	unsigned long n:     1; /* No-execute if n=1 */
-	unsigned long l:     1; /* Virt pages are large (l=1) or 4KB (l=0) */
-	unsigned long c:     1; /* Class */
-	unsigned long resv0: 7; /* Padding to a 64b boundary */
-} slb_dword1;
-
-typedef struct {
-	union {
-		unsigned long dword0;
-		slb_dword0    dw0;
-	} dw0;
-
-	union {
-		unsigned long dword1;
-		slb_dword1    dw1;
-	} dw1;
-} SLBE;
-
 /* Hardware Page Table Entry */

 #define HPTES_PER_GROUP 8
@@ -259,6 +224,30 @@
 #define STAB0_PHYS_ADDR	(STAB0_PAGE<<PAGE_SHIFT)
 #define STAB0_VIRT_ADDR	(KERNELBASE+STAB0_PHYS_ADDR)

+#define SLB_NUM_BOLTED		2
+#define SLB_CACHE_ENTRIES	8
+
+/* Bits in the SLB ESID word */
+#define SLB_ESID_V		0x0000000008000000	/* entry is valid */
+
+/* Bits in the SLB VSID word */
+#define SLB_VSID_SHIFT		12
+#define SLB_VSID_KS		0x0000000000000800
+#define SLB_VSID_KP		0x0000000000000400
+#define SLB_VSID_N		0x0000000000000200	/* no-execute */
+#define SLB_VSID_L		0x0000000000000100	/* largepage (4M) */
+#define SLB_VSID_C		0x0000000000000080	/* class */
+
+#define SLB_VSID_KERNEL		(SLB_VSID_KP|SLB_VSID_C)
+#define SLB_VSID_USER		(SLB_VSID_KP|SLB_VSID_KS)
+
+#define VSID_RANDOMIZER ASM_CONST(42470972311)
+#define VSID_MASK	0xfffffffffUL
+/* Because we never access addresses below KERNELBASE as kernel
+ * addresses, this VSID is never used for anything real, and will
+ * never have pages hashed into it */
+#define BAD_VSID	ASM_CONST(0)
+
 /* Block size masks */
 #define BL_128K	0x000
 #define BL_256K 0x001
Index: working-2.6/include/asm-ppc64/page.h
===================================================================
--- working-2.6.orig/include/asm-ppc64/page.h
+++ working-2.6/include/asm-ppc64/page.h
@@ -27,6 +27,7 @@

 #define SID_SHIFT       28
 #define SID_MASK        0xfffffffffUL
+#define ESID_MASK	0xfffffffff0000000UL
 #define GET_ESID(x)     (((x) >> SID_SHIFT) & SID_MASK)

 #ifdef CONFIG_HUGETLB_PAGE
@@ -37,8 +38,8 @@
 #define HUGETLB_PAGE_ORDER	(HPAGE_SHIFT - PAGE_SHIFT)

 /* For 64-bit processes the hugepage range is 1T-1.5T */
-#define TASK_HPAGE_BASE 	(0x0000010000000000UL)
-#define TASK_HPAGE_END 	(0x0000018000000000UL)
+#define TASK_HPAGE_BASE ASM_CONST(0x0000010000000000)
+#define TASK_HPAGE_END 	ASM_CONST(0x0000018000000000)

 #define LOW_ESID_MASK(addr, len)	(((1U << (GET_ESID(addr+len-1)+1)) \
 	   	                	- (1U << GET_ESID(addr))) & 0xffff)
Index: working-2.6/include/asm-ppc64/mmu_context.h
===================================================================
--- working-2.6.orig/include/asm-ppc64/mmu_context.h
+++ working-2.6/include/asm-ppc64/mmu_context.h
@@ -136,7 +136,7 @@
 }

 extern void flush_stab(struct task_struct *tsk, struct mm_struct *mm);
-extern void flush_slb(struct task_struct *tsk, struct mm_struct *mm);
+extern void switch_slb(struct task_struct *tsk, struct mm_struct *mm);

 /*
  * switch_mm is the entry point called from the architecture independent
@@ -161,7 +161,7 @@
 		return;

 	if (cur_cpu_spec->cpu_features & CPU_FTR_SLB)
-		flush_slb(tsk, next);
+		switch_slb(tsk, next);
 	else
 		flush_stab(tsk, next);
 }
@@ -181,10 +181,6 @@
 	local_irq_restore(flags);
 }

-#define VSID_RANDOMIZER 42470972311UL
-#define VSID_MASK	0xfffffffffUL
-
-
 /* This is only valid for kernel (including vmalloc, imalloc and bolted) EA's
  */
 static inline unsigned long
Index: working-2.6/include/asm-ppc64/paca.h
===================================================================
--- working-2.6.orig/include/asm-ppc64/paca.h
+++ working-2.6/include/asm-ppc64/paca.h
@@ -78,20 +78,25 @@
 	u64 exmc[8];		/* used for machine checks */
 	u64 exslb[8];		/* used for SLB/segment table misses
 				 * on the linear mapping */
-	u64 exdsi[8];		/* used for linear mapping hash table misses */
+	mm_context_t context;
+	u16 slb_cache[SLB_CACHE_ENTRIES];
+	u16 slb_cache_ptr;

 	/*
 	 * then miscellaneous read-write fields
 	 */
 	struct task_struct *__current;	/* Pointer to current */
 	u64 kstack;			/* Saved Kernel stack addr */
-	u64 stab_next_rr;		/* stab/slb round-robin counter */
+	u64 stab_rr;			/* stab/slb round-robin counter */
 	u64 next_jiffy_update_tb;	/* TB value for next jiffy update */
 	u64 saved_r1;			/* r1 save for RTAS calls */
 	u64 saved_msr;			/* MSR saved here by enter_rtas */
 	u32 lpevent_count;		/* lpevents processed  */
 	u8 proc_enabled;		/* irq soft-enable flag */

+	/* not yet used */
+	u64 exdsi[8];		/* used for linear mapping hash table misses */
+
 	/*
 	 * iSeries structues which the hypervisor knows about - Not
 	 * sure if these particularly need to be cacheline aligned.
Index: working-2.6/arch/ppc64/kernel/stab.c
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/stab.c
+++ working-2.6/arch/ppc64/kernel/stab.c
@@ -20,26 +20,10 @@
 #include <asm/naca.h>
 #include <asm/cputable.h>

-static int make_ste(unsigned long stab, unsigned long esid, unsigned long vsid);
-static void make_slbe(unsigned long esid, unsigned long vsid, int large,
-		      int kernel_segment);
+static int make_ste(unsigned long stab, unsigned long esid,
+		    unsigned long vsid);

-static inline void slb_add_bolted(void)
-{
-#ifndef CONFIG_PPC_ISERIES
-	unsigned long esid = GET_ESID(VMALLOCBASE);
-	unsigned long vsid = get_kernel_vsid(VMALLOCBASE);
-
-	WARN_ON(!irqs_disabled());
-
-	/*
-	 * Bolt in the first vmalloc segment. Since modules end
-	 * up there it gets hit very heavily.
-	 */
-	get_paca()->stab_next_rr = 1;
-	make_slbe(esid, vsid, 0, 1);
-#endif
-}
+void slb_initialize(void);

 /*
  * Build an entry for the base kernel segment and put it into
@@ -48,32 +32,13 @@
  */
 void stab_initialize(unsigned long stab)
 {
-	unsigned long esid, vsid;
-	int seg0_largepages = 0;
-
-	esid = GET_ESID(KERNELBASE);
-	vsid = get_kernel_vsid(esid << SID_SHIFT);
-
-	if (cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE)
-		seg0_largepages = 1;
+	unsigned long vsid = get_kernel_vsid(KERNELBASE);

 	if (cur_cpu_spec->cpu_features & CPU_FTR_SLB) {
-		/* Invalidate the entire SLB & all the ERATS */
-#ifdef CONFIG_PPC_ISERIES
-		asm volatile("isync; slbia; isync":::"memory");
-#else
-		asm volatile("isync":::"memory");
-		asm volatile("slbmte  %0,%0"::"r" (0) : "memory");
-		asm volatile("isync; slbia; isync":::"memory");
-		get_paca()->stab_next_rr = 0;
-		make_slbe(esid, vsid, seg0_largepages, 1);
-		asm volatile("isync":::"memory");
-#endif
-
-		slb_add_bolted();
+		slb_initialize();
 	} else {
 		asm volatile("isync; slbia; isync":::"memory");
-		make_ste(stab, esid, vsid);
+		make_ste(stab, GET_ESID(KERNELBASE), vsid);

 		/* Order update */
 		asm volatile("sync":::"memory");
@@ -129,7 +94,7 @@
 	 * Could not find empty entry, pick one with a round robin selection.
 	 * Search all entries in the two groups.
 	 */
-	castout_entry = get_paca()->stab_next_rr;
+	castout_entry = get_paca()->stab_rr;
 	for (i = 0; i < 16; i++) {
 		if (castout_entry < 8) {
 			global_entry = (esid & 0x1f) << 3;
@@ -148,7 +113,7 @@
 		castout_entry = (castout_entry + 1) & 0xf;
 	}

-	get_paca()->stab_next_rr = (castout_entry + 1) & 0xf;
+	get_paca()->stab_rr = (castout_entry + 1) & 0xf;

 	/* Modify the old entry to the new value. */

@@ -314,229 +279,3 @@

 	preload_stab(tsk, mm);
 }
-
-/*
- * SLB stuff
- */
-
-/*
- * Create a segment buffer entry for the given esid/vsid pair.
- *
- * NOTE: A context syncronising instruction is required before and after
- * this, in the common case we use exception entry and rfid.
- */
-static void make_slbe(unsigned long esid, unsigned long vsid, int large,
-		      int kernel_segment)
-{
-	unsigned long entry, castout_entry;
-	union {
-		unsigned long word0;
-		slb_dword0    data;
-	} esid_data;
-	union {
-		unsigned long word0;
-		slb_dword1    data;
-	} vsid_data;
-	struct paca_struct *lpaca = get_paca();
-
-	/*
-	 * We take the next entry, round robin. Previously we tried
-	 * to find a free slot first but that took too long. Unfortunately
-	 * we dont have any LRU information to help us choose a slot.
-	 */
-
-	/*
-	 * Never cast out the segment for our kernel stack. Since we
-	 * dont invalidate the ERAT we could have a valid translation
-	 * for the kernel stack during the first part of exception exit
-	 * which gets invalidated due to a tlbie from another cpu at a
-	 * non recoverable point (after setting srr0/1) - Anton
-	 *
-	 * paca Ksave is always valid (even when on the interrupt stack)
-	 * so we use that.
-	 */
-	castout_entry = lpaca->stab_next_rr;
-	do {
-		entry = castout_entry;
-		castout_entry++;
-		/*
-		 * We bolt in the first kernel segment and the first
-		 * vmalloc segment.
-		 */
-		if (castout_entry >= SLB_NUM_ENTRIES)
-			castout_entry = 2;
-		asm volatile("slbmfee  %0,%1" : "=r" (esid_data) : "r" (entry));
-	} while (esid_data.data.v &&
-		 esid_data.data.esid == GET_ESID(lpaca->kstack));
-
-	lpaca->stab_next_rr = castout_entry;
-
-	/* slbie not needed as the previous mapping is still valid. */
-
-	/*
-	 * Write the new SLB entry.
-	 */
-	vsid_data.word0 = 0;
-	vsid_data.data.vsid = vsid;
-	vsid_data.data.kp = 1;
-	if (large)
-		vsid_data.data.l = 1;
-	if (kernel_segment)
-		vsid_data.data.c = 1;
-	else
-		vsid_data.data.ks = 1;
-
-	esid_data.word0 = 0;
-	esid_data.data.esid = esid;
-	esid_data.data.v = 1;
-	esid_data.data.index = entry;
-
-	/*
-	 * No need for an isync before or after this slbmte. The exception
-	 * we enter with and the rfid we exit with are context synchronizing.
-	 */
-	asm volatile("slbmte  %0,%1" : : "r" (vsid_data), "r" (esid_data));
-}
-
-static inline void __slb_allocate(unsigned long esid, unsigned long vsid,
-				  mm_context_t context)
-{
-	int large = 0;
-	int region_id = REGION_ID(esid << SID_SHIFT);
-	unsigned long offset;
-
-	if (cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE) {
-		if (region_id == KERNEL_REGION_ID)
-			large = 1;
-		else if (region_id == USER_REGION_ID)
-			large = in_hugepage_area(context, esid << SID_SHIFT);
-	}
-
-	make_slbe(esid, vsid, large, region_id != USER_REGION_ID);
-
-	if (region_id != USER_REGION_ID)
-		return;
-
-	offset = __get_cpu_var(stab_cache_ptr);
-	if (offset < NR_STAB_CACHE_ENTRIES)
-		__get_cpu_var(stab_cache[offset++]) = esid;
-	else
-		offset = NR_STAB_CACHE_ENTRIES+1;
-	__get_cpu_var(stab_cache_ptr) = offset;
-}
-
-/*
- * Allocate a segment table entry for the given ea.
- */
-int slb_allocate(unsigned long ea)
-{
-	unsigned long vsid, esid;
-	mm_context_t context;
-
-	/* Check for invalid effective addresses. */
-	if (unlikely(!IS_VALID_EA(ea)))
-		return 1;
-
-	/* Kernel or user address? */
-	if (REGION_ID(ea) >= KERNEL_REGION_ID) {
-		context = KERNEL_CONTEXT(ea);
-		vsid = get_kernel_vsid(ea);
-	} else {
-		if (unlikely(!current->mm))
-			return 1;
-
-		context = current->mm->context;
-		vsid = get_vsid(context.id, ea);
-	}
-
-	esid = GET_ESID(ea);
-#ifndef CONFIG_PPC_ISERIES
-	BUG_ON((esid << SID_SHIFT) == VMALLOCBASE);
-#endif
-	__slb_allocate(esid, vsid, context);
-
-	return 0;
-}
-
-/*
- * preload some userspace segments into the SLB.
- */
-static void preload_slb(struct task_struct *tsk, struct mm_struct *mm)
-{
-	unsigned long pc = KSTK_EIP(tsk);
-	unsigned long stack = KSTK_ESP(tsk);
-	unsigned long unmapped_base;
-	unsigned long pc_esid = GET_ESID(pc);
-	unsigned long stack_esid = GET_ESID(stack);
-	unsigned long unmapped_base_esid;
-	unsigned long vsid;
-
-	if (test_tsk_thread_flag(tsk, TIF_32BIT))
-		unmapped_base = TASK_UNMAPPED_BASE_USER32;
-	else
-		unmapped_base = TASK_UNMAPPED_BASE_USER64;
-
-	unmapped_base_esid = GET_ESID(unmapped_base);
-
-	if (!IS_VALID_EA(pc) || (REGION_ID(pc) >= KERNEL_REGION_ID))
-		return;
-	vsid = get_vsid(mm->context.id, pc);
-	__slb_allocate(pc_esid, vsid, mm->context);
-
-	if (pc_esid == stack_esid)
-		return;
-
-	if (!IS_VALID_EA(stack) || (REGION_ID(stack) >= KERNEL_REGION_ID))
-		return;
-	vsid = get_vsid(mm->context.id, stack);
-	__slb_allocate(stack_esid, vsid, mm->context);
-
-	if (pc_esid == unmapped_base_esid || stack_esid == unmapped_base_esid)
-		return;
-
-	if (!IS_VALID_EA(unmapped_base) ||
-	    (REGION_ID(unmapped_base) >= KERNEL_REGION_ID))
-		return;
-	vsid = get_vsid(mm->context.id, unmapped_base);
-	__slb_allocate(unmapped_base_esid, vsid, mm->context);
-}
-
-/* Flush all user entries from the segment table of the current processor. */
-void flush_slb(struct task_struct *tsk, struct mm_struct *mm)
-{
-	unsigned long offset = __get_cpu_var(stab_cache_ptr);
-	union {
-		unsigned long word0;
-		slb_dword0 data;
-	} esid_data;
-
-	if (offset <= NR_STAB_CACHE_ENTRIES) {
-		int i;
-		asm volatile("isync" : : : "memory");
-		for (i = 0; i < offset; i++) {
-			esid_data.word0 = 0;
-			esid_data.data.esid = __get_cpu_var(stab_cache[i]);
-			BUG_ON(esid_data.data.esid == GET_ESID(VMALLOCBASE));
-			asm volatile("slbie %0" : : "r" (esid_data));
-		}
-		asm volatile("isync" : : : "memory");
-	} else {
-		asm volatile("isync; slbia; isync" : : : "memory");
-		slb_add_bolted();
-	}
-
-	/* Workaround POWER5 < DD2.1 issue */
-	if (offset == 1 || offset > NR_STAB_CACHE_ENTRIES) {
-		/*
-		 * flush segment in EEH region, we dont normally access
-		 * addresses in this region.
-		 */
-		esid_data.word0 = 0;
-		esid_data.data.esid = EEH_REGION_ID;
-		asm volatile("slbie %0" : : "r" (esid_data));
-	}
-
-	__get_cpu_var(stab_cache_ptr) = 0;
-
-	preload_slb(tsk, mm);
-}
Index: working-2.6/arch/ppc64/kernel/head.S
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/head.S
+++ working-2.6/arch/ppc64/kernel/head.S
@@ -498,7 +498,6 @@
 	mtcrf	0x80,r12
 	mfspr	r12,SPRG2
 	EXCEPTION_PROLOG_PSERIES(PACA_EXSLB, .do_slb_bolted)
-

 	/* Space for the naca.  Architected to be located at real address
 	 * NACA_PHYS_ADDR.  Various tools rely on this location being fixed.
@@ -873,11 +872,7 @@
 	ld	r3,PACA_EXGEN+EX_DAR(r13)
 	std	r3,_DAR(r1)
 	bl	.slb_allocate
-	cmpdi	r3,0			/* Check return code */
-	beq	fast_exception_return	/* Return if we succeeded */
-	li	r5,0
-	std	r5,_DSISR(r1)
-	b	.handle_page_fault
+	b	fast_exception_return

 	.align	7
 	.globl InstructionAccess_common
@@ -894,14 +889,7 @@
 	EXCEPTION_PROLOG_COMMON(0x480, PACA_EXGEN)
 	ld	r3,_NIP(r1)		/* SRR0 = NIA	*/
 	bl	.slb_allocate
-	or.	r3,r3,r3		/* Check return code */
-	beq+	fast_exception_return	/* Return if we succeeded */
-
-	ld	r4,_NIP(r1)
-	li	r5,0
-	std	r4,_DAR(r1)
-	std	r5,_DSISR(r1)
-	b      .handle_page_fault
+	b	fast_exception_return

 	.align	7
 	.globl HardwareInterrupt_common
@@ -1155,7 +1143,6 @@
  * r9 - r13 are saved in paca->exslb.
  * We assume we aren't going to take any exceptions during this procedure.
  */
-/* XXX note fix masking in get_kernel_vsid to match */
 _GLOBAL(do_slb_bolted)
 	stw	r9,PACA_EXSLB+EX_CCR(r13)	/* save CR in exc. frame */
 	std	r11,PACA_EXSLB+EX_SRR0(r13)	/* save SRR0 in exc. frame */
@@ -1167,12 +1154,13 @@
 	 */

 	/* r13 = paca */
+	/* use a cpu feature mask if we ever change our slb size */
 1:	ld	r10,PACASTABRR(r13)
-	addi	r9,r10,1
-	cmpdi	r9,SLB_NUM_ENTRIES
+	addi	r10,r10,1
+	cmpdi	r10,SLB_NUM_ENTRIES
 	blt+	2f
-	li	r9,2			/* dont touch slot 0 or 1 */
-2:	std	r9,PACASTABRR(r13)
+	li	r10,SLB_NUM_BOLTED		/* dont touch bolted slots */
+2:	std	r10,PACASTABRR(r13)

 	/* r13 = paca, r10 = entry */

Index: working-2.6/arch/ppc64/kernel/asm-offsets.c
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/asm-offsets.c
+++ working-2.6/arch/ppc64/kernel/asm-offsets.c
@@ -86,10 +86,16 @@
         DEFINE(PACASAVEDMSR, offsetof(struct paca_struct, saved_msr));
         DEFINE(PACASTABREAL, offsetof(struct paca_struct, stab_real));
         DEFINE(PACASTABVIRT, offsetof(struct paca_struct, stab_addr));
-	DEFINE(PACASTABRR, offsetof(struct paca_struct, stab_next_rr));
+	DEFINE(PACASTABRR, offsetof(struct paca_struct, stab_rr));
         DEFINE(PACAR1, offsetof(struct paca_struct, saved_r1));
 	DEFINE(PACATOC, offsetof(struct paca_struct, kernel_toc));
 	DEFINE(PACAPROCENABLED, offsetof(struct paca_struct, proc_enabled));
+	DEFINE(PACASLBCACHE, offsetof(struct paca_struct, slb_cache));
+	DEFINE(PACASLBCACHEPTR, offsetof(struct paca_struct, slb_cache_ptr));
+	DEFINE(PACACONTEXTID, offsetof(struct paca_struct, context.id));
+#ifdef CONFIG_HUGETLB_PAGE
+	DEFINE(PACAHTLBSEGS, offsetof(struct paca_struct, context.htlb_segs));
+#endif /* CONFIG_HUGETLB_PAGE */
 	DEFINE(PACADEFAULTDECR, offsetof(struct paca_struct, default_decr));
 	DEFINE(PACAPROFENABLED, offsetof(struct paca_struct, prof_enabled));
 	DEFINE(PACAPROFLEN, offsetof(struct paca_struct, prof_len));
Index: working-2.6/arch/ppc64/kernel/pacaData.c
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/pacaData.c
+++ working-2.6/arch/ppc64/kernel/pacaData.c
@@ -55,7 +55,6 @@
 	.stab_addr = (asrv),		/* Virt pointer to segment table */ \
 	.emergency_sp = &emergency_stack[((number)+1) * PAGE_SIZE],	    \
 	.cpu_start = (start),		/* Processor start */		    \
-	.stab_next_rr = 1,						    \
 	.lppaca = {							    \
 		.xDesc = 0xd397d781,	/* "LpPa" */			    \
 		.xSize = sizeof(struct ItLpPaca),			    \
Index: working-2.6/arch/ppc64/kernel/smp.c
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/smp.c
+++ working-2.6/arch/ppc64/kernel/smp.c
@@ -385,8 +385,6 @@

 	/* Fixup atomic count: it exited inside IRQ handler. */
 	paca[lcpu].__current->thread_info->preempt_count	= 0;
-	/* Fixup SLB round-robin so next segment (kernel) goes in segment 0 */
-	paca[lcpu].stab_next_rr = 0;

 	/* At boot this is done in prom.c. */
 	paca[lcpu].hw_cpu_id = pcpu;
Index: working-2.6/arch/ppc64/mm/slb.c
===================================================================
--- /dev/null
+++ working-2.6/arch/ppc64/mm/slb.c
@@ -0,0 +1,136 @@
+/*
+ * PowerPC64 SLB support.
+ *
+ * Copyright (C) 2004 David Gibson <dwg at au.ibm.com>, IBM
+ * Based on earlier code writteh by:
+ * Dave Engebretsen and Mike Corrigan {engebret|mikejc}@us.ibm.com
+ *    Copyright (c) 2001 Dave Engebretsen
+ * Copyright (C) 2002 Anton Blanchard <anton at au.ibm.com>, IBM
+ *
+ *
+ *      This program is free software; you can redistribute it and/or
+ *      modify it under the terms of the GNU General Public License
+ *      as published by the Free Software Foundation; either version
+ *      2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/config.h>
+#include <asm/pgtable.h>
+#include <asm/mmu.h>
+#include <asm/mmu_context.h>
+#include <asm/paca.h>
+#include <asm/naca.h>
+#include <asm/cputable.h>
+
+extern void slb_allocate(unsigned long ea);
+
+static inline void create_slbe(unsigned long ea, unsigned long vsid,
+			       unsigned long flags, unsigned long entry)
+{
+	ea = (ea & ESID_MASK) | SLB_ESID_V | entry;
+	vsid = (vsid << SLB_VSID_SHIFT) | flags;
+	asm volatile("slbmte  %0,%1" :
+		     : "r" (vsid), "r" (ea)
+		     : "memory" );
+}
+
+static void slb_add_bolted(void)
+{
+#ifndef CONFIG_PPC_ISERIES
+	WARN_ON(!irqs_disabled());
+
+	/* If you change this make sure you change SLB_NUM_BOLTED
+	 * appropriately too */
+
+	/* Slot 1 - first VMALLOC segment
+         * 	Since modules end up there it gets hit very heavily.
+         */
+	create_slbe(VMALLOCBASE, get_kernel_vsid(VMALLOCBASE),
+		    SLB_VSID_KERNEL, 1);
+
+	asm volatile("isync":::"memory");
+#endif
+}
+
+/* Flush all user entries from the segment table of the current processor. */
+void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
+{
+	unsigned long offset = get_paca()->slb_cache_ptr;
+	unsigned long esid_data;
+	unsigned long pc = KSTK_EIP(tsk);
+	unsigned long stack = KSTK_ESP(tsk);
+	unsigned long unmapped_base;
+
+	if (offset <= SLB_CACHE_ENTRIES) {
+		int i;
+		asm volatile("isync" : : : "memory");
+		for (i = 0; i < offset; i++) {
+			esid_data = (unsigned long)get_paca()->slb_cache[i]
+				<< SID_SHIFT;
+			asm volatile("slbie %0" : : "r" (esid_data));
+		}
+		asm volatile("isync" : : : "memory");
+	} else {
+		asm volatile("isync; slbia; isync" : : : "memory");
+		slb_add_bolted();
+	}
+
+	/* Workaround POWER5 < DD2.1 issue */
+	if (offset == 1 || offset > SLB_CACHE_ENTRIES) {
+		/* flush segment in EEH region, we shouldn't ever
+		 * access addresses in this region. */
+		asm volatile("slbie %0" : : "r"(EEHREGIONBASE));
+	}
+
+	get_paca()->slb_cache_ptr = 0;
+	get_paca()->context = mm->context;
+
+	/*
+	 * preload some userspace segments into the SLB.
+	 */
+	if (test_tsk_thread_flag(tsk, TIF_32BIT))
+		unmapped_base = TASK_UNMAPPED_BASE_USER32;
+	else
+		unmapped_base = TASK_UNMAPPED_BASE_USER64;
+
+	if (pc >= KERNELBASE)
+		return;
+	slb_allocate(pc);
+
+	if (GET_ESID(pc) == GET_ESID(stack))
+		return;
+
+	if (stack >= KERNELBASE)
+		return;
+	slb_allocate(stack);
+
+	if ((GET_ESID(pc) == GET_ESID(unmapped_base))
+	    || (GET_ESID(stack) == GET_ESID(unmapped_base)))
+		return;
+
+	if (unmapped_base >= KERNELBASE)
+		return;
+	slb_allocate(unmapped_base);
+}
+
+void slb_initialize(void)
+{
+#ifdef CONFIG_PPC_ISERIES
+	asm volatile("isync; slbia; isync":::"memory");
+#else
+	unsigned long flags = SLB_VSID_KERNEL;
+
+	/* Invalidate the entire SLB (even slot 0) & all the ERATS */
+	if (cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE)
+		flags |= SLB_VSID_L;
+
+	asm volatile("isync":::"memory");
+	asm volatile("slbmte  %0,%0"::"r" (0) : "memory");
+	asm volatile("isync; slbia; isync":::"memory");
+	create_slbe(KERNELBASE, get_kernel_vsid(KERNELBASE),
+		    flags, 0);
+
+#endif
+	slb_add_bolted();
+	get_paca()->stab_rr = SLB_NUM_BOLTED;
+}
Index: working-2.6/arch/ppc64/mm/Makefile
===================================================================
--- working-2.6.orig/arch/ppc64/mm/Makefile
+++ working-2.6/arch/ppc64/mm/Makefile
@@ -4,6 +4,6 @@

 EXTRA_CFLAGS += -mno-minimal-toc

-obj-y := fault.o init.o imalloc.o hash_utils.o hash_low.o tlb.o
+obj-y := fault.o init.o imalloc.o hash_utils.o hash_low.o tlb.o slb_low.o slb.o
 obj-$(CONFIG_DISCONTIGMEM) += numa.o
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
Index: working-2.6/arch/ppc64/mm/fault.c
===================================================================
--- working-2.6.orig/arch/ppc64/mm/fault.c
+++ working-2.6/arch/ppc64/mm/fault.c
@@ -93,13 +93,15 @@
 	unsigned long is_write = error_code & 0x02000000;
 	unsigned long trap = TRAP(regs);

-	if (trap == 0x300 || trap == 0x380) {
+	BUG_ON((trap == 0x380) || (trap == 0x480));
+
+	if (trap == 0x300) {
 		if (debugger_fault_handler(regs))
 			return 0;
 	}

 	/* On a kernel SLB miss we can only check for a valid exception entry */
-	if (!user_mode(regs) && (trap == 0x380 || address >= TASK_SIZE))
+	if (!user_mode(regs) && (address >= TASK_SIZE))
 		return SIGSEGV;

 	if (error_code & 0x00400000) {
Index: working-2.6/arch/ppc64/mm/slb_low.S
===================================================================
--- /dev/null
+++ working-2.6/arch/ppc64/mm/slb_low.S
@@ -0,0 +1,168 @@
+/*
+ * arch/ppc64/mm/slb_low.S
+ *
+ * Low-level SLB routines
+ *
+ * Copyright (C) 2004 David Gibson <dwg at au.ibm.com>, IBM
+ *
+ * Based on earlier C version:
+ * Dave Engebretsen and Mike Corrigan {engebret|mikejc}@us.ibm.com
+ *    Copyright (c) 2001 Dave Engebretsen
+ * Copyright (C) 2002 Anton Blanchard <anton at au.ibm.com>, IBM
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/config.h>
+#include <asm/processor.h>
+#include <asm/page.h>
+#include <asm/mmu.h>
+#include <asm/ppc_asm.h>
+#include <asm/offsets.h>
+#include <asm/cputable.h>
+
+/* void slb_allocate(unsigned long ea);
+ *
+ * Create an SLB entry for the given EA (user or kernel).
+ * 	r3 = faulting address, r13 = PACA
+ *	r9, r10, r11 are clobbered by this function
+ * No other registers are examined or changed.
+ */
+_GLOBAL(slb_allocate)
+	/*
+	 * First find a slot, round robin. Previously we tried to find
+	 * a free slot first but that took too long. Unfortunately we
+	 * dont have any LRU information to help us choose a slot.
+	 */
+	srdi	r9,r1,27
+	ori	r9,r9,1			/* mangle SP for later compare */
+
+	ld	r10,PACASTABRR(r13)
+3:
+	addi	r10,r10,1
+	/* use a cpu feature mask if we ever change our slb size */
+	cmpldi	r10,SLB_NUM_ENTRIES
+
+	blt+	4f
+	li	r10,SLB_NUM_BOLTED
+4:
+	slbmfee	r11,r10
+	/* Don't throw out the segment for our kernel stack. Since we
+	 * dont invalidate the ERAT we could have a valid translation
+	 * for the kernel stack during the first part of exception
+	 * exit which gets invalidated due to a tlbie from another cpu
+	 * at a non recoverable point (after setting srr0/1) - Anton
+	 *
+	 * The >> 27 (rather than >> 28) is so that the LSB is the
+	 * valid bit - this way we check valid and ESID in one compare.
+	 */
+	srdi	r11,r11,27
+	cmpd	r11,r9
+	beq-	3b
+
+	std	r10,PACASTABRR(r13)
+
+	/* r3 = faulting address, r10 = entry */
+
+	srdi	r9,r3,60		/* get region */
+	srdi	r3,r3,28		/* get esid */
+	cmpldi	cr7,r9,0xc		/* cmp KERNELBASE for later use */
+
+	/* r9 = region, r3 = esid, cr7 = <>KERNELBASE */
+
+	rldicr.	r11,r3,32,16
+	bne-	8f			/* invalid ea bits set */
+	addi	r11,r9,-1
+	cmpldi	r11,0xb
+	blt-	8f			/* invalid region */
+
+	/* r9 = region, r3 = esid, r10 = entry, cr7 = <>KERNELBASE */
+
+	blt	cr7,0f			/* user or kernel? */
+
+	/* kernel address */
+	li	r11,SLB_VSID_KERNEL
+BEGIN_FTR_SECTION
+	bne	cr7,9f
+	li	r11,(SLB_VSID_KERNEL|SLB_VSID_L)
+END_FTR_SECTION_IFSET(CPU_FTR_16M_PAGE)
+	b	9f
+
+0:	/* user address */
+	li	r11,SLB_VSID_USER
+#ifdef CONFIG_HUGETLB_PAGE
+BEGIN_FTR_SECTION
+	/* check against the hugepage ranges */
+	cmpldi	r3,(TASK_HPAGE_END>>SID_SHIFT)
+	bge	6f			/* >= TASK_HPAGE_END */
+	cmpldi	r3,(TASK_HPAGE_BASE>>SID_SHIFT)
+	bge	5f			/* TASK_HPAGE_BASE..TASK_HPAGE_END */
+	cmpldi	r3,16
+	bge	6f			/* 4GB..TASK_HPAGE_BASE */
+
+	lhz	r9,PACAHTLBSEGS(r13)
+	srd	r9,r9,r3
+	andi.	r9,r9,1
+	beq	6f
+
+5:	/* this is a hugepage user address */
+	li	r11,(SLB_VSID_USER|SLB_VSID_L)
+END_FTR_SECTION_IFSET(CPU_FTR_16M_PAGE)
+#endif /* CONFIG_HUGETLB_PAGE */
+
+6:	ld	r9,PACACONTEXTID(r13)
+
+9:	/* r9 = "context", r3 = esid, r11 = flags, r10 = entry */
+
+	rldimi	r9,r3,15,0		/* r9= VSID ordinal */
+
+7:	rldimi	r10,r3,28,0		/* r10= ESID<<28 | entry */
+	oris	r10,r10,SLB_ESID_V at h	/* r10 |= SLB_ESID_V */
+
+	/* r9 = ordinal, r3 = esid, r11 = flags, r10 = esid_data */
+
+	li	r3,VSID_RANDOMIZER at higher
+	sldi	r3,r3,32
+	oris	r3,r3,VSID_RANDOMIZER at h
+	ori	r3,r3,VSID_RANDOMIZER at l
+
+	mulld	r9,r3,r9		/* r9 = ordinal * VSID_RANDOMIZER */
+	clrldi	r9,r9,28		/* r9 &= VSID_MASK */
+	sldi	r9,r9,SLB_VSID_SHIFT	/* r9 <<= SLB_VSID_SHIFT */
+	or	r9,r9,r11		/* r9 |= flags */
+
+	/* r9 = vsid_data, r10 = esid_data, cr7 = <>KERNELBASE */
+
+	/*
+	 * No need for an isync before or after this slbmte. The exception
+	 * we enter with and the rfid we exit with are context synchronizing.
+	 */
+	slbmte	r9,r10
+
+	bgelr	cr7			/* we're done for kernel addresses */
+
+	/* Update the slb cache */
+	lhz	r3,PACASLBCACHEPTR(r13)	/* offset = paca->slb_cache_ptr */
+	cmpldi	r3,SLB_CACHE_ENTRIES
+	bge	1f
+
+	/* still room in the slb cache */
+	sldi	r11,r3,1		/* r11 = offset * sizeof(u16) */
+	rldicl	r10,r10,36,28		/* get low 16 bits of the ESID */
+	add	r11,r11,r13		/* r11 = (u16 *)paca + offset */
+	sth	r10,PACASLBCACHE(r11)	/* paca->slb_cache[offset] = esid */
+	addi	r3,r3,1			/* offset++ */
+	b	2f
+1:					/* offset >= SLB_CACHE_ENTRIES */
+	li	r3,SLB_CACHE_ENTRIES+1
+2:
+	sth	r3,PACASLBCACHEPTR(r13)	/* paca->slb_cache_ptr = offset */
+	blr
+
+8:	/* invalid EA */
+	li	r9,0			/* 0 VSID ordinal -> BAD_VSID */
+	li	r11,SLB_VSID_USER	/* flags don't much matter */
+	b	7b


--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From david at gibson.dropbear.id.au  Wed Jul  7 16:06:48 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Wed, 7 Jul 2004 16:06:48 +1000
Subject: [2/4] RFC: SLB Rewrite (unify do_slb_bolted and slb_allocate)
Message-ID: <20040707060648.GC987@zax>


Unify do_slb_bolted with the general SLB miss path.  There is now one
SLB miss handler, in assembler, and called with only the low-level
exception prolog (EXCEPTION_PROLOG_[PI]SERIES rather than
EXCEPTION_PROLOG_COMMON) and minimal extra save/restore logic.

Index: working-2.6/arch/ppc64/kernel/asm-offsets.c
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/asm-offsets.c
+++ working-2.6/arch/ppc64/kernel/asm-offsets.c
@@ -93,6 +93,7 @@
 	DEFINE(PACASLBCACHE, offsetof(struct paca_struct, slb_cache));
 	DEFINE(PACASLBCACHEPTR, offsetof(struct paca_struct, slb_cache_ptr));
 	DEFINE(PACACONTEXTID, offsetof(struct paca_struct, context.id));
+	DEFINE(PACASLBR3, offsetof(struct paca_struct, slb_r3));
 #ifdef CONFIG_HUGETLB_PAGE
 	DEFINE(PACAHTLBSEGS, offsetof(struct paca_struct, context.htlb_segs));
 #endif /* CONFIG_HUGETLB_PAGE */
Index: working-2.6/arch/ppc64/kernel/head.S
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/head.S
+++ working-2.6/arch/ppc64/kernel/head.S
@@ -200,6 +200,7 @@
 #define EX_R13		32
 #define EX_SRR0		40
 #define EX_DAR		48
+#define EX_LR		48	/* SLB miss saves LR, but not DAR */
 #define EX_DSISR	56
 #define EX_CCR		60

@@ -433,18 +434,16 @@
 	.globl DataAccessSLB_Pseries
 DataAccessSLB_Pseries:
 	mtspr	SPRG1,r13
-	mtspr	SPRG2,r12
-	mfspr	r13,DAR
-	mfcr	r12
-	srdi	r13,r13,60
-	cmpdi	r13,0xc
-	beq	.do_slb_bolted_Pseries
-	mtcrf	0x80,r12
-	mfspr	r12,SPRG2
-	EXCEPTION_PROLOG_PSERIES(PACA_EXGEN, DataAccessSLB_common)
+	EXCEPTION_PROLOG_PSERIES(PACA_EXSLB, data_slb_Pseries)

 	STD_EXCEPTION_PSERIES(0x400, InstructionAccess)
-	STD_EXCEPTION_PSERIES(0x480, InstructionAccessSLB)
+
+	. = 0x480
+	.globl InstructionAccessSLB_Pseries
+InstructionAccessSLB_Pseries:
+	mtspr	SPRG1,r13
+	EXCEPTION_PROLOG_PSERIES(PACA_EXSLB, instr_slb_Pseries)
+
 	STD_EXCEPTION_PSERIES(0x500, HardwareInterrupt)
 	STD_EXCEPTION_PSERIES(0x600, Alignment)
 	STD_EXCEPTION_PSERIES(0x700, ProgramCheck)
@@ -494,10 +493,6 @@
 	mfspr	r12,SPRG2
 	EXCEPTION_PROLOG_PSERIES(PACA_EXSLB, .do_stab_bolted)

-_GLOBAL(do_slb_bolted_Pseries)
-	mtcrf	0x80,r12
-	mfspr	r12,SPRG2
-	EXCEPTION_PROLOG_PSERIES(PACA_EXSLB, .do_slb_bolted)

 	/* Space for the naca.  Architected to be located at real address
 	 * NACA_PHYS_ADDR.  Various tools rely on this location being fixed.
@@ -586,27 +581,23 @@
 	.globl	DataAccessSLB_Iseries
 DataAccessSLB_Iseries:
 	mtspr	SPRG1,r13		/* save r13 */
-	mtspr	SPRG2,r12
-	mfspr	r13,DAR
-	mfcr	r12
-	srdi	r13,r13,60
-	cmpdi	r13,0xc
-	beq	.do_slb_bolted_Iseries
-	mtcrf	0x80,r12
-	mfspr	r12,SPRG2
-	EXCEPTION_PROLOG_ISERIES_1(PACA_EXGEN)
+	EXCEPTION_PROLOG_ISERIES_1(PACA_EXSLB)
 	EXCEPTION_PROLOG_ISERIES_2
-	b	DataAccessSLB_common
+	std	r3,PACASLBR3(r13)
+	mfspr	r3,DAR
+	b	.do_slb_miss

-.do_slb_bolted_Iseries:
-	mtcrf	0x80,r12
-	mfspr	r12,SPRG2
+	STD_EXCEPTION_ISERIES(0x400, InstructionAccess, PACA_EXGEN)
+
+	.globl	InstructionAccessSLB_Iseries
+InstructionAccessSLB_Iseries:
+	mtspr	SPRG1,r13		/* save r13 */
 	EXCEPTION_PROLOG_ISERIES_1(PACA_EXSLB)
 	EXCEPTION_PROLOG_ISERIES_2
-	b	.do_slb_bolted
+	std	r3,PACASLBR3(r13)
+	mr	r3,r11
+	b	.do_slb_miss

-	STD_EXCEPTION_ISERIES(0x400, InstructionAccess, PACA_EXGEN)
-	STD_EXCEPTION_ISERIES(0x480, InstructionAccessSLB, PACA_EXGEN)
 	MASKABLE_EXCEPTION_ISERIES(0x500, HardwareInterrupt)
 	STD_EXCEPTION_ISERIES(0x600, Alignment, PACA_EXGEN)
 	STD_EXCEPTION_ISERIES(0x700, ProgramCheck, PACA_EXGEN)
@@ -864,17 +855,6 @@
 	b	.do_hash_page	 	/* Try to handle as hpte fault */

 	.align	7
-	.globl DataAccessSLB_common
-DataAccessSLB_common:
-	mfspr	r10,DAR
-	std	r10,PACA_EXGEN+EX_DAR(r13)
-	EXCEPTION_PROLOG_COMMON(0x380, PACA_EXGEN)
-	ld	r3,PACA_EXGEN+EX_DAR(r13)
-	std	r3,_DAR(r1)
-	bl	.slb_allocate
-	b	fast_exception_return
-
-	.align	7
 	.globl InstructionAccess_common
 InstructionAccess_common:
 	EXCEPTION_PROLOG_COMMON(0x400, PACA_EXGEN)
@@ -884,14 +864,6 @@
 	b	.do_hash_page		/* Try to handle as hpte fault */

 	.align	7
-	.globl InstructionAccessSLB_common
-InstructionAccessSLB_common:
-	EXCEPTION_PROLOG_COMMON(0x480, PACA_EXGEN)
-	ld	r3,_NIP(r1)		/* SRR0 = NIA	*/
-	bl	.slb_allocate
-	b	fast_exception_return
-
-	.align	7
 	.globl HardwareInterrupt_common
 	.globl HardwareInterrupt_entry
 HardwareInterrupt_common:
@@ -1137,130 +1109,52 @@
 	ld	r13,PACA_EXSLB+EX_R13(r13)
 	rfid

+data_slb_Pseries:
+	std	r3,PACASLBR3(r13)
+	mfspr	r3,DAR
+	b	.do_slb_miss
+
+instr_slb_Pseries:
+	std	r3,PACASLBR3(r13)
+	mr	r3,r11		/* prolog stored SRR0 in r11 */
+	b	.do_slb_miss
+
 /*
  * r13 points to the PACA, r9 contains the saved CR,
  * r11 and r12 contain the saved SRR0 and SRR1.
+ * r3 has the faulting address
  * r9 - r13 are saved in paca->exslb.
+ * r3 is saved in paca->slb_r3
  * We assume we aren't going to take any exceptions during this procedure.
  */
-_GLOBAL(do_slb_bolted)
+_GLOBAL(do_slb_miss)
+	mflr	r10
+
 	stw	r9,PACA_EXSLB+EX_CCR(r13)	/* save CR in exc. frame */
 	std	r11,PACA_EXSLB+EX_SRR0(r13)	/* save SRR0 in exc. frame */
+	std	r10,PACA_EXSLB+EX_LR(r13)	/* save LR */

-	/*
-	 * We take the next entry, round robin. Previously we tried
-	 * to find a free slot first but that took too long. Unfortunately
-	 * we dont have any LRU information to help us choose a slot.
-	 */
-
-	/* r13 = paca */
-	/* use a cpu feature mask if we ever change our slb size */
-1:	ld	r10,PACASTABRR(r13)
-	addi	r10,r10,1
-	cmpdi	r10,SLB_NUM_ENTRIES
-	blt+	2f
-	li	r10,SLB_NUM_BOLTED		/* dont touch bolted slots */
-2:	std	r10,PACASTABRR(r13)
-
-	/* r13 = paca, r10 = entry */
-
-	/*
-	 * Never cast out the segment for our kernel stack. Since we
-	 * dont invalidate the ERAT we could have a valid translation
-	 * for the kernel stack during the first part of exception exit
-	 * which gets invalidated due to a tlbie from another cpu at a
-	 * non recoverable point (after setting srr0/1) - Anton
-	 */
-	slbmfee	r9,r10
-	srdi	r9,r9,27
-	/*
-	 * Use paca->ksave as the value of the kernel stack pointer,
-	 * because this is valid at all times.
-	 * The >> 27 (rather than >> 28) is so that the LSB is the
-	 * valid bit - this way we check valid and ESID in one compare.
-	 * In order to completely close the tiny race in the context
-	 * switch (between updating r1 and updating paca->ksave),
-	 * we check against both r1 and paca->ksave.
-	 */
-	srdi	r11,r1,27
-	ori	r11,r11,1
-	cmpd	r11,r9
-	beq-	1b
-	ld	r11,PACAKSAVE(r13)
-	srdi	r11,r11,27
- 	ori	r11,r11,1
- 	cmpd	r11,r9
- 	beq-	1b
-
-	/* r13 = paca, r10 = entry */
-
-	/* (((ea >> 28) & 0x1fff) << 15) | (ea >> 60) */
-	mfspr	r9,DAR
-	rldicl	r11,r9,36,51
-	sldi	r11,r11,15
-	srdi	r9,r9,60
-	or	r11,r11,r9
-
-	/* VSID_RANDOMIZER */
-	li	r9,9
-	sldi	r9,r9,32
-	oris	r9,r9,58231
-	ori	r9,r9,39831
-
-	/* vsid = (ordinal * VSID_RANDOMIZER) & VSID_MASK */
-	mulld	r11,r11,r9
-	clrldi	r11,r11,28
-
-	/* r13 = paca, r10 = entry, r11 = vsid */
-
-	/* Put together slb word1 */
-	sldi	r11,r11,12
-
-BEGIN_FTR_SECTION
-	/* set kp and c bits */
-	ori	r11,r11,0x480
-END_FTR_SECTION_IFCLR(CPU_FTR_16M_PAGE)
-BEGIN_FTR_SECTION
-	/* set kp, l and c bits */
-	ori	r11,r11,0x580
-END_FTR_SECTION_IFSET(CPU_FTR_16M_PAGE)
-
-	/* r13 = paca, r10 = entry, r11 = slb word1 */
-
-	/* Put together slb word0 */
-	mfspr	r9,DAR
-	clrrdi	r9,r9,28	/* get the new esid */
-	oris	r9,r9,0x800	/* set valid bit */
-	rldimi	r9,r10,0,52	/* insert entry */
-
-	/* r13 = paca, r9 = slb word0, r11 = slb word1 */
-
-	/*
-	 * No need for an isync before or after this slbmte. The exception
-	 * we enter with and the rfid we exit with are context synchronizing .
-	 */
-	slbmte	r11,r9
+	bl	.slb_allocate			/* handle it */

 	/* All done -- return from exception. */
+
+	ld	r10,PACA_EXSLB+EX_LR(r13)
+	ld	r3,PACASLBR3(r13)
 	lwz	r9,PACA_EXSLB+EX_CCR(r13)	/* get saved CR */
 	ld	r11,PACA_EXSLB+EX_SRR0(r13)	/* get saved SRR0 */

+	mtlr	r10
+
 	andi.	r10,r12,MSR_RI	/* check for unrecoverable exception */
 	beq-	unrecov_slb

-	/*
-	 * Until everyone updates binutils hardwire the POWER4 optimised
-	 * single field mtcrf
-	 */
-#if 0
-	.machine	push
-	.machine	"power4"
+.machine	push
+.machine	"power4"
 	mtcrf	0x80,r9
-	.machine	pop
-#else
-	.long 0x7d380120
-#endif
+	mtcrf	0x01,r9		/* slb_allocate uses cr0 and cr7 */
+.machine	pop

+	/* Clear RI */
 	mfmsr	r10
 	clrrdi	r10,r10,2
 	mtmsrd	r10,1
Index: working-2.6/include/asm-ppc64/paca.h
===================================================================
--- working-2.6.orig/include/asm-ppc64/paca.h
+++ working-2.6/include/asm-ppc64/paca.h
@@ -78,6 +78,7 @@
 	u64 exmc[8];		/* used for machine checks */
 	u64 exslb[8];		/* used for SLB/segment table misses
 				 * on the linear mapping */
+	u64 slb_r3;		/* spot to save R3 on SLB miss */
 	mm_context_t context;
 	u16 slb_cache[SLB_CACHE_ENTRIES];
 	u16 slb_cache_ptr;

--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From david at gibson.dropbear.id.au  Wed Jul  7 16:06:57 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Wed, 7 Jul 2004 16:06:57 +1000
Subject: [3/4] RFC: SLB Rewrite (optimize SLB entry/exit path)
Message-ID: <20040707060657.GD987@zax>


Streamlines the exception entry/exit path of the SLB miss handler to
shave a few cycles off.  The most significant change is that the RI
bit is left off throughout the whole handler, which avoids an extra
mtmsrd to turn it back off on the exit path.

Index: working-2.6/arch/ppc64/kernel/head.S
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/head.S
+++ working-2.6/arch/ppc64/kernel/head.S
@@ -434,7 +434,25 @@
 	.globl DataAccessSLB_Pseries
 DataAccessSLB_Pseries:
 	mtspr	SPRG1,r13
-	EXCEPTION_PROLOG_PSERIES(PACA_EXSLB, data_slb_Pseries)
+	mfspr	r13,SPRG3		/* get paca address into r13 */
+	std	r9,PACA_EXSLB+EX_R9(r13)	/* save r9 - r12 */
+	std	r10,PACA_EXSLB+EX_R10(r13)
+	std	r11,PACA_EXSLB+EX_R11(r13)
+	std	r12,PACA_EXSLB+EX_R12(r13)
+	std	r3,PACASLBR3(r13)
+	mfspr	r9,SPRG1
+	std	r9,PACA_EXSLB+EX_R13(r13)
+	mfcr	r9
+	clrrdi	r12,r13,32		/* get high part of &label */
+	mfmsr	r10
+	mfspr	r11,SRR0		/* save SRR0 */
+	ori	r12,r12,(.do_slb_miss)@l
+	ori	r10,r10,MSR_IR|MSR_DR	/* DON'T set RI for SLB miss */
+	mtspr	SRR0,r12
+	mfspr	r12,SRR1		/* and SRR1 */
+	mtspr	SRR1,r10
+	mfspr	r3,DAR
+	rfid

 	STD_EXCEPTION_PSERIES(0x400, InstructionAccess)

@@ -442,7 +460,25 @@
 	.globl InstructionAccessSLB_Pseries
 InstructionAccessSLB_Pseries:
 	mtspr	SPRG1,r13
-	EXCEPTION_PROLOG_PSERIES(PACA_EXSLB, instr_slb_Pseries)
+	mfspr	r13,SPRG3		/* get paca address into r13 */
+	std	r9,PACA_EXSLB+EX_R9(r13)	/* save r9 - r12 */
+	std	r10,PACA_EXSLB+EX_R10(r13)
+	std	r11,PACA_EXSLB+EX_R11(r13)
+	std	r12,PACA_EXSLB+EX_R12(r13)
+	std	r3,PACASLBR3(r13)
+	mfspr	r9,SPRG1
+	std	r9,PACA_EXSLB+EX_R13(r13)
+	mfcr	r9
+	clrrdi	r12,r13,32		/* get high part of &label */
+	mfmsr	r10
+	mfspr	r11,SRR0		/* save SRR0 */
+	ori	r12,r12,(.do_slb_miss)@l
+	ori	r10,r10,MSR_IR|MSR_DR	/* DON'T set RI for SLB miss */
+	mtspr	SRR0,r12
+	mfspr	r12,SRR1		/* and SRR1 */
+	mtspr	SRR1,r10
+	mr	r3,r11			/* SRR0 is faulting address */
+	rfid

 	STD_EXCEPTION_PSERIES(0x500, HardwareInterrupt)
 	STD_EXCEPTION_PSERIES(0x600, Alignment)
@@ -582,8 +618,9 @@
 DataAccessSLB_Iseries:
 	mtspr	SPRG1,r13		/* save r13 */
 	EXCEPTION_PROLOG_ISERIES_1(PACA_EXSLB)
-	EXCEPTION_PROLOG_ISERIES_2
 	std	r3,PACASLBR3(r13)
+	ld	r11,PACALPPACA+LPPACASRR0(r13)
+	ld	r12,PACALPPACA+LPPACASRR1(r13)
 	mfspr	r3,DAR
 	b	.do_slb_miss

@@ -593,8 +630,9 @@
 InstructionAccessSLB_Iseries:
 	mtspr	SPRG1,r13		/* save r13 */
 	EXCEPTION_PROLOG_ISERIES_1(PACA_EXSLB)
-	EXCEPTION_PROLOG_ISERIES_2
 	std	r3,PACASLBR3(r13)
+	ld	r11,PACALPPACA+LPPACASRR0(r13)
+	ld	r12,PACALPPACA+LPPACASRR1(r13)
 	mr	r3,r11
 	b	.do_slb_miss

@@ -1109,16 +1147,6 @@
 	ld	r13,PACA_EXSLB+EX_R13(r13)
 	rfid

-data_slb_Pseries:
-	std	r3,PACASLBR3(r13)
-	mfspr	r3,DAR
-	b	.do_slb_miss
-
-instr_slb_Pseries:
-	std	r3,PACASLBR3(r13)
-	mr	r3,r11		/* prolog stored SRR0 in r11 */
-	b	.do_slb_miss
-
 /*
  * r13 points to the PACA, r9 contains the saved CR,
  * r11 and r12 contain the saved SRR0 and SRR1.
@@ -1154,11 +1182,6 @@
 	mtcrf	0x01,r9		/* slb_allocate uses cr0 and cr7 */
 .machine	pop

-	/* Clear RI */
-	mfmsr	r10
-	clrrdi	r10,r10,2
-	mtmsrd	r10,1
-
 	mtspr	SRR0,r11
 	mtspr	SRR1,r12
 	ld	r9,PACA_EXSLB+EX_R9(r13)

--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From david at gibson.dropbear.id.au  Wed Jul  7 16:07:02 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Wed, 7 Jul 2004 16:07:02 +1000
Subject: [4/4] RFC: SLB Rewrite (new VSID algorithm)
Message-ID: <20040707060702.GE987@zax>


Replace the VSID allocation algorithm.  The new algorithm first
generates a 36-bit "proto-VSID" (with 0xfffffffff reserved).  For
kernel addresses this is equal to the ESID, for user addresses it is:
	(context << 15) | esid

These are distinguishable from kernel proto-VSIDs because the top bit
is clear.  Proto-VSIDs with the top two bits equal to 10 as reserved
for now.  The proto-VSIDs are then scrambled into real VSIDs with the
(1 to 1) multiplicative hash:
	VSID = (proto-VSID * VSID_MULTIPLIER) % VSID_MODULUS
	where	VSID_MULTIPLIER = 268435399 = 0xFFFFFC7
		VSID_MODULUS = 2^36-1 = 0xFFFFFFFFF

This scheme has a number of advantages over the old one:
	- We now have VSIDs for every kernel address (i.e. everything
above 0xC000000000000000), except the very top segment.  That
simplifies a numbver of things.
	- We allow for 15 significant bits of ESID for user addresses
with 20 bits of context.  i.e. 8T (43 bits) of address space for up to
1M contexts, significantly more than the old method (although we will
need changes in the hash path and context allocation to take advantage
of this).
	- Because we use a real multiplicative hash function, we have
much better hash scattering with this VSID algorithm (at least based
on some initial results).

Because the MODULUS is 2^n-1 we can use a trick to compute it
efficiently without a divide or extra multiply.  This makes the new
algorithm barely slower than the old one.

Index: working-2.6/include/asm-ppc64/mmu_context.h
===================================================================
--- working-2.6.orig/include/asm-ppc64/mmu_context.h
+++ working-2.6/include/asm-ppc64/mmu_context.h
@@ -34,7 +34,7 @@
 }

 #define NO_CONTEXT		0
-#define FIRST_USER_CONTEXT	0x10    /* First 16 reserved for kernel */
+#define FIRST_USER_CONTEXT	1
 #define LAST_USER_CONTEXT	0x8000  /* Same as PID_MAX for now... */
 #define NUM_USER_CONTEXT	(LAST_USER_CONTEXT-FIRST_USER_CONTEXT)

@@ -181,46 +181,43 @@
 	local_irq_restore(flags);
 }

-/* This is only valid for kernel (including vmalloc, imalloc and bolted) EA's
- */
-static inline unsigned long
-get_kernel_vsid( unsigned long ea )
-{
-	unsigned long ordinal, vsid;
-
-	ordinal = (((ea >> 28) & 0x1fff) * LAST_USER_CONTEXT) | (ea >> 60);
-	vsid = (ordinal * VSID_RANDOMIZER) & VSID_MASK;
-
-#ifdef HTABSTRESS
-	/* For debug, this path creates a very poor vsid distribuition.
-	 * A user program can access virtual addresses in the form
-	 * 0x0yyyyxxxx000 where yyyy = xxxx to cause multiple mappings
-	 * to hash to the same page table group.
-	 */
-	ordinal = ((ea >> 28) & 0x1fff) | (ea >> 44);
-	vsid = ordinal & VSID_MASK;
-#endif /* HTABSTRESS */
+/*
+ * WARNING - If you change these you must make sure the asm
+ * implementations in slb_allocate(), do_stab_bolted and mmu.h
+ * (ASM_VSID_SCRAMBLE macro) are changed accordingly.
+ *
+ * You'll also need to change the precomputed VSID values in head.S
+ * which are used by the iSeries firmware.
+ */
+
+static inline unsigned long vsid_scramble(unsigned long protovsid)
+{
+#if 0
+	/* The code below is equivalent to this function for arguments
+	 * < 2^VSID_BITS, which is all this should ever be called
+	 * with.  However gcc is not clever enough to compute the
+	 * modulus (2^n-1) without a second multiply. */
+	return ((protovsid * VSID_MULTIPLIER) % VSID_MODULUS);
+#else /* 1 */
+	unsigned long x;
+
+	x = protovsid * VSID_MULTIPLIER;
+	x = (x >> VSID_BITS) + (x & VSID_MODULUS);
+	return (x + ((x+1) >> VSID_BITS)) & VSID_MODULUS;
+#endif /* 1 */
+}

-	return vsid;
+/* This is only valid for addresses >= KERNELBASE */
+static inline unsigned long get_kernel_vsid(unsigned long ea)
+{
+	return vsid_scramble(ea >> SID_SHIFT);
 }

-/* This is only valid for user EA's (user EA's do not exceed 2^41 (EADDR_SIZE))
- */
-static inline unsigned long
-get_vsid( unsigned long context, unsigned long ea )
+/* This is only valid for user addresses (which are below 2^41) */
+static inline unsigned long get_vsid(unsigned long context, unsigned long ea)
 {
-	unsigned long ordinal, vsid;
-
-	ordinal = (((ea >> 28) & 0x1fff) * LAST_USER_CONTEXT) | context;
-	vsid = (ordinal * VSID_RANDOMIZER) & VSID_MASK;
-
-#ifdef HTABSTRESS
-	/* See comment above. */
-	ordinal = ((ea >> 28) & 0x1fff) | (context << 16);
-	vsid = ordinal & VSID_MASK;
-#endif /* HTABSTRESS */
-
-	return vsid;
+	return vsid_scramble((context << USER_ESID_BITS)
+			     | (ea >> SID_SHIFT));
 }

 #endif /* __PPC64_MMU_CONTEXT_H */
Index: working-2.6/include/asm-ppc64/mmu.h
===================================================================
--- working-2.6.orig/include/asm-ppc64/mmu.h
+++ working-2.6/include/asm-ppc64/mmu.h
@@ -15,6 +15,7 @@

 #include <linux/config.h>
 #include <asm/page.h>
+#include <linux/stringify.h>

 #ifndef __ASSEMBLY__

@@ -241,12 +242,44 @@
 #define SLB_VSID_KERNEL		(SLB_VSID_KP|SLB_VSID_C)
 #define SLB_VSID_USER		(SLB_VSID_KP|SLB_VSID_KS)

-#define VSID_RANDOMIZER ASM_CONST(42470972311)
-#define VSID_MASK	0xfffffffffUL
-/* Because we never access addresses below KERNELBASE as kernel
- * addresses, this VSID is never used for anything real, and will
- * never have pages hashed into it */
-#define BAD_VSID	ASM_CONST(0)
+#define VSID_MULTIPLIER	ASM_CONST(268435399)	/* largest 28-bit prime */
+#define VSID_BITS	36
+#define VSID_MODULUS	((1UL<<VSID_BITS)-1)
+
+#define CONTEXT_BITS	20
+#define USER_ESID_BITS	15
+
+/*
+ * This macro generates asm code to compute the VSID scramble
+ * function.  Used in slb_allocate() and do_stab_bolted.  The function
+ * computed is: (protovsid*VSID_MULTIPLIER) % VSID_MODULUS
+ *
+ *	rt = register continaing the proto-VSID and into which the
+ *		VSID will be stored
+ *	rx = scratch register (clobbered)
+ *
+ * 	- rt and rx must be different registers
+ * 	- The answer will end up in the low 36 bits of rt.  The higher
+ * 	  bits may contain other garbage, so you may need to mask the
+ * 	  result.
+ */
+#define ASM_VSID_SCRAMBLE(rt, rx)	\
+	lis	rx,VSID_MULTIPLIER at h;					\
+	ori	rx,rx,VSID_MULTIPLIER at l;				\
+	mulld	rt,rt,rx;		/* rt = rt * MULTIPLIER */	\
+									\
+	srdi	rx,rt,VSID_BITS;					\
+	clrldi	rt,rt,(64-VSID_BITS);					\
+	add	rt,rt,rx;		/* add high and low bits */	\
+	/* Now, r3 == VSID (mod 2^36-1), and lies between 0 and		\
+	 * 2^36-1+2^28-1.  That in particular means that if r3 >=	\
+	 * 2^36-1, then r3+1 has the 2^36 bit set.  So, if r3+1 has	\
+	 * the bit clear, r3 already has the answer we want, if it	\
+	 * doesn't, the answer is the low 36 bits of r3+1.  So in all	\
+	 * cases the answer is the low 36 bits of (r3 + ((r3+1) >> 36))*/\
+	addi	rx,rt,1;						\
+	srdi	rx,rx,VSID_BITS;	/* extract 2^36 bit */		\
+	add	rt,rt,rx

 /* Block size masks */
 #define BL_128K	0x000
Index: working-2.6/arch/ppc64/mm/slb_low.S
===================================================================
--- working-2.6.orig/arch/ppc64/mm/slb_low.S
+++ working-2.6/arch/ppc64/mm/slb_low.S
@@ -71,19 +71,19 @@
 	srdi	r3,r3,28		/* get esid */
 	cmpldi	cr7,r9,0xc		/* cmp KERNELBASE for later use */

-	/* r9 = region, r3 = esid, cr7 = <>KERNELBASE */
-
-	rldicr.	r11,r3,32,16
-	bne-	8f			/* invalid ea bits set */
-	addi	r11,r9,-1
-	cmpldi	r11,0xb
-	blt-	8f			/* invalid region */
+	rldimi	r10,r3,28,0		/* r10= ESID<<28 | entry */
+	oris	r10,r10,SLB_ESID_V at h	/* r10 |= SLB_ESID_V */

-	/* r9 = region, r3 = esid, r10 = entry, cr7 = <>KERNELBASE */
+	/* r3 = esid, r10 = esid_data, cr7 = <>KERNELBASE */

 	blt	cr7,0f			/* user or kernel? */

-	/* kernel address */
+	/* kernel address: proto-VSID = ESID */
+	/* WARNING - MAGIC: we don't use the VSID 0xfffffffff, but
+	 * this code will generate the protoVSID 0xfffffffff for the
+	 * top segment.  That's ok, the scramble below will translate
+	 * it to VSID 0, which is reserved as a bad VSID - one which
+	 * will never have any pages in it.  */
 	li	r11,SLB_VSID_KERNEL
 BEGIN_FTR_SECTION
 	bne	cr7,9f
@@ -91,8 +91,12 @@
 END_FTR_SECTION_IFSET(CPU_FTR_16M_PAGE)
 	b	9f

-0:	/* user address */
+0:	/* user address: proto-VSID = ESID<<15 | context */
 	li	r11,SLB_VSID_USER
+
+	srdi.	r9,r3,13
+	bne-	8f			/* invalid ea bits set */
+
 #ifdef CONFIG_HUGETLB_PAGE
 BEGIN_FTR_SECTION
 	/* check against the hugepage ranges */
@@ -114,33 +118,18 @@
 #endif /* CONFIG_HUGETLB_PAGE */

 6:	ld	r9,PACACONTEXTID(r13)
+	rldimi	r3,r9,USER_ESID_BITS,0

-9:	/* r9 = "context", r3 = esid, r11 = flags, r10 = entry */
-
-	rldimi	r9,r3,15,0		/* r9= VSID ordinal */
-
-7:	rldimi	r10,r3,28,0		/* r10= ESID<<28 | entry */
-	oris	r10,r10,SLB_ESID_V at h	/* r10 |= SLB_ESID_V */
-
-	/* r9 = ordinal, r3 = esid, r11 = flags, r10 = esid_data */
-
-	li	r3,VSID_RANDOMIZER at higher
-	sldi	r3,r3,32
-	oris	r3,r3,VSID_RANDOMIZER at h
-	ori	r3,r3,VSID_RANDOMIZER at l
-
-	mulld	r9,r3,r9		/* r9 = ordinal * VSID_RANDOMIZER */
-	clrldi	r9,r9,28		/* r9 &= VSID_MASK */
-	sldi	r9,r9,SLB_VSID_SHIFT	/* r9 <<= SLB_VSID_SHIFT */
-	or	r9,r9,r11		/* r9 |= flags */
+9:	/* r3 = protovsid, r11 = flags, r10 = esid_data, cr7 = <>KERNELBASE */
+	ASM_VSID_SCRAMBLE(r3,r9)

-	/* r9 = vsid_data, r10 = esid_data, cr7 = <>KERNELBASE */
+	rldimi	r11,r3,SLB_VSID_SHIFT,16	/* combine VSID and flags */

 	/*
 	 * No need for an isync before or after this slbmte. The exception
 	 * we enter with and the rfid we exit with are context synchronizing.
 	 */
-	slbmte	r9,r10
+	slbmte	r11,r10

 	bgelr	cr7			/* we're done for kernel addresses */

@@ -163,6 +152,6 @@
 	blr

 8:	/* invalid EA */
-	li	r9,0			/* 0 VSID ordinal -> BAD_VSID */
+	li	r3,0			/* BAD_VSID */
 	li	r11,SLB_VSID_USER	/* flags don't much matter */
-	b	7b
+	b	9b
Index: working-2.6/arch/ppc64/kernel/head.S
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/head.S
+++ working-2.6/arch/ppc64/kernel/head.S
@@ -576,11 +576,11 @@
 	.llong	0		/* Reserved */
 	.llong	0		/* Reserved */
 	.llong	0		/* Reserved */
-	.llong	0x0c00000000	/* ESID to map (Kernel at EA = 0xC000000000000000) */
-	.llong	0x06a99b4b14	/* VSID to map (Kernel at VA = 0x6a99b4b140000000) */
+	.llong	(KERNELBASE>>28)/* ESID to map */
+	.llong	0x40BFFFFD5	/* VSID to map */
 	.llong	8192		/* # pages to map (32 MB) */
 	.llong	0		/* Offset from start of loadarea to start of map */
-	.llong	0x0006a99b4b140000	/* VPN of first page to map */
+	.llong	0x40BFFFFD50000	/* VPN of first page to map */

 	. = 0x6100

@@ -1072,18 +1072,9 @@
 	rldimi	r10,r11,7,52	/* r10 = first ste of the group */

 	/* Calculate VSID */
-	/* (((ea >> 28) & 0x1fff) << 15) | (ea >> 60) */
-	rldic	r11,r11,15,36
-	ori	r11,r11,0xc
-
-	/* VSID_RANDOMIZER */
-	li	r9,9
-	sldi	r9,r9,32
-	oris	r9,r9,58231
-	ori	r9,r9,39831
-
-	mulld	r9,r11,r9
-	rldic	r9,r9,12,16	/* r9 = vsid << 12 */
+	/* This is a kernel address, so protovsid = ESID */
+	ASM_VSID_SCRAMBLE(r11, r9)
+	rldic	r9,r11,12,16	/* r9 = vsid << 12 */

 	/* Search the primary group for a free entry */
 1:	ld	r11,0(r10)	/* Test valid bit of the current ste	*/

--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From moilanen at austin.ibm.com  Thu Jul  8 01:06:43 2004
From: moilanen at austin.ibm.com (Jake Moilanen)
Date: Wed, 7 Jul 2004 10:06:43 -0500
Subject: [1/4] RFC: SLB rewrite (core rewrite)
In-Reply-To: <20040707060635.GB987@zax>
References: <20040707060635.GB987@zax>
Message-ID: <20040707100643.0b889896.moilanen@austin.ibm.com>


This is pretty nice.  Good work.  I have just one nit below:

> +_GLOBAL(slb_allocate)
> +	/*
> +	 * First find a slot, round robin. Previously we tried to find
> +	 * a free slot first but that took too long. Unfortunately we
> +	 * dont have any LRU information to help us choose a slot.
> +	 */
> +	srdi	r9,r1,27
> +	ori	r9,r9,1			/* mangle SP for later compare */
> +
> +	ld	r10,PACASTABRR(r13)
> +3:
> +	addi	r10,r10,1
> +	/* use a cpu feature mask if we ever change our slb size */
> +	cmpldi	r10,SLB_NUM_ENTRIES
> +
> +	blt+	4f

This branch probably shouldn't be predicted.  The general rule on branch
prediction is for an error case, or a missed lock.  Since about power 4,
the branch prediction is a little over 99% correct, you'll get a miss 1
out of 62 times or 1.6% of the time.  It's probably not measurable, just
might save a few cycles.

> +	 * The >> 27 (rather than >> 28) is so that the LSB is the
> +	 * valid bit - this way we check valid and ESID in one compare.
> +	 */
> +	srdi	r11,r11,27
> +	cmpd	r11,r9
> +	beq-	3b

Same as above.

Thanks,
Jake

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Thu Jul  8 01:33:16 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Wed, 07 Jul 2004 10:33:16 -0500
Subject: [0/4] RFC: SLB rewrite
In-Reply-To: <20040707060626.GA987@zax>
References: <20040707060626.GA987@zax>
Message-ID: <1089214395.2026.23.camel@gaston>


On Wed, 2004-07-07 at 01:06, David Gibson wrote:
> Here, at long last, the much awaited, SLB rewrite!

Hehe ! Kudos ! Looks great !

What about moving the rest of stab.c to arch/ppc64/mm/ while
we are at it ? :)

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Thu Jul  8 02:31:12 2004
From: olof at austin.ibm.com (Olof Johansson)
Date: Wed, 07 Jul 2004 11:31:12 -0500
Subject: [0/4] RFC: SLB rewrite
In-Reply-To: <20040707060626.GA987@zax>
References: <20040707060626.GA987@zax>
Message-ID: <40EC2550.6020001@austin.ibm.com>


David Gibson wrote:

> If there are no serious problems, I'm thinking of pushing at least the
> first three of these to Linus/akpm in the next week or so.

Good work!  The code looks clean and solid to me.

Maybe the description from the top of the VSID patch should be included
as a top-of-file comment in slb_low.S or slb.c so it's around for new
readers in the future?

Also, where does the nonlinear behaviour at the high segment count
reload times for the staggered measurements for the new VSID generation
come from? HTAB hash bucket collissions/clustering?


-Olof

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From david at gibson.dropbear.id.au  Thu Jul  8 11:02:51 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Thu, 8 Jul 2004 11:02:51 +1000
Subject: [0/4] RFC: SLB rewrite
In-Reply-To: <1089214395.2026.23.camel@gaston>
References: <20040707060626.GA987@zax> <1089214395.2026.23.camel@gaston>
Message-ID: <20040708010251.GA4247@zax>


On Wed, Jul 07, 2004 at 10:33:16AM -0500, Benjamin Herrenschmidt wrote:
>
> On Wed, 2004-07-07 at 01:06, David Gibson wrote:
> > Here, at long last, the much awaited, SLB rewrite!
>
> Hehe ! Kudos ! Looks great !
>
> What about moving the rest of stab.c to arch/ppc64/mm/ while
> we are at it ? :)

Well, I have some plans to do some cleanups to the stabs code, too -
mostly as it impinges even on SLB machines.  I was going to move it to
arch/ppc64/mm when I did that.

--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From david at gibson.dropbear.id.au  Thu Jul  8 11:41:04 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Thu, 8 Jul 2004 11:41:04 +1000
Subject: [0/4] RFC: SLB rewrite
In-Reply-To: <40EC2550.6020001@austin.ibm.com>
References: <20040707060626.GA987@zax> <40EC2550.6020001@austin.ibm.com>
Message-ID: <20040708014104.GA4309@zax>


On Wed, Jul 07, 2004 at 11:31:12AM -0500, Olof Johansson wrote:
>
> David Gibson wrote:
>
> >If there are no serious problems, I'm thinking of pushing at least the
> >first three of these to Linus/akpm in the next week or so.
>
> Good work!  The code looks clean and solid to me.
>
> Maybe the description from the top of the VSID patch should be included
> as a top-of-file comment in slb_low.S or slb.c so it's around for new
> readers in the future?

Not a bad idea.

> Also, where does the nonlinear behaviour at the high segment count
> reload times for the staggered measurements for the new VSID generation
> come from? HTAB hash bucket collissions/clustering?

That I'm not sure.  I don't think it could be from hash
collisions/clustering.  I tweaked my hash bucket simulator
(hashscatter.py, it's in the directory with the images and other
stuff) to simulate the slbmisser program running along with a bunch of
sample programs (essentially 1024 copies of bash).  It shows much
better hash scattering with the new algorithm - maximum bucket
occupancy 6, versus 127 with the old (I'm only computing the primary
hash bucket).

If I take the other processes out, and just simulate slbmisser itself
(plus the linear mapping), it still doesn't show anything untoward -
maxmimum occupancy 2 with the new algorithm, 16 with the old.

So I don't know what's causing the extra cost there.  Possibly just
that the actual VSID scramble itself takes different numbers of cycles
depending on the input?

I do worry a bit about that kink in the graph - what if it continues
to get worse with even larger set sizes, or with the kernel protoVSIDs
(which are numerically larger than user ones)..

--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From david at gibson.dropbear.id.au  Thu Jul  8 11:46:54 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Thu, 8 Jul 2004 11:46:54 +1000
Subject: [1/4] RFC: SLB rewrite (core rewrite)
In-Reply-To: <20040707100643.0b889896.moilanen@austin.ibm.com>
References: <20040707060635.GB987@zax> <20040707100643.0b889896.moilanen@austin.ibm.com>
Message-ID: <20040708014654.GB4309@zax>


On Wed, Jul 07, 2004 at 10:06:43AM -0500, Jake Moilanen wrote:
>
> This is pretty nice.  Good work.  I have just one nit below:
>
> > +_GLOBAL(slb_allocate)
> > +	/*
> > +	 * First find a slot, round robin. Previously we tried to find
> > +	 * a free slot first but that took too long. Unfortunately we
> > +	 * dont have any LRU information to help us choose a slot.
> > +	 */
> > +	srdi	r9,r1,27
> > +	ori	r9,r9,1			/* mangle SP for later compare */
> > +
> > +	ld	r10,PACASTABRR(r13)
> > +3:
> > +	addi	r10,r10,1
> > +	/* use a cpu feature mask if we ever change our slb size */
> > +	cmpldi	r10,SLB_NUM_ENTRIES
> > +
> > +	blt+	4f
>
> This branch probably shouldn't be predicted.  The general rule on branch
> prediction is for an error case, or a missed lock.  Since about power 4,
> the branch prediction is a little over 99% correct, you'll get a miss 1
> out of 62 times or 1.6% of the time.  It's probably not measurable, just
> might save a few cycles.

Hmm, well Anton actually suggested that branch hint, and the one
below.  Because the branch prediction is based on whether it was taken
last time round, and this is a strict round-robin, he thought the
dynamic prediction would get it wrong twice round the cycle, whereas
the static would only get it wrong one.

Or something.

He did do some timings, the difference was certainly tiny, although it
might have been (just) measurable.

> > +	 * The >> 27 (rather than >> 28) is so that the LSB is the
> > +	 * valid bit - this way we check valid and ESID in one compare.
> > +	 */
> > +	srdi	r11,r11,27
> > +	cmpd	r11,r9
> > +	beq-	3b
>
> Same as above.

Of course, I do also have a plan to get rid of the slbmfee and that
loop by locking the kernel stack into a fixed SLB slot.  I have a
patch to do it even, but I need to figure out how to get it to work on
iSeries.

--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From brking at us.ibm.com  Fri Jul  9 02:36:28 2004
From: brking at us.ibm.com (Brian King)
Date: Thu, 08 Jul 2004 11:36:28 -0500
Subject: How to block pci config-reads during device self-test?
In-Reply-To: <20040706161254.C21634@forte.austin.ibm.com>
References: <20040706161254.C21634@forte.austin.ibm.com>
Message-ID: <40ED780C.6000401@us.ibm.com>


I've been doing some talking with various hardware folks to see if there
  is a way to get this fixed and so far the answer I have been getting
is no. It is how the hardware works and there isn't much that can be
done about it in microcode.

To work around the problem we could add userspace pci config read and
write function pointers to the pci_dev struct which a device driver
could overload. These would be invoked for user initiated pci config
accesses. The device driver could then do the right thing for that
device if necessary.

I don't like the idea of "learning to live with this". People do run
into this problem and telling them it can't be fixed is not an
acceptable answer.

-Brian

linas at austin.ibm.com wrote:
>
> Am having trouble with PCI config-space reads ... I have a device
> (actualy Brian King has it) that can perform a built-in-self test
> (BIST). However, if anything does a PCI config-read during BIST, then
> the device does something crazy that makes the PCI controller chip
> take it offline.
>
> I'm not sure what's doing the config-spcae reads ... seems to be some
> user-space tool or daemon. I'm wondering if there is any practical way
> to block such reads to a given device until its self-test sequence is
> completed. I could try to modify the architecture-specific pci files
> to do this (arch/ppc64/kernel/pSeries_pci.c) but this seems a tad ugly
> ... is there another way? or do we have to just learn to live with
> this ahrdware?

--

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Fri Jul  9 02:37:59 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Thu, 08 Jul 2004 11:37:59 -0500
Subject: How to block pci config-reads during device self-test?
In-Reply-To: <40ED780C.6000401@us.ibm.com>
References: <20040706161254.C21634@forte.austin.ibm.com>
	 <40ED780C.6000401@us.ibm.com>
Message-ID: <1089304678.5355.50.camel@gaston>


On Thu, 2004-07-08 at 11:36, Brian King wrote:

> I don't like the idea of "learning to live with this". People do run
> into this problem and telling them it can't be fixed is not an
> acceptable answer.

Then we should do the workaround in the low level pci access functions,
possibly by having the driver call an arch function to "notify" that
the given device is in a bad state for a while...

Ben.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From greg at kroah.com  Fri Jul  9 02:45:22 2004
From: greg at kroah.com (Greg KH)
Date: Thu, 8 Jul 2004 09:45:22 -0700
Subject: How to block pci config-reads during device self-test?
In-Reply-To: <40ED780C.6000401@us.ibm.com>
References: <20040706161254.C21634@forte.austin.ibm.com> <40ED780C.6000401@us.ibm.com>
Message-ID: <20040708164522.GD14231@kroah.com>


On Thu, Jul 08, 2004 at 11:36:28AM -0500, Brian King wrote:
> I've been doing some talking with various hardware folks to see if
> there is a way to get this fixed and so far the answer I have been
> getting is no. It is how the hardware works and there isn't much that
> can be done about it in microcode.

But this is limited only to a single PCI device that you have, correct?
To add such a huge core change for only a single, broken device, that
99.99% of the Linux users and developers will never see is not a nice
option.

thanks,

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From willy at debian.org  Fri Jul  9 03:37:59 2004
From: willy at debian.org (Matthew Wilcox)
Date: Thu, 8 Jul 2004 18:37:59 +0100
Subject: [Pcihpd-discuss] Re: How to block pci config-reads during device self-test?
In-Reply-To: <40ED780C.6000401@us.ibm.com>
References: <20040706161254.C21634@forte.austin.ibm.com> <40ED780C.6000401@us.ibm.com>
Message-ID: <20040708173759.GO27324@parcelfarce.linux.theplanet.co.uk>


On Thu, Jul 08, 2004 at 11:36:28AM -0500, Brian King wrote:
> I don't like the idea of "learning to live with this". People do run
> into this problem and telling them it can't be fixed is not an
> acceptable answer.

Sure it is.  Even the taiwanese knock-off PCI boards don't have these
kinds of problems.  Your hardware guys need to fix it.

--
"Next the statesmen will invent cheap lies, putting the blame upon
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince
himself that the war is just, and will thank God for the better sleep
he enjoys after this process of grotesque self-deception." -- Mark Twain

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From brking at us.ibm.com  Fri Jul  9 03:56:06 2004
From: brking at us.ibm.com (Brian King)
Date: Thu, 08 Jul 2004 12:56:06 -0500
Subject: How to block pci config-reads during device self-test?
In-Reply-To: <20040708164522.GD14231@kroah.com>
References: <20040706161254.C21634@forte.austin.ibm.com> <40ED780C.6000401@us.ibm.com> <20040708164522.GD14231@kroah.com>
Message-ID: <40ED8AB6.9020604@us.ibm.com>


Greg KH wrote:
> But this is limited only to a single PCI device that you have, correct?
> To add such a huge core change for only a single, broken device, that
> 99.99% of the Linux users and developers will never see is not a nice
> option.

It is limited to every adapter the ipr device driver controls which
means almost every scsi adapter on pSeries hardware, including the
embedded scsi controller on most systems. So the scope of this problem
for pSeries users is fairly large.

While I agree with your statement above, Greg, I would still like to try
to come to a solution to this problem that we can all live with. If that
means only changing ppc64 code and ipr code, I will go along with that.

And while I can pester the hardware folks to fix this in future chips,
we will still be living with existing hardware for a long time.

--
Brian King
eServer Storage I/O
IBM Linux Technology Center

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From greg at kroah.com  Fri Jul  9 03:59:29 2004
From: greg at kroah.com (Greg KH)
Date: Thu, 8 Jul 2004 10:59:29 -0700
Subject: How to block pci config-reads during device self-test?
In-Reply-To: <40ED8AB6.9020604@us.ibm.com>
References: <20040706161254.C21634@forte.austin.ibm.com> <40ED780C.6000401@us.ibm.com> <20040708164522.GD14231@kroah.com> <40ED8AB6.9020604@us.ibm.com>
Message-ID: <20040708175929.GC15655@kroah.com>


On Thu, Jul 08, 2004 at 12:56:06PM -0500, Brian King wrote:
> Greg KH wrote:
> >But this is limited only to a single PCI device that you have, correct?
> >To add such a huge core change for only a single, broken device, that
> >99.99% of the Linux users and developers will never see is not a nice
> >option.
>
> It is limited to every adapter the ipr device driver controls which
> means almost every scsi adapter on pSeries hardware, including the
> embedded scsi controller on most systems. So the scope of this problem
> for pSeries users is fairly large.

Not if you "fix this" by disabling access to the self-test mode :)

thanks,

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From brking at us.ibm.com  Fri Jul  9 04:08:26 2004
From: brking at us.ibm.com (Brian King)
Date: Thu, 08 Jul 2004 13:08:26 -0500
Subject: How to block pci config-reads during device self-test?
In-Reply-To: <20040708175929.GC15655@kroah.com>
References: <20040706161254.C21634@forte.austin.ibm.com> <40ED780C.6000401@us.ibm.com> <20040708164522.GD14231@kroah.com> <40ED8AB6.9020604@us.ibm.com> <20040708175929.GC15655@kroah.com>
Message-ID: <40ED8D9A.2030805@us.ibm.com>


Greg KH wrote:
> On Thu, Jul 08, 2004 at 12:56:06PM -0500, Brian King wrote:
>
>>Greg KH wrote:
>>
>>>But this is limited only to a single PCI device that you have,
>>>correct? To add such a huge core change for only a single, broken
>>>device, that 99.99% of the Linux users and developers will never see
>>>is not a nice option.
>>
>>It is limited to every adapter the ipr device driver controls which
>>means almost every scsi adapter on pSeries hardware, including the
>>embedded scsi controller on most systems. So the scope of this problem
>>for pSeries users is fairly large.
>
> Not if you "fix this" by disabling access to the self-test mode :)

Its not just used for the self-test mode. It is the only way I have to
give the adapter a hard reset during error recovery. So when my
eh_host_reset function gets invoked due to the adapter having severe
problems or if the adapter takes an error that requires a hard reset to
recover from, I have to run BIST.

--
Brian King
eServer Storage I/O
IBM Linux Technology Center

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at denx.de  Sat Jul 10 07:13:04 2004
From: linas at denx.de (Linas Vepstas)
Date: Fri, 9 Jul 2004 16:13:04 -0500
Subject: How to block pci config-reads during device self-test?
In-Reply-To: <20040708164522.GD14231@kroah.com>
References: <20040706161254.C21634@forte.austin.ibm.com> <40ED780C.6000401@us.ibm.com> <20040708164522.GD14231@kroah.com>
Message-ID: <20040709211304.GG17333@bilge>


On Thu, Jul 08, 2004 at 09:45:22AM -0700, Greg KH was heard to remark:
> On Thu, Jul 08, 2004 at 11:36:28AM -0500, Brian King wrote:
> > I've been doing some talking with various hardware folks to see if
> > there is a way to get this fixed and so far the answer I have been
> > getting is no. It is how the hardware works and there isn't much
> > that can be done about it in microcode.
>
> But this is limited only to a single PCI device that you have,
> correct?

I have a vagure recollection that this is not the first time this
has been seen, but that was a while ago, and I didn't understand
the issue then.

We don't know how many more devices are affected.  The pSeries
PCI hardware significantly clamps down on what is allowed on the
bus, all in the name of not corrupting the kernel.  Traditional
PCI bridges are a lot more permissive ... we actually don't know how
many more cards are out there that might go wild if touched
in unexpected ways.

However, note that most, (if not all?) of the devices on ppc64 are
now chasing or have recently chased rare bugs due to the EEH clampdown,
so this might be the tip of an iceburg.

> To add such a huge core change for only a single, broken device, that
> 99.99% of the Linux users and developers will never see is not a nice
> option.

I'll work with Brian to add something to arch/ppc64/kernel/pSeries_pci.c
that works around this problem.    There is a separate patch I'll mail
later today for dealing with EEH detection in the same codepath.

--linas

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Sun Jul 11 10:27:46 2004
From: anton at samba.org (Anton Blanchard)
Date: Sun, 11 Jul 2004 10:27:46 +1000
Subject: [1/4] RFC: SLB rewrite (core rewrite)
In-Reply-To: <20040707100643.0b889896.moilanen@austin.ibm.com>
References: <20040707060635.GB987@zax> <20040707100643.0b889896.moilanen@austin.ibm.com>
Message-ID: <20040711002746.GA5232@krispykreme>


> This branch probably shouldn't be predicted.  The general rule on branch
> prediction is for an error case, or a missed lock.  Since about power 4,
> the branch prediction is a little over 99% correct, you'll get a miss 1
> out of 62 times or 1.6% of the time.  It's probably not measurable, just
> might save a few cycles.

Milton and I looked over this a while ago. Neither of the two POWER4
branch prediction algorithms will do a good job. We are likely to get 2
mispredictions per loop whereas we only get one misprediction per loop
with static prediction.

As David pointed out we have removed the slbmfee loop completely in
subsequent patches. We could probably remove the first loop by keeping
the bolted entries to a power of 2 and using a rotate and mask insert
instruction to shift the 2 ^ 6 bit down:

	addi	r3,r3,1
	li	r4,0
	rlwimi	r4,r3,32-(6-2),31-2,31-2
	andi.	r3,r3,0x3f
	or	r3,r3,r4

That looks like 3 cycles but Im sure we can do it in less. However even
if we manage to do it in 2 cycles, thats going to add 2 * 60 = 120
cycles in one complete loop whereas a single misprediction per loop
should take somewhere around 20 cycles I think.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Sun Jul 11 11:05:22 2004
From: anton at samba.org (Anton Blanchard)
Date: Sun, 11 Jul 2004 11:05:22 +1000
Subject: [1/4] RFC: SLB rewrite (core rewrite)
In-Reply-To: <20040708014654.GB4309@zax>
References: <20040707060635.GB987@zax> <20040707100643.0b889896.moilanen@austin.ibm.com> <20040708014654.GB4309@zax>
Message-ID: <20040711010522.GB5232@krispykreme>


> He did do some timings, the difference was certainly tiny, although it
> might have been (just) measurable.

Yeah it was tiny, although David's slbtest showed it to be on average
just under 1 cycle faster (probably due to averaging the extra
misprediction per 60 iterations).

> Of course, I do also have a plan to get rid of the slbmfee and that
> loop by locking the kernel stack into a fixed SLB slot.  I have a
> patch to do it even, but I need to figure out how to get it to work on
> iSeries.

There are a few more things that would be interesting to look at:

- Are there any more segments we should bolt in? So far we have
  PAGE_OFFSET, VMALLOC_BASE and the kernel stack. It may make sense to
  bolt in a userspace entry or two. The userspace stack is the best
  candidate.  Obviously the more segments we bolt in, the less that are
  available for general use and bolting in more userspace segments will
  mean mostly kernel bound workloads will suffer (eg SFS). It may make
  sense to bolt in an IO region too.

- We currently preload a few userspace segments (stack, pc and task
  unmapped base). Does it make sense to restore the entire SLB state on
  context switch? One problem we have is that we usually enter the
  kernel via libc which is in a different segment to our program text.
  This means we always take an SLB miss when we return out of libc after
  a context switch. (ie we are guaranteed to take one SLB miss per
  context switch)

It would be interesting to get a log of SLB misses for a few workloads.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Mon Jul 12 10:13:29 2004
From: anton at samba.org (Anton Blanchard)
Date: Mon, 12 Jul 2004 10:13:29 +1000
Subject: [RFC] JS20 IDE perf patch
In-Reply-To: <200312151706.hBFH6wv1008098@falcon30.maxey.austin.rr.com>
References: <200312151706.hBFH6wv1008098@falcon30.maxey.austin.rr.com>
Message-ID: <20040712001329.GD30109@krispykreme>


Hi Doug,

Did this get sorted out or do we still need to fix it in 2.6?

Anton

On Mon, Dec 15, 2003 at 11:06:58AM -0600, Doug Maxey wrote:
>
> Howdy,
>
>   Have done some investigation of the AMD IDE implementation on the
>   JS20 and have come up with a set of patches to address this.  The
>   underlying assumption that the PCI bus can only do 66mhz is no
>   longer true with the HT implementation on this platform.
>
>   The specific issues seen were:
>   1) ide.c has code that assumes the PCI bus only runs between 20 and
>      66 mhz.
>   2) FW has not completely configured the chipset to run at the higher
>      IDE bus speeds.
>   3) amd74xx.c does not recognize and cannot use the higher PCI bus
>      speeds to enable the higher IDE speeds.
>
>   Item 1 is addressed by creating ATA_PCIMHZ_MIN and ATA_PCIMHZ_MAX in
>   linux/ide.h.  These values are used in ide/ide.c and amd74xx.c.  If
>   there are arch specific values, they override the the values set in
>   linux/ide.h via the arch specific header.
>
>   Item 2 is addressed by fixup_amd_ide in arch/ppc64/kernel/pci.c.
>
>   Item 3 uses the ATA_PCIMHZ* to calculate the PCI bus frequency in
>   amd74xx.c and is used to determine the UDMA settings for the
>   drives.
>
>   Please review these changes.  Comments are welcome.
>
> ++doug
>


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/
** This list is shutting down 7/24/2004.


From david at gibson.dropbear.id.au  Mon Jul 12 10:57:11 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Mon, 12 Jul 2004 10:57:11 +1000
Subject: [1/4] RFC: SLB rewrite (core rewrite)
In-Reply-To: <20040711002746.GA5232@krispykreme>
References: <20040707060635.GB987@zax> <20040707100643.0b889896.moilanen@austin.ibm.com> <20040711002746.GA5232@krispykreme>
Message-ID: <20040712005711.GB9423@zax>


On Sun, Jul 11, 2004 at 10:27:46AM +1000, Anton Blanchard wrote:
>
>
> > This branch probably shouldn't be predicted.  The general rule on branch
> > prediction is for an error case, or a missed lock.  Since about power 4,
> > the branch prediction is a little over 99% correct, you'll get a miss 1
> > out of 62 times or 1.6% of the time.  It's probably not measurable, just
> > might save a few cycles.
>
> Milton and I looked over this a while ago. Neither of the two POWER4
> branch prediction algorithms will do a good job. We are likely to get 2
> mispredictions per loop whereas we only get one misprediction per loop
> with static prediction.
>
> As David pointed out we have removed the slbmfee loop completely in
> subsequent patches.

Well.. we did some timings with a hacky patch which just took out the
slbmfee on the grounds that most of the time it would just work.  I
have a patch to actually safely remove it by bolting the kernel stack
into SLB slot 1, but I'm not yet sure how to get that working properly
on iSeries.

--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/
** This list is shutting down 7/24/2004.


From david at gibson.dropbear.id.au  Mon Jul 12 11:33:03 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Mon, 12 Jul 2004 11:33:03 +1000
Subject: [1/4] RFC: SLB rewrite (core rewrite)
In-Reply-To: <20040711010522.GB5232@krispykreme>
References: <20040707060635.GB987@zax> <20040707100643.0b889896.moilanen@austin.ibm.com> <20040708014654.GB4309@zax> <20040711010522.GB5232@krispykreme>
Message-ID: <20040712013302.GC9423@zax>


On Sun, Jul 11, 2004 at 11:05:22AM +1000, Anton Blanchard wrote:
>
>
> > He did do some timings, the difference was certainly tiny, although it
> > might have been (just) measurable.
>
> Yeah it was tiny, although David's slbtest showed it to be on average
> just under 1 cycle faster (probably due to averaging the extra
> misprediction per 60 iterations).
>
> > Of course, I do also have a plan to get rid of the slbmfee and that
> > loop by locking the kernel stack into a fixed SLB slot.  I have a
> > patch to do it even, but I need to figure out how to get it to work on
> > iSeries.
>
> There are a few more things that would be interesting to look at:
>
> - Are there any more segments we should bolt in? So far we have
>   PAGE_OFFSET, VMALLOC_BASE and the kernel stack. It may make sense to
>   bolt in a userspace entry or two. The userspace stack is the best
>   candidate.  Obviously the more segments we bolt in, the less that are
>   available for general use and bolting in more userspace segments will
>   mean mostly kernel bound workloads will suffer (eg SFS). It may make
>   sense to bolt in an IO region too.

Indeed.  My gut instinct would be that we want to keep the number of
bolted segments down, so as much of the SLB can be used as flexibly as
possible.

That said, I've had a couple of ideas on how to generate some
semblence of LRU data in ways that might be sufficiently low overhead
to be practical.  For example, what if we kept a variable containing
the ESID of the last segment cast out from the SLB.  Whenever we take
an SLB miss, we check the EA against that stored segment.  If they
match - i.e. this segment is so heavily used that we want it again
almost as soon as we've thrown it out - we put a flag indicating to
skip over this slot for some number of future misses.

Or rather than just having permanently bolted and non-bolted slots, we
could divide the SLB into a "fast cycle" and a "slow cycle" section
with separate RR pointers.  Either we could "semibolt" certain
segments - putting them in the slow reload section.  Or we could use
some generalization of the scheme above to dynamically promote
frequently used segments to the slow cycle area.

> - We currently preload a few userspace segments (stack, pc and task
>   unmapped base). Does it make sense to restore the entire SLB state on
>   context switch? One problem we have is that we usually enter the
>   kernel via libc which is in a different segment to our program text.
>   This means we always take an SLB miss when we return out of libc after
>   a context switch. (ie we are guaranteed to take one SLB miss per
>   context switch)
>
> It would be interesting to get a log of SLB misses for a few
> workloads.

It certainly would.  We really want to run some simulations to see how
bolting various segments, or schemes like those above would affect the
SLB miss rate.

--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/
** This list is shutting down 7/24/2004.


From dwm at austin.ibm.com  Tue Jul 13 01:23:56 2004
From: dwm at austin.ibm.com (Doug Maxey)
Date: Mon, 12 Jul 2004 10:23:56 -0500
Subject: [RFC] JS20 IDE perf patch
In-Reply-To: <20040712001329.GD30109@krispykreme>
Message-ID: <200407121523.i6CFNuHW003347@falcon10.austin.ibm.com>


Anton,

  It never did get resolved in the sense that regardless of the
  tweaks, the drive seems to be limited to a throuhput number in the
  upper 20's with hdparm, even with the patches.  Suspect that I may
  be beating a dead horse.

  Why I say that is that further testing on another power platform,
  with the SII controller, shows the driver setting UDMA5 with roughly
  the same numbers from hdparm.

  Mark has asked for a definitive answer on the max drive speed from
  the manufacturer, but have yet to seen the numbers.

  In any event, if the drive is being limited by the driver, the fix
  would really need to go into 2.4, or go into 2.6 and be backported,
  if there is a discrepancy in the drive capabilities and how the
  setup is done.  But am not sure at this point if there is any
  usefulness in the patch.

++doug

On Mon, 12 Jul 2004 10:13:29 +1000, Anton Blanchard wrote:
>
>Hi Doug,
>
>Did this get sorted out or do we still need to fix it in 2.6?
>
>Anton
>
>On Mon, Dec 15, 2003 at 11:06:58AM -0600, Doug Maxey wrote:
>>
>> Howdy,
>>
>>   Have done some investigation of the AMD IDE implementation on the
>>   JS20 and have come up with a set of patches to address this.  The
>>   underlying assumption that the PCI bus can only do 66mhz is no
>>   longer true with the HT implementation on this platform.
>>
>>   The specific issues seen were:
>>   1) ide.c has code that assumes the PCI bus only runs between 20 and
>>      66 mhz.
>>   2) FW has not completely configured the chipset to run at the higher
>>      IDE bus speeds.
>>   3) amd74xx.c does not recognize and cannot use the higher PCI bus
>>      speeds to enable the higher IDE speeds.
>>
>>   Item 1 is addressed by creating ATA_PCIMHZ_MIN and ATA_PCIMHZ_MAX in
>>   linux/ide.h.  These values are used in ide/ide.c and amd74xx.c.  If
>>   there are arch specific values, they override the the values set in
>>   linux/ide.h via the arch specific header.
>>
>>   Item 2 is addressed by fixup_amd_ide in arch/ppc64/kernel/pci.c.
>>
>>   Item 3 uses the ATA_PCIMHZ* to calculate the PCI bus frequency in
>>   amd74xx.c and is used to determine the UDMA settings for the
>>   drives.
>>
>>   Please review these changes.  Comments are welcome.
>>
>> ++doug
>>
>
>


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/
** This list is shutting down 7/24/2004.


From linas at denx.de  Tue Jul 13 07:16:55 2004
From: linas at denx.de (Linas Vepstas)
Date: Mon, 12 Jul 2004 16:16:55 -0500
Subject: How to block pci config-reads during device self-test?
In-Reply-To: <20040708173759.GO27324@parcelfarce.linux.theplanet.co.uk>
References: <20040706161254.C21634@forte.austin.ibm.com> <40ED780C.6000401@us.ibm.com> <20040708173759.GO27324@parcelfarce.linux.theplanet.co.uk>
Message-ID: <20040712211655.GA27294@bilge>


On Thu, Jul 08, 2004 at 06:37:59PM +0100, Matthew Wilcox was heard to remark:
> On Thu, Jul 08, 2004 at 11:36:28AM -0500, Brian King wrote:
> > I don't like the idea of "learning to live with this". People do
> > run into this problem and telling them it can't be fixed is not an
> > acceptable answer.
>
> Sure it is. Even the taiwanese knock-off PCI boards don't have these
> kinds of problems. Your hardware guys need to fix it.

Actually, I just talked to the PCI guys here, and they seem to be saying
that this behaviour is expected by the PCI spec;  this implies that *all*
cards with BIST will have this problem.

To recap: device driver has started a BIST on the PCI card; and some
other linux daemon performs a PCI config-space I/O to the adapter.
Since the card under BIST cannot assert DEVSEL#, the PCI bridge
does a Master Abort, with the result that the device now placed
offline.

It sounds to me like its pretty normal that a device in BIST
is going to be ignoring pretty much anything going into it,
so the master abort is no surpirse.  So the answer does indeed
seem to be that blocking pci config-space i/o during BIST is the
right thing to do (for all architectures, and not just for ppc64
where I guess we'll do it first).

--linas

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Tue Jul 13 10:43:16 2004
From: anton at samba.org (Anton Blanchard)
Date: Tue, 13 Jul 2004 10:43:16 +1000
Subject: [1/4] RFC: SLB rewrite (core rewrite)
In-Reply-To: <20040712013302.GC9423@zax>
References: <20040707060635.GB987@zax> <20040707100643.0b889896.moilanen@austin.ibm.com> <20040708014654.GB4309@zax> <20040711010522.GB5232@krispykreme> <20040712013302.GC9423@zax>
Message-ID: <20040713004316.GB26263@krispykreme>


> Indeed.  My gut instinct would be that we want to keep the number of
> bolted segments down, so as much of the SLB can be used as flexibly as
> possible.

Agreed.

> That said, I've had a couple of ideas on how to generate some
> semblence of LRU data in ways that might be sufficiently low overhead
> to be practical.  For example, what if we kept a variable containing
> the ESID of the last segment cast out from the SLB.  Whenever we take
> an SLB miss, we check the EA against that stored segment.  If they
> match - i.e. this segment is so heavily used that we want it again
> almost as soon as we've thrown it out - we put a flag indicating to
> skip over this slot for some number of future misses.

Interesting idea. The segments that I have seen miss a lot are:

userspace
	stack
	PC
	libc segment

kernel
	task struct

The match EA against last castout should be able to catch these and has
the advantage of adapting to various workloads.

> It certainly would.  We really want to run some simulations to see how
> bolting various segments, or schemes like those above would affect the
> SLB miss rate.

If we cut the amount of the SLB we use to a minimum we could get a
rather accurate log of SLB accesses. Does the SLB miss handler touch
anything outside segment 0 in your rewrite? If it doesnt we could use
only one slot for replacement, and then log every miss we take.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/
** This list is shutting down 7/24/2004.


From linas at denx.de  Wed Jul 14 09:03:28 2004
From: linas at denx.de (Linas Vepstas)
Date: Tue, 13 Jul 2004 18:03:28 -0500
Subject: [RFC/PATCH] Re: How to block pci config-reads during device self-test?
In-Reply-To: <40ED780C.6000401@us.ibm.com>
References: <20040706161254.C21634@forte.austin.ibm.com> <40ED780C.6000401@us.ibm.com>
Message-ID: <20040713230328.GQ17333@bilge>

Hi,

On Thu, Jul 08, 2004 at 11:36:28AM -0500, Brian King was heard to remark:
> I've been doing some talking with various hardware folks to see if
> there is a way to get this fixed and so far the answer I have been
> getting is no. It is how the hardware works and there isn't much that
> can be done about it ...

Actually, the hardware isn't broken ... its working as designed, more
or less per pci spec.  See below for techninical details.

> I don't like the idea of "learning to live with this". People do
> run into this problem and telling them it can't be fixed is not an
> acceptable answer.
>
> linas at austin.ibm.com wrote:
> >
> >Am having trouble with PCI config-space reads ... I have a device
> >(actualy Brian King has it) that can perform a built-in-self test
> >(BIST). However, if anything does a PCI config-read during BIST, then
> >the device does something crazy that makes the PCI controller chip
> >take it offline.

actually, the device didn't do anything 'crazy' actually, it just
sat there silently, ignoring the i/o.

> >I'm not sure what's doing the config-spcae reads ... seems
> >to be some user-space tool or daemon. I'm wondering if
> >there is any practical way to block such reads to a given
> >device until its self-test sequence is completed. I could
> >try to modify the architecture-specific pci files to do this
> >(arch/ppc64/kernel/pSeries_pci.c) but this seems a tad ugly ... is
> >there another way? or do we have to just learn to live with this
> >hardware?

Here's a patch w/ a recap description of the problem.

The following patch addresses a peculiar PCI i/o problem.
This patch is  specific to ppc64, although one could argue that
a better patch should probably be arch indep.   But there's also
a different way to fix this problem, which is why this is an RFC.

The problem:
Some device drivers use the PCI BIST (built-in self-test) to reset
the PCI card if the card hangs in some error condition.  While the
BIST is running, the card, almost by definiition, is unable to respond
to pci i/o.  If the pci controller tries to do any i/o, the card simply
won't respond, and the pci controller will do a 'master abort' a few
pci bus cycles later.  This is normal and expected behaviour for both
card and the controller.

The problem that I'm having is that I have a pci controller that
thinks master-abort==dead card, and so offlines the card.  Which
is actually reasonable too ... no one should be doing i/o during BIST
anyway. Unfortunately various user-space daemons & tools can
do pci config-space i/o at anytime (typically to perform a hardware
inventory) ... and invariably one of these user-land tools runs while
the card is doing a BIST; the pci controller then off-lines this "dead
card", and we are unhappy.

There are two possible solutions.  (1) block pci config-space i/o
during a BIST, and (2) let the device driver detect that the card
has been off-lined, and so reset the pci controller chip.

I'm starting to think (2) is better, but this patch implements (1).
Opinions?  Comments?

The patch below  uses rw semaphores to block i/o access during
BIST.  All i/o access is locked with a read semaphore: thus, access
remains fast-n-cheap.  If the driver needs to BIST, it will then need
to do a down_write()/up_write() surrounding the BIST.

--linas
-------------- next part --------------
===== arch/ppc64/kernel/pSeries_pci.c 1.39 vs edited =====
--- 1.39/arch/ppc64/kernel/pSeries_pci.c	Mon Jul 12 18:29:16 2004
+++ edited/arch/ppc64/kernel/pSeries_pci.c	Tue Jul 13 15:46:08 2004
@@ -76,14 +76,22 @@

 	addr = (dn->busno << 16) | (dn->devfn << 8) | where;
 	buid = dn->phb->buid;
+
+	if (in_interrupt()) {
+		/* Driver should unplug interrupts during BIST */
+		BUG_ON (down_read_trylock(&dn->bist_lock) != 1);
+	} else {
+		down_read (&dn->bist_lock);
+	}
 	if (buid) {
 		ret = rtas_call(ibm_read_pci_config, 4, 2, &returnval,
 				addr, buid >> 32, buid & 0xffffffff, size);
 	} else {
 		ret = rtas_call(read_pci_config, 2, 2, &returnval, addr, size);
 	}
-	*val = returnval;
+	up_read (&dn->bist_lock);

+	*val = returnval;
 	if (ret)
 		return PCIBIOS_DEVICE_NOT_FOUND;

@@ -129,11 +137,19 @@

 	addr = (dn->busno << 16) | (dn->devfn << 8) | where;
 	buid = dn->phb->buid;
+
+	if (in_interrupt()) {
+		/* Driver should unplug interrupts during BIST */
+		BUG_ON (down_read_trylock(&dn->bist_lock) != 1);
+	} else {
+		down_read (&dn->bist_lock);
+	}
 	if (buid) {
 		ret = rtas_call(ibm_write_pci_config, 5, 1, NULL, addr, buid >> 32, buid & 0xffffffff, size, (ulong) val);
 	} else {
 		ret = rtas_call(write_pci_config, 3, 1, NULL, addr, size, (ulong)val);
 	}
+	up_read (&dn->bist_lock);

 	if (ret)
 		return PCIBIOS_DEVICE_NOT_FOUND;
===== arch/ppc64/kernel/prom.c 1.89 vs edited =====
--- 1.89/arch/ppc64/kernel/prom.c	Tue Jul 13 15:26:58 2004
+++ edited/arch/ppc64/kernel/prom.c	Tue Jul 13 16:03:01 2004
@@ -31,6 +31,7 @@
 #include <linux/stringify.h>
 #include <linux/delay.h>
 #include <linux/initrd.h>
+#include <linux/rwsem.h>
 #include <asm/prom.h>
 #include <asm/rtas.h>
 #include <asm/lmb.h>
@@ -3021,6 +3022,7 @@
 		kfree(np);
 		return err;
 	}
+	init_rwsem (&np->bist_lock);

 	write_lock(&devtree_lock);
 	np->sibling = np->parent->child;
===== include/asm-ppc64/prom.h 1.20 vs edited =====
--- 1.20/include/asm-ppc64/prom.h	Thu May 27 20:37:20 2004
+++ edited/include/asm-ppc64/prom.h	Tue Jul 13 15:45:28 2004
@@ -16,6 +16,7 @@
  */
 #include <linux/proc_fs.h>
 #include <asm/atomic.h>
+#include <asm/rwsem.h>

 #define PTRRELOC(x)     ((typeof(x))((unsigned long)(x) - offset))
 #define PTRUNRELOC(x)   ((typeof(x))((unsigned long)(x) + offset))
@@ -151,6 +152,12 @@
 	int	busno;			/* for pci devices */
 	int	bussubno;		/* for pci devices */
 	int	devfn;			/* for pci devices */
+
+	/* Lock to prevent 3rd-party PCI config space i/o while device
+	 * driver is running BIST on the controller; driver should obtain
+	 * down_write()/up_write() in this lock for the duration of BIST. */
+	struct rw_semaphore bist_lock;
+
 #define DN_STATUS_BIST_FAILED (1<<0)
 	int	status;			/* Current device status (non-zero is bad) */
 	int	eeh_mode;		/* See eeh.h for possible EEH_MODEs */

From david at gibson.dropbear.id.au  Wed Jul 14 10:54:21 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Wed, 14 Jul 2004 10:54:21 +1000
Subject: [1/4] RFC: SLB rewrite (core rewrite)
In-Reply-To: <20040713004316.GB26263@krispykreme>
References: <20040707060635.GB987@zax> <20040707100643.0b889896.moilanen@austin.ibm.com> <20040708014654.GB4309@zax> <20040711010522.GB5232@krispykreme> <20040712013302.GC9423@zax> <20040713004316.GB26263@krispykreme>
Message-ID: <20040714005421.GA18558@zax>


On Tue, Jul 13, 2004 at 10:43:16AM +1000, Anton Blanchard wrote:
>
>
> > Indeed.  My gut instinct would be that we want to keep the number of
> > bolted segments down, so as much of the SLB can be used as flexibly as
> > possible.
>
> Agreed.
>
> > That said, I've had a couple of ideas on how to generate some
> > semblence of LRU data in ways that might be sufficiently low overhead
> > to be practical.  For example, what if we kept a variable containing
> > the ESID of the last segment cast out from the SLB.  Whenever we take
> > an SLB miss, we check the EA against that stored segment.  If they
> > match - i.e. this segment is so heavily used that we want it again
> > almost as soon as we've thrown it out - we put a flag indicating to
> > skip over this slot for some number of future misses.
>
> Interesting idea. The segments that I have seen miss a lot are:
>
> userspace
> 	stack
> 	PC
> 	libc segment
>
> kernel
> 	task struct
>
> The match EA against last castout should be able to catch these and has
> the advantage of adapting to various workloads.
>
> > It certainly would.  We really want to run some simulations to see how
> > bolting various segments, or schemes like those above would affect the
> > SLB miss rate.
>
> If we cut the amount of the SLB we use to a minimum we could get a
> rather accurate log of SLB accesses. Does the SLB miss handler touch
> anything outside segment 0 in your rewrite? If it doesnt we could use
> only one slot for replacement, and then log every miss we take.

Yes, that should work, the new handler shouldn't touch anything
outside segment 0.

--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/
** This list is shutting down 7/24/2004.


From haveblue at us.ibm.com  Wed Jul 14 11:04:24 2004
From: haveblue at us.ibm.com (Dave Hansen)
Date: Tue, 13 Jul 2004 18:04:24 -0700
Subject: oops bringing up secondary cpus
Message-ID: <1089767063.4370.160.camel@nighthawk>


First of all, this oops is very likely my fault.  I'm porting
CONFIG_NONLINEAR over to ppc64 for memory hotplug, and I have little
doubt that I screwed up somewhere.  I could really use some more eyes
analyzing this oops, though.  It's a 2-way p650 partition with 12GB of
allocated memory.

I think it occurs the first time that a secondary CPU touches its idle
task's kernel stack.  Here's the dump:

cpu 0x1: Vector: 300 (Data Access) at [c0000002fff3bd10]
    pc: 000000000000beac
    lr: 000000000000be88
    sp: c0000002fff3bf90
   msr: 8000000000001002
   dar: c0000002fff3bf90
 dsisr: a000000
  current = 0xc0000002fff30920
  paca    = 0xc00000000035a000
    pid   = 0, comm = swapper

pc appears to be in __secondary_start (the 0xc00000000000beac address):

objdump:
c00000000000be38 <.__secondary_start>:
...
c00000000000be94:       64 63 00 50     oris    r3,r3,80
c00000000000be98:       60 63 5f 58     ori     r3,r3,24408
c00000000000be9c:       7b 1c 1f 24     rldicr  r28,r24,3,60
c00000000000bea0:       7c 23 e0 2a     ldx     r1,r3,r28
c00000000000bea4:       38 21 3f 90     addi    r1,r1,16272
c00000000000bea8:       38 00 00 00     li      r0,0
c00000000000beac:       f8 01 00 00     std     r0,0(r1)
c00000000000beb0:       f8 2d 00 20     std     r1,32(r13)
c00000000000beb4:       e8 6d 00 38     ld      r3,56(r13)
c00000000000beb8:       60 64 00 01     ori     r4,r3,1
c00000000000bebc:       38 60 50 00     li      r3,20480


The idle task for cpu one is allocated at c0000002fff38000, so this
looks valid.  The instruction that causes the exception appears to be
reading the contents of r1 into r0, right?  That looks like a quite
valid virtual address to me.  Is this a problem with the SLB?

enter ? for help
1:mon> r
R00 = 0000000000000000   R16 = 0000000000000000
R01 = c0000002fff3bf90   R17 = 0000000000000000
R02 = c000000000500440   R18 = 0000000000000000
R03 = c000000000505f58   R19 = 0000000000000000
R04 = 000008d12e6ab480   R20 = 0000000000000000
R05 = 0000000000000000   R21 = 0000000000c00000
R06 = 0000000000000001   R22 = 0000000000000000
R07 = 0000000000c02000   R23 = 0000000000000001
R08 = c00000000035a000   R24 = 0000000000000001
R09 = d000000008000001   R25 = 0000000000000010
R10 = 0000000000000001   R26 = 0000000000000568
R11 = 0000000000000002   R27 = 000000000000041c
R12 = 2000000000000000   R28 = 0000000000000008
R13 = c00000000035a000   R29 = 00000000000009e8
R14 = 0000000000000000   R30 = 0000000003280000
R15 = 0000000000000000   R31 = 0000000000900000
pc  = 000000000000beac
lr  = 000000000000be88
msr = 8000000000001002   cr  = 42280484
ctr = 0000000000000000   xer = 0000000020000000   trap =      300


SLB contents of cpu 1
00 c000000008000000 00006a99b4b14580
01 d000000008000000 000008d12e6ab480
02 c0000002f8000000 0000171f7cb14580
03 c000000030000000 0000a12fdcb14580
04 0000000010000000 000055a12defac00
05 0000000000000000 00001fe68f09ec00
06 00000000f0000000 000030d55709ec00
07 0000000040000000 000013596f09ec00
08 c0000002f0000000 0000171f7cb14580
09 0000000010000000 0000dcc34709ec00
10 0000000000000000 000098c475efac00
11 00000000f0000000 0000a9b33defac00
12 0000000040000000 00008c3755efac00
13 c000000030000000 0000a12fdcb14580
14 0000000010000000 000055a12defac00
15 c0000002f0000000 0000171f7cb14580
16 0000000000000000 00001fe68f09ec00
17 00000000f0000000 000030d55709ec00
18 0000000040000000 000013596f09ec00
19 0000000010000000 0000dcc34709ec00
20 0000000000000000 000098c475efac00
21 00000000f0000000 0000a9b33defac00
22 0000000040000000 00008c3755efac00
23 c000000030000000 0000a12fdcb14580
24 0000000010000000 000055a12defac00
25 c0000002f0000000 0000171f7cb14580
26 c000000030000000 0000a12fdcb14580
27 0000000000000000 0000e3779b970c00
28 00000000f0000000 0000f46663970c00
29 0000000040000000 0000d6ea7b970c00
30 0000000010000000 0000a05453970c00
31 c0000002f0000000 0000171f7cb14580
32 e000000000000000 0000a708a8242480
33 c0000002f0000000 0000171f7cb14580
34 e000000080000000 00008dee68242480
35 0000000000000000 000098c475efac00
36 00000000f0000000 0000a9b33defac00
37 0000000040000000 00008c3755efac00
38 c000000030000000 0000a12fdcb14580
39 0000000010000000 000055a12defac00
40 0000000000000000 00001fe68f09ec00
41 00000000f0000000 000030d55709ec00
42 0000000040000000 000013596f09ec00
43 c0000002f0000000 0000171f7cb14580
44 0000000010000000 0000dcc34709ec00
45 0000000000000000 000036fbefa91c00
46 00000000f0000000 000047eab7a91c00
47 0000000040000000 00002a6ecfa91c00
48 0000000010000000 0000f3d8a7a91c00
49 0000000000000000 00001fe68f09ec00
50 00000000f0000000 000030d55709ec00
51 0000000040000000 000013596f09ec00
52 c000000030000000 0000a12fdcb14580
53 0000000010000000 0000dcc34709ec00
54 c0000002f0000000 0000171f7cb14580
55 c000000030000000 0000a12fdcb14580
56 c0000002f0000000 0000171f7cb14580
57 c000000030000000 0000a12fdcb14580
58 c0000002f0000000 0000171f7cb14580
59 c000000030000000 0000a12fdcb14580
60 c0000002f0000000 0000171f7cb14580
61 c000000030000000 0000a12fdcb14580
62 c0000002f0000000 0000171f7cb14580
63 0000000000000000 000098c475efac00

Could stab_initialize() have been screwed up?  Any suggestions how to
debug this?  I even tried dumping the PACA on a working, and non-working
kernel, but the only difference I got was at decimal offset 480:

--- paca0-mm2-virgin    2004-07-13 13:33:04.000000000 -0700
+++ paca0-mm2-nononl    2004-07-13 16:17:21.000000000 -0700
@@ -2,7 +2,7 @@
 ][0032]: 0100000000000000 0000000000000000 d397d78102800000 0000000000000000
 ][0128]: d397d9e204000000 0000000000000000 0000000000000000 0000000000000000
 ][0256]: 0000000300000000 0000000000000000 0000000000000000 0000000000000000
-][0480]: 0000003b6fae5692 0000000000000000 c000000000000000 0000000000000000
+][0480]: e000000000000000 0000000000000000 c000000000000000 0000000000000000
 ][1024]: c000000000346180 0000000000000000 0000000000000000 0000000000000000
 ][1056]: 0000000000000000 0000000000000000 d397d78102800000 0000000000000000
 ][1152]: d397d9e204000000 0000000000000000 0000000000000000 0000000000000000

And that's in a reserved area of the xLpPaca, so it's pretty much a
guess what the hypervisor was doing.  (btw, this diff is from a slightly
kernel different kernel than the above oops)

-- Dave


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/
** This list is shutting down 7/24/2004.


From haveblue at us.ibm.com  Thu Jul 15 09:55:25 2004
From: haveblue at us.ibm.com (Dave Hansen)
Date: Wed, 14 Jul 2004 16:55:25 -0700
Subject: [PATCH] trivial __make_room() warning fix
Message-ID: <1089849324.10000.50.camel@nighthawk>

I was just compiling 2.6.8-rc1-mm1, and got a warning I don't think I've
seen before:

arch/ppc64/kernel/prom.c: In function `__make_room':
arch/ppc64/kernel/prom.c:1415: warning: unused variable `offset'

The attached patch fixes it.

-- Dave
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ppc64-initrd-warning-2.6.8-rc1-mm1.patch
Type: text/x-patch
Size: 634 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040714/4d14a010/attachment.bin 

From haveblue at us.ibm.com  Thu Jul 15 10:14:01 2004
From: haveblue at us.ibm.com (Dave Hansen)
Date: Wed, 14 Jul 2004 17:14:01 -0700
Subject: Broken LPAR cpu bringup
Message-ID: <1089850441.10000.62.camel@nighthawk>


OK, after tearing my remaining hair out for a few more hours, I realized
that the oops that I posted yesterday is *not* my fault.  :)  Something
is getting really confused.

These results are from a clean 2.6.8-rc1-mm1 kernel.  It's a partition
on a p650 with 8 physical CPUs, and 2 in the partition that I'm using.
64GB of total RAM, 12 GB in the partition.  There are no other active
partitions.

The early CPU booting appears to go as planned:

instantiating rtas at 0x000000003fd3c000... done
0000000000000000 : booting  cpu /cpus/PowerPC,POWER4 at 0
0000000000000001 : starting cpu /cpus/PowerPC,POWER4 at 1... ... done
Calling quiesce ...

But, when SMP cpu bringup happens, it gets really confused and tries to
bring up *all* of the CPUs:

Partition configured for 8 cpus.
No more cpus available, failing
Processor 1 is stuck.
No more cpus available, failing
Processor 2 is stuck.
No more cpus available, failing
Processor 3 is stuck.
No more cpus available, failing
Processor 4 is stuck.
No more cpus available, failing
Processor 5 is stuck.
No more cpus available, failing
Processor 6 is stuck.
No more cpus available, failing
Processor 7 is stuck.
Brought up 1 CPUs

The mailing list didn't like me attaching the device tree, so here's
another copy:

http://www.baconmonkey.com/device-tree-p650.tar.gz

-- Dave


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nathanl at austin.ibm.com  Thu Jul 15 14:29:22 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Wed, 14 Jul 2004 23:29:22 -0500
Subject: Broken LPAR cpu bringup
In-Reply-To: <1089850441.10000.62.camel@nighthawk>
References: <1089850441.10000.62.camel@nighthawk>
Message-ID: <1089865083.13914.3.camel@biclops.private.network>


On Wed, 2004-07-14 at 19:14, Dave Hansen wrote:
> These results are from a clean 2.6.8-rc1-mm1 kernel.  It's a partition
> on a p650 with 8 physical CPUs, and 2 in the partition that I'm using.
> 64GB of total RAM, 12 GB in the partition.  There are no other active
> partitions.
>
> The early CPU booting appears to go as planned:
>
> instantiating rtas at 0x000000003fd3c000... done
> 0000000000000000 : booting  cpu /cpus/PowerPC,POWER4 at 0
> 0000000000000001 : starting cpu /cpus/PowerPC,POWER4 at 1... ... done
> Calling quiesce ...
>
> But, when SMP cpu bringup happens, it gets really confused and tries to
> bring up *all* of the CPUs:
>
> Partition configured for 8 cpus.
> No more cpus available, failing
> Processor 1 is stuck.
> No more cpus available, failing
> Processor 2 is stuck.
> No more cpus available, failing
> Processor 3 is stuck.
> No more cpus available, failing
> Processor 4 is stuck.
> No more cpus available, failing
> Processor 5 is stuck.
> No more cpus available, failing
> Processor 6 is stuck.
> No more cpus available, failing
> Processor 7 is stuck.
> Brought up 1 CPUs
>

Does this fix it?  The problem is that ppc64 doesn't know about
cpu_present_map.  The arch-independent code copies cpu_possible_map to
cpu_present_map if the latter has not been modified by the arch bootup
code -- as you have discovered, this breaks on LPAR.

Nathan

Signed-off-by: Nathan Lynch <nathanl at austin.ibm.com>

diff -Naurp -X /home/nathanl/working/dontdiff 2.6.8-rc1-mm1/arch/ppc64/kernel/prom.c 2.6.8-rc1-mm1.new/arch/ppc64/kernel/prom.c
--- 2.6.8-rc1-mm1/arch/ppc64/kernel/prom.c	2004-07-14 20:08:13.000000000 -0500
+++ 2.6.8-rc1-mm1.new/arch/ppc64/kernel/prom.c	2004-07-14 19:57:01.000000000 -0500
@@ -942,7 +942,7 @@ static void __init prom_hold_cpus(unsign
 #ifdef CONFIG_SMP
 			cpu_set(cpuid, RELOC(cpu_available_map));
 			cpu_set(cpuid, RELOC(cpu_possible_map));
-			cpu_set(cpuid, RELOC(cpu_present_at_boot));
+			cpu_set(cpuid, RELOC(cpu_present_map));
 			if (reg == 0)
 				cpu_set(cpuid, RELOC(cpu_online_map));
 #endif /* CONFIG_SMP */
@@ -1044,7 +1044,7 @@ static void __init prom_hold_cpus(unsign
 				_systemcfg->processorCount++;
 				cpu_set(cpuid, RELOC(cpu_available_map));
 				cpu_set(cpuid, RELOC(cpu_possible_map));
-				cpu_set(cpuid, RELOC(cpu_present_at_boot));
+				cpu_set(cpuid, RELOC(cpu_present_map));
 #endif
 			} else {
 				prom_printf("... failed: %x\n", *acknowledge);
@@ -1056,7 +1056,7 @@ static void __init prom_hold_cpus(unsign
 			cpu_set(cpuid, RELOC(cpu_available_map));
 			cpu_set(cpuid, RELOC(cpu_possible_map));
 			cpu_set(cpuid, RELOC(cpu_online_map));
-			cpu_set(cpuid, RELOC(cpu_present_at_boot));
+			cpu_set(cpuid, RELOC(cpu_present_map));
 		}
 #endif
 next:
@@ -1071,7 +1071,7 @@ next:
 				    interrupt_server[i]);
 			if (_naca->smt_state) {
 				cpu_set(cpuid, RELOC(cpu_available_map));
-				cpu_set(cpuid, RELOC(cpu_present_at_boot));
+				cpu_set(cpuid, RELOC(cpu_present_map));
 				prom_printf("available\n");
 			} else {
 				prom_printf("not available\n");
diff -Naurp -X /home/nathanl/working/dontdiff 2.6.8-rc1-mm1/arch/ppc64/kernel/smp.c 2.6.8-rc1-mm1.new/arch/ppc64/kernel/smp.c
--- 2.6.8-rc1-mm1/arch/ppc64/kernel/smp.c	2004-07-14 20:08:20.000000000 -0500
+++ 2.6.8-rc1-mm1.new/arch/ppc64/kernel/smp.c	2004-07-14 20:31:38.000000000 -0500
@@ -60,7 +60,6 @@ unsigned long cache_decay_ticks;
 cpumask_t cpu_possible_map = CPU_MASK_NONE;
 cpumask_t cpu_online_map = CPU_MASK_NONE;
 cpumask_t cpu_available_map = CPU_MASK_NONE;
-cpumask_t cpu_present_at_boot = CPU_MASK_NONE;

 EXPORT_SYMBOL(cpu_online_map);
 EXPORT_SYMBOL(cpu_possible_map);
@@ -126,7 +125,7 @@ static int smp_iSeries_numProcs(void)
                 if (paca[i].lppaca.xDynProcStatus < 2) {
 			cpu_set(i, cpu_available_map);
 			cpu_set(i, cpu_possible_map);
-			cpu_set(i, cpu_present_at_boot);
+			cpu_set(i, cpu_present_map);
                         ++np;
                 }
         }
@@ -868,7 +867,7 @@ int __devinit __cpu_up(unsigned int cpu)
 	int c;

 	/* At boot, don't bother with non-present cpus -JSCHOPP */
-	if (system_state == SYSTEM_BOOTING && !cpu_present_at_boot(cpu))
+	if (system_state == SYSTEM_BOOTING && !cpu_present(cpu))
 		return -ENOENT;

 	paca[cpu].prof_counter = 1;
diff -Naurp -X /home/nathanl/working/dontdiff 2.6.8-rc1-mm1/arch/ppc64/kernel/xics.c 2.6.8-rc1-mm1.new/arch/ppc64/kernel/xics.c
--- 2.6.8-rc1-mm1/arch/ppc64/kernel/xics.c	2004-07-14 20:08:13.000000000 -0500
+++ 2.6.8-rc1-mm1.new/arch/ppc64/kernel/xics.c	2004-07-14 19:55:33.000000000 -0500
@@ -548,7 +548,7 @@ nextnode:
 #ifdef CONFIG_SMP
 		for_each_cpu(i) {
 			/* FIXME: Do this dynamically! --RR */
-			if (!cpu_present_at_boot(i))
+			if (!cpu_present(i))
 				continue;
 			xics_per_cpu[i] = __ioremap((ulong)inodes[get_hard_smp_processor_id(i)].addr,
 						    (ulong)inodes[get_hard_smp_processor_id(i)].size,
diff -Naurp -X /home/nathanl/working/dontdiff 2.6.8-rc1-mm1/include/asm-ppc64/smp.h 2.6.8-rc1-mm1.new/include/asm-ppc64/smp.h
--- 2.6.8-rc1-mm1/include/asm-ppc64/smp.h	2004-07-14 20:08:18.000000000 -0500
+++ 2.6.8-rc1-mm1.new/include/asm-ppc64/smp.h	2004-07-14 20:11:51.000000000 -0500
@@ -42,15 +42,11 @@ extern void smp_message_recv(int, struct
  * possible:        CPU is a candidate to be made online
  * available:       CPU is candidate for the 'possible' pool
  *                  Used to get SMT threads started at boot time.
- * present_at_boot: CPU was available at boot time.  Used in DLPAR
- *                  code to handle special cases for processor start up.
  */
-extern cpumask_t cpu_present_at_boot;
 extern cpumask_t cpu_online_map;
 extern cpumask_t cpu_possible_map;
 extern cpumask_t cpu_available_map;

-#define cpu_present_at_boot(cpu) cpu_isset(cpu, cpu_present_at_boot)
 #define cpu_available(cpu)       cpu_isset(cpu, cpu_available_map)

 /* Since OpenPIC has only 4 IPIs, we use slightly different message numbers.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Thu Jul 15 16:29:36 2004
From: anton at samba.org (Anton Blanchard)
Date: Thu, 15 Jul 2004 16:29:36 +1000
Subject: oops bringing up secondary cpus
In-Reply-To: <1089767063.4370.160.camel@nighthawk>
References: <1089767063.4370.160.camel@nighthawk>
Message-ID: <20040715062936.GA27715@krispykreme>


Hi,

> cpu 0x1: Vector: 300 (Data Access) at [c0000002fff3bd10]
>     pc: 000000000000beac
>     lr: 000000000000be88
>     sp: c0000002fff3bf90
>    msr: 8000000000001002
>    dar: c0000002fff3bf90
>  dsisr: a000000
>   current = 0xc0000002fff30920
>   paca    = 0xc00000000035a000
>     pid   = 0, comm = swapper

Looks like you are trying to access a virtual address with IR and DR
off (ie you are in real mode).

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Fri Jul 16 00:52:47 2004
From: anton at samba.org (Anton Blanchard)
Date: Fri, 16 Jul 2004 00:52:47 +1000
Subject: iseries profiling support
Message-ID: <20040715145246.GF27715@krispykreme>


Hi,

Is anyone using the dprofile facility on iseries these days? If not we
can remove some code.

Anton

===== arch/ppc64/kernel/asm-offsets.c 1.21 vs edited =====
--- 1.21/arch/ppc64/kernel/asm-offsets.c	Fri Jul  2 15:23:46 2004
+++ edited/arch/ppc64/kernel/asm-offsets.c	Tue Jul 13 22:40:07 2004
@@ -91,11 +91,6 @@
 	DEFINE(PACATOC, offsetof(struct paca_struct, kernel_toc));
 	DEFINE(PACAPROCENABLED, offsetof(struct paca_struct, proc_enabled));
 	DEFINE(PACADEFAULTDECR, offsetof(struct paca_struct, default_decr));
-	DEFINE(PACAPROFENABLED, offsetof(struct paca_struct, prof_enabled));
-	DEFINE(PACAPROFLEN, offsetof(struct paca_struct, prof_len));
-	DEFINE(PACAPROFSHIFT, offsetof(struct paca_struct, prof_shift));
-	DEFINE(PACAPROFBUFFER, offsetof(struct paca_struct, prof_buffer));
-	DEFINE(PACAPROFSTEXT, offsetof(struct paca_struct, prof_stext));
         DEFINE(PACA_EXGEN, offsetof(struct paca_struct, exgen));
         DEFINE(PACA_EXMC, offsetof(struct paca_struct, exmc));
         DEFINE(PACA_EXSLB, offsetof(struct paca_struct, exslb));
===== arch/ppc64/kernel/head.S 1.67 vs edited =====
--- 1.67/arch/ppc64/kernel/head.S	Mon Jul  5 20:27:11 2004
+++ edited/arch/ppc64/kernel/head.S	Tue Jul 13 22:41:02 2004
@@ -317,36 +317,11 @@
 label##_Iseries:							\
 	mtspr	SPRG1,r13;		/* save r13 */			\
 	EXCEPTION_PROLOG_ISERIES_1(PACA_EXGEN);				\
-	lbz	r10,PACAPROFENABLED(r13);				\
-	cmpwi	r10,0;							\
-	bne-	label##_Iseries_profile;				\
-label##_Iseries_prof_ret:						\
 	lbz	r10,PACAPROCENABLED(r13);				\
 	cmpwi	0,r10,0;						\
 	beq-	label##_Iseries_masked;					\
 	EXCEPTION_PROLOG_ISERIES_2;					\
 	b	label##_common;						\
-label##_Iseries_profile:						\
-	ld	r12,PACALPPACA+LPPACASRR1(r13);				\
-	andi.	r12,r12,MSR_PR;		/* Test if in kernel */		\
-	bne	label##_Iseries_prof_ret;				\
-	ld	r11,PACALPPACA+LPPACASRR0(r13);				\
-	ld	r12,PACAPROFSTEXT(r13);	/* _stext */			\
-	subf	r11,r12,r11;		/* offset into kernel */	\
-	lwz	r12,PACAPROFSHIFT(r13);					\
-	srd	r11,r11,r12;						\
-	lwz	r12,PACAPROFLEN(r13);	/* profile table length - 1 */	\
-	cmpd	r11,r12;		/* off end? */			\
-	ble	1f;							\
-	mr	r11,r12;		/* force into last entry */	\
-1:	sldi	r11,r11,2;		/* convert to offset */		\
-	ld	r12,PACAPROFBUFFER(r13);/* profile buffer */		\
-	add	r12,r12,r11;						\
-2:	lwarx	r11,0,r12;		/* atomically increment */	\
-	addi	r11,r11,1;						\
-	stwcx.	r11,0,r12;						\
-	bne-	2b;							\
-	b	label##_Iseries_prof_ret

 #ifdef DO_SOFT_DISABLE
 #define DISABLE_INTS				\
===== arch/ppc64/kernel/iSeries_setup.c 1.27 vs edited =====
--- 1.27/arch/ppc64/kernel/iSeries_setup.c	Fri Jul  2 15:23:46 2004
+++ edited/arch/ppc64/kernel/iSeries_setup.c	Tue Jul 13 22:38:29 2004
@@ -36,6 +36,7 @@
 #include <asm/pgtable.h>
 #include <asm/mmu_context.h>
 #include <asm/cputable.h>
+#include <asm/sections.h>

 #include <asm/time.h>
 #include "iSeries_setup.h"
@@ -54,17 +55,12 @@
 #include <asm/iSeries/mf.h>

 /* Function Prototypes */
-extern void abort(void);
 extern void ppcdbg_initialize(void);
-extern void iSeries_pcibios_init(void);
 extern void tce_init_iSeries(void);

 static void build_iSeries_Memory_Map(void);
 static void setup_iSeries_cache_sizes(void);
 static void iSeries_bolt_kernel(unsigned long saddr, unsigned long eaddr);
-extern void build_valid_hpte(unsigned long vsid, unsigned long ea, unsigned long pa,
-			     pte_t *ptep, unsigned hpteflags, unsigned bolted);
-static void iSeries_setup_dprofile(void);
 extern void iSeries_setup_arch(void);
 extern void iSeries_pci_final_fixup(void);

@@ -77,16 +73,10 @@
 static unsigned long tbFreqMhz;
 static unsigned long tbFreqMhzHundreths;

-unsigned long dprof_shift;
-unsigned long dprof_len;
-unsigned int *dprof_buffer;
-
 int piranha_simulator;

 int boot_cpuid;

-extern char _end[];
-
 extern int rd_size;		/* Defined in drivers/block/rd.c */
 extern unsigned long klimit;
 extern unsigned long embedded_sysmap_start;
@@ -366,30 +356,6 @@
 	}
 	*p = 0;

-        if (strstr(cmd_line, "dprofile=")) {
-                for (q = cmd_line; (p = strstr(q, "dprofile=")) != 0; ) {
-			unsigned long size, new_klimit;
-
-                        q = p + 9;
-                        if ((p > cmd_line) && (p[-1] != ' '))
-                                continue;
-                        dprof_shift = simple_strtoul(q, &q, 0);
-			dprof_len = (unsigned long)_etext -
-				(unsigned long)_stext;
-			dprof_len >>= dprof_shift;
-			size = ((dprof_len * sizeof(unsigned int)) +
-					(PAGE_SIZE-1)) & PAGE_MASK;
-			dprof_buffer = (unsigned int *)((klimit +
-						(PAGE_SIZE-1)) & PAGE_MASK);
-			new_klimit = ((unsigned long)dprof_buffer) + size;
-			lmb_reserve(__pa(klimit), (new_klimit-klimit));
-			klimit = new_klimit;
-			memset(dprof_buffer, 0, size);
-                }
-        }
-
-	iSeries_setup_dprofile();
-
 	mf_init();
 	mf_initialized = 1;
 	mb();
@@ -834,22 +800,6 @@
 		if (embedded_sysmap_end)
 			klimit = KERNELBASE + ((embedded_sysmap_end + 4095) &
 					0xfffffffffffff000);
-	}
-}
-
-static void iSeries_setup_dprofile(void)
-{
-	if (dprof_buffer) {
-		unsigned i;
-
-		for (i = 0; i < NR_CPUS; ++i) {
-			paca[i].prof_shift = dprof_shift;
-			paca[i].prof_len = dprof_len - 1;
-			paca[i].prof_buffer = dprof_buffer;
-			paca[i].prof_stext = (unsigned *)_stext;
-			mb();
-			paca[i].prof_enabled = 1;
-		}
 	}
 }

===== arch/ppc64/kernel/irq.c 1.62 vs edited =====
--- 1.62/arch/ppc64/kernel/irq.c	Fri Jul  2 15:23:46 2004
+++ edited/arch/ppc64/kernel/irq.c	Tue Jul 13 22:42:12 2004
@@ -803,18 +803,6 @@

 	*mask = new_value;

-#ifdef CONFIG_PPC_ISERIES
-	{
-		unsigned i;
-		for (i=0; i<NR_CPUS; ++i) {
-			if ( paca[i].prof_buffer && cpu_isset(i, new_value) )
-				paca[i].prof_enabled = 1;
-			else
-				paca[i].prof_enabled = 0;
-		}
-	}
-#endif
-
 	return full_count;
 }

===== arch/ppc64/kernel/misc.S 1.83 vs edited =====
--- 1.83/arch/ppc64/kernel/misc.S	Thu Jun 17 15:46:06 2004
+++ edited/arch/ppc64/kernel/misc.S	Tue Jul 13 23:59:50 2004
@@ -453,12 +449,6 @@
 	bdnz	00b
 	sync
 	blr
-
-_GLOBAL(abs)
-	cmpi	0,r3,0
-	bge	10f
-	neg	r3,r3
-10:	blr

 _GLOBAL(_get_PVR)
 	mfspr	r3,PVR
===== arch/ppc64/kernel/smp.c 1.71 vs edited =====
--- 1.71/arch/ppc64/kernel/smp.c	Fri Jul  2 15:23:46 2004
+++ edited/arch/ppc64/kernel/smp.c	Tue Jul 13 23:37:04 2004
@@ -586,10 +586,7 @@

 void smp_local_timer_interrupt(struct pt_regs * regs)
 {
-	if (!--(get_paca()->prof_counter)) {
-		update_process_times(user_mode(regs));
-		(get_paca()->prof_counter)=get_paca()->prof_multiplier;
-	}
+	update_process_times(user_mode(regs));
 }

 void smp_message_recv(int msg, struct pt_regs *regs)
@@ -825,8 +822,6 @@
 	/* Fixup boot cpu */
 	smp_store_cpu_info(boot_cpuid);
 	cpu_callin_map[boot_cpuid] = 1;
-	paca[boot_cpuid].prof_counter = 1;
-	paca[boot_cpuid].prof_multiplier = 1;

 #ifndef CONFIG_PPC_ISERIES
 	paca[boot_cpuid].next_jiffy_update_tb = tb_last_stamp = get_tb();
@@ -872,8 +867,6 @@
 	if (system_state == SYSTEM_BOOTING && !cpu_present_at_boot(cpu))
 		return -ENOENT;

-	paca[cpu].prof_counter = 1;
-	paca[cpu].prof_multiplier = 1;
 	paca[cpu].default_decr = tb_ticks_per_jiffy / decr_overclock;

 	if (!(cur_cpu_spec->cpu_features & CPU_FTR_SLB)) {
===== include/asm-ppc64/paca.h 1.19 vs edited =====
--- 1.19/include/asm-ppc64/paca.h	Fri Jul  2 15:23:46 2004
+++ edited/include/asm-ppc64/paca.h	Tue Jul 13 22:32:42 2004
@@ -99,21 +99,6 @@
 	 */
 	struct ItLpPaca lppaca __attribute__((aligned(0x80)));
 	struct ItLpRegSave reg_save;
-
-	/*
-	 * iSeries profiling support
-	 *
-	 * FIXME: do we still want this, or can we ditch it in favour
-	 * of oprofile?
-	 */
-	u32 *prof_buffer;		/* iSeries profiling buffer */
-	u32 *prof_stext;		/* iSeries start of kernel text */
-	u32 prof_multiplier;
-	u32 prof_counter;
-	u32 prof_shift;			/* iSeries shift for profile
-					 * bucket size */
-	u32 prof_len;			/* iSeries length of profile */
-	u8 prof_enabled;		/* 1=iSeries profiling enabled */
 };

 #endif /* _PPC64_PACA_H */

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Fri Jul 16 00:57:08 2004
From: anton at samba.org (Anton Blanchard)
Date: Fri, 16 Jul 2004 00:57:08 +1000
Subject: page align emergency stack
Message-ID: <20040715145708.GG27715@krispykreme>


Page align the emergency stack.

Signed-off-by: Anton Blanchard <anton at samba.org>

===== arch/ppc64/kernel/pacaData.c 1.15 vs edited =====
--- 1.15/arch/ppc64/kernel/pacaData.c	Fri Jul  2 15:23:46 2004
+++ edited/arch/ppc64/kernel/pacaData.c	Tue Jul 13 12:40:05 2004
@@ -30,7 +30,7 @@
 /* Stack space used when we detect a bad kernel stack pointer, and
  * early in SMP boots before relocation is enabled.
  */
-char emergency_stack[PAGE_SIZE * NR_CPUS];
+char emergency_stack[PAGE_SIZE * NR_CPUS] __page_aligned;

 /* The Paca is an array with one entry per processor.  Each contains an
  * ItLpPaca, which contains the information shared between the

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Fri Jul 16 01:46:30 2004
From: anton at samba.org (Anton Blanchard)
Date: Fri, 16 Jul 2004 01:46:30 +1000
Subject: [PATCH] remove multiple IRQ optimisation
Message-ID: <20040715154630.GH27715@krispykreme>


ppc64 has an optimisation where it loops on get_irq until there are no
more interrupts to be handled. Mark Hack notes that this optimisation
hardly ever hits and costs us a potentially expensive extra read of an
interrupt register every interrupt.

Also make do_IRQ void, the callers never use the return value.

Signed-off-by: Anton Blanchard <anton at samba.org>

===== arch/ppc64/kernel/irq.c 1.62 vs edited =====
--- 1.62/arch/ppc64/kernel/irq.c	Fri Jul  2 15:23:46 2004
+++ edited/arch/ppc64/kernel/irq.c	Fri Jul 16 01:39:54 2004
@@ -589,7 +589,7 @@
 }

 #ifdef CONFIG_PPC_ISERIES
-int do_IRQ(struct pt_regs *regs)
+void do_IRQ(struct pt_regs *regs)
 {
 	struct paca_struct *lpaca;
 	struct ItLpQueue *lpq;
@@ -629,15 +629,13 @@
 		/* Signal a fake decrementer interrupt */
 		timer_interrupt(regs);
 	}
-
-	return 1; /* lets ret_from_int know we can do checks */
 }

 #else	/* CONFIG_PPC_ISERIES */

-int do_IRQ(struct pt_regs *regs)
+void do_IRQ(struct pt_regs *regs)
 {
-	int irq, first = 1;
+	int irq;

 	irq_enter();

@@ -656,25 +654,15 @@
 	}
 #endif

-	/*
-	 * Every arch is required to implement ppc_md.get_irq.
-	 * This function will either return an irq number or -1 to
-	 * indicate there are no more pending.  But the first time
-	 * through the loop this means there wasn't an IRQ pending.
-	 * The value -2 is for buggy hardware and means that this IRQ
-	 * has already been handled. -- Tom
-	 */
-	while ((irq = ppc_md.get_irq(regs)) >= 0) {
+	irq = ppc_md.get_irq(regs);
+
+	if (irq >= 0)
 		ppc_irq_dispatch_handler(regs, irq);
-		first = 0;
-	}
-	if (irq != -2 && first)
+	else
 		/* That's not SMP safe ... but who cares ? */
 		ppc_spurious_interrupts++;

 	irq_exit();
-
-	return 1; /* lets ret_from_int know we can do checks */
 }
 #endif	/* CONFIG_PPC_ISERIES */


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nathanl at austin.ibm.com  Fri Jul 16 08:39:10 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Thu, 15 Jul 2004 17:39:10 -0500
Subject: Broken LPAR cpu bringup
In-Reply-To: <1089865083.13914.3.camel@biclops.private.network>
References: <1089850441.10000.62.camel@nighthawk>
	 <1089865083.13914.3.camel@biclops.private.network>
Message-ID: <1089931149.12866.5.camel@pants.austin.ibm.com>


On Wed, 2004-07-14 at 23:29, Nathan Lynch wrote:
> Does this fix it?  The problem is that ppc64 doesn't know about
> cpu_present_map.  The arch-independent code copies cpu_possible_map to
> cpu_present_map if the latter has not been modified by the arch bootup
> code -- as you have discovered, this breaks on LPAR.

Well, that's not the only problem, it seems.  With this one on top of
the previous patch, does it work?  I have tested it on a similar
configuration, but not Power4.

Nathan

diff -prauN -X /home/nathanl/working/dontdiff 2.6.8-rc1-mm1.1/arch/ppc64/kernel/smp.c 2.6.8-rc1-mm1.2/arch/ppc64/kernel/smp.c
--- 2.6.8-rc1-mm1.1/arch/ppc64/kernel/smp.c	2004-07-15 17:27:54.000000000 -0500
+++ 2.6.8-rc1-mm1.2/arch/ppc64/kernel/smp.c	2004-07-15 16:59:46.000000000 -0500
@@ -373,7 +373,7 @@ static inline int __devinit smp_startup_

 	/* At boot time the cpus are already spinning in hold
 	 * loops, so nothing to do. */
- 	if (system_state == SYSTEM_BOOTING)
+ 	if (system_state < SYSTEM_RUNNING)
 		return 1;

 	pcpu = find_physical_cpu_to_start(get_hard_smp_processor_id(lcpu));
@@ -423,6 +423,11 @@ static inline void look_for_more_cpus(vo
 	maxcpus = ireg[num_addr_cell + num_size_cell];
 	/* DRENG need to account for threads here too */

+	if ((cur_cpu_spec->cpu_features & CPU_FTR_SMT) &&
+	    ((naca->smt_state == SMT_ON) || (naca->smt_state == SMT_DYNAMIC)))
+		maxcpus *= 2;
+
+
 	if (maxcpus > NR_CPUS) {
 		printk(KERN_WARNING
 		       "Partition configured for %d cpus, "


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From haveblue at us.ibm.com  Fri Jul 16 08:43:34 2004
From: haveblue at us.ibm.com (Dave Hansen)
Date: Thu, 15 Jul 2004 15:43:34 -0700
Subject: Broken LPAR cpu bringup
In-Reply-To: <1089931149.12866.5.camel@pants.austin.ibm.com>
References: <1089850441.10000.62.camel@nighthawk>
	 <1089865083.13914.3.camel@biclops.private.network>
	 <1089931149.12866.5.camel@pants.austin.ibm.com>
Message-ID: <1089931414.32312.13.camel@nighthawk>


On Thu, 2004-07-15 at 15:39, Nathan Lynch wrote:
> On Wed, 2004-07-14 at 23:29, Nathan Lynch wrote:
> > Does this fix it?  The problem is that ppc64 doesn't know about
> > cpu_present_map.  The arch-independent code copies cpu_possible_map to
> > cpu_present_map if the latter has not been modified by the arch bootup
> > code -- as you have discovered, this breaks on LPAR.
>
> Well, that's not the only problem, it seems.  With this one on top of
> the previous patch, does it work?  I have tested it on a similar
> configuration, but not Power4.

They still get stuck on my tree.  I'll try on a plain tree in a little
bit.

-- Dave


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From haveblue at us.ibm.com  Fri Jul 16 08:57:54 2004
From: haveblue at us.ibm.com (Dave Hansen)
Date: Thu, 15 Jul 2004 15:57:54 -0700
Subject: Broken LPAR cpu bringup
In-Reply-To: <1089931414.32312.13.camel@nighthawk>
References: <1089850441.10000.62.camel@nighthawk>
	 <1089865083.13914.3.camel@biclops.private.network>
	 <1089931149.12866.5.camel@pants.austin.ibm.com>
	 <1089931414.32312.13.camel@nighthawk>
Message-ID: <1089932274.32312.23.camel@nighthawk>


On Thu, 2004-07-15 at 15:43, Dave Hansen wrote:
> On Thu, 2004-07-15 at 15:39, Nathan Lynch wrote:
> > On Wed, 2004-07-14 at 23:29, Nathan Lynch wrote:
> > > Does this fix it?  The problem is that ppc64 doesn't know about
> > > cpu_present_map.  The arch-independent code copies cpu_possible_map to
> > > cpu_present_map if the latter has not been modified by the arch bootup
> > > code -- as you have discovered, this breaks on LPAR.
> >
> > Well, that's not the only problem, it seems.  With this one on top of
> > the previous patch, does it work?  I have tested it on a similar
> > configuration, but not Power4.
>
> They still get stuck on my tree.  I'll try on a plain tree in a little
> bit.

Nope, still oopses in the load_balance sched domains code:

cpu 0x1: Vector: 380 (Data SLB Access) at [c00000000ca8b420]
    pc: c000000000046644: .find_busiest_group+0x24c/0x470
    lr: c00000000004681c: .find_busiest_group+0x424/0x470
    sp: c00000000ca8b6a0
   msr: 8000000000001032
   dar: 10
  current = 0xc00000000ca80da0
  paca    = 0xc00000000033c900
    pid   = 0, comm = swapper

1:mon> t
[c00000000ca8b7c0] c0000000000469c4 .load_balance+0x5c/0x1c4
[c00000000ca8b880] c000000000046f6c .rebalance_tick+0x120/0x144
[c00000000ca8b930] c000000000057324 .update_process_times+0x44/0x60
[c00000000ca8b9c0] c000000000036f18 .smp_local_timer_interrupt+0x40/0x50
[c00000000ca8ba30] c000000000013050 .timer_interrupt+0xf4/0x394
[c00000000ca8bb10] c00000000000a2b4 Decrementer_common+0xb4/0x100
[c00000000ca8be90] c0000000000128d0 .cpu_idle+0x2c/0x44
[c00000000ca8bf00] c0u 0x0: :  .start_secondaryy+0xd8/0x11c

I wonder if the sched domains didn't get set up correctly.

-- Dave


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From haveblue at us.ibm.com  Fri Jul 16 09:03:24 2004
From: haveblue at us.ibm.com (Dave Hansen)
Date: Thu, 15 Jul 2004 16:03:24 -0700
Subject: Broken LPAR cpu bringup
In-Reply-To: <1089931414.32312.13.camel@nighthawk>
References: <1089850441.10000.62.camel@nighthawk>
	 <1089865083.13914.3.camel@biclops.private.network>
	 <1089931149.12866.5.camel@pants.austin.ibm.com>
	 <1089931414.32312.13.camel@nighthawk>
Message-ID: <1089932604.32312.28.camel@nighthawk>


I'm just going to keep replying to myself.  It's fun.

I just noticed that, with your new patch, CPU0 also doesn't boot all the
way.  This happens about the time that init is exec'd off.  The CPU0
oops is garbled, so here is what I have.  I don't have any modules and
_end is at c00000000054b000, so it appears to have jumped off into lala
land.

Unrecoverable FP Unavailable Exception 800 at c00000000ca8aa90
s  msr: 8000000000009032
  current = 0xc00000003f8ac6f0
  paca    = 0xc00000000033c000
    pid   = 246, comm = hotplug

1:mon>
1:mon> c0
0:mon>
0:mon>
0:mon> r
R00 = c00000000ca8aa90   R16 = 0000000000000080
R01 = c00000000ca8a9e0   R17 = 0000000000000080
R02 = c0000000004a5470   R18 = 0000000000000080
R03 = 0000000000000002   R19 = c00000000b17d588
R04 = 0000000000000000   R20 = 0000000000000001
R05 = 0000000000000001   R21 = 0000000000000000
R06 = 0a00000000000000   R22 = 0000000000000000
R07 = 0000000000000000   R23 = 0000000000000000
R08 = c0000000002e3809   R24 = c00000000ca8a8b0
R09 = 0000000044282444   R25 = 0000000000000001
R10 = c00000000ca8a820   R26 = c00000000002bad0
R11 = 0000000048282424   R27 = 0000000084282444
R12 = 00000000000cdfd0   R28 = c00000000ca8a8b0
R13 = c00000000033c900   R29 = 0000000000000000
R14 = 0000000000000000   R30 = 5b00000000012634
R15 = c000000000330c38   R31 = 0000000000000000
pc  = c00000000ca8aa90
lr  = c00000000ca8aa90
msr = 8000000000009032   cr  = d00000048282424
ctr = 00000000000cdfd0   xer = 0000000000000000   trap =        a
0:mon> t
[link register   ] c00000000ca8aa90
[c00000000ca8a9e0] c00000000ca8aa80 (unreliable)
[c00000000ca8aa50] 0d00000000044508
[c00000000ca8aae0] 0a28244224282424
[c00000000ca8ab50] c00000000ca8ac00
[c00000000ca8ac00] 0000000000000000
SP (d0000000ca8ae20) is in userspace
0:mon>

-- Dave


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From haveblue at us.ibm.com  Fri Jul 16 09:12:41 2004
From: haveblue at us.ibm.com (Dave Hansen)
Date: Thu, 15 Jul 2004 16:12:41 -0700
Subject: oops bringing up secondary cpus
In-Reply-To: <20040715062936.GA27715@krispykreme>
References: <1089767063.4370.160.camel@nighthawk>
	 <20040715062936.GA27715@krispykreme>
Message-ID: <1089933160.32312.32.camel@nighthawk>


On Wed, 2004-07-14 at 23:29, Anton Blanchard wrote:
> > cpu 0x1: Vector: 300 (Data Access) at [c0000002fff3bd10]
> >     pc: 000000000000beac
> >     lr: 000000000000be88
> >     sp: c0000002fff3bf90
> >    msr: 8000000000001002
> >    dar: c0000002fff3bf90
> >  dsisr: a000000
> >   current = 0xc0000002fff30920
> >   paca    = 0xc00000000035a000
> >     pid   = 0, comm = swapper
>
> Looks like you are trying to access a virtual address with IR and DR
> off (ie you are in real mode).

Thanks for taking a look.  First of all, where are IR and DR represened
in the register dump?

Also, does __secondary_start happen in real mode?  It appears to have
some virtual addresses (of the paca) handed in to it, and it says that
relocation is off.

-- Dave


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Fri Jul 16 09:20:08 2004
From: anton at samba.org (Anton Blanchard)
Date: Fri, 16 Jul 2004 09:20:08 +1000
Subject: oops bringing up secondary cpus
In-Reply-To: <1089933160.32312.32.camel@nighthawk>
References: <1089767063.4370.160.camel@nighthawk> <20040715062936.GA27715@krispykreme> <1089933160.32312.32.camel@nighthawk>
Message-ID: <20040715232008.GA17574@krispykreme>


> > > cpu 0x1: Vector: 300 (Data Access) at [c0000002fff3bd10]
> > >     pc: 000000000000beac
> > >     lr: 000000000000be88
> > >     sp: c0000002fff3bf90
> > >    msr: 8000000000001002
> > >    dar: c0000002fff3bf90
> > >  dsisr: a000000
> > >   current = 0xc0000002fff30920
> > >   paca    = 0xc00000000035a000
> > >     pid   = 0, comm = swapper
> >
> > Looks like you are trying to access a virtual address with IR and DR
> > off (ie you are in real mode).
>
> Thanks for taking a look.  First of all, where are IR and DR represened
> in the register dump?

The msr holds all that info, check out include/asm/processor.h:

#define MSR_IR_LG       5               /* Instruction Relocate */
#define MSR_DR_LG       4               /* Data Relocate */

> Also, does __secondary_start happen in real mode?  It appears to have
> some virtual addresses (of the paca) handed in to it, and it says that
> relocation is off.

Ahh sorry I forgot to tell you about a trick we use. In real mode the
top 2 bits are ignored, so looking at the DAR you are trying to access
0x2fff3bf90.

Now one reason we would take a 300 on a real address is if you tried to
touch memory outside the RMO region. Since the address is above 4GB Im
guessing this is your problem.

Since r1 is the same as the the DAR it sounds like its to do with
current_set being outside the RMO:

        LOADADDR(r3,current_set)
        sldi    r28,r24,3               /* get current_set[cpu#] */
        ldx     r1,r3,r28

etc.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From haveblue at us.ibm.com  Fri Jul 16 09:33:44 2004
From: haveblue at us.ibm.com (Dave Hansen)
Date: Thu, 15 Jul 2004 16:33:44 -0700
Subject: oops bringing up secondary cpus
In-Reply-To: <20040715232008.GA17574@krispykreme>
References: <1089767063.4370.160.camel@nighthawk>
	 <20040715062936.GA27715@krispykreme> <1089933160.32312.32.camel@nighthawk>
	 <20040715232008.GA17574@krispykreme>
Message-ID: <1089934424.32312.37.camel@nighthawk>


On Thu, 2004-07-15 at 16:20, Anton Blanchard wrote:
> Ahh sorry I forgot to tell you about a trick we use. In real mode the
> top 2 bits are ignored, so looking at the DAR you are trying to access
> 0x2fff3bf90.

Aha!  That really clears it up.

> Now one reason we would take a 300 on a real address is if you tried to
> touch memory outside the RMO region. Since the address is above 4GB Im
> guessing this is your problem.
>
> Since r1 is the same as the the DAR it sounds like its to do with
> current_set being outside the RMO:
>
>         LOADADDR(r3,current_set)
>         sldi    r28,r24,3               /* get current_set[cpu#] */
>         ldx     r1,r3,r28

OK, but current_set is the task info for the idle process for the new
CPU, right?  It looks to me like it's allocated like any old task using
copy_process() in smp_create_idle().  This should use kmalloc() like any
other task, and that's certainly not guaranteed to be in the RMO.  Did
this change recently?  Do we need to lmb_alloc() the idle task struct
for the secondary CPUs?

-- Dave


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Fri Jul 16 09:53:21 2004
From: anton at samba.org (Anton Blanchard)
Date: Fri, 16 Jul 2004 09:53:21 +1000
Subject: oops bringing up secondary cpus
In-Reply-To: <1089934424.32312.37.camel@nighthawk>
References: <1089767063.4370.160.camel@nighthawk> <20040715062936.GA27715@krispykreme> <1089933160.32312.32.camel@nighthawk> <20040715232008.GA17574@krispykreme> <1089934424.32312.37.camel@nighthawk>
Message-ID: <20040715235321.GB17574@krispykreme>


> OK, but current_set is the task info for the idle process for the new
> CPU, right?  It looks to me like it's allocated like any old task using
> copy_process() in smp_create_idle().  This should use kmalloc() like any
> other task, and that's certainly not guaranteed to be in the RMO.  Did
> this change recently?  Do we need to lmb_alloc() the idle task struct
> for the secondary CPUs?

To be honest I cant see where we touch r1 in __secondary_start, have you
made in changes to it? BTW the rfid at the end of __secondary_start is
where we go to virtual mode, so you only have to worry about the code
before that point.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From haveblue at us.ibm.com  Fri Jul 16 10:00:55 2004
From: haveblue at us.ibm.com (Dave Hansen)
Date: Thu, 15 Jul 2004 17:00:55 -0700
Subject: oops bringing up secondary cpus
In-Reply-To: <20040715235321.GB17574@krispykreme>
References: <1089767063.4370.160.camel@nighthawk>
	 <20040715062936.GA27715@krispykreme> <1089933160.32312.32.camel@nighthawk>
	 <20040715232008.GA17574@krispykreme> <1089934424.32312.37.camel@nighthawk>
	 <20040715235321.GB17574@krispykreme>
Message-ID: <1089936055.32312.40.camel@nighthawk>


 On Thu, 2004-07-15 at 16:53, Anton Blanchard wrote:
>  > OK, but current_set is the task info for the idle process for the new
> > CPU, right?  It looks to me like it's allocated like any old task using
> > copy_process() in smp_create_idle().  This should use kmalloc() like any
> > other task, and that's certainly not guaranteed to be in the RMO.  Did
> > this change recently?  Do we need to lmb_alloc() the idle task struct
> > for the secondary CPUs?
>
> To be honest I cant see where we touch r1 in __secondary_start, have you
> made in changes to it? BTW the rfid at the end of __secondary_start is
> where we go to virtual mode, so you only have to worry about the code
> before that point.

No, I haven't modified it, but I'm running 2.6.7-mm2, which looks a bit
different than current mainline.  I'll update and try again.

-- Dave


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From david at gibson.dropbear.id.au  Fri Jul 16 10:47:29 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Fri, 16 Jul 2004 10:47:29 +1000
Subject: page align emergency stack
In-Reply-To: <20040715145708.GG27715@krispykreme>
References: <20040715145708.GG27715@krispykreme>
Message-ID: <20040716004729.GA24753@zax>


On Fri, Jul 16, 2004 at 12:57:08AM +1000, Anton Blanchard wrote:
>
> Page align the emergency stack.
>
> Signed-off-by: Anton Blanchard <anton at samba.org>

Do we actually need to do this?  I noted that the old guard pages were
page aligned, but couldn't see any particular reason for it, so I
didn't transfer the alignment to the new version.

--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Fri Jul 16 11:11:41 2004
From: anton at samba.org (Anton Blanchard)
Date: Fri, 16 Jul 2004 11:11:41 +1000
Subject: page align emergency stack
In-Reply-To: <20040716004729.GA24753@zax>
References: <20040715145708.GG27715@krispykreme> <20040716004729.GA24753@zax>
Message-ID: <20040716011141.GC17574@krispykreme>


> Do we actually need to do this?  I noted that the old guard pages were
> page aligned, but couldn't see any particular reason for it, so I
> didn't transfer the alignment to the new version.

The ABI requires us to have 128 bit alignment doesnt it? Im thinking
about what would happen if we saved altivec registers to the stack.

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From david at gibson.dropbear.id.au  Fri Jul 16 11:39:01 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Fri, 16 Jul 2004 11:39:01 +1000
Subject: page align emergency stack
In-Reply-To: <20040716011141.GC17574@krispykreme>
References: <20040715145708.GG27715@krispykreme> <20040716004729.GA24753@zax> <20040716011141.GC17574@krispykreme>
Message-ID: <20040716013901.GC24753@zax>


On Fri, Jul 16, 2004 at 11:11:41AM +1000, Anton Blanchard wrote:
>
>
> > Do we actually need to do this?  I noted that the old guard pages were
> > page aligned, but couldn't see any particular reason for it, so I
> > didn't transfer the alignment to the new version.
>
> The ABI requires us to have 128 bit alignment doesnt it? Im thinking
> about what would happen if we saved altivec registers to the stack.

Ok, that's not quite the same thing as page alignment...

--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From david at gibson.dropbear.id.au  Fri Jul 16 16:10:38 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Fri, 16 Jul 2004 16:10:38 +1000
Subject: [PPC64, TRIVIAL] Rename confusing locks in ras.c, rtasd.c
Message-ID: <20040716061038.GD26044@zax>


Andrew, please apply:

Both arch/ppc64/kernel/ras.c and arch/ppc64/kernel/rtasd.c have a
spinlock variable declared static called "log_lock".  Since the code
in these files interact quit a lot, having two different locks with
identical names is manifestly confusing.  This patch renames both
locks to something a little clearer.  In the case of ras.c it also
renames the buffer protected by the lock to a more usefullly greppable
name.

Signed-off-by: David Gibson <dwg at au.ibm.com>

Index: working-2.6/arch/ppc64/kernel/ras.c
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/ras.c
+++ working-2.6/arch/ppc64/kernel/ras.c
@@ -109,8 +109,8 @@
 }
 __initcall(init_ras_IRQ);

-static struct rtas_error_log log_buf;
-static spinlock_t log_lock = SPIN_LOCK_UNLOCKED;
+static struct rtas_error_log ras_log_buf;
+static spinlock_t ras_log_buf_lock = SPIN_LOCK_UNLOCKED;

 /*
  * Handle power subsystem events (EPOW).
@@ -126,17 +126,17 @@
 	unsigned int size = sizeof(log_entry);
 	int status = 0xdeadbeef;

-	spin_lock(&log_lock);
+	spin_lock(&ras_log_buf_lock);

 	status = rtas_call(rtas_token("check-exception"), 6, 1, NULL,
 			   0x500, irq,
 			   RTAS_EPOW_WARNING | RTAS_POWERMGM_EVENTS,
 			   1,  /* Time Critical */
-			   __pa(&log_buf), size);
+			   __pa(&ras_log_buf), size);

-	log_entry = log_buf;
+	log_entry = ras_log_buf;

-	spin_unlock(&log_lock);
+	spin_unlock(&ras_log_buf_lock);

 	udbg_printf("EPOW <0x%lx 0x%x>\n",
 		    *((unsigned long *)&log_entry), status);
@@ -165,17 +165,17 @@
 	int status = 0xdeadbeef;
 	int fatal;

-	spin_lock(&log_lock);
+	spin_lock(&ras_log_buf_lock);

 	status = rtas_call(rtas_token("check-exception"), 6, 1, NULL,
 			   0x500, irq,
 			   RTAS_INTERNAL_ERROR,
 			   1, /* Time Critical */
-			   __pa(&log_buf), size);
+			   __pa(&ras_log_buf), size);

-	log_entry = log_buf;
+	log_entry = ras_log_buf;

-	spin_unlock(&log_lock);
+	spin_unlock(&ras_log_buf_lock);

 	if ((status == 0) && (log_entry.severity >= SEVERITY_ERROR_SYNC))
 		fatal = 1;
Index: working-2.6/arch/ppc64/kernel/rtasd.c
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/rtasd.c
+++ working-2.6/arch/ppc64/kernel/rtasd.c
@@ -33,7 +33,7 @@
 #define DEBUG(A...)
 #endif

-static spinlock_t log_lock = SPIN_LOCK_UNLOCKED;
+static spinlock_t rtasd_log_lock = SPIN_LOCK_UNLOCKED;

 DECLARE_WAIT_QUEUE_HEAD(rtas_log_wait);

@@ -152,7 +152,7 @@
 	if (buf == NULL)
 		return;

-	spin_lock_irqsave(&log_lock, s);
+	spin_lock_irqsave(&rtasd_log_lock, s);

 	/* get length and increase count */
 	switch (err_type & ERR_TYPE_MASK) {
@@ -163,7 +163,7 @@
 		break;
 	case ERR_TYPE_KERNEL_PANIC:
 	default:
-		spin_unlock_irqrestore(&log_lock, s);
+		spin_unlock_irqrestore(&rtasd_log_lock, s);
 		return;
 	}

@@ -174,7 +174,7 @@
 	/* Check to see if we need to or have stopped logging */
 	if (fatal || no_more_logging) {
 		no_more_logging = 1;
-		spin_unlock_irqrestore(&log_lock, s);
+		spin_unlock_irqrestore(&rtasd_log_lock, s);
 		return;
 	}

@@ -199,12 +199,12 @@
 		else
 			rtas_log_start += 1;

-		spin_unlock_irqrestore(&log_lock, s);
+		spin_unlock_irqrestore(&rtasd_log_lock, s);
 		wake_up_interruptible(&rtas_log_wait);
 		break;
 	case ERR_TYPE_KERNEL_PANIC:
 	default:
-		spin_unlock_irqrestore(&log_lock, s);
+		spin_unlock_irqrestore(&rtasd_log_lock, s);
 		return;
 	}

@@ -247,24 +247,24 @@
 		return -ENOMEM;


-	spin_lock_irqsave(&log_lock, s);
+	spin_lock_irqsave(&rtasd_log_lock, s);
 	/* if it's 0, then we know we got the last one (the one in NVRAM) */
 	if (rtas_log_size == 0 && !no_more_logging)
 		nvram_clear_error_log();
-	spin_unlock_irqrestore(&log_lock, s);
+	spin_unlock_irqrestore(&rtasd_log_lock, s);


 	error = wait_event_interruptible(rtas_log_wait, rtas_log_size);
 	if (error)
 		goto out;

-	spin_lock_irqsave(&log_lock, s);
+	spin_lock_irqsave(&rtasd_log_lock, s);
 	offset = rtas_error_log_buffer_max * (rtas_log_start & LOG_NUMBER_MASK);
 	memcpy(tmp, &rtas_log_buf[offset], count);

 	rtas_log_start += 1;
 	rtas_log_size -= 1;
-	spin_unlock_irqrestore(&log_lock, s);
+	spin_unlock_irqrestore(&rtasd_log_lock, s);

 	error = copy_to_user(buf, tmp, count) ? -EFAULT : count;
 out:


--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From david at gibson.dropbear.id.au  Fri Jul 16 16:13:27 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Fri, 16 Jul 2004 16:13:27 +1000
Subject: RFC: Fix (another) bug in rtas logging
Message-ID: <20040716061327.GE26044@zax>


Have I missed anything here?  This bug certainly fixes the crash on
boot I've been getting on our p630.

The recent patch changing the rtas error logging had a bug.  It can
result in rtas_call() attempting to call kmalloc() too early (from
setup_arch() before the slab caches are initialized), leading to an
oops on boot.

I can see no reason that log_error() can't be called with the
rtas.lock still held, so this patch avoids race addresses by the
original race by simply moving the log_error() call under the lock.

Signed-off-by: David Gibson <dwg at au.ibm.com>

Index: working-2.6/arch/ppc64/kernel/rtas.c
===================================================================
--- working-2.6.orig/arch/ppc64/kernel/rtas.c
+++ working-2.6/arch/ppc64/kernel/rtas.c
@@ -114,7 +114,6 @@
 	int i, logit = 0;
 	unsigned long s;
 	struct rtas_args *rtas_args;
-	char * buff_copy = NULL;
 	int ret;

 	PPCDBG(PPCDBG_RTAS, "Entering rtas_call\n");
@@ -165,19 +164,12 @@

 	/* Log the error in the unlikely case that there was one. */
 	if (unlikely(logit)) {
-		buff_copy = kmalloc(RTAS_ERROR_LOG_MAX, GFP_ATOMIC);
-		if (buff_copy) {
-			memcpy(buff_copy, rtas_err_buf, RTAS_ERROR_LOG_MAX);
-		}
+		log_error(rtas_err_buf, ERR_TYPE_RTAS_LOG, 0);
 	}

 	/* Gotta do something different here, use global lock for now... */
 	spin_unlock_irqrestore(&rtas.lock, s);

-	if (buff_copy) {
-		log_error(buff_copy, ERR_TYPE_RTAS_LOG, 0);
-		kfree(buff_copy);
-	}
 	return ret;
 }


--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Sat Jul 17 00:44:19 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Fri, 16 Jul 2004 10:44:19 -0400
Subject: RFC: Fix (another) bug in rtas logging
In-Reply-To: <20040716061327.GE26044@zax>
References: <20040716061327.GE26044@zax>
Message-ID: <1089989058.2487.63.camel@gaston>


On Fri, 2004-07-16 at 02:13, David Gibson wrote:
> Have I missed anything here?  This bug certainly fixes the crash on
> boot I've been getting on our p630.
>
> The recent patch changing the rtas error logging had a bug.  It can
> result in rtas_call() attempting to call kmalloc() too early (from
> setup_arch() before the slab caches are initialized), leading to an
> oops on boot.
>
> I can see no reason that log_error() can't be called with the
> rtas.lock still held, so this patch avoids race addresses by the
> original race by simply moving the log_error() call under the lock.

No, log_error can call ppc_md.log_error which can call some nvram
stuff itself calling RTAS afaik

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Sat Jul 17 04:49:14 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Fri, 16 Jul 2004 14:49:14 -0400
Subject: oops bringing up secondary cpus
In-Reply-To: <20040715235321.GB17574@krispykreme>
References: <1089767063.4370.160.camel@nighthawk>
	 <20040715062936.GA27715@krispykreme> <1089933160.32312.32.camel@nighthawk>
	 <20040715232008.GA17574@krispykreme> <1089934424.32312.37.camel@nighthawk>
	 <20040715235321.GB17574@krispykreme>
Message-ID: <1090003753.2487.86.camel@gaston>


On Thu, 2004-07-15 at 19:53, Anton Blanchard wrote:
> > OK, but current_set is the task info for the idle process for the new
> > CPU, right?  It looks to me like it's allocated like any old task using
> > copy_process() in smp_create_idle().  This should use kmalloc() like any
> > other task, and that's certainly not guaranteed to be in the RMO.  Did
> > this change recently?  Do we need to lmb_alloc() the idle task struct
> > for the secondary CPUs?
>
> To be honest I cant see where we touch r1 in __secondary_start, have you
> made in changes to it? BTW the rfid at the end of __secondary_start is
> where we go to virtual mode, so you only have to worry about the code
> before that point.

I remember fixing just that bug after paulus rewrite went in, we were
touching the stack before enabling the MMU....

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Sat Jul 17 04:53:15 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Fri, 16 Jul 2004 14:53:15 -0400
Subject: oops bringing up secondary cpus
In-Reply-To: <1089936055.32312.40.camel@nighthawk>
References: <1089767063.4370.160.camel@nighthawk>
	 <20040715062936.GA27715@krispykreme> <1089933160.32312.32.camel@nighthawk>
	 <20040715232008.GA17574@krispykreme> <1089934424.32312.37.camel@nighthawk>
	 <20040715235321.GB17574@krispykreme>  <1089936055.32312.40.camel@nighthawk>
Message-ID: <1090003995.2487.88.camel@gaston>


Ok, here's the patch I sent that got in last month fixing that:

-----Forwarded Message-----
From: Benjamin Herrenschmidt <benh at kernel.crashing.org>
To: Andrew Morton <akpm at osdl.org>
Cc: Linus Torvalds <torvalds at osdl.org>, Paul Mackerras <paulus at samba.org>, Linux Kernel list <linux-kernel at vger.kernel.org>
Subject: [PATCH] ppc64: Fix booting on LPAR machines with more than 1 CPU
Date: Thu, 24 Jun 2004 11:28:39 -0500

Hi !

The exception rewrite contains a small bug that prevents bring up of CPUs
on logically partitioned machines. The kernel is trying to zero the backlink
on the new stack while running with relocation disabled, which potentially
cause it to try to access an address outside of the region allowed in
real mode. This seem to be a leftover from previous code as we also zero
the backlink later after turning off the MMU. This patch removes the
offending bit.

===== arch/ppc64/kernel/head.S 1.61 vs edited =====
--- 1.61/arch/ppc64/kernel/head.S	2004-06-17 00:46:06 -05:00
+++ edited/arch/ppc64/kernel/head.S	2004-06-24 11:25:41 -05:00
@@ -1833,8 +1833,6 @@
 	sldi	r28,r24,3		/* get current_set[cpu#] */
 	ldx	r1,r3,r28
 	addi	r1,r1,THREAD_SIZE-STACK_FRAME_OVERHEAD
-	li	r0,0
-	std	r0,0(r1)
 	std	r1,PACAKSAVE(r13)

 	ld	r3,PACASTABREAL(r13)    /* get raddr of segment table       */


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From david at gibson.dropbear.id.au  Mon Jul 19 10:52:23 2004
From: david at gibson.dropbear.id.au (David Gibson)
Date: Mon, 19 Jul 2004 10:52:23 +1000
Subject: RFC: Fix (another) bug in rtas logging
In-Reply-To: <1089989058.2487.63.camel@gaston>
References: <20040716061327.GE26044@zax> <1089989058.2487.63.camel@gaston>
Message-ID: <20040719005223.GB10537@zax>


On Fri, Jul 16, 2004 at 10:44:19AM -0400, Benjamin Herrenschmidt wrote:
>
> On Fri, 2004-07-16 at 02:13, David Gibson wrote:
> > Have I missed anything here?  This bug certainly fixes the crash on
> > boot I've been getting on our p630.
> >
> > The recent patch changing the rtas error logging had a bug.  It can
> > result in rtas_call() attempting to call kmalloc() too early (from
> > setup_arch() before the slab caches are initialized), leading to an
> > oops on boot.
> >
> > I can see no reason that log_error() can't be called with the
> > rtas.lock still held, so this patch avoids race addresses by the
> > original race by simply moving the log_error() call under the lock.
>
> No, log_error can call ppc_md.log_error which can call some nvram
> stuff itself calling RTAS afaik

Ah, yes, good point.

--
David Gibson			| For every complex problem there is a
david AT gibson.dropbear.id.au	| solution which is simple, neat and
				| wrong.
http://www.ozlabs.org/people/dgibson

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nfont at austin.ibm.com  Tue Jul 20 02:52:25 2004
From: nfont at austin.ibm.com (Nathan Fontenot)
Date: Mon, 19 Jul 2004 11:52:25 -0500
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
Message-ID: <40FBFC49.1070804@austin.ibm.com>

This patch will allow you to turn off the reporting of rtas messages to
/var/log/messages.  There have been several situations in which machines
spew out too many rtas messages thus making debugging more difficult.
Being able to turn off the rtas messages reporting will help by not
having to wade through the tens or hundreds of (probably) unrelated rtas
events in /var/log/messages while trying to debug a system.

Anton or Paul, could you please review and push upstream, thanks.

Signed-off-by: Nathan Fontenot <nfont at austin.ibm.com>

--
Nathan F.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rtasmsgs.patch
Type: text/x-patch
Size: 3877 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040719/e15a5d44/attachment.bin 

From johnrose at austin.ibm.com  Tue Jul 20 04:30:01 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Mon, 19 Jul 2004 13:30:01 -0500
Subject: [PATCH] imalloc supersets
Message-ID: <1090261801.18793.12.camel@sinatra.austin.ibm.com>


The patch below implements the ability to query outstanding imalloc regions for
a given virtual address range.  The patch extends im_get_area() to allow a
region criterion of IM_REGION_SUPERSET.  For a particular "superset" virtual
address and size passed into im_get_area(), the function returns the first
outstanding region that is contained within this superset region.

The patch also changes iounmap_explicit() to allow for the unmapping of all
regions that fit under a "supserset".

This ability is necessary for PHB DLPAR.  For a PHB removal, the RPA requires
that all of its children slots already be dynamically removed.  Each of these
slot-level removals has fractured the imalloc region assigned to the PHB at
boot.  At PHB removal time, it is necessary to iounmap() the remaining
artifacts of the initial PHB region.

Thanks-
John

Signed-off-by:  John Rose <johnrose at austin.ibm.com>

diff -X /home/johnrose/tmp/diffignore.txt -urpN sles9-rc5-vanilla/arch/ppc64/mm/imalloc.c sles9-rc5/arch/ppc64/mm/imalloc.c
--- sles9-rc5-vanilla/arch/ppc64/mm/imalloc.c	2004-07-16 16:11:54.000000000 -0500
+++ sles9-rc5/arch/ppc64/mm/imalloc.c	2004-07-19 11:14:29.000000000 -0500
@@ -37,33 +37,51 @@ static int get_free_im_addr(unsigned lon
 	return 0;
 }

+/* Return whether the region described by v_addr and size is a subset
+ * of the region described by parent
+ */
+static inline int im_region_is_subset(unsigned long v_addr, unsigned long size,
+			struct vm_struct *parent)
+{
+	return (int) (v_addr >= (unsigned long) parent->addr &&
+	              v_addr < (unsigned long) parent->addr + parent->size &&
+	    	      size < parent->size);
+}
+
+/* Return whether the region described by v_addr and size is a superset
+ * of the region described by child
+ */
+static int im_region_is_superset(unsigned long v_addr, unsigned long size,
+		struct vm_struct *child)
+{
+	struct vm_struct parent;
+
+	parent.addr = (void *) v_addr;
+	parent.size = size;
+
+	return im_region_is_subset((unsigned long) child->addr, child->size,
+			&parent);
+}
+
 /* Return whether the region described by v_addr and size overlaps
  * the region described by vm.  Overlapping regions meet the
  * following conditions:
  * 1) The regions share some part of the address space
  * 2) The regions aren't identical
- * 3) The first region is not a subset of the second
+ * 3) Neither region is a subset of the other
  */
-static inline int im_region_overlaps(unsigned long v_addr, unsigned long size,
+static int im_region_overlaps(unsigned long v_addr, unsigned long size,
 		     struct vm_struct *vm)
 {
+	if (im_region_is_superset(v_addr, size, vm))
+		return 0;
+
 	return (v_addr + size > (unsigned long) vm->addr + vm->size &&
 		v_addr < (unsigned long) vm->addr + vm->size) ||
 	       (v_addr < (unsigned long) vm->addr &&
 		v_addr + size > (unsigned long) vm->addr);
 }

-/* Return whether the region described by v_addr and size is a subset
- * of the region described by vm
- */
-static inline int im_region_is_subset(unsigned long v_addr, unsigned long size,
-			struct vm_struct *vm)
-{
-	return (int) (v_addr >= (unsigned long) vm->addr &&
-	              v_addr < (unsigned long) vm->addr + vm->size &&
-	    	      size < vm->size);
-}
-
 /* Determine imalloc status of region described by v_addr and size.
  * Can return one of the following:
  * IM_REGION_UNUSED   -  Entire region is unallocated in imalloc space.
@@ -73,28 +91,37 @@ static inline int im_region_is_subset(un
  * IM_REGION_EXISTS -    Exact region already allocated in imalloc space.
  *                       vm will be assigned to a ptr to the existing imlist
  *                       member.
- * IM_REGION_OVERLAPS -  A portion of the region is already allocated in
- *                       imalloc space.
+ * IM_REGION_OVERLAPS -  Region overlaps an allocated region in imalloc space.
+ * IM_REGION_SUPERSET -  Region is a superset of a region that is already
+ *                       allocated in imalloc space.
  */
 static int im_region_status(unsigned long v_addr, unsigned long size,
 		    struct vm_struct **vm)
 {
 	struct vm_struct *tmp;

-	for (tmp = imlist; tmp; tmp = tmp->next)
-		if (v_addr < (unsigned long) tmp->addr + tmp->size)
+	for (tmp = imlist; tmp; tmp = tmp->next)
+		if (v_addr < (unsigned long) tmp->addr + tmp->size)
 			break;
-
+
 	if (tmp) {
 		if (im_region_overlaps(v_addr, size, tmp))
 			return IM_REGION_OVERLAP;

 		*vm = tmp;
-		if (im_region_is_subset(v_addr, size, tmp))
+		if (im_region_is_subset(v_addr, size, tmp)) {
+			/* Return with tmp pointing to superset */
 			return IM_REGION_SUBSET;
+		}
+		if (im_region_is_superset(v_addr, size, tmp)) {
+			/* Return with tmp pointing to first subset */
+			return IM_REGION_SUPERSET;
+		}
 		else if (v_addr == (unsigned long) tmp->addr &&
-		 	 size == tmp->size)
+		 	 size == tmp->size) {
+			/* Return with tmp pointing to exact region */
 			return IM_REGION_EXISTS;
+		}
 	}

 	*vm = NULL;
@@ -208,6 +235,10 @@ static struct vm_struct * __im_get_area(
 		tmp = split_im_region(req_addr, size, tmp);
 		break;
 	case IM_REGION_EXISTS:
+		/* Return requested region */
+		break;
+	case IM_REGION_SUPERSET:
+		/* Return first existing subset of requested region */
 		break;
 	default:
 		printk(KERN_ERR "%s() unexpected imalloc region status\n",
diff -X /home/johnrose/tmp/diffignore.txt -urpN sles9-rc5-vanilla/arch/ppc64/mm/init.c sles9-rc5/arch/ppc64/mm/init.c
--- sles9-rc5-vanilla/arch/ppc64/mm/init.c	2004-07-16 16:11:54.000000000 -0500
+++ sles9-rc5/arch/ppc64/mm/init.c	2004-07-19 13:08:56.000000000 -0500
@@ -392,9 +392,28 @@ void iounmap(void *addr)
 	return;
 }

+static int iounmap_subset_regions(void *addr, unsigned long size)
+{
+	struct vm_struct *area;
+
+	/* Check whether subsets of this region exist */
+	area = im_get_area((unsigned long) addr, size, IM_REGION_SUPERSET);
+	if (area == NULL)
+		return 1;
+
+	while (area) {
+		iounmap(area->addr);
+		area = im_get_area((unsigned long) addr, size,
+				IM_REGION_SUPERSET);
+	}
+
+	return 0;
+}
+
 int iounmap_explicit(void *addr, unsigned long size)
 {
 	struct vm_struct *area;
+	int rc;

 	/* addr could be in EEH or IO region, map it to IO region regardless.
 	 */
@@ -407,12 +426,17 @@ int iounmap_explicit(void *addr, unsigne
 	area = im_get_area((unsigned long) addr, size,
 			    IM_REGION_EXISTS | IM_REGION_SUBSET);
 	if (area == NULL) {
-		printk(KERN_ERR "%s() cannot unmap nonexistant range 0x%lx\n",
+		/* Determine whether subset regions exist.  If so, unmap */
+		rc = iounmap_subset_regions(addr, size);
+		if (rc) {
+			printk(KERN_ERR
+			       "%s() cannot unmap nonexistent range 0x%lx\n",
 				__FUNCTION__, (unsigned long) addr);
-		return 1;
+			return 1;
+		}
+	} else {
+		iounmap(area->addr);
 	}
-
-	iounmap(area->addr);
 	return 0;
 }


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From cfriesen at nortelnetworks.com  Tue Jul 20 05:18:45 2004
From: cfriesen at nortelnetworks.com (Chris Friesen)
Date: Mon, 19 Jul 2004 15:18:45 -0400
Subject: trying to netboot G5 Xserve HPC node
Message-ID: <40FC1E95.2000501@nortelnetworks.com>


I've got an HPC G5 Xserve (the one with no CD drive), and I'm attempting to boot
linux on it.

I've got dhcpd V3.0pl1 running on a G5 desktop (/etc/dhcp.conf included below).

When I try and netboot the Xserve, it appears to issue a bootp request, the
server sends a response, but the IP address offer is never accepted.  What am I
doing wrong?  Is there any way to get the Xserve to show firmware boot logs on
the serial console?

The logs on the dhcp server look like this:

Jul 19 18:03:57 g5-2 dhcpd: dhcp.c(2072): non-null pointer
Jul 19 18:03:57 g5-2 dhcpd: DHCPDISCOVER from 00:0d:93:9b:a8:6c via eth0
Jul 19 18:03:57 g5-2 dhcpd: Received BootP request from Macintosh netboot client
Jul 19 18:03:57 g5-2 dhcpd: DHCPOFFER on 10.40.200.5 to 00:0d:93:9b:a8:6c via eth0
Jul 19 18:04:01 g5-2 dhcpd: dhcp.c(2072): non-null pointer
Jul 19 18:04:01 g5-2 dhcpd: DHCPDISCOVER from 00:0d:93:9b:a8:6c via eth0
Jul 19 18:04:01 g5-2 dhcpd: Received BootP request from Macintosh netboot client
Jul 19 18:04:01 g5-2 dhcpd: DHCPOFFER on 10.40.200.5 to 00:0d:93:9b:a8:6c via eth0

Any ideas?

Chris

dhcpd.conf file on server
=========================================================

# global dhcpd parameters
deny unknown-clients;                   #disallow unknown connections
ddns-update-style none;                 #disallow dynamic DNS updates

allow bootp;                            #allow bootp requests, thus the dhcp
                                         #server will act as a bootp server

# which network interface the server will listen on
subnet 10.40.200.0 netmask 255.255.255.0 { #the zeros indicate which range
}                                          #of addresses are allowed to connect

#set of parameters common to all clients
group {
option broadcast-address 10.40.200.255;
option subnet-mask 255.255.255.0;

host g5_3 {
         filename "yaboot";
         server-name "tester";
         next-server tester;
         hardware ethernet 00:0D:93:9B:A8:6C;
         fixed-address 10.40.200.5;
}

   #you may paste another "host" entry here for additional clients on this network

  }

=======================================================================

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nathanl at austin.ibm.com  Tue Jul 20 08:10:14 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Mon, 19 Jul 2004 17:10:14 -0500
Subject: Broken LPAR cpu bringup
In-Reply-To: <1089932274.32312.23.camel@nighthawk>
References: <1089850441.10000.62.camel@nighthawk>
	 <1089865083.13914.3.camel@biclops.private.network>
	 <1089931149.12866.5.camel@pants.austin.ibm.com>
	 <1089931414.32312.13.camel@nighthawk> <1089932274.32312.23.camel@nighthawk>
Message-ID: <1090275014.15991.18.camel@pants.austin.ibm.com>


On Thu, 2004-07-15 at 17:57, Dave Hansen wrote:
> > They still get stuck on my tree.  I'll try on a plain tree in a little
> > bit.
>
> Nope, still oopses in the load_balance sched domains code:

Ok, third (and hopefully final) attempt.  Try this without any of the
others I sent.  I think this change caused it:

http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.8-rc1/2.6.8-rc1-mm1/broken-out/detect-too-early-schedule-attempts.patch

Backing out the above change works, too.

Tested on a 4-way partition on an 8-way p650.

Nathan

diff -prauN -X /home/nathanl/working/dontdiff 2.6.8-rc1-mm1/arch/ppc64/kernel/smp.c 2.6.8-rc1-mm1.1/arch/ppc64/kernel/smp.c
--- 2.6.8-rc1-mm1/arch/ppc64/kernel/smp.c	2004-07-19 15:54:22.000000000 -0500
+++ 2.6.8-rc1-mm1.1/arch/ppc64/kernel/smp.c	2004-07-19 16:16:44.000000000 -0500
@@ -374,7 +374,7 @@ static inline int __devinit smp_startup_

 	/* At boot time the cpus are already spinning in hold
 	 * loops, so nothing to do. */
- 	if (system_state == SYSTEM_BOOTING)
+ 	if (system_state < SYSTEM_RUNNING)
 		return 1;

 	pcpu = find_physical_cpu_to_start(get_hard_smp_processor_id(lcpu));
@@ -868,7 +868,7 @@ int __devinit __cpu_up(unsigned int cpu)
 	int c;

 	/* At boot, don't bother with non-present cpus -JSCHOPP */
-	if (system_state == SYSTEM_BOOTING && !cpu_present_at_boot(cpu))
+	if (system_state < SYSTEM_RUNNING && !cpu_present_at_boot(cpu))
 		return -ENOENT;

 	paca[cpu].prof_counter = 1;
@@ -902,7 +902,7 @@ int __devinit __cpu_up(unsigned int cpu)
 	 * use this value that I found through experimentation.
 	 * -- Cort
 	 */
-	if (system_state == SYSTEM_BOOTING)
+	if (system_state < SYSTEM_RUNNING)
 		for (c = 5000; c && !cpu_callin_map[cpu]; c--)
 			udelay(100);
 #ifdef CONFIG_HOTPLUG_CPU


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Tue Jul 20 09:07:52 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Mon, 19 Jul 2004 18:07:52 -0500
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
In-Reply-To: <40FBFC49.1070804@austin.ibm.com>
References: <40FBFC49.1070804@austin.ibm.com>
Message-ID: <75F11743-D9D8-11D8-A3B4-000A95A0560C@us.ibm.com>


On Jul 19, 2004, at 11:52 AM, Nathan Fontenot wrote:

> This patch will allow you to turn off the reporting of rtas messages
> to /var/log/messages. There have been several situations in which
> machines spew out too many rtas messages thus making debugging more
> difficult. Being able to turn off the rtas messages reporting will
> help by not having to wade through the tens or hundreds of (probably)
> unrelated rtas events in /var/log/messages while trying to debug a
> system.

Weren't people were talking about using netlink rather than /proc for
this? I think netlink involves creating a special type of socket and
reading from that, so it's not easily shell-scriptable, but...

--
Hollis Blanchard
IBM Linux Technology Center

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nfont at austin.ibm.com  Tue Jul 20 23:26:53 2004
From: nfont at austin.ibm.com (Nathan Fontenot)
Date: Tue, 20 Jul 2004 08:26:53 -0500
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
In-Reply-To: <75F11743-D9D8-11D8-A3B4-000A95A0560C@us.ibm.com>
References: <40FBFC49.1070804@austin.ibm.com> <75F11743-D9D8-11D8-A3B4-000A95A0560C@us.ibm.com>
Message-ID: <40FD1D9D.40406@austin.ibm.com>


Hollis Blanchard wrote:

> Weren't people were talking about using netlink rather than /proc for
> this? I think netlink involves creating a special type of socket and
> reading from that, so it's not easily shell-scriptable, but...

There has been several discussions about dealing with rtas events.  I
beleive the netlink issue dealt with the getting rtas events to
rtas_errd in user space.  This patch just provides a switch to turn
on/off (via /proc or a boot parameter) the reporting of rtas events to
/var/log/messages and should have no effect on rtas events going to
rtas_errd or nvram.

Nathan F.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From moilanen at austin.ibm.com  Tue Jul 20 23:31:43 2004
From: moilanen at austin.ibm.com (Jake Moilanen)
Date: Tue, 20 Jul 2004 08:31:43 -0500
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
In-Reply-To: <75F11743-D9D8-11D8-A3B4-000A95A0560C@us.ibm.com>
References: <40FBFC49.1070804@austin.ibm.com>
	<75F11743-D9D8-11D8-A3B4-000A95A0560C@us.ibm.com>
Message-ID: <20040720083143.2726671e.moilanen@austin.ibm.com>


> > This patch will allow you to turn off the reporting of rtas messages
> > to /var/log/messages. There have been several situations in which
> > machines spew out too many rtas messages thus making debugging more
> > difficult. Being able to turn off the rtas messages reporting will
> > help by not having to wade through the tens or hundreds of (probably)
> > unrelated rtas events in /var/log/messages while trying to debug a
> > system.
>
> Weren't people were talking about using netlink rather than /proc for
> this? I think netlink involves creating a special type of socket and
> reading from that, so it's not easily shell-scriptable, but...
>

Hollis,

I think this is an intermittent step in moving towards netlink.  Even
when we move towards netlink, we still want to give an option to have
the messages in /var/log/messages.

Nathan,

Do you think we should make CONFIG option if this is set or not set by
default, so the distros can decide to have it printed if they don't
package ELA by default.

Thanks,
Jake

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nfont at austin.ibm.com  Wed Jul 21 00:51:02 2004
From: nfont at austin.ibm.com (Nathan Fontenot)
Date: Tue, 20 Jul 2004 09:51:02 -0500
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
In-Reply-To: <20040720083143.2726671e.moilanen@austin.ibm.com>
References: <40FBFC49.1070804@austin.ibm.com>	<75F11743-D9D8-11D8-A3B4-000A95A0560C@us.ibm.com> <20040720083143.2726671e.moilanen@austin.ibm.com>
Message-ID: <40FD3156.5080509@austin.ibm.com>


Jake Moilanen wrote:

> Do you think we should make CONFIG option if this is set or not set by
> default, so the distros can decide to have it printed if they don't
> package ELA by default.

I think the default (for now) should remain to print rtas events to
/var/log/messages.  There shouldn't be very many events for most systems.

I am working on an update for rtas event handling but am not sure when
this will be ready.  Getting this patch in will at least help with rtas
spam and debugging for now.

--
Nathan F.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Wed Jul 21 00:57:32 2004
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Tue, 20 Jul 2004 09:57:32 -0500
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
In-Reply-To: <40FD3156.5080509@austin.ibm.com>
References: <40FBFC49.1070804@austin.ibm.com>	<75F11743-D9D8-11D8-A3B4-000A95A0560C@us.ibm.com> <20040720083143.2726671e.moilanen@austin.ibm.com> <40FD3156.5080509@austin.ibm.com>
Message-ID: <20E20F6F-DA5D-11D8-8BEE-000A95A0560C@us.ibm.com>


On Jul 20, 2004, at 9:51 AM, Nathan Fontenot wrote:

> Jake Moilanen wrote:
>
>> Do you think we should make CONFIG option if this is set or not set by
>> default, so the distros can decide to have it printed if they don't
>> package ELA by default.
>
> I think the default (for now) should remain to print rtas events to
> /var/log/messages.  There shouldn't be very many events for most
> systems.

This has not been my experience. I hope it would be true for most
production systems, but every (pre-release) machine I've worked on
recently has loads of them, often as frequent as one message per
second. These messages make our lives more difficult than they need to
be.

--
Hollis Blanchard
IBM Linux Technology Center

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From greg at kroah.com  Wed Jul 21 15:58:40 2004
From: greg at kroah.com (Greg KH)
Date: Wed, 21 Jul 2004 01:58:40 -0400
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
In-Reply-To: <40FBFC49.1070804@austin.ibm.com>
References: <40FBFC49.1070804@austin.ibm.com>
Message-ID: <20040721055840.GA18787@kroah.com>


On Mon, Jul 19, 2004 at 11:52:25AM -0500, Nathan Fontenot wrote:
> +	entry = create_proc_entry("ppc64/rtas/rtasmsgs", S_IRUSR, NULL);
> +	if (entry)
> +		entry->proc_fops = &ppc_rtas_msg_operations;
> +

Please do not do this in proc.  It should be in sysfs (and the side
affect of that will be that your code is smaller...)

thanks,

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From greg at kroah.com  Thu Jul 22 03:53:20 2004
From: greg at kroah.com (Greg KH)
Date: Wed, 21 Jul 2004 13:53:20 -0400
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
In-Reply-To: <40FEBC0E.7060005@austin.ibm.com>
References: <40FBFC49.1070804@austin.ibm.com> <20040721055840.GA18787@kroah.com> <40FEBC0E.7060005@austin.ibm.com>
Message-ID: <20040721175320.GA16704@kroah.com>


On Wed, Jul 21, 2004 at 01:55:10PM -0500, Nathan Fontenot wrote:
> Greg KH wrote:
> >On Mon, Jul 19, 2004 at 11:52:25AM -0500, Nathan Fontenot wrote:
> >
> >>+	entry = create_proc_entry("ppc64/rtas/rtasmsgs", S_IRUSR, NULL);
> >>+	if (entry)
> >>+		entry->proc_fops = &ppc_rtas_msg_operations;
> >>+
> >
> >
> >Please do not do this in proc.  It should be in sysfs (and the side
> >affect of that will be that your code is smaller...)
> >
> >thanks,
> >
> >greg k-h
>
> Agreed, and glad you mentioned it.
>
> Would anyone have a problem with moving everything from /proc/ppc64/rtas
> to sysfs (/sys/firmware/rtas) ?

No objection from me, please do this, it is the proper place for these
files.

thanks,

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nfont at austin.ibm.com  Thu Jul 22 04:55:10 2004
From: nfont at austin.ibm.com (Nathan Fontenot)
Date: Wed, 21 Jul 2004 13:55:10 -0500
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
In-Reply-To: <20040721055840.GA18787@kroah.com>
References: <40FBFC49.1070804@austin.ibm.com> <20040721055840.GA18787@kroah.com>
Message-ID: <40FEBC0E.7060005@austin.ibm.com>


Greg KH wrote:
> On Mon, Jul 19, 2004 at 11:52:25AM -0500, Nathan Fontenot wrote:
>
>>+	entry = create_proc_entry("ppc64/rtas/rtasmsgs", S_IRUSR, NULL);
>>+	if (entry)
>>+		entry->proc_fops = &ppc_rtas_msg_operations;
>>+
>
>
> Please do not do this in proc.  It should be in sysfs (and the side
> affect of that will be that your code is smaller...)
>
> thanks,
>
> greg k-h

Agreed, and glad you mentioned it.

Would anyone have a problem with moving everything from /proc/ppc64/rtas
to sysfs (/sys/firmware/rtas) ?  It seems that sysfs is where all these
files should live anyway.

--
Nathan F.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From moilanen at austin.ibm.com  Thu Jul 22 05:50:19 2004
From: moilanen at austin.ibm.com (Jake Moilanen)
Date: Wed, 21 Jul 2004 14:50:19 -0500
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
In-Reply-To: <40FEBC0E.7060005@austin.ibm.com>
References: <40FBFC49.1070804@austin.ibm.com>
	<20040721055840.GA18787@kroah.com>
	<40FEBC0E.7060005@austin.ibm.com>
Message-ID: <20040721145019.0766d4ad.moilanen@austin.ibm.com>


>
> Would anyone have a problem with moving everything from /proc/ppc64/rtas
> to sysfs (/sys/firmware/rtas) ?  It seems that sysfs is where all these
> files should live anyway.
>

I'm not sure I agree that the files like error_log, firmware_flash
should be in sysfs.  Those should probably move to something like
netlink which we discussed previously.

Do we need to move over some of the files at all (eg volume).

Thanks,
Jake

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From johnrose at us.ibm.com  Thu Jul 22 05:56:45 2004
From: johnrose at us.ibm.com (John H Rose)
Date: Wed, 21 Jul 2004 14:56:45 -0500
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
In-Reply-To: <40FEBC0E.7060005@austin.ibm.com>
Message-ID: <OF08FB76A9.59C52710-ON85256ED8.006D50DE-86256ED8.006D91C3@us.ibm.com>


I know this is the trendy idea, and I don't disagree with it, but keep in
mind that you'll break tools.  Update flash, for example. :)

John

-----------------------
John Rose
pSeries Linux Development
johnrose at austin.ibm.com
Office: 512-838-0298
Tieline: 678-0298

[ Nathan Fontenot <nfont at austin.ibm.com> writes: ]
>
> Greg KH wrote:
> > On Mon, Jul 19, 2004 at 11:52:25AM -0500, Nathan Fontenot wrote:
> >
> >>+   entry = create_proc_entry("ppc64/rtas/rtasmsgs", S_IRUSR, NULL);
> >>+   if (entry)
> >>+         entry->proc_fops = &ppc_rtas_msg_operations;
> >>+
> >
> > Please do not do this in proc. It should be in sysfs (and the side
> > affect of that will be that your code is smaller...)
>
> Agreed, and glad you mentioned it.
>
> Would anyone have a problem with moving everything from
> /proc/ppc64/rtas to sysfs (/sys/firmware/rtas) ? It seems that sysfs
> is where all these files should live anyway.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From haveblue at us.ibm.com  Thu Jul 22 05:57:45 2004
From: haveblue at us.ibm.com (Dave Hansen)
Date: Wed, 21 Jul 2004 12:57:45 -0700
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
In-Reply-To: <40FEBC0E.7060005@austin.ibm.com>
References: <40FBFC49.1070804@austin.ibm.com>
	 <20040721055840.GA18787@kroah.com>  <40FEBC0E.7060005@austin.ibm.com>
Message-ID: <1090439865.5862.5.camel@nighthawk>


On Wed, 2004-07-21 at 11:55, Nathan Fontenot wrote:
> Would anyone have a problem with moving everything from /proc/ppc64/rtas
> to sysfs (/sys/firmware/rtas) ?  It seems that sysfs is where all these
> files should live anyway.

Some of them use pretty broken interfaces, and I don't think they should
be perpetuated.  The biggest offender in my mind is
ppc64/rtas/rmo_buffer.  It exports physical addresses so that they can
be mapped with /dev/mem.  I think a more refined interface is
appropriate here.

As for the other things in that directory, could you do an ls on your
system and describe which ones you think need to be kept?

-- Dave

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Thu Jul 22 06:48:40 2004
From: linas at austin.ibm.com (Linas Vepstas)
Date: Wed, 21 Jul 2004 15:48:40 -0500
Subject: Resending [PATCH] 2.6 ppc64 RTAS: use dynamic buffer size
Message-ID: <20040721204840.GD13171@austin.ibm.com>


Resending,

Trying to work around a mail gateway bug; sorry if you are receiving
this message a second time.

(If you suspect that folks are not getting your email, contact me,
I think I now understand about half the problem.  You need to configure
your exim/postfix/sendmail to use a "smarthost".  Valid smarthosts
are austin.ibm.com, us.ibm.com, smtp.linux.ibm.com.  If you don't
use the smarthost, your email will go through pixpat.austin.ibm.com
which is appearently blackholed by spam black-lists).

--linas


----- Forwarded message from Mail Delivery System <Mailer-Daemon at bilge> -----
------ This is a copy of the message, including all the headers. ------

Return-path: <linas at bilge>
Received: from linas by bilge with local (Exim 3.36 #1 (Debian))
	id 1BmhA1-0002N0-00; Mon, 19 Jul 2004 18:03:05 -0500
Date: Mon, 19 Jul 2004 18:03:05 -0500
To: paulus at au1.ibm.com, paulus at samba.org
Cc: linuxppc64-dev at lists.linuxppc.org, antonb at samba.org,
	benh at kernel.crashing.org
Subject: [PATCH] 2.6 ppc64 RTAS: use dynamic buffer size
Message-ID: <20040719230305.GH7544 at bilge>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="32u276st3Jlj2kUU"
Content-Disposition: inline
User-Agent: Mutt/1.5.6+20040523i
From: Linas Vepstas <linas at bilge>


--32u276st3Jlj2kUU
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline


Paul,

Please review & forward as appropriate.

Firmware expects error log sizes to be of a very specific size,
but different versions of firmware appearently expect different
sizes; using the wrong size results in a painful, hard-to-debug
crash in firmware.   Benh provided a patch for this some months
ago,  but appreantly missed this code path.   This patch sets up
the log buffer size dynamically; it also fixes a bug with the
return code not being handled correctly.

Signed-off-by: Linas Vepstas <linas at linas.org>

--linas

--32u276st3Jlj2kUU
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="rtas-log-length.patch"

===== arch/ppc64/kernel/rtas.c 1.38 vs edited =====
--- 1.38/arch/ppc64/kernel/rtas.c	Wed Jul 14 15:27:37 2004
+++ edited/arch/ppc64/kernel/rtas.c	Mon Jul 19 17:11:51 2004
@@ -22,7 +22,6 @@
 #include <asm/rtas.h>
 #include <asm/semaphore.h>
 #include <asm/machdep.h>
-#include <asm/paca.h>
 #include <asm/page.h>
 #include <asm/param.h>
 #include <asm/system.h>
@@ -73,7 +72,6 @@
 	return tokp ? *tokp : RTAS_UNKNOWN_SERVICE;
 }

-
 /** Return a copy of the detailed error text associated with the
  *  most recent failed call to rtas.  Because the error text
  *  might go stale if there are any other intervening rtas calls,
@@ -84,18 +82,26 @@
 __fetch_rtas_last_error(void)
 {
 	struct rtas_args err_args, save_args;
+	u32 bufsz;
+
+	bufsz = rtas_token ("rtas-error-log-max");
+	if ((bufsz == RTAS_UNKNOWN_SERVICE) ||
+	    (bufsz > RTAS_ERROR_LOG_MAX)) {
+		printk (KERN_WARNING "RTAS: bad log buffer size %d\n", bufsz);
+		bufsz = RTAS_ERROR_LOG_MAX;
+	}

 	err_args.token = rtas_token("rtas-last-error");
 	err_args.nargs = 2;
 	err_args.nret = 1;
-	err_args.rets = (rtas_arg_t *)&(err_args.args[2]);

 	err_args.args[0] = (rtas_arg_t)__pa(rtas_err_buf);
-	err_args.args[1] = RTAS_ERROR_LOG_MAX;
+	err_args.args[1] = bufsz;
 	err_args.args[2] = 0;

 	save_args = rtas.args;
 	rtas.args = err_args;
+	rtas.args.rets = (rtas_arg_t *)&(rtas.args.args[2]);

 	PPCDBG(PPCDBG_RTAS, "\tentering rtas with 0x%lx\n",
 	       __pa(&err_args));
@@ -105,6 +111,7 @@
 	err_args = rtas.args;
 	rtas.args = save_args;

+	err_args.rets = (rtas_arg_t *)&(err_args.args[2]);
 	return err_args.rets[0];
 }


--32u276st3Jlj2kUU--

----- End forwarded message -----

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Thu Jul 22 07:08:39 2004
From: linas at austin.ibm.com (Linas Vepstas)
Date: Wed, 21 Jul 2004 16:08:39 -0500
Subject: Resend: Resend: [PATCH] [2.6] PPC64: log firmware errors during boot.
Message-ID: <20040721210839.GP13171@austin.ibm.com>


Reseinding another bounced email

--linas

----- Forwarded message from Mail Delivery System <Mailer-Daemon at bilge> -----
------ This is a copy of the message, including all the headers. ------

Return-path: <linas at bilge>
Received: from linas by bilge with local (Exim 3.36 #1 (Debian))
	id 1BksOD-0008Cp-00; Wed, 14 Jul 2004 17:38:13 -0500
Date: Wed, 14 Jul 2004 17:38:13 -0500
To: paulus at au1.ibm.com, paulus at samba.org
Cc: linuxppc64-dev at lists.linuxppc.org
Subject: Resend: [PATCH] [2.6] PPC64: log firmware errors during boot.
Message-ID: <20040714223813.GX17333 at bilge>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="idY8LE8SD6/8DnRI"
Content-Disposition: inline
User-Agent: Mutt/1.5.6+20040523i
From: Linas Vepstas <linas at bilge>


--idY8LE8SD6/8DnRI
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline


Paul,

Resending patch from 26 June that remains unapplied.  Topics related
to this patch were discussed, but none of the discussions affected this
patch directly,  So I think the patch is still good to go ...

Repeat of the original text:

Firmware can report errors at any time, and not atypically during boot.
However, these reports were being discarded until th rtasd comes up,
which occurs fairly late in the boot cycle.  As a result, firmware
errors during boot were being silently ignored.

Signed-off-by: Linas Vepstas <linas at linas.org>

--linas

--idY8LE8SD6/8DnRI
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="rtas-log-boot-msgs.patch"

--- arch/ppc64/kernel/rtasd.c.orig	2004-06-28 15:33:12.000000000 -0500
+++ arch/ppc64/kernel/rtasd.c	2004-06-29 18:51:31.000000000 -0500
@@ -57,6 +57,8 @@ volatile int error_log_cnt = 0;
  */
 static unsigned char logdata[RTAS_ERROR_LOG_MAX];

+static int get_eventscan_parms(void);
+
 /* To see this info, grep RTAS /var/log/messages and each entry
  * will be collected together with obvious begin/end.
  * There will be a unique identifier on the begin and end lines.
@@ -121,6 +123,9 @@ static int log_rtas_len(char * buf)
 		len += err->extended_log_length;
 	}

+	if (rtas_error_log_max == 0) {
+		get_eventscan_parms();
+	}
 	if (len > rtas_error_log_max)
 		len = rtas_error_log_max;

@@ -148,7 +153,6 @@ void pSeries_log_error(char *buf, unsign
 	int len = 0;

 	DEBUG("logging event\n");
-
 	if (buf == NULL)
 		return;

@@ -171,6 +175,13 @@ void pSeries_log_error(char *buf, unsign
 	if (!no_more_logging && !(err_type & ERR_FLAG_BOOT))
 		nvram_write_error_log(buf, len, err_type);

+	/* rtas errors can occur during boot, and we do want to capture
+	 * those somewhere, even if nvram isn't ready (why not?), and even
+	 * if rtasd isn't ready. Put them into the boot log, at least.  */
+	if ((err_type & ERR_TYPE_MASK) == ERR_TYPE_RTAS_LOG) {
+		printk_log_rtas(buf, len);
+	}
+
 	/* Check to see if we need to or have stopped logging */
 	if (fatal || no_more_logging) {
 		no_more_logging = 1;

--idY8LE8SD6/8DnRI--

----- End forwarded message -----

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Thu Jul 22 07:10:01 2004
From: linas at austin.ibm.com (Linas Vepstas)
Date: Wed, 21 Jul 2004 16:10:01 -0500
Subject: Resend: [PATCH] 2.6 ppc64 -- Yet another unbalanced pci_dev_get()/put()
Message-ID: <20040721211001.GQ13171@austin.ibm.com>


forwarding more bounced email, this one had a patch attached.

--linas

----- Forwarded message from Mail Delivery System <Mailer-Daemon at bilge> -----
------ This is a copy of the message, including all the headers. ------

Return-path: <linas at bilge>
Received: from linas by bilge with local (Exim 3.36 #1 (Debian))
	id 1Bkrua-0008BY-00; Wed, 14 Jul 2004 17:07:36 -0500
Date: Wed, 14 Jul 2004 17:07:36 -0500
To: paulus at au1.ibm.com, paulus at samba.org
Cc: linuxppc64-dev at lists.linuxppc.org
Subject: [PATCH] 2.6 ppc64 -- Yet another unbalanced pci_dev_get()/put()
Message-ID: <20040714220736.GW17333 at bilge>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="TybLhxa8M7aNoW+V"
Content-Disposition: inline
User-Agent: Mutt/1.5.6+20040523i
From: Linas Vepstas <linas at bilge>


--TybLhxa8M7aNoW+V
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline


Paul,

Please review & forward upstream ...

This patch fixes yet another set of mis-matched pci_dev_get() /
pci_dev_put() calls.  The bug should affect graphics cards only;
no other card types will see a put() without a matching get().
The mismatch ocured because we ignore eeh errors on graphics
cards :(

There is a small timing window during which this bug may cause memory
corruption on machines with lots of memory pressure, for which a pci
graphics device is being hot-removed...

--linas


--TybLhxa8M7aNoW+V
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="eeh-more-unbalanced-get-put.patch"

===== arch/ppc64/kernel/eeh.c 1.28 vs edited =====
--- 1.28/arch/ppc64/kernel/eeh.c	Mon Jul 12 18:29:16 2004
+++ edited/arch/ppc64/kernel/eeh.c	Wed Jul 14 15:40:47 2004
@@ -250,6 +250,7 @@
 static void __pci_addr_cache_insert_device(struct pci_dev *dev)
 {
 	struct device_node *dn;
+	int really_did_insert = 0;
 	int i;

 	dn = pci_device_to_OF_node(dev);
@@ -268,7 +269,6 @@
 #endif
 		return;
 	}
-	pci_dev_get(dev);

 	/* Walk resources on this device, poke them into the tree */
 	for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) {
@@ -282,6 +282,11 @@
 		if (start == 0 || ~start == 0 || end == 0 || ~end == 0)
 			 continue;
 		pci_addr_cache_insert(dev, start, end, flags);
+		really_did_insert = 1;
+	}
+
+	if (really_did_insert) {
+		pci_dev_get (dev);
 	}
 }

@@ -305,6 +310,7 @@
 static inline void __pci_addr_cache_remove_device(struct pci_dev *dev)
 {
 	struct rb_node *n;
+	int really_did_remove = 0;

 restart:
 	n = rb_first(&pci_io_addr_cache_root.rb_root);
@@ -315,11 +321,14 @@
 		if (piar->pcidev == dev) {
 			rb_erase(n, &pci_io_addr_cache_root.rb_root);
 			kfree(piar);
+			really_did_remove = 1;
 			goto restart;
 		}
 		n = rb_next(n);
 	}
-	pci_dev_put(dev);
+	if (really_did_remove) {
+	   pci_dev_put(dev);
+	}
 }

 /**

--TybLhxa8M7aNoW+V--

----- End forwarded message -----

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nfont at austin.ibm.com  Thu Jul 22 07:11:38 2004
From: nfont at austin.ibm.com (Nathan Fontenot)
Date: Wed, 21 Jul 2004 16:11:38 -0500
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
In-Reply-To: <OF08FB76A9.59C52710-ON85256ED8.006D50DE-86256ED8.006D91C3@us.ibm.com>
References: <OF08FB76A9.59C52710-ON85256ED8.006D50DE-86256ED8.006D91C3@us.ibm.com>
Message-ID: <40FEDC0A.1030201@austin.ibm.com>


John H Rose wrote:
> I know this is the trendy idea, and I don't disagree with it, but keep
> in mind that you'll break tools. Update flash, for example. :)

We could always provide a link from /proc/ppc64/rtas to sysfs so we
don't break any tools.

Nathan F.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From greg at kroah.com  Thu Jul 22 07:11:49 2004
From: greg at kroah.com (Greg KH)
Date: Wed, 21 Jul 2004 17:11:49 -0400
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
In-Reply-To: <20040721145019.0766d4ad.moilanen@austin.ibm.com>
References: <40FBFC49.1070804@austin.ibm.com> <20040721055840.GA18787@kroah.com> <40FEBC0E.7060005@austin.ibm.com> <20040721145019.0766d4ad.moilanen@austin.ibm.com>
Message-ID: <20040721211144.GA17352@kroah.com>


On Wed, Jul 21, 2004 at 02:50:19PM -0500, Jake Moilanen wrote:
> >
> > Would anyone have a problem with moving everything from
> > /proc/ppc64/rtas to sysfs (/sys/firmware/rtas) ? It seems that sysfs
> > is where all these files should live anyway.
>
> I'm not sure I agree that the files like error_log, firmware_flash
> should be in sysfs. Those should probably move to something like
> netlink which we discussed previously.
>
> Do we need to move over some of the files at all (eg volume).

Care to show us all of the /proc/ppc64/rtas files, and what they are
used for?  It would help with the discussion for those of us without
easy access to such machines.

thanks,

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From greg at kroah.com  Thu Jul 22 07:13:14 2004
From: greg at kroah.com (Greg KH)
Date: Wed, 21 Jul 2004 17:13:14 -0400
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
In-Reply-To: <OF08FB76A9.59C52710-ON85256ED8.006D50DE-86256ED8.006D91C3@us.ibm.com>
References: <40FEBC0E.7060005@austin.ibm.com> <OF08FB76A9.59C52710-ON85256ED8.006D50DE-86256ED8.006D91C3@us.ibm.com>
Message-ID: <20040721211313.GB17352@kroah.com>


On Wed, Jul 21, 2004 at 02:56:45PM -0500, John H Rose wrote:
>
> I know this is the trendy idea, and I don't disagree with it, but keep
> in mind that you'll break tools. Update flash, for example. :)

Any flash or firmware files should be using the kernel firmware
interface, and not reinventing the wheel again.

And yes it is a trend, one that will continue and not go away.  Please
learn to follow it :)

thanks,

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Thu Jul 22 07:13:16 2004
From: linas at austin.ibm.com (Linas Vepstas)
Date: Wed, 21 Jul 2004 16:13:16 -0500
Subject: Another: [PATCH} 2.6 ppc64 rtas crash in kmalloc
Message-ID: <20040721211316.GS13171@austin.ibm.com>


Hi,

Another bounced patch email.

--linas

----- Forwarded message from Mail Delivery System <Mailer-Daemon at bilge> -----
------ This is a copy of the message, including all the headers. ------

Return-path: <linas at bilge>
Received: from linas by bilge with local (Exim 3.36 #1 (Debian))
	id 1BkWiX-0007gc-00; Tue, 13 Jul 2004 18:29:45 -0500
Date: Tue, 13 Jul 2004 18:29:45 -0500
To: paulus at au1.ibm.com, paulus at samba.org
Cc: linuxppc64-dev at lists.linuxppc.org
Subject: [PATCH} 2.6 ppc64 rtas crash in kmalloc
Message-ID: <20040713232945.GR17333 at bilge>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="UPT3ojh+0CqEDtpF"
Content-Disposition: inline
User-Agent: Mutt/1.5.6+20040523i
From: Linas Vepstas <linas at bilge>


--UPT3ojh+0CqEDtpF
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline


Paul,

Please review and send upstream as appropriate.

The recent set of rtas patches I sent in (about a week ago)
introduced a bug: the possible use of kmalloc before the
VM subsystem is initialized.  This patch checks for the VM
subsystem being ready, and avoids the kmalloc if its not.

People typically hit this bug during very early boot stages,
when EEH is being initialized, and an rtas_call fails, leading
to the use of kmalloc to get the error message.

--linas


--UPT3ojh+0CqEDtpF
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="rtas-kmalloc-crash.patch"

===== arch/ppc64/kernel/rtas.c 1.37 vs edited =====
--- 1.37/arch/ppc64/kernel/rtas.c	Mon Jul  5 05:27:10 2004
+++ edited/arch/ppc64/kernel/rtas.c	Tue Jul 13 18:12:06 2004
@@ -165,9 +165,13 @@

 	/* Log the error in the unlikely case that there was one. */
 	if (unlikely(logit)) {
-		buff_copy = kmalloc(RTAS_ERROR_LOG_MAX, GFP_ATOMIC);
-		if (buff_copy) {
-			memcpy(buff_copy, rtas_err_buf, RTAS_ERROR_LOG_MAX);
+		/* Can't call kmalloc if VM subsystem is not yet up. */
+		struct cache_sizes *csizep = malloc_sizes;
+		if (csizep->cs_cachep) {
+			buff_copy = kmalloc(RTAS_ERROR_LOG_MAX, GFP_ATOMIC);
+			if (buff_copy) {
+				memcpy(buff_copy, rtas_err_buf, RTAS_ERROR_LOG_MAX);
+			}
 		}
 	}


--UPT3ojh+0CqEDtpF--

----- End forwarded message -----

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Thu Jul 22 07:16:50 2004
From: linas at austin.ibm.com (Linas Vepstas)
Date: Wed, 21 Jul 2004 16:16:50 -0500
Subject: Resending ... [PATCH] 2.6 ppc64 Janitor prom.c memleak; whitespace
Message-ID: <20040721211650.GU13171@austin.ibm.com>


----- Forwarded message from Mail Delivery System <Mailer-Daemon at bilge> -----
------ This is a copy of the message, including all the headers. ------

Return-path: <linas at bilge>
Received: from linas by bilge with local (Exim 3.36 #1 (Debian))
	id 1BkTcP-0007ZZ-00; Tue, 13 Jul 2004 15:11:13 -0500
Date: Tue, 13 Jul 2004 15:11:13 -0500
To: paulus at au1.ibm.com, paulus at samba.org
Cc: linuxppc64-dev at lists.linuxppc.org
Subject: [PATCH] 2.6 ppc64 Janitor prom.c memleak; whitespace
Message-ID: <20040713201112.GO17333 at bilge>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="GvXjxJ+pjyke8COw"
Content-Disposition: inline
User-Agent: Mutt/1.5.6+20040523i
From: Linas Vepstas <linas at bilge>


--GvXjxJ+pjyke8COw
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline


Paul,

Please review & forward as appropriate.

I had reason to review prom.c today, and saw one minor bug
(a very unlikely memory leak) and a lot of bad indentation
(8 spaces used where tab should have been used).   Bad
whitespace drives me crazy because my vi set ts=3 not 8.
This patch fixes the mem leak (at very bottom of the patch)
and lots of whitespace ick.  The mem leak is unlikely because
it requires other failures to happen before the memleak
happens.

Signed-off-by: Linas Vepstas <linas at linas.org>

--linas

p.s. My next patch will make actual functional changes to prom.c

--GvXjxJ+pjyke8COw
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="prom-whitespace.patch"

===== arch/ppc64/kernel/prom.c 1.88 vs edited =====
--- 1.88/arch/ppc64/kernel/prom.c	Fri Jul  2 00:23:46 2004
+++ edited/arch/ppc64/kernel/prom.c	Tue Jul 13 15:00:28 2004
@@ -199,16 +199,16 @@
 	unsigned long offset = reloc_offset();
 	struct prom_t *_prom = PTRRELOC(&prom);
 	va_list list;
-
+
 	_prom->args.service = ADDR(service);
 	_prom->args.nargs = nargs;
 	_prom->args.nret = nret;
-        _prom->args.rets = (prom_arg_t *)&(_prom->args.args[nargs]);
+	_prom->args.rets = (prom_arg_t *)&(_prom->args.args[nargs]);

-        va_start(list, nret);
+	va_start(list, nret);
 	for (i=0; i < nargs; i++)
 		_prom->args.args[i] = va_arg(list, prom_arg_t);
-        va_end(list);
+	va_end(list);

 	for (i=0; i < nret ;i++)
 		_prom->args.rets[i] = 0;
@@ -244,17 +244,17 @@
 static void __init prom_print_hex(unsigned long val)
 {
 	unsigned long offset = reloc_offset();
-        int i, nibbles = sizeof(val)*2;
-        char buf[sizeof(val)*2+1];
+	int i, nibbles = sizeof(val)*2;
+	char buf[sizeof(val)*2+1];
 	struct prom_t *_prom = PTRRELOC(&prom);

-        for (i = nibbles-1;  i >= 0;  i--) {
-                buf[i] = (val & 0xf) + '0';
-                if (buf[i] > '9')
-                    buf[i] += ('a'-'0'-10);
-                val >>= 4;
-        }
-        buf[nibbles] = '\0';
+	for (i = nibbles-1;  i >= 0;  i--) {
+		buf[i] = (val & 0xf) + '0';
+		if (buf[i] > '9')
+			buf[i] += ('a'-'0'-10);
+		val >>= 4;
+	}
+	buf[nibbles] = '\0';
 	call_prom("write", 3, 1, _prom->stdout, buf, nibbles);
 }

@@ -343,22 +343,22 @@
 {
 	phandle node;
 	char type[64];
-        unsigned long num_cpus = 0;
-        unsigned long offset = reloc_offset();
+	unsigned long num_cpus = 0;
+	unsigned long offset = reloc_offset();
 	struct prom_t *_prom = PTRRELOC(&prom);
-        struct naca_struct *_naca = RELOC(naca);
-        struct systemcfg *_systemcfg = RELOC(systemcfg);
+	struct naca_struct *_naca = RELOC(naca);
+	struct systemcfg *_systemcfg = RELOC(systemcfg);

 	/* NOTE: _naca->debug_switch is already initialized. */
 	prom_debug("prom_initialize_naca: start...\n");

 	_naca->pftSize = 0;	/* ilog2 of htab size.  computed below. */

-        for (node = 0; prom_next_node(&node); ) {
-                type[0] = 0;
+	for (node = 0; prom_next_node(&node); ) {
+		type[0] = 0;
 		prom_getprop(node, "device_type", type, sizeof(type));

-                if (!strcmp(type, RELOC("cpu"))) {
+		if (!strcmp(type, RELOC("cpu"))) {
 			num_cpus += 1;

 			/* We're assuming *all* of the CPUs have the same
@@ -404,7 +404,7 @@
 					_naca->pftSize = pft_size[1];
 				}
 			}
-                } else if (!strcmp(type, RELOC("serial"))) {
+		} else if (!strcmp(type, RELOC("serial"))) {
 			phandle isa, pci;
 			struct isa_reg_property reg;
 			union pci_range ranges;
@@ -435,7 +435,7 @@
 					((((unsigned long)ranges.pci64.phys_hi) << 32) |
 					 (ranges.pci64.phys_lo)) + reg.address;
 			}
-                }
+		}
 	}

 	if (_systemcfg->platform == PLATFORM_POWERMAC)
@@ -465,8 +465,8 @@
 	}

 	/* We gotta have at least 1 cpu... */
-        if ( (_systemcfg->processorCount = num_cpus) < 1 )
-                PROM_BUG();
+	if ( (_systemcfg->processorCount = num_cpus) < 1 )
+		PROM_BUG();

 	_systemcfg->physicalMemorySize = lmb_phys_mem_size();

@@ -496,21 +496,21 @@
 	_systemcfg->version.minor = SYSTEMCFG_MINOR;
 	_systemcfg->processor = _get_PVR();

-        prom_debug("systemcfg->processorCount       = 0x%x\n",
+	prom_debug("systemcfg->processorCount       = 0x%x\n",
 		   _systemcfg->processorCount);
-        prom_debug("systemcfg->physicalMemorySize   = 0x%x\n",
+	prom_debug("systemcfg->physicalMemorySize   = 0x%x\n",
 		   _systemcfg->physicalMemorySize);
-        prom_debug("naca->pftSize                   = 0x%x\n",
+	prom_debug("naca->pftSize                   = 0x%x\n",
 		   _naca->pftSize);
-        prom_debug("systemcfg->dCacheL1LineSize     = 0x%x\n",
+	prom_debug("systemcfg->dCacheL1LineSize     = 0x%x\n",
 		   _systemcfg->dCacheL1LineSize);
-        prom_debug("systemcfg->iCacheL1LineSize     = 0x%x\n",
+	prom_debug("systemcfg->iCacheL1LineSize     = 0x%x\n",
 		   _systemcfg->iCacheL1LineSize);
-        prom_debug("naca->serialPortAddr            = 0x%x\n",
+	prom_debug("naca->serialPortAddr            = 0x%x\n",
 		   _naca->serialPortAddr);
-        prom_debug("naca->interrupt_controller      = 0x%x\n",
+	prom_debug("naca->interrupt_controller      = 0x%x\n",
 		   _naca->interrupt_controller);
-        prom_debug("systemcfg->platform             = 0x%x\n",
+	prom_debug("systemcfg->platform             = 0x%x\n",
 		   _systemcfg->platform);
 	prom_debug("prom_initialize_naca: end...\n");
 }
@@ -547,36 +547,36 @@
 #ifdef DEBUG_PROM
 void prom_dump_lmb(void)
 {
-        unsigned long i;
-        unsigned long offset = reloc_offset();
+	unsigned long i;
+	unsigned long offset = reloc_offset();
 	struct lmb *_lmb  = PTRRELOC(&lmb);

-        prom_printf("\nprom_dump_lmb:\n");
-        prom_printf("    memory.cnt                  = 0x%x\n",
+	prom_printf("\nprom_dump_lmb:\n");
+	prom_printf("    memory.cnt		  = 0x%x\n",
 		    _lmb->memory.cnt);
-        prom_printf("    memory.size                 = 0x%x\n",
+	prom_printf("    memory.size		 = 0x%x\n",
 		    _lmb->memory.size);
-        for (i=0; i < _lmb->memory.cnt ;i++) {
-                prom_printf("    memory.region[0x%x].base       = 0x%x\n",
+	for (i=0; i < _lmb->memory.cnt ;i++) {
+		prom_printf("    memory.region[0x%x].base       = 0x%x\n",
 			    i, _lmb->memory.region[i].base);
-                prom_printf("                      .physbase = 0x%x\n",
+		prom_printf("		      .physbase = 0x%x\n",
 			    _lmb->memory.region[i].physbase);
-                prom_printf("                      .size     = 0x%x\n",
+		prom_printf("		      .size     = 0x%x\n",
 			    _lmb->memory.region[i].size);
-        }
+	}

-        prom_printf("\n    reserved.cnt                  = 0x%x\n",
+	prom_printf("\n    reserved.cnt		  = 0x%x\n",
 		    _lmb->reserved.cnt);
-        prom_printf("    reserved.size                 = 0x%x\n",
+	prom_printf("    reserved.size		 = 0x%x\n",
 		    _lmb->reserved.size);
-        for (i=0; i < _lmb->reserved.cnt ;i++) {
-                prom_printf("    reserved.region[0x%x\n].base       = 0x%x\n",
+	for (i=0; i < _lmb->reserved.cnt ;i++) {
+		prom_printf("    reserved.region[0x%x\n].base       = 0x%x\n",
 			    i, _lmb->reserved.region[i].base);
-                prom_printf("                      .physbase = 0x%x\n",
+		prom_printf("		      .physbase = 0x%x\n",
 			    _lmb->reserved.region[i].physbase);
-                prom_printf("                      .size     = 0x%x\n",
+		prom_printf("		      .size     = 0x%x\n",
 			    _lmb->reserved.region[i].size);
-        }
+	}
 }
 #endif /* DEBUG_PROM */

@@ -584,9 +584,9 @@
 {
 	phandle node;
 	char type[64];
-        unsigned long i, offset = reloc_offset();
+	unsigned long i, offset = reloc_offset();
 	struct prom_t *_prom = PTRRELOC(&prom);
-        struct systemcfg *_systemcfg = RELOC(systemcfg);
+	struct systemcfg *_systemcfg = RELOC(systemcfg);
 	union lmb_reg_property reg;
 	unsigned long lmb_base, lmb_size;
 	unsigned long num_regs, bytes_per_reg = (_prom->encode_phys_size*2)/8;
@@ -599,11 +599,11 @@
 	if (_systemcfg->platform == PLATFORM_POWERMAC)
 		bytes_per_reg = 12;

-        for (node = 0; prom_next_node(&node); ) {
-                type[0] = 0;
-                prom_getprop(node, "device_type", type, sizeof(type));
+	for (node = 0; prom_next_node(&node); ) {
+		type[0] = 0;
+		prom_getprop(node, "device_type", type, sizeof(type));

-                if (strcmp(type, RELOC("memory")))
+		if (strcmp(type, RELOC("memory")))
 			continue;

 		num_regs = prom_getprop(node, "reg", &reg, sizeof(reg))
@@ -651,7 +651,7 @@
 	struct rtas_t *_rtas = PTRRELOC(&rtas);
 	struct systemcfg *_systemcfg = RELOC(systemcfg);
 	ihandle prom_rtas;
-        u32 getprop_rval;
+	u32 getprop_rval;
 	char hypertas_funcs[4];

 	prom_debug("prom_instantiate_rtas: start...\n");
@@ -669,7 +669,7 @@

 		prom_getprop(prom_rtas, "rtas-size",
 			     &getprop_rval, sizeof(getprop_rval));
-	        _rtas->size = getprop_rval;
+		_rtas->size = getprop_rval;
 		prom_printf("instantiating rtas");
 		if (_rtas->size != 0) {
 			unsigned long rtas_region = RTAS_INSTANTIATE_MAX;
@@ -707,9 +707,9 @@
 			prom_printf(" done\n");
 		}

-        	prom_debug("rtas->base                = 0x%x\n", _rtas->base);
-        	prom_debug("rtas->entry               = 0x%x\n", _rtas->entry);
-        	prom_debug("rtas->size                = 0x%x\n", _rtas->size);
+		prom_debug("rtas->base		= 0x%x\n", _rtas->base);
+		prom_debug("rtas->entry	       = 0x%x\n", _rtas->entry);
+		prom_debug("rtas->size		= 0x%x\n", _rtas->size);
 	}
 	prom_debug("prom_instantiate_rtas: end...\n");
 }
@@ -744,7 +744,7 @@
 {
 	phandle node;
 	ihandle phb_node;
-        unsigned long offset = reloc_offset();
+	unsigned long offset = reloc_offset();
 	char compatible[64], path[64], type[64], model[64];
 	unsigned long i, table = 0;
 	unsigned long base, vbase, align;
@@ -853,21 +853,21 @@
 		/* Call OF to setup the TCE hardware */
 		if (call_prom("package-to-path", 3, 1, node,
 			      path, sizeof(path)-1) == PROM_ERROR) {
-                        prom_printf("package-to-path failed\n");
-                } else {
-                        prom_printf("opening PHB %s", path);
-                }
-
-                phb_node = call_prom("open", 1, 1, path);
-                if ( (long)phb_node <= 0) {
-                        prom_printf("... failed\n");
-                } else {
-                        prom_printf("... done\n");
-                }
-                call_prom("call-method", 6, 0, ADDR("set-64-bit-addressing"),
+			prom_printf("package-to-path failed\n");
+		} else {
+			prom_printf("opening PHB %s", path);
+		}
+
+		phb_node = call_prom("open", 1, 1, path);
+		if ( (long)phb_node <= 0) {
+			prom_printf("... failed\n");
+		} else {
+			prom_printf("... done\n");
+		}
+		call_prom("call-method", 6, 0, ADDR("set-64-bit-addressing"),
 			  phb_node, -1, minsize,
 			  (u32) base, (u32) (base >> 32));
-                call_prom("close", 1, 0, phb_node);
+		call_prom("close", 1, 0, phb_node);

 		table++;
 	}
@@ -910,15 +910,15 @@
 	unsigned int cpu_threads, hw_cpu_num;
 	int propsize;
 	extern void __secondary_hold(void);
-        extern unsigned long __secondary_hold_spinloop;
-        extern unsigned long __secondary_hold_acknowledge;
-        unsigned long *spinloop
+	extern unsigned long __secondary_hold_spinloop;
+	extern unsigned long __secondary_hold_acknowledge;
+	unsigned long *spinloop
 		= (void *)virt_to_abs(&__secondary_hold_spinloop);
-        unsigned long *acknowledge
+	unsigned long *acknowledge
 		= (void *)virt_to_abs(&__secondary_hold_acknowledge);
-        unsigned long secondary_hold
+	unsigned long secondary_hold
 		= virt_to_abs(*PTRRELOC((unsigned long *)__secondary_hold));
-        struct systemcfg *_systemcfg = RELOC(systemcfg);
+	struct systemcfg *_systemcfg = RELOC(systemcfg);
 	struct paca_struct *lpaca = PTRRELOC(&paca[0]);
 	struct prom_t *_prom = PTRRELOC(&prom);
 #ifdef CONFIG_SMP
@@ -962,12 +962,12 @@
 	prom_debug("    1) *acknowledge   = 0x%x\n", *acknowledge);
 	prom_debug("    1) secondary_hold = 0x%x\n", secondary_hold);

-        /* Set the common spinloop variable, so all of the secondary cpus
+	/* Set the common spinloop variable, so all of the secondary cpus
 	 * will block when they are awakened from their OF spinloop.
 	 * This must occur for both SMP and non SMP kernels, since OF will
 	 * be trashed when we move the kernel.
-         */
-        *spinloop = 0;
+	 */
+	*spinloop = 0;

 #ifdef CONFIG_HMT
 	for (i=0; i < NR_CPUS; i++) {
@@ -986,7 +986,7 @@
 		if (strcmp(type, RELOC("okay")) != 0)
 			continue;

-                reg = -1;
+		reg = -1;
 		prom_getprop(node, "reg", &reg, sizeof(reg));

 		path = (char *) mem;
@@ -1124,7 +1124,7 @@
 	ihandle prom_options = 0;
 	char option[9];
 	unsigned long offset = reloc_offset();
-        struct naca_struct *_naca = RELOC(naca);
+	struct naca_struct *_naca = RELOC(naca);
 	char found = 0;

 	if (strstr(RELOC(cmd_line), RELOC("smt-enabled="))) {
@@ -1253,10 +1253,10 @@
 	struct prom_t *_prom = PTRRELOC(&prom);
 	u32 val;

-        if (prom_getprop(_prom->chosen, "stdout", &val, sizeof(val)) <= 0)
-                prom_panic("cannot find stdout");
+	if (prom_getprop(_prom->chosen, "stdout", &val, sizeof(val)) <= 0)
+		prom_panic("cannot find stdout");

-        _prom->stdout = val;
+	_prom->stdout = val;
 }

 static int __init prom_find_machine_type(void)
@@ -1306,7 +1306,7 @@
 	ihandle ih;
 	int i, j;
 	unsigned long offset = reloc_offset();
-        struct prom_t *_prom = PTRRELOC(&prom);
+	struct prom_t *_prom = PTRRELOC(&prom);
 	char type[16], *path;
 	static unsigned char default_colors[] = {
 		0x00, 0x00, 0x00,
@@ -1403,7 +1403,7 @@
 				break;
 #endif /* CONFIG_LOGO_LINUX_CLUT224 */
 	}
-
+
 	return DOUBLEWORD_ALIGN(mem);
 }

@@ -1592,7 +1592,7 @@
 {
 	struct bi_record *first, *last;

-  	prom_debug("birec_verify: r6=0x%x\n", (unsigned long)bi_recs);
+	prom_debug("birec_verify: r6=0x%x\n", (unsigned long)bi_recs);
 	if (bi_recs != NULL)
 		prom_debug("  tag=0x%x\n", bi_recs->tag);

@@ -1601,7 +1601,7 @@

 	last = (struct bi_record *)(long)bi_recs->data[0];

-  	prom_debug("  last=0x%x\n", (unsigned long)last);
+	prom_debug("  last=0x%x\n", (unsigned long)last);
 	if (last != NULL)
 		prom_debug("  last_tag=0x%x\n", last->tag);

@@ -1609,7 +1609,7 @@
 		return NULL;

 	first = (struct bi_record *)(long)last->data[0];
-  	prom_debug("  first=0x%x\n", (unsigned long)first);
+	prom_debug("  first=0x%x\n", (unsigned long)first);

 	if ( first == NULL || first != bi_recs )
 		return NULL;
@@ -1681,9 +1681,9 @@
 	/* Init prom stdout device */
 	prom_init_stdout();

-  	prom_debug("klimit=0x%x\n", RELOC(klimit));
-  	prom_debug("offset=0x%x\n", offset);
-  	prom_debug("->mem=0x%x\n", RELOC(klimit) - offset);
+	prom_debug("klimit=0x%x\n", RELOC(klimit));
+	prom_debug("offset=0x%x\n", offset);
+	prom_debug("->mem=0x%x\n", RELOC(klimit) - offset);

 	/* check out if we have bi_recs */
 	_prom->bi_recs = prom_bi_rec_verify((struct bi_record *)r6);
@@ -1713,7 +1713,7 @@
 		copy_and_flush(0, KERNELBASE - offset, 0x100, 0);

 	/* Start storing things at klimit */
-      	mem = RELOC(klimit) - offset;
+	mem = RELOC(klimit) - offset;

 	/* Get the full OF pathname of the stdout device */
 	p = (char *) mem;
@@ -1728,9 +1728,9 @@
 	_prom->encode_phys_size = (getprop_rval == 1) ? 32 : 64;

 	/* Determine which cpu is actually running right _now_ */
-        if (prom_getprop(_prom->chosen, "cpu",
+	if (prom_getprop(_prom->chosen, "cpu",
 			 &prom_cpu, sizeof(prom_cpu)) <= 0)
-                prom_panic("cannot find boot cpu");
+		prom_panic("cannot find boot cpu");

 	cpu_pkg = call_prom("instance-to-package", 1, 1, prom_cpu);
 	prom_getprop(cpu_pkg, "reg", &getprop_rval, sizeof(getprop_rval));
@@ -1739,7 +1739,7 @@

 	RELOC(boot_cpuid) = 0;

-  	prom_debug("Booting CPU hw index = 0x%x\n", _prom->cpu);
+	prom_debug("Booting CPU hw index = 0x%x\n", _prom->cpu);

 	/* Get the boot device and translate it to a full OF pathname. */
 	p = (char *) mem;
@@ -1773,18 +1773,18 @@
 	if (_systemcfg->platform != PLATFORM_POWERMAC)
 		prom_instantiate_rtas();

-        /* Initialize some system info into the Naca early... */
-        prom_initialize_naca();
+	/* Initialize some system info into the Naca early... */
+	prom_initialize_naca();

 	smt_setup();

-        /* If we are on an SMP machine, then we *MUST* do the
-         * following, regardless of whether we have an SMP
-         * kernel or not.
-         */
+	/* If we are on an SMP machine, then we *MUST* do the
+	 * following, regardless of whether we have an SMP
+	 * kernel or not.
+	 */
 	prom_hold_cpus(mem);

-  	prom_debug("after basic inits, mem=0x%x\n", mem);
+	prom_debug("after basic inits, mem=0x%x\n", mem);
 #ifdef CONFIG_BLK_DEV_INITRD
 	prom_debug("initrd_start=0x%x\n", RELOC(initrd_start));
 	prom_debug("initrd_end=0x%x\n", RELOC(initrd_end));
@@ -1796,7 +1796,7 @@
 	RELOC(klimit) = mem + offset;

 	prom_debug("new klimit is\n");
-  	prom_debug("klimit=0x%x\n", RELOC(klimit));
+	prom_debug("klimit=0x%x\n", RELOC(klimit));
 	prom_debug(" ->mem=0x%x\n", mem);

 	lmb_reserve(0, __pa(RELOC(klimit)));
@@ -2082,7 +2082,7 @@
 		i = 0;
 		adr = (struct address_range *) mem_start;
 		while ((l -= sizeof(struct pci_reg_property)) >= 0) {
- 			if (!measure_only) {
+			if (!measure_only) {
 				adr[i].space = pci_addrs[i].addr.a_hi;
 				adr[i].address = pci_addrs[i].addr.a_lo;
 				adr[i].size = pci_addrs[i].size_lo;
@@ -2121,7 +2121,7 @@
 		i = 0;
 		adr = (struct address_range *) mem_start;
 		while ((l -= sizeof(struct reg_property32)) >= 0) {
- 			if (!measure_only) {
+			if (!measure_only) {
 				adr[i].space = 2;
 				adr[i].address = rp[i].address + base_address;
 				adr[i].size = rp[i].size;
@@ -2161,7 +2161,7 @@
 		i = 0;
 		adr = (struct address_range *) mem_start;
 		while ((l -= sizeof(struct reg_property32)) >= 0) {
- 			if (!measure_only) {
+			if (!measure_only) {
 				adr[i].space = 2;
 				adr[i].address = rp[i].address + base_address;
 				adr[i].size = rp[i].size;
@@ -2189,7 +2189,7 @@
 		i = 0;
 		adr = (struct address_range *) mem_start;
 		while ((l -= sizeof(struct reg_property)) >= 0) {
- 			if (!measure_only) {
+			if (!measure_only) {
 				adr[i].space = rp[i].space;
 				adr[i].address = rp[i].address;
 				adr[i].size = rp[i].size;
@@ -2218,7 +2218,7 @@
 		i = 0;
 		adr = (struct address_range *) mem_start;
 		while ((l -= rpsize) >= 0) {
- 			if (!measure_only) {
+			if (!measure_only) {
 				adr[i].space = 0;
 				adr[i].address = rp[naddrc - 1];
 				adr[i].size = rp[naddrc + nsizec - 1];
@@ -2296,7 +2296,7 @@
 	return mem_start;
 }

-/*
+/**
  * finish_device_tree is called once things are running normally
  * (i.e. with text and data mapped to the address they were linked at).
  * It traverses the device tree and fills in the name, type,
@@ -2347,7 +2347,7 @@
 	return 1;
 }

-/*
+/**
  * Work out the sense (active-low level / active-high edge)
  * of each interrupt from the device tree.
  */
@@ -2369,7 +2369,7 @@
 	}
 }

-/*
+/**
  * Construct and return a list of the device_nodes with a given name.
  */
 struct device_node *
@@ -2388,7 +2388,7 @@
 	return head;
 }

-/*
+/**
  * Construct and return a list of the device_nodes with a given type.
  */
 struct device_node *
@@ -2407,7 +2407,7 @@
 	return head;
 }

-/*
+/**
  * Returns all nodes linked together
  */
 struct device_node *
@@ -2424,7 +2424,7 @@
 	return head;
 }

-/* Checks if the given "compat" string matches one of the strings in
+/** Checks if the given "compat" string matches one of the strings in
  * the device's "compatible" property
  */
 int
@@ -2448,7 +2448,7 @@
 }


-/*
+/**
  * Indicates whether the root node has a given value in its
  * compatible property.
  */
@@ -2457,7 +2457,7 @@
 {
 	struct device_node *root;
 	int rc = 0;
-
+
 	root = of_find_node_by_path("/");
 	if (root) {
 		rc = device_is_compatible(root, compat);
@@ -2466,7 +2466,7 @@
 	return rc;
 }

-/*
+/**
  * Construct and return a list of the device_nodes with a given type
  * and compatible property.
  */
@@ -2489,7 +2489,7 @@
 	return head;
 }

-/*
+/**
  * Find the device_node with a given full_name.
  */
 struct device_node *
@@ -2904,7 +2904,7 @@
 	u32 *regs;
 	int err = 0;
 	phandle *ibm_phandle;
-
+
 	node->name = get_property(node, "name", 0);
 	node->type = get_property(node, "device_type", 0);

@@ -2957,26 +2957,26 @@
 		if (err) goto out;
 	}

-       /* now do the rough equivalent of update_dn_pci_info, this
-        * probably is not correct for phb's, but should work for
-	* IOAs and slots.
-        */
-
-       node->phb = parent->phb;
-
-       regs = (u32 *)get_property(node, "reg", 0);
-       if (regs) {
-               node->busno = (regs[0] >> 16) & 0xff;
-               node->devfn = (regs[0] >> 8) & 0xff;
-       }
+	/* now do the rough equivalent of update_dn_pci_info, this
+	 * probably is not correct for phb's, but should work for
+	 * IOAs and slots.
+	 */
+
+	node->phb = parent->phb;
+
+	regs = (u32 *)get_property(node, "reg", 0);
+	if (regs) {
+		node->busno = (regs[0] >> 16) & 0xff;
+		node->devfn = (regs[0] >> 8) & 0xff;
+	}

 	/* fixing up iommu_table */

 	if(strcmp(node->name, "pci") == 0 &&
-                get_property(node, "ibm,dma-window", NULL)) {
-                node->bussubno = node->busno;
-                iommu_devnode_init(node);
-        }
+		get_property(node, "ibm,dma-window", NULL)) {
+		node->bussubno = node->busno;
+		iommu_devnode_init(node);
+	}
 	else
 		node->iommu_table = parent->iommu_table;

@@ -3001,13 +3001,6 @@

 	memset(np, 0, sizeof(*np));

-	np->full_name = kmalloc(strlen(path) + 1, GFP_KERNEL);
-	if (!np->full_name) {
-		kfree(np);
-		return -ENOMEM;
-	}
-	strcpy(np->full_name, path);
-
 	np->properties = proplist;
 	OF_MARK_DYNAMIC(np);
 	of_node_get(np);
@@ -3016,6 +3009,13 @@
 		kfree(np);
 		return -EINVAL; /* could also be ENOMEM, though */
 	}
+
+	np->full_name = kmalloc(strlen(path) + 1, GFP_KERNEL);
+	if (!np->full_name) {
+		kfree(np);
+		return -ENOMEM;
+	}
+	strcpy(np->full_name, path);

 	if (0 != (err = of_finish_dynamic_node(np))) {
 		kfree(np);

--GvXjxJ+pjyke8COw--

----- End forwarded message -----

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From greg at kroah.com  Thu Jul 22 07:19:30 2004
From: greg at kroah.com (Greg KH)
Date: Wed, 21 Jul 2004 17:19:30 -0400
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
In-Reply-To: <40FEDC0A.1030201@austin.ibm.com>
References: <OF08FB76A9.59C52710-ON85256ED8.006D50DE-86256ED8.006D91C3@us.ibm.com> <40FEDC0A.1030201@austin.ibm.com>
Message-ID: <20040721211930.GB18110@kroah.com>


On Wed, Jul 21, 2004 at 04:11:38PM -0500, Nathan Fontenot wrote:
> John H Rose wrote:
> >I know this is the trendy idea, and I don't disagree with it, but
> >keep in mind that you'll break tools. Update flash, for example. :)
>
> We could always provide a link from /proc/ppc64/rtas to sysfs so we
> don't break any tools.

How will you know where sysfs is mounted? :)

And no, putting stuff like that in the kernel will not be acceptable...

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Thu Jul 22 07:30:41 2004
From: linas at austin.ibm.com (Linas Vepstas)
Date: Wed, 21 Jul 2004 16:30:41 -0500
Subject: Resending ... Re: [PATCH] [2.6] PPC64: log firmware errors during boot.
Message-ID: <20040721213041.GC13171@austin.ibm.com>


----- Forwarded message from Mail Delivery System <Mailer-Daemon at bilge> -----
------ This is a copy of the message, including all the headers. ------

Return-path: <linas at bilge>
Received: from linas by bilge with local (Exim 3.36 #1 (Debian))
	id 1Bj0dp-0005gG-00; Fri, 09 Jul 2004 14:02:37 -0500
Date: Fri, 9 Jul 2004 14:02:37 -0500
To: Jake Moilanen <moilanen at austin.ibm.com>
Cc: paulus at samba.org, linuxppc64-dev at lists.linuxppc.org,
	linux-kernel at vger.kernel.org
Subject: Re: [PATCH] [2.6] PPC64: log firmware errors during boot.
Message-ID: <20040709190237.GE17333 at bilge>
References: <20040629191046.Q21634 at forte.austin.ibm.com> <16610.39955.554139.858593 at cargo.ozlabs.ibm.com> <20040706084116.11ab7988.moilanen at austin.ibm.com> <20040708110337.N21634 at forte.austin.ibm.com> <20040708125545.41aae667.moilanen at austin.ibm.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20040708125545.41aae667.moilanen at austin.ibm.com>
User-Agent: Mutt/1.5.6+20040523i
From: Linas Vepstas <linas at bilge>

On Thu, Jul 08, 2004 at 12:55:45PM -0500, Jake Moilanen was heard to remark:
> On Thu, 8 Jul 2004 11:03:37 -0500
> linas at austin.ibm.com wrote:
>
> > Actually, they don't seem to be queueed at all; when I turned on
> > logging earlier, a whole pile of messages poped out that weren't
> > visible before.
>
> If you are seeing a different pile of messages, I would imagine the
> messages that popped out are not coming from event-scan then.  Might be
> last_error, which messages do not come in from event-scan.  I can see
> them not being logged in early boot.

Yep.  They were due to EEH not being enabled on empty slots.
Appearently, they were being generated during boot for years,
but no one noticed them before, because we had this logging turned
off.  So once burned, twice shy... if we got the messages earlier,
we'd be less likely to overlook the root problem ...

> A problem I could see, is if we make an rtas call before the VM
> is up.  The kmalloc for last_error won't like that.

Ugh, yes, eeh is initialized very early, before the vm system is up.
I'll have to prepare a patch to check malloc_sizes->cs_cachep
for NULL, and not call kmalloc() if it is.  Is there a better way
to poll to find out if VM is up?

--linas

----- End forwarded message -----

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From brking at us.ibm.com  Thu Jul 22 07:47:02 2004
From: brking at us.ibm.com (Brian King)
Date: Wed, 21 Jul 2004 16:47:02 -0500
Subject: Resending: [RFC/PATCH] Re: How to block pci config-reads during
 device self-test?
In-Reply-To: <20040721211506.GT13171@austin.ibm.com>
References: <20040721211506.GT13171@austin.ibm.com>
Message-ID: <40FEE456.8050307@us.ibm.com>


> There are two possible solutions.  (1) block pci config-space i/o
> during a BIST, and (2) let the device driver detect that the card
> has been off-lined, and so reset the pci controller chip.
>
> I'm starting to think (2) is better, but this patch implements (1).
> Opinions?  Comments?

I do like the idea of 2, but have a couple comments. Often times an eeh
error will lead to the adapter being replaced, so we may need to be a
little sensitive regarding error logging. I wouldn't want hitting this
window to result in the adapter being replaced. Also, just doing 2 and
not 1 will still allow a potential problem to exist. If userspace was
constantly reading pci config space of the adapter while resetting it,
it could result in continual eeh errors, until the device driver gives
up trying to bring the adapter back to life and offlines it.

> ===== arch/ppc64/kernel/pSeries_pci.c 1.39 vs edited =====
> --- 1.39/arch/ppc64/kernel/pSeries_pci.c	Mon Jul 12 18:29:16 2004
> +++ edited/arch/ppc64/kernel/pSeries_pci.c	Tue Jul 13 15:46:08 2004
> @@ -76,14 +76,22 @@
>
>  	addr = (dn->busno << 16) | (dn->devfn << 8) | where;
>  	buid = dn->phb->buid;
> +
> +	if (in_interrupt()) {
> +		/* Driver should unplug interrupts during BIST */
> +		BUG_ON (down_read_trylock(&dn->bist_lock) != 1);
> +	} else {
> +		down_read (&dn->bist_lock);
> +	}

This does not work. Device drivers often read pci config space from
interrupt context. If userspace happened to be reading at the same time,
we could hit the BUG_ON.

Also, like we discussed before, these routines are called with the
pci_lock spinlock held, so we can't be using semaphores. I think the
best way to effectively do what this patch is trying to do is to have
the read routine return ff's and have the write routine bit bucket the
data.


--
Brian King
eServer Storage I/O
IBM Linux Technology Center

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nfont at austin.ibm.com  Thu Jul 22 08:23:17 2004
From: nfont at austin.ibm.com (Nathan Fontenot)
Date: Wed, 21 Jul 2004 17:23:17 -0500
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
In-Reply-To: <1090439865.5862.5.camel@nighthawk>
References: <40FBFC49.1070804@austin.ibm.com>	 <20040721055840.GA18787@kroah.com>  <40FEBC0E.7060005@austin.ibm.com> <1090439865.5862.5.camel@nighthawk>
Message-ID: <40FEECD5.40907@austin.ibm.com>


Dave Hansen wrote:

> As for the other things in that directory, could you do an ls on your
> system and describe which ones you think need to be kept?
>
> -- Dave

Here we go...

# ls /proc/ppc64/rtas
.   clock      frequency  progress    rtasmsgs  volume
..  error_log  poweron    rmo_buffer  sensors

With the addition of rmo_buffer you can make rtas calls from user space.
  Thus we may be able to get rid of the following items.

o clock - get/set platform time.
o frequency - set the frequency of the system speaker.
o progress - read/write data to the system led display
o volume - set the volume of the system speaker.
o power_on - set a future poweron time for the system

The remaining should be retained

o rtasmsgs - turn on/off rtas event messages to /var/log/message.
o error_log - rtas_errd uses this to get rtas events from the kernel.
o rmo_buffer - get a rmo region to make rtas calls from user space.
o sensors - displays the surrent state of system sensors

All of theses are defined in arch/ppc64/kernel/rtas-proc.c, except for
error_log which is in arch/ppc64/kernel/rtasd.c if anyone cares to look.

Nathan F.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From ananth at in.ibm.com  Thu Jul 22 21:34:35 2004
From: ananth at in.ibm.com (Ananth N Mavinakayanahalli)
Date: Thu, 22 Jul 2004 16:34:35 +0500
Subject: [PATCH] set tbl->it_type to TCE_PCI in iommu code
Message-ID: <20040722113435.GA12185@in.ibm.com>


Hi Anton,

Here is a patch that sets struct iommu_table->it_type to TCE_PCI in
pSeries_iommu.c. Please apply

Signed-off-by: Ananth N Mavinakayanahalli <ananth at in.ibm.com>


Thanks,
Ananth
--
Ananth Narayan
Linux Technology Center,
IBM Software Lab, INDIA


diff -Naurp temp/linux-2.6.8-rc2/arch/ppc64/kernel/pSeries_iommu.c risc/linux-2.6.8-rc2/arch/ppc64/kernel/pSeries_iommu.c
--- temp/linux-2.6.8-rc2/arch/ppc64/kernel/pSeries_iommu.c	2004-07-18 09:57:47.000000000 +0500
+++ risc/linux-2.6.8-rc2/arch/ppc64/kernel/pSeries_iommu.c	2004-07-22 16:24:53.000000000 +0500
@@ -211,6 +211,7 @@ static void iommu_table_setparms(struct
 	tbl->it_index = 0;
 	tbl->it_entrysize = sizeof(union tce_entry);
 	tbl->it_blocksize = 16;
+	tbl->it_type = TCE_PCI;
 }

 /*
@@ -246,6 +247,7 @@ static void iommu_table_setparms_lpar(st
 	tbl->it_index  = dma_window[0];
 	tbl->it_entrysize = sizeof(union tce_entry);
 	tbl->it_blocksize  = 16;
+	tbl->it_type = TCE_PCI;
 }


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From sharada at in.ibm.com  Thu Jul 22 22:56:10 2004
From: sharada at in.ibm.com (R Sharada)
Date: Thu, 22 Jul 2004 18:26:10 +0530
Subject: identifying cpus (and smt threads)
Message-ID: <20040722125609.GA1466@in.ibm.com>


Hello,
	I am trying to remove some initialization code of the cpumask global
data structures from the prom_hold_cpus code and had a few queries.
The four cpu maps are the following:
- cpu_online_map,
- cpu_available_map,
- cpu_possible_map, and
- cpu_present_at_boot_map

This is  my understanding of the prom_hold_cpus code and mask initialization:

Each device_tree node of type cpu represents a physical processor/cpu
For each valid processor identified by the firmware (obtained from looking at
the 'status' property of the device_tree cpu node)
	Get the 'reg' property that represents the physical cpu id
	Get the 'interrupt-server#s' property that is supposed to represent
something about the smt threads, if supported
	(what does this property actually mean? On my Power4 test machine, p630,
I found this property seems to be the same as the 'reg' property in the open
firmware prompt. Does it give the individual thread ids, if SMT is supported?).
I tried to find concrete information about this property in documents, but
failed to get it.
	Based on the propsize returned from the getprop for the property
ibm,interrupt-server#s', the number of threads (CPU_MAX_THREADS is anyways 2) is
obtained.
	The code then differentiates between the primary thread of boot cpu and
non-boot cpu.
	If primary thread of boot cpu, then set the cpuid in all the 4 cpumasks.
	If primary thread of non-boot cpu, then make them spin on secondary-hold
and when they are out, set the cpuid in the cpu_avaiable_map, cpu_possible_map
and cpu_present_at_boot. But they are not set in the cpu_online_map. Why is
that? Aren't they online if they have come out of the secondary_hold spinloop?
	Then the secondary threads cpuids are set in the cpu_available_map and
cpu_present_at_boot map.
	Later, in setup_system, the secondary threads are started through the
rtas-call 'start-cpu' if they are already in the cpu_available_map but not on
the cpu_possible_map.

A few questions here:
- why is there a cpu_present_at_boot map? It seems to be the same as the
cpu_available_map. Or did I miss something?
- second, I want to move the cpuset() calls for setting the cpumasks from the
prom_hold_cpus code to setup_system. However, I see that only the primary thread
of boot cpu is set in the cpu_online_map, and the primary threads of non-boot
cpus are not set in the cpu_online_map.
- When do the other primary threads of the non-boot cpus get set in the
online_map? Can I assume that they will all be up by the time setup_system is
executed and set all the primary threads in the cpu_online_map? If not, how can
I differentiate between the primary threads of the boot and non-boot cpus?

Thanks and Regards,
Sharada

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From jschopp at austin.ibm.com  Fri Jul 23 01:58:13 2004
From: jschopp at austin.ibm.com (Joel Schopp)
Date: Thu, 22 Jul 2004 10:58:13 -0500
Subject: identifying cpus (and smt threads)
In-Reply-To: <20040722125609.GA1466@in.ibm.com>
References: <20040722125609.GA1466@in.ibm.com>
Message-ID: <40FFE415.6040909@austin.ibm.com>


> Each device_tree node of type cpu represents a physical processor/cpu
> For each valid processor identified by the firmware (obtained from looking at
> the 'status' property of the device_tree cpu node)
> 	Get the 'reg' property that represents the physical cpu id
> 	Get the 'interrupt-server#s' property that is supposed to represent
> something about the smt threads, if supported
> 	(what does this property actually mean? On my Power4 test machine, p630,
> I found this property seems to be the same as the 'reg' property in the open
> firmware prompt. Does it give the individual thread ids, if SMT is supported?).
> I tried to find concrete information about this property in documents, but
> failed to get it.

The document you are looking for is the RPA.  The interrupt-server#s
does give the id for each thread.  For single threaded systems (Power 4
and earlier) the reg and interrupt-server#s are the same.  For SMT
systems reg is the same as the first entry in interrupt-server#s.

> 	Based on the propsize returned from the getprop for the property
> ibm,interrupt-server#s', the number of threads (CPU_MAX_THREADS is anyways 2) is
> obtained.
> 	The code then differentiates between the primary thread of boot cpu and
> non-boot cpu.
> 	If primary thread of boot cpu, then set the cpuid in all the 4 cpumasks.
> 	If primary thread of non-boot cpu, then make them spin on secondary-hold
> and when they are out, set the cpuid in the cpu_avaiable_map, cpu_possible_map
> and cpu_present_at_boot. But they are not set in the cpu_online_map. Why is
> that? Aren't they online if they have come out of the secondary_hold spinloop?

Some cpus may be visible that are not assigned to the partition in a
LPAR system.  They may be in use by another partition or be otherwise
utilized.  These cannot be brought online at this point.

> 	Then the secondary threads cpuids are set in the cpu_available_map and
> cpu_present_at_boot map.
> 	Later, in setup_system, the secondary threads are started through the
> rtas-call 'start-cpu' if they are already in the cpu_available_map but not on
> the cpu_possible_map.
>
> A few questions here:
> - why is there a cpu_present_at_boot map? It seems to be the same as the
> cpu_available_map. Or did I miss something?

It is largely historical.  However, it is very possible to add cpus to
the available map that were not present at boot.  This is the case on
Power5 with shared processor enabled for instance.

> - second, I want to move the cpuset() calls for setting the cpumasks from the
> prom_hold_cpus code to setup_system. However, I see that only the primary thread
> of boot cpu is set in the cpu_online_map, and the primary threads of non-boot
> cpus are not set in the cpu_online_map.
> - When do the other primary threads of the non-boot cpus get set in the
> online_map? Can I assume that they will all be up by the time setup_system is
> executed and set all the primary threads in the cpu_online_map? If not, how can
> I differentiate between the primary threads of the boot and non-boot cpus?

Not sure exactly what you mean by this.  But I assume you are talking
about cpu hotplug (ie cpu DLPAR) events which can bring cpus online at
any time after boot when a customer asks it to.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Fri Jul 23 05:00:09 2004
From: paulus at samba.org (Paul Mackerras)
Date: Thu, 22 Jul 2004 15:00:09 -0400
Subject: [PATCH] set tbl->it_type to TCE_PCI in iommu code
In-Reply-To: <20040722113435.GA12185@in.ibm.com>
References: <20040722113435.GA12185@in.ibm.com>
Message-ID: <16640.3769.571620.779049@cargo.ozlabs.ibm.com>


Ananth,

> Here is a patch that sets struct iommu_table->it_type to TCE_PCI in
> pSeries_iommu.c. Please apply

This certainly looks like a good thing to do.  To help me explain it
better to Andrew Morton, could you tell me whether this fixes an
actual bug (and if so, what bug)?  Or is it just for code cleanness,
or to help debugging?

Thanks,
Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Fri Jul 23 05:12:55 2004
From: paulus at samba.org (Paul Mackerras)
Date: Thu, 22 Jul 2004 15:12:55 -0400
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
In-Reply-To: <1090439865.5862.5.camel@nighthawk>
References: <40FBFC49.1070804@austin.ibm.com>
	<20040721055840.GA18787@kroah.com>
	<40FEBC0E.7060005@austin.ibm.com>
	<1090439865.5862.5.camel@nighthawk>
Message-ID: <16640.4535.25999.780795@cargo.ozlabs.ibm.com>


Dave Hansen writes:

> On Wed, 2004-07-21 at 11:55, Nathan Fontenot wrote:
> > Would anyone have a problem with moving everything from
> > /proc/ppc64/rtas to sysfs (/sys/firmware/rtas) ? It seems that sysfs
> > is where all these files should live anyway.
>
> Some of them use pretty broken interfaces, and I don't think
> they should be perpetuated. The biggest offender in my mind is
> ppc64/rtas/rmo_buffer. It exports physical addresses so that they
> can be mapped with /dev/mem. I think a more refined interface is
> appropriate here.

First, we won't remove any /proc files from 2.6 that are being used by
applications.  That's a job for 2.7.  (I think we could remove things
like /proc/ppc64/rtas/poweron though.)

Secondly, we can discuss the rmo_buffer thing if you like, but I'll be
surprised if you can come up with something that doesn't involve
unnecessarily duplicating code.  The users of rmo_buffer absolutely
need to know the physical address of the rmo_buffer memory.  Given
that, I don't see any good reason for making a device with an mmap
method for mapping the rmo_buffer memory into userspace when we
already have a perfectly good device which we can use for that
purpose, namely /dev/mem.

In other words, you'll need to produce a more sophisticated technical
argument than just "/dev/mem? ugh..".  8-)

Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Fri Jul 23 05:24:53 2004
From: paulus at samba.org (Paul Mackerras)
Date: Thu, 22 Jul 2004 15:24:53 -0400
Subject: [PATCH] 2.6 ppc64 RTAS: use dynamic buffer size
In-Reply-To: <20040721204840.GD13171@austin.ibm.com>
References: <20040721204840.GD13171@austin.ibm.com>
Message-ID: <16640.5253.20520.720962@cargo.ozlabs.ibm.com>


Linas,

> Firmware expects error log sizes to be of a very specific size,
> but different versions of firmware appearently expect different
> sizes; using the wrong size results in a painful, hard-to-debug crash
> in firmware. Benh provided a patch for this some months ago, but
> appreantly missed this code path. This patch sets up the log buffer
> size dynamically; it also fixes a bug with the return code not being
> handled correctly.

Just small nit, but this hunk looks pointless:

> @@ -105,6 +111,7 @@
>  	err_args = rtas.args;
>  	rtas.args = save_args;
>
> +	err_args.rets = (rtas_arg_t *)&(err_args.args[2]);
>  	return err_args.rets[0];
>  }
>

The rest of it looks fine.  If you agree that this hunk can go, I'll
send the rest of it upstream.

Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Fri Jul 23 06:24:41 2004
From: paulus at samba.org (Paul Mackerras)
Date: Thu, 22 Jul 2004 16:24:41 -0400
Subject: [PATCH] 2.6 ppc64 Janitor prom.c memleak; whitespace
In-Reply-To: <20040721211650.GU13171@austin.ibm.com>
References: <20040721211650.GU13171@austin.ibm.com>
Message-ID: <16640.8841.405843.460895@cargo.ozlabs.ibm.com>


Linas Vepstas writes:

> I had reason to review prom.c today, and saw one minor bug (a very
> unlikely memory leak) and a lot of bad indentation (8 spaces used
> where tab should have been used). Bad whitespace drives me crazy
> because my vi set ts=3 not 8. This patch fixes the mem leak (at very
> bottom of the patch) and lots of whitespace ick. The mem leak is
> unlikely because it requires other failures to happen before the
> memleak happens.

Please don't combine extensive whitespace-only changes with real bug
fixes.  I'll send on the whitespace changes from your patch.

With the memory leak, I agree it needs to be fixed, but it seems to me
that if of_finish_dynamic_node() fails, we will still leak
np->full_name, even with your patch.  Also there are memory leaks in
the failure paths of of_finish_dynamic_node.  (If you feel like fixing
those, fine, but don't feel you have to. :)

Regards,
Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From johnrose at austin.ibm.com  Fri Jul 23 06:36:41 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Thu, 22 Jul 2004 15:36:41 -0500
Subject: [PATCH] imalloc supersets
In-Reply-To: <1090261801.18793.12.camel@sinatra.austin.ibm.com>
References: <1090261801.18793.12.camel@sinatra.austin.ibm.com>
Message-ID: <1090528601.1648.14.camel@sinatra.austin.ibm.com>


My first patch  left out a change to pgtable.h.  My apologies, attached
is the corrected patch.  So, anyone have thoughts on this? :)

Thanks-
John

diff -X /home/johnrose/tmp/diffignore.txt -urpN sles9-rc5-vanilla/arch/ppc64/mm/imalloc.c sles9-rc5-sav/arch/ppc64/mm/imalloc.c
--- sles9-rc5-vanilla/arch/ppc64/mm/imalloc.c	2004-07-16 16:11:54.000000000 -0500
+++ sles9-rc5-sav/arch/ppc64/mm/imalloc.c	2004-07-22 15:08:55.000000000 -0500
@@ -37,33 +37,51 @@ static int get_free_im_addr(unsigned lon
 	return 0;
 }

+/* Return whether the region described by v_addr and size is a subset
+ * of the region described by parent
+ */
+static inline int im_region_is_subset(unsigned long v_addr, unsigned long size,
+			struct vm_struct *parent)
+{
+	return (int) (v_addr >= (unsigned long) parent->addr &&
+	              v_addr < (unsigned long) parent->addr + parent->size &&
+	    	      size < parent->size);
+}
+
+/* Return whether the region described by v_addr and size is a superset
+ * of the region described by child
+ */
+static int im_region_is_superset(unsigned long v_addr, unsigned long size,
+		struct vm_struct *child)
+{
+	struct vm_struct parent;
+
+	parent.addr = (void *) v_addr;
+	parent.size = size;
+
+	return im_region_is_subset((unsigned long) child->addr, child->size,
+			&parent);
+}
+
 /* Return whether the region described by v_addr and size overlaps
  * the region described by vm.  Overlapping regions meet the
  * following conditions:
  * 1) The regions share some part of the address space
  * 2) The regions aren't identical
- * 3) The first region is not a subset of the second
+ * 3) Neither region is a subset of the other
  */
-static inline int im_region_overlaps(unsigned long v_addr, unsigned long size,
+static int im_region_overlaps(unsigned long v_addr, unsigned long size,
 		     struct vm_struct *vm)
 {
+	if (im_region_is_superset(v_addr, size, vm))
+		return 0;
+
 	return (v_addr + size > (unsigned long) vm->addr + vm->size &&
 		v_addr < (unsigned long) vm->addr + vm->size) ||
 	       (v_addr < (unsigned long) vm->addr &&
 		v_addr + size > (unsigned long) vm->addr);
 }

-/* Return whether the region described by v_addr and size is a subset
- * of the region described by vm
- */
-static inline int im_region_is_subset(unsigned long v_addr, unsigned long size,
-			struct vm_struct *vm)
-{
-	return (int) (v_addr >= (unsigned long) vm->addr &&
-	              v_addr < (unsigned long) vm->addr + vm->size &&
-	    	      size < vm->size);
-}
-
 /* Determine imalloc status of region described by v_addr and size.
  * Can return one of the following:
  * IM_REGION_UNUSED   -  Entire region is unallocated in imalloc space.
@@ -73,28 +91,37 @@ static inline int im_region_is_subset(un
  * IM_REGION_EXISTS -    Exact region already allocated in imalloc space.
  *                       vm will be assigned to a ptr to the existing imlist
  *                       member.
- * IM_REGION_OVERLAPS -  A portion of the region is already allocated in
- *                       imalloc space.
+ * IM_REGION_OVERLAPS -  Region overlaps an allocated region in imalloc space.
+ * IM_REGION_SUPERSET -  Region is a superset of a region that is already
+ *                       allocated in imalloc space.
  */
 static int im_region_status(unsigned long v_addr, unsigned long size,
 		    struct vm_struct **vm)
 {
 	struct vm_struct *tmp;

-	for (tmp = imlist; tmp; tmp = tmp->next)
-		if (v_addr < (unsigned long) tmp->addr + tmp->size)
+	for (tmp = imlist; tmp; tmp = tmp->next)
+		if (v_addr < (unsigned long) tmp->addr + tmp->size)
 			break;
-
+
 	if (tmp) {
 		if (im_region_overlaps(v_addr, size, tmp))
 			return IM_REGION_OVERLAP;

 		*vm = tmp;
-		if (im_region_is_subset(v_addr, size, tmp))
+		if (im_region_is_subset(v_addr, size, tmp)) {
+			/* Return with tmp pointing to superset */
 			return IM_REGION_SUBSET;
+		}
+		if (im_region_is_superset(v_addr, size, tmp)) {
+			/* Return with tmp pointing to first subset */
+			return IM_REGION_SUPERSET;
+		}
 		else if (v_addr == (unsigned long) tmp->addr &&
-		 	 size == tmp->size)
+		 	 size == tmp->size) {
+			/* Return with tmp pointing to exact region */
 			return IM_REGION_EXISTS;
+		}
 	}

 	*vm = NULL;
@@ -208,6 +235,10 @@ static struct vm_struct * __im_get_area(
 		tmp = split_im_region(req_addr, size, tmp);
 		break;
 	case IM_REGION_EXISTS:
+		/* Return requested region */
+		break;
+	case IM_REGION_SUPERSET:
+		/* Return first existing subset of requested region */
 		break;
 	default:
 		printk(KERN_ERR "%s() unexpected imalloc region status\n",
diff -X /home/johnrose/tmp/diffignore.txt -urpN sles9-rc5-vanilla/arch/ppc64/mm/init.c sles9-rc5-sav/arch/ppc64/mm/init.c
--- sles9-rc5-vanilla/arch/ppc64/mm/init.c	2004-07-16 16:11:54.000000000 -0500
+++ sles9-rc5-sav/arch/ppc64/mm/init.c	2004-07-22 15:17:56.000000000 -0500
@@ -392,9 +392,28 @@ void iounmap(void *addr)
 	return;
 }

+static int iounmap_subset_regions(void *addr, unsigned long size)
+{
+	struct vm_struct *area;
+
+	/* Check whether subsets of this region exist */
+	area = im_get_area((unsigned long) addr, size, IM_REGION_SUPERSET);
+	if (area == NULL)
+		return 1;
+
+	while (area) {
+		iounmap(area->addr);
+		area = im_get_area((unsigned long) addr, size,
+				IM_REGION_SUPERSET);
+	}
+
+	return 0;
+}
+
 int iounmap_explicit(void *addr, unsigned long size)
 {
 	struct vm_struct *area;
+	int rc;

 	/* addr could be in EEH or IO region, map it to IO region regardless.
 	 */
@@ -407,12 +426,18 @@ int iounmap_explicit(void *addr, unsigne
 	area = im_get_area((unsigned long) addr, size,
 			    IM_REGION_EXISTS | IM_REGION_SUBSET);
 	if (area == NULL) {
-		printk(KERN_ERR "%s() cannot unmap nonexistant range 0x%lx\n",
-				__FUNCTION__, (unsigned long) addr);
-		return 1;
+		/* Determine whether subset regions exist.  If so, unmap */
+		rc = iounmap_subset_regions(addr, size);
+		if (rc) {
+			printk(KERN_ERR
+			       "%s() cannot unmap nonexistant range 0x%lx\n",
+ 				__FUNCTION__, (unsigned long) addr);
+			return 1;
+		}
+	} else {
+		iounmap(area->addr);
 	}

-	iounmap(area->addr);
 	return 0;
 }

diff -X /home/johnrose/tmp/diffignore.txt -urpN sles9-rc5-vanilla/include/asm-ppc64/pgtable.h sles9-rc5-sav/include/asm-ppc64/pgtable.h
--- sles9-rc5-vanilla/include/asm-ppc64/pgtable.h	2004-07-16 16:12:00.000000000 -0500
+++ sles9-rc5-sav/include/asm-ppc64/pgtable.h	2004-07-22 15:09:35.000000000 -0500
@@ -467,6 +467,7 @@ extern void hpte_init_iSeries(void);
 #define IM_REGION_SUBSET	0x2
 #define IM_REGION_EXISTS	0x4
 #define IM_REGION_OVERLAP	0x8
+#define IM_REGION_SUPERSET	0x10

 extern struct vm_struct * im_get_free_area(unsigned long size);
 extern struct vm_struct * im_get_area(unsigned long v_addr, unsigned long size,

On Mon, 2004-07-19 at 13:30, John Rose wrote:
> The patch below implements the ability to query outstanding imalloc regions for
> a given virtual address range.  The patch extends im_get_area() to allow a
> region criterion of IM_REGION_SUPERSET.  For a particular "superset" virtual
> address and size passed into im_get_area(), the function returns the first
> outstanding region that is contained within this superset region.
>
> The patch also changes iounmap_explicit() to allow for the unmapping of all
> regions that fit under a "supserset".
>
> This ability is necessary for PHB DLPAR.  For a PHB removal, the RPA requires
> that all of its children slots already be dynamically removed.  Each of these
> slot-level removals has fractured the imalloc region assigned to the PHB at
> boot.  At PHB removal time, it is necessary to iounmap() the remaining
> artifacts of the initial PHB region.
>
> Thanks-
> John
>
> Signed-off-by:  John Rose <johnrose at austin.ibm.com>


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Fri Jul 23 07:38:29 2004
From: linas at austin.ibm.com (Linas Vepstas)
Date: Thu, 22 Jul 2004 16:38:29 -0500
Subject: [PATCH] 2.6 ppc64 RTAS: use dynamic buffer size
In-Reply-To: <16640.5253.20520.720962@cargo.ozlabs.ibm.com>
References: <20040721204840.GD13171@austin.ibm.com> <16640.5253.20520.720962@cargo.ozlabs.ibm.com>
Message-ID: <20040722213829.GD13396@austin.ibm.com>


On Thu, Jul 22, 2004 at 03:24:53PM -0400, Paul Mackerras was heard to remark:
>
> > Firmware expects error log sizes to be of a very specific size,
> > but different versions of firmware appearently expect different
> > sizes; using the wrong size results in a painful, hard-to-debug
> > crash in firmware. Benh provided a patch for this some months ago,
> > but appreantly missed this code path. This patch sets up the log
> > buffer size dynamically; it also fixes a bug with the return code
> > not being handled correctly.
>
> Just small nit, but this hunk looks pointless:
>
> > @@ -105,6 +111,7 @@
> >  	err_args = rtas.args;
> >  	rtas.args = save_args;
> >
> > +	err_args.rets = (rtas_arg_t *)&(err_args.args[2]);
> >  	return err_args.rets[0];
> >  }

heh.
I wondered if anyone would notice..
Yes, put this line back to where it used to be.

Silly me, for a moment there, I was thinking that
firmware was looking at this value and so I was desperately
trying to make sure it was right; I need'nt have worried.
I realized this right after I hit "send" on email ...

> The rest of it looks fine. If you agree that this hunk can go, I'll
> send the rest of it upstream.

Yep, Just make sure that 'rets' is set somewhere in there,
anywhere (e.g. put this line back to how it used to be).

--linas

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nathanl at austin.ibm.com  Fri Jul 23 08:09:01 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Thu, 22 Jul 2004 17:09:01 -0500
Subject: identifying cpus (and smt threads)
In-Reply-To: <20040722125609.GA1466@in.ibm.com>
References: <20040722125609.GA1466@in.ibm.com>
Message-ID: <1090534141.3041.24.camel@booger>


On Thu, 2004-07-22 at 07:56, R Sharada wrote:

> A few questions here:
> - why is there a cpu_present_at_boot map? It seems to be the same as the
> cpu_available_map. Or did I miss something?

It's cruft and we should get rid of both of those maps and use
cpu_present_map like other architectures (e.g. ia64).  I posted a patch
a few days ago which does this, but it needs more work to accommodate
cpu hotplug.

Nathan


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nfont at austin.ibm.com  Fri Jul 23 08:25:05 2004
From: nfont at austin.ibm.com (Nathan Fontenot)
Date: Thu, 22 Jul 2004 17:25:05 -0500
Subject: [PATCH] updates to surveillance for power5
Message-ID: <41003EC1.5030109@austin.ibm.com>

This patch updates enable_surveillance() so we do not return an error
on platforms (notably power5) that do not have a surveillance sensor.
Additionaly, the rtas_call was changed to rtas_set_indicator as to avoid
having to handle RTAS_BUSY returns.

Since this is only called when a surveillance timeout is specified at
the boot propmt I also added a printk to inform users.

Paul or Anton, could you push this upstream if there are no objections.

Signed-off-by: Nathan Fontenot <nfont at austin.ibm.com>

thanks,
-Nathan F.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: surv_p5update-linus.patch
Type: text/x-patch
Size: 738 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040722/2b080c7f/attachment.bin 

From ananth at in.ibm.com  Fri Jul 23 15:44:08 2004
From: ananth at in.ibm.com (Ananth N Mavinakayanahalli)
Date: Fri, 23 Jul 2004 10:44:08 +0500
Subject: [PATCH] set tbl->it_type to TCE_PCI in iommu code
In-Reply-To: <16640.3769.571620.779049@cargo.ozlabs.ibm.com>
References: <20040722113435.GA12185@in.ibm.com> <16640.3769.571620.779049@cargo.ozlabs.ibm.com>
Message-ID: <20040723054408.GA13344@in.ibm.com>


On Thu, Jul 22, 2004 at 03:00:09PM -0400, Paul Mackerras wrote:

Hi Paul,

> > Here is a patch that sets struct iommu_table->it_type to TCE_PCI in
> > pSeries_iommu.c. Please apply
>
> This certainly looks like a good thing to do.  To help me explain it
> better to Andrew Morton, could you tell me whether this fixes an
> actual bug (and if so, what bug)?  Or is it just for code cleanness,
> or to help debugging?

I noticed that the it_type was not being set when updating the old
tce_table code in ppc64 kdb to use the changed iommu structs. This
is just for code completeness (and it is updated in iSeries_iommu.c,
but was somehow missed in pSeries_iommu.c).


Thanks,
Ananth


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From haveblue at us.ibm.com  Sat Jul 24 00:38:37 2004
From: haveblue at us.ibm.com (Dave Hansen)
Date: Fri, 23 Jul 2004 07:38:37 -0700
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
In-Reply-To: <16640.4535.25999.780795@cargo.ozlabs.ibm.com>
References: <40FBFC49.1070804@austin.ibm.com>
	 <20040721055840.GA18787@kroah.com> <40FEBC0E.7060005@austin.ibm.com>
	 <1090439865.5862.5.camel@nighthawk>
	 <16640.4535.25999.780795@cargo.ozlabs.ibm.com>
Message-ID: <1090593517.8876.24.camel@nighthawk>


On Thu, 2004-07-22 at 12:12, Paul Mackerras wrote:
> First, we won't remove any /proc files from 2.6 that are being used by
> applications.  That's a job for 2.7.  (I think we could remove things
> like /proc/ppc64/rtas/poweron though.)

What about the new kernel development model? ;)

> Secondly, we can discuss the rmo_buffer thing if you like, but I'll be
> surprised if you can come up with something that doesn't involve
> unnecessarily duplicating code.  The users of rmo_buffer absolutely
> need to know the physical address of the rmo_buffer memory.  Given
> that, I don't see any good reason for making a device with an mmap
> method for mapping the rmo_buffer memory into userspace when we
> already have a perfectly good device which we can use for that
> purpose, namely /dev/mem.
>
> In other words, you'll need to produce a more sophisticated technical
> argument than just "/dev/mem? ugh..".  8-)

The argument against /dev/mem is pretty simple: it's a bit too big of a
hammer.  Using it will force you to either run as root, or allow a
non-root user/group to write to it, which is an equivalent of giving
them root.

If it were another device, you can give selected users access to it, and
the worst they can do is corrupt RTAS calls (which may be a root
equivalent too, I imagine).

A suggestion would be to use the current ppc_rtas to get that
information back out.  Could there be some special args passed to the
syscall that give you the address of the buffer?

Something conceptually along the lines of:

        if (uargs->token == SPECIAL_TOKEN) {
        	uargs->rets = phys_to_virt(&rtas_rmo_buf);
        	return foo;
        }

So, from userspace and librtas, you just see it as a normal ppc_rtas
call with a token that isn't defined by the hardware spec.

This has the advantage of isolating all of the rtas code to a slightly
more restricted area.  It also makes it more explicit to someone reading
the code that the same users of ppc_rtas also want to know the physical
address of the rtas_rmo_buf.

-- Dave

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Sat Jul 24 04:15:11 2004
From: paulus at samba.org (Paul Mackerras)
Date: Fri, 23 Jul 2004 14:15:11 -0400
Subject: Another: [PATCH} 2.6 ppc64 rtas crash in kmalloc
In-Reply-To: <20040721211316.GS13171@austin.ibm.com>
References: <20040721211316.GS13171@austin.ibm.com>
Message-ID: <16641.21935.42185.14181@cargo.ozlabs.ibm.com>


Linas Vepstas writes:

> The recent set of rtas patches I sent in (about a week ago)
> introduced a bug: the possible use of kmalloc before the
> VM subsystem is initialized.  This patch checks for the VM
> subsystem being ready, and avoids the kmalloc if its not.

> +		/* Can't call kmalloc if VM subsystem is not yet up. */
> +		struct cache_sizes *csizep = malloc_sizes;
> +		if (csizep->cs_cachep) {

Please just use if (mem_init_done) for this, instead of looking at
internal slab cache variables.

Thanks,
Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Sat Jul 24 06:41:56 2004
From: paulus at samba.org (Paul Mackerras)
Date: Fri, 23 Jul 2004 16:41:56 -0400
Subject: [PATCH] 2.6 ppc64 RTAS: use dynamic buffer size
In-Reply-To: <20040722213829.GD13396@austin.ibm.com>
References: <20040721204840.GD13171@austin.ibm.com>
	<16640.5253.20520.720962@cargo.ozlabs.ibm.com>
	<20040722213829.GD13396@austin.ibm.com>
Message-ID: <16641.30740.797512.714695@cargo.ozlabs.ibm.com>


Linas Vepstas writes:

> heh. I wondered if anyone would notice.. Yes, put this line back to
> where it used to be.
>
> Silly me, for a moment there, I was thinking that firmware was looking
> at this value and so I was desperately trying to make sure it was
> right; I need'nt have worried. I realized this right after I hit
> "send" on email ...
>
> > The rest of it looks fine. If you agree that this hunk can go, I'll
> > send the rest of it upstream.
>
> Yep, Just make sure that 'rets' is set somewhere in there, anywhere
> (e.g. put this line back to how it used to be).

In fact I wasn't quite correct, since the next line dereferenced
err_args.rets.  However, it isn't necessary to set either
err_args.rets or rtas.args.rets in that function, since neither
enter_rtas() nor RTAS itself look at that field.

How about this patch instead?

Regards,
Paul.

diff -urN linux-2.5/arch/ppc64/kernel/rtas.c test25/arch/ppc64/kernel/rtas.c
--- linux-2.5/arch/ppc64/kernel/rtas.c	2004-07-06 08:43:03.000000000 +1000
+++ test25/arch/ppc64/kernel/rtas.c	2004-07-24 05:22:16.716914216 +1000
@@ -22,7 +22,6 @@
 #include <asm/rtas.h>
 #include <asm/semaphore.h>
 #include <asm/machdep.h>
-#include <asm/paca.h>
 #include <asm/page.h>
 #include <asm/param.h>
 #include <asm/system.h>
@@ -73,7 +72,6 @@
 	return tokp ? *tokp : RTAS_UNKNOWN_SERVICE;
 }

-
 /** Return a copy of the detailed error text associated with the
  *  most recent failed call to rtas.  Because the error text
  *  might go stale if there are any other intervening rtas calls,
@@ -84,28 +82,32 @@
 __fetch_rtas_last_error(void)
 {
 	struct rtas_args err_args, save_args;
+	u32 bufsz;
+
+	bufsz = rtas_token ("rtas-error-log-max");
+	if ((bufsz == RTAS_UNKNOWN_SERVICE) ||
+	    (bufsz > RTAS_ERROR_LOG_MAX)) {
+		printk (KERN_WARNING "RTAS: bad log buffer size %d\n", bufsz);
+		bufsz = RTAS_ERROR_LOG_MAX;
+	}

 	err_args.token = rtas_token("rtas-last-error");
 	err_args.nargs = 2;
 	err_args.nret = 1;
-	err_args.rets = (rtas_arg_t *)&(err_args.args[2]);

 	err_args.args[0] = (rtas_arg_t)__pa(rtas_err_buf);
-	err_args.args[1] = RTAS_ERROR_LOG_MAX;
+	err_args.args[1] = bufsz;
 	err_args.args[2] = 0;

 	save_args = rtas.args;
 	rtas.args = err_args;

-	PPCDBG(PPCDBG_RTAS, "\tentering rtas with 0x%lx\n",
-	       __pa(&err_args));
 	enter_rtas(__pa(&rtas.args));
-	PPCDBG(PPCDBG_RTAS, "\treturned from rtas ...\n");

 	err_args = rtas.args;
 	rtas.args = save_args;

-	return err_args.rets[0];
+	return err_args.args[2];
 }

 int rtas_call(int token, int nargs, int nret, int *outputs, ...)

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Sat Jul 24 09:16:59 2004
From: paulus at samba.org (Paul Mackerras)
Date: Fri, 23 Jul 2004 19:16:59 -0400
Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming
In-Reply-To: <1090593517.8876.24.camel@nighthawk>
References: <40FBFC49.1070804@austin.ibm.com>
	<20040721055840.GA18787@kroah.com>
	<40FEBC0E.7060005@austin.ibm.com>
	<1090439865.5862.5.camel@nighthawk>
	<16640.4535.25999.780795@cargo.ozlabs.ibm.com>
	<1090593517.8876.24.camel@nighthawk>
Message-ID: <16641.40043.346159.297712@cargo.ozlabs.ibm.com>


Dave Hansen writes:

> On Thu, 2004-07-22 at 12:12, Paul Mackerras wrote:
> > First, we won't remove any /proc files from 2.6 that are being used by
> > applications.  That's a job for 2.7.  (I think we could remove things
> > like /proc/ppc64/rtas/poweron though.)
>
> What about the new kernel development model? ;)

What about it?  I'm just saying that we're not going to remove an
existing interface that usermode is using in the 2.6 series kernels.

> The argument against /dev/mem is pretty simple: it's a bit too big of a
> hammer.  Using it will force you to either run as root, or allow a
> non-root user/group to write to it, which is an equivalent of giving
> them root.

We only want root to be able to do RTAS calls.  If you can do RTAS
calls you can do things like reboot or power off the system,
enable/disable interrupts, access PCI config space, stop the CPU, or
update flash ROM or nvram.  I'm pretty sure you could persuade RTAS to
write to arbitrary physical addresses below 4GB too.

> A suggestion would be to use the current ppc_rtas to get that
> information back out.  Could there be some special args passed to the
> syscall that give you the address of the buffer?
>
> Something conceptually along the lines of:
>
>         if (uargs->token == SPECIAL_TOKEN) {
>         	uargs->rets = phys_to_virt(&rtas_rmo_buf);
>         	return foo;
>         }

Yes, that's not a bad idea iff we can find a token value that firmware
is guaranteed not to use.  I don't see anything in the RPA that says
that token values are restricted to any particular range though.  We
could use some other invalid input such as a NULL argument buffer
pointer perhaps instead.

Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From linas at austin.ibm.com  Sat Jul 24 09:58:59 2004
From: linas at austin.ibm.com (Linas Vepstas)
Date: Fri, 23 Jul 2004 18:58:59 -0500
Subject: [PATCH] 2.6 ppc64 RTAS: use dynamic buffer size
In-Reply-To: <16641.30740.797512.714695@cargo.ozlabs.ibm.com>
References: <20040721204840.GD13171@austin.ibm.com> <16640.5253.20520.720962@cargo.ozlabs.ibm.com> <20040722213829.GD13396@austin.ibm.com> <16641.30740.797512.714695@cargo.ozlabs.ibm.com>
Message-ID: <20040723235859.GG13396@austin.ibm.com>


Yes, that will work too.

--linas

On Fri, Jul 23, 2004 at 04:41:56PM -0400, Paul Mackerras was heard to remark:
> Linas Vepstas writes:
>
> > heh. I wondered if anyone would notice.. Yes, put this line back to
> > where it used to be.
> >
> > Silly me, for a moment there, I was thinking that firmware was
> > looking at this value and so I was desperately trying to make sure
> > it was right; I need'nt have worried. I realized this right after I
> > hit "send" on email ...
> >
> > > The rest of it looks fine. If you agree that this hunk can go,
> > > I'll send the rest of it upstream.
> >
> > Yep, Just make sure that 'rets' is set somewhere in there, anywhere
> > (e.g. put this line back to how it used to be).
>
> In fact I wasn't quite correct, since the next line dereferenced
> err_args.rets. However, it isn't necessary to set either err_args.rets
> or rtas.args.rets in that function, since neither enter_rtas() nor
> RTAS itself look at that field.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nathanl at austin.ibm.com  Sat Jul 24 16:43:10 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Sat, 24 Jul 2004 01:43:10 -0500
Subject: UP load_up_fpu crash (2.6.8-rc2)
Message-ID: <1090651390.3592.11.camel@booger>


We seem to be broken with CONFIG_SMP=n on 2.6.8-rc2 and 2.6.8-rc1-mm1:

Freeing unused kernel memory: 280k freed
INIT: version 2.85 booting
Vector: 300 (Data Access) at [c00000003f043bb0]
    pc: c00000000000bab0: .load_up_fpu+0xb0/0x16c
    lr: 00000000400272b8
    sp: c00000003f043e30
   msr: 8000000000003032
   dar: 108
 dsisr: 40000000
  current = 0xc00000003f03d440
  paca    = 0xc0000000003cc000
    pid   = 327, comm = hotplug
enter ? for help
mon> t
[c00000003f043e30] c00000000000b4d8 .handle_page_fault+0x20/0x40
(unreliable)
--- Exception: 801 (FPU Unavailable) at 000000004000b908
SP (ffffe480) is in userspace


Any ideas?  I've seen this on pSeries lpar, haven't tried anything
else.  SMP kernels are fine.  I'll try to track down the version where
this first appeared as time permits.

Nathan


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Sun Jul 25 05:15:37 2004
From: anton at samba.org (Anton Blanchard)
Date: Sun, 25 Jul 2004 05:15:37 +1000
Subject: [PATCH] Fix migrate_irqs_away
Message-ID: <20040724191537.GA20839@krispykreme>


In migrate_irqs_away we werent converting a virtual irq to a real one.

Signed-off-by: Anton Blanchard <anton at samba.org>

===== xics.c 1.44 vs edited =====
--- 1.44/arch/ppc64/kernel/xics.c	Fri May 28 15:34:39 2004
+++ edited/xics.c	Sun Jul 25 04:59:24 2004
@@ -657,7 +657,7 @@
 	int set_indicator = rtas_token("set-indicator");
 	const unsigned int giqs = 9005UL; /* Global Interrupt Queue Server */
 	int status = 0;
-	unsigned int irq, cpu = smp_processor_id();
+	unsigned int irq, virq, cpu = smp_processor_id();
 	int xics_status[2];
 	unsigned long flags;

@@ -677,11 +677,12 @@
 	iosync();

 	printk(KERN_WARNING "HOTPLUG: Migrating IRQs away\n");
-	for_each_irq(irq) {
-		irq_desc_t *desc = get_irq_desc(irq);
+	for_each_irq(virq) {
+		irq_desc_t *desc = get_irq_desc(virq);

 		/* We need to get IPIs still. */
-		if (irq_offset_down(irq) == XICS_IPI)
+		irq = virt_irq_to_real(irq_offset_down(virq));
+		if (irq == XICS_IPI || irq == NO_IRQ)
 			continue;

 		/* We only need to migrate enabled IRQS */
@@ -696,7 +697,7 @@
 		if (status) {
 			printk(KERN_ERR "migrate_irqs_away: irq=%d "
 					"ibm,get-xive returns %d\n",
-					irq, status);
+					virq, status);
 			goto unlock;
 		}

@@ -709,17 +710,17 @@
 			goto unlock;

 		printk(KERN_WARNING "IRQ %d affinity broken off cpu %u\n",
-		       irq, cpu);
+		       virq, cpu);

 		/* Reset affinity to all cpus */
 		xics_status[0] = default_distrib_server;

-		status = rtas_call(ibm_set_xive, 3, 1, NULL,
-				irq, xics_status[0], xics_status[1]);
+		status = rtas_call(ibm_set_xive, 3, 1, NULL, irq,
+				xics_status[0], xics_status[1]);
 		if (status)
 			printk(KERN_ERR "migrate_irqs_away irq=%d "
 					"ibm,set-xive returns %d\n",
-					irq, status);
+					virq, status);

 unlock:
 		spin_unlock_irqrestore(&desc->lock, flags);

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From haveblue at us.ibm.com  Tue Jul 27 02:28:11 2004
From: haveblue at us.ibm.com (Dave Hansen)
Date: Mon, 26 Jul 2004 09:28:11 -0700
Subject: UP load_up_fpu crash (2.6.8-rc2)
In-Reply-To: <1090651390.3592.11.camel@booger>
References: <1090651390.3592.11.camel@booger>
Message-ID: <1090859291.21372.2535.camel@nighthawk>


On Fri, 2004-07-23 at 23:43, Nathan Lynch wrote:
> We seem to be broken with CONFIG_SMP=n on 2.6.8-rc2 and 2.6.8-rc1-mm1:
>
> Freeing unused kernel memory: 280k freed
> INIT: version 2.85 booting
> Vector: 300 (Data Access) at [c00000003f043bb0]
>     pc: c00000000000bab0: .load_up_fpu+0xb0/0x16c
...
> --- Exception: 801 (FPU Unavailable) at 000000004000b908
> SP (ffffe480) is in userspace
>
>
> Any ideas?  I've seen this on pSeries lpar, haven't tried anything
> else.  SMP kernels are fine.  I'll try to track down the version where
> this first appeared as time permits.

I was seeing the same exception, but in a slightly different place.  I
_think_ I saw it within some altivec code.  I thought it was related to
the failed CPU bringup.  I'll post it if I see it again.

-- Dave


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From johnrose at austin.ibm.com  Tue Jul 27 08:57:04 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Mon, 26 Jul 2004 17:57:04 -0500
Subject: [PATCH] ISA OFDT node refcount fix
Message-ID: <1090882624.5914.1.camel@sinatra.austin.ibm.com>


The patch below moves a misplaced of_node_put().  In the existing code, the
node in question is used just after its refcount is decremented.  Please apply.

Thanks-
John

Signed-off-by:  John Rose <johnrose at austin.ibm.com>

diff -Nru a/arch/ppc64/kernel/pSeries_pci.c b/arch/ppc64/kernel/pSeries_pci.c
--- a/arch/ppc64/kernel/pSeries_pci.c	Mon Jul 26 17:47:44 2004
+++ b/arch/ppc64/kernel/pSeries_pci.c	Mon Jul 26 17:47:44 2004
@@ -284,10 +284,10 @@
 				isa_dn = of_find_node_by_type(NULL, "isa");
 				if (isa_dn) {
 					isa_io_base = pci_io_base;
-					of_node_put(isa_dn);
 					pci_process_ISA_OF_ranges(isa_dn,
 						hose->io_base_phys,
 						hose->io_base_virt);
+					of_node_put(isa_dn);
                                         /* Allow all IO */
                                         io_page_mask = -1;
 				}


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From johnrose at austin.ibm.com  Tue Jul 27 09:06:21 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Mon, 26 Jul 2004 18:06:21 -0500
Subject: [PATCH] remove linux,pci-domain from OFDT
Message-ID: <1090883181.5914.4.camel@sinatra.austin.ibm.com>


The patch below scraps the creation of the "linux,pci-domain" property in the
OF device tree for each PCI Host Bridge.  This seems appropriate for the
following reasons:

1) It isn't referenced/used in the kernel.
2) It isn't exported to userspace, since it's added after /proc/device-tree
   is created.
3) Even if it was correctly exported to userspace, the same info is already
   available in sysfs.

Please apply, if there are no problems.

Thanks-
John

Signed-off-by:  John Rose <johnrose at austin.ibm.com>

diff -Nru a/arch/ppc64/kernel/pSeries_pci.c b/arch/ppc64/kernel/pSeries_pci.c
--- a/arch/ppc64/kernel/pSeries_pci.c	Mon Jul 26 17:50:29 2004
+++ b/arch/ppc64/kernel/pSeries_pci.c	Mon Jul 26 17:50:29 2004
@@ -402,7 +402,6 @@
 	int *bus_range;
 	char *model;
 	enum phb_types phb_type;
- 	struct property *of_prop;

 	model = (char *)get_property(dev, "model", NULL);

@@ -448,21 +447,6 @@
 		kfree(phb);
 		return NULL;
 	}
-
-	of_prop = (struct property *)alloc_bootmem(sizeof(struct property) +
-			sizeof(phb->global_number));
-
-	if (!of_prop) {
-		kfree(phb);
-		return NULL;
-	}
-
-	memset(of_prop, 0, sizeof(struct property));
-	of_prop->name = "linux,pci-domain";
-	of_prop->length = sizeof(phb->global_number);
-	of_prop->value = (unsigned char *)&of_prop[1];
-	memcpy(of_prop->value, &phb->global_number, sizeof(phb->global_number));
-	prom_add_property(dev, of_prop);

 	phb->first_busno =  bus_range[0];
 	phb->last_busno  =  bus_range[1];


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From johnrose at austin.ibm.com  Wed Jul 28 06:24:00 2004
From: johnrose at austin.ibm.com (John Rose)
Date: Tue, 27 Jul 2004 15:24:00 -0500
Subject: [PATCH] [trivial] struct pci_controller cleanup
Message-ID: <1090959840.15403.3.camel@sinatra.austin.ibm.com>


The patch below removes the unused member "pci_io_offset" from struct
pci_controller.  If there are no problems, please apply.

Thakns-
John

Signed-off-by:  John Rose <johnrose at austin.ibm.com>

diff -Nru a/include/asm-ppc64/pci-bridge.h b/include/asm-ppc64/pci-bridge.h
--- a/include/asm-ppc64/pci-bridge.h	Tue Jul 27 15:17:38 2004
+++ b/include/asm-ppc64/pci-bridge.h	Tue Jul 27 15:17:38 2004
@@ -47,7 +47,6 @@
 	 * the PCI memory space in the CPU bus space
 	 */
 	unsigned long pci_mem_offset;
-	unsigned long pci_io_offset;

 	struct pci_ops *ops;
 	volatile unsigned int *cfg_addr;


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From sfr at canb.auug.org.au  Wed Jul 28 09:03:01 2004
From: sfr at canb.auug.org.au (Stephen Rothwell)
Date: Wed, 28 Jul 2004 09:03:01 +1000
Subject: [PATCH] PPC64 pci_dn cleanups
Message-ID: <20040728090301.465d95b7.sfr@canb.auug.org.au>

Hi Paul,

This patch just cleans up arch/ppc64/kernel/pci_dn.c a bit, including:
	- remove it from the iSeries build completely
	- small changes to Makefile
	- remove the "post" parameter from traverse_pci_devices as
	  noone used it
	- make traverse_all_pci_devices static
	- remove CONFIG_PPC_PSERIES tests as we no longer build for iSeries
	- some reformatting (closer to "standard")
	- remove some of pointer casts
Please apply and consider for upstream.

This has been built (with default config) on pSeries and pmac and built and
run on iSeries.

Signed-off-by: Stephen Rothwell <sfr at canb.auug.org.au>
--
Cheers,
Stephen Rothwell                    sfr at canb.auug.org.au
http://www.canb.auug.org.au/~sfr/
-------------- next part --------------
diff -ruNp 2.6.8-rc2-bk6/arch/ppc64/kernel/Makefile 2.6.8-rc2-bk6.pci/arch/ppc64/kernel/Makefile
--- 2.6.8-rc2-bk6/arch/ppc64/kernel/Makefile	2004-07-21 08:39:34.000000000 +1000
+++ 2.6.8-rc2-bk6.pci/arch/ppc64/kernel/Makefile	2004-07-28 05:46:49.000000000 +1000
@@ -15,14 +15,11 @@ obj-y               :=	setup.o entry.o t

 obj-$(CONFIG_PPC_OF) +=	of_device.o

-obj-$(CONFIG_PCI)	+= pci.o pci_dn.o pci_iommu.o
+pci-obj-$(CONFIG_PPC_ISERIES)	+= iSeries_pci.o iSeries_pci_reset.o \
+				     iSeries_IoMmTable.o
+pci-obj-$(CONFIG_PPC_PSERIES)	+= pci_dn.o pci_dma_direct.o

-ifdef CONFIG_PPC_ISERIES
-obj-$(CONFIG_PCI)	+= iSeries_pci.o iSeries_pci_reset.o \
-			     iSeries_IoMmTable.o
-else
-obj-$(CONFIG_PCI)	+= pci_dma_direct.o
-endif
+obj-$(CONFIG_PCI)	+= pci.o pci_iommu.o $(pci-obj-y)

 obj-$(CONFIG_PPC_ISERIES) += iSeries_irq.o \
 			     iSeries_VpdInfo.o XmPciLpEvent.o \
diff -ruNp 2.6.8-rc2-bk6/arch/ppc64/kernel/eeh.c 2.6.8-rc2-bk6.pci/arch/ppc64/kernel/eeh.c
--- 2.6.8-rc2-bk6/arch/ppc64/kernel/eeh.c	2004-07-28 01:32:09.000000000 +1000
+++ 2.6.8-rc2-bk6.pci/arch/ppc64/kernel/eeh.c	2004-07-28 05:15:46.000000000 +1000
@@ -618,7 +618,7 @@ void __init eeh_init(void)

 		info.buid_lo = BUID_LO(buid);
 		info.buid_hi = BUID_HI(buid);
-		traverse_pci_devices(phb, early_enable_eeh, NULL, &info);
+		traverse_pci_devices(phb, early_enable_eeh, &info);
 	}

 	if (eeh_subsystem_enabled) {
diff -ruNp 2.6.8-rc2-bk6/arch/ppc64/kernel/pci.h 2.6.8-rc2-bk6.pci/arch/ppc64/kernel/pci.h
--- 2.6.8-rc2-bk6/arch/ppc64/kernel/pci.h	2004-07-21 08:39:34.000000000 +1000
+++ 2.6.8-rc2-bk6.pci/arch/ppc64/kernel/pci.h	2004-07-28 05:15:36.000000000 +1000
@@ -34,8 +34,8 @@ extern struct pci_dev *ppc64_isabridge_d
  *******************************************************************/
 struct device_node;
 typedef void *(*traverse_func)(struct device_node *me, void *data);
-void *traverse_pci_devices(struct device_node *start, traverse_func pre, traverse_func post, void *data);
-void *traverse_all_pci_devices(traverse_func pre);
+void *traverse_pci_devices(struct device_node *start, traverse_func pre,
+		void *data);

 void pci_devs_phb_init(void);
 void pci_fix_bus_sysdata(void);
diff -ruNp 2.6.8-rc2-bk6/arch/ppc64/kernel/pci_dn.c 2.6.8-rc2-bk6.pci/arch/ppc64/kernel/pci_dn.c
--- 2.6.8-rc2-bk6/arch/ppc64/kernel/pci_dn.c	2004-07-28 01:32:09.000000000 +1000
+++ 2.6.8-rc2-bk6.pci/arch/ppc64/kernel/pci_dn.c	2004-07-28 05:49:07.000000000 +1000
@@ -19,8 +19,6 @@
  * along with this program; if not, write to the Free Software
  * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307 USA
  */
-
-#include <linux/config.h>
 #include <linux/kernel.h>
 #include <linux/pci.h>
 #include <linux/delay.h>
@@ -40,20 +38,20 @@

 #include "pci.h"

-/* Traverse_func that inits the PCI fields of the device node.
+/*
+ * Traverse_func that inits the PCI fields of the device node.
  * NOTE: this *must* be done before read/write config to the device.
  */
-static void * __init
-update_dn_pci_info(struct device_node *dn, void *data)
+static void * __init update_dn_pci_info(struct device_node *dn, void *data)
 {
-#ifdef CONFIG_PPC_PSERIES
-	struct pci_controller *phb = (struct pci_controller *)data;
+	struct pci_controller *phb = data;
 	u32 *regs;
 	char *device_type = get_property(dn, "device_type", NULL);
 	char *model;

 	dn->phb = phb;
-	if (device_type && strcmp(device_type, "pci") == 0 && get_property(dn, "class-code", NULL) == 0) {
+	if (device_type && (strcmp(device_type, "pci") == 0) &&
+			(get_property(dn, "class-code", NULL) == 0)) {
 		/* special case for PHB's.  Sigh. */
 		regs = (u32 *)get_property(dn, "bus-range", NULL);
 		dn->busno = regs[0];
@@ -72,57 +70,47 @@ update_dn_pci_info(struct device_node *d
 			dn->devfn = (regs[0] >> 8) & 0xff;
 		}
 	}
-#endif
 	return NULL;
 }

-/******************************************************************
+/*
  * Traverse a device tree stopping each PCI device in the tree.
  * This is done depth first.  As each node is processed, a "pre"
- * function is called, the children are processed recursively, and
- * then a "post" function is called.
+ * function is called and the children are processed recursively.
  *
- * The "pre" and "post" funcs return a value.  If non-zero
- * is returned from the "pre" func, the traversal stops and this
- * value is returned.  The return value from "post" is not used.
- * This return value is useful when using traverse as
- * a method of finding a device.
+ * The "pre" func returns a value.  If non-zero is returned from
+ * the "pre" func, the traversal stops and this value is returned.
+ * This return value is useful when using traverse as a method of
+ * finding a device.
  *
- * NOTE: we do not run the funcs for devices that do not appear to
+ * NOTE: we do not run the func for devices that do not appear to
  * be PCI except for the start node which we assume (this is good
  * because the start node is often a phb which may be missing PCI
  * properties).
  * We use the class-code as an indicator. If we run into
  * one of these nodes we also assume its siblings are non-pci for
  * performance.
- *
- ******************************************************************/
-void *traverse_pci_devices(struct device_node *start, traverse_func pre, traverse_func post, void *data)
+ */
+void *traverse_pci_devices(struct device_node *start, traverse_func pre,
+		void *data)
 {
 	struct device_node *dn, *nextdn;
 	void *ret;

-	if (pre && (ret = pre(start, data)) != NULL)
+	if (pre && ((ret = pre(start, data)) != NULL))
 		return ret;
 	for (dn = start->child; dn; dn = nextdn) {
 		nextdn = NULL;
-#ifdef CONFIG_PPC_PSERIES
 		if (get_property(dn, "class-code", NULL)) {
-			if (pre && (ret = pre(dn, data)) != NULL)
+			if (pre && ((ret = pre(dn, data)) != NULL))
 				return ret;
-			if (dn->child) {
+			if (dn->child)
 				/* Depth first...do children */
 				nextdn = dn->child;
-			} else if (dn->sibling) {
+			else if (dn->sibling)
 				/* ok, try next sibling instead. */
 				nextdn = dn->sibling;
-			} else {
-				/* no more children or siblings...call "post" */
-				if (post)
-					post(dn, data);
-			}
 		}
-#endif
 		if (!nextdn) {
 			/* Walk up to next valid sibling. */
 			do {
@@ -136,31 +124,35 @@ void *traverse_pci_devices(struct device
 	return NULL;
 }

-/* Same as traverse_pci_devices except this does it for all phbs.
+/*
+ * Same as traverse_pci_devices except this does it for all phbs.
  */
-void *traverse_all_pci_devices(traverse_func pre)
+static void *traverse_all_pci_devices(traverse_func pre)
 {
-	struct pci_controller* phb;
+	struct pci_controller *phb;
 	void *ret;
-	for (phb=hose_head;phb;phb=phb->next)
-		if ((ret = traverse_pci_devices((struct device_node *)phb->arch_data, pre, NULL, phb)) != NULL)
+
+	for (phb = hose_head; phb; phb = phb->next)
+		if ((ret = traverse_pci_devices(phb->arch_data, pre, phb))
+				!= NULL)
 			return ret;
 	return NULL;
 }


-/* Traversal func that looks for a <busno,devfcn> value.
+/*
+ * Traversal func that looks for a <busno,devfcn> value.
  * If found, the device_node is returned (thus terminating the traversal).
  */
-static void *
-is_devfn_node(struct device_node *dn, void *data)
+static void *is_devfn_node(struct device_node *dn, void *data)
 {
 	int busno = ((unsigned long)data >> 8) & 0xff;
 	int devfn = ((unsigned long)data) & 0xff;
-	return (devfn == dn->devfn && busno == dn->busno) ? dn : NULL;
+	return ((devfn == dn->devfn) && (busno == dn->busno)) ? dn : NULL;
 }

-/* This is the "slow" path for looking up a device_node from a
+/*
+ * This is the "slow" path for looking up a device_node from a
  * pci_dev.  It will hunt for the device under its parent's
  * phb and then update sysdata for a future fastpath.
  *
@@ -174,14 +166,14 @@ is_devfn_node(struct device_node *dn, vo
  */
 struct device_node *fetch_dev_dn(struct pci_dev *dev)
 {
-	struct device_node *orig_dn = (struct device_node *)dev->sysdata;
+	struct device_node *orig_dn = dev->sysdata;
 	struct pci_controller *phb = orig_dn->phb; /* assume same phb as orig_dn */
 	struct device_node *phb_dn;
 	struct device_node *dn;
 	unsigned long searchval = (dev->bus->number << 8) | dev->devfn;

-	phb_dn = (struct device_node *)(phb->arch_data);
-	dn = (struct device_node *)traverse_pci_devices(phb_dn, is_devfn_node, NULL, (void *)searchval);
+	phb_dn = phb->arch_data;
+	dn = traverse_pci_devices(phb_dn, is_devfn_node, (void *)searchval);
 	if (dn) {
 		dev->sysdata = dn;
 		/* ToDo: call some device init hook here */
@@ -191,25 +183,23 @@ struct device_node *fetch_dev_dn(struct
 EXPORT_SYMBOL(fetch_dev_dn);


-/******************************************************************
+/*
  * Actually initialize the phbs.
  * The buswalk on this phb has not happened yet.
- ******************************************************************/
-void __init
-pci_devs_phb_init(void)
+ */
+void __init pci_devs_phb_init(void)
 {
 	/* This must be done first so the device nodes have valid pci info! */
 	traverse_all_pci_devices(update_dn_pci_info);
 }


-static void __init
-pci_fixup_bus_sysdata_list(struct list_head *bus_list)
+static void __init pci_fixup_bus_sysdata_list(struct list_head *bus_list)
 {
 	struct list_head *ln;
 	struct pci_bus *bus;

-	for (ln=bus_list->next; ln != bus_list; ln=ln->next) {
+	for (ln = bus_list->next; ln != bus_list; ln = ln->next) {
 		bus = pci_bus_b(ln);
 		if (bus->self)
 			bus->sysdata = bus->self->sysdata;
@@ -217,7 +207,7 @@ pci_fixup_bus_sysdata_list(struct list_h
 	}
 }

-/******************************************************************
+/*
  * Fixup the bus->sysdata ptrs to point to the bus' device_node.
  * This is done late in pcibios_init().  We do this mostly for
  * sanity, but pci_dma.c uses these at DMA time so they must be
@@ -225,9 +215,8 @@ pci_fixup_bus_sysdata_list(struct list_h
  * To do this we recurse down the bus hierarchy.  Note that PHB's
  * have bus->self == NULL, but fortunately bus->sysdata is already
  * correct in this case.
- ******************************************************************/
-void __init
-pci_fix_bus_sysdata(void)
+ */
+void __init pci_fix_bus_sysdata(void)
 {
 	pci_fixup_bus_sysdata_list(&pci_root_buses);
 }

From paulus at samba.org  Wed Jul 28 11:50:04 2004
From: paulus at samba.org (Paul Mackerras)
Date: Tue, 27 Jul 2004 20:50:04 -0500
Subject: UP load_up_fpu crash (2.6.8-rc2)
In-Reply-To: <1090651390.3592.11.camel@booger>
References: <1090651390.3592.11.camel@booger>
Message-ID: <16647.1612.244643.858755@cargo.ozlabs.ibm.com>


Nathan Lynch writes:

> We seem to be broken with CONFIG_SMP=n on 2.6.8-rc2 and 2.6.8-rc1-mm1:
>
> Freeing unused kernel memory: 280k freed
> INIT: version 2.85 booting
> Vector: 300 (Data Access) at [c00000003f043bb0]
>     pc: c00000000000bab0: .load_up_fpu+0xb0/0x16c
>     lr: 00000000400272b8
>     sp: c00000003f043e30
>    msr: 8000000000003032
>    dar: 108
>  dsisr: 40000000
>   current = 0xc00000003f03d440
>   paca    = 0xc0000000003cc000
>     pid   = 327, comm = hotplug
> enter ? for help
> mon> t
> [c00000003f043e30] c00000000000b4d8 .handle_page_fault+0x20/0x40
> (unreliable)
> --- Exception: 801 (FPU Unavailable) at 000000004000b908
> SP (ffffe480) is in userspace

This is very puzzling.  It appears that we have taken a FPU
unavailable trap from userspace, which is fine, but then it looks like
we think some other task owns the FPU at the moment, and that task is
a kernel thread.

We are crashing because last_task_used_math->thread.regs is NULL.
That should only happen for a kernel thread, but last_task_used_math
should never point to a kernel thread.  The only place that
last_task_used_math gets set to a non-NULL value is in load_up_fpu,
and that should only be called if we get a FPU unavailable trap from
usermode.

It would be very useful to see what last_task_used_math contains at
the time of the crash, and see what last_task_used_math->comm is, so
we can work out whether the task that owns the FPU is in fact a kernel
thread - in which case we need to work out how last_task_used_math is
getting to point at it - or if it isn't a kernel thread, in which case
we need to work out why task->thread.regs is NULL for that task.

Paul.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From sharada at in.ibm.com  Wed Jul 28 19:52:09 2004
From: sharada at in.ibm.com (R Sharada)
Date: Wed, 28 Jul 2004 15:22:09 +0530
Subject: adding boot_cpu to cpu_online_map query
Message-ID: <20040728095209.GA2224@in.ibm.com>


Hello,
	Looking at the ppc64 code, I had a query on how the boot_cpuid is set in
the cpu_online_map.

Looking at prom.c (prom_hold_cpus), I see that the boot cpu is added in the
cpu_online_map:

#ifdef CONFIG_SMP
                else {
                        prom_printf("%x : booting  cpu %s\n", cpuid, path);
                        cpu_set(cpuid, RELOC(cpu_available_map));
                        cpu_set(cpuid, RELOC(cpu_possible_map));
                        cpu_set(cpuid, RELOC(cpu_online_map));
                        cpu_set(cpuid, RELOC(cpu_present_at_boot));
                }
#endif

Again, in start_kernel(), in smp_prepare_boot_cpu(), we again add the boot_cpuid
in the cpu_online_map:

BUG_ON(smp_processor_id() != boot_cpuid);

        /* cpu_possible is set up in prom.c */
        cpu_set(boot_cpuid, cpu_online_map);

        paca[boot_cpuid].__current = current;
        current_set[boot_cpuid] = current->thread_info;

Why is this being done twice for ppc64? Or did I miss something?
I do see that the cpu_set in prom.c is under ifdef CONFIG_SMP. So, for SMP
systems, the cpu_set for boot_cpu will be called twice and for non-SMP, it will
be set in start_kernel?

Thanks and Regards,
Sharada

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Thu Jul 29 09:27:19 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Thu, 29 Jul 2004 09:27:19 +1000
Subject: [RFC/PATCH] How to block pci config-reads during device self-test?
In-Reply-To: <20040713230328.GQ17333@bilge>
References: <20040706161254.C21634@forte.austin.ibm.com>
	 <40ED780C.6000401@us.ibm.com>  <20040713230328.GQ17333@bilge>
Message-ID: <1091057239.2333.34.camel@gaston>


On Wed, 2004-07-14 at 09:03, Linas Vepstas wrote:
>
> On Thu, Jul 08, 2004 at 11:36:28AM -0500, Brian King was heard to remark:
> > I've been doing some talking with various hardware folks to see if
> > there is a way to get this fixed and so far the answer I have been
> > getting is no. It is how the hardware works and there isn't much
> > that can be done about it ...
>
> Actually, the hardware isn't broken ... its working as designed, more
> or less per pci spec. See below for techninical details.

Catching up a bit late... the whole in_interrupt() test with semaphore
is totally disgusting... Also, PCI ops are called within a spinlock
anyway.

Just make it fail (return FF's)

--
Benjamin Herrenschmidt <benh at kernel.crashing.org>

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Thu Jul 29 15:56:36 2004
From: anton at samba.org (Anton Blanchard)
Date: Thu, 29 Jul 2004 15:56:36 +1000
Subject: [PATCH] [trivial] struct pci_controller cleanup
In-Reply-To: <1090959840.15403.3.camel@sinatra.austin.ibm.com>
References: <1090959840.15403.3.camel@sinatra.austin.ibm.com>
Message-ID: <20040729055636.GG6456@krispykreme>


Hi,

Thanks John.

Anton

--

From: John Rose <johnrose at austin.ibm.com>

The patch below removes the unused member "pci_io_offset" from struct
pci_controller.  If there are no problems, please apply.

Signed-off-by: John Rose <johnrose at austin.ibm.com>
Signed-off-by: Anton Blanchard <anton at samba.org>

diff -Nru a/include/asm-ppc64/pci-bridge.h b/include/asm-ppc64/pci-bridge.h
--- a/include/asm-ppc64/pci-bridge.h	Tue Jul 27 15:17:38 2004
+++ b/include/asm-ppc64/pci-bridge.h	Tue Jul 27 15:17:38 2004
@@ -47,7 +47,6 @@
 	 * the PCI memory space in the CPU bus space
 	 */
 	unsigned long pci_mem_offset;
-	unsigned long pci_io_offset;

 	struct pci_ops *ops;
 	volatile unsigned int *cfg_addr;

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Fri Jul 30 15:22:32 2004
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Fri, 30 Jul 2004 15:22:32 +1000
Subject: [RFC][PATCH] ppc64: better handling of H_ENTER failures
Message-ID: <1091164951.2077.34.camel@gaston>


Hi !

This patch changes the hash insertion routines to return an error
instead of calling panic() when HV refuses to insert a HPTE.

The error is now propagated upstream, and either bad_page_fault() is
called (kernel mode) or a SIGBUS signal is forced (user mode). Some
other panic() cases are also turned into BUG_ON.

Overall, this should provide us with better debugging data if the
problem happens, and avoids errors from userland mapping /dev/mem and
trying to use forbidden IOs (XFree ?) to bring the whole kernel down.

Any comment ?

Ben.

===== arch/ppc64/kernel/head.S 1.61 vs edited =====
--- 1.61/arch/ppc64/kernel/head.S	2004-06-17 00:46:06 -05:00
+++ edited/arch/ppc64/kernel/head.S	2004-06-23 15:03:55 -05:00
@@ -1015,6 +1015,9 @@
 	 * interrupts if necessary.
 	 */
 	beq	.ret_from_except_lite
+	/* For a hash failure, we don't bother re-enabling interrupts */
+	ble-	12f
+
 	/*
 	 * hash_page couldn't handle it, set soft interrupt enable back
 	 * to what it was before the trap.  Note that .local_irq_restore
@@ -1025,6 +1028,8 @@
 	b	11f
 #else
 	beq+	fast_exception_return   /* Return from exception on success */
+	ble-	12f			/* Failure return from hash_page */
+
 	/* fall through */
 #endif

@@ -1042,6 +1047,15 @@
 	addi	r3,r1,STACK_FRAME_OVERHEAD
 	lwz	r4,_DAR(r1)
 	bl	.bad_page_fault
+	b	.ret_from_except
+
+/* We have a page fault that hash_page could handle but HV refused
+ * the PTE insertion
+ */
+12:	bl	.save_nvgprs
+	addi	r3,r1,STACK_FRAME_OVERHEAD
+	lwz	r4,_DAR(r1)
+	bl	.low_hash_fault
 	b	.ret_from_except

 	/* here we have a segment miss */
===== arch/ppc64/kernel/pSeries_lpar.c 1.36 vs edited =====
--- 1.36/arch/ppc64/kernel/pSeries_lpar.c	2004-04-12 12:54:09 -05:00
+++ edited/arch/ppc64/kernel/pSeries_lpar.c	2004-06-23 15:00:53 -05:00
@@ -377,7 +377,7 @@
 	lpar_rc = plpar_hcall(H_ENTER, flags, hpte_group, lhpte.dw0.dword0,
 			      lhpte.dw1.dword1, &slot, &dummy0, &dummy1);

-	if (lpar_rc == H_PTEG_Full)
+	if (unlikely(lpar_rc == H_PTEG_Full))
 		return -1;

 	/*
@@ -385,8 +385,13 @@
 	 * will fail. However we must catch the failure in hash_page
 	 * or we will loop forever, so return -2 in this case.
 	 */
-	if (lpar_rc != H_Success)
+	if (unlikely(lpar_rc != H_Success)) {
+		printk(KERN_WARNING
+		       "MMU fault ! H_ENTER failed, rc: %lx, va: %016lx, prpn: %lx,"
+		       " hpteflags: %lx, secondary: %d, bolted: %d, large: %d\n",
+		       lpar_rc, va, prpn, hpteflags, secondary, bolted, large);
 		return -2;
+	}

 	/* Because of iSeries, we have to pass down the secondary
 	 * bucket bit here as well
@@ -415,9 +420,7 @@
 		if (lpar_rc == H_Success)
 			return i;

-		if (lpar_rc != H_Not_Found)
-			panic("Bad return code from pte remove rc = %lx\n",
-			      lpar_rc);
+		BUG_ON(lpar_rc != H_Not_Found);

 		slot_offset++;
 		slot_offset &= 0x7;
@@ -447,8 +450,7 @@
 	if (lpar_rc == H_Not_Found)
 		return -1;

-	if (lpar_rc != H_Success)
-		panic("bad return code from pte protect rc = %lx\n", lpar_rc);
+	BUG_ON(lpar_rc != H_Success);

 	return 0;
 }
@@ -467,8 +469,7 @@

 	lpar_rc = plpar_pte_read(flags, slot, &dword0, &dummy_word1);

-	if (lpar_rc != H_Success)
-		panic("Error on pte read in get_hpte0 rc = %lx\n", lpar_rc);
+	BUG_ON(lpar_rc != H_Success);

 	return dword0;
 }
@@ -519,15 +520,12 @@
 	vpn = va >> PAGE_SHIFT;

 	slot = pSeries_lpar_hpte_find(vpn);
-	if (slot == -1)
-		panic("updateboltedpp: Could not find page to bolt\n");
+	BUG_ON(slot == -1);

 	flags = newpp & 3;
 	lpar_rc = plpar_pte_protect(flags, slot, 0);

-	if (lpar_rc != H_Success)
-		panic("Bad return code from pte bolted protect rc = %lx\n",
-		      lpar_rc);
+	BUG_ON(lpar_rc != H_Success);
 }

 static void pSeries_lpar_hpte_invalidate(unsigned long slot, unsigned long va,
@@ -546,8 +544,7 @@
 	if (lpar_rc == H_Not_Found)
 		return;

-	if (lpar_rc != H_Success)
-		panic("Bad return code from invalidate rc = %lx\n", lpar_rc);
+	BUG_ON(lpar_rc != H_Success);
 }

 /*
===== arch/ppc64/mm/hash_low.S 1.5 vs edited =====
--- 1.5/arch/ppc64/mm/hash_low.S	2004-04-12 12:54:08 -05:00
+++ edited/arch/ppc64/mm/hash_low.S	2004-06-22 16:14:22 -05:00
@@ -278,6 +278,10 @@
 	b	bail

 htab_pte_insert_failure:
-	b	.htab_insert_failure
+	/* Bail out restoring old PTE */
+	ld	r6,STK_PARM(r6)(r1)
+	std	r31,0(r6)
+	li	r3,-1
+	b	bail


===== arch/ppc64/mm/hash_utils.c 1.47 vs edited =====
--- 1.47/arch/ppc64/mm/hash_utils.c	2004-06-10 01:21:41 -05:00
+++ edited/arch/ppc64/mm/hash_utils.c	2004-06-23 15:08:29 -05:00
@@ -28,6 +28,7 @@
 #include <linux/ctype.h>
 #include <linux/cache.h>
 #include <linux/init.h>
+#include <linux/signal.h>

 #include <asm/ppcdebug.h>
 #include <asm/processor.h>
@@ -236,14 +237,11 @@
 	return pp;
 }

-/*
- * Called by asm hashtable.S in case of critical insert failure
+/* Result code is:
+ *  0 - handled
+ *  1 - normal page fault
+ * -1 - critical hash insertion error
  */
-void htab_insert_failure(void)
-{
-	panic("hash_page: pte_insert failed\n");
-}
-
 int hash_page(unsigned long ea, unsigned long access, unsigned long trap)
 {
 	void *pgdir;
@@ -374,6 +372,25 @@
 	*insn_addr = (unsigned int)(0x48000001 | (offset & 0x03fffffc));
 	flush_icache_range((unsigned long)insn_addr, 4+
 			   (unsigned long)insn_addr);
+}
+
+/*
+ * low_hash_fault is called when we the low level hash code failed
+ * to instert a PTE due to an hypervisor error
+ */
+void low_hash_fault(struct pt_regs *regs, unsigned long address)
+{
+	if (user_mode(regs)) {
+		siginfo_t info;
+
+		info.si_signo = SIGBUS;
+		info.si_errno = 0;
+		info.si_code = BUS_ADRERR;
+		info.si_addr = (void *)address;
+		force_sig_info(SIGBUS, &info, current);
+		return;
+	}
+	bad_page_fault(regs, address, SIGBUS);
 }

 void __init htab_finish_init(void)
===== include/asm-ppc64/system.h 1.32 vs edited =====
--- 1.32/include/asm-ppc64/system.h	2004-06-20 20:15:35 -05:00
+++ edited/include/asm-ppc64/system.h	2004-06-23 15:08:01 -05:00
@@ -105,6 +105,7 @@
 extern void bad_page_fault(struct pt_regs *regs, unsigned long address,
 			   int sig);
 extern void show_regs(struct pt_regs * regs);
+extern void low_hash_fault(struct pt_regs *regs, unsigned long address);
 extern int die(const char *str, struct pt_regs *regs, long err);

 extern void flush_instruction_cache(void);


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nathanl at austin.ibm.com  Sat Jul 31 07:45:50 2004
From: nathanl at austin.ibm.com (nathanl at austin.ibm.com)
Date: Fri, 30 Jul 2004 16:45:50 -0500
Subject: [patch 1/4] Use platform numbering of cpus for hypervisor calls.
Message-ID: <200407302145.i6ULj9c0068622@austin.ibm.com>


We were using Linux's cpu numbering for cpu-related hypervisor calls
(e.g. vpa registration, H_CONFER).  It happened to work most of the
time because Linux and the hypervisor usually, but not always, have
the same numbering for cpus.

Signed-off-by: Nathan Lynch <nathanl at austin.ibm.com>


---


diff -puN arch/ppc64/kernel/smp.c~ppc64_fix_hcall_cpuids arch/ppc64/kernel/smp.c
--- 2.6.8-rc2-mm1/arch/ppc64/kernel/smp.c~ppc64_fix_hcall_cpuids	2004-07-30 10:16:57.000000000 -0500
+++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/kernel/smp.c	2004-07-30 16:44:39.000000000 -0500
@@ -501,11 +501,11 @@ static void __init smp_space_timers(unsi
 #ifdef CONFIG_PPC_PSERIES
 void vpa_init(int cpu)
 {
-	unsigned long flags;
+	unsigned long flags, pcpu = get_hard_smp_processor_id(cpu);

 	/* Register the Virtual Processor Area (VPA) */
 	flags = 1UL << (63 - 18);
-	register_vpa(flags, cpu, __pa((unsigned long)&(paca[cpu].lppaca)));
+	register_vpa(flags, pcpu, __pa((unsigned long)&(paca[cpu].lppaca)));
 }

 static inline void smp_xics_do_message(int cpu, int msg)
diff -puN arch/ppc64/lib/locks.c~ppc64_fix_hcall_cpuids arch/ppc64/lib/locks.c
--- 2.6.8-rc2-mm1/arch/ppc64/lib/locks.c~ppc64_fix_hcall_cpuids	2004-07-30 10:16:57.000000000 -0500
+++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/lib/locks.c	2004-07-30 10:16:57.000000000 -0500
@@ -63,7 +63,8 @@ void __spin_yield(spinlock_t *lock)
 	HvCall2(HvCallBaseYieldProcessor, HvCall_YieldToProc,
 		((u64)holder_cpu << 32) | yield_count);
 #else
-	plpar_hcall_norets(H_CONFER, holder_cpu, yield_count);
+	plpar_hcall_norets(H_CONFER, get_hard_smp_processor_id(holder_cpu),
+			   yield_count);
 #endif
 }

@@ -179,7 +180,8 @@ void __rw_yield(rwlock_t *rw)
 	HvCall2(HvCallBaseYieldProcessor, HvCall_YieldToProc,
 		((u64)holder_cpu << 32) | yield_count);
 #else
-	plpar_hcall_norets(H_CONFER, holder_cpu, yield_count);
+	plpar_hcall_norets(H_CONFER, get_hard_smp_processor_id(holder_cpu),
+			   yield_count);
 #endif
 }


_

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nathanl at austin.ibm.com  Sat Jul 31 07:45:56 2004
From: nathanl at austin.ibm.com (nathanl at austin.ibm.com)
Date: Fri, 30 Jul 2004 16:45:56 -0500
Subject: [patch 2/4] Use cpu_present_map in ppc64
Message-ID: <200407302145.i6ULjHc0064152@austin.ibm.com>


Adopt the "standard" cpu_present_map for describing cpus which are
present in the system, but not necessarily online.  cpu_present_map is
meant to be a superset of cpu_online_map and a subset of
cpu_possible_map.

Signed-off-by: Nathan Lynch <nathanl at austin.ibm.com>


---


diff -puN arch/ppc64/kernel/prom.c~ppc64-add-cpu_present_map arch/ppc64/kernel/prom.c
--- 2.6.8-rc2-mm1/arch/ppc64/kernel/prom.c~ppc64-add-cpu_present_map	2004-07-30 16:29:29.000000000 -0500
+++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/kernel/prom.c	2004-07-30 16:29:29.000000000 -0500
@@ -943,6 +943,7 @@ static void __init prom_hold_cpus(unsign
 			cpu_set(cpuid, RELOC(cpu_available_map));
 			cpu_set(cpuid, RELOC(cpu_possible_map));
 			cpu_set(cpuid, RELOC(cpu_present_at_boot));
+			cpu_set(cpuid, RELOC(cpu_present_map));
 			if (reg == 0)
 				cpu_set(cpuid, RELOC(cpu_online_map));
 #endif /* CONFIG_SMP */
@@ -1045,6 +1046,7 @@ static void __init prom_hold_cpus(unsign
 				cpu_set(cpuid, RELOC(cpu_available_map));
 				cpu_set(cpuid, RELOC(cpu_possible_map));
 				cpu_set(cpuid, RELOC(cpu_present_at_boot));
+				cpu_set(cpuid, RELOC(cpu_present_map));
 #endif
 			} else {
 				prom_printf("... failed: %x\n", *acknowledge);
@@ -1057,6 +1059,7 @@ static void __init prom_hold_cpus(unsign
 			cpu_set(cpuid, RELOC(cpu_possible_map));
 			cpu_set(cpuid, RELOC(cpu_online_map));
 			cpu_set(cpuid, RELOC(cpu_present_at_boot));
+			cpu_set(cpuid, RELOC(cpu_present_map));
 		}
 #endif
 next:
@@ -1072,6 +1075,7 @@ next:
 			if (_naca->smt_state) {
 				cpu_set(cpuid, RELOC(cpu_available_map));
 				cpu_set(cpuid, RELOC(cpu_present_at_boot));
+				cpu_set(cpuid, RELOC(cpu_present_map));
 				prom_printf("available\n");
 			} else {
 				prom_printf("not available\n");
@@ -1103,6 +1107,7 @@ next:
 			}
 /* 			cpu_set(i+1, cpu_online_map); */
 			cpu_set(i+1, RELOC(cpu_possible_map));
+			cpu_set(i+1, RELOC(cpu_present_map));
 		}
 		_systemcfg->processorCount *= 2;
 	} else {
diff -puN arch/ppc64/kernel/smp.c~ppc64-add-cpu_present_map arch/ppc64/kernel/smp.c
--- 2.6.8-rc2-mm1/arch/ppc64/kernel/smp.c~ppc64-add-cpu_present_map	2004-07-30 16:29:29.000000000 -0500
+++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/kernel/smp.c	2004-07-30 16:29:29.000000000 -0500
@@ -130,6 +130,7 @@ static int smp_iSeries_numProcs(void)
 			cpu_set(i, cpu_available_map);
 			cpu_set(i, cpu_possible_map);
 			cpu_set(i, cpu_present_at_boot);
+			cpu_set(i, cpu_present_map);
                         ++np;
                 }
         }
diff -puN arch/ppc64/kernel/setup.c~ppc64-add-cpu_present_map arch/ppc64/kernel/setup.c

_

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nathanl at austin.ibm.com  Sat Jul 31 07:46:04 2004
From: nathanl at austin.ibm.com (nathanl at austin.ibm.com)
Date: Fri, 30 Jul 2004 16:46:04 -0500
Subject: [patch 3/4] Rework secondary SMT thread setup at boot
Message-ID: <200407302145.i6ULjNc0063654@austin.ibm.com>


Our (ab)use of cpu_possible_map in setup_system to start secondary SMT
threads bothers me.  Mark such threads in cpu_possible_map during
early boot; let RTAS tell us which present cpus are still offline
later so we can start them.

I'm not totally sure about this one, it might be better to set up
cpu_sibling_map in prom_hold_cpus and use that in setup_system.

Signed-off-by: Nathan Lynch <nathanl at austin.ibm.com>


---


diff -puN arch/ppc64/kernel/setup.c~ppc64-fix-secondary-smt-thread-setup arch/ppc64/kernel/setup.c
--- 2.6.8-rc2-mm1/arch/ppc64/kernel/setup.c~ppc64-fix-secondary-smt-thread-setup	2004-07-30 16:32:16.000000000 -0500
+++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/kernel/setup.c	2004-07-30 16:32:16.000000000 -0500
@@ -232,16 +232,17 @@ void setup_system(unsigned long r3, unsi
 		chrp_init(r3, r4, r5, r6, r7);

 #ifdef CONFIG_SMP
-		/* Start secondary threads on SMT systems */
-		for (i = 0; i < NR_CPUS; i++) {
-			if (cpu_available(i) && !cpu_possible(i)) {
+		/* Start secondary threads on SMT systems; primary threads
+		 * are already in the running state.
+		 */
+		for_each_present_cpu(i) {
+			if (query_cpu_stopped
+			    (get_hard_smp_processor_id(i)) == 0) {
 				printk("%16.16x : starting thread\n", i);
 				rtas_call(rtas_token("start-cpu"), 3, 1, &ret,
 					  get_hard_smp_processor_id(i),
 					  (u32)*((unsigned long *)pseries_secondary_smp_init),
 					  i);
-				cpu_set(i, cpu_possible_map);
-				systemcfg->processorCount++;
 			}
 		}
 #endif /* CONFIG_SMP */
diff -puN arch/ppc64/kernel/prom.c~ppc64-fix-secondary-smt-thread-setup arch/ppc64/kernel/prom.c
--- 2.6.8-rc2-mm1/arch/ppc64/kernel/prom.c~ppc64-fix-secondary-smt-thread-setup	2004-07-30 16:32:16.000000000 -0500
+++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/kernel/prom.c	2004-07-30 16:32:16.000000000 -0500
@@ -1076,6 +1076,8 @@ next:
 				cpu_set(cpuid, RELOC(cpu_available_map));
 				cpu_set(cpuid, RELOC(cpu_present_at_boot));
 				cpu_set(cpuid, RELOC(cpu_present_map));
+				cpu_set(cpuid, RELOC(cpu_possible_map));
+				_systemcfg->processorCount++;
 				prom_printf("available\n");
 			} else {
 				prom_printf("not available\n");
diff -puN arch/ppc64/kernel/smp.c~ppc64-fix-secondary-smt-thread-setup arch/ppc64/kernel/smp.c
--- 2.6.8-rc2-mm1/arch/ppc64/kernel/smp.c~ppc64-fix-secondary-smt-thread-setup	2004-07-30 16:32:16.000000000 -0500
+++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/kernel/smp.c	2004-07-30 16:32:16.000000000 -0500
@@ -228,7 +228,6 @@ static void __devinit smp_openpic_setup_
 	do_openpic_setup_cpu();
 }

-#ifdef CONFIG_HOTPLUG_CPU
 /* Get state of physical CPU.
  * Return codes:
  *	0	- The processor is in the RTAS stopped state
@@ -237,7 +236,7 @@ static void __devinit smp_openpic_setup_
  *	-1	- Hardware Error
  *	-2	- Hardware Busy, Try again later.
  */
-static int query_cpu_stopped(unsigned int pcpu)
+int query_cpu_stopped(unsigned int pcpu)
 {
 	int cpu_status;
 	int status, qcss_tok;
@@ -254,6 +253,8 @@ static int query_cpu_stopped(unsigned in
 	return cpu_status;
 }

+#ifdef CONFIG_HOTPLUG_CPU
+
 int __cpu_disable(void)
 {
 	/* FIXME: go put this in a header somewhere */
diff -puN include/asm-ppc64/smp.h~ppc64-fix-secondary-smt-thread-setup include/asm-ppc64/smp.h
--- 2.6.8-rc2-mm1/include/asm-ppc64/smp.h~ppc64-fix-secondary-smt-thread-setup	2004-07-30 16:32:16.000000000 -0500
+++ 2.6.8-rc2-mm1-nathanl/include/asm-ppc64/smp.h	2004-07-30 16:32:16.000000000 -0500
@@ -73,6 +73,7 @@ void smp_init_pSeries(void);
 extern int __cpu_disable(void);
 extern void __cpu_die(unsigned int cpu);
 extern void cpu_die(void) __attribute__((noreturn));
+extern int query_cpu_stopped(unsigned int pcpu);
 #ifdef CONFIG_SCHED_SMT
 extern cpumask_t cpu_sibling_map[NR_CPUS];
 #endif

_

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nathanl at austin.ibm.com  Sat Jul 31 07:46:11 2004
From: nathanl at austin.ibm.com (nathanl at austin.ibm.com)
Date: Fri, 30 Jul 2004 16:46:11 -0500
Subject: [patch 4/4] Remove unnecessary cpu maps (available, present_at_boot)
Message-ID: <200407302145.i6ULjXc0055970@austin.ibm.com>


With cpu_present_map, we don't need these any longer.

Signed-off-by: Nathan Lynch <nathanl at austin.ibm.com>


---


diff -puN include/asm-ppc64/smp.h~ppc64-remove-unnecessary-cpu-maps include/asm-ppc64/smp.h
--- 2.6.8-rc2-mm1/include/asm-ppc64/smp.h~ppc64-remove-unnecessary-cpu-maps	2004-07-30 16:38:12.000000000 -0500
+++ 2.6.8-rc2-mm1-nathanl/include/asm-ppc64/smp.h	2004-07-30 16:38:12.000000000 -0500
@@ -36,23 +36,6 @@ extern void smp_message_recv(int, struct
 #define smp_processor_id() (get_paca()->paca_index)
 #define hard_smp_processor_id() (get_paca()->hw_cpu_id)

-/*
- * Retrieve the state of a CPU:
- * online:          CPU is in a normal run state
- * possible:        CPU is a candidate to be made online
- * available:       CPU is candidate for the 'possible' pool
- *                  Used to get SMT threads started at boot time.
- * present_at_boot: CPU was available at boot time.  Used in DLPAR
- *                  code to handle special cases for processor start up.
- */
-extern cpumask_t cpu_present_at_boot;
-extern cpumask_t cpu_online_map;
-extern cpumask_t cpu_possible_map;
-extern cpumask_t cpu_available_map;
-
-#define cpu_present_at_boot(cpu) cpu_isset(cpu, cpu_present_at_boot)
-#define cpu_available(cpu)       cpu_isset(cpu, cpu_available_map)
-
 /* Since OpenPIC has only 4 IPIs, we use slightly different message numbers.
  *
  * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up
diff -puN arch/ppc64/kernel/prom.c~ppc64-remove-unnecessary-cpu-maps arch/ppc64/kernel/prom.c
--- 2.6.8-rc2-mm1/arch/ppc64/kernel/prom.c~ppc64-remove-unnecessary-cpu-maps	2004-07-30 16:38:12.000000000 -0500
+++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/kernel/prom.c	2004-07-30 16:38:12.000000000 -0500
@@ -940,9 +940,7 @@ static void __init prom_hold_cpus(unsign
 			lpaca[cpuid].hw_cpu_id = reg;

 #ifdef CONFIG_SMP
-			cpu_set(cpuid, RELOC(cpu_available_map));
 			cpu_set(cpuid, RELOC(cpu_possible_map));
-			cpu_set(cpuid, RELOC(cpu_present_at_boot));
 			cpu_set(cpuid, RELOC(cpu_present_map));
 			if (reg == 0)
 				cpu_set(cpuid, RELOC(cpu_online_map));
@@ -1043,9 +1041,7 @@ static void __init prom_hold_cpus(unsign
 #ifdef CONFIG_SMP
 				/* Set the number of active processors. */
 				_systemcfg->processorCount++;
-				cpu_set(cpuid, RELOC(cpu_available_map));
 				cpu_set(cpuid, RELOC(cpu_possible_map));
-				cpu_set(cpuid, RELOC(cpu_present_at_boot));
 				cpu_set(cpuid, RELOC(cpu_present_map));
 #endif
 			} else {
@@ -1055,10 +1051,8 @@ static void __init prom_hold_cpus(unsign
 #ifdef CONFIG_SMP
 		else {
 			prom_printf("%x : booting  cpu %s\n", cpuid, path);
-			cpu_set(cpuid, RELOC(cpu_available_map));
 			cpu_set(cpuid, RELOC(cpu_possible_map));
 			cpu_set(cpuid, RELOC(cpu_online_map));
-			cpu_set(cpuid, RELOC(cpu_present_at_boot));
 			cpu_set(cpuid, RELOC(cpu_present_map));
 		}
 #endif
@@ -1073,8 +1067,6 @@ next:
 			prom_printf("%x : preparing thread ... ",
 				    interrupt_server[i]);
 			if (_naca->smt_state) {
-				cpu_set(cpuid, RELOC(cpu_available_map));
-				cpu_set(cpuid, RELOC(cpu_present_at_boot));
 				cpu_set(cpuid, RELOC(cpu_present_map));
 				cpu_set(cpuid, RELOC(cpu_possible_map));
 				_systemcfg->processorCount++;
diff -puN arch/ppc64/kernel/smp.c~ppc64-remove-unnecessary-cpu-maps arch/ppc64/kernel/smp.c
--- 2.6.8-rc2-mm1/arch/ppc64/kernel/smp.c~ppc64-remove-unnecessary-cpu-maps	2004-07-30 16:38:12.000000000 -0500
+++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/kernel/smp.c	2004-07-30 16:38:12.000000000 -0500
@@ -62,8 +62,6 @@ unsigned long cache_decay_ticks;

 cpumask_t cpu_possible_map = CPU_MASK_NONE;
 cpumask_t cpu_online_map = CPU_MASK_NONE;
-cpumask_t cpu_available_map = CPU_MASK_NONE;
-cpumask_t cpu_present_at_boot = CPU_MASK_NONE;

 EXPORT_SYMBOL(cpu_online_map);
 EXPORT_SYMBOL(cpu_possible_map);
@@ -127,9 +125,7 @@ static int smp_iSeries_numProcs(void)
 	np = 0;
         for (i=0; i < NR_CPUS; ++i) {
                 if (paca[i].lppaca.xDynProcStatus < 2) {
-			cpu_set(i, cpu_available_map);
 			cpu_set(i, cpu_possible_map);
-			cpu_set(i, cpu_present_at_boot);
 			cpu_set(i, cpu_present_map);
                         ++np;
                 }
@@ -890,7 +886,7 @@ int __devinit __cpu_up(unsigned int cpu)
 	int c;

 	/* At boot, don't bother with non-present cpus -JSCHOPP */
-	if (system_state == SYSTEM_BOOTING && !cpu_present_at_boot(cpu))
+	if (system_state == SYSTEM_BOOTING && !cpu_present(cpu))
 		return -ENOENT;

 	paca[cpu].prof_counter = 1;
diff -puN arch/ppc64/kernel/xics.c~ppc64-remove-unnecessary-cpu-maps arch/ppc64/kernel/xics.c
--- 2.6.8-rc2-mm1/arch/ppc64/kernel/xics.c~ppc64-remove-unnecessary-cpu-maps	2004-07-30 16:38:12.000000000 -0500
+++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/kernel/xics.c	2004-07-30 16:38:12.000000000 -0500
@@ -548,7 +548,7 @@ nextnode:
 #ifdef CONFIG_SMP
 		for_each_cpu(i) {
 			/* FIXME: Do this dynamically! --RR */
-			if (!cpu_present_at_boot(i))
+			if (!cpu_present(i))
 				continue;
 			xics_per_cpu[i] = __ioremap((ulong)inodes[get_hard_smp_processor_id(i)].addr,
 						    (ulong)inodes[get_hard_smp_processor_id(i)].size,

_

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nathanl at austin.ibm.com  Sat Jul 31 11:31:07 2004
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Fri, 30 Jul 2004 20:31:07 -0500
Subject: [patch 3/4] Rework secondary SMT thread setup at boot
In-Reply-To: <200407302145.i6ULjNc0063654@austin.ibm.com>
References: <200407302145.i6ULjNc0063654@austin.ibm.com>
Message-ID: <1091237398.31118.4.camel@biclops.private.network>


On Fri, 2004-07-30 at 16:46, nathanl at austin.ibm.com wrote:
> Our (ab)use of cpu_possible_map in setup_system to start secondary SMT
> threads bothers me.  Mark such threads in cpu_possible_map during
> early boot; let RTAS tell us which present cpus are still offline
> later so we can start them.
>
> I'm not totally sure about this one, it might be better to set up
> cpu_sibling_map in prom_hold_cpus and use that in setup_system.

To clarify this last comment:  is it safe to assume we have the
query-cpu-stopped-state RTAS call (which is used by query_cpu_stopped())
available on all SMT systems?  I suppose if we're relying on the
platform having start-cpu it's ok to assume it has
query-cpu-stopped-state... or not?

Nathan


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/