From lxiep at us.ibm.com Thu Jul 1 05:03:32 2004 From: lxiep at us.ibm.com (Linda Xie) Date: Wed, 30 Jun 2004 14:03:32 -0500 Subject: [PATCH] rpaphp broken in ameslab In-Reply-To: <16610.22790.655073.704981@cargo.ozlabs.ibm.com> References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> Message-ID: <40E30E84.8090206@us.ibm.com> Paul, Paul Mackerras wrote: >Linda, > > > >By the way, I notice that upstream rpaphp_core.c now has the call to >eeh_register_disable_func(), although the actual function isn't >present in arch/ppc64/kernel/eeh.c. I thought we were going to do >that via userspace. Did Greg change his mind? > > eeh_register_disable_func needs to be sent to upstream by ppc64(maintainers), if it's not in. > >I think that we have it the wrong way around though. I think that the >rpaphp code should do something analogous to a request_irq() to ask to >be informed about slot isolation events, rather than having the EEH >code call the rpaphp code to disable a slot. In fact I think the >separation is bogus; the EEH code and the rpaphp code are both part of >the driver for the RPA PCI subsystem. > > Linas, Can you answer all EEH releated questions? Thanks, Linda >Paul. > > > ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Thu Jul 1 05:14:33 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Wed, 30 Jun 2004 14:14:33 -0500 Subject: [PATCH] rpaphp broken in ameslab In-Reply-To: <40E30E84.8090206@us.ibm.com>; from lxiep@us.ibm.com on Wed, Jun 30, 2004 at 02:03:32PM -0500 References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> Message-ID: <20040630141433.V21634@forte.austin.ibm.com> On Wed, Jun 30, 2004 at 02:03:32PM -0500, Linda Xie wrote: > Paul Mackerras wrote: > > > >By the way, I notice that upstream rpaphp_core.c now has the call to > >eeh_register_disable_func(), although the actual function isn't > >present in arch/ppc64/kernel/eeh.c. Paul, You and Anton are responsible for keeping the arch/ppc64 directories in sync between sles9, ameslab, and akpm. You are, after all, the one true official, designated maintainer ... if the code hasn't been migrated to akpm ... uuh ... what am I missing? >From where I sit, the sles9 code is really the latest, greatest, most tested and debugged arch/ppc64 code that there is. This is the tree that the developers get thier code/patches into. Ameslab is distinctly downlevel (for example, sles9-2.6.5-7.79 has an eeh.c that has a number of patches, some up to a month old, that haven't yet appeared in ameslab.). If the akpm tree doesn't have eeh_register_disable_func() then its badly out of date, since that function got added many moons ago. I'm somewhat concerned; the sles9 tree is now closed/closing, so its a really great time to bring the other trees up-to-date, so that we can continue to have a place to drop patches. > >I think that we have it the wrong way around though. I think that the Yes, eeh_register_disable_func() is an ugly hack; its because rpaphp can be built as a module, and eeh cannot, yett eeh needs to call into rpaphp. > >rpaphp code should do something analogous to a request_irq() to ask to > >be informed about slot isolation events, rather than having the EEH > >code call the rpaphp code to disable a slot. Other than the module/not-a-module issue, why? I don't see any particular advantage to an async, event-driven thingy as compared to a plain-old subroutine call. Subroutine calls are easy to make, yuo don't need to invent infrastructure. KISS. > > In fact I think the > >separation is bogus; the EEH code and the rpaphp code are both part of > >the driver for the RPA PCI subsystem. prolly. But note that the generic hotplug API's need to be extended to give device drivers a mechanism to ask RPA PHP / EEH if a disconnect event occured. Last I talked to Greg, he wasn't willing to accept something like that yet, so its a bit up in the air. Personally, I'd be happy to wait till more eeh prototype gets written before propounding on archtiectural changes, (and happier still to find the time to actually do it). --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From greg at kroah.com Thu Jul 1 05:46:34 2004 From: greg at kroah.com (Greg KH) Date: Wed, 30 Jun 2004 12:46:34 -0700 Subject: [PATCH] rpaphp broken in ameslab In-Reply-To: <20040630141433.V21634@forte.austin.ibm.com> References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> Message-ID: <20040630194634.GA22223@kroah.com> On Wed, Jun 30, 2004 at 02:14:33PM -0500, linas at austin.ibm.com wrote: > On Wed, Jun 30, 2004 at 02:03:32PM -0500, Linda Xie wrote: > > Paul Mackerras wrote: > > > > > >By the way, I notice that upstream rpaphp_core.c now has the call to > > >eeh_register_disable_func(), although the actual function isn't > > >present in arch/ppc64/kernel/eeh.c. > > Paul, > > You and Anton are responsible for keeping the arch/ppc64 directories > in sync between sles9, ameslab, and akpm. You are, after all, the > one true official, designated maintainer ... if the code hasn't been > migrated to akpm ... uuh ... what am I missing? > > >From where I sit, the sles9 code is really the latest, greatest, most > tested and debugged arch/ppc64 code that there is. This is the tree > that the developers get thier code/patches into. And that is the big problem. Those patches/fixes should go to mainline, not directly to suse. How are they going to get back into mainline? Are you all willing to now send those same patches to the next distro that has a release (like Red Hat?) What about other distros that have full ppc64 releases (debian, gentoo, etc.) Change your development process to properly go to mainline and there would not be any issues like this. > I'm somewhat concerned; the sles9 tree is now closed/closing, so > its a really great time to bring the other trees up-to-date, so > that we can continue to have a place to drop patches. You should all be more concerned that all those suse patches are not seen by anyone else. > > >I think that we have it the wrong way around though. I think that the > > Yes, eeh_register_disable_func() is an ugly hack; its because rpaphp > can be built as a module, and eeh cannot, yett eeh needs to call into > rpaphp. Which is an ugly hack in and of itself. I only oked it for now. Actually, since everyone agrees that this isn't the way to go, I'll go remove it :) > > > In fact I think the > > >separation is bogus; the EEH code and the rpaphp code are both part of > > >the driver for the RPA PCI subsystem. > > prolly. But note that the generic hotplug API's need to be extended > to give device drivers a mechanism to ask RPA PHP / EEH if a disconnect > event occured. Last I talked to Greg, he wasn't willing to accept > something like that yet, so its a bit up in the air. I wasn't willing to accept that, as that was the wrong way to do this. It should be done from userspace with hotplug events like we mentioned. And none of this "what happens about the root device" crud either, I've seen your code in the kernel that checks for this today. bah. thanks, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Thu Jul 1 06:56:28 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Wed, 30 Jun 2004 15:56:28 -0500 Subject: [PATCH] rpaphp broken in ameslab In-Reply-To: <20040630194634.GA22223@kroah.com>; from greg@kroah.com on Wed, Jun 30, 2004 at 12:46:34PM -0700 References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <20040630194634.GA22223@kroah.com> Message-ID: <20040630155627.X21634@forte.austin.ibm.com> On Wed, Jun 30, 2004 at 12:46:34PM -0700, Greg KH wrote: > On Wed, Jun 30, 2004 at 02:14:33PM -0500, linas at austin.ibm.com wrote: > > On Wed, Jun 30, 2004 at 02:03:32PM -0500, Linda Xie wrote: > > > Paul Mackerras wrote: > > > > > > > >By the way, I notice that upstream rpaphp_core.c now has the call to > > > >eeh_register_disable_func(), although the actual function isn't > > > >present in arch/ppc64/kernel/eeh.c. > > > > Paul, > > > > You and Anton are responsible for keeping the arch/ppc64 directories > > in sync between sles9, ameslab, and akpm. You are, after all, the > > one true official, designated maintainer ... if the code hasn't been > > migrated to akpm ... uuh ... what am I missing? > > > > >From where I sit, the sles9 code is really the latest, greatest, most > > tested and debugged arch/ppc64 code that there is. This is the tree > > that the developers get thier code/patches into. > > And that is the big problem. > > Those patches/fixes should go to mainline, not directly to suse. How > are they going to get back into mainline? My understanding is that Paul Mackerras and Anton Blanchard are the designated maintainers of the arch/ppc64 tree. They are responsible for sending the patches upstream, getting them into mainline. > Which is an ugly hack in and of itself. I only oked it for now. > > Actually, since everyone agrees that this isn't the way to go, I'll go > remove it :) Ack, I wish you wouldn't, it will break things. What do you suggest as the 'right way' to accomplish this? > > > > In fact I think the > > > >separation is bogus; the EEH code and the rpaphp code are both part of > > > >the driver for the RPA PCI subsystem. > > > > prolly. But note that the generic hotplug API's need to be extended > > to give device drivers a mechanism to ask RPA PHP / EEH if a disconnect > > event occured. Last I talked to Greg, he wasn't willing to accept > > something like that yet, so its a bit up in the air. > > I wasn't willing to accept that, as that was the wrong way to do this. Yes, well, a few months ago, Torvalds made me promise that this would be the way that it would be done eventually. You were cc'ed on that chain of notes. Why didn't you take that up with him? > It should be done from userspace with hotplug events like we mentioned. I think you're confusing this with a different issue. The issue here is "the device driver thinks that the PCI bus is whacked. The device driver wants to be able to make a call to find out if the PCI bus is whacked." I don't think you want to bounce that up to userspace and back down again. > And none of this "what happens about the root device" crud either, I've > seen your code in the kernel that checks for this today. bah. ? Well, if your implying I wrote that code, I didn't. My goal is to get it to do 'the right thing'. Once I stop getting dirstract by other issues. The 'root device' issue is a real issue. You can't execute /sbin/hotplug if the root fs is not reachable. If the scsi device driver suspects that the reason its unable to get a response from the scsi controller is that the PCI bus is down, then the scsi device driver must be given a mechanism for rebooting the PCI bus without having to go to user-space to do it. --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Thu Jul 1 07:01:01 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Wed, 30 Jun 2004 16:01:01 -0500 Subject: [PATCH] rpaphp broken in ameslab In-Reply-To: <20040630155627.X21634@forte.austin.ibm.com> References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <20040630194634.GA22223@kroah.com> <20040630155627.X21634@forte.austin.ibm.com> Message-ID: <1088629261.18831.36.camel@localhost> On Wed, 2004-06-30 at 15:56, linas at austin.ibm.com wrote: > On Wed, Jun 30, 2004 at 12:46:34PM -0700, Greg KH wrote: > > Those patches/fixes should go to mainline, not directly to suse. How > > are they going to get back into mainline? > > My understanding is that Paul Mackerras and Anton Blanchard are the > designated maintainers of the arch/ppc64 tree. They are responsible > for sending the patches upstream, getting them into mainline. That's not my understanding at all. Why should Paul and Anton have to understand, agree with, and advocate your code on LKML when you can do it yourself? They have their own code they have to worry about... -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From greg at kroah.com Thu Jul 1 07:14:14 2004 From: greg at kroah.com (Greg KH) Date: Wed, 30 Jun 2004 14:14:14 -0700 Subject: [PATCH] rpaphp broken in ameslab In-Reply-To: <20040630155627.X21634@forte.austin.ibm.com> References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <20040630194634.GA22223@kroah.com> <20040630155627.X21634@forte.austin.ibm.com> Message-ID: <20040630211414.GB24048@kroah.com> On Wed, Jun 30, 2004 at 03:56:28PM -0500, linas at austin.ibm.com wrote: > On Wed, Jun 30, 2004 at 12:46:34PM -0700, Greg KH wrote: > > On Wed, Jun 30, 2004 at 02:14:33PM -0500, linas at austin.ibm.com wrote: > > > On Wed, Jun 30, 2004 at 02:03:32PM -0500, Linda Xie wrote: > > > > Paul Mackerras wrote: > > > > > > > > > >By the way, I notice that upstream rpaphp_core.c now has the call to > > > > >eeh_register_disable_func(), although the actual function isn't > > > > >present in arch/ppc64/kernel/eeh.c. > > > > > > Paul, > > > > > > You and Anton are responsible for keeping the arch/ppc64 directories > > > in sync between sles9, ameslab, and akpm. You are, after all, the > > > one true official, designated maintainer ... if the code hasn't been > > > migrated to akpm ... uuh ... what am I missing? > > > > > > >From where I sit, the sles9 code is really the latest, greatest, most > > > tested and debugged arch/ppc64 code that there is. This is the tree > > > that the developers get thier code/patches into. > > > > And that is the big problem. > > > > Those patches/fixes should go to mainline, not directly to suse. How > > are they going to get back into mainline? > > My understanding is that Paul Mackerras and Anton Blanchard are the > designated maintainers of the arch/ppc64 tree. They are responsible > for sending the patches upstream, getting them into mainline. No, you are responsible for sending those patches to them, in public, to get them, and the community, to accept them and then pass them on. It's not up to them to do all of the work in picking pieces out of different trees and forwarding them upward. > > > > > In fact I think the > > > > >separation is bogus; the EEH code and the rpaphp code are both part of > > > > >the driver for the RPA PCI subsystem. > > > > > > prolly. But note that the generic hotplug API's need to be extended > > > to give device drivers a mechanism to ask RPA PHP / EEH if a disconnect > > > event occured. Last I talked to Greg, he wasn't willing to accept > > > something like that yet, so its a bit up in the air. > > > > I wasn't willing to accept that, as that was the wrong way to do this. > > Yes, well, a few months ago, Torvalds made me promise that this would > be the way that it would be done eventually. You were cc'ed on that > chain of notes. Why didn't you take that up with him? Yes, I remember that. When is "eventually" going to happen? next week? 2.7? 2.9? thanks, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From haveblue at us.ibm.com Thu Jul 1 07:36:51 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Wed, 30 Jun 2004 14:36:51 -0700 Subject: [PATCH] fix off-by-one in mem_init() Message-ID: <1088631410.5265.676.camel@nighthawk> lmb_end_of_DRAM() returns the address of the end of RAM, not the starting address of the last page. We've been accessing mem_map[] out of bounds for quite a while. But, it's just a read, so it's probably never caused a real problem. But, during my port of CONFIG_NONLINEAR to ppc64, I have a check to make sure that all __va() calls are given with valid physical addresses. This code tripped that check. -- Dave -------------- next part -------------- A non-text attachment was scrubbed... Name: ppc64-lt-end_of_DRAM.patch Type: text/x-patch Size: 398 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040630/ea8a9e30/attachment.bin From linas at austin.ibm.com Thu Jul 1 08:45:59 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Wed, 30 Jun 2004 17:45:59 -0500 Subject: [PATCH] rpaphp broken in ameslab In-Reply-To: <1088629261.18831.36.camel@localhost>; from hollisb@us.ibm.com on Wed, Jun 30, 2004 at 04:01:01PM -0500 References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <20040630194634.GA22223@kroah.com> <20040630155627.X21634@forte.austin.ibm.com> <1088629261.18831.36.camel@localhost> Message-ID: <20040630174558.Y21634@forte.austin.ibm.com> On Wed, Jun 30, 2004 at 04:01:01PM -0500, Hollis Blanchard wrote: > On Wed, 2004-06-30 at 15:56, linas at austin.ibm.com wrote: > > On Wed, Jun 30, 2004 at 12:46:34PM -0700, Greg KH wrote: > > > Those patches/fixes should go to mainline, not directly to suse. How > > > are they going to get back into mainline? > > > > My understanding is that Paul Mackerras and Anton Blanchard are the > > designated maintainers of the arch/ppc64 tree. They are responsible > > for sending the patches upstream, getting them into mainline. > > That's not my understanding at all. Why should Paul and Anton have to > understand, agree with, and advocate your code on LKML when you can do > it yourself? They have their own code they have to worry about... I was refering to patches to the arch/ppc64 tree, and not patches that go into the generic kernel. The 1 or 2 patches I've had to generic kernel code had have gone to LKML. All the other patches I've done go into bugzilla, and through some magical process :) eventually ship as code. As to more work for Paul and Anton, that's between them and thier managers... These various roles & responsibilities were announced in meetings and emails ... I was just doing the management-directed thing; its news to me to learn I should now bypass Paul and Anton ... or should I take this as a promotion? :) --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Thu Jul 1 09:03:06 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Wed, 30 Jun 2004 18:03:06 -0500 Subject: [PATCH] rpaphp broken in ameslab In-Reply-To: <20040630211414.GB24048@kroah.com>; from greg@kroah.com on Wed, Jun 30, 2004 at 02:14:14PM -0700 References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <20040630194634.GA22223@kroah.com> <20040630155627.X21634@forte.austin.ibm.com> <20040630211414.GB24048@kroah.com> Message-ID: <20040630180306.Z21634@forte.austin.ibm.com> On Wed, Jun 30, 2004 at 02:14:14PM -0700, Greg KH wrote: > On Wed, Jun 30, 2004 at 03:56:28PM -0500, linas at austin.ibm.com wrote: > > > > My understanding is that Paul Mackerras and Anton Blanchard are the > > designated maintainers of the arch/ppc64 tree. They are responsible > > for sending the patches upstream, getting them into mainline. > > No, you are responsible for sending those patches to them, in public, to > get them, and the community, to accept them and then pass them on. It's > not up to them to do all of the work in picking pieces out of different > trees and forwarding them upward. Look, take this up with the management. I'm stating the facts. There's a half-dozen or dozen or so developers here generating patches for the ppc64 tree, and *everyone* (with Hollis as a clear exception?) has been operating the same way. The patches go into the sles9 tree, the ppc64 maintainers have the responsibility of accepting or refusing the patches and of syncing the trees. > Yes, I remember that. When is "eventually" going to happen? next week? Personally speaking? no, not next week. I've got a backlog of bugs to clear out first, and when that queue shrinks to zero, then I'll get a chance to do new development. I was hoping to start six months ago; but its been emergency bug-fix season, which seems to be winding down. --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From greg at kroah.com Thu Jul 1 09:18:34 2004 From: greg at kroah.com (Greg KH) Date: Wed, 30 Jun 2004 16:18:34 -0700 Subject: [PATCH] rpaphp broken in ameslab In-Reply-To: <20040630180306.Z21634@forte.austin.ibm.com> References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <20040630194634.GA22223@kroah.com> <20040630155627.X21634@forte.austin.ibm.com> <20040630211414.GB24048@kroah.com> <20040630180306.Z21634@forte.austin.ibm.com> Message-ID: <20040630231834.GE28815@kroah.com> On Wed, Jun 30, 2004 at 06:03:06PM -0500, linas at austin.ibm.com wrote: > On Wed, Jun 30, 2004 at 02:14:14PM -0700, Greg KH wrote: > > On Wed, Jun 30, 2004 at 03:56:28PM -0500, linas at austin.ibm.com wrote: > > > > > > My understanding is that Paul Mackerras and Anton Blanchard are the > > > designated maintainers of the arch/ppc64 tree. They are responsible > > > for sending the patches upstream, getting them into mainline. > > > > No, you are responsible for sending those patches to them, in public, to > > get them, and the community, to accept them and then pass them on. It's > > not up to them to do all of the work in picking pieces out of different > > trees and forwarding them upward. > > Look, take this up with the management. Who's management? I'm not talking company specific issues here. I'm talking about the proper open source kernel development model. As you are playing in the sandbox, you need to follow the same rules as all other developers, otherwise the other members of the sandbox get mad and start ignoring your nifty sand castles you want them to accept. {ok, so the analogy broke down at the end there, but I hope my point got across...} > I'm stating the facts. > There's a half-dozen or dozen or so developers here generating > patches for the ppc64 tree, and *everyone* (with Hollis as a > clear exception?) has been operating the same way. That's because Hollis properly understands the Linux kernel development process :) > The patches go into the sles9 tree, the ppc64 maintainers have the > responsibility of accepting or refusing the patches and of syncing the > trees. Huh? If the ppc64 maintainers never know about them, how can that happen? If the patches are never posted to a _public_ mailing list, how can you be sure that the patch is even the correct thing to do? If you add a usb specific patch to the suse tree, how is the usb maintainer going to find out about it? ppc64 should be no different than any other kernel subsystem. > > Yes, I remember that. When is "eventually" going to happen? next week? > > Personally speaking? no, not next week. I've got a backlog of bugs > to clear out first, and when that queue shrinks to zero, then I'll > get a chance to do new development. I was hoping to start > six months ago; but its been emergency bug-fix season, which seems > to be winding down. Ah, but wait for it to start up again with the next distro testing cycle that doesn't have all of those suse bugs in it... But that's not relevant here. I want to know when this hack will be removed, and done properly, in a arch-independent manner. If it's not on the shortlist of things to do, I can't live with it, and will have to remove it, because that means it will never get fixed :( thanks, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Thu Jul 1 09:41:09 2004 From: paulus at samba.org (Paul Mackerras) Date: Thu, 1 Jul 2004 09:41:09 +1000 Subject: [PATCH] rpaphp broken in ameslab In-Reply-To: <20040630141433.V21634@forte.austin.ibm.com> References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> Message-ID: <16611.20373.87299.105401@cargo.ozlabs.ibm.com> Linas, > You and Anton are responsible for keeping the arch/ppc64 directories > in sync between sles9, ameslab, and akpm. You are, after all, the > one true official, designated maintainer ... if the code hasn't been Yes, ultimately. However, that doesn't mean that I am the subject matter expert for all parts of the arch/ppc64 code. Just as Linus isn't the expert for all parts of the entire kernel tree. In the areas where I am not the expert I need and expect the expert(s) for those areas to be sending me patches with explanations that I can forward upstream. It's not good enough for the expert to just put the latest and greatest code into ameslab. If I'm not the expert, I don't know, without going digging into the revision history, what the rationale for the changes is. I also don't know whether what is in there is just what we want or if it is something that we just want a few developers to try. Thus, I need the experts in areas such as RAS and EEH to be sending me patches suitable for forwarding to Andrew Morton, complete with rationale and explanation. (Which you have been doing - thanks.) As for the SLES9 tree, we (Anton, Jeremy Kerr and I) have been spending some effort on identifying which changes need to go upstream, and in fact most of them are upstream already. However, in areas such as EEH and RAS it can take us considerable effort to work out why changes are being made and whether they should go upstream, when the expert in the area could just take one look and say very quickly yes/no and why. > >From where I sit, the sles9 code is really the latest, greatest, most > tested and debugged arch/ppc64 code that there is. This is the tree > that the developers get thier code/patches into. Ameslab is distinctly > downlevel (for example, sles9-2.6.5-7.79 has an eeh.c that has a number > of patches, some up to a month old, that haven't yet appeared in ameslab.). Well... why not? Why haven't the people who have been debugging EEH problems in sles9 been at least cc'ing the patches to me? > If the akpm tree doesn't have eeh_register_disable_func() then its > badly out of date, since that function got added many moons ago. I never sent that upstream because (a) the EEH developer(s) never sent me a proposed patch to go upstream and (b) as I understood it, Greg KH had vetoed that approach. > Yes, eeh_register_disable_func() is an ugly hack; its because rpaphp > can be built as a module, and eeh cannot, yett eeh needs to call into > rpaphp. I think we should have a notifier list for EEH events and a way for code to request to be added to that list. The rpaphp code can then be notified about EEH events. What it does then is up to it. Having the EEH code decide that a slot removal is the right thing seems bogus. That should be up to the rpaphp code and/or userspace. Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Thu Jul 1 09:47:44 2004 From: paulus at samba.org (Paul Mackerras) Date: Thu, 1 Jul 2004 09:47:44 +1000 Subject: [PATCH] rpaphp broken in ameslab In-Reply-To: <20040630174558.Y21634@forte.austin.ibm.com> References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <20040630194634.GA22223@kroah.com> <20040630155627.X21634@forte.austin.ibm.com> <1088629261.18831.36.camel@localhost> <20040630174558.Y21634@forte.austin.ibm.com> Message-ID: <16611.20768.91355.411754@cargo.ozlabs.ibm.com> Linas, > I was refering to patches to the arch/ppc64 tree, and not patches that > go into the generic kernel. The 1 or 2 patches I've had to generic > kernel code had have gone to LKML. All the other patches I've done > go into bugzilla, and through some magical process :) eventually ship > as code. To the extent that you are the EEH expert (which you seem to be, let me know if you don't want to be), I rely on you to send me patches with explanations. Otherwise I have to be the EEH expert myself. > As to more work for Paul and Anton, that's between them and thier > managers... These various roles & responsibilities were announced > in meetings and emails ... I was just doing the management-directed > thing; its news to me to learn I should now bypass Paul and Anton ... > or should I take this as a promotion? :) No, you shouldn't bypass me/Anton, but you shouldn't expect us to do all the work either. :) Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Thu Jul 1 09:50:26 2004 From: paulus at samba.org (Paul Mackerras) Date: Thu, 1 Jul 2004 09:50:26 +1000 Subject: [PATCH] rpaphp broken in ameslab In-Reply-To: <20040630180306.Z21634@forte.austin.ibm.com> References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <20040630194634.GA22223@kroah.com> <20040630155627.X21634@forte.austin.ibm.com> <20040630211414.GB24048@kroah.com> <20040630180306.Z21634@forte.austin.ibm.com> Message-ID: <16611.20930.229141.507440@cargo.ozlabs.ibm.com> linas at austin.ibm.com writes: > Look, take this up with the management. I'm stating the facts. > There's a half-dozen or dozen or so developers here generating > patches for the ppc64 tree, and *everyone* (with Hollis as a > clear exception?) has been operating the same way. The patches > go into the sles9 tree, the ppc64 maintainers have the responsibility > of accepting or refusing the patches and of syncing the trees. OK, so I haven't been speaking up loudly and clearly and often enough. Hollis has got it right. We'll have to do better in the next bug frenzy. Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Thu Jul 1 10:02:25 2004 From: paulus at samba.org (Paul Mackerras) Date: Thu, 1 Jul 2004 10:02:25 +1000 Subject: [PATCH] rpaphp broken in ameslab In-Reply-To: <20040630231834.GE28815@kroah.com> References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <20040630194634.GA22223@kroah.com> <20040630155627.X21634@forte.austin.ibm.com> <20040630211414.GB24048@kroah.com> <20040630180306.Z21634@forte.austin.ibm.com> <20040630231834.GE28815@kroah.com> Message-ID: <16611.21649.436501.278036@cargo.ozlabs.ibm.com> Greg KH writes: > But that's not relevant here. I want to know when this hack will be > removed, and done properly, in a arch-independent manner. If it's not > on the shortlist of things to do, I can't live with it, and will have to > remove it, because that means it will never get fixed :( What I propose is that the EEH code export a notifier list for code that wants to know about EEH events, and do a notifier_call_chain when events such as slot isolation occur. The rpaphp code can then register itself. When a slot isolation event occurs, the rpaphp code can then do whatever is appropriate. This seems to me to be as fundamental as a device driver being able to do request_irq(). The rpaphp code is part of the device driver for the pci subsystem and it needs to be notified when its hardware changes state. Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From greg at kroah.com Thu Jul 1 10:08:51 2004 From: greg at kroah.com (Greg KH) Date: Wed, 30 Jun 2004 17:08:51 -0700 Subject: [PATCH] rpaphp broken in ameslab In-Reply-To: <16611.21649.436501.278036@cargo.ozlabs.ibm.com> References: <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <20040630194634.GA22223@kroah.com> <20040630155627.X21634@forte.austin.ibm.com> <20040630211414.GB24048@kroah.com> <20040630180306.Z21634@forte.austin.ibm.com> <20040630231834.GE28815@kroah.com> <16611.21649.436501.278036@cargo.ozlabs.ibm.com> Message-ID: <20040701000851.GC30029@kroah.com> On Thu, Jul 01, 2004 at 10:02:25AM +1000, Paul Mackerras wrote: > Greg KH writes: > > > But that's not relevant here. I want to know when this hack will be > > removed, and done properly, in a arch-independent manner. If it's not > > on the shortlist of things to do, I can't live with it, and will have to > > remove it, because that means it will never get fixed :( > > What I propose is that the EEH code export a notifier list for code > that wants to know about EEH events, and do a notifier_call_chain when > events such as slot isolation occur. The rpaphp code can then > register itself. When a slot isolation event occurs, the rpaphp code > can then do whatever is appropriate. This seems to me to be as > fundamental as a device driver being able to do request_irq(). The > rpaphp code is part of the device driver for the pci subsystem and it > needs to be notified when its hardware changes state. That's acceptable to me. thanks, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Thu Jul 1 11:20:32 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Wed, 30 Jun 2004 20:20:32 -0500 Subject: [PATCH] rpaphp broken in ameslab In-Reply-To: <16611.20373.87299.105401@cargo.ozlabs.ibm.com>; from paulus@samba.org on Thu, Jul 01, 2004 at 09:41:09AM +1000 References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <16611.20373.87299.105401@cargo.ozlabs.ibm.com> Message-ID: <20040630202032.A21634@forte.austin.ibm.com> On Thu, Jul 01, 2004 at 09:41:09AM +1000, Paul Mackerras wrote: > > patches suitable for forwarding to Andrew Morton, complete with > rationale and explanation. (Which you have been doing - thanks.) OK, well, except that, until yesterday, I haven't been :) there's a few months of accumulated patches lying around. They're mostly trite. > As for the SLES9 tree, we (Anton, Jeremy Kerr and I) have been > spending some effort on identifying which changes need to go > upstream, and in fact most of them are upstream already. However, in > areas such as EEH and RAS it can take us considerable effort to work > out why changes are being made and whether they should go upstream, > when the expert in the area could just take one look and say very > quickly yes/no and why. OK. The short answer for eeh.c is 'all of it'. > Well... why not? Why haven't the people who have been debugging EEH > problems in sles9 been at least cc'ing the patches to me? I duuno... I thought this was standard operating proceedure ... one can get cc'ed on all bugzilla ppc64 activity automatically. Whenever a bug gets marked 'fixed-awaiting-test' one can grab the patch and go with it. That's what SUSE does. I do have one big concern with tracking patches by email vs. tracking patches in bugzilla, and that is the problem of closure. I sent four patches yesterday, heard responses for two of them. What about the other two? Were they applied? Rejected? Still in the queue? Accidenetally ignored & forgotten? How will I know? In bugzilla, theres a very clear idea of what the status is, and there's a 'paper trail' to go with it. Its got built-in nagging ... I can say things like 'please apply patch 7538, its been ready for months', which works out a lot better than 'hey remember that email I send you months ago, did you ever get around to it?' Problem with un-adorned email is you never know if a thread came to a conclusion or not. But bugzilla has a built-in status tracker.. you know when the thread is dead. I like to think of bugzilla as 'plain-email with extra status bits' -- its auto-threaded, everyone can review a thread at any time, every post to every thread is there, even for late-comers who might not have been 'subscribed' at the begining. And you can perform complex queries accoss threads and topics, which is something mutt doesn't do, and I don't think pine, elm, evolution or any of the other emailers do either. What would really be slick is if there was a button on bugzilla, that, when clicked, did a 'bk push'. > I think we should have a notifier list for EEH events and a way for > code to request to be added to that list. The rpaphp code can then > be notified about EEH events. OK, we have that code already; presently, its not compiled with rpaphp_pci.c, its compiled with eeh.c. I can cut and paste it into drivers/pci/hotplug/rpaphp_pci.c and/or create a new file drivers/pci/hotplug/rpaphp_eeh.c That eleminates the need for the callback. --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Thu Jul 1 12:17:11 2004 From: paulus at samba.org (Paul Mackerras) Date: Thu, 1 Jul 2004 12:17:11 +1000 Subject: [PATCH] rpaphp broken in ameslab In-Reply-To: <20040630202032.A21634@forte.austin.ibm.com> References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <16611.20373.87299.105401@cargo.ozlabs.ibm.com> <20040630202032.A21634@forte.austin.ibm.com> Message-ID: <16611.29735.778269.316356@cargo.ozlabs.ibm.com> linas at austin.ibm.com writes: > OK. The short answer for eeh.c is 'all of it'. OK, that doesn't give me the explanation that would need to go with the patch when I send it to Andrew. I want to see a notifier list exported by eeh.c as I proposed in a previous email before that goes upstream. > I do have one big concern with tracking patches by email vs. > tracking patches in bugzilla, and that is the problem of closure. > I sent four patches yesterday, heard responses for two of them. > What about the other two? Were they applied? Rejected? Still > in the queue? Accidenetally ignored & forgotten? How will > I know? In bugzilla, theres a very clear idea of what the I thought I replied to all 4. Let me check... I have replied to the emails with the subjects: [PATCH] [2.6] PPC64: log firmware errors during boot. [PATCH] PPC64: lockfix for rtas error log [PATCH] PPC64: (resend) Janitor signature of rtas_call() routine [PATCH] PPC64: Janitor log_rtas_error() call arguments (Yes, the last one I didn't directly reply to, but I rediffed the patch and sent it to Andrew with a cc to you.) The trouble with bugzilla is that it doesn't give me the final patch with explanation, ready to go upstream. > status is, and there's a 'paper trail' to go with it. > Its got built-in nagging ... I can say things like 'please > apply patch 7538, its been ready for months', which works out > a lot better than 'hey remember that email I send you months > ago, did you ever get around to it?' I would much rather you sent me the patch with explanation and said "please send this upstream". If I send it upstream I will cc you. I will either do that or send you a reply telling you what problems I see with the patch. If I don't feel free to re-send at 2-day intervals. :) > Problem with un-adorned email is you never know if a thread > came to a conclusion or not. But bugzilla has a built-in > status tracker.. you know when the thread is dead. You do know when the thread has come to a conclusion, it's when the patch is in Linus' tree. Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Thu Jul 1 19:37:13 2004 From: anton at samba.org (Anton Blanchard) Date: Thu, 1 Jul 2004 19:37:13 +1000 Subject: RFC: Dynamic segment table allocation In-Reply-To: <20040630065105.GA19054@zax> References: <20040630065105.GA19054@zax> Message-ID: <20040701093713.GJ7523@krispykreme> Hi, > Can anyone see a problem with the patch below? If not, I plan to push > it to Andrew in the next few days. Boots on a 4-way Power3 (RS/6000 > 270) and an RS64 iSeries lpar. > > PPC64 machines before Power4 need a segment table page allocated for > each CPU. Currently these are allocated statically in a big array in > head.S for all CPUs. However, except for the boot CPU, there are no > particular constraints on the segment table's location, and the > secondary CPUs don't come up until quite late when get_free_pages() is > operational. > > This patch dynamically allocates segment table pages as the secondary > CPUs come up. This reduces the kernel image size by 192k... I like the idea of removing this bloat. However I think all segment tables have to be in the first segment, otherwise the bolted segment miss handler could take a miss when storing to the segment table. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Fri Jul 2 04:14:30 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Thu, 1 Jul 2004 13:14:30 -0500 Subject: [PATCH] rpaphp broken in ameslab In-Reply-To: <16611.29735.778269.316356@cargo.ozlabs.ibm.com>; from paulus@samba.org on Thu, Jul 01, 2004 at 12:17:11PM +1000 References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <16611.20373.87299.105401@cargo.ozlabs.ibm.com> <20040630202032.A21634@forte.austin.ibm.com> <16611.29735.778269.316356@cargo.ozlabs.ibm.com> Message-ID: <20040701131430.C21634@forte.austin.ibm.com> On Thu, Jul 01, 2004 at 12:17:11PM +1000, Paul Mackerras wrote: > linas at austin.ibm.com writes: > > > OK. The short answer for eeh.c is 'all of it'. > > OK, that doesn't give me the explanation that would need to go with > the patch when I send it to Andrew. The detailed explanations are in bugzilla. I'll crawl through the archives and pull these; but this makes me unhappy :( I personally would be happier with a process that requires less make-work; these patches should have been picked up at the time they were authored, not post-factum. > I want to see a notifier list exported by eeh.c as I proposed in a > previous email before that goes upstream. Its currently implemented as a work queue. Is that acceptable? To keep gregkh happy, I'll move the work-queue to drivers/pci/hotplug/rpaphp_eeh.c, will this work? > I thought I replied to all 4. Let me check... OK, right. I thought a few had fallen through the cracks. Again, I question the efficiency of this process, it took longer than it should to figure out what's up. > The trouble with bugzilla is that it doesn't give me the final patch > with explanation, ready to go upstream. That's not true. Of course it does. > > status is, and there's a 'paper trail' to go with it. > > Its got built-in nagging ... I can say things like 'please > > apply patch 7538, its been ready for months', which works out > > a lot better than 'hey remember that email I send you months > > ago, did you ever get around to it?' > > I would much rather you sent me the patch with explanation and said > "please send this upstream". Yes well, I and others were operating under the assumption that you were actually monitoring bugzilla, and pulling the patches & explanations from there, merging and sending them upstream. Clearly, this did not occur. The process broke down. How will we do better next time? > > Problem with un-adorned email is you never know if a thread > > came to a conclusion or not. But bugzilla has a built-in > > status tracker.. you know when the thread is dead. > > You do know when the thread has come to a conclusion, it's when the > patch is in Linus' tree. Its an inefficient, error-prone process. I can generate more patches a day than I am able to track by email. Ergo, patch-tracking is a bottleneck. --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Fri Jul 2 06:04:34 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Thu, 1 Jul 2004 15:04:34 -0500 Subject: [PATCH] PPC64: lockfix for rtas error log In-Reply-To: <20040701113116.72a25ed0.moilanen@austin.ibm.com>; from moilanen@austin.ibm.com on Thu, Jul 01, 2004 at 11:31:16AM -0500 References: <20040629175007.P21634@forte.austin.ibm.com> <16610.41869.78636.349800@cargo.ozlabs.ibm.com> <20040630125027.T21634@forte.austin.ibm.com> <16611.18350.530425.178652@cargo.ozlabs.ibm.com> <20040701113116.72a25ed0.moilanen@austin.ibm.com> Message-ID: <20040701150434.G21634@forte.austin.ibm.com> > > As for the RTAS messages being printk'd, the only possible A lot of those messages are 'garbage' messages, I'm trying to clear some (maybe most?) of these up. For example, a lot of them are the result of EEH not being turned on for PHB's and EAD's (i.e. not being turned on for empty unoccupied pci slots). Once I get that patch in (which pre-req's all these other patches), I think most of the messages will be gone, which I know will make lots of people happy. --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Fri Jul 2 07:15:12 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Thu, 01 Jul 2004 16:15:12 -0500 Subject: [PATCH] 2.6 PPC64: lockfix for rtas error log (third-times-a-charm?)] In-Reply-To: <20040701153117.H21634@forte.austin.ibm.com> References: <20040701141931.E21634@forte.austin.ibm.com> <1088711355.21679.152.camel@nighthawk> <20040701153117.H21634@forte.austin.ibm.com> Message-ID: <40E47EE0.306@austin.ibm.com> linas at austin.ibm.com wrote: > + /* Log the error in the unlikely case that there was one. */ > + if (unlikely(logit)) { > + buff_copy = kmalloc (RTAS_ERROR_LOG_MAX, GFP_ATOMIC); > + if (buff_copy) { > + memcpy (buff_copy, rtas_err_buf, RTAS_ERROR_LOG_MAX); > + } > + } This isn't performance critical code, do you really need to hard code unlikely here? The compiler and processor do pretty good prediction on it's own where needed. (also, as Greg said, you have extra whitespace in some places, including after kmallocs and kfrees) -Olof ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Fri Jul 2 07:20:50 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Thu, 1 Jul 2004 16:20:50 -0500 Subject: [PATCH] 2.6 PPC64: lockfix for rtas error log (third-times-a-charm?)] In-Reply-To: <40E47EE0.306@austin.ibm.com>; from olof@austin.ibm.com on Thu, Jul 01, 2004 at 04:15:12PM -0500 References: <20040701141931.E21634@forte.austin.ibm.com> <1088711355.21679.152.camel@nighthawk> <20040701153117.H21634@forte.austin.ibm.com> <40E47EE0.306@austin.ibm.com> Message-ID: <20040701162049.K21634@forte.austin.ibm.com> On Thu, Jul 01, 2004 at 04:15:12PM -0500, Olof Johansson wrote: > linas at austin.ibm.com wrote: > > > + /* Log the error in the unlikely case that there was one. */ > > + if (unlikely(logit)) { > > + buff_copy = kmalloc (RTAS_ERROR_LOG_MAX, GFP_ATOMIC); > > + if (buff_copy) { > > + memcpy (buff_copy, rtas_err_buf, RTAS_ERROR_LOG_MAX); > > + } > > + } > > This isn't performance critical code, do you really need to hard code > unlikely here? I figured its more of a hint to the human reader than it is to the compiler. > (also, as Greg said, you have extra whitespace in some places, including > after kmallocs and kfrees) Is this really important enough to gen a new patch? If that's what it takes to get the patch accepted, I'll do it ... --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From greg at kroah.com Fri Jul 2 09:29:12 2004 From: greg at kroah.com (Greg KH) Date: Thu, 1 Jul 2004 16:29:12 -0700 Subject: [PATCH] 2.6 PPC64: lockfix for rtas error log (third-times-a-charm?)] In-Reply-To: <20040701162049.K21634@forte.austin.ibm.com> References: <20040701141931.E21634@forte.austin.ibm.com> <1088711355.21679.152.camel@nighthawk> <20040701153117.H21634@forte.austin.ibm.com> <40E47EE0.306@austin.ibm.com> <20040701162049.K21634@forte.austin.ibm.com> Message-ID: <20040701232912.GA27199@kroah.com> On Thu, Jul 01, 2004 at 04:20:50PM -0500, linas at austin.ibm.com wrote: > > On Thu, Jul 01, 2004 at 04:15:12PM -0500, Olof Johansson wrote: > > linas at austin.ibm.com wrote: > > > > > + /* Log the error in the unlikely case that there was one. */ > > > + if (unlikely(logit)) { > > > + buff_copy = kmalloc (RTAS_ERROR_LOG_MAX, GFP_ATOMIC); > > > + if (buff_copy) { > > > + memcpy (buff_copy, rtas_err_buf, RTAS_ERROR_LOG_MAX); > > > + } > > > + } > > > > This isn't performance critical code, do you really need to hard code > > unlikely here? > > I figured its more of a hint to the human reader than it is to the compiler. > > > (also, as Greg said, you have extra whitespace in some places, including > > after kmallocs and kfrees) > > Is this really important enough to gen a new patch? If that's what it > takes to get the patch accepted, I'll do it ... Please do. thanks, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Fri Jul 2 12:02:16 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Fri, 2 Jul 2004 12:02:16 +1000 Subject: RFC: Dynamic segment table allocation In-Reply-To: <20040701093713.GJ7523@krispykreme> References: <20040630065105.GA19054@zax> <20040701093713.GJ7523@krispykreme> Message-ID: <20040702020216.GD5937@zax> On Thu, Jul 01, 2004 at 07:37:13PM +1000, Anton Blanchard wrote: > > Hi, > > > Can anyone see a problem with the patch below? If not, I plan to push > > it to Andrew in the next few days. Boots on a 4-way Power3 (RS/6000 > > 270) and an RS64 iSeries lpar. > > > > PPC64 machines before Power4 need a segment table page allocated for > > each CPU. Currently these are allocated statically in a big array in > > head.S for all CPUs. However, except for the boot CPU, there are no > > particular constraints on the segment table's location, and the > > secondary CPUs don't come up until quite late when get_free_pages() is > > operational. > > > > This patch dynamically allocates segment table pages as the secondary > > CPUs come up. This reduces the kernel image size by 192k... > > I like the idea of removing this bloat. However I think all segment tables > have to be in the first segment, otherwise the bolted segment miss handler > could take a miss when storing to the segment table. Ah... good point. I can see two ways around that: allocate the tables from bootmem sufficiently low, or bolt into each table an STE for itself. Paulus says he prefers the former option, so I'll look at doing that. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Fri Jul 2 12:45:51 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 2 Jul 2004 12:45:51 +1000 Subject: RFC: Dynamic segment table allocation In-Reply-To: <20040702020216.GD5937@zax> References: <20040630065105.GA19054@zax> <20040701093713.GJ7523@krispykreme> <20040702020216.GD5937@zax> Message-ID: <20040702024551.GK7523@krispykreme> > Ah... good point. > > I can see two ways around that: allocate the tables from bootmem > sufficiently low, or bolt into each table an STE for itself. Paulus > says he prefers the former option, so I'll look at doing that. irqstack_early_init is doing a similar thing for irq stacks, it would be worth consolidating the dynamic segment table allocation code with this. Also it would be worth working with Nathan Lynch to calculate smp_possible_cpus earlier at boot so that irqstack_early_init doesnt allocate NR_CPUS worth of stacks all the time. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Fri Jul 2 13:58:49 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Fri, 2 Jul 2004 13:58:49 +1000 Subject: RFC: Dynamic segment table allocation In-Reply-To: <20040702024551.GK7523@krispykreme> References: <20040630065105.GA19054@zax> <20040701093713.GJ7523@krispykreme> <20040702020216.GD5937@zax> <20040702024551.GK7523@krispykreme> Message-ID: <20040702035849.GF5937@zax> On Fri, Jul 02, 2004 at 12:45:51PM +1000, Anton Blanchard wrote: > > > Ah... good point. > > > > I can see two ways around that: allocate the tables from bootmem > > sufficiently low, or bolt into each table an STE for itself. Paulus > > says he prefers the former option, so I'll look at doing that. > > irqstack_early_init is doing a similar thing for irq stacks, it would be > worth consolidating the dynamic segment table allocation code with this. Yes, noticed thats. > Also it would be worth working with Nathan Lynch to calculate > smp_possible_cpus earlier at boot so that irqstack_early_init doesnt > allocate NR_CPUS worth of stacks all the time. Erm.. as far as I can tell smp_possible_cpus is already early enough for the call to irqstack_early_init(). The patch below uses it for the stabs allocation immediately after the call to irqstack_early_init()... PPC64 machines before Power4 need a segment table page allocated for each CPU. Currently these are allocated statically in a big array in head.S for all CPUs. The segment tables need to be in the first segment (so do_stab_bolted doesn't take a recursive fault on the stab itself), but other than that there are no constraints which require the stabs for the secondary CPUs to be statically allocated. This patch allocates segment tables dynamically during boot, using lmb_alloc() to ensure they are within the first 256M segment. This reduces the kernel image size by 192k... Signed-off-by: David Gibson Index: working-2.6/arch/ppc64/kernel/head.S =================================================================== --- working-2.6.orig/arch/ppc64/kernel/head.S +++ working-2.6/arch/ppc64/kernel/head.S @@ -2201,11 +2201,6 @@ ioremap_dir: .space 4096 -/* 1 page segment table per cpu (max 48, cpu0 allocated at STAB0_PHYS_ADDR) */ - .globl stab_array -stab_array: - .space 4096 * 48 - /* * This space gets a copy of optional info passed to us by the bootstrap * Used to pass parameters into the kernel like root=/dev/sda1, etc. Index: working-2.6/arch/ppc64/kernel/smp.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/smp.c +++ working-2.6/arch/ppc64/kernel/smp.c @@ -876,19 +876,6 @@ paca[cpu].prof_multiplier = 1; paca[cpu].default_decr = tb_ticks_per_jiffy / decr_overclock; - if (!(cur_cpu_spec->cpu_features & CPU_FTR_SLB)) { - void *tmp; - - /* maximum of 48 CPUs on machines with a segment table */ - if (cpu >= 48) - BUG(); - - tmp = &stab_array[PAGE_SIZE * cpu]; - memset(tmp, 0, PAGE_SIZE); - paca[cpu].stab_addr = (unsigned long)tmp; - paca[cpu].stab_real = virt_to_abs(tmp); - } - /* The information for processor bringup must * be written out to main store before we release * the processor. Index: working-2.6/arch/ppc64/kernel/setup.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/setup.c +++ working-2.6/arch/ppc64/kernel/setup.c @@ -76,6 +76,7 @@ extern void pseries_secondary_smp_init(unsigned long); extern int idle_setup(void); extern void vpa_init(int cpu); +void stabs_alloc(void); unsigned long decr_overclock = 1; unsigned long decr_overclock_proc0 = 1; @@ -639,6 +640,7 @@ *cmdline_p = cmd_line; irqstack_early_init(); + stabs_alloc(); /* set up the bootmem stuff with available memory */ do_init_bootmem(); Index: working-2.6/arch/ppc64/kernel/stab.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/stab.c +++ working-2.6/arch/ppc64/kernel/stab.c @@ -19,6 +19,8 @@ #include #include #include +#include +#include static int make_ste(unsigned long stab, unsigned long esid, unsigned long vsid); static void make_slbe(unsigned long esid, unsigned long vsid, int large, @@ -41,6 +43,34 @@ #endif } +/* Allocate segment tables for secondary CPUs. These must all go in + * the first (bolted) segment, so that do_stab_bolted won't get a + * segment miss on the segment table itself. */ +void stabs_alloc(void) +{ + int cpu; + + if (cur_cpu_spec->cpu_features & CPU_FTR_SLB) + return; + + for_each_cpu(cpu) { + unsigned long newstab; + + if (cpu == 0) + continue; /* stab for CPU 0 is statically allocated */ + + newstab = lmb_alloc_base(PAGE_SIZE, PAGE_SIZE, 1< References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <16611.20373.87299.105401@cargo.ozlabs.ibm.com> <20040630202032.A21634@forte.austin.ibm.com> <16611.29735.778269.316356@cargo.ozlabs.ibm.com> <20040701131430.C21634@forte.austin.ibm.com> Message-ID: <16612.63829.67601.300009@cargo.ozlabs.ibm.com> linas at austin.ibm.com writes: > The detailed explanations are in bugzilla. I'll crawl through the > archives and pull these; but this makes me unhappy :( I personally > would be happier with a process that requires less make-work; these > patches should have been picked up at the time they were authored, > not post-factum. I agree, but it is the person who knows the code who should decide (a) whether the patch in bugzilla is what should go upstream and (b) whether the patch has actually fixed the bug, and take the action to push the patch upstream. That is logically the job of whoever knows the code best, and they should do it as soon as the patch is known to have actually fixed the bug. > > I want to see a notifier list exported by eeh.c as I proposed in a > > previous email before that goes upstream. > > Its currently implemented as a work queue. Is that acceptable? > To keep gregkh happy, I'll move the work-queue to > drivers/pci/hotplug/rpaphp_eeh.c, will this work? It's not the work queue that is the problem, it is that the EEH code is taking a decision about what hotplug should do. I am saying that the EEH code should offer to provide notifications to any interested code about slot isolation events. The slot isolation event is a fact, the request to do an unplug operation is policy. Let's leave the policy up to the rpaphp driver and/or userspace. Specifically, I think we should leave the workqueue in eeh.c, replace eeh_disable_slot with a notifier list, and move the check for ethernet devices to rpaphp.c. The eeh_register_disable_func call then changes to notifier_chain_register(&eeh_slot_isolation_list, &my_notify_block) or somesuch. > Yes well, I and others were operating under the assumption that you > were actually monitoring bugzilla, and pulling the patches & explanations > from there, merging and sending them upstream. Clearly, this did not > occur. The process broke down. How will we do better next time? I guess we need to have it clearly stated who is the designated person for each area of the code. Within arch/ppc64 and include/asm-ppc64, I would like to have people who take responsibility for legacy iSeries, RAS, DLPAR, EEH, PCI, NUMA, oprofile/perfctr, and Powermac support. Stephen Rothwell has been pretty much looking after the iSeries stuff, and Ben H is clearly the powermac maintainer. So far, Anton seems to be the expert on NUMA and oprofile/perfctr. We have a whole team doing DLPAR work and I look to people like John Rose to say what should go upstream in that area. It's been an informal process so far which has worked in some areas but not in others. > Its an inefficient, error-prone process. I can generate more patches > a day than I am able to track by email. Ergo, patch-tracking is a > bottleneck. The discipline of sending patches with explanations is that it forces you to write a description of the changes that will give someone who doesn't know the history a reasonable idea of the motivation behind them. That turns out to be an important part of keeping the kernel maintainable - the fact that you can look at a year-old change and get a reasonable idea of what the patch author was trying to achieve is invaluable. That is far more important than having people churn out patches at the maximum possible rate. One of the big reasons why the ameslab BK tree doesn't work as a source of patches to send upstream is that many people haven't been putting in much in the way of changeset descriptions. Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Fri Jul 2 16:14:21 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Fri, 2 Jul 2004 16:14:21 +1000 Subject: RFC: Dynamic segment table allocation In-Reply-To: <20040702035849.GF5937@zax> References: <20040630065105.GA19054@zax> <20040701093713.GJ7523@krispykreme> <20040702020216.GD5937@zax> <20040702024551.GK7523@krispykreme> <20040702035849.GF5937@zax> Message-ID: <20040702061421.GI5937@zax> On Fri, Jul 02, 2004 at 01:58:49PM +1000, David Gibson wrote: > > On Fri, Jul 02, 2004 at 12:45:51PM +1000, Anton Blanchard wrote: > > > > > Ah... good point. > > > > > > I can see two ways around that: allocate the tables from bootmem > > > sufficiently low, or bolt into each table an STE for itself. Paulus > > > says he prefers the former option, so I'll look at doing that. > > > > irqstack_early_init is doing a similar thing for irq stacks, it would be > > worth consolidating the dynamic segment table allocation code with this. > > Yes, noticed thats. > > > Also it would be worth working with Nathan Lynch to calculate > > smp_possible_cpus earlier at boot so that irqstack_early_init doesnt > > allocate NR_CPUS worth of stacks all the time. > > Erm.. as far as I can tell smp_possible_cpus is already early enough > for the call to irqstack_early_init(). The patch below uses it for > the stabs allocation immediately after the call to > irqstack_early_init()... Hrm. For some reason that patch doesn't work on iSeries. Still debugging. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Fri Jul 2 21:22:24 2004 From: paulus at samba.org (Paul Mackerras) Date: Fri, 2 Jul 2004 21:22:24 +1000 Subject: [PATCH] fix ras irq handlers In-Reply-To: <1083860324.3702.8.camel@mudbug.austin.ibm.com> References: <1083860324.3702.8.camel@mudbug.austin.ibm.com> Message-ID: <16613.17776.748934.366072@cargo.ozlabs.ibm.com> Nathan, I just dug out a patch that you sent a while ago that made some changes to ras.c, mainly to cater for the possibility of epow-events and internal-errors having multiple elements. I have reworked it - let me know what you think of the patch below. One of the changes I have made is to check the #interrupt-cells applying to the interrupts property. Without this I think we will incorrectly try to get interrupt 0 or 1, since when #interrupt-cells is 2, the second cell is the edge/level indicator. Paul. diff -urN test25/arch/ppc64/kernel/prom.c ppc64-2.5-pseries/arch/ppc64/kernel/prom.c --- test25/arch/ppc64/kernel/prom.c 2004-06-30 22:00:47.000000000 +1000 +++ ppc64-2.5-pseries/arch/ppc64/kernel/prom.c 2004-07-02 17:06:51.000000000 +1000 @@ -1881,8 +1881,7 @@ * Find out the size of each entry of the interrupts property * for a node. */ -static int __devinit -prom_n_intr_cells(struct device_node *np) +int __devinit prom_n_intr_cells(struct device_node *np) { struct device_node *p; unsigned int *icp; @@ -1896,7 +1895,7 @@ || get_property(p, "interrupt-map", NULL) != NULL) { printk("oops, node %s doesn't have #interrupt-cells\n", p->full_name); - return 1; + return 1; } } #ifdef DEBUG_IRQ diff -urN ppc64-linux-2.5/arch/ppc64/kernel/ras.c ppc64-2.5-pseries/arch/ppc64/kernel/ras.c --- ppc64-linux-2.5/arch/ppc64/kernel/ras.c 2004-04-13 14:04:32.000000000 +1000 +++ ppc64-2.5-pseries/arch/ppc64/kernel/ras.c 2004-07-02 21:07:28.932004888 +1000 @@ -52,6 +52,16 @@ #include #include +static unsigned char log_buf[RTAS_ERROR_LOG_MAX]; +static spinlock_t log_lock = SPIN_LOCK_UNLOCKED; + +static int ras_get_sensor_state_token; +static int ras_check_exception_token; + +#define EPOW_SENSOR_TOKEN 9 +#define EPOW_SENSOR_INDEX 0 +#define RAS_VECTOR_OFFSET 0x500 + static irqreturn_t ras_epow_interrupt(int irq, void *dev_id, struct pt_regs * regs); static irqreturn_t ras_error_interrupt(int irq, void *dev_id, @@ -59,6 +69,35 @@ /* #define DEBUG */ +static void request_ras_irqs(struct device_node *np, char *propname, + irqreturn_t (*handler)(int, void *, struct pt_regs *), + const char *name) +{ + unsigned int *ireg, len, i; + int virq, n_intr; + + ireg = (unsigned int *)get_property(np, propname, &len); + if (ireg == NULL) + return; + n_intr = prom_n_intr_cells(np); + len /= n_intr * sizeof(*ireg); + + for (i = 0; i < len; i++) { + virq = virt_irq_create_mapping(*ireg); + if (virq == NO_IRQ) { + printk(KERN_ERR "Unable to allocate interrupt " + "number for %s\n", np->full_name); + return; + } + if (request_irq(irq_offset_up(virq), handler, 0, name, NULL)) { + printk(KERN_ERR "Unable to request interrupt %d for " + "%s\n", irq_offset_up(virq), np->full_name); + return; + } + ireg += n_intr; + } +} + /* * Initialize handlers for the set of interrupts caused by hardware errors * and power system events. @@ -66,52 +105,33 @@ static int __init init_ras_IRQ(void) { struct device_node *np; - unsigned int *ireg, len, i; - int virq; - if ((np = of_find_node_by_path("/event-sources/internal-errors")) && - (ireg = (unsigned int *)get_property(np, "open-pic-interrupt", - &len))) { - for (i=0; i<(len / sizeof(*ireg)); i++) { - virq = virt_irq_create_mapping(*(ireg)); - if (virq == NO_IRQ) { - printk(KERN_ERR "Unable to allocate interrupt " - "number for %s\n", np->full_name); - break; - } - request_irq(irq_offset_up(virq), - ras_error_interrupt, 0, - "RAS_ERROR", NULL); - ireg++; - } + ras_get_sensor_state_token = rtas_token("get-sensor-state"); + ras_check_exception_token = rtas_token("check-exception"); + + /* Internal Errors */ + np = of_find_node_by_path("/event-sources/internal-errors"); + if (np != NULL) { + request_ras_irqs(np, "open-pic-interrupt", ras_error_interrupt, + "RAS_ERROR"); + request_ras_irqs(np, "interrupts", ras_error_interrupt, + "RAS_ERROR"); + of_node_put(np); } - of_node_put(np); - if ((np = of_find_node_by_path("/event-sources/epow-events")) && - (ireg = (unsigned int *)get_property(np, "open-pic-interrupt", - &len))) { - for (i=0; i<(len / sizeof(*ireg)); i++) { - virq = virt_irq_create_mapping(*(ireg)); - if (virq == NO_IRQ) { - printk(KERN_ERR "Unable to allocate interrupt " - " number for %s\n", np->full_name); - break; - } - request_irq(irq_offset_up(virq), - ras_epow_interrupt, 0, - "RAS_EPOW", NULL); - ireg++; - } + np = of_find_node_by_path("/event-sources/epow-events"); + if (np != NULL) { + request_ras_irqs(np, "open-pic-interrupt", ras_epow_interrupt, + "RAS_EPOW"); + request_ras_irqs(np, "interrupts", ras_epow_interrupt, + "RAS_EPOW"); + of_node_put(np); } - of_node_put(np); return 1; } __initcall(init_ras_IRQ); -static struct rtas_error_log log_buf; -static spinlock_t log_lock = SPIN_LOCK_UNLOCKED; - /* * Handle power subsystem events (EPOW). * @@ -122,30 +142,35 @@ static irqreturn_t ras_epow_interrupt(int irq, void *dev_id, struct pt_regs * regs) { - struct rtas_error_log log_entry; - unsigned int size = sizeof(log_entry); - long status = 0xdeadbeef; + int status = 0xdeadbeef; + int state = 0; + int virq = irq_offset_down(irq); + int critical; spin_lock(&log_lock); - status = rtas_call(rtas_token("check-exception"), 6, 1, NULL, - 0x500, irq, - RTAS_EPOW_WARNING | RTAS_POWERMGM_EVENTS, - 1, /* Time Critical */ - __pa(&log_buf), size); + status = rtas_call(ras_get_sensor_state_token, 2, 2, &state, + EPOW_SENSOR_TOKEN, EPOW_SENSOR_INDEX); - log_entry = log_buf; - - spin_unlock(&log_lock); + if (state > 3) + critical = 1; /* Time Critical */ + else + critical = 0; - udbg_printf("EPOW <0x%lx 0x%lx>\n", - *((unsigned long *)&log_entry), status); - printk(KERN_WARNING - "EPOW <0x%lx 0x%lx>\n",*((unsigned long *)&log_entry), status); + status = rtas_call(ras_check_exception_token, 6, 1, NULL, + RAS_VECTOR_OFFSET, virt_irq_to_real(virq), + RTAS_EPOW_WARNING | RTAS_POWERMGM_EVENTS, + critical, __pa(&log_buf), RTAS_ERROR_LOG_MAX); + + udbg_printf("EPOW <0x%lx 0x%x 0x%x>\n", + *((unsigned long *)&log_buf), status, state); + printk(KERN_WARNING "EPOW <0x%lx 0x%x 0x%x>\n", + *((unsigned long *)&log_buf), status, state); /* format and print the extended information */ - log_error((char *)&log_entry, ERR_TYPE_RTAS_LOG, 0); - + log_error(log_buf, ERR_TYPE_RTAS_LOG, 0); + + spin_unlock(&log_lock); return IRQ_HANDLED; } @@ -160,37 +185,34 @@ static irqreturn_t ras_error_interrupt(int irq, void *dev_id, struct pt_regs * regs) { - struct rtas_error_log log_entry; - unsigned int size = sizeof(log_entry); - long status = 0xdeadbeef; + struct rtas_error_log *rtas_elog; + int status = 0xdeadbeef; int fatal; spin_lock(&log_lock); - status = rtas_call(rtas_token("check-exception"), 6, 1, NULL, - 0x500, irq, - RTAS_INTERNAL_ERROR, - 1, /* Time Critical */ - __pa(&log_buf), size); - - log_entry = log_buf; + status = rtas_call(ras_check_exception_token, 6, 1, NULL, + RAS_VECTOR_OFFSET, + virt_irq_to_real(irq_offset_down(irq)), + RTAS_INTERNAL_ERROR, + 1 /* Time Critical */, + __pa(&log_buf), RTAS_ERROR_LOG_MAX); - spin_unlock(&log_lock); + rtas_elog = (struct rtas_error_log *)log_buf; - if ((status == 0) && (log_entry.severity >= SEVERITY_ERROR_SYNC)) + if ((status == 0) && (rtas_elog->severity >= SEVERITY_ERROR_SYNC)) fatal = 1; else fatal = 0; /* format and print the extended information */ - log_error((char *)&log_entry, ERR_TYPE_RTAS_LOG, fatal); + log_error(log_buf, ERR_TYPE_RTAS_LOG, fatal); if (fatal) { - udbg_printf("HW Error <0x%lx 0x%lx>\n", - *((unsigned long *)&log_entry), status); - printk(KERN_EMERG - "Error: Fatal hardware error <0x%lx 0x%lx>\n", - *((unsigned long *)&log_entry), status); + udbg_printf("HW Error <0x%lx 0x%x>\n", + *((unsigned long *)&log_buf), status); + printk(KERN_EMERG "Error: Fatal hardware error <0x%lx 0x%x>\n", + *((unsigned long *)&log_buf), status); #ifndef DEBUG /* Don't actually power off when debugging so we can test @@ -200,11 +222,13 @@ ppc_md.power_off(); #endif } else { - udbg_printf("Recoverable HW Error <0x%lx 0x%lx>\n", - *((unsigned long *)&log_entry), status); + udbg_printf("Recoverable HW Error <0x%lx 0x%x>\n", + *((unsigned long *)&log_buf), status); printk(KERN_WARNING - "Warning: Recoverable hardware error <0x%lx 0x%lx>\n", - *((unsigned long *)&log_entry), status); + "Warning: Recoverable hardware error <0x%lx 0x%x>\n", + *((unsigned long *)&log_buf), status); } + + spin_unlock(&log_lock); return IRQ_HANDLED; } diff -urN test25/include/asm-ppc64/prom.h ppc64-2.5-pseries/include/asm-ppc64/prom.h --- test25/include/asm-ppc64/prom.h 2004-06-24 21:46:45.000000000 +1000 +++ ppc64-2.5-pseries/include/asm-ppc64/prom.h 2004-07-02 17:06:22.000000000 +1000 @@ -269,6 +269,7 @@ extern void print_properties(struct device_node *node); extern int prom_n_addr_cells(struct device_node* np); extern int prom_n_size_cells(struct device_node* np); +extern int prom_n_intr_cells(struct device_node* np); extern void prom_get_irq_senses(unsigned char *senses, int off, int max); extern void prom_add_property(struct device_node* np, struct property* prop); ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nfont at austin.ibm.com Sat Jul 3 01:47:07 2004 From: nfont at austin.ibm.com (Nathan Fontenot) Date: Fri, 02 Jul 2004 10:47:07 -0500 Subject: [PATCH] fix ras irq handlers In-Reply-To: <16613.17776.748934.366072@cargo.ozlabs.ibm.com> References: <1083860324.3702.8.camel@mudbug.austin.ibm.com> <16613.17776.748934.366072@cargo.ozlabs.ibm.com> Message-ID: <40E5837B.4000008@austin.ibm.com> Everything looks good to me. I did add three changes you may (or may not) want to include. These are pretty minor things I noticed in looking over the code with your patch applied. I included an updated patch with these updates. - Added a comment (/* EPOW Events */) before the irq initialization of epow-events to atch te internal errors comment. - Changed ras_epow_interrupt to use tha same irq manipulation, (virt_irq_to_real(irq_offset_down(irq)), you used in ras_error_interrupt for consistancy. - Moved the taking of the log_lock in rtas_epow_interrupt to after the first rtas_call. We don't need it until the second rtas call and I don't see any need to hold it longer than need be. > One of the changes I have made is to check the #interrupt-cells > applying to the interrupts property. Without this I think we will > incorrectly try to get interrupt 0 or 1, since when #interrupt-cells > is 2, the second cell is the edge/level indicator. Anton had opened a LTC bug (9508) for this issue. I posted a patch there awhile back but I like your implementation better. I'll update the bug with your patch if thats ok. -- Nathan Fontenot -------------- next part -------------- A non-text attachment was scrubbed... Name: ras_irq_update.patch Type: text/x-patch Size: 8967 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040702/9e1a15c7/attachment.bin From linas at austin.ibm.com Sat Jul 3 02:05:41 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Fri, 2 Jul 2004 11:05:41 -0500 Subject: [PATCH] 2.6 PPC64: lockfix for rtas error log (third-times-a-charm?)] In-Reply-To: <20040701232912.GA27199@kroah.com>; from greg@kroah.com on Thu, Jul 01, 2004 at 04:29:12PM -0700 References: <20040701141931.E21634@forte.austin.ibm.com> <1088711355.21679.152.camel@nighthawk> <20040701153117.H21634@forte.austin.ibm.com> <40E47EE0.306@austin.ibm.com> <20040701162049.K21634@forte.austin.ibm.com> <20040701232912.GA27199@kroah.com> Message-ID: <20040702110541.N21634@forte.austin.ibm.com> Here's the latest set patch w/ whitespace fixes. On Thu, Jul 01, 2004 at 04:29:12PM -0700, Greg KH wrote: > On Thu, Jul 01, 2004 at 04:20:50PM -0500, linas at austin.ibm.com wrote: > > > > Is this really important enough to gen a new patch? If that's what it > > takes to get the patch accepted, I'll do it ... > > Please do. ------------------------------------------ Upon closer analysis of the code, I see that log_rtas_error() was incorrectly named, and was being used incorrectly. The solution is to get rid of it entirely; see patch below. So: -- In one case kmalloc must be GFP_ATOMIC because rtas_call() can happen in any context, incl. irqs. -- In the other case, I turned it into GFP_KENREL, at the cost of doing a needless malloc/free in the vast majority of cases where there is no error. Small price, as I beleive that this routine is very rarely called. Patch below, Signed-off-by: Linas Vepstas -------------- next part -------------- --- arch/ppc64/kernel/rtas.c.orig-pre-lockfix 2004-06-29 17:02:12.000000000 -0500 +++ arch/ppc64/kernel/rtas.c 2004-07-02 10:52:50.000000000 -0500 @@ -98,8 +98,14 @@ rtas_token(const char *service) } +/** Return a copy of the detailed error text associated with the + * most recent failed call to rtas. Because the error text + * might go stale if there are any other intervening rtas calls, + * this routine must be called atomically with whatever produced + * the error (i.e. with rtas.lock still held from the previous call). + */ static int -__log_rtas_error(void) +__fetch_rtas_last_error(void) { struct rtas_args err_args, save_args; @@ -126,19 +132,6 @@ __log_rtas_error(void) return err_args.rets[0]; } -void -log_rtas_error(void) -{ - unsigned long s; - int rc; - - spin_lock_irqsave(&rtas.lock, s); - rc = __log_rtas_error(); - spin_unlock_irqrestore(&rtas.lock, s); - if (rc == 0) - log_error(rtas_err_buf, ERR_TYPE_RTAS_LOG, 0); -} - int rtas_call(int token, int nargs, int nret, unsigned long *outputs, ...) @@ -147,6 +140,7 @@ rtas_call(int token, int nargs, int nret int i, logit = 0; unsigned long s; struct rtas_args *rtas_args; + char * buff_copy = NULL; int ret; PPCDBG(PPCDBG_RTAS, "Entering rtas_call\n"); @@ -181,7 +175,7 @@ rtas_call(int token, int nargs, int nret PPCDBG(PPCDBG_RTAS, "\treturned from rtas ...\n"); if (rtas_args->rets[0] == -1) - logit = (__log_rtas_error() == 0); + logit = (__fetch_rtas_last_error() == 0); ifppcdebug(PPCDBG_RTAS) { for(i=0; i < nret ;i++) @@ -193,12 +187,21 @@ rtas_call(int token, int nargs, int nret outputs[i] = rtas_args->rets[i+1]; ret = (int)((nret > 0) ? rtas_args->rets[0] : 0); + /* Log the error in the unlikely case that there was one. */ + if (unlikely(logit)) { + buff_copy = kmalloc(RTAS_ERROR_LOG_MAX, GFP_ATOMIC); + if (buff_copy) { + memcpy(buff_copy, rtas_err_buf, RTAS_ERROR_LOG_MAX); + } + } + /* Gotta do something different here, use global lock for now... */ spin_unlock_irqrestore(&rtas.lock, s); - if (logit) - log_error(rtas_err_buf, ERR_TYPE_RTAS_LOG, 0); - + if (buff_copy) { + log_error(buff_copy, ERR_TYPE_RTAS_LOG, 0); + kfree(buff_copy); + } return ret; } @@ -218,7 +221,7 @@ rtas_extended_busy_delay_time(int status /* Use microseconds for reasonable accuracy */ for (ms=1; order > 0; order--) - ms *= 10; + ms *= 10; return ms; } @@ -409,9 +412,9 @@ rtas_restart(char *cmd) if (rtas_firmware_flash_list.next) rtas_flash_firmware(); - printk("RTAS system-reboot returned %d\n", + printk("RTAS system-reboot returned %d\n", rtas_call(rtas_token("system-reboot"), 0, 1, NULL)); - for (;;); + for (;;); } void @@ -419,10 +422,10 @@ rtas_power_off(void) { if (rtas_firmware_flash_list.next) rtas_flash_bypass_warning(); - /* allow power on only with power button press */ - printk("RTAS power-off returned %d\n", - rtas_call(rtas_token("power-off"), 2, 1, NULL,0xffffffff,0xffffffff)); - for (;;); + /* allow power on only with power button press */ + printk("RTAS power-off returned %d\n", + rtas_call(rtas_token("power-off"), 2, 1, NULL,0xffffffff,0xffffffff)); + for (;;); } void @@ -430,7 +433,7 @@ rtas_halt(void) { if (rtas_firmware_flash_list.next) rtas_flash_bypass_warning(); - rtas_power_off(); + rtas_power_off(); } /* Must be in the RMO region, so we place it here */ @@ -460,7 +463,9 @@ asmlinkage int ppc_rtas(struct rtas_args { struct rtas_args args; unsigned long flags; + char * buff_copy; int nargs; + int err_rc; if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -479,18 +484,32 @@ asmlinkage int ppc_rtas(struct rtas_args nargs * sizeof(rtas_arg_t)) != 0) return -EFAULT; + buff_copy = kmalloc(RTAS_ERROR_LOG_MAX, GFP_KERNEL); + spin_lock_irqsave(&rtas.lock, flags); get_paca()->xRtas = args; enter_rtas(__pa(&get_paca()->xRtas)); args = get_paca()->xRtas; - spin_unlock_irqrestore(&rtas.lock, flags); - args.rets = (rtas_arg_t *)&(args.args[nargs]); - if (args.rets[0] == -1) - log_rtas_error(); + if (args.rets[0] == -1) { + err_rc = __fetch_rtas_last_error(); + if ((err_rc == 0) && buff_copy) { + memcpy(buff_copy, rtas_err_buf, RTAS_ERROR_LOG_MAX); + } + } + + spin_unlock_irqrestore(&rtas.lock, flags); + + if (buff_copy) { + if ((args.rets[0] == -1) && (err_rc == 0)) { + log_error(buff_copy, ERR_TYPE_RTAS_LOG, 0); + } + kfree(buff_copy); + } + /* Copy out args. */ if (copy_to_user(uargs->args + nargs, args.args + nargs, From linas at austin.ibm.com Sat Jul 3 02:33:33 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Fri, 2 Jul 2004 11:33:33 -0500 Subject: (resend) [PATCH] 2.6 RPAPHP structure size/performance Message-ID: <20040702113333.R21634@forte.austin.ibm.com> Hi Greg, Please review and apply the following patch if you find it agreeable. This patch does not make any functional changes, but does improve both performance and memory usage by rearranging structure elements. The need for these changes became appearent during a code review of the disassembly involving this structure. The memory footprint of this structure is made smaller by grouping the byte fields next to each other. The access of the list_head can be simplified by making it the first element of the structure, thus avoiding a needless add-immediate without negatively impacting any of the other accesses. Signed-off-by: Linas Vepstas --linas --- drivers/pci/hotplug/rpaphp.h.orig 2004-06-18 16:10:47.000000000 -0500 +++ drivers/pci/hotplug/rpaphp.h 2004-06-23 13:28:20.000000000 -0500 @@ -85,6 +85,7 @@ struct rpaphp_pci_func { * struct slot - slot information for each *physical* slot */ struct slot { + struct list_head rpaphp_slot_list; int state; u32 index; u32 type; @@ -92,6 +93,7 @@ struct slot { char *name; char *location; u8 removable; + u8 dev_type; /* VIO or PCI */ struct device_node *dn; /* slot's device_node in OFDT */ /* dn has phb info */ struct pci_dev *bridge; /* slot's pci_dev in pci_devices */ @@ -99,9 +101,7 @@ struct slot { struct list_head pci_funcs; /* pci_devs in PCI slot */ struct vio_dev *vio_dev; /* vio_dev in VIO slot */ } dev; - u8 dev_type; /* VIO or PCI */ struct hotplug_slot *hotplug_slot; - struct list_head rpaphp_slot_list; }; extern struct hotplug_slot_ops rpaphp_hotplug_slot_ops; ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Sat Jul 3 03:29:49 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Fri, 2 Jul 2004 12:29:49 -0500 Subject: EEH/Hotplug (was Re: [PATCH] rpaphp broken in ameslab) In-Reply-To: <16612.63829.67601.300009@cargo.ozlabs.ibm.com>; from paulus@samba.org on Fri, Jul 02, 2004 at 03:57:41PM +1000 References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <16611.20373.87299.105401@cargo.ozlabs.ibm.com> <20040630202032.A21634@forte.austin.ibm.com> <16611.29735.778269.316356@cargo.ozlabs.ibm.com> <20040701131430.C21634@forte.austin.ibm.com> <16612.63829.67601.300009@cargo.ozlabs.ibm.com> Message-ID: <20040702122949.S21634@forte.austin.ibm.com> On Fri, Jul 02, 2004 at 03:57:41PM +1000, Paul Mackerras wrote: > > > I want to see a notifier list exported by eeh.c as I proposed in a > > > previous email before that goes upstream. > > > > Its currently implemented as a work queue. Is that acceptable? > > To keep gregkh happy, I'll move the work-queue to > > drivers/pci/hotplug/rpaphp_eeh.c, will this work? > > It's not the work queue that is the problem, it is that the EEH code > is taking a decision about what hotplug should do. I am saying that > the EEH code should offer to provide notifications to any interested > code about slot isolation events. The slot isolation event is a fact, > the request to do an unplug operation is policy. Let's leave the > policy up to the rpaphp driver and/or userspace. I'm not yet convinced that hotplug should be the focal point for device driver policy decisions, but I'll go ahead and implement the notifier chain for now, and see what happens. Note that the scsi generic layer implements a bunch of policy almost the same kind of thing, except that its for the scsi bus, and not for the pci bus. Not all scsi device drivers use the scsi-generic layer, but those that do get a reset sequence something like the following: -- if device not responding, reset device -- if above failed, retry a few times. -- if still failed, reset scsi bus -- if still failed, retry a few times ... -- if above failed, reset scsi controller For pci bus disconnection events that affected scsi devices, I was going to tap into that 'policy' code. I'm not sure I want to comment more until I try the prototype. I'm not sure if anyone is thinking about i/o fabrics yet, or how that policy gets done ... for example, one disk is attached to two scsi controllers, and there was an eeh event on one of the controllers; where is the failover policy implemented? Currently, I think all the device drivers that do this are all proprietary ... --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From greg at kroah.com Sat Jul 3 04:12:15 2004 From: greg at kroah.com (Greg KH) Date: Fri, 2 Jul 2004 11:12:15 -0700 Subject: EEH/Hotplug (was Re: [PATCH] rpaphp broken in ameslab) In-Reply-To: <20040702122949.S21634@forte.austin.ibm.com> References: <40E1F6C7.80907@us.ibm.com> <16610.22790.655073.704981@cargo.ozlabs.ibm.com> <40E30E84.8090206@us.ibm.com> <20040630141433.V21634@forte.austin.ibm.com> <16611.20373.87299.105401@cargo.ozlabs.ibm.com> <20040630202032.A21634@forte.austin.ibm.com> <16611.29735.778269.316356@cargo.ozlabs.ibm.com> <20040701131430.C21634@forte.austin.ibm.com> <16612.63829.67601.300009@cargo.ozlabs.ibm.com> <20040702122949.S21634@forte.austin.ibm.com> Message-ID: <20040702181214.GA28182@kroah.com> On Fri, Jul 02, 2004 at 12:29:49PM -0500, linas at austin.ibm.com wrote: > On Fri, Jul 02, 2004 at 03:57:41PM +1000, Paul Mackerras wrote: > > > > I want to see a notifier list exported by eeh.c as I proposed in a > > > > previous email before that goes upstream. > > > > > > Its currently implemented as a work queue. Is that acceptable? > > > To keep gregkh happy, I'll move the work-queue to > > > drivers/pci/hotplug/rpaphp_eeh.c, will this work? > > > > It's not the work queue that is the problem, it is that the EEH code > > is taking a decision about what hotplug should do. I am saying that > > the EEH code should offer to provide notifications to any interested > > code about slot isolation events. The slot isolation event is a fact, > > the request to do an unplug operation is policy. Let's leave the > > policy up to the rpaphp driver and/or userspace. > > I'm not yet convinced that hotplug should be the focal point for > device driver policy decisions, Sorry, but you're a bit late to the table for trying to change this overall kernel design decision :) > but I'll go ahead and implement the notifier chain for now, and see > what happens. Thank you. > Note that the scsi generic layer implements a bunch of policy > almost the same kind of thing, except that its for the scsi bus, > and not for the pci bus. Not all scsi device drivers use the > scsi-generic layer, but those that do get a reset sequence something > like the following: > > -- if device not responding, reset device > -- if above failed, retry a few times. > -- if still failed, reset scsi bus > -- if still failed, retry a few times ... > -- if above failed, reset scsi controller > > For pci bus disconnection events that affected scsi devices, I was > going to tap into that 'policy' code. I'm not sure I want to comment > more until I try the prototype. scsi errors and pci errors are quite different things. For one, I'm pretty sure the scsi stuff is specified by the spec. And it's way more common than pci errors would be. It's also done in a generic manner, not a arch specific way, which is a good thing. > I'm not sure if anyone is thinking about i/o fabrics yet, or how > that policy gets done ... for example, one disk is attached to > two scsi controllers, and there was an eeh event on one of the > controllers; where is the failover policy implemented? Currently, > I think all the device drivers that do this are all proprietary ... The multipath people are working on this, using dm and userspace stuff. The kernel drivers that try to do this within the kernel have been rejected for one reason or another (not the least being that no company seems to want to release them...) thanks, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From greg at kroah.com Sat Jul 3 06:56:51 2004 From: greg at kroah.com (Greg KH) Date: Fri, 2 Jul 2004 13:56:51 -0700 Subject: [PATCH] rpaphp broken in ameslab In-Reply-To: <40E1F6C7.80907@us.ibm.com> References: <40E1D5BE.2010504@austin.ibm.com> <40E1F6C7.80907@us.ibm.com> Message-ID: <20040702205651.GD29580@kroah.com> On Tue, Jun 29, 2004 at 06:09:59PM -0500, Linda Xie wrote: > Hi Joel, > > Any changes to rpaphp(or rpaphp related) now should be posted on PCI > hotplug mailing list (CC to kernel mailing list and Greg KH). After > Greg applies the > changes to his tree, the changes will be merged into amslab. BTW, I > posted a separate patch that exports pci_scan_chid_bus for rpaphp a > while ago. > > Greg, Can you apply the attached patch to your tree? > > Signed-off-by: Linda Xie lxie at us.ibm.com Applied, thanks. greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From sharada at in.ibm.com Mon Jul 5 16:53:00 2004 From: sharada at in.ibm.com (R Sharada) Date: Mon, 5 Jul 2004 12:23:00 +0530 Subject: Fw: debugging on ppc with xmon Message-ID: <20040705065300.GB1458@in.ibm.com> Hello, Perhaps I sent this mail out earlier to the wrong id for the list. Did not see it on the list. Hence resending - hopefully this time I am sending to the right id. Thanks and Regards, Sharada ----- Forwarded message from R Sharada ----- Date: Mon, 5 Jul 2004 11:28:21 +0530 From: R Sharada To: owner-linuxppc64-dev at lists.linuxppc.org Subject: debugging on ppc with xmon Reply-To: sharada at in.ibm.com Hello, I have been learning and understanding ppc lately, and was working on trying to do some re-ordering of the prom code. As a first step, I was trying to move the setting of the global cpuid masks to a later point in time, in setup.c, within setup_system. So, I moved the code performing the cpuset in the prom_hold_cpus() function to a function, cpuid_setup(), and called it within setup_system, after finish_device_tree() (within CONFIG_PSERIES). When I rebooted this kernel, it oopsed in xmon right after prom_init, with a data SLB access (which in this case is most likely to be a wrong memory reference or null pointer reference). I did go through olof's mail on debugging with xmon on ppc. However, I am still not clear as to what access is causing the invalid memory reference. A zr at the xmon prompt does not reboot but faults back again into xmon. How can I obtain a objdump output with source line corelation to the assembly code? Any pointers or directions as to how to debug this would be helpful, as I am starting off with ppc debugging with this as the first. The panic oops is pasted out here: Calling quiesce ... returning from prom_init cpu 0x0: Vector: 380 (Data SLB Access) at [c000000000563b70] pc: c000000000039d98: .cpuid_setup+0x1d0/0x3b0 lr: c000000000039d7c: .cpuid_setup+0x1b4/0x3b0 sp: c000000000563df0 msr: 9000000000001032 dar: 4202000 current = 0xc0000000005f5480 paca = 0xc000000000564000 pid = 0, comm = swapper enter ? for help 0:mon> ? 0:mon> di c000000000039d98 c000000000039d98 901c0000 stw r0,0(r28) c000000000039d9c 4800097d bl c00000000003a718 # .get_property0c000000000039da0 60000000 nop c000000000039da4 38a00000 li r5,0 c000000000039da8 e882a318 ld r4,-23784(r2) c000000000039dac 7c7c1b78 mr r28,r3 c000000000039db0 7fc3f378 mr r3,r30 c000000000039db4 48000965 bl c00000000003a718 # .get_property0c000000000039db8 60000000 nop c000000000039dbc 38a00001 li r5,1 c000000000039dc0 80030000 lwz r0,0(r3) c000000000039dc4 2f800000 cmpwi cr7,r0,0 c000000000039dc8 419c0018 blt cr7,c000000000039de0 # .cpuid_setup+0c000000000039dcc 7c0007b4 extsw r0,r0 c000000000039dd0 7805f022 rldicl r5,r0,62,32 c000000000039dd4 2b850002 cmplwi cr7,r5,2 0:mon> zr cpu 0x0: Vector: 380 (Data SLB Access) at [c0000000005630f0] pc: c00000000004be54: .cmds+0x1914/0x1f34 lr: c00000000004a864: .cmds+0x324/0x1f34 sp: c000000000563370 msr: 9000000000001032 dar: 0 current = 0xc0000000005f5480 paca = 0xc000000000564000 pid = 0, comm = swapper cpu 0x0: Exception 380 (Data SLB Access) in xmon, returning to main loop Thanks and Regards, Sharada ----- End forwarded message ----- ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From segher at kernel.crashing.org Tue Jul 6 17:04:05 2004 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Tue, 6 Jul 2004 09:04:05 +0200 Subject: debugging on ppc with xmon In-Reply-To: <20040705065300.GB1458@in.ibm.com> References: <20040705065300.GB1458@in.ibm.com> Message-ID: > When I rebooted this kernel, it oopsed in xmon right after prom_init, > with a data SLB access (which in this case is most likely to be a > wrong memory > reference or null pointer reference). Null pointer references do not fail here :-( -- so it is some other "wrong" access. > How can I obtain a objdump output with source line corelation to the > assembly code? objdump -drS vmlinux It will only show the C source files though, and/or only when you have kernel debug info enabled. Also, it is pretty useless at any decent compiler optimization level IMHO. > Any pointers or directions as to how to debug this would be helpful, > as I am starting off with ppc debugging with this as the first. I normally just insert a boatload of printk()'s, see what interval it fails in, and refine. Repeat until done. Or set a lot of breakpoints in your debugger -- basically the same thing, but less convenient. Segher ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Wed Jul 7 07:12:54 2004 From: linas at austin.ibm.com (linas at austin.ibm.com) Date: Tue, 6 Jul 2004 16:12:54 -0500 Subject: How to block pci config-reads during device self-test? Message-ID: <20040706161254.C21634@forte.austin.ibm.com> Hi all, Am having trouble with PCI config-space reads ... I have a device (actualy Brian King has it) that can perform a built-in-self test (BIST). However, if anything does a PCI config-read during BIST, then the device does something crazy that makes the PCI controller chip take it offline. I'm not sure what's doing the config-spcae reads ... seems to be some user-space tool or daemon. I'm wondering if there is any practical way to block such reads to a given device until its self-test sequence is completed. I could try to modify the architecture-specific pci files to do this (arch/ppc64/kernel/pSeries_pci.c) but this seems a tad ugly ... is there another way? or do we have to just learn to live with this ahrdware? --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Wed Jul 7 07:15:49 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 06 Jul 2004 16:15:49 -0500 Subject: How to block pci config-reads during device self-test? In-Reply-To: <20040706161254.C21634@forte.austin.ibm.com> References: <20040706161254.C21634@forte.austin.ibm.com> Message-ID: <1089148549.1898.84.camel@gaston> > I'm not sure what's doing the config-spcae reads ... seems to be some > user-space tool or daemon. I'm wondering if there is any practical > way to block such reads to a given device until its self-test > sequence is completed. I could try to modify the architecture-specific > pci files to do this (arch/ppc64/kernel/pSeries_pci.c) but this seems > a tad ugly ... is there another way? or do we have to just learn to > live with this ahrdware? I see no sane solution... I have a similar (not exactly identical though) problem on the g5 where some devices will lockup on config access when they are power managed. What I did there was to hack a list of devices to "ignore" on config accesses... If your driver knows when the device is going away for the BIST and how long, maybe you can play tricks like adding a property to the device node indicating if the device should be skipped on config access (that is basically return all fffffff's), have the driver set that for the duration of the BIST, and the config ops check that. Definitely not the cleanest way, but to me, it seems like broken HW... Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From greg at kroah.com Wed Jul 7 09:48:43 2004 From: greg at kroah.com (Greg KH) Date: Tue, 6 Jul 2004 16:48:43 -0700 Subject: How to block pci config-reads during device self-test? In-Reply-To: <20040706161254.C21634@forte.austin.ibm.com> References: <20040706161254.C21634@forte.austin.ibm.com> Message-ID: <20040706234843.GA11327@kroah.com> On Tue, Jul 06, 2004 at 04:12:54PM -0500, linas at austin.ibm.com wrote: > > I could try to modify the architecture-specific pci files to do this > (arch/ppc64/kernel/pSeries_pci.c) but this seems a tad ugly ... is > there another way? or do we have to just learn to live with this > ahrdware? As I've told Brian in the past, you have to learn to just live with this broken hardware. Good luck, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Wed Jul 7 16:06:26 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Wed, 7 Jul 2004 16:06:26 +1000 Subject: [0/4] RFC: SLB rewrite Message-ID: <20040707060626.GA987@zax> Here, at long last, the much awaited, SLB rewrite! I have this in four patches at the moment: 1/4: slbrewrite2 - The guts of the rewrite, miss path in asm 2/4: noprolog - Unify do_slb_bolted with full SLB miss path 3/4: customentryexit - Optimise SLB exception entry/exit path 4/4: newvsid - Replace VSID algorithm with a better one Overall these seem to improve the (user address) SLB miss time by around 25% (200ns to 150ns on a G5). Test program and resulting graphs are at http://www.ozlabs.org/people/dgibson/slbtest/ the most interesting graph is probably: http://www.ozlabs.org/people/dgibson/slbtest/stagger.png These patches have been tested (though not extensively) on an Apple G5 (PowerPC 970), a p630 LPAR (Power4+), an RS/6000 270 (POWER3) and an RS64 iSeries partition (the latter two obviously aren't expected to show any improvement, the point is just to check I haven't broken things dramatically for STAB machines). Anton has also tested some earlier versions of the rewrite on a Power5 (bare metal Linux). If there are no serious problems, I'm thinking of pushing at least the first three of these to Linus/akpm in the next week or so. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Wed Jul 7 16:06:35 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Wed, 7 Jul 2004 16:06:35 +1000 Subject: [1/4] RFC: SLB rewrite (core rewrite) Message-ID: <20040707060635.GB987@zax> Rewrite/cleanup of the SLB management code. This removes nearly all the SLB related code from arch/ppc64/kernel/stab.c and puts a rewritten version in arch/ppc64/mm, where it better belongs. The main SLB miss path is in assembler and the other routines have been cleaned up and streamlined. Notable changes: - Ugly bitfields no longer used for generating SLB entries. - slb_allocate() (the main SLB miss routine) is now in assembler, and all the data it uses is stored in the PACA. - The mm context is now copied into the PACA at context switch time, to avoid looking up the thread struct on SLB miss. - An SLB miss will now never (directly) result in a call to do_page_fault. If we get a miss on a totally bogus address the handler will now put in an SLB referencing VSID 0. This will never have any pages, so we'll get the (fatal) page fault shortly afterwards. This simplifies the SLB entry and exit paths. - The round-robin pointer in the PACA now references the last-used instead of next-to-use SLB slot, which simplifies the asm for updating it slightly. Index: working-2.6/include/asm-ppc64/mmu.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu.h +++ working-2.6/include/asm-ppc64/mmu.h @@ -37,12 +37,6 @@ mm_context_t ctx = { .id = REGION_ID(ea), KERNEL_LOW_HPAGES}; \ ctx; }) -/* - * Hardware Segment Lookaside Buffer Entry - * This structure has been padded out to two 64b doublewords (actual SLBE's are - * 94 bits). This padding facilites use by the segment management - * instructions. - */ typedef struct { unsigned long esid: 36; /* Effective segment ID */ unsigned long resv0:20; /* Reserved */ @@ -71,35 +65,6 @@ } dw1; } STE; -typedef struct { - unsigned long esid: 36; /* Effective segment ID */ - unsigned long v: 1; /* Entry valid (v=1) or invalid */ - unsigned long null1:15; /* padding to a 64b boundary */ - unsigned long index:12; /* Index to select SLB entry. Used by slbmte */ -} slb_dword0; - -typedef struct { - unsigned long vsid: 52; /* Virtual segment ID */ - unsigned long ks: 1; /* Supervisor (privileged) state storage key */ - unsigned long kp: 1; /* Problem state storage key */ - unsigned long n: 1; /* No-execute if n=1 */ - unsigned long l: 1; /* Virt pages are large (l=1) or 4KB (l=0) */ - unsigned long c: 1; /* Class */ - unsigned long resv0: 7; /* Padding to a 64b boundary */ -} slb_dword1; - -typedef struct { - union { - unsigned long dword0; - slb_dword0 dw0; - } dw0; - - union { - unsigned long dword1; - slb_dword1 dw1; - } dw1; -} SLBE; - /* Hardware Page Table Entry */ #define HPTES_PER_GROUP 8 @@ -259,6 +224,30 @@ #define STAB0_PHYS_ADDR (STAB0_PAGE<> SID_SHIFT) & SID_MASK) #ifdef CONFIG_HUGETLB_PAGE @@ -37,8 +38,8 @@ #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) /* For 64-bit processes the hugepage range is 1T-1.5T */ -#define TASK_HPAGE_BASE (0x0000010000000000UL) -#define TASK_HPAGE_END (0x0000018000000000UL) +#define TASK_HPAGE_BASE ASM_CONST(0x0000010000000000) +#define TASK_HPAGE_END ASM_CONST(0x0000018000000000) #define LOW_ESID_MASK(addr, len) (((1U << (GET_ESID(addr+len-1)+1)) \ - (1U << GET_ESID(addr))) & 0xffff) Index: working-2.6/include/asm-ppc64/mmu_context.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu_context.h +++ working-2.6/include/asm-ppc64/mmu_context.h @@ -136,7 +136,7 @@ } extern void flush_stab(struct task_struct *tsk, struct mm_struct *mm); -extern void flush_slb(struct task_struct *tsk, struct mm_struct *mm); +extern void switch_slb(struct task_struct *tsk, struct mm_struct *mm); /* * switch_mm is the entry point called from the architecture independent @@ -161,7 +161,7 @@ return; if (cur_cpu_spec->cpu_features & CPU_FTR_SLB) - flush_slb(tsk, next); + switch_slb(tsk, next); else flush_stab(tsk, next); } @@ -181,10 +181,6 @@ local_irq_restore(flags); } -#define VSID_RANDOMIZER 42470972311UL -#define VSID_MASK 0xfffffffffUL - - /* This is only valid for kernel (including vmalloc, imalloc and bolted) EA's */ static inline unsigned long Index: working-2.6/include/asm-ppc64/paca.h =================================================================== --- working-2.6.orig/include/asm-ppc64/paca.h +++ working-2.6/include/asm-ppc64/paca.h @@ -78,20 +78,25 @@ u64 exmc[8]; /* used for machine checks */ u64 exslb[8]; /* used for SLB/segment table misses * on the linear mapping */ - u64 exdsi[8]; /* used for linear mapping hash table misses */ + mm_context_t context; + u16 slb_cache[SLB_CACHE_ENTRIES]; + u16 slb_cache_ptr; /* * then miscellaneous read-write fields */ struct task_struct *__current; /* Pointer to current */ u64 kstack; /* Saved Kernel stack addr */ - u64 stab_next_rr; /* stab/slb round-robin counter */ + u64 stab_rr; /* stab/slb round-robin counter */ u64 next_jiffy_update_tb; /* TB value for next jiffy update */ u64 saved_r1; /* r1 save for RTAS calls */ u64 saved_msr; /* MSR saved here by enter_rtas */ u32 lpevent_count; /* lpevents processed */ u8 proc_enabled; /* irq soft-enable flag */ + /* not yet used */ + u64 exdsi[8]; /* used for linear mapping hash table misses */ + /* * iSeries structues which the hypervisor knows about - Not * sure if these particularly need to be cacheline aligned. Index: working-2.6/arch/ppc64/kernel/stab.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/stab.c +++ working-2.6/arch/ppc64/kernel/stab.c @@ -20,26 +20,10 @@ #include #include -static int make_ste(unsigned long stab, unsigned long esid, unsigned long vsid); -static void make_slbe(unsigned long esid, unsigned long vsid, int large, - int kernel_segment); +static int make_ste(unsigned long stab, unsigned long esid, + unsigned long vsid); -static inline void slb_add_bolted(void) -{ -#ifndef CONFIG_PPC_ISERIES - unsigned long esid = GET_ESID(VMALLOCBASE); - unsigned long vsid = get_kernel_vsid(VMALLOCBASE); - - WARN_ON(!irqs_disabled()); - - /* - * Bolt in the first vmalloc segment. Since modules end - * up there it gets hit very heavily. - */ - get_paca()->stab_next_rr = 1; - make_slbe(esid, vsid, 0, 1); -#endif -} +void slb_initialize(void); /* * Build an entry for the base kernel segment and put it into @@ -48,32 +32,13 @@ */ void stab_initialize(unsigned long stab) { - unsigned long esid, vsid; - int seg0_largepages = 0; - - esid = GET_ESID(KERNELBASE); - vsid = get_kernel_vsid(esid << SID_SHIFT); - - if (cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE) - seg0_largepages = 1; + unsigned long vsid = get_kernel_vsid(KERNELBASE); if (cur_cpu_spec->cpu_features & CPU_FTR_SLB) { - /* Invalidate the entire SLB & all the ERATS */ -#ifdef CONFIG_PPC_ISERIES - asm volatile("isync; slbia; isync":::"memory"); -#else - asm volatile("isync":::"memory"); - asm volatile("slbmte %0,%0"::"r" (0) : "memory"); - asm volatile("isync; slbia; isync":::"memory"); - get_paca()->stab_next_rr = 0; - make_slbe(esid, vsid, seg0_largepages, 1); - asm volatile("isync":::"memory"); -#endif - - slb_add_bolted(); + slb_initialize(); } else { asm volatile("isync; slbia; isync":::"memory"); - make_ste(stab, esid, vsid); + make_ste(stab, GET_ESID(KERNELBASE), vsid); /* Order update */ asm volatile("sync":::"memory"); @@ -129,7 +94,7 @@ * Could not find empty entry, pick one with a round robin selection. * Search all entries in the two groups. */ - castout_entry = get_paca()->stab_next_rr; + castout_entry = get_paca()->stab_rr; for (i = 0; i < 16; i++) { if (castout_entry < 8) { global_entry = (esid & 0x1f) << 3; @@ -148,7 +113,7 @@ castout_entry = (castout_entry + 1) & 0xf; } - get_paca()->stab_next_rr = (castout_entry + 1) & 0xf; + get_paca()->stab_rr = (castout_entry + 1) & 0xf; /* Modify the old entry to the new value. */ @@ -314,229 +279,3 @@ preload_stab(tsk, mm); } - -/* - * SLB stuff - */ - -/* - * Create a segment buffer entry for the given esid/vsid pair. - * - * NOTE: A context syncronising instruction is required before and after - * this, in the common case we use exception entry and rfid. - */ -static void make_slbe(unsigned long esid, unsigned long vsid, int large, - int kernel_segment) -{ - unsigned long entry, castout_entry; - union { - unsigned long word0; - slb_dword0 data; - } esid_data; - union { - unsigned long word0; - slb_dword1 data; - } vsid_data; - struct paca_struct *lpaca = get_paca(); - - /* - * We take the next entry, round robin. Previously we tried - * to find a free slot first but that took too long. Unfortunately - * we dont have any LRU information to help us choose a slot. - */ - - /* - * Never cast out the segment for our kernel stack. Since we - * dont invalidate the ERAT we could have a valid translation - * for the kernel stack during the first part of exception exit - * which gets invalidated due to a tlbie from another cpu at a - * non recoverable point (after setting srr0/1) - Anton - * - * paca Ksave is always valid (even when on the interrupt stack) - * so we use that. - */ - castout_entry = lpaca->stab_next_rr; - do { - entry = castout_entry; - castout_entry++; - /* - * We bolt in the first kernel segment and the first - * vmalloc segment. - */ - if (castout_entry >= SLB_NUM_ENTRIES) - castout_entry = 2; - asm volatile("slbmfee %0,%1" : "=r" (esid_data) : "r" (entry)); - } while (esid_data.data.v && - esid_data.data.esid == GET_ESID(lpaca->kstack)); - - lpaca->stab_next_rr = castout_entry; - - /* slbie not needed as the previous mapping is still valid. */ - - /* - * Write the new SLB entry. - */ - vsid_data.word0 = 0; - vsid_data.data.vsid = vsid; - vsid_data.data.kp = 1; - if (large) - vsid_data.data.l = 1; - if (kernel_segment) - vsid_data.data.c = 1; - else - vsid_data.data.ks = 1; - - esid_data.word0 = 0; - esid_data.data.esid = esid; - esid_data.data.v = 1; - esid_data.data.index = entry; - - /* - * No need for an isync before or after this slbmte. The exception - * we enter with and the rfid we exit with are context synchronizing. - */ - asm volatile("slbmte %0,%1" : : "r" (vsid_data), "r" (esid_data)); -} - -static inline void __slb_allocate(unsigned long esid, unsigned long vsid, - mm_context_t context) -{ - int large = 0; - int region_id = REGION_ID(esid << SID_SHIFT); - unsigned long offset; - - if (cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE) { - if (region_id == KERNEL_REGION_ID) - large = 1; - else if (region_id == USER_REGION_ID) - large = in_hugepage_area(context, esid << SID_SHIFT); - } - - make_slbe(esid, vsid, large, region_id != USER_REGION_ID); - - if (region_id != USER_REGION_ID) - return; - - offset = __get_cpu_var(stab_cache_ptr); - if (offset < NR_STAB_CACHE_ENTRIES) - __get_cpu_var(stab_cache[offset++]) = esid; - else - offset = NR_STAB_CACHE_ENTRIES+1; - __get_cpu_var(stab_cache_ptr) = offset; -} - -/* - * Allocate a segment table entry for the given ea. - */ -int slb_allocate(unsigned long ea) -{ - unsigned long vsid, esid; - mm_context_t context; - - /* Check for invalid effective addresses. */ - if (unlikely(!IS_VALID_EA(ea))) - return 1; - - /* Kernel or user address? */ - if (REGION_ID(ea) >= KERNEL_REGION_ID) { - context = KERNEL_CONTEXT(ea); - vsid = get_kernel_vsid(ea); - } else { - if (unlikely(!current->mm)) - return 1; - - context = current->mm->context; - vsid = get_vsid(context.id, ea); - } - - esid = GET_ESID(ea); -#ifndef CONFIG_PPC_ISERIES - BUG_ON((esid << SID_SHIFT) == VMALLOCBASE); -#endif - __slb_allocate(esid, vsid, context); - - return 0; -} - -/* - * preload some userspace segments into the SLB. - */ -static void preload_slb(struct task_struct *tsk, struct mm_struct *mm) -{ - unsigned long pc = KSTK_EIP(tsk); - unsigned long stack = KSTK_ESP(tsk); - unsigned long unmapped_base; - unsigned long pc_esid = GET_ESID(pc); - unsigned long stack_esid = GET_ESID(stack); - unsigned long unmapped_base_esid; - unsigned long vsid; - - if (test_tsk_thread_flag(tsk, TIF_32BIT)) - unmapped_base = TASK_UNMAPPED_BASE_USER32; - else - unmapped_base = TASK_UNMAPPED_BASE_USER64; - - unmapped_base_esid = GET_ESID(unmapped_base); - - if (!IS_VALID_EA(pc) || (REGION_ID(pc) >= KERNEL_REGION_ID)) - return; - vsid = get_vsid(mm->context.id, pc); - __slb_allocate(pc_esid, vsid, mm->context); - - if (pc_esid == stack_esid) - return; - - if (!IS_VALID_EA(stack) || (REGION_ID(stack) >= KERNEL_REGION_ID)) - return; - vsid = get_vsid(mm->context.id, stack); - __slb_allocate(stack_esid, vsid, mm->context); - - if (pc_esid == unmapped_base_esid || stack_esid == unmapped_base_esid) - return; - - if (!IS_VALID_EA(unmapped_base) || - (REGION_ID(unmapped_base) >= KERNEL_REGION_ID)) - return; - vsid = get_vsid(mm->context.id, unmapped_base); - __slb_allocate(unmapped_base_esid, vsid, mm->context); -} - -/* Flush all user entries from the segment table of the current processor. */ -void flush_slb(struct task_struct *tsk, struct mm_struct *mm) -{ - unsigned long offset = __get_cpu_var(stab_cache_ptr); - union { - unsigned long word0; - slb_dword0 data; - } esid_data; - - if (offset <= NR_STAB_CACHE_ENTRIES) { - int i; - asm volatile("isync" : : : "memory"); - for (i = 0; i < offset; i++) { - esid_data.word0 = 0; - esid_data.data.esid = __get_cpu_var(stab_cache[i]); - BUG_ON(esid_data.data.esid == GET_ESID(VMALLOCBASE)); - asm volatile("slbie %0" : : "r" (esid_data)); - } - asm volatile("isync" : : : "memory"); - } else { - asm volatile("isync; slbia; isync" : : : "memory"); - slb_add_bolted(); - } - - /* Workaround POWER5 < DD2.1 issue */ - if (offset == 1 || offset > NR_STAB_CACHE_ENTRIES) { - /* - * flush segment in EEH region, we dont normally access - * addresses in this region. - */ - esid_data.word0 = 0; - esid_data.data.esid = EEH_REGION_ID; - asm volatile("slbie %0" : : "r" (esid_data)); - } - - __get_cpu_var(stab_cache_ptr) = 0; - - preload_slb(tsk, mm); -} Index: working-2.6/arch/ppc64/kernel/head.S =================================================================== --- working-2.6.orig/arch/ppc64/kernel/head.S +++ working-2.6/arch/ppc64/kernel/head.S @@ -498,7 +498,6 @@ mtcrf 0x80,r12 mfspr r12,SPRG2 EXCEPTION_PROLOG_PSERIES(PACA_EXSLB, .do_slb_bolted) - /* Space for the naca. Architected to be located at real address * NACA_PHYS_ADDR. Various tools rely on this location being fixed. @@ -873,11 +872,7 @@ ld r3,PACA_EXGEN+EX_DAR(r13) std r3,_DAR(r1) bl .slb_allocate - cmpdi r3,0 /* Check return code */ - beq fast_exception_return /* Return if we succeeded */ - li r5,0 - std r5,_DSISR(r1) - b .handle_page_fault + b fast_exception_return .align 7 .globl InstructionAccess_common @@ -894,14 +889,7 @@ EXCEPTION_PROLOG_COMMON(0x480, PACA_EXGEN) ld r3,_NIP(r1) /* SRR0 = NIA */ bl .slb_allocate - or. r3,r3,r3 /* Check return code */ - beq+ fast_exception_return /* Return if we succeeded */ - - ld r4,_NIP(r1) - li r5,0 - std r4,_DAR(r1) - std r5,_DSISR(r1) - b .handle_page_fault + b fast_exception_return .align 7 .globl HardwareInterrupt_common @@ -1155,7 +1143,6 @@ * r9 - r13 are saved in paca->exslb. * We assume we aren't going to take any exceptions during this procedure. */ -/* XXX note fix masking in get_kernel_vsid to match */ _GLOBAL(do_slb_bolted) stw r9,PACA_EXSLB+EX_CCR(r13) /* save CR in exc. frame */ std r11,PACA_EXSLB+EX_SRR0(r13) /* save SRR0 in exc. frame */ @@ -1167,12 +1154,13 @@ */ /* r13 = paca */ + /* use a cpu feature mask if we ever change our slb size */ 1: ld r10,PACASTABRR(r13) - addi r9,r10,1 - cmpdi r9,SLB_NUM_ENTRIES + addi r10,r10,1 + cmpdi r10,SLB_NUM_ENTRIES blt+ 2f - li r9,2 /* dont touch slot 0 or 1 */ -2: std r9,PACASTABRR(r13) + li r10,SLB_NUM_BOLTED /* dont touch bolted slots */ +2: std r10,PACASTABRR(r13) /* r13 = paca, r10 = entry */ Index: working-2.6/arch/ppc64/kernel/asm-offsets.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/asm-offsets.c +++ working-2.6/arch/ppc64/kernel/asm-offsets.c @@ -86,10 +86,16 @@ DEFINE(PACASAVEDMSR, offsetof(struct paca_struct, saved_msr)); DEFINE(PACASTABREAL, offsetof(struct paca_struct, stab_real)); DEFINE(PACASTABVIRT, offsetof(struct paca_struct, stab_addr)); - DEFINE(PACASTABRR, offsetof(struct paca_struct, stab_next_rr)); + DEFINE(PACASTABRR, offsetof(struct paca_struct, stab_rr)); DEFINE(PACAR1, offsetof(struct paca_struct, saved_r1)); DEFINE(PACATOC, offsetof(struct paca_struct, kernel_toc)); DEFINE(PACAPROCENABLED, offsetof(struct paca_struct, proc_enabled)); + DEFINE(PACASLBCACHE, offsetof(struct paca_struct, slb_cache)); + DEFINE(PACASLBCACHEPTR, offsetof(struct paca_struct, slb_cache_ptr)); + DEFINE(PACACONTEXTID, offsetof(struct paca_struct, context.id)); +#ifdef CONFIG_HUGETLB_PAGE + DEFINE(PACAHTLBSEGS, offsetof(struct paca_struct, context.htlb_segs)); +#endif /* CONFIG_HUGETLB_PAGE */ DEFINE(PACADEFAULTDECR, offsetof(struct paca_struct, default_decr)); DEFINE(PACAPROFENABLED, offsetof(struct paca_struct, prof_enabled)); DEFINE(PACAPROFLEN, offsetof(struct paca_struct, prof_len)); Index: working-2.6/arch/ppc64/kernel/pacaData.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/pacaData.c +++ working-2.6/arch/ppc64/kernel/pacaData.c @@ -55,7 +55,6 @@ .stab_addr = (asrv), /* Virt pointer to segment table */ \ .emergency_sp = &emergency_stack[((number)+1) * PAGE_SIZE], \ .cpu_start = (start), /* Processor start */ \ - .stab_next_rr = 1, \ .lppaca = { \ .xDesc = 0xd397d781, /* "LpPa" */ \ .xSize = sizeof(struct ItLpPaca), \ Index: working-2.6/arch/ppc64/kernel/smp.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/smp.c +++ working-2.6/arch/ppc64/kernel/smp.c @@ -385,8 +385,6 @@ /* Fixup atomic count: it exited inside IRQ handler. */ paca[lcpu].__current->thread_info->preempt_count = 0; - /* Fixup SLB round-robin so next segment (kernel) goes in segment 0 */ - paca[lcpu].stab_next_rr = 0; /* At boot this is done in prom.c. */ paca[lcpu].hw_cpu_id = pcpu; Index: working-2.6/arch/ppc64/mm/slb.c =================================================================== --- /dev/null +++ working-2.6/arch/ppc64/mm/slb.c @@ -0,0 +1,136 @@ +/* + * PowerPC64 SLB support. + * + * Copyright (C) 2004 David Gibson , IBM + * Based on earlier code writteh by: + * Dave Engebretsen and Mike Corrigan {engebret|mikejc}@us.ibm.com + * Copyright (c) 2001 Dave Engebretsen + * Copyright (C) 2002 Anton Blanchard , IBM + * + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#include +#include +#include +#include +#include +#include +#include + +extern void slb_allocate(unsigned long ea); + +static inline void create_slbe(unsigned long ea, unsigned long vsid, + unsigned long flags, unsigned long entry) +{ + ea = (ea & ESID_MASK) | SLB_ESID_V | entry; + vsid = (vsid << SLB_VSID_SHIFT) | flags; + asm volatile("slbmte %0,%1" : + : "r" (vsid), "r" (ea) + : "memory" ); +} + +static void slb_add_bolted(void) +{ +#ifndef CONFIG_PPC_ISERIES + WARN_ON(!irqs_disabled()); + + /* If you change this make sure you change SLB_NUM_BOLTED + * appropriately too */ + + /* Slot 1 - first VMALLOC segment + * Since modules end up there it gets hit very heavily. + */ + create_slbe(VMALLOCBASE, get_kernel_vsid(VMALLOCBASE), + SLB_VSID_KERNEL, 1); + + asm volatile("isync":::"memory"); +#endif +} + +/* Flush all user entries from the segment table of the current processor. */ +void switch_slb(struct task_struct *tsk, struct mm_struct *mm) +{ + unsigned long offset = get_paca()->slb_cache_ptr; + unsigned long esid_data; + unsigned long pc = KSTK_EIP(tsk); + unsigned long stack = KSTK_ESP(tsk); + unsigned long unmapped_base; + + if (offset <= SLB_CACHE_ENTRIES) { + int i; + asm volatile("isync" : : : "memory"); + for (i = 0; i < offset; i++) { + esid_data = (unsigned long)get_paca()->slb_cache[i] + << SID_SHIFT; + asm volatile("slbie %0" : : "r" (esid_data)); + } + asm volatile("isync" : : : "memory"); + } else { + asm volatile("isync; slbia; isync" : : : "memory"); + slb_add_bolted(); + } + + /* Workaround POWER5 < DD2.1 issue */ + if (offset == 1 || offset > SLB_CACHE_ENTRIES) { + /* flush segment in EEH region, we shouldn't ever + * access addresses in this region. */ + asm volatile("slbie %0" : : "r"(EEHREGIONBASE)); + } + + get_paca()->slb_cache_ptr = 0; + get_paca()->context = mm->context; + + /* + * preload some userspace segments into the SLB. + */ + if (test_tsk_thread_flag(tsk, TIF_32BIT)) + unmapped_base = TASK_UNMAPPED_BASE_USER32; + else + unmapped_base = TASK_UNMAPPED_BASE_USER64; + + if (pc >= KERNELBASE) + return; + slb_allocate(pc); + + if (GET_ESID(pc) == GET_ESID(stack)) + return; + + if (stack >= KERNELBASE) + return; + slb_allocate(stack); + + if ((GET_ESID(pc) == GET_ESID(unmapped_base)) + || (GET_ESID(stack) == GET_ESID(unmapped_base))) + return; + + if (unmapped_base >= KERNELBASE) + return; + slb_allocate(unmapped_base); +} + +void slb_initialize(void) +{ +#ifdef CONFIG_PPC_ISERIES + asm volatile("isync; slbia; isync":::"memory"); +#else + unsigned long flags = SLB_VSID_KERNEL; + + /* Invalidate the entire SLB (even slot 0) & all the ERATS */ + if (cur_cpu_spec->cpu_features & CPU_FTR_16M_PAGE) + flags |= SLB_VSID_L; + + asm volatile("isync":::"memory"); + asm volatile("slbmte %0,%0"::"r" (0) : "memory"); + asm volatile("isync; slbia; isync":::"memory"); + create_slbe(KERNELBASE, get_kernel_vsid(KERNELBASE), + flags, 0); + +#endif + slb_add_bolted(); + get_paca()->stab_rr = SLB_NUM_BOLTED; +} Index: working-2.6/arch/ppc64/mm/Makefile =================================================================== --- working-2.6.orig/arch/ppc64/mm/Makefile +++ working-2.6/arch/ppc64/mm/Makefile @@ -4,6 +4,6 @@ EXTRA_CFLAGS += -mno-minimal-toc -obj-y := fault.o init.o imalloc.o hash_utils.o hash_low.o tlb.o +obj-y := fault.o init.o imalloc.o hash_utils.o hash_low.o tlb.o slb_low.o slb.o obj-$(CONFIG_DISCONTIGMEM) += numa.o obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o Index: working-2.6/arch/ppc64/mm/fault.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/fault.c +++ working-2.6/arch/ppc64/mm/fault.c @@ -93,13 +93,15 @@ unsigned long is_write = error_code & 0x02000000; unsigned long trap = TRAP(regs); - if (trap == 0x300 || trap == 0x380) { + BUG_ON((trap == 0x380) || (trap == 0x480)); + + if (trap == 0x300) { if (debugger_fault_handler(regs)) return 0; } /* On a kernel SLB miss we can only check for a valid exception entry */ - if (!user_mode(regs) && (trap == 0x380 || address >= TASK_SIZE)) + if (!user_mode(regs) && (address >= TASK_SIZE)) return SIGSEGV; if (error_code & 0x00400000) { Index: working-2.6/arch/ppc64/mm/slb_low.S =================================================================== --- /dev/null +++ working-2.6/arch/ppc64/mm/slb_low.S @@ -0,0 +1,168 @@ +/* + * arch/ppc64/mm/slb_low.S + * + * Low-level SLB routines + * + * Copyright (C) 2004 David Gibson , IBM + * + * Based on earlier C version: + * Dave Engebretsen and Mike Corrigan {engebret|mikejc}@us.ibm.com + * Copyright (c) 2001 Dave Engebretsen + * Copyright (C) 2002 Anton Blanchard , IBM + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#include +#include +#include +#include +#include +#include +#include + +/* void slb_allocate(unsigned long ea); + * + * Create an SLB entry for the given EA (user or kernel). + * r3 = faulting address, r13 = PACA + * r9, r10, r11 are clobbered by this function + * No other registers are examined or changed. + */ +_GLOBAL(slb_allocate) + /* + * First find a slot, round robin. Previously we tried to find + * a free slot first but that took too long. Unfortunately we + * dont have any LRU information to help us choose a slot. + */ + srdi r9,r1,27 + ori r9,r9,1 /* mangle SP for later compare */ + + ld r10,PACASTABRR(r13) +3: + addi r10,r10,1 + /* use a cpu feature mask if we ever change our slb size */ + cmpldi r10,SLB_NUM_ENTRIES + + blt+ 4f + li r10,SLB_NUM_BOLTED +4: + slbmfee r11,r10 + /* Don't throw out the segment for our kernel stack. Since we + * dont invalidate the ERAT we could have a valid translation + * for the kernel stack during the first part of exception + * exit which gets invalidated due to a tlbie from another cpu + * at a non recoverable point (after setting srr0/1) - Anton + * + * The >> 27 (rather than >> 28) is so that the LSB is the + * valid bit - this way we check valid and ESID in one compare. + */ + srdi r11,r11,27 + cmpd r11,r9 + beq- 3b + + std r10,PACASTABRR(r13) + + /* r3 = faulting address, r10 = entry */ + + srdi r9,r3,60 /* get region */ + srdi r3,r3,28 /* get esid */ + cmpldi cr7,r9,0xc /* cmp KERNELBASE for later use */ + + /* r9 = region, r3 = esid, cr7 = <>KERNELBASE */ + + rldicr. r11,r3,32,16 + bne- 8f /* invalid ea bits set */ + addi r11,r9,-1 + cmpldi r11,0xb + blt- 8f /* invalid region */ + + /* r9 = region, r3 = esid, r10 = entry, cr7 = <>KERNELBASE */ + + blt cr7,0f /* user or kernel? */ + + /* kernel address */ + li r11,SLB_VSID_KERNEL +BEGIN_FTR_SECTION + bne cr7,9f + li r11,(SLB_VSID_KERNEL|SLB_VSID_L) +END_FTR_SECTION_IFSET(CPU_FTR_16M_PAGE) + b 9f + +0: /* user address */ + li r11,SLB_VSID_USER +#ifdef CONFIG_HUGETLB_PAGE +BEGIN_FTR_SECTION + /* check against the hugepage ranges */ + cmpldi r3,(TASK_HPAGE_END>>SID_SHIFT) + bge 6f /* >= TASK_HPAGE_END */ + cmpldi r3,(TASK_HPAGE_BASE>>SID_SHIFT) + bge 5f /* TASK_HPAGE_BASE..TASK_HPAGE_END */ + cmpldi r3,16 + bge 6f /* 4GB..TASK_HPAGE_BASE */ + + lhz r9,PACAHTLBSEGS(r13) + srd r9,r9,r3 + andi. r9,r9,1 + beq 6f + +5: /* this is a hugepage user address */ + li r11,(SLB_VSID_USER|SLB_VSID_L) +END_FTR_SECTION_IFSET(CPU_FTR_16M_PAGE) +#endif /* CONFIG_HUGETLB_PAGE */ + +6: ld r9,PACACONTEXTID(r13) + +9: /* r9 = "context", r3 = esid, r11 = flags, r10 = entry */ + + rldimi r9,r3,15,0 /* r9= VSID ordinal */ + +7: rldimi r10,r3,28,0 /* r10= ESID<<28 | entry */ + oris r10,r10,SLB_ESID_V at h /* r10 |= SLB_ESID_V */ + + /* r9 = ordinal, r3 = esid, r11 = flags, r10 = esid_data */ + + li r3,VSID_RANDOMIZER at higher + sldi r3,r3,32 + oris r3,r3,VSID_RANDOMIZER at h + ori r3,r3,VSID_RANDOMIZER at l + + mulld r9,r3,r9 /* r9 = ordinal * VSID_RANDOMIZER */ + clrldi r9,r9,28 /* r9 &= VSID_MASK */ + sldi r9,r9,SLB_VSID_SHIFT /* r9 <<= SLB_VSID_SHIFT */ + or r9,r9,r11 /* r9 |= flags */ + + /* r9 = vsid_data, r10 = esid_data, cr7 = <>KERNELBASE */ + + /* + * No need for an isync before or after this slbmte. The exception + * we enter with and the rfid we exit with are context synchronizing. + */ + slbmte r9,r10 + + bgelr cr7 /* we're done for kernel addresses */ + + /* Update the slb cache */ + lhz r3,PACASLBCACHEPTR(r13) /* offset = paca->slb_cache_ptr */ + cmpldi r3,SLB_CACHE_ENTRIES + bge 1f + + /* still room in the slb cache */ + sldi r11,r3,1 /* r11 = offset * sizeof(u16) */ + rldicl r10,r10,36,28 /* get low 16 bits of the ESID */ + add r11,r11,r13 /* r11 = (u16 *)paca + offset */ + sth r10,PACASLBCACHE(r11) /* paca->slb_cache[offset] = esid */ + addi r3,r3,1 /* offset++ */ + b 2f +1: /* offset >= SLB_CACHE_ENTRIES */ + li r3,SLB_CACHE_ENTRIES+1 +2: + sth r3,PACASLBCACHEPTR(r13) /* paca->slb_cache_ptr = offset */ + blr + +8: /* invalid EA */ + li r9,0 /* 0 VSID ordinal -> BAD_VSID */ + li r11,SLB_VSID_USER /* flags don't much matter */ + b 7b -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Wed Jul 7 16:06:48 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Wed, 7 Jul 2004 16:06:48 +1000 Subject: [2/4] RFC: SLB Rewrite (unify do_slb_bolted and slb_allocate) Message-ID: <20040707060648.GC987@zax> Unify do_slb_bolted with the general SLB miss path. There is now one SLB miss handler, in assembler, and called with only the low-level exception prolog (EXCEPTION_PROLOG_[PI]SERIES rather than EXCEPTION_PROLOG_COMMON) and minimal extra save/restore logic. Index: working-2.6/arch/ppc64/kernel/asm-offsets.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/asm-offsets.c +++ working-2.6/arch/ppc64/kernel/asm-offsets.c @@ -93,6 +93,7 @@ DEFINE(PACASLBCACHE, offsetof(struct paca_struct, slb_cache)); DEFINE(PACASLBCACHEPTR, offsetof(struct paca_struct, slb_cache_ptr)); DEFINE(PACACONTEXTID, offsetof(struct paca_struct, context.id)); + DEFINE(PACASLBR3, offsetof(struct paca_struct, slb_r3)); #ifdef CONFIG_HUGETLB_PAGE DEFINE(PACAHTLBSEGS, offsetof(struct paca_struct, context.htlb_segs)); #endif /* CONFIG_HUGETLB_PAGE */ Index: working-2.6/arch/ppc64/kernel/head.S =================================================================== --- working-2.6.orig/arch/ppc64/kernel/head.S +++ working-2.6/arch/ppc64/kernel/head.S @@ -200,6 +200,7 @@ #define EX_R13 32 #define EX_SRR0 40 #define EX_DAR 48 +#define EX_LR 48 /* SLB miss saves LR, but not DAR */ #define EX_DSISR 56 #define EX_CCR 60 @@ -433,18 +434,16 @@ .globl DataAccessSLB_Pseries DataAccessSLB_Pseries: mtspr SPRG1,r13 - mtspr SPRG2,r12 - mfspr r13,DAR - mfcr r12 - srdi r13,r13,60 - cmpdi r13,0xc - beq .do_slb_bolted_Pseries - mtcrf 0x80,r12 - mfspr r12,SPRG2 - EXCEPTION_PROLOG_PSERIES(PACA_EXGEN, DataAccessSLB_common) + EXCEPTION_PROLOG_PSERIES(PACA_EXSLB, data_slb_Pseries) STD_EXCEPTION_PSERIES(0x400, InstructionAccess) - STD_EXCEPTION_PSERIES(0x480, InstructionAccessSLB) + + . = 0x480 + .globl InstructionAccessSLB_Pseries +InstructionAccessSLB_Pseries: + mtspr SPRG1,r13 + EXCEPTION_PROLOG_PSERIES(PACA_EXSLB, instr_slb_Pseries) + STD_EXCEPTION_PSERIES(0x500, HardwareInterrupt) STD_EXCEPTION_PSERIES(0x600, Alignment) STD_EXCEPTION_PSERIES(0x700, ProgramCheck) @@ -494,10 +493,6 @@ mfspr r12,SPRG2 EXCEPTION_PROLOG_PSERIES(PACA_EXSLB, .do_stab_bolted) -_GLOBAL(do_slb_bolted_Pseries) - mtcrf 0x80,r12 - mfspr r12,SPRG2 - EXCEPTION_PROLOG_PSERIES(PACA_EXSLB, .do_slb_bolted) /* Space for the naca. Architected to be located at real address * NACA_PHYS_ADDR. Various tools rely on this location being fixed. @@ -586,27 +581,23 @@ .globl DataAccessSLB_Iseries DataAccessSLB_Iseries: mtspr SPRG1,r13 /* save r13 */ - mtspr SPRG2,r12 - mfspr r13,DAR - mfcr r12 - srdi r13,r13,60 - cmpdi r13,0xc - beq .do_slb_bolted_Iseries - mtcrf 0x80,r12 - mfspr r12,SPRG2 - EXCEPTION_PROLOG_ISERIES_1(PACA_EXGEN) + EXCEPTION_PROLOG_ISERIES_1(PACA_EXSLB) EXCEPTION_PROLOG_ISERIES_2 - b DataAccessSLB_common + std r3,PACASLBR3(r13) + mfspr r3,DAR + b .do_slb_miss -.do_slb_bolted_Iseries: - mtcrf 0x80,r12 - mfspr r12,SPRG2 + STD_EXCEPTION_ISERIES(0x400, InstructionAccess, PACA_EXGEN) + + .globl InstructionAccessSLB_Iseries +InstructionAccessSLB_Iseries: + mtspr SPRG1,r13 /* save r13 */ EXCEPTION_PROLOG_ISERIES_1(PACA_EXSLB) EXCEPTION_PROLOG_ISERIES_2 - b .do_slb_bolted + std r3,PACASLBR3(r13) + mr r3,r11 + b .do_slb_miss - STD_EXCEPTION_ISERIES(0x400, InstructionAccess, PACA_EXGEN) - STD_EXCEPTION_ISERIES(0x480, InstructionAccessSLB, PACA_EXGEN) MASKABLE_EXCEPTION_ISERIES(0x500, HardwareInterrupt) STD_EXCEPTION_ISERIES(0x600, Alignment, PACA_EXGEN) STD_EXCEPTION_ISERIES(0x700, ProgramCheck, PACA_EXGEN) @@ -864,17 +855,6 @@ b .do_hash_page /* Try to handle as hpte fault */ .align 7 - .globl DataAccessSLB_common -DataAccessSLB_common: - mfspr r10,DAR - std r10,PACA_EXGEN+EX_DAR(r13) - EXCEPTION_PROLOG_COMMON(0x380, PACA_EXGEN) - ld r3,PACA_EXGEN+EX_DAR(r13) - std r3,_DAR(r1) - bl .slb_allocate - b fast_exception_return - - .align 7 .globl InstructionAccess_common InstructionAccess_common: EXCEPTION_PROLOG_COMMON(0x400, PACA_EXGEN) @@ -884,14 +864,6 @@ b .do_hash_page /* Try to handle as hpte fault */ .align 7 - .globl InstructionAccessSLB_common -InstructionAccessSLB_common: - EXCEPTION_PROLOG_COMMON(0x480, PACA_EXGEN) - ld r3,_NIP(r1) /* SRR0 = NIA */ - bl .slb_allocate - b fast_exception_return - - .align 7 .globl HardwareInterrupt_common .globl HardwareInterrupt_entry HardwareInterrupt_common: @@ -1137,130 +1109,52 @@ ld r13,PACA_EXSLB+EX_R13(r13) rfid +data_slb_Pseries: + std r3,PACASLBR3(r13) + mfspr r3,DAR + b .do_slb_miss + +instr_slb_Pseries: + std r3,PACASLBR3(r13) + mr r3,r11 /* prolog stored SRR0 in r11 */ + b .do_slb_miss + /* * r13 points to the PACA, r9 contains the saved CR, * r11 and r12 contain the saved SRR0 and SRR1. + * r3 has the faulting address * r9 - r13 are saved in paca->exslb. + * r3 is saved in paca->slb_r3 * We assume we aren't going to take any exceptions during this procedure. */ -_GLOBAL(do_slb_bolted) +_GLOBAL(do_slb_miss) + mflr r10 + stw r9,PACA_EXSLB+EX_CCR(r13) /* save CR in exc. frame */ std r11,PACA_EXSLB+EX_SRR0(r13) /* save SRR0 in exc. frame */ + std r10,PACA_EXSLB+EX_LR(r13) /* save LR */ - /* - * We take the next entry, round robin. Previously we tried - * to find a free slot first but that took too long. Unfortunately - * we dont have any LRU information to help us choose a slot. - */ - - /* r13 = paca */ - /* use a cpu feature mask if we ever change our slb size */ -1: ld r10,PACASTABRR(r13) - addi r10,r10,1 - cmpdi r10,SLB_NUM_ENTRIES - blt+ 2f - li r10,SLB_NUM_BOLTED /* dont touch bolted slots */ -2: std r10,PACASTABRR(r13) - - /* r13 = paca, r10 = entry */ - - /* - * Never cast out the segment for our kernel stack. Since we - * dont invalidate the ERAT we could have a valid translation - * for the kernel stack during the first part of exception exit - * which gets invalidated due to a tlbie from another cpu at a - * non recoverable point (after setting srr0/1) - Anton - */ - slbmfee r9,r10 - srdi r9,r9,27 - /* - * Use paca->ksave as the value of the kernel stack pointer, - * because this is valid at all times. - * The >> 27 (rather than >> 28) is so that the LSB is the - * valid bit - this way we check valid and ESID in one compare. - * In order to completely close the tiny race in the context - * switch (between updating r1 and updating paca->ksave), - * we check against both r1 and paca->ksave. - */ - srdi r11,r1,27 - ori r11,r11,1 - cmpd r11,r9 - beq- 1b - ld r11,PACAKSAVE(r13) - srdi r11,r11,27 - ori r11,r11,1 - cmpd r11,r9 - beq- 1b - - /* r13 = paca, r10 = entry */ - - /* (((ea >> 28) & 0x1fff) << 15) | (ea >> 60) */ - mfspr r9,DAR - rldicl r11,r9,36,51 - sldi r11,r11,15 - srdi r9,r9,60 - or r11,r11,r9 - - /* VSID_RANDOMIZER */ - li r9,9 - sldi r9,r9,32 - oris r9,r9,58231 - ori r9,r9,39831 - - /* vsid = (ordinal * VSID_RANDOMIZER) & VSID_MASK */ - mulld r11,r11,r9 - clrldi r11,r11,28 - - /* r13 = paca, r10 = entry, r11 = vsid */ - - /* Put together slb word1 */ - sldi r11,r11,12 - -BEGIN_FTR_SECTION - /* set kp and c bits */ - ori r11,r11,0x480 -END_FTR_SECTION_IFCLR(CPU_FTR_16M_PAGE) -BEGIN_FTR_SECTION - /* set kp, l and c bits */ - ori r11,r11,0x580 -END_FTR_SECTION_IFSET(CPU_FTR_16M_PAGE) - - /* r13 = paca, r10 = entry, r11 = slb word1 */ - - /* Put together slb word0 */ - mfspr r9,DAR - clrrdi r9,r9,28 /* get the new esid */ - oris r9,r9,0x800 /* set valid bit */ - rldimi r9,r10,0,52 /* insert entry */ - - /* r13 = paca, r9 = slb word0, r11 = slb word1 */ - - /* - * No need for an isync before or after this slbmte. The exception - * we enter with and the rfid we exit with are context synchronizing . - */ - slbmte r11,r9 + bl .slb_allocate /* handle it */ /* All done -- return from exception. */ + + ld r10,PACA_EXSLB+EX_LR(r13) + ld r3,PACASLBR3(r13) lwz r9,PACA_EXSLB+EX_CCR(r13) /* get saved CR */ ld r11,PACA_EXSLB+EX_SRR0(r13) /* get saved SRR0 */ + mtlr r10 + andi. r10,r12,MSR_RI /* check for unrecoverable exception */ beq- unrecov_slb - /* - * Until everyone updates binutils hardwire the POWER4 optimised - * single field mtcrf - */ -#if 0 - .machine push - .machine "power4" +.machine push +.machine "power4" mtcrf 0x80,r9 - .machine pop -#else - .long 0x7d380120 -#endif + mtcrf 0x01,r9 /* slb_allocate uses cr0 and cr7 */ +.machine pop + /* Clear RI */ mfmsr r10 clrrdi r10,r10,2 mtmsrd r10,1 Index: working-2.6/include/asm-ppc64/paca.h =================================================================== --- working-2.6.orig/include/asm-ppc64/paca.h +++ working-2.6/include/asm-ppc64/paca.h @@ -78,6 +78,7 @@ u64 exmc[8]; /* used for machine checks */ u64 exslb[8]; /* used for SLB/segment table misses * on the linear mapping */ + u64 slb_r3; /* spot to save R3 on SLB miss */ mm_context_t context; u16 slb_cache[SLB_CACHE_ENTRIES]; u16 slb_cache_ptr; -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Wed Jul 7 16:06:57 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Wed, 7 Jul 2004 16:06:57 +1000 Subject: [3/4] RFC: SLB Rewrite (optimize SLB entry/exit path) Message-ID: <20040707060657.GD987@zax> Streamlines the exception entry/exit path of the SLB miss handler to shave a few cycles off. The most significant change is that the RI bit is left off throughout the whole handler, which avoids an extra mtmsrd to turn it back off on the exit path. Index: working-2.6/arch/ppc64/kernel/head.S =================================================================== --- working-2.6.orig/arch/ppc64/kernel/head.S +++ working-2.6/arch/ppc64/kernel/head.S @@ -434,7 +434,25 @@ .globl DataAccessSLB_Pseries DataAccessSLB_Pseries: mtspr SPRG1,r13 - EXCEPTION_PROLOG_PSERIES(PACA_EXSLB, data_slb_Pseries) + mfspr r13,SPRG3 /* get paca address into r13 */ + std r9,PACA_EXSLB+EX_R9(r13) /* save r9 - r12 */ + std r10,PACA_EXSLB+EX_R10(r13) + std r11,PACA_EXSLB+EX_R11(r13) + std r12,PACA_EXSLB+EX_R12(r13) + std r3,PACASLBR3(r13) + mfspr r9,SPRG1 + std r9,PACA_EXSLB+EX_R13(r13) + mfcr r9 + clrrdi r12,r13,32 /* get high part of &label */ + mfmsr r10 + mfspr r11,SRR0 /* save SRR0 */ + ori r12,r12,(.do_slb_miss)@l + ori r10,r10,MSR_IR|MSR_DR /* DON'T set RI for SLB miss */ + mtspr SRR0,r12 + mfspr r12,SRR1 /* and SRR1 */ + mtspr SRR1,r10 + mfspr r3,DAR + rfid STD_EXCEPTION_PSERIES(0x400, InstructionAccess) @@ -442,7 +460,25 @@ .globl InstructionAccessSLB_Pseries InstructionAccessSLB_Pseries: mtspr SPRG1,r13 - EXCEPTION_PROLOG_PSERIES(PACA_EXSLB, instr_slb_Pseries) + mfspr r13,SPRG3 /* get paca address into r13 */ + std r9,PACA_EXSLB+EX_R9(r13) /* save r9 - r12 */ + std r10,PACA_EXSLB+EX_R10(r13) + std r11,PACA_EXSLB+EX_R11(r13) + std r12,PACA_EXSLB+EX_R12(r13) + std r3,PACASLBR3(r13) + mfspr r9,SPRG1 + std r9,PACA_EXSLB+EX_R13(r13) + mfcr r9 + clrrdi r12,r13,32 /* get high part of &label */ + mfmsr r10 + mfspr r11,SRR0 /* save SRR0 */ + ori r12,r12,(.do_slb_miss)@l + ori r10,r10,MSR_IR|MSR_DR /* DON'T set RI for SLB miss */ + mtspr SRR0,r12 + mfspr r12,SRR1 /* and SRR1 */ + mtspr SRR1,r10 + mr r3,r11 /* SRR0 is faulting address */ + rfid STD_EXCEPTION_PSERIES(0x500, HardwareInterrupt) STD_EXCEPTION_PSERIES(0x600, Alignment) @@ -582,8 +618,9 @@ DataAccessSLB_Iseries: mtspr SPRG1,r13 /* save r13 */ EXCEPTION_PROLOG_ISERIES_1(PACA_EXSLB) - EXCEPTION_PROLOG_ISERIES_2 std r3,PACASLBR3(r13) + ld r11,PACALPPACA+LPPACASRR0(r13) + ld r12,PACALPPACA+LPPACASRR1(r13) mfspr r3,DAR b .do_slb_miss @@ -593,8 +630,9 @@ InstructionAccessSLB_Iseries: mtspr SPRG1,r13 /* save r13 */ EXCEPTION_PROLOG_ISERIES_1(PACA_EXSLB) - EXCEPTION_PROLOG_ISERIES_2 std r3,PACASLBR3(r13) + ld r11,PACALPPACA+LPPACASRR0(r13) + ld r12,PACALPPACA+LPPACASRR1(r13) mr r3,r11 b .do_slb_miss @@ -1109,16 +1147,6 @@ ld r13,PACA_EXSLB+EX_R13(r13) rfid -data_slb_Pseries: - std r3,PACASLBR3(r13) - mfspr r3,DAR - b .do_slb_miss - -instr_slb_Pseries: - std r3,PACASLBR3(r13) - mr r3,r11 /* prolog stored SRR0 in r11 */ - b .do_slb_miss - /* * r13 points to the PACA, r9 contains the saved CR, * r11 and r12 contain the saved SRR0 and SRR1. @@ -1154,11 +1182,6 @@ mtcrf 0x01,r9 /* slb_allocate uses cr0 and cr7 */ .machine pop - /* Clear RI */ - mfmsr r10 - clrrdi r10,r10,2 - mtmsrd r10,1 - mtspr SRR0,r11 mtspr SRR1,r12 ld r9,PACA_EXSLB+EX_R9(r13) -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Wed Jul 7 16:07:02 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Wed, 7 Jul 2004 16:07:02 +1000 Subject: [4/4] RFC: SLB Rewrite (new VSID algorithm) Message-ID: <20040707060702.GE987@zax> Replace the VSID allocation algorithm. The new algorithm first generates a 36-bit "proto-VSID" (with 0xfffffffff reserved). For kernel addresses this is equal to the ESID, for user addresses it is: (context << 15) | esid These are distinguishable from kernel proto-VSIDs because the top bit is clear. Proto-VSIDs with the top two bits equal to 10 as reserved for now. The proto-VSIDs are then scrambled into real VSIDs with the (1 to 1) multiplicative hash: VSID = (proto-VSID * VSID_MULTIPLIER) % VSID_MODULUS where VSID_MULTIPLIER = 268435399 = 0xFFFFFC7 VSID_MODULUS = 2^36-1 = 0xFFFFFFFFF This scheme has a number of advantages over the old one: - We now have VSIDs for every kernel address (i.e. everything above 0xC000000000000000), except the very top segment. That simplifies a numbver of things. - We allow for 15 significant bits of ESID for user addresses with 20 bits of context. i.e. 8T (43 bits) of address space for up to 1M contexts, significantly more than the old method (although we will need changes in the hash path and context allocation to take advantage of this). - Because we use a real multiplicative hash function, we have much better hash scattering with this VSID algorithm (at least based on some initial results). Because the MODULUS is 2^n-1 we can use a trick to compute it efficiently without a divide or extra multiply. This makes the new algorithm barely slower than the old one. Index: working-2.6/include/asm-ppc64/mmu_context.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu_context.h +++ working-2.6/include/asm-ppc64/mmu_context.h @@ -34,7 +34,7 @@ } #define NO_CONTEXT 0 -#define FIRST_USER_CONTEXT 0x10 /* First 16 reserved for kernel */ +#define FIRST_USER_CONTEXT 1 #define LAST_USER_CONTEXT 0x8000 /* Same as PID_MAX for now... */ #define NUM_USER_CONTEXT (LAST_USER_CONTEXT-FIRST_USER_CONTEXT) @@ -181,46 +181,43 @@ local_irq_restore(flags); } -/* This is only valid for kernel (including vmalloc, imalloc and bolted) EA's - */ -static inline unsigned long -get_kernel_vsid( unsigned long ea ) -{ - unsigned long ordinal, vsid; - - ordinal = (((ea >> 28) & 0x1fff) * LAST_USER_CONTEXT) | (ea >> 60); - vsid = (ordinal * VSID_RANDOMIZER) & VSID_MASK; - -#ifdef HTABSTRESS - /* For debug, this path creates a very poor vsid distribuition. - * A user program can access virtual addresses in the form - * 0x0yyyyxxxx000 where yyyy = xxxx to cause multiple mappings - * to hash to the same page table group. - */ - ordinal = ((ea >> 28) & 0x1fff) | (ea >> 44); - vsid = ordinal & VSID_MASK; -#endif /* HTABSTRESS */ +/* + * WARNING - If you change these you must make sure the asm + * implementations in slb_allocate(), do_stab_bolted and mmu.h + * (ASM_VSID_SCRAMBLE macro) are changed accordingly. + * + * You'll also need to change the precomputed VSID values in head.S + * which are used by the iSeries firmware. + */ + +static inline unsigned long vsid_scramble(unsigned long protovsid) +{ +#if 0 + /* The code below is equivalent to this function for arguments + * < 2^VSID_BITS, which is all this should ever be called + * with. However gcc is not clever enough to compute the + * modulus (2^n-1) without a second multiply. */ + return ((protovsid * VSID_MULTIPLIER) % VSID_MODULUS); +#else /* 1 */ + unsigned long x; + + x = protovsid * VSID_MULTIPLIER; + x = (x >> VSID_BITS) + (x & VSID_MODULUS); + return (x + ((x+1) >> VSID_BITS)) & VSID_MODULUS; +#endif /* 1 */ +} - return vsid; +/* This is only valid for addresses >= KERNELBASE */ +static inline unsigned long get_kernel_vsid(unsigned long ea) +{ + return vsid_scramble(ea >> SID_SHIFT); } -/* This is only valid for user EA's (user EA's do not exceed 2^41 (EADDR_SIZE)) - */ -static inline unsigned long -get_vsid( unsigned long context, unsigned long ea ) +/* This is only valid for user addresses (which are below 2^41) */ +static inline unsigned long get_vsid(unsigned long context, unsigned long ea) { - unsigned long ordinal, vsid; - - ordinal = (((ea >> 28) & 0x1fff) * LAST_USER_CONTEXT) | context; - vsid = (ordinal * VSID_RANDOMIZER) & VSID_MASK; - -#ifdef HTABSTRESS - /* See comment above. */ - ordinal = ((ea >> 28) & 0x1fff) | (context << 16); - vsid = ordinal & VSID_MASK; -#endif /* HTABSTRESS */ - - return vsid; + return vsid_scramble((context << USER_ESID_BITS) + | (ea >> SID_SHIFT)); } #endif /* __PPC64_MMU_CONTEXT_H */ Index: working-2.6/include/asm-ppc64/mmu.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu.h +++ working-2.6/include/asm-ppc64/mmu.h @@ -15,6 +15,7 @@ #include #include +#include #ifndef __ASSEMBLY__ @@ -241,12 +242,44 @@ #define SLB_VSID_KERNEL (SLB_VSID_KP|SLB_VSID_C) #define SLB_VSID_USER (SLB_VSID_KP|SLB_VSID_KS) -#define VSID_RANDOMIZER ASM_CONST(42470972311) -#define VSID_MASK 0xfffffffffUL -/* Because we never access addresses below KERNELBASE as kernel - * addresses, this VSID is never used for anything real, and will - * never have pages hashed into it */ -#define BAD_VSID ASM_CONST(0) +#define VSID_MULTIPLIER ASM_CONST(268435399) /* largest 28-bit prime */ +#define VSID_BITS 36 +#define VSID_MODULUS ((1UL<= \ + * 2^36-1, then r3+1 has the 2^36 bit set. So, if r3+1 has \ + * the bit clear, r3 already has the answer we want, if it \ + * doesn't, the answer is the low 36 bits of r3+1. So in all \ + * cases the answer is the low 36 bits of (r3 + ((r3+1) >> 36))*/\ + addi rx,rt,1; \ + srdi rx,rx,VSID_BITS; /* extract 2^36 bit */ \ + add rt,rt,rx /* Block size masks */ #define BL_128K 0x000 Index: working-2.6/arch/ppc64/mm/slb_low.S =================================================================== --- working-2.6.orig/arch/ppc64/mm/slb_low.S +++ working-2.6/arch/ppc64/mm/slb_low.S @@ -71,19 +71,19 @@ srdi r3,r3,28 /* get esid */ cmpldi cr7,r9,0xc /* cmp KERNELBASE for later use */ - /* r9 = region, r3 = esid, cr7 = <>KERNELBASE */ - - rldicr. r11,r3,32,16 - bne- 8f /* invalid ea bits set */ - addi r11,r9,-1 - cmpldi r11,0xb - blt- 8f /* invalid region */ + rldimi r10,r3,28,0 /* r10= ESID<<28 | entry */ + oris r10,r10,SLB_ESID_V at h /* r10 |= SLB_ESID_V */ - /* r9 = region, r3 = esid, r10 = entry, cr7 = <>KERNELBASE */ + /* r3 = esid, r10 = esid_data, cr7 = <>KERNELBASE */ blt cr7,0f /* user or kernel? */ - /* kernel address */ + /* kernel address: proto-VSID = ESID */ + /* WARNING - MAGIC: we don't use the VSID 0xfffffffff, but + * this code will generate the protoVSID 0xfffffffff for the + * top segment. That's ok, the scramble below will translate + * it to VSID 0, which is reserved as a bad VSID - one which + * will never have any pages in it. */ li r11,SLB_VSID_KERNEL BEGIN_FTR_SECTION bne cr7,9f @@ -91,8 +91,12 @@ END_FTR_SECTION_IFSET(CPU_FTR_16M_PAGE) b 9f -0: /* user address */ +0: /* user address: proto-VSID = ESID<<15 | context */ li r11,SLB_VSID_USER + + srdi. r9,r3,13 + bne- 8f /* invalid ea bits set */ + #ifdef CONFIG_HUGETLB_PAGE BEGIN_FTR_SECTION /* check against the hugepage ranges */ @@ -114,33 +118,18 @@ #endif /* CONFIG_HUGETLB_PAGE */ 6: ld r9,PACACONTEXTID(r13) + rldimi r3,r9,USER_ESID_BITS,0 -9: /* r9 = "context", r3 = esid, r11 = flags, r10 = entry */ - - rldimi r9,r3,15,0 /* r9= VSID ordinal */ - -7: rldimi r10,r3,28,0 /* r10= ESID<<28 | entry */ - oris r10,r10,SLB_ESID_V at h /* r10 |= SLB_ESID_V */ - - /* r9 = ordinal, r3 = esid, r11 = flags, r10 = esid_data */ - - li r3,VSID_RANDOMIZER at higher - sldi r3,r3,32 - oris r3,r3,VSID_RANDOMIZER at h - ori r3,r3,VSID_RANDOMIZER at l - - mulld r9,r3,r9 /* r9 = ordinal * VSID_RANDOMIZER */ - clrldi r9,r9,28 /* r9 &= VSID_MASK */ - sldi r9,r9,SLB_VSID_SHIFT /* r9 <<= SLB_VSID_SHIFT */ - or r9,r9,r11 /* r9 |= flags */ +9: /* r3 = protovsid, r11 = flags, r10 = esid_data, cr7 = <>KERNELBASE */ + ASM_VSID_SCRAMBLE(r3,r9) - /* r9 = vsid_data, r10 = esid_data, cr7 = <>KERNELBASE */ + rldimi r11,r3,SLB_VSID_SHIFT,16 /* combine VSID and flags */ /* * No need for an isync before or after this slbmte. The exception * we enter with and the rfid we exit with are context synchronizing. */ - slbmte r9,r10 + slbmte r11,r10 bgelr cr7 /* we're done for kernel addresses */ @@ -163,6 +152,6 @@ blr 8: /* invalid EA */ - li r9,0 /* 0 VSID ordinal -> BAD_VSID */ + li r3,0 /* BAD_VSID */ li r11,SLB_VSID_USER /* flags don't much matter */ - b 7b + b 9b Index: working-2.6/arch/ppc64/kernel/head.S =================================================================== --- working-2.6.orig/arch/ppc64/kernel/head.S +++ working-2.6/arch/ppc64/kernel/head.S @@ -576,11 +576,11 @@ .llong 0 /* Reserved */ .llong 0 /* Reserved */ .llong 0 /* Reserved */ - .llong 0x0c00000000 /* ESID to map (Kernel at EA = 0xC000000000000000) */ - .llong 0x06a99b4b14 /* VSID to map (Kernel at VA = 0x6a99b4b140000000) */ + .llong (KERNELBASE>>28)/* ESID to map */ + .llong 0x40BFFFFD5 /* VSID to map */ .llong 8192 /* # pages to map (32 MB) */ .llong 0 /* Offset from start of loadarea to start of map */ - .llong 0x0006a99b4b140000 /* VPN of first page to map */ + .llong 0x40BFFFFD50000 /* VPN of first page to map */ . = 0x6100 @@ -1072,18 +1072,9 @@ rldimi r10,r11,7,52 /* r10 = first ste of the group */ /* Calculate VSID */ - /* (((ea >> 28) & 0x1fff) << 15) | (ea >> 60) */ - rldic r11,r11,15,36 - ori r11,r11,0xc - - /* VSID_RANDOMIZER */ - li r9,9 - sldi r9,r9,32 - oris r9,r9,58231 - ori r9,r9,39831 - - mulld r9,r11,r9 - rldic r9,r9,12,16 /* r9 = vsid << 12 */ + /* This is a kernel address, so protovsid = ESID */ + ASM_VSID_SCRAMBLE(r11, r9) + rldic r9,r11,12,16 /* r9 = vsid << 12 */ /* Search the primary group for a free entry */ 1: ld r11,0(r10) /* Test valid bit of the current ste */ -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Thu Jul 8 01:06:43 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Wed, 7 Jul 2004 10:06:43 -0500 Subject: [1/4] RFC: SLB rewrite (core rewrite) In-Reply-To: <20040707060635.GB987@zax> References: <20040707060635.GB987@zax> Message-ID: <20040707100643.0b889896.moilanen@austin.ibm.com> This is pretty nice. Good work. I have just one nit below: > +_GLOBAL(slb_allocate) > + /* > + * First find a slot, round robin. Previously we tried to find > + * a free slot first but that took too long. Unfortunately we > + * dont have any LRU information to help us choose a slot. > + */ > + srdi r9,r1,27 > + ori r9,r9,1 /* mangle SP for later compare */ > + > + ld r10,PACASTABRR(r13) > +3: > + addi r10,r10,1 > + /* use a cpu feature mask if we ever change our slb size */ > + cmpldi r10,SLB_NUM_ENTRIES > + > + blt+ 4f This branch probably shouldn't be predicted. The general rule on branch prediction is for an error case, or a missed lock. Since about power 4, the branch prediction is a little over 99% correct, you'll get a miss 1 out of 62 times or 1.6% of the time. It's probably not measurable, just might save a few cycles. > + * The >> 27 (rather than >> 28) is so that the LSB is the > + * valid bit - this way we check valid and ESID in one compare. > + */ > + srdi r11,r11,27 > + cmpd r11,r9 > + beq- 3b Same as above. Thanks, Jake ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Thu Jul 8 01:33:16 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 07 Jul 2004 10:33:16 -0500 Subject: [0/4] RFC: SLB rewrite In-Reply-To: <20040707060626.GA987@zax> References: <20040707060626.GA987@zax> Message-ID: <1089214395.2026.23.camel@gaston> On Wed, 2004-07-07 at 01:06, David Gibson wrote: > Here, at long last, the much awaited, SLB rewrite! Hehe ! Kudos ! Looks great ! What about moving the rest of stab.c to arch/ppc64/mm/ while we are at it ? :) Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From olof at austin.ibm.com Thu Jul 8 02:31:12 2004 From: olof at austin.ibm.com (Olof Johansson) Date: Wed, 07 Jul 2004 11:31:12 -0500 Subject: [0/4] RFC: SLB rewrite In-Reply-To: <20040707060626.GA987@zax> References: <20040707060626.GA987@zax> Message-ID: <40EC2550.6020001@austin.ibm.com> David Gibson wrote: > If there are no serious problems, I'm thinking of pushing at least the > first three of these to Linus/akpm in the next week or so. Good work! The code looks clean and solid to me. Maybe the description from the top of the VSID patch should be included as a top-of-file comment in slb_low.S or slb.c so it's around for new readers in the future? Also, where does the nonlinear behaviour at the high segment count reload times for the staggered measurements for the new VSID generation come from? HTAB hash bucket collissions/clustering? -Olof ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Thu Jul 8 11:02:51 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 8 Jul 2004 11:02:51 +1000 Subject: [0/4] RFC: SLB rewrite In-Reply-To: <1089214395.2026.23.camel@gaston> References: <20040707060626.GA987@zax> <1089214395.2026.23.camel@gaston> Message-ID: <20040708010251.GA4247@zax> On Wed, Jul 07, 2004 at 10:33:16AM -0500, Benjamin Herrenschmidt wrote: > > On Wed, 2004-07-07 at 01:06, David Gibson wrote: > > Here, at long last, the much awaited, SLB rewrite! > > Hehe ! Kudos ! Looks great ! > > What about moving the rest of stab.c to arch/ppc64/mm/ while > we are at it ? :) Well, I have some plans to do some cleanups to the stabs code, too - mostly as it impinges even on SLB machines. I was going to move it to arch/ppc64/mm when I did that. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Thu Jul 8 11:41:04 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 8 Jul 2004 11:41:04 +1000 Subject: [0/4] RFC: SLB rewrite In-Reply-To: <40EC2550.6020001@austin.ibm.com> References: <20040707060626.GA987@zax> <40EC2550.6020001@austin.ibm.com> Message-ID: <20040708014104.GA4309@zax> On Wed, Jul 07, 2004 at 11:31:12AM -0500, Olof Johansson wrote: > > David Gibson wrote: > > >If there are no serious problems, I'm thinking of pushing at least the > >first three of these to Linus/akpm in the next week or so. > > Good work! The code looks clean and solid to me. > > Maybe the description from the top of the VSID patch should be included > as a top-of-file comment in slb_low.S or slb.c so it's around for new > readers in the future? Not a bad idea. > Also, where does the nonlinear behaviour at the high segment count > reload times for the staggered measurements for the new VSID generation > come from? HTAB hash bucket collissions/clustering? That I'm not sure. I don't think it could be from hash collisions/clustering. I tweaked my hash bucket simulator (hashscatter.py, it's in the directory with the images and other stuff) to simulate the slbmisser program running along with a bunch of sample programs (essentially 1024 copies of bash). It shows much better hash scattering with the new algorithm - maximum bucket occupancy 6, versus 127 with the old (I'm only computing the primary hash bucket). If I take the other processes out, and just simulate slbmisser itself (plus the linear mapping), it still doesn't show anything untoward - maxmimum occupancy 2 with the new algorithm, 16 with the old. So I don't know what's causing the extra cost there. Possibly just that the actual VSID scramble itself takes different numbers of cycles depending on the input? I do worry a bit about that kink in the graph - what if it continues to get worse with even larger set sizes, or with the kernel protoVSIDs (which are numerically larger than user ones).. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Thu Jul 8 11:46:54 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 8 Jul 2004 11:46:54 +1000 Subject: [1/4] RFC: SLB rewrite (core rewrite) In-Reply-To: <20040707100643.0b889896.moilanen@austin.ibm.com> References: <20040707060635.GB987@zax> <20040707100643.0b889896.moilanen@austin.ibm.com> Message-ID: <20040708014654.GB4309@zax> On Wed, Jul 07, 2004 at 10:06:43AM -0500, Jake Moilanen wrote: > > This is pretty nice. Good work. I have just one nit below: > > > +_GLOBAL(slb_allocate) > > + /* > > + * First find a slot, round robin. Previously we tried to find > > + * a free slot first but that took too long. Unfortunately we > > + * dont have any LRU information to help us choose a slot. > > + */ > > + srdi r9,r1,27 > > + ori r9,r9,1 /* mangle SP for later compare */ > > + > > + ld r10,PACASTABRR(r13) > > +3: > > + addi r10,r10,1 > > + /* use a cpu feature mask if we ever change our slb size */ > > + cmpldi r10,SLB_NUM_ENTRIES > > + > > + blt+ 4f > > This branch probably shouldn't be predicted. The general rule on branch > prediction is for an error case, or a missed lock. Since about power 4, > the branch prediction is a little over 99% correct, you'll get a miss 1 > out of 62 times or 1.6% of the time. It's probably not measurable, just > might save a few cycles. Hmm, well Anton actually suggested that branch hint, and the one below. Because the branch prediction is based on whether it was taken last time round, and this is a strict round-robin, he thought the dynamic prediction would get it wrong twice round the cycle, whereas the static would only get it wrong one. Or something. He did do some timings, the difference was certainly tiny, although it might have been (just) measurable. > > + * The >> 27 (rather than >> 28) is so that the LSB is the > > + * valid bit - this way we check valid and ESID in one compare. > > + */ > > + srdi r11,r11,27 > > + cmpd r11,r9 > > + beq- 3b > > Same as above. Of course, I do also have a plan to get rid of the slbmfee and that loop by locking the kernel stack into a fixed SLB slot. I have a patch to do it even, but I need to figure out how to get it to work on iSeries. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From brking at us.ibm.com Fri Jul 9 02:36:28 2004 From: brking at us.ibm.com (Brian King) Date: Thu, 08 Jul 2004 11:36:28 -0500 Subject: How to block pci config-reads during device self-test? In-Reply-To: <20040706161254.C21634@forte.austin.ibm.com> References: <20040706161254.C21634@forte.austin.ibm.com> Message-ID: <40ED780C.6000401@us.ibm.com> I've been doing some talking with various hardware folks to see if there is a way to get this fixed and so far the answer I have been getting is no. It is how the hardware works and there isn't much that can be done about it in microcode. To work around the problem we could add userspace pci config read and write function pointers to the pci_dev struct which a device driver could overload. These would be invoked for user initiated pci config accesses. The device driver could then do the right thing for that device if necessary. I don't like the idea of "learning to live with this". People do run into this problem and telling them it can't be fixed is not an acceptable answer. -Brian linas at austin.ibm.com wrote: > > Am having trouble with PCI config-space reads ... I have a device > (actualy Brian King has it) that can perform a built-in-self test > (BIST). However, if anything does a PCI config-read during BIST, then > the device does something crazy that makes the PCI controller chip > take it offline. > > I'm not sure what's doing the config-spcae reads ... seems to be some > user-space tool or daemon. I'm wondering if there is any practical way > to block such reads to a given device until its self-test sequence is > completed. I could try to modify the architecture-specific pci files > to do this (arch/ppc64/kernel/pSeries_pci.c) but this seems a tad ugly > ... is there another way? or do we have to just learn to live with > this ahrdware? -- ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Fri Jul 9 02:37:59 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 08 Jul 2004 11:37:59 -0500 Subject: How to block pci config-reads during device self-test? In-Reply-To: <40ED780C.6000401@us.ibm.com> References: <20040706161254.C21634@forte.austin.ibm.com> <40ED780C.6000401@us.ibm.com> Message-ID: <1089304678.5355.50.camel@gaston> On Thu, 2004-07-08 at 11:36, Brian King wrote: > I don't like the idea of "learning to live with this". People do run > into this problem and telling them it can't be fixed is not an > acceptable answer. Then we should do the workaround in the low level pci access functions, possibly by having the driver call an arch function to "notify" that the given device is in a bad state for a while... Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From greg at kroah.com Fri Jul 9 02:45:22 2004 From: greg at kroah.com (Greg KH) Date: Thu, 8 Jul 2004 09:45:22 -0700 Subject: How to block pci config-reads during device self-test? In-Reply-To: <40ED780C.6000401@us.ibm.com> References: <20040706161254.C21634@forte.austin.ibm.com> <40ED780C.6000401@us.ibm.com> Message-ID: <20040708164522.GD14231@kroah.com> On Thu, Jul 08, 2004 at 11:36:28AM -0500, Brian King wrote: > I've been doing some talking with various hardware folks to see if > there is a way to get this fixed and so far the answer I have been > getting is no. It is how the hardware works and there isn't much that > can be done about it in microcode. But this is limited only to a single PCI device that you have, correct? To add such a huge core change for only a single, broken device, that 99.99% of the Linux users and developers will never see is not a nice option. thanks, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From willy at debian.org Fri Jul 9 03:37:59 2004 From: willy at debian.org (Matthew Wilcox) Date: Thu, 8 Jul 2004 18:37:59 +0100 Subject: [Pcihpd-discuss] Re: How to block pci config-reads during device self-test? In-Reply-To: <40ED780C.6000401@us.ibm.com> References: <20040706161254.C21634@forte.austin.ibm.com> <40ED780C.6000401@us.ibm.com> Message-ID: <20040708173759.GO27324@parcelfarce.linux.theplanet.co.uk> On Thu, Jul 08, 2004 at 11:36:28AM -0500, Brian King wrote: > I don't like the idea of "learning to live with this". People do run > into this problem and telling them it can't be fixed is not an > acceptable answer. Sure it is. Even the taiwanese knock-off PCI boards don't have these kinds of problems. Your hardware guys need to fix it. -- "Next the statesmen will invent cheap lies, putting the blame upon the nation that is attacked, and every man will be glad of those conscience-soothing falsities, and will diligently study them, and refuse to examine any refutations of them; and thus he will by and by convince himself that the war is just, and will thank God for the better sleep he enjoys after this process of grotesque self-deception." -- Mark Twain ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From brking at us.ibm.com Fri Jul 9 03:56:06 2004 From: brking at us.ibm.com (Brian King) Date: Thu, 08 Jul 2004 12:56:06 -0500 Subject: How to block pci config-reads during device self-test? In-Reply-To: <20040708164522.GD14231@kroah.com> References: <20040706161254.C21634@forte.austin.ibm.com> <40ED780C.6000401@us.ibm.com> <20040708164522.GD14231@kroah.com> Message-ID: <40ED8AB6.9020604@us.ibm.com> Greg KH wrote: > But this is limited only to a single PCI device that you have, correct? > To add such a huge core change for only a single, broken device, that > 99.99% of the Linux users and developers will never see is not a nice > option. It is limited to every adapter the ipr device driver controls which means almost every scsi adapter on pSeries hardware, including the embedded scsi controller on most systems. So the scope of this problem for pSeries users is fairly large. While I agree with your statement above, Greg, I would still like to try to come to a solution to this problem that we can all live with. If that means only changing ppc64 code and ipr code, I will go along with that. And while I can pester the hardware folks to fix this in future chips, we will still be living with existing hardware for a long time. -- Brian King eServer Storage I/O IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From greg at kroah.com Fri Jul 9 03:59:29 2004 From: greg at kroah.com (Greg KH) Date: Thu, 8 Jul 2004 10:59:29 -0700 Subject: How to block pci config-reads during device self-test? In-Reply-To: <40ED8AB6.9020604@us.ibm.com> References: <20040706161254.C21634@forte.austin.ibm.com> <40ED780C.6000401@us.ibm.com> <20040708164522.GD14231@kroah.com> <40ED8AB6.9020604@us.ibm.com> Message-ID: <20040708175929.GC15655@kroah.com> On Thu, Jul 08, 2004 at 12:56:06PM -0500, Brian King wrote: > Greg KH wrote: > >But this is limited only to a single PCI device that you have, correct? > >To add such a huge core change for only a single, broken device, that > >99.99% of the Linux users and developers will never see is not a nice > >option. > > It is limited to every adapter the ipr device driver controls which > means almost every scsi adapter on pSeries hardware, including the > embedded scsi controller on most systems. So the scope of this problem > for pSeries users is fairly large. Not if you "fix this" by disabling access to the self-test mode :) thanks, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From brking at us.ibm.com Fri Jul 9 04:08:26 2004 From: brking at us.ibm.com (Brian King) Date: Thu, 08 Jul 2004 13:08:26 -0500 Subject: How to block pci config-reads during device self-test? In-Reply-To: <20040708175929.GC15655@kroah.com> References: <20040706161254.C21634@forte.austin.ibm.com> <40ED780C.6000401@us.ibm.com> <20040708164522.GD14231@kroah.com> <40ED8AB6.9020604@us.ibm.com> <20040708175929.GC15655@kroah.com> Message-ID: <40ED8D9A.2030805@us.ibm.com> Greg KH wrote: > On Thu, Jul 08, 2004 at 12:56:06PM -0500, Brian King wrote: > >>Greg KH wrote: >> >>>But this is limited only to a single PCI device that you have, >>>correct? To add such a huge core change for only a single, broken >>>device, that 99.99% of the Linux users and developers will never see >>>is not a nice option. >> >>It is limited to every adapter the ipr device driver controls which >>means almost every scsi adapter on pSeries hardware, including the >>embedded scsi controller on most systems. So the scope of this problem >>for pSeries users is fairly large. > > Not if you "fix this" by disabling access to the self-test mode :) Its not just used for the self-test mode. It is the only way I have to give the adapter a hard reset during error recovery. So when my eh_host_reset function gets invoked due to the adapter having severe problems or if the adapter takes an error that requires a hard reset to recover from, I have to run BIST. -- Brian King eServer Storage I/O IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at denx.de Sat Jul 10 07:13:04 2004 From: linas at denx.de (Linas Vepstas) Date: Fri, 9 Jul 2004 16:13:04 -0500 Subject: How to block pci config-reads during device self-test? In-Reply-To: <20040708164522.GD14231@kroah.com> References: <20040706161254.C21634@forte.austin.ibm.com> <40ED780C.6000401@us.ibm.com> <20040708164522.GD14231@kroah.com> Message-ID: <20040709211304.GG17333@bilge> On Thu, Jul 08, 2004 at 09:45:22AM -0700, Greg KH was heard to remark: > On Thu, Jul 08, 2004 at 11:36:28AM -0500, Brian King wrote: > > I've been doing some talking with various hardware folks to see if > > there is a way to get this fixed and so far the answer I have been > > getting is no. It is how the hardware works and there isn't much > > that can be done about it in microcode. > > But this is limited only to a single PCI device that you have, > correct? I have a vagure recollection that this is not the first time this has been seen, but that was a while ago, and I didn't understand the issue then. We don't know how many more devices are affected. The pSeries PCI hardware significantly clamps down on what is allowed on the bus, all in the name of not corrupting the kernel. Traditional PCI bridges are a lot more permissive ... we actually don't know how many more cards are out there that might go wild if touched in unexpected ways. However, note that most, (if not all?) of the devices on ppc64 are now chasing or have recently chased rare bugs due to the EEH clampdown, so this might be the tip of an iceburg. > To add such a huge core change for only a single, broken device, that > 99.99% of the Linux users and developers will never see is not a nice > option. I'll work with Brian to add something to arch/ppc64/kernel/pSeries_pci.c that works around this problem. There is a separate patch I'll mail later today for dealing with EEH detection in the same codepath. --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sun Jul 11 10:27:46 2004 From: anton at samba.org (Anton Blanchard) Date: Sun, 11 Jul 2004 10:27:46 +1000 Subject: [1/4] RFC: SLB rewrite (core rewrite) In-Reply-To: <20040707100643.0b889896.moilanen@austin.ibm.com> References: <20040707060635.GB987@zax> <20040707100643.0b889896.moilanen@austin.ibm.com> Message-ID: <20040711002746.GA5232@krispykreme> > This branch probably shouldn't be predicted. The general rule on branch > prediction is for an error case, or a missed lock. Since about power 4, > the branch prediction is a little over 99% correct, you'll get a miss 1 > out of 62 times or 1.6% of the time. It's probably not measurable, just > might save a few cycles. Milton and I looked over this a while ago. Neither of the two POWER4 branch prediction algorithms will do a good job. We are likely to get 2 mispredictions per loop whereas we only get one misprediction per loop with static prediction. As David pointed out we have removed the slbmfee loop completely in subsequent patches. We could probably remove the first loop by keeping the bolted entries to a power of 2 and using a rotate and mask insert instruction to shift the 2 ^ 6 bit down: addi r3,r3,1 li r4,0 rlwimi r4,r3,32-(6-2),31-2,31-2 andi. r3,r3,0x3f or r3,r3,r4 That looks like 3 cycles but Im sure we can do it in less. However even if we manage to do it in 2 cycles, thats going to add 2 * 60 = 120 cycles in one complete loop whereas a single misprediction per loop should take somewhere around 20 cycles I think. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sun Jul 11 11:05:22 2004 From: anton at samba.org (Anton Blanchard) Date: Sun, 11 Jul 2004 11:05:22 +1000 Subject: [1/4] RFC: SLB rewrite (core rewrite) In-Reply-To: <20040708014654.GB4309@zax> References: <20040707060635.GB987@zax> <20040707100643.0b889896.moilanen@austin.ibm.com> <20040708014654.GB4309@zax> Message-ID: <20040711010522.GB5232@krispykreme> > He did do some timings, the difference was certainly tiny, although it > might have been (just) measurable. Yeah it was tiny, although David's slbtest showed it to be on average just under 1 cycle faster (probably due to averaging the extra misprediction per 60 iterations). > Of course, I do also have a plan to get rid of the slbmfee and that > loop by locking the kernel stack into a fixed SLB slot. I have a > patch to do it even, but I need to figure out how to get it to work on > iSeries. There are a few more things that would be interesting to look at: - Are there any more segments we should bolt in? So far we have PAGE_OFFSET, VMALLOC_BASE and the kernel stack. It may make sense to bolt in a userspace entry or two. The userspace stack is the best candidate. Obviously the more segments we bolt in, the less that are available for general use and bolting in more userspace segments will mean mostly kernel bound workloads will suffer (eg SFS). It may make sense to bolt in an IO region too. - We currently preload a few userspace segments (stack, pc and task unmapped base). Does it make sense to restore the entire SLB state on context switch? One problem we have is that we usually enter the kernel via libc which is in a different segment to our program text. This means we always take an SLB miss when we return out of libc after a context switch. (ie we are guaranteed to take one SLB miss per context switch) It would be interesting to get a log of SLB misses for a few workloads. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Mon Jul 12 10:13:29 2004 From: anton at samba.org (Anton Blanchard) Date: Mon, 12 Jul 2004 10:13:29 +1000 Subject: [RFC] JS20 IDE perf patch In-Reply-To: <200312151706.hBFH6wv1008098@falcon30.maxey.austin.rr.com> References: <200312151706.hBFH6wv1008098@falcon30.maxey.austin.rr.com> Message-ID: <20040712001329.GD30109@krispykreme> Hi Doug, Did this get sorted out or do we still need to fix it in 2.6? Anton On Mon, Dec 15, 2003 at 11:06:58AM -0600, Doug Maxey wrote: > > Howdy, > > Have done some investigation of the AMD IDE implementation on the > JS20 and have come up with a set of patches to address this. The > underlying assumption that the PCI bus can only do 66mhz is no > longer true with the HT implementation on this platform. > > The specific issues seen were: > 1) ide.c has code that assumes the PCI bus only runs between 20 and > 66 mhz. > 2) FW has not completely configured the chipset to run at the higher > IDE bus speeds. > 3) amd74xx.c does not recognize and cannot use the higher PCI bus > speeds to enable the higher IDE speeds. > > Item 1 is addressed by creating ATA_PCIMHZ_MIN and ATA_PCIMHZ_MAX in > linux/ide.h. These values are used in ide/ide.c and amd74xx.c. If > there are arch specific values, they override the the values set in > linux/ide.h via the arch specific header. > > Item 2 is addressed by fixup_amd_ide in arch/ppc64/kernel/pci.c. > > Item 3 uses the ATA_PCIMHZ* to calculate the PCI bus frequency in > amd74xx.c and is used to determine the UDMA settings for the > drives. > > Please review these changes. Comments are welcome. > > ++doug > ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ ** This list is shutting down 7/24/2004. From david at gibson.dropbear.id.au Mon Jul 12 10:57:11 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Mon, 12 Jul 2004 10:57:11 +1000 Subject: [1/4] RFC: SLB rewrite (core rewrite) In-Reply-To: <20040711002746.GA5232@krispykreme> References: <20040707060635.GB987@zax> <20040707100643.0b889896.moilanen@austin.ibm.com> <20040711002746.GA5232@krispykreme> Message-ID: <20040712005711.GB9423@zax> On Sun, Jul 11, 2004 at 10:27:46AM +1000, Anton Blanchard wrote: > > > > This branch probably shouldn't be predicted. The general rule on branch > > prediction is for an error case, or a missed lock. Since about power 4, > > the branch prediction is a little over 99% correct, you'll get a miss 1 > > out of 62 times or 1.6% of the time. It's probably not measurable, just > > might save a few cycles. > > Milton and I looked over this a while ago. Neither of the two POWER4 > branch prediction algorithms will do a good job. We are likely to get 2 > mispredictions per loop whereas we only get one misprediction per loop > with static prediction. > > As David pointed out we have removed the slbmfee loop completely in > subsequent patches. Well.. we did some timings with a hacky patch which just took out the slbmfee on the grounds that most of the time it would just work. I have a patch to actually safely remove it by bolting the kernel stack into SLB slot 1, but I'm not yet sure how to get that working properly on iSeries. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ ** This list is shutting down 7/24/2004. From david at gibson.dropbear.id.au Mon Jul 12 11:33:03 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Mon, 12 Jul 2004 11:33:03 +1000 Subject: [1/4] RFC: SLB rewrite (core rewrite) In-Reply-To: <20040711010522.GB5232@krispykreme> References: <20040707060635.GB987@zax> <20040707100643.0b889896.moilanen@austin.ibm.com> <20040708014654.GB4309@zax> <20040711010522.GB5232@krispykreme> Message-ID: <20040712013302.GC9423@zax> On Sun, Jul 11, 2004 at 11:05:22AM +1000, Anton Blanchard wrote: > > > > He did do some timings, the difference was certainly tiny, although it > > might have been (just) measurable. > > Yeah it was tiny, although David's slbtest showed it to be on average > just under 1 cycle faster (probably due to averaging the extra > misprediction per 60 iterations). > > > Of course, I do also have a plan to get rid of the slbmfee and that > > loop by locking the kernel stack into a fixed SLB slot. I have a > > patch to do it even, but I need to figure out how to get it to work on > > iSeries. > > There are a few more things that would be interesting to look at: > > - Are there any more segments we should bolt in? So far we have > PAGE_OFFSET, VMALLOC_BASE and the kernel stack. It may make sense to > bolt in a userspace entry or two. The userspace stack is the best > candidate. Obviously the more segments we bolt in, the less that are > available for general use and bolting in more userspace segments will > mean mostly kernel bound workloads will suffer (eg SFS). It may make > sense to bolt in an IO region too. Indeed. My gut instinct would be that we want to keep the number of bolted segments down, so as much of the SLB can be used as flexibly as possible. That said, I've had a couple of ideas on how to generate some semblence of LRU data in ways that might be sufficiently low overhead to be practical. For example, what if we kept a variable containing the ESID of the last segment cast out from the SLB. Whenever we take an SLB miss, we check the EA against that stored segment. If they match - i.e. this segment is so heavily used that we want it again almost as soon as we've thrown it out - we put a flag indicating to skip over this slot for some number of future misses. Or rather than just having permanently bolted and non-bolted slots, we could divide the SLB into a "fast cycle" and a "slow cycle" section with separate RR pointers. Either we could "semibolt" certain segments - putting them in the slow reload section. Or we could use some generalization of the scheme above to dynamically promote frequently used segments to the slow cycle area. > - We currently preload a few userspace segments (stack, pc and task > unmapped base). Does it make sense to restore the entire SLB state on > context switch? One problem we have is that we usually enter the > kernel via libc which is in a different segment to our program text. > This means we always take an SLB miss when we return out of libc after > a context switch. (ie we are guaranteed to take one SLB miss per > context switch) > > It would be interesting to get a log of SLB misses for a few > workloads. It certainly would. We really want to run some simulations to see how bolting various segments, or schemes like those above would affect the SLB miss rate. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ ** This list is shutting down 7/24/2004. From dwm at austin.ibm.com Tue Jul 13 01:23:56 2004 From: dwm at austin.ibm.com (Doug Maxey) Date: Mon, 12 Jul 2004 10:23:56 -0500 Subject: [RFC] JS20 IDE perf patch In-Reply-To: <20040712001329.GD30109@krispykreme> Message-ID: <200407121523.i6CFNuHW003347@falcon10.austin.ibm.com> Anton, It never did get resolved in the sense that regardless of the tweaks, the drive seems to be limited to a throuhput number in the upper 20's with hdparm, even with the patches. Suspect that I may be beating a dead horse. Why I say that is that further testing on another power platform, with the SII controller, shows the driver setting UDMA5 with roughly the same numbers from hdparm. Mark has asked for a definitive answer on the max drive speed from the manufacturer, but have yet to seen the numbers. In any event, if the drive is being limited by the driver, the fix would really need to go into 2.4, or go into 2.6 and be backported, if there is a discrepancy in the drive capabilities and how the setup is done. But am not sure at this point if there is any usefulness in the patch. ++doug On Mon, 12 Jul 2004 10:13:29 +1000, Anton Blanchard wrote: > >Hi Doug, > >Did this get sorted out or do we still need to fix it in 2.6? > >Anton > >On Mon, Dec 15, 2003 at 11:06:58AM -0600, Doug Maxey wrote: >> >> Howdy, >> >> Have done some investigation of the AMD IDE implementation on the >> JS20 and have come up with a set of patches to address this. The >> underlying assumption that the PCI bus can only do 66mhz is no >> longer true with the HT implementation on this platform. >> >> The specific issues seen were: >> 1) ide.c has code that assumes the PCI bus only runs between 20 and >> 66 mhz. >> 2) FW has not completely configured the chipset to run at the higher >> IDE bus speeds. >> 3) amd74xx.c does not recognize and cannot use the higher PCI bus >> speeds to enable the higher IDE speeds. >> >> Item 1 is addressed by creating ATA_PCIMHZ_MIN and ATA_PCIMHZ_MAX in >> linux/ide.h. These values are used in ide/ide.c and amd74xx.c. If >> there are arch specific values, they override the the values set in >> linux/ide.h via the arch specific header. >> >> Item 2 is addressed by fixup_amd_ide in arch/ppc64/kernel/pci.c. >> >> Item 3 uses the ATA_PCIMHZ* to calculate the PCI bus frequency in >> amd74xx.c and is used to determine the UDMA settings for the >> drives. >> >> Please review these changes. Comments are welcome. >> >> ++doug >> > > ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ ** This list is shutting down 7/24/2004. From linas at denx.de Tue Jul 13 07:16:55 2004 From: linas at denx.de (Linas Vepstas) Date: Mon, 12 Jul 2004 16:16:55 -0500 Subject: How to block pci config-reads during device self-test? In-Reply-To: <20040708173759.GO27324@parcelfarce.linux.theplanet.co.uk> References: <20040706161254.C21634@forte.austin.ibm.com> <40ED780C.6000401@us.ibm.com> <20040708173759.GO27324@parcelfarce.linux.theplanet.co.uk> Message-ID: <20040712211655.GA27294@bilge> On Thu, Jul 08, 2004 at 06:37:59PM +0100, Matthew Wilcox was heard to remark: > On Thu, Jul 08, 2004 at 11:36:28AM -0500, Brian King wrote: > > I don't like the idea of "learning to live with this". People do > > run into this problem and telling them it can't be fixed is not an > > acceptable answer. > > Sure it is. Even the taiwanese knock-off PCI boards don't have these > kinds of problems. Your hardware guys need to fix it. Actually, I just talked to the PCI guys here, and they seem to be saying that this behaviour is expected by the PCI spec; this implies that *all* cards with BIST will have this problem. To recap: device driver has started a BIST on the PCI card; and some other linux daemon performs a PCI config-space I/O to the adapter. Since the card under BIST cannot assert DEVSEL#, the PCI bridge does a Master Abort, with the result that the device now placed offline. It sounds to me like its pretty normal that a device in BIST is going to be ignoring pretty much anything going into it, so the master abort is no surpirse. So the answer does indeed seem to be that blocking pci config-space i/o during BIST is the right thing to do (for all architectures, and not just for ppc64 where I guess we'll do it first). --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Tue Jul 13 10:43:16 2004 From: anton at samba.org (Anton Blanchard) Date: Tue, 13 Jul 2004 10:43:16 +1000 Subject: [1/4] RFC: SLB rewrite (core rewrite) In-Reply-To: <20040712013302.GC9423@zax> References: <20040707060635.GB987@zax> <20040707100643.0b889896.moilanen@austin.ibm.com> <20040708014654.GB4309@zax> <20040711010522.GB5232@krispykreme> <20040712013302.GC9423@zax> Message-ID: <20040713004316.GB26263@krispykreme> > Indeed. My gut instinct would be that we want to keep the number of > bolted segments down, so as much of the SLB can be used as flexibly as > possible. Agreed. > That said, I've had a couple of ideas on how to generate some > semblence of LRU data in ways that might be sufficiently low overhead > to be practical. For example, what if we kept a variable containing > the ESID of the last segment cast out from the SLB. Whenever we take > an SLB miss, we check the EA against that stored segment. If they > match - i.e. this segment is so heavily used that we want it again > almost as soon as we've thrown it out - we put a flag indicating to > skip over this slot for some number of future misses. Interesting idea. The segments that I have seen miss a lot are: userspace stack PC libc segment kernel task struct The match EA against last castout should be able to catch these and has the advantage of adapting to various workloads. > It certainly would. We really want to run some simulations to see how > bolting various segments, or schemes like those above would affect the > SLB miss rate. If we cut the amount of the SLB we use to a minimum we could get a rather accurate log of SLB accesses. Does the SLB miss handler touch anything outside segment 0 in your rewrite? If it doesnt we could use only one slot for replacement, and then log every miss we take. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ ** This list is shutting down 7/24/2004. From linas at denx.de Wed Jul 14 09:03:28 2004 From: linas at denx.de (Linas Vepstas) Date: Tue, 13 Jul 2004 18:03:28 -0500 Subject: [RFC/PATCH] Re: How to block pci config-reads during device self-test? In-Reply-To: <40ED780C.6000401@us.ibm.com> References: <20040706161254.C21634@forte.austin.ibm.com> <40ED780C.6000401@us.ibm.com> Message-ID: <20040713230328.GQ17333@bilge> Hi, On Thu, Jul 08, 2004 at 11:36:28AM -0500, Brian King was heard to remark: > I've been doing some talking with various hardware folks to see if > there is a way to get this fixed and so far the answer I have been > getting is no. It is how the hardware works and there isn't much that > can be done about it ... Actually, the hardware isn't broken ... its working as designed, more or less per pci spec. See below for techninical details. > I don't like the idea of "learning to live with this". People do > run into this problem and telling them it can't be fixed is not an > acceptable answer. > > linas at austin.ibm.com wrote: > > > >Am having trouble with PCI config-space reads ... I have a device > >(actualy Brian King has it) that can perform a built-in-self test > >(BIST). However, if anything does a PCI config-read during BIST, then > >the device does something crazy that makes the PCI controller chip > >take it offline. actually, the device didn't do anything 'crazy' actually, it just sat there silently, ignoring the i/o. > >I'm not sure what's doing the config-spcae reads ... seems > >to be some user-space tool or daemon. I'm wondering if > >there is any practical way to block such reads to a given > >device until its self-test sequence is completed. I could > >try to modify the architecture-specific pci files to do this > >(arch/ppc64/kernel/pSeries_pci.c) but this seems a tad ugly ... is > >there another way? or do we have to just learn to live with this > >hardware? Here's a patch w/ a recap description of the problem. The following patch addresses a peculiar PCI i/o problem. This patch is specific to ppc64, although one could argue that a better patch should probably be arch indep. But there's also a different way to fix this problem, which is why this is an RFC. The problem: Some device drivers use the PCI BIST (built-in self-test) to reset the PCI card if the card hangs in some error condition. While the BIST is running, the card, almost by definiition, is unable to respond to pci i/o. If the pci controller tries to do any i/o, the card simply won't respond, and the pci controller will do a 'master abort' a few pci bus cycles later. This is normal and expected behaviour for both card and the controller. The problem that I'm having is that I have a pci controller that thinks master-abort==dead card, and so offlines the card. Which is actually reasonable too ... no one should be doing i/o during BIST anyway. Unfortunately various user-space daemons & tools can do pci config-space i/o at anytime (typically to perform a hardware inventory) ... and invariably one of these user-land tools runs while the card is doing a BIST; the pci controller then off-lines this "dead card", and we are unhappy. There are two possible solutions. (1) block pci config-space i/o during a BIST, and (2) let the device driver detect that the card has been off-lined, and so reset the pci controller chip. I'm starting to think (2) is better, but this patch implements (1). Opinions? Comments? The patch below uses rw semaphores to block i/o access during BIST. All i/o access is locked with a read semaphore: thus, access remains fast-n-cheap. If the driver needs to BIST, it will then need to do a down_write()/up_write() surrounding the BIST. --linas -------------- next part -------------- ===== arch/ppc64/kernel/pSeries_pci.c 1.39 vs edited ===== --- 1.39/arch/ppc64/kernel/pSeries_pci.c Mon Jul 12 18:29:16 2004 +++ edited/arch/ppc64/kernel/pSeries_pci.c Tue Jul 13 15:46:08 2004 @@ -76,14 +76,22 @@ addr = (dn->busno << 16) | (dn->devfn << 8) | where; buid = dn->phb->buid; + + if (in_interrupt()) { + /* Driver should unplug interrupts during BIST */ + BUG_ON (down_read_trylock(&dn->bist_lock) != 1); + } else { + down_read (&dn->bist_lock); + } if (buid) { ret = rtas_call(ibm_read_pci_config, 4, 2, &returnval, addr, buid >> 32, buid & 0xffffffff, size); } else { ret = rtas_call(read_pci_config, 2, 2, &returnval, addr, size); } - *val = returnval; + up_read (&dn->bist_lock); + *val = returnval; if (ret) return PCIBIOS_DEVICE_NOT_FOUND; @@ -129,11 +137,19 @@ addr = (dn->busno << 16) | (dn->devfn << 8) | where; buid = dn->phb->buid; + + if (in_interrupt()) { + /* Driver should unplug interrupts during BIST */ + BUG_ON (down_read_trylock(&dn->bist_lock) != 1); + } else { + down_read (&dn->bist_lock); + } if (buid) { ret = rtas_call(ibm_write_pci_config, 5, 1, NULL, addr, buid >> 32, buid & 0xffffffff, size, (ulong) val); } else { ret = rtas_call(write_pci_config, 3, 1, NULL, addr, size, (ulong)val); } + up_read (&dn->bist_lock); if (ret) return PCIBIOS_DEVICE_NOT_FOUND; ===== arch/ppc64/kernel/prom.c 1.89 vs edited ===== --- 1.89/arch/ppc64/kernel/prom.c Tue Jul 13 15:26:58 2004 +++ edited/arch/ppc64/kernel/prom.c Tue Jul 13 16:03:01 2004 @@ -31,6 +31,7 @@ #include #include #include +#include #include #include #include @@ -3021,6 +3022,7 @@ kfree(np); return err; } + init_rwsem (&np->bist_lock); write_lock(&devtree_lock); np->sibling = np->parent->child; ===== include/asm-ppc64/prom.h 1.20 vs edited ===== --- 1.20/include/asm-ppc64/prom.h Thu May 27 20:37:20 2004 +++ edited/include/asm-ppc64/prom.h Tue Jul 13 15:45:28 2004 @@ -16,6 +16,7 @@ */ #include #include +#include #define PTRRELOC(x) ((typeof(x))((unsigned long)(x) - offset)) #define PTRUNRELOC(x) ((typeof(x))((unsigned long)(x) + offset)) @@ -151,6 +152,12 @@ int busno; /* for pci devices */ int bussubno; /* for pci devices */ int devfn; /* for pci devices */ + + /* Lock to prevent 3rd-party PCI config space i/o while device + * driver is running BIST on the controller; driver should obtain + * down_write()/up_write() in this lock for the duration of BIST. */ + struct rw_semaphore bist_lock; + #define DN_STATUS_BIST_FAILED (1<<0) int status; /* Current device status (non-zero is bad) */ int eeh_mode; /* See eeh.h for possible EEH_MODEs */ From david at gibson.dropbear.id.au Wed Jul 14 10:54:21 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Wed, 14 Jul 2004 10:54:21 +1000 Subject: [1/4] RFC: SLB rewrite (core rewrite) In-Reply-To: <20040713004316.GB26263@krispykreme> References: <20040707060635.GB987@zax> <20040707100643.0b889896.moilanen@austin.ibm.com> <20040708014654.GB4309@zax> <20040711010522.GB5232@krispykreme> <20040712013302.GC9423@zax> <20040713004316.GB26263@krispykreme> Message-ID: <20040714005421.GA18558@zax> On Tue, Jul 13, 2004 at 10:43:16AM +1000, Anton Blanchard wrote: > > > > Indeed. My gut instinct would be that we want to keep the number of > > bolted segments down, so as much of the SLB can be used as flexibly as > > possible. > > Agreed. > > > That said, I've had a couple of ideas on how to generate some > > semblence of LRU data in ways that might be sufficiently low overhead > > to be practical. For example, what if we kept a variable containing > > the ESID of the last segment cast out from the SLB. Whenever we take > > an SLB miss, we check the EA against that stored segment. If they > > match - i.e. this segment is so heavily used that we want it again > > almost as soon as we've thrown it out - we put a flag indicating to > > skip over this slot for some number of future misses. > > Interesting idea. The segments that I have seen miss a lot are: > > userspace > stack > PC > libc segment > > kernel > task struct > > The match EA against last castout should be able to catch these and has > the advantage of adapting to various workloads. > > > It certainly would. We really want to run some simulations to see how > > bolting various segments, or schemes like those above would affect the > > SLB miss rate. > > If we cut the amount of the SLB we use to a minimum we could get a > rather accurate log of SLB accesses. Does the SLB miss handler touch > anything outside segment 0 in your rewrite? If it doesnt we could use > only one slot for replacement, and then log every miss we take. Yes, that should work, the new handler shouldn't touch anything outside segment 0. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ ** This list is shutting down 7/24/2004. From haveblue at us.ibm.com Wed Jul 14 11:04:24 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Tue, 13 Jul 2004 18:04:24 -0700 Subject: oops bringing up secondary cpus Message-ID: <1089767063.4370.160.camel@nighthawk> First of all, this oops is very likely my fault. I'm porting CONFIG_NONLINEAR over to ppc64 for memory hotplug, and I have little doubt that I screwed up somewhere. I could really use some more eyes analyzing this oops, though. It's a 2-way p650 partition with 12GB of allocated memory. I think it occurs the first time that a secondary CPU touches its idle task's kernel stack. Here's the dump: cpu 0x1: Vector: 300 (Data Access) at [c0000002fff3bd10] pc: 000000000000beac lr: 000000000000be88 sp: c0000002fff3bf90 msr: 8000000000001002 dar: c0000002fff3bf90 dsisr: a000000 current = 0xc0000002fff30920 paca = 0xc00000000035a000 pid = 0, comm = swapper pc appears to be in __secondary_start (the 0xc00000000000beac address): objdump: c00000000000be38 <.__secondary_start>: ... c00000000000be94: 64 63 00 50 oris r3,r3,80 c00000000000be98: 60 63 5f 58 ori r3,r3,24408 c00000000000be9c: 7b 1c 1f 24 rldicr r28,r24,3,60 c00000000000bea0: 7c 23 e0 2a ldx r1,r3,r28 c00000000000bea4: 38 21 3f 90 addi r1,r1,16272 c00000000000bea8: 38 00 00 00 li r0,0 c00000000000beac: f8 01 00 00 std r0,0(r1) c00000000000beb0: f8 2d 00 20 std r1,32(r13) c00000000000beb4: e8 6d 00 38 ld r3,56(r13) c00000000000beb8: 60 64 00 01 ori r4,r3,1 c00000000000bebc: 38 60 50 00 li r3,20480 The idle task for cpu one is allocated at c0000002fff38000, so this looks valid. The instruction that causes the exception appears to be reading the contents of r1 into r0, right? That looks like a quite valid virtual address to me. Is this a problem with the SLB? enter ? for help 1:mon> r R00 = 0000000000000000 R16 = 0000000000000000 R01 = c0000002fff3bf90 R17 = 0000000000000000 R02 = c000000000500440 R18 = 0000000000000000 R03 = c000000000505f58 R19 = 0000000000000000 R04 = 000008d12e6ab480 R20 = 0000000000000000 R05 = 0000000000000000 R21 = 0000000000c00000 R06 = 0000000000000001 R22 = 0000000000000000 R07 = 0000000000c02000 R23 = 0000000000000001 R08 = c00000000035a000 R24 = 0000000000000001 R09 = d000000008000001 R25 = 0000000000000010 R10 = 0000000000000001 R26 = 0000000000000568 R11 = 0000000000000002 R27 = 000000000000041c R12 = 2000000000000000 R28 = 0000000000000008 R13 = c00000000035a000 R29 = 00000000000009e8 R14 = 0000000000000000 R30 = 0000000003280000 R15 = 0000000000000000 R31 = 0000000000900000 pc = 000000000000beac lr = 000000000000be88 msr = 8000000000001002 cr = 42280484 ctr = 0000000000000000 xer = 0000000020000000 trap = 300 SLB contents of cpu 1 00 c000000008000000 00006a99b4b14580 01 d000000008000000 000008d12e6ab480 02 c0000002f8000000 0000171f7cb14580 03 c000000030000000 0000a12fdcb14580 04 0000000010000000 000055a12defac00 05 0000000000000000 00001fe68f09ec00 06 00000000f0000000 000030d55709ec00 07 0000000040000000 000013596f09ec00 08 c0000002f0000000 0000171f7cb14580 09 0000000010000000 0000dcc34709ec00 10 0000000000000000 000098c475efac00 11 00000000f0000000 0000a9b33defac00 12 0000000040000000 00008c3755efac00 13 c000000030000000 0000a12fdcb14580 14 0000000010000000 000055a12defac00 15 c0000002f0000000 0000171f7cb14580 16 0000000000000000 00001fe68f09ec00 17 00000000f0000000 000030d55709ec00 18 0000000040000000 000013596f09ec00 19 0000000010000000 0000dcc34709ec00 20 0000000000000000 000098c475efac00 21 00000000f0000000 0000a9b33defac00 22 0000000040000000 00008c3755efac00 23 c000000030000000 0000a12fdcb14580 24 0000000010000000 000055a12defac00 25 c0000002f0000000 0000171f7cb14580 26 c000000030000000 0000a12fdcb14580 27 0000000000000000 0000e3779b970c00 28 00000000f0000000 0000f46663970c00 29 0000000040000000 0000d6ea7b970c00 30 0000000010000000 0000a05453970c00 31 c0000002f0000000 0000171f7cb14580 32 e000000000000000 0000a708a8242480 33 c0000002f0000000 0000171f7cb14580 34 e000000080000000 00008dee68242480 35 0000000000000000 000098c475efac00 36 00000000f0000000 0000a9b33defac00 37 0000000040000000 00008c3755efac00 38 c000000030000000 0000a12fdcb14580 39 0000000010000000 000055a12defac00 40 0000000000000000 00001fe68f09ec00 41 00000000f0000000 000030d55709ec00 42 0000000040000000 000013596f09ec00 43 c0000002f0000000 0000171f7cb14580 44 0000000010000000 0000dcc34709ec00 45 0000000000000000 000036fbefa91c00 46 00000000f0000000 000047eab7a91c00 47 0000000040000000 00002a6ecfa91c00 48 0000000010000000 0000f3d8a7a91c00 49 0000000000000000 00001fe68f09ec00 50 00000000f0000000 000030d55709ec00 51 0000000040000000 000013596f09ec00 52 c000000030000000 0000a12fdcb14580 53 0000000010000000 0000dcc34709ec00 54 c0000002f0000000 0000171f7cb14580 55 c000000030000000 0000a12fdcb14580 56 c0000002f0000000 0000171f7cb14580 57 c000000030000000 0000a12fdcb14580 58 c0000002f0000000 0000171f7cb14580 59 c000000030000000 0000a12fdcb14580 60 c0000002f0000000 0000171f7cb14580 61 c000000030000000 0000a12fdcb14580 62 c0000002f0000000 0000171f7cb14580 63 0000000000000000 000098c475efac00 Could stab_initialize() have been screwed up? Any suggestions how to debug this? I even tried dumping the PACA on a working, and non-working kernel, but the only difference I got was at decimal offset 480: --- paca0-mm2-virgin 2004-07-13 13:33:04.000000000 -0700 +++ paca0-mm2-nononl 2004-07-13 16:17:21.000000000 -0700 @@ -2,7 +2,7 @@ ][0032]: 0100000000000000 0000000000000000 d397d78102800000 0000000000000000 ][0128]: d397d9e204000000 0000000000000000 0000000000000000 0000000000000000 ][0256]: 0000000300000000 0000000000000000 0000000000000000 0000000000000000 -][0480]: 0000003b6fae5692 0000000000000000 c000000000000000 0000000000000000 +][0480]: e000000000000000 0000000000000000 c000000000000000 0000000000000000 ][1024]: c000000000346180 0000000000000000 0000000000000000 0000000000000000 ][1056]: 0000000000000000 0000000000000000 d397d78102800000 0000000000000000 ][1152]: d397d9e204000000 0000000000000000 0000000000000000 0000000000000000 And that's in a reserved area of the xLpPaca, so it's pretty much a guess what the hypervisor was doing. (btw, this diff is from a slightly kernel different kernel than the above oops) -- Dave ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ ** This list is shutting down 7/24/2004. From haveblue at us.ibm.com Thu Jul 15 09:55:25 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Wed, 14 Jul 2004 16:55:25 -0700 Subject: [PATCH] trivial __make_room() warning fix Message-ID: <1089849324.10000.50.camel@nighthawk> I was just compiling 2.6.8-rc1-mm1, and got a warning I don't think I've seen before: arch/ppc64/kernel/prom.c: In function `__make_room': arch/ppc64/kernel/prom.c:1415: warning: unused variable `offset' The attached patch fixes it. -- Dave -------------- next part -------------- A non-text attachment was scrubbed... Name: ppc64-initrd-warning-2.6.8-rc1-mm1.patch Type: text/x-patch Size: 634 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040714/4d14a010/attachment.bin From haveblue at us.ibm.com Thu Jul 15 10:14:01 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Wed, 14 Jul 2004 17:14:01 -0700 Subject: Broken LPAR cpu bringup Message-ID: <1089850441.10000.62.camel@nighthawk> OK, after tearing my remaining hair out for a few more hours, I realized that the oops that I posted yesterday is *not* my fault. :) Something is getting really confused. These results are from a clean 2.6.8-rc1-mm1 kernel. It's a partition on a p650 with 8 physical CPUs, and 2 in the partition that I'm using. 64GB of total RAM, 12 GB in the partition. There are no other active partitions. The early CPU booting appears to go as planned: instantiating rtas at 0x000000003fd3c000... done 0000000000000000 : booting cpu /cpus/PowerPC,POWER4 at 0 0000000000000001 : starting cpu /cpus/PowerPC,POWER4 at 1... ... done Calling quiesce ... But, when SMP cpu bringup happens, it gets really confused and tries to bring up *all* of the CPUs: Partition configured for 8 cpus. No more cpus available, failing Processor 1 is stuck. No more cpus available, failing Processor 2 is stuck. No more cpus available, failing Processor 3 is stuck. No more cpus available, failing Processor 4 is stuck. No more cpus available, failing Processor 5 is stuck. No more cpus available, failing Processor 6 is stuck. No more cpus available, failing Processor 7 is stuck. Brought up 1 CPUs The mailing list didn't like me attaching the device tree, so here's another copy: http://www.baconmonkey.com/device-tree-p650.tar.gz -- Dave ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Thu Jul 15 14:29:22 2004 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Wed, 14 Jul 2004 23:29:22 -0500 Subject: Broken LPAR cpu bringup In-Reply-To: <1089850441.10000.62.camel@nighthawk> References: <1089850441.10000.62.camel@nighthawk> Message-ID: <1089865083.13914.3.camel@biclops.private.network> On Wed, 2004-07-14 at 19:14, Dave Hansen wrote: > These results are from a clean 2.6.8-rc1-mm1 kernel. It's a partition > on a p650 with 8 physical CPUs, and 2 in the partition that I'm using. > 64GB of total RAM, 12 GB in the partition. There are no other active > partitions. > > The early CPU booting appears to go as planned: > > instantiating rtas at 0x000000003fd3c000... done > 0000000000000000 : booting cpu /cpus/PowerPC,POWER4 at 0 > 0000000000000001 : starting cpu /cpus/PowerPC,POWER4 at 1... ... done > Calling quiesce ... > > But, when SMP cpu bringup happens, it gets really confused and tries to > bring up *all* of the CPUs: > > Partition configured for 8 cpus. > No more cpus available, failing > Processor 1 is stuck. > No more cpus available, failing > Processor 2 is stuck. > No more cpus available, failing > Processor 3 is stuck. > No more cpus available, failing > Processor 4 is stuck. > No more cpus available, failing > Processor 5 is stuck. > No more cpus available, failing > Processor 6 is stuck. > No more cpus available, failing > Processor 7 is stuck. > Brought up 1 CPUs > Does this fix it? The problem is that ppc64 doesn't know about cpu_present_map. The arch-independent code copies cpu_possible_map to cpu_present_map if the latter has not been modified by the arch bootup code -- as you have discovered, this breaks on LPAR. Nathan Signed-off-by: Nathan Lynch diff -Naurp -X /home/nathanl/working/dontdiff 2.6.8-rc1-mm1/arch/ppc64/kernel/prom.c 2.6.8-rc1-mm1.new/arch/ppc64/kernel/prom.c --- 2.6.8-rc1-mm1/arch/ppc64/kernel/prom.c 2004-07-14 20:08:13.000000000 -0500 +++ 2.6.8-rc1-mm1.new/arch/ppc64/kernel/prom.c 2004-07-14 19:57:01.000000000 -0500 @@ -942,7 +942,7 @@ static void __init prom_hold_cpus(unsign #ifdef CONFIG_SMP cpu_set(cpuid, RELOC(cpu_available_map)); cpu_set(cpuid, RELOC(cpu_possible_map)); - cpu_set(cpuid, RELOC(cpu_present_at_boot)); + cpu_set(cpuid, RELOC(cpu_present_map)); if (reg == 0) cpu_set(cpuid, RELOC(cpu_online_map)); #endif /* CONFIG_SMP */ @@ -1044,7 +1044,7 @@ static void __init prom_hold_cpus(unsign _systemcfg->processorCount++; cpu_set(cpuid, RELOC(cpu_available_map)); cpu_set(cpuid, RELOC(cpu_possible_map)); - cpu_set(cpuid, RELOC(cpu_present_at_boot)); + cpu_set(cpuid, RELOC(cpu_present_map)); #endif } else { prom_printf("... failed: %x\n", *acknowledge); @@ -1056,7 +1056,7 @@ static void __init prom_hold_cpus(unsign cpu_set(cpuid, RELOC(cpu_available_map)); cpu_set(cpuid, RELOC(cpu_possible_map)); cpu_set(cpuid, RELOC(cpu_online_map)); - cpu_set(cpuid, RELOC(cpu_present_at_boot)); + cpu_set(cpuid, RELOC(cpu_present_map)); } #endif next: @@ -1071,7 +1071,7 @@ next: interrupt_server[i]); if (_naca->smt_state) { cpu_set(cpuid, RELOC(cpu_available_map)); - cpu_set(cpuid, RELOC(cpu_present_at_boot)); + cpu_set(cpuid, RELOC(cpu_present_map)); prom_printf("available\n"); } else { prom_printf("not available\n"); diff -Naurp -X /home/nathanl/working/dontdiff 2.6.8-rc1-mm1/arch/ppc64/kernel/smp.c 2.6.8-rc1-mm1.new/arch/ppc64/kernel/smp.c --- 2.6.8-rc1-mm1/arch/ppc64/kernel/smp.c 2004-07-14 20:08:20.000000000 -0500 +++ 2.6.8-rc1-mm1.new/arch/ppc64/kernel/smp.c 2004-07-14 20:31:38.000000000 -0500 @@ -60,7 +60,6 @@ unsigned long cache_decay_ticks; cpumask_t cpu_possible_map = CPU_MASK_NONE; cpumask_t cpu_online_map = CPU_MASK_NONE; cpumask_t cpu_available_map = CPU_MASK_NONE; -cpumask_t cpu_present_at_boot = CPU_MASK_NONE; EXPORT_SYMBOL(cpu_online_map); EXPORT_SYMBOL(cpu_possible_map); @@ -126,7 +125,7 @@ static int smp_iSeries_numProcs(void) if (paca[i].lppaca.xDynProcStatus < 2) { cpu_set(i, cpu_available_map); cpu_set(i, cpu_possible_map); - cpu_set(i, cpu_present_at_boot); + cpu_set(i, cpu_present_map); ++np; } } @@ -868,7 +867,7 @@ int __devinit __cpu_up(unsigned int cpu) int c; /* At boot, don't bother with non-present cpus -JSCHOPP */ - if (system_state == SYSTEM_BOOTING && !cpu_present_at_boot(cpu)) + if (system_state == SYSTEM_BOOTING && !cpu_present(cpu)) return -ENOENT; paca[cpu].prof_counter = 1; diff -Naurp -X /home/nathanl/working/dontdiff 2.6.8-rc1-mm1/arch/ppc64/kernel/xics.c 2.6.8-rc1-mm1.new/arch/ppc64/kernel/xics.c --- 2.6.8-rc1-mm1/arch/ppc64/kernel/xics.c 2004-07-14 20:08:13.000000000 -0500 +++ 2.6.8-rc1-mm1.new/arch/ppc64/kernel/xics.c 2004-07-14 19:55:33.000000000 -0500 @@ -548,7 +548,7 @@ nextnode: #ifdef CONFIG_SMP for_each_cpu(i) { /* FIXME: Do this dynamically! --RR */ - if (!cpu_present_at_boot(i)) + if (!cpu_present(i)) continue; xics_per_cpu[i] = __ioremap((ulong)inodes[get_hard_smp_processor_id(i)].addr, (ulong)inodes[get_hard_smp_processor_id(i)].size, diff -Naurp -X /home/nathanl/working/dontdiff 2.6.8-rc1-mm1/include/asm-ppc64/smp.h 2.6.8-rc1-mm1.new/include/asm-ppc64/smp.h --- 2.6.8-rc1-mm1/include/asm-ppc64/smp.h 2004-07-14 20:08:18.000000000 -0500 +++ 2.6.8-rc1-mm1.new/include/asm-ppc64/smp.h 2004-07-14 20:11:51.000000000 -0500 @@ -42,15 +42,11 @@ extern void smp_message_recv(int, struct * possible: CPU is a candidate to be made online * available: CPU is candidate for the 'possible' pool * Used to get SMT threads started at boot time. - * present_at_boot: CPU was available at boot time. Used in DLPAR - * code to handle special cases for processor start up. */ -extern cpumask_t cpu_present_at_boot; extern cpumask_t cpu_online_map; extern cpumask_t cpu_possible_map; extern cpumask_t cpu_available_map; -#define cpu_present_at_boot(cpu) cpu_isset(cpu, cpu_present_at_boot) #define cpu_available(cpu) cpu_isset(cpu, cpu_available_map) /* Since OpenPIC has only 4 IPIs, we use slightly different message numbers. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Thu Jul 15 16:29:36 2004 From: anton at samba.org (Anton Blanchard) Date: Thu, 15 Jul 2004 16:29:36 +1000 Subject: oops bringing up secondary cpus In-Reply-To: <1089767063.4370.160.camel@nighthawk> References: <1089767063.4370.160.camel@nighthawk> Message-ID: <20040715062936.GA27715@krispykreme> Hi, > cpu 0x1: Vector: 300 (Data Access) at [c0000002fff3bd10] > pc: 000000000000beac > lr: 000000000000be88 > sp: c0000002fff3bf90 > msr: 8000000000001002 > dar: c0000002fff3bf90 > dsisr: a000000 > current = 0xc0000002fff30920 > paca = 0xc00000000035a000 > pid = 0, comm = swapper Looks like you are trying to access a virtual address with IR and DR off (ie you are in real mode). Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Fri Jul 16 00:52:47 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 16 Jul 2004 00:52:47 +1000 Subject: iseries profiling support Message-ID: <20040715145246.GF27715@krispykreme> Hi, Is anyone using the dprofile facility on iseries these days? If not we can remove some code. Anton ===== arch/ppc64/kernel/asm-offsets.c 1.21 vs edited ===== --- 1.21/arch/ppc64/kernel/asm-offsets.c Fri Jul 2 15:23:46 2004 +++ edited/arch/ppc64/kernel/asm-offsets.c Tue Jul 13 22:40:07 2004 @@ -91,11 +91,6 @@ DEFINE(PACATOC, offsetof(struct paca_struct, kernel_toc)); DEFINE(PACAPROCENABLED, offsetof(struct paca_struct, proc_enabled)); DEFINE(PACADEFAULTDECR, offsetof(struct paca_struct, default_decr)); - DEFINE(PACAPROFENABLED, offsetof(struct paca_struct, prof_enabled)); - DEFINE(PACAPROFLEN, offsetof(struct paca_struct, prof_len)); - DEFINE(PACAPROFSHIFT, offsetof(struct paca_struct, prof_shift)); - DEFINE(PACAPROFBUFFER, offsetof(struct paca_struct, prof_buffer)); - DEFINE(PACAPROFSTEXT, offsetof(struct paca_struct, prof_stext)); DEFINE(PACA_EXGEN, offsetof(struct paca_struct, exgen)); DEFINE(PACA_EXMC, offsetof(struct paca_struct, exmc)); DEFINE(PACA_EXSLB, offsetof(struct paca_struct, exslb)); ===== arch/ppc64/kernel/head.S 1.67 vs edited ===== --- 1.67/arch/ppc64/kernel/head.S Mon Jul 5 20:27:11 2004 +++ edited/arch/ppc64/kernel/head.S Tue Jul 13 22:41:02 2004 @@ -317,36 +317,11 @@ label##_Iseries: \ mtspr SPRG1,r13; /* save r13 */ \ EXCEPTION_PROLOG_ISERIES_1(PACA_EXGEN); \ - lbz r10,PACAPROFENABLED(r13); \ - cmpwi r10,0; \ - bne- label##_Iseries_profile; \ -label##_Iseries_prof_ret: \ lbz r10,PACAPROCENABLED(r13); \ cmpwi 0,r10,0; \ beq- label##_Iseries_masked; \ EXCEPTION_PROLOG_ISERIES_2; \ b label##_common; \ -label##_Iseries_profile: \ - ld r12,PACALPPACA+LPPACASRR1(r13); \ - andi. r12,r12,MSR_PR; /* Test if in kernel */ \ - bne label##_Iseries_prof_ret; \ - ld r11,PACALPPACA+LPPACASRR0(r13); \ - ld r12,PACAPROFSTEXT(r13); /* _stext */ \ - subf r11,r12,r11; /* offset into kernel */ \ - lwz r12,PACAPROFSHIFT(r13); \ - srd r11,r11,r12; \ - lwz r12,PACAPROFLEN(r13); /* profile table length - 1 */ \ - cmpd r11,r12; /* off end? */ \ - ble 1f; \ - mr r11,r12; /* force into last entry */ \ -1: sldi r11,r11,2; /* convert to offset */ \ - ld r12,PACAPROFBUFFER(r13);/* profile buffer */ \ - add r12,r12,r11; \ -2: lwarx r11,0,r12; /* atomically increment */ \ - addi r11,r11,1; \ - stwcx. r11,0,r12; \ - bne- 2b; \ - b label##_Iseries_prof_ret #ifdef DO_SOFT_DISABLE #define DISABLE_INTS \ ===== arch/ppc64/kernel/iSeries_setup.c 1.27 vs edited ===== --- 1.27/arch/ppc64/kernel/iSeries_setup.c Fri Jul 2 15:23:46 2004 +++ edited/arch/ppc64/kernel/iSeries_setup.c Tue Jul 13 22:38:29 2004 @@ -36,6 +36,7 @@ #include #include #include +#include #include #include "iSeries_setup.h" @@ -54,17 +55,12 @@ #include /* Function Prototypes */ -extern void abort(void); extern void ppcdbg_initialize(void); -extern void iSeries_pcibios_init(void); extern void tce_init_iSeries(void); static void build_iSeries_Memory_Map(void); static void setup_iSeries_cache_sizes(void); static void iSeries_bolt_kernel(unsigned long saddr, unsigned long eaddr); -extern void build_valid_hpte(unsigned long vsid, unsigned long ea, unsigned long pa, - pte_t *ptep, unsigned hpteflags, unsigned bolted); -static void iSeries_setup_dprofile(void); extern void iSeries_setup_arch(void); extern void iSeries_pci_final_fixup(void); @@ -77,16 +73,10 @@ static unsigned long tbFreqMhz; static unsigned long tbFreqMhzHundreths; -unsigned long dprof_shift; -unsigned long dprof_len; -unsigned int *dprof_buffer; - int piranha_simulator; int boot_cpuid; -extern char _end[]; - extern int rd_size; /* Defined in drivers/block/rd.c */ extern unsigned long klimit; extern unsigned long embedded_sysmap_start; @@ -366,30 +356,6 @@ } *p = 0; - if (strstr(cmd_line, "dprofile=")) { - for (q = cmd_line; (p = strstr(q, "dprofile=")) != 0; ) { - unsigned long size, new_klimit; - - q = p + 9; - if ((p > cmd_line) && (p[-1] != ' ')) - continue; - dprof_shift = simple_strtoul(q, &q, 0); - dprof_len = (unsigned long)_etext - - (unsigned long)_stext; - dprof_len >>= dprof_shift; - size = ((dprof_len * sizeof(unsigned int)) + - (PAGE_SIZE-1)) & PAGE_MASK; - dprof_buffer = (unsigned int *)((klimit + - (PAGE_SIZE-1)) & PAGE_MASK); - new_klimit = ((unsigned long)dprof_buffer) + size; - lmb_reserve(__pa(klimit), (new_klimit-klimit)); - klimit = new_klimit; - memset(dprof_buffer, 0, size); - } - } - - iSeries_setup_dprofile(); - mf_init(); mf_initialized = 1; mb(); @@ -834,22 +800,6 @@ if (embedded_sysmap_end) klimit = KERNELBASE + ((embedded_sysmap_end + 4095) & 0xfffffffffffff000); - } -} - -static void iSeries_setup_dprofile(void) -{ - if (dprof_buffer) { - unsigned i; - - for (i = 0; i < NR_CPUS; ++i) { - paca[i].prof_shift = dprof_shift; - paca[i].prof_len = dprof_len - 1; - paca[i].prof_buffer = dprof_buffer; - paca[i].prof_stext = (unsigned *)_stext; - mb(); - paca[i].prof_enabled = 1; - } } } ===== arch/ppc64/kernel/irq.c 1.62 vs edited ===== --- 1.62/arch/ppc64/kernel/irq.c Fri Jul 2 15:23:46 2004 +++ edited/arch/ppc64/kernel/irq.c Tue Jul 13 22:42:12 2004 @@ -803,18 +803,6 @@ *mask = new_value; -#ifdef CONFIG_PPC_ISERIES - { - unsigned i; - for (i=0; iprof_counter)) { - update_process_times(user_mode(regs)); - (get_paca()->prof_counter)=get_paca()->prof_multiplier; - } + update_process_times(user_mode(regs)); } void smp_message_recv(int msg, struct pt_regs *regs) @@ -825,8 +822,6 @@ /* Fixup boot cpu */ smp_store_cpu_info(boot_cpuid); cpu_callin_map[boot_cpuid] = 1; - paca[boot_cpuid].prof_counter = 1; - paca[boot_cpuid].prof_multiplier = 1; #ifndef CONFIG_PPC_ISERIES paca[boot_cpuid].next_jiffy_update_tb = tb_last_stamp = get_tb(); @@ -872,8 +867,6 @@ if (system_state == SYSTEM_BOOTING && !cpu_present_at_boot(cpu)) return -ENOENT; - paca[cpu].prof_counter = 1; - paca[cpu].prof_multiplier = 1; paca[cpu].default_decr = tb_ticks_per_jiffy / decr_overclock; if (!(cur_cpu_spec->cpu_features & CPU_FTR_SLB)) { ===== include/asm-ppc64/paca.h 1.19 vs edited ===== --- 1.19/include/asm-ppc64/paca.h Fri Jul 2 15:23:46 2004 +++ edited/include/asm-ppc64/paca.h Tue Jul 13 22:32:42 2004 @@ -99,21 +99,6 @@ */ struct ItLpPaca lppaca __attribute__((aligned(0x80))); struct ItLpRegSave reg_save; - - /* - * iSeries profiling support - * - * FIXME: do we still want this, or can we ditch it in favour - * of oprofile? - */ - u32 *prof_buffer; /* iSeries profiling buffer */ - u32 *prof_stext; /* iSeries start of kernel text */ - u32 prof_multiplier; - u32 prof_counter; - u32 prof_shift; /* iSeries shift for profile - * bucket size */ - u32 prof_len; /* iSeries length of profile */ - u8 prof_enabled; /* 1=iSeries profiling enabled */ }; #endif /* _PPC64_PACA_H */ ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Fri Jul 16 00:57:08 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 16 Jul 2004 00:57:08 +1000 Subject: page align emergency stack Message-ID: <20040715145708.GG27715@krispykreme> Page align the emergency stack. Signed-off-by: Anton Blanchard ===== arch/ppc64/kernel/pacaData.c 1.15 vs edited ===== --- 1.15/arch/ppc64/kernel/pacaData.c Fri Jul 2 15:23:46 2004 +++ edited/arch/ppc64/kernel/pacaData.c Tue Jul 13 12:40:05 2004 @@ -30,7 +30,7 @@ /* Stack space used when we detect a bad kernel stack pointer, and * early in SMP boots before relocation is enabled. */ -char emergency_stack[PAGE_SIZE * NR_CPUS]; +char emergency_stack[PAGE_SIZE * NR_CPUS] __page_aligned; /* The Paca is an array with one entry per processor. Each contains an * ItLpPaca, which contains the information shared between the ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Fri Jul 16 01:46:30 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 16 Jul 2004 01:46:30 +1000 Subject: [PATCH] remove multiple IRQ optimisation Message-ID: <20040715154630.GH27715@krispykreme> ppc64 has an optimisation where it loops on get_irq until there are no more interrupts to be handled. Mark Hack notes that this optimisation hardly ever hits and costs us a potentially expensive extra read of an interrupt register every interrupt. Also make do_IRQ void, the callers never use the return value. Signed-off-by: Anton Blanchard ===== arch/ppc64/kernel/irq.c 1.62 vs edited ===== --- 1.62/arch/ppc64/kernel/irq.c Fri Jul 2 15:23:46 2004 +++ edited/arch/ppc64/kernel/irq.c Fri Jul 16 01:39:54 2004 @@ -589,7 +589,7 @@ } #ifdef CONFIG_PPC_ISERIES -int do_IRQ(struct pt_regs *regs) +void do_IRQ(struct pt_regs *regs) { struct paca_struct *lpaca; struct ItLpQueue *lpq; @@ -629,15 +629,13 @@ /* Signal a fake decrementer interrupt */ timer_interrupt(regs); } - - return 1; /* lets ret_from_int know we can do checks */ } #else /* CONFIG_PPC_ISERIES */ -int do_IRQ(struct pt_regs *regs) +void do_IRQ(struct pt_regs *regs) { - int irq, first = 1; + int irq; irq_enter(); @@ -656,25 +654,15 @@ } #endif - /* - * Every arch is required to implement ppc_md.get_irq. - * This function will either return an irq number or -1 to - * indicate there are no more pending. But the first time - * through the loop this means there wasn't an IRQ pending. - * The value -2 is for buggy hardware and means that this IRQ - * has already been handled. -- Tom - */ - while ((irq = ppc_md.get_irq(regs)) >= 0) { + irq = ppc_md.get_irq(regs); + + if (irq >= 0) ppc_irq_dispatch_handler(regs, irq); - first = 0; - } - if (irq != -2 && first) + else /* That's not SMP safe ... but who cares ? */ ppc_spurious_interrupts++; irq_exit(); - - return 1; /* lets ret_from_int know we can do checks */ } #endif /* CONFIG_PPC_ISERIES */ ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Fri Jul 16 08:39:10 2004 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Thu, 15 Jul 2004 17:39:10 -0500 Subject: Broken LPAR cpu bringup In-Reply-To: <1089865083.13914.3.camel@biclops.private.network> References: <1089850441.10000.62.camel@nighthawk> <1089865083.13914.3.camel@biclops.private.network> Message-ID: <1089931149.12866.5.camel@pants.austin.ibm.com> On Wed, 2004-07-14 at 23:29, Nathan Lynch wrote: > Does this fix it? The problem is that ppc64 doesn't know about > cpu_present_map. The arch-independent code copies cpu_possible_map to > cpu_present_map if the latter has not been modified by the arch bootup > code -- as you have discovered, this breaks on LPAR. Well, that's not the only problem, it seems. With this one on top of the previous patch, does it work? I have tested it on a similar configuration, but not Power4. Nathan diff -prauN -X /home/nathanl/working/dontdiff 2.6.8-rc1-mm1.1/arch/ppc64/kernel/smp.c 2.6.8-rc1-mm1.2/arch/ppc64/kernel/smp.c --- 2.6.8-rc1-mm1.1/arch/ppc64/kernel/smp.c 2004-07-15 17:27:54.000000000 -0500 +++ 2.6.8-rc1-mm1.2/arch/ppc64/kernel/smp.c 2004-07-15 16:59:46.000000000 -0500 @@ -373,7 +373,7 @@ static inline int __devinit smp_startup_ /* At boot time the cpus are already spinning in hold * loops, so nothing to do. */ - if (system_state == SYSTEM_BOOTING) + if (system_state < SYSTEM_RUNNING) return 1; pcpu = find_physical_cpu_to_start(get_hard_smp_processor_id(lcpu)); @@ -423,6 +423,11 @@ static inline void look_for_more_cpus(vo maxcpus = ireg[num_addr_cell + num_size_cell]; /* DRENG need to account for threads here too */ + if ((cur_cpu_spec->cpu_features & CPU_FTR_SMT) && + ((naca->smt_state == SMT_ON) || (naca->smt_state == SMT_DYNAMIC))) + maxcpus *= 2; + + if (maxcpus > NR_CPUS) { printk(KERN_WARNING "Partition configured for %d cpus, " ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From haveblue at us.ibm.com Fri Jul 16 08:43:34 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Thu, 15 Jul 2004 15:43:34 -0700 Subject: Broken LPAR cpu bringup In-Reply-To: <1089931149.12866.5.camel@pants.austin.ibm.com> References: <1089850441.10000.62.camel@nighthawk> <1089865083.13914.3.camel@biclops.private.network> <1089931149.12866.5.camel@pants.austin.ibm.com> Message-ID: <1089931414.32312.13.camel@nighthawk> On Thu, 2004-07-15 at 15:39, Nathan Lynch wrote: > On Wed, 2004-07-14 at 23:29, Nathan Lynch wrote: > > Does this fix it? The problem is that ppc64 doesn't know about > > cpu_present_map. The arch-independent code copies cpu_possible_map to > > cpu_present_map if the latter has not been modified by the arch bootup > > code -- as you have discovered, this breaks on LPAR. > > Well, that's not the only problem, it seems. With this one on top of > the previous patch, does it work? I have tested it on a similar > configuration, but not Power4. They still get stuck on my tree. I'll try on a plain tree in a little bit. -- Dave ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From haveblue at us.ibm.com Fri Jul 16 08:57:54 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Thu, 15 Jul 2004 15:57:54 -0700 Subject: Broken LPAR cpu bringup In-Reply-To: <1089931414.32312.13.camel@nighthawk> References: <1089850441.10000.62.camel@nighthawk> <1089865083.13914.3.camel@biclops.private.network> <1089931149.12866.5.camel@pants.austin.ibm.com> <1089931414.32312.13.camel@nighthawk> Message-ID: <1089932274.32312.23.camel@nighthawk> On Thu, 2004-07-15 at 15:43, Dave Hansen wrote: > On Thu, 2004-07-15 at 15:39, Nathan Lynch wrote: > > On Wed, 2004-07-14 at 23:29, Nathan Lynch wrote: > > > Does this fix it? The problem is that ppc64 doesn't know about > > > cpu_present_map. The arch-independent code copies cpu_possible_map to > > > cpu_present_map if the latter has not been modified by the arch bootup > > > code -- as you have discovered, this breaks on LPAR. > > > > Well, that's not the only problem, it seems. With this one on top of > > the previous patch, does it work? I have tested it on a similar > > configuration, but not Power4. > > They still get stuck on my tree. I'll try on a plain tree in a little > bit. Nope, still oopses in the load_balance sched domains code: cpu 0x1: Vector: 380 (Data SLB Access) at [c00000000ca8b420] pc: c000000000046644: .find_busiest_group+0x24c/0x470 lr: c00000000004681c: .find_busiest_group+0x424/0x470 sp: c00000000ca8b6a0 msr: 8000000000001032 dar: 10 current = 0xc00000000ca80da0 paca = 0xc00000000033c900 pid = 0, comm = swapper 1:mon> t [c00000000ca8b7c0] c0000000000469c4 .load_balance+0x5c/0x1c4 [c00000000ca8b880] c000000000046f6c .rebalance_tick+0x120/0x144 [c00000000ca8b930] c000000000057324 .update_process_times+0x44/0x60 [c00000000ca8b9c0] c000000000036f18 .smp_local_timer_interrupt+0x40/0x50 [c00000000ca8ba30] c000000000013050 .timer_interrupt+0xf4/0x394 [c00000000ca8bb10] c00000000000a2b4 Decrementer_common+0xb4/0x100 [c00000000ca8be90] c0000000000128d0 .cpu_idle+0x2c/0x44 [c00000000ca8bf00] c0u 0x0: : .start_secondaryy+0xd8/0x11c I wonder if the sched domains didn't get set up correctly. -- Dave ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From haveblue at us.ibm.com Fri Jul 16 09:03:24 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Thu, 15 Jul 2004 16:03:24 -0700 Subject: Broken LPAR cpu bringup In-Reply-To: <1089931414.32312.13.camel@nighthawk> References: <1089850441.10000.62.camel@nighthawk> <1089865083.13914.3.camel@biclops.private.network> <1089931149.12866.5.camel@pants.austin.ibm.com> <1089931414.32312.13.camel@nighthawk> Message-ID: <1089932604.32312.28.camel@nighthawk> I'm just going to keep replying to myself. It's fun. I just noticed that, with your new patch, CPU0 also doesn't boot all the way. This happens about the time that init is exec'd off. The CPU0 oops is garbled, so here is what I have. I don't have any modules and _end is at c00000000054b000, so it appears to have jumped off into lala land. Unrecoverable FP Unavailable Exception 800 at c00000000ca8aa90 s msr: 8000000000009032 current = 0xc00000003f8ac6f0 paca = 0xc00000000033c000 pid = 246, comm = hotplug 1:mon> 1:mon> c0 0:mon> 0:mon> 0:mon> r R00 = c00000000ca8aa90 R16 = 0000000000000080 R01 = c00000000ca8a9e0 R17 = 0000000000000080 R02 = c0000000004a5470 R18 = 0000000000000080 R03 = 0000000000000002 R19 = c00000000b17d588 R04 = 0000000000000000 R20 = 0000000000000001 R05 = 0000000000000001 R21 = 0000000000000000 R06 = 0a00000000000000 R22 = 0000000000000000 R07 = 0000000000000000 R23 = 0000000000000000 R08 = c0000000002e3809 R24 = c00000000ca8a8b0 R09 = 0000000044282444 R25 = 0000000000000001 R10 = c00000000ca8a820 R26 = c00000000002bad0 R11 = 0000000048282424 R27 = 0000000084282444 R12 = 00000000000cdfd0 R28 = c00000000ca8a8b0 R13 = c00000000033c900 R29 = 0000000000000000 R14 = 0000000000000000 R30 = 5b00000000012634 R15 = c000000000330c38 R31 = 0000000000000000 pc = c00000000ca8aa90 lr = c00000000ca8aa90 msr = 8000000000009032 cr = d00000048282424 ctr = 00000000000cdfd0 xer = 0000000000000000 trap = a 0:mon> t [link register ] c00000000ca8aa90 [c00000000ca8a9e0] c00000000ca8aa80 (unreliable) [c00000000ca8aa50] 0d00000000044508 [c00000000ca8aae0] 0a28244224282424 [c00000000ca8ab50] c00000000ca8ac00 [c00000000ca8ac00] 0000000000000000 SP (d0000000ca8ae20) is in userspace 0:mon> -- Dave ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From haveblue at us.ibm.com Fri Jul 16 09:12:41 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Thu, 15 Jul 2004 16:12:41 -0700 Subject: oops bringing up secondary cpus In-Reply-To: <20040715062936.GA27715@krispykreme> References: <1089767063.4370.160.camel@nighthawk> <20040715062936.GA27715@krispykreme> Message-ID: <1089933160.32312.32.camel@nighthawk> On Wed, 2004-07-14 at 23:29, Anton Blanchard wrote: > > cpu 0x1: Vector: 300 (Data Access) at [c0000002fff3bd10] > > pc: 000000000000beac > > lr: 000000000000be88 > > sp: c0000002fff3bf90 > > msr: 8000000000001002 > > dar: c0000002fff3bf90 > > dsisr: a000000 > > current = 0xc0000002fff30920 > > paca = 0xc00000000035a000 > > pid = 0, comm = swapper > > Looks like you are trying to access a virtual address with IR and DR > off (ie you are in real mode). Thanks for taking a look. First of all, where are IR and DR represened in the register dump? Also, does __secondary_start happen in real mode? It appears to have some virtual addresses (of the paca) handed in to it, and it says that relocation is off. -- Dave ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Fri Jul 16 09:20:08 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 16 Jul 2004 09:20:08 +1000 Subject: oops bringing up secondary cpus In-Reply-To: <1089933160.32312.32.camel@nighthawk> References: <1089767063.4370.160.camel@nighthawk> <20040715062936.GA27715@krispykreme> <1089933160.32312.32.camel@nighthawk> Message-ID: <20040715232008.GA17574@krispykreme> > > > cpu 0x1: Vector: 300 (Data Access) at [c0000002fff3bd10] > > > pc: 000000000000beac > > > lr: 000000000000be88 > > > sp: c0000002fff3bf90 > > > msr: 8000000000001002 > > > dar: c0000002fff3bf90 > > > dsisr: a000000 > > > current = 0xc0000002fff30920 > > > paca = 0xc00000000035a000 > > > pid = 0, comm = swapper > > > > Looks like you are trying to access a virtual address with IR and DR > > off (ie you are in real mode). > > Thanks for taking a look. First of all, where are IR and DR represened > in the register dump? The msr holds all that info, check out include/asm/processor.h: #define MSR_IR_LG 5 /* Instruction Relocate */ #define MSR_DR_LG 4 /* Data Relocate */ > Also, does __secondary_start happen in real mode? It appears to have > some virtual addresses (of the paca) handed in to it, and it says that > relocation is off. Ahh sorry I forgot to tell you about a trick we use. In real mode the top 2 bits are ignored, so looking at the DAR you are trying to access 0x2fff3bf90. Now one reason we would take a 300 on a real address is if you tried to touch memory outside the RMO region. Since the address is above 4GB Im guessing this is your problem. Since r1 is the same as the the DAR it sounds like its to do with current_set being outside the RMO: LOADADDR(r3,current_set) sldi r28,r24,3 /* get current_set[cpu#] */ ldx r1,r3,r28 etc. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From haveblue at us.ibm.com Fri Jul 16 09:33:44 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Thu, 15 Jul 2004 16:33:44 -0700 Subject: oops bringing up secondary cpus In-Reply-To: <20040715232008.GA17574@krispykreme> References: <1089767063.4370.160.camel@nighthawk> <20040715062936.GA27715@krispykreme> <1089933160.32312.32.camel@nighthawk> <20040715232008.GA17574@krispykreme> Message-ID: <1089934424.32312.37.camel@nighthawk> On Thu, 2004-07-15 at 16:20, Anton Blanchard wrote: > Ahh sorry I forgot to tell you about a trick we use. In real mode the > top 2 bits are ignored, so looking at the DAR you are trying to access > 0x2fff3bf90. Aha! That really clears it up. > Now one reason we would take a 300 on a real address is if you tried to > touch memory outside the RMO region. Since the address is above 4GB Im > guessing this is your problem. > > Since r1 is the same as the the DAR it sounds like its to do with > current_set being outside the RMO: > > LOADADDR(r3,current_set) > sldi r28,r24,3 /* get current_set[cpu#] */ > ldx r1,r3,r28 OK, but current_set is the task info for the idle process for the new CPU, right? It looks to me like it's allocated like any old task using copy_process() in smp_create_idle(). This should use kmalloc() like any other task, and that's certainly not guaranteed to be in the RMO. Did this change recently? Do we need to lmb_alloc() the idle task struct for the secondary CPUs? -- Dave ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Fri Jul 16 09:53:21 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 16 Jul 2004 09:53:21 +1000 Subject: oops bringing up secondary cpus In-Reply-To: <1089934424.32312.37.camel@nighthawk> References: <1089767063.4370.160.camel@nighthawk> <20040715062936.GA27715@krispykreme> <1089933160.32312.32.camel@nighthawk> <20040715232008.GA17574@krispykreme> <1089934424.32312.37.camel@nighthawk> Message-ID: <20040715235321.GB17574@krispykreme> > OK, but current_set is the task info for the idle process for the new > CPU, right? It looks to me like it's allocated like any old task using > copy_process() in smp_create_idle(). This should use kmalloc() like any > other task, and that's certainly not guaranteed to be in the RMO. Did > this change recently? Do we need to lmb_alloc() the idle task struct > for the secondary CPUs? To be honest I cant see where we touch r1 in __secondary_start, have you made in changes to it? BTW the rfid at the end of __secondary_start is where we go to virtual mode, so you only have to worry about the code before that point. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From haveblue at us.ibm.com Fri Jul 16 10:00:55 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Thu, 15 Jul 2004 17:00:55 -0700 Subject: oops bringing up secondary cpus In-Reply-To: <20040715235321.GB17574@krispykreme> References: <1089767063.4370.160.camel@nighthawk> <20040715062936.GA27715@krispykreme> <1089933160.32312.32.camel@nighthawk> <20040715232008.GA17574@krispykreme> <1089934424.32312.37.camel@nighthawk> <20040715235321.GB17574@krispykreme> Message-ID: <1089936055.32312.40.camel@nighthawk> On Thu, 2004-07-15 at 16:53, Anton Blanchard wrote: > > OK, but current_set is the task info for the idle process for the new > > CPU, right? It looks to me like it's allocated like any old task using > > copy_process() in smp_create_idle(). This should use kmalloc() like any > > other task, and that's certainly not guaranteed to be in the RMO. Did > > this change recently? Do we need to lmb_alloc() the idle task struct > > for the secondary CPUs? > > To be honest I cant see where we touch r1 in __secondary_start, have you > made in changes to it? BTW the rfid at the end of __secondary_start is > where we go to virtual mode, so you only have to worry about the code > before that point. No, I haven't modified it, but I'm running 2.6.7-mm2, which looks a bit different than current mainline. I'll update and try again. -- Dave ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Fri Jul 16 10:47:29 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Fri, 16 Jul 2004 10:47:29 +1000 Subject: page align emergency stack In-Reply-To: <20040715145708.GG27715@krispykreme> References: <20040715145708.GG27715@krispykreme> Message-ID: <20040716004729.GA24753@zax> On Fri, Jul 16, 2004 at 12:57:08AM +1000, Anton Blanchard wrote: > > Page align the emergency stack. > > Signed-off-by: Anton Blanchard Do we actually need to do this? I noted that the old guard pages were page aligned, but couldn't see any particular reason for it, so I didn't transfer the alignment to the new version. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Fri Jul 16 11:11:41 2004 From: anton at samba.org (Anton Blanchard) Date: Fri, 16 Jul 2004 11:11:41 +1000 Subject: page align emergency stack In-Reply-To: <20040716004729.GA24753@zax> References: <20040715145708.GG27715@krispykreme> <20040716004729.GA24753@zax> Message-ID: <20040716011141.GC17574@krispykreme> > Do we actually need to do this? I noted that the old guard pages were > page aligned, but couldn't see any particular reason for it, so I > didn't transfer the alignment to the new version. The ABI requires us to have 128 bit alignment doesnt it? Im thinking about what would happen if we saved altivec registers to the stack. Anton ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Fri Jul 16 11:39:01 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Fri, 16 Jul 2004 11:39:01 +1000 Subject: page align emergency stack In-Reply-To: <20040716011141.GC17574@krispykreme> References: <20040715145708.GG27715@krispykreme> <20040716004729.GA24753@zax> <20040716011141.GC17574@krispykreme> Message-ID: <20040716013901.GC24753@zax> On Fri, Jul 16, 2004 at 11:11:41AM +1000, Anton Blanchard wrote: > > > > Do we actually need to do this? I noted that the old guard pages were > > page aligned, but couldn't see any particular reason for it, so I > > didn't transfer the alignment to the new version. > > The ABI requires us to have 128 bit alignment doesnt it? Im thinking > about what would happen if we saved altivec registers to the stack. Ok, that's not quite the same thing as page alignment... -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Fri Jul 16 16:10:38 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Fri, 16 Jul 2004 16:10:38 +1000 Subject: [PPC64, TRIVIAL] Rename confusing locks in ras.c, rtasd.c Message-ID: <20040716061038.GD26044@zax> Andrew, please apply: Both arch/ppc64/kernel/ras.c and arch/ppc64/kernel/rtasd.c have a spinlock variable declared static called "log_lock". Since the code in these files interact quit a lot, having two different locks with identical names is manifestly confusing. This patch renames both locks to something a little clearer. In the case of ras.c it also renames the buffer protected by the lock to a more usefullly greppable name. Signed-off-by: David Gibson Index: working-2.6/arch/ppc64/kernel/ras.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/ras.c +++ working-2.6/arch/ppc64/kernel/ras.c @@ -109,8 +109,8 @@ } __initcall(init_ras_IRQ); -static struct rtas_error_log log_buf; -static spinlock_t log_lock = SPIN_LOCK_UNLOCKED; +static struct rtas_error_log ras_log_buf; +static spinlock_t ras_log_buf_lock = SPIN_LOCK_UNLOCKED; /* * Handle power subsystem events (EPOW). @@ -126,17 +126,17 @@ unsigned int size = sizeof(log_entry); int status = 0xdeadbeef; - spin_lock(&log_lock); + spin_lock(&ras_log_buf_lock); status = rtas_call(rtas_token("check-exception"), 6, 1, NULL, 0x500, irq, RTAS_EPOW_WARNING | RTAS_POWERMGM_EVENTS, 1, /* Time Critical */ - __pa(&log_buf), size); + __pa(&ras_log_buf), size); - log_entry = log_buf; + log_entry = ras_log_buf; - spin_unlock(&log_lock); + spin_unlock(&ras_log_buf_lock); udbg_printf("EPOW <0x%lx 0x%x>\n", *((unsigned long *)&log_entry), status); @@ -165,17 +165,17 @@ int status = 0xdeadbeef; int fatal; - spin_lock(&log_lock); + spin_lock(&ras_log_buf_lock); status = rtas_call(rtas_token("check-exception"), 6, 1, NULL, 0x500, irq, RTAS_INTERNAL_ERROR, 1, /* Time Critical */ - __pa(&log_buf), size); + __pa(&ras_log_buf), size); - log_entry = log_buf; + log_entry = ras_log_buf; - spin_unlock(&log_lock); + spin_unlock(&ras_log_buf_lock); if ((status == 0) && (log_entry.severity >= SEVERITY_ERROR_SYNC)) fatal = 1; Index: working-2.6/arch/ppc64/kernel/rtasd.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/rtasd.c +++ working-2.6/arch/ppc64/kernel/rtasd.c @@ -33,7 +33,7 @@ #define DEBUG(A...) #endif -static spinlock_t log_lock = SPIN_LOCK_UNLOCKED; +static spinlock_t rtasd_log_lock = SPIN_LOCK_UNLOCKED; DECLARE_WAIT_QUEUE_HEAD(rtas_log_wait); @@ -152,7 +152,7 @@ if (buf == NULL) return; - spin_lock_irqsave(&log_lock, s); + spin_lock_irqsave(&rtasd_log_lock, s); /* get length and increase count */ switch (err_type & ERR_TYPE_MASK) { @@ -163,7 +163,7 @@ break; case ERR_TYPE_KERNEL_PANIC: default: - spin_unlock_irqrestore(&log_lock, s); + spin_unlock_irqrestore(&rtasd_log_lock, s); return; } @@ -174,7 +174,7 @@ /* Check to see if we need to or have stopped logging */ if (fatal || no_more_logging) { no_more_logging = 1; - spin_unlock_irqrestore(&log_lock, s); + spin_unlock_irqrestore(&rtasd_log_lock, s); return; } @@ -199,12 +199,12 @@ else rtas_log_start += 1; - spin_unlock_irqrestore(&log_lock, s); + spin_unlock_irqrestore(&rtasd_log_lock, s); wake_up_interruptible(&rtas_log_wait); break; case ERR_TYPE_KERNEL_PANIC: default: - spin_unlock_irqrestore(&log_lock, s); + spin_unlock_irqrestore(&rtasd_log_lock, s); return; } @@ -247,24 +247,24 @@ return -ENOMEM; - spin_lock_irqsave(&log_lock, s); + spin_lock_irqsave(&rtasd_log_lock, s); /* if it's 0, then we know we got the last one (the one in NVRAM) */ if (rtas_log_size == 0 && !no_more_logging) nvram_clear_error_log(); - spin_unlock_irqrestore(&log_lock, s); + spin_unlock_irqrestore(&rtasd_log_lock, s); error = wait_event_interruptible(rtas_log_wait, rtas_log_size); if (error) goto out; - spin_lock_irqsave(&log_lock, s); + spin_lock_irqsave(&rtasd_log_lock, s); offset = rtas_error_log_buffer_max * (rtas_log_start & LOG_NUMBER_MASK); memcpy(tmp, &rtas_log_buf[offset], count); rtas_log_start += 1; rtas_log_size -= 1; - spin_unlock_irqrestore(&log_lock, s); + spin_unlock_irqrestore(&rtasd_log_lock, s); error = copy_to_user(buf, tmp, count) ? -EFAULT : count; out: -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Fri Jul 16 16:13:27 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Fri, 16 Jul 2004 16:13:27 +1000 Subject: RFC: Fix (another) bug in rtas logging Message-ID: <20040716061327.GE26044@zax> Have I missed anything here? This bug certainly fixes the crash on boot I've been getting on our p630. The recent patch changing the rtas error logging had a bug. It can result in rtas_call() attempting to call kmalloc() too early (from setup_arch() before the slab caches are initialized), leading to an oops on boot. I can see no reason that log_error() can't be called with the rtas.lock still held, so this patch avoids race addresses by the original race by simply moving the log_error() call under the lock. Signed-off-by: David Gibson Index: working-2.6/arch/ppc64/kernel/rtas.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/rtas.c +++ working-2.6/arch/ppc64/kernel/rtas.c @@ -114,7 +114,6 @@ int i, logit = 0; unsigned long s; struct rtas_args *rtas_args; - char * buff_copy = NULL; int ret; PPCDBG(PPCDBG_RTAS, "Entering rtas_call\n"); @@ -165,19 +164,12 @@ /* Log the error in the unlikely case that there was one. */ if (unlikely(logit)) { - buff_copy = kmalloc(RTAS_ERROR_LOG_MAX, GFP_ATOMIC); - if (buff_copy) { - memcpy(buff_copy, rtas_err_buf, RTAS_ERROR_LOG_MAX); - } + log_error(rtas_err_buf, ERR_TYPE_RTAS_LOG, 0); } /* Gotta do something different here, use global lock for now... */ spin_unlock_irqrestore(&rtas.lock, s); - if (buff_copy) { - log_error(buff_copy, ERR_TYPE_RTAS_LOG, 0); - kfree(buff_copy); - } return ret; } -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Sat Jul 17 00:44:19 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 16 Jul 2004 10:44:19 -0400 Subject: RFC: Fix (another) bug in rtas logging In-Reply-To: <20040716061327.GE26044@zax> References: <20040716061327.GE26044@zax> Message-ID: <1089989058.2487.63.camel@gaston> On Fri, 2004-07-16 at 02:13, David Gibson wrote: > Have I missed anything here? This bug certainly fixes the crash on > boot I've been getting on our p630. > > The recent patch changing the rtas error logging had a bug. It can > result in rtas_call() attempting to call kmalloc() too early (from > setup_arch() before the slab caches are initialized), leading to an > oops on boot. > > I can see no reason that log_error() can't be called with the > rtas.lock still held, so this patch avoids race addresses by the > original race by simply moving the log_error() call under the lock. No, log_error can call ppc_md.log_error which can call some nvram stuff itself calling RTAS afaik Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Sat Jul 17 04:49:14 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 16 Jul 2004 14:49:14 -0400 Subject: oops bringing up secondary cpus In-Reply-To: <20040715235321.GB17574@krispykreme> References: <1089767063.4370.160.camel@nighthawk> <20040715062936.GA27715@krispykreme> <1089933160.32312.32.camel@nighthawk> <20040715232008.GA17574@krispykreme> <1089934424.32312.37.camel@nighthawk> <20040715235321.GB17574@krispykreme> Message-ID: <1090003753.2487.86.camel@gaston> On Thu, 2004-07-15 at 19:53, Anton Blanchard wrote: > > OK, but current_set is the task info for the idle process for the new > > CPU, right? It looks to me like it's allocated like any old task using > > copy_process() in smp_create_idle(). This should use kmalloc() like any > > other task, and that's certainly not guaranteed to be in the RMO. Did > > this change recently? Do we need to lmb_alloc() the idle task struct > > for the secondary CPUs? > > To be honest I cant see where we touch r1 in __secondary_start, have you > made in changes to it? BTW the rfid at the end of __secondary_start is > where we go to virtual mode, so you only have to worry about the code > before that point. I remember fixing just that bug after paulus rewrite went in, we were touching the stack before enabling the MMU.... Ben. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Sat Jul 17 04:53:15 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 16 Jul 2004 14:53:15 -0400 Subject: oops bringing up secondary cpus In-Reply-To: <1089936055.32312.40.camel@nighthawk> References: <1089767063.4370.160.camel@nighthawk> <20040715062936.GA27715@krispykreme> <1089933160.32312.32.camel@nighthawk> <20040715232008.GA17574@krispykreme> <1089934424.32312.37.camel@nighthawk> <20040715235321.GB17574@krispykreme> <1089936055.32312.40.camel@nighthawk> Message-ID: <1090003995.2487.88.camel@gaston> Ok, here's the patch I sent that got in last month fixing that: -----Forwarded Message----- From: Benjamin Herrenschmidt To: Andrew Morton Cc: Linus Torvalds , Paul Mackerras , Linux Kernel list Subject: [PATCH] ppc64: Fix booting on LPAR machines with more than 1 CPU Date: Thu, 24 Jun 2004 11:28:39 -0500 Hi ! The exception rewrite contains a small bug that prevents bring up of CPUs on logically partitioned machines. The kernel is trying to zero the backlink on the new stack while running with relocation disabled, which potentially cause it to try to access an address outside of the region allowed in real mode. This seem to be a leftover from previous code as we also zero the backlink later after turning off the MMU. This patch removes the offending bit. ===== arch/ppc64/kernel/head.S 1.61 vs edited ===== --- 1.61/arch/ppc64/kernel/head.S 2004-06-17 00:46:06 -05:00 +++ edited/arch/ppc64/kernel/head.S 2004-06-24 11:25:41 -05:00 @@ -1833,8 +1833,6 @@ sldi r28,r24,3 /* get current_set[cpu#] */ ldx r1,r3,r28 addi r1,r1,THREAD_SIZE-STACK_FRAME_OVERHEAD - li r0,0 - std r0,0(r1) std r1,PACAKSAVE(r13) ld r3,PACASTABREAL(r13) /* get raddr of segment table */ ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From david at gibson.dropbear.id.au Mon Jul 19 10:52:23 2004 From: david at gibson.dropbear.id.au (David Gibson) Date: Mon, 19 Jul 2004 10:52:23 +1000 Subject: RFC: Fix (another) bug in rtas logging In-Reply-To: <1089989058.2487.63.camel@gaston> References: <20040716061327.GE26044@zax> <1089989058.2487.63.camel@gaston> Message-ID: <20040719005223.GB10537@zax> On Fri, Jul 16, 2004 at 10:44:19AM -0400, Benjamin Herrenschmidt wrote: > > On Fri, 2004-07-16 at 02:13, David Gibson wrote: > > Have I missed anything here? This bug certainly fixes the crash on > > boot I've been getting on our p630. > > > > The recent patch changing the rtas error logging had a bug. It can > > result in rtas_call() attempting to call kmalloc() too early (from > > setup_arch() before the slab caches are initialized), leading to an > > oops on boot. > > > > I can see no reason that log_error() can't be called with the > > rtas.lock still held, so this patch avoids race addresses by the > > original race by simply moving the log_error() call under the lock. > > No, log_error can call ppc_md.log_error which can call some nvram > stuff itself calling RTAS afaik Ah, yes, good point. -- David Gibson | For every complex problem there is a david AT gibson.dropbear.id.au | solution which is simple, neat and | wrong. http://www.ozlabs.org/people/dgibson ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nfont at austin.ibm.com Tue Jul 20 02:52:25 2004 From: nfont at austin.ibm.com (Nathan Fontenot) Date: Mon, 19 Jul 2004 11:52:25 -0500 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming Message-ID: <40FBFC49.1070804@austin.ibm.com> This patch will allow you to turn off the reporting of rtas messages to /var/log/messages. There have been several situations in which machines spew out too many rtas messages thus making debugging more difficult. Being able to turn off the rtas messages reporting will help by not having to wade through the tens or hundreds of (probably) unrelated rtas events in /var/log/messages while trying to debug a system. Anton or Paul, could you please review and push upstream, thanks. Signed-off-by: Nathan Fontenot -- Nathan F. -------------- next part -------------- A non-text attachment was scrubbed... Name: rtasmsgs.patch Type: text/x-patch Size: 3877 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040719/e15a5d44/attachment.bin From johnrose at austin.ibm.com Tue Jul 20 04:30:01 2004 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 19 Jul 2004 13:30:01 -0500 Subject: [PATCH] imalloc supersets Message-ID: <1090261801.18793.12.camel@sinatra.austin.ibm.com> The patch below implements the ability to query outstanding imalloc regions for a given virtual address range. The patch extends im_get_area() to allow a region criterion of IM_REGION_SUPERSET. For a particular "superset" virtual address and size passed into im_get_area(), the function returns the first outstanding region that is contained within this superset region. The patch also changes iounmap_explicit() to allow for the unmapping of all regions that fit under a "supserset". This ability is necessary for PHB DLPAR. For a PHB removal, the RPA requires that all of its children slots already be dynamically removed. Each of these slot-level removals has fractured the imalloc region assigned to the PHB at boot. At PHB removal time, it is necessary to iounmap() the remaining artifacts of the initial PHB region. Thanks- John Signed-off-by: John Rose diff -X /home/johnrose/tmp/diffignore.txt -urpN sles9-rc5-vanilla/arch/ppc64/mm/imalloc.c sles9-rc5/arch/ppc64/mm/imalloc.c --- sles9-rc5-vanilla/arch/ppc64/mm/imalloc.c 2004-07-16 16:11:54.000000000 -0500 +++ sles9-rc5/arch/ppc64/mm/imalloc.c 2004-07-19 11:14:29.000000000 -0500 @@ -37,33 +37,51 @@ static int get_free_im_addr(unsigned lon return 0; } +/* Return whether the region described by v_addr and size is a subset + * of the region described by parent + */ +static inline int im_region_is_subset(unsigned long v_addr, unsigned long size, + struct vm_struct *parent) +{ + return (int) (v_addr >= (unsigned long) parent->addr && + v_addr < (unsigned long) parent->addr + parent->size && + size < parent->size); +} + +/* Return whether the region described by v_addr and size is a superset + * of the region described by child + */ +static int im_region_is_superset(unsigned long v_addr, unsigned long size, + struct vm_struct *child) +{ + struct vm_struct parent; + + parent.addr = (void *) v_addr; + parent.size = size; + + return im_region_is_subset((unsigned long) child->addr, child->size, + &parent); +} + /* Return whether the region described by v_addr and size overlaps * the region described by vm. Overlapping regions meet the * following conditions: * 1) The regions share some part of the address space * 2) The regions aren't identical - * 3) The first region is not a subset of the second + * 3) Neither region is a subset of the other */ -static inline int im_region_overlaps(unsigned long v_addr, unsigned long size, +static int im_region_overlaps(unsigned long v_addr, unsigned long size, struct vm_struct *vm) { + if (im_region_is_superset(v_addr, size, vm)) + return 0; + return (v_addr + size > (unsigned long) vm->addr + vm->size && v_addr < (unsigned long) vm->addr + vm->size) || (v_addr < (unsigned long) vm->addr && v_addr + size > (unsigned long) vm->addr); } -/* Return whether the region described by v_addr and size is a subset - * of the region described by vm - */ -static inline int im_region_is_subset(unsigned long v_addr, unsigned long size, - struct vm_struct *vm) -{ - return (int) (v_addr >= (unsigned long) vm->addr && - v_addr < (unsigned long) vm->addr + vm->size && - size < vm->size); -} - /* Determine imalloc status of region described by v_addr and size. * Can return one of the following: * IM_REGION_UNUSED - Entire region is unallocated in imalloc space. @@ -73,28 +91,37 @@ static inline int im_region_is_subset(un * IM_REGION_EXISTS - Exact region already allocated in imalloc space. * vm will be assigned to a ptr to the existing imlist * member. - * IM_REGION_OVERLAPS - A portion of the region is already allocated in - * imalloc space. + * IM_REGION_OVERLAPS - Region overlaps an allocated region in imalloc space. + * IM_REGION_SUPERSET - Region is a superset of a region that is already + * allocated in imalloc space. */ static int im_region_status(unsigned long v_addr, unsigned long size, struct vm_struct **vm) { struct vm_struct *tmp; - for (tmp = imlist; tmp; tmp = tmp->next) - if (v_addr < (unsigned long) tmp->addr + tmp->size) + for (tmp = imlist; tmp; tmp = tmp->next) + if (v_addr < (unsigned long) tmp->addr + tmp->size) break; - + if (tmp) { if (im_region_overlaps(v_addr, size, tmp)) return IM_REGION_OVERLAP; *vm = tmp; - if (im_region_is_subset(v_addr, size, tmp)) + if (im_region_is_subset(v_addr, size, tmp)) { + /* Return with tmp pointing to superset */ return IM_REGION_SUBSET; + } + if (im_region_is_superset(v_addr, size, tmp)) { + /* Return with tmp pointing to first subset */ + return IM_REGION_SUPERSET; + } else if (v_addr == (unsigned long) tmp->addr && - size == tmp->size) + size == tmp->size) { + /* Return with tmp pointing to exact region */ return IM_REGION_EXISTS; + } } *vm = NULL; @@ -208,6 +235,10 @@ static struct vm_struct * __im_get_area( tmp = split_im_region(req_addr, size, tmp); break; case IM_REGION_EXISTS: + /* Return requested region */ + break; + case IM_REGION_SUPERSET: + /* Return first existing subset of requested region */ break; default: printk(KERN_ERR "%s() unexpected imalloc region status\n", diff -X /home/johnrose/tmp/diffignore.txt -urpN sles9-rc5-vanilla/arch/ppc64/mm/init.c sles9-rc5/arch/ppc64/mm/init.c --- sles9-rc5-vanilla/arch/ppc64/mm/init.c 2004-07-16 16:11:54.000000000 -0500 +++ sles9-rc5/arch/ppc64/mm/init.c 2004-07-19 13:08:56.000000000 -0500 @@ -392,9 +392,28 @@ void iounmap(void *addr) return; } +static int iounmap_subset_regions(void *addr, unsigned long size) +{ + struct vm_struct *area; + + /* Check whether subsets of this region exist */ + area = im_get_area((unsigned long) addr, size, IM_REGION_SUPERSET); + if (area == NULL) + return 1; + + while (area) { + iounmap(area->addr); + area = im_get_area((unsigned long) addr, size, + IM_REGION_SUPERSET); + } + + return 0; +} + int iounmap_explicit(void *addr, unsigned long size) { struct vm_struct *area; + int rc; /* addr could be in EEH or IO region, map it to IO region regardless. */ @@ -407,12 +426,17 @@ int iounmap_explicit(void *addr, unsigne area = im_get_area((unsigned long) addr, size, IM_REGION_EXISTS | IM_REGION_SUBSET); if (area == NULL) { - printk(KERN_ERR "%s() cannot unmap nonexistant range 0x%lx\n", + /* Determine whether subset regions exist. If so, unmap */ + rc = iounmap_subset_regions(addr, size); + if (rc) { + printk(KERN_ERR + "%s() cannot unmap nonexistent range 0x%lx\n", __FUNCTION__, (unsigned long) addr); - return 1; + return 1; + } + } else { + iounmap(area->addr); } - - iounmap(area->addr); return 0; } ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From cfriesen at nortelnetworks.com Tue Jul 20 05:18:45 2004 From: cfriesen at nortelnetworks.com (Chris Friesen) Date: Mon, 19 Jul 2004 15:18:45 -0400 Subject: trying to netboot G5 Xserve HPC node Message-ID: <40FC1E95.2000501@nortelnetworks.com> I've got an HPC G5 Xserve (the one with no CD drive), and I'm attempting to boot linux on it. I've got dhcpd V3.0pl1 running on a G5 desktop (/etc/dhcp.conf included below). When I try and netboot the Xserve, it appears to issue a bootp request, the server sends a response, but the IP address offer is never accepted. What am I doing wrong? Is there any way to get the Xserve to show firmware boot logs on the serial console? The logs on the dhcp server look like this: Jul 19 18:03:57 g5-2 dhcpd: dhcp.c(2072): non-null pointer Jul 19 18:03:57 g5-2 dhcpd: DHCPDISCOVER from 00:0d:93:9b:a8:6c via eth0 Jul 19 18:03:57 g5-2 dhcpd: Received BootP request from Macintosh netboot client Jul 19 18:03:57 g5-2 dhcpd: DHCPOFFER on 10.40.200.5 to 00:0d:93:9b:a8:6c via eth0 Jul 19 18:04:01 g5-2 dhcpd: dhcp.c(2072): non-null pointer Jul 19 18:04:01 g5-2 dhcpd: DHCPDISCOVER from 00:0d:93:9b:a8:6c via eth0 Jul 19 18:04:01 g5-2 dhcpd: Received BootP request from Macintosh netboot client Jul 19 18:04:01 g5-2 dhcpd: DHCPOFFER on 10.40.200.5 to 00:0d:93:9b:a8:6c via eth0 Any ideas? Chris dhcpd.conf file on server ========================================================= # global dhcpd parameters deny unknown-clients; #disallow unknown connections ddns-update-style none; #disallow dynamic DNS updates allow bootp; #allow bootp requests, thus the dhcp #server will act as a bootp server # which network interface the server will listen on subnet 10.40.200.0 netmask 255.255.255.0 { #the zeros indicate which range } #of addresses are allowed to connect #set of parameters common to all clients group { option broadcast-address 10.40.200.255; option subnet-mask 255.255.255.0; host g5_3 { filename "yaboot"; server-name "tester"; next-server tester; hardware ethernet 00:0D:93:9B:A8:6C; fixed-address 10.40.200.5; } #you may paste another "host" entry here for additional clients on this network } ======================================================================= ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Tue Jul 20 08:10:14 2004 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Mon, 19 Jul 2004 17:10:14 -0500 Subject: Broken LPAR cpu bringup In-Reply-To: <1089932274.32312.23.camel@nighthawk> References: <1089850441.10000.62.camel@nighthawk> <1089865083.13914.3.camel@biclops.private.network> <1089931149.12866.5.camel@pants.austin.ibm.com> <1089931414.32312.13.camel@nighthawk> <1089932274.32312.23.camel@nighthawk> Message-ID: <1090275014.15991.18.camel@pants.austin.ibm.com> On Thu, 2004-07-15 at 17:57, Dave Hansen wrote: > > They still get stuck on my tree. I'll try on a plain tree in a little > > bit. > > Nope, still oopses in the load_balance sched domains code: Ok, third (and hopefully final) attempt. Try this without any of the others I sent. I think this change caused it: http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.8-rc1/2.6.8-rc1-mm1/broken-out/detect-too-early-schedule-attempts.patch Backing out the above change works, too. Tested on a 4-way partition on an 8-way p650. Nathan diff -prauN -X /home/nathanl/working/dontdiff 2.6.8-rc1-mm1/arch/ppc64/kernel/smp.c 2.6.8-rc1-mm1.1/arch/ppc64/kernel/smp.c --- 2.6.8-rc1-mm1/arch/ppc64/kernel/smp.c 2004-07-19 15:54:22.000000000 -0500 +++ 2.6.8-rc1-mm1.1/arch/ppc64/kernel/smp.c 2004-07-19 16:16:44.000000000 -0500 @@ -374,7 +374,7 @@ static inline int __devinit smp_startup_ /* At boot time the cpus are already spinning in hold * loops, so nothing to do. */ - if (system_state == SYSTEM_BOOTING) + if (system_state < SYSTEM_RUNNING) return 1; pcpu = find_physical_cpu_to_start(get_hard_smp_processor_id(lcpu)); @@ -868,7 +868,7 @@ int __devinit __cpu_up(unsigned int cpu) int c; /* At boot, don't bother with non-present cpus -JSCHOPP */ - if (system_state == SYSTEM_BOOTING && !cpu_present_at_boot(cpu)) + if (system_state < SYSTEM_RUNNING && !cpu_present_at_boot(cpu)) return -ENOENT; paca[cpu].prof_counter = 1; @@ -902,7 +902,7 @@ int __devinit __cpu_up(unsigned int cpu) * use this value that I found through experimentation. * -- Cort */ - if (system_state == SYSTEM_BOOTING) + if (system_state < SYSTEM_RUNNING) for (c = 5000; c && !cpu_callin_map[cpu]; c--) udelay(100); #ifdef CONFIG_HOTPLUG_CPU ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Tue Jul 20 09:07:52 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Mon, 19 Jul 2004 18:07:52 -0500 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming In-Reply-To: <40FBFC49.1070804@austin.ibm.com> References: <40FBFC49.1070804@austin.ibm.com> Message-ID: <75F11743-D9D8-11D8-A3B4-000A95A0560C@us.ibm.com> On Jul 19, 2004, at 11:52 AM, Nathan Fontenot wrote: > This patch will allow you to turn off the reporting of rtas messages > to /var/log/messages. There have been several situations in which > machines spew out too many rtas messages thus making debugging more > difficult. Being able to turn off the rtas messages reporting will > help by not having to wade through the tens or hundreds of (probably) > unrelated rtas events in /var/log/messages while trying to debug a > system. Weren't people were talking about using netlink rather than /proc for this? I think netlink involves creating a special type of socket and reading from that, so it's not easily shell-scriptable, but... -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nfont at austin.ibm.com Tue Jul 20 23:26:53 2004 From: nfont at austin.ibm.com (Nathan Fontenot) Date: Tue, 20 Jul 2004 08:26:53 -0500 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming In-Reply-To: <75F11743-D9D8-11D8-A3B4-000A95A0560C@us.ibm.com> References: <40FBFC49.1070804@austin.ibm.com> <75F11743-D9D8-11D8-A3B4-000A95A0560C@us.ibm.com> Message-ID: <40FD1D9D.40406@austin.ibm.com> Hollis Blanchard wrote: > Weren't people were talking about using netlink rather than /proc for > this? I think netlink involves creating a special type of socket and > reading from that, so it's not easily shell-scriptable, but... There has been several discussions about dealing with rtas events. I beleive the netlink issue dealt with the getting rtas events to rtas_errd in user space. This patch just provides a switch to turn on/off (via /proc or a boot parameter) the reporting of rtas events to /var/log/messages and should have no effect on rtas events going to rtas_errd or nvram. Nathan F. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Tue Jul 20 23:31:43 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Tue, 20 Jul 2004 08:31:43 -0500 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming In-Reply-To: <75F11743-D9D8-11D8-A3B4-000A95A0560C@us.ibm.com> References: <40FBFC49.1070804@austin.ibm.com> <75F11743-D9D8-11D8-A3B4-000A95A0560C@us.ibm.com> Message-ID: <20040720083143.2726671e.moilanen@austin.ibm.com> > > This patch will allow you to turn off the reporting of rtas messages > > to /var/log/messages. There have been several situations in which > > machines spew out too many rtas messages thus making debugging more > > difficult. Being able to turn off the rtas messages reporting will > > help by not having to wade through the tens or hundreds of (probably) > > unrelated rtas events in /var/log/messages while trying to debug a > > system. > > Weren't people were talking about using netlink rather than /proc for > this? I think netlink involves creating a special type of socket and > reading from that, so it's not easily shell-scriptable, but... > Hollis, I think this is an intermittent step in moving towards netlink. Even when we move towards netlink, we still want to give an option to have the messages in /var/log/messages. Nathan, Do you think we should make CONFIG option if this is set or not set by default, so the distros can decide to have it printed if they don't package ELA by default. Thanks, Jake ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nfont at austin.ibm.com Wed Jul 21 00:51:02 2004 From: nfont at austin.ibm.com (Nathan Fontenot) Date: Tue, 20 Jul 2004 09:51:02 -0500 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming In-Reply-To: <20040720083143.2726671e.moilanen@austin.ibm.com> References: <40FBFC49.1070804@austin.ibm.com> <75F11743-D9D8-11D8-A3B4-000A95A0560C@us.ibm.com> <20040720083143.2726671e.moilanen@austin.ibm.com> Message-ID: <40FD3156.5080509@austin.ibm.com> Jake Moilanen wrote: > Do you think we should make CONFIG option if this is set or not set by > default, so the distros can decide to have it printed if they don't > package ELA by default. I think the default (for now) should remain to print rtas events to /var/log/messages. There shouldn't be very many events for most systems. I am working on an update for rtas event handling but am not sure when this will be ready. Getting this patch in will at least help with rtas spam and debugging for now. -- Nathan F. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From hollisb at us.ibm.com Wed Jul 21 00:57:32 2004 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Tue, 20 Jul 2004 09:57:32 -0500 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming In-Reply-To: <40FD3156.5080509@austin.ibm.com> References: <40FBFC49.1070804@austin.ibm.com> <75F11743-D9D8-11D8-A3B4-000A95A0560C@us.ibm.com> <20040720083143.2726671e.moilanen@austin.ibm.com> <40FD3156.5080509@austin.ibm.com> Message-ID: <20E20F6F-DA5D-11D8-8BEE-000A95A0560C@us.ibm.com> On Jul 20, 2004, at 9:51 AM, Nathan Fontenot wrote: > Jake Moilanen wrote: > >> Do you think we should make CONFIG option if this is set or not set by >> default, so the distros can decide to have it printed if they don't >> package ELA by default. > > I think the default (for now) should remain to print rtas events to > /var/log/messages. There shouldn't be very many events for most > systems. This has not been my experience. I hope it would be true for most production systems, but every (pre-release) machine I've worked on recently has loads of them, often as frequent as one message per second. These messages make our lives more difficult than they need to be. -- Hollis Blanchard IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From greg at kroah.com Wed Jul 21 15:58:40 2004 From: greg at kroah.com (Greg KH) Date: Wed, 21 Jul 2004 01:58:40 -0400 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming In-Reply-To: <40FBFC49.1070804@austin.ibm.com> References: <40FBFC49.1070804@austin.ibm.com> Message-ID: <20040721055840.GA18787@kroah.com> On Mon, Jul 19, 2004 at 11:52:25AM -0500, Nathan Fontenot wrote: > + entry = create_proc_entry("ppc64/rtas/rtasmsgs", S_IRUSR, NULL); > + if (entry) > + entry->proc_fops = &ppc_rtas_msg_operations; > + Please do not do this in proc. It should be in sysfs (and the side affect of that will be that your code is smaller...) thanks, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From greg at kroah.com Thu Jul 22 03:53:20 2004 From: greg at kroah.com (Greg KH) Date: Wed, 21 Jul 2004 13:53:20 -0400 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming In-Reply-To: <40FEBC0E.7060005@austin.ibm.com> References: <40FBFC49.1070804@austin.ibm.com> <20040721055840.GA18787@kroah.com> <40FEBC0E.7060005@austin.ibm.com> Message-ID: <20040721175320.GA16704@kroah.com> On Wed, Jul 21, 2004 at 01:55:10PM -0500, Nathan Fontenot wrote: > Greg KH wrote: > >On Mon, Jul 19, 2004 at 11:52:25AM -0500, Nathan Fontenot wrote: > > > >>+ entry = create_proc_entry("ppc64/rtas/rtasmsgs", S_IRUSR, NULL); > >>+ if (entry) > >>+ entry->proc_fops = &ppc_rtas_msg_operations; > >>+ > > > > > >Please do not do this in proc. It should be in sysfs (and the side > >affect of that will be that your code is smaller...) > > > >thanks, > > > >greg k-h > > Agreed, and glad you mentioned it. > > Would anyone have a problem with moving everything from /proc/ppc64/rtas > to sysfs (/sys/firmware/rtas) ? No objection from me, please do this, it is the proper place for these files. thanks, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nfont at austin.ibm.com Thu Jul 22 04:55:10 2004 From: nfont at austin.ibm.com (Nathan Fontenot) Date: Wed, 21 Jul 2004 13:55:10 -0500 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming In-Reply-To: <20040721055840.GA18787@kroah.com> References: <40FBFC49.1070804@austin.ibm.com> <20040721055840.GA18787@kroah.com> Message-ID: <40FEBC0E.7060005@austin.ibm.com> Greg KH wrote: > On Mon, Jul 19, 2004 at 11:52:25AM -0500, Nathan Fontenot wrote: > >>+ entry = create_proc_entry("ppc64/rtas/rtasmsgs", S_IRUSR, NULL); >>+ if (entry) >>+ entry->proc_fops = &ppc_rtas_msg_operations; >>+ > > > Please do not do this in proc. It should be in sysfs (and the side > affect of that will be that your code is smaller...) > > thanks, > > greg k-h Agreed, and glad you mentioned it. Would anyone have a problem with moving everything from /proc/ppc64/rtas to sysfs (/sys/firmware/rtas) ? It seems that sysfs is where all these files should live anyway. -- Nathan F. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From moilanen at austin.ibm.com Thu Jul 22 05:50:19 2004 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Wed, 21 Jul 2004 14:50:19 -0500 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming In-Reply-To: <40FEBC0E.7060005@austin.ibm.com> References: <40FBFC49.1070804@austin.ibm.com> <20040721055840.GA18787@kroah.com> <40FEBC0E.7060005@austin.ibm.com> Message-ID: <20040721145019.0766d4ad.moilanen@austin.ibm.com> > > Would anyone have a problem with moving everything from /proc/ppc64/rtas > to sysfs (/sys/firmware/rtas) ? It seems that sysfs is where all these > files should live anyway. > I'm not sure I agree that the files like error_log, firmware_flash should be in sysfs. Those should probably move to something like netlink which we discussed previously. Do we need to move over some of the files at all (eg volume). Thanks, Jake ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at us.ibm.com Thu Jul 22 05:56:45 2004 From: johnrose at us.ibm.com (John H Rose) Date: Wed, 21 Jul 2004 14:56:45 -0500 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming In-Reply-To: <40FEBC0E.7060005@austin.ibm.com> Message-ID: I know this is the trendy idea, and I don't disagree with it, but keep in mind that you'll break tools. Update flash, for example. :) John ----------------------- John Rose pSeries Linux Development johnrose at austin.ibm.com Office: 512-838-0298 Tieline: 678-0298 [ Nathan Fontenot writes: ] > > Greg KH wrote: > > On Mon, Jul 19, 2004 at 11:52:25AM -0500, Nathan Fontenot wrote: > > > >>+ entry = create_proc_entry("ppc64/rtas/rtasmsgs", S_IRUSR, NULL); > >>+ if (entry) > >>+ entry->proc_fops = &ppc_rtas_msg_operations; > >>+ > > > > Please do not do this in proc. It should be in sysfs (and the side > > affect of that will be that your code is smaller...) > > Agreed, and glad you mentioned it. > > Would anyone have a problem with moving everything from > /proc/ppc64/rtas to sysfs (/sys/firmware/rtas) ? It seems that sysfs > is where all these files should live anyway. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From haveblue at us.ibm.com Thu Jul 22 05:57:45 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Wed, 21 Jul 2004 12:57:45 -0700 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming In-Reply-To: <40FEBC0E.7060005@austin.ibm.com> References: <40FBFC49.1070804@austin.ibm.com> <20040721055840.GA18787@kroah.com> <40FEBC0E.7060005@austin.ibm.com> Message-ID: <1090439865.5862.5.camel@nighthawk> On Wed, 2004-07-21 at 11:55, Nathan Fontenot wrote: > Would anyone have a problem with moving everything from /proc/ppc64/rtas > to sysfs (/sys/firmware/rtas) ? It seems that sysfs is where all these > files should live anyway. Some of them use pretty broken interfaces, and I don't think they should be perpetuated. The biggest offender in my mind is ppc64/rtas/rmo_buffer. It exports physical addresses so that they can be mapped with /dev/mem. I think a more refined interface is appropriate here. As for the other things in that directory, could you do an ls on your system and describe which ones you think need to be kept? -- Dave ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Thu Jul 22 06:48:40 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Wed, 21 Jul 2004 15:48:40 -0500 Subject: Resending [PATCH] 2.6 ppc64 RTAS: use dynamic buffer size Message-ID: <20040721204840.GD13171@austin.ibm.com> Resending, Trying to work around a mail gateway bug; sorry if you are receiving this message a second time. (If you suspect that folks are not getting your email, contact me, I think I now understand about half the problem. You need to configure your exim/postfix/sendmail to use a "smarthost". Valid smarthosts are austin.ibm.com, us.ibm.com, smtp.linux.ibm.com. If you don't use the smarthost, your email will go through pixpat.austin.ibm.com which is appearently blackholed by spam black-lists). --linas ----- Forwarded message from Mail Delivery System ----- ------ This is a copy of the message, including all the headers. ------ Return-path: Received: from linas by bilge with local (Exim 3.36 #1 (Debian)) id 1BmhA1-0002N0-00; Mon, 19 Jul 2004 18:03:05 -0500 Date: Mon, 19 Jul 2004 18:03:05 -0500 To: paulus at au1.ibm.com, paulus at samba.org Cc: linuxppc64-dev at lists.linuxppc.org, antonb at samba.org, benh at kernel.crashing.org Subject: [PATCH] 2.6 ppc64 RTAS: use dynamic buffer size Message-ID: <20040719230305.GH7544 at bilge> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="32u276st3Jlj2kUU" Content-Disposition: inline User-Agent: Mutt/1.5.6+20040523i From: Linas Vepstas --32u276st3Jlj2kUU Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Paul, Please review & forward as appropriate. Firmware expects error log sizes to be of a very specific size, but different versions of firmware appearently expect different sizes; using the wrong size results in a painful, hard-to-debug crash in firmware. Benh provided a patch for this some months ago, but appreantly missed this code path. This patch sets up the log buffer size dynamically; it also fixes a bug with the return code not being handled correctly. Signed-off-by: Linas Vepstas --linas --32u276st3Jlj2kUU Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="rtas-log-length.patch" ===== arch/ppc64/kernel/rtas.c 1.38 vs edited ===== --- 1.38/arch/ppc64/kernel/rtas.c Wed Jul 14 15:27:37 2004 +++ edited/arch/ppc64/kernel/rtas.c Mon Jul 19 17:11:51 2004 @@ -22,7 +22,6 @@ #include #include #include -#include #include #include #include @@ -73,7 +72,6 @@ return tokp ? *tokp : RTAS_UNKNOWN_SERVICE; } - /** Return a copy of the detailed error text associated with the * most recent failed call to rtas. Because the error text * might go stale if there are any other intervening rtas calls, @@ -84,18 +82,26 @@ __fetch_rtas_last_error(void) { struct rtas_args err_args, save_args; + u32 bufsz; + + bufsz = rtas_token ("rtas-error-log-max"); + if ((bufsz == RTAS_UNKNOWN_SERVICE) || + (bufsz > RTAS_ERROR_LOG_MAX)) { + printk (KERN_WARNING "RTAS: bad log buffer size %d\n", bufsz); + bufsz = RTAS_ERROR_LOG_MAX; + } err_args.token = rtas_token("rtas-last-error"); err_args.nargs = 2; err_args.nret = 1; - err_args.rets = (rtas_arg_t *)&(err_args.args[2]); err_args.args[0] = (rtas_arg_t)__pa(rtas_err_buf); - err_args.args[1] = RTAS_ERROR_LOG_MAX; + err_args.args[1] = bufsz; err_args.args[2] = 0; save_args = rtas.args; rtas.args = err_args; + rtas.args.rets = (rtas_arg_t *)&(rtas.args.args[2]); PPCDBG(PPCDBG_RTAS, "\tentering rtas with 0x%lx\n", __pa(&err_args)); @@ -105,6 +111,7 @@ err_args = rtas.args; rtas.args = save_args; + err_args.rets = (rtas_arg_t *)&(err_args.args[2]); return err_args.rets[0]; } --32u276st3Jlj2kUU-- ----- End forwarded message ----- ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Thu Jul 22 07:08:39 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Wed, 21 Jul 2004 16:08:39 -0500 Subject: Resend: Resend: [PATCH] [2.6] PPC64: log firmware errors during boot. Message-ID: <20040721210839.GP13171@austin.ibm.com> Reseinding another bounced email --linas ----- Forwarded message from Mail Delivery System ----- ------ This is a copy of the message, including all the headers. ------ Return-path: Received: from linas by bilge with local (Exim 3.36 #1 (Debian)) id 1BksOD-0008Cp-00; Wed, 14 Jul 2004 17:38:13 -0500 Date: Wed, 14 Jul 2004 17:38:13 -0500 To: paulus at au1.ibm.com, paulus at samba.org Cc: linuxppc64-dev at lists.linuxppc.org Subject: Resend: [PATCH] [2.6] PPC64: log firmware errors during boot. Message-ID: <20040714223813.GX17333 at bilge> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="idY8LE8SD6/8DnRI" Content-Disposition: inline User-Agent: Mutt/1.5.6+20040523i From: Linas Vepstas --idY8LE8SD6/8DnRI Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Paul, Resending patch from 26 June that remains unapplied. Topics related to this patch were discussed, but none of the discussions affected this patch directly, So I think the patch is still good to go ... Repeat of the original text: Firmware can report errors at any time, and not atypically during boot. However, these reports were being discarded until th rtasd comes up, which occurs fairly late in the boot cycle. As a result, firmware errors during boot were being silently ignored. Signed-off-by: Linas Vepstas --linas --idY8LE8SD6/8DnRI Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="rtas-log-boot-msgs.patch" --- arch/ppc64/kernel/rtasd.c.orig 2004-06-28 15:33:12.000000000 -0500 +++ arch/ppc64/kernel/rtasd.c 2004-06-29 18:51:31.000000000 -0500 @@ -57,6 +57,8 @@ volatile int error_log_cnt = 0; */ static unsigned char logdata[RTAS_ERROR_LOG_MAX]; +static int get_eventscan_parms(void); + /* To see this info, grep RTAS /var/log/messages and each entry * will be collected together with obvious begin/end. * There will be a unique identifier on the begin and end lines. @@ -121,6 +123,9 @@ static int log_rtas_len(char * buf) len += err->extended_log_length; } + if (rtas_error_log_max == 0) { + get_eventscan_parms(); + } if (len > rtas_error_log_max) len = rtas_error_log_max; @@ -148,7 +153,6 @@ void pSeries_log_error(char *buf, unsign int len = 0; DEBUG("logging event\n"); - if (buf == NULL) return; @@ -171,6 +175,13 @@ void pSeries_log_error(char *buf, unsign if (!no_more_logging && !(err_type & ERR_FLAG_BOOT)) nvram_write_error_log(buf, len, err_type); + /* rtas errors can occur during boot, and we do want to capture + * those somewhere, even if nvram isn't ready (why not?), and even + * if rtasd isn't ready. Put them into the boot log, at least. */ + if ((err_type & ERR_TYPE_MASK) == ERR_TYPE_RTAS_LOG) { + printk_log_rtas(buf, len); + } + /* Check to see if we need to or have stopped logging */ if (fatal || no_more_logging) { no_more_logging = 1; --idY8LE8SD6/8DnRI-- ----- End forwarded message ----- ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Thu Jul 22 07:10:01 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Wed, 21 Jul 2004 16:10:01 -0500 Subject: Resend: [PATCH] 2.6 ppc64 -- Yet another unbalanced pci_dev_get()/put() Message-ID: <20040721211001.GQ13171@austin.ibm.com> forwarding more bounced email, this one had a patch attached. --linas ----- Forwarded message from Mail Delivery System ----- ------ This is a copy of the message, including all the headers. ------ Return-path: Received: from linas by bilge with local (Exim 3.36 #1 (Debian)) id 1Bkrua-0008BY-00; Wed, 14 Jul 2004 17:07:36 -0500 Date: Wed, 14 Jul 2004 17:07:36 -0500 To: paulus at au1.ibm.com, paulus at samba.org Cc: linuxppc64-dev at lists.linuxppc.org Subject: [PATCH] 2.6 ppc64 -- Yet another unbalanced pci_dev_get()/put() Message-ID: <20040714220736.GW17333 at bilge> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="TybLhxa8M7aNoW+V" Content-Disposition: inline User-Agent: Mutt/1.5.6+20040523i From: Linas Vepstas --TybLhxa8M7aNoW+V Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Paul, Please review & forward upstream ... This patch fixes yet another set of mis-matched pci_dev_get() / pci_dev_put() calls. The bug should affect graphics cards only; no other card types will see a put() without a matching get(). The mismatch ocured because we ignore eeh errors on graphics cards :( There is a small timing window during which this bug may cause memory corruption on machines with lots of memory pressure, for which a pci graphics device is being hot-removed... --linas --TybLhxa8M7aNoW+V Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="eeh-more-unbalanced-get-put.patch" ===== arch/ppc64/kernel/eeh.c 1.28 vs edited ===== --- 1.28/arch/ppc64/kernel/eeh.c Mon Jul 12 18:29:16 2004 +++ edited/arch/ppc64/kernel/eeh.c Wed Jul 14 15:40:47 2004 @@ -250,6 +250,7 @@ static void __pci_addr_cache_insert_device(struct pci_dev *dev) { struct device_node *dn; + int really_did_insert = 0; int i; dn = pci_device_to_OF_node(dev); @@ -268,7 +269,6 @@ #endif return; } - pci_dev_get(dev); /* Walk resources on this device, poke them into the tree */ for (i = 0; i < DEVICE_COUNT_RESOURCE; i++) { @@ -282,6 +282,11 @@ if (start == 0 || ~start == 0 || end == 0 || ~end == 0) continue; pci_addr_cache_insert(dev, start, end, flags); + really_did_insert = 1; + } + + if (really_did_insert) { + pci_dev_get (dev); } } @@ -305,6 +310,7 @@ static inline void __pci_addr_cache_remove_device(struct pci_dev *dev) { struct rb_node *n; + int really_did_remove = 0; restart: n = rb_first(&pci_io_addr_cache_root.rb_root); @@ -315,11 +321,14 @@ if (piar->pcidev == dev) { rb_erase(n, &pci_io_addr_cache_root.rb_root); kfree(piar); + really_did_remove = 1; goto restart; } n = rb_next(n); } - pci_dev_put(dev); + if (really_did_remove) { + pci_dev_put(dev); + } } /** --TybLhxa8M7aNoW+V-- ----- End forwarded message ----- ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nfont at austin.ibm.com Thu Jul 22 07:11:38 2004 From: nfont at austin.ibm.com (Nathan Fontenot) Date: Wed, 21 Jul 2004 16:11:38 -0500 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming In-Reply-To: References: Message-ID: <40FEDC0A.1030201@austin.ibm.com> John H Rose wrote: > I know this is the trendy idea, and I don't disagree with it, but keep > in mind that you'll break tools. Update flash, for example. :) We could always provide a link from /proc/ppc64/rtas to sysfs so we don't break any tools. Nathan F. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From greg at kroah.com Thu Jul 22 07:11:49 2004 From: greg at kroah.com (Greg KH) Date: Wed, 21 Jul 2004 17:11:49 -0400 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming In-Reply-To: <20040721145019.0766d4ad.moilanen@austin.ibm.com> References: <40FBFC49.1070804@austin.ibm.com> <20040721055840.GA18787@kroah.com> <40FEBC0E.7060005@austin.ibm.com> <20040721145019.0766d4ad.moilanen@austin.ibm.com> Message-ID: <20040721211144.GA17352@kroah.com> On Wed, Jul 21, 2004 at 02:50:19PM -0500, Jake Moilanen wrote: > > > > Would anyone have a problem with moving everything from > > /proc/ppc64/rtas to sysfs (/sys/firmware/rtas) ? It seems that sysfs > > is where all these files should live anyway. > > I'm not sure I agree that the files like error_log, firmware_flash > should be in sysfs. Those should probably move to something like > netlink which we discussed previously. > > Do we need to move over some of the files at all (eg volume). Care to show us all of the /proc/ppc64/rtas files, and what they are used for? It would help with the discussion for those of us without easy access to such machines. thanks, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From greg at kroah.com Thu Jul 22 07:13:14 2004 From: greg at kroah.com (Greg KH) Date: Wed, 21 Jul 2004 17:13:14 -0400 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming In-Reply-To: References: <40FEBC0E.7060005@austin.ibm.com> Message-ID: <20040721211313.GB17352@kroah.com> On Wed, Jul 21, 2004 at 02:56:45PM -0500, John H Rose wrote: > > I know this is the trendy idea, and I don't disagree with it, but keep > in mind that you'll break tools. Update flash, for example. :) Any flash or firmware files should be using the kernel firmware interface, and not reinventing the wheel again. And yes it is a trend, one that will continue and not go away. Please learn to follow it :) thanks, greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Thu Jul 22 07:13:16 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Wed, 21 Jul 2004 16:13:16 -0500 Subject: Another: [PATCH} 2.6 ppc64 rtas crash in kmalloc Message-ID: <20040721211316.GS13171@austin.ibm.com> Hi, Another bounced patch email. --linas ----- Forwarded message from Mail Delivery System ----- ------ This is a copy of the message, including all the headers. ------ Return-path: Received: from linas by bilge with local (Exim 3.36 #1 (Debian)) id 1BkWiX-0007gc-00; Tue, 13 Jul 2004 18:29:45 -0500 Date: Tue, 13 Jul 2004 18:29:45 -0500 To: paulus at au1.ibm.com, paulus at samba.org Cc: linuxppc64-dev at lists.linuxppc.org Subject: [PATCH} 2.6 ppc64 rtas crash in kmalloc Message-ID: <20040713232945.GR17333 at bilge> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="UPT3ojh+0CqEDtpF" Content-Disposition: inline User-Agent: Mutt/1.5.6+20040523i From: Linas Vepstas --UPT3ojh+0CqEDtpF Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Paul, Please review and send upstream as appropriate. The recent set of rtas patches I sent in (about a week ago) introduced a bug: the possible use of kmalloc before the VM subsystem is initialized. This patch checks for the VM subsystem being ready, and avoids the kmalloc if its not. People typically hit this bug during very early boot stages, when EEH is being initialized, and an rtas_call fails, leading to the use of kmalloc to get the error message. --linas --UPT3ojh+0CqEDtpF Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="rtas-kmalloc-crash.patch" ===== arch/ppc64/kernel/rtas.c 1.37 vs edited ===== --- 1.37/arch/ppc64/kernel/rtas.c Mon Jul 5 05:27:10 2004 +++ edited/arch/ppc64/kernel/rtas.c Tue Jul 13 18:12:06 2004 @@ -165,9 +165,13 @@ /* Log the error in the unlikely case that there was one. */ if (unlikely(logit)) { - buff_copy = kmalloc(RTAS_ERROR_LOG_MAX, GFP_ATOMIC); - if (buff_copy) { - memcpy(buff_copy, rtas_err_buf, RTAS_ERROR_LOG_MAX); + /* Can't call kmalloc if VM subsystem is not yet up. */ + struct cache_sizes *csizep = malloc_sizes; + if (csizep->cs_cachep) { + buff_copy = kmalloc(RTAS_ERROR_LOG_MAX, GFP_ATOMIC); + if (buff_copy) { + memcpy(buff_copy, rtas_err_buf, RTAS_ERROR_LOG_MAX); + } } } --UPT3ojh+0CqEDtpF-- ----- End forwarded message ----- ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Thu Jul 22 07:16:50 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Wed, 21 Jul 2004 16:16:50 -0500 Subject: Resending ... [PATCH] 2.6 ppc64 Janitor prom.c memleak; whitespace Message-ID: <20040721211650.GU13171@austin.ibm.com> ----- Forwarded message from Mail Delivery System ----- ------ This is a copy of the message, including all the headers. ------ Return-path: Received: from linas by bilge with local (Exim 3.36 #1 (Debian)) id 1BkTcP-0007ZZ-00; Tue, 13 Jul 2004 15:11:13 -0500 Date: Tue, 13 Jul 2004 15:11:13 -0500 To: paulus at au1.ibm.com, paulus at samba.org Cc: linuxppc64-dev at lists.linuxppc.org Subject: [PATCH] 2.6 ppc64 Janitor prom.c memleak; whitespace Message-ID: <20040713201112.GO17333 at bilge> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="GvXjxJ+pjyke8COw" Content-Disposition: inline User-Agent: Mutt/1.5.6+20040523i From: Linas Vepstas --GvXjxJ+pjyke8COw Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Paul, Please review & forward as appropriate. I had reason to review prom.c today, and saw one minor bug (a very unlikely memory leak) and a lot of bad indentation (8 spaces used where tab should have been used). Bad whitespace drives me crazy because my vi set ts=3 not 8. This patch fixes the mem leak (at very bottom of the patch) and lots of whitespace ick. The mem leak is unlikely because it requires other failures to happen before the memleak happens. Signed-off-by: Linas Vepstas --linas p.s. My next patch will make actual functional changes to prom.c --GvXjxJ+pjyke8COw Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="prom-whitespace.patch" ===== arch/ppc64/kernel/prom.c 1.88 vs edited ===== --- 1.88/arch/ppc64/kernel/prom.c Fri Jul 2 00:23:46 2004 +++ edited/arch/ppc64/kernel/prom.c Tue Jul 13 15:00:28 2004 @@ -199,16 +199,16 @@ unsigned long offset = reloc_offset(); struct prom_t *_prom = PTRRELOC(&prom); va_list list; - + _prom->args.service = ADDR(service); _prom->args.nargs = nargs; _prom->args.nret = nret; - _prom->args.rets = (prom_arg_t *)&(_prom->args.args[nargs]); + _prom->args.rets = (prom_arg_t *)&(_prom->args.args[nargs]); - va_start(list, nret); + va_start(list, nret); for (i=0; i < nargs; i++) _prom->args.args[i] = va_arg(list, prom_arg_t); - va_end(list); + va_end(list); for (i=0; i < nret ;i++) _prom->args.rets[i] = 0; @@ -244,17 +244,17 @@ static void __init prom_print_hex(unsigned long val) { unsigned long offset = reloc_offset(); - int i, nibbles = sizeof(val)*2; - char buf[sizeof(val)*2+1]; + int i, nibbles = sizeof(val)*2; + char buf[sizeof(val)*2+1]; struct prom_t *_prom = PTRRELOC(&prom); - for (i = nibbles-1; i >= 0; i--) { - buf[i] = (val & 0xf) + '0'; - if (buf[i] > '9') - buf[i] += ('a'-'0'-10); - val >>= 4; - } - buf[nibbles] = '\0'; + for (i = nibbles-1; i >= 0; i--) { + buf[i] = (val & 0xf) + '0'; + if (buf[i] > '9') + buf[i] += ('a'-'0'-10); + val >>= 4; + } + buf[nibbles] = '\0'; call_prom("write", 3, 1, _prom->stdout, buf, nibbles); } @@ -343,22 +343,22 @@ { phandle node; char type[64]; - unsigned long num_cpus = 0; - unsigned long offset = reloc_offset(); + unsigned long num_cpus = 0; + unsigned long offset = reloc_offset(); struct prom_t *_prom = PTRRELOC(&prom); - struct naca_struct *_naca = RELOC(naca); - struct systemcfg *_systemcfg = RELOC(systemcfg); + struct naca_struct *_naca = RELOC(naca); + struct systemcfg *_systemcfg = RELOC(systemcfg); /* NOTE: _naca->debug_switch is already initialized. */ prom_debug("prom_initialize_naca: start...\n"); _naca->pftSize = 0; /* ilog2 of htab size. computed below. */ - for (node = 0; prom_next_node(&node); ) { - type[0] = 0; + for (node = 0; prom_next_node(&node); ) { + type[0] = 0; prom_getprop(node, "device_type", type, sizeof(type)); - if (!strcmp(type, RELOC("cpu"))) { + if (!strcmp(type, RELOC("cpu"))) { num_cpus += 1; /* We're assuming *all* of the CPUs have the same @@ -404,7 +404,7 @@ _naca->pftSize = pft_size[1]; } } - } else if (!strcmp(type, RELOC("serial"))) { + } else if (!strcmp(type, RELOC("serial"))) { phandle isa, pci; struct isa_reg_property reg; union pci_range ranges; @@ -435,7 +435,7 @@ ((((unsigned long)ranges.pci64.phys_hi) << 32) | (ranges.pci64.phys_lo)) + reg.address; } - } + } } if (_systemcfg->platform == PLATFORM_POWERMAC) @@ -465,8 +465,8 @@ } /* We gotta have at least 1 cpu... */ - if ( (_systemcfg->processorCount = num_cpus) < 1 ) - PROM_BUG(); + if ( (_systemcfg->processorCount = num_cpus) < 1 ) + PROM_BUG(); _systemcfg->physicalMemorySize = lmb_phys_mem_size(); @@ -496,21 +496,21 @@ _systemcfg->version.minor = SYSTEMCFG_MINOR; _systemcfg->processor = _get_PVR(); - prom_debug("systemcfg->processorCount = 0x%x\n", + prom_debug("systemcfg->processorCount = 0x%x\n", _systemcfg->processorCount); - prom_debug("systemcfg->physicalMemorySize = 0x%x\n", + prom_debug("systemcfg->physicalMemorySize = 0x%x\n", _systemcfg->physicalMemorySize); - prom_debug("naca->pftSize = 0x%x\n", + prom_debug("naca->pftSize = 0x%x\n", _naca->pftSize); - prom_debug("systemcfg->dCacheL1LineSize = 0x%x\n", + prom_debug("systemcfg->dCacheL1LineSize = 0x%x\n", _systemcfg->dCacheL1LineSize); - prom_debug("systemcfg->iCacheL1LineSize = 0x%x\n", + prom_debug("systemcfg->iCacheL1LineSize = 0x%x\n", _systemcfg->iCacheL1LineSize); - prom_debug("naca->serialPortAddr = 0x%x\n", + prom_debug("naca->serialPortAddr = 0x%x\n", _naca->serialPortAddr); - prom_debug("naca->interrupt_controller = 0x%x\n", + prom_debug("naca->interrupt_controller = 0x%x\n", _naca->interrupt_controller); - prom_debug("systemcfg->platform = 0x%x\n", + prom_debug("systemcfg->platform = 0x%x\n", _systemcfg->platform); prom_debug("prom_initialize_naca: end...\n"); } @@ -547,36 +547,36 @@ #ifdef DEBUG_PROM void prom_dump_lmb(void) { - unsigned long i; - unsigned long offset = reloc_offset(); + unsigned long i; + unsigned long offset = reloc_offset(); struct lmb *_lmb = PTRRELOC(&lmb); - prom_printf("\nprom_dump_lmb:\n"); - prom_printf(" memory.cnt = 0x%x\n", + prom_printf("\nprom_dump_lmb:\n"); + prom_printf(" memory.cnt = 0x%x\n", _lmb->memory.cnt); - prom_printf(" memory.size = 0x%x\n", + prom_printf(" memory.size = 0x%x\n", _lmb->memory.size); - for (i=0; i < _lmb->memory.cnt ;i++) { - prom_printf(" memory.region[0x%x].base = 0x%x\n", + for (i=0; i < _lmb->memory.cnt ;i++) { + prom_printf(" memory.region[0x%x].base = 0x%x\n", i, _lmb->memory.region[i].base); - prom_printf(" .physbase = 0x%x\n", + prom_printf(" .physbase = 0x%x\n", _lmb->memory.region[i].physbase); - prom_printf(" .size = 0x%x\n", + prom_printf(" .size = 0x%x\n", _lmb->memory.region[i].size); - } + } - prom_printf("\n reserved.cnt = 0x%x\n", + prom_printf("\n reserved.cnt = 0x%x\n", _lmb->reserved.cnt); - prom_printf(" reserved.size = 0x%x\n", + prom_printf(" reserved.size = 0x%x\n", _lmb->reserved.size); - for (i=0; i < _lmb->reserved.cnt ;i++) { - prom_printf(" reserved.region[0x%x\n].base = 0x%x\n", + for (i=0; i < _lmb->reserved.cnt ;i++) { + prom_printf(" reserved.region[0x%x\n].base = 0x%x\n", i, _lmb->reserved.region[i].base); - prom_printf(" .physbase = 0x%x\n", + prom_printf(" .physbase = 0x%x\n", _lmb->reserved.region[i].physbase); - prom_printf(" .size = 0x%x\n", + prom_printf(" .size = 0x%x\n", _lmb->reserved.region[i].size); - } + } } #endif /* DEBUG_PROM */ @@ -584,9 +584,9 @@ { phandle node; char type[64]; - unsigned long i, offset = reloc_offset(); + unsigned long i, offset = reloc_offset(); struct prom_t *_prom = PTRRELOC(&prom); - struct systemcfg *_systemcfg = RELOC(systemcfg); + struct systemcfg *_systemcfg = RELOC(systemcfg); union lmb_reg_property reg; unsigned long lmb_base, lmb_size; unsigned long num_regs, bytes_per_reg = (_prom->encode_phys_size*2)/8; @@ -599,11 +599,11 @@ if (_systemcfg->platform == PLATFORM_POWERMAC) bytes_per_reg = 12; - for (node = 0; prom_next_node(&node); ) { - type[0] = 0; - prom_getprop(node, "device_type", type, sizeof(type)); + for (node = 0; prom_next_node(&node); ) { + type[0] = 0; + prom_getprop(node, "device_type", type, sizeof(type)); - if (strcmp(type, RELOC("memory"))) + if (strcmp(type, RELOC("memory"))) continue; num_regs = prom_getprop(node, "reg", ®, sizeof(reg)) @@ -651,7 +651,7 @@ struct rtas_t *_rtas = PTRRELOC(&rtas); struct systemcfg *_systemcfg = RELOC(systemcfg); ihandle prom_rtas; - u32 getprop_rval; + u32 getprop_rval; char hypertas_funcs[4]; prom_debug("prom_instantiate_rtas: start...\n"); @@ -669,7 +669,7 @@ prom_getprop(prom_rtas, "rtas-size", &getprop_rval, sizeof(getprop_rval)); - _rtas->size = getprop_rval; + _rtas->size = getprop_rval; prom_printf("instantiating rtas"); if (_rtas->size != 0) { unsigned long rtas_region = RTAS_INSTANTIATE_MAX; @@ -707,9 +707,9 @@ prom_printf(" done\n"); } - prom_debug("rtas->base = 0x%x\n", _rtas->base); - prom_debug("rtas->entry = 0x%x\n", _rtas->entry); - prom_debug("rtas->size = 0x%x\n", _rtas->size); + prom_debug("rtas->base = 0x%x\n", _rtas->base); + prom_debug("rtas->entry = 0x%x\n", _rtas->entry); + prom_debug("rtas->size = 0x%x\n", _rtas->size); } prom_debug("prom_instantiate_rtas: end...\n"); } @@ -744,7 +744,7 @@ { phandle node; ihandle phb_node; - unsigned long offset = reloc_offset(); + unsigned long offset = reloc_offset(); char compatible[64], path[64], type[64], model[64]; unsigned long i, table = 0; unsigned long base, vbase, align; @@ -853,21 +853,21 @@ /* Call OF to setup the TCE hardware */ if (call_prom("package-to-path", 3, 1, node, path, sizeof(path)-1) == PROM_ERROR) { - prom_printf("package-to-path failed\n"); - } else { - prom_printf("opening PHB %s", path); - } - - phb_node = call_prom("open", 1, 1, path); - if ( (long)phb_node <= 0) { - prom_printf("... failed\n"); - } else { - prom_printf("... done\n"); - } - call_prom("call-method", 6, 0, ADDR("set-64-bit-addressing"), + prom_printf("package-to-path failed\n"); + } else { + prom_printf("opening PHB %s", path); + } + + phb_node = call_prom("open", 1, 1, path); + if ( (long)phb_node <= 0) { + prom_printf("... failed\n"); + } else { + prom_printf("... done\n"); + } + call_prom("call-method", 6, 0, ADDR("set-64-bit-addressing"), phb_node, -1, minsize, (u32) base, (u32) (base >> 32)); - call_prom("close", 1, 0, phb_node); + call_prom("close", 1, 0, phb_node); table++; } @@ -910,15 +910,15 @@ unsigned int cpu_threads, hw_cpu_num; int propsize; extern void __secondary_hold(void); - extern unsigned long __secondary_hold_spinloop; - extern unsigned long __secondary_hold_acknowledge; - unsigned long *spinloop + extern unsigned long __secondary_hold_spinloop; + extern unsigned long __secondary_hold_acknowledge; + unsigned long *spinloop = (void *)virt_to_abs(&__secondary_hold_spinloop); - unsigned long *acknowledge + unsigned long *acknowledge = (void *)virt_to_abs(&__secondary_hold_acknowledge); - unsigned long secondary_hold + unsigned long secondary_hold = virt_to_abs(*PTRRELOC((unsigned long *)__secondary_hold)); - struct systemcfg *_systemcfg = RELOC(systemcfg); + struct systemcfg *_systemcfg = RELOC(systemcfg); struct paca_struct *lpaca = PTRRELOC(&paca[0]); struct prom_t *_prom = PTRRELOC(&prom); #ifdef CONFIG_SMP @@ -962,12 +962,12 @@ prom_debug(" 1) *acknowledge = 0x%x\n", *acknowledge); prom_debug(" 1) secondary_hold = 0x%x\n", secondary_hold); - /* Set the common spinloop variable, so all of the secondary cpus + /* Set the common spinloop variable, so all of the secondary cpus * will block when they are awakened from their OF spinloop. * This must occur for both SMP and non SMP kernels, since OF will * be trashed when we move the kernel. - */ - *spinloop = 0; + */ + *spinloop = 0; #ifdef CONFIG_HMT for (i=0; i < NR_CPUS; i++) { @@ -986,7 +986,7 @@ if (strcmp(type, RELOC("okay")) != 0) continue; - reg = -1; + reg = -1; prom_getprop(node, "reg", ®, sizeof(reg)); path = (char *) mem; @@ -1124,7 +1124,7 @@ ihandle prom_options = 0; char option[9]; unsigned long offset = reloc_offset(); - struct naca_struct *_naca = RELOC(naca); + struct naca_struct *_naca = RELOC(naca); char found = 0; if (strstr(RELOC(cmd_line), RELOC("smt-enabled="))) { @@ -1253,10 +1253,10 @@ struct prom_t *_prom = PTRRELOC(&prom); u32 val; - if (prom_getprop(_prom->chosen, "stdout", &val, sizeof(val)) <= 0) - prom_panic("cannot find stdout"); + if (prom_getprop(_prom->chosen, "stdout", &val, sizeof(val)) <= 0) + prom_panic("cannot find stdout"); - _prom->stdout = val; + _prom->stdout = val; } static int __init prom_find_machine_type(void) @@ -1306,7 +1306,7 @@ ihandle ih; int i, j; unsigned long offset = reloc_offset(); - struct prom_t *_prom = PTRRELOC(&prom); + struct prom_t *_prom = PTRRELOC(&prom); char type[16], *path; static unsigned char default_colors[] = { 0x00, 0x00, 0x00, @@ -1403,7 +1403,7 @@ break; #endif /* CONFIG_LOGO_LINUX_CLUT224 */ } - + return DOUBLEWORD_ALIGN(mem); } @@ -1592,7 +1592,7 @@ { struct bi_record *first, *last; - prom_debug("birec_verify: r6=0x%x\n", (unsigned long)bi_recs); + prom_debug("birec_verify: r6=0x%x\n", (unsigned long)bi_recs); if (bi_recs != NULL) prom_debug(" tag=0x%x\n", bi_recs->tag); @@ -1601,7 +1601,7 @@ last = (struct bi_record *)(long)bi_recs->data[0]; - prom_debug(" last=0x%x\n", (unsigned long)last); + prom_debug(" last=0x%x\n", (unsigned long)last); if (last != NULL) prom_debug(" last_tag=0x%x\n", last->tag); @@ -1609,7 +1609,7 @@ return NULL; first = (struct bi_record *)(long)last->data[0]; - prom_debug(" first=0x%x\n", (unsigned long)first); + prom_debug(" first=0x%x\n", (unsigned long)first); if ( first == NULL || first != bi_recs ) return NULL; @@ -1681,9 +1681,9 @@ /* Init prom stdout device */ prom_init_stdout(); - prom_debug("klimit=0x%x\n", RELOC(klimit)); - prom_debug("offset=0x%x\n", offset); - prom_debug("->mem=0x%x\n", RELOC(klimit) - offset); + prom_debug("klimit=0x%x\n", RELOC(klimit)); + prom_debug("offset=0x%x\n", offset); + prom_debug("->mem=0x%x\n", RELOC(klimit) - offset); /* check out if we have bi_recs */ _prom->bi_recs = prom_bi_rec_verify((struct bi_record *)r6); @@ -1713,7 +1713,7 @@ copy_and_flush(0, KERNELBASE - offset, 0x100, 0); /* Start storing things at klimit */ - mem = RELOC(klimit) - offset; + mem = RELOC(klimit) - offset; /* Get the full OF pathname of the stdout device */ p = (char *) mem; @@ -1728,9 +1728,9 @@ _prom->encode_phys_size = (getprop_rval == 1) ? 32 : 64; /* Determine which cpu is actually running right _now_ */ - if (prom_getprop(_prom->chosen, "cpu", + if (prom_getprop(_prom->chosen, "cpu", &prom_cpu, sizeof(prom_cpu)) <= 0) - prom_panic("cannot find boot cpu"); + prom_panic("cannot find boot cpu"); cpu_pkg = call_prom("instance-to-package", 1, 1, prom_cpu); prom_getprop(cpu_pkg, "reg", &getprop_rval, sizeof(getprop_rval)); @@ -1739,7 +1739,7 @@ RELOC(boot_cpuid) = 0; - prom_debug("Booting CPU hw index = 0x%x\n", _prom->cpu); + prom_debug("Booting CPU hw index = 0x%x\n", _prom->cpu); /* Get the boot device and translate it to a full OF pathname. */ p = (char *) mem; @@ -1773,18 +1773,18 @@ if (_systemcfg->platform != PLATFORM_POWERMAC) prom_instantiate_rtas(); - /* Initialize some system info into the Naca early... */ - prom_initialize_naca(); + /* Initialize some system info into the Naca early... */ + prom_initialize_naca(); smt_setup(); - /* If we are on an SMP machine, then we *MUST* do the - * following, regardless of whether we have an SMP - * kernel or not. - */ + /* If we are on an SMP machine, then we *MUST* do the + * following, regardless of whether we have an SMP + * kernel or not. + */ prom_hold_cpus(mem); - prom_debug("after basic inits, mem=0x%x\n", mem); + prom_debug("after basic inits, mem=0x%x\n", mem); #ifdef CONFIG_BLK_DEV_INITRD prom_debug("initrd_start=0x%x\n", RELOC(initrd_start)); prom_debug("initrd_end=0x%x\n", RELOC(initrd_end)); @@ -1796,7 +1796,7 @@ RELOC(klimit) = mem + offset; prom_debug("new klimit is\n"); - prom_debug("klimit=0x%x\n", RELOC(klimit)); + prom_debug("klimit=0x%x\n", RELOC(klimit)); prom_debug(" ->mem=0x%x\n", mem); lmb_reserve(0, __pa(RELOC(klimit))); @@ -2082,7 +2082,7 @@ i = 0; adr = (struct address_range *) mem_start; while ((l -= sizeof(struct pci_reg_property)) >= 0) { - if (!measure_only) { + if (!measure_only) { adr[i].space = pci_addrs[i].addr.a_hi; adr[i].address = pci_addrs[i].addr.a_lo; adr[i].size = pci_addrs[i].size_lo; @@ -2121,7 +2121,7 @@ i = 0; adr = (struct address_range *) mem_start; while ((l -= sizeof(struct reg_property32)) >= 0) { - if (!measure_only) { + if (!measure_only) { adr[i].space = 2; adr[i].address = rp[i].address + base_address; adr[i].size = rp[i].size; @@ -2161,7 +2161,7 @@ i = 0; adr = (struct address_range *) mem_start; while ((l -= sizeof(struct reg_property32)) >= 0) { - if (!measure_only) { + if (!measure_only) { adr[i].space = 2; adr[i].address = rp[i].address + base_address; adr[i].size = rp[i].size; @@ -2189,7 +2189,7 @@ i = 0; adr = (struct address_range *) mem_start; while ((l -= sizeof(struct reg_property)) >= 0) { - if (!measure_only) { + if (!measure_only) { adr[i].space = rp[i].space; adr[i].address = rp[i].address; adr[i].size = rp[i].size; @@ -2218,7 +2218,7 @@ i = 0; adr = (struct address_range *) mem_start; while ((l -= rpsize) >= 0) { - if (!measure_only) { + if (!measure_only) { adr[i].space = 0; adr[i].address = rp[naddrc - 1]; adr[i].size = rp[naddrc + nsizec - 1]; @@ -2296,7 +2296,7 @@ return mem_start; } -/* +/** * finish_device_tree is called once things are running normally * (i.e. with text and data mapped to the address they were linked at). * It traverses the device tree and fills in the name, type, @@ -2347,7 +2347,7 @@ return 1; } -/* +/** * Work out the sense (active-low level / active-high edge) * of each interrupt from the device tree. */ @@ -2369,7 +2369,7 @@ } } -/* +/** * Construct and return a list of the device_nodes with a given name. */ struct device_node * @@ -2388,7 +2388,7 @@ return head; } -/* +/** * Construct and return a list of the device_nodes with a given type. */ struct device_node * @@ -2407,7 +2407,7 @@ return head; } -/* +/** * Returns all nodes linked together */ struct device_node * @@ -2424,7 +2424,7 @@ return head; } -/* Checks if the given "compat" string matches one of the strings in +/** Checks if the given "compat" string matches one of the strings in * the device's "compatible" property */ int @@ -2448,7 +2448,7 @@ } -/* +/** * Indicates whether the root node has a given value in its * compatible property. */ @@ -2457,7 +2457,7 @@ { struct device_node *root; int rc = 0; - + root = of_find_node_by_path("/"); if (root) { rc = device_is_compatible(root, compat); @@ -2466,7 +2466,7 @@ return rc; } -/* +/** * Construct and return a list of the device_nodes with a given type * and compatible property. */ @@ -2489,7 +2489,7 @@ return head; } -/* +/** * Find the device_node with a given full_name. */ struct device_node * @@ -2904,7 +2904,7 @@ u32 *regs; int err = 0; phandle *ibm_phandle; - + node->name = get_property(node, "name", 0); node->type = get_property(node, "device_type", 0); @@ -2957,26 +2957,26 @@ if (err) goto out; } - /* now do the rough equivalent of update_dn_pci_info, this - * probably is not correct for phb's, but should work for - * IOAs and slots. - */ - - node->phb = parent->phb; - - regs = (u32 *)get_property(node, "reg", 0); - if (regs) { - node->busno = (regs[0] >> 16) & 0xff; - node->devfn = (regs[0] >> 8) & 0xff; - } + /* now do the rough equivalent of update_dn_pci_info, this + * probably is not correct for phb's, but should work for + * IOAs and slots. + */ + + node->phb = parent->phb; + + regs = (u32 *)get_property(node, "reg", 0); + if (regs) { + node->busno = (regs[0] >> 16) & 0xff; + node->devfn = (regs[0] >> 8) & 0xff; + } /* fixing up iommu_table */ if(strcmp(node->name, "pci") == 0 && - get_property(node, "ibm,dma-window", NULL)) { - node->bussubno = node->busno; - iommu_devnode_init(node); - } + get_property(node, "ibm,dma-window", NULL)) { + node->bussubno = node->busno; + iommu_devnode_init(node); + } else node->iommu_table = parent->iommu_table; @@ -3001,13 +3001,6 @@ memset(np, 0, sizeof(*np)); - np->full_name = kmalloc(strlen(path) + 1, GFP_KERNEL); - if (!np->full_name) { - kfree(np); - return -ENOMEM; - } - strcpy(np->full_name, path); - np->properties = proplist; OF_MARK_DYNAMIC(np); of_node_get(np); @@ -3016,6 +3009,13 @@ kfree(np); return -EINVAL; /* could also be ENOMEM, though */ } + + np->full_name = kmalloc(strlen(path) + 1, GFP_KERNEL); + if (!np->full_name) { + kfree(np); + return -ENOMEM; + } + strcpy(np->full_name, path); if (0 != (err = of_finish_dynamic_node(np))) { kfree(np); --GvXjxJ+pjyke8COw-- ----- End forwarded message ----- ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From greg at kroah.com Thu Jul 22 07:19:30 2004 From: greg at kroah.com (Greg KH) Date: Wed, 21 Jul 2004 17:19:30 -0400 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming In-Reply-To: <40FEDC0A.1030201@austin.ibm.com> References: <40FEDC0A.1030201@austin.ibm.com> Message-ID: <20040721211930.GB18110@kroah.com> On Wed, Jul 21, 2004 at 04:11:38PM -0500, Nathan Fontenot wrote: > John H Rose wrote: > >I know this is the trendy idea, and I don't disagree with it, but > >keep in mind that you'll break tools. Update flash, for example. :) > > We could always provide a link from /proc/ppc64/rtas to sysfs so we > don't break any tools. How will you know where sysfs is mounted? :) And no, putting stuff like that in the kernel will not be acceptable... greg k-h ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Thu Jul 22 07:30:41 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Wed, 21 Jul 2004 16:30:41 -0500 Subject: Resending ... Re: [PATCH] [2.6] PPC64: log firmware errors during boot. Message-ID: <20040721213041.GC13171@austin.ibm.com> ----- Forwarded message from Mail Delivery System ----- ------ This is a copy of the message, including all the headers. ------ Return-path: Received: from linas by bilge with local (Exim 3.36 #1 (Debian)) id 1Bj0dp-0005gG-00; Fri, 09 Jul 2004 14:02:37 -0500 Date: Fri, 9 Jul 2004 14:02:37 -0500 To: Jake Moilanen Cc: paulus at samba.org, linuxppc64-dev at lists.linuxppc.org, linux-kernel at vger.kernel.org Subject: Re: [PATCH] [2.6] PPC64: log firmware errors during boot. Message-ID: <20040709190237.GE17333 at bilge> References: <20040629191046.Q21634 at forte.austin.ibm.com> <16610.39955.554139.858593 at cargo.ozlabs.ibm.com> <20040706084116.11ab7988.moilanen at austin.ibm.com> <20040708110337.N21634 at forte.austin.ibm.com> <20040708125545.41aae667.moilanen at austin.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20040708125545.41aae667.moilanen at austin.ibm.com> User-Agent: Mutt/1.5.6+20040523i From: Linas Vepstas On Thu, Jul 08, 2004 at 12:55:45PM -0500, Jake Moilanen was heard to remark: > On Thu, 8 Jul 2004 11:03:37 -0500 > linas at austin.ibm.com wrote: > > > Actually, they don't seem to be queueed at all; when I turned on > > logging earlier, a whole pile of messages poped out that weren't > > visible before. > > If you are seeing a different pile of messages, I would imagine the > messages that popped out are not coming from event-scan then. Might be > last_error, which messages do not come in from event-scan. I can see > them not being logged in early boot. Yep. They were due to EEH not being enabled on empty slots. Appearently, they were being generated during boot for years, but no one noticed them before, because we had this logging turned off. So once burned, twice shy... if we got the messages earlier, we'd be less likely to overlook the root problem ... > A problem I could see, is if we make an rtas call before the VM > is up. The kmalloc for last_error won't like that. Ugh, yes, eeh is initialized very early, before the vm system is up. I'll have to prepare a patch to check malloc_sizes->cs_cachep for NULL, and not call kmalloc() if it is. Is there a better way to poll to find out if VM is up? --linas ----- End forwarded message ----- ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From brking at us.ibm.com Thu Jul 22 07:47:02 2004 From: brking at us.ibm.com (Brian King) Date: Wed, 21 Jul 2004 16:47:02 -0500 Subject: Resending: [RFC/PATCH] Re: How to block pci config-reads during device self-test? In-Reply-To: <20040721211506.GT13171@austin.ibm.com> References: <20040721211506.GT13171@austin.ibm.com> Message-ID: <40FEE456.8050307@us.ibm.com> > There are two possible solutions. (1) block pci config-space i/o > during a BIST, and (2) let the device driver detect that the card > has been off-lined, and so reset the pci controller chip. > > I'm starting to think (2) is better, but this patch implements (1). > Opinions? Comments? I do like the idea of 2, but have a couple comments. Often times an eeh error will lead to the adapter being replaced, so we may need to be a little sensitive regarding error logging. I wouldn't want hitting this window to result in the adapter being replaced. Also, just doing 2 and not 1 will still allow a potential problem to exist. If userspace was constantly reading pci config space of the adapter while resetting it, it could result in continual eeh errors, until the device driver gives up trying to bring the adapter back to life and offlines it. > ===== arch/ppc64/kernel/pSeries_pci.c 1.39 vs edited ===== > --- 1.39/arch/ppc64/kernel/pSeries_pci.c Mon Jul 12 18:29:16 2004 > +++ edited/arch/ppc64/kernel/pSeries_pci.c Tue Jul 13 15:46:08 2004 > @@ -76,14 +76,22 @@ > > addr = (dn->busno << 16) | (dn->devfn << 8) | where; > buid = dn->phb->buid; > + > + if (in_interrupt()) { > + /* Driver should unplug interrupts during BIST */ > + BUG_ON (down_read_trylock(&dn->bist_lock) != 1); > + } else { > + down_read (&dn->bist_lock); > + } This does not work. Device drivers often read pci config space from interrupt context. If userspace happened to be reading at the same time, we could hit the BUG_ON. Also, like we discussed before, these routines are called with the pci_lock spinlock held, so we can't be using semaphores. I think the best way to effectively do what this patch is trying to do is to have the read routine return ff's and have the write routine bit bucket the data. -- Brian King eServer Storage I/O IBM Linux Technology Center ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nfont at austin.ibm.com Thu Jul 22 08:23:17 2004 From: nfont at austin.ibm.com (Nathan Fontenot) Date: Wed, 21 Jul 2004 17:23:17 -0500 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming In-Reply-To: <1090439865.5862.5.camel@nighthawk> References: <40FBFC49.1070804@austin.ibm.com> <20040721055840.GA18787@kroah.com> <40FEBC0E.7060005@austin.ibm.com> <1090439865.5862.5.camel@nighthawk> Message-ID: <40FEECD5.40907@austin.ibm.com> Dave Hansen wrote: > As for the other things in that directory, could you do an ls on your > system and describe which ones you think need to be kept? > > -- Dave Here we go... # ls /proc/ppc64/rtas . clock frequency progress rtasmsgs volume .. error_log poweron rmo_buffer sensors With the addition of rmo_buffer you can make rtas calls from user space. Thus we may be able to get rid of the following items. o clock - get/set platform time. o frequency - set the frequency of the system speaker. o progress - read/write data to the system led display o volume - set the volume of the system speaker. o power_on - set a future poweron time for the system The remaining should be retained o rtasmsgs - turn on/off rtas event messages to /var/log/message. o error_log - rtas_errd uses this to get rtas events from the kernel. o rmo_buffer - get a rmo region to make rtas calls from user space. o sensors - displays the surrent state of system sensors All of theses are defined in arch/ppc64/kernel/rtas-proc.c, except for error_log which is in arch/ppc64/kernel/rtasd.c if anyone cares to look. Nathan F. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From ananth at in.ibm.com Thu Jul 22 21:34:35 2004 From: ananth at in.ibm.com (Ananth N Mavinakayanahalli) Date: Thu, 22 Jul 2004 16:34:35 +0500 Subject: [PATCH] set tbl->it_type to TCE_PCI in iommu code Message-ID: <20040722113435.GA12185@in.ibm.com> Hi Anton, Here is a patch that sets struct iommu_table->it_type to TCE_PCI in pSeries_iommu.c. Please apply Signed-off-by: Ananth N Mavinakayanahalli Thanks, Ananth -- Ananth Narayan Linux Technology Center, IBM Software Lab, INDIA diff -Naurp temp/linux-2.6.8-rc2/arch/ppc64/kernel/pSeries_iommu.c risc/linux-2.6.8-rc2/arch/ppc64/kernel/pSeries_iommu.c --- temp/linux-2.6.8-rc2/arch/ppc64/kernel/pSeries_iommu.c 2004-07-18 09:57:47.000000000 +0500 +++ risc/linux-2.6.8-rc2/arch/ppc64/kernel/pSeries_iommu.c 2004-07-22 16:24:53.000000000 +0500 @@ -211,6 +211,7 @@ static void iommu_table_setparms(struct tbl->it_index = 0; tbl->it_entrysize = sizeof(union tce_entry); tbl->it_blocksize = 16; + tbl->it_type = TCE_PCI; } /* @@ -246,6 +247,7 @@ static void iommu_table_setparms_lpar(st tbl->it_index = dma_window[0]; tbl->it_entrysize = sizeof(union tce_entry); tbl->it_blocksize = 16; + tbl->it_type = TCE_PCI; } ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From sharada at in.ibm.com Thu Jul 22 22:56:10 2004 From: sharada at in.ibm.com (R Sharada) Date: Thu, 22 Jul 2004 18:26:10 +0530 Subject: identifying cpus (and smt threads) Message-ID: <20040722125609.GA1466@in.ibm.com> Hello, I am trying to remove some initialization code of the cpumask global data structures from the prom_hold_cpus code and had a few queries. The four cpu maps are the following: - cpu_online_map, - cpu_available_map, - cpu_possible_map, and - cpu_present_at_boot_map This is my understanding of the prom_hold_cpus code and mask initialization: Each device_tree node of type cpu represents a physical processor/cpu For each valid processor identified by the firmware (obtained from looking at the 'status' property of the device_tree cpu node) Get the 'reg' property that represents the physical cpu id Get the 'interrupt-server#s' property that is supposed to represent something about the smt threads, if supported (what does this property actually mean? On my Power4 test machine, p630, I found this property seems to be the same as the 'reg' property in the open firmware prompt. Does it give the individual thread ids, if SMT is supported?). I tried to find concrete information about this property in documents, but failed to get it. Based on the propsize returned from the getprop for the property ibm,interrupt-server#s', the number of threads (CPU_MAX_THREADS is anyways 2) is obtained. The code then differentiates between the primary thread of boot cpu and non-boot cpu. If primary thread of boot cpu, then set the cpuid in all the 4 cpumasks. If primary thread of non-boot cpu, then make them spin on secondary-hold and when they are out, set the cpuid in the cpu_avaiable_map, cpu_possible_map and cpu_present_at_boot. But they are not set in the cpu_online_map. Why is that? Aren't they online if they have come out of the secondary_hold spinloop? Then the secondary threads cpuids are set in the cpu_available_map and cpu_present_at_boot map. Later, in setup_system, the secondary threads are started through the rtas-call 'start-cpu' if they are already in the cpu_available_map but not on the cpu_possible_map. A few questions here: - why is there a cpu_present_at_boot map? It seems to be the same as the cpu_available_map. Or did I miss something? - second, I want to move the cpuset() calls for setting the cpumasks from the prom_hold_cpus code to setup_system. However, I see that only the primary thread of boot cpu is set in the cpu_online_map, and the primary threads of non-boot cpus are not set in the cpu_online_map. - When do the other primary threads of the non-boot cpus get set in the online_map? Can I assume that they will all be up by the time setup_system is executed and set all the primary threads in the cpu_online_map? If not, how can I differentiate between the primary threads of the boot and non-boot cpus? Thanks and Regards, Sharada ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From jschopp at austin.ibm.com Fri Jul 23 01:58:13 2004 From: jschopp at austin.ibm.com (Joel Schopp) Date: Thu, 22 Jul 2004 10:58:13 -0500 Subject: identifying cpus (and smt threads) In-Reply-To: <20040722125609.GA1466@in.ibm.com> References: <20040722125609.GA1466@in.ibm.com> Message-ID: <40FFE415.6040909@austin.ibm.com> > Each device_tree node of type cpu represents a physical processor/cpu > For each valid processor identified by the firmware (obtained from looking at > the 'status' property of the device_tree cpu node) > Get the 'reg' property that represents the physical cpu id > Get the 'interrupt-server#s' property that is supposed to represent > something about the smt threads, if supported > (what does this property actually mean? On my Power4 test machine, p630, > I found this property seems to be the same as the 'reg' property in the open > firmware prompt. Does it give the individual thread ids, if SMT is supported?). > I tried to find concrete information about this property in documents, but > failed to get it. The document you are looking for is the RPA. The interrupt-server#s does give the id for each thread. For single threaded systems (Power 4 and earlier) the reg and interrupt-server#s are the same. For SMT systems reg is the same as the first entry in interrupt-server#s. > Based on the propsize returned from the getprop for the property > ibm,interrupt-server#s', the number of threads (CPU_MAX_THREADS is anyways 2) is > obtained. > The code then differentiates between the primary thread of boot cpu and > non-boot cpu. > If primary thread of boot cpu, then set the cpuid in all the 4 cpumasks. > If primary thread of non-boot cpu, then make them spin on secondary-hold > and when they are out, set the cpuid in the cpu_avaiable_map, cpu_possible_map > and cpu_present_at_boot. But they are not set in the cpu_online_map. Why is > that? Aren't they online if they have come out of the secondary_hold spinloop? Some cpus may be visible that are not assigned to the partition in a LPAR system. They may be in use by another partition or be otherwise utilized. These cannot be brought online at this point. > Then the secondary threads cpuids are set in the cpu_available_map and > cpu_present_at_boot map. > Later, in setup_system, the secondary threads are started through the > rtas-call 'start-cpu' if they are already in the cpu_available_map but not on > the cpu_possible_map. > > A few questions here: > - why is there a cpu_present_at_boot map? It seems to be the same as the > cpu_available_map. Or did I miss something? It is largely historical. However, it is very possible to add cpus to the available map that were not present at boot. This is the case on Power5 with shared processor enabled for instance. > - second, I want to move the cpuset() calls for setting the cpumasks from the > prom_hold_cpus code to setup_system. However, I see that only the primary thread > of boot cpu is set in the cpu_online_map, and the primary threads of non-boot > cpus are not set in the cpu_online_map. > - When do the other primary threads of the non-boot cpus get set in the > online_map? Can I assume that they will all be up by the time setup_system is > executed and set all the primary threads in the cpu_online_map? If not, how can > I differentiate between the primary threads of the boot and non-boot cpus? Not sure exactly what you mean by this. But I assume you are talking about cpu hotplug (ie cpu DLPAR) events which can bring cpus online at any time after boot when a customer asks it to. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Fri Jul 23 05:00:09 2004 From: paulus at samba.org (Paul Mackerras) Date: Thu, 22 Jul 2004 15:00:09 -0400 Subject: [PATCH] set tbl->it_type to TCE_PCI in iommu code In-Reply-To: <20040722113435.GA12185@in.ibm.com> References: <20040722113435.GA12185@in.ibm.com> Message-ID: <16640.3769.571620.779049@cargo.ozlabs.ibm.com> Ananth, > Here is a patch that sets struct iommu_table->it_type to TCE_PCI in > pSeries_iommu.c. Please apply This certainly looks like a good thing to do. To help me explain it better to Andrew Morton, could you tell me whether this fixes an actual bug (and if so, what bug)? Or is it just for code cleanness, or to help debugging? Thanks, Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Fri Jul 23 05:12:55 2004 From: paulus at samba.org (Paul Mackerras) Date: Thu, 22 Jul 2004 15:12:55 -0400 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming In-Reply-To: <1090439865.5862.5.camel@nighthawk> References: <40FBFC49.1070804@austin.ibm.com> <20040721055840.GA18787@kroah.com> <40FEBC0E.7060005@austin.ibm.com> <1090439865.5862.5.camel@nighthawk> Message-ID: <16640.4535.25999.780795@cargo.ozlabs.ibm.com> Dave Hansen writes: > On Wed, 2004-07-21 at 11:55, Nathan Fontenot wrote: > > Would anyone have a problem with moving everything from > > /proc/ppc64/rtas to sysfs (/sys/firmware/rtas) ? It seems that sysfs > > is where all these files should live anyway. > > Some of them use pretty broken interfaces, and I don't think > they should be perpetuated. The biggest offender in my mind is > ppc64/rtas/rmo_buffer. It exports physical addresses so that they > can be mapped with /dev/mem. I think a more refined interface is > appropriate here. First, we won't remove any /proc files from 2.6 that are being used by applications. That's a job for 2.7. (I think we could remove things like /proc/ppc64/rtas/poweron though.) Secondly, we can discuss the rmo_buffer thing if you like, but I'll be surprised if you can come up with something that doesn't involve unnecessarily duplicating code. The users of rmo_buffer absolutely need to know the physical address of the rmo_buffer memory. Given that, I don't see any good reason for making a device with an mmap method for mapping the rmo_buffer memory into userspace when we already have a perfectly good device which we can use for that purpose, namely /dev/mem. In other words, you'll need to produce a more sophisticated technical argument than just "/dev/mem? ugh..". 8-) Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Fri Jul 23 05:24:53 2004 From: paulus at samba.org (Paul Mackerras) Date: Thu, 22 Jul 2004 15:24:53 -0400 Subject: [PATCH] 2.6 ppc64 RTAS: use dynamic buffer size In-Reply-To: <20040721204840.GD13171@austin.ibm.com> References: <20040721204840.GD13171@austin.ibm.com> Message-ID: <16640.5253.20520.720962@cargo.ozlabs.ibm.com> Linas, > Firmware expects error log sizes to be of a very specific size, > but different versions of firmware appearently expect different > sizes; using the wrong size results in a painful, hard-to-debug crash > in firmware. Benh provided a patch for this some months ago, but > appreantly missed this code path. This patch sets up the log buffer > size dynamically; it also fixes a bug with the return code not being > handled correctly. Just small nit, but this hunk looks pointless: > @@ -105,6 +111,7 @@ > err_args = rtas.args; > rtas.args = save_args; > > + err_args.rets = (rtas_arg_t *)&(err_args.args[2]); > return err_args.rets[0]; > } > The rest of it looks fine. If you agree that this hunk can go, I'll send the rest of it upstream. Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Fri Jul 23 06:24:41 2004 From: paulus at samba.org (Paul Mackerras) Date: Thu, 22 Jul 2004 16:24:41 -0400 Subject: [PATCH] 2.6 ppc64 Janitor prom.c memleak; whitespace In-Reply-To: <20040721211650.GU13171@austin.ibm.com> References: <20040721211650.GU13171@austin.ibm.com> Message-ID: <16640.8841.405843.460895@cargo.ozlabs.ibm.com> Linas Vepstas writes: > I had reason to review prom.c today, and saw one minor bug (a very > unlikely memory leak) and a lot of bad indentation (8 spaces used > where tab should have been used). Bad whitespace drives me crazy > because my vi set ts=3 not 8. This patch fixes the mem leak (at very > bottom of the patch) and lots of whitespace ick. The mem leak is > unlikely because it requires other failures to happen before the > memleak happens. Please don't combine extensive whitespace-only changes with real bug fixes. I'll send on the whitespace changes from your patch. With the memory leak, I agree it needs to be fixed, but it seems to me that if of_finish_dynamic_node() fails, we will still leak np->full_name, even with your patch. Also there are memory leaks in the failure paths of of_finish_dynamic_node. (If you feel like fixing those, fine, but don't feel you have to. :) Regards, Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Fri Jul 23 06:36:41 2004 From: johnrose at austin.ibm.com (John Rose) Date: Thu, 22 Jul 2004 15:36:41 -0500 Subject: [PATCH] imalloc supersets In-Reply-To: <1090261801.18793.12.camel@sinatra.austin.ibm.com> References: <1090261801.18793.12.camel@sinatra.austin.ibm.com> Message-ID: <1090528601.1648.14.camel@sinatra.austin.ibm.com> My first patch left out a change to pgtable.h. My apologies, attached is the corrected patch. So, anyone have thoughts on this? :) Thanks- John diff -X /home/johnrose/tmp/diffignore.txt -urpN sles9-rc5-vanilla/arch/ppc64/mm/imalloc.c sles9-rc5-sav/arch/ppc64/mm/imalloc.c --- sles9-rc5-vanilla/arch/ppc64/mm/imalloc.c 2004-07-16 16:11:54.000000000 -0500 +++ sles9-rc5-sav/arch/ppc64/mm/imalloc.c 2004-07-22 15:08:55.000000000 -0500 @@ -37,33 +37,51 @@ static int get_free_im_addr(unsigned lon return 0; } +/* Return whether the region described by v_addr and size is a subset + * of the region described by parent + */ +static inline int im_region_is_subset(unsigned long v_addr, unsigned long size, + struct vm_struct *parent) +{ + return (int) (v_addr >= (unsigned long) parent->addr && + v_addr < (unsigned long) parent->addr + parent->size && + size < parent->size); +} + +/* Return whether the region described by v_addr and size is a superset + * of the region described by child + */ +static int im_region_is_superset(unsigned long v_addr, unsigned long size, + struct vm_struct *child) +{ + struct vm_struct parent; + + parent.addr = (void *) v_addr; + parent.size = size; + + return im_region_is_subset((unsigned long) child->addr, child->size, + &parent); +} + /* Return whether the region described by v_addr and size overlaps * the region described by vm. Overlapping regions meet the * following conditions: * 1) The regions share some part of the address space * 2) The regions aren't identical - * 3) The first region is not a subset of the second + * 3) Neither region is a subset of the other */ -static inline int im_region_overlaps(unsigned long v_addr, unsigned long size, +static int im_region_overlaps(unsigned long v_addr, unsigned long size, struct vm_struct *vm) { + if (im_region_is_superset(v_addr, size, vm)) + return 0; + return (v_addr + size > (unsigned long) vm->addr + vm->size && v_addr < (unsigned long) vm->addr + vm->size) || (v_addr < (unsigned long) vm->addr && v_addr + size > (unsigned long) vm->addr); } -/* Return whether the region described by v_addr and size is a subset - * of the region described by vm - */ -static inline int im_region_is_subset(unsigned long v_addr, unsigned long size, - struct vm_struct *vm) -{ - return (int) (v_addr >= (unsigned long) vm->addr && - v_addr < (unsigned long) vm->addr + vm->size && - size < vm->size); -} - /* Determine imalloc status of region described by v_addr and size. * Can return one of the following: * IM_REGION_UNUSED - Entire region is unallocated in imalloc space. @@ -73,28 +91,37 @@ static inline int im_region_is_subset(un * IM_REGION_EXISTS - Exact region already allocated in imalloc space. * vm will be assigned to a ptr to the existing imlist * member. - * IM_REGION_OVERLAPS - A portion of the region is already allocated in - * imalloc space. + * IM_REGION_OVERLAPS - Region overlaps an allocated region in imalloc space. + * IM_REGION_SUPERSET - Region is a superset of a region that is already + * allocated in imalloc space. */ static int im_region_status(unsigned long v_addr, unsigned long size, struct vm_struct **vm) { struct vm_struct *tmp; - for (tmp = imlist; tmp; tmp = tmp->next) - if (v_addr < (unsigned long) tmp->addr + tmp->size) + for (tmp = imlist; tmp; tmp = tmp->next) + if (v_addr < (unsigned long) tmp->addr + tmp->size) break; - + if (tmp) { if (im_region_overlaps(v_addr, size, tmp)) return IM_REGION_OVERLAP; *vm = tmp; - if (im_region_is_subset(v_addr, size, tmp)) + if (im_region_is_subset(v_addr, size, tmp)) { + /* Return with tmp pointing to superset */ return IM_REGION_SUBSET; + } + if (im_region_is_superset(v_addr, size, tmp)) { + /* Return with tmp pointing to first subset */ + return IM_REGION_SUPERSET; + } else if (v_addr == (unsigned long) tmp->addr && - size == tmp->size) + size == tmp->size) { + /* Return with tmp pointing to exact region */ return IM_REGION_EXISTS; + } } *vm = NULL; @@ -208,6 +235,10 @@ static struct vm_struct * __im_get_area( tmp = split_im_region(req_addr, size, tmp); break; case IM_REGION_EXISTS: + /* Return requested region */ + break; + case IM_REGION_SUPERSET: + /* Return first existing subset of requested region */ break; default: printk(KERN_ERR "%s() unexpected imalloc region status\n", diff -X /home/johnrose/tmp/diffignore.txt -urpN sles9-rc5-vanilla/arch/ppc64/mm/init.c sles9-rc5-sav/arch/ppc64/mm/init.c --- sles9-rc5-vanilla/arch/ppc64/mm/init.c 2004-07-16 16:11:54.000000000 -0500 +++ sles9-rc5-sav/arch/ppc64/mm/init.c 2004-07-22 15:17:56.000000000 -0500 @@ -392,9 +392,28 @@ void iounmap(void *addr) return; } +static int iounmap_subset_regions(void *addr, unsigned long size) +{ + struct vm_struct *area; + + /* Check whether subsets of this region exist */ + area = im_get_area((unsigned long) addr, size, IM_REGION_SUPERSET); + if (area == NULL) + return 1; + + while (area) { + iounmap(area->addr); + area = im_get_area((unsigned long) addr, size, + IM_REGION_SUPERSET); + } + + return 0; +} + int iounmap_explicit(void *addr, unsigned long size) { struct vm_struct *area; + int rc; /* addr could be in EEH or IO region, map it to IO region regardless. */ @@ -407,12 +426,18 @@ int iounmap_explicit(void *addr, unsigne area = im_get_area((unsigned long) addr, size, IM_REGION_EXISTS | IM_REGION_SUBSET); if (area == NULL) { - printk(KERN_ERR "%s() cannot unmap nonexistant range 0x%lx\n", - __FUNCTION__, (unsigned long) addr); - return 1; + /* Determine whether subset regions exist. If so, unmap */ + rc = iounmap_subset_regions(addr, size); + if (rc) { + printk(KERN_ERR + "%s() cannot unmap nonexistant range 0x%lx\n", + __FUNCTION__, (unsigned long) addr); + return 1; + } + } else { + iounmap(area->addr); } - iounmap(area->addr); return 0; } diff -X /home/johnrose/tmp/diffignore.txt -urpN sles9-rc5-vanilla/include/asm-ppc64/pgtable.h sles9-rc5-sav/include/asm-ppc64/pgtable.h --- sles9-rc5-vanilla/include/asm-ppc64/pgtable.h 2004-07-16 16:12:00.000000000 -0500 +++ sles9-rc5-sav/include/asm-ppc64/pgtable.h 2004-07-22 15:09:35.000000000 -0500 @@ -467,6 +467,7 @@ extern void hpte_init_iSeries(void); #define IM_REGION_SUBSET 0x2 #define IM_REGION_EXISTS 0x4 #define IM_REGION_OVERLAP 0x8 +#define IM_REGION_SUPERSET 0x10 extern struct vm_struct * im_get_free_area(unsigned long size); extern struct vm_struct * im_get_area(unsigned long v_addr, unsigned long size, On Mon, 2004-07-19 at 13:30, John Rose wrote: > The patch below implements the ability to query outstanding imalloc regions for > a given virtual address range. The patch extends im_get_area() to allow a > region criterion of IM_REGION_SUPERSET. For a particular "superset" virtual > address and size passed into im_get_area(), the function returns the first > outstanding region that is contained within this superset region. > > The patch also changes iounmap_explicit() to allow for the unmapping of all > regions that fit under a "supserset". > > This ability is necessary for PHB DLPAR. For a PHB removal, the RPA requires > that all of its children slots already be dynamically removed. Each of these > slot-level removals has fractured the imalloc region assigned to the PHB at > boot. At PHB removal time, it is necessary to iounmap() the remaining > artifacts of the initial PHB region. > > Thanks- > John > > Signed-off-by: John Rose ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Fri Jul 23 07:38:29 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Thu, 22 Jul 2004 16:38:29 -0500 Subject: [PATCH] 2.6 ppc64 RTAS: use dynamic buffer size In-Reply-To: <16640.5253.20520.720962@cargo.ozlabs.ibm.com> References: <20040721204840.GD13171@austin.ibm.com> <16640.5253.20520.720962@cargo.ozlabs.ibm.com> Message-ID: <20040722213829.GD13396@austin.ibm.com> On Thu, Jul 22, 2004 at 03:24:53PM -0400, Paul Mackerras was heard to remark: > > > Firmware expects error log sizes to be of a very specific size, > > but different versions of firmware appearently expect different > > sizes; using the wrong size results in a painful, hard-to-debug > > crash in firmware. Benh provided a patch for this some months ago, > > but appreantly missed this code path. This patch sets up the log > > buffer size dynamically; it also fixes a bug with the return code > > not being handled correctly. > > Just small nit, but this hunk looks pointless: > > > @@ -105,6 +111,7 @@ > > err_args = rtas.args; > > rtas.args = save_args; > > > > + err_args.rets = (rtas_arg_t *)&(err_args.args[2]); > > return err_args.rets[0]; > > } heh. I wondered if anyone would notice.. Yes, put this line back to where it used to be. Silly me, for a moment there, I was thinking that firmware was looking at this value and so I was desperately trying to make sure it was right; I need'nt have worried. I realized this right after I hit "send" on email ... > The rest of it looks fine. If you agree that this hunk can go, I'll > send the rest of it upstream. Yep, Just make sure that 'rets' is set somewhere in there, anywhere (e.g. put this line back to how it used to be). --linas ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Fri Jul 23 08:09:01 2004 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Thu, 22 Jul 2004 17:09:01 -0500 Subject: identifying cpus (and smt threads) In-Reply-To: <20040722125609.GA1466@in.ibm.com> References: <20040722125609.GA1466@in.ibm.com> Message-ID: <1090534141.3041.24.camel@booger> On Thu, 2004-07-22 at 07:56, R Sharada wrote: > A few questions here: > - why is there a cpu_present_at_boot map? It seems to be the same as the > cpu_available_map. Or did I miss something? It's cruft and we should get rid of both of those maps and use cpu_present_map like other architectures (e.g. ia64). I posted a patch a few days ago which does this, but it needs more work to accommodate cpu hotplug. Nathan ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nfont at austin.ibm.com Fri Jul 23 08:25:05 2004 From: nfont at austin.ibm.com (Nathan Fontenot) Date: Thu, 22 Jul 2004 17:25:05 -0500 Subject: [PATCH] updates to surveillance for power5 Message-ID: <41003EC1.5030109@austin.ibm.com> This patch updates enable_surveillance() so we do not return an error on platforms (notably power5) that do not have a surveillance sensor. Additionaly, the rtas_call was changed to rtas_set_indicator as to avoid having to handle RTAS_BUSY returns. Since this is only called when a surveillance timeout is specified at the boot propmt I also added a printk to inform users. Paul or Anton, could you push this upstream if there are no objections. Signed-off-by: Nathan Fontenot thanks, -Nathan F. -------------- next part -------------- A non-text attachment was scrubbed... Name: surv_p5update-linus.patch Type: text/x-patch Size: 738 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20040722/2b080c7f/attachment.bin From ananth at in.ibm.com Fri Jul 23 15:44:08 2004 From: ananth at in.ibm.com (Ananth N Mavinakayanahalli) Date: Fri, 23 Jul 2004 10:44:08 +0500 Subject: [PATCH] set tbl->it_type to TCE_PCI in iommu code In-Reply-To: <16640.3769.571620.779049@cargo.ozlabs.ibm.com> References: <20040722113435.GA12185@in.ibm.com> <16640.3769.571620.779049@cargo.ozlabs.ibm.com> Message-ID: <20040723054408.GA13344@in.ibm.com> On Thu, Jul 22, 2004 at 03:00:09PM -0400, Paul Mackerras wrote: Hi Paul, > > Here is a patch that sets struct iommu_table->it_type to TCE_PCI in > > pSeries_iommu.c. Please apply > > This certainly looks like a good thing to do. To help me explain it > better to Andrew Morton, could you tell me whether this fixes an > actual bug (and if so, what bug)? Or is it just for code cleanness, > or to help debugging? I noticed that the it_type was not being set when updating the old tce_table code in ppc64 kdb to use the changed iommu structs. This is just for code completeness (and it is updated in iSeries_iommu.c, but was somehow missed in pSeries_iommu.c). Thanks, Ananth ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From haveblue at us.ibm.com Sat Jul 24 00:38:37 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Fri, 23 Jul 2004 07:38:37 -0700 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming In-Reply-To: <16640.4535.25999.780795@cargo.ozlabs.ibm.com> References: <40FBFC49.1070804@austin.ibm.com> <20040721055840.GA18787@kroah.com> <40FEBC0E.7060005@austin.ibm.com> <1090439865.5862.5.camel@nighthawk> <16640.4535.25999.780795@cargo.ozlabs.ibm.com> Message-ID: <1090593517.8876.24.camel@nighthawk> On Thu, 2004-07-22 at 12:12, Paul Mackerras wrote: > First, we won't remove any /proc files from 2.6 that are being used by > applications. That's a job for 2.7. (I think we could remove things > like /proc/ppc64/rtas/poweron though.) What about the new kernel development model? ;) > Secondly, we can discuss the rmo_buffer thing if you like, but I'll be > surprised if you can come up with something that doesn't involve > unnecessarily duplicating code. The users of rmo_buffer absolutely > need to know the physical address of the rmo_buffer memory. Given > that, I don't see any good reason for making a device with an mmap > method for mapping the rmo_buffer memory into userspace when we > already have a perfectly good device which we can use for that > purpose, namely /dev/mem. > > In other words, you'll need to produce a more sophisticated technical > argument than just "/dev/mem? ugh..". 8-) The argument against /dev/mem is pretty simple: it's a bit too big of a hammer. Using it will force you to either run as root, or allow a non-root user/group to write to it, which is an equivalent of giving them root. If it were another device, you can give selected users access to it, and the worst they can do is corrupt RTAS calls (which may be a root equivalent too, I imagine). A suggestion would be to use the current ppc_rtas to get that information back out. Could there be some special args passed to the syscall that give you the address of the buffer? Something conceptually along the lines of: if (uargs->token == SPECIAL_TOKEN) { uargs->rets = phys_to_virt(&rtas_rmo_buf); return foo; } So, from userspace and librtas, you just see it as a normal ppc_rtas call with a token that isn't defined by the hardware spec. This has the advantage of isolating all of the rtas code to a slightly more restricted area. It also makes it more explicit to someone reading the code that the same users of ppc_rtas also want to know the physical address of the rtas_rmo_buf. -- Dave ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Sat Jul 24 04:15:11 2004 From: paulus at samba.org (Paul Mackerras) Date: Fri, 23 Jul 2004 14:15:11 -0400 Subject: Another: [PATCH} 2.6 ppc64 rtas crash in kmalloc In-Reply-To: <20040721211316.GS13171@austin.ibm.com> References: <20040721211316.GS13171@austin.ibm.com> Message-ID: <16641.21935.42185.14181@cargo.ozlabs.ibm.com> Linas Vepstas writes: > The recent set of rtas patches I sent in (about a week ago) > introduced a bug: the possible use of kmalloc before the > VM subsystem is initialized. This patch checks for the VM > subsystem being ready, and avoids the kmalloc if its not. > + /* Can't call kmalloc if VM subsystem is not yet up. */ > + struct cache_sizes *csizep = malloc_sizes; > + if (csizep->cs_cachep) { Please just use if (mem_init_done) for this, instead of looking at internal slab cache variables. Thanks, Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Sat Jul 24 06:41:56 2004 From: paulus at samba.org (Paul Mackerras) Date: Fri, 23 Jul 2004 16:41:56 -0400 Subject: [PATCH] 2.6 ppc64 RTAS: use dynamic buffer size In-Reply-To: <20040722213829.GD13396@austin.ibm.com> References: <20040721204840.GD13171@austin.ibm.com> <16640.5253.20520.720962@cargo.ozlabs.ibm.com> <20040722213829.GD13396@austin.ibm.com> Message-ID: <16641.30740.797512.714695@cargo.ozlabs.ibm.com> Linas Vepstas writes: > heh. I wondered if anyone would notice.. Yes, put this line back to > where it used to be. > > Silly me, for a moment there, I was thinking that firmware was looking > at this value and so I was desperately trying to make sure it was > right; I need'nt have worried. I realized this right after I hit > "send" on email ... > > > The rest of it looks fine. If you agree that this hunk can go, I'll > > send the rest of it upstream. > > Yep, Just make sure that 'rets' is set somewhere in there, anywhere > (e.g. put this line back to how it used to be). In fact I wasn't quite correct, since the next line dereferenced err_args.rets. However, it isn't necessary to set either err_args.rets or rtas.args.rets in that function, since neither enter_rtas() nor RTAS itself look at that field. How about this patch instead? Regards, Paul. diff -urN linux-2.5/arch/ppc64/kernel/rtas.c test25/arch/ppc64/kernel/rtas.c --- linux-2.5/arch/ppc64/kernel/rtas.c 2004-07-06 08:43:03.000000000 +1000 +++ test25/arch/ppc64/kernel/rtas.c 2004-07-24 05:22:16.716914216 +1000 @@ -22,7 +22,6 @@ #include #include #include -#include #include #include #include @@ -73,7 +72,6 @@ return tokp ? *tokp : RTAS_UNKNOWN_SERVICE; } - /** Return a copy of the detailed error text associated with the * most recent failed call to rtas. Because the error text * might go stale if there are any other intervening rtas calls, @@ -84,28 +82,32 @@ __fetch_rtas_last_error(void) { struct rtas_args err_args, save_args; + u32 bufsz; + + bufsz = rtas_token ("rtas-error-log-max"); + if ((bufsz == RTAS_UNKNOWN_SERVICE) || + (bufsz > RTAS_ERROR_LOG_MAX)) { + printk (KERN_WARNING "RTAS: bad log buffer size %d\n", bufsz); + bufsz = RTAS_ERROR_LOG_MAX; + } err_args.token = rtas_token("rtas-last-error"); err_args.nargs = 2; err_args.nret = 1; - err_args.rets = (rtas_arg_t *)&(err_args.args[2]); err_args.args[0] = (rtas_arg_t)__pa(rtas_err_buf); - err_args.args[1] = RTAS_ERROR_LOG_MAX; + err_args.args[1] = bufsz; err_args.args[2] = 0; save_args = rtas.args; rtas.args = err_args; - PPCDBG(PPCDBG_RTAS, "\tentering rtas with 0x%lx\n", - __pa(&err_args)); enter_rtas(__pa(&rtas.args)); - PPCDBG(PPCDBG_RTAS, "\treturned from rtas ...\n"); err_args = rtas.args; rtas.args = save_args; - return err_args.rets[0]; + return err_args.args[2]; } int rtas_call(int token, int nargs, int nret, int *outputs, ...) ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From paulus at samba.org Sat Jul 24 09:16:59 2004 From: paulus at samba.org (Paul Mackerras) Date: Fri, 23 Jul 2004 19:16:59 -0400 Subject: [PATCH] 2.6 PPC64: Reduce rtas spamming In-Reply-To: <1090593517.8876.24.camel@nighthawk> References: <40FBFC49.1070804@austin.ibm.com> <20040721055840.GA18787@kroah.com> <40FEBC0E.7060005@austin.ibm.com> <1090439865.5862.5.camel@nighthawk> <16640.4535.25999.780795@cargo.ozlabs.ibm.com> <1090593517.8876.24.camel@nighthawk> Message-ID: <16641.40043.346159.297712@cargo.ozlabs.ibm.com> Dave Hansen writes: > On Thu, 2004-07-22 at 12:12, Paul Mackerras wrote: > > First, we won't remove any /proc files from 2.6 that are being used by > > applications. That's a job for 2.7. (I think we could remove things > > like /proc/ppc64/rtas/poweron though.) > > What about the new kernel development model? ;) What about it? I'm just saying that we're not going to remove an existing interface that usermode is using in the 2.6 series kernels. > The argument against /dev/mem is pretty simple: it's a bit too big of a > hammer. Using it will force you to either run as root, or allow a > non-root user/group to write to it, which is an equivalent of giving > them root. We only want root to be able to do RTAS calls. If you can do RTAS calls you can do things like reboot or power off the system, enable/disable interrupts, access PCI config space, stop the CPU, or update flash ROM or nvram. I'm pretty sure you could persuade RTAS to write to arbitrary physical addresses below 4GB too. > A suggestion would be to use the current ppc_rtas to get that > information back out. Could there be some special args passed to the > syscall that give you the address of the buffer? > > Something conceptually along the lines of: > > if (uargs->token == SPECIAL_TOKEN) { > uargs->rets = phys_to_virt(&rtas_rmo_buf); > return foo; > } Yes, that's not a bad idea iff we can find a token value that firmware is guaranteed not to use. I don't see anything in the RPA that says that token values are restricted to any particular range though. We could use some other invalid input such as a NULL argument buffer pointer perhaps instead. Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From linas at austin.ibm.com Sat Jul 24 09:58:59 2004 From: linas at austin.ibm.com (Linas Vepstas) Date: Fri, 23 Jul 2004 18:58:59 -0500 Subject: [PATCH] 2.6 ppc64 RTAS: use dynamic buffer size In-Reply-To: <16641.30740.797512.714695@cargo.ozlabs.ibm.com> References: <20040721204840.GD13171@austin.ibm.com> <16640.5253.20520.720962@cargo.ozlabs.ibm.com> <20040722213829.GD13396@austin.ibm.com> <16641.30740.797512.714695@cargo.ozlabs.ibm.com> Message-ID: <20040723235859.GG13396@austin.ibm.com> Yes, that will work too. --linas On Fri, Jul 23, 2004 at 04:41:56PM -0400, Paul Mackerras was heard to remark: > Linas Vepstas writes: > > > heh. I wondered if anyone would notice.. Yes, put this line back to > > where it used to be. > > > > Silly me, for a moment there, I was thinking that firmware was > > looking at this value and so I was desperately trying to make sure > > it was right; I need'nt have worried. I realized this right after I > > hit "send" on email ... > > > > > The rest of it looks fine. If you agree that this hunk can go, > > > I'll send the rest of it upstream. > > > > Yep, Just make sure that 'rets' is set somewhere in there, anywhere > > (e.g. put this line back to how it used to be). > > In fact I wasn't quite correct, since the next line dereferenced > err_args.rets. However, it isn't necessary to set either err_args.rets > or rtas.args.rets in that function, since neither enter_rtas() nor > RTAS itself look at that field. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Sat Jul 24 16:43:10 2004 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Sat, 24 Jul 2004 01:43:10 -0500 Subject: UP load_up_fpu crash (2.6.8-rc2) Message-ID: <1090651390.3592.11.camel@booger> We seem to be broken with CONFIG_SMP=n on 2.6.8-rc2 and 2.6.8-rc1-mm1: Freeing unused kernel memory: 280k freed INIT: version 2.85 booting Vector: 300 (Data Access) at [c00000003f043bb0] pc: c00000000000bab0: .load_up_fpu+0xb0/0x16c lr: 00000000400272b8 sp: c00000003f043e30 msr: 8000000000003032 dar: 108 dsisr: 40000000 current = 0xc00000003f03d440 paca = 0xc0000000003cc000 pid = 327, comm = hotplug enter ? for help mon> t [c00000003f043e30] c00000000000b4d8 .handle_page_fault+0x20/0x40 (unreliable) --- Exception: 801 (FPU Unavailable) at 000000004000b908 SP (ffffe480) is in userspace Any ideas? I've seen this on pSeries lpar, haven't tried anything else. SMP kernels are fine. I'll try to track down the version where this first appeared as time permits. Nathan ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Sun Jul 25 05:15:37 2004 From: anton at samba.org (Anton Blanchard) Date: Sun, 25 Jul 2004 05:15:37 +1000 Subject: [PATCH] Fix migrate_irqs_away Message-ID: <20040724191537.GA20839@krispykreme> In migrate_irqs_away we werent converting a virtual irq to a real one. Signed-off-by: Anton Blanchard ===== xics.c 1.44 vs edited ===== --- 1.44/arch/ppc64/kernel/xics.c Fri May 28 15:34:39 2004 +++ edited/xics.c Sun Jul 25 04:59:24 2004 @@ -657,7 +657,7 @@ int set_indicator = rtas_token("set-indicator"); const unsigned int giqs = 9005UL; /* Global Interrupt Queue Server */ int status = 0; - unsigned int irq, cpu = smp_processor_id(); + unsigned int irq, virq, cpu = smp_processor_id(); int xics_status[2]; unsigned long flags; @@ -677,11 +677,12 @@ iosync(); printk(KERN_WARNING "HOTPLUG: Migrating IRQs away\n"); - for_each_irq(irq) { - irq_desc_t *desc = get_irq_desc(irq); + for_each_irq(virq) { + irq_desc_t *desc = get_irq_desc(virq); /* We need to get IPIs still. */ - if (irq_offset_down(irq) == XICS_IPI) + irq = virt_irq_to_real(irq_offset_down(virq)); + if (irq == XICS_IPI || irq == NO_IRQ) continue; /* We only need to migrate enabled IRQS */ @@ -696,7 +697,7 @@ if (status) { printk(KERN_ERR "migrate_irqs_away: irq=%d " "ibm,get-xive returns %d\n", - irq, status); + virq, status); goto unlock; } @@ -709,17 +710,17 @@ goto unlock; printk(KERN_WARNING "IRQ %d affinity broken off cpu %u\n", - irq, cpu); + virq, cpu); /* Reset affinity to all cpus */ xics_status[0] = default_distrib_server; - status = rtas_call(ibm_set_xive, 3, 1, NULL, - irq, xics_status[0], xics_status[1]); + status = rtas_call(ibm_set_xive, 3, 1, NULL, irq, + xics_status[0], xics_status[1]); if (status) printk(KERN_ERR "migrate_irqs_away irq=%d " "ibm,set-xive returns %d\n", - irq, status); + virq, status); unlock: spin_unlock_irqrestore(&desc->lock, flags); ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From haveblue at us.ibm.com Tue Jul 27 02:28:11 2004 From: haveblue at us.ibm.com (Dave Hansen) Date: Mon, 26 Jul 2004 09:28:11 -0700 Subject: UP load_up_fpu crash (2.6.8-rc2) In-Reply-To: <1090651390.3592.11.camel@booger> References: <1090651390.3592.11.camel@booger> Message-ID: <1090859291.21372.2535.camel@nighthawk> On Fri, 2004-07-23 at 23:43, Nathan Lynch wrote: > We seem to be broken with CONFIG_SMP=n on 2.6.8-rc2 and 2.6.8-rc1-mm1: > > Freeing unused kernel memory: 280k freed > INIT: version 2.85 booting > Vector: 300 (Data Access) at [c00000003f043bb0] > pc: c00000000000bab0: .load_up_fpu+0xb0/0x16c ... > --- Exception: 801 (FPU Unavailable) at 000000004000b908 > SP (ffffe480) is in userspace > > > Any ideas? I've seen this on pSeries lpar, haven't tried anything > else. SMP kernels are fine. I'll try to track down the version where > this first appeared as time permits. I was seeing the same exception, but in a slightly different place. I _think_ I saw it within some altivec code. I thought it was related to the failed CPU bringup. I'll post it if I see it again. -- Dave ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Tue Jul 27 08:57:04 2004 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 26 Jul 2004 17:57:04 -0500 Subject: [PATCH] ISA OFDT node refcount fix Message-ID: <1090882624.5914.1.camel@sinatra.austin.ibm.com> The patch below moves a misplaced of_node_put(). In the existing code, the node in question is used just after its refcount is decremented. Please apply. Thanks- John Signed-off-by: John Rose diff -Nru a/arch/ppc64/kernel/pSeries_pci.c b/arch/ppc64/kernel/pSeries_pci.c --- a/arch/ppc64/kernel/pSeries_pci.c Mon Jul 26 17:47:44 2004 +++ b/arch/ppc64/kernel/pSeries_pci.c Mon Jul 26 17:47:44 2004 @@ -284,10 +284,10 @@ isa_dn = of_find_node_by_type(NULL, "isa"); if (isa_dn) { isa_io_base = pci_io_base; - of_node_put(isa_dn); pci_process_ISA_OF_ranges(isa_dn, hose->io_base_phys, hose->io_base_virt); + of_node_put(isa_dn); /* Allow all IO */ io_page_mask = -1; } ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Tue Jul 27 09:06:21 2004 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 26 Jul 2004 18:06:21 -0500 Subject: [PATCH] remove linux,pci-domain from OFDT Message-ID: <1090883181.5914.4.camel@sinatra.austin.ibm.com> The patch below scraps the creation of the "linux,pci-domain" property in the OF device tree for each PCI Host Bridge. This seems appropriate for the following reasons: 1) It isn't referenced/used in the kernel. 2) It isn't exported to userspace, since it's added after /proc/device-tree is created. 3) Even if it was correctly exported to userspace, the same info is already available in sysfs. Please apply, if there are no problems. Thanks- John Signed-off-by: John Rose diff -Nru a/arch/ppc64/kernel/pSeries_pci.c b/arch/ppc64/kernel/pSeries_pci.c --- a/arch/ppc64/kernel/pSeries_pci.c Mon Jul 26 17:50:29 2004 +++ b/arch/ppc64/kernel/pSeries_pci.c Mon Jul 26 17:50:29 2004 @@ -402,7 +402,6 @@ int *bus_range; char *model; enum phb_types phb_type; - struct property *of_prop; model = (char *)get_property(dev, "model", NULL); @@ -448,21 +447,6 @@ kfree(phb); return NULL; } - - of_prop = (struct property *)alloc_bootmem(sizeof(struct property) + - sizeof(phb->global_number)); - - if (!of_prop) { - kfree(phb); - return NULL; - } - - memset(of_prop, 0, sizeof(struct property)); - of_prop->name = "linux,pci-domain"; - of_prop->length = sizeof(phb->global_number); - of_prop->value = (unsigned char *)&of_prop[1]; - memcpy(of_prop->value, &phb->global_number, sizeof(phb->global_number)); - prom_add_property(dev, of_prop); phb->first_busno = bus_range[0]; phb->last_busno = bus_range[1]; ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From johnrose at austin.ibm.com Wed Jul 28 06:24:00 2004 From: johnrose at austin.ibm.com (John Rose) Date: Tue, 27 Jul 2004 15:24:00 -0500 Subject: [PATCH] [trivial] struct pci_controller cleanup Message-ID: <1090959840.15403.3.camel@sinatra.austin.ibm.com> The patch below removes the unused member "pci_io_offset" from struct pci_controller. If there are no problems, please apply. Thakns- John Signed-off-by: John Rose diff -Nru a/include/asm-ppc64/pci-bridge.h b/include/asm-ppc64/pci-bridge.h --- a/include/asm-ppc64/pci-bridge.h Tue Jul 27 15:17:38 2004 +++ b/include/asm-ppc64/pci-bridge.h Tue Jul 27 15:17:38 2004 @@ -47,7 +47,6 @@ * the PCI memory space in the CPU bus space */ unsigned long pci_mem_offset; - unsigned long pci_io_offset; struct pci_ops *ops; volatile unsigned int *cfg_addr; ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From sfr at canb.auug.org.au Wed Jul 28 09:03:01 2004 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Wed, 28 Jul 2004 09:03:01 +1000 Subject: [PATCH] PPC64 pci_dn cleanups Message-ID: <20040728090301.465d95b7.sfr@canb.auug.org.au> Hi Paul, This patch just cleans up arch/ppc64/kernel/pci_dn.c a bit, including: - remove it from the iSeries build completely - small changes to Makefile - remove the "post" parameter from traverse_pci_devices as noone used it - make traverse_all_pci_devices static - remove CONFIG_PPC_PSERIES tests as we no longer build for iSeries - some reformatting (closer to "standard") - remove some of pointer casts Please apply and consider for upstream. This has been built (with default config) on pSeries and pmac and built and run on iSeries. Signed-off-by: Stephen Rothwell -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- diff -ruNp 2.6.8-rc2-bk6/arch/ppc64/kernel/Makefile 2.6.8-rc2-bk6.pci/arch/ppc64/kernel/Makefile --- 2.6.8-rc2-bk6/arch/ppc64/kernel/Makefile 2004-07-21 08:39:34.000000000 +1000 +++ 2.6.8-rc2-bk6.pci/arch/ppc64/kernel/Makefile 2004-07-28 05:46:49.000000000 +1000 @@ -15,14 +15,11 @@ obj-y := setup.o entry.o t obj-$(CONFIG_PPC_OF) += of_device.o -obj-$(CONFIG_PCI) += pci.o pci_dn.o pci_iommu.o +pci-obj-$(CONFIG_PPC_ISERIES) += iSeries_pci.o iSeries_pci_reset.o \ + iSeries_IoMmTable.o +pci-obj-$(CONFIG_PPC_PSERIES) += pci_dn.o pci_dma_direct.o -ifdef CONFIG_PPC_ISERIES -obj-$(CONFIG_PCI) += iSeries_pci.o iSeries_pci_reset.o \ - iSeries_IoMmTable.o -else -obj-$(CONFIG_PCI) += pci_dma_direct.o -endif +obj-$(CONFIG_PCI) += pci.o pci_iommu.o $(pci-obj-y) obj-$(CONFIG_PPC_ISERIES) += iSeries_irq.o \ iSeries_VpdInfo.o XmPciLpEvent.o \ diff -ruNp 2.6.8-rc2-bk6/arch/ppc64/kernel/eeh.c 2.6.8-rc2-bk6.pci/arch/ppc64/kernel/eeh.c --- 2.6.8-rc2-bk6/arch/ppc64/kernel/eeh.c 2004-07-28 01:32:09.000000000 +1000 +++ 2.6.8-rc2-bk6.pci/arch/ppc64/kernel/eeh.c 2004-07-28 05:15:46.000000000 +1000 @@ -618,7 +618,7 @@ void __init eeh_init(void) info.buid_lo = BUID_LO(buid); info.buid_hi = BUID_HI(buid); - traverse_pci_devices(phb, early_enable_eeh, NULL, &info); + traverse_pci_devices(phb, early_enable_eeh, &info); } if (eeh_subsystem_enabled) { diff -ruNp 2.6.8-rc2-bk6/arch/ppc64/kernel/pci.h 2.6.8-rc2-bk6.pci/arch/ppc64/kernel/pci.h --- 2.6.8-rc2-bk6/arch/ppc64/kernel/pci.h 2004-07-21 08:39:34.000000000 +1000 +++ 2.6.8-rc2-bk6.pci/arch/ppc64/kernel/pci.h 2004-07-28 05:15:36.000000000 +1000 @@ -34,8 +34,8 @@ extern struct pci_dev *ppc64_isabridge_d *******************************************************************/ struct device_node; typedef void *(*traverse_func)(struct device_node *me, void *data); -void *traverse_pci_devices(struct device_node *start, traverse_func pre, traverse_func post, void *data); -void *traverse_all_pci_devices(traverse_func pre); +void *traverse_pci_devices(struct device_node *start, traverse_func pre, + void *data); void pci_devs_phb_init(void); void pci_fix_bus_sysdata(void); diff -ruNp 2.6.8-rc2-bk6/arch/ppc64/kernel/pci_dn.c 2.6.8-rc2-bk6.pci/arch/ppc64/kernel/pci_dn.c --- 2.6.8-rc2-bk6/arch/ppc64/kernel/pci_dn.c 2004-07-28 01:32:09.000000000 +1000 +++ 2.6.8-rc2-bk6.pci/arch/ppc64/kernel/pci_dn.c 2004-07-28 05:49:07.000000000 +1000 @@ -19,8 +19,6 @@ * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ - -#include #include #include #include @@ -40,20 +38,20 @@ #include "pci.h" -/* Traverse_func that inits the PCI fields of the device node. +/* + * Traverse_func that inits the PCI fields of the device node. * NOTE: this *must* be done before read/write config to the device. */ -static void * __init -update_dn_pci_info(struct device_node *dn, void *data) +static void * __init update_dn_pci_info(struct device_node *dn, void *data) { -#ifdef CONFIG_PPC_PSERIES - struct pci_controller *phb = (struct pci_controller *)data; + struct pci_controller *phb = data; u32 *regs; char *device_type = get_property(dn, "device_type", NULL); char *model; dn->phb = phb; - if (device_type && strcmp(device_type, "pci") == 0 && get_property(dn, "class-code", NULL) == 0) { + if (device_type && (strcmp(device_type, "pci") == 0) && + (get_property(dn, "class-code", NULL) == 0)) { /* special case for PHB's. Sigh. */ regs = (u32 *)get_property(dn, "bus-range", NULL); dn->busno = regs[0]; @@ -72,57 +70,47 @@ update_dn_pci_info(struct device_node *d dn->devfn = (regs[0] >> 8) & 0xff; } } -#endif return NULL; } -/****************************************************************** +/* * Traverse a device tree stopping each PCI device in the tree. * This is done depth first. As each node is processed, a "pre" - * function is called, the children are processed recursively, and - * then a "post" function is called. + * function is called and the children are processed recursively. * - * The "pre" and "post" funcs return a value. If non-zero - * is returned from the "pre" func, the traversal stops and this - * value is returned. The return value from "post" is not used. - * This return value is useful when using traverse as - * a method of finding a device. + * The "pre" func returns a value. If non-zero is returned from + * the "pre" func, the traversal stops and this value is returned. + * This return value is useful when using traverse as a method of + * finding a device. * - * NOTE: we do not run the funcs for devices that do not appear to + * NOTE: we do not run the func for devices that do not appear to * be PCI except for the start node which we assume (this is good * because the start node is often a phb which may be missing PCI * properties). * We use the class-code as an indicator. If we run into * one of these nodes we also assume its siblings are non-pci for * performance. - * - ******************************************************************/ -void *traverse_pci_devices(struct device_node *start, traverse_func pre, traverse_func post, void *data) + */ +void *traverse_pci_devices(struct device_node *start, traverse_func pre, + void *data) { struct device_node *dn, *nextdn; void *ret; - if (pre && (ret = pre(start, data)) != NULL) + if (pre && ((ret = pre(start, data)) != NULL)) return ret; for (dn = start->child; dn; dn = nextdn) { nextdn = NULL; -#ifdef CONFIG_PPC_PSERIES if (get_property(dn, "class-code", NULL)) { - if (pre && (ret = pre(dn, data)) != NULL) + if (pre && ((ret = pre(dn, data)) != NULL)) return ret; - if (dn->child) { + if (dn->child) /* Depth first...do children */ nextdn = dn->child; - } else if (dn->sibling) { + else if (dn->sibling) /* ok, try next sibling instead. */ nextdn = dn->sibling; - } else { - /* no more children or siblings...call "post" */ - if (post) - post(dn, data); - } } -#endif if (!nextdn) { /* Walk up to next valid sibling. */ do { @@ -136,31 +124,35 @@ void *traverse_pci_devices(struct device return NULL; } -/* Same as traverse_pci_devices except this does it for all phbs. +/* + * Same as traverse_pci_devices except this does it for all phbs. */ -void *traverse_all_pci_devices(traverse_func pre) +static void *traverse_all_pci_devices(traverse_func pre) { - struct pci_controller* phb; + struct pci_controller *phb; void *ret; - for (phb=hose_head;phb;phb=phb->next) - if ((ret = traverse_pci_devices((struct device_node *)phb->arch_data, pre, NULL, phb)) != NULL) + + for (phb = hose_head; phb; phb = phb->next) + if ((ret = traverse_pci_devices(phb->arch_data, pre, phb)) + != NULL) return ret; return NULL; } -/* Traversal func that looks for a value. +/* + * Traversal func that looks for a value. * If found, the device_node is returned (thus terminating the traversal). */ -static void * -is_devfn_node(struct device_node *dn, void *data) +static void *is_devfn_node(struct device_node *dn, void *data) { int busno = ((unsigned long)data >> 8) & 0xff; int devfn = ((unsigned long)data) & 0xff; - return (devfn == dn->devfn && busno == dn->busno) ? dn : NULL; + return ((devfn == dn->devfn) && (busno == dn->busno)) ? dn : NULL; } -/* This is the "slow" path for looking up a device_node from a +/* + * This is the "slow" path for looking up a device_node from a * pci_dev. It will hunt for the device under its parent's * phb and then update sysdata for a future fastpath. * @@ -174,14 +166,14 @@ is_devfn_node(struct device_node *dn, vo */ struct device_node *fetch_dev_dn(struct pci_dev *dev) { - struct device_node *orig_dn = (struct device_node *)dev->sysdata; + struct device_node *orig_dn = dev->sysdata; struct pci_controller *phb = orig_dn->phb; /* assume same phb as orig_dn */ struct device_node *phb_dn; struct device_node *dn; unsigned long searchval = (dev->bus->number << 8) | dev->devfn; - phb_dn = (struct device_node *)(phb->arch_data); - dn = (struct device_node *)traverse_pci_devices(phb_dn, is_devfn_node, NULL, (void *)searchval); + phb_dn = phb->arch_data; + dn = traverse_pci_devices(phb_dn, is_devfn_node, (void *)searchval); if (dn) { dev->sysdata = dn; /* ToDo: call some device init hook here */ @@ -191,25 +183,23 @@ struct device_node *fetch_dev_dn(struct EXPORT_SYMBOL(fetch_dev_dn); -/****************************************************************** +/* * Actually initialize the phbs. * The buswalk on this phb has not happened yet. - ******************************************************************/ -void __init -pci_devs_phb_init(void) + */ +void __init pci_devs_phb_init(void) { /* This must be done first so the device nodes have valid pci info! */ traverse_all_pci_devices(update_dn_pci_info); } -static void __init -pci_fixup_bus_sysdata_list(struct list_head *bus_list) +static void __init pci_fixup_bus_sysdata_list(struct list_head *bus_list) { struct list_head *ln; struct pci_bus *bus; - for (ln=bus_list->next; ln != bus_list; ln=ln->next) { + for (ln = bus_list->next; ln != bus_list; ln = ln->next) { bus = pci_bus_b(ln); if (bus->self) bus->sysdata = bus->self->sysdata; @@ -217,7 +207,7 @@ pci_fixup_bus_sysdata_list(struct list_h } } -/****************************************************************** +/* * Fixup the bus->sysdata ptrs to point to the bus' device_node. * This is done late in pcibios_init(). We do this mostly for * sanity, but pci_dma.c uses these at DMA time so they must be @@ -225,9 +215,8 @@ pci_fixup_bus_sysdata_list(struct list_h * To do this we recurse down the bus hierarchy. Note that PHB's * have bus->self == NULL, but fortunately bus->sysdata is already * correct in this case. - ******************************************************************/ -void __init -pci_fix_bus_sysdata(void) + */ +void __init pci_fix_bus_sysdata(void) { pci_fixup_bus_sysdata_list(&pci_root_buses); } From paulus at samba.org Wed Jul 28 11:50:04 2004 From: paulus at samba.org (Paul Mackerras) Date: Tue, 27 Jul 2004 20:50:04 -0500 Subject: UP load_up_fpu crash (2.6.8-rc2) In-Reply-To: <1090651390.3592.11.camel@booger> References: <1090651390.3592.11.camel@booger> Message-ID: <16647.1612.244643.858755@cargo.ozlabs.ibm.com> Nathan Lynch writes: > We seem to be broken with CONFIG_SMP=n on 2.6.8-rc2 and 2.6.8-rc1-mm1: > > Freeing unused kernel memory: 280k freed > INIT: version 2.85 booting > Vector: 300 (Data Access) at [c00000003f043bb0] > pc: c00000000000bab0: .load_up_fpu+0xb0/0x16c > lr: 00000000400272b8 > sp: c00000003f043e30 > msr: 8000000000003032 > dar: 108 > dsisr: 40000000 > current = 0xc00000003f03d440 > paca = 0xc0000000003cc000 > pid = 327, comm = hotplug > enter ? for help > mon> t > [c00000003f043e30] c00000000000b4d8 .handle_page_fault+0x20/0x40 > (unreliable) > --- Exception: 801 (FPU Unavailable) at 000000004000b908 > SP (ffffe480) is in userspace This is very puzzling. It appears that we have taken a FPU unavailable trap from userspace, which is fine, but then it looks like we think some other task owns the FPU at the moment, and that task is a kernel thread. We are crashing because last_task_used_math->thread.regs is NULL. That should only happen for a kernel thread, but last_task_used_math should never point to a kernel thread. The only place that last_task_used_math gets set to a non-NULL value is in load_up_fpu, and that should only be called if we get a FPU unavailable trap from usermode. It would be very useful to see what last_task_used_math contains at the time of the crash, and see what last_task_used_math->comm is, so we can work out whether the task that owns the FPU is in fact a kernel thread - in which case we need to work out how last_task_used_math is getting to point at it - or if it isn't a kernel thread, in which case we need to work out why task->thread.regs is NULL for that task. Paul. ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From sharada at in.ibm.com Wed Jul 28 19:52:09 2004 From: sharada at in.ibm.com (R Sharada) Date: Wed, 28 Jul 2004 15:22:09 +0530 Subject: adding boot_cpu to cpu_online_map query Message-ID: <20040728095209.GA2224@in.ibm.com> Hello, Looking at the ppc64 code, I had a query on how the boot_cpuid is set in the cpu_online_map. Looking at prom.c (prom_hold_cpus), I see that the boot cpu is added in the cpu_online_map: #ifdef CONFIG_SMP else { prom_printf("%x : booting cpu %s\n", cpuid, path); cpu_set(cpuid, RELOC(cpu_available_map)); cpu_set(cpuid, RELOC(cpu_possible_map)); cpu_set(cpuid, RELOC(cpu_online_map)); cpu_set(cpuid, RELOC(cpu_present_at_boot)); } #endif Again, in start_kernel(), in smp_prepare_boot_cpu(), we again add the boot_cpuid in the cpu_online_map: BUG_ON(smp_processor_id() != boot_cpuid); /* cpu_possible is set up in prom.c */ cpu_set(boot_cpuid, cpu_online_map); paca[boot_cpuid].__current = current; current_set[boot_cpuid] = current->thread_info; Why is this being done twice for ppc64? Or did I miss something? I do see that the cpu_set in prom.c is under ifdef CONFIG_SMP. So, for SMP systems, the cpu_set for boot_cpu will be called twice and for non-SMP, it will be set in start_kernel? Thanks and Regards, Sharada ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Thu Jul 29 09:27:19 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 29 Jul 2004 09:27:19 +1000 Subject: [RFC/PATCH] How to block pci config-reads during device self-test? In-Reply-To: <20040713230328.GQ17333@bilge> References: <20040706161254.C21634@forte.austin.ibm.com> <40ED780C.6000401@us.ibm.com> <20040713230328.GQ17333@bilge> Message-ID: <1091057239.2333.34.camel@gaston> On Wed, 2004-07-14 at 09:03, Linas Vepstas wrote: > > On Thu, Jul 08, 2004 at 11:36:28AM -0500, Brian King was heard to remark: > > I've been doing some talking with various hardware folks to see if > > there is a way to get this fixed and so far the answer I have been > > getting is no. It is how the hardware works and there isn't much > > that can be done about it ... > > Actually, the hardware isn't broken ... its working as designed, more > or less per pci spec. See below for techninical details. Catching up a bit late... the whole in_interrupt() test with semaphore is totally disgusting... Also, PCI ops are called within a spinlock anyway. Just make it fail (return FF's) -- Benjamin Herrenschmidt ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From anton at samba.org Thu Jul 29 15:56:36 2004 From: anton at samba.org (Anton Blanchard) Date: Thu, 29 Jul 2004 15:56:36 +1000 Subject: [PATCH] [trivial] struct pci_controller cleanup In-Reply-To: <1090959840.15403.3.camel@sinatra.austin.ibm.com> References: <1090959840.15403.3.camel@sinatra.austin.ibm.com> Message-ID: <20040729055636.GG6456@krispykreme> Hi, Thanks John. Anton -- From: John Rose The patch below removes the unused member "pci_io_offset" from struct pci_controller. If there are no problems, please apply. Signed-off-by: John Rose Signed-off-by: Anton Blanchard diff -Nru a/include/asm-ppc64/pci-bridge.h b/include/asm-ppc64/pci-bridge.h --- a/include/asm-ppc64/pci-bridge.h Tue Jul 27 15:17:38 2004 +++ b/include/asm-ppc64/pci-bridge.h Tue Jul 27 15:17:38 2004 @@ -47,7 +47,6 @@ * the PCI memory space in the CPU bus space */ unsigned long pci_mem_offset; - unsigned long pci_io_offset; struct pci_ops *ops; volatile unsigned int *cfg_addr; ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From benh at kernel.crashing.org Fri Jul 30 15:22:32 2004 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 30 Jul 2004 15:22:32 +1000 Subject: [RFC][PATCH] ppc64: better handling of H_ENTER failures Message-ID: <1091164951.2077.34.camel@gaston> Hi ! This patch changes the hash insertion routines to return an error instead of calling panic() when HV refuses to insert a HPTE. The error is now propagated upstream, and either bad_page_fault() is called (kernel mode) or a SIGBUS signal is forced (user mode). Some other panic() cases are also turned into BUG_ON. Overall, this should provide us with better debugging data if the problem happens, and avoids errors from userland mapping /dev/mem and trying to use forbidden IOs (XFree ?) to bring the whole kernel down. Any comment ? Ben. ===== arch/ppc64/kernel/head.S 1.61 vs edited ===== --- 1.61/arch/ppc64/kernel/head.S 2004-06-17 00:46:06 -05:00 +++ edited/arch/ppc64/kernel/head.S 2004-06-23 15:03:55 -05:00 @@ -1015,6 +1015,9 @@ * interrupts if necessary. */ beq .ret_from_except_lite + /* For a hash failure, we don't bother re-enabling interrupts */ + ble- 12f + /* * hash_page couldn't handle it, set soft interrupt enable back * to what it was before the trap. Note that .local_irq_restore @@ -1025,6 +1028,8 @@ b 11f #else beq+ fast_exception_return /* Return from exception on success */ + ble- 12f /* Failure return from hash_page */ + /* fall through */ #endif @@ -1042,6 +1047,15 @@ addi r3,r1,STACK_FRAME_OVERHEAD lwz r4,_DAR(r1) bl .bad_page_fault + b .ret_from_except + +/* We have a page fault that hash_page could handle but HV refused + * the PTE insertion + */ +12: bl .save_nvgprs + addi r3,r1,STACK_FRAME_OVERHEAD + lwz r4,_DAR(r1) + bl .low_hash_fault b .ret_from_except /* here we have a segment miss */ ===== arch/ppc64/kernel/pSeries_lpar.c 1.36 vs edited ===== --- 1.36/arch/ppc64/kernel/pSeries_lpar.c 2004-04-12 12:54:09 -05:00 +++ edited/arch/ppc64/kernel/pSeries_lpar.c 2004-06-23 15:00:53 -05:00 @@ -377,7 +377,7 @@ lpar_rc = plpar_hcall(H_ENTER, flags, hpte_group, lhpte.dw0.dword0, lhpte.dw1.dword1, &slot, &dummy0, &dummy1); - if (lpar_rc == H_PTEG_Full) + if (unlikely(lpar_rc == H_PTEG_Full)) return -1; /* @@ -385,8 +385,13 @@ * will fail. However we must catch the failure in hash_page * or we will loop forever, so return -2 in this case. */ - if (lpar_rc != H_Success) + if (unlikely(lpar_rc != H_Success)) { + printk(KERN_WARNING + "MMU fault ! H_ENTER failed, rc: %lx, va: %016lx, prpn: %lx," + " hpteflags: %lx, secondary: %d, bolted: %d, large: %d\n", + lpar_rc, va, prpn, hpteflags, secondary, bolted, large); return -2; + } /* Because of iSeries, we have to pass down the secondary * bucket bit here as well @@ -415,9 +420,7 @@ if (lpar_rc == H_Success) return i; - if (lpar_rc != H_Not_Found) - panic("Bad return code from pte remove rc = %lx\n", - lpar_rc); + BUG_ON(lpar_rc != H_Not_Found); slot_offset++; slot_offset &= 0x7; @@ -447,8 +450,7 @@ if (lpar_rc == H_Not_Found) return -1; - if (lpar_rc != H_Success) - panic("bad return code from pte protect rc = %lx\n", lpar_rc); + BUG_ON(lpar_rc != H_Success); return 0; } @@ -467,8 +469,7 @@ lpar_rc = plpar_pte_read(flags, slot, &dword0, &dummy_word1); - if (lpar_rc != H_Success) - panic("Error on pte read in get_hpte0 rc = %lx\n", lpar_rc); + BUG_ON(lpar_rc != H_Success); return dword0; } @@ -519,15 +520,12 @@ vpn = va >> PAGE_SHIFT; slot = pSeries_lpar_hpte_find(vpn); - if (slot == -1) - panic("updateboltedpp: Could not find page to bolt\n"); + BUG_ON(slot == -1); flags = newpp & 3; lpar_rc = plpar_pte_protect(flags, slot, 0); - if (lpar_rc != H_Success) - panic("Bad return code from pte bolted protect rc = %lx\n", - lpar_rc); + BUG_ON(lpar_rc != H_Success); } static void pSeries_lpar_hpte_invalidate(unsigned long slot, unsigned long va, @@ -546,8 +544,7 @@ if (lpar_rc == H_Not_Found) return; - if (lpar_rc != H_Success) - panic("Bad return code from invalidate rc = %lx\n", lpar_rc); + BUG_ON(lpar_rc != H_Success); } /* ===== arch/ppc64/mm/hash_low.S 1.5 vs edited ===== --- 1.5/arch/ppc64/mm/hash_low.S 2004-04-12 12:54:08 -05:00 +++ edited/arch/ppc64/mm/hash_low.S 2004-06-22 16:14:22 -05:00 @@ -278,6 +278,10 @@ b bail htab_pte_insert_failure: - b .htab_insert_failure + /* Bail out restoring old PTE */ + ld r6,STK_PARM(r6)(r1) + std r31,0(r6) + li r3,-1 + b bail ===== arch/ppc64/mm/hash_utils.c 1.47 vs edited ===== --- 1.47/arch/ppc64/mm/hash_utils.c 2004-06-10 01:21:41 -05:00 +++ edited/arch/ppc64/mm/hash_utils.c 2004-06-23 15:08:29 -05:00 @@ -28,6 +28,7 @@ #include #include #include +#include #include #include @@ -236,14 +237,11 @@ return pp; } -/* - * Called by asm hashtable.S in case of critical insert failure +/* Result code is: + * 0 - handled + * 1 - normal page fault + * -1 - critical hash insertion error */ -void htab_insert_failure(void) -{ - panic("hash_page: pte_insert failed\n"); -} - int hash_page(unsigned long ea, unsigned long access, unsigned long trap) { void *pgdir; @@ -374,6 +372,25 @@ *insn_addr = (unsigned int)(0x48000001 | (offset & 0x03fffffc)); flush_icache_range((unsigned long)insn_addr, 4+ (unsigned long)insn_addr); +} + +/* + * low_hash_fault is called when we the low level hash code failed + * to instert a PTE due to an hypervisor error + */ +void low_hash_fault(struct pt_regs *regs, unsigned long address) +{ + if (user_mode(regs)) { + siginfo_t info; + + info.si_signo = SIGBUS; + info.si_errno = 0; + info.si_code = BUS_ADRERR; + info.si_addr = (void *)address; + force_sig_info(SIGBUS, &info, current); + return; + } + bad_page_fault(regs, address, SIGBUS); } void __init htab_finish_init(void) ===== include/asm-ppc64/system.h 1.32 vs edited ===== --- 1.32/include/asm-ppc64/system.h 2004-06-20 20:15:35 -05:00 +++ edited/include/asm-ppc64/system.h 2004-06-23 15:08:01 -05:00 @@ -105,6 +105,7 @@ extern void bad_page_fault(struct pt_regs *regs, unsigned long address, int sig); extern void show_regs(struct pt_regs * regs); +extern void low_hash_fault(struct pt_regs *regs, unsigned long address); extern int die(const char *str, struct pt_regs *regs, long err); extern void flush_instruction_cache(void); ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Sat Jul 31 07:45:50 2004 From: nathanl at austin.ibm.com (nathanl at austin.ibm.com) Date: Fri, 30 Jul 2004 16:45:50 -0500 Subject: [patch 1/4] Use platform numbering of cpus for hypervisor calls. Message-ID: <200407302145.i6ULj9c0068622@austin.ibm.com> We were using Linux's cpu numbering for cpu-related hypervisor calls (e.g. vpa registration, H_CONFER). It happened to work most of the time because Linux and the hypervisor usually, but not always, have the same numbering for cpus. Signed-off-by: Nathan Lynch --- diff -puN arch/ppc64/kernel/smp.c~ppc64_fix_hcall_cpuids arch/ppc64/kernel/smp.c --- 2.6.8-rc2-mm1/arch/ppc64/kernel/smp.c~ppc64_fix_hcall_cpuids 2004-07-30 10:16:57.000000000 -0500 +++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/kernel/smp.c 2004-07-30 16:44:39.000000000 -0500 @@ -501,11 +501,11 @@ static void __init smp_space_timers(unsi #ifdef CONFIG_PPC_PSERIES void vpa_init(int cpu) { - unsigned long flags; + unsigned long flags, pcpu = get_hard_smp_processor_id(cpu); /* Register the Virtual Processor Area (VPA) */ flags = 1UL << (63 - 18); - register_vpa(flags, cpu, __pa((unsigned long)&(paca[cpu].lppaca))); + register_vpa(flags, pcpu, __pa((unsigned long)&(paca[cpu].lppaca))); } static inline void smp_xics_do_message(int cpu, int msg) diff -puN arch/ppc64/lib/locks.c~ppc64_fix_hcall_cpuids arch/ppc64/lib/locks.c --- 2.6.8-rc2-mm1/arch/ppc64/lib/locks.c~ppc64_fix_hcall_cpuids 2004-07-30 10:16:57.000000000 -0500 +++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/lib/locks.c 2004-07-30 10:16:57.000000000 -0500 @@ -63,7 +63,8 @@ void __spin_yield(spinlock_t *lock) HvCall2(HvCallBaseYieldProcessor, HvCall_YieldToProc, ((u64)holder_cpu << 32) | yield_count); #else - plpar_hcall_norets(H_CONFER, holder_cpu, yield_count); + plpar_hcall_norets(H_CONFER, get_hard_smp_processor_id(holder_cpu), + yield_count); #endif } @@ -179,7 +180,8 @@ void __rw_yield(rwlock_t *rw) HvCall2(HvCallBaseYieldProcessor, HvCall_YieldToProc, ((u64)holder_cpu << 32) | yield_count); #else - plpar_hcall_norets(H_CONFER, holder_cpu, yield_count); + plpar_hcall_norets(H_CONFER, get_hard_smp_processor_id(holder_cpu), + yield_count); #endif } _ ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Sat Jul 31 07:45:56 2004 From: nathanl at austin.ibm.com (nathanl at austin.ibm.com) Date: Fri, 30 Jul 2004 16:45:56 -0500 Subject: [patch 2/4] Use cpu_present_map in ppc64 Message-ID: <200407302145.i6ULjHc0064152@austin.ibm.com> Adopt the "standard" cpu_present_map for describing cpus which are present in the system, but not necessarily online. cpu_present_map is meant to be a superset of cpu_online_map and a subset of cpu_possible_map. Signed-off-by: Nathan Lynch --- diff -puN arch/ppc64/kernel/prom.c~ppc64-add-cpu_present_map arch/ppc64/kernel/prom.c --- 2.6.8-rc2-mm1/arch/ppc64/kernel/prom.c~ppc64-add-cpu_present_map 2004-07-30 16:29:29.000000000 -0500 +++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/kernel/prom.c 2004-07-30 16:29:29.000000000 -0500 @@ -943,6 +943,7 @@ static void __init prom_hold_cpus(unsign cpu_set(cpuid, RELOC(cpu_available_map)); cpu_set(cpuid, RELOC(cpu_possible_map)); cpu_set(cpuid, RELOC(cpu_present_at_boot)); + cpu_set(cpuid, RELOC(cpu_present_map)); if (reg == 0) cpu_set(cpuid, RELOC(cpu_online_map)); #endif /* CONFIG_SMP */ @@ -1045,6 +1046,7 @@ static void __init prom_hold_cpus(unsign cpu_set(cpuid, RELOC(cpu_available_map)); cpu_set(cpuid, RELOC(cpu_possible_map)); cpu_set(cpuid, RELOC(cpu_present_at_boot)); + cpu_set(cpuid, RELOC(cpu_present_map)); #endif } else { prom_printf("... failed: %x\n", *acknowledge); @@ -1057,6 +1059,7 @@ static void __init prom_hold_cpus(unsign cpu_set(cpuid, RELOC(cpu_possible_map)); cpu_set(cpuid, RELOC(cpu_online_map)); cpu_set(cpuid, RELOC(cpu_present_at_boot)); + cpu_set(cpuid, RELOC(cpu_present_map)); } #endif next: @@ -1072,6 +1075,7 @@ next: if (_naca->smt_state) { cpu_set(cpuid, RELOC(cpu_available_map)); cpu_set(cpuid, RELOC(cpu_present_at_boot)); + cpu_set(cpuid, RELOC(cpu_present_map)); prom_printf("available\n"); } else { prom_printf("not available\n"); @@ -1103,6 +1107,7 @@ next: } /* cpu_set(i+1, cpu_online_map); */ cpu_set(i+1, RELOC(cpu_possible_map)); + cpu_set(i+1, RELOC(cpu_present_map)); } _systemcfg->processorCount *= 2; } else { diff -puN arch/ppc64/kernel/smp.c~ppc64-add-cpu_present_map arch/ppc64/kernel/smp.c --- 2.6.8-rc2-mm1/arch/ppc64/kernel/smp.c~ppc64-add-cpu_present_map 2004-07-30 16:29:29.000000000 -0500 +++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/kernel/smp.c 2004-07-30 16:29:29.000000000 -0500 @@ -130,6 +130,7 @@ static int smp_iSeries_numProcs(void) cpu_set(i, cpu_available_map); cpu_set(i, cpu_possible_map); cpu_set(i, cpu_present_at_boot); + cpu_set(i, cpu_present_map); ++np; } } diff -puN arch/ppc64/kernel/setup.c~ppc64-add-cpu_present_map arch/ppc64/kernel/setup.c _ ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Sat Jul 31 07:46:04 2004 From: nathanl at austin.ibm.com (nathanl at austin.ibm.com) Date: Fri, 30 Jul 2004 16:46:04 -0500 Subject: [patch 3/4] Rework secondary SMT thread setup at boot Message-ID: <200407302145.i6ULjNc0063654@austin.ibm.com> Our (ab)use of cpu_possible_map in setup_system to start secondary SMT threads bothers me. Mark such threads in cpu_possible_map during early boot; let RTAS tell us which present cpus are still offline later so we can start them. I'm not totally sure about this one, it might be better to set up cpu_sibling_map in prom_hold_cpus and use that in setup_system. Signed-off-by: Nathan Lynch --- diff -puN arch/ppc64/kernel/setup.c~ppc64-fix-secondary-smt-thread-setup arch/ppc64/kernel/setup.c --- 2.6.8-rc2-mm1/arch/ppc64/kernel/setup.c~ppc64-fix-secondary-smt-thread-setup 2004-07-30 16:32:16.000000000 -0500 +++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/kernel/setup.c 2004-07-30 16:32:16.000000000 -0500 @@ -232,16 +232,17 @@ void setup_system(unsigned long r3, unsi chrp_init(r3, r4, r5, r6, r7); #ifdef CONFIG_SMP - /* Start secondary threads on SMT systems */ - for (i = 0; i < NR_CPUS; i++) { - if (cpu_available(i) && !cpu_possible(i)) { + /* Start secondary threads on SMT systems; primary threads + * are already in the running state. + */ + for_each_present_cpu(i) { + if (query_cpu_stopped + (get_hard_smp_processor_id(i)) == 0) { printk("%16.16x : starting thread\n", i); rtas_call(rtas_token("start-cpu"), 3, 1, &ret, get_hard_smp_processor_id(i), (u32)*((unsigned long *)pseries_secondary_smp_init), i); - cpu_set(i, cpu_possible_map); - systemcfg->processorCount++; } } #endif /* CONFIG_SMP */ diff -puN arch/ppc64/kernel/prom.c~ppc64-fix-secondary-smt-thread-setup arch/ppc64/kernel/prom.c --- 2.6.8-rc2-mm1/arch/ppc64/kernel/prom.c~ppc64-fix-secondary-smt-thread-setup 2004-07-30 16:32:16.000000000 -0500 +++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/kernel/prom.c 2004-07-30 16:32:16.000000000 -0500 @@ -1076,6 +1076,8 @@ next: cpu_set(cpuid, RELOC(cpu_available_map)); cpu_set(cpuid, RELOC(cpu_present_at_boot)); cpu_set(cpuid, RELOC(cpu_present_map)); + cpu_set(cpuid, RELOC(cpu_possible_map)); + _systemcfg->processorCount++; prom_printf("available\n"); } else { prom_printf("not available\n"); diff -puN arch/ppc64/kernel/smp.c~ppc64-fix-secondary-smt-thread-setup arch/ppc64/kernel/smp.c --- 2.6.8-rc2-mm1/arch/ppc64/kernel/smp.c~ppc64-fix-secondary-smt-thread-setup 2004-07-30 16:32:16.000000000 -0500 +++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/kernel/smp.c 2004-07-30 16:32:16.000000000 -0500 @@ -228,7 +228,6 @@ static void __devinit smp_openpic_setup_ do_openpic_setup_cpu(); } -#ifdef CONFIG_HOTPLUG_CPU /* Get state of physical CPU. * Return codes: * 0 - The processor is in the RTAS stopped state @@ -237,7 +236,7 @@ static void __devinit smp_openpic_setup_ * -1 - Hardware Error * -2 - Hardware Busy, Try again later. */ -static int query_cpu_stopped(unsigned int pcpu) +int query_cpu_stopped(unsigned int pcpu) { int cpu_status; int status, qcss_tok; @@ -254,6 +253,8 @@ static int query_cpu_stopped(unsigned in return cpu_status; } +#ifdef CONFIG_HOTPLUG_CPU + int __cpu_disable(void) { /* FIXME: go put this in a header somewhere */ diff -puN include/asm-ppc64/smp.h~ppc64-fix-secondary-smt-thread-setup include/asm-ppc64/smp.h --- 2.6.8-rc2-mm1/include/asm-ppc64/smp.h~ppc64-fix-secondary-smt-thread-setup 2004-07-30 16:32:16.000000000 -0500 +++ 2.6.8-rc2-mm1-nathanl/include/asm-ppc64/smp.h 2004-07-30 16:32:16.000000000 -0500 @@ -73,6 +73,7 @@ void smp_init_pSeries(void); extern int __cpu_disable(void); extern void __cpu_die(unsigned int cpu); extern void cpu_die(void) __attribute__((noreturn)); +extern int query_cpu_stopped(unsigned int pcpu); #ifdef CONFIG_SCHED_SMT extern cpumask_t cpu_sibling_map[NR_CPUS]; #endif _ ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Sat Jul 31 07:46:11 2004 From: nathanl at austin.ibm.com (nathanl at austin.ibm.com) Date: Fri, 30 Jul 2004 16:46:11 -0500 Subject: [patch 4/4] Remove unnecessary cpu maps (available, present_at_boot) Message-ID: <200407302145.i6ULjXc0055970@austin.ibm.com> With cpu_present_map, we don't need these any longer. Signed-off-by: Nathan Lynch --- diff -puN include/asm-ppc64/smp.h~ppc64-remove-unnecessary-cpu-maps include/asm-ppc64/smp.h --- 2.6.8-rc2-mm1/include/asm-ppc64/smp.h~ppc64-remove-unnecessary-cpu-maps 2004-07-30 16:38:12.000000000 -0500 +++ 2.6.8-rc2-mm1-nathanl/include/asm-ppc64/smp.h 2004-07-30 16:38:12.000000000 -0500 @@ -36,23 +36,6 @@ extern void smp_message_recv(int, struct #define smp_processor_id() (get_paca()->paca_index) #define hard_smp_processor_id() (get_paca()->hw_cpu_id) -/* - * Retrieve the state of a CPU: - * online: CPU is in a normal run state - * possible: CPU is a candidate to be made online - * available: CPU is candidate for the 'possible' pool - * Used to get SMT threads started at boot time. - * present_at_boot: CPU was available at boot time. Used in DLPAR - * code to handle special cases for processor start up. - */ -extern cpumask_t cpu_present_at_boot; -extern cpumask_t cpu_online_map; -extern cpumask_t cpu_possible_map; -extern cpumask_t cpu_available_map; - -#define cpu_present_at_boot(cpu) cpu_isset(cpu, cpu_present_at_boot) -#define cpu_available(cpu) cpu_isset(cpu, cpu_available_map) - /* Since OpenPIC has only 4 IPIs, we use slightly different message numbers. * * Make sure this matches openpic_request_IPIs in open_pic.c, or what shows up diff -puN arch/ppc64/kernel/prom.c~ppc64-remove-unnecessary-cpu-maps arch/ppc64/kernel/prom.c --- 2.6.8-rc2-mm1/arch/ppc64/kernel/prom.c~ppc64-remove-unnecessary-cpu-maps 2004-07-30 16:38:12.000000000 -0500 +++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/kernel/prom.c 2004-07-30 16:38:12.000000000 -0500 @@ -940,9 +940,7 @@ static void __init prom_hold_cpus(unsign lpaca[cpuid].hw_cpu_id = reg; #ifdef CONFIG_SMP - cpu_set(cpuid, RELOC(cpu_available_map)); cpu_set(cpuid, RELOC(cpu_possible_map)); - cpu_set(cpuid, RELOC(cpu_present_at_boot)); cpu_set(cpuid, RELOC(cpu_present_map)); if (reg == 0) cpu_set(cpuid, RELOC(cpu_online_map)); @@ -1043,9 +1041,7 @@ static void __init prom_hold_cpus(unsign #ifdef CONFIG_SMP /* Set the number of active processors. */ _systemcfg->processorCount++; - cpu_set(cpuid, RELOC(cpu_available_map)); cpu_set(cpuid, RELOC(cpu_possible_map)); - cpu_set(cpuid, RELOC(cpu_present_at_boot)); cpu_set(cpuid, RELOC(cpu_present_map)); #endif } else { @@ -1055,10 +1051,8 @@ static void __init prom_hold_cpus(unsign #ifdef CONFIG_SMP else { prom_printf("%x : booting cpu %s\n", cpuid, path); - cpu_set(cpuid, RELOC(cpu_available_map)); cpu_set(cpuid, RELOC(cpu_possible_map)); cpu_set(cpuid, RELOC(cpu_online_map)); - cpu_set(cpuid, RELOC(cpu_present_at_boot)); cpu_set(cpuid, RELOC(cpu_present_map)); } #endif @@ -1073,8 +1067,6 @@ next: prom_printf("%x : preparing thread ... ", interrupt_server[i]); if (_naca->smt_state) { - cpu_set(cpuid, RELOC(cpu_available_map)); - cpu_set(cpuid, RELOC(cpu_present_at_boot)); cpu_set(cpuid, RELOC(cpu_present_map)); cpu_set(cpuid, RELOC(cpu_possible_map)); _systemcfg->processorCount++; diff -puN arch/ppc64/kernel/smp.c~ppc64-remove-unnecessary-cpu-maps arch/ppc64/kernel/smp.c --- 2.6.8-rc2-mm1/arch/ppc64/kernel/smp.c~ppc64-remove-unnecessary-cpu-maps 2004-07-30 16:38:12.000000000 -0500 +++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/kernel/smp.c 2004-07-30 16:38:12.000000000 -0500 @@ -62,8 +62,6 @@ unsigned long cache_decay_ticks; cpumask_t cpu_possible_map = CPU_MASK_NONE; cpumask_t cpu_online_map = CPU_MASK_NONE; -cpumask_t cpu_available_map = CPU_MASK_NONE; -cpumask_t cpu_present_at_boot = CPU_MASK_NONE; EXPORT_SYMBOL(cpu_online_map); EXPORT_SYMBOL(cpu_possible_map); @@ -127,9 +125,7 @@ static int smp_iSeries_numProcs(void) np = 0; for (i=0; i < NR_CPUS; ++i) { if (paca[i].lppaca.xDynProcStatus < 2) { - cpu_set(i, cpu_available_map); cpu_set(i, cpu_possible_map); - cpu_set(i, cpu_present_at_boot); cpu_set(i, cpu_present_map); ++np; } @@ -890,7 +886,7 @@ int __devinit __cpu_up(unsigned int cpu) int c; /* At boot, don't bother with non-present cpus -JSCHOPP */ - if (system_state == SYSTEM_BOOTING && !cpu_present_at_boot(cpu)) + if (system_state == SYSTEM_BOOTING && !cpu_present(cpu)) return -ENOENT; paca[cpu].prof_counter = 1; diff -puN arch/ppc64/kernel/xics.c~ppc64-remove-unnecessary-cpu-maps arch/ppc64/kernel/xics.c --- 2.6.8-rc2-mm1/arch/ppc64/kernel/xics.c~ppc64-remove-unnecessary-cpu-maps 2004-07-30 16:38:12.000000000 -0500 +++ 2.6.8-rc2-mm1-nathanl/arch/ppc64/kernel/xics.c 2004-07-30 16:38:12.000000000 -0500 @@ -548,7 +548,7 @@ nextnode: #ifdef CONFIG_SMP for_each_cpu(i) { /* FIXME: Do this dynamically! --RR */ - if (!cpu_present_at_boot(i)) + if (!cpu_present(i)) continue; xics_per_cpu[i] = __ioremap((ulong)inodes[get_hard_smp_processor_id(i)].addr, (ulong)inodes[get_hard_smp_processor_id(i)].size, _ ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/ From nathanl at austin.ibm.com Sat Jul 31 11:31:07 2004 From: nathanl at austin.ibm.com (Nathan Lynch) Date: Fri, 30 Jul 2004 20:31:07 -0500 Subject: [patch 3/4] Rework secondary SMT thread setup at boot In-Reply-To: <200407302145.i6ULjNc0063654@austin.ibm.com> References: <200407302145.i6ULjNc0063654@austin.ibm.com> Message-ID: <1091237398.31118.4.camel@biclops.private.network> On Fri, 2004-07-30 at 16:46, nathanl at austin.ibm.com wrote: > Our (ab)use of cpu_possible_map in setup_system to start secondary SMT > threads bothers me. Mark such threads in cpu_possible_map during > early boot; let RTAS tell us which present cpus are still offline > later so we can start them. > > I'm not totally sure about this one, it might be better to set up > cpu_sibling_map in prom_hold_cpus and use that in setup_system. To clarify this last comment: is it safe to assume we have the query-cpu-stopped-state RTAS call (which is used by query_cpu_stopped()) available on all SMT systems? I suppose if we're relying on the platform having start-cpu it's ok to assume it has query-cpu-stopped-state... or not? Nathan ** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/