From sfr at canb.auug.org.au  Mon Dec  1 13:42:11 2003
From: sfr at canb.auug.org.au (Stephen Rothwell)
Date: Mon, 1 Dec 2003 13:42:11 +1100
Subject: Remove flight recorder
Message-ID: <20031201134211.0db3853a.sfr@canb.auug.org.au>


Hi all,

During the porting of the 2.6 kernel to iSeries (or the iSeries kernel
to 2.6 :-)) I have been #defining out the flight recorder stuff as I
had no idea what it was for.  Anton suggested that it may have been a
useful bringup tool, but that we don't have the necessary tools to use
it.

Does anyone mind if I just excise the flight recorder code (from iSeries
in particular)?

Please don't bite my head off, I am just asking. :-)
--
Cheers,
Stephen Rothwell                    sfr at canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Mon Dec  1 16:15:05 2003
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Mon, 01 Dec 2003 16:15:05 +1100
Subject: [Fwd: [PATCH] ppc64: 2.6 altivec support, sys_swapcontext &
	signal32 rework]
Message-ID: <1070255704.658.108.camel@gaston>

(Re-post with the patch bzip2'ed...)

Hi !

It wasn't very easy to split this patch, so here it is in one piece
for now. It does a few things:

 - Adds basic Altivec support to the context switching code
 - Adds Altivec support to the 64 bits sigcontext
 - Rewrite part of the signal32 compat code based on the new
   ppc32 implementation, with Altivec support in the contexts,
   factors out some sigset flipping code, and fixes a long standing
   bug where an 32 bits RT context would have an incorrectly flipped
   sigset (a 64 bits one instead of a 32 bits one).
 - Adds sys_swapcontext syscall (and sys32_swapcontext) for kernel
   based implementation of {set,get,swap}_context with Altivec
   support

So far, it appears to work fine (when run as part of my G5 kernel which
contains a bunch of other changes though). It would need some more testing
hopefully.

What is needed now is a glibc implementation of the ucontext calls for
both 32 and 64 bits that makes use of the new sys_swapcontext, at least
with 2.6. I may give it a try, but I'm sure somebody more familiar with
glibc than I am would get this done much much more quickly... So if you
are that person, please speak up ;)

Regarding the details of the Altivec stuff: On a ppc64 signal frame, a
kernel that supports altivec (AT_HWCAP) will always fill properly the
pointer to the altivec context. In there, VRSAVE is always set, regardless
of the usage of altivec done by the process. MSR:VEC in the regs context
will be set is the other altivec registers (vr0..31 and vscr) have valid
values in the context.

It is important to split vrsave from the rest of the context as a process
may set vrsave prior to doing its first vector operation, and get preempted
with a signal in between those, thus having a valid vrsave context that
needs to be saved & restored without having taken its first altivec
exception yet, thus not having an altivec context to save yet.

I'm not sure what's the best way to deal with the availability of vrsave
on ppc32 contexts, it's a bit more nasty here. (kernel version ?). Some
ppc32 kernels will report supporting altivec via AT_HWCAP without actually
implementing altivec sig/u context stuff. We may need to based ourself on
some kernel versioning here, maybe consider 2.4.23 as the minimum version
to rely on kernel sigcontext containing proper vrsave for ppc32 ?

Ben.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ppc64-altivec_and_sig.diff.bz2
Type: application/x-bzip
Size: 17700 bytes
Desc: not available
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20031201/2e38bba6/attachment.bin 

From boutcher at us.ibm.com  Tue Dec  2 00:09:46 2003
From: boutcher at us.ibm.com (David Boutcher)
Date: Mon, 1 Dec 2003 07:09:46 -0600
Subject: Remove flight recorder
In-Reply-To: <20031201134211.0db3853a.sfr@canb.auug.org.au>
Message-ID: <OF3081EF4A.A91CCCC2-ON86256DEF.0047DA67-86256DEF.0048251E@rchland.ibm.com>


owner-linuxppc64-dev at lists.linuxppc.org wrote on 11/30/2003 08:42:11 PM:
> Hi all,
>
> During the porting of the 2.6 kernel to iSeries (or the iSeries kernel
> to 2.6 :-)) I have been #defining out the flight recorder stuff as I
> had no idea what it was for.  Anton suggested that it may have been a
> useful bringup tool, but that we don't have the necessary tools to use
> it.
>
> Does anyone mind if I just excise the flight recorder code (from iSeries
> in particular)?

Hi Stephen,

Which flight recorder stuff?  Do you mean the HvCall_WriteLogBuffer stuff?
In that case, it probably should be left in.  Those calls write the console
output to a hypervisor buffer that can be retreived later through a couple
of different mechanisms.  One of the key uses for that is if you don't have
a console connected when your linux crashes, you can come in later and dump
the last console output.  It is also handy when linux crashes, because
frequently not all output makes it out through the tortuous console
connection, and you can dump the tail end of any kernel output.

To dump the hypervisor log buffer, type ctrl-x ctrl-x on the console
screen, or there is a path through the green-screen SST screens if you
really want to use it.

If I just answered the wrong question, remind me which flight recorder you
are pulling out?

Thanks,

Dave Boutcher
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From sfr at canb.auug.org.au  Tue Dec  2 01:23:02 2003
From: sfr at canb.auug.org.au (Stephen Rothwell)
Date: Tue, 2 Dec 2003 01:23:02 +1100
Subject: Remove flight recorder
In-Reply-To: <OF3081EF4A.A91CCCC2-ON86256DEF.0047DA67-86256DEF.0048251E@rchland.ibm.com>
References: <20031201134211.0db3853a.sfr@canb.auug.org.au>
	<OF3081EF4A.A91CCCC2-ON86256DEF.0047DA67-86256DEF.0048251E@rchland.ibm.com>
Message-ID: <20031202012302.6ad03b7f.sfr@canb.auug.org.au>


On Mon, 1 Dec 2003 07:09:46 -0600 "David Boutcher" <boutcher at us.ibm.com> wrote:
>
> If I just answered the wrong question, remind me which flight recorder you
> are pulling out?

Wrong question :-) sorry.  In 2.4 there is arch/ppc64/flight_recorder.c
which allows you to log to a buffer that is accessible through the proc
file system.  It doesn't exist in the 2.6 kernel (so I had to ifdef out
the places in the 2.4 iSeries code that I am forward porting) so
presumably the pSeries guys won't miss it :-)

I know about the Hypervisor log and have used it quite a lot so far.

--
Cheers,
Stephen Rothwell                    sfr at canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From boutcher at us.ibm.com  Tue Dec  2 01:45:03 2003
From: boutcher at us.ibm.com (David Boutcher)
Date: Mon, 1 Dec 2003 08:45:03 -0600
Subject: Remove flight recorder
In-Reply-To: <20031202012302.6ad03b7f.sfr@canb.auug.org.au>
Message-ID: <OF4E20E101.9F854C58-ON86256DEF.0050E303-86256DEF.0050DE61@rchland.ibm.com>


owner-linuxppc64-dev at lists.linuxppc.org wrote on 12/01/2003 08:23:02 AM:
> On Mon, 1 Dec 2003 07:09:46 -0600 "David Boutcher" <boutcher at us.ibm.
> com> wrote:

> Wrong question :-) sorry.  In 2.4 there is arch/ppc64/flight_recorder.c
> which allows you to log to a buffer that is accessible through the proc
> file system.  It doesn't exist in the 2.6 kernel (so I had to ifdef out
> the places in the 2.4 iSeries code that I am forward porting) so
> presumably the pSeries guys won't miss it :-)

Oh THAT flight recorder.  Ya, I don't think anyone is using that flight
recorder.  Blow it away.

Dave Boutcher
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Tue Dec  2 08:09:40 2003
From: anton at samba.org (Anton Blanchard)
Date: Tue, 2 Dec 2003 08:09:40 +1100
Subject: [PATCH] nvram buffering/error logging port to 2.6
In-Reply-To: <1068210343.21219.17.camel@tin.ibm.com>
References: <F29B30D4-0FC5-11D8-9B6E-000A95A0560C@us.ibm.com> <1068210343.21219.17.camel@tin.ibm.com>
Message-ID: <20031201210940.GD22620@krispykreme>


Hi,

> This is a port of the nvram buffering/error logging code from 2.4 to
> 2.6.  I should also note that I included moving /proc/rtas to
> /proc/ppc64/rtas.

Taking a step back, why do we need to buffer error log entries in NVRAM?

My thoughts when I added the original event-scan userspace interface were:

1. Machine boots, we execute event-scans but dont request error log
entries.
2. When the rtas proc file is opened we then start requesting error log
information.

Is this less reliable? Well we already have a window between where we do
the event scan and when we write the information to NVRAM. Im guessing
writing NVRAM isnt fast, we could easily lose or get corrupted event
scan data if the machine locked up in this window.

NVRAM is a limited resource, how do we avoid overflowing it during boot?
Could we lose error log information if we end up with a bunch of event-scan
error logs?

The real way to fix this window is to have a better interface to the error
log information (ie a read error log RTAS call and a discard error log
RTAS call, you call discard error log once you have successfully
committed the error log to disk).

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From moilanen at austin.ibm.com  Tue Dec  2 10:19:06 2003
From: moilanen at austin.ibm.com (Jake Moilanen)
Date: Mon, 01 Dec 2003 17:19:06 -0600
Subject: [PATCH] nvram buffering/error logging port to 2.6
In-Reply-To: <20031201210940.GD22620@krispykreme>
References: <F29B30D4-0FC5-11D8-9B6E-000A95A0560C@us.ibm.com>
	 <1068210343.21219.17.camel@tin.ibm.com>
	 <20031201210940.GD22620@krispykreme>
Message-ID: <1070320746.1129.676.camel@tin.ibm.com	>


> 1. Machine boots, we execute event-scans but dont request error log
> entries.
> 2. When the rtas proc file is opened we then start requesting error log
> information.
>
> Is this less reliable?

In a little more detail:
1.) On boot, what was in NVRAM is store into memory.
2.) Event-scans wills start pulling error logs and if there is an
error-log entry from rtas, that data overwrites what was in NVRAM.
3.) rtas_errd will pull from /proc/ppc64/rtas/error_log
4.) When the data is stored on disk, rtas_errd will go and read from
error_log again and this signals that it is safe to clear NVRAM of the
event that was stored.

So it is possible to lose the event stored from last boot if on the
current boot the system goes down inbetween the first event-scan (and
the case that there is a new event-log entry) and when the rtas_errd
runs for the first time.

I do not feel that this is a big hole, but this hole could be closed by
not starting event-scans until rtas_errd has started.  This does not
seem smart, as if rtas_errd is not installed on the system we will get a
surveillance timeout.

>  Well we already have a window between where we do
> the event scan and when we write the information to NVRAM. Im guessing
> writing NVRAM isnt fast, we could easily lose or get corrupted event
> scan data if the machine locked up in this window.

There is nothing that can be done about this.

> NVRAM is a limited resource, how do we avoid overflowing it during boot?

The OS is guaranteed 1K of NVRAM per partition.  If for some reason we
do not have the space we should not do the NVRAM buffering of the events
coming in.

> Could we lose error log information if we end up with a bunch of     >
event-scan error logs?

Yes, if we are over 64 error-logs and rtas_errd is not processing them
fast enough it is possible.  The most I have ever seen is 3 come in at
once.  If 64 come in at one time, then there is something severly
broken.

> The real way to fix this window is to have a better interface to the error
> log information (ie a read error log RTAS call and a discard error log
> RTAS call, you call discard error log once you have successfully
> committed the error log to disk).

I'm not clear.  How is this different then what is currently there?  Do
you mean storing every single error-log in NVRAM until it is on disk?
Currently we only store 1 error log because we are only guaranteed that
much space in NVRAM (i.e. could lose that NVRAM space on the next boot
and nullify the buffering of error-logs in NVRAM).  So the last fatal
error-log received is what is stored into NVRAM or if there was no fatal
then just the last error-log received.

Thanks,
Jake

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From moilanen at austin.ibm.com  Wed Dec  3 04:43:07 2003
From: moilanen at austin.ibm.com (Jake Moilanen)
Date: Tue, 02 Dec 2003 11:43:07 -0600
Subject: bogus changes for generic pci_scan_slot() in ameslab-2.5
In-Reply-To: <20031128154151.GA30606@suse.de>
References: <20031128154151.GA30606@suse.de>
Message-ID: <1070385249.1123.1659.camel@tin.ibm.com	>


This looks like the workaround for the pci multifunc problem that is
seen on LPARs.  The linux-2.6-pcibios-scan-all-fns-1.patch I posted in
the JS20 support email fixes the same problem, does not break ppc32, and
should probably used instead.

Thanks,
Jake

On Fri, 2003-11-28 at 09:41, Olaf Hering wrote:
> Good morning,
>
> what is the purpose of this change?
>
>
> diff -purN linux-2.5/drivers/pci/probe.c linuxppc64-2.5/drivers/pci/probe.c
> --- linux-2.5/drivers/pci/probe.c	2003-08-06 15:34:30.000000000 +0000
> +++ linuxppc64-2.5/drivers/pci/probe.c	2003-11-05 22:12:33.000000000 +0000
> @@ -552,6 +552,7 @@ int __devinit pci_scan_slot(struct pci_b
>  		struct pci_dev *dev;
>
>  		dev = pci_scan_device(bus, devfn);
> +#if 0
>  		if (func == 0) {
>  			if (!dev)
>  				break;
> @@ -560,6 +561,10 @@ int __devinit pci_scan_slot(struct pci_b
>  				continue;
>  			dev->multifunction = 1;
>  		}
> +#else
> +		if (!dev)
> +			continue;
> +#endif
>
>  		/* Fix up broken headers */
>  		pci_fixup_device(PCI_FIXUP_HEADER, dev);
>
> It breaks on ppc32, B&W G3, dies in indirect_read_config() because the
> pointer *cfg_data becomes bogus, devfn is > 0xff (no idea if that
> matters).
>
> turning #if 0 into #if 1 cures it. I havent tried it on other systems
> yet, but at least a PReP MTX+ works with the patch above.
>
>
>
> --
> USB is for mice, FireWire is for men!
>
> sUse lINUX ag, n?RNBERG
>

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olh at suse.de  Wed Dec  3 04:51:15 2003
From: olh at suse.de (Olaf Hering)
Date: Tue, 2 Dec 2003 18:51:15 +0100
Subject: bogus changes for generic pci_scan_slot() in ameslab-2.5
In-Reply-To: <1070385249.1123.1659.camel@tin.ibm.com	>
References: <20031128154151.GA30606@suse.de> <1070385249.1123.1659.camel@tin.ibm.com	>
Message-ID: <20031202175115.GA3508@suse.de>


 On Tue, Dec 02, Jake Moilanen wrote:

> This looks like the workaround for the pci multifunc problem that is
> seen on LPARs.  The linux-2.6-pcibios-scan-all-fns-1.patch I posted in
> the JS20 support email fixes the same problem, does not break ppc32, and
> should probably used instead.

I will give it a try on the beige G3, thanks.

--
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From jimix at watson.ibm.com  Sat Dec  6 02:47:29 2003
From: jimix at watson.ibm.com (Jimi Xenidis)
Date: Fri, 5 Dec 2003 10:47:29 -0500
Subject: alignment and correction bug in glibc
Message-ID: <16336.43153.27131.648204@kitch0.watson.ibm.com>


File linuxthreads/sysdeps/unix/sysv/linux/powerpc/powerpc64/sysdep-cancel.h

performs a ld, cmpdi with 0 on a 32 bit value before every system call
in a threaded app.

diff of proposed fixe below.
-JX


--- sysdep-cancel.h	Tue Jun 17 18:22:57 2003
+++ /tmp/fix.S	Fri Dec  5 10:43:37 2003
@@ -103,8 +103,8 @@
 	.tc __local_multiple_threads[TC],__local_multiple_threads; \
   .previous;              \
   ld    10,.LC__local_multiple_threads at toc(2);				\
-  ld    10,0(10);								\
-  cmpdi 10,0
+  lwz   10,0(10);							\
+  cmpwi 10,0
 # endif

 #elif !defined __ASSEMBLER__

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From aprasad at in.ibm.com  Sun Dec  7 11:58:06 2003
From: aprasad at in.ibm.com (Anil K Prasad)
Date: Sun, 7 Dec 2003 06:28:06 +0530
Subject: [PATCH] ppc64 kdb to print SDR1 along with other SPRs
Message-ID: <OF312A2141.D294045F-ON86256DF5.0003D1E4-86256DF5.00045B3F@in.ibm.com>


Hi,
I am not sure why there is no option to see SDR1 value in kdb. I had
expected it to be under 'superreg' option.
Anyway, I have just added extra printf for SDR1.. and below is patch for
the same.

Thanks,
Anil.

--------------------------------------------------------------------------PATCH

BEGIN---------------------------------------------------------------------------------
--- linux-2.4.21/arch/ppc64/kdb/kdbasupport.c   2003-12-06
16:30:53.000000000 -0800
+++ linux-myfix/arch/ppc64/kdb/kdbasupport.c    2003-12-06
15:13:10.000000000 -0800
@@ -1661,6 +1661,7 @@
            kdb_printf("toc  = %.16lx  dar  = %.16lx\n", toc, get_dar());
            kdb_printf("srr0 = %.16lx  srr1 = %.16lx\n", get_srr0(),
get_srr1());
            kdb_printf("asr  = %.16lx\n", mfasr());
+           kdb_printf("sdr1 = %.16lx\n", mfsdr1());
            for (i = 0; i < 8; ++i)
                  kdb_printf("sr%.2ld = %.16lx  sr%.2ld = %.16lx\n", (long
int)i, (unsigned long)get_sr(i), (long int)(i+8), (long unsigned int)
get_sr(i+8));

--- linux-2.4.21/include/asm-ppc64/processor.h  2003-12-06
16:29:13.000000000 -0800
+++ linux-myfix/include/asm-ppc64/processor.h   2003-12-06
16:28:56.000000000 -0800
@@ -594,6 +594,8 @@

 #define mfasr()        ({unsigned long rval; \
                  asm volatile("mfasr %0" : "=r" (rval)); rval;})
+#define mfsdr1()             ({unsigned long rval; \
+                 asm volatile("mfsdr1 %0" : "=r" (rval)); rval;})

 #ifndef __ASSEMBLY__
 extern int have_of;


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From sjmunroe at us.ibm.com  Mon Dec  8 07:40:55 2003
From: sjmunroe at us.ibm.com (Steve Munroe)
Date: Sun, 7 Dec 2003 14:40:55 -0600
Subject: [Fwd: [PATCH] ppc64: 2.6 altivec support, sys_swapcontext &	signal32
 rework]
Message-ID: <OFD7634017.BB56ADFB-ON86256DF5.00711DC4-86256DF5.00719C31@us.ibm.com>


Ben thanks.

With Tom Gall's I have your kernel running on my G5. This was pull from
your BK about our Friday noon. We had a few glitches and had to deconfigure
NVRAM and pmac_seriel.

The plan is to build glibc with some VMX patches and start testing next
week.

Steven J. Munroe
Power Linux Toolchain Architect
IBM Corporation, Linux Technology Center


                      Benjamin Herrenschmidt
                      <benh at kernel.crashing.org>          To:       linuxppc64-dev at lists.linuxppc.org
                      Sent by:                            cc:
                      owner-linuxppc64-dev at lists.l        Subject:  [Fwd: [PATCH] ppc64: 2.6 altivec support, sys_swapcontext &       signal32
                      inuxppc.org                          rework]


                      11/30/03 11:15 PM


(Re-post with the patch bzip2'ed...)

Hi !

It wasn't very easy to split this patch, so here it is in one piece
for now. It does a few things:

 - Adds basic Altivec support to the context switching code
 - Adds Altivec support to the 64 bits sigcontext
 - Rewrite part of the signal32 compat code based on the new
   ppc32 implementation, with Altivec support in the contexts,
   factors out some sigset flipping code, and fixes a long standing
   bug where an 32 bits RT context would have an incorrectly flipped
   sigset (a 64 bits one instead of a 32 bits one).
 - Adds sys_swapcontext syscall (and sys32_swapcontext) for kernel
   based implementation of {set,get,swap}_context with Altivec
   support

So far, it appears to work fine (when run as part of my G5 kernel which
contains a bunch of other changes though). It would need some more testing
hopefully.

What is needed now is a glibc implementation of the ucontext calls for
both 32 and 64 bits that makes use of the new sys_swapcontext, at least
with 2.6. I may give it a try, but I'm sure somebody more familiar with
glibc than I am would get this done much much more quickly... So if you
are that person, please speak up ;)

Regarding the details of the Altivec stuff: On a ppc64 signal frame, a
kernel that supports altivec (AT_HWCAP) will always fill properly the
pointer to the altivec context. In there, VRSAVE is always set, regardless
of the usage of altivec done by the process. MSR:VEC in the regs context
will be set is the other altivec registers (vr0..31 and vscr) have valid
values in the context.

It is important to split vrsave from the rest of the context as a process
may set vrsave prior to doing its first vector operation, and get preempted
with a signal in between those, thus having a valid vrsave context that
needs to be saved & restored without having taken its first altivec
exception yet, thus not having an altivec context to save yet.

I'm not sure what's the best way to deal with the availability of vrsave
on ppc32 contexts, it's a bit more nasty here. (kernel version ?). Some
ppc32 kernels will report supporting altivec via AT_HWCAP without actually
implementing altivec sig/u context stuff. We may need to based ourself on
some kernel versioning here, maybe consider 2.4.23 as the minimum version
to rely on kernel sigcontext containing proper vrsave for ppc32 ?

Ben.


#### ppc64-altivec_and_sig.diff.bz2 has been removed from this note on
December 07, 2003 by Steve Munroe


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Mon Dec  8 10:28:19 2003
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Mon, 08 Dec 2003 10:28:19 +1100
Subject: [Fwd: [PATCH] ppc64: 2.6 altivec support, sys_swapcontext
	&	signal32 rework]
In-Reply-To: <OFD7634017.BB56ADFB-ON86256DF5.00711DC4-86256DF5.00719C31@us.ibm.com>
References: 
	 <OFD7634017.BB56ADFB-ON86256DF5.00711DC4-86256DF5.00719C31@us.ibm.com>
Message-ID: <1070839698.12501.58.camel@gaston>


On Mon, 2003-12-08 at 07:40, Steve Munroe wrote:
> Ben thanks.
>
> With Tom Gall's I have your kernel running on my G5. This was pull from
> your BK about our Friday noon. We had a few glitches and had to deconfigure
> NVRAM and pmac_seriel.

Ah ? I have both working here. I use pmac_zilog for serial console
using a stealth serial adapter and nvram works fine so far. What kind
of glitches did you have ?

> The plan is to build glibc with some VMX patches and start testing next
> week.

Great !

Note that paulus also noticed that the saved_msr & saved_ee thingy
in the ppc64 signal handling appear to be broken (and makes little
sense in the first place). The plan is to remove it completely. I'll
do that later this week.

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From johnrose at austin.ibm.com  Tue Dec  9 02:58:00 2003
From: johnrose at austin.ibm.com (John Rose)
Date: Mon, 08 Dec 2003 09:58:00 -0600
Subject: [PATCH] 2.6 - OF dynamic update __init funcs
Message-ID: <1070899080.18287.10.camel@verve>


The current of_finish_dynamic_node() calls some prom.c functions that are
marked __init.  Since this function is for use after boot, the functions
should be changed to __devinit.  If there are no comments, I'll push this
to 2.6 shortly.

Thanks-
John

diff -Nru a/arch/ppc64/kernel/prom.c b/arch/ppc64/kernel/prom.c
--- a/arch/ppc64/kernel/prom.c	Sun Dec  7 20:18:16 2003
+++ b/arch/ppc64/kernel/prom.c	Sun Dec  7 20:18:16 2003
@@ -1701,7 +1701,7 @@
 /*
  * Find the interrupt parent of a node.
  */
-static struct device_node * __init
+static struct device_node * __devinit
 intr_parent(struct device_node *p)
 {
 	phandle *parp;
@@ -1716,7 +1716,7 @@
  * Find out the size of each entry of the interrupts property
  * for a node.
  */
-static int __init
+static int __devinit
 prom_n_intr_cells(struct device_node *np)
 {
 	struct device_node *p;
@@ -1744,7 +1744,7 @@
  * Map an interrupt from a device up to the platform interrupt
  * descriptor.
  */
-static int __init
+static int __devinit
 map_interrupt(unsigned int **irq, struct device_node **ictrler,
 	      struct device_node *np, unsigned int *ints, int nintrc)
 {


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From sjmunroe at us.ibm.com  Tue Dec  9 03:29:24 2003
From: sjmunroe at us.ibm.com (Steve Munroe)
Date: Mon, 8 Dec 2003 10:29:24 -0600
Subject: [Fwd: [PATCH] ppc64: 2.6 altivec support, sys_swapcontext	&	signal32
 rework]
Message-ID: <OFB685E26A.1B93D5CE-ON86256DF6.005A2129-86256DF6.005A96EA@us.ibm.com>


Ben Herrenschmidt writes:

>> sjmunroe writes:
>> With Tom Gall's I have your kernel running on my G5. This was pull from
>> your BK about our Friday noon. We had a few glitches and had to
deconfigure
>> NVRAM and pmac_seriel.
>
>Ah ? I have both working here. I use pmac_zilog for serial console
>using a stealth serial adapter and nvram works fine so far. What kind
>of glitches did you have ?

These where compile time fails: I think in one case NVRAM_BYTES? was not
defined. I don't remember the specifics of why pmac_zilog failed.

Steven J. Munroe
Power Linux Toolchain Architect
IBM Corporation, Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olh at suse.de  Tue Dec  9 09:42:30 2003
From: olh at suse.de (Olaf Hering)
Date: Mon, 8 Dec 2003 23:42:30 +0100
Subject: stopping all cpus in one go
Message-ID: <20031208224230.GA22205@suse.de>


Is there a good reason to leave the other cpus running?
could this change deadlock?


diff -p -purNX kernel_exclude.txt orig/linux-2.6.0-test11/arch/ppc64/xmon/xmon.c linux-2.6.0-test11/arch/ppc64/xmon/xmon.c
--- orig/linux-2.6.0-test11/arch/ppc64/xmon/xmon.c      2003-11-26 20:45:27.000000000 +0000
+++ linux-2.6.0-test11/arch/ppc64/xmon/xmon.c   2003-12-08 22:15:43.000000000 +0000
@@ -228,7 +228,7 @@ xmon(struct pt_regs *excp)
 {
        struct pt_regs regs;
        int cmd;
-       unsigned long msr;
+       unsigned long msr, cpu;

        if (excp == NULL) {
                /* Ok, grab regs as they are now.
@@ -300,6 +300,8 @@ xmon(struct pt_regs *excp)
 #endif /* CONFIG_SMP */
        remove_bpts();
        disable_surveillance();
+       cpu = MSG_ALL_BUT_SELF;
+       smp_send_xmon_break(cpu);
        cmd = cmds(excp);
        if (cmd == 's') {
                xmon_trace[smp_processor_id()] = SSTEP;

--
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Tue Dec  9 10:20:06 2003
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Tue, 09 Dec 2003 10:20:06 +1100
Subject: [Fwd: [PATCH] ppc64: 2.6 altivec support,
	sys_swapcontext	&	signal32 rework]
In-Reply-To: <OFB685E26A.1B93D5CE-ON86256DF6.005A2129-86256DF6.005A96EA@us.ibm.com>
References: 
	 <OFB685E26A.1B93D5CE-ON86256DF6.005A2129-86256DF6.005A96EA@us.ibm.com>
Message-ID: <1070925605.11006.138.camel@gaston>


On Tue, 2003-12-09 at 03:29, Steve Munroe wrote:
> Ben Herrenschmidt writes:
>
> >> sjmunroe writes:
> >> With Tom Gall's I have your kernel running on my G5. This was pull from
> >> your BK about our Friday noon. We had a few glitches and had to
> deconfigure
> >> NVRAM and pmac_seriel.
> >
> >Ah ? I have both working here. I use pmac_zilog for serial console
> >using a stealth serial adapter and nvram works fine so far. What kind
> >of glitches did you have ?
>
> These where compile time fails: I think in one case NVRAM_BYTES? was not
> defined. I don't remember the specifics of why pmac_zilog failed.

Weird. I'll check that with Tom.

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Tue Dec  9 18:56:35 2003
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Tue, 09 Dec 2003 18:56:35 +1100
Subject: hash table
Message-ID: <1070956595.11006.183.camel@gaston>


Here's a first shot at my rework of __hash_page, if you want
to have a quick look... It did a few tests but didn't really
stress the box that badly, so there may be bugs in there.

At this point, the goal isn't (yet) perfs, it is to get rid
of the page table lock in hash_page().

(though my simple tests showed an approximate 10% improvement of hash_page duration).

There is still room for optimisation. It would be nice for example
to move the lazy cache flush to asm to avoid the overhead of function
calls & additional stackframe, and I could rewrite the non-HV
verions of the low level ppc_md. functions in asm with some wins
I supposed looking at the C code... There is also room for
optimisation in my asm code (like some bit manipulations
or better scheduling).

I think there is no race with the flush code. The reason is that
the case where flush is called on a present page seem to be strictly
limited to a PP bits update (or an accessed bits update in some
error case, but we can dismiss that one completely I beleive).

Since flush uses pte_update, it will not race with a pending
_hash_page. The only possible race is a __hash_page occuring during
a flush. But in this case, the PTE will have the new PP bits already
so at worst, we exit flush with an entry present... but that has the
new bits. So it's ok. I don't think we can race on the content of
the HPTE neither as we have the HPTE lock bit there. I'd still
like your point of view though.

Ben.


diff -urN linux-g5-ppc64/arch/ppc64/kernel/htab.c linux-g5-htab/arch/ppc64/kernel/htab.c
--- linux-g5-ppc64/arch/ppc64/kernel/htab.c	2003-12-08 20:27:20.084329896 +1100
+++ linux-g5-htab/arch/ppc64/kernel/htab.c	2003-12-09 18:15:10.064315512 +1100
@@ -27,6 +27,7 @@
 #include <linux/sysctl.h>
 #include <linux/ctype.h>
 #include <linux/cache.h>
+#include <linux/init.h>

 #include <asm/ppcdebug.h>
 #include <asm/processor.h>
@@ -129,7 +130,7 @@
 	}
 }

-void
+void __init
 htab_initialize(void)
 {
 	unsigned long table, htab_size_bytes;
@@ -231,6 +232,47 @@
 }

 /*
+ * Called by asm hashtable.S for doing lazy icache flush
+ */
+unsigned int hash_page_do_lazy_icache(unsigned int pp, pte_t pte, int trap)
+{
+	struct page *page;
+
+#define PPC64_HWNOEXEC (1 << 2)
+
+	if (!pfn_valid(pte_pfn(pte)))
+		return pp;
+
+	page = pte_page(pte);
+
+	/* page is dirty */
+	if (!PageReserved(page) && !test_bit(PG_arch_1, &page->flags)) {
+		if (trap == 0x400) {
+			__flush_dcache_icache(page_address(page));
+			set_bit(PG_arch_1, &page->flags);
+		} else
+			pp |= PPC64_HWNOEXEC;
+	}
+	return pp;
+}
+
+/*
+ * Called by asm hashtable.S in case of critical insert failure
+ */
+void htab_insert_failure(void)
+{
+	panic("hash_page: pte_insert failed\n");
+}
+
+/*
+ * Handle a fault by adding an HPTE. If the address can't be determined
+ * to be valid via Linux page tables, return 1. If handled return 0
+ */
+extern int __hash_page(unsigned long ea, unsigned long access, unsigned long vsid,
+		       pte_t *ptep, unsigned long trap, int local);
+
+#if 0
+/*
  * Handle a fault by adding an HPTE. If the address can't be determined
  * to be valid via Linux page tables, return 1. If handled return 0
  */
@@ -380,6 +422,7 @@

 	return 0;
 }
+#endif

 int hash_page(unsigned long ea, unsigned long access, unsigned long trap)
 {
@@ -444,24 +487,20 @@
 	if (pgdir == NULL)
 		return 1;

-	/*
-	 * Lock the Linux page table to prevent mmap and kswapd
-	 * from modifying entries while we search and update
-	 */
-	spin_lock(&mm->page_table_lock);
-
 	tmp = cpumask_of_cpu(smp_processor_id());
 	if (user_region && cpus_equal(mm->cpu_vm_mask, tmp))
 		local = 1;

-	ret = hash_huge_page(mm, access, ea, vsid, local);
-	if (ret < 0) {
+	/* Is this a huge page ? */
+	if (unlikely(in_hugepage_area(mm->context, ea)))
+		ret = hash_huge_page(mm, access, ea, vsid, local);
+	else {
 		ptep = find_linux_pte(pgdir, ea);
+		if (ptep == NULL)
+			return 1;
 		ret = __hash_page(ea, access, vsid, ptep, trap, local);
 	}

-	spin_unlock(&mm->page_table_lock);
-
 #ifdef CONFIG_HTABLE_STATS
 	if (ret == 0) {
 		duration = mftb() - duration;
@@ -519,3 +558,26 @@
 					local);
 	}
 }
+
+static inline void make_bl(unsigned int *insn_addr, void *func)
+{
+	unsigned long funcp = *((unsigned long *)func);
+	int offset = funcp - (unsigned long)insn_addr;
+
+	*insn_addr = (unsigned int)(0x48000001 | (offset & 0x03fffffc));
+	flush_icache_range((unsigned long)insn_addr, 4+
+			   (unsigned long)insn_addr);
+}
+
+void __init htab_finish_init(void)
+{
+	extern unsigned int *htab_call_hpte_insert1;
+	extern unsigned int *htab_call_hpte_insert2;
+	extern unsigned int *htab_call_hpte_remove;
+	extern unsigned int *htab_call_hpte_updatepp;
+
+	make_bl(htab_call_hpte_insert1, ppc_md.hpte_insert);
+	make_bl(htab_call_hpte_insert2, ppc_md.hpte_insert);
+	make_bl(htab_call_hpte_remove, ppc_md.hpte_remove);
+	make_bl(htab_call_hpte_updatepp, ppc_md.hpte_updatepp);
+}
diff -urN linux-g5-ppc64/arch/ppc64/kernel/setup.c linux-g5-htab/arch/ppc64/kernel/setup.c
--- linux-g5-ppc64/arch/ppc64/kernel/setup.c	2003-12-08 20:15:07.922635392 +1100
+++ linux-g5-htab/arch/ppc64/kernel/setup.c	2003-12-09 18:14:21.331723992 +1100
@@ -246,6 +246,10 @@
 		pmac_init(r3, r4, r5, r6, r7);
 	}
 #endif
+	/* Finish initializing the hash table (do the dynamic
+	 * patching for the fast-path hashtable.S code)
+	 */
+	htab_finish_init();

 	printk("Starting Linux PPC64 %s\n", UTS_RELEASE);

diff -urN linux-g5-ppc64/arch/ppc64/mm/Makefile linux-g5-htab/arch/ppc64/mm/Makefile
--- linux-g5-ppc64/arch/ppc64/mm/Makefile	2003-11-19 21:20:09.000000000 +1100
+++ linux-g5-htab/arch/ppc64/mm/Makefile	2003-12-08 17:33:31.452722880 +1100
@@ -4,6 +4,6 @@

 EXTRA_CFLAGS += -mno-minimal-toc

-obj-y := fault.o init.o extable.o imalloc.o
+obj-y := fault.o init.o extable.o imalloc.o hashtable.o
 obj-$(CONFIG_DISCONTIGMEM) += numa.o
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
diff -urN linux-g5-ppc64/arch/ppc64/mm/hashtable.S linux-g5-htab/arch/ppc64/mm/hashtable.S
--- linux-g5-ppc64/arch/ppc64/mm/hashtable.S	Thu Jan 01 10:00:00 1970
+++ linux-g5-htab/arch/ppc64/mm/hashtable.S	Tue Dec 09 18:54:52 2003
@@ -0,0 +1,289 @@
+/*
+ * ppc64 MMU hashtable management routines
+ *
+ * (c) Copyright IBM Corp. 2003
+ *
+ * Maintained by: Benjamin Herrenschmidt
+ *                <benh at kernel.crashing.org>
+ *
+ * This file is covered by the GNU Public Licence v2 as
+ * described in the kernel's COPYING file.
+ */
+
+#include <linux/config.h>
+#include <asm/processor.h>
+#include <asm/pgtable.h>
+#include <asm/mmu.h>
+#include <asm/page.h>
+#include <asm/types.h>
+#include <asm/ppc_asm.h>
+#include <asm/offsets.h>
+#include <asm/cputable.h>
+
+	.text
+
+/*
+ * Stackframe:
+ *
+ *         +-> Back chain			(SP + 256)
+ *         |   General register save area	(SP + 112)
+ *         |   Parameter save area		(SP + 48)
+ *         |   TOC save area			(SP + 40)
+ *         |   link editor doubleword		(SP + 32)
+ *         |   compiler doubleword		(SP + 24)
+ *         |   LR save area			(SP + 16)
+ *         |   CR save area			(SP + 8)
+ * SP ---> +-- Back chain			(SP + 0)
+ */
+#define STACKFRAMESIZE	256
+
+/* Save parameters offsets */
+#define STK_PARM(i)	(STACKFRAMESIZE + 48 + ((i)-3)*8)
+
+/* Save non-volatile offsets */
+#define STK_REG(i)	(112 + ((i)-14)*8)
+
+/*
+ * _hash_page(unsigned long ea, unsigned long access, unsigned long vsid,
+ *		pte_t *ptep, unsigned long trap, int local)
+ *
+ * Adds a page to the hash table. This is the non-LPAR version for now
+ */
+
+_GLOBAL(__hash_page)
+	mflr	r0
+	std	r0,16(r1)
+	stdu	r1,-STACKFRAMESIZE(r1)
+	/* Save all params that we need after a function call */
+	std	r6,STK_PARM(r6)(r1)
+	std	r8,STK_PARM(r8)(r1)
+
+	/* Add _PAGE_PRESENT to access */
+	ori	r4,r4,_PAGE_PRESENT
+
+	/* Save non-volatile registers.
+	 * r31 will hold "old PTE"
+	 * r30 is "new PTE"
+	 * r29 is "va"
+	 * r28 is a hash value
+	 * r27 is hashtab mask (maybe dynamic patched instead ?)
+	 */
+	std	r27,STK_REG(r27)(r1)
+	std	r28,STK_REG(r28)(r1)
+	std	r29,STK_REG(r29)(r1)
+	std	r30,STK_REG(r30)(r1)
+	std	r31,STK_REG(r31)(r1)
+
+	/* Step 1:
+	 *
+	 * Check permissions, atomically mark the linux PTE busy
+	 * and hashed.
+	 */
+1:
+	ldarx	r31,0,r6
+	/* Check access rights (access & ~(pte_val(*ptep))) */
+	andc.	r0,r4,r31
+	bne-	htab_wrong_access
+	/* Check if PTE is busy */
+	andi.	r0,r31,_PAGE_BUSY
+	bne-	1b
+	/* Prepare new PTE value (turn access RW into DIRTY, then
+	 * add BUSY,HASHPTE and ACCESSED)
+	 */
+	rlwinm	r30,r4,5,24,24	/* _PAGE_RW -> _PAGE_DIRTY */
+	or	r30,r30,r31
+	ori	r30,r30,_PAGE_BUSY | _PAGE_ACCESSED | _PAGE_HASHPTE
+	/* Write the linux PTE atomically (setting busy) */
+	stdcx.	r30,0,r6
+	bne-	1b
+
+
+	/* Step 2:
+	 *
+	 * Insert/Update the HPTE in the hash table. At this point,
+	 * r4 (access) is re-useable, we use it for the new HPTE flags
+	 */
+
+	/* Calc va and put it in r29 */
+	rldicr	r29,r5,28,63-28
+	rldicl	r3,r3,0,36
+	or	r29,r3,r29
+
+	/* Calculate hash value for primary slot and store it in r28 */
+	rldicl	r5,r5,0,25		/* vsid & 0x0000007fffffffff */
+	rldicl	r0,r3,64-12,48		/* (ea >> 12) & 0xffff */
+	xor	r28,r5,r0
+
+	/* Convert linux PTE bits into HW equivalents. Fix using
+	 * mask inserts instead
+	 */
+	rlwinm	r3,r30,32-1,31,31	/* _PAGE_USER -> PP lsb */
+	rlwinm	r0,r30,32-2,31,31	/* _PAGE_RW -> r0 lsb */
+	rlwinm	r4,r30,32-7,31,31	/* _PAGE_DIRTY -> r4 lsb */
+	and	r0,r0,r4		/* _PAGE_RW & _PAGE_DIRTY -> r0 lsb */
+	andc	r3,r3,r0		/* PP lsb &=  ~(PAGE_RW & _PAGE_DIRTY) */
+	andi.	r4,r30,_PAGE_USER	/* _PAGE_USER -> r4 msb */
+	or	r3,r3,r4		/* PP msb = r4 msb */
+	andi.	r0,r30,0x1f8		/* Add in other flags */
+	or	r3,r3,r0
+
+	/* We eventually do the icache sync here (maybe inline that
+	 * code rather than call a C function...
+	 */
+BEGIN_FTR_SECTION
+	mr	r4,r30
+	mr	r5,r7
+	bl	.hash_page_do_lazy_icache
+END_FTR_SECTION_IFSET(CPU_FTR_NOEXECUTE)
+
+	/* At this point, r3 contains new PP bits, save them in
+	 * place of "access" in the param area (sic)
+	 */
+	std	r3,STK_PARM(r4)(r1)
+
+	/* Get htab_hash_mask */
+	ld	r4,htab_data at got(2)
+	ld	r27,16(r4)	/* htab_data.htab_hash_mask -> r27 */
+
+	/* Check if we may already be in the hashtable, in this case, we
+	 * go to out-of-line code to try to modify the HPTE
+	 */
+	andi.	r0,r31,_PAGE_HASHPTE
+	bne	htab_modify_pte
+
+htab_insert_pte:
+	/* Clear hpte bits in new pte (we also clear BUSY btw) and
+	 * add _PAGE_HASHPTE
+	 */
+	lis	r0,_PAGE_HPTEFLAGS at h
+	ori	r0,r0,_PAGE_HPTEFLAGS at l
+	andc	r30,r30,r0
+	ori	r30,r30,_PAGE_HASHPTE
+
+1:
+	/* page number in r5 */
+	rldicl	r5,r31,64-PTE_SHIFT,PTE_SHIFT
+
+	/* Calculate primary group hash */
+	and	r0,r28,r27
+	rldicr	r3,r0,3,63-3	/* r0 = (hash & mask) << 3 */
+
+	/* Call ppc_md.hpte_insert */
+	ld	r7,STK_PARM(r4)(r1)	/* Retreive new pp bits */
+	mr	r4,r29			/* Retreive va */
+	li	r6,0			/* primary slot *
+	li	r8,0			/* not bolted and not large */
+	li	r9,0
+_GLOBAL(htab_call_hpte_insert1)
+	bl	.			/* Will be patched by htab_finish_init() */
+	cmpi	0,r3,0
+	bge	htab_pte_insert_ok	/* Insertion successful */
+	cmpi	0,r3,-2			/* Critical failure */
+	beq-	htab_pte_insert_failure
+
+	/* Now try secondary slot */
+	ori	r30,r30,_PAGE_SECONDARY
+
+	/* page number in r5 */
+	rldicl	r5,r31,64-PTE_SHIFT,PTE_SHIFT
+
+	/* Calculate secondary group hash */
+	not	r3,r28
+	and	r0,r3,r27
+	rldicr	r3,r0,3,63-3	/* r0 = (~hash & mask) << 3 */
+
+	/* Call ppc_md.hpte_insert */
+	ld	r7,STK_PARM(r4)(r1)	/* Retreive new pp bits */
+	mr	r4,r29			/* Retreive va */
+	li	r6,1			/* secondary slot *
+	li	r8,0			/* not bolted and not large */
+	li	r9,0
+_GLOBAL(htab_call_hpte_insert2)
+	bl	.			/* Will be patched by htab_finish_init() */
+	cmpi	0,r3,0
+	bge+	htab_pte_insert_ok	/* Insertion successful */
+	cmpi	0,r3,-2			/* Critical failure */
+	beq-	htab_pte_insert_failure
+
+	/* Both are full, we need to evict something */
+	mftb	r0
+	/* Pick a random group based on TB */
+	andi.	r0,r0,1
+	mr	r5,r28
+	bne	2f
+	not	r5,r5
+2:	and	r0,r5,r27
+	rldicr	r3,r0,3,63-3	/* r0 = (hash & mask) << 3 */
+	/* Call ppc_md.hpte_remove */
+_GLOBAL(htab_call_hpte_remove)
+	bl	.			/* Will be patched by htab_finish_init() */
+
+	/* Try all again */
+	b	1b
+
+htab_pte_insert_ok:
+	/* Insert slot number in PTE */
+	rldimi	r30,r3,12,63-14
+
+	/* Write out the PTE with a normal write
+	 * (maybe add eieio may be good still ?)
+	 */
+htab_write_out_pte:
+	ld	r6,STK_PARM(r6)(r1)
+	std	r30,0(r6)
+	li	r3, 0
+bail:
+	ld	r27,STK_REG(r27)(r1)
+	ld	r28,STK_REG(r28)(r1)
+	ld	r29,STK_REG(r29)(r1)
+	ld      r30,STK_REG(r30)(r1)
+	ld      r31,STK_REG(r31)(r1)
+	addi    r1,r1,STACKFRAMESIZE
+	ld      r0,16(r1)
+	mtlr    r0
+	blr
+
+htab_modify_pte:
+	/* Keep PP bits in r4 and slot idx from the PTE around in r3 */
+	mr	r4,r3
+	rlwinm	r3,r31,32-12,29,31
+
+	/* Secondary group ? if yes, get a inverted hash value */
+	mr	r5,r28
+	andi.	r0,r31,_PAGE_SECONDARY
+	beq	1f
+	not	r5,r5
+1:
+	/* Calculate proper slot value for ppc_md.hpte_updatepp */
+	and	r0,r5,r27
+	rldicr	r0,r0,3,63-3	/* r0 = (hash & mask) << 3 */
+	add	r3,r0,r3	/* add slot idx */
+
+	/* Call ppc_md.hpte_updatepp */
+	mr	r5,r29			/* va */
+	li	r6,0			/* large is 0 */
+	ld	r7,STK_PARM(r8)(r1)	/* get "local" param */
+_GLOBAL(htab_call_hpte_updatepp)
+	bl	.			/* Will be patched by htab_finish_init() */
+
+	/* if we failed because typically the HPTE wasn't really here
+	 * we try an insertion.
+	 */
+	cmpi	0,r3,-1
+	beq-	htab_insert_pte
+
+	/* Clear the BUSY bit and Write out the PTE */
+	li	r0,_PAGE_BUSY
+	andc	r30,r30,r0
+	b	htab_write_out_pte
+
+htab_wrong_access:
+	/* Bail out clearing reservation */
+	stdcx.	r31,0,r6
+	li	r3,1
+	b	bail
+
+htab_pte_insert_failure:
+	b	.htab_insert_failure
+
+
diff -urN linux-g5-ppc64/arch/ppc64/mm/hugetlbpage.c linux-g5-htab/arch/ppc64/mm/hugetlbpage.c
--- linux-g5-ppc64/arch/ppc64/mm/hugetlbpage.c	2003-12-02 13:11:59.000000000 +1100
+++ linux-g5-htab/arch/ppc64/mm/hugetlbpage.c	2003-12-08 15:52:18.100012832 +1100
@@ -655,10 +655,6 @@
 	unsigned long hpteflags, prpn;
 	long slot;

-	/* Is this for us? */
-	if (!in_hugepage_area(mm->context, ea))
-		return -1;
-
 	ea &= ~(HPAGE_SIZE-1);

 	/* We have to find the first hugepte in the batch, since
diff -urN linux-g5-ppc64/include/asm-ppc64/mmu.h linux-g5-htab/include/asm-ppc64/mmu.h
--- linux-g5-ppc64/include/asm-ppc64/mmu.h	2003-12-01 14:40:29.000000000 +1100
+++ linux-g5-htab/include/asm-ppc64/mmu.h	2003-12-09 17:23:43.436554256 +1100
@@ -225,6 +225,8 @@
 	asm volatile("ptesync": : :"memory");
 }

+extern void htab_finish_init(void);
+
 #endif /* __ASSEMBLY__ */

 /*
diff -urN linux-g5-ppc64/include/asm-ppc64/pgtable.h linux-g5-htab/include/asm-ppc64/pgtable.h
--- linux-g5-ppc64/include/asm-ppc64/pgtable.h	2003-12-05 13:58:59.000000000 +1100
+++ linux-g5-htab/include/asm-ppc64/pgtable.h	2003-12-09 13:57:28.174879992 +1100
@@ -74,22 +74,23 @@
  * Bits in a linux-style PTE.  These match the bits in the
  * (hardware-defined) PowerPC PTE as closely as possible.
  */
-#define _PAGE_PRESENT	0x001UL	/* software: pte contains a translation */
-#define _PAGE_USER	0x002UL	/* matches one of the PP bits */
-#define _PAGE_RW	0x004UL	/* software: user write access allowed */
-#define _PAGE_GUARDED	0x008UL
-#define _PAGE_COHERENT	0x010UL	/* M: enforce memory coherence (SMP systems) */
-#define _PAGE_NO_CACHE	0x020UL	/* I: cache inhibit */
-#define _PAGE_WRITETHRU	0x040UL	/* W: cache write-through */
-#define _PAGE_DIRTY	0x080UL	/* C: page changed */
-#define _PAGE_ACCESSED	0x100UL	/* R: page referenced */
-#define _PAGE_FILE	0x200UL /* software: pte holds file offset */
-#define _PAGE_HASHPTE	0x400UL	/* software: pte has an associated HPTE */
-#define _PAGE_EXEC	0x800UL	/* software: i-cache coherence required */
-#define _PAGE_SECONDARY 0x8000UL /* software: HPTE is in secondary group */
-#define _PAGE_GROUP_IX  0x7000UL /* software: HPTE index within group */
+#define _PAGE_PRESENT	0x0001 /* software: pte contains a translation */
+#define _PAGE_USER	0x0002 /* matches one of the PP bits */
+#define _PAGE_FILE	0x0002 /* (!present only) software: pte holds file offset */
+#define _PAGE_RW	0x0004 /* software: user write access allowed */
+#define _PAGE_GUARDED	0x0008
+#define _PAGE_COHERENT	0x0010 /* M: enforce memory coherence (SMP systems) */
+#define _PAGE_NO_CACHE	0x0020 /* I: cache inhibit */
+#define _PAGE_WRITETHRU	0x0040 /* W: cache write-through */
+#define _PAGE_DIRTY	0x0080 /* C: page changed */
+#define _PAGE_ACCESSED	0x0100 /* R: page referenced */
+#define _PAGE_EXEC	0x0200 /* software: i-cache coherence required */
+#define _PAGE_HASHPTE	0x0400 /* software: pte has an associated HPTE */
+#define _PAGE_BUSY	0x0800 /* software: PTE & hash are busy */
+#define _PAGE_SECONDARY 0x8000 /* software: HPTE is in secondary group */
+#define _PAGE_GROUP_IX  0x7000 /* software: HPTE index within group */
 /* Bits 0x7000 identify the index within an HPT Group */
-#define _PAGE_HPTEFLAGS (_PAGE_HASHPTE | _PAGE_SECONDARY | _PAGE_GROUP_IX)
+#define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | _PAGE_SECONDARY | _PAGE_GROUP_IX)
 /* PAGE_MASK gives the right answer below, but only by accident */
 /* It should be preserving the high 48 bits and then specifically */
 /* preserving _PAGE_SECONDARY | _PAGE_GROUP_IX */
@@ -157,8 +158,10 @@
 #define _PMD_HUGEPAGE	0x00000001U
 #define HUGEPTE_BATCH_SIZE (1<<(HPAGE_SHIFT-PMD_SHIFT))

+#ifndef __ASSEMBLY__
 int hash_huge_page(struct mm_struct *mm, unsigned long access,
 		   unsigned long ea, unsigned long vsid, int local);
+#endif /* __ASSEMBLY__ */

 #define HAVE_ARCH_UNMAPPED_AREA
 #else
@@ -288,9 +291,12 @@
 					unsigned long set )
 {
 	unsigned long old, tmp;
-
+	extern void udbg_putc(unsigned char c);
+
 	__asm__ __volatile__(
 	"1:	ldarx	%0,0,%3		# pte_update\n\
+	andi.	%1,%0,0x0800 \n\
+	bne-	1b \n\
 	andc	%1,%0,%4 \n\
 	or	%1,%1,%5 \n\
 	stdcx.	%1,0,%3 \n\


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Tue Dec  9 19:02:41 2003
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Tue, 09 Dec 2003 19:02:41 +1100
Subject: hash table
In-Reply-To: <1070956595.11006.183.camel@gaston>
References: <1070956595.11006.183.camel@gaston>
Message-ID: <1070956960.11009.186.camel@gaston>


On Tue, 2003-12-09 at 18:56, Benjamin Herrenschmidt wrote:
> Here's a first shot at my rework of __hash_page, if you want
> to have a quick look... It did a few tests but didn't really
> stress the box that badly, so there may be bugs in there.
>
> .../...

And we also want that patch (which can be included in ameslab
asap I suppose). It fixes the .got to be right before the .toc
so that @got accesses done from assembly work properly. According
to Alan Modra, the old stuff with .got in data segment was bogus.

Ben.


===== arch/ppc64/kernel/vmlinux.lds.S 1.18 vs edited =====
--- 1.18/arch/ppc64/kernel/vmlinux.lds.S	Fri Sep 12 21:01:40 2003
+++ edited/arch/ppc64/kernel/vmlinux.lds.S	Mon Dec  8 19:04:13 2003
@@ -53,7 +53,6 @@
     *(.data1)
     *(.sdata)
     *(.sdata2)
-    *(.got.plt) *(.got)
     *(.dynamic)
     CONSTRUCTORS
   }
@@ -126,6 +125,7 @@
   /* freed after init ends here */

   __toc_start = .;
+  .got : { *(.got.plt) *(.got) }
   .toc : { *(.toc) }
   . = ALIGN(4096);
   __toc_end = .;


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Wed Dec 10 07:20:19 2003
From: olof at austin.ibm.com (Olof Johansson)
Date: Tue, 09 Dec 2003 14:20:19 -0600
Subject: hash table
In-Reply-To: <1070956595.11006.183.camel@gaston>
References: <1070956595.11006.183.camel@gaston>
Message-ID: <3FD62E83.6070207@austin.ibm.com>


Benjamin Herrenschmidt wrote:

>I think there is no race with the flush code. The reason is that
>the case where flush is called on a present page seem to be strictly
>limited to a PP bits update (or an accessed bits update in some
>error case, but we can dismiss that one completely I beleive).
>
>
>Since flush uses pte_update, it will not race with a pending
>_hash_page. The only possible race is a __hash_page occuring during
>a flush. But in this case, the PTE will have the new PP bits already
>so at worst, we exit flush with an entry present... but that has the
>new bits. So it's ok. I don't think we can race on the content of
>the HPTE neither as we have the HPTE lock bit there. I'd still
>like your point of view though.
>
>
I can see a race between the find_linux_pte() and the use of ptep in
__hash_page. Another CPU can come in during that window and deallocate
the PTE, can't it? One solution for this is to set _PAGE_BUSY in
find_linux_pte() atomically during lookup. There's even more subtle
races in the sense that the tree is walked while someone might update it
underneath of the lookup, but maybe they can be ignored?

Also two minor comments:

* in pte_update, use _PAGE_BUSY instead of hardcoded 0x0800? Would
increase readability a little.

* in __hash_page / htab_wrong_access: There's no check for failed stdcx.


-Olof


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Wed Dec 10 07:39:05 2003
From: olof at austin.ibm.com (Olof Johansson)
Date: Tue, 09 Dec 2003 14:39:05 -0600
Subject: hash table
In-Reply-To: <3FD62E83.6070207@austin.ibm.com>
References: <1070956595.11006.183.camel@gaston> <3FD62E83.6070207@austin.ibm.com>
Message-ID: <3FD632E9.2030606@austin.ibm.com>


Olof Johansson wrote:

>
> There's even more subtle races in the sense that the tree is walked
> while someone might update it
> underneath of the lookup, but maybe they can be ignored?

Hmm, I'm used to thinking about this in 2.4, where we don't have the
HPTE lock bit. I'm guessing that will protect us here.


-Olof


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Wed Dec 10 11:02:52 2003
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Wed, 10 Dec 2003 11:02:52 +1100
Subject: hash table
In-Reply-To: <3FD62E83.6070207@austin.ibm.com>
References: <1070956595.11006.183.camel@gaston>
	 <3FD62E83.6070207@austin.ibm.com>
Message-ID: <1071014571.12500.215.camel@gaston>


> I can see a race between the find_linux_pte() and the use of ptep in
> __hash_page. Another CPU can come in during that window and deallocate
> the PTE, can't it? One solution for this is to set _PAGE_BUSY in
> find_linux_pte() atomically during lookup. There's even more subtle
> races in the sense that the tree is walked while someone might update it
> underneath of the lookup, but maybe they can be ignored?

Yup, this race is on my list already ;)

I want to move find_linux_pte down into __hash_page anyway, but that's
not how to fix this race.

AFAIK, the only race is (very unlikely but definitely there) if we free
a PTE page on one CPU while we are in hash_page() on another CPU.

Paulus proposed a fix for this which consist of delaying the actual
freeing of PTE pages. We gather them into a list that we free either
after a given threshold or after a while at idle time.

When we actually go to free it, we use an IPI to sync with othe CPUs,
making sure they aren't in hash_page(). At that point, we'll have
already cleared the pmd entries, so we know no CPU will go down to the
PTE any more on a further hash_page().

>Also two minor comments:
>
> * in pte_update, use _PAGE_BUSY instead of hardcoded 0x0800? Would
> increase readability a little.

Yah, maybe, I didn't feel like adding another argument to the asm
statement,  I hate that syntax, but you are probably right ;)

> * in __hash_page / htab_wrong_access: There's no check for failed stdcx.

That's normal, the only point of this stdcx. is to not leave a dangling
reservation, I don't care if it succeed as the value I'm writing back is
the original value intact.

Thanks for your comments,
Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From brking at us.ibm.com  Thu Dec 11 01:51:41 2003
From: brking at us.ibm.com (Brian King)
Date: Wed, 10 Dec 2003 08:51:41 -0600
Subject: pci_map_single return value
Message-ID: <3FD732FD.10903@us.ibm.com>


Why is it that pci_map_single returns NO_TCE on failure on PPC64 and 0
on failure on IA64? Which one is correct? If NO_TCE is correct, then why
is it defined in a ppc64 include? This makes it difficult for device
drivers to actually check for it.


--
Brian King
eServer Storage I/O
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From boutcher at us.ibm.com  Thu Dec 11 02:27:04 2003
From: boutcher at us.ibm.com (David Boutcher)
Date: Wed, 10 Dec 2003 09:27:04 -0600
Subject: pci_map_single return value
In-Reply-To: <3FD732FD.10903@us.ibm.com>
Message-ID: <OF2574EB93.2194319E-ON86256DF8.00549ACD-86256DF8.0054B803@us.ibm.com>


Because 0 is a valid TCE in the current ppc64 implementation :-)  Though
actually Dave E made some changes lately that may make that not true any
more, I'm not sure.

owner-linuxppc64-dev at lists.linuxppc.org wrote on 12/10/2003 08:51:41 AM:
> Why is it that pci_map_single returns NO_TCE on failure on PPC64 and 0
> on failure on IA64? Which one is correct? If NO_TCE is correct, then why
> is it defined in a ppc64 include? This makes it difficult for device
> drivers to actually check for it.

Dave Boutcher
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olh at suse.de  Thu Dec 11 03:45:53 2003
From: olh at suse.de (Olaf Hering)
Date: Wed, 10 Dec 2003 17:45:53 +0100
Subject: bogus changes for generic pci_scan_slot() in ameslab-2.5
In-Reply-To: <1070385249.1123.1659.camel@tin.ibm.com	>
References: <20031128154151.GA30606@suse.de> <1070385249.1123.1659.camel@tin.ibm.com	>
Message-ID: <20031210164553.GB29983@suse.de>


 On Tue, Dec 02, Jake Moilanen wrote:

> This looks like the workaround for the pci multifunc problem that is
> seen on LPARs.  The linux-2.6-pcibios-scan-all-fns-1.patch I posted in
> the JS20 support email fixes the same problem, does not break ppc32, and
> should probably used instead.

I have tried this patch on the beige g3 and it booted ok. The one below
should be reverted from ameslab.

> On Fri, 2003-11-28 at 09:41, Olaf Hering wrote:
> > Good morning,
> >
> > what is the purpose of this change?
> >
> >
> > diff -purN linux-2.5/drivers/pci/probe.c linuxppc64-2.5/drivers/pci/probe.c
> > --- linux-2.5/drivers/pci/probe.c	2003-08-06 15:34:30.000000000 +0000
> > +++ linuxppc64-2.5/drivers/pci/probe.c	2003-11-05 22:12:33.000000000 +0000
> > @@ -552,6 +552,7 @@ int __devinit pci_scan_slot(struct pci_b
> >  		struct pci_dev *dev;
> >
> >  		dev = pci_scan_device(bus, devfn);
> > +#if 0
> >  		if (func == 0) {
> >  			if (!dev)
> >  				break;
> > @@ -560,6 +561,10 @@ int __devinit pci_scan_slot(struct pci_b
> >  				continue;
> >  			dev->multifunction = 1;
> >  		}
> > +#else
> > +		if (!dev)
> > +			continue;
> > +#endif
> >
> >  		/* Fix up broken headers */
> >  		pci_fixup_device(PCI_FIXUP_HEADER, dev);
> >
> > It breaks on ppc32, B&W G3, dies in indirect_read_config() because the
> > pointer *cfg_data becomes bogus, devfn is > 0xff (no idea if that
> > matters).
> >
> > turning #if 0 into #if 1 cures it. I havent tried it on other systems
> > yet, but at least a PReP MTX+ works with the patch above.
> >
> >
> >
> > --
> > USB is for mice, FireWire is for men!
> >
> > sUse lINUX ag, n?RNBERG
> >
>

--
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nathanl at austin.ibm.com  Thu Dec 11 11:52:06 2003
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Wed, 10 Dec 2003 18:52:06 -0600
Subject: lparcfg code
In-Reply-To: <20031210215427.CDC3E24064@source.scl.ameslab.gov>
References: <20031210215427.CDC3E24064@source.scl.ameslab.gov>
Message-ID: <3FD7BFB6.1000102@austin.ibm.com>


Hi-

I noticed this just got committed.

ppc64 at source.scl.ameslab.gov wrote:
> full patch URL:
> http://source.scl.ameslab.gov:14690//linux-2.5/patch at 1.1343
>
> ChangeSet
>   1.1343 03/12/10 15:46:38 will_schmidt at vnet.ibm.com +5 -0
>   add/forward port of lparcfg
>
>   arch/ppc64/kernel/lparcfg.c
>     1.1 03/12/10 15:43:51 will_schmidt at vnet.ibm.com +474 -0
>
>   include/asm-ppc64/hvcall.h
>     1.10 03/12/10 15:43:51 will_schmidt at vnet.ibm.com +17 -0
>     add hcall for 4 output parms
>
>   arch/ppc64/kernel/pSeries_hvCall.S
>     1.6 03/12/10 15:43:51 will_schmidt at vnet.ibm.com +38 -0
>     add hcall for 4 output parms
>
>   arch/ppc64/kernel/lparcfg.c
>     1.0 03/12/10 15:43:51 will_schmidt at vnet.ibm.com +0 -0
>     BitKeeper file /development/willschm/kernels/bk25.dec10/linux-2.5/arch/ppc64/kernel/lparcfg.c
>
>   arch/ppc64/kernel/Makefile
>     1.30 03/12/10 15:43:51 will_schmidt at vnet.ibm.com +2 -0
>     add/forward port of lparcfg
>
>   arch/ppc64/Kconfig
>     1.31 03/12/10 15:43:51 will_schmidt at vnet.ibm.com +4 -0
>     add/forward port of lparcfg
>
> ======== ChangeSet 1.1343 ========
> will_schmidt at vnet.ibm.com|ChangeSet|20031210214638|52756


The lparcfg thing doesn't want to build as a module:

*** Warning: "cur_cpu_spec" [arch/ppc64/kernel/lparcfg.ko] undefined!
*** Warning: "systemcfg" [arch/ppc64/kernel/lparcfg.ko] undefined!
*** Warning: ".plpar_hcall_4out" [arch/ppc64/kernel/lparcfg.ko] undefined!
   CC      arch/ppc64/kernel/lparcfg.mod.o
   LD [M]  arch/ppc64/kernel/lparcfg.ko

Do we really need it to be a module?  To me, it seems like something
that could just be a boolean config option instead of tristate.

Some other points and questions:

- the init function in arch/ppc64/kernel/lparcfg.c should check whether
this is an lpar system; if not, it should return without creating
/proc/ppc64/lparcfg.

- the way the lparcfg code uses the proc_dir_entry->data pointer as a
"scratch" buffer before copying to user space is pretty weird.  The read
function should allocate a separate buffer for each invocation, or use
the stack.  The data member of proc_dir_entry is usually used for
symlinks or static data, such as device tree properties.

- the lparcfg_open function seems unnecessary.

- the h_get_ppp function should be declared static, I suspect.  It
should also check the return value of plpar_hcall_4out.

- functions' opening braces should be on their own lines.

- why is the pSeries version of lparcfg_data gathering information (e.g.
serial number, system type) that is already available to users in
/proc/device-tree?

- in 2.5, of_find_node_by_path should be used instead of find_path_device.

- I noticed a few lines like this:

   if (cur_cpu_spec->firmware_features && FW_FEATURE_SPLPAR)

   I think these should be bitwise-and'ing instead of logical.

- Is /proc/ppc64/lparcfg going to be writable in order to change certain
parameters?  I am doing some work along these lines and I don't want to
duplicate effort.

Thanks,
Nathan


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Thu Dec 11 15:39:39 2003
From: paulus at samba.org (Paul Mackerras)
Date: Thu, 11 Dec 2003 15:39:39 +1100
Subject: lparcfg code
In-Reply-To: <3FD7BFB6.1000102@austin.ibm.com>
References: <20031210215427.CDC3E24064@source.scl.ameslab.gov>
	<3FD7BFB6.1000102@austin.ibm.com>
Message-ID: <16343.62731.254445.650129@cargo.ozlabs.ibm.com>


Nathan Lynch writes:

> I noticed this just got committed.
>
> ppc64 at source.scl.ameslab.gov wrote:
> > full patch URL:
> > http://source.scl.ameslab.gov:14690//linux-2.5/patch at 1.1343
> >
> > ChangeSet
> >   1.1343 03/12/10 15:46:38 will_schmidt at vnet.ibm.com +5 -0
> >   add/forward port of lparcfg

I noticed that the config option doesn't have any help text.  It's not
necessarily obvious to people what "LPAR configuration data" is or
whether they might want it.  Will, please add some reasonable help
text.

> - functions' opening braces should be on their own lines.

Yes.

Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Fri Dec 12 04:47:47 2003
From: olof at austin.ibm.com (Olof Johansson)
Date: Thu, 11 Dec 2003 11:47:47 -0600
Subject: [2.5] [PATCH] Don't loop looking for free SLB entries in do_slb_bolted
Message-ID: <3FD8ADC3.4090401@austin.ibm.com>

Anton and I have done some work in 2.4 to optimize the SLB reload path.
One big time waster for a busy system is the search for a free entry.
This is the corresponding patch for 2.5/2.6. It's smaller since we don't
have to do slbie's on 2.6.

For a system with smaller working set than the SLB can fit (<16GB),
evicting a used entry is no biggie: It'll be fauled in again and we'll
reach a stable state after a few iterations through this. For systems
with larger working sets, there is no stable state and we will save alot
of time by not scanning all 63 entries on every fault just to find them
all valid and fall back to round-robin.

Comments are welcome. I haven't tested the 2.6 patch as much as 2.4 due
to lack of big hardware and good workloads.

-Olof
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: slb-noloop-patch.25
Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20031211/da429f90/attachment.txt 

From anton at samba.org  Fri Dec 12 06:40:05 2003
From: anton at samba.org (Anton Blanchard)
Date: Fri, 12 Dec 2003 06:40:05 +1100
Subject: [2.5] [PATCH] Don't loop looking for free SLB entries in do_slb_bolted
In-Reply-To: <3FD8ADC3.4090401@austin.ibm.com>
References: <3FD8ADC3.4090401@austin.ibm.com>
Message-ID: <20031211194005.GB17683@krispykreme>

Hi Olof,

> Anton and I have done some work in 2.4 to optimize the SLB reload
> path. One big time waster for a busy system is the search for a free
> entry. This is the corresponding patch for 2.5/2.6. It's smaller since
> we don't have to do slbie's on 2.6.

Heres what Im testing at the moment. Its a fairly big patch but Im
really hoping to get our SLB reload overhead under control in 2.6 :)

- nop out some stuff that is POWER3/RS64 specific
- we were checking some bits in the DSISR in DataAccess_common that I
  cant find in the architecture manual so I nuked it. (0xa4500000 ->
  0x04500000)
- put do_slb_bolted on a diet, dont do a search for empty entries,
  similar to Olofs patch. Use the POWER4 optimsed mtcrf instruction.
- flush the kernel segment out of the SLB on context switch to avoid
  the race where the translation is in the ERAT but not in the SLB and
  it gets invalidated by another cpu doing tlbie at just the wrong time
  (eg exception exit after srr0/srr1 has been loaded)
- split segment handling and slb handling code apart.
- preload PC, SP and TASK_UNMAPPED_BASE segments on a context switch.
- create an SLB cache and only flush those segments if possible on a
  context switch.
- optimise switch_mm, we were flushing the stab/slb more often than we
  needed to (eg when switching between a user task and a kernel thread).

Its been soaking on a large box for a while now. Its completely untested
on POWER3 however :)

Anton
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: reload.patch
Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20031212/358726fc/attachment.txt 

From willschm at us.ibm.com  Fri Dec 12 09:06:39 2003
From: willschm at us.ibm.com (Will Schmidt)
Date: Thu, 11 Dec 2003 16:06:39 -0600
Subject: lparcfg code
Message-ID: <OF8AE21311.C39821C2-ON86256DF9.0078B969-86256DF9.00797633@us.ibm.com>


I've just pushed up some changes that include help text, got rid of those
extra '&' chars, and moved the function opening braces.
I will need to revisit the code before too long to fill in the guts of the
get_splpar_potential_characteristics() function, and will  see what I can
do to resolve the rest of the comments.

The purpose of the lparcfg interface was initially intended to be an
interface for a License Manager  tool to determine what sort of system
capabilities exist.     Some of the contents are available elsewhere in the
device tree, but this is meant to be a 'one-stop shopping' interface. :-)

-Will

willschm at us.ibm.com
Linux on PowerPC-64 Development
IBM Rochester


|---------+--------------------------------------->
|         |           Paul Mackerras              |
|         |           <paulus at samba.org>          |
|         |           Sent by:                    |
|         |           owner-linuxppc64-dev at lists.l|
|         |           inuxppc.org                 |
|         |                                       |
|         |                                       |
|         |           12/10/2003 10:39 PM         |
|         |                                       |
|---------+--------------------------------------->
  >------------------------------------------------------------------------------------------------------------|
  |                                                                                                            |
  |       To:       Nathan Lynch <nathanl at austin.ibm.com>, Will Schmidt/Rochester/IBM at IBMUS                    |
  |       cc:       linuxppc64-dev at lists.linuxppc.org                                                          |
  |       Subject:  Re: lparcfg code                                                                           |
  |                                                                                                            |
  >------------------------------------------------------------------------------------------------------------|


Nathan Lynch writes:

> I noticed this just got committed.
>
> ppc64 at source.scl.ameslab.gov wrote:
> > full patch URL:
> > http://source.scl.ameslab.gov:14690//linux-2.5/patch at 1.1343
> >
> > ChangeSet
> >   1.1343 03/12/10 15:46:38 will_schmidt at vnet.ibm.com +5 -0
> >   add/forward port of lparcfg

I noticed that the config option doesn't have any help text.  It's not
necessarily obvious to people what "LPAR configuration data" is or
whether they might want it.  Will, please add some reasonable help
text.

> - functions' opening braces should be on their own lines.

Yes.

Paul.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nathanl at austin.ibm.com  Fri Dec 12 10:11:00 2003
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Thu, 11 Dec 2003 17:11:00 -0600
Subject: lparcfg code
In-Reply-To: <OF8AE21311.C39821C2-ON86256DF9.0078B969-86256DF9.00797633@us.ibm.com>
References: <OF8AE21311.C39821C2-ON86256DF9.0078B969-86256DF9.00797633@us.ibm.com>
Message-ID: <3FD8F984.7050104@austin.ibm.com>

Will Schmidt wrote:
> I've just pushed up some changes that include help text, got rid of those
> extra '&' chars, and moved the function opening braces.
> I will need to revisit the code before too long to fill in the guts of the
> get_splpar_potential_characteristics() function, and will  see what I can
> do to resolve the rest of the comments.
>
> The purpose of the lparcfg interface was initially intended to be an
> interface for a License Manager  tool to determine what sort of system
> capabilities exist.     Some of the contents are available elsewhere in the
> device tree, but this is meant to be a 'one-stop shopping' interface. :-)

Thanks, Will.  I noticed a couple other little things (compiler
warnings, some more && vs & stuff, and struct initializers) in the
meantime, patch is attached.

Look ok?

Nathan

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lparcfg_cleanup.patch
Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20031211/69ae8ed8/attachment.txt 

From nathanl at austin.ibm.com  Fri Dec 12 13:13:18 2003
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Thu, 11 Dec 2003 20:13:18 -0600
Subject: [PATCH] make debugger optional, depend on CONFIG_DEBUG_KERNEL
Message-ID: <3FD9243E.6030600@austin.ibm.com>

Hi-

Currently in the 2.5 tree, selecting "Kernel hacking" in the top level
config menu entails enabling a debugger (either xmon or kdb).  Patch
allows one to not use a debugger at all, and ensures that one cannot
enable a debugger without selecting CONFIG_DEBUG_KERNEL.

Nathan


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: debugger_optional.patch
Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20031211/405bc3c1/attachment.txt 

From paulus at au1.ibm.com  Fri Dec 12 16:08:03 2003
From: paulus at au1.ibm.com (Paul Mackerras)
Date: Fri, 12 Dec 2003 16:08:03 +1100
Subject: srp.h and viosrp.h
Message-ID: <16345.19763.163666.223631@cargo.ozlabs.ibm.com>


I just had a look at srp.h and viosrp.h in drivers/scsi in the ameslab
2.4 tree.  These can't be submitted in their present form.  Firstly,
they have no copyright notice, and secondly, they have no comments at
the top explaining what the files are and what they relate to (and
very few comments explaining the definitions mean).  Finally, why are
these files in drivers/scsi rather than include/something?

Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Fri Dec 12 16:51:11 2003
From: anton at samba.org (Anton Blanchard)
Date: Fri, 12 Dec 2003 16:51:11 +1100
Subject: [2.5] [PATCH] Don't loop looking for free SLB entries in do_slb_bolted
In-Reply-To: <20031211194005.GB17683@krispykreme>
References: <3FD8ADC3.4090401@austin.ibm.com> <20031211194005.GB17683@krispykreme>
Message-ID: <20031212055111.GC17683@krispykreme>


> Its been soaking on a large box for a while now. Its completely untested
> on POWER3 however :)

And we popped a bug in the patch. The valid bit is in the correct spot,
no idea what we were moving from 2^11 (I suspect it was my bad spec
reading).

Anton


diff -puN arch/ppc64/kernel/head.S~debug_slb_rewrite arch/ppc64/kernel/head.S
--- foo_work/arch/ppc64/kernel/head.S~debug_slb_rewrite	2003-12-11 15:51:17.110465423 -0600
+++ foo_work-anton/arch/ppc64/kernel/head.S	2003-12-11 15:51:31.704685017 -0600
@@ -997,7 +997,9 @@ SLB_NUM_ENTRIES = 64
 	 * non recoverable point (after setting srr0/1) - Anton
 	 */
 	slbmfee	r21,r22
+#if 0
 	insrdi	r21,r21,12,36		/* move valid bit 2^11 to 2^27 */
+#endif
 	srdi	r21,r21,27
 	/*
 	 * This is incorrect (r1 is not the kernel stack) if we entered

_

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Fri Dec 12 17:12:03 2003
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Fri, 12 Dec 2003 17:12:03 +1100
Subject: [PATCH] Fix race between pte_free and hash_page
Message-ID: <1071209522.12496.344.camel@gaston>


Hi !

As discussed earlier with Olof, there is (and has always been afaik)
a race between freeing a PTE page and dereferencing PTEs in that page
from hash_page on another CPU. Typically can happen with a threaded
application if one thread unmap's something that the other thread
still uses.

After much thinking and at least 2 different implementations and
Rusty wise advice, here is an implementation that should be both
fast and scalable. The idea is to defer the freeing until some
safe point where we know the PTE page will no longer be used
(since the PMD has been cleared, since a point where we know all
pending hash_page are completed). This is just what RCU gives us
so let's use it.

Unless I missed something, this should probably be applied to
ameslab-2.5 now.

===== include/asm/pgalloc.h 1.11 vs edited =====
--- 1.11/include/asm-ppc64/pgalloc.h	Fri Sep 19 16:55:11 2003
+++ edited/include/asm/pgalloc.h	Fri Dec 12 17:07:17 2003
@@ -3,7 +3,10 @@

 #include <linux/mm.h>
 #include <linux/slab.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
 #include <asm/processor.h>
+#include <asm/tlb.h>

 extern kmem_cache_t *zero_cache;

@@ -62,15 +65,54 @@

 	return NULL;
 }
-
-static inline void
-pte_free_kernel(pte_t *pte)
+
+static inline void pte_free_kernel(pte_t *pte)
 {
 	kmem_cache_free(zero_cache, pte);
 }

 #define pte_free(pte_page)	pte_free_kernel(page_address(pte_page))
-#define __pte_free_tlb(tlb, pte)	pte_free(pte)
+
+struct pte_freelist_batch
+{
+	struct rcu_head	rcu;
+	unsigned int	index;
+	struct page *	pages[0];
+};
+
+#define PTE_FREELIST_SIZE	((PAGE_SIZE - sizeof(struct pte_freelist_batch) / \
+				  sizeof(struct page *)))
+
+extern void pte_free_now(struct page *ptepage);
+extern void pte_free_submit(struct pte_freelist_batch *batch);
+
+DECLARE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur);
+
+static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage)
+{
+	/* This is safe as we are holding page_table_lock */
+        cpumask_t local_cpumask = cpumask_of_cpu(smp_processor_id());
+	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+
+	if (cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask) ||
+	    cpus_equal(tlb->mm->cpu_vm_mask, CPU_MASK_NONE)) {
+		pte_free(ptepage);
+		return;
+	}
+
+	if (*batchp == NULL) {
+		*batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC);
+		if (*batchp == NULL) {
+			pte_free_now(ptepage);
+			return;
+		}
+	}
+	(*batchp)->pages[(*batchp)->index++] = ptepage;
+	if ((*batchp)->index == PTE_FREELIST_SIZE) {
+		pte_free_submit(*batchp);
+		*batchp = NULL;
+	}
+}

 #define check_pgt_cache()	do { } while (0)

===== include/asm/tlb.h 1.9 vs edited =====
--- 1.9/include/asm-ppc64/tlb.h	Tue Aug 19 12:46:23 2003
+++ edited/include/asm/tlb.h	Fri Dec 12 13:48:28 2003
@@ -74,6 +74,8 @@
 	batch->index = i;
 }

+extern void pte_free_finish(void);
+
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
 	int cpu = smp_processor_id();
@@ -86,6 +88,8 @@

 	flush_hash_range(tlb->mm->context, batch->index, local);
 	batch->index = 0;
+
+	pte_free_finish();
 }

 #endif /* _PPC64_TLB_H */
===== arch/ppc64/mm/init.c 1.52 vs edited =====
--- 1.52/arch/ppc64/mm/init.c	Fri Oct 24 00:10:29 2003
+++ edited/arch/ppc64/mm/init.c	Fri Dec 12 17:09:58 2003
@@ -94,6 +94,52 @@
  * include/asm-ppc64/tlb.h file -- tgall
  */
 DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
+DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur);
+unsigned long pte_freelist_forced_free;
+
+static void pte_free_smp_sync(void *arg)
+{
+	/* Do nothing, just ensure we sync with all CPUs */
+}
+
+/* This is only called when we are critically out of memory
+ * (and fail to get a page in pte_free_tlb).
+ */
+void pte_free_now(struct page *ptepage)
+{
+	pte_freelist_forced_free++;
+
+	smp_call_function(pte_free_smp_sync, NULL, 0, 1);
+
+	pte_free(ptepage);
+}
+
+static void pte_free_rcu_callback(void *arg)
+{
+	struct pte_freelist_batch *batch = arg;
+	unsigned int i;
+
+	for (i = 0; i < batch->index; i++)
+		pte_free(batch->pages[i]);
+	free_page((unsigned long)batch);
+}
+
+void pte_free_submit(struct pte_freelist_batch *batch)
+{
+	INIT_RCU_HEAD(&batch->rcu);
+	call_rcu(&batch->rcu, pte_free_rcu_callback, batch);
+}
+
+void pte_free_finish(void)
+{
+	/* This is safe as we are holding page_table_lock */
+	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+
+	if (*batchp == NULL)
+		return;
+	pte_free_submit(*batchp);
+	*batchp = NULL;
+}

 void show_mem(void)
 {


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From boutcher at us.ibm.com  Fri Dec 12 18:38:46 2003
From: boutcher at us.ibm.com (David Boutcher)
Date: Fri, 12 Dec 2003 01:38:46 -0600
Subject: srp.h and viosrp.h
In-Reply-To: <16345.19763.163666.223631@cargo.ozlabs.ibm.com>
Message-ID: <OFA7DBE0CC.DDF4AAA4-ON86256DFA.0029A872-86256DFA.0029D817@us.ibm.com>


owner-linuxppc64-dev at lists.linuxppc.org wrote on 12/11/2003 11:08:03 PM:
> I just had a look at srp.h and viosrp.h in drivers/scsi in the ameslab
> 2.4 tree.  These can't be submitted in their present form.  Firstly,
> they have no copyright notice, and secondly, they have no comments at
> the top explaining what the files are and what they relate to (and
> very few comments explaining the definitions mean).  Finally, why are
> these files in drivers/scsi rather than include/something?

Ya, Hollis already made similar comments.  I'm in Germany right now with
limited access to the network, so I'll fix them up next week.

I debated where they should live....they are very SCSI oriented, so
drivers/scsi seemed like a reasonable place.  A fact that will be clearer
once I put comments in them :-)

Dave Boutcher
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olh at suse.de  Sat Dec 13 02:07:31 2003
From: olh at suse.de (Olaf Hering)
Date: Fri, 12 Dec 2003 16:07:31 +0100
Subject: [PATCH] boot Makefile dependency for ld.script
Message-ID: <20031212150731.GA12935@suse.de>


A ld.script update will not trigger a rebuild of the boot files.


diff -purN linux-2.6.0-test11/arch/ppc64/boot/Makefile linux-2.6.0-test11.ppc32/arch/ppc64/boot/Makefile
--- linux-2.6.0-test11/arch/ppc64/boot/Makefile	2003-11-26 21:45:27.000000000 +0100
+++ linux-2.6.0-test11.ppc32/arch/ppc64/boot/Makefile	2003-12-12 16:04:16.000000000 +0100
@@ -106,11 +106,11 @@ $(call obj-sec, $(required) $(initrd)):
 	$(call addsection, $@)

 $(obj)/zImage: obj-boot += $(call obj-sec, $(required))
-$(obj)/zImage: $(call obj-sec, $(required)) $(obj-boot) $(obj)/addnote FORCE
+$(obj)/zImage: $(call obj-sec, $(required)) $(obj-boot) $(obj)/zImage.lds $(obj)/addnote FORCE
 	$(call if_changed,addnote)

 $(obj)/zImage.initrd: obj-boot += $(call obj-sec, $(required) $(initrd))
-$(obj)/zImage.initrd: $(call obj-sec, $(required) $(initrd)) $(obj-boot) $(obj)/addnote FORCE
+$(obj)/zImage.initrd: $(call obj-sec, $(required) $(initrd)) $(obj-boot) $(obj)/zImage.lds $(obj)/addnote FORCE
 	$(call if_changed,addnote)

 $(obj)/imagesize.c: vmlinux
--
USB is for mice, FireWire is for men!

sUse lINUX ag, n?RNBERG

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Sat Dec 13 03:17:01 2003
From: anton at samba.org (Anton Blanchard)
Date: Sat, 13 Dec 2003 03:17:01 +1100
Subject: [PATCH] Fix race between pte_free and hash_page
In-Reply-To: <1071209522.12496.344.camel@gaston>
References: <1071209522.12496.344.camel@gaston>
Message-ID: <20031212161701.GE17683@krispykreme>


Hi Ben,

> +static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage)
> +{
> +	/* This is safe as we are holding page_table_lock */
> +        cpumask_t local_cpumask = cpumask_of_cpu(smp_processor_id());
> +	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
> +
> +	if (cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask) ||
> +	    cpus_equal(tlb->mm->cpu_vm_mask, CPU_MASK_NONE)) {
> +		pte_free(ptepage);
> +		return;
> +	}

Looks good. Since we hold the pagetable lock, can we also check for
mm users == 1 and take the fast path?

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From olof at austin.ibm.com  Sat Dec 13 04:20:12 2003
From: olof at austin.ibm.com (Olof Johansson)
Date: Fri, 12 Dec 2003 11:20:12 -0600
Subject: [PATCH] Fix race between pte_free and hash_page
In-Reply-To: <1071209522.12496.344.camel@gaston>
References: <1071209522.12496.344.camel@gaston>
Message-ID: <3FD9F8CC.2020003@austin.ibm.com>


Benjamin Herrenschmidt wrote:

>Unless I missed something, this should probably be applied to
>ameslab-2.5 now.
>
>
>
Ben,

I like it!  Two comments and one question:

* Unless I'm missing something myself, (*batchp)->index is never
initialized. I guess we might get lucky most the time, but it could
cause badness.

* pte_freelist_forced_free is unprotected/nonatomic. It only seems to be
used as an indicator of memory pressure so it's not a big problem. I'm
guessing we don't really want to waste cycles on syncronization and can
live with it not being exact.

* Do we know how much extra IPI activity this causes on a fairly loaded
system? It would be interesting to know.


Thanks,

-Olof


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Sat Dec 13 15:32:48 2003
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sat, 13 Dec 2003 15:32:48 +1100
Subject: [PATCH] Fix race between pte_free and hash_page
In-Reply-To: <3FD9F8CC.2020003@austin.ibm.com>
References: <1071209522.12496.344.camel@gaston>
	 <3FD9F8CC.2020003@austin.ibm.com>
Message-ID: <1071289968.12499.354.camel@gaston>


On Sat, 2003-12-13 at 04:20, Olof Johansson wrote:
> Benjamin Herrenschmidt wrote:
>
> >Unless I missed something, this should probably be applied to
> >ameslab-2.5 now.
> >
> >
> >
> Ben,
>
> I like it!  Two comments and one question:
>
> * Unless I'm missing something myself, (*batchp)->index is never
> initialized. I guess we might get lucky most the time, but it could
> cause badness.

Right. New patch on the way.

> * pte_freelist_forced_free is unprotected/nonatomic. It only seems to be
> used as an indicator of memory pressure so it's not a big problem. I'm
> guessing we don't really want to waste cycles on syncronization and can
> live with it not being exact.

Yup.

> * Do we know how much extra IPI activity this causes on a fairly loaded
> system? It would be interesting to know.

I don't know at this point, I expect no much though, a system that can't
get one page with GFP_ATOMIC must be close to total oom...

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Sat Dec 13 15:52:40 2003
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Sat, 13 Dec 2003 15:52:40 +1100
Subject: [PATCH] Fix race between pte_free and hash_page
In-Reply-To: <20031212161701.GE17683@krispykreme>
References: <1071209522.12496.344.camel@gaston>
	 <20031212161701.GE17683@krispykreme>
Message-ID: <1071291159.12499.365.camel@gaston>


On Sat, 2003-12-13 at 03:17, Anton Blanchard wrote:

> Looks good. Since we hold the pagetable lock, can we also check for
> mm users == 1 and take the fast path?

Yup. Here's a new one doing that. I also removed the test with
CPU_MASK_NONE (seems useless especially with the mm_users one)
and added proper initialization of the "index" field in the
batch as pointed out by Olof.

===== include/asm/pgalloc.h 1.11 vs edited =====
--- 1.11/include/asm-ppc64/pgalloc.h	Fri Sep 19 16:55:11 2003
+++ edited/include/asm/pgalloc.h	Sat Dec 13 15:49:57 2003
@@ -3,7 +3,10 @@

 #include <linux/mm.h>
 #include <linux/slab.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
 #include <asm/processor.h>
+#include <asm/tlb.h>

 extern kmem_cache_t *zero_cache;

@@ -62,15 +65,55 @@

 	return NULL;
 }
-
-static inline void
-pte_free_kernel(pte_t *pte)
+
+static inline void pte_free_kernel(pte_t *pte)
 {
 	kmem_cache_free(zero_cache, pte);
 }

 #define pte_free(pte_page)	pte_free_kernel(page_address(pte_page))
-#define __pte_free_tlb(tlb, pte)	pte_free(pte)
+
+struct pte_freelist_batch
+{
+	struct rcu_head	rcu;
+	unsigned int	index;
+	struct page *	pages[0];
+};
+
+#define PTE_FREELIST_SIZE	((PAGE_SIZE - sizeof(struct pte_freelist_batch) / \
+				  sizeof(struct page *)))
+
+extern void pte_free_now(struct page *ptepage);
+extern void pte_free_submit(struct pte_freelist_batch *batch);
+
+DECLARE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur);
+
+static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage)
+{
+	/* This is safe as we are holding page_table_lock */
+        cpumask_t local_cpumask = cpumask_of_cpu(smp_processor_id());
+	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+
+	if (atomic_read(&tlb->mm->mm_users) < 2 ||
+	    cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask)) {
+		pte_free(ptepage);
+		return;
+	}
+
+	if (*batchp == NULL) {
+		*batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC);
+		if (*batchp == NULL) {
+			pte_free_now(ptepage);
+			return;
+		}
+		(*batchp)->index = 0;
+	}
+	(*batchp)->pages[(*batchp)->index++] = ptepage;
+	if ((*batchp)->index == PTE_FREELIST_SIZE) {
+		pte_free_submit(*batchp);
+		*batchp = NULL;
+	}
+}

 #define check_pgt_cache()	do { } while (0)

===== include/asm/tlb.h 1.9 vs edited =====
--- 1.9/include/asm-ppc64/tlb.h	Tue Aug 19 12:46:23 2003
+++ edited/include/asm/tlb.h	Fri Dec 12 13:48:28 2003
@@ -74,6 +74,8 @@
 	batch->index = i;
 }

+extern void pte_free_finish(void);
+
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
 	int cpu = smp_processor_id();
@@ -86,6 +88,8 @@

 	flush_hash_range(tlb->mm->context, batch->index, local);
 	batch->index = 0;
+
+	pte_free_finish();
 }

 #endif /* _PPC64_TLB_H */
===== arch/ppc64/mm/init.c 1.52 vs edited =====
--- 1.52/arch/ppc64/mm/init.c	Fri Oct 24 00:10:29 2003
+++ edited/arch/ppc64/mm/init.c	Fri Dec 12 17:09:58 2003
@@ -94,6 +94,52 @@
  * include/asm-ppc64/tlb.h file -- tgall
  */
 DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
+DEFINE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur);
+unsigned long pte_freelist_forced_free;
+
+static void pte_free_smp_sync(void *arg)
+{
+	/* Do nothing, just ensure we sync with all CPUs */
+}
+
+/* This is only called when we are critically out of memory
+ * (and fail to get a page in pte_free_tlb).
+ */
+void pte_free_now(struct page *ptepage)
+{
+	pte_freelist_forced_free++;
+
+	smp_call_function(pte_free_smp_sync, NULL, 0, 1);
+
+	pte_free(ptepage);
+}
+
+static void pte_free_rcu_callback(void *arg)
+{
+	struct pte_freelist_batch *batch = arg;
+	unsigned int i;
+
+	for (i = 0; i < batch->index; i++)
+		pte_free(batch->pages[i]);
+	free_page((unsigned long)batch);
+}
+
+void pte_free_submit(struct pte_freelist_batch *batch)
+{
+	INIT_RCU_HEAD(&batch->rcu);
+	call_rcu(&batch->rcu, pte_free_rcu_callback, batch);
+}
+
+void pte_free_finish(void)
+{
+	/* This is safe as we are holding page_table_lock */
+	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+
+	if (*batchp == NULL)
+		return;
+	pte_free_submit(*batchp);
+	*batchp = NULL;
+}

 void show_mem(void)
 {


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From dwm at austin.ibm.com  Tue Dec 16 04:06:58 2003
From: dwm at austin.ibm.com (Doug Maxey)
Date: Mon, 15 Dec 2003 11:06:58 -0600
Subject: [RFC] JS20 IDE perf patch
Message-ID: <200312151706.hBFH6wv1008098@falcon30.maxey.austin.rr.com>


Howdy,

  Have done some investigation of the AMD IDE implementation on the
  JS20 and have come up with a set of patches to address this.  The
  underlying assumption that the PCI bus can only do 66mhz is no
  longer true with the HT implementation on this platform.

  The specific issues seen were:
  1) ide.c has code that assumes the PCI bus only runs between 20 and
     66 mhz.
  2) FW has not completely configured the chipset to run at the higher
     IDE bus speeds.
  3) amd74xx.c does not recognize and cannot use the higher PCI bus
     speeds to enable the higher IDE speeds.

  Item 1 is addressed by creating ATA_PCIMHZ_MIN and ATA_PCIMHZ_MAX in
  linux/ide.h.  These values are used in ide/ide.c and amd74xx.c.  If
  there are arch specific values, they override the the values set in
  linux/ide.h via the arch specific header.

  Item 2 is addressed by fixup_amd_ide in arch/ppc64/kernel/pci.c.

  Item 3 uses the ATA_PCIMHZ* to calculate the PCI bus frequency in
  amd74xx.c and is used to determine the UDMA settings for the
  drives.

  Please review these changes.  Comments are welcome.

++doug

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/octet-stream
Size: 6001 bytes
Desc: amd-js20-ide-perf.diff
Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20031215/c7e40981/attachment.obj 

From lxiep at us.ibm.com  Tue Dec 16 11:26:32 2003
From: lxiep at us.ibm.com (Linda Xie)
Date: Mon, 15 Dec 2003 18:26:32 -0600
Subject: [PATCH] Add unlimited name length support to PHP core
Message-ID: <3FDE5138.7000507@us.ltcfwd.linux.ibm.com>

Hi All,


The attached patch adds unlimited name length support to PHP core
and some fixes needed in kobject.c and symlink.c.

Please review it and send your comments by 12/19/03.


Thanks,


Linda

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: kobj-patch.patch
Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20031215/300f2ff1/attachment.txt 

From benh at kernel.crashing.org  Tue Dec 16 12:59:14 2003
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Tue, 16 Dec 2003 12:59:14 +1100
Subject: [RFC] JS20 IDE perf patch
In-Reply-To: <200312151706.hBFH6wv1008098@falcon30.maxey.austin.rr.com>
References: <200312151706.hBFH6wv1008098@falcon30.maxey.austin.rr.com>
Message-ID: <1071539954.12497.443.camel@gaston>


>   Item 1 is addressed by creating ATA_PCIMHZ_MIN and ATA_PCIMHZ_MAX in
>   linux/ide.h.  These values are used in ide/ide.c and amd74xx.c.  If
>   there are arch specific values, they override the the values set in
>   linux/ide.h via the arch specific header.
>
>   Item 2 is addressed by fixup_amd_ide in arch/ppc64/kernel/pci.c.
>
>   Item 3 uses the ATA_PCIMHZ* to calculate the PCI bus frequency in
>   amd74xx.c and is used to determine the UDMA settings for the
>   drives.
>
>   Please review these changes.  Comments are welcome.

I'd rather keep generic drivers/ide.c out of the picture and do things
locally to the AMD IDE driver... Wouldn't it be possible to get the
proper values in the OF device-tree and add code to the AMD driver
to pick them from there ?

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From segher at kernel.crashing.org  Wed Dec 17 01:54:26 2003
From: segher at kernel.crashing.org (Segher Boessenkool)
Date: Tue, 16 Dec 2003 15:54:26 +0100
Subject: [RFC] JS20 IDE perf patch
In-Reply-To: <200312151706.hBFH6wv1008098@falcon30.maxey.austin.rr.com>
References: <200312151706.hBFH6wv1008098@falcon30.maxey.austin.rr.com>
Message-ID: <BE7A3C9C-2FD7-11D8-9472-000A95A4DC02@kernel.crashing.org>


On 15-dec-03, at 18:06, Doug Maxey wrote:
>   Please review these changes.  Comments are welcome.

+#ifdef CONFIG_PPC_PSERIES
+#define ATA_PCIMHZ_MIN 20
+#define ATA_PCIMHZ_MAX 512
+#endif /* CONFIG_PPC_PSERIES */


The 8111 HT is an 8-bit link at 200MHz, i.e. 800MB/s total or 400MB/s
each way;
that would translate into a 200MHz or a 100MHz PCI bus equivalent,
whichever
value you like better (and they're both somewhat bogus anyway), but 512
seems
a bit high.  Anything I'm missing here?


Thanks,

Segher


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From dwm at austin.ibm.com  Wed Dec 17 02:59:54 2003
From: dwm at austin.ibm.com (Doug Maxey)
Date: Tue, 16 Dec 2003 09:59:54 -0600
Subject: [RFC] JS20 IDE perf patch
Message-ID: <200312161559.hBGFxsdd024620@localhost.localdomain>


Ben,

  First, my bad, the "hack at falcon30..." was a garbled address, not
  using the same alias file on both machines...  Correct addr here.

  Did not see a way to pass any interesting data, in particular the bus
  speed via pci_dev.  Do you have a suggestion on that?

  The primary reason to change the generic ide.c was to allow the
  setting on the append line to turn on the changes.

  Cannot answer regarding the device tree contents.

++doug

On Tue, 16 Dec 2003 12:59:14 +1100, Benjamin Herrenschmidt wrote:
[snip]
>
>I'd rather keep generic drivers/ide.c out of the picture and do things
>locally to the AMD IDE driver... Wouldn't it be possible to get the
>proper values in the OF device-tree and add code to the AMD driver
>to pick them from there ?
>
>Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From dwm at austin.ibm.com  Wed Dec 17 03:19:19 2003
From: dwm at austin.ibm.com (Doug Maxey)
Date: Tue, 16 Dec 2003 10:19:19 -0600
Subject: [RFC] JS20 IDE perf patch
Message-ID: <200312161619.hBGGJJ8X024652@localhost.localdomain>


Segher,

  Thanks for the info.  Have not spent as much time with the 8111 spec
  as I probably should(:

  I was extrapolating the ATA_PCIMHZ_MAX for future systems that might
  have the maxed out PCIX bus.  You never know what HW throws at you
  here.

  For this particular instance, the 100 would be good. I need to
  rework the amd74xx.c changes for that one.

++doug

On Tue, 16 Dec 2003 15:54:26 +0100, Segher Boessenkool wrote:
>
>On 15-dec-03, at 18:06, Doug Maxey wrote:
>>   Please review these changes.  Comments are welcome.
>
>+#ifdef CONFIG_PPC_PSERIES
>+#define ATA_PCIMHZ_MIN 20
>+#define ATA_PCIMHZ_MAX 512
>+#endif /* CONFIG_PPC_PSERIES */
>
>
>The 8111 HT is an 8-bit link at 200MHz, i.e. 800MB/s total or 400MB/s
>each way;
>that would translate into a 200MHz or a 100MHz PCI bus equivalent,
>whichever
>value you like better (and they're both somewhat bogus anyway), but 512
>seems
>a bit high.  Anything I'm missing here?
>
>
>Thanks,
>
>Segher
>


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From gregkh at us.ibm.com  Wed Dec 17 03:55:25 2003
From: gregkh at us.ibm.com (Greg KH)
Date: Tue, 16 Dec 2003 08:55:25 -0800
Subject: [PATCH] Add unlimited name length support to PHP core
In-Reply-To: <3FDE5138.7000507@us.ltcfwd.linux.ibm.com>
References: <3FDE5138.7000507@us.ltcfwd.linux.ibm.com>
Message-ID: <20031216165525.GA16517@us.ibm.com>


On Mon, Dec 15, 2003 at 06:26:32PM -0600, Linda Xie wrote:
> Hi All,
>
>
> The attached patch adds unlimited name length support to PHP core
> and some fixes needed in kobject.c and symlink.c.

Please _ALWAYS_ send patches like this, that are not PPC specific, to
the proper mailing lists (in this case, the pci hotplug one, and
linux-kernel, or just linux-kernel, that would work.)

Also, please explain why the change is needed to kobjects and sysfs,
breaking them all up into individual patches if they do different
things.

> Please review it and send your comments by 12/19/03.

Heh, asking for review by a specific date is not the nicest thing to do
in the open source community :)

thanks,

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From johnrose at us.ibm.com  Wed Dec 17 04:17:10 2003
From: johnrose at us.ibm.com (John H Rose)
Date: Tue, 16 Dec 2003 11:17:10 -0600
Subject: [PATCH] Add unlimited name length support to PHP core
In-Reply-To: <20031216165525.GA16517@us.ibm.com>
Message-ID: <OF3181D9F7.0676F3EF-ON85256DFE.005E53CB-86256DFE.005EF71E@us.ibm.com>


Hi Greg-

> Heh, asking for review by a specific date is not the nicest thing to do
> in the open source community :)

How can one convey the importance of timing for review comments without
sounding inconsiderate?  Given the reality of project deadlines and code
drop dates, such requests seem like a necessary evil.  Thoughts?

Just curious-
John

-----------------------
John Rose
pSeries Linux Development
johnrose at austin.ibm.com
Office: 512-838-0298
Tieline: 678-0298

gregkh at us.ltcfwd.linux.ibm.com@lists.linuxppc.org on 12/16/2003 10:55:25 AM

Sent by:    owner-linuxppc64-dev at lists.linuxppc.org


To:    lxiep at us.ltcfwd.linux.ibm.com
cc:    linuxppc64-dev at lists.linuxppc.org, Linda Xie/Austin/IBM at IBMUS
Subject:    Re: [PATCH] Add unlimited name length support to PHP core


On Mon, Dec 15, 2003 at 06:26:32PM -0600, Linda Xie wrote:
> Hi All,
>
>
> The attached patch adds unlimited name length support to PHP core
> and some fixes needed in kobject.c and symlink.c.

Please _ALWAYS_ send patches like this, that are not PPC specific, to
the proper mailing lists (in this case, the pci hotplug one, and
linux-kernel, or just linux-kernel, that would work.)

Also, please explain why the change is needed to kobjects and sysfs,
breaking them all up into individual patches if they do different
things.

> Please review it and send your comments by 12/19/03.

Heh, asking for review by a specific date is not the nicest thing to do
in the open source community :)

thanks,

greg k-h


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From segher at kernel.crashing.org  Wed Dec 17 04:22:56 2003
From: segher at kernel.crashing.org (Segher Boessenkool)
Date: Tue, 16 Dec 2003 18:22:56 +0100
Subject: [RFC] JS20 IDE perf patch
In-Reply-To: <OFB26704A7.E32EA5A1-ON87256DFE.005C497E-86256DFE.005C631C@us.ibm.com>
References: <OFB26704A7.E32EA5A1-ON87256DFE.005C497E-86256DFE.005C631C@us.ibm.com>
Message-ID: <7D17378E-2FEC-11D8-9472-000A95A4DC02@kernel.crashing.org>


On 16-dec-03, at 17:51, Mark Hack wrote:

> The HT link is 400Mhz DDR 8 bit. The 200Mhz bus is the default but
> this is increased during init.
>
> The bus speed is therefore 400 x 2 MBytes /sec

That's not what the AMD spec
(http://www.amd.com/us-en/assets/content_type/
white_papers_and_tech_docs/24674.pdf , sorry if this link gets broken
up) says -- right on the
front page it says 200MHz DDR is max speed, and again it says so at
page 39 (bottom).

This DDR stuff is confusing sometimes ;-)


Segher


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From gregkh at us.ibm.com  Wed Dec 17 04:29:27 2003
From: gregkh at us.ibm.com (Greg KH)
Date: Tue, 16 Dec 2003 09:29:27 -0800
Subject: [PATCH] Add unlimited name length support to PHP core
In-Reply-To: <OF3181D9F7.0676F3EF-ON85256DFE.005E53CB-86256DFE.005EF71E@us.ibm.com>
References: <20031216165525.GA16517@us.ibm.com> <OF3181D9F7.0676F3EF-ON85256DFE.005E53CB-86256DFE.005EF71E@us.ibm.com>
Message-ID: <20031216172927.GA16655@us.ibm.com>


On Tue, Dec 16, 2003 at 11:17:10AM -0600, John H Rose wrote:
>
>
> Hi Greg-
>
> > Heh, asking for review by a specific date is not the nicest thing to do
> > in the open source community :)
>
> How can one convey the importance of timing for review comments without
> sounding inconsiderate?  Given the reality of project deadlines and code
> drop dates, such requests seem like a necessary evil.  Thoughts?

First off, don't say anything.  Almost all maintainers are very
responsive.  If you don't hear back after a while, a repost is
acceptable, and usually expected.  If you still don't hear back, oh
well...

But for you to try to impose your project deadlines on community members
is not a very nice thing to do.  Odds are, none of them really care
about your deadlines, as they have their own, differing schedules, and
priorities :)

Hope this helps,

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Wed Dec 17 06:17:26 2003
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Tue, 16 Dec 2003 13:17:26 -0600
Subject: OpenAFS?
Message-ID: <7BD4DC5B-2FFC-11D8-BB49-000A95A0560C@us.ibm.com>


I just got a random call yesterday asking about AFS on ppc64, and I
don't think I've seen the issue(s) described anywhere...

I remember hearing people complain OpenAFS didn't work on ppc64. One
issue was the code trying to insert a function pointer into the syscall
table at runtime, which doesn't work because it ends up inserting a
pointer to the ppc64 function descriptor.

Are there other issues? Has anyone seriously tried to make it work?

--
Hollis Blanchard
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hollisb at us.ibm.com  Wed Dec 17 07:16:02 2003
From: hollisb at us.ibm.com (Hollis Blanchard)
Date: Tue, 16 Dec 2003 14:16:02 -0600
Subject: code review requests
In-Reply-To: <20031216172927.GA16655@us.ibm.com>
Message-ID: <AB9C5136-3004-11D8-BB49-000A95A0560C@us.ibm.com>


On Tuesday, Dec 16, 2003, at 11:29 US/Central, Greg KH wrote:
>
> First off, don't say anything.  Almost all maintainers are very
> responsive.

That hasn't been my experience, at least... Maybe you've found some
maintainers who aren't overworked?! :)

Seriously, it seems to me that the "gets dropped" option ends up
happening pretty frequently, unless you're working in an area that's
personally interesting to the maintainer. (That goes for email RFCs as
well, private or public.) Those areas of interest may not coincide with
work assignments you've been given...

--
Hollis Blanchard
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From jschopp at austin.ibm.com  Wed Dec 17 07:22:09 2003
From: jschopp at austin.ibm.com (jschopp at austin.ibm.com)
Date: Tue, 16 Dec 2003 14:22:09 -0600 (CST)
Subject: [PATCH] minor lock improvements
Message-ID: <Pine.A41.4.44.0312161402300.48844-100000@forte.austin.ibm.com>

I've attached a patch that should help locks be just a smidge faster on
ppc64 machines.  I am not a performance guy so I ran the only benchmark I
had handy (sdet) which I am unfortunatly not allowed to publish a number
on to show the increase.  I got a overall throughput increase of .436%,
with a confidence of 95% that the increase is between .232% and .639%.  I
would expect other tests to show larger improvements (performance guys
welcome to help me out here).

The patch needs some feedback (comments in code show where) on how to do
a couple things correctly.
-------------- next part --------------
diff -Nru a/include/asm-ppc64/ppc_asm.h b/include/asm-ppc64/ppc_asm.h
--- a/include/asm-ppc64/ppc_asm.h	Tue Dec 16 14:15:20 2003
+++ b/include/asm-ppc64/ppc_asm.h	Tue Dec 16 14:15:20 2003
@@ -44,10 +44,25 @@
 	ld	ra,PACALPPACA+LPPACAANYINT(rb); /* Get pending interrupt flags */\
 	cmpldi	0,ra,0;

-/* Macros to adjust thread priority for Iseries hardware multithreading */
+/* Macros to adjust thread priority for RPA hardware multithreading
+ * and iSeries hardware nultithreading.  This way is kidof hackish,
+ * looking for suggestions on how to do it better. Joel S.
+ */
+#ifdef CONFIG_HMT
 #define HMT_LOW		or 1,1,1
 #define HMT_MEDIUM	or 2,2,2
 #define HMT_HIGH	or 3,3,3
+#else /* CONFIG_HMT */
+#ifdef CONFIG_PPC_ISERIES
+#define HMT_LOW		or 1,1,1
+#define HMT_MEDIUM	or 2,2,2
+#define HMT_HIGH	or 3,3,3
+#else /* CONFIG_PPC_ISERES */
+#define HMT_LOW
+#define HMT_MEDIUM
+#define HMT_HIGH
+#endif /* CONFIG_PPC_ISERIES */
+#endif /* CONFIG_HMT */

 /* Insert the high 32 bits of the MSR into what will be the new
    MSR (via SRR1 and rfid)  This preserves the MSR.SF and MSR.ISF
diff -Nru a/include/asm-ppc64/spinlock.h b/include/asm-ppc64/spinlock.h
--- a/include/asm-ppc64/spinlock.h	Tue Dec 16 14:15:20 2003
+++ b/include/asm-ppc64/spinlock.h	Tue Dec 16 14:15:20 2003
@@ -22,7 +22,18 @@
  * locking when running on an RPA platform.  As we do more performance
  * tuning, I would expect this selection mechanism to change.  Dave E.
  */
+/* XXX- Need some way to test if SPLPAR is possible on this machine
+ * this way is kindof hackish. HMT and SPLPAR don't really have anything
+ * to do with eachother.  Open for suggestions.  Joel S.
+ */
+#ifdef CONFIG_PPC_PSERIES
+#ifndef CONFIG_HMT
+#undef SPLPAR_LOCKS
+#else /* CONIFG_HMT is defined */
 #define SPLPAR_LOCKS
+#endif /* CONFIG_HMT */
+#endif /* CONFIG_PPC_PSERIES */
+
 #define HVSC			".long 0x44000022\n"

 typedef struct {
@@ -107,7 +118,7 @@
 	unsigned long tmp, tmp2;

 	__asm__ __volatile__(
-	"b		2f		# spin_lock\n\
+	"b		3f		# spin_lock\n\
 1:"
 	HMT_LOW
 "       ldx		%0,0,%2         # load the lock value\n\
@@ -127,11 +138,12 @@
 "        b               1b\n\
 2: \n"
 	HMT_MEDIUM
-" 	ldarx		%0,0,%2\n\
+"3: \n\
+ 	ldarx		%0,0,%2\n\
 	cmpdi		0,%0,0\n\
 	bne-		1b\n\
 	stdcx.		13,0,%2\n\
-	bne-		2b\n\
+	bne-		3b\n\
 	isync"
 	: "=&r"(tmp), "=&r"(tmp2)
 	: "r"(&lock->lock)
@@ -148,10 +160,10 @@
 	HMT_LOW
 "       ldx		%0,0,%1         # load the lock value\n\
 	cmpdi		0,%0,0          # if not locked, try to acquire\n\
-	bne+		1b\n\
-2: \n"
+	bne+		1b\n"
 	HMT_MEDIUM
-" 	ldarx		%0,0,%1\n\
+"2: \n\
+ 	ldarx		%0,0,%1\n\
 	cmpdi		0,%0,0\n\
 	bne-		1b\n\
 	stdcx.		13,0,%1\n\
@@ -224,7 +236,7 @@
 	unsigned long tmp, tmp2;

 	__asm__ __volatile__(
-	"b		2f		# read_lock\n\
+	"b		3f		# read_lock\n\
 1:"
 	HMT_LOW
 "	ldx		%0,0,%2\n\
@@ -247,11 +259,12 @@
         sc                              # do the hcall \n\
 2: \n"
 	HMT_MEDIUM
-" 	ldarx		%0,0,%2\n\
+"3:\n\
+ 	ldarx		%0,0,%2\n\
 	addic.		%0,%0,1\n\
 	ble-		1b\n\
 	stdcx.		%0,0,%2\n\
-	bne-		2b\n\
+	bne-		3b\n\
 	isync"
 	: "=&r"(tmp), "=&r"(tmp2)
 	: "r"(&rw->lock)
@@ -265,7 +278,7 @@
 	unsigned long tmp, tmp2;

 	__asm__ __volatile__(
-	"b		2f		# read_lock\n\
+	"b		3f		# read_lock\n\
 1:"
 	HMT_LOW
 "	ldx		%0,0,%2\n\
@@ -284,11 +297,12 @@
 	HVSC
 "2: \n"
 	HMT_MEDIUM
-" 	ldarx		%0,0,%2\n\
+"3: \n\
+ 	ldarx		%0,0,%2\n\
 	addic.		%0,%0,1\n\
 	ble-		1b\n\
 	stdcx.		%0,0,%2\n\
-	bne-		2b\n\
+	bne-		3b\n\
 	isync"
 	: "=&r"(tmp), "=&r"(tmp2)
 	: "r"(&rw->lock)
@@ -305,10 +319,10 @@
 	HMT_LOW
 "	ldx		%0,0,%1\n\
 	cmpdi		0,%0,0\n\
-	blt+		1b\n\
-2: \n"
+	blt+		1b\n"
 	HMT_MEDIUM
-" 	ldarx		%0,0,%1\n\
+"2: \n\
+ 	ldarx		%0,0,%1\n\
 	addic.		%0,%0,1\n\
 	ble-		1b\n\
 	stdcx.		%0,0,%1\n\
@@ -363,7 +377,7 @@
 	unsigned long tmp, tmp2;

 	__asm__ __volatile__(
-	"b		2f		# spin_lock\n\
+	"b		3f		# spin_lock\n\
 1:"
 	HMT_LOW
 "       ldx		%0,0,%2         # load the lock value\n\
@@ -387,11 +401,12 @@
         sc                              # do the hcall \n\
 2: \n"
 	HMT_MEDIUM
-" 	ldarx		%0,0,%2\n\
+"3: \n\
+ 	ldarx		%0,0,%2\n\
 	cmpdi		0,%0,0\n\
 	bne-		1b\n\
 	stdcx.		13,0,%2\n\
-	bne-		2b\n\
+	bne-		3b\n\
 	isync"
 	: "=&r"(tmp), "=&r"(tmp2)
 	: "r"(&rw->lock)
@@ -405,7 +420,7 @@
 	unsigned long tmp, tmp2;

 	__asm__ __volatile__(
-	"b		2f		# spin_lock\n\
+	"b		3f		# spin_lock\n\
 1:"
 	HMT_LOW
 "       ldx		%0,0,%2         # load the lock value\n\
@@ -427,11 +442,12 @@
 "        b               1b\n\
 2: \n"
 	HMT_MEDIUM
-" 	ldarx		%0,0,%2\n\
+"3: \n\
+ 	ldarx		%0,0,%2\n\
 	cmpdi		0,%0,0\n\
 	bne-		1b\n\
 	stdcx.		13,0,%2\n\
-	bne-		2b\n\
+	bne-		3b\n\
 	isync"
 	: "=&r"(tmp), "=&r"(tmp2)
 	: "r"(&rw->lock)
@@ -443,7 +459,7 @@
 	unsigned long tmp;

 	__asm__ __volatile__(
-	"b		2f		# spin_lock\n\
+	"b		3f		# spin_lock\n\
 1:"
 	HMT_LOW
 "       ldx		%0,0,%1         # load the lock value\n\
@@ -451,11 +467,12 @@
 	bne+		1b\n\
 2: \n"
 	HMT_MEDIUM
-" 	ldarx		%0,0,%1\n\
+"3: \n\
+ 	ldarx		%0,0,%1\n\
 	cmpdi		0,%0,0\n\
 	bne-		1b\n\
 	stdcx.		13,0,%1\n\
-	bne-		2b\n\
+	bne-		3b\n\
 	isync"
 	: "=&r"(tmp)
 	: "r"(&rw->lock)

From gregkh at us.ibm.com  Wed Dec 17 08:04:18 2003
From: gregkh at us.ibm.com (Greg KH)
Date: Tue, 16 Dec 2003 13:04:18 -0800
Subject: code review requests
In-Reply-To: <AB9C5136-3004-11D8-BB49-000A95A0560C@us.ibm.com>
References: <20031216172927.GA16655@us.ibm.com> <AB9C5136-3004-11D8-BB49-000A95A0560C@us.ibm.com>
Message-ID: <20031216210418.GA29601@us.ibm.com>


On Tue, Dec 16, 2003 at 02:16:02PM -0600, Hollis Blanchard wrote:
> Seriously, it seems to me that the "gets dropped" option ends up
> happening pretty frequently, unless you're working in an area that's
> personally interesting to the maintainer. (That goes for email RFCs as
> well, private or public.) Those areas of interest may not coincide with
> work assignments you've been given...

That's the joy/headache of dealing with the opensource community :)

But telling someone that they need to review your patch by a specific
date would do nothing more than piss off a overworked maintainer.  It's
just bad form, just like sending email questions privately to them:
	http://www.arm.linux.org.uk/news/?newsitem=11

greg k-h

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From peter at bergner.org  Wed Dec 17 08:55:46 2003
From: peter at bergner.org (Peter Bergner)
Date: Tue, 16 Dec 2003 15:55:46 -0600
Subject: OpenAFS?
In-Reply-To: <7BD4DC5B-2FFC-11D8-BB49-000A95A0560C@us.ibm.com>
References: <7BD4DC5B-2FFC-11D8-BB49-000A95A0560C@us.ibm.com>
Message-ID: <1071611744.3077.29.camel@otta.rchland.ibm.com>


On Tue, 2003-12-16 at 13:17, Hollis Blanchard wrote:
> I just got a random call yesterday asking about AFS on ppc64, and I
> don't think I've seen the issue(s) described anywhere...
>
> I remember hearing people complain OpenAFS didn't work on ppc64. One
> issue was the code trying to insert a function pointer into the syscall
> table at runtime, which doesn't work because it ends up inserting a
> pointer to the ppc64 function descriptor.
>
> Are there other issues? Has anyone seriously tried to make it work?

That's not enough of a problem for you!?!? ;-)  Actually, I think SUSE
is working on getting it to work.  Try asking _Marcus_ on #ppc64 on how
far he's gotten.


Peter


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From dje at watson.ibm.com  Wed Dec 17 09:27:13 2003
From: dje at watson.ibm.com (David Edelsohn)
Date: Tue, 16 Dec 2003 17:27:13 -0500
Subject: OpenAFS? 
In-Reply-To: Message from Hollis Blanchard <hollisb@us.ibm.com> 
   of "Tue, 16 Dec 2003 13:17:26 CST." <7BD4DC5B-2FFC-11D8-BB49-000A95A0560C@us.ibm.com> 
References: <7BD4DC5B-2FFC-11D8-BB49-000A95A0560C@us.ibm.com> 
Message-ID: <200312162227.hBGMRDT34098@makai.watson.ibm.com>


>>>>> Hollis Blanchard writes:

Hollis> I just got a random call yesterday asking about AFS on ppc64, and I
Hollis> don't think I've seen the issue(s) described anywhere...

Hollis> I remember hearing people complain OpenAFS didn't work on ppc64. One
Hollis> issue was the code trying to insert a function pointer into the syscall
Hollis> table at runtime, which doesn't work because it ends up inserting a
Hollis> pointer to the ppc64 function descriptor.

	A port to IA-64 should have a similar problem.  Has OpenAFS been
ported to IA-64 Linux?

David

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Wed Dec 17 10:23:16 2003
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Wed, 17 Dec 2003 10:23:16 +1100
Subject: [RFC] JS20 IDE perf patch
In-Reply-To: <200312161559.hBGFxsdd024620@localhost.localdomain>
References: <200312161559.hBGFxsdd024620@localhost.localdomain>
Message-ID: <1071616996.753.26.camel@gaston>


On Wed, 2003-12-17 at 02:59, Doug Maxey wrote:
> Ben,
>
>   First, my bad, the "hack at falcon30..." was a garbled address, not
>   using the same alias file on both machines...  Correct addr here.
>
>   Did not see a way to pass any interesting data, in particular the bus
>   speed via pci_dev.  Do you have a suggestion on that?
>
>   The primary reason to change the generic ide.c was to allow the
>   setting on the append line to turn on the changes.
>
>   Cannot answer regarding the device tree contents.

No, we don't have a good way yet to provide the bus speed per pci
dev in a generic way afaik, though that would be a good addition
I beleive. We could add a ppc-specific pcibios_get_bus_speed()
thingy eventually... If the information is available in the
device-tree, you can peek there too from the driver.

The append line is fine for testing, I don't like the idea of
having to rely on that for proper IDE timings in a product though

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From johnrose at austin.ibm.com  Wed Dec 17 10:54:43 2003
From: johnrose at austin.ibm.com (John Rose)
Date: Tue, 16 Dec 2003 17:54:43 -0600
Subject: [PATCH] xmon fixes for 2.6
Message-ID: <1071618882.7664.37.camel@verve>


This patch was written mostly by Jake Moilanen.  It fixes problems
that prevented the use of xmon breakpoints on pSeries LPAR systems.
Also updates help text to reflect obsolete subcommands, and adds
some necessary lines for single stepping.

If there are no comments on this, will push soon (thurs?).

Thanks-
John

diff -Nru a/arch/ppc64/xmon/xmon.c b/arch/ppc64/xmon/xmon.c
--- a/arch/ppc64/xmon/xmon.c	Tue Dec 16 17:49:24 2003
+++ b/arch/ppc64/xmon/xmon.c	Tue Dec 16 17:49:24 2003
@@ -37,9 +37,10 @@
 #define skipbl	xmon_skipbl

 #ifdef CONFIG_SMP
-unsigned long cpus_in_xmon = 0;
+volatile unsigned long cpus_in_xmon = 0;
 static unsigned long got_xmon = 0;
 static volatile int take_xmon = -1;
+static volatile int leaving_xmon = 0;
 #endif /* CONFIG_SMP */

 static unsigned long adrs;
@@ -162,7 +163,6 @@
   mz	zero a block of memory\n\
   mx	translation information for an effective address\n\
   mi	show information about memory allocation\n\
-  M	print System.map\n\
   p 	show the task list\n\
   r	print registers\n\
   s	single step\n\
@@ -227,7 +227,7 @@
 xmon(struct pt_regs *excp)
 {
 	struct pt_regs regs;
-	int cmd;
+	int cmd = 0;
 	unsigned long msr;

 	if (excp == NULL) {
@@ -285,9 +285,15 @@
 	xmon_regs[smp_processor_id()] = excp;
 	excprint(excp);
 #ifdef CONFIG_SMP
-	if (test_and_set_bit(smp_processor_id(), &cpus_in_xmon))
+	leaving_xmon = 0;
+	/* possible race condition here if a CPU is held up and gets
+	 * here while we are exiting */
+	if (test_and_set_bit(smp_processor_id(), &cpus_in_xmon)) {
+		/* xmon probably caused an exception itself */
+		printf("We are already in xmon\n");
 		for (;;)
 			;
+	}
 	while (test_and_set_bit(0, &got_xmon)) {
 		if (take_xmon == smp_processor_id()) {
 			take_xmon = -1;
@@ -304,6 +310,9 @@
 	if (cmd == 's') {
 		xmon_trace[smp_processor_id()] = SSTEP;
 		excp->msr |= MSR_SE;
+#ifdef CONFIG_SMP
+		take_xmon = smp_processor_id();
+#endif
 	} else if (at_breakpoint(excp->nip)) {
 		xmon_trace[smp_processor_id()] = BRSTEP;
 		excp->msr |= MSR_SE;
@@ -313,7 +322,9 @@
 	}
 	xmon_regs[smp_processor_id()] = 0;
 #ifdef CONFIG_SMP
-	clear_bit(0, &got_xmon);
+	leaving_xmon = 1;
+	if (cmd != 's')
+		clear_bit(0, &got_xmon);
 	clear_bit(smp_processor_id(), &cpus_in_xmon);
 #endif /* CONFIG_SMP */
 	set_msrd(msr);		/* restore interrupt enable */
@@ -421,8 +432,6 @@
 	int i;
 	struct bpt *bp;

-	if (systemcfg->platform != PLATFORM_PSERIES)
-		return;
 	bp = bpts;
 	for (i = 0; i < NBPTS; ++i, ++bp) {
 		if (!bp->enabled)
@@ -450,9 +459,6 @@
 	struct bpt *bp;
 	unsigned instr;

-	if (systemcfg->platform != PLATFORM_PSERIES)
-		return;
-
 	if ((cur_cpu_spec->cpu_features & CPU_FTR_DABR))
 		set_dabr(0);
 	if ((cur_cpu_spec->cpu_features & CPU_FTR_IABR))
@@ -478,11 +484,15 @@
 static int
 cmds(struct pt_regs *excp)
 {
-	int cmd;
+	int cmd = 0;

 	last_cmd = NULL;
 	for(;;) {
 #ifdef CONFIG_SMP
+		/* Need to check if we should take any commands on
+		   this CPU. */
+		if (leaving_xmon)
+			return cmd;
 		printf("%d:", smp_processor_id());
 #endif /* CONFIG_SMP */
 		printf("mon> ");
@@ -832,6 +842,12 @@
 				}
 			break;
 		}
+
+		if (!(systemcfg->platform & PLATFORM_PSERIES)) {
+			printf("Not supported for this platform\n");
+			break;
+		}
+
 		bp = at_breakpoint(a);
 		if (bp == 0) {
 			for (bp = bpts; bp < &bpts[NBPTS]; ++bp)


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From paulus at samba.org  Wed Dec 17 11:45:08 2003
From: paulus at samba.org (Paul Mackerras)
Date: Wed, 17 Dec 2003 11:45:08 +1100
Subject: [PATCH] xmon fixes for 2.6
In-Reply-To: <1071618882.7664.37.camel@verve>
References: <1071618882.7664.37.camel@verve>
Message-ID: <16351.42772.717656.989995@cargo.ozlabs.ibm.com>


John Rose writes:

> This patch was written mostly by Jake Moilanen.  It fixes problems
> that prevented the use of xmon breakpoints on pSeries LPAR systems.
> Also updates help text to reflect obsolete subcommands, and adds
> some necessary lines for single stepping.

Hmmmm.  I am leaning towards the view that on entering xmon, we should
send an IPI to try to get all cpus into xmon, and that on exiting
xmon, we release all cpus.  At the moment the breakpoint stuff is a
real mess on SMP systems - we can easily get into the situation where
xmon thinks that the trap instruction that it put at a particular
place is the real instruction that should be there.  I also think it
is wrong to hang if we try to re-enter xmon.

Regards,
Paul.

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Thu Dec 18 09:53:28 2003
From: anton at samba.org (Anton Blanchard)
Date: Thu, 18 Dec 2003 09:53:28 +1100
Subject: pci_map_single return value
In-Reply-To: <3FD732FD.10903@us.ibm.com>
References: <3FD732FD.10903@us.ibm.com>
Message-ID: <20031217225328.GB25456@krispykreme>


Hi,

> Why is it that pci_map_single returns NO_TCE on failure on PPC64 and 0
> on failure on IA64? Which one is correct? If NO_TCE is correct, then why
> is it defined in a ppc64 include? This makes it difficult for device
> drivers to actually check for it.

I brought this up with the architecture maintainers and the concensus
seems to be having a per arch pci_dma_error() function, like x86-64 does:

extern dma_addr_t bad_dma_address;
#define pci_dma_error(x) ((x) == bad_dma_address)

We also need a consistent error return value from pci_map_sg().

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From nathanl at austin.ibm.com  Thu Dec 18 13:28:37 2003
From: nathanl at austin.ibm.com (Nathan Lynch)
Date: Wed, 17 Dec 2003 20:28:37 -0600
Subject: [PATCH] Re: lparcfg code
In-Reply-To: <OF8AE21311.C39821C2-ON86256DF9.0078B969-86256DF9.00797633@us.ibm.com>
References: <OF8AE21311.C39821C2-ON86256DF9.0078B969-86256DF9.00797633@us.ibm.com>
Message-ID: <3FE110D5.5060601@austin.ibm.com>

I've attached a patch which gives /proc/ppc64/lparcfg a write() function
for changing certain of the system parameters
("partition_entitled_capacity" and "capacity_weight").

Nathan
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lparcfg_write.patch
Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20031217/79e39e51/attachment.txt 

From boutcher at us.ibm.com  Fri Dec 19 00:05:13 2003
From: boutcher at us.ibm.com (David Boutcher)
Date: Thu, 18 Dec 2003 07:05:13 -0600
Subject: Removing ibmveth /proc stuff
Message-ID: <OF60AB9B69.DF472A34-ON86256E00.0047A870-86256E00.0047BAC7@us.ibm.com>


I noticed the following go by in a changeset:

1.1299 03/12/18 14:19:45 dgibson at superego.ozlabs.ibm.com +2 -0
  Substantial cleanups to iSeries virtual ethernet driver (backported from
  ameslab 2.5).  Removes the barely useful /proc stuff, amongst other
  things.

There is iSeries documentation as well as testcases that rely on ibmveth
information in /proc.  Could you describe what you removed?

Thanks,

Dave B


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From meissner at suse.de  Fri Dec 19 01:14:55 2003
From: meissner at suse.de (Marcus Meissner)
Date: Thu, 18 Dec 2003 15:14:55 +0100
Subject: OpenAFS?
In-Reply-To: <1071611744.3077.29.camel@otta.rchland.ibm.com>
References: <7BD4DC5B-2FFC-11D8-BB49-000A95A0560C@us.ibm.com> <1071611744.3077.29.camel@otta.rchland.ibm.com>
Message-ID: <20031218141454.GA27975@suse.de>


On Tue, Dec 16, 2003 at 03:55:46PM -0600, Peter Bergner wrote:
>
> On Tue, 2003-12-16 at 13:17, Hollis Blanchard wrote:
> > I just got a random call yesterday asking about AFS on ppc64, and I
> > don't think I've seen the issue(s) described anywhere...
> >
> > I remember hearing people complain OpenAFS didn't work on ppc64.
> > One issue was the code trying to insert a function pointer into the
> > syscall table at runtime, which doesn't work because it ends up
> > inserting a pointer to the ppc64 function descriptor.
> >
> > Are there other issues? Has anyone seriously tried to make it work?
>
> That's not enough of a problem for you!?!? ;-) Actually, I think SUSE
> is working on getting it to work. Try asking _Marcus_ on #ppc64 on how
> far he's gotten.

Looks good so far, but I had to employ some pretty bad hacks, especial
for the kernel system calltable hooking (as mentioned).

Check the openafs*.patch files on:
	ftp://ftp.suse.com/pub/people/aj/afs-test/

# uname -a
Linux tamarillo 2.4.21-128-pseries64 #1 SMP Tue Dec 16 14:11:31 UTC 2003 ppc64 unknown
# ls /afs/suse.de/
.  ..  afsdoc  space  user  usr
#

Ciao, Marcus

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From lxiep at us.ibm.com  Fri Dec 19 08:05:14 2003
From: lxiep at us.ibm.com (Linda Xie)
Date: Thu, 18 Dec 2003 15:05:14 -0600
Subject: [PATCH] -- export symbols needed by rpadlpar module
Message-ID: <3FE2168A.1030503@us.ltcfwd.linux.ibm.com>


Hi,

The attached patch exports of_find_node_by_name and of_get_next_child
needed by rpadlpar module.

Comments are welcome.


Linda


diff -Nru a/arch/ppc64/kernel/prom.c b/arch/ppc64/kernel/prom.c
--- a/arch/ppc64/kernel/prom.c  Thu Dec 18 14:39:08 2003
+++ b/arch/ppc64/kernel/prom.c  Thu Dec 18 14:39:08 2003
@@ -2185,6 +2185,7 @@
         read_unlock(&devtree_lock);
         return np;
  }
+EXPORT_SYMBOL(of_find_node_by_name);

  /**
   *     of_find_node_by_type - Find a node by its "device_type" property
@@ -2335,6 +2336,7 @@
         read_unlock(&devtree_lock);
         return next;
  }
+EXPORT_SYMBOL(of_get_next_child);

  /**
   *     of_node_get - Increment refcount of a node


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From hozer at hozed.org  Sat Dec 20 02:32:40 2003
From: hozer at hozed.org (Troy Benjegerdes)
Date: Fri, 19 Dec 2003 09:32:40 -0600
Subject: srp.h and viosrp.h
In-Reply-To: <OFA7DBE0CC.DDF4AAA4-ON86256DFA.0029A872-86256DFA.0029D817@us.ibm.com>
References: <16345.19763.163666.223631@cargo.ozlabs.ibm.com> <OFA7DBE0CC.DDF4AAA4-ON86256DFA.0029A872-86256DFA.0029D817@us.ibm.com>
Message-ID: <20031219153239.GH18377@kalmia.hozed.org>


On Fri, Dec 12, 2003 at 01:38:46AM -0600, David Boutcher wrote:
>
> owner-linuxppc64-dev at lists.linuxppc.org wrote on 12/11/2003 11:08:03 PM:
> > I just had a look at srp.h and viosrp.h in drivers/scsi in the ameslab
> > 2.4 tree.  These can't be submitted in their present form.  Firstly,
> > they have no copyright notice, and secondly, they have no comments at
> > the top explaining what the files are and what they relate to (and
> > very few comments explaining the definitions mean).  Finally, why are
> > these files in drivers/scsi rather than include/something?
>
> Ya, Hollis already made similar comments.  I'm in Germany right now with
> limited access to the network, so I'll fix them up next week.
>
> I debated where they should live....they are very SCSI oriented, so
> drivers/scsi seemed like a reasonable place.  A fact that will be clearer
> once I put comments in them :-)
>
> Dave Boutcher
> IBM Linux Technology Center

Is this the same SRP I keep hearing about in Infiniband? If so, are we
going to wind up with conflicts with infiniband drivers?

Have you take a look at the sourceforge infiniband stack at all? (the
only open source stack I know of so far)

This sounds like something that should go in include/ somewhere, and we
ought to try and work with other people working on it elsewhere.

--
--------------------------------------------------------------------------
Troy Benjegerdes                'da hozer'                hozer at drgw.net


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From boutcher at us.ibm.com  Sat Dec 20 03:22:38 2003
From: boutcher at us.ibm.com (David Boutcher)
Date: Fri, 19 Dec 2003 10:22:38 -0600
Subject: srp.h and viosrp.h
In-Reply-To: <20031219153239.GH18377@kalmia.hozed.org>
Message-ID: <OFD2D0B7FE.733EF8CC-ON86256E01.00598A54-86256E01.0059CD52@us.ibm.com>


Troy Benjegerdes <hozer at hozed.org> wrote on 12/19/2003 09:32:40 AM:
> Is this the same SRP I keep hearing about in Infiniband? If so, are we
> going to wind up with conflicts with infiniband drivers?
>
> Have you take a look at the sourceforge infiniband stack at all? (the
> only open source stack I know of so far)

Yes, this is exactly the same SRP, and all the definitions in these files
are from the SRP specification that will be used for Infiniband.  Since the
"S" in SRP stands for SCSI, /drivers/scsi SEEMED like a good place....and
probably a better place than /include/scsi, since those header files are
mainly concerned with the interface from user to kernel for SCSI ops.  The
SRP definitions define the interface out the bottom of the kernel to the
device (over some interface like infiniband.)

I looked at the sourceforge code a while ago, I should go back and see if
it has changed lately....it wasn't moving too fast last time I looked at
it.

Dave Boutcher
IBM Linux Technology Center


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From brking at us.ibm.com  Wed Dec 24 08:34:13 2003
From: brking at us.ibm.com (Brian King)
Date: Tue, 23 Dec 2003 15:34:13 -0600
Subject: pci_map_single return value
References: <3FD732FD.10903@us.ibm.com> <20031217225328.GB25456@krispykreme>
Message-ID: <3FE8B4D5.2070405@us.ibm.com>

Anton Blanchard wrote:
>>Why is it that pci_map_single returns NO_TCE on failure on PPC64 and 0
>>on failure on IA64? Which one is correct? If NO_TCE is correct, then why
>>is it defined in a ppc64 include? This makes it difficult for device
>>drivers to actually check for it.
>
>
> I brought this up with the architecture maintainers and the concensus
> seems to be having a per arch pci_dma_error() function, like x86-64 does:
>
> extern dma_addr_t bad_dma_address;
> #define pci_dma_error(x) ((x) == bad_dma_address)
>

How does this look?


--
Brian King
eServer Storage I/O
IBM Linux Technology Center
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: dma_mapping_error.patch
Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20031223/00570dcf/attachment.txt 

From anton at samba.org  Wed Dec 24 10:56:32 2003
From: anton at samba.org (Anton Blanchard)
Date: Wed, 24 Dec 2003 10:56:32 +1100
Subject: ppc64 PTE hacks
Message-ID: <20031223235632.GE934@krispykreme>


Hi,

I just remembered we never merged this patch from Paul. It would be
great to get rid of the flush_tlb_* functions.

Anton

----- Forwarded message from Paul Mackerras <paulus at samba.org> -----

From: Paul Mackerras <paulus at samba.org>
To: anton at samba.org
Subject: ppc64 PTE hacks

Anton,

Here is the patch that changes the HPTE handling so that we queue up a
HPTE invalidation at the time when we change the linux PTE, instead of
later in the flush_tlb_* call.

Could you run some benchmarks for me with and without this patch on a
decent-sized POWER4 box sometime?

(I just noticed that this patch gives a net removal of 66 lines from
the kernel, which is nice. :)

Thanks,
Paul.

diff -urN linux-2.5/arch/ppc64/kernel/process.c ppc64/arch/ppc64/kernel/process.c
--- linux-2.5/arch/ppc64/kernel/process.c	2003-02-23 21:45:50.000000000 +1100
+++ ppc64/arch/ppc64/kernel/process.c	2003-03-19 16:37:25.000000000 +1100
@@ -45,6 +45,7 @@
 #include <asm/machdep.h>
 #include <asm/iSeries/HvCallHpt.h>
 #include <asm/hardirq.h>
+#include <asm/tlbflush.h>

 struct task_struct *last_task_used_math = NULL;

@@ -103,6 +104,8 @@
 		giveup_fpu(prev);
 #endif /* CONFIG_SMP */

+	flush_tlb_pending();
+
 	new_thread = &new->thread;
 	old_thread = &current->thread;

diff -urN linux-2.5/arch/ppc64/mm/Makefile ppc64/arch/ppc64/mm/Makefile
--- linux-2.5/arch/ppc64/mm/Makefile	2002-12-16 10:50:39.000000000 +1100
+++ ppc64/arch/ppc64/mm/Makefile	2003-02-24 17:14:52.000000000 +1100
@@ -4,5 +4,5 @@

 EXTRA_CFLAGS += -mno-minimal-toc

-obj-y := fault.o init.o extable.o imalloc.o
+obj-y := fault.o init.o extable.o imalloc.o tlb.o
 obj-$(CONFIG_DISCONTIGMEM) += numa.o
diff -urN linux-2.5/arch/ppc64/mm/init.c ppc64/arch/ppc64/mm/init.c
--- linux-2.5/arch/ppc64/mm/init.c	2003-02-23 21:45:50.000000000 +1100
+++ ppc64/arch/ppc64/mm/init.c	2003-02-24 17:15:30.000000000 +1100
@@ -242,147 +242,6 @@
 	}
 }

-void
-flush_tlb_mm(struct mm_struct *mm)
-{
-	struct vm_area_struct *mp;
-
-	spin_lock(&mm->page_table_lock);
-
-	for (mp = mm->mmap; mp != NULL; mp = mp->vm_next)
-		__flush_tlb_range(mm, mp->vm_start, mp->vm_end);
-
-	/* XXX are there races with checking cpu_vm_mask? - Anton */
-	mm->cpu_vm_mask = 0;
-
-	spin_unlock(&mm->page_table_lock);
-}
-
-/*
- * Callers should hold the mm->page_table_lock
- */
-void
-flush_tlb_page(struct vm_area_struct *vma, unsigned long vmaddr)
-{
-	unsigned long context = 0;
-	pgd_t *pgd;
-	pmd_t *pmd;
-	pte_t *ptep;
-	pte_t pte;
-	int local = 0;
-
-	switch( REGION_ID(vmaddr) ) {
-	case VMALLOC_REGION_ID:
-		pgd = pgd_offset_k( vmaddr );
-		break;
-	case IO_REGION_ID:
-		pgd = pgd_offset_i( vmaddr );
-		break;
-	case USER_REGION_ID:
-		pgd = pgd_offset( vma->vm_mm, vmaddr );
-		context = vma->vm_mm->context;
-
-		/* XXX are there races with checking cpu_vm_mask? - Anton */
-		if (vma->vm_mm->cpu_vm_mask == (1 << smp_processor_id()))
-			local = 1;
-
-		break;
-	default:
-		panic("flush_tlb_page: invalid region 0x%016lx", vmaddr);
-
-	}
-
-	if (!pgd_none(*pgd)) {
-		pmd = pmd_offset(pgd, vmaddr);
-		if (!pmd_none(*pmd)) {
-			ptep = pte_offset_kernel(pmd, vmaddr);
-			/* Check if HPTE might exist and flush it if so */
-			pte = __pte(pte_update(ptep, _PAGE_HPTEFLAGS, 0));
-			if ( pte_val(pte) & _PAGE_HASHPTE ) {
-				flush_hash_page(context, vmaddr, pte, local);
-			}
-		}
-	}
-}
-
-struct ppc64_tlb_batch ppc64_tlb_batch[NR_CPUS];
-
-void
-__flush_tlb_range(struct mm_struct *mm, unsigned long start, unsigned long end)
-{
-	pgd_t *pgd;
-	pmd_t *pmd;
-	pte_t *ptep;
-	pte_t pte;
-	unsigned long pgd_end, pmd_end;
-	unsigned long context = 0;
-	struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[smp_processor_id()];
-	unsigned long i = 0;
-	int local = 0;
-
-	switch(REGION_ID(start)) {
-	case VMALLOC_REGION_ID:
-		pgd = pgd_offset_k(start);
-		break;
-	case IO_REGION_ID:
-		pgd = pgd_offset_i(start);
-		break;
-	case USER_REGION_ID:
-		pgd = pgd_offset(mm, start);
-		context = mm->context;
-
-		/* XXX are there races with checking cpu_vm_mask? - Anton */
-		if (mm->cpu_vm_mask == (1 << smp_processor_id()))
-			local = 1;
-
-		break;
-	default:
-		panic("flush_tlb_range: invalid region for start (%016lx) and end (%016lx)\n", start, end);
-	}
-
-	do {
-		pgd_end = (start + PGDIR_SIZE) & PGDIR_MASK;
-		if (pgd_end > end)
-			pgd_end = end;
-		if (!pgd_none(*pgd)) {
-			pmd = pmd_offset(pgd, start);
-			do {
-				pmd_end = (start + PMD_SIZE) & PMD_MASK;
-				if (pmd_end > end)
-					pmd_end = end;
-				if (!pmd_none(*pmd)) {
-					ptep = pte_offset_kernel(pmd, start);
-					do {
-						if (pte_val(*ptep) & _PAGE_HASHPTE) {
-							pte = __pte(pte_update(ptep, _PAGE_HPTEFLAGS, 0));
-							if (pte_val(pte) & _PAGE_HASHPTE) {
-								batch->pte[i] = pte;
-								batch->addr[i] = start;
-								i++;
-								if (i == PPC64_TLB_BATCH_NR) {
-									flush_hash_range(context, i, local);
-									i = 0;
-								}
-							}
-						}
-						start += PAGE_SIZE;
-						++ptep;
-					} while (start < pmd_end);
-				} else {
-					start = pmd_end;
-				}
-				++pmd;
-			} while (start < pgd_end);
-		} else {
-			start = pgd_end;
-		}
-		++pgd;
-	} while (start < end);
-
-	if (i)
-		flush_hash_range(context, i, local);
-}
-
 void free_initmem(void)
 {
 	unsigned long addr;
diff -urN linux-2.5/arch/ppc64/mm/tlb.c ppc64/arch/ppc64/mm/tlb.c
--- linux-2.5/arch/ppc64/mm/tlb.c	Thu Jan 01 10:00:00 1970
+++ ppc64/arch/ppc64/mm/tlb.c	Tue Feb 25 15:51:52 2003
@@ -0,0 +1,96 @@
+/*
+ * This file contains the routines for flushing entries from the
+ * TLB and MMU hash table.
+ *
+ *  Derived from arch/ppc64/mm/init.c:
+ *    Copyright (C) 1995-1996 Gary Thomas (gdt at linuxppc.org)
+ *
+ *  Modifications by Paul Mackerras (PowerMac) (paulus at cs.anu.edu.au)
+ *  and Cort Dougan (PReP) (cort at cs.nmt.edu)
+ *    Copyright (C) 1996 Paul Mackerras
+ *  Amiga/APUS changes by Jesper Skov (jskov at cygnus.co.uk).
+ *
+ *  Derived from "arch/i386/mm/init.c"
+ *    Copyright (C) 1991, 1992, 1993, 1994  Linus Torvalds
+ *
+ *  Dave Engebretsen <engebret at us.ibm.com>
+ *      Rework for PPC64 port.
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ */
+#include <linux/config.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/init.h>
+#include <linux/highmem.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/tlb.h>
+
+#if 0
+struct ppc64_tlb_batch {
+	unsigned long index;
+	unsigned long context;
+	struct mm_struct *mm;
+	pte_t pte[PPC64_TLB_BATCH_NR];
+	unsigned long addr[PPC64_TLB_BATCH_NR];
+	//unsigned long vaddr[PPC64_TLB_BATCH_NR];
+};
+#endif
+
+struct ppc64_tlb_batch ppc64_tlb_batch[NR_CPUS];
+
+/*
+ * Update the MMU hash table to correspond with a change to
+ * a Linux PTE.  If wrprot is true, it is permissible to
+ * change the existing HPTE to read-only rather than removing it
+ * (if we remove it we should clear the _PTE_HPTEFLAGS bits).
+ */
+void hpte_update(pte_t *ptep, unsigned long pte, int wrprot)
+{
+	struct page *ptepage;
+	struct mm_struct *mm;
+	unsigned long addr;
+	int i;
+	unsigned long context = 0;
+	struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[smp_processor_id()];
+
+	ptepage = virt_to_page(ptep);
+	mm = (struct mm_struct *) ptepage->mapping;
+	addr = ptepage->index + (((unsigned long)ptep & ~PAGE_MASK) << 9);
+	if (REGION_ID(addr) == USER_REGION_ID)
+		context = mm->context;
+	i = batch->index;
+	if (unlikely(i != 0 && context != batch->context)) {
+		flush_tlb_pending();
+		i = 0;
+	}
+	if (i == 0) {
+		batch->context = context;
+		batch->mm = mm;
+	}
+	batch->pte[i] = __pte(pte);
+	batch->addr[i] = addr;
+	batch->index = ++i;
+	if (i >= PPC64_TLB_BATCH_NR)
+		flush_tlb_pending();
+}
+
+void __flush_tlb_pending(struct ppc64_tlb_batch *batch)
+{
+	int i;
+	int local = 0;
+
+	i = batch->index;
+	if (batch->mm->cpu_vm_mask == (1 << smp_processor_id()))
+		local = 1;
+	if (i == 1)
+		flush_hash_page(batch->context, batch->addr[0], batch->pte[0],
+				local);
+	else
+		flush_hash_range(batch->context, i, local);
+	batch->index = 0;
+}
diff -urN linux-2.5/include/asm-ppc64/pgtable.h ppc64/include/asm-ppc64/pgtable.h
--- linux-2.5/include/asm-ppc64/pgtable.h	2003-02-27 08:12:37.000000000 +1100
+++ ppc64/include/asm-ppc64/pgtable.h	2003-03-19 16:03:12.000000000 +1100
@@ -10,6 +10,7 @@
 #include <asm/processor.h>		/* For TASK_SIZE */
 #include <asm/mmu.h>
 #include <asm/page.h>
+#include <asm/tlbflush.h>
 #endif /* __ASSEMBLY__ */

 /* PMD_SHIFT determines what a second-level page table entry can map */
@@ -262,64 +263,85 @@

 /* Atomic PTE updates */

-static inline unsigned long pte_update( pte_t *p, unsigned long clr,
-					unsigned long set )
+static inline unsigned long pte_update(pte_t *p, unsigned long clr)
 {
 	unsigned long old, tmp;

 	__asm__ __volatile__(
 	"1:	ldarx	%0,0,%3		# pte_update\n\
 	andc	%1,%0,%4 \n\
-	or	%1,%1,%5 \n\
 	stdcx.	%1,0,%3 \n\
 	bne-	1b"
 	: "=&r" (old), "=&r" (tmp), "=m" (*p)
-	: "r" (p), "r" (clr), "r" (set), "m" (*p)
+	: "r" (p), "r" (clr), "m" (*p)
 	: "cc" );
 	return old;
 }

+/* PTE updating functions */
+extern void hpte_update(pte_t *ptep, unsigned long pte, int wrprot);
+
 static inline int ptep_test_and_clear_young(pte_t *ptep)
 {
-	return (pte_update(ptep, _PAGE_ACCESSED, 0) & _PAGE_ACCESSED) != 0;
+	unsigned long old;
+
+	old = pte_update(ptep, _PAGE_ACCESSED | _PAGE_HPTEFLAGS);
+	if (old & _PAGE_HASHPTE) {
+		hpte_update(ptep, old, 0);
+		flush_tlb_pending();	/* XXX generic code doesn't flush */
+	}
+	return (old & _PAGE_ACCESSED) != 0;
 }

 static inline int ptep_test_and_clear_dirty(pte_t *ptep)
 {
-	return (pte_update(ptep, _PAGE_DIRTY, 0) & _PAGE_DIRTY) != 0;
-}
+	unsigned long old;

-static inline pte_t ptep_get_and_clear(pte_t *ptep)
-{
-	return __pte(pte_update(ptep, ~_PAGE_HPTEFLAGS, 0));
+	old = pte_update(ptep, _PAGE_DIRTY);
+	if ((~old & (_PAGE_HASHPTE | _PAGE_RW | _PAGE_DIRTY)) == 0)
+		hpte_update(ptep, old, 1);
+	return (old & _PAGE_DIRTY) != 0;
 }

 static inline void ptep_set_wrprotect(pte_t *ptep)
 {
-	pte_update(ptep, _PAGE_RW, 0);
+	unsigned long old;
+
+	old = pte_update(ptep, _PAGE_RW);
+	if ((~old & (_PAGE_HASHPTE | _PAGE_RW | _PAGE_DIRTY)) == 0)
+		hpte_update(ptep, old, 1);
 }

-static inline void ptep_mkdirty(pte_t *ptep)
+static inline pte_t ptep_get_and_clear(pte_t *ptep)
 {
-	pte_update(ptep, 0, _PAGE_DIRTY);
+	unsigned long old = pte_update(ptep, ~0UL);
+
+	if (old & _PAGE_HASHPTE)
+		hpte_update(ptep, old, 0);
+	return __pte(old);
 }

-#define pte_same(A,B)	(((pte_val(A) ^ pte_val(B)) & ~_PAGE_HPTEFLAGS) == 0)
+static inline void pte_clear(pte_t * ptep)
+{
+	unsigned long old = pte_update(ptep, ~0UL);
+
+	if (old & _PAGE_HASHPTE)
+		hpte_update(ptep, old, 0);
+}

 /*
  * set_pte stores a linux PTE into the linux page table.
- * On machines which use an MMU hash table we avoid changing the
- * _PAGE_HASHPTE bit.
  */
 static inline void set_pte(pte_t *ptep, pte_t pte)
 {
-	pte_update(ptep, ~_PAGE_HPTEFLAGS, pte_val(pte) & ~_PAGE_HPTEFLAGS);
+	if (pte_present(*ptep))
+		pte_clear(ptep);
+	if (pte_present(pte))
+		flush_tlb_pending();
+	*ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS;
 }

-static inline void pte_clear(pte_t * ptep)
-{
-	pte_update(ptep, ~_PAGE_HPTEFLAGS, 0);
-}
+#define pte_same(A,B)	(((pte_val(A) ^ pte_val(B)) & ~_PAGE_HPTEFLAGS) == 0)

 extern unsigned long ioremap_bot, ioremap_base;

diff -urN linux-2.5/include/asm-ppc64/tlb.h ppc64/include/asm-ppc64/tlb.h
--- linux-2.5/include/asm-ppc64/tlb.h	2003-01-12 18:45:40.000000000 +1100
+++ ppc64/include/asm-ppc64/tlb.h	2003-02-25 15:52:01.000000000 +1100
@@ -13,11 +13,10 @@
 #define _PPC64_TLB_H

 #include <asm/pgtable.h>
-#include <asm/tlbflush.h>
 #include <asm/page.h>
 #include <asm/mmu.h>

-static inline void tlb_flush(struct mmu_gather *tlb);
+#define tlb_flush(tlb)	flush_tlb_pending()

 /* Avoid pulling in another include just for this */
 #define check_pgt_cache()	do { } while (0)
@@ -29,61 +28,6 @@
 #define tlb_start_vma(tlb, vma)	do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)

-/* Should make this at least as large as the generic batch size, but it
- * takes up too much space */
-#define PPC64_TLB_BATCH_NR 192
-
-struct ppc64_tlb_batch {
-	unsigned long index;
-	pte_t pte[PPC64_TLB_BATCH_NR];
-	unsigned long addr[PPC64_TLB_BATCH_NR];
-	unsigned long vaddr[PPC64_TLB_BATCH_NR];
-};
-
-extern struct ppc64_tlb_batch ppc64_tlb_batch[NR_CPUS];
-
-static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
-					unsigned long address)
-{
-	int cpu = smp_processor_id();
-	struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[cpu];
-	unsigned long i = batch->index;
-	pte_t pte;
-
-	if (pte_val(*ptep) & _PAGE_HASHPTE) {
-		pte = __pte(pte_update(ptep, _PAGE_HPTEFLAGS, 0));
-		if (pte_val(pte) & _PAGE_HASHPTE) {
-
-			batch->pte[i] = pte;
-			batch->addr[i] = address;
-			i++;
-
-			if (i == PPC64_TLB_BATCH_NR) {
-				int local = 0;
-
-				if (tlb->mm->cpu_vm_mask == (1UL << cpu))
-					local = 1;
-
-				flush_hash_range(tlb->mm->context, i, local);
-				i = 0;
-			}
-		}
-	}
-
-	batch->index = i;
-}
-
-static inline void tlb_flush(struct mmu_gather *tlb)
-{
-	int cpu = smp_processor_id();
-	struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[cpu];
-	int local = 0;
-
-	if (tlb->mm->cpu_vm_mask == (1UL << smp_processor_id()))
-		local = 1;
-
-	flush_hash_range(tlb->mm->context, batch->index, local);
-	batch->index = 0;
-}
+#define __tlb_remove_tlb_entry(tlb, pte, address) do { } while (0)

 #endif /* _PPC64_TLB_H */
diff -urN linux-2.5/include/asm-ppc64/tlbflush.h ppc64/include/asm-ppc64/tlbflush.h
--- linux-2.5/include/asm-ppc64/tlbflush.h	2002-06-07 18:21:41.000000000 +1000
+++ ppc64/include/asm-ppc64/tlbflush.h	2003-02-25 15:51:59.000000000 +1100
@@ -1,10 +1,6 @@
 #ifndef _PPC64_TLBFLUSH_H
 #define _PPC64_TLBFLUSH_H

-#include <linux/threads.h>
-#include <linux/mm.h>
-#include <asm/page.h>
-
 /*
  * TLB flushing:
  *
@@ -15,22 +11,36 @@
  *  - flush_tlb_pgtables(mm, start, end) flushes a range of page tables
  */

-extern void flush_tlb_mm(struct mm_struct *mm);
-extern void flush_tlb_page(struct vm_area_struct *vma, unsigned long vmaddr);
-extern void __flush_tlb_range(struct mm_struct *mm,
-			    unsigned long start, unsigned long end);
-#define flush_tlb_range(vma, start, end) \
-	__flush_tlb_range(vma->vm_mm, start, end)
+struct mm_struct;
+
+#define PPC64_TLB_BATCH_NR 192
+
+struct ppc64_tlb_batch {
+	unsigned long index;
+	unsigned long context;
+	struct mm_struct *mm;
+	pte_t pte[PPC64_TLB_BATCH_NR];
+	unsigned long addr[PPC64_TLB_BATCH_NR];
+	unsigned long vaddr[PPC64_TLB_BATCH_NR];
+};

-#define flush_tlb_kernel_range(start, end) \
-	__flush_tlb_range(&init_mm, (start), (end))
+extern struct ppc64_tlb_batch ppc64_tlb_batch[];
+extern void __flush_tlb_pending(struct ppc64_tlb_batch *batch);

-static inline void flush_tlb_pgtables(struct mm_struct *mm,
-				      unsigned long start, unsigned long end)
+static inline void flush_tlb_pending(void)
 {
-	/* PPC has hw page tables. */
+	struct ppc64_tlb_batch *batch = &ppc64_tlb_batch[smp_processor_id()];
+
+	if (batch->index)
+		__flush_tlb_pending(batch);
 }

+#define flush_tlb_mm(mm)			flush_tlb_pending()
+#define flush_tlb_page(vma, addr)		flush_tlb_pending()
+#define flush_tlb_range(vma, start, end)	flush_tlb_pending()
+#define flush_tlb_kernel_range(start, end)	flush_tlb_pending()
+#define flush_tlb_pgtables(mm, start, end)	do { } while (0)
+
 extern void flush_hash_page(unsigned long context, unsigned long ea, pte_t pte,
 			    int local);
 void flush_hash_range(unsigned long context, unsigned long number, int local);

----- End forwarded message -----

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From benh at kernel.crashing.org  Wed Dec 24 15:58:48 2003
From: benh at kernel.crashing.org (Benjamin Herrenschmidt)
Date: Wed, 24 Dec 2003 15:58:48 +1100
Subject: ppc64 PTE hacks
In-Reply-To: <20031223235632.GE934@krispykreme>
References: <20031223235632.GE934@krispykreme>
Message-ID: <1072241928.820.39.camel@gaston>


On Wed, 2003-12-24 at 10:56, Anton Blanchard wrote:
> Hi,
>
> I just remembered we never merged this patch from Paul. It would be
> great to get rid of the flush_tlb_* functions.

We actually _need_ it to fix the problem you discovered where we
never flush HPTEs on ptep_test_and_clear_young(), thus defeating
page aging.

paul: same problem with ppc32 I suppose

Ben.


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Sat Dec 27 23:15:25 2003
From: anton at samba.org (Anton Blanchard)
Date: Sat, 27 Dec 2003 23:15:25 +1100
Subject: per page execute
Message-ID: <20031227121524.GA24358@krispykreme>


Hi,

We need to move towards enforcing exec permission on all mappings.
Heres a start:

- Switch _PAGE_EXEC and _PAGE_RW so _PAGE_EXEC matches up with the
  hardware noexec bit
- Add _PAGE_EXEC to hash_page permission check
- Check for exec permission in do_page_fault
- Remove redundant set of _PAGE_PRESENT in do_hash_page_DSI, we do it
  again in __hash_page
- Invert linux _PAGE_EXEC bit and enter it in the ppc64 hpte
- Awful bss hack to force program BSS sections to be marked exec
- Split 32 and 64bit data and stack flags, only enforce exec permission
  on mmap/brk for the moment. (ie always mark stack executable)
- More rigid pte_modify, avoid turning off no cache bits when doing
  mprotect. (not related to the rest of this patch, I took the
  opportunity to fix it while I was in the area)
- Our PXXX and SXXX were backwards :) They are in XWR order due to the
  way our mmap flags are laid out. Wow, that bug must date back a few
  years.
- Kill unused PAGE_KERNEL_CI, pte_cache, pte_uncache

When mapping in an executable, it seems the kernel doesnt look at
permissions of non load segments. We explode pretty early because our
plt in ppc32 is in such an area. The awful hack was an attempt to fix
this problem quickly without marking the entire bss as exec by default.
Its crying for a proper fix :)

Another thing that worries me:

  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [25] .plt              NOBITS          10010c08 000c00 0000c0 00 WAX 0   0  4
  [26] .bss              NOBITS          10010cc8 000c00 000004 00  WA 0   0  1

Look how the non executable bss butts right onto the executable plt.
Even with the patch below, we are failing some security tests that try
and exec stuff out of the bss. Thats because the stuff ends up in the same
page as the plt. Alan, could this be considered a toolchain bug?

We also need to fix the kernel signal trampoline code before turning off
exec permission on the stack. If we did the fixmap trick that x86 does and
the trampoline always ended up in this page, that would work well.

Does glibc rely on the stack being executable? We may need a boot option
for people on old toolchains/glibcs (eg the bug where the toolchain forgot
to mark sections executable or the other bug where our GOT was not marked
executable).

Anton

===== arch/ppc64/kernel/head.S 1.42 vs edited =====
--- 1.42/arch/ppc64/kernel/head.S	Wed Dec 17 15:27:52 2003
+++ edited/arch/ppc64/kernel/head.S	Sat Dec 27 11:23:37 2003
@@ -35,6 +35,7 @@
 #include <asm/offsets.h>
 #include <asm/bug.h>
 #include <asm/cputable.h>
+#include <asm/pgtable.h>

 #ifdef CONFIG_PPC_ISERIES
 #define DO_SOFT_DISABLE
@@ -658,7 +659,7 @@
 	andis.	r0,r3,0xa450		/* weird error? */
 	bne	1f			/* if not, try to put a PTE */
 	andis.	r0,r3,0x0020		/* Is it a page table fault? */
-	rlwinm	r4,r3,32-23,29,29	/* DSISR_STORE -> _PAGE_RW */
+	rlwinm	r4,r3,32-25+9,31-9,31-9 /* DSISR_STORE -> _PAGE_RW */
 	ld      r3,_DAR(r1)             /* into the hash table */

 	beq+	2f			/* If so handle it */
@@ -818,10 +819,9 @@
 	b       .ret_from_except

 _GLOBAL(do_hash_page_ISI)
-	li	r4,0
+	li	r4,_PAGE_EXEC
 _GLOBAL(do_hash_page_DSI)
 	rlwimi	r4,r23,32-13,30,30	/* Insert MSR_PR as _PAGE_USER */
-	ori	r4,r4,1			/* add _PAGE_PRESENT */

 	mflr	r21			/* Save LR in r21 */

===== arch/ppc64/mm/fault.c 1.14 vs edited =====
--- 1.14/arch/ppc64/mm/fault.c	Fri Sep 12 21:01:40 2003
+++ edited/arch/ppc64/mm/fault.c	Sat Dec 27 16:32:23 2003
@@ -59,6 +59,7 @@
 	siginfo_t info;
 	unsigned long code = SEGV_MAPERR;
 	unsigned long is_write = error_code & 0x02000000;
+	unsigned long is_exec = regs->trap == 0x400;

 #ifdef CONFIG_DEBUG_KERNEL
 	if (debugger_fault_handler && (regs->trap == 0x300 ||
@@ -102,16 +103,20 @@
 good_area:
 	code = SEGV_ACCERR;

+	if (is_exec) {
+		/* XXX huh? */
+		/* protection fault */
+		if (error_code & 0x08000000)
+			goto bad_area;
+		if (!(vma->vm_flags & VM_EXEC))
+			goto bad_area;
 	/* a write */
-	if (is_write) {
+	} else if (is_write) {
 		if (!(vma->vm_flags & VM_WRITE))
 			goto bad_area;
 	/* a read */
 	} else {
-		/* protection fault */
-		if (error_code & 0x08000000)
-			goto bad_area;
-		if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
+		if (!(vma->vm_flags & VM_READ))
 			goto bad_area;
 	}

===== arch/ppc64/mm/hash_low.S 1.1 vs edited =====
--- 1.1/arch/ppc64/mm/hash_low.S	Wed Dec 17 15:55:14 2003
+++ edited/arch/ppc64/mm/hash_low.S	Sat Dec 27 11:25:59 2003
@@ -90,7 +90,7 @@
 	/* Prepare new PTE value (turn access RW into DIRTY, then
 	 * add BUSY,HASHPTE and ACCESSED)
 	 */
-	rlwinm	r30,r4,5,24,24	/* _PAGE_RW -> _PAGE_DIRTY */
+	rlwinm	r30,r4,32-9+7,31-7,31-7	/* _PAGE_RW -> _PAGE_DIRTY */
 	or	r30,r30,r31
 	ori	r30,r30,_PAGE_BUSY | _PAGE_ACCESSED | _PAGE_HASHPTE
 	/* Write the linux PTE atomically (setting busy) */
@@ -113,11 +113,11 @@
 	rldicl	r5,r5,0,25		/* vsid & 0x0000007fffffffff */
 	rldicl	r0,r3,64-12,48		/* (ea >> 12) & 0xffff */
 	xor	r28,r5,r0
-
-	/* Convert linux PTE bits into HW equivalents
-	 */
-	andi.	r3,r30,0x1fa		/* Get basic set of flags */
-	rlwinm	r0,r30,32-2+1,30,30	/* _PAGE_RW -> _PAGE_USER (r0) */
+
+	/* Convert linux PTE bits into HW equivalents */
+	andi.	r3,r30,0x1fe		/* Get basic set of flags */
+	xori	r3,r3,_PAGE_EXEC	/* _PAGE_EXEC -> NOEXEC */
+	rlwinm	r0,r30,32-9+1,30,30	/* _PAGE_RW -> _PAGE_USER (r0) */
 	rlwinm	r4,r30,32-7+1,30,30	/* _PAGE_DIRTY -> _PAGE_USER (r4) */
 	and	r0,r0,r4		/* _PAGE_RW & _PAGE_DIRTY -> r0 bit 30 */
 	andc	r0,r30,r0		/* r0 = pte & ~r0 */
===== fs/binfmt_elf.c 1.54 vs edited =====
--- 1.54/fs/binfmt_elf.c	Thu Oct 23 08:29:22 2003
+++ edited/fs/binfmt_elf.c	Sat Dec 27 22:00:22 2003
@@ -86,8 +86,10 @@
 {
 	start = ELF_PAGEALIGN(start);
 	end = ELF_PAGEALIGN(end);
-	if (end > start)
+	if (end > start) {
 		do_brk(start, end - start);
+		sys_mprotect(start, end-start, PROT_READ|PROT_WRITE|PROT_EXEC);
+	}
 	current->mm->start_brk = current->mm->brk = end;
 }

===== include/asm-ppc64/page.h 1.22 vs edited =====
--- 1.22/include/asm-ppc64/page.h	Fri Sep 12 21:06:51 2003
+++ edited/include/asm-ppc64/page.h	Sat Dec 27 17:43:57 2003
@@ -234,8 +234,25 @@

 #define virt_addr_valid(kaddr)	pfn_valid(__pa(kaddr) >> PAGE_SHIFT)

-#define VM_DATA_DEFAULT_FLAGS	(VM_READ | VM_WRITE | VM_EXEC | \
+#define VM_DATA_DEFAULT_FLAGS32	(VM_READ | VM_WRITE | \
 				 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
+
+#define VM_STACK_DEFAULT_FLAGS32 (VM_READ | VM_WRITE | VM_EXEC | \
+				 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
+
+#define VM_DATA_DEFAULT_FLAGS64	(VM_READ | VM_WRITE | \
+				 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
+
+#define VM_STACK_DEFAULT_FLAGS64 (VM_READ | VM_WRITE | VM_EXEC | \
+				 VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
+
+#define VM_DATA_DEFAULT_FLAGS \
+	(test_thread_flag(TIF_32BIT) ? \
+	 VM_DATA_DEFAULT_FLAGS32 : VM_DATA_DEFAULT_FLAGS64)
+
+#define VM_STACK_DEFAULT_FLAGS \
+	(test_thread_flag(TIF_32BIT) ? \
+	 VM_STACK_DEFAULT_FLAGS32 : VM_STACK_DEFAULT_FLAGS64)

 #endif /* __KERNEL__ */
 #endif /* _PPC64_PAGE_H */
===== include/asm-ppc64/pgtable.h 1.30 vs edited =====
--- 1.30/include/asm-ppc64/pgtable.h	Wed Dec 17 16:08:23 2003
+++ edited/include/asm-ppc64/pgtable.h	Sat Dec 27 15:07:05 2003
@@ -78,24 +78,25 @@
 #define _PAGE_PRESENT	0x0001 /* software: pte contains a translation */
 #define _PAGE_USER	0x0002 /* matches one of the PP bits */
 #define _PAGE_FILE	0x0002 /* (!present only) software: pte holds file offset */
-#define _PAGE_RW	0x0004 /* software: user write access allowed */
+#define _PAGE_EXEC	0x0004 /* No execute on POWER4 and newer (we invert) */
 #define _PAGE_GUARDED	0x0008
 #define _PAGE_COHERENT	0x0010 /* M: enforce memory coherence (SMP systems) */
 #define _PAGE_NO_CACHE	0x0020 /* I: cache inhibit */
 #define _PAGE_WRITETHRU	0x0040 /* W: cache write-through */
 #define _PAGE_DIRTY	0x0080 /* C: page changed */
 #define _PAGE_ACCESSED	0x0100 /* R: page referenced */
-#define _PAGE_EXEC	0x0200 /* software: i-cache coherence required */
+#define _PAGE_RW	0x0200 /* software: user write access allowed */
 #define _PAGE_HASHPTE	0x0400 /* software: pte has an associated HPTE */
 #define _PAGE_BUSY	0x0800 /* software: PTE & hash are busy */
 #define _PAGE_SECONDARY 0x8000 /* software: HPTE is in secondary group */
 #define _PAGE_GROUP_IX  0x7000 /* software: HPTE index within group */
 /* Bits 0x7000 identify the index within an HPT Group */
 #define _PAGE_HPTEFLAGS (_PAGE_BUSY | _PAGE_HASHPTE | _PAGE_SECONDARY | _PAGE_GROUP_IX)
+
 /* PAGE_MASK gives the right answer below, but only by accident */
 /* It should be preserving the high 48 bits and then specifically */
 /* preserving _PAGE_SECONDARY | _PAGE_GROUP_IX */
-#define _PAGE_CHG_MASK	(PAGE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_HPTEFLAGS)
+#define _PAGE_CHG_MASK (_PAGE_GUARDED | _PAGE_COHERENT | _PAGE_NO_CACHE | _PAGE_WRITETHRU | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_HPTEFLAGS | PAGE_MASK)

 #define _PAGE_BASE	(_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_COHERENT)

@@ -111,31 +112,32 @@
 #define PAGE_READONLY	__pgprot(_PAGE_BASE | _PAGE_USER)
 #define PAGE_READONLY_X	__pgprot(_PAGE_BASE | _PAGE_USER | _PAGE_EXEC)
 #define PAGE_KERNEL	__pgprot(_PAGE_BASE | _PAGE_WRENABLE)
-#define PAGE_KERNEL_CI	__pgprot(_PAGE_PRESENT | _PAGE_ACCESSED | \
-			       _PAGE_WRENABLE | _PAGE_NO_CACHE | _PAGE_GUARDED)

 /*
- * The PowerPC can only do execute protection on a segment (256MB) basis,
- * not on a page basis.  So we consider execute permission the same as read.
+ * POWER4 and newer have per page execute protection, older chips can only
+ * do this on a segment (256MB) basis.
+ *
  * Also, write permissions imply read permissions.
  * This is the closest we can get..
+ *
+ * Note due to the way vm flags are laid out, the bits are XWR
  */
 #define __P000	PAGE_NONE
-#define __P001	PAGE_READONLY_X
+#define __P001	PAGE_READONLY
 #define __P010	PAGE_COPY
-#define __P011	PAGE_COPY_X
-#define __P100	PAGE_READONLY
+#define __P011	PAGE_COPY
+#define __P100	PAGE_READONLY_X
 #define __P101	PAGE_READONLY_X
-#define __P110	PAGE_COPY
+#define __P110	PAGE_COPY_X
 #define __P111	PAGE_COPY_X

 #define __S000	PAGE_NONE
-#define __S001	PAGE_READONLY_X
+#define __S001	PAGE_READONLY
 #define __S010	PAGE_SHARED
-#define __S011	PAGE_SHARED_X
-#define __S100	PAGE_READONLY
+#define __S011	PAGE_SHARED
+#define __S100	PAGE_READONLY_X
 #define __S101	PAGE_READONLY_X
-#define __S110	PAGE_SHARED
+#define __S110	PAGE_SHARED_X
 #define __S111	PAGE_SHARED_X

 #ifndef __ASSEMBLY__
@@ -191,7 +193,8 @@
 })

 #define pte_modify(_pte, newprot) \
-  (__pte((pte_val(_pte) & _PAGE_CHG_MASK) | pgprot_val(newprot)))
+	(__pte((pte_val(_pte) & _PAGE_CHG_MASK) | \
+	       (pgprot_val(newprot) & ~_PAGE_CHG_MASK)))

 #define pte_none(pte)		((pte_val(pte) & ~_PAGE_HPTEFLAGS) == 0)
 #define pte_present(pte)	(pte_val(pte) & _PAGE_PRESENT)
@@ -260,9 +263,6 @@
 static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY;}
 static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED;}
 static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE;}
-
-static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; }
-static inline void pte_cache(pte_t pte)   { pte_val(pte) &= ~_PAGE_NO_CACHE; }

 static inline pte_t pte_rdprotect(pte_t pte) {
 	pte_val(pte) &= ~_PAGE_USER; return pte; }

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Sun Dec 28 16:29:55 2003
From: anton at samba.org (Anton Blanchard)
Date: Sun, 28 Dec 2003 16:29:55 +1100
Subject: spinlocks
Message-ID: <20031228052954.GD24358@krispykreme>


Hi,

We really have to get the new spinlocks beaten into shape...

1. They are massive: 24 inline instructions. They eat hot icache for
breakfast.

2. They add a bunch of clobbers:

        : "=&r"(tmp), "=&r"(tmp2)
        : "r"(&lock->lock)
        : "r3", "r4", "r5", "cr0", "cr1", "ctr", "xer", "memory");

We tie gcc's hands behind its back for the unlikely case that we have
to call into the hypervisor.

3. Separate spinlocks for iseries and pseries where most of it is
duplicated.

4. They add more reliance on the paca. We have to stop using the paca
for everything that isnt architecturally required and move to per cpu
data. In the end we may have to put the processor virtual area in the
paca, but we need to be thinking about this issue.

As an aside, can someone explain why we reread the lock holder:

        lwsync                          # if odd, give up cycles\n\
        ldx             %1,0,%2         # reverify the lock holder\n\
        cmpd            %0,%1\n\
        bne             1b              # new holder so restart\n\

Wont there be a race regardless of whether this code is there?

4. I like how we store r13 into the lock since it could save one
register and will make the guys wanting debug spinlocks a bit happier
(you can work out which cpu has the lock via the spinlock value)

Im proposing a few things:

1. Recognise that once we are in SPLPAR mode, all performance bets are
off and we can burn more cycles. If we are calling into the hypervisor,
the path length there is going to dwarf us so why optimise for it?

2. Move the slow path out of line. We had problems with this due to the
limited reach of a conditional branch but we can fix this by compiling
with -ffunction-sections. We only then encounter problems if we get a
function that is larger than 32kB. If that happens, something is really
wrong :)

3. In the slow path call a single out of line function when calling
into the hypervisor that saves/restores all relevant registers. The call
will be nop'ed out by the cpufeature fixup stuff on non SPLPAR. With
the new module interface we should be able to handle cpufeature fixups
in modules.

Outstanding stuff:
- implement the out of line splpar_spinlock code
- fix cpu features to fixup stuff in modules
- work out how to use FW_FEATURE_SPLPAR in the FTR_SECTION code

Here is what Im thinking the spinlocks should look like:

static inline void _raw_spin_lock(spinlock_t *lock)
{
        unsigned long tmp;

        asm volatile(
"1:     ldarx           %0,0,%1         # spin_lock\n\
        cmpdi           0,%0,0\n\
        bne-            2f\n\
        stdcx.          13,0,%1\n\
        bne-            1b\n\
        isync\n\
        .subsection 1\n\
2:"
        HMT_LOW
BEGIN_FTR_SECTION
"       mflr            %0\n\
        bl              .splpar_spinlock\n"
END_FTR_SECTION_IFSET(CPU_FTR_SPLPAR)
"       ldx             %0,0,%1\n\
        cmpdi           0,%0,0\n\
        bne-            2b\n"
        HMT_MEDIUM
"       b               1b\n\
        .previous"
        : "=&r"(tmp)
        : "r"(&lock->lock)
        : "cr0", "memory");
}

Anton

===== arch/ppc64/Makefile 1.39 vs edited =====
--- 1.39/arch/ppc64/Makefile	Tue Dec  9 03:23:33 2003
+++ edited/arch/ppc64/Makefile	Sun Dec 28 13:41:49 2003
@@ -28,7 +28,8 @@

 LDFLAGS		:= -m elf64ppc
 LDFLAGS_vmlinux	:= -Bstatic -e $(KERNELLOAD) -Ttext $(KERNELLOAD)
-CFLAGS		+= -msoft-float -pipe -Wno-uninitialized -mminimal-toc
+CFLAGS		+= -msoft-float -pipe -Wno-uninitialized -mminimal-toc \
+		   -mtraceback=none -ffunction-sections

 ifeq ($(CONFIG_POWER4_ONLY),y)
 CFLAGS		+= -mcpu=power4
===== include/asm-ppc64/spinlock.h 1.7 vs edited =====
--- 1.7/include/asm-ppc64/spinlock.h	Sat Nov 15 05:45:32 2003
+++ edited/include/asm-ppc64/spinlock.h	Sun Dec 28 13:50:18 2003
@@ -15,14 +15,14 @@
  * 2 of the License, or (at your option) any later version.
  */

-#include <asm/hvcall.h>
+#include <asm/cputable.h>

 /*
  * The following define is being used to select basic or shared processor
  * locking when running on an RPA platform.  As we do more performance
  * tuning, I would expect this selection mechanism to change.  Dave E.
  */
-#define SPLPAR_LOCKS
+#undef SPLPAR_LOCKS
 #define HVSC			".long 0x44000022\n"

 typedef struct {
@@ -138,25 +138,33 @@
 	: "r3", "r4", "r5", "cr0", "cr1", "ctr", "xer", "memory");
 }
 #else
-static __inline__ void _raw_spin_lock(spinlock_t *lock)
+
+static inline void _raw_spin_lock(spinlock_t *lock)
 {
 	unsigned long tmp;

-	__asm__ __volatile__(
-       "b		2f		# spin_lock\n\
-1:"
-	HMT_LOW
-"       ldx		%0,0,%1         # load the lock value\n\
-	cmpdi		0,%0,0          # if not locked, try to acquire\n\
-	bne+		1b\n\
-2: \n"
-	HMT_MEDIUM
-" 	ldarx		%0,0,%1\n\
+	asm volatile(
+"1:	ldarx		%0,0,%1		# spin_lock\n\
 	cmpdi		0,%0,0\n\
-	bne-		1b\n\
+	bne-		2f\n\
 	stdcx.		13,0,%1\n\
-	bne-		2b\n\
-	isync"
+	bne-		1b\n\
+	isync\n\
+	.subsection 1\n\
+2:"
+	HMT_LOW
+#if 0
+BEGIN_FTR_SECTION
+"	mflr		%0\n\
+	bl		.splpar_spinlock\n"
+END_FTR_SECTION_IFSET(CPU_FTR_SPLPAR)
+#endif
+"	ldx		%0,0,%1\n\
+	cmpdi		0,%0,0\n\
+	bne-		2b\n"
+	HMT_MEDIUM
+"	b		1b\n\
+	.previous"
 	: "=&r"(tmp)
 	: "r"(&lock->lock)
 	: "cr0", "memory");


** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Wed Dec 31 01:36:58 2003
From: anton at samba.org (Anton Blanchard)
Date: Wed, 31 Dec 2003 01:36:58 +1100
Subject: 2.6.0 oops
Message-ID: <20031230143658.GA28023@krispykreme>


Hi,

Running bash-shared-mappings on ameslab BK from last night I got the
following oops:

cpu 0: Vector: 300 (Data Access) at [c00000000ee63760]
    pc: c00000000000e158 (.__hash_page+0x2c/0xa4)
    lr: c00000000004af14 (.update_mmu_cache+0x15c/0x1d4)
    sp: c00000000ee639e0
   msr: a000000000009032
   dar: 0
 dsisr: 40000000
  current = 0xc000000001e18300
  paca    = 0xc00000000051a000
    pid   = 17778, comm = init
0:mon> t
c00000000ee63ae0  c00000000004af14  .update_mmu_cache+0x15c/0x1d4
c00000000ee63b90  c00000000008fe40  .do_no_page+0x288/0x554
c00000000ee63c60  c0000000000903c8  .handle_mm_fault+0x194/0x268
c00000000ee63d00  c000000000049734  .do_page_fault+0x160/0x48c
c00000000ee63e30  c00000000000acb0  InstructionAccess_common+0x10c/0x110

0:mon> di %pc
c00000000000e158  7fe030a8      ldarx   r31,r0,r6

We died nice and early in __hash_page, all arguments should be preserved:

int __hash_page(unsigned long ea, unsigned long access, unsigned long vsid,
		pte_t *ptep, unsigned long trap, int local);

R03 = 000000000ff4e7d8
R04 = 0000000000000003
R05 = 0000000e3779b970
R06 = 0000000000000000
R07 = 0000000000000300
R08 = 0000000000000000

ea is probably somewhere in shared libraries. access is read/write,
vsid looks ok but ptep is NULL. How did this happen? We hold the page
table spinlock around the whole region that does the set_pte in
do_no_page right through to update_mmu_cache where we died. So Im
guessing we did put a pte in there but why didnt find_linux_pte find it?

There are 2 ways find_linux_pte (in update_mmu_cache) could return NULL.
Either the pgd/pmd is empty or the pte doesnt have _PAGE_PRESENT set.
We did check that the pte had _PAGE_ACCESSED set earlier in update_mmu_cache
so its sounding like we are asking to insert a pte with _PAGE_ACCESSED set
and _PAGE_PRESENT not set.

The safe thing to do is add a check for NULL after find_linux_pte,
update_mmu_cache is just an optimisation and it makes sense to guard
against weird things like this. However I would like to understand how
we ended up in this scenario in the first place...

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From amodra at bigpond.net.au  Wed Dec 31 09:18:32 2003
From: amodra at bigpond.net.au (Alan Modra)
Date: Wed, 31 Dec 2003 08:48:32 +1030
Subject: spinlocks
In-Reply-To: <20031228052954.GD24358@krispykreme>
References: <20031228052954.GD24358@krispykreme>
Message-ID: <20031230221832.GA22998@bubble.sa.bigpond.net.au>


On Sun, Dec 28, 2003 at 04:29:55PM +1100, Anton Blanchard wrote:
> static inline void _raw_spin_lock(spinlock_t *lock)
> {
>         unsigned long tmp;
>
>         asm volatile(
> "1:     ldarx           %0,0,%1         # spin_lock\n\
>         cmpdi           0,%0,0\n\
>         bne-            2f\n\
>         stdcx.          13,0,%1\n\
>         bne-            1b\n\
>         isync\n\
>         .subsection 1\n\
> 2:"
>         HMT_LOW
> BEGIN_FTR_SECTION
> "       mflr            %0\n\
>         bl              .splpar_spinlock\n"
> END_FTR_SECTION_IFSET(CPU_FTR_SPLPAR)
> "       ldx             %0,0,%1\n\
>         cmpdi           0,%0,0\n\
>         bne-            2b\n"
>         HMT_MEDIUM
> "       b               1b\n\
>         .previous"
>         : "=&r"(tmp)
>         : "r"(&lock->lock)
>         : "cr0", "memory");
> }

You might want to restore lr somewhere in there, unless there's
something magic about those FTR_SECTION macros.  :)

Do you really want to tell gcc that all memory is potentially changed
by _raw_spin_lock?  Hmm, I guess if you're accessing something
protected by a lock then you want to say that old values of the
"something" are stale.  However, I think it would be better to
explicitly say that &lock->lock is an output of the asm, rather than
relying on the "memory" clobber to do that.

Also, you might find it a little tricky to write splpar_spinlock.  The
problem is that you can't use any registers (since you haven't told
gcc about any), and you'll need to be careful about using the stack.
If _raw_spin_lock is called from a leaf function foo, then gcc may not
set up a stack frame for foo.  As per the ABI, gcc may use 288 bytes
below r1 as scratch that isn't saved over calls.  Since you haven't
told gcc that you're making a call, you need to skip this area if
using the stack in splpar_spinlock.

I wonder if you wouldn't do better by making _raw_spin_lock a function
written in asm.  OK, that would mean the overhead of a function call,
but I reckon many people forget that inline code blows icache, which
probably hurts more..

--
Alan Modra
IBM OzLabs - Linux Technology Centre

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From amodra at bigpond.net.au  Wed Dec 31 09:18:41 2003
From: amodra at bigpond.net.au (Alan Modra)
Date: Wed, 31 Dec 2003 08:48:41 +1030
Subject: per page execute
In-Reply-To: <20031227121524.GA24358@krispykreme>
References: <20031227121524.GA24358@krispykreme>
Message-ID: <20031230221841.GB22998@bubble.sa.bigpond.net.au>


On Sat, Dec 27, 2003 at 11:15:25PM +1100, Anton Blanchard wrote:
>   [25] .plt NOBITS 10010c08 000c00 0000c0 00 WAX 0   0  4
>   [26] .bss NOBITS 10010cc8 000c00 000004 00  WA 0   0  1
>
> Look how the non executable bss butts right onto the executable plt.
> Even with the patch below, we are failing some security tests that try
> and exec stuff out of the bss. Thats because the stuff ends up in the same
> page as the plt. Alan, could this be considered a toolchain bug?

Possibly.  What about .got (exec) and adjacent .sdata (non-exec)?  The
ABI says that shared libs access .sdata via the got pointer, so
there's no hope of separating them.

--
Alan Modra
IBM OzLabs - Linux Technology Centre

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/


From anton at samba.org  Wed Dec 31 10:58:36 2003
From: anton at samba.org (Anton Blanchard)
Date: Wed, 31 Dec 2003 10:58:36 +1100
Subject: spinlocks
In-Reply-To: <20031230221832.GA22998@bubble.sa.bigpond.net.au>
References: <20031228052954.GD24358@krispykreme> <20031230221832.GA22998@bubble.sa.bigpond.net.au>
Message-ID: <20031230235836.GC28023@krispykreme>


Hi,

> You might want to restore lr somewhere in there, unless there's
> something magic about those FTR_SECTION macros.  :)

No magic just not enough thought has gone into my code yet :)

> Do you really want to tell gcc that all memory is potentially changed
> by _raw_spin_lock?  Hmm, I guess if you're accessing something
> protected by a lock then you want to say that old values of the
> "something" are stale.  However, I think it would be better to
> explicitly say that &lock->lock is an output of the asm, rather than
> relying on the "memory" clobber to do that.

Yeah we need to force a full gcc memory barrier there. If you think we
should add the explicit clobber as well I can, we have a lot of code
that does that however (atomic and bitop code).

> Also, you might find it a little tricky to write splpar_spinlock.  The
> problem is that you can't use any registers (since you haven't told
> gcc about any), and you'll need to be careful about using the stack.
> If _raw_spin_lock is called from a leaf function foo, then gcc may not
> set up a stack frame for foo.  As per the ABI, gcc may use 288 bytes
> below r1 as scratch that isn't saved over calls.  Since you haven't
> told gcc that you're making a call, you need to skip this area if
> using the stack in splpar_spinlock.

Yeah I was thinking we force tmp to be an explicit register in the
clobbers, then we have something to start from. Id expect
splpar_spinlock will allocate a stackframe and go from there.

> I wonder if you wouldn't do better by making _raw_spin_lock a function
> written in asm.  OK, that would mean the overhead of a function call,
> but I reckon many people forget that inline code blows icache, which
> probably hurts more..

Well Id do that if we could specify clobbers in function prototypes in
gcc :) Otherwise the overhead of a function call is reasonably high.
Also it makes profiling a bitch when you spend 50% of your time in
the spinlock function and have no idea how that is broken up.

FYI enable -ffunction-sections and notice how it takes a few minutes to
do the final link stage... The profile looks like (numbers are % of cpu
time):

22.9499  ld                       __udivmoddi4
 8.0067  libc-2.3.2.so            strcmp
 7.8211  ld                       lang_check_section_addresses
 5.3252  ld                       lang_output_section_find
 4.1369  ld                       gldelf64ppc_place_orphan
 3.8997  make                     (no symbols)
 2.7411  libpthread-0.10.so       __pthread_alt_unlock
 1.2113  libpthread-0.10.so       __pthread_alt_lock
 1.1746  ld                       __udivdi3
 0.8079  libc-2.3.2.so            __ctype_b_loc

Ouch, really ld doesnt like 10,000 sections :)

GNU ld version 2.14.90 20030814

Anton

** Sent via the linuxppc64-dev mail list. See http://lists.linuxppc.org/