From anton at samba.org Sun May 1 19:26:42 2005 From: anton at samba.org (Anton Blanchard) Date: Sun, 1 May 2005 19:26:42 +1000 Subject: [PATCH] ppc64: remove hidden -fno-omit-frame-pointer for schedule.c Message-ID: <20050501092641.GL19662@krispykreme> Hi, While looking at code generated by gcc4.0 I noticed some functions still had frame pointers, even after we stopped ppc64 from defining CONFIG_FRAME_POINTER. It turns out kernel/Makefile hardwires -fno-omit-frame-pointer on when compiling schedule.c. It was already disabled on ia64, disable it on ppc64 as well. Signed-off-by: Anton Blanchard Index: linux-2.6.12-rc2/kernel/Makefile =================================================================== --- linux-2.6.12-rc2.orig/kernel/Makefile 2005-04-19 13:37:40.599016667 +1000 +++ linux-2.6.12-rc2/kernel/Makefile 2005-05-01 05:48:00.689299680 +1000 @@ -33,6 +33,7 @@ obj-$(CONFIG_SECCOMP) += seccomp.o ifneq ($(CONFIG_IA64),y) +ifneq ($(CONFIG_PPC64),y) # According to Alan Modra , the -fno-omit-frame-pointer is # needed for x86 only. Why this used to be enabled for all architectures is beyond # me. I suspect most platforms don't need this, but until we know that for sure @@ -40,6 +41,7 @@ # to get a correct value for the wait-channel (WCHAN in ps). --davidm CFLAGS_sched.o := $(PROFILING) -fno-omit-frame-pointer endif +endif $(obj)/configs.o: $(obj)/config_data.h From akpm at osdl.org Sun May 1 19:37:59 2005 From: akpm at osdl.org (Andrew Morton) Date: Sun, 1 May 2005 02:37:59 -0700 Subject: [PATCH] ppc64: remove hidden -fno-omit-frame-pointer for schedule.c In-Reply-To: <20050501092641.GL19662@krispykreme> References: <20050501092641.GL19662@krispykreme> Message-ID: <20050501023759.15d98aea.akpm@osdl.org> Anton Blanchard wrote: > > --- linux-2.6.12-rc2.orig/kernel/Makefile 2005-04-19 13:37:40.599016667 +1000 > +++ linux-2.6.12-rc2/kernel/Makefile 2005-05-01 05:48:00.689299680 +1000 > @@ -33,6 +33,7 @@ > obj-$(CONFIG_SECCOMP) += seccomp.o > > ifneq ($(CONFIG_IA64),y) > +ifneq ($(CONFIG_PPC64),y) Could we please use a new CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER, define that in the arch's Kconfig? (is that a triple negative I see?) From anton at samba.org Sun May 1 19:56:39 2005 From: anton at samba.org (Anton Blanchard) Date: Sun, 1 May 2005 19:56:39 +1000 Subject: [PATCH] ppc64: remove hidden -fno-omit-frame-pointer for schedule.c In-Reply-To: <20050501023759.15d98aea.akpm@osdl.org> References: <20050501092641.GL19662@krispykreme> <20050501023759.15d98aea.akpm@osdl.org> Message-ID: <20050501095639.GM19662@krispykreme> > Could we please use a new CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER, define > that in the arch's Kconfig? > > (is that a triple negative I see?) I like it. Hopefully someone suitably annoyed by this triple negative will go and work out which damn architectures actually need -fno-omit-frame-pointer and reverse the test. For now ppc32, ppc64, ia64 dont need it. Anton -- While looking at code generated by gcc4.0 I noticed some functions still had frame pointers, even after we stopped ppc64 from defining CONFIG_FRAME_POINTER. It turns out kernel/Makefile hardwires -fno-omit-frame-pointer on when compiling schedule.c. Create CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER and define it on architectures that dont require frame pointers in sched.c code. Signed-off-by: Anton Blanchard Index: linux-2.6.12-rc2/kernel/Makefile =================================================================== --- linux-2.6.12-rc2.orig/kernel/Makefile 2005-04-19 13:37:40.599016667 +1000 +++ linux-2.6.12-rc2/kernel/Makefile 2005-05-01 19:48:44.471448005 +1000 @@ -32,7 +32,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o obj-$(CONFIG_SECCOMP) += seccomp.o -ifneq ($(CONFIG_IA64),y) +ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) # According to Alan Modra , the -fno-omit-frame-pointer is # needed for x86 only. Why this used to be enabled for all architectures is beyond # me. I suspect most platforms don't need this, but until we know that for sure Index: linux-2.6.12-rc2/arch/ia64/Kconfig =================================================================== --- linux-2.6.12-rc2.orig/arch/ia64/Kconfig 2005-04-19 13:37:33.173418325 +1000 +++ linux-2.6.12-rc2/arch/ia64/Kconfig 2005-05-01 19:49:35.060202590 +1000 @@ -46,6 +46,10 @@ bool default y +config SCHED_NO_NO_OMIT_FRAME_POINTER + bool + default y + choice prompt "System type" default IA64_GENERIC Index: linux-2.6.12-rc2/arch/ppc/Kconfig =================================================================== --- linux-2.6.12-rc2.orig/arch/ppc/Kconfig 2005-04-19 13:37:33.450396856 +1000 +++ linux-2.6.12-rc2/arch/ppc/Kconfig 2005-05-01 19:49:24.699414050 +1000 @@ -43,6 +43,10 @@ bool default y +config SCHED_NO_NO_OMIT_FRAME_POINTER + bool + default y + source "init/Kconfig" menu "Processor" Index: linux-2.6.12-rc2/arch/ppc64/Kconfig =================================================================== --- linux-2.6.12-rc2.orig/arch/ppc64/Kconfig 2005-05-01 05:39:38.017058150 +1000 +++ linux-2.6.12-rc2/arch/ppc64/Kconfig 2005-05-01 19:50:47.878561880 +1000 @@ -40,6 +40,10 @@ bool default y +config SCHED_NO_NO_OMIT_FRAME_POINTER + bool + default y + # We optimistically allocate largepages from the VM, so make the limit # large enough (16MB). This badly named config option is actually # max order + 1 From anton at samba.org Sun May 1 20:30:13 2005 From: anton at samba.org (Anton Blanchard) Date: Sun, 1 May 2005 20:30:13 +1000 Subject: [PATCH] ppc64: add missing Kconfig help text Message-ID: <20050501103013.GN19662@krispykreme> From: Jesper Juhl There's no help text for CONFIG_DEBUG_STACKOVERFLOW - add one. Signed-off-by: Jesper Juhl Signed-off-by: Anton Blanchard Index: linux-2.6.12-rc2/arch/ppc64/Kconfig.debug =================================================================== --- linux-2.6.12-rc2.orig/arch/ppc64/Kconfig.debug 2005-02-04 04:10:36.000000000 +1100 +++ linux-2.6.12-rc2/arch/ppc64/Kconfig.debug 2005-05-01 20:27:18.760365099 +1000 @@ -5,6 +5,9 @@ config DEBUG_STACKOVERFLOW bool "Check for stack overflows" depends on DEBUG_KERNEL + help + This option will cause messages to be printed if free stack space + drops below a certain limit. config KPROBES bool "Kprobes" From peter at chubb.wattle.id.au Mon May 2 10:17:51 2005 From: peter at chubb.wattle.id.au (Peter Chubb) Date: Mon, 2 May 2005 10:17:51 +1000 Subject: [PATCH] ppc64: update to use the new 4L headers In-Reply-To: <4270472E.9050708@yahoo.com.au> References: <1114652039.7112.213.camel@gaston> <42704130.9050005@yahoo.com.au> <427044AA.5030402@nortel.com> <4270472E.9050708@yahoo.com.au> Message-ID: <17013.29103.249971.866326@wombat.chubb.wattle.id.au> >>>>> "Nick" == Nick Piggin writes: Nick> Chris Friesen wrote: >> I needed something like: >> >> pte_t *va_to_ptep_map(struct mm_struct *mm, unsigned int addr) >> >> There was code in follow_page() that did basically what I needed, >> but it was all contained within that function so I had to >> re-implement it. >> Nick> If you can break out exactly what you need, and make that inline Nick> or otherwise available via the correct header, I'm sure it would Nick> have a good chance of being merged. We're currently working on this, so as to be able to provide interfaces to alternative page tables. We want to be able to slot in Liedtke's `Guarded Page Tables', or B-trees, or a hash table to see what happens. Except we've called the function: pte_t * lookup_page_table(unsigned long address, struct mm_struct *mm); follow_page() is essentially the same after inline expansion happens; but we're seeing a regression in clear_page_range() that we want to fix before release. If you want to take a look (warning: it's still fairly rough work-in-progress) there's high level design being worked on at http://www.gelato.unsw.edu.au/IA64wiki/PageTableInterface and patches from our CVS repository. The only patch of interst is pti.patch. cvs -d :pserver:anoncvs at gelato.unsw.edu.au:/gelato login Logging in to :pserver:anoncvs at lemon:2401/gelato CVS password:[enter anoncvs] $ cvs -d:pserver:anoncvs at gelato.unsw.edu.au:/gelato co kernel/page_table_interface or from http://www.gelato.unsw.edu.au/cgi-bin/viewcvs.cgi/cvs/kernel/page_table_interface/ Peter C From miltonm at bga.com Mon May 2 16:43:40 2005 From: miltonm at bga.com (Milton Miller) Date: Mon, 2 May 2005 01:43:40 -0500 Subject: [PATCH 1/2] ppc64: fix read/write on large /dev/nvram Message-ID: <7845758a806ed6769cea59a9df344d39@bga.com> On Fri Apr 22 16:49:59 EST 2005, Arnd wrote a patch with the following lines (among several others). - len = ppc_md.nvram_read(tmp_buffer, count, ppos); + ret = ppc_md.nvram_read(tmp, count, ppos); - len = ppc_md.nvram_write(tmp_buffer, count, ppos); + ret = ppc_md.nvram_read(tmp, count, ppos); Even though I am just scanning, I am guessing this is not quite right. milton From david at gibson.dropbear.id.au Tue May 3 10:26:08 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Tue, 3 May 2005 10:26:08 +1000 Subject: [PPC64] pgtable.h and other header cleanups Message-ID: <20050503002608.GA22453@localhost.localdomain> Andrew, please apply. This patch started as simply removing a few never-used macros from asm-ppc64/pgtable.h, then kind of grew. It now makes a bunch of cleanups to the ppc64 low-level header files (with corresponding changes to .c files where necessary) such as: - Abolishing never-used macros - Eliminating multiple #defines with the same purpose - Removing pointless macros (cases where just expanding the macro everywhere turns out clearer and more sensible) - Removing some cases where macros which could be defined in terms of each other weren't - Moving imalloc() related definitions from pgtable.h to their own header file (imalloc.h) - Re-arranging headers to group things more logically - Moving all VSID allocation related things to mmu.h, instead of being split between mmu.h and mmu_context.h - Removing some reserved space for flags from the PMD - we're not using it. Signed-off-by: David Gibson Index: working-2.6/include/asm-ppc64/pgtable.h =================================================================== --- working-2.6.orig/include/asm-ppc64/pgtable.h 2005-05-02 16:21:09.000000000 +1000 +++ working-2.6/include/asm-ppc64/pgtable.h 2005-05-02 17:58:29.000000000 +1000 @@ -17,16 +17,6 @@ #include -/* PMD_SHIFT determines what a second-level page table entry can map */ -#define PMD_SHIFT (PAGE_SHIFT + PAGE_SHIFT - 3) -#define PMD_SIZE (1UL << PMD_SHIFT) -#define PMD_MASK (~(PMD_SIZE-1)) - -/* PGDIR_SHIFT determines what a third-level page table entry can map */ -#define PGDIR_SHIFT (PAGE_SHIFT + (PAGE_SHIFT - 3) + (PAGE_SHIFT - 2)) -#define PGDIR_SIZE (1UL << PGDIR_SHIFT) -#define PGDIR_MASK (~(PGDIR_SIZE-1)) - /* * Entries per page directory level. The PTE level must use a 64b record * for each page table entry. The PMD and PGD level use a 32b record for @@ -40,40 +30,30 @@ #define PTRS_PER_PMD (1 << PMD_INDEX_SIZE) #define PTRS_PER_PGD (1 << PGD_INDEX_SIZE) -#define USER_PTRS_PER_PGD (1024) -#define FIRST_USER_ADDRESS 0 +/* PMD_SHIFT determines what a second-level page table entry can map */ +#define PMD_SHIFT (PAGE_SHIFT + PTE_INDEX_SIZE) +#define PMD_SIZE (1UL << PMD_SHIFT) +#define PMD_MASK (~(PMD_SIZE-1)) -#define EADDR_SIZE (PTE_INDEX_SIZE + PMD_INDEX_SIZE + \ - PGD_INDEX_SIZE + PAGE_SHIFT) +/* PGDIR_SHIFT determines what a third-level page table entry can map */ +#define PGDIR_SHIFT (PMD_SHIFT + PMD_INDEX_SIZE) +#define PGDIR_SIZE (1UL << PGDIR_SHIFT) +#define PGDIR_MASK (~(PGDIR_SIZE-1)) + +#define FIRST_USER_ADDRESS 0 /* * Size of EA range mapped by our pagetables. */ -#define PGTABLE_EA_BITS 41 -#define PGTABLE_EA_MASK ((1UL<> PMD_TO_PTEPAGE_SHIFT)) +#define pmd_page_kernel(pmd) (__bpn_to_ba(pmd_val(pmd))) #define pmd_page(pmd) virt_to_page(pmd_page_kernel(pmd)) #define pud_set(pudp, pmdp) (pud_val(*(pudp)) = (__ba_to_bpn(pmdp))) @@ -266,8 +241,6 @@ /* to find an entry in the ioremap page-table-directory */ #define pgd_offset_i(address) (ioremap_pgd + pgd_index(address)) -#define pages_to_mb(x) ((x) >> (20-PAGE_SHIFT)) - /* * The following only work if pte_present() is true. * Undefined behaviour if not.. @@ -487,18 +460,13 @@ extern unsigned long ioremap_bot, ioremap_base; -#define USER_PGD_PTRS (PAGE_OFFSET >> PGDIR_SHIFT) -#define KERNEL_PGD_PTRS (PTRS_PER_PGD-USER_PGD_PTRS) - -#define pte_ERROR(e) \ - printk("%s:%d: bad pte %016lx.\n", __FILE__, __LINE__, pte_val(e)) #define pmd_ERROR(e) \ printk("%s:%d: bad pmd %08x.\n", __FILE__, __LINE__, pmd_val(e)) #define pgd_ERROR(e) \ printk("%s:%d: bad pgd %08x.\n", __FILE__, __LINE__, pgd_val(e)) -extern pgd_t swapper_pg_dir[1024]; -extern pgd_t ioremap_dir[1024]; +extern pgd_t swapper_pg_dir[]; +extern pgd_t ioremap_dir[]; extern void paging_init(void); @@ -540,43 +508,11 @@ */ #define kern_addr_valid(addr) (1) -#define io_remap_page_range(vma, vaddr, paddr, size, prot) \ - remap_pfn_range(vma, vaddr, (paddr) >> PAGE_SHIFT, size, prot) - #define io_remap_pfn_range(vma, vaddr, pfn, size, prot) \ remap_pfn_range(vma, vaddr, pfn, size, prot) -#define MK_IOSPACE_PFN(space, pfn) (pfn) -#define GET_IOSPACE(pfn) 0 -#define GET_PFN(pfn) (pfn) - void pgtable_cache_init(void); -extern void hpte_init_native(void); -extern void hpte_init_lpar(void); -extern void hpte_init_iSeries(void); - -/* imalloc region types */ -#define IM_REGION_UNUSED 0x1 -#define IM_REGION_SUBSET 0x2 -#define IM_REGION_EXISTS 0x4 -#define IM_REGION_OVERLAP 0x8 -#define IM_REGION_SUPERSET 0x10 - -extern struct vm_struct * im_get_free_area(unsigned long size); -extern struct vm_struct * im_get_area(unsigned long v_addr, unsigned long size, - int region_type); -unsigned long im_free(void *addr); - -extern long pSeries_lpar_hpte_insert(unsigned long hpte_group, - unsigned long va, unsigned long prpn, - int secondary, unsigned long hpteflags, - int bolted, int large); - -extern long native_hpte_insert(unsigned long hpte_group, unsigned long va, - unsigned long prpn, int secondary, - unsigned long hpteflags, int bolted, int large); - /* * find_linux_pte returns the address of a linux pte for a given * effective address and directory. If not found, it returns zero. Index: working-2.6/include/asm-ppc64/page.h =================================================================== --- working-2.6.orig/include/asm-ppc64/page.h 2005-05-02 16:21:09.000000000 +1000 +++ working-2.6/include/asm-ppc64/page.h 2005-05-02 16:21:43.000000000 +1000 @@ -23,7 +23,6 @@ #define PAGE_SHIFT 12 #define PAGE_SIZE (ASM_CONST(1) << PAGE_SHIFT) #define PAGE_MASK (~(PAGE_SIZE-1)) -#define PAGE_OFFSET_MASK (PAGE_SIZE-1) #define SID_SHIFT 28 #define SID_MASK 0xfffffffffUL @@ -85,9 +84,6 @@ /* align addr on a size boundary - adjust address up if needed */ #define _ALIGN(addr,size) _ALIGN_UP(addr,size) -/* to align the pointer to the (next) double word boundary */ -#define DOUBLEWORD_ALIGN(addr) _ALIGN(addr,sizeof(unsigned long)) - /* to align the pointer to the (next) page boundary */ #define PAGE_ALIGN(addr) _ALIGN(addr, PAGE_SIZE) @@ -100,7 +96,6 @@ #define REGION_SIZE 4UL #define REGION_SHIFT 60UL #define REGION_MASK (((1UL<>REGION_SHIFT) -#define VMALLOC_REGION_ID (VMALLOCBASE>>REGION_SHIFT) -#define KERNEL_REGION_ID (KERNELBASE>>REGION_SHIFT) +#define IO_REGION_ID (IOREGIONBASE >> REGION_SHIFT) +#define VMALLOC_REGION_ID (VMALLOCBASE >> REGION_SHIFT) +#define KERNEL_REGION_ID (KERNELBASE >> REGION_SHIFT) #define USER_REGION_ID (0UL) -#define REGION_ID(X) (((unsigned long)(X))>>REGION_SHIFT) +#define REGION_ID(ea) (((unsigned long)(ea)) >> REGION_SHIFT) -#define __bpn_to_ba(x) ((((unsigned long)(x))<> PAGE_SHIFT) #define __va(x) ((void *)((unsigned long)(x) + KERNELBASE)) Index: working-2.6/arch/ppc64/mm/imalloc.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/imalloc.c 2005-04-26 15:37:55.000000000 +1000 +++ working-2.6/arch/ppc64/mm/imalloc.c 2005-05-02 16:59:40.000000000 +1000 @@ -14,6 +14,7 @@ #include #include #include +#include static DECLARE_MUTEX(imlist_sem); struct vm_struct * imlist = NULL; @@ -23,11 +24,11 @@ unsigned long addr; struct vm_struct **p, *tmp; - addr = IMALLOC_START; + addr = ioremap_bot; for (p = &imlist; (tmp = *p) ; p = &tmp->next) { if (size + addr < (unsigned long) tmp->addr) break; - if ((unsigned long)tmp->addr >= IMALLOC_START) + if ((unsigned long)tmp->addr >= ioremap_bot) addr = tmp->size + (unsigned long) tmp->addr; if (addr > IMALLOC_END-size) return 1; Index: working-2.6/arch/ppc64/mm/hash_utils.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hash_utils.c 2005-04-26 15:37:55.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hash_utils.c 2005-05-02 16:56:17.000000000 +1000 @@ -298,24 +298,23 @@ int local = 0; cpumask_t tmp; + if ((ea & ~REGION_MASK) > EADDR_MASK) + return 1; + switch (REGION_ID(ea)) { case USER_REGION_ID: user_region = 1; mm = current->mm; - if ((ea > USER_END) || (! mm)) + if (! mm) return 1; vsid = get_vsid(mm->context.id, ea); break; case IO_REGION_ID: - if (ea > IMALLOC_END) - return 1; mm = &ioremap_mm; vsid = get_kernel_vsid(ea); break; case VMALLOC_REGION_ID: - if (ea > VMALLOC_END) - return 1; mm = &init_mm; vsid = get_kernel_vsid(ea); break; @@ -362,7 +361,7 @@ unsigned long vsid, vpn, va, hash, secondary, slot; unsigned long huge = pte_huge(pte); - if ((ea >= USER_START) && (ea <= USER_END)) + if (ea < KERNELBASE) vsid = get_vsid(context, ea); else vsid = get_kernel_vsid(ea); Index: working-2.6/arch/ppc64/mm/hash_native.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hash_native.c 2005-04-26 15:37:55.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hash_native.c 2005-05-02 16:51:44.000000000 +1000 @@ -320,8 +320,7 @@ j = 0; for (i = 0; i < number; i++) { - if ((batch->addr[i] >= USER_START) && - (batch->addr[i] <= USER_END)) + if (batch->addr[i] < KERNELBASE) vsid = get_vsid(context, batch->addr[i]); else vsid = get_kernel_vsid(batch->addr[i]); Index: working-2.6/arch/ppc64/mm/init.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/init.c 2005-05-02 08:57:20.000000000 +1000 +++ working-2.6/arch/ppc64/mm/init.c 2005-05-02 16:38:18.000000000 +1000 @@ -64,6 +64,7 @@ #include #include #include +#include int mem_init_done; unsigned long ioremap_bot = IMALLOC_BASE; Index: working-2.6/include/asm-ppc64/mmu.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu.h 2005-04-26 15:38:02.000000000 +1000 +++ working-2.6/include/asm-ppc64/mmu.h 2005-05-02 17:45:59.000000000 +1000 @@ -15,19 +15,10 @@ #include #include -#include -#ifndef __ASSEMBLY__ - -/* Time to allow for more things here */ -typedef unsigned long mm_context_id_t; -typedef struct { - mm_context_id_t id; -#ifdef CONFIG_HUGETLB_PAGE - pgd_t *huge_pgdir; - u16 htlb_segs; /* bitmask */ -#endif -} mm_context_t; +/* + * Segment table + */ #define STE_ESID_V 0x80 #define STE_ESID_KS 0x20 @@ -36,15 +27,48 @@ #define STE_VSID_SHIFT 12 -struct stab_entry { - unsigned long esid_data; - unsigned long vsid_data; -}; +/* Location of cpu0's segment table */ +#define STAB0_PAGE 0x9 +#define STAB0_PHYS_ADDR (STAB0_PAGE<> VSID_BITS) + (x & VSID_MODULUS); + return (x + ((x+1) >> VSID_BITS)) & VSID_MODULUS; +#endif /* 1 */ +} + +/* This is only valid for addresses >= KERNELBASE */ +static inline unsigned long get_kernel_vsid(unsigned long ea) +{ + return vsid_scramble(ea >> SID_SHIFT); +} + +/* This is only valid for user addresses (which are below 2^41) */ +static inline unsigned long get_vsid(unsigned long context, unsigned long ea) +{ + return vsid_scramble((context << USER_ESID_BITS) + | (ea >> SID_SHIFT)); +} + +#endif /* __ASSEMBLY */ + #endif /* _PPC64_MMU_H_ */ Index: working-2.6/arch/ppc64/mm/stab.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/stab.c 2005-04-26 15:37:55.000000000 +1000 +++ working-2.6/arch/ppc64/mm/stab.c 2005-05-02 17:29:03.000000000 +1000 @@ -19,6 +19,11 @@ #include #include +struct stab_entry { + unsigned long esid_data; + unsigned long vsid_data; +}; + /* Both the segment table and SLB code uses the following cache */ #define NR_STAB_CACHE_ENTRIES 8 DEFINE_PER_CPU(long, stab_cache_ptr); Index: working-2.6/include/asm-ppc64/mmu_context.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu_context.h 2005-04-26 15:38:02.000000000 +1000 +++ working-2.6/include/asm-ppc64/mmu_context.h 2005-05-02 17:41:49.000000000 +1000 @@ -84,86 +84,4 @@ local_irq_restore(flags); } -/* VSID allocation - * =============== - * - * We first generate a 36-bit "proto-VSID". For kernel addresses this - * is equal to the ESID, for user addresses it is: - * (context << 15) | (esid & 0x7fff) - * - * The two forms are distinguishable because the top bit is 0 for user - * addresses, whereas the top two bits are 1 for kernel addresses. - * Proto-VSIDs with the top two bits equal to 0b10 are reserved for - * now. - * - * The proto-VSIDs are then scrambled into real VSIDs with the - * multiplicative hash: - * - * VSID = (proto-VSID * VSID_MULTIPLIER) % VSID_MODULUS - * where VSID_MULTIPLIER = 268435399 = 0xFFFFFC7 - * VSID_MODULUS = 2^36-1 = 0xFFFFFFFFF - * - * This scramble is only well defined for proto-VSIDs below - * 0xFFFFFFFFF, so both proto-VSID and actual VSID 0xFFFFFFFFF are - * reserved. VSID_MULTIPLIER is prime, so in particular it is - * co-prime to VSID_MODULUS, making this a 1:1 scrambling function. - * Because the modulus is 2^n-1 we can compute it efficiently without - * a divide or extra multiply (see below). - * - * This scheme has several advantages over older methods: - * - * - We have VSIDs allocated for every kernel address - * (i.e. everything above 0xC000000000000000), except the very top - * segment, which simplifies several things. - * - * - We allow for 15 significant bits of ESID and 20 bits of - * context for user addresses. i.e. 8T (43 bits) of address space for - * up to 1M contexts (although the page table structure and context - * allocation will need changes to take advantage of this). - * - * - The scramble function gives robust scattering in the hash - * table (at least based on some initial results). The previous - * method was more susceptible to pathological cases giving excessive - * hash collisions. - */ - -/* - * WARNING - If you change these you must make sure the asm - * implementations in slb_allocate(), do_stab_bolted and mmu.h - * (ASM_VSID_SCRAMBLE macro) are changed accordingly. - * - * You'll also need to change the precomputed VSID values in head.S - * which are used by the iSeries firmware. - */ - -static inline unsigned long vsid_scramble(unsigned long protovsid) -{ -#if 0 - /* The code below is equivalent to this function for arguments - * < 2^VSID_BITS, which is all this should ever be called - * with. However gcc is not clever enough to compute the - * modulus (2^n-1) without a second multiply. */ - return ((protovsid * VSID_MULTIPLIER) % VSID_MODULUS); -#else /* 1 */ - unsigned long x; - - x = protovsid * VSID_MULTIPLIER; - x = (x >> VSID_BITS) + (x & VSID_MODULUS); - return (x + ((x+1) >> VSID_BITS)) & VSID_MODULUS; -#endif /* 1 */ -} - -/* This is only valid for addresses >= KERNELBASE */ -static inline unsigned long get_kernel_vsid(unsigned long ea) -{ - return vsid_scramble(ea >> SID_SHIFT); -} - -/* This is only valid for user addresses (which are below 2^41) */ -static inline unsigned long get_vsid(unsigned long context, unsigned long ea) -{ - return vsid_scramble((context << USER_ESID_BITS) - | (ea >> SID_SHIFT)); -} - #endif /* __PPC64_MMU_CONTEXT_H */ -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From omkhar at gentoo.org Tue May 3 10:44:41 2005 From: omkhar at gentoo.org (Omkhar Arasaratnam) Date: Mon, 02 May 2005 20:44:41 -0400 Subject: [BUG] 2.4.30 - Bring up on JS20 Fails Message-ID: <4276C979.3020300@gentoo.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 At first the kernel wouldn't compile as it was missing a header, which I have resolved : http://dev.gentoo.org/~omkhar/ppc64-autofs_4.patch After including this header I was able to compile, but on bring up i see the following: [boot]0012 Setup Arch pSeries_pci: this system has large bus numbers and the kernel was not built with the patch that fixes include/linux/pci.h struct pci_bus so number, primary, secondary and subordinate are ints. Kernel panic: pSeries_pci: this system has large bus numbers and the kernel was not built with the patch that fixes include/linux/pci.h struct pci_bus so number, primary, secondary and subordinate are ints. In idle task - not syncing Ideas? - -- Omkhar Arasaratnam - Gentoo PPC64 Developer omkhar at gentoo.org - http://dev.gentoo.org/~omkhar Gentoo Linux / PPC64 Linux: http://ppc64.gentoo.org -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (MingW32) iD8DBQFCdsl59msUWjh2lHURAm9hAKCdfkUB5p+qx8hlQvzt7PnHgaLKqACeKe8y 6kUP8tOuOF+Zgi1OxkzOXKc= =QkrG -----END PGP SIGNATURE----- From benh at kernel.crashing.org Tue May 3 11:16:53 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 03 May 2005 11:16:53 +1000 Subject: [BUG] 2.4.30 - Bring up on JS20 Fails In-Reply-To: <4276C979.3020300@gentoo.org> References: <4276C979.3020300@gentoo.org> Message-ID: <1115083013.6155.37.camel@gaston> On Mon, 2005-05-02 at 20:44 -0400, Omkhar Arasaratnam wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > At first the kernel wouldn't compile as it was missing a header, which I > have resolved : http://dev.gentoo.org/~omkhar/ppc64-autofs_4.patch > > After including this header I was able to compile, but on bring up i see > the following: > > [boot]0012 Setup Arch > pSeries_pci: this system has large bus numbers and the kernel was not > built with the patch that fixes include/linux/pci.h struct pci_bus so > number, primary, secondary and subordinate are ints. > Kernel panic: pSeries_pci: this system has large bus numbers and the > kernel was not > built with the patch that fixes include/linux/pci.h struct pci_bus so > number, primary, secondary and subordinate are ints. > In idle task - not syncing Why 2.4 ? Ben. From david at gibson.dropbear.id.au Tue May 3 11:23:43 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Tue, 3 May 2005 11:23:43 +1000 Subject: [PPC64] pgtable.h and other header cleanups In-Reply-To: <20050503002608.GA22453@localhost.localdomain> References: <20050503002608.GA22453@localhost.localdomain> Message-ID: <20050503012343.GB22453@localhost.localdomain> On Tue, May 03, 2005 at 10:26:08AM +1000, David Gibson wrote: > Andrew, please apply. > > This patch started as simply removing a few never-used macros from > asm-ppc64/pgtable.h, then kind of grew. It now makes a bunch of > cleanups to the ppc64 low-level header files (with corresponding > changes to .c files where necessary) such as: > - Abolishing never-used macros > - Eliminating multiple #defines with the same purpose > - Removing pointless macros (cases where just expanding the > macro everywhere turns out clearer and more sensible) > - Removing some cases where macros which could be defined in > terms of each other weren't > - Moving imalloc() related definitions from pgtable.h to their > own header file (imalloc.h) > - Re-arranging headers to group things more logically > - Moving all VSID allocation related things to mmu.h, instead > of being split between mmu.h and mmu_context.h > - Removing some reserved space for flags from the PMD - we're > not using it. Aargh! Don't apply, patch is broken (missing imalloc.h). Grr... I could have sworn I'd quilt added it. Fixed version coming shortly. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From anton at samba.org Tue May 3 13:13:22 2005 From: anton at samba.org (Anton Blanchard) Date: Tue, 3 May 2005 13:13:22 +1000 Subject: [BUG] 2.4.30 - Bring up on JS20 Fails In-Reply-To: <4276C979.3020300@gentoo.org> References: <4276C979.3020300@gentoo.org> Message-ID: <20050503031322.GG12682@krispykreme> Hi, > After including this header I was able to compile, but on bring up i see > the following: > > [boot]0012 Setup Arch > pSeries_pci: this system has large bus numbers and the kernel was not > built with the patch that fixes include/linux/pci.h struct pci_bus so > number, primary, secondary and subordinate are ints. > Kernel panic: pSeries_pci: this system has large bus numbers and the > kernel was not > built with the patch that fixes > include/linux/pci.h struct pci_bus so > number, primary, secondary and subordinate are ints. Do that and it should work :) Its to do with PCI domains and is fixed properly in 2.6. Anton From david at gibson.dropbear.id.au Tue May 3 13:33:32 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Tue, 3 May 2005 13:33:32 +1000 Subject: [PPC64] pgtable.h and other header cleanups In-Reply-To: <20050503012343.GB22453@localhost.localdomain> References: <20050503002608.GA22453@localhost.localdomain> <20050503012343.GB22453@localhost.localdomain> Message-ID: <20050503033332.GC22453@localhost.localdomain> On Tue, May 03, 2005 at 11:23:43AM +1000, David Gibson wrote: > On Tue, May 03, 2005 at 10:26:08AM +1000, David Gibson wrote: > > Andrew, please apply. > > > > This patch started as simply removing a few never-used macros from > > asm-ppc64/pgtable.h, then kind of grew. It now makes a bunch of > > cleanups to the ppc64 low-level header files (with corresponding > > changes to .c files where necessary) such as: > > - Abolishing never-used macros > > - Eliminating multiple #defines with the same purpose > > - Removing pointless macros (cases where just expanding the > > macro everywhere turns out clearer and more sensible) > > - Removing some cases where macros which could be defined in > > terms of each other weren't > > - Moving imalloc() related definitions from pgtable.h to their > > own header file (imalloc.h) > > - Re-arranging headers to group things more logically > > - Moving all VSID allocation related things to mmu.h, instead > > of being split between mmu.h and mmu_context.h > > - Removing some reserved space for flags from the PMD - we're > > not using it. > > Aargh! Don't apply, patch is broken (missing imalloc.h). Grr... I > could have sworn I'd quilt added it. Fixed version coming shortly. Ok, this time for sure. Andrew, please apply: This patch started as simply removing a few never-used macros from asm-ppc64/pgtable.h, then kind of grew. It now makes a bunch of cleanups to the ppc64 low-level header files (with corresponding changes to .c files where necessary) such as: - Abolishing never-used macros - Eliminating multiple #defines with the same purpose - Removing pointless macros (cases where just expanding the macro everywhere turns out clearer and more sensible) - Removing some cases where macros which could be defined in terms of each other weren't - Moving imalloc() related definitions from pgtable.h to their own header file (imalloc.h) - Re-arranging headers to group things more logically - Moving all VSID allocation related things to mmu.h, instead of being split between mmu.h and mmu_context.h - Removing some reserved space for flags from the PMD - we're not using it. - Fix some bugs which broke compile with STRICT_MM_TYPECHECKS. Signed-off-by: David Gibson Index: working-2.6/include/asm-ppc64/pgtable.h =================================================================== --- working-2.6.orig/include/asm-ppc64/pgtable.h 2005-05-02 08:57:22.000000000 +1000 +++ working-2.6/include/asm-ppc64/pgtable.h 2005-05-03 12:56:34.000000000 +1000 @@ -17,16 +17,6 @@ #include -/* PMD_SHIFT determines what a second-level page table entry can map */ -#define PMD_SHIFT (PAGE_SHIFT + PAGE_SHIFT - 3) -#define PMD_SIZE (1UL << PMD_SHIFT) -#define PMD_MASK (~(PMD_SIZE-1)) - -/* PGDIR_SHIFT determines what a third-level page table entry can map */ -#define PGDIR_SHIFT (PAGE_SHIFT + (PAGE_SHIFT - 3) + (PAGE_SHIFT - 2)) -#define PGDIR_SIZE (1UL << PGDIR_SHIFT) -#define PGDIR_MASK (~(PGDIR_SIZE-1)) - /* * Entries per page directory level. The PTE level must use a 64b record * for each page table entry. The PMD and PGD level use a 32b record for @@ -40,40 +30,30 @@ #define PTRS_PER_PMD (1 << PMD_INDEX_SIZE) #define PTRS_PER_PGD (1 << PGD_INDEX_SIZE) -#define USER_PTRS_PER_PGD (1024) -#define FIRST_USER_ADDRESS 0 +/* PMD_SHIFT determines what a second-level page table entry can map */ +#define PMD_SHIFT (PAGE_SHIFT + PTE_INDEX_SIZE) +#define PMD_SIZE (1UL << PMD_SHIFT) +#define PMD_MASK (~(PMD_SIZE-1)) -#define EADDR_SIZE (PTE_INDEX_SIZE + PMD_INDEX_SIZE + \ - PGD_INDEX_SIZE + PAGE_SHIFT) +/* PGDIR_SHIFT determines what a third-level page table entry can map */ +#define PGDIR_SHIFT (PMD_SHIFT + PMD_INDEX_SIZE) +#define PGDIR_SIZE (1UL << PGDIR_SHIFT) +#define PGDIR_MASK (~(PGDIR_SIZE-1)) + +#define FIRST_USER_ADDRESS 0 /* * Size of EA range mapped by our pagetables. */ -#define PGTABLE_EA_BITS 41 -#define PGTABLE_EA_MASK ((1UL<> PMD_TO_PTEPAGE_SHIFT)) +#define pmd_page_kernel(pmd) (__bpn_to_ba(pmd_val(pmd))) #define pmd_page(pmd) virt_to_page(pmd_page_kernel(pmd)) #define pud_set(pudp, pmdp) (pud_val(*(pudp)) = (__ba_to_bpn(pmdp))) @@ -266,8 +242,6 @@ /* to find an entry in the ioremap page-table-directory */ #define pgd_offset_i(address) (ioremap_pgd + pgd_index(address)) -#define pages_to_mb(x) ((x) >> (20-PAGE_SHIFT)) - /* * The following only work if pte_present() is true. * Undefined behaviour if not.. @@ -442,7 +416,7 @@ pte_clear(mm, addr, ptep); flush_tlb_pending(); } - *ptep = __pte(pte_val(pte)) & ~_PAGE_HPTEFLAGS; + *ptep = __pte(pte_val(pte) & ~_PAGE_HPTEFLAGS); } /* Set the dirty and/or accessed bits atomically in a linux PTE, this @@ -487,18 +461,13 @@ extern unsigned long ioremap_bot, ioremap_base; -#define USER_PGD_PTRS (PAGE_OFFSET >> PGDIR_SHIFT) -#define KERNEL_PGD_PTRS (PTRS_PER_PGD-USER_PGD_PTRS) - -#define pte_ERROR(e) \ - printk("%s:%d: bad pte %016lx.\n", __FILE__, __LINE__, pte_val(e)) #define pmd_ERROR(e) \ printk("%s:%d: bad pmd %08x.\n", __FILE__, __LINE__, pmd_val(e)) #define pgd_ERROR(e) \ printk("%s:%d: bad pgd %08x.\n", __FILE__, __LINE__, pgd_val(e)) -extern pgd_t swapper_pg_dir[1024]; -extern pgd_t ioremap_dir[1024]; +extern pgd_t swapper_pg_dir[]; +extern pgd_t ioremap_dir[]; extern void paging_init(void); @@ -540,43 +509,11 @@ */ #define kern_addr_valid(addr) (1) -#define io_remap_page_range(vma, vaddr, paddr, size, prot) \ - remap_pfn_range(vma, vaddr, (paddr) >> PAGE_SHIFT, size, prot) - #define io_remap_pfn_range(vma, vaddr, pfn, size, prot) \ remap_pfn_range(vma, vaddr, pfn, size, prot) -#define MK_IOSPACE_PFN(space, pfn) (pfn) -#define GET_IOSPACE(pfn) 0 -#define GET_PFN(pfn) (pfn) - void pgtable_cache_init(void); -extern void hpte_init_native(void); -extern void hpte_init_lpar(void); -extern void hpte_init_iSeries(void); - -/* imalloc region types */ -#define IM_REGION_UNUSED 0x1 -#define IM_REGION_SUBSET 0x2 -#define IM_REGION_EXISTS 0x4 -#define IM_REGION_OVERLAP 0x8 -#define IM_REGION_SUPERSET 0x10 - -extern struct vm_struct * im_get_free_area(unsigned long size); -extern struct vm_struct * im_get_area(unsigned long v_addr, unsigned long size, - int region_type); -unsigned long im_free(void *addr); - -extern long pSeries_lpar_hpte_insert(unsigned long hpte_group, - unsigned long va, unsigned long prpn, - int secondary, unsigned long hpteflags, - int bolted, int large); - -extern long native_hpte_insert(unsigned long hpte_group, unsigned long va, - unsigned long prpn, int secondary, - unsigned long hpteflags, int bolted, int large); - /* * find_linux_pte returns the address of a linux pte for a given * effective address and directory. If not found, it returns zero. Index: working-2.6/include/asm-ppc64/page.h =================================================================== --- working-2.6.orig/include/asm-ppc64/page.h 2005-05-02 08:57:22.000000000 +1000 +++ working-2.6/include/asm-ppc64/page.h 2005-05-03 13:08:06.000000000 +1000 @@ -23,7 +23,6 @@ #define PAGE_SHIFT 12 #define PAGE_SIZE (ASM_CONST(1) << PAGE_SHIFT) #define PAGE_MASK (~(PAGE_SIZE-1)) -#define PAGE_OFFSET_MASK (PAGE_SIZE-1) #define SID_SHIFT 28 #define SID_MASK 0xfffffffffUL @@ -85,9 +84,6 @@ /* align addr on a size boundary - adjust address up if needed */ #define _ALIGN(addr,size) _ALIGN_UP(addr,size) -/* to align the pointer to the (next) double word boundary */ -#define DOUBLEWORD_ALIGN(addr) _ALIGN(addr,sizeof(unsigned long)) - /* to align the pointer to the (next) page boundary */ #define PAGE_ALIGN(addr) _ALIGN(addr, PAGE_SIZE) @@ -100,7 +96,6 @@ #define REGION_SIZE 4UL #define REGION_SHIFT 60UL #define REGION_MASK (((1UL<>REGION_SHIFT) -#define VMALLOC_REGION_ID (VMALLOCBASE>>REGION_SHIFT) -#define KERNEL_REGION_ID (KERNELBASE>>REGION_SHIFT) +#define IO_REGION_ID (IOREGIONBASE >> REGION_SHIFT) +#define VMALLOC_REGION_ID (VMALLOCBASE >> REGION_SHIFT) +#define KERNEL_REGION_ID (KERNELBASE >> REGION_SHIFT) #define USER_REGION_ID (0UL) -#define REGION_ID(X) (((unsigned long)(X))>>REGION_SHIFT) +#define REGION_ID(ea) (((unsigned long)(ea)) >> REGION_SHIFT) -#define __bpn_to_ba(x) ((((unsigned long)(x))<> PAGE_SHIFT) #define __va(x) ((void *)((unsigned long)(x) + KERNELBASE)) Index: working-2.6/arch/ppc64/mm/imalloc.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/imalloc.c 2005-04-26 15:37:55.000000000 +1000 +++ working-2.6/arch/ppc64/mm/imalloc.c 2005-05-03 12:56:34.000000000 +1000 @@ -14,6 +14,7 @@ #include #include #include +#include static DECLARE_MUTEX(imlist_sem); struct vm_struct * imlist = NULL; @@ -23,11 +24,11 @@ unsigned long addr; struct vm_struct **p, *tmp; - addr = IMALLOC_START; + addr = ioremap_bot; for (p = &imlist; (tmp = *p) ; p = &tmp->next) { if (size + addr < (unsigned long) tmp->addr) break; - if ((unsigned long)tmp->addr >= IMALLOC_START) + if ((unsigned long)tmp->addr >= ioremap_bot) addr = tmp->size + (unsigned long) tmp->addr; if (addr > IMALLOC_END-size) return 1; Index: working-2.6/arch/ppc64/mm/hash_utils.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hash_utils.c 2005-04-26 15:37:55.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hash_utils.c 2005-05-03 12:56:34.000000000 +1000 @@ -298,24 +298,23 @@ int local = 0; cpumask_t tmp; + if ((ea & ~REGION_MASK) > EADDR_MASK) + return 1; + switch (REGION_ID(ea)) { case USER_REGION_ID: user_region = 1; mm = current->mm; - if ((ea > USER_END) || (! mm)) + if (! mm) return 1; vsid = get_vsid(mm->context.id, ea); break; case IO_REGION_ID: - if (ea > IMALLOC_END) - return 1; mm = &ioremap_mm; vsid = get_kernel_vsid(ea); break; case VMALLOC_REGION_ID: - if (ea > VMALLOC_END) - return 1; mm = &init_mm; vsid = get_kernel_vsid(ea); break; @@ -362,7 +361,7 @@ unsigned long vsid, vpn, va, hash, secondary, slot; unsigned long huge = pte_huge(pte); - if ((ea >= USER_START) && (ea <= USER_END)) + if (ea < KERNELBASE) vsid = get_vsid(context, ea); else vsid = get_kernel_vsid(ea); Index: working-2.6/arch/ppc64/mm/hash_native.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hash_native.c 2005-04-26 15:37:55.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hash_native.c 2005-05-03 12:56:34.000000000 +1000 @@ -320,8 +320,7 @@ j = 0; for (i = 0; i < number; i++) { - if ((batch->addr[i] >= USER_START) && - (batch->addr[i] <= USER_END)) + if (batch->addr[i] < KERNELBASE) vsid = get_vsid(context, batch->addr[i]); else vsid = get_kernel_vsid(batch->addr[i]); Index: working-2.6/arch/ppc64/mm/init.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/init.c 2005-05-02 08:57:20.000000000 +1000 +++ working-2.6/arch/ppc64/mm/init.c 2005-05-03 12:56:34.000000000 +1000 @@ -64,6 +64,7 @@ #include #include #include +#include int mem_init_done; unsigned long ioremap_bot = IMALLOC_BASE; Index: working-2.6/include/asm-ppc64/mmu.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu.h 2005-04-26 15:38:02.000000000 +1000 +++ working-2.6/include/asm-ppc64/mmu.h 2005-05-03 12:56:34.000000000 +1000 @@ -15,19 +15,10 @@ #include #include -#include -#ifndef __ASSEMBLY__ - -/* Time to allow for more things here */ -typedef unsigned long mm_context_id_t; -typedef struct { - mm_context_id_t id; -#ifdef CONFIG_HUGETLB_PAGE - pgd_t *huge_pgdir; - u16 htlb_segs; /* bitmask */ -#endif -} mm_context_t; +/* + * Segment table + */ #define STE_ESID_V 0x80 #define STE_ESID_KS 0x20 @@ -36,15 +27,48 @@ #define STE_VSID_SHIFT 12 -struct stab_entry { - unsigned long esid_data; - unsigned long vsid_data; -}; +/* Location of cpu0's segment table */ +#define STAB0_PAGE 0x9 +#define STAB0_PHYS_ADDR (STAB0_PAGE<> VSID_BITS) + (x & VSID_MODULUS); + return (x + ((x+1) >> VSID_BITS)) & VSID_MODULUS; +#endif /* 1 */ +} + +/* This is only valid for addresses >= KERNELBASE */ +static inline unsigned long get_kernel_vsid(unsigned long ea) +{ + return vsid_scramble(ea >> SID_SHIFT); +} + +/* This is only valid for user addresses (which are below 2^41) */ +static inline unsigned long get_vsid(unsigned long context, unsigned long ea) +{ + return vsid_scramble((context << USER_ESID_BITS) + | (ea >> SID_SHIFT)); +} + +#endif /* __ASSEMBLY */ + #endif /* _PPC64_MMU_H_ */ Index: working-2.6/arch/ppc64/mm/stab.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/stab.c 2005-04-26 15:37:55.000000000 +1000 +++ working-2.6/arch/ppc64/mm/stab.c 2005-05-03 12:56:34.000000000 +1000 @@ -19,6 +19,11 @@ #include #include +struct stab_entry { + unsigned long esid_data; + unsigned long vsid_data; +}; + /* Both the segment table and SLB code uses the following cache */ #define NR_STAB_CACHE_ENTRIES 8 DEFINE_PER_CPU(long, stab_cache_ptr); Index: working-2.6/include/asm-ppc64/mmu_context.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu_context.h 2005-04-26 15:38:02.000000000 +1000 +++ working-2.6/include/asm-ppc64/mmu_context.h 2005-05-03 12:56:34.000000000 +1000 @@ -84,86 +84,4 @@ local_irq_restore(flags); } -/* VSID allocation - * =============== - * - * We first generate a 36-bit "proto-VSID". For kernel addresses this - * is equal to the ESID, for user addresses it is: - * (context << 15) | (esid & 0x7fff) - * - * The two forms are distinguishable because the top bit is 0 for user - * addresses, whereas the top two bits are 1 for kernel addresses. - * Proto-VSIDs with the top two bits equal to 0b10 are reserved for - * now. - * - * The proto-VSIDs are then scrambled into real VSIDs with the - * multiplicative hash: - * - * VSID = (proto-VSID * VSID_MULTIPLIER) % VSID_MODULUS - * where VSID_MULTIPLIER = 268435399 = 0xFFFFFC7 - * VSID_MODULUS = 2^36-1 = 0xFFFFFFFFF - * - * This scramble is only well defined for proto-VSIDs below - * 0xFFFFFFFFF, so both proto-VSID and actual VSID 0xFFFFFFFFF are - * reserved. VSID_MULTIPLIER is prime, so in particular it is - * co-prime to VSID_MODULUS, making this a 1:1 scrambling function. - * Because the modulus is 2^n-1 we can compute it efficiently without - * a divide or extra multiply (see below). - * - * This scheme has several advantages over older methods: - * - * - We have VSIDs allocated for every kernel address - * (i.e. everything above 0xC000000000000000), except the very top - * segment, which simplifies several things. - * - * - We allow for 15 significant bits of ESID and 20 bits of - * context for user addresses. i.e. 8T (43 bits) of address space for - * up to 1M contexts (although the page table structure and context - * allocation will need changes to take advantage of this). - * - * - The scramble function gives robust scattering in the hash - * table (at least based on some initial results). The previous - * method was more susceptible to pathological cases giving excessive - * hash collisions. - */ - -/* - * WARNING - If you change these you must make sure the asm - * implementations in slb_allocate(), do_stab_bolted and mmu.h - * (ASM_VSID_SCRAMBLE macro) are changed accordingly. - * - * You'll also need to change the precomputed VSID values in head.S - * which are used by the iSeries firmware. - */ - -static inline unsigned long vsid_scramble(unsigned long protovsid) -{ -#if 0 - /* The code below is equivalent to this function for arguments - * < 2^VSID_BITS, which is all this should ever be called - * with. However gcc is not clever enough to compute the - * modulus (2^n-1) without a second multiply. */ - return ((protovsid * VSID_MULTIPLIER) % VSID_MODULUS); -#else /* 1 */ - unsigned long x; - - x = protovsid * VSID_MULTIPLIER; - x = (x >> VSID_BITS) + (x & VSID_MODULUS); - return (x + ((x+1) >> VSID_BITS)) & VSID_MODULUS; -#endif /* 1 */ -} - -/* This is only valid for addresses >= KERNELBASE */ -static inline unsigned long get_kernel_vsid(unsigned long ea) -{ - return vsid_scramble(ea >> SID_SHIFT); -} - -/* This is only valid for user addresses (which are below 2^41) */ -static inline unsigned long get_vsid(unsigned long context, unsigned long ea) -{ - return vsid_scramble((context << USER_ESID_BITS) - | (ea >> SID_SHIFT)); -} - #endif /* __PPC64_MMU_CONTEXT_H */ Index: working-2.6/include/asm-ppc64/imalloc.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ working-2.6/include/asm-ppc64/imalloc.h 2005-05-03 12:56:34.000000000 +1000 @@ -0,0 +1,24 @@ +#ifndef _PPC64_IMALLOC_H +#define _PPC64_IMALLOC_H + +/* + * Define the address range of the imalloc VM area. + */ +#define PHBS_IO_BASE IOREGIONBASE +#define IMALLOC_BASE (IOREGIONBASE + 0x80000000ul) /* Reserve 2 gigs for PHBs */ +#define IMALLOC_END (IOREGIONBASE + EADDR_MASK) + + +/* imalloc region types */ +#define IM_REGION_UNUSED 0x1 +#define IM_REGION_SUBSET 0x2 +#define IM_REGION_EXISTS 0x4 +#define IM_REGION_OVERLAP 0x8 +#define IM_REGION_SUPERSET 0x10 + +extern struct vm_struct * im_get_free_area(unsigned long size); +extern struct vm_struct * im_get_area(unsigned long v_addr, unsigned long size, + int region_type); +unsigned long im_free(void *addr); + +#endif /* _PPC64_IMALLOC_H */ Index: working-2.6/arch/ppc64/kernel/pci.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/pci.c 2005-04-26 15:37:55.000000000 +1000 +++ working-2.6/arch/ppc64/kernel/pci.c 2005-05-03 12:56:34.000000000 +1000 @@ -438,7 +438,7 @@ int i; if (page_is_ram(offset >> PAGE_SHIFT)) - return prot; + return __pgprot(prot); prot |= _PAGE_NO_CACHE | _PAGE_GUARDED; -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From paulus at samba.org Tue May 3 14:14:00 2005 From: paulus at samba.org (Paul Mackerras) Date: Tue, 3 May 2005 14:14:00 +1000 Subject: [PPC64] pgtable.h and other header cleanups In-Reply-To: <20050503033332.GC22453@localhost.localdomain> References: <20050503002608.GA22453@localhost.localdomain> <20050503012343.GB22453@localhost.localdomain> <20050503033332.GC22453@localhost.localdomain> Message-ID: <17014.64136.49697.910612@cargo.ozlabs.ibm.com> David Gibson writes: > This patch started as simply removing a few never-used macros from > asm-ppc64/pgtable.h, then kind of grew. It now makes a bunch of > cleanups to the ppc64 low-level header files (with corresponding > changes to .c files where necessary) such as: > - Abolishing never-used macros > - Eliminating multiple #defines with the same purpose > - Removing pointless macros (cases where just expanding the > macro everywhere turns out clearer and more sensible) > - Removing some cases where macros which could be defined in > terms of each other weren't > - Moving imalloc() related definitions from pgtable.h to their > own header file (imalloc.h) > - Re-arranging headers to group things more logically > - Moving all VSID allocation related things to mmu.h, instead > of being split between mmu.h and mmu_context.h > - Removing some reserved space for flags from the PMD - we're > not using it. > - Fix some bugs which broke compile with STRICT_MM_TYPECHECKS. > > Signed-off-by: David Gibson Acked-by: Paul Mackerras From sonny at burdell.org Tue May 3 15:02:12 2005 From: sonny at burdell.org (Sonny Rao) Date: Tue, 3 May 2005 01:02:12 -0400 Subject: 2.6.11 e1000 EEH MMIO failure Message-ID: <20050503050212.GA22395@kevlar.burdell.org> I'm guessing this means a bad e1000 card but I wanted to check with the experts. The box is a p690 w/ some expansion drawers attached, and is running a pretty-much stock 2.6.11 kernel, system is booted in SMP mode. Could it be related to e1000 errata "23" mentioned earlier on the mailing list? Here are the messages: Intel(R) PRO/1000 Network Driver - version 5.6.10.1-k2 Copyright (c) 1999-2004 Intel Corporation. e1000: eth3: e1000_probe: Intel(R) PRO/1000 Network Connection PCI: Enabling device: (000a:01:01.0), cmd 143 e1000: eth4: e1000_probe: Intel(R) PRO/1000 Network Connection PCI: Enabling device: (000a:01:01.1), cmd 143 e1000: eth5: e1000_probe: Intel(R) PRO/1000 Network Connection PCI: Enabling device: (000e:21:01.0), cmd 143 e1000: eth6: e1000_probe: Intel(R) PRO/1000 Network Connection PCI: Enabling device: (0011:21:01.0), cmd 143 e1000: eth7: e1000_probe: Intel(R) PRO/1000 Network Connection RTAS: event: 15, Type: Retry, Severity: 2 EEH: MMIO failure (2) on device: ethernet /pci at 3ffe7f0a000/pci at 2,2/ethernet at 1 Call Trace: [c00000103873a910] [c000000000631630] 0xc000000000631630 (unreliable) [c00000103873a990] [c000000000036a6c] .eeh_dn_check_failure+0x2e4/0x334 [c00000103873aa70] [c000000000036c20] .eeh_check_failure+0x164/0x1b0 [c00000103873ab10] [d0000000002a6b04] .e1000_check_for_link+0x5ac/0x664 [e1000] [c00000103873abd0] [d00000000029a5e0] .e1000_watchdog+0x48/0x79c [e1000] [c00000103873ac90] [c00000000005f558] .run_timer_softirq+0x15c/0x280 [c00000103873ad60] [c00000000005a3c4] .__do_softirq+0xdc/0x1c8 [c00000103873ae20] [c00000000005a538] .do_softirq+0x88/0x8c [c00000103873aeb0] [c000000000011520] .timer_interrupt+0x294/0x35c [c00000103873afb0] [c00000000000a2b8] decrementer_common+0xb8/0x100 --- Exception: 901 at ._spin_unlock_irqrestore+0x1c/0x28 LR = .rtas_call+0x1a4/0x2b4 [c00000103873b2a0] [c0000000001e8128] .snprintf+0x30/0x44 (unreliable) [c00000103873b2e0] [c00000000003421c] .rtas_call+0x110/0x2b4 [c00000103873b3a0] [c0000000000366ec] .read_slot_reset_state+0x94/0xac [c00000103873b420] [c000000000036890] .eeh_dn_check_failure+0x108/0x334 [c00000103873b500] [c000000000036c20] .eeh_check_failure+0x164/0x1b0 [c00000103873b5a0] [d00000000029f174] .e1000_up+0x404/0x40c [e1000] [c00000103873b650] [d00000000029f5cc] .e1000_open+0x54/0xc0 [e1000] [c00000103873b6e0] [c0000000002fec84] .dev_open+0x118/0x13c [c00000103873b780] [c0000000002fcef8] .dev_change_flags+0x19c/0x1d4 [c00000103873b820] [c000000000357878] .devinet_ioctl+0x66c/0x820 [c00000103873b930] [c000000000358794] .inet_ioctl+0x260/0x2e0 [c00000103873b9c0] [c0000000002f03a0] .sock_ioctl+0x28c/0x418 [c00000103873ba70] [c0000000000c7564] .do_ioctl+0x124/0x13c [c00000103873bb10] [c0000000000c777c] .vfs_ioctl+0x200/0x4e0 [c00000103873bbc0] [c0000000000c7ab8] .sys_ioctl+0x5c/0xa4 [c00000103873bc70] [c00000000001e8c0] .dev_ifsioc+0x8c/0x348 [c00000103873bd50] [c0000000000e7d24] .compat_sys_ioctl+0x46c/0x4c4 [c00000103873be30] [c00000000000d500] syscall_exit+0x0/0x18 e1000: eth7: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex RTAS: event: 16, Type: Retry, Severity: 2 EEH: MMIO failure (2) on device: ethernet /pci at 3ffe7f0a000/pci at 2,2/ethernet at 1 Call Trace: [c00000103873b3a0] [c000000000631630] 0xc000000000631630 (unreliable) [c00000103873b420] [c000000000036a6c] .eeh_dn_check_failure+0x2e4/0x334 [c00000103873b500] [c000000000036c20] .eeh_check_failure+0x164/0x1b0 [c00000103873b5a0] [d00000000029f174] .e1000_up+0x404/0x40c [e1000] [c00000103873b650] [d00000000029f5cc] .e1000_open+0x54/0xc0 [e1000] [c00000103873b6e0] [c0000000002fec84] .dev_open+0x118/0x13c [c00000103873b780] [c0000000002fcef8] .dev_change_flags+0x19c/0x1d4 [c00000103873b820] [c000000000357878] .devinet_ioctl+0x66c/0x820 [c00000103873b930] [c000000000358794] .inet_ioctl+0x260/0x2e0 [c00000103873b9c0] [c0000000002f03a0] .sock_ioctl+0x28c/0x418 [c00000103873ba70] [c0000000000c7564] .do_ioctl+0x124/0x13c [c00000103873bb10] [c0000000000c777c] .vfs_ioctl+0x200/0x4e0 [c00000103873bbc0] [c0000000000c7ab8] .sys_ioctl+0x5c/0xa4 [c00000103873bc70] [c00000000001e8c0] .dev_ifsioc+0x8c/0x348 [c00000103873bd50] [c0000000000e7d24] .compat_sys_ioctl+0x46c/0x4c4 [c00000103873be30] [c00000000000d500] syscall_exit+0x0/0x18 EEH: MMIO failure (2), notifiying device 0011:21:01.0 EEH: MMIO failure (2), notifiying device 0011:21:01.0 PCI: Enabling device: (0014:01:01.0), cmd 143 e1000: eth8: e1000_probe: Intel(R) PRO/1000 Network Connection PCI: Enabling device: (0014:01:01.1), cmd 143 e1000: eth9: e1000_probe: Intel(R) PRO/1000 Network Connection PCI: Enabling device: (0017:01:01.0), cmd 143 e1000: eth10: e1000_probe: Intel(R) PRO/1000 Network Connection Sonny From benh at kernel.crashing.org Tue May 3 15:34:58 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 03 May 2005 15:34:58 +1000 Subject: [PATCH] Fix gcc 4.0 vs CONFIG_ALTIVEC Message-ID: <1115098498.6030.82.camel@gaston> Hi ! gcc-4.0 generates altivec code implicitely when -mcpu insidates an altivec capable CPU which is not suitable for the kenrel. However, we used to set -mcpu=970 when CONFIG_ALTIVEC was set because a gcc-3.x bug prevented from using -maltivec along with -mcpu=power4, thus prevented building the RAID6 altivec code. This patch fixes all of this by testing for the gcc version. If 4.0 or later, just normally use -mcpu=power4 and let the RAID6 code add -maltivec to the few files it needs to be compiled with altivec support. For 3.x, we still use -mcpu=970 to work around the above problem, which is fine as 3.x will never implicitely generate altivec code. The Makefile hackery may not be the most lovely, I welcome anybody more skilled than me to improve it. Signed-off-by: Benjamin Herrenschmidt Index: linux-work/arch/ppc64/Makefile =================================================================== --- linux-work.orig/arch/ppc64/Makefile 2005-05-02 10:48:08.000000000 +1000 +++ linux-work/arch/ppc64/Makefile 2005-05-03 14:38:43.000000000 +1000 @@ -56,13 +56,20 @@ CFLAGS += -msoft-float -pipe -mminimal-toc -mtraceback=none \ -mcall-aixdesc +GCC_VERSION := $(call cc-version) +GCC_BROKEN_VEC := $(shell if [ $(GCC_VERSION) -lt 0400 ] ; then echo "y"; fi ;) + ifeq ($(CONFIG_POWER4_ONLY),y) ifeq ($(CONFIG_ALTIVEC),y) +ifeq ($(GCC_BROKEN_VEC),y) CFLAGS += $(call cc-option,-mcpu=970) else CFLAGS += $(call cc-option,-mcpu=power4) endif else + CFLAGS += $(call cc-option,-mcpu=power4) +endif +else CFLAGS += $(call cc-option,-mtune=power4) endif From benh at kernel.crashing.org Tue May 3 16:08:25 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 03 May 2005 16:08:25 +1000 Subject: [PATCH] Fix gcc 4.0 vs CONFIG_ALTIVEC In-Reply-To: <1115098498.6030.82.camel@gaston> References: <1115098498.6030.82.camel@gaston> Message-ID: <1115100505.6030.91.camel@gaston> On Tue, 2005-05-03 at 15:35 +1000, Benjamin Herrenschmidt wrote: > Hi ! > > gcc-4.0 generates altivec code implicitely when -mcpu insidates an > altivec capable CPU which is not suitable for the kenrel. Damn ! I should stay away from a keyboard today ! Oh well, at least the patch itself looks ok. Ben. From arnd at arndb.de Tue May 3 19:46:03 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Tue, 3 May 2005 11:46:03 +0200 Subject: [PATCH 1/2] ppc64: fix read/write on large /dev/nvram In-Reply-To: <7845758a806ed6769cea59a9df344d39@bga.com> References: <7845758a806ed6769cea59a9df344d39@bga.com> Message-ID: <200505031146.04824.arnd@arndb.de> On Maandag 02 Mai 2005 08:43, Milton Miller wrote: > On Fri Apr 22 16:49:59 EST 2005, Arnd wrote a patch with the following > lines (among several others). > > - len = ppc_md.nvram_read(tmp_buffer, count, ppos); > + ret = ppc_md.nvram_read(tmp, count, ppos); > > - len = ppc_md.nvram_write(tmp_buffer, count, ppos); > + ret = ppc_md.nvram_read(tmp, count, ppos); > > > Even though I am just scanning, I am guessing this is not quite right. Good catch. I only tested the read path because I did not want to mess with the contents of the nvram. I'll do a new patch when I come back to Germany, unless someone else (Utz?) does one first. Arnd <>< From omkhar at gentoo.org Wed May 4 02:27:51 2005 From: omkhar at gentoo.org (Omkhar Arasaratnam) Date: Tue, 03 May 2005 12:27:51 -0400 Subject: [BUG] 2.4.30 - Bring up on JS20 Fails In-Reply-To: <20050503031322.GG12682@krispykreme> References: <4276C979.3020300@gentoo.org> <20050503031322.GG12682@krispykreme> Message-ID: <4277A687.8010806@gentoo.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Anton Blanchard wrote: > > Hi, > >> After including this header I was able to compile, but on bring >> up i see the following: >> >> [boot]0012 Setup Arch pSeries_pci: this system has large bus >> numbers and the kernel was not built with the patch that fixes >> include/linux/pci.h struct pci_bus so number, primary, secondary >> and subordinate are ints. Kernel panic: pSeries_pci: this system >> has large bus numbers and the kernel was not built with the patch >> that fixes > > >> include/linux/pci.h struct pci_bus so number, primary, secondary >> and subordinate are ints. > > > Do that and it should work :) Its to do with PCI domains and is > fixed properly in 2.6. > > Anton > Long story short - someone asked - I'll try and quantify thier requirements but thats about it for now - -- Omkhar Arasaratnam - Gentoo PPC64 Developer omkhar at gentoo.org - http://dev.gentoo.org/~omkhar Gentoo Linux / PPC64 Linux: http://ppc64.gentoo.org -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.0 (MingW32) iD8DBQFCd6aH9msUWjh2lHURArUrAJ0afKK091NgJV/3J9TbmTFreT+gLgCghO9B sh6ijo7Mmkg28spwNR06MvY= =Wq8s -----END PGP SIGNATURE----- From sonny at burdell.org Wed May 4 04:17:47 2005 From: sonny at burdell.org (Sonny Rao) Date: Tue, 3 May 2005 14:17:47 -0400 Subject: 2.6.11 e1000 EEH MMIO failure In-Reply-To: <20050503050212.GA22395@kevlar.burdell.org> References: <20050503050212.GA22395@kevlar.burdell.org> Message-ID: <20050503181747.GB7870@kevlar.burdell.org> On Tue, May 03, 2005 at 01:02:12AM -0400, Sonny Rao wrote: > I'm guessing this means a bad e1000 card but I wanted to check with > the experts. The box is a p690 w/ some expansion drawers attached, > and is running a pretty-much stock 2.6.11 kernel, system is booted in > SMP mode. > > Could it be related to e1000 errata "23" mentioned earlier on the > mailing list? > This little bugger is causing a lot of spew into my logs, is there a way to tell EEH to just offline that PCI device ? Isn't that what it's supposed to do? Is there a PCI hotplug FAQ or README somewhere that I can read (and stop posting this crap to the list :) ) Thanks, Sonny From linas at austin.ibm.com Wed May 4 08:46:32 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Tue, 3 May 2005 17:46:32 -0500 Subject: 2.6.11 e1000 EEH MMIO failure In-Reply-To: <20050503050212.GA22395@kevlar.burdell.org> References: <20050503050212.GA22395@kevlar.burdell.org> Message-ID: <20050503224632.GF11745@austin.ibm.com> Recent e1000 code has some new kind of whiz-bang watchdog timer code that is causing the device to DMA off into hyperspace, thus triggering the EEH code. It's not clear to me if the 2.6.11 kernel has this code. Am cc'ing two people who should know.... --linas On Tue, May 03, 2005 at 01:02:12AM -0400, Sonny Rao was heard to remark: > I'm guessing this means a bad e1000 card but I wanted to check with > the experts. The box is a p690 w/ some expansion drawers attached, > and is running a pretty-much stock 2.6.11 kernel, system is booted in > SMP mode. > > Could it be related to e1000 errata "23" mentioned earlier on the > mailing list? > > Here are the messages: > > Intel(R) PRO/1000 Network Driver - version 5.6.10.1-k2 > Copyright (c) 1999-2004 Intel Corporation. > > > > e1000: eth3: e1000_probe: Intel(R) PRO/1000 Network Connection > PCI: Enabling device: (000a:01:01.0), cmd 143 > e1000: eth4: e1000_probe: Intel(R) PRO/1000 Network Connection > PCI: Enabling device: (000a:01:01.1), cmd 143 > e1000: eth5: e1000_probe: Intel(R) PRO/1000 Network Connection > PCI: Enabling device: (000e:21:01.0), cmd 143 > e1000: eth6: e1000_probe: Intel(R) PRO/1000 Network Connection > PCI: Enabling device: (0011:21:01.0), cmd 143 > e1000: eth7: e1000_probe: Intel(R) PRO/1000 Network Connection > RTAS: event: 15, Type: Retry, Severity: 2 > EEH: MMIO failure (2) on device: ethernet /pci at 3ffe7f0a000/pci at 2,2/ethernet at 1 > Call Trace: > [c00000103873a910] [c000000000631630] 0xc000000000631630 (unreliable) > [c00000103873a990] [c000000000036a6c] .eeh_dn_check_failure+0x2e4/0x334 > [c00000103873aa70] [c000000000036c20] .eeh_check_failure+0x164/0x1b0 > [c00000103873ab10] [d0000000002a6b04] .e1000_check_for_link+0x5ac/0x664 [e1000] > [c00000103873abd0] [d00000000029a5e0] .e1000_watchdog+0x48/0x79c [e1000] > [c00000103873ac90] [c00000000005f558] .run_timer_softirq+0x15c/0x280 > [c00000103873ad60] [c00000000005a3c4] .__do_softirq+0xdc/0x1c8 > [c00000103873ae20] [c00000000005a538] .do_softirq+0x88/0x8c > [c00000103873aeb0] [c000000000011520] .timer_interrupt+0x294/0x35c > [c00000103873afb0] [c00000000000a2b8] decrementer_common+0xb8/0x100 > --- Exception: 901 at ._spin_unlock_irqrestore+0x1c/0x28 > LR = .rtas_call+0x1a4/0x2b4 > [c00000103873b2a0] [c0000000001e8128] .snprintf+0x30/0x44 (unreliable) > [c00000103873b2e0] [c00000000003421c] .rtas_call+0x110/0x2b4 > [c00000103873b3a0] [c0000000000366ec] .read_slot_reset_state+0x94/0xac > [c00000103873b420] [c000000000036890] .eeh_dn_check_failure+0x108/0x334 > [c00000103873b500] [c000000000036c20] .eeh_check_failure+0x164/0x1b0 > [c00000103873b5a0] [d00000000029f174] .e1000_up+0x404/0x40c [e1000] > [c00000103873b650] [d00000000029f5cc] .e1000_open+0x54/0xc0 [e1000] > [c00000103873b6e0] [c0000000002fec84] .dev_open+0x118/0x13c > [c00000103873b780] [c0000000002fcef8] .dev_change_flags+0x19c/0x1d4 > [c00000103873b820] [c000000000357878] .devinet_ioctl+0x66c/0x820 > [c00000103873b930] [c000000000358794] .inet_ioctl+0x260/0x2e0 > [c00000103873b9c0] [c0000000002f03a0] .sock_ioctl+0x28c/0x418 > [c00000103873ba70] [c0000000000c7564] .do_ioctl+0x124/0x13c > [c00000103873bb10] [c0000000000c777c] .vfs_ioctl+0x200/0x4e0 > [c00000103873bbc0] [c0000000000c7ab8] .sys_ioctl+0x5c/0xa4 > [c00000103873bc70] [c00000000001e8c0] .dev_ifsioc+0x8c/0x348 > [c00000103873bd50] [c0000000000e7d24] .compat_sys_ioctl+0x46c/0x4c4 > [c00000103873be30] [c00000000000d500] syscall_exit+0x0/0x18 > e1000: eth7: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex > RTAS: event: 16, Type: Retry, Severity: 2 > EEH: MMIO failure (2) on device: ethernet /pci at 3ffe7f0a000/pci at 2,2/ethernet at 1 > Call Trace: > [c00000103873b3a0] [c000000000631630] 0xc000000000631630 (unreliable) > [c00000103873b420] [c000000000036a6c] .eeh_dn_check_failure+0x2e4/0x334 > [c00000103873b500] [c000000000036c20] .eeh_check_failure+0x164/0x1b0 > [c00000103873b5a0] [d00000000029f174] .e1000_up+0x404/0x40c [e1000] > [c00000103873b650] [d00000000029f5cc] .e1000_open+0x54/0xc0 [e1000] > [c00000103873b6e0] [c0000000002fec84] .dev_open+0x118/0x13c > [c00000103873b780] [c0000000002fcef8] .dev_change_flags+0x19c/0x1d4 > [c00000103873b820] [c000000000357878] .devinet_ioctl+0x66c/0x820 > [c00000103873b930] [c000000000358794] .inet_ioctl+0x260/0x2e0 > [c00000103873b9c0] [c0000000002f03a0] .sock_ioctl+0x28c/0x418 > [c00000103873ba70] [c0000000000c7564] .do_ioctl+0x124/0x13c > [c00000103873bb10] [c0000000000c777c] .vfs_ioctl+0x200/0x4e0 > [c00000103873bbc0] [c0000000000c7ab8] .sys_ioctl+0x5c/0xa4 > [c00000103873bc70] [c00000000001e8c0] .dev_ifsioc+0x8c/0x348 > [c00000103873bd50] [c0000000000e7d24] .compat_sys_ioctl+0x46c/0x4c4 > [c00000103873be30] [c00000000000d500] syscall_exit+0x0/0x18 > EEH: MMIO failure (2), notifiying device 0011:21:01.0 > EEH: MMIO failure (2), notifiying device 0011:21:01.0 > PCI: Enabling device: (0014:01:01.0), cmd 143 > e1000: eth8: e1000_probe: Intel(R) PRO/1000 Network Connection > PCI: Enabling device: (0014:01:01.1), cmd 143 > e1000: eth9: e1000_probe: Intel(R) PRO/1000 Network Connection > PCI: Enabling device: (0017:01:01.0), cmd 143 > e1000: eth10: e1000_probe: Intel(R) PRO/1000 Network Connection > > Sonny > _______________________________________________ > Linuxppc64-dev mailing list > Linuxppc64-dev at ozlabs.org > https://ozlabs.org/cgi-bin/mailman/listinfo/linuxppc64-dev > From linas at austin.ibm.com Wed May 4 08:55:08 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Tue, 3 May 2005 17:55:08 -0500 Subject: 2.6.11 e1000 EEH MMIO failure In-Reply-To: <20050503181747.GB7870@kevlar.burdell.org> References: <20050503050212.GA22395@kevlar.burdell.org> <20050503181747.GB7870@kevlar.burdell.org> Message-ID: <20050503225508.GG11745@austin.ibm.com> On Tue, May 03, 2005 at 02:17:47PM -0400, Sonny Rao was heard to remark: > > This little bugger is causing a lot of spew into my logs, is there a > way to tell EEH to just offline that PCI device ? Isn't that what > it's supposed to do? Is there a PCI hotplug FAQ or README somewhere > that I can read (and stop posting this crap to the list :) ) You can prevent it from panicing by setting "panic_on_oops" to 0 echo 0 > /proc/sys/kernel/panic_on_oops Unfortunately, there is no boot-prompt option for this; there should be a __setup(panic_on_oops) added to kernel/panic.c As to actually recovering from that error-- you might try applying one of the earlier posted EEH patches; it should work. These earlier patches aren't in the mainline kernel because they have deficiencies. I'm supposed to be re-writing the code to make an EEH patch that is generally acceptable as a real patch, but am currently snowed under with other activities. --linas From sonny at burdell.org Wed May 4 10:29:03 2005 From: sonny at burdell.org (Sonny Rao) Date: Tue, 3 May 2005 20:29:03 -0400 Subject: 2.6.11 e1000 EEH MMIO failure In-Reply-To: <20050503224632.GF11745@austin.ibm.com> References: <20050503050212.GA22395@kevlar.burdell.org> <20050503224632.GF11745@austin.ibm.com> Message-ID: <20050504002903.GA11855@kevlar.burdell.org> On Tue, May 03, 2005 at 05:46:32PM -0500, Linas Vepstas wrote: > > Recent e1000 code has some new kind of whiz-bang watchdog timer > code that is causing the device to DMA off into hyperspace, > thus triggering the EEH code. It's not clear to me if the > 2.6.11 kernel has this code. > > Am cc'ing two people who should know.... > > --linas > Well that machine has other e1000 cards in it that aren't doing this, so I'm thinking it really is bad hardware in my case. If you want this card for testing EEH code or something, I just found it and ripped it out earlier today, and you can have it :-) Sonny From sonny at burdell.org Wed May 4 10:33:05 2005 From: sonny at burdell.org (Sonny Rao) Date: Tue, 3 May 2005 20:33:05 -0400 Subject: 2.6.11 e1000 EEH MMIO failure In-Reply-To: <20050503225508.GG11745@austin.ibm.com> References: <20050503050212.GA22395@kevlar.burdell.org> <20050503181747.GB7870@kevlar.burdell.org> <20050503225508.GG11745@austin.ibm.com> Message-ID: <20050504003305.GB11855@kevlar.burdell.org> On Tue, May 03, 2005 at 05:55:08PM -0500, Linas Vepstas wrote: > On Tue, May 03, 2005 at 02:17:47PM -0400, Sonny Rao was heard to remark: > > > > This little bugger is causing a lot of spew into my logs, is there a > > way to tell EEH to just offline that PCI device ? Isn't that what > > it's supposed to do? Is there a PCI hotplug FAQ or README somewhere > > that I can read (and stop posting this crap to the list :) ) > > You can prevent it from panicing by setting "panic_on_oops" to 0 > echo 0 > /proc/sys/kernel/panic_on_oops > > Unfortunately, there is no boot-prompt option for this; > there should be a __setup(panic_on_oops) added to kernel/panic.c Hmm okay, so it isn't actually causing a panic in my case, which I think is good mind you :) I didn't actually try and use it though, it was just in that machine among other e1000s. > As to actually recovering from that error-- you might try applying > one of the earlier posted EEH patches; it should work. These earlier > patches aren't in the mainline kernel because they have deficiencies. > > I'm supposed to be re-writing the code to make an EEH patch that is > generally acceptable as a real patch, but am currently snowed under > with other activities. Ah okay cool, so in the future Linux will be able to smartly handle it, very nice. Unfortunately I can't really test your patch because several other people need to use the machine which is normally partitioned up (and that particular device is left out of any LPAR config) I just happend to boot the full-system partition to do some tests and noticed the problem. Again, if someone wants to do something with that card, let me know, otherwise I'm going to toss it out. Sonny From anton at samba.org Wed May 4 14:37:45 2005 From: anton at samba.org (Anton Blanchard) Date: Wed, 4 May 2005 14:37:45 +1000 Subject: [PATCH] remove io_page_mask Message-ID: <20050504043745.GJ13590@krispykreme> Hi Jake, I found an issue with the io_page_mask code when pci_probe_only is not set (we dont initialise io_page_mask and bad things happen). I was about to fix it up when I wondered if we can remove it now. Ben changed the serial code to check before it goes pounding on addresses. Im not sure if there were other issues with badly behaving drivers but my js20 boots here with the following removal patch. Thoughts? Anton Index: foobar2/include/asm-ppc64/io.h =================================================================== --- foobar2.orig/include/asm-ppc64/io.h 2005-05-04 13:51:41.245647479 +1000 +++ foobar2/include/asm-ppc64/io.h 2005-05-04 13:55:14.823405718 +1000 @@ -33,12 +33,6 @@ extern unsigned long isa_io_base; extern unsigned long pci_io_base; -extern unsigned long io_page_mask; - -#define MAX_ISA_PORT 0x10000 - -#define _IO_IS_VALID(port) ((port) >= MAX_ISA_PORT || (1 << (port>>PAGE_SHIFT)) \ - & io_page_mask) #ifdef CONFIG_PPC_ISERIES /* __raw_* accessors aren't supported on iSeries */ Index: foobar2/include/asm-ppc64/eeh.h =================================================================== --- foobar2.orig/include/asm-ppc64/eeh.h 2005-05-04 13:51:41.246647403 +1000 +++ foobar2/include/asm-ppc64/eeh.h 2005-05-04 13:55:14.825405566 +1000 @@ -310,8 +310,6 @@ static inline u8 eeh_inb(unsigned long port) { u8 val; - if (!_IO_IS_VALID(port)) - return ~0; val = in_8((u8 __iomem *)(port+pci_io_base)); if (EEH_POSSIBLE_ERROR(val, u8)) return eeh_check_failure((void __iomem *)(port), val); @@ -320,15 +318,12 @@ static inline void eeh_outb(u8 val, unsigned long port) { - if (_IO_IS_VALID(port)) - out_8((u8 __iomem *)(port+pci_io_base), val); + out_8((u8 __iomem *)(port+pci_io_base), val); } static inline u16 eeh_inw(unsigned long port) { u16 val; - if (!_IO_IS_VALID(port)) - return ~0; val = in_le16((u16 __iomem *)(port+pci_io_base)); if (EEH_POSSIBLE_ERROR(val, u16)) return eeh_check_failure((void __iomem *)(port), val); @@ -337,15 +332,12 @@ static inline void eeh_outw(u16 val, unsigned long port) { - if (_IO_IS_VALID(port)) - out_le16((u16 __iomem *)(port+pci_io_base), val); + out_le16((u16 __iomem *)(port+pci_io_base), val); } static inline u32 eeh_inl(unsigned long port) { u32 val; - if (!_IO_IS_VALID(port)) - return ~0; val = in_le32((u32 __iomem *)(port+pci_io_base)); if (EEH_POSSIBLE_ERROR(val, u32)) return eeh_check_failure((void __iomem *)(port), val); @@ -354,8 +346,7 @@ static inline void eeh_outl(u32 val, unsigned long port) { - if (_IO_IS_VALID(port)) - out_le32((u32 __iomem *)(port+pci_io_base), val); + out_le32((u32 __iomem *)(port+pci_io_base), val); } /* in-string eeh macros */ Index: foobar2/arch/ppc64/kernel/iSeries_pci.c =================================================================== --- foobar2.orig/arch/ppc64/kernel/iSeries_pci.c 2005-05-04 13:55:12.042223389 +1000 +++ foobar2/arch/ppc64/kernel/iSeries_pci.c 2005-05-04 13:55:52.221213083 +1000 @@ -47,8 +47,6 @@ #include "pci.h" -extern unsigned long io_page_mask; - /* * Forward declares of prototypes. */ @@ -291,7 +289,6 @@ PPCDBG(PPCDBG_BUSWALK, "iSeries_pcibios_init Entry.\n"); iomm_table_initialize(); find_and_init_phbs(); - io_page_mask = -1; PPCDBG(PPCDBG_BUSWALK, "iSeries_pcibios_init Exit.\n"); } Index: foobar2/arch/ppc64/kernel/pci.c =================================================================== --- foobar2.orig/arch/ppc64/kernel/pci.c 2005-05-04 13:55:12.047223007 +1000 +++ foobar2/arch/ppc64/kernel/pci.c 2005-05-04 13:55:52.226212702 +1000 @@ -42,15 +42,6 @@ unsigned long pci_probe_only = 1; unsigned long pci_assign_all_buses = 0; -/* - * legal IO pages under MAX_ISA_PORT. This is to ensure we don't touch - * devices we don't have access to. - */ -unsigned long io_page_mask; - -EXPORT_SYMBOL(io_page_mask); - - unsigned int pcibios_assign_all_busses(void) { return pci_assign_all_buses; @@ -674,8 +665,6 @@ pci_process_ISA_OF_ranges(isa_dn, hose->io_base_phys, hose->io_base_virt); of_node_put(isa_dn); - /* Allow all IO */ - io_page_mask = -1; } } @@ -837,24 +826,9 @@ if (dev->resource[i].flags & IORESOURCE_IO) { unsigned long offset = (unsigned long)hose->io_base_virt - pci_io_base; - unsigned long start, end, mask; - - start = dev->resource[i].start += offset; - end = dev->resource[i].end += offset; - /* Need to allow IO access to pages that are in the - ISA range */ - if (start < MAX_ISA_PORT) { - if (end > MAX_ISA_PORT) - end = MAX_ISA_PORT; - - start >>= PAGE_SHIFT; - end >>= PAGE_SHIFT; - - /* get the range of pages for the map */ - mask = ((1 << (end+1))-1) ^ ((1 << start)-1); - io_page_mask |= mask; - } + dev->resource[i].start += offset; + dev->resource[i].end += offset; } else if (dev->resource[i].flags & IORESOURCE_MEM) { dev->resource[i].start += hose->pci_mem_offset; Index: foobar2/arch/ppc64/kernel/maple_pci.c =================================================================== --- foobar2.orig/arch/ppc64/kernel/maple_pci.c 2005-05-04 13:55:12.044223236 +1000 +++ foobar2/arch/ppc64/kernel/maple_pci.c 2005-05-04 13:55:52.223212931 +1000 @@ -454,9 +454,6 @@ /* Tell pci.c to use the common resource allocation mecanism */ pci_probe_only = 0; - - /* Allow all IO */ - io_page_mask = -1; } int maple_pci_get_legacy_ide_irq(struct pci_dev *pdev, int channel) Index: foobar2/arch/ppc64/kernel/pmac_pci.c =================================================================== --- foobar2.orig/arch/ppc64/kernel/pmac_pci.c 2005-05-04 13:55:12.050222778 +1000 +++ foobar2/arch/ppc64/kernel/pmac_pci.c 2005-05-04 13:55:52.228212549 +1000 @@ -755,9 +755,6 @@ /* Tell pci.c to not use the common resource allocation mecanism */ pci_probe_only = 1; - - /* Allow all IO */ - io_page_mask = -1; } /* Index: foobar2/arch/ppc64/kernel/iomap.c =================================================================== --- foobar2.orig/arch/ppc64/kernel/iomap.c 2005-05-04 13:16:13.000000000 +1000 +++ foobar2/arch/ppc64/kernel/iomap.c 2005-05-04 14:02:43.887671757 +1000 @@ -88,8 +88,6 @@ void __iomem *ioport_map(unsigned long port, unsigned int len) { - if (!_IO_IS_VALID(port)) - return NULL; return (void __iomem *) (port+pci_io_base); } From anton at samba.org Wed May 4 15:42:58 2005 From: anton at samba.org (Anton Blanchard) Date: Wed, 4 May 2005 15:42:58 +1000 Subject: [PATCH] remove io_page_mask In-Reply-To: <20050504043745.GJ13590@krispykreme> References: <20050504043745.GJ13590@krispykreme> Message-ID: <20050504054258.GK13590@krispykreme> > I found an issue with the io_page_mask code when pci_probe_only is not > set (we dont initialise io_page_mask and bad things happen). I was > about to fix it up when I wondered if we can remove it now. > > Ben changed the serial code to check before it goes pounding on > addresses. Im not sure if there were other issues with badly behaving > drivers but my js20 boots here with the following removal patch. First fallout: parport_pc_probe_port+0x318/0xbd0 parport_pc_init+0x418/0x520 sys_init_module+0x1bc/0x4b0 The parallel port code is going out and touching stuff it shouldnt. Anton From will_schmidt at vnet.ibm.com Wed May 4 23:12:27 2005 From: will_schmidt at vnet.ibm.com (will schmidt) Date: Wed, 04 May 2005 08:12:27 -0500 Subject: (resend) RFC/Patch xmon pte/pgd/ userspace address additions Message-ID: <4278CA3B.6060700@vnet.ibm.com> Hi Folks, This is a resend. I didnt see this on the patch page, so suspect it got lost. (Do updated patches need to be sent to the list under a different thread?) -Will -------- Original Message -------- Subject: Re: RFC/Patch more xmon additions Date: Thu, 07 Apr 2005 08:39:36 -0500 From: will schmidt To: will schmidt CC: linuxppc64-dev at ozlabs.org References: <421E3BE3.90301 at vnet.ibm.com> Hi All, here's a revised version of my initial patch. - I've removed the try_spinlock code; - As an alternative to duplicating lots of function to add mread calls in place of references, I've added setjmp(bus_error_jmp) {} around what seem more likely to be critical areas. - cleaned up spacing - changed most of the function names to be xmon_xxx instead of wm_xxx. these functions show up under a submenu 'w'. use "w?" at xmon> prompt to get the help blurb. -Will will schmidt wrote: > > Hi Folks, > Am looking for comments on this additional function i've added to xmon > on the side.. > > the bulk of my intent was to make it easier for me to poke at memory > within a particular user process. > > I realize that the spacing is a bit screwed up, and the function names > should eventually change. Because i couldnt decide on letters for the > new functions, i put them under a submenu 'w'. > > wP will dump info on all processes. > > wp 0xabc will make process with pid 0xabc the active pid. <- active > only with respect to xmon poking into memory. > > wd 0xabcd1234 - will call through the pdg/pmd functions and return the > kernel address corresponding to 0xabcd1234 within the processes memory > space location. > > wg will dump gprs of the process/thread. > > -Will > > > ------------------------------------------------------------------------ > > > _______________________________________________ > Linuxppc64-dev mailing list > Linuxppc64-dev at ozlabs.org > https://ozlabs.org/cgi-bin/mailman/listinfo/linuxppc64-dev -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: xmon_pxd_code_apr7.diff Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050504/e50cfe05/attachment.txt From linas at austin.ibm.com Thu May 5 02:28:55 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Wed, 4 May 2005 11:28:55 -0500 Subject: 2.6.11 e1000 EEH MMIO failure In-Reply-To: <20050504003305.GB11855@kevlar.burdell.org> References: <20050503050212.GA22395@kevlar.burdell.org> <20050503181747.GB7870@kevlar.burdell.org> <20050503225508.GG11745@austin.ibm.com> <20050504003305.GB11855@kevlar.burdell.org> Message-ID: <20050504162855.GH11745@austin.ibm.com> On Tue, May 03, 2005 at 08:33:05PM -0400, Sonny Rao was heard to remark: > > Ah okay cool, so in the future Linux will be able to smartly handle > it, very nice. Unfortunately I can't really test your patch because > several other people need to use the machine which is normally > partitioned up (and that particular device is left out of any LPAR > config) I just happend to boot the full-system partition to do some > tests and noticed the problem. There's supposed to be some code that allows slots to be dynamically added and removed from running partitions, but I've never tried it myself. > Again, if someone wants to do something with that card, let me know, > otherwise I'm going to toss it out. FWIW, field experience shows that nine out of ten failures are due to poorly seated PCI cards. Before you chuck it, you might want to remove it, make sure there are no iron filings in the slot, and try again. Let me know how that goes; I'd like to add this to my bag of "real world" experience with this thing. --linas From support at paypal.com Thu May 5 02:30:40 2005 From: support at paypal.com (support at paypal.com) Date: Wed, 4 May 2005 18:30:40 +0200 (CEST) Subject: Billing Issues Message-ID: <20050504163040.AE7D524F7@wmphpp02.st2.lyceu.net> An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050504/f601641f/attachment.htm From moilanen at austin.ibm.com Thu May 5 06:35:59 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Wed, 4 May 2005 15:35:59 -0500 Subject: [PATCH] ppc64: enforce medium thread priority in hypervisor calls In-Reply-To: <20050429135446.GF19662@krispykreme> References: <20050429135446.GF19662@krispykreme> Message-ID: <20050504153559.2aa85753.moilanen@austin.ibm.com> > Calls into the hypervisor do not raise the thread priority. Ensure we > are running at medium priority upon entry to the hypervisor. Anton, what's the purpose of this patch. I thought only RS64 had HMT, and those boxes don't make hypervisor calls. Jake From moilanen at austin.ibm.com Thu May 5 06:48:35 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Wed, 4 May 2005 15:48:35 -0500 Subject: [PATCH] ppc64: enforce medium thread priority in hypervisor calls In-Reply-To: <20050504153559.2aa85753.moilanen@austin.ibm.com> References: <20050429135446.GF19662@krispykreme> <20050504153559.2aa85753.moilanen@austin.ibm.com> Message-ID: <20050504154835.1e67686b.moilanen@austin.ibm.com> On Wed, 4 May 2005 15:35:59 -0500 Jake Moilanen wrote: > > Calls into the hypervisor do not raise the thread priority. Ensure we > > are running at medium priority upon entry to the hypervisor. > > Anton, what's the purpose of this patch. I thought only RS64 had HMT, > and those boxes don't make hypervisor calls. Disregard...Olof reminded me that SMT uses the same scheme. Jake From moilanen at austin.ibm.com Thu May 5 04:50:07 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Wed, 4 May 2005 13:50:07 -0500 Subject: [PATCH] remove io_page_mask In-Reply-To: <20050504043745.GJ13590@krispykreme> References: <20050504043745.GJ13590@krispykreme> Message-ID: <20050504135007.78f449d2.moilanen@austin.ibm.com> > I found an issue with the io_page_mask code when pci_probe_only is not > set (we dont initialise io_page_mask and bad things happen). I was > about to fix it up when I wondered if we can remove it now. > > Ben changed the serial code to check before it goes pounding on > addresses. Im not sure if there were other issues with badly behaving > drivers but my js20 boots here with the following removal patch. As long as the serial code is fixed up, then the JS20 shouldn't need the io_page_mask. Jake From apw at shadowen.org Thu May 5 06:30:57 2005 From: apw at shadowen.org (Andy Whitcroft) Date: Wed, 04 May 2005 21:30:57 +0100 Subject: [3/3] sparsemem memory model for ppc64 Message-ID: Provide the architecture specific implementation for SPARSEMEM for PPC64 systems. Signed-off-by: Andy Whitcroft Signed-off-by: Dave Hansen Signed-off-by: Mike Kravetz (in part) Signed-off-by: Martin Bligh --- arch/ppc64/Kconfig | 13 ++++++++++++- arch/ppc64/kernel/setup.c | 1 + arch/ppc64/mm/Makefile | 2 +- arch/ppc64/mm/init.c | 24 +++++++++++++++++++----- include/asm-ppc64/mmzone.h | 36 +++++++++++++++++++++++------------- include/asm-ppc64/page.h | 3 ++- include/asm-ppc64/sparsemem.h | 16 ++++++++++++++++ 7 files changed, 74 insertions(+), 21 deletions(-) diff -X /home/apw/brief/lib/vdiff.excl -rupN reference/arch/ppc64/Kconfig current/arch/ppc64/Kconfig --- reference/arch/ppc64/Kconfig 2005-05-04 20:54:52.000000000 +0100 +++ current/arch/ppc64/Kconfig 2005-05-04 20:54:54.000000000 +0100 @@ -198,6 +198,13 @@ config HMT This option enables hardware multithreading on RS64 cpus. pSeries systems p620 and p660 have such a cpu type. +config ARCH_SELECT_MEMORY_MODEL + def_bool y + +config ARCH_FLATMEM_ENABLE + def_bool y + depends on !NUMA + config ARCH_DISCONTIGMEM_ENABLE def_bool y depends on SMP && PPC_PSERIES @@ -209,6 +216,10 @@ config ARCH_DISCONTIGMEM_DEFAULT config ARCH_FLATMEM_ENABLE def_bool y +config ARCH_SPARSEMEM_ENABLE + def_bool y + depends on ARCH_DISCONTIGMEM_ENABLE + source "mm/Kconfig" config HAVE_ARCH_EARLY_PFN_TO_NID @@ -229,7 +240,7 @@ config NODES_SPAN_OTHER_NODES config NUMA bool "NUMA support" - depends on DISCONTIGMEM + default y if DISCONTIGMEM || SPARSEMEM config SCHED_SMT bool "SMT (Hyperthreading) scheduler support" diff -X /home/apw/brief/lib/vdiff.excl -rupN reference/arch/ppc64/kernel/setup.c current/arch/ppc64/kernel/setup.c --- reference/arch/ppc64/kernel/setup.c 2005-04-11 19:33:15.000000000 +0100 +++ current/arch/ppc64/kernel/setup.c 2005-05-04 20:54:53.000000000 +0100 @@ -1059,6 +1059,7 @@ void __init setup_arch(char **cmdline_p) /* set up the bootmem stuff with available memory */ do_init_bootmem(); + sparse_init(); /* initialize the syscall map in systemcfg */ setup_syscall_map(); diff -X /home/apw/brief/lib/vdiff.excl -rupN reference/arch/ppc64/mm/init.c current/arch/ppc64/mm/init.c --- reference/arch/ppc64/mm/init.c 2005-05-04 20:54:20.000000000 +0100 +++ current/arch/ppc64/mm/init.c 2005-05-04 20:54:54.000000000 +0100 @@ -601,13 +601,21 @@ EXPORT_SYMBOL(page_is_ram); * Initialize the bootmem system and give it all the memory we * have available. */ -#ifndef CONFIG_DISCONTIGMEM +#ifndef CONFIG_NEED_MULTIPLE_NODES void __init do_init_bootmem(void) { unsigned long i; unsigned long start, bootmap_pages; unsigned long total_pages = lmb_end_of_DRAM() >> PAGE_SHIFT; int boot_mapsize; + unsigned long start_pfn, end_pfn; + /* + * Note presence of first (logical/coalasced) LMB which will + * contain RMO region + */ + start_pfn = lmb.memory.region[0].physbase >> PAGE_SHIFT; + end_pfn = start_pfn + (lmb.memory.region[0].size >> PAGE_SHIFT); + memory_present(0, start_pfn, end_pfn); /* * Find an area to use for the bootmem bitmap. Calculate the size of @@ -623,12 +631,18 @@ void __init do_init_bootmem(void) max_pfn = max_low_pfn; - /* add all physical memory to the bootmem map. Also find the first */ + /* add all physical memory to the bootmem map. Also, find the first + * presence of all LMBs*/ for (i=0; i < lmb.memory.cnt; i++) { unsigned long physbase, size; physbase = lmb.memory.region[i].physbase; size = lmb.memory.region[i].size; + if (i) { /* already created mappings for first LMB */ + start_pfn = physbase >> PAGE_SHIFT; + end_pfn = start_pfn + (size >> PAGE_SHIFT); + } + memory_present(0, start_pfn, end_pfn); free_bootmem(physbase, size); } @@ -667,7 +681,7 @@ void __init paging_init(void) free_area_init_node(0, &contig_page_data, zones_size, __pa(PAGE_OFFSET) >> PAGE_SHIFT, zholes_size); } -#endif /* CONFIG_DISCONTIGMEM */ +#endif /* ! CONFIG_NEED_MULTIPLE_NODES */ static struct kcore_list kcore_vmem; @@ -698,7 +712,7 @@ module_init(setup_kcore); void __init mem_init(void) { -#ifdef CONFIG_DISCONTIGMEM +#ifdef CONFIG_NEED_MULTIPLE_NODES int nid; #endif pg_data_t *pgdat; @@ -709,7 +723,7 @@ void __init mem_init(void) num_physpages = max_low_pfn; /* RAM is assumed contiguous */ high_memory = (void *) __va(max_low_pfn * PAGE_SIZE); -#ifdef CONFIG_DISCONTIGMEM +#ifdef CONFIG_NEED_MULTIPLE_NODES for_each_online_node(nid) { if (NODE_DATA(nid)->node_spanned_pages != 0) { printk("freeing bootmem node %x\n", nid); diff -X /home/apw/brief/lib/vdiff.excl -rupN reference/arch/ppc64/mm/Makefile current/arch/ppc64/mm/Makefile --- reference/arch/ppc64/mm/Makefile 2005-01-21 14:04:09.000000000 +0000 +++ current/arch/ppc64/mm/Makefile 2005-05-04 20:54:54.000000000 +0100 @@ -6,6 +6,6 @@ EXTRA_CFLAGS += -mno-minimal-toc obj-y := fault.o init.o imalloc.o hash_utils.o hash_low.o tlb.o \ slb_low.o slb.o stab.o mmap.o -obj-$(CONFIG_DISCONTIGMEM) += numa.o +obj-$(CONFIG_NEED_MULTIPLE_NODES) += numa.o obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o obj-$(CONFIG_PPC_MULTIPLATFORM) += hash_native.o diff -X /home/apw/brief/lib/vdiff.excl -rupN reference/include/asm-ppc64/mmzone.h current/include/asm-ppc64/mmzone.h --- reference/include/asm-ppc64/mmzone.h 2005-05-04 20:54:50.000000000 +0100 +++ current/include/asm-ppc64/mmzone.h 2005-05-04 20:54:54.000000000 +0100 @@ -10,9 +10,20 @@ #include #include -#ifdef CONFIG_DISCONTIGMEM +/* generic non-linear memory support: + * + * 1) we will not split memory into more chunks than will fit into the + * flags field of the struct page + */ + + +#ifdef CONFIG_NEED_MULTIPLE_NODES extern struct pglist_data *node_data[]; +/* + * Return a pointer to the node data for node n. + */ +#define NODE_DATA(nid) (node_data[nid]) /* * Following are specific to this numa platform. @@ -47,30 +58,27 @@ static inline int pa_to_nid(unsigned lon return nid; } -#define pfn_to_nid(pfn) pa_to_nid((pfn) << PAGE_SHIFT) - -/* - * Return a pointer to the node data for node n. - */ -#define NODE_DATA(nid) (node_data[nid]) - #define node_localnr(pfn, nid) ((pfn) - NODE_DATA(nid)->node_start_pfn) /* * Following are macros that each numa implmentation must define. */ -/* - * Given a kernel address, find the home node of the underlying memory. - */ -#define kvaddr_to_nid(kaddr) pa_to_nid(__pa(kaddr)) - #define node_start_pfn(nid) (NODE_DATA(nid)->node_start_pfn) #define node_end_pfn(nid) (NODE_DATA(nid)->node_end_pfn) #define local_mapnr(kvaddr) \ ( (__pa(kvaddr) >> PAGE_SHIFT) - node_start_pfn(kvaddr_to_nid(kvaddr)) +#ifdef CONFIG_DISCONTIGMEM + +/* + * Given a kernel address, find the home node of the underlying memory. + */ +#define kvaddr_to_nid(kaddr) pa_to_nid(__pa(kaddr)) + +#define pfn_to_nid(pfn) pa_to_nid((unsigned long)(pfn) << PAGE_SHIFT) + /* Written this way to avoid evaluating arguments twice */ #define discontigmem_pfn_to_page(pfn) \ ({ \ @@ -91,6 +99,8 @@ static inline int pa_to_nid(unsigned lon #endif /* CONFIG_DISCONTIGMEM */ +#endif /* CONFIG_NEED_MULTIPLE_NODES */ + #ifdef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID #define early_pfn_to_nid(pfn) pa_to_nid(((unsigned long)pfn) << PAGE_SHIFT) #endif diff -X /home/apw/brief/lib/vdiff.excl -rupN reference/include/asm-ppc64/page.h current/include/asm-ppc64/page.h --- reference/include/asm-ppc64/page.h 2005-04-11 19:33:45.000000000 +0100 +++ current/include/asm-ppc64/page.h 2005-05-04 20:54:54.000000000 +0100 @@ -224,7 +224,8 @@ extern u64 ppc64_pft_size; /* Log 2 of #define page_to_pfn(page) discontigmem_page_to_pfn(page) #define pfn_to_page(pfn) discontigmem_pfn_to_page(pfn) #define pfn_valid(pfn) discontigmem_pfn_valid(pfn) -#else +#endif +#ifdef CONFIG_FLATMEM #define pfn_to_page(pfn) (mem_map + (pfn)) #define page_to_pfn(page) ((unsigned long)((page) - mem_map)) #define pfn_valid(pfn) ((pfn) < max_mapnr) diff -X /home/apw/brief/lib/vdiff.excl -rupN reference/include/asm-ppc64/sparsemem.h current/include/asm-ppc64/sparsemem.h --- reference/include/asm-ppc64/sparsemem.h 1970-01-01 01:00:00.000000000 +0100 +++ current/include/asm-ppc64/sparsemem.h 2005-05-04 20:54:54.000000000 +0100 @@ -0,0 +1,16 @@ +#ifndef _ASM_PPC64_SPARSEMEM_H +#define _ASM_PPC64_SPARSEMEM_H 1 + +#ifdef CONFIG_SPARSEMEM +/* + * SECTION_SIZE_BITS 2^N: how big each section will be + * MAX_PHYSADDR_BITS 2^N: how much physical address space we have + * MAX_PHYSMEM_BITS 2^N: how much memory we can have in that space + */ +#define SECTION_SIZE_BITS 24 +#define MAX_PHYSADDR_BITS 38 +#define MAX_PHYSMEM_BITS 36 + +#endif /* CONFIG_SPARSEMEM */ + +#endif /* _ASM_PPC64_SPARSEMEM_H */ From apw at shadowen.org Thu May 5 06:28:57 2005 From: apw at shadowen.org (Andy Whitcroft) Date: Wed, 04 May 2005 21:28:57 +0100 Subject: [1/3] add early_pfn_to_nid for ppc64 Message-ID: Provide an implementation of early_pfn_to_nid for PPC64. This is used by memory models to determine the node from which to take allocations before the memory allocators are fully initialised. Signed-off-by: Andy Whitcroft Signed-off-by: Dave Hansen Signed-off-by: Martin Bligh --- arch/ppc64/Kconfig | 4 ++++ include/asm-ppc64/mmzone.h | 5 +++++ 2 files changed, 9 insertions(+) diff -X /home/apw/brief/lib/vdiff.excl -rupN reference/arch/ppc64/Kconfig current/arch/ppc64/Kconfig --- reference/arch/ppc64/Kconfig 2005-05-04 20:54:41.000000000 +0100 +++ current/arch/ppc64/Kconfig 2005-05-04 20:54:48.000000000 +0100 @@ -211,6 +211,10 @@ config ARCH_FLATMEM_ENABLE source "mm/Kconfig" +config HAVE_ARCH_EARLY_PFN_TO_NID + bool + default y + # Some NUMA nodes have memory ranges that span # other nodes. Even though a pfn is valid and # between a node's start and end pfns, it may not diff -X /home/apw/brief/lib/vdiff.excl -rupN reference/include/asm-ppc64/mmzone.h current/include/asm-ppc64/mmzone.h --- reference/include/asm-ppc64/mmzone.h 2005-05-04 20:54:41.000000000 +0100 +++ current/include/asm-ppc64/mmzone.h 2005-05-04 20:54:48.000000000 +0100 @@ -90,4 +90,9 @@ static inline int pa_to_nid(unsigned lon #define discontigmem_pfn_valid(pfn) ((pfn) < num_physpages) #endif /* CONFIG_DISCONTIGMEM */ + +#ifdef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID +#define early_pfn_to_nid(pfn) pa_to_nid(((unsigned long)pfn) << PAGE_SHIFT) +#endif + #endif /* _ASM_MMZONE_H_ */ From apw at shadowen.org Thu May 5 06:27:57 2005 From: apw at shadowen.org (Andy Whitcroft) Date: Wed, 04 May 2005 21:27:57 +0100 Subject: [0/3] SPARSEMEM memory model patches for ppc64 Message-ID: After long testing outside -mm we believe that the SPARSEMEM patches are ready for wider testing, please consider for -mm. SPARSEMEM essentially is a replacement for DISCONTIGMEM providing support for non-contigious memory but with the advantage of handling both inter- and intra-node memory holes. The goal of the implementation was to design a clean memory memory model covering the needs of both UMA and NUMA discontigouos memory layouts whilst providing a basis for hotplug. This should allow us to consolidate the implementation of various "discontiguous" memory model whilst trying to fix its short comings. Ultimatly it should allow us to remove DISCONTIGMEM. Following this mail are 3 patches which provide SPARSEMEM for ppc64: [1/3] add early_pfn_to_nid for ppc64 [2/3] add memory present for ppc64 [3/3] sparsemem memory model for ppc64 These apply on top of the generic/i386 patches recently sent out to linux-mm: [1/6] generify early_pfn_to_nid [2/6] generify memory present [3/6] sparsemem memory model [4/6] sparsemem memory model for i386 [5/6] sparsemem swiss cheese numa layouts [6/6] sparsemem hotplug base These patches have been compiled, booted and tested on 2.6.12-rc2 (plus the -mm patches listed below). They have been compile and boot tested against 2.6.12-rc3-mm2. They do assume a number of patches already incorporated into -mm including the latest configuration updates from Dave Hansen . remove-non-discontig-use-of-pgdat-node_mem_map.patch resubmit-sparsemem-base-early_pfn_to_nid-works-before-sparse-is-initialized.patch resubmit-sparsemem-base-simple-numa-remap-space-allocator.patch resubmit-sparsemem-base-reorganize-page-flags-bit-operations.patch resubmit-sparsemem-base-teach-discontig-about-sparse-ranges.patch create-mm-kconfig-for-arch-independent-memory-options.patch make-each-arch-use-mm-kconfig.patch update-all-defconfigs-for-arch_discontigmem_enable.patch introduce-new-kconfig-option-for-numa-or-discontig.patch sparsemem-fix-minor-defaults-issue-in-mm-kconfig.patch mm-kconfig-kill-unused-arch_flatmem_disable.patch mm-kconfig-hide-memory-model-selection-menu.patch Comments/feedback appreciated. -apw From apw at shadowen.org Thu May 5 06:29:57 2005 From: apw at shadowen.org (Andy Whitcroft) Date: Wed, 04 May 2005 21:29:57 +0100 Subject: [2/3] add memory present for ppc64 Message-ID: Provide hooks for PPC64 to allow memory models to be informed of installed memory areas. This allows SPARSEMEM to instantiate mem_map for the populated areas. Signed-off-by: Andy Whitcroft Signed-off-by: Dave Hansen Signed-off-by: Martin Bligh --- Kconfig | 4 ++-- mm/numa.c | 3 +++ 2 files changed, 5 insertions(+), 2 deletions(-) diff -X /home/apw/brief/lib/vdiff.excl -rupN reference/arch/ppc64/Kconfig current/arch/ppc64/Kconfig --- reference/arch/ppc64/Kconfig 2005-05-04 20:54:50.000000000 +0100 +++ current/arch/ppc64/Kconfig 2005-05-04 20:54:50.000000000 +0100 @@ -212,8 +212,8 @@ config ARCH_FLATMEM_ENABLE source "mm/Kconfig" config HAVE_ARCH_EARLY_PFN_TO_NID - bool - default y + def_bool y + depends on NEED_MULTIPLE_NODES # Some NUMA nodes have memory ranges that span # other nodes. Even though a pfn is valid and diff -X /home/apw/brief/lib/vdiff.excl -rupN reference/arch/ppc64/mm/numa.c current/arch/ppc64/mm/numa.c --- reference/arch/ppc64/mm/numa.c 2005-04-11 19:33:15.000000000 +0100 +++ current/arch/ppc64/mm/numa.c 2005-05-04 20:54:50.000000000 +0100 @@ -440,6 +440,8 @@ new_range: for (i = start ; i < (start+size); i += MEMORY_INCREMENT) numa_memory_lookup_table[i >> MEMORY_INCREMENT_SHIFT] = numa_domain; + memory_present(numa_domain, start >> PAGE_SHIFT, + (start + size) >> PAGE_SHIFT); if (--ranges) goto new_range; @@ -481,6 +483,7 @@ static void __init setup_nonnuma(void) for (i = 0 ; i < top_of_ram; i += MEMORY_INCREMENT) numa_memory_lookup_table[i >> MEMORY_INCREMENT_SHIFT] = 0; + memory_present(0, 0, init_node_data[0].node_end_pfn); } static void __init dump_numa_topology(void) From olof at lixom.net Thu May 5 08:08:19 2005 From: olof at lixom.net (Olof Johansson) Date: Wed, 4 May 2005 17:08:19 -0500 Subject: [PATCH] remove io_page_mask In-Reply-To: <20050504135007.78f449d2.moilanen@austin.ibm.com> References: <20050504043745.GJ13590@krispykreme> <20050504135007.78f449d2.moilanen@austin.ibm.com> Message-ID: <20050504220815.GA29571@austin.ibm.com> On Wed, May 04, 2005 at 01:50:07PM -0500, Jake Moilanen wrote: > > I found an issue with the io_page_mask code when pci_probe_only is not > > set (we dont initialise io_page_mask and bad things happen). I was > > about to fix it up when I wondered if we can remove it now. > > > > Ben changed the serial code to check before it goes pounding on > > addresses. Im not sure if there were other issues with badly behaving > > drivers but my js20 boots here with the following removal patch. > > As long as the serial code is fixed up, then the JS20 shouldn't need the > io_page_mask. I tried booting 2.6.12-rc3-mm2 + this patch on a JS20 here and it seemed to go well. Serial options enabled were: CONFIG_SERIAL_8250=y CONFIG_SERIAL_8250_CONSOLE=y CONFIG_SERIAL_8250_NR_UARTS=4 # CONFIG_SERIAL_8250_EXTENDED is not set # Non-8250 serial port support CONFIG_SERIAL_CORE=y CONFIG_SERIAL_CORE_CONSOLE=y -Olof From sonny at burdell.org Thu May 5 10:39:54 2005 From: sonny at burdell.org (Sonny Rao) Date: Wed, 4 May 2005 20:39:54 -0400 Subject: 2.6.11 e1000 EEH MMIO failure In-Reply-To: <20050504162855.GH11745@austin.ibm.com> References: <20050503050212.GA22395@kevlar.burdell.org> <20050503181747.GB7870@kevlar.burdell.org> <20050503225508.GG11745@austin.ibm.com> <20050504003305.GB11855@kevlar.burdell.org> <20050504162855.GH11745@austin.ibm.com> Message-ID: <20050505003954.GB7367@kevlar.burdell.org> On Wed, May 04, 2005 at 11:28:55AM -0500, Linas Vepstas wrote: > On Tue, May 03, 2005 at 08:33:05PM -0400, Sonny Rao was heard to remark: > > > > Ah okay cool, so in the future Linux will be able to smartly handle > > it, very nice. Unfortunately I can't really test your patch because > > several other people need to use the machine which is normally > > partitioned up (and that particular device is left out of any LPAR > > config) I just happend to boot the full-system partition to do some > > tests and noticed the problem. > > There's supposed to be some code that allows slots to be dynamically > added and removed from running partitions, but I've never tried it > myself. > > > Again, if someone wants to do something with that card, let me know, > > otherwise I'm going to toss it out. > > FWIW, field experience shows that nine out of ten failures are due to > poorly seated PCI cards. Before you chuck it, you might want to remove > it, make sure there are no iron filings in the slot, and try again. > > Let me know how that goes; I'd like to add this to my bag of "real > world" experience with this thing. Well, I tried to hot-plug that thing back in and boot a partition with it, and as far as I can tell the card seems to be dead. Not sure how to interpret this. Oh well, thanks. Sonny From david at gibson.dropbear.id.au Thu May 5 11:42:56 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 5 May 2005 11:42:56 +1000 Subject: Patch to kill ioremap_mm Message-ID: <20050505014256.GE18270@localhost.localdomain> Can anyone see any problems with this patch. If not, I'll send it on to akpm. Currently ppc64 has two mm_structs for the kernel, init_mm and also ioremap_mm. The latter really isn't necessary: this patch abolishes it, instead restricting vmallocs to the lower 1TB of the init_mm's range and placing io mappings in the upper 1TB. This simplifies the code in a number of places, and gets rid of an unecessary set of pagetables. Index: working-2.6/include/asm-ppc64/pgtable.h =================================================================== --- working-2.6.orig/include/asm-ppc64/pgtable.h 2005-05-05 10:58:04.000000000 +1000 +++ working-2.6/include/asm-ppc64/pgtable.h 2005-05-05 11:12:59.000000000 +1000 @@ -53,7 +53,8 @@ * Define the address range of the vmalloc VM area. */ #define VMALLOC_START (0xD000000000000000ul) -#define VMALLOC_END (VMALLOC_START + EADDR_MASK) +#define VMALLOC_SIZE (0x10000000000UL) +#define VMALLOC_END (VMALLOC_START + VMALLOC_SIZE) /* * Bits in a linux-style PTE. These match the bits in the @@ -239,9 +240,6 @@ /* This now only contains the vmalloc pages */ #define pgd_offset_k(address) pgd_offset(&init_mm, address) -/* to find an entry in the ioremap page-table-directory */ -#define pgd_offset_i(address) (ioremap_pgd + pgd_index(address)) - /* * The following only work if pte_present() is true. * Undefined behaviour if not.. @@ -459,15 +457,12 @@ #define __HAVE_ARCH_PTE_SAME #define pte_same(A,B) (((pte_val(A) ^ pte_val(B)) & ~_PAGE_HPTEFLAGS) == 0) -extern unsigned long ioremap_bot, ioremap_base; - #define pmd_ERROR(e) \ printk("%s:%d: bad pmd %08x.\n", __FILE__, __LINE__, pmd_val(e)) #define pgd_ERROR(e) \ printk("%s:%d: bad pgd %08x.\n", __FILE__, __LINE__, pgd_val(e)) extern pgd_t swapper_pg_dir[]; -extern pgd_t ioremap_dir[]; extern void paging_init(void); Index: working-2.6/include/asm-ppc64/imalloc.h =================================================================== --- working-2.6.orig/include/asm-ppc64/imalloc.h 2005-05-05 10:58:04.000000000 +1000 +++ working-2.6/include/asm-ppc64/imalloc.h 2005-05-05 11:13:39.000000000 +1000 @@ -4,9 +4,9 @@ /* * Define the address range of the imalloc VM area. */ -#define PHBS_IO_BASE IOREGIONBASE -#define IMALLOC_BASE (IOREGIONBASE + 0x80000000ul) /* Reserve 2 gigs for PHBs */ -#define IMALLOC_END (IOREGIONBASE + EADDR_MASK) +#define PHBS_IO_BASE VMALLOC_END +#define IMALLOC_BASE (PHBS_IO_BASE + 0x80000000ul) /* Reserve 2 gigs for PHBs */ +#define IMALLOC_END (VMALLOC_START + EADDR_MASK) /* imalloc region types */ @@ -21,4 +21,6 @@ int region_type); unsigned long im_free(void *addr); +extern unsigned long ioremap_bot; + #endif /* _PPC64_IMALLOC_H */ Index: working-2.6/include/asm-ppc64/page.h =================================================================== --- working-2.6.orig/include/asm-ppc64/page.h 2005-05-05 10:58:04.000000000 +1000 +++ working-2.6/include/asm-ppc64/page.h 2005-05-05 11:14:02.000000000 +1000 @@ -202,9 +202,7 @@ #define PAGE_OFFSET ASM_CONST(0xC000000000000000) #define KERNELBASE PAGE_OFFSET #define VMALLOCBASE ASM_CONST(0xD000000000000000) -#define IOREGIONBASE ASM_CONST(0xE000000000000000) -#define IO_REGION_ID (IOREGIONBASE >> REGION_SHIFT) #define VMALLOC_REGION_ID (VMALLOCBASE >> REGION_SHIFT) #define KERNEL_REGION_ID (KERNELBASE >> REGION_SHIFT) #define USER_REGION_ID (0UL) Index: working-2.6/arch/ppc64/kernel/eeh.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/eeh.c 2005-04-26 15:37:55.000000000 +1000 +++ working-2.6/arch/ppc64/kernel/eeh.c 2005-05-05 11:23:40.000000000 +1000 @@ -505,7 +505,7 @@ pte_t *ptep; unsigned long pa; - ptep = find_linux_pte(ioremap_mm.pgd, token); + ptep = find_linux_pte(init_mm.pgd, token); if (!ptep) return token; pa = pte_pfn(*ptep) << PAGE_SHIFT; Index: working-2.6/arch/ppc64/kernel/process.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/process.c 2005-04-26 15:37:55.000000000 +1000 +++ working-2.6/arch/ppc64/kernel/process.c 2005-05-05 11:16:20.000000000 +1000 @@ -58,14 +58,6 @@ struct task_struct *last_task_used_altivec = NULL; #endif -struct mm_struct ioremap_mm = { - .pgd = ioremap_dir, - .mm_users = ATOMIC_INIT(2), - .mm_count = ATOMIC_INIT(1), - .cpu_vm_mask = CPU_MASK_ALL, - .page_table_lock = SPIN_LOCK_UNLOCKED, -}; - /* * Make sure the floating-point register state in the * the thread_struct is up to date for task tsk. Index: working-2.6/include/asm-ppc64/processor.h =================================================================== --- working-2.6.orig/include/asm-ppc64/processor.h 2005-04-26 15:38:02.000000000 +1000 +++ working-2.6/include/asm-ppc64/processor.h 2005-05-05 11:24:46.000000000 +1000 @@ -590,16 +590,6 @@ } /* - * Note: the vm_start and vm_end fields here should *not* - * be in kernel space. (Could vm_end == vm_start perhaps?) - */ -#define IOREMAP_MMAP { &ioremap_mm, 0, 0x1000, NULL, \ - PAGE_SHARED, VM_READ | VM_WRITE | VM_EXEC, \ - 1, NULL, NULL } - -extern struct mm_struct ioremap_mm; - -/* * Return saved PC of a blocked thread. For now, this is the "user" PC */ #define thread_saved_pc(tsk) \ Index: working-2.6/arch/ppc64/mm/hash_utils.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hash_utils.c 2005-05-05 10:58:04.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hash_utils.c 2005-05-05 11:17:03.000000000 +1000 @@ -310,10 +310,6 @@ vsid = get_vsid(mm->context.id, ea); break; - case IO_REGION_ID: - mm = &ioremap_mm; - vsid = get_kernel_vsid(ea); - break; case VMALLOC_REGION_ID: mm = &init_mm; vsid = get_kernel_vsid(ea); Index: working-2.6/arch/ppc64/mm/init.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/init.c 2005-05-05 10:58:04.000000000 +1000 +++ working-2.6/arch/ppc64/mm/init.c 2005-05-05 11:22:54.000000000 +1000 @@ -144,7 +144,7 @@ pte = pte_offset_kernel(pmd, addr); do { - pte_t ptent = ptep_get_and_clear(&ioremap_mm, addr, pte); + pte_t ptent = ptep_get_and_clear(&init_mm, addr, pte); WARN_ON(!pte_none(ptent) && !pte_present(ptent)); } while (pte++, addr += PAGE_SIZE, addr != end); } @@ -181,13 +181,13 @@ static void unmap_im_area(unsigned long addr, unsigned long end) { - struct mm_struct *mm = &ioremap_mm; + struct mm_struct *mm = &init_mm; unsigned long next; pgd_t *pgd; spin_lock(&mm->page_table_lock); - pgd = pgd_offset_i(addr); + pgd = pgd_offset_k(addr); flush_cache_vunmap(addr, end); do { next = pgd_addr_end(addr, end); @@ -214,21 +214,21 @@ unsigned long vsid; if (mem_init_done) { - spin_lock(&ioremap_mm.page_table_lock); - pgdp = pgd_offset_i(ea); - pudp = pud_alloc(&ioremap_mm, pgdp, ea); + spin_lock(&init_mm.page_table_lock); + pgdp = pgd_offset_k(ea); + pudp = pud_alloc(&init_mm, pgdp, ea); if (!pudp) return -ENOMEM; - pmdp = pmd_alloc(&ioremap_mm, pudp, ea); + pmdp = pmd_alloc(&init_mm, pudp, ea); if (!pmdp) return -ENOMEM; - ptep = pte_alloc_kernel(&ioremap_mm, pmdp, ea); + ptep = pte_alloc_kernel(&init_mm, pmdp, ea); if (!ptep) return -ENOMEM; pa = abs_to_phys(pa); - set_pte_at(&ioremap_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, + set_pte_at(&init_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, __pgprot(flags))); - spin_unlock(&ioremap_mm.page_table_lock); + spin_unlock(&init_mm.page_table_lock); } else { unsigned long va, vpn, hash, hpteg; -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From olof at lixom.net Thu May 5 12:31:32 2005 From: olof at lixom.net (Olof Johansson) Date: Wed, 4 May 2005 21:31:32 -0500 Subject: [3/3] sparsemem memory model for ppc64 In-Reply-To: References: Message-ID: <20050505023132.GB20283@austin.ibm.com> Hi, Just two formatting nitpicks below. -Olof On Wed, May 04, 2005 at 09:30:57PM +0100, Andy Whitcroft wrote: > diff -X /home/apw/brief/lib/vdiff.excl -rupN reference/arch/ppc64/mm/init.c current/arch/ppc64/mm/init.c > --- reference/arch/ppc64/mm/init.c 2005-05-04 20:54:20.000000000 +0100 > +++ current/arch/ppc64/mm/init.c 2005-05-04 20:54:54.000000000 +0100 [...] > @@ -623,12 +631,18 @@ void __init do_init_bootmem(void) > > max_pfn = max_low_pfn; > > - /* add all physical memory to the bootmem map. Also find the first */ > + /* add all physical memory to the bootmem map. Also, find the first > + * presence of all LMBs*/ CodingStyle: */ on new line > for (i=0; i < lmb.memory.cnt; i++) { > unsigned long physbase, size; > > physbase = lmb.memory.region[i].physbase; > size = lmb.memory.region[i].size; > + if (i) { /* already created mappings for first LMB */ > + start_pfn = physbase >> PAGE_SHIFT; > + end_pfn = start_pfn + (size >> PAGE_SHIFT); Comment on new line indented, please -Olof From olof at lixom.net Thu May 5 12:31:19 2005 From: olof at lixom.net (Olof Johansson) Date: Wed, 4 May 2005 21:31:19 -0500 Subject: [2/3] add memory present for ppc64 In-Reply-To: References: Message-ID: <20050505023119.GA20283@austin.ibm.com> On Wed, May 04, 2005 at 09:29:57PM +0100, Andy Whitcroft wrote: > diff -X /home/apw/brief/lib/vdiff.excl -rupN reference/arch/ppc64/Kconfig current/arch/ppc64/Kconfig > --- reference/arch/ppc64/Kconfig 2005-05-04 20:54:50.000000000 +0100 > +++ current/arch/ppc64/Kconfig 2005-05-04 20:54:50.000000000 +0100 > @@ -212,8 +212,8 @@ config ARCH_FLATMEM_ENABLE > source "mm/Kconfig" > > config HAVE_ARCH_EARLY_PFN_TO_NID > - bool > - default y > + def_bool y > + depends on NEED_MULTIPLE_NODES Ok, time to show my lack of undestanding here, but when can we ever be CONFIG_NUMA and NOT need multiple nodes? > @@ -481,6 +483,7 @@ static void __init setup_nonnuma(void) > > for (i = 0 ; i < top_of_ram; i += MEMORY_INCREMENT) > numa_memory_lookup_table[i >> MEMORY_INCREMENT_SHIFT] = 0; > + memory_present(0, 0, init_node_data[0].node_end_pfn); Isn't the memory_present stuff and numa_memory_lookup_table two implementations doing the same thing (mapping memory to nodes)? Can we kill numa_memory_lookup_table with this? -Olof From haveblue at us.ibm.com Thu May 5 14:43:18 2005 From: haveblue at us.ibm.com (Dave Hansen) Date: Wed, 04 May 2005 21:43:18 -0700 Subject: [2/3] add memory present for ppc64 In-Reply-To: <20050505023119.GA20283@austin.ibm.com> References: <20050505023119.GA20283@austin.ibm.com> Message-ID: <1115268198.9286.11.camel@localhost> On Wed, 2005-05-04 at 21:31 -0500, Olof Johansson wrote: > On Wed, May 04, 2005 at 09:29:57PM +0100, Andy Whitcroft wrote: > > diff -X /home/apw/brief/lib/vdiff.excl -rupN reference/arch/ppc64/Kconfig current/arch/ppc64/Kconfig > > --- reference/arch/ppc64/Kconfig 2005-05-04 20:54:50.000000000 +0100 > > +++ current/arch/ppc64/Kconfig 2005-05-04 20:54:50.000000000 +0100 > > @@ -212,8 +212,8 @@ config ARCH_FLATMEM_ENABLE > > source "mm/Kconfig" > > > > config HAVE_ARCH_EARLY_PFN_TO_NID > > - bool > > - default y > > + def_bool y > > + depends on NEED_MULTIPLE_NODES > > Ok, time to show my lack of undestanding here, but when can we ever be > CONFIG_NUMA and NOT need multiple nodes? NEED_MULTIPLE_NODES is for DISCONTIG || NUMA. It is a blanket config option that helps us separate those two very intertwined options. > > @@ -481,6 +483,7 @@ static void __init setup_nonnuma(void) > > > > for (i = 0 ; i < top_of_ram; i += MEMORY_INCREMENT) > > numa_memory_lookup_table[i >> MEMORY_INCREMENT_SHIFT] = 0; > > + memory_present(0, 0, init_node_data[0].node_end_pfn); > > Isn't the memory_present stuff and numa_memory_lookup_table two > implementations doing the same thing (mapping memory to nodes)? They have similar functions: record the physical layout of the system. But, memory_present() is for sparsemem, which basically implements pfn_to_page() and page_to_pfn(). The numa_memory_lookup_table[] is used for pfn_to_nid(), which is actually orthogonal to what sparsemem needs. > Can we kill numa_memory_lookup_table with this? Nope, we still need it for pfn_to_nid(). We could possibly replace the current implementation like this: #define pfn_to_nid(pfn) page_zone(__pfn_to_section(pfn)->section_mem_map[pfn])->zone_pgdat->node_id But, that might have a few performance implications :) There are certainly some options that sparsemem opens up here, and I hope that we explore them further as we move away from discontig. We could even do something like store the nid directly in the mem_section. But, as I said, that's an optimization that can come later. -- Dave From benh at kernel.crashing.org Thu May 5 17:18:22 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 05 May 2005 17:18:22 +1000 Subject: [PATCH] SMU update #2 Message-ID: <1115277502.7568.178.camel@gaston> Ok, here's more SMU work. It's not yet ready for upstream though. This patch rewrites completely the SMU driver (System Management Unit I suppose) which is used currently on iMac G5 and single CPU desktop G5, and will be used by Apple future machines. This new version adds interrupt-driven asynchronous command processing, userland interace to send SMU commands, a low level asynchronous i2c interface, a driver hooking this to the linux-i2c layer (with various "bugs" or rather "deviances" to what a linux-i2c driver is supposed to do, in a similar vein to keywest-i2c, but then, I personally consider the linux-i2c layer as a pile of barely useable crap). You can play with the userland interface to the SMU with this tool: http://gate.crashing.org/~benh/smu_wink.c That will flash your front led. I have successfully verified that I could use the i2c interface to read the hard disk temperature sensors. The other sensors (CPU temperature, voltage and current) are on the SMU itself and require specific commands. I have played with those in OF, it will be fairly easy to use this driver to access them. From that point, I'll finish my decrypting of the Darwin thermal control driver for the iMac G5 and implement something. I'm posting this for now as I can't tell when I'll be able to resume work on it and somebody else may want to play a bit in the meantime. Index: linux-work/drivers/macintosh/smu.c =================================================================== --- linux-work.orig/drivers/macintosh/smu.c 2005-05-05 14:57:39.000000000 +1000 +++ linux-work/drivers/macintosh/smu.c 2005-05-05 16:28:11.000000000 +1000 @@ -8,21 +8,22 @@ */ /* - * For now, this driver includes: - * - RTC get & set - * - reboot & shutdown commands - * all synchronous with IRQ disabled (ugh) + * ??/??/05 - BenH - 0.1 : Initial released based on J. Mayer port * + * ??/??/05 - BenH - 0.4 : Rewritten version with async support, not merged + * but posted to some mailing lists + * + * ??/??/05 - BenH - 0.5 : Add i2c support + * TODO: - * rework in a way the PMU driver works, that is asynchronous - * with a queue of commands. I'll do that as soon as I have an - * SMU based machine at hand. Some more cleanup is needed too, - * like maybe fitting it into a platform device, etc... - * Also check what's up with cache coherency, and if we really - * can't do better than flushing the cache, maybe build a table - * of command len/reply len like the PMU driver to only flush - * what is actually necessary. - * --BenH. + * - maybe add timeout to commands ? + * - blocking version of time functions + * - polling version of i2c commands (including timer that works with + * interrutps off) + * - maybe avoid some data copies with i2c by directly using the smu cmd + * buffer and a lower level internal interface + * - understand SMU -> CPU events and implement reception of them via + * the userland interface */ #include @@ -36,6 +37,10 @@ #include #include #include +#include +#include +#include +#include #include #include @@ -45,6 +50,11 @@ #include #include #include +#include +#include + +#define VERSION "0.5" +#define AUTHOR "(c) 2005 Benjamin Herrenschmidt, IBM Corp." #define DEBUG_SMU 1 @@ -57,20 +67,30 @@ /* * This is the command buffer passed to the SMU hardware */ +#define SMU_MAX_DATA 254 + struct smu_cmd_buf { u8 cmd; u8 length; - u8 data[0x0FFE]; + u8 data[SMU_MAX_DATA]; }; struct smu_device { spinlock_t lock; struct device_node *of_node; - int db_ack; /* doorbell ack GPIO */ - int db_req; /* doorbell req GPIO */ + struct of_device *of_dev; + int doorbell; /* doorbell gpio */ u32 __iomem *db_buf; /* doorbell buffer */ + int db_irq; + int msg; + int msg_irq; struct smu_cmd_buf *cmd_buf; /* command buffer virtual */ u32 cmd_buf_abs; /* command buffer absolute */ + struct list_head cmd_list; + struct smu_cmd *cmd_cur; /* pending command */ + struct list_head cmd_i2c_list; + struct smu_i2c_cmd *cmd_i2c_cur; /* pending i2c command */ + struct timer_list i2c_timer; }; /* @@ -79,113 +99,243 @@ */ static struct smu_device *smu; + /* - * SMU low level communication stuff + * SMU driver low level stuff */ -static inline int smu_cmd_stat(struct smu_cmd_buf *cmd_buf, u8 cmd_ack) + +static void smu_start_cmd(void) { - rmb(); - return cmd_buf->cmd == cmd_ack && cmd_buf->length != 0; + unsigned long faddr, fend; + struct smu_cmd *cmd; + + if (list_empty(&smu->cmd_list)) + return; + + /* Fetch first command in queue */ + cmd = list_entry(smu->cmd_list.next, struct smu_cmd, link); + smu->cmd_cur = cmd; + list_del(&cmd->link); + + DPRINTK("SMU: starting cmd %x, %d bytes data\n", cmd->cmd, + cmd->data_len); + DPRINTK("SMU: data buffer: %02x %02x %02x %02x ...\n", + ((u8 *)cmd->data_buf)[0], ((u8 *)cmd->data_buf)[1], + ((u8 *)cmd->data_buf)[2], ((u8 *)cmd->data_buf)[3]); + + /* Fill the SMU command buffer */ + smu->cmd_buf->cmd = cmd->cmd; + smu->cmd_buf->length = cmd->data_len; + memcpy(smu->cmd_buf->data, cmd->data_buf, cmd->data_len); + + /* Flush command and data to RAM */ + faddr = (unsigned long)smu->cmd_buf; + fend = faddr + smu->cmd_buf->length + 2; + flush_inval_dcache_range(faddr, fend); + + /* This isn't exactly a DMA mapping here, I suspect + * the SMU is actually communicating with us via i2c to the + * northbridge or the CPU to access RAM. + */ + writel(smu->cmd_buf_abs, smu->db_buf); + + /* Ring the SMU doorbell */ + pmac_do_feature_call(PMAC_FTR_WRITE_GPIO, NULL, smu->doorbell, 4); } -static inline u8 smu_save_ack_cmd(struct smu_cmd_buf *cmd_buf) + +static irqreturn_t smu_db_intr(int irq, void *arg, struct pt_regs *regs) { - return (~cmd_buf->cmd) & 0xff; + unsigned long flags; + struct smu_cmd *cmd; + void (*done)(struct smu_cmd *cmd, void *misc) = NULL; + void *misc = NULL; + u8 gpio; + int rc = 0; + + /* SMU completed the command, well, we hope, let's make sure + * of it + */ + spin_lock_irqsave(&smu->lock, flags); + + gpio = pmac_do_feature_call(PMAC_FTR_READ_GPIO, NULL, smu->doorbell); + if ((gpio & 7) != 7) + return IRQ_HANDLED; + + cmd = smu->cmd_cur; + smu->cmd_cur = NULL; + if (cmd == NULL) + goto bail; + + if (rc == 0) { + unsigned long faddr; + int reply_len; + u8 ack; + + /* CPU might have brought back the cache line, so we need + * to flush again before peeking at the SMU response. We + * flush the entire buffer for now as we haven't read the + * reply lenght (it's only 2 cache lines anyway) + */ + faddr = (unsigned long)smu->cmd_buf; + flush_inval_dcache_range(faddr, faddr + 256); + + /* Now check ack */ + ack = (~cmd->cmd) & 0xff; + if (ack != smu->cmd_buf->cmd) { + DPRINTK("SMU: incorrect ack, want %x got %x\n", + ack, smu->cmd_buf->cmd); + rc = -EIO; + } + reply_len = rc == 0 ? smu->cmd_buf->length : 0; + DPRINTK("SMU: reply len: %d\n", reply_len); + if (reply_len > cmd->reply_len) { + printk(KERN_WARNING "SMU: reply buffer too small," + "got %d bytes for a %d bytes buffer\n", + reply_len, cmd->reply_len); + reply_len = cmd->reply_len; + } + cmd->reply_len = reply_len; + if (cmd->reply_buf && reply_len) + memcpy(cmd->reply_buf, smu->cmd_buf->data, reply_len); + } + + /* Now complete the command. Write status last in order as we lost + * ownership of the command structure as soon as it's no longer -1 + */ + done = cmd->done; + misc = cmd->misc; + mb(); + cmd->status = rc; + bail: + /* Start next command if any */ + smu_start_cmd(); + spin_unlock_irqrestore(&smu->lock, flags); + + /* Call command completion handler if any */ + if (done) + done(cmd, misc); + + /* It's an edge interrupt, nothing to do */ + return IRQ_HANDLED; } -static void smu_send_cmd(struct smu_device *dev) + +static irqreturn_t smu_msg_intr(int irq, void *arg, struct pt_regs *regs) { - /* SMU command buf is currently cacheable, we need a physical - * address. This isn't exactly a DMA mapping here, I suspect - * the SMU is actually communicating with us via i2c to the - * northbridge or the CPU to access RAM. + /* I don't quite know what to do with this one, we seem to never + * receive it, so I suspect we have to arm it someway in the SMU + * to start getting events that way. */ - writel(dev->cmd_buf_abs, dev->db_buf); - /* Ring the SMU doorbell */ - pmac_do_feature_call(PMAC_FTR_WRITE_GPIO, NULL, dev->db_req, 4); - pmac_do_feature_call(PMAC_FTR_READ_GPIO, NULL, dev->db_req, 4); + printk(KERN_INFO "SMU: message interrupt !\n"); + + /* It's an edge interrupt, nothing to do */ + return IRQ_HANDLED; } -static int smu_cmd_done(struct smu_device *dev) + +/* + * Queued command management. + * + */ + +int smu_queue_cmd(struct smu_cmd *cmd) { - unsigned long wait = 0; - int gpio; + unsigned long flags; - /* Check the SMU doorbell */ - do { - gpio = pmac_do_feature_call(PMAC_FTR_READ_GPIO, - NULL, dev->db_ack); - if ((gpio & 7) == 7) - return 0; - udelay(100); - } while(++wait < 10000); + if (smu == NULL) + return -ENODEV; + if (cmd->data_len > SMU_MAX_DATA || + cmd->reply_len > SMU_MAX_DATA) + return -EINVAL; - printk(KERN_ERR "SMU timeout !\n"); - return -ENXIO; + cmd->status = 1; + spin_lock_irqsave(&smu->lock, flags); + list_add_tail(&cmd->link, &smu->cmd_list); + if (smu->cmd_cur == NULL) + smu_start_cmd(); + spin_unlock_irqrestore(&smu->lock, flags); + + return 0; } +EXPORT_SYMBOL(smu_queue_cmd); -static int smu_do_cmd(struct smu_device *dev) + +int smu_queue_simple(struct smu_simple_cmd *scmd, u8 command, + unsigned int data_len, + void (*done)(struct smu_cmd *cmd, void *misc), + void *misc, ...) { - int rc; - u8 cmd_ack; + struct smu_cmd *cmd = &scmd->cmd; + va_list list; + int i; - DPRINTK("SMU do_cmd %02x len=%d %02x\n", - dev->cmd_buf->cmd, dev->cmd_buf->length, - dev->cmd_buf->data[0]); - - cmd_ack = smu_save_ack_cmd(dev->cmd_buf); - - /* Clear cmd_buf cache lines */ - flush_inval_dcache_range((unsigned long)dev->cmd_buf, - ((unsigned long)dev->cmd_buf) + - sizeof(struct smu_cmd_buf)); - smu_send_cmd(dev); - rc = smu_cmd_done(dev); - if (rc == 0) - rc = smu_cmd_stat(dev->cmd_buf, cmd_ack) ? 0 : -1; - - DPRINTK("SMU do_cmd %02x len=%d %02x => %d (%02x)\n", - dev->cmd_buf->cmd, dev->cmd_buf->length, - dev->cmd_buf->data[0], rc, cmd_ack); + if (data_len > sizeof(scmd->buffer)) + return -EINVAL; - return rc; + memset(scmd, 0, sizeof(*scmd)); + cmd->cmd = command; + cmd->data_len = data_len; + cmd->data_buf = scmd->buffer; + cmd->reply_len = sizeof(scmd->buffer); + cmd->reply_buf = scmd->buffer; + cmd->done = done; + cmd->misc = misc; + + va_start(list, misc); + for (i = 0; i < data_len; ++i) + scmd->buffer[i] = (u8)va_arg(list, int); + va_end(list); + + return smu_queue_cmd(cmd); } +EXPORT_SYMBOL(smu_queue_simple); -/* RTC low level commands */ -static inline int bcd2hex (int n) + +void smu_poll(void) { - return (((n & 0xf0) >> 4) * 10) + (n & 0xf); + u8 gpio; + + if (smu == NULL) + return; + + gpio = pmac_do_feature_call(PMAC_FTR_READ_GPIO, NULL, smu->doorbell); + if ((gpio & 7) == 7) + smu_db_intr(smu->db_irq, smu, NULL); } +EXPORT_SYMBOL(smu_poll); -static inline int hex2bcd (int n) + +void smu_done_complete(struct smu_cmd *cmd, void *misc) { - return ((n / 10) << 4) + (n % 10); + struct completion *comp = misc; + + complete(comp); } +EXPORT_SYMBOL(smu_done_complete); + -#if 0 -static inline void smu_fill_set_pwrup_timer_cmd(struct smu_cmd_buf *cmd_buf) +void smu_spinwait_cmd(struct smu_cmd *cmd) { - cmd_buf->cmd = 0x8e; - cmd_buf->length = 8; - cmd_buf->data[0] = 0x00; - memset(cmd_buf->data + 1, 0, 7); + while(cmd->status == 1) + smu_poll(); } +EXPORT_SYMBOL(smu_spinwait_cmd); -static inline void smu_fill_get_pwrup_timer_cmd(struct smu_cmd_buf *cmd_buf) + +/* RTC low level commands */ +static inline int bcd2hex (int n) { - cmd_buf->cmd = 0x8e; - cmd_buf->length = 1; - cmd_buf->data[0] = 0x01; + return (((n & 0xf0) >> 4) * 10) + (n & 0xf); } -static inline void smu_fill_dis_pwrup_timer_cmd(struct smu_cmd_buf *cmd_buf) + +static inline int hex2bcd (int n) { - cmd_buf->cmd = 0x8e; - cmd_buf->length = 1; - cmd_buf->data[0] = 0x02; + return ((n / 10) << 4) + (n % 10); } -#endif + static inline void smu_fill_set_rtc_cmd(struct smu_cmd_buf *cmd_buf, struct rtc_time *time) @@ -202,100 +352,96 @@ cmd_buf->data[7] = hex2bcd(time->tm_year - 100); } -static inline void smu_fill_get_rtc_cmd(struct smu_cmd_buf *cmd_buf) -{ - cmd_buf->cmd = 0x8e; - cmd_buf->length = 1; - cmd_buf->data[0] = 0x81; -} - -static void smu_parse_get_rtc_reply(struct smu_cmd_buf *cmd_buf, - struct rtc_time *time) -{ - time->tm_sec = bcd2hex(cmd_buf->data[0]); - time->tm_min = bcd2hex(cmd_buf->data[1]); - time->tm_hour = bcd2hex(cmd_buf->data[2]); - time->tm_wday = bcd2hex(cmd_buf->data[3]); - time->tm_mday = bcd2hex(cmd_buf->data[4]); - time->tm_mon = bcd2hex(cmd_buf->data[5]) - 1; - time->tm_year = bcd2hex(cmd_buf->data[6]) + 100; -} -int smu_get_rtc_time(struct rtc_time *time) +int smu_get_rtc_time(struct rtc_time *time, int spinwait) { - unsigned long flags; + struct smu_simple_cmd cmd; int rc; if (smu == NULL) return -ENODEV; memset(time, 0, sizeof(struct rtc_time)); - spin_lock_irqsave(&smu->lock, flags); - smu_fill_get_rtc_cmd(smu->cmd_buf); - rc = smu_do_cmd(smu); - if (rc == 0) - smu_parse_get_rtc_reply(smu->cmd_buf, time); - spin_unlock_irqrestore(&smu->lock, flags); - - return rc; + rc = smu_queue_simple(&cmd, SMU_CMD_RTC_COMMAND, 1, NULL, NULL, + SMU_CMD_RTC_GET_DATETIME); + if (rc) + return rc; + smu_spinwait_simple(&cmd); + + time->tm_sec = bcd2hex(cmd.buffer[0]); + time->tm_min = bcd2hex(cmd.buffer[1]); + time->tm_hour = bcd2hex(cmd.buffer[2]); + time->tm_wday = bcd2hex(cmd.buffer[3]); + time->tm_mday = bcd2hex(cmd.buffer[4]); + time->tm_mon = bcd2hex(cmd.buffer[5]) - 1; + time->tm_year = bcd2hex(cmd.buffer[6]) + 100; + + return 0; } -int smu_set_rtc_time(struct rtc_time *time) + +int smu_set_rtc_time(struct rtc_time *time, int spinwait) { - unsigned long flags; + struct smu_simple_cmd cmd; int rc; if (smu == NULL) return -ENODEV; - spin_lock_irqsave(&smu->lock, flags); - smu_fill_set_rtc_cmd(smu->cmd_buf, time); - rc = smu_do_cmd(smu); - spin_unlock_irqrestore(&smu->lock, flags); + rc = smu_queue_simple(&cmd, SMU_CMD_RTC_COMMAND, 8, NULL, NULL, + SMU_CMD_RTC_SET_DATETIME, + hex2bcd(time->tm_sec), + hex2bcd(time->tm_min), + hex2bcd(time->tm_hour), + time->tm_wday, + hex2bcd(time->tm_mday), + hex2bcd(time->tm_mon) + 1, + hex2bcd(time->tm_year - 100)); + if (rc) + return rc; + smu_spinwait_simple(&cmd); - return rc; + return 0; } + void smu_shutdown(void) { - const unsigned char *command = "SHUTDOWN"; - unsigned long flags; + struct smu_simple_cmd cmd; if (smu == NULL) return; - spin_lock_irqsave(&smu->lock, flags); - smu->cmd_buf->cmd = 0xaa; - smu->cmd_buf->length = strlen(command); - strcpy(smu->cmd_buf->data, command); - smu_do_cmd(smu); + if (smu_queue_simple(&cmd, SMU_CMD_POWER_COMMAND, 9, NULL, NULL, + 'S', 'H', 'U', 'T', 'D', 'O', 'W', 'N', 0)) + return; + smu_spinwait_simple(&cmd); for (;;) ; - spin_unlock_irqrestore(&smu->lock, flags); } + void smu_restart(void) { - const unsigned char *command = "RESTART"; - unsigned long flags; + struct smu_simple_cmd cmd; if (smu == NULL) return; - spin_lock_irqsave(&smu->lock, flags); - smu->cmd_buf->cmd = 0xaa; - smu->cmd_buf->length = strlen(command); - strcpy(smu->cmd_buf->data, command); - smu_do_cmd(smu); + if (smu_queue_simple(&cmd, SMU_CMD_POWER_COMMAND, 8, NULL, NULL, + 'R', 'E', 'S', 'T', 'A', 'R', 'T', 0)) + return; + smu_spinwait_simple(&cmd); for (;;) ; - spin_unlock_irqrestore(&smu->lock, flags); } + int smu_present(void) { return smu != NULL; } +EXPORT_SYMBOL(smu_present); int smu_init (void) @@ -307,6 +453,8 @@ if (np == NULL) return -ENODEV; + printk(KERN_INFO "SMU driver %s %s\n", VERSION, AUTHOR); + if (smu_cmdbuf_abs == 0) { printk(KERN_ERR "SMU: Command buffer not allocated !\n"); return -EINVAL; @@ -318,7 +466,13 @@ memset(smu, 0, sizeof(*smu)); spin_lock_init(&smu->lock); + INIT_LIST_HEAD(&smu->cmd_list); + INIT_LIST_HEAD(&smu->cmd_i2c_list); smu->of_node = np; + smu->db_irq = NO_IRQ; + smu->msg_irq = NO_IRQ; + init_timer(&smu->i2c_timer); + /* smu_cmdbuf_abs is in the low 2G of RAM, can be converted to a * 32 bits value safely */ @@ -331,8 +485,8 @@ goto fail; } data = (u32 *)get_property(np, "reg", NULL); - of_node_put(np); if (data == NULL) { + of_node_put(np); printk(KERN_ERR "SMU: Can't find doorbell GPIO address !\n"); goto fail; } @@ -341,8 +495,31 @@ * and ack. GPIOs are at 0x50, best would be to find that out * in the device-tree though. */ - smu->db_req = 0x50 + *data; - smu->db_ack = 0x50 + *data; + smu->doorbell = *data; + if (smu->doorbell < 0x50) + smu->doorbell += 0x50; + if (np->n_intrs > 0) + smu->db_irq = np->intrs[0].line; + + of_node_put(np); + + /* Now look for the smu-interrupt GPIO */ + do { + np = of_find_node_by_name(NULL, "smu-interrupt"); + if (np == NULL) + break; + data = (u32 *)get_property(np, "reg", NULL); + if (data == NULL) { + of_node_put(np); + break; + } + smu->msg = *data; + if (smu->msg < 0x50) + smu->msg += 0x50; + if (np->n_intrs > 0) + smu->msg_irq = np->intrs[0].line; + of_node_put(np); + } while(0); /* Doorbell buffer is currently hard-coded, I didn't find a proper * device-tree entry giving the address. Best would probably to use @@ -362,3 +539,536 @@ return -ENXIO; } + + +static int smu_late_init(void) +{ + if (!smu) + return 0; + + /* + * Try to request the interrupts + */ + + if (smu->db_irq != NO_IRQ) { + if (request_irq(smu->db_irq, smu_db_intr, + SA_SHIRQ, "SMU doorbell", smu) < 0) { + printk(KERN_WARNING "SMU: can't " + "request interrupt %d\n", + smu->db_irq); + smu->db_irq = NO_IRQ; + } + } + + if (smu->msg_irq != NO_IRQ) { + if (request_irq(smu->msg_irq, smu_msg_intr, + SA_SHIRQ, "SMU message", smu) < 0) { + printk(KERN_WARNING "SMU: can't " + "request interrupt %d\n", + smu->msg_irq); + smu->msg_irq = NO_IRQ; + } + } + + return 0; +} +arch_initcall(smu_late_init); + +/* + * sysfs visibility + */ + +static void smu_expose_childs(void *unused) +{ + struct device_node *np; + + for (np = NULL; (np = of_get_next_child(smu->of_node, np)) != NULL;) { + if (device_is_compatible(np, "smu-i2c")) { + char name[32]; + u32 *reg = (u32 *)get_property(np, "reg", NULL); + + if (reg == NULL) + continue; + sprintf(name, "smu-i2c-%02x", *reg); + of_platform_device_create(np, name, &smu->of_dev->dev); + } + } + +} + +static DECLARE_WORK(smu_expose_childs_work, smu_expose_childs, NULL); + +static int smu_platform_probe(struct of_device* dev, + const struct of_match *match) +{ + if (!smu) + return -ENODEV; + smu->of_dev = dev; + + /* + * Ok, we are matched, now expose all i2c busses. We have to defer + * that unfortunately or it would deadlock inside the device model + */ + schedule_work(&smu_expose_childs_work); + + return 0; +} + +static struct of_match smu_platform_match[] = +{ + { + .name = OF_ANY_MATCH, + .type = "smu", + .compatible = OF_ANY_MATCH + }, + {}, +}; + +static struct of_platform_driver smu_of_platform_driver = +{ + .name = "smu", + .match_table = smu_platform_match, + .probe = smu_platform_probe, +}; + +static int __init smu_init_sysfs(void) +{ + int rc; + + /* + * Due to sysfs bogosity, a sysdev is not a real device, so + * we should in fact create both if we want sysdev semantics + * for power management. + * For now, we don't power manage machines with an SMU chip, + * I'm a bit too far from figuring out how that works with those + * new chipsets, but that will come back and bite us + */ + rc = of_register_driver(&smu_of_platform_driver); + return 0; +} + +device_initcall(smu_init_sysfs); + +struct of_device *smu_get_ofdev(void) +{ + if (!smu) + return NULL; + return smu->of_dev; +} + +EXPORT_SYMBOL_GPL(smu_get_ofdev); + +/* + * i2c interface + */ + +static void smu_i2c_complete_command(struct smu_i2c_cmd *cmd, int fail) +{ + void (*done)(struct smu_i2c_cmd *cmd, void *misc) = cmd->done; + void *misc = cmd->misc; + unsigned long flags; + + /* Check for read case */ + if (!fail && cmd->read) { + if (cmd->pdata[0] < 1) + fail = 1; + else + memcpy(cmd->info.data, &cmd->pdata[1], + cmd->info.datalen); + } + + DPRINTK("SMU: completing, success: %d\n", !fail); + + /* Update status and mark no pending i2c command with lock + * held so nobody comes in while we dequeue an eventual + * pending next i2c command + */ + spin_lock_irqsave(&smu->lock, flags); + smu->cmd_i2c_cur = NULL; + wmb(); + cmd->status = fail ? -EIO : 0; + + /* Is there another i2c command waiting ? */ + if (!list_empty(&smu->cmd_i2c_list)) { + struct smu_i2c_cmd *newcmd; + + /* Fetch it, new current, remove from list */ + newcmd = list_entry(smu->cmd_i2c_list.next, + struct smu_i2c_cmd, link); + smu->cmd_i2c_cur = newcmd; + list_del(&cmd->link); + + /* Queue with low level smu */ + list_add_tail(&cmd->scmd.link, &smu->cmd_list); + if (smu->cmd_cur == NULL) + smu_start_cmd(); + } + spin_unlock_irqrestore(&smu->lock, flags); + + /* Call command completion handler if any */ + if (done) + done(cmd, misc); + +} + + +static void smu_i2c_retry(unsigned long data) +{ + struct smu_i2c_cmd *cmd = (struct smu_i2c_cmd *)data; + + DPRINTK("SMU: i2c failure, requeuing...\n"); + + /* requeue command simply by resetting reply_len */ + cmd->pdata[0] = 0xff; + cmd->scmd.reply_len = 0x10; + smu_queue_cmd(&cmd->scmd); +} + + +static void smu_i2c_low_completion(struct smu_cmd *scmd, void *misc) +{ + struct smu_i2c_cmd *cmd = misc; + int fail = 0; + + DPRINTK("SMU: i2c compl. stage=%d status=%x pdata[0]=%x rlen: %x\n", + cmd->stage, scmd->status, cmd->pdata[0], scmd->reply_len); + + /* Check for possible status */ + if (scmd->status < 0) + fail = 1; + else if (cmd->read) { + if (cmd->stage == 0) + fail = cmd->pdata[0] != 0; + else + fail = cmd->pdata[0] >= 0x80; + } else { + fail = cmd->pdata[0] != 0; + } + + /* Handle failures by requeuing command, after 5ms interval + */ + if (fail && --cmd->retries > 0) { + DPRINTK("SMU: i2c failure, starting timer...\n"); + smu->i2c_timer.function = smu_i2c_retry; + smu->i2c_timer.data = (unsigned long)cmd; + smu->i2c_timer.expires = jiffies + msecs_to_jiffies(5); + add_timer(&smu->i2c_timer); + return; + } + + /* If failure or stage 1, command is complete */ + if (fail || cmd->stage != 0) { + smu_i2c_complete_command(cmd, fail); + return; + } + + DPRINTK("SMU: going to stage 1\n"); + + /* Ok, initial command complete, now poll status */ + scmd->reply_buf = cmd->pdata; + scmd->reply_len = 0x10; + scmd->data_buf = cmd->pdata; + scmd->data_len = 1; + cmd->pdata[0] = 0; + cmd->stage = 1; + cmd->retries = 20; + smu_queue_cmd(scmd); +} + + +int smu_queue_i2c(struct smu_i2c_cmd *cmd) +{ + unsigned long flags; + + if (smu == NULL) + return -ENODEV; + + /* Fill most fields of scmd */ + cmd->scmd.cmd = SMU_CMD_I2C_COMMAND; + cmd->scmd.done = smu_i2c_low_completion; + cmd->scmd.misc = cmd; + cmd->scmd.reply_buf = cmd->pdata; + cmd->scmd.reply_len = 0x10; + cmd->scmd.data_buf = (u8 *)(char *)&cmd->info; + cmd->scmd.status = 1; + cmd->stage = 0; + cmd->pdata[0] = 0xff; + cmd->retries = 20; + cmd->status = 1; + + /* Check transfer type, sanitize some "info" fields + * based on transfer type and do more checking + */ + cmd->info.caddr = cmd->info.devaddr; + cmd->read = cmd->info.devaddr & 0x01; + switch(cmd->info.type) { + case SMU_I2C_TRANSFER_SIMPLE: + memset(&cmd->info.sublen, 0, 4); + break; + case SMU_I2C_TRANSFER_COMBINED: + cmd->info.devaddr &= 0xfe; + case SMU_I2C_TRANSFER_STDSUB: + if (cmd->info.sublen > 3) + return -EINVAL; + break; + default: + return -EINVAL; + } + + /* Finish setting up command based on transfer direction + */ + if (cmd->read) { + if (cmd->info.datalen > SMU_I2C_READ_MAX) + return -EINVAL; + memset(cmd->info.data, 0xff, cmd->info.datalen); + cmd->scmd.data_len = 9; + } else { + if (cmd->info.datalen > SMU_I2C_WRITE_MAX) + return -EINVAL; + cmd->scmd.data_len = 9 + cmd->info.datalen; + } + + DPRINTK("SMU: i2c enqueuing command\n"); + DPRINTK("SMU: %s, len=%d bus=%x addr=%x sub0=%x type=%x\n", + cmd->read ? "read" : "write", cmd->info.datalen, + cmd->info.bus, cmd->info.caddr, + cmd->info.subaddr[0], cmd->info.type); + + + /* Enqueue command in i2c list, and if empty, enqueue also in + * main command list + */ + spin_lock_irqsave(&smu->lock, flags); + if (smu->cmd_i2c_cur == NULL) { + smu->cmd_i2c_cur = cmd; + list_add_tail(&cmd->scmd.link, &smu->cmd_list); + if (smu->cmd_cur == NULL) + smu_start_cmd(); + } else + list_add_tail(&cmd->link, &smu->cmd_i2c_list); + spin_unlock_irqrestore(&smu->lock, flags); + + return 0; +} + + + +/* + * Userland driver interface + */ + + +static LIST_HEAD(smu_clist); +static DEFINE_SPINLOCK(smu_clist_lock); + +enum smu_file_mode { + smu_file_commands, + smu_file_events, + smu_file_closing +}; + +struct smu_private +{ + struct list_head list; + enum smu_file_mode mode; + int busy; + struct smu_cmd cmd; + spinlock_t lock; + wait_queue_head_t wait; + u8 buffer[SMU_MAX_DATA]; +}; + + +static int smu_open(struct inode *inode, struct file *file) +{ + struct smu_private *pp; + unsigned long flags; + + pp = kmalloc(sizeof(struct smu_private), GFP_KERNEL); + if (pp == 0) + return -ENOMEM; + memset(pp, 0, sizeof(struct smu_private)); + spin_lock_init(&pp->lock); + pp->mode = smu_file_commands; + init_waitqueue_head(&pp->wait); + + spin_lock_irqsave(&smu_clist_lock, flags); + list_add(&pp->list, &smu_clist); + spin_unlock_irqrestore(&smu_clist_lock, flags); + file->private_data = pp; + + return 0; +} + + +static void smu_user_cmd_done(struct smu_cmd *cmd, void *misc) +{ + struct smu_private *pp = misc; + + wake_up_interruptible(&pp->wait); +} + + +static ssize_t smu_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) +{ + struct smu_private *pp = file->private_data; + unsigned long flags; + struct smu_user_cmd_hdr hdr; + int rc = 0; + + if (pp->busy) + return -EBUSY; + else if (copy_from_user(&hdr, buf, sizeof(hdr))) + return -EFAULT; + else if (hdr.cmdtype == SMU_CMDTYPE_WANTS_EVENTS) { + pp->mode = smu_file_events; + return 0; + } else if (hdr.cmdtype != SMU_CMDTYPE_SMU) + return -EINVAL; + else if (pp->mode != smu_file_commands) + return -EBADFD; + else if (hdr.data_len > SMU_MAX_DATA) + return -EINVAL; + + spin_lock_irqsave(&pp->lock, flags); + if (pp->busy) { + spin_unlock_irqrestore(&pp->lock, flags); + return -EBUSY; + } + pp->busy = 1; + pp->cmd.status = 1; + spin_unlock_irqrestore(&pp->lock, flags); + + if (copy_from_user(pp->buffer, buf + sizeof(hdr), hdr.data_len)) { + pp->busy = 0; + return -EFAULT; + } + + pp->cmd.cmd = hdr.cmd; + pp->cmd.data_len = hdr.data_len; + pp->cmd.reply_len = SMU_MAX_DATA; + pp->cmd.data_buf = pp->buffer; + pp->cmd.reply_buf = pp->buffer; + pp->cmd.done = smu_user_cmd_done; + pp->cmd.misc = pp; + rc = smu_queue_cmd(&pp->cmd); + if (rc < 0) + return rc; + return count; +} + + +static ssize_t smu_read_command(struct file *file, struct smu_private *pp, + char __user *buf, size_t count) +{ + DECLARE_WAITQUEUE(wait, current); + struct smu_user_reply_hdr hdr; + int size, rc = 0; + + if (!pp->busy) + return 0; + if (count < sizeof(struct smu_user_reply_hdr)) + return -EOVERFLOW; + if (pp->cmd.status == 1) { + if (file->f_flags & O_NONBLOCK) + return -EAGAIN; + add_wait_queue(&pp->wait, &wait); + for (;;) { + set_current_state(TASK_INTERRUPTIBLE); + rc = 0; + if (pp->cmd.status != 1) + break; + rc = -ERESTARTSYS; + if (signal_pending(current)) + break; + schedule(); + } + set_current_state(TASK_RUNNING); + remove_wait_queue(&pp->wait, &wait); + if (rc) + return rc; + } + if (pp->cmd.status != 0) + pp->cmd.reply_len = 0; + size = sizeof(hdr) + pp->cmd.reply_len; + if (count < size) + size = count; + rc = size; + hdr.status = pp->cmd.status; + hdr.reply_len = pp->cmd.reply_len; + if (copy_to_user(buf, &hdr, sizeof(hdr))) + return -EFAULT; + size -= sizeof(hdr); + if (size && copy_to_user(buf + sizeof(hdr), pp->buffer, size)) + return -EFAULT; + pp->busy = 0; + + return rc; +} + + +static ssize_t smu_read_events(struct file *file, struct smu_private *pp, + char __user *buf, size_t count) +{ + /* Not implemented */ + msleep_interruptible(1000); + return 0; +} + + +static ssize_t smu_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct smu_private *pp = file->private_data; + + if (pp->mode == smu_file_commands) + return smu_read_command(file, pp, buf, count); + if (pp->mode == smu_file_events) + return smu_read_events(file, pp, buf, count); + + return -EBADFD; +} + + +static int smu_release(struct inode *inode, struct file *file) +{ + struct smu_private *pp = file->private_data; + unsigned long flags; + + if (pp == 0) + return 0; + + file->private_data = NULL; + pp->mode = smu_file_closing; + // XXX wait completion + spin_lock_irqsave(&smu_clist_lock, flags); + list_del(&pp->list); + spin_unlock_irqrestore(&smu_clist_lock, flags); + kfree(pp); + + return 0; +} + + +static struct file_operations smu_device_fops __pmacdata = { + .llseek = no_llseek, + .read = smu_read, + .write = smu_write, + .open = smu_open, + .release = smu_release, +}; + +static struct miscdevice pmu_device __pmacdata = { + MISC_DYNAMIC_MINOR, "smu", &smu_device_fops +}; + +static int smu_device_init(void) +{ + if (!smu) + return -ENODEV; + if (misc_register(&pmu_device) < 0) + printk(KERN_ERR "via-pmu: cannot register misc device.\n"); + return 0; +} +device_initcall(smu_device_init); Index: linux-work/include/asm-ppc64/smu.h =================================================================== --- linux-work.orig/include/asm-ppc64/smu.h 2005-05-05 14:57:39.000000000 +1000 +++ linux-work/include/asm-ppc64/smu.h 2005-05-05 14:58:12.000000000 +1000 @@ -1,22 +1,379 @@ +#ifndef _SMU_H +#define _SMU_H + /* * Definitions for talking to the SMU chip in newer G5 PowerMacs */ #include +#include + +/* + * Known SMU commands + * + * Most of what is below comes from looking at the Open Firmware driver, + * though this is still incomplete and could use better documentation here + * or there... + */ + + +/* + * Partition info commands + * + * I do not know what those are for at this point + */ +#define SMU_CMD_PARTITION_COMMAND 0x3e + /* - * Basic routines for use by architecture. To be extended as - * we understand more of the chip + * Fan control + * + * This is a "mux" for fan control commands, first byte is the + * "sub" command. + */ +#define SMU_CMD_FAN_COMMAND 0x4a + + +/* + * Battery access + * + * Same command number as the PMU, could it be same syntax ? + */ +#define SMU_CMD_BATTERY_COMMAND 0x6f +#define SMU_CMD_GET_BATTERY_INFO 0x00 + +/* + * Real time clock control + * + * This is a "mux", first data byte contains the "sub" command. + * The "RTC" part of the SMU controls the date, time, powerup + * timer, but also a PRAM + * + * Dates are in BCD format on 7 bytes: + * [sec] [min] [hour] [weekday] [month day] [month] [year] + * with month being 1 based and year minus 100 + */ +#define SMU_CMD_RTC_COMMAND 0x8e +#define SMU_CMD_RTC_SET_PWRUP_TIMER 0x00 /* i: 7 bytes date */ +#define SMU_CMD_RTC_GET_PWRUP_TIMER 0x01 /* o: 7 bytes date */ +#define SMU_CMD_RTC_STOP_PWRUP_TIMER 0x02 +#define SMU_CMD_RTC_SET_PRAM_BYTE_ACC 0x20 /* i: 1 byte (address?) */ +#define SMU_CMD_RTC_SET_PRAM_AUTOINC 0x21 /* i: 1 byte (data?) */ +#define SMU_CMD_RTC_SET_PRAM_LO_BYTES 0x22 /* i: 10 bytes */ +#define SMU_CMD_RTC_SET_PRAM_HI_BYTES 0x23 /* i: 10 bytes */ +#define SMU_CMD_RTC_GET_PRAM_BYTE 0x28 /* i: 1 bytes (address?) */ +#define SMU_CMD_RTC_GET_PRAM_LO_BYTES 0x29 /* o: 10 bytes */ +#define SMU_CMD_RTC_GET_PRAM_HI_BYTES 0x2a /* o: 10 bytes */ +#define SMU_CMD_RTC_SET_DATETIME 0x80 /* i: 7 bytes date */ +#define SMU_CMD_RTC_GET_DATETIME 0x81 /* o: 7 bytes date */ + + /* + * i2c commands + * + * To issue an i2c command, first is to send a parameter block to the + * the SMU. This is a command of type 0x9a with 9 bytes of header + * eventually followed by data for a write: + * + * 0: bus number (from device-tree usually, SMU has lots of busses !) + * 1: transfer type/format (see below) + * 2: device address. For combined and combined4 type transfers, this + * is the "write" version of the address (bit 0x01 cleared) + * 3: subaddress length (0..3) + * 4: subaddress byte 0 (or only byte for subaddress length 1) + * 5: subaddress byte 1 + * 6: subaddress byte 2 + * 7: combined address (device address for combined mode data phase) + * 8: data length + * + * The transfer types are the same good old Apple ones it seems, + * that is: + * - 0x00: Simple transfer + * - 0x01: Subaddress transfer (addr write + data tx, no restart) + * - 0x02: Combined transfer (addr write + restart + data tx) + * + * This is then followed by actual data for a write. + * + * At this point, the OF driver seems to have a limitation on transfer + * sizes of 0xd bytes on reads and 0x5 bytes on writes. I do not know + * wether this is just an OF limit due to some temporary buffer size + * or if this is an SMU imposed limit. This driver has the same limitation + * for now as I use a 0x10 bytes temporary buffer as well + * + * Once that is completed, a response is expected from the SMU. This is + * obtained via a command of type 0x9a with a length of 1 byte containing + * 0 as the data byte. OF also fills the rest of the data buffer with 0xff's + * though I can't tell yet if this is actually necessary. Once this command + * is complete, at this point, all I can tell is what OF does. OF tests + * byte 0 of the reply: + * - on read, 0xfe or 0xfc : bus is busy, wait (see below) or nak ? + * - on read, 0x00 or 0x01 : reply is in buffer (after the byte 0) + * - on write, < 0 -> failure (immediate exit) + * - else, OF just exists (without error, weird) + * + * So on read, there is this wait-for-busy thing when getting a 0xfc or + * 0xfe result. OF does a loop of up to 64 retries, waiting 20ms and + * doing the above again until either the retries expire or the result + * is no longer 0xfe or 0xfc + * + * The Darwin I2C driver is less subtle though. On any non-success status + * from the response command, it waits 5ms and tries again up to 20 times, + * it doesn't differenciate between fatal errors or "busy" status. + * + * This driver provides an asynchronous paramblock based i2c command + * interface to be used either directly by low level code or by a higher + * level driver interfacing to the linux i2c layer. The current + * implementation of this relies on working timers & timer interrupts + * though, so be careful of calling context for now. This may be "fixed" + * in the future by adding a polling facility. + */ +#define SMU_CMD_I2C_COMMAND 0x9a + /* transfer types */ +#define SMU_I2C_TRANSFER_SIMPLE 0x00 +#define SMU_I2C_TRANSFER_STDSUB 0x01 +#define SMU_I2C_TRANSFER_COMBINED 0x02 + +/* + * Power supply control + * + * The "sub" command is an ASCII string in the data, the + * data lenght is that of the string. + * + * The VSLEW command can be used to get or set the voltage slewing. + * - lenght 5 (only "VSLEW") : it returns "DONE" and 3 bytes of + * reply at data offset 6, 7 and 8. + * - lenght 8 ("VSLEWxyz") has 3 additional bytes appended, and is + * used to set the voltage slewing point. The SMU replies with "DONE" + * I yet have to figure out their exact meaning of those 3 bytes in + * both cases. + * + */ +#define SMU_CMD_POWER_COMMAND 0xaa +#define SMU_CMD_POWER_RESTART "RESTART" +#define SMU_CMD_POWER_SHUTDOWN "SHUTDOWN" +#define SMU_CMD_POWER_VOLTAGE_SLEW "VSLEW" + +/* Misc commands + * + * This command seem to be a grab bag of various things + */ +#define SMU_CMD_MISC_df_COMMAND 0xdf +#define SMU_CMD_MISC_df_SET_DISPLAY_LIT 0x02 /* i: 1 byte */ +#define SMU_CMD_MISC_df_NMI_OPTION 0x04 + +/* + * Version info commands + * + * I haven't quite tried to figure out how these work + */ +#define SMU_CMD_VERSION_COMMAND 0xea + + +/* + * Misc commands + * + * This command seem to be a grab bag of various things + */ +#define SMU_CMD_MISC_ee_COMMAND 0xee +#define SMU_CMD_MISC_ee_GET_DATABLOCK_REC 0x02 +#define SMU_CMD_MISC_ee_LEDS_CTRL 0x04 /* i: 00 (00,01) [00] */ +#define SMU_CMD_MISC_ee_GET_DATA 0x05 /* i: 00 , o: ?? */ + + + +/* + * - Kernel side interface - + */ + +#ifdef __KERNEL__ + +/* + * Asynchronous SMU commands + * + * Fill up this structure and submit it via smu_queue_command(), + * and get notified by the optional done() callback, or because + * status becomes != 1 + */ + +struct smu_cmd; + +struct smu_cmd +{ + /* public */ + u8 cmd; /* command */ + int data_len; /* data len */ + int reply_len; /* reply len */ + void *data_buf; /* data buffer */ + void *reply_buf; /* reply buffer */ + int status; /* command status */ + void (*done)(struct smu_cmd *cmd, void *misc); + void *misc; + + /* private */ + struct list_head link; +}; + +/* + * Queues an SMU command, all fields have to be initialized + */ +extern int smu_queue_cmd(struct smu_cmd *cmd); + +/* + * Simple command wrapper. This structure embeds a small buffer + * to ease sending simple SMU commands from the stack + */ +struct smu_simple_cmd +{ + struct smu_cmd cmd; + u8 buffer[16]; +}; + +/* + * Queues a simple command. All fields will be initialized by that + * function + */ +extern int smu_queue_simple(struct smu_simple_cmd *scmd, u8 command, + unsigned int data_len, + void (*done)(struct smu_cmd *cmd, void *misc), + void *misc, + ...); + +/* + * Completion helper. Pass it to smu_queue_simple or as 'done' + * member to smu_queue_cmd, it will call complete() on the struct + * completion passed in the "misc" argument + */ +extern void smu_done_complete(struct smu_cmd *cmd, void *misc); + +/* + * Synchronous helpers. Will spin-wait for completion of a command + */ +extern void smu_spinwait_cmd(struct smu_cmd *cmd); + +static inline void smu_spinwait_simple(struct smu_simple_cmd *scmd) +{ + smu_spinwait_cmd(&scmd->cmd); +} + +/* + * Poll routine to call if blocked with irqs off + */ +extern void smu_poll(void); + + +/* + * Init routine, presence check.... */ extern int smu_init(void); extern int smu_present(void); +struct of_device; +extern struct of_device *smu_get_ofdev(void); + + +/* + * Common command wrappers + */ extern void smu_shutdown(void); extern void smu_restart(void); -extern int smu_get_rtc_time(struct rtc_time *time); -extern int smu_set_rtc_time(struct rtc_time *time); +struct rtc_time; +extern int smu_get_rtc_time(struct rtc_time *time, int spinwait); +extern int smu_set_rtc_time(struct rtc_time *time, int spinwait); /* * SMU command buffer absolute address, exported by pmac_setup, * this is allocated very early during boot. */ extern unsigned long smu_cmdbuf_abs; + + +/* + * Kenrel asynchronous i2c interface + */ + +/* SMU i2c header, exactly matches i2c header on wire */ +struct smu_i2c_param +{ + u8 bus; /* SMU bus ID (from device tree) */ + u8 type; /* i2c transfer type */ + u8 devaddr; /* device address (includes direction) */ + u8 sublen; /* subaddress length */ + u8 subaddr[3]; /* subaddress */ + u8 caddr; /* combined address, filled by SMU driver */ + u8 datalen; /* length of transfer */ + u8 data[7]; /* data */ +}; + +#define SMU_I2C_READ_MAX 0x0d +#define SMU_I2C_WRITE_MAX 0x05 + +struct smu_i2c_cmd +{ + /* public */ + struct smu_i2c_param info; + void (*done)(struct smu_i2c_cmd *cmd, void *misc); + void *misc; + int status; /* 1 = pending, 0 = ok, <0 = fail */ + + /* private */ + struct smu_cmd scmd; + int read; + int stage; + int retries; + u8 pdata[0x10]; + struct list_head link; +}; + +/* + * Call this to queue an i2c command to the SMU. You must fill info, + * including info.data for a write, done and misc. + * For now, no polling interface is provided so you have to use completion + * callback. + */ +extern int smu_queue_i2c(struct smu_i2c_cmd *cmd); + + +#endif /* __KERNEL__ */ + +/* + * - Userland interface - + */ + +/* + * A given instance of the device can be configured for 2 different + * things at the moment: + * + * - sending SMU commands (default at open() time) + * - receiving SMU events (not yet implemented) + * + * Commands are written with write() of a command block. They can be + * "driver" commands (for example to switch to event reception mode) + * or real SMU commands. They are made of a header followed by command + * data if any. + * + * For SMU commands (not for driver commands), you can then read() back + * a reply. The reader will be blocked or not depending on how the device + * file is opened. poll() isn't implemented yet. The reply will consist + * of a header as well, followed by the reply data if any. You should + * always provide a buffer large enough for the maximum reply data, I + * recommand one page. + * + * It is illegal to send SMU commands through a file descriptor configured + * for events reception + * + */ +struct smu_user_cmd_hdr +{ + __u32 cmdtype; +#define SMU_CMDTYPE_SMU 0 /* SMU command */ +#define SMU_CMDTYPE_WANTS_EVENTS 1 /* switch fd to events mode */ + + __u8 cmd; /* SMU command byte */ + __u32 data_len; /* Lenght of data following */ +}; + +struct smu_user_reply_hdr +{ + __u32 status; /* Command status */ + __u32 reply_len; /* Lenght of data follwing */ +}; + +#endif /* _SMU_H */ Index: linux-work/arch/ppc64/kernel/pmac_time.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/pmac_time.c 2005-05-05 14:57:39.000000000 +1000 +++ linux-work/arch/ppc64/kernel/pmac_time.c 2005-05-05 14:58:12.000000000 +1000 @@ -89,7 +89,7 @@ #ifdef CONFIG_PMAC_SMU case SYS_CTRLER_SMU: - smu_get_rtc_time(tm); + smu_get_rtc_time(tm, 1); break; #endif /* CONFIG_PMAC_SMU */ default: @@ -133,7 +133,7 @@ #ifdef CONFIG_PMAC_SMU case SYS_CTRLER_SMU: - return smu_set_rtc_time(tm); + return smu_set_rtc_time(tm, 1); #endif /* CONFIG_PMAC_SMU */ default: return -ENODEV; Index: linux-work/drivers/i2c/busses/i2c-pmac-smu.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/drivers/i2c/busses/i2c-pmac-smu.c 2005-05-05 16:23:19.000000000 +1000 @@ -0,0 +1,298 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static int probe; + +MODULE_AUTHOR("Benjamin Herrenschmidt "); +MODULE_DESCRIPTION("I2C driver for Apple's SMU"); +MODULE_LICENSE("GPL"); +module_param(probe, bool, 0); + + +/* Physical interface */ +struct smu_iface +{ + struct i2c_adapter adapter; + struct completion complete; + u32 busid; +}; + +static void smu_i2c_done(struct smu_i2c_cmd *cmd, void *misc) +{ + struct smu_iface *iface = misc; + complete(&iface->complete); +} + +/* + * SMBUS-type transfer entrypoint + */ +static s32 smu_smbus_xfer( struct i2c_adapter* adap, + u16 addr, + unsigned short flags, + char read_write, + u8 command, + int size, + union i2c_smbus_data* data) +{ + struct smu_iface *iface = i2c_get_adapdata(adap); + struct smu_i2c_cmd cmd; + int rc = 0; + int read = (read_write == I2C_SMBUS_READ); + + cmd.info.bus = iface->busid; + cmd.info.devaddr = (addr << 1) | (read ? 0x01 : 0x00); + + /* Prepare datas & select mode */ + switch (size) { + case I2C_SMBUS_QUICK: + cmd.info.type = SMU_I2C_TRANSFER_SIMPLE; + cmd.info.datalen = 0; + break; + case I2C_SMBUS_BYTE: + cmd.info.type = SMU_I2C_TRANSFER_SIMPLE; + cmd.info.datalen = 1; + if (!read) + cmd.info.data[0] = data->byte; + break; + case I2C_SMBUS_BYTE_DATA: + cmd.info.type = SMU_I2C_TRANSFER_STDSUB; + cmd.info.datalen = 1; + cmd.info.sublen = 1; + cmd.info.subaddr[0] = command; + cmd.info.subaddr[1] = 0; + cmd.info.subaddr[2] = 0; + if (!read) + cmd.info.data[0] = data->byte; + break; + case I2C_SMBUS_WORD_DATA: + cmd.info.type = SMU_I2C_TRANSFER_STDSUB; + cmd.info.datalen = 2; + cmd.info.sublen = 1; + cmd.info.subaddr[0] = command; + cmd.info.subaddr[1] = 0; + cmd.info.subaddr[2] = 0; + if (!read) { + cmd.info.data[0] = data->byte & 0xff; + cmd.info.data[1] = (data->byte >> 8) & 0xff; + } + break; + /* Note that these are broken vs. the expected smbus API where + * on reads, the lenght is actually returned from the function, + * but I think the current API makes no sense and I don't want + * any driver that I haven't verified for correctness to go + * anywhere near a pmac i2c bus anyway ... + */ + case I2C_SMBUS_BLOCK_DATA: + cmd.info.type = SMU_I2C_TRANSFER_STDSUB; + cmd.info.datalen = data->block[0] + 1; + if (cmd.info.datalen > 6) + return -EINVAL; + if (!read) + memcpy(cmd.info.data, data->block, cmd.info.datalen); + cmd.info.sublen = 1; + cmd.info.subaddr[0] = command; + cmd.info.subaddr[1] = 0; + cmd.info.subaddr[2] = 0; + break; + case I2C_SMBUS_I2C_BLOCK_DATA: + cmd.info.type = SMU_I2C_TRANSFER_STDSUB; + cmd.info.datalen = data->block[0]; + if (cmd.info.datalen > 7) + return -EINVAL; + if (!read) + memcpy(cmd.info.data, &data->block[1], + cmd.info.datalen); + cmd.info.sublen = 1; + cmd.info.subaddr[0] = command; + cmd.info.subaddr[1] = 0; + cmd.info.subaddr[2] = 0; + break; + + default: + return -EINVAL; + } + + /* Turn a standardsub read into a combined mode access */ + if (read_write == I2C_SMBUS_READ && + cmd.info.type == SMU_I2C_TRANSFER_STDSUB) + cmd.info.type = SMU_I2C_TRANSFER_COMBINED; + + /* Finish filling command and submit it */ + cmd.done = smu_i2c_done; + cmd.misc = iface; + rc = smu_queue_i2c(&cmd); + if (rc < 0) + return rc; + wait_for_completion(&iface->complete); + rc = cmd.status; + + if (!read || rc < 0) + return rc; + + switch (size) { + case I2C_SMBUS_BYTE: + case I2C_SMBUS_BYTE_DATA: + data->byte = cmd.info.data[0]; + break; + case I2C_SMBUS_WORD_DATA: + data->word = ((u16)cmd.info.data[1]) << 8; + data->word |= cmd.info.data[0]; + break; + /* Note that these are broken vs. the expected smbus API where + * on reads, the lenght is actually returned from the function, + * but I think the current API makes no sense and I don't want + * any driver that I haven't verified for correctness to go + * anywhere near a pmac i2c bus anyway ... + */ + case I2C_SMBUS_BLOCK_DATA: + case I2C_SMBUS_I2C_BLOCK_DATA: + memcpy(&data->block[0], cmd.info.data, cmd.info.datalen); + break; + } + + return rc; +} + +static u32 +smu_smbus_func(struct i2c_adapter * adapter) +{ + return I2C_FUNC_SMBUS_QUICK | I2C_FUNC_SMBUS_BYTE | + I2C_FUNC_SMBUS_BYTE_DATA | I2C_FUNC_SMBUS_WORD_DATA | + I2C_FUNC_SMBUS_BLOCK_DATA; +} + +/* For now, we only handle combined mode (smbus) */ +static struct i2c_algorithm smu_algorithm = { + .name = "SMU i2c", + .id = I2C_ALGO_SMBUS, + .smbus_xfer = smu_smbus_xfer, + .functionality = smu_smbus_func, +}; + +static int create_iface(struct device_node *np, struct device *dev) +{ + struct smu_iface* iface; + u32 *reg, busid; + int rc; + + reg = (u32 *)get_property(np, "reg", NULL); + if (reg == NULL) { + printk(KERN_ERR "i2c-pmac-smu: can't find bus number !\n"); + return -ENXIO; + } + busid = *reg; + + iface = kmalloc(sizeof(struct smu_iface), GFP_KERNEL); + if (iface == NULL) { + printk(KERN_ERR "i2c-pmac-smu: can't allocate inteface !\n"); + return -ENOMEM; + } + memset(iface, 0, sizeof(struct smu_iface)); + init_completion(&iface->complete); + iface->busid = busid; + + dev_set_drvdata(dev, iface); + + sprintf(iface->adapter.name, "smu-i2c-%02x", busid); + iface->adapter.id = I2C_ALGO_SMBUS; + iface->adapter.algo = &smu_algorithm; + iface->adapter.algo_data = NULL; + iface->adapter.client_register = NULL; + iface->adapter.client_unregister = NULL; + i2c_set_adapdata(&iface->adapter, iface); + iface->adapter.dev.parent = dev; + + rc = i2c_add_adapter(&iface->adapter); + if (rc) { + printk(KERN_ERR "i2c-pamc-smu.c: Adapter %s registration " + "failed\n", iface->adapter.name); + i2c_set_adapdata(&iface->adapter, NULL); + } + + if (probe) { + unsigned char addr; + printk("Probe: "); + for (addr = 0x00; addr <= 0x7f; addr++) { + if (i2c_smbus_xfer(&iface->adapter,addr, + 0,0,0,I2C_SMBUS_QUICK,NULL) >= 0) + printk("%02x ", addr); + } + printk("\n"); + } + + printk(KERN_INFO "SMU i2c bus %x registered\n", busid); + + return 0; +} + +static int dispose_iface(struct device *dev) +{ + struct smu_iface *iface = dev_get_drvdata(dev); + int rc; + + rc = i2c_del_adapter(&iface->adapter); + i2c_set_adapdata(&iface->adapter, NULL); + /* We aren't that prepared to deal with this... */ + if (rc) + printk("i2c-pmac-smu.c: Failed to remove bus %s !\n", + iface->adapter.name); + dev_set_drvdata(dev, NULL); + kfree(iface); + + return 0; +} + + +static int create_iface_of_platform(struct of_device* dev, + const struct of_match *match) +{ + return create_iface(dev->node, &dev->dev); +} + + +static int dispose_iface_of_platform(struct of_device* dev) +{ + return dispose_iface(&dev->dev); +} + + +static struct of_match i2c_smu_match[] = +{ + { + .name = OF_ANY_MATCH, + .type = OF_ANY_MATCH, + .compatible = "smu-i2c" + }, + {}, +}; +static struct of_platform_driver i2c_smu_of_platform_driver = +{ + .name = "i2c-smu", + .match_table = i2c_smu_match, + .probe = create_iface_of_platform, + .remove = dispose_iface_of_platform +}; + + +static int __init i2c_pmac_smu_init(void) +{ + of_register_driver(&i2c_smu_of_platform_driver); + return 0; +} + + +static void __exit i2c_pmac_smu_cleanup(void) +{ + of_unregister_driver(&i2c_smu_of_platform_driver); +} + +module_init(i2c_pmac_smu_init); +module_exit(i2c_pmac_smu_cleanup); Index: linux-work/drivers/i2c/busses/Kconfig =================================================================== --- linux-work.orig/drivers/i2c/busses/Kconfig 2005-05-05 14:57:39.000000000 +1000 +++ linux-work/drivers/i2c/busses/Kconfig 2005-05-05 14:58:12.000000000 +1000 @@ -235,6 +235,18 @@ This support is also available as a module. If so, the module will be called i2c-keywest. +config I2C_PMAC_SMU + tristate "Powermac SMU I2C interface" + depends on I2C && PPC_PMAC64 + help + This supports the use of the I2C interface in the SMU + chip on recent Apple machines like the iMac G5. It is used + among others by the thermal control driver for those machines. + Say Y if you have such a machine. + + This support is also available as a module. If so, the module + will be called i2c-pmac-smu. + config I2C_MPC tristate "MPC107/824x/85xx/52xx" depends on I2C && PPC32 Index: linux-work/drivers/i2c/busses/Makefile =================================================================== --- linux-work.orig/drivers/i2c/busses/Makefile 2005-05-05 14:57:39.000000000 +1000 +++ linux-work/drivers/i2c/busses/Makefile 2005-05-05 14:58:12.000000000 +1000 @@ -20,6 +20,7 @@ obj-$(CONFIG_I2C_IXP2000) += i2c-ixp2000.o obj-$(CONFIG_I2C_IXP4XX) += i2c-ixp4xx.o obj-$(CONFIG_I2C_KEYWEST) += i2c-keywest.o +obj-$(CONFIG_I2C_PMAC_SMU) += i2c-pmac-smu.o obj-$(CONFIG_I2C_MPC) += i2c-mpc.o obj-$(CONFIG_I2C_MV64XXX) += i2c-mv64xxx.o obj-$(CONFIG_I2C_NFORCE2) += i2c-nforce2.o Index: linux-work/arch/ppc64/kernel/pmac_setup.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/pmac_setup.c 2005-05-05 14:57:39.000000000 +1000 +++ linux-work/arch/ppc64/kernel/pmac_setup.c 2005-05-05 16:14:18.000000000 +1000 @@ -444,15 +444,23 @@ static int __init pmac_declare_of_platform_devices(void) { - struct device_node *np; + struct device_node *np, *npp; - np = find_devices("u3"); - if (np) { - for (np = np->child; np != NULL; np = np->sibling) + npp = of_find_node_by_name(NULL, "u3"); + if (npp) { + for (np = NULL; (np = of_get_next_child(npp, np)) != NULL;) { if (strncmp(np->name, "i2c", 3) == 0) { - of_platform_device_create(np, "u3-i2c"); + of_platform_device_create(np, "u3-i2c", NULL); + of_node_put(np); break; } + } + of_node_put(npp); + } + npp = of_find_node_by_type(NULL, "smu"); + if (npp) { + of_platform_device_create(npp, "smu", NULL); + of_node_put(npp); } return 0; Index: linux-work/drivers/macintosh/therm_adt746x.c =================================================================== --- linux-work.orig/drivers/macintosh/therm_adt746x.c 2005-05-02 10:48:11.000000000 +1000 +++ linux-work/drivers/macintosh/therm_adt746x.c 2005-05-05 16:17:03.000000000 +1000 @@ -563,7 +563,7 @@ "limit_adjust: %d, fan_speed: %d\n", therm_bus, therm_address, limit_adjust, fan_speed); - of_dev = of_platform_device_create(np, "temperatures"); + of_dev = of_platform_device_create(np, "temperatures", NULL); if (of_dev == NULL) { printk(KERN_ERR "Can't register temperatures device !\n"); Index: linux-work/drivers/macintosh/therm_pm72.c =================================================================== --- linux-work.orig/drivers/macintosh/therm_pm72.c 2005-05-02 10:48:11.000000000 +1000 +++ linux-work/drivers/macintosh/therm_pm72.c 2005-05-05 16:16:44.000000000 +1000 @@ -2052,7 +2052,7 @@ return -ENODEV; } } - of_dev = of_platform_device_create(np, "temperature"); + of_dev = of_platform_device_create(np, "temperature", NULL); if (of_dev == NULL) { printk(KERN_ERR "Can't register FCU platform device !\n"); return -ENODEV; Index: linux-work/arch/ppc64/kernel/of_device.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/of_device.c 2005-05-02 10:48:08.000000000 +1000 +++ linux-work/arch/ppc64/kernel/of_device.c 2005-05-05 16:21:46.000000000 +1000 @@ -232,7 +232,9 @@ device_unregister(&ofdev->dev); } -struct of_device* of_platform_device_create(struct device_node *np, const char *bus_id) +struct of_device* of_platform_device_create(struct device_node *np, + const char *bus_id, + struct device *parent) { struct of_device *dev; u32 *reg; @@ -245,7 +247,7 @@ dev->node = np; dev->dma_mask = 0xffffffffUL; dev->dev.dma_mask = &dev->dma_mask; - dev->dev.parent = NULL; + dev->dev.parent = parent; dev->dev.bus = &of_platform_bus_type; dev->dev.release = of_release_dev; @@ -260,6 +262,7 @@ return dev; } + EXPORT_SYMBOL(of_match_device); EXPORT_SYMBOL(of_platform_bus_type); EXPORT_SYMBOL(of_register_driver); Index: linux-work/arch/ppc/platforms/pmac_setup.c =================================================================== --- linux-work.orig/arch/ppc/platforms/pmac_setup.c 2005-05-02 10:48:08.000000000 +1000 +++ linux-work/arch/ppc/platforms/pmac_setup.c 2005-05-05 16:04:32.000000000 +1000 @@ -719,7 +719,8 @@ if (np) { for (np = np->child; np != NULL; np = np->sibling) if (strncmp(np->name, "i2c", 3) == 0) { - of_platform_device_create(np, "uni-n-i2c"); + of_platform_device_create(np, "uni-n-i2c", + NULL); break; } } @@ -727,17 +728,18 @@ if (np) { for (np = np->child; np != NULL; np = np->sibling) if (strncmp(np->name, "i2c", 3) == 0) { - of_platform_device_create(np, "u3-i2c"); + of_platform_device_create(np, "u3-i2c", + NULL); break; } } np = find_devices("valkyrie"); if (np) - of_platform_device_create(np, "valkyrie"); + of_platform_device_create(np, "valkyrie", NULL); np = find_devices("platinum"); if (np) - of_platform_device_create(np, "platinum"); + of_platform_device_create(np, "platinum", NULL); return 0; } Index: linux-work/include/asm-ppc/of_device.h =================================================================== --- linux-work.orig/include/asm-ppc/of_device.h 2005-05-02 10:49:57.000000000 +1000 +++ linux-work/include/asm-ppc/of_device.h 2005-05-05 16:03:54.000000000 +1000 @@ -67,7 +67,9 @@ extern void of_unregister_driver(struct of_platform_driver *drv); extern int of_device_register(struct of_device *ofdev); extern void of_device_unregister(struct of_device *ofdev); -extern struct of_device *of_platform_device_create(struct device_node *np, const char *bus_id); +extern struct of_device *of_platform_device_create(struct device_node *np, + const char *bus_id, + struct device *parent); extern void of_release_dev(struct device *dev); #endif /* __OF_DEVICE_H__ */ Index: linux-work/arch/ppc/syslib/of_device.c =================================================================== --- linux-work.orig/arch/ppc/syslib/of_device.c 2005-05-02 10:48:08.000000000 +1000 +++ linux-work/arch/ppc/syslib/of_device.c 2005-05-05 17:14:01.000000000 +1000 @@ -233,7 +233,9 @@ device_unregister(&ofdev->dev); } -struct of_device* of_platform_device_create(struct device_node *np, const char *bus_id) +struct of_device* of_platform_device_create(struct device_node *np, + const char *bus_id, + struct device *parent) { struct of_device *dev; u32 *reg; @@ -246,7 +248,7 @@ dev->node = of_node_get(np); dev->dma_mask = 0xffffffffUL; dev->dev.dma_mask = &dev->dma_mask; - dev->dev.parent = NULL; + dev->dev.parent = parent; dev->dev.bus = &of_platform_bus_type; dev->dev.release = of_release_dev; From apw at shadowen.org Thu May 5 22:04:10 2005 From: apw at shadowen.org (Andy Whitcroft) Date: Thu, 05 May 2005 13:04:10 +0100 Subject: [2/3] add memory present for ppc64 In-Reply-To: <20050505023119.GA20283@austin.ibm.com> References: <20050505023119.GA20283@austin.ibm.com> Message-ID: <427A0BBA.1080803@shadowen.org> Olof Johansson wrote: > On Wed, May 04, 2005 at 09:29:57PM +0100, Andy Whitcroft wrote: > > >>diff -X /home/apw/brief/lib/vdiff.excl -rupN reference/arch/ppc64/Kconfig current/arch/ppc64/Kconfig >>--- reference/arch/ppc64/Kconfig 2005-05-04 20:54:50.000000000 +0100 >>+++ current/arch/ppc64/Kconfig 2005-05-04 20:54:50.000000000 +0100 >>@@ -212,8 +212,8 @@ config ARCH_FLATMEM_ENABLE >> source "mm/Kconfig" >> >> config HAVE_ARCH_EARLY_PFN_TO_NID >>- bool >>- default y >>+ def_bool y >>+ depends on NEED_MULTIPLE_NODES > > > Ok, time to show my lack of undestanding here, but when can we ever be > CONFIG_NUMA and NOT need multiple nodes? > > >>@@ -481,6 +483,7 @@ static void __init setup_nonnuma(void) >> >> for (i = 0 ; i < top_of_ram; i += MEMORY_INCREMENT) >> numa_memory_lookup_table[i >> MEMORY_INCREMENT_SHIFT] = 0; >>+ memory_present(0, 0, init_node_data[0].node_end_pfn); > > > Isn't the memory_present stuff and numa_memory_lookup_table two > implementations doing the same thing (mapping memory to nodes)? > Can we kill numa_memory_lookup_table with this? This table basically is part of the DISCONTIGMEM implementation and used lightly by SPARSEMEM. In the i386 port we have already pushd that out into a discontigmem implementation of memory_present. That is a logical next step in this port and I've got some of it already done. That should sit nicely on this lot. I'll work on this one. -apw From jschopp at austin.ibm.com Fri May 6 01:44:42 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Thu, 05 May 2005 10:44:42 -0500 Subject: [1/3] add early_pfn_to_nid for ppc64 In-Reply-To: References: Message-ID: <427A3F6A.6060405@austin.ibm.com> > +#ifdef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID > +#define early_pfn_to_nid(pfn) pa_to_nid(((unsigned long)pfn) << PAGE_SHIFT) > +#endif Is there a reason we didn't just use pfn_to_nid() directly here instead of pa_to_nid()? I'm just thinking of having DISCONTIG/NUMA off and pfn_to_nid() being #defined to zero for those cases. From apw at shadowen.org Fri May 6 02:19:19 2005 From: apw at shadowen.org (Andy Whitcroft) Date: Thu, 05 May 2005 17:19:19 +0100 Subject: [1/3] add early_pfn_to_nid for ppc64 In-Reply-To: <427A3F6A.6060405@austin.ibm.com> References: <427A3F6A.6060405@austin.ibm.com> Message-ID: <427A4787.4030802@shadowen.org> Joel Schopp wrote: >> +#ifdef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID >> +#define early_pfn_to_nid(pfn) pa_to_nid(((unsigned long)pfn) << >> PAGE_SHIFT) >> +#endif > > > Is there a reason we didn't just use pfn_to_nid() directly here instead > of pa_to_nid()? I'm just thinking of having DISCONTIG/NUMA off and > pfn_to_nid() being #defined to zero for those cases. The problem is that pfn_to_nid is defined by the memory model. In the SPARSEMEM case it isn't always usable until after the we have initialised and allocated the sparse mem_maps. It is allocations during this phase that need this early_pfn_to_nid() form, to guide its allocations of the mem_map to obtain locality with the physical memory blocks. This is clearer in the i386 port where the early_pfn_to_nid() implementation uses low level table to determine the location. As has been mentioned in another thread, we are using what is effectivly a DISCONTIGMEM data structure here. I have some work in progress to split that last part and move to a true early implementation on ppc64 too. -apw From johnrose at austin.ibm.com Fri May 6 02:21:30 2005 From: johnrose at austin.ibm.com (John Rose) Date: Thu, 05 May 2005 11:21:30 -0500 Subject: Patch to kill ioremap_mm In-Reply-To: <20050505014256.GE18270@localhost.localdomain> References: <20050505014256.GE18270@localhost.localdomain> Message-ID: <1115310090.6011.21.camel@sinatra.austin.ibm.com> Hi David- Given that we use a separate allocation scheme for imalloc mappings, does it make sense to lump these into the vmalloc mm_struct, and to share the vmalloc address space? This saves lines of code, but is it as clear as the existing (separate) layout? Thanks- John On Wed, 2005-05-04 at 20:42, David Gibson wrote: > Can anyone see any problems with this patch. If not, I'll send it on > to akpm. > > Currently ppc64 has two mm_structs for the kernel, init_mm and also > ioremap_mm. The latter really isn't necessary: this patch abolishes > it, instead restricting vmallocs to the lower 1TB of the init_mm's > range and placing io mappings in the upper 1TB. This simplifies the > code in a number of places, and gets rid of an unecessary set of > pagetables. > > Index: working-2.6/include/asm-ppc64/pgtable.h > =================================================================== > --- working-2.6.orig/include/asm-ppc64/pgtable.h 2005-05-05 10:58:04.000000000 +1000 > +++ working-2.6/include/asm-ppc64/pgtable.h 2005-05-05 11:12:59.000000000 +1000 > @@ -53,7 +53,8 @@ > * Define the address range of the vmalloc VM area. > */ > #define VMALLOC_START (0xD000000000000000ul) > -#define VMALLOC_END (VMALLOC_START + EADDR_MASK) > +#define VMALLOC_SIZE (0x10000000000UL) > +#define VMALLOC_END (VMALLOC_START + VMALLOC_SIZE) > > /* > * Bits in a linux-style PTE. These match the bits in the > @@ -239,9 +240,6 @@ > /* This now only contains the vmalloc pages */ > #define pgd_offset_k(address) pgd_offset(&init_mm, address) > > -/* to find an entry in the ioremap page-table-directory */ > -#define pgd_offset_i(address) (ioremap_pgd + pgd_index(address)) > - > /* > * The following only work if pte_present() is true. > * Undefined behaviour if not.. > @@ -459,15 +457,12 @@ > #define __HAVE_ARCH_PTE_SAME > #define pte_same(A,B) (((pte_val(A) ^ pte_val(B)) & ~_PAGE_HPTEFLAGS) == 0) > > -extern unsigned long ioremap_bot, ioremap_base; > - > #define pmd_ERROR(e) \ > printk("%s:%d: bad pmd %08x.\n", __FILE__, __LINE__, pmd_val(e)) > #define pgd_ERROR(e) \ > printk("%s:%d: bad pgd %08x.\n", __FILE__, __LINE__, pgd_val(e)) > > extern pgd_t swapper_pg_dir[]; > -extern pgd_t ioremap_dir[]; > > extern void paging_init(void); > > Index: working-2.6/include/asm-ppc64/imalloc.h > =================================================================== > --- working-2.6.orig/include/asm-ppc64/imalloc.h 2005-05-05 10:58:04.000000000 +1000 > +++ working-2.6/include/asm-ppc64/imalloc.h 2005-05-05 11:13:39.000000000 +1000 > @@ -4,9 +4,9 @@ > /* > * Define the address range of the imalloc VM area. > */ > -#define PHBS_IO_BASE IOREGIONBASE > -#define IMALLOC_BASE (IOREGIONBASE + 0x80000000ul) /* Reserve 2 gigs for PHBs */ > -#define IMALLOC_END (IOREGIONBASE + EADDR_MASK) > +#define PHBS_IO_BASE VMALLOC_END > +#define IMALLOC_BASE (PHBS_IO_BASE + 0x80000000ul) /* Reserve 2 gigs for PHBs */ > +#define IMALLOC_END (VMALLOC_START + EADDR_MASK) > > > /* imalloc region types */ > @@ -21,4 +21,6 @@ > int region_type); > unsigned long im_free(void *addr); > > +extern unsigned long ioremap_bot; > + > #endif /* _PPC64_IMALLOC_H */ > Index: working-2.6/include/asm-ppc64/page.h > =================================================================== > --- working-2.6.orig/include/asm-ppc64/page.h 2005-05-05 10:58:04.000000000 +1000 > +++ working-2.6/include/asm-ppc64/page.h 2005-05-05 11:14:02.000000000 +1000 > @@ -202,9 +202,7 @@ > #define PAGE_OFFSET ASM_CONST(0xC000000000000000) > #define KERNELBASE PAGE_OFFSET > #define VMALLOCBASE ASM_CONST(0xD000000000000000) > -#define IOREGIONBASE ASM_CONST(0xE000000000000000) > > -#define IO_REGION_ID (IOREGIONBASE >> REGION_SHIFT) > #define VMALLOC_REGION_ID (VMALLOCBASE >> REGION_SHIFT) > #define KERNEL_REGION_ID (KERNELBASE >> REGION_SHIFT) > #define USER_REGION_ID (0UL) > Index: working-2.6/arch/ppc64/kernel/eeh.c > =================================================================== > --- working-2.6.orig/arch/ppc64/kernel/eeh.c 2005-04-26 15:37:55.000000000 +1000 > +++ working-2.6/arch/ppc64/kernel/eeh.c 2005-05-05 11:23:40.000000000 +1000 > @@ -505,7 +505,7 @@ > pte_t *ptep; > unsigned long pa; > > - ptep = find_linux_pte(ioremap_mm.pgd, token); > + ptep = find_linux_pte(init_mm.pgd, token); > if (!ptep) > return token; > pa = pte_pfn(*ptep) << PAGE_SHIFT; > Index: working-2.6/arch/ppc64/kernel/process.c > =================================================================== > --- working-2.6.orig/arch/ppc64/kernel/process.c 2005-04-26 15:37:55.000000000 +1000 > +++ working-2.6/arch/ppc64/kernel/process.c 2005-05-05 11:16:20.000000000 +1000 > @@ -58,14 +58,6 @@ > struct task_struct *last_task_used_altivec = NULL; > #endif > > -struct mm_struct ioremap_mm = { > - .pgd = ioremap_dir, > - .mm_users = ATOMIC_INIT(2), > - .mm_count = ATOMIC_INIT(1), > - .cpu_vm_mask = CPU_MASK_ALL, > - .page_table_lock = SPIN_LOCK_UNLOCKED, > -}; > - > /* > * Make sure the floating-point register state in the > * the thread_struct is up to date for task tsk. > Index: working-2.6/include/asm-ppc64/processor.h > =================================================================== > --- working-2.6.orig/include/asm-ppc64/processor.h 2005-04-26 15:38:02.000000000 +1000 > +++ working-2.6/include/asm-ppc64/processor.h 2005-05-05 11:24:46.000000000 +1000 > @@ -590,16 +590,6 @@ > } > > /* > - * Note: the vm_start and vm_end fields here should *not* > - * be in kernel space. (Could vm_end == vm_start perhaps?) > - */ > -#define IOREMAP_MMAP { &ioremap_mm, 0, 0x1000, NULL, \ > - PAGE_SHARED, VM_READ | VM_WRITE | VM_EXEC, \ > - 1, NULL, NULL } > - > -extern struct mm_struct ioremap_mm; > - > -/* > * Return saved PC of a blocked thread. For now, this is the "user" PC > */ > #define thread_saved_pc(tsk) \ > Index: working-2.6/arch/ppc64/mm/hash_utils.c > =================================================================== > --- working-2.6.orig/arch/ppc64/mm/hash_utils.c 2005-05-05 10:58:04.000000000 +1000 > +++ working-2.6/arch/ppc64/mm/hash_utils.c 2005-05-05 11:17:03.000000000 +1000 > @@ -310,10 +310,6 @@ > > vsid = get_vsid(mm->context.id, ea); > break; > - case IO_REGION_ID: > - mm = &ioremap_mm; > - vsid = get_kernel_vsid(ea); > - break; > case VMALLOC_REGION_ID: > mm = &init_mm; > vsid = get_kernel_vsid(ea); > Index: working-2.6/arch/ppc64/mm/init.c > =================================================================== > --- working-2.6.orig/arch/ppc64/mm/init.c 2005-05-05 10:58:04.000000000 +1000 > +++ working-2.6/arch/ppc64/mm/init.c 2005-05-05 11:22:54.000000000 +1000 > @@ -144,7 +144,7 @@ > > pte = pte_offset_kernel(pmd, addr); > do { > - pte_t ptent = ptep_get_and_clear(&ioremap_mm, addr, pte); > + pte_t ptent = ptep_get_and_clear(&init_mm, addr, pte); > WARN_ON(!pte_none(ptent) && !pte_present(ptent)); > } while (pte++, addr += PAGE_SIZE, addr != end); > } > @@ -181,13 +181,13 @@ > > static void unmap_im_area(unsigned long addr, unsigned long end) > { > - struct mm_struct *mm = &ioremap_mm; > + struct mm_struct *mm = &init_mm; > unsigned long next; > pgd_t *pgd; > > spin_lock(&mm->page_table_lock); > > - pgd = pgd_offset_i(addr); > + pgd = pgd_offset_k(addr); > flush_cache_vunmap(addr, end); > do { > next = pgd_addr_end(addr, end); > @@ -214,21 +214,21 @@ > unsigned long vsid; > > if (mem_init_done) { > - spin_lock(&ioremap_mm.page_table_lock); > - pgdp = pgd_offset_i(ea); > - pudp = pud_alloc(&ioremap_mm, pgdp, ea); > + spin_lock(&init_mm.page_table_lock); > + pgdp = pgd_offset_k(ea); > + pudp = pud_alloc(&init_mm, pgdp, ea); > if (!pudp) > return -ENOMEM; > - pmdp = pmd_alloc(&ioremap_mm, pudp, ea); > + pmdp = pmd_alloc(&init_mm, pudp, ea); > if (!pmdp) > return -ENOMEM; > - ptep = pte_alloc_kernel(&ioremap_mm, pmdp, ea); > + ptep = pte_alloc_kernel(&init_mm, pmdp, ea); > if (!ptep) > return -ENOMEM; > pa = abs_to_phys(pa); > - set_pte_at(&ioremap_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, > + set_pte_at(&init_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, > __pgprot(flags))); > - spin_unlock(&ioremap_mm.page_table_lock); > + spin_unlock(&init_mm.page_table_lock); > } else { > unsigned long va, vpn, hash, hpteg; > > > From apw at shadowen.org Fri May 6 03:37:00 2005 From: apw at shadowen.org (Andy Whitcroft) Date: Thu, 05 May 2005 18:37:00 +0100 Subject: [3/3] sparsemem memory model for ppc64 In-Reply-To: <20050505023132.GB20283@austin.ibm.com> References: <20050505023132.GB20283@austin.ibm.com> Message-ID: <427A59BC.1020208@shadowen.org> Olof Johansson wrote: > Hi, > > Just two formatting nitpicks below. Thanks, this would be better served by rewriting the first comment and removing the second all together. /* Add all physical memory to the bootmem map, mark each area * present. The first block has already been marked present above. */ I note that the diff in question has sneaked into the wrong patch, that segement represents memory_present. So I'll rediff them with it there. No overall change to the code. -apw From kravetz at us.ibm.com Fri May 6 03:53:20 2005 From: kravetz at us.ibm.com (mike kravetz) Date: Thu, 5 May 2005 10:53:20 -0700 Subject: [3/3] sparsemem memory model for ppc64 In-Reply-To: References: Message-ID: <20050505175320.GC3930@w-mikek2.ibm.com> On Wed, May 04, 2005 at 09:30:57PM +0100, Andy Whitcroft wrote: > + /* > + * Note presence of first (logical/coalasced) LMB which will > + * contain RMO region > + */ > + start_pfn = lmb.memory.region[0].physbase >> PAGE_SHIFT; > + end_pfn = start_pfn + (lmb.memory.region[0].size >> PAGE_SHIFT); > + memory_present(0, start_pfn, end_pfn); I need to take a close look at this again, but I think this special handling for the RMO region in unnecessary. I added it in the 'early days of SPARSE' when there were some 'bootstrap' issues and we needed to initialize some memory before setting up the bootmem bitmap. I'm pretty sure all those issues have gone away. -- Mike From ntl at pobox.com Fri May 6 04:32:02 2005 From: ntl at pobox.com (Nathan Lynch) Date: Thu, 5 May 2005 13:32:02 -0500 Subject: [PATCH] ppc64: don't create spurious symlinks under node0 sysdev Message-ID: <20050505183202.GA3614@otto> On partitioned systems we can wind up creating spurious symlinks in /sys/devices/system/node/node0 to non-present cpus. The symlinks are not broken; the problem is that we're potentially misinforming userspace that there is a relationship between node0 and cpus which are to be added later. There's no guarantee at all that a cpu which is added later will belong to node 0. sysfs.c | 7 ++++++- 1 files changed, 6 insertions(+), 1 deletion(-) Signed-off-by: Nathan Lynch Index: linux-2.6.12-rc3-mm3/arch/ppc64/kernel/sysfs.c =================================================================== --- linux-2.6.12-rc3-mm3.orig/arch/ppc64/kernel/sysfs.c +++ linux-2.6.12-rc3-mm3/arch/ppc64/kernel/sysfs.c @@ -404,7 +404,12 @@ static int __init topology_init(void) struct cpu *c = &per_cpu(cpu_devices, cpu); #ifdef CONFIG_NUMA - parent = &node_devices[cpu_to_node(cpu)]; + /* The node to which a cpu belongs can't be known + * until the cpu is made present. + */ + parent = NULL; + if (cpu_present(cpu)) + parent = &node_devices[cpu_to_node(cpu)]; #endif /* * For now, we just see if the system supports making From linas at austin.ibm.com Fri May 6 07:57:25 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Thu, 5 May 2005 16:57:25 -0500 Subject: PATCH [PPC64]: dead processes never reaped In-Reply-To: <1113975850.5515.377.camel@gaston> References: <20050418193833.GW15596@austin.ibm.com> <1113975850.5515.377.camel@gaston> Message-ID: <20050505215725.GJ11745@austin.ibm.com> On Wed, Apr 20, 2005 at 03:44:10PM +1000, Benjamin Herrenschmidt was heard to remark: > On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote: > > > > The patch below appears to fix a problem where a number of dead processes > > linger on the system. On a highly loaded system, dozens of processes > > were found stuck in do_exit(), calling thier very last schedule(), and > > then being lost forever. And this problem seems to be unreproducible. Dang, it was one of the more interesting ones I've seen. --linas From ntl at pobox.com Fri May 6 08:15:20 2005 From: ntl at pobox.com (Nathan Lynch) Date: Thu, 5 May 2005 17:15:20 -0500 Subject: [PATCH 1/2] logical numbering for numa nodes (2nd try) Message-ID: <20050505221520.GB3614@otto> (version 2) This patch fixes the ppc64 numa code to be more consistent with the conversion from numnodes to node_online_mask etc. and removes the dependence on the platform numa numbering by setting up a mapping between the platform ids found in the ibm,associativity properties and logical node numbers. The main reason I want to make this change is that the numbering scheme of the platform ids is unspecified and we really can't rely on the values being below MAX_NUMNODES. I know you weren't really keen on having this mapping but I think in the long term this is what we'll wind up having to do anyway. I've also ripped out DEBUG_NUMA -- the effect is that it's now as if DEBUG_NUMA is always on. This means that resources have to be explicitly associated with their nodes. As Dave Hansen suggested in response to the original version of the patch, I've made it so that establishing a mapping between the domain and logical node has to be done explicitly instead of implicitly on the first lookup. This patch exposes some latent issues in the interaction of cpu hotplug, numa, and sched domains which are addressed in the next patch. arch/ppc64/mm/numa.c | 208 ++++++++++++++++++++++++++----------------- include/asm-ppc64/mmzone.h | 17 --- include/asm-ppc64/topology.h | 10 -- 3 files changed, 130 insertions(+), 105 deletions(-) Signed-off-by: Nathan Lynch Index: linux-2.6.12-rc3-mm3/arch/ppc64/mm/numa.c =================================================================== --- linux-2.6.12-rc3-mm3.orig/arch/ppc64/mm/numa.c +++ linux-2.6.12-rc3-mm3/arch/ppc64/mm/numa.c @@ -26,11 +26,7 @@ static int numa_enabled = 1; static int numa_debug; #define dbg(args...) if (numa_debug) { printk(KERN_INFO args); } -#ifdef DEBUG_NUMA #define ARRAY_INITIALISER -1 -#else -#define ARRAY_INITIALISER 0 -#endif int numa_cpu_lookup_table[NR_CPUS] = { [ 0 ... (NR_CPUS - 1)] = ARRAY_INITIALISER}; @@ -58,6 +54,64 @@ EXPORT_SYMBOL(numa_memory_lookup_table); EXPORT_SYMBOL(numa_cpumask_lookup_table); EXPORT_SYMBOL(nr_cpus_in_node); +#define INVALID_DOMAIN (-1) +static int nid_to_domain_tbl[MAX_NUMNODES] = { [0 ... MAX_NUMNODES - 1] = INVALID_DOMAIN }; + +static int nid_to_domain(int nid) +{ + BUG_ON(nid >= MAX_NUMNODES); + BUG_ON(nid < 0); + + return nid_to_domain_tbl[nid]; +} + +/* Returns -1 if domain not mapped */ +static int domain_to_nid(int domain) +{ + int nid; + + WARN_ON(domain == INVALID_DOMAIN); + + for (nid = 0; nid < MAX_NUMNODES; nid++) { + int tmp = nid_to_domain(nid); + if (tmp == domain) + return nid; + } + + return -1; +} + +/* Map the given domain to the next available node id if it is not + * already mapped. If this is a new mapping, set the node online. + */ +static int __init establish_domain_mapping(int domain) +{ + int nid; + + WARN_ON(domain == INVALID_DOMAIN); + + for (nid = 0; nid < MAX_NUMNODES; nid++) { + if (nid_to_domain_tbl[nid] == domain) { + WARN_ON(!node_online(nid)); + return nid; + } + else if (nid_to_domain_tbl[nid] != INVALID_DOMAIN) + continue; + + printk(KERN_INFO + "Mapping platform domain %i to logical node %i\n", + domain, nid); + + nid_to_domain_tbl[nid] = domain; + node_set_online(nid); + return nid; + } + printk(KERN_WARNING "nid_to_domain_tbl full; time to increase" + " NODES_SHIFT?\n"); + + return -1; +} + static inline void map_cpu_to_node(int cpu, int node) { numa_cpu_lookup_table[cpu] = node; @@ -126,16 +180,23 @@ static int of_node_numa_domain(struct de unsigned int *tmp; if (min_common_depth == -1) - return 0; + return INVALID_DOMAIN; tmp = of_get_associativity(device); if (tmp && (tmp[0] >= min_common_depth)) { numa_domain = tmp[min_common_depth]; } else { - dbg("WARNING: no NUMA information for %s\n", + dbg("no NUMA information for %s\n", device->full_name); - numa_domain = 0; + numa_domain = INVALID_DOMAIN; } + + /* POWER4 LPAR uses 0xffff for invalid domain; + * fix that up here so callers don't have to worry about it. + */ + if (numa_domain == 0xffff) + numa_domain = INVALID_DOMAIN; + return numa_domain; } @@ -223,12 +284,12 @@ static unsigned long read_n_cells(int n, } /* - * Figure out to which domain a cpu belongs and stick it there. - * Return the id of the domain used. + * Figure out to which node a cpu belongs and stick it there. + * Return the id of the node used. */ static int numa_setup_cpu(unsigned long lcpu) { - int numa_domain = 0; + int nid = 0, numa_domain = INVALID_DOMAIN; struct device_node *cpu = find_cpu_node(lcpu); if (!cpu) { @@ -238,25 +299,17 @@ static int numa_setup_cpu(unsigned long numa_domain = of_node_numa_domain(cpu); - if (numa_domain >= num_online_nodes()) { - /* - * POWER4 LPAR uses 0xffff as invalid node, - * dont warn in this case. - */ - if (numa_domain != 0xffff) - printk(KERN_ERR "WARNING: cpu %ld " - "maps to invalid NUMA node %d\n", - lcpu, numa_domain); - numa_domain = 0; - } -out: - node_set_online(numa_domain); + if (numa_domain != INVALID_DOMAIN) + nid = domain_to_nid(numa_domain); - map_cpu_to_node(lcpu, numa_domain); + if (nid < 0) + nid = 0; +out: + map_cpu_to_node(lcpu, nid); of_node_put(cpu); - return numa_domain; + return nid; } static int cpu_numa_callback(struct notifier_block *nfb, @@ -278,8 +331,8 @@ static int cpu_numa_callback(struct noti case CPU_DEAD: case CPU_UP_CANCELED: unmap_cpu_from_node(lcpu); - break; ret = NOTIFY_OK; + break; #endif } return ret; @@ -319,7 +372,6 @@ static int __init parse_numa_properties( struct device_node *cpu = NULL; struct device_node *memory = NULL; int addr_cells, size_cells; - int max_domain = 0; long entries = lmb_end_of_DRAM() >> MEMORY_INCREMENT_SHIFT; unsigned long i; @@ -341,37 +393,13 @@ static int __init parse_numa_properties( if (min_common_depth < 0) return min_common_depth; - max_domain = numa_setup_cpu(boot_cpuid); - - /* - * Even though we connect cpus to numa domains later in SMP init, - * we need to know the maximum node id now. This is because each - * node id must have NODE_DATA etc backing it. - * As a result of hotplug we could still have cpus appear later on - * with larger node ids. In that case we force the cpu into node 0. - */ - for_each_cpu(i) { - int numa_domain; - - cpu = find_cpu_node(i); - - if (cpu) { - numa_domain = of_node_numa_domain(cpu); - of_node_put(cpu); - - if (numa_domain < MAX_NUMNODES && - max_domain < numa_domain) - max_domain = numa_domain; - } - } - addr_cells = get_mem_addr_cells(); size_cells = get_mem_size_cells(); memory = NULL; while ((memory = of_find_node_by_type(memory, "memory")) != NULL) { unsigned long start; unsigned long size; - int numa_domain; + int numa_domain, nid; int ranges; unsigned int *memcell_buf; unsigned int len; @@ -391,17 +419,16 @@ new_range: numa_domain = of_node_numa_domain(memory); - if (numa_domain >= MAX_NUMNODES) { - if (numa_domain != 0xffff) - printk(KERN_ERR "WARNING: memory at %lx maps " - "to invalid NUMA node %d\n", start, - numa_domain); - numa_domain = 0; + if (numa_domain < 0) + nid = 0; + else { + nid = domain_to_nid(numa_domain); + if (nid < 0) + nid = establish_domain_mapping(numa_domain); + if (nid < 0) + nid = 0; } - if (max_domain < numa_domain) - max_domain = numa_domain; - if (! (size = numa_enforce_memory_limit(start, size))) { if (--ranges) goto new_range; @@ -412,41 +439,53 @@ new_range: /* * Initialize new node struct, or add to an existing one. */ - if (init_node_data[numa_domain].node_end_pfn) { + if (init_node_data[nid].node_end_pfn) { if ((start / PAGE_SIZE) < - init_node_data[numa_domain].node_start_pfn) - init_node_data[numa_domain].node_start_pfn = + init_node_data[nid].node_start_pfn) + init_node_data[nid].node_start_pfn = start / PAGE_SIZE; if (((start / PAGE_SIZE) + (size / PAGE_SIZE)) > - init_node_data[numa_domain].node_end_pfn) - init_node_data[numa_domain].node_end_pfn = + init_node_data[nid].node_end_pfn) + init_node_data[nid].node_end_pfn = (start / PAGE_SIZE) + (size / PAGE_SIZE); - init_node_data[numa_domain].node_present_pages += + init_node_data[nid].node_present_pages += size / PAGE_SIZE; } else { - node_set_online(numa_domain); - - init_node_data[numa_domain].node_start_pfn = + init_node_data[nid].node_start_pfn = start / PAGE_SIZE; - init_node_data[numa_domain].node_end_pfn = - init_node_data[numa_domain].node_start_pfn + + init_node_data[nid].node_end_pfn = + init_node_data[nid].node_start_pfn + size / PAGE_SIZE; - init_node_data[numa_domain].node_present_pages = + init_node_data[nid].node_present_pages = size / PAGE_SIZE; } for (i = start ; i < (start+size); i += MEMORY_INCREMENT) numa_memory_lookup_table[i >> MEMORY_INCREMENT_SHIFT] = - numa_domain; + nid; if (--ranges) goto new_range; } - for (i = 0; i <= max_domain; i++) - node_set_online(i); + /* We need to establish domain<->nid mappings for any + * cpu nodes in the device tree with domains which were not + * encountered in the memory loop above. + */ + while ((cpu = of_find_node_by_type(cpu, "cpu"))) { + int domain = of_node_numa_domain(cpu); + if (domain < 0) + continue; + if (domain_to_nid(domain) < 0) + establish_domain_mapping(domain); + } + + /* Secondary logical cpus are associated with nids later in + * boot, but we need to explicitly set up the boot cpu. + */ + numa_setup_cpu(boot_cpuid); return 0; } @@ -541,7 +580,7 @@ static unsigned long careful_allocation( * If the memory came from a previously allocated node, we must * retry with the bootmem allocator. */ - if (pa_to_nid(ret) < nid) { + if (pa_to_nid(ret) != nid) { nid = pa_to_nid(ret); ret = (unsigned long)__alloc_bootmem_node(NODE_DATA(nid), size, align, 0); @@ -632,7 +671,7 @@ void __init do_init_bootmem(void) memory = NULL; while ((memory = of_find_node_by_type(memory, "memory")) != NULL) { unsigned long mem_start, mem_size; - int numa_domain, ranges; + int numa_domain, ranges, thisnid; unsigned int *memcell_buf; unsigned int len; @@ -644,9 +683,18 @@ void __init do_init_bootmem(void) new_range: mem_start = read_n_cells(addr_cells, &memcell_buf); mem_size = read_n_cells(size_cells, &memcell_buf); - numa_domain = numa_enabled ? of_node_numa_domain(memory) : 0; - if (numa_domain != nid) + if (numa_enabled) + numa_domain = of_node_numa_domain(memory); + else + numa_domain = -1; + + if (numa_domain < 0) + thisnid = 0; + else + thisnid = domain_to_nid(numa_domain); + + if (thisnid != nid) continue; mem_size = numa_enforce_memory_limit(mem_start, mem_size); Index: linux-2.6.12-rc3-mm3/include/asm-ppc64/mmzone.h =================================================================== --- linux-2.6.12-rc3-mm3.orig/include/asm-ppc64/mmzone.h +++ linux-2.6.12-rc3-mm3/include/asm-ppc64/mmzone.h @@ -27,24 +27,9 @@ extern int nr_cpus_in_node[]; #define MEMORY_INCREMENT_SHIFT 24 #define MEMORY_INCREMENT (1UL << MEMORY_INCREMENT_SHIFT) -/* NUMA debugging, will not work on a DLPAR machine */ -#undef DEBUG_NUMA - static inline int pa_to_nid(unsigned long pa) { - int nid; - - nid = numa_memory_lookup_table[pa >> MEMORY_INCREMENT_SHIFT]; - -#ifdef DEBUG_NUMA - /* the physical address passed in is not in the map for the system */ - if (nid == -1) { - printk("bad address: %lx\n", pa); - BUG(); - } -#endif - - return nid; + return numa_memory_lookup_table[pa >> MEMORY_INCREMENT_SHIFT]; } #define pfn_to_nid(pfn) pa_to_nid((pfn) << PAGE_SHIFT) Index: linux-2.6.12-rc3-mm3/include/asm-ppc64/topology.h =================================================================== --- linux-2.6.12-rc3-mm3.orig/include/asm-ppc64/topology.h +++ linux-2.6.12-rc3-mm3/include/asm-ppc64/topology.h @@ -8,15 +8,7 @@ static inline int cpu_to_node(int cpu) { - int node; - - node = numa_cpu_lookup_table[cpu]; - -#ifdef DEBUG_NUMA - BUG_ON(node == -1); -#endif - - return node; + return numa_cpu_lookup_table[cpu]; } #define parent_node(node) (node) From ntl at pobox.com Fri May 6 08:15:39 2005 From: ntl at pobox.com (Nathan Lynch) Date: Thu, 5 May 2005 17:15:39 -0500 Subject: [PATCH 2/2] update cpu-to-node mappings during dlpar Message-ID: <20050505221539.GC3614@otto> This fixes up some fallout from the preceding patch. The sched domains reinitialization code which runs at cpu hotplug time expects the cpu-to-node mappings to have been set up earlier than we were doing. Seems that things such as cpu_to_node() are expected to return sane values regardless of whether the cpu is online. It makes sense, I suppose: we should be updating the cpu<->node mappings when the topology changes, instead of keying on the state of cpus. Map logical cpus to nodes when a processor is added to the system; tear down the mapping(s) when a processor is going away. Get rid of the numa cpu hotplug notifier stuff. arch/ppc64/kernel/pSeries_smp.c | 3 + arch/ppc64/mm/numa.c | 61 +++++++++++++--------------------------- include/asm-ppc64/topology.h | 13 ++++++++ 3 files changed, 37 insertions(+), 40 deletions(-) Signed-off-by: Nathan Lynch Index: linux-2.6.12-rc3-mm3/arch/ppc64/kernel/pSeries_smp.c =================================================================== --- linux-2.6.12-rc3-mm3.orig/arch/ppc64/kernel/pSeries_smp.c +++ linux-2.6.12-rc3-mm3/arch/ppc64/kernel/pSeries_smp.c @@ -45,6 +45,7 @@ #include #include #include +#include #include "mpic.h" @@ -187,6 +188,7 @@ static int pSeries_add_processor(struct BUG_ON(cpu_isset(cpu, cpu_present_map)); cpu_set(cpu, cpu_present_map); set_hard_smp_processor_id(cpu, *intserv++); + numa_setup_cpu(cpu, np); } err = 0; out_unlock: @@ -218,6 +220,7 @@ static void pSeries_remove_processor(str continue; BUG_ON(cpu_online(cpu)); cpu_clear(cpu, cpu_present_map); + numa_teardown_cpu(cpu); set_hard_smp_processor_id(cpu, -1); break; } Index: linux-2.6.12-rc3-mm3/arch/ppc64/mm/numa.c =================================================================== --- linux-2.6.12-rc3-mm3.orig/arch/ppc64/mm/numa.c +++ linux-2.6.12-rc3-mm3/arch/ppc64/mm/numa.c @@ -136,6 +136,11 @@ static void unmap_cpu_from_node(unsigned cpu, node); } } +#else +static void unmap_cpu_from_node(unsigned long cpu) +{ + return; +} #endif /* CONFIG_HOTPLUG_CPU */ static struct device_node * __devinit find_cpu_node(unsigned int cpu) @@ -284,19 +289,25 @@ static unsigned long read_n_cells(int n, } /* - * Figure out to which node a cpu belongs and stick it there. - * Return the id of the node used. + * Figure out to which node a cpu belongs and stick it there. Return + * the id of the node used. We allow the caller to optionally pass + * the device_node which corresponds to the logical cpu, since at + * DLPAR time the new node may not have been added to the device tree + * yet. */ -static int numa_setup_cpu(unsigned long lcpu) +int numa_setup_cpu(unsigned long lcpu, struct device_node *np) { int nid = 0, numa_domain = INVALID_DOMAIN; - struct device_node *cpu = find_cpu_node(lcpu); + struct device_node *cpu = np ? of_node_get(np) : find_cpu_node(lcpu); if (!cpu) { WARN_ON(1); goto out; } + if (!numa_enabled) + goto out; + numa_domain = of_node_numa_domain(cpu); if (numa_domain != INVALID_DOMAIN) @@ -312,32 +323,10 @@ out: return nid; } -static int cpu_numa_callback(struct notifier_block *nfb, - unsigned long action, - void *hcpu) -{ - unsigned long lcpu = (unsigned long)hcpu; - int ret = NOTIFY_DONE; - - switch (action) { - case CPU_UP_PREPARE: - if (min_common_depth == -1 || !numa_enabled) - map_cpu_to_node(lcpu, 0); - else - numa_setup_cpu(lcpu); - ret = NOTIFY_OK; - break; -#ifdef CONFIG_HOTPLUG_CPU - case CPU_DEAD: - case CPU_UP_CANCELED: - unmap_cpu_from_node(lcpu); - ret = NOTIFY_OK; - break; -#endif - } - return ret; +void numa_teardown_cpu(unsigned long lcpu) +{ + unmap_cpu_from_node(lcpu); } - /* * Check and possibly modify a memory region to enforce the memory limit. * @@ -373,7 +362,7 @@ static int __init parse_numa_properties( struct device_node *memory = NULL; int addr_cells, size_cells; long entries = lmb_end_of_DRAM() >> MEMORY_INCREMENT_SHIFT; - unsigned long i; + unsigned long i, lcpu; if (numa_enabled == 0) { printk(KERN_WARNING "NUMA disabled by user\n"); @@ -482,10 +471,8 @@ new_range: establish_domain_mapping(domain); } - /* Secondary logical cpus are associated with nids later in - * boot, but we need to explicitly set up the boot cpu. - */ - numa_setup_cpu(boot_cpuid); + for_each_present_cpu(lcpu) + numa_setup_cpu(lcpu, NULL); return 0; } @@ -602,10 +589,6 @@ void __init do_init_bootmem(void) int nid; int addr_cells, size_cells; struct device_node *memory = NULL; - static struct notifier_block ppc64_numa_nb = { - .notifier_call = cpu_numa_callback, - .priority = 1 /* Must run before sched domains notifier. */ - }; min_low_pfn = 0; max_low_pfn = lmb_end_of_DRAM() >> PAGE_SHIFT; @@ -616,8 +599,6 @@ void __init do_init_bootmem(void) else dump_numa_topology(); - register_cpu_notifier(&ppc64_numa_nb); - for_each_online_node(nid) { unsigned long start_paddr, end_paddr; int i; Index: linux-2.6.12-rc3-mm3/include/asm-ppc64/topology.h =================================================================== --- linux-2.6.12-rc3-mm3.orig/include/asm-ppc64/topology.h +++ linux-2.6.12-rc3-mm3/include/asm-ppc64/topology.h @@ -4,6 +4,8 @@ #include #include +struct device_node; /* for numa_setup_cpu() */ + #ifdef CONFIG_NUMA static inline int cpu_to_node(int cpu) @@ -51,10 +53,21 @@ static inline int node_to_first_cpu(int .nr_balance_failed = 0, \ } +int numa_setup_cpu(unsigned long lcpu, struct device_node *); +void numa_teardown_cpu(unsigned long lcpu); #else /* !CONFIG_NUMA */ #include +static int inline numa_setup_cpu(unsigned long lcpu, struct device_node *np) +{ + return 0; +} + +static void inline numa_teardown_cpu(unsigned long lcpu) +{ + return; +} #endif /* CONFIG_NUMA */ #endif /* _ASM_PPC64_TOPOLOGY_H */ From paulus at samba.org Fri May 6 15:10:54 2005 From: paulus at samba.org (Paul Mackerras) Date: Fri, 6 May 2005 15:10:54 +1000 Subject: [PATCH 1/4] ppc64: rename arch/ppc64/kernel/pSeries_pci.c In-Reply-To: <200504200152.58965.arnd@arndb.de> References: <200504200149.22063.arnd@arndb.de> <200504200152.58965.arnd@arndb.de> Message-ID: <17018.64606.662481.104228@cargo.ozlabs.ibm.com> Arnd Bergmann writes: > Rename pSeries_pci.c to rtas_pci.c as a preparation to generalize it > for use by BPA. Most of the file can be used by any machine that > implements rtas. Hmmm, you rename pSeries_pci.c to rtas_pci.c and then in the next patch you recreate pSeries_pci.c and move some stuff from rtas_pci.c into it. Could we have one patch that creates rtas_pci.c and just moves stuff from pSeries_pci.c to it? Paul. From paulus at samba.org Fri May 6 15:33:21 2005 From: paulus at samba.org (Paul Mackerras) Date: Fri, 6 May 2005 15:33:21 +1000 Subject: [PATCH 2/3] native hash clear In-Reply-To: <20050413140850.GC5081@in.ibm.com> References: <20050413140605.GA5081@in.ibm.com> <20050413140807.GB5081@in.ibm.com> <20050413140850.GC5081@in.ibm.com> Message-ID: <17019.417.816952.686163@cargo.ozlabs.ibm.com> R Sharada writes: > Add code to clear the hash table and invalidate the tlb for native (SMP, > non-LPAR) mode. Supports 16M and 4k pages. > + /* we take the tlbie lock and hold it. Some hardware will > + * deadlock if we try to tlbie from two processors at once. > + */ > + spin_lock(&native_tlbie_lock); ... > + /* > + * we could lock the pte here, but we are the only cpu > + * running, right? and for crash dump, we probably > + * don't want to wait for a maybe bad cpu. > + */ So which is it? Are we locking things, and possibly waiting forever for a bad cpu, or are we not locking things? Or is there a reason why we lock in one case but not the other? Paul. From sharada at in.ibm.com Fri May 6 16:08:15 2005 From: sharada at in.ibm.com (R Sharada) Date: Fri, 6 May 2005 11:38:15 +0530 Subject: 2.6.12-rc3-mm3 pcibus_to_node patch breaks ppc64 CONFIG_NUMA case Message-ID: <20050506060815.GA2282@in.ibm.com> Hello, The patches in 2.6.12-rc3-mm3, that introduce pcibus_to_node, in the ide driver code, break the ppc64 CONFIG_NUMA case, as there is no definition for pcibus_to_node for the CONFIG_NUMA case in ppc64. The asm-generic/topology.h definition for pcibus_to_node gets included for the !CONFIG_NUMA case and works ok. There is a patch in mm3 for x86 that adds this definition on i386 and x86_64 platforms, but none for ppc64. The following are the patches that seem to cause this problem: numa-aware-block-device-control-structure-allocation.patch numa-aware-block-device-control-structure-allocation-tidy.patch - The above two patches introduce the pcibus_to_node call Note: x86-x86_64-pcibus_to_node.patch - this patches x86 code for pcibus_to_node alone Thanks and Regards, Sharada From service at paypal.com Fri May 6 18:06:13 2005 From: service at paypal.com (PayPal) Date: Fri, 6 May 2005 10:06:13 +0200 (CEST) Subject: PayPal Flagged Account Message-ID: <20050506080613.97FBA2DC869@server12.kundenserver12hsgbr.de> An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050506/83700ea2/attachment.htm From miltonm at bga.com Fri May 6 18:39:33 2005 From: miltonm at bga.com (Milton Miller) Date: Fri, 6 May 2005 03:39:33 -0500 Subject: [PATCH 2/3] native hash clear In-Reply-To: <17019.417.816952.686163@cargo.ozlabs.ibm.com> References: <20050413140605.GA5081@in.ibm.com> <20050413140807.GB5081@in.ibm.com> <20050413140850.GC5081@in.ibm.com> <17019.417.816952.686163@cargo.ozlabs.ibm.com> Message-ID: <530ae20d9cfe9d8ee5771f6c55964e20@bga.com> On May 6, 2005, at 12:33 AM, Paul Mackerras wrote: > R Sharada writes: > >> Add code to clear the hash table and invalidate the tlb for native >> (SMP, >> non-LPAR) mode. Supports 16M and 4k pages. > >> + /* we take the tlbie lock and hold it. Some hardware will >> + * deadlock if we try to tlbie from two processors at once. >> + */ >> + spin_lock(&native_tlbie_lock); > ... >> + /* >> + * we could lock the pte here, but we are the only cpu >> + * running, right? and for crash dump, we probably >> + * don't want to wait for a maybe bad cpu. >> + */ > > So which is it? Are we locking things, and possibly waiting forever > for a bad cpu, or are we not locking things? Or is there a reason why > we lock in one case but not the other? Right now, a bit of both. We don't worry about software interlocks, but we do use the lock over the smaller area that protects us from causing a hardware deadlock. So, the question is, is this the right tradeoff? milton From amavin at redhat.com Sat May 7 00:35:56 2005 From: amavin at redhat.com (Ananth N Mavinakayanahalli) Date: Fri, 06 May 2005 10:35:56 -0400 Subject: [PATCH] Kprobes: don't eat dabr/iabr exceptions Message-ID: <427B80CC.8010602@redhat.com> Hi Anton, Please find patch below that should fix the issue of kprobes eating up IABR/DABR exceptions. Please let me know if this works for you. Thanks, Ananth Signed-off-by: Ananth N Mavinakayanahalli -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: do-not-eat-dabr.patch Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050506/05d78b05/attachment.txt From christoph at lameter.com Sat May 7 00:11:20 2005 From: christoph at lameter.com (Christoph Lameter) Date: Fri, 6 May 2005 07:11:20 -0700 (PDT) Subject: 2.6.12-rc3-mm3 pcibus_to_node patch breaks ppc64 CONFIG_NUMA case In-Reply-To: <20050506060815.GA2282@in.ibm.com> References: <20050506060815.GA2282@in.ibm.com> Message-ID: On Fri, 6 May 2005, R Sharada wrote: > Hello, > The patches in 2.6.12-rc3-mm3, that introduce pcibus_to_node, in the > ide driver code, break the ppc64 CONFIG_NUMA case, as there is no definition > for pcibus_to_node for the CONFIG_NUMA case in ppc64. The asm-generic/topology.h > definition for pcibus_to_node gets included for the !CONFIG_NUMA case and works > ok. There is a patch in mm3 for x86 that adds this definition on i386 and > x86_64 platforms, but none for ppc64. It seems that the asm-generic/topology.h should always be included so that fallback mechanism can be defined for all platforms. Could you change that for ppc64? From johnrose at austin.ibm.com Sat May 7 07:49:11 2005 From: johnrose at austin.ibm.com (John Rose) Date: Fri, 06 May 2005 16:49:11 -0500 Subject: Patch to kill ioremap_mm In-Reply-To: <1115335822.7627.189.camel@gaston> References: <20050505014256.GE18270@localhost.localdomain> <1115306696.6011.6.camel@sinatra.austin.ibm.com> <1115335822.7627.189.camel@gaston> Message-ID: <1115416151.15458.15.camel@sinatra.austin.ibm.com> > We discussed that, it's sort of step #2 :) The idea I have is to fold > normal ioremap into the vmalloc space like other archs, ditch imalloc, > and for now to keep the top 1TB for the explicit mappings (PHB IO > space). David or I will do that after we are finished with other things. I would assume that we'll still have a path for creating mappings before mem_init_done? Ack, I'm already imagining the slot/phb DLPAR problems that vmalloc may not be able to handle. For example splitting existing regions, removing mappings within a given range, etc. John From linas at austin.ibm.com Sat May 7 09:05:06 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Fri, 6 May 2005 18:05:06 -0500 Subject: [PATCH] FYI/ Re: PCI Error Recovery API Proposal (updated) In-Reply-To: <1112685311.9518.35.camel@gaston> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <4235847F.3080705@jp.fujitsu.com> <20050314181420.GD498@austin.ibm.com> <1112685311.9518.35.camel@gaston> Message-ID: <20050506230506.GL11745@austin.ibm.com> Hi, This is an "FYI" patch partially implementing the PCI error recovery API previously detailed by BenH in an earlier email. Its an "FYI patch" because this patch has numerous flaws and limitations which I'm hoping to address any day now. I've been busy with other things, but have recently been able to carve out a chunk of time to work on this. This patch is almost identical to a previous patch I'd mailed out before, with only minor changes made to bring it into line with BenH's proposed API. Basically, I'm just dusting off the old patch, prior to making more serious changes. I hope to send a more serious patch in a few days/week. Meanwhile, criticism invited. This patch does actually recover from PCI errors on ethernet cards plugged into ppc64 hotplug slots, and from PCI errors on the IPR scsi controller. --linas -------------- next part -------------- --- include/linux/pci.h.linas-orig 2005-04-29 20:27:22.000000000 -0500 +++ include/linux/pci.h 2005-05-06 16:34:02.000000000 -0500 @@ -659,6 +659,80 @@ struct pci_dynids { unsigned int use_driver_data:1; /* pci_driver->driver_data is used */ }; +/* ---------------------------------------------------------------- */ +/** PCI error recovery state. Whenever the PCI bus state changes, + * the io_state_change() callback will be called to notify the + * device driver os state changes. + */ + +enum pci_channel_state { + pci_channel_io_normal = 0, /* I/O channel is in normal state */ + pci_channel_io_frozen = 1, /* I/O to channel is blocked */ + pci_channel_io_perm_failure, /* pci card is dead */ +}; + +enum pcierr_result { + PCIERR_RESULT_CAN_RECOVER=1, + PCIERR_RESULT_NEED_RESET, + PCIERR_RESULT_DISCONNECT, + PCIERR_RESULT_RECOVERED, +}; + +/* PCI bus error event callbacks */ +struct pci_error_handlers +{ + int (*error_detected)(struct pci_dev *dev, enum pci_channel_state error); + int (*error_recover)(struct pci_dev *dev); + int (*error_restart)(struct pci_dev *dev); + int (*link_reset)(struct pci_dev *dev); + int (*slot_reset)(struct pci_dev *dev); +}; + +/** + * PCI Error notifier event flags. + */ +#define PEH_NOTIFY_ERROR 1 + +/** PEH event -- structure holding pci controller data that describes + * a change in the isolation status of a PCI slot. A pointer + * to this struct is passed as the data pointer in a notify callback. + */ +struct peh_event { + struct list_head list; + struct pci_dev *dev; /* affected device */ + enum pci_channel_state state; /* PCI bus state for the affected device */ + int time_unavail; /* milliseconds until device might be available */ +}; + +/** + * peh_send_failure_event - generate a PCI error event + * @dev pci device + * + * This routine builds a PCI error event which will be delivered + * to all listeners on the peh_notifier_chain. + * + * This routine can be called within an interrupt context; + * the actual event will be delivered in a normal context + * (from a workqueue). + */ +int peh_send_failure_event (struct pci_dev *dev, + enum pci_channel_state state, + int time_unavail); + +/** + * peh_register_notifier - Register to find out about EEH events. + * @nb: notifier block to callback on events + */ +int peh_register_notifier(struct notifier_block *nb); + +/** + * peh_unregister_notifier - Unregister to an EEH event notifier. + * @nb: notifier block to callback on events + */ +int peh_unregister_notifier(struct notifier_block *nb); + +/* ---------------------------------------------------------------- */ + struct module; struct pci_driver { struct list_head node; @@ -671,6 +745,7 @@ struct pci_driver { int (*resume) (struct pci_dev *dev); /* Device woken up */ int (*enable_wake) (struct pci_dev *dev, u32 state, int enable); /* Enable wake event */ + struct pci_error_handlers err_handler; struct device_driver driver; struct pci_dynids dynids; }; --- Documentation/pci-error-recovery.txt.linas-orig 2005-05-06 17:44:41.000000000 -0500 +++ Documentation/pci-error-recovery.txt 2005-05-06 17:39:19.000000000 -0500 @@ -0,0 +1,192 @@ + + PCI Error Recovery + ------------------ + + +Preliminary sketch of API, cut n pasted from email from BenH. +circa 5 april 2005 + +The error recovery API support is exposed by the driver in the form of +a structure of function pointers pointed to by a new field in struct +pci_driver. The absence of this pointer in pci_driver denotes an +"non-aware" driver, behaviour on these is platform dependant. Platforms +like ppc64 can try to simulate hotplug remove/add. + +The definition of "pci_error_token" is not covered here. It is based on +Seto's work on the synchronous error detection. We still need to define +functions for extracting infos out of an opaque error token. This is +separate from this API. + +This structure has the form: + +struct pci_error_handlers +{ + int (*error_detected)(struct pci_dev *dev, pci_error_token error); + int (*error_recover)(struct pci_dev *dev); + int (*error_restart)(struct pci_dev *dev); + int (*link_reset)(struct pci_dev *dev); + int (*slot_reset)(struct pci_dev *dev); +}; + +A driver doesn't have to implement all of these callbacks. The only mandatory +one is error_detected. If a callback is not implemented, the corresponding +feature is considered unsupported. For example, if error_recover and +error_restart (they really go together, see desscription to understand why) +aren't there, then the driver is assumed as not doing any direct recovery and +requires a reset. If link_reset is not implemented, the card is assumed as +not caring about link resets, in which case, if recover is supported, the core +can try recover (but not slot_reset unless it really did reset the slot). If slot +reset is not supported, link reset can be called instead on a slot reset. + +At first, the call will always be : + + 1) error_detected() + + Error detected. This is sent once after an error has been detected. At +this point, the device might not be accessible anymore depending on the +platform (the slot will be isolated on ppc64). The driver may already +have "noticed" the error because of a failing IO, but this is the proper +"synchronisation point", that is, it gives a chance to the driver to +cleanup, waiting for pending stuffs (timers, whatever, etc...) to +complete, it can take semaphores, schedule, etc... everything but touch +the device. Within this function and after it returns, the driver +shouldn't do any new IOs. Called in task context. This is sort of a +"quiesce" point. See note about interrupts at the end of this doc. + + Result codes: + - PCIERR_RESULT_CAN_RECOVER: + Return this if you think you might be able to recover + the HW by just banging IOs or if you want to be given + a chance to extract some diagnostic informations (see + below). + - PCIERR_RESULT_NEED_RESET: + Return this if you think you can't recover unless the + slot is reset. + - PCIERR_RESULT_DISCONNECT: + Return this if you think you won't recover at all, + (this will detach the driver ? or just leave it + dangling ? to be decided) + + +So at this point, we have called error_detected() for all drivers +on the segment that had the error. On ppc64, the slot is isolated. What +happens now typically depends on the result from the drivers. If all +drivers on the segment/slot return PCIERR_RESULT_CAN_RECOVER, we would +re-enable IOs on the slot (or do nothing special if the platform doesn't +isolate slots) and call 2). If not and we can reset slots, we go to 4), +if neither, we have a dead slot. If it's an hotplug slot, we might +"simulate" reset by triggering HW unplug/replug tho. + + 2) error_recover() + + This is the "early recovery" call. IOs are allowed again, but DMA is +not (hrm... to be discussed, I prefer not), with some restrictions. This +is NOT a callback for the driver to start operations again, only to +peek/poke at the device, extract diagnostic informations if any, and +eventually do things like trigger a device local reset or such things, +but not restart operations. This is sent if all drivers on a segment +agree that they can try to recover and no automatic link reset was performed +by the HW. If the platform can't just re-enable IOs without a slot reset or a +link reset, it doesn't call this callback and goes directly to 3) or 4). All IOs +should be done _synchronously_ from withing this callback, errors triggered by +them will be returned via the normal pci_check_whatever() api, no new +error_detected() callback will be issued due to an error happening here. However, +such an error might cause IOs to be re-blocked for the whole segment, and thus +invalidate the recovery that other devices on the same segment might have done, +forcing the whole segment into one of the next states, that is link reset or +slot reset. + + Result codes: + - PCIERR_RESULT_RECOVERED + Return this if you think your device is fully + functionnal and think you are ready to start + to do your normal driver job again. There is no + guarantee that because you returned that, you'll be + allowed to actually proceed as another driver on the + same segment might have failed and thus triggered a + slot reset on platforms that support it. + + - PCIERR_RESULT_NEED_RESET + Return this if you think your device is not + recoverable in it's current state and you need a slot + reset to proceed. + + - PCIERR_RESULT_DISCONNECT + Same as above. Total failure, no recovery even after + reset driver dead. (To be defined more precisely) + + 3) link_reset() + + This is called after the link has been reset. This is typically a +PCI Express specific state at this point and is done wether a non fatal error +has been detected that can be "solved" by resetting the link. The driver is +informed here of that reset and should check if the device appears to be in +working condition. This function acts a bit like 2) error_recover(), that is +it is not supposed to restart normal driver IO operations right away, just +"probe" the device to check it's recoverability status. If all is right, then +the core will call error_restart() once all driver have ack'd link_reset(). + + Result codes: + (identical to error_recover) + + 4) slot_reset() + + This is called after the slot has been hard reset (and PCI BARs +re-configured by the platform). If the platform supports PCI hotplug, +it can implement this by toggling power on the slot off/on. Drivers here +have a chance to re-initialize the hardware (re-download firmware etc...), +but drivers shouldn't restart normal IO processing operations at this point. +(see note about interrupts, they aren't guaranteed to be delivered until the +restart callback has been called). Upon success from this callback, the +patform will call error_restart() to complete the error handling and let +the driver restart normal IO request processing. + +However, a driver can still return a critical failure from here in case +it just can't get it's device back from reset. There is just nothing we +can do about it tho. The driver will just be considered "dead" in this case. + + Result codes: + - PCIERR_RESULT_DISCONNECT + Same as above. + + 5) error_restart() + + This is called if all drivers on the segment have returned +PCIERR_RESULT_RECOVERED from one of the 3 prevous callbacks. That basically +tells the driver to restart activity, everything is back & running. No result +code is taken into account here. If a new error happens, it will restart +a new error handling process. + +That's it. I think this covers all the possibilities. The way those +callbacks are called is platform policy. A platform with no slot reset +capability for example may want to just "ignore" drivers that can't +recover (disconnect them) and try to let other cards on the same segment +recover. Keep in mind that in most real life cases, though, there will +be only one driver per segment. + +Now, there is a note about interrupts. If you get an interrupt and your +device is dead or has been isolated, there is a problem :) + +After much thinking, I decided to leave that to the platform. That is, +the recovery API only precies that: + + - There is no guarantee that interrupt delivery can proceed from any +device on the segment starting from the error detection and until the +restart callback is sent, at which point interrupts are expected to be +fully operational. + + - There is no guarantee that interrupt delivery is stopped, that is, ad +river that gets an interrupts after detecting an error, or that detects +and error within the interrupt handler such that it prevents proper +ack'ing of the interrupt (and thus removal of the source) should just +return IRQ_NOTHANDLED. It's up to the platform to deal with taht +condition, typically by masking the irq source during the duration of +the error handling. It is expected that the platform "knows" which +interrupts are routed to error-management capable slots and can deal +with temporarily disabling that irq number during error processing (this +isn't terribly complex). That means some IRQ latency for other devices +sharing the interrupt, but there is simply no other way. High end +platforms aren't supposed to share interrupts between many devices +anyway :) + + --- drivers/pci/Makefile.linas-orig 2005-04-29 20:31:33.000000000 -0500 +++ drivers/pci/Makefile 2005-05-06 12:28:43.000000000 -0500 @@ -3,7 +3,7 @@ # obj-y += access.o bus.o probe.o remove.o pci.o quirks.o \ - names.o pci-driver.o search.o pci-sysfs.o \ + names.o pci-driver.o pci-error.o search.o pci-sysfs.o \ rom.o obj-$(CONFIG_PROC_FS) += proc.o --- drivers/pci/pci-error.c.linas-orig 2005-05-06 17:44:47.000000000 -0500 +++ drivers/pci/pci-error.c 2005-05-06 16:56:02.000000000 -0500 @@ -0,0 +1,152 @@ +/* + * pci-error.c + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include +#include +#include + +#undef DEBUG + +/** Overview: + * PEH, or "PCI Error Handling" is a PCI bridge technology for + * dealing with PCI bus errors that can't be dealt with within the + * usual PCI framework, except by check-stopping the CPU. Systems + * that are designed for high-availability/reliability cannot afford + * to crash due to a "mere" PCI error, thus the need for PEH. + * An PEH-capable bridge operates by converting a detected error + * into a "slot freeze", taking the PCI adapter off-line, making + * the slot behave, from the OS'es point of view, as if the slot + * were "empty": all reads return 0xff's and all writes are silently + * ignored. PEH slot isolation events can be triggered by parity + * errors on the address or data busses (e.g. during posted writes), + * which in turn might be caused by low voltage on the bus, dust, + * vibration, humidity, radioactivity or plain-old failed hardware. + * + * Note, however, that one of the leading causes of PEH slot + * freeze events are buggy device drivers, buggy device microcode, + * or buggy device hardware. This is because any attempt by the + * device to bus-master data to a memory address that is not + * assigned to the device will trigger a slot freeze. (The idea + * is to prevent devices-gone-wild from corrupting system memory). + * Buggy hardware/drivers will have a miserable time co-existing + * with PEH. + */ + +/* PEH event workqueue setup. */ +static spinlock_t peh_eventlist_lock = SPIN_LOCK_UNLOCKED; +LIST_HEAD(peh_eventlist); +static void peh_event_handler(void *); +DECLARE_WORK(peh_event_wq, peh_event_handler, NULL); + +static struct notifier_block *peh_notifier_chain; + +/** + * peh_event_handler - dispatch PEH events. The detection of a frozen + * slot can occur inside an interrupt, where it can be hard to do + * anything about it. The goal of this routine is to pull these + * detection events out of the context of the interrupt handler, and + * re-dispatch them for processing at a later time in a normal context. + * + * @dummy - unused + */ +static void peh_event_handler(void *dummy) +{ + unsigned long flags; + struct peh_event *event; + + while (1) { + spin_lock_irqsave(&peh_eventlist_lock, flags); + event = NULL; + if (!list_empty(&peh_eventlist)) { + event = list_entry(peh_eventlist.next, struct peh_event, list); + list_del(&event->list); + } + spin_unlock_irqrestore(&peh_eventlist_lock, flags); + if (event == NULL) + break; + + printk(KERN_INFO "PEH: Detected PCI bus error on device " + "%s %s\n", + pci_name(event->dev), pci_pretty_name(event->dev)); + + notifier_call_chain (&peh_notifier_chain, + PEH_NOTIFY_ERROR, event); + + pci_dev_put(event->dev); + kfree(event); + } +} + + +/** + * peh_send_failure_event - generate a PCI error event + * @dev pci device + * + * This routine builds a PCI error event which will be delivered + * to all listeners on the peh_notifier_chain. + * + * This routine can be called within an interrupt context; + * the actual event will be delivered in a normal context + * (from a workqueue). + */ +int peh_send_failure_event (struct pci_dev *dev, + enum pci_channel_state state, + int time_unavail) +{ + unsigned long flags; + struct peh_event *event; + + event = kmalloc(sizeof(*event), GFP_ATOMIC); + if (event == NULL) { + printk (KERN_ERR "PEH: out of memory, event not handled\n"); + return 1; + } + + event->dev = dev; + event->state = state; + event->time_unavail = time_unavail; + + /* We may or may not be called in an interrupt context */ + spin_lock_irqsave(&peh_eventlist_lock, flags); + list_add(&event->list, &peh_eventlist); + spin_unlock_irqrestore(&peh_eventlist_lock, flags); + + schedule_work(&peh_event_wq); + + return 0; +} + +/** + * peh_register_notifier - Register to find out about EEH events. + * @nb: notifier block to callback on events + */ +int peh_register_notifier(struct notifier_block *nb) +{ + return notifier_chain_register(&peh_notifier_chain, nb); +} + +/** + * peh_unregister_notifier - Unregister to an EEH event notifier. + * @nb: notifier block to callback on events + */ +int peh_unregister_notifier(struct notifier_block *nb) +{ + return notifier_chain_unregister(&peh_notifier_chain, nb); +} + + --- drivers/scsi/ipr.c.linas-orig 2005-04-29 20:33:36.000000000 -0500 +++ drivers/scsi/ipr.c 2005-05-06 17:28:15.000000000 -0500 @@ -80,6 +80,11 @@ #include #include #include + +#ifdef CONFIG_PPC64 +#define CONFIG_SCSI_IPR_EEH +#endif /* CONFIG_PPC64 */ + #include "ipr.h" /* @@ -4993,6 +4998,7 @@ static int ipr_reset_start_bist(struct i return rc; } + /** * ipr_reset_allowed - Query whether or not IOA can be reset * @ioa_cfg: ioa config struct @@ -5306,6 +5312,69 @@ static void ipr_initiate_ioa_reset(struc shutdown_type); } +#ifdef CONFIG_SCSI_IPR_EEH + +/** If the PCI slot is frozen, hold off all i/o + * activity; then, as soon as the slot is available again, + * initiate an adapter reset. + */ +static int ipr_reset_freeze(struct ipr_cmnd *ipr_cmd) +{ + list_add_tail(&ipr_cmd->queue, &ipr_cmd->ioa_cfg->pending_q); + ipr_cmd->done = ipr_reset_ioa_job; + return IPR_RC_JOB_RETURN; +} + +static void ipr_eeh_frozen (struct pci_dev *pdev) +{ + unsigned long flags = 0; + struct ipr_ioa_cfg *ioa_cfg = pci_get_drvdata(pdev); + + spin_lock_irqsave(ioa_cfg->host->host_lock, flags); + _ipr_initiate_ioa_reset(ioa_cfg, ipr_reset_freeze, IPR_SHUTDOWN_NONE); + spin_unlock_irqrestore(ioa_cfg->host->host_lock, flags); +} + +static int ipr_eeh_thawed (struct pci_dev *pdev) +{ + unsigned long flags = 0; + struct ipr_ioa_cfg *ioa_cfg = pci_get_drvdata(pdev); + + spin_lock_irqsave(ioa_cfg->host->host_lock, flags); + _ipr_initiate_ioa_reset(ioa_cfg, ipr_reset_restore_cfg_space, + IPR_SHUTDOWN_NONE); + spin_unlock_irqrestore(ioa_cfg->host->host_lock, flags); + + return PCIERR_RESULT_RECOVERED; +} + +static void ipr_eeh_perm_failure (struct pci_dev *pdev) +{ +#if 0 // XXXXXXXXXXXXXXXXXXXXXXX + ipr_cmd->job_step = ipr_reset_shutdown_ioa; + rc = IPR_RC_JOB_CONTINUE; +#endif +} + +static int ipr_eeh_error_detected (struct pci_dev *pdev, + enum pci_channel_state state) +{ + switch (state) { + case pci_channel_io_frozen: + ipr_eeh_frozen (pdev); + return PCIERR_RESULT_NEED_RESET; + + case pci_channel_io_perm_failure: + ipr_eeh_perm_failure (pdev); + return PCIERR_RESULT_DISCONNECT; + break; + default: + break; + } + return PCIERR_RESULT_NEED_RESET; +} +#endif + /** * ipr_probe_ioa_part2 - Initializes IOAs found in ipr_probe_ioa(..) * @ioa_cfg: ioa cfg struct @@ -6015,6 +6084,10 @@ static struct pci_driver ipr_driver = { .id_table = ipr_pci_table, .probe = ipr_probe, .remove = ipr_remove, + .err_handler = { + .error_detected = ipr_eeh_error_detected, + .slot_reset = ipr_eeh_thawed, + }, .driver = { .shutdown = ipr_shutdown, }, --- drivers/scsi/sym53c8xx_2/sym_glue.c.linas-orig 2005-04-29 20:33:12.000000000 -0500 +++ drivers/scsi/sym53c8xx_2/sym_glue.c 2005-05-06 16:55:02.000000000 -0500 @@ -49,6 +49,10 @@ #include #include +#ifdef CONFIG_PPC64 +#define CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY +#endif + #include "sym_glue.h" #include "sym_nvram.h" @@ -770,6 +774,10 @@ static irqreturn_t sym53c8xx_intr(int ir struct sym_hcb *np = (struct sym_hcb *)dev_id; if (DEBUG_FLAGS & DEBUG_TINY) printf_debug ("["); +#ifdef CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY + if (np->s.io_state != pci_channel_io_normal) + return IRQ_HANDLED; +#endif /* CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY */ spin_lock_irqsave(np->s.host->host_lock, flags); sym_interrupt(np); @@ -844,6 +852,27 @@ static void sym_eh_done(struct scsi_cmnd */ static void sym_eh_timeout(u_long p) { __sym_eh_done((struct scsi_cmnd *)p, 1); } +#ifdef CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY +static void sym_eeh_timeout(u_long p) +{ + struct sym_eh_wait *ep = (struct sym_eh_wait *) p; + if (!ep) + return; + complete(&ep->done); +} + +static void sym_eeh_done(struct sym_eh_wait *ep) +{ + if (!ep) + return; + ep->timed_out = 0; + if (!del_timer(&ep->timer)) + return; + + complete(&ep->done); +} +#endif /* CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY */ + /* * Generic method for our eh processing. * The 'op' argument tells what we have to do. @@ -905,6 +934,35 @@ prepare: sts = 0; break; case SYM_EH_HOST_RESET: +#ifdef CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY +printk("duuuuuude attempting symbios recovery\n"); +dump_stack(); + int rc = eeh_slot_is_isolated (np->s.device); + +printk ("duude symbios is isolated ??=%d\n", rc); +printk ("duuude the current io state is %d\n", np->s.io_state); + if (rc) { + struct sym_eh_wait eeh, *eep = &eeh; + np->s.io_reset_wait = eep; + init_completion(&eep->done); + init_timer(&eep->timer); + eep->to_do = SYM_EH_DO_WAIT; + eep->timer.expires = jiffies + (10*HZ); + eep->timer.function = sym_eeh_timeout; + eep->timer.data = (u_long)eep; + eep->timed_out = 1; /* Be pessimistic for once :) */ + add_timer(&eep->timer); + spin_unlock_irq(np->s.host->host_lock); + wait_for_completion(&eep->done); + spin_lock_irq(np->s.host->host_lock); + if (eep->timed_out) { +printk ("duude symbios timed out\n"); + } else { +printk ("duude symbios waited for completion\n"); + } + np->s.io_reset_wait = NULL; + } +#endif /* CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY */ sym_reset_scsi_bus(np, 0); sym_start_up (np, 1); sts = 0; @@ -1577,6 +1635,30 @@ static int sym_setup_bus_dma_mask(struct return -1; } +#ifdef CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY +int sym2_io_error_detected (struct pci_dev *pdev, enum pci_channel_state state) +{ + struct sym_hcb *np = pci_get_drvdata(pdev); +printk ("duude symbios got this state change %d jiffies=%ld\n", state, jiffies); + + np->s.io_state = state; + // XXX if perm frozen, then ...? + + return 0; +} + +int sym2_io_slot_reset (struct pci_dev *pdev) +{ + struct sym_hcb *np = pci_get_drvdata(pdev); +printk ("duude symbios got slot reset done jiffies=%ld\n", jiffies); + + np->s.io_state = pci_channel_io_normal; + sym_eeh_done (np->s.io_reset_wait); + + return 0; +} +#endif /* CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY */ + /* * Host attach and initialisations. * @@ -1625,6 +1707,8 @@ static struct Scsi_Host * __devinit sym_ if (!np) goto attach_failed; np->s.device = dev->pdev; + np->s.io_state = pci_channel_io_normal; + np->s.io_reset_wait = NULL; np->bus_dmat = dev->pdev; /* Result in 1 DMA pool per HBA */ host_data->ncb = np; np->s.host = instance; @@ -2359,6 +2443,10 @@ static struct pci_driver sym2_driver = { .id_table = sym2_id_table, .probe = sym2_probe, .remove = __devexit_p(sym2_remove), + .err_handler = { + .error_detected = sym2_io_error_detected, + .slot_reset = sym2_io_slot_reset, + }, }; static int __init sym2_init(void) --- drivers/scsi/sym53c8xx_2/sym_glue.h.linas-orig 2005-04-29 20:32:45.000000000 -0500 +++ drivers/scsi/sym53c8xx_2/sym_glue.h 2005-05-06 16:29:39.000000000 -0500 @@ -358,6 +358,10 @@ struct sym_shcb { char chip_name[8]; struct pci_dev *device; + /* pci bus i/o state; waiter for clearing of i/o state */ + enum pci_channel_state io_state; + struct sym_eh_wait *io_reset_wait; + struct Scsi_Host *host; void __iomem * mmio_va; /* MMIO kernel virtual address */ --- drivers/scsi/sym53c8xx_2/sym_hipd.c.linas-orig 2005-04-29 20:22:45.000000000 -0500 +++ drivers/scsi/sym53c8xx_2/sym_hipd.c 2005-05-06 12:28:43.000000000 -0500 @@ -2836,6 +2836,7 @@ void sym_interrupt (struct sym_hcb *np) u_char istat, istatc; u_char dstat; u_short sist; + u_int icnt; /* * interrupt on the fly ? @@ -2877,6 +2878,7 @@ void sym_interrupt (struct sym_hcb *np) sist = 0; dstat = 0; istatc = istat; + icnt = 0; do { if (istatc & SIP) sist |= INW (nc_sist); @@ -2884,6 +2886,14 @@ void sym_interrupt (struct sym_hcb *np) dstat |= INB (nc_dstat); istatc = INB (nc_istat); istat |= istatc; + icnt ++; + if (100 < icnt) { +#define CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY +#ifdef CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY + if(eeh_slot_is_isolated (np->s.device)) + return; +#endif /* CONFIG_SCSI_SYM53C8XX_EEH_RECOVERY */ + } } while (istatc & (SIP|DIP)); if (DEBUG_FLAGS & DEBUG_TINY) --- include/asm-ppc64/eeh.h.linas-orig 2005-04-29 20:34:03.000000000 -0500 +++ include/asm-ppc64/eeh.h 2005-05-06 12:28:43.000000000 -0500 @@ -23,6 +23,7 @@ #include #include #include +#include #include struct pci_dev; @@ -36,6 +37,11 @@ struct notifier_block; #define EEH_MODE_SUPPORTED (1<<0) #define EEH_MODE_NOCHECK (1<<1) #define EEH_MODE_ISOLATED (1<<2) +#define EEH_MODE_RECOVERING (1<<3) + +/* Max number of EEH freezes allowed before we consider the device + * to be permanently disabled. */ +#define EEH_MAX_ALLOWED_FREEZES 5 void __init eeh_init(void); unsigned long eeh_check_failure(const volatile void __iomem *token, @@ -59,35 +65,82 @@ void eeh_add_device_late(struct pci_dev * eeh_remove_device - undo EEH setup for the indicated pci device * @dev: pci device to be removed * - * This routine should be when a device is removed from a running - * system (e.g. by hotplug or dlpar). + * This routine should be called when a device is removed from + * a running system (e.g. by hotplug or dlpar). It unregisters + * the PCI device from the EEH subsystem. I/O errors affecting + * this device will no longer be detected after this call; thus, + * i/o errors affecting this slot may leave this device unusable. */ void eeh_remove_device(struct pci_dev *); -#define EEH_DISABLE 0 -#define EEH_ENABLE 1 -#define EEH_RELEASE_LOADSTORE 2 -#define EEH_RELEASE_DMA 3 +/** + * eeh_slot_is_isolated -- return non-zero value if slot is frozen + */ +int eeh_slot_is_isolated (struct pci_dev *dev); /** - * Notifier event flags. + * eeh_ioaddr_is_isolated -- return non-zero value if device at + * io address is frozen. */ -#define EEH_NOTIFY_FREEZE 1 +int eeh_ioaddr_is_isolated(const volatile void __iomem *token); -/** EEH event -- structure holding pci slot data that describes - * a change in the isolation status of a PCI slot. A pointer - * to this struct is passed as the data pointer in a notify callback. - */ -struct eeh_event { - struct list_head list; - struct pci_dev *dev; - struct device_node *dn; - int reset_state; -}; - -/** Register to find out about EEH events. */ -int eeh_register_notifier(struct notifier_block *nb); -int eeh_unregister_notifier(struct notifier_block *nb); +/** + * eeh_slot_error_detail -- record and EEH error condition to the log + * @severity: 1 if temporary, 2 if permanent failure. + * + * Obtains the the EEH error details from the RTAS subsystem, + * and then logs these details with the RTAS error log system. + */ +void eeh_slot_error_detail (struct device_node *dn, int severity); + +/** + * rtas_set_slot_reset -- unfreeze a frozen slot + * + * Clear the EEH-frozen condition on a slot. This routine + * does this by asserting the PCI #RST line for 1/8th of + * a second; this routine will sleep while the adapter is + * being reset. + */ +void rtas_set_slot_reset (struct device_node *dn); + +/** rtas_pci_slot_reset raises/lowers the pci #RST line + * state: 1/0 to raise/lower the #RST + * + * Clear the EEH-frozen condition on a slot. This routine + * asserts the PCI #RST line if the 'state' argument is '1', + * and drops the #RST line if 'state is '0'. This routine is + * safe to call in an interrupt context. + * + */ +void rtas_pci_slot_reset(struct device_node *dn, int state); +void eeh_pci_slot_reset(struct pci_dev *dev, int state); + +/** eeh_pci_slot_availability -- Indicates whether a PCI + * slot is ready to be used. After a PCI reset, it may take a while + * for the PCI fabric to fully reset the comminucations path to the + * given PCI card. This routine can be used to determine how long + * to wait before a PCI slot might become usable. + * + * This routine returns how long to wait (in milliseconds) before + * the slot is expected to be usable. A value of zero means the + * slot is immediately usable. A negavitve value means that the + * slot is permanently disabled. + */ +int eeh_pci_slot_availability(struct pci_dev *dev); + +/** Restore device configuration info across device resets. + */ +void eeh_restore_bars(struct device_node *); +void eeh_pci_restore_bars(struct pci_dev *dev); + +/** + * rtas_configure_bridge -- firmware initialization of pci bridge + * + * Ask the firmware to configure any PCI bridge devices + * located behind the indicated node. Required after a + * pci device reset. + */ +void rtas_configure_bridge(struct device_node *dn); /** * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure. --- include/asm-ppc64/prom.h.linas-orig 2005-04-29 20:32:46.000000000 -0500 +++ include/asm-ppc64/prom.h 2005-05-06 12:28:43.000000000 -0500 @@ -119,6 +119,7 @@ struct property { */ struct pci_controller; struct iommu_table; +struct eeh_recovery_ops; struct device_node { char *name; @@ -137,8 +138,12 @@ struct device_node { int devfn; /* for pci devices */ int eeh_mode; /* See eeh.h for possible EEH_MODEs */ int eeh_config_addr; + int eeh_check_count; /* number of times device driver ignored error */ + int eeh_freeze_count; /* number of times this device froze up. */ + int eeh_is_bridge; /* device is pci-to-pci bridge */ struct pci_controller *phb; /* for pci devices */ struct iommu_table *iommu_table; /* for phb's or bridges */ + u32 config_space[16]; /* saved PCI config space */ struct property *properties; struct device_node *parent; --- include/asm-ppc64/rtas.h.linas-orig 2005-04-29 20:32:32.000000000 -0500 +++ include/asm-ppc64/rtas.h 2005-05-06 12:28:43.000000000 -0500 @@ -243,4 +243,6 @@ extern unsigned long rtas_rmo_buf; #define GLOBAL_INTERRUPT_QUEUE 9005 +extern int rtas_write_config(struct device_node *dn, int where, int size, u32 val); + #endif /* _PPC64_RTAS_H */ --- arch/ppc64/kernel/eeh.c.linas-orig 2005-04-29 20:29:19.000000000 -0500 +++ arch/ppc64/kernel/eeh.c 2005-05-06 16:52:39.000000000 -0500 @@ -17,16 +17,17 @@ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ -#include +#include #include +#include #include -#include #include #include #include #include #include #include +#include #include #include #include @@ -49,8 +50,8 @@ * were "empty": all reads return 0xff's and all writes are silently * ignored. EEH slot isolation events can be triggered by parity * errors on the address or data busses (e.g. during posted writes), - * which in turn might be caused by dust, vibration, humidity, - * radioactivity or plain-old failed hardware. + * which in turn might be caused by low voltage on the bus, dust, + * vibration, humidity, radioactivity or plain-old failed hardware. * * Note, however, that one of the leading causes of EEH slot * freeze events are buggy device drivers, buggy device microcode, @@ -75,22 +76,13 @@ #define BUID_HI(buid) ((buid) >> 32) #define BUID_LO(buid) ((buid) & 0xffffffff) -/* EEH event workqueue setup. */ -static DEFINE_SPINLOCK(eeh_eventlist_lock); -LIST_HEAD(eeh_eventlist); -static void eeh_event_handler(void *); -DECLARE_WORK(eeh_event_wq, eeh_event_handler, NULL); - -static struct notifier_block *eeh_notifier_chain; - /* * If a device driver keeps reading an MMIO register in an interrupt * handler after a slot isolation event has occurred, we assume it * is broken and panic. This sets the threshold for how many read * attempts we allow before panicking. */ -#define EEH_MAX_FAILS 1000 -static atomic_t eeh_fail_count; +#define EEH_MAX_FAILS 100000 /* RTAS tokens */ static int ibm_set_eeh_option; @@ -107,6 +99,10 @@ static DEFINE_SPINLOCK(slot_errbuf_lock) static int eeh_error_buf_size; /* System monitoring statistics */ +static DEFINE_PER_CPU(unsigned long, no_device); +static DEFINE_PER_CPU(unsigned long, no_dn); +static DEFINE_PER_CPU(unsigned long, no_cfg_addr); +static DEFINE_PER_CPU(unsigned long, ignored_check); static DEFINE_PER_CPU(unsigned long, total_mmio_ffs); static DEFINE_PER_CPU(unsigned long, false_positives); static DEFINE_PER_CPU(unsigned long, ignored_failures); @@ -225,9 +221,9 @@ pci_addr_cache_insert(struct pci_dev *de while (*p) { parent = *p; piar = rb_entry(parent, struct pci_io_addr_range, rb_node); - if (alo < piar->addr_lo) { + if (ahi < piar->addr_lo) { p = &parent->rb_left; - } else if (ahi > piar->addr_hi) { + } else if (alo > piar->addr_hi) { p = &parent->rb_right; } else { if (dev != piar->pcidev || @@ -245,6 +241,11 @@ pci_addr_cache_insert(struct pci_dev *de piar->addr_hi = ahi; piar->pcidev = dev; piar->flags = flags; + +#ifdef DEBUG + printk (KERN_DEBUG "PIAR: insert range=[%lx:%lx] dev=%s\n", + alo, ahi, pci_name (dev)); +#endif rb_link_node(&piar->rb_node, parent, p); rb_insert_color(&piar->rb_node, &pci_io_addr_cache_root.rb_root); @@ -369,8 +370,12 @@ void pci_addr_cache_remove_device(struct */ void __init pci_addr_cache_build(void) { + struct device_node *dn; struct pci_dev *dev = NULL; + if (!eeh_subsystem_enabled) + return; + spin_lock_init(&pci_io_addr_cache_root.piar_lock); while ((dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) { @@ -379,6 +384,17 @@ void __init pci_addr_cache_build(void) continue; } pci_addr_cache_insert_device(dev); + + /* Save the BAR's; firmware doesn't restore these after EEH reset */ + dn = pci_device_to_OF_node(dev); + if (dn) { + int i; + for (i = 0; i < 16; i++) + pci_read_config_dword(dev, i * 4, &dn->config_space[i]); + + if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) + dn->eeh_is_bridge = 1; + } } #ifdef DEBUG @@ -390,24 +406,32 @@ void __init pci_addr_cache_build(void) /* --------------------------------------------------------------- */ /* Above lies the PCI Address Cache. Below lies the EEH event infrastructure */ -/** - * eeh_register_notifier - Register to find out about EEH events. - * @nb: notifier block to callback on events - */ -int eeh_register_notifier(struct notifier_block *nb) +void eeh_slot_error_detail (struct device_node *dn, int severity) { - return notifier_chain_register(&eeh_notifier_chain, nb); -} + unsigned long flags; + int rc; -/** - * eeh_unregister_notifier - Unregister to an EEH event notifier. - * @nb: notifier block to callback on events - */ -int eeh_unregister_notifier(struct notifier_block *nb) -{ - return notifier_chain_unregister(&eeh_notifier_chain, nb); + if (!dn) return; + + /* Log the error with the rtas logger */ + spin_lock_irqsave(&slot_errbuf_lock, flags); + memset(slot_errbuf, 0, eeh_error_buf_size); + + rc = rtas_call(ibm_slot_error_detail, + 8, 1, NULL, dn->eeh_config_addr, + BUID_HI(dn->phb->buid), + BUID_LO(dn->phb->buid), NULL, 0, + virt_to_phys(slot_errbuf), + eeh_error_buf_size, + severity); + + if (rc == 0) + log_error(slot_errbuf, ERR_TYPE_RTAS_LOG, 0); + spin_unlock_irqrestore(&slot_errbuf_lock, flags); } +EXPORT_SYMBOL(eeh_slot_error_detail); + /** * read_slot_reset_state - Read the reset state of a device node's slot * @dn: device node to read @@ -422,6 +446,7 @@ static int read_slot_reset_state(struct outputs = 4; } else { token = ibm_read_slot_reset_state; + rets[2] = 0; /* fake PE Unavailable info */ outputs = 3; } @@ -430,75 +455,8 @@ static int read_slot_reset_state(struct } /** - * eeh_panic - call panic() for an eeh event that cannot be handled. - * The philosophy of this routine is that it is better to panic and - * halt the OS than it is to risk possible data corruption by - * oblivious device drivers that don't know better. - * - * @dev pci device that had an eeh event - * @reset_state current reset state of the device slot - */ -static void eeh_panic(struct pci_dev *dev, int reset_state) -{ - /* - * XXX We should create a separate sysctl for this. - * - * Since the panic_on_oops sysctl is used to halt the system - * in light of potential corruption, we can use it here. - */ - if (panic_on_oops) - panic("EEH: MMIO failure (%d) on device:%s %s\n", reset_state, - pci_name(dev), pci_pretty_name(dev)); - else { - __get_cpu_var(ignored_failures)++; - printk(KERN_INFO "EEH: Ignored MMIO failure (%d) on device:%s %s\n", - reset_state, pci_name(dev), pci_pretty_name(dev)); - } -} - -/** - * eeh_event_handler - dispatch EEH events. The detection of a frozen - * slot can occur inside an interrupt, where it can be hard to do - * anything about it. The goal of this routine is to pull these - * detection events out of the context of the interrupt handler, and - * re-dispatch them for processing at a later time in a normal context. - * - * @dummy - unused - */ -static void eeh_event_handler(void *dummy) -{ - unsigned long flags; - struct eeh_event *event; - - while (1) { - spin_lock_irqsave(&eeh_eventlist_lock, flags); - event = NULL; - if (!list_empty(&eeh_eventlist)) { - event = list_entry(eeh_eventlist.next, struct eeh_event, list); - list_del(&event->list); - } - spin_unlock_irqrestore(&eeh_eventlist_lock, flags); - if (event == NULL) - break; - - printk(KERN_INFO "EEH: MMIO failure (%d), notifiying device " - "%s %s\n", event->reset_state, - pci_name(event->dev), pci_pretty_name(event->dev)); - - atomic_set(&eeh_fail_count, 0); - notifier_call_chain (&eeh_notifier_chain, - EEH_NOTIFY_FREEZE, event); - - __get_cpu_var(slot_resets)++; - - pci_dev_put(event->dev); - kfree(event); - } -} - -/** - * eeh_token_to_phys - convert EEH address token to phys address - * @token i/o token, should be address in the form 0xE.... + * eeh_token_to_phys - convert I/O address to phys address + * @token i/o address, should be address in the form 0xA.... */ static inline unsigned long eeh_token_to_phys(unsigned long token) { @@ -513,6 +471,18 @@ static inline unsigned long eeh_token_to return pa | (token & (PAGE_SIZE-1)); } + +static inline struct pci_dev * eeh_find_pci_dev(struct device_node *dn) +{ + struct pci_dev *dev = NULL; + for_each_pci_dev(dev) { + if (pci_device_to_OF_node(dev) == dn) + return dev; + } + return NULL; +} + + /** * eeh_dn_check_failure - check if all 1's data is due to EEH slot freeze * @dn device node @@ -528,29 +498,33 @@ static inline unsigned long eeh_token_to * * It is safe to call this routine in an interrupt context. */ +extern void disable_irq_nosync(unsigned int); + int eeh_dn_check_failure(struct device_node *dn, struct pci_dev *dev) { int ret; int rets[3]; - unsigned long flags; - int rc, reset_state; - struct eeh_event *event; + enum pci_channel_state state; __get_cpu_var(total_mmio_ffs)++; if (!eeh_subsystem_enabled) return 0; - if (!dn) + if (!dn) { + __get_cpu_var(no_dn)++; return 0; + } /* Access to IO BARs might get this far and still not want checking. */ if (!(dn->eeh_mode & EEH_MODE_SUPPORTED) || dn->eeh_mode & EEH_MODE_NOCHECK) { + __get_cpu_var(ignored_check)++; return 0; } if (!dn->eeh_config_addr) { + __get_cpu_var(no_cfg_addr)++; return 0; } @@ -559,12 +533,18 @@ int eeh_dn_check_failure(struct device_n * slot, we know it's bad already, we don't need to check... */ if (dn->eeh_mode & EEH_MODE_ISOLATED) { - atomic_inc(&eeh_fail_count); - if (atomic_read(&eeh_fail_count) >= EEH_MAX_FAILS) { + dn->eeh_check_count ++; + if (dn->eeh_check_count >= EEH_MAX_FAILS) { + printk (KERN_ERR "EEH: Device driver ignored %d bad reads, panicing\n", + dn->eeh_check_count); + dump_stack(); /* re-read the slot reset state */ if (read_slot_reset_state(dn, rets) != 0) rets[0] = -1; /* reset state unknown */ - eeh_panic(dev, rets[0]); + + /* If we are here, then we hit an infinite loop. Stop. */ + panic("EEH: MMIO halt (%d) on device:%s %s\n", rets[0], + pci_name(dev), pci_pretty_name(dev)); } return 0; } @@ -577,53 +557,41 @@ int eeh_dn_check_failure(struct device_n * In any case they must share a common PHB. */ ret = read_slot_reset_state(dn, rets); - if (!(ret == 0 && rets[1] == 1 && (rets[0] == 2 || rets[0] == 4))) { + if (!(ret == 0 && ((rets[1] == 1 && (rets[0] == 2 || rets[0] >= 4)) + || (rets[0] == 5)))) { __get_cpu_var(false_positives)++; return 0; } - /* prevent repeated reports of this failure */ - dn->eeh_mode |= EEH_MODE_ISOLATED; - - reset_state = rets[0]; + /* Note that empty slots will fail; empty slots don't have children... */ + if ((rets[0] == 5) && (dn->child == NULL)) { + __get_cpu_var(false_positives)++; + return 0; + } - spin_lock_irqsave(&slot_errbuf_lock, flags); - memset(slot_errbuf, 0, eeh_error_buf_size); + /* Prevent repeated reports of this failure */ + dn->eeh_mode |= EEH_MODE_ISOLATED; + __get_cpu_var(slot_resets)++; - rc = rtas_call(ibm_slot_error_detail, - 8, 1, NULL, dn->eeh_config_addr, - BUID_HI(dn->phb->buid), - BUID_LO(dn->phb->buid), NULL, 0, - virt_to_phys(slot_errbuf), - eeh_error_buf_size, - 1 /* Temporary Error */); + if (!dev) + dev = eeh_find_pci_dev (dn); - if (rc == 0) - log_error(slot_errbuf, ERR_TYPE_RTAS_LOG, 0); - spin_unlock_irqrestore(&slot_errbuf_lock, flags); + /* Some devices go crazy if irq's are not ack'ed; disable irq now */ + if (dev) + disable_irq_nosync (dev->irq); + + state = pci_channel_io_normal; + if ((rets[0] == 2) || (rets[0] == 4)) + state = pci_channel_io_frozen; + if (rets[0] == 5) + state = pci_channel_io_perm_failure; - printk(KERN_INFO "EEH: MMIO failure (%d) on device: %s %s\n", - rets[0], dn->name, dn->full_name); - event = kmalloc(sizeof(*event), GFP_ATOMIC); - if (event == NULL) { - eeh_panic(dev, reset_state); - return 1; - } - - event->dev = dev; - event->dn = dn; - event->reset_state = reset_state; - - /* We may or may not be called in an interrupt context */ - spin_lock_irqsave(&eeh_eventlist_lock, flags); - list_add(&event->list, &eeh_eventlist); - spin_unlock_irqrestore(&eeh_eventlist_lock, flags); + peh_send_failure_event (dev, state, rets[2]); /* Most EEH events are due to device driver bugs. Having * a stack trace will help the device-driver authors figure * out what happened. So print that out. */ - dump_stack(); - schedule_work(&eeh_event_wq); + if (rets[0] != 5) dump_stack(); return 0; } @@ -635,7 +603,6 @@ EXPORT_SYMBOL(eeh_dn_check_failure); * @token i/o token, should be address in the form 0xA.... * @val value, should be all 1's (XXX why do we need this arg??) * - * Check for an eeh failure at the given token address. * Check for an EEH failure at the given token address. Call this * routine if the result of a read was all 0xff's and you want to * find out if this is due to an EEH slot freeze event. This routine @@ -643,6 +610,7 @@ EXPORT_SYMBOL(eeh_dn_check_failure); * * Note this routine is safe to call in an interrupt context. */ + unsigned long eeh_check_failure(const volatile void __iomem *token, unsigned long val) { unsigned long addr; @@ -652,8 +620,10 @@ unsigned long eeh_check_failure(const vo /* Finding the phys addr + pci device; this is pretty quick. */ addr = eeh_token_to_phys((unsigned long __force) token); dev = pci_get_device_by_addr(addr); - if (!dev) + if (!dev) { + __get_cpu_var(no_device)++; return val; + } dn = pci_device_to_OF_node(dev); eeh_dn_check_failure (dn, dev); @@ -664,6 +634,249 @@ unsigned long eeh_check_failure(const vo EXPORT_SYMBOL(eeh_check_failure); +/* ------------------------------------------------------------- */ +/* The code below deals with error recovery */ + +int +eeh_slot_is_isolated(struct pci_dev *dev) +{ + struct device_node *dn; + dn = pci_device_to_OF_node(dev); + return (dn->eeh_mode & EEH_MODE_ISOLATED); +} + +int +eeh_ioaddr_is_isolated(const volatile void __iomem *token) +{ + unsigned long addr; + struct pci_dev *dev; + int rc; + + addr = eeh_token_to_phys((unsigned long __force) token); + dev = pci_get_device_by_addr(addr); + if (!dev) + return 0; + rc = eeh_slot_is_isolated(dev); + pci_dev_put(dev); + return rc; +} + +/** eeh_pci_slot_reset -- raises/lowers the pci #RST line + * state: 1/0 to raise/lower the #RST + */ +void +eeh_pci_slot_reset(struct pci_dev *dev, int state) +{ + struct device_node *dn = pci_device_to_OF_node(dev); + rtas_pci_slot_reset (dn, state); +} + +/** Return negative value if a permanent error, else return + * a number of milliseconds to wait until the PCI slot is + * ready to be used. + */ +static int +eeh_slot_availability(struct device_node *dn) +{ + int rc; + int rets[3]; + + rc = read_slot_reset_state(dn, rets); +printk ("duuude dn=%s read slot reset state rc=%d rets=%d--%d--%d\n", dn->full_name, rc, rets[0], rets[1], rets[2]); + + if (rc) return rc; + + if (rets[1] == 0) return -1; /* EEH is not supported */ + if (rets[0] == 0) return 0; /* Oll Korrect */ + if (rets[0] == 5) { + if (rets[2] == 0) return -1; /* permanently unavailable */ + return rets[2]; /* number of millisecs to wait */ + } + return -1; +} + +int +eeh_pci_slot_availability(struct pci_dev *dev) +{ + struct device_node *dn = pci_device_to_OF_node(dev); + if (!dn) return -1; + + BUG_ON (dn->phb==NULL); + if (dn->phb==NULL) { + printk (KERN_ERR "EEH, checking on slot with no phb dn=%s dev=%s:%s\n", + dn->full_name, pci_name(dev), pci_pretty_name (dev)); + return -1; + } + return eeh_slot_availability (dn); +} + +void +rtas_pci_slot_reset(struct device_node *dn, int state) +{ + int rc; + + if (!dn) + return; + if (!dn->phb) { + printk (KERN_WARNING "EEH: in slot reset, device node %s has no phb\n", dn->full_name); + return; + } + + dn->eeh_mode |= EEH_MODE_RECOVERING; + rc = rtas_call(ibm_set_slot_reset,4,1, NULL, + dn->eeh_config_addr, + BUID_HI(dn->phb->buid), + BUID_LO(dn->phb->buid), + state); + if (rc) { + printk (KERN_WARNING "EEH: Unable to reset the failed slot, (%d) #RST=%d\n", rc, state); + return; + } + + if (state == 0) + dn->eeh_mode &= ~(EEH_MODE_RECOVERING|EEH_MODE_ISOLATED); +} + +/** rtas_set_slot_reset -- assert the pci #RST line for 1/4 second + * dn -- device node to be reset. + */ + +void +rtas_set_slot_reset(struct device_node *dn) +{ + int i, rc; + +printk ("duude going to reset device %s\n", dn->full_name); +eeh_slot_availability(dn); + rtas_pci_slot_reset (dn, 1); + + /* The PCI bus requires that the reset be held high for at least + * a 100 milliseconds. We wait a bit longer 'just in case'. */ + +#define PCI_BUS_RST_HOLD_TIME_MSEC 250 + msleep (PCI_BUS_RST_HOLD_TIME_MSEC); + rtas_pci_slot_reset (dn, 0); + + /* After a PCI slot has been reset, the PCI Express spec requires + * a 1.5 second idle time for the bus to stabilize, before starting + * up traffic. */ +#define PCI_BUS_SETTLE_TIME_MSEC 1800 + msleep (PCI_BUS_SETTLE_TIME_MSEC); + + /* Now double check with the firmware to make sure the device is + * ready to be used; if not, wait for recovery. */ + for (i=0; i<10; i++) { + rc = eeh_slot_availability (dn); + if (rc <= 0) return; + + msleep (rc+100); + } +eeh_slot_availability (dn); +printk ("duuude WTFFFFFFFFFFFFFFFFFFFFFFF done reseting %s\n", dn->full_name); +extern int rtas_read_config(struct device_node *dn, int where, int size, u32 *val); +u32 val; +for(i=0;i<16;i++) { +rc = rtas_read_config (dn, i*4,4,&val); +printk ("duude read config %d rc=%d val=%x expect=%x\n", i, rc, val,dn->config_space[i]); +} + +} + +EXPORT_SYMBOL(rtas_set_slot_reset); + +void +rtas_configure_bridge(struct device_node *dn) +{ + int token = rtas_token ("ibm,configure-bridge"); + int rc; + + if (token == RTAS_UNKNOWN_SERVICE) + return; + rc = rtas_call(token,3,1, NULL, + dn->eeh_config_addr, + BUID_HI(dn->phb->buid), + BUID_LO(dn->phb->buid)); + if (rc) { + printk (KERN_WARNING "EEH: Unable to configure device bridge (%d) for %s\n", + rc, dn->full_name); + } +} + +EXPORT_SYMBOL(rtas_configure_bridge); + +/* ------------------------------------------------------- */ +/** Save and restore of PCI BARs + * + * Although firmware will set up BARs during boot, it doesn't + * set up device BAR's after a device reset, although it will, + * if requested, set up bridge configuration. Thus, we need to + * configure the PCI devices ourselves. Config-space setup is + * stored in the PCI structures which are normally deleted during + * device removal. Thus, the "save" routine references the + * structures so that they aren't deleted. + */ + +/** + * __restore_bars - Restore the Base Address Registers + * Loads the PCI configuration space base address registers, + * the expansion ROM base address, the latency timer, and etc. + * from the saved values in the device node. + */ +static inline void __restore_bars (struct device_node *dn) +{ + int i; + + if (NULL==dn->phb) return; + for (i=4; i<10; i++) { + rtas_write_config(dn, i*4, 4, dn->config_space[i]); + } + + /* 12 == Expansion ROM Address */ + rtas_write_config(dn, 12*4, 4, dn->config_space[12]); + +#define BYTE_SWAP(OFF) (8*((OFF)/4)+3-(OFF)) +#define SAVED_BYTE(OFF) (((u8 *)(dn->config_space))[BYTE_SWAP(OFF)]) + + rtas_write_config (dn, PCI_CACHE_LINE_SIZE, 1, + SAVED_BYTE(PCI_CACHE_LINE_SIZE)); + + rtas_write_config (dn, PCI_LATENCY_TIMER, 1, + SAVED_BYTE(PCI_LATENCY_TIMER)); + + /* max latency, min grant, interrupt pin and line */ + rtas_write_config(dn, 15*4, 4, dn->config_space[15]); +} + +/** + * eeh_restore_bars - restore the PCI config space info + */ +void eeh_restore_bars(struct device_node *dn) +{ + if (! dn->eeh_is_bridge) + __restore_bars (dn); + + if (dn->child) + eeh_restore_bars (dn->child); +#if DO_SIBLINGS + if (dn->sibling) + eeh_restore_bars (dn->sibling); +#endif +} + +void eeh_pci_restore_bars(struct pci_dev *dev) +{ + struct device_node *dn = pci_device_to_OF_node(dev); + eeh_restore_bars (dn); +} + +/* ------------------------------------------------------------- */ +/* The code below deals with enabling EEH for devices during the + * early boot sequence. EEH must be enabled before any PCI probing + * can be done. + */ + +#define EEH_ENABLE 1 + struct eeh_early_enable_info { unsigned int buid_hi; unsigned int buid_lo; @@ -682,6 +895,8 @@ static void *early_enable_eeh(struct dev int enable; dn->eeh_mode = 0; + dn->eeh_check_count = 0; + dn->eeh_freeze_count = 0; if (status && strcmp(status, "ok") != 0) return NULL; /* ignore devices with bad status */ @@ -743,7 +958,7 @@ static void *early_enable_eeh(struct dev dn->full_name); } - return NULL; + return NULL; } /* @@ -824,11 +1039,13 @@ void eeh_add_device_early(struct device_ struct pci_controller *phb; struct eeh_early_enable_info info; - if (!dn || !eeh_subsystem_enabled) + if (!dn) return; phb = dn->phb; if (NULL == phb || 0 == phb->buid) { - printk(KERN_WARNING "EEH: Expected buid but found none\n"); + printk(KERN_WARNING "EEH: Expected buid but found none for %s\n", + dn->full_name); + dump_stack(); return; } @@ -847,6 +1064,9 @@ EXPORT_SYMBOL(eeh_add_device_early); */ void eeh_add_device_late(struct pci_dev *dev) { + int i; + struct device_node *dn; + if (!dev || !eeh_subsystem_enabled) return; @@ -856,6 +1076,14 @@ void eeh_add_device_late(struct pci_dev #endif pci_addr_cache_insert_device (dev); + + /* Save the BAR's; firmware doesn't restore these after EEH reset */ + dn = pci_device_to_OF_node(dev); + for (i = 0; i < 16; i++) + pci_read_config_dword(dev, i * 4, &dn->config_space[i]); + + if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) + dn->eeh_is_bridge = 1; } EXPORT_SYMBOL(eeh_add_device_late); @@ -885,12 +1113,17 @@ static int proc_eeh_show(struct seq_file unsigned int cpu; unsigned long ffs = 0, positives = 0, failures = 0; unsigned long resets = 0; + unsigned long no_dev = 0, no_dn = 0, no_cfg = 0, no_check = 0; for_each_cpu(cpu) { ffs += per_cpu(total_mmio_ffs, cpu); positives += per_cpu(false_positives, cpu); failures += per_cpu(ignored_failures, cpu); resets += per_cpu(slot_resets, cpu); + no_dev += per_cpu(no_device, cpu); + no_dn += per_cpu(no_dn, cpu); + no_cfg += per_cpu(no_cfg_addr, cpu); + no_check += per_cpu(ignored_check, cpu); } if (0 == eeh_subsystem_enabled) { @@ -898,13 +1131,17 @@ static int proc_eeh_show(struct seq_file seq_printf(m, "eeh_total_mmio_ffs=%ld\n", ffs); } else { seq_printf(m, "EEH Subsystem is enabled\n"); - seq_printf(m, "eeh_total_mmio_ffs=%ld\n" + seq_printf(m, + "no device=%ld\n" + "no device node=%ld\n" + "no config address=%ld\n" + "check not wanted=%ld\n" + "eeh_total_mmio_ffs=%ld\n" "eeh_false_positives=%ld\n" "eeh_ignored_failures=%ld\n" - "eeh_slot_resets=%ld\n" - "eeh_fail_count=%d\n", - ffs, positives, failures, resets, - eeh_fail_count.counter); + "eeh_slot_resets=%ld\n", + no_dev, no_dn, no_cfg, no_check, + ffs, positives, failures, resets); } return 0; --- arch/ppc64/kernel/pSeries_pci.c.linas-orig 2005-04-29 20:33:03.000000000 -0500 +++ arch/ppc64/kernel/pSeries_pci.c 2005-05-06 12:28:43.000000000 -0500 @@ -52,7 +52,7 @@ static int s7a_workaround; extern struct mpic *pSeries_mpic; -static int rtas_read_config(struct device_node *dn, int where, int size, u32 *val) +int rtas_read_config(struct device_node *dn, int where, int size, u32 *val) { int returnval = -1; unsigned long buid, addr; @@ -101,7 +101,7 @@ static int rtas_pci_read_config(struct p return PCIBIOS_DEVICE_NOT_FOUND; } -static int rtas_write_config(struct device_node *dn, int where, int size, u32 val) +int rtas_write_config(struct device_node *dn, int where, int size, u32 val) { unsigned long buid, addr; int ret; --- drivers/pci/hotplug/rpaphp.h.linas-orig 2005-04-29 20:26:21.000000000 -0500 +++ drivers/pci/hotplug/rpaphp.h 2005-05-06 12:28:43.000000000 -0500 @@ -118,7 +118,8 @@ extern int rpaphp_enable_pci_slot(struct extern int register_pci_slot(struct slot *slot); extern int rpaphp_unconfig_pci_adapter(struct slot *slot); extern int rpaphp_get_pci_adapter_status(struct slot *slot, int is_init, u8 * value); -extern struct hotplug_slot *rpaphp_find_hotplug_slot(struct pci_dev *dev); +extern void init_eeh_handler (void); +extern void exit_eeh_handler (void); /* rpaphp_core.c */ extern int rpaphp_add_slot(struct device_node *dn); --- drivers/pci/hotplug/rpaphp_core.c.linas-orig 2005-04-29 20:32:16.000000000 -0500 +++ drivers/pci/hotplug/rpaphp_core.c 2005-05-06 12:28:43.000000000 -0500 @@ -460,12 +460,18 @@ static int __init rpaphp_init(void) { info(DRIVER_DESC " version: " DRIVER_VERSION "\n"); + /* Get set to handle EEH events. */ + init_eeh_handler(); + /* read all the PRA info from the system */ return init_rpa(); } static void __exit rpaphp_exit(void) { + /* Let EEH know we are going away. */ + exit_eeh_handler(); + cleanup_slots(); } --- drivers/pci/hotplug/rpaphp_pci.c.linas-orig 2005-04-29 20:22:38.000000000 -0500 +++ drivers/pci/hotplug/rpaphp_pci.c 2005-05-06 17:19:33.000000000 -0500 @@ -22,8 +22,13 @@ * Send feedback to * */ +#include +#include +#include #include +#include #include +#include #include #include #include "../pci.h" /* for pci_add_new_bus */ @@ -63,6 +68,7 @@ int rpaphp_claim_resource(struct pci_dev root ? "Address space collision on" : "No parent found for", resource, dtype, pci_name(dev), res->start, res->end); + dump_stack(); } return err; } @@ -188,6 +194,19 @@ rpaphp_fixup_new_pci_devices(struct pci_ static int rpaphp_pci_config_bridge(struct pci_dev *dev); +static void rpaphp_eeh_add_bus_device(struct pci_bus *bus) +{ + struct pci_dev *dev; + list_for_each_entry(dev, &bus->devices, bus_list) { + eeh_add_device_late(dev); + if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) { + struct pci_bus *subbus = dev->subordinate; + if (bus) + rpaphp_eeh_add_bus_device (subbus); + } + } +} + /***************************************************************************** rpaphp_pci_config_slot() will configure all devices under the given slot->dn and return the the first pci_dev. @@ -215,6 +234,8 @@ rpaphp_pci_config_slot(struct device_nod } if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) rpaphp_pci_config_bridge(dev); + + rpaphp_eeh_add_bus_device(bus); } return dev; } @@ -223,7 +244,6 @@ static int rpaphp_pci_config_bridge(stru { u8 sec_busno; struct pci_bus *child_bus; - struct pci_dev *child_dev; dbg("Enter %s: BRIDGE dev=%s\n", __FUNCTION__, pci_name(dev)); @@ -240,11 +260,7 @@ static int rpaphp_pci_config_bridge(stru /* do pci_scan_child_bus */ pci_scan_child_bus(child_bus); - list_for_each_entry(child_dev, &child_bus->devices, bus_list) { - eeh_add_device_late(child_dev); - } - - /* fixup new pci devices without touching bus struct */ + /* Fixup new pci devices without touching bus struct */ rpaphp_fixup_new_pci_devices(child_bus, 0); /* Make the discovered devices available */ @@ -282,7 +298,7 @@ static void print_slot_pci_funcs(struct return; } #else -static void print_slot_pci_funcs(struct slot *slot) +static inline void print_slot_pci_funcs(struct slot *slot) { return; } @@ -364,7 +380,6 @@ static void rpaphp_eeh_remove_bus_device if (pdev) rpaphp_eeh_remove_bus_device(pdev); } - } return; } @@ -566,36 +581,280 @@ exit: return retval; } -struct hotplug_slot *rpaphp_find_hotplug_slot(struct pci_dev *dev) +/** + * rpaphp_search_bus_for_dev - return 1 if device is under this bus, else 0 + * @bus: the bus to search for this device. + * @dev: the pci device we are looking for. + */ +static int rpaphp_search_bus_for_dev (struct pci_bus *bus, struct pci_dev *dev) +{ + struct list_head *ln; + + if (!bus) return 0; + + for (ln = bus->devices.next; ln != &bus->devices; ln = ln->next) { + struct pci_dev *pdev = pci_dev_b(ln); + if (pdev == dev) + return 1; + if (pdev->subordinate) { + int rc; + rc = rpaphp_search_bus_for_dev (pdev->subordinate, dev); + if (rc) + return 1; + } + } + return 0; +} + +/** + * rpaphp_find_slot - find and return the slot holding the device + * @dev: pci device for which we want the slot structure. + */ +static struct slot *rpaphp_find_slot(struct pci_dev *dev) { - struct list_head *tmp, *n; - struct slot *slot; + struct list_head *tmp, *n; + struct slot *slot; list_for_each_safe(tmp, n, &rpaphp_slot_head) { struct pci_bus *bus; - struct list_head *ln; slot = list_entry(tmp, struct slot, rpaphp_slot_list); - if (slot->bridge == NULL) { - if (slot->dev_type == PCI_DEV) { - printk(KERN_WARNING "PCI slot missing bridge %s %s \n", - slot->name, slot->location); - } + + /* PHB's don't have bridges. */ + if (slot->bridge == NULL) continue; - } + + /* The PCI device could be the slot itself. */ + if (slot->bridge == dev) + return slot; bus = slot->bridge->subordinate; if (!bus) { + printk (KERN_WARNING "PCI bridge is missing bus: %s %s\n", + pci_name (slot->bridge), pci_pretty_name (slot->bridge)); continue; /* should never happen? */ } - for (ln = bus->devices.next; ln != &bus->devices; ln = ln->next) { - struct pci_dev *pdev = pci_dev_b(ln); - if (pdev == dev) - return slot->hotplug_slot; - } + + if (rpaphp_search_bus_for_dev (bus, dev)) + return slot; } + return NULL; +} + +/** get_phb_of_device -- find the pci controller for the device + * @dev the pci device + * This routine returns a pointer to the device node that + * describes the pci controller for the indicated slot. + */ +static struct device_node * +get_phb_of_device (struct pci_dev *dev) +{ + struct device_node *dn; + struct pci_bus *bus; + + while (1) { + bus = dev->bus; + if (!bus) + break; + dn = pci_bus_to_OF_node(bus); + + if (dn->phb) + return dn; + + dev = bus->self; + BUG_ON (dev==NULL); + if (dev == NULL) + return NULL; + } return NULL; } -EXPORT_SYMBOL_GPL(rpaphp_find_hotplug_slot); +/* ------------------------------------------------------- */ +/** + * handle_eeh_events -- reset a PCI device after hard lockup. + * + * pSeries systems will isolate a PCI slot if the PCI-Host + * bridge detects address or data parity errors, DMA's + * occuring to wild addresses (which usually happen due to + * bugs in device drivers or in PCI adapter firmware). + * Slot isolations also occur if #SERR, #PERR or other misc + * PCI-related errors are detected. + * + * Recovery process consists of unplugging the device driver + * (which generated hotplug events to userspace), then issuing + * a PCI #RST to the device, then reconfiguring the PCI config + * space for all bridges & devices under this slot, and then + * finally restarting the device drivers (which cause a second + * set of hotplug events to go out to userspace). + */ + +int eeh_reset_device (struct pci_dev *dev, struct device_node *dn, int reconfig) +{ + struct slot *frozen_slot= NULL; + + if (!dev) + return 1; + + if (reconfig) + frozen_slot = rpaphp_find_slot(dev); + + if (reconfig && frozen_slot) rpaphp_unconfig_pci_adapter (frozen_slot); + + /* Reset the pci controller. (Asserts RST#; resets config space). + * Reconfigure bridges and devices */ + rtas_set_slot_reset (dn->child); + rtas_configure_bridge(dn); + eeh_restore_bars(dn->child); +printk ("duude, post restore bars, for %s here's the dump\n", dn->full_name); +{ +extern int rtas_read_config(struct device_node *dn, int where, int size, u32 *val); +int i, rc; +u32 val; +struct device_node *xn=dn->child; +for(i=0;i<16;i++) { +rc = rtas_read_config (xn, i*4,4,&val); +printk ("duude read config %d rc=%d val=%x expect=%x\n", i, rc, val,xn->config_space[i]); +}} + + enable_irq (dev->irq); + + /* Give the system 5 seconds to finish running the user-space + * hotplug scripts, e.g. ifdown for ethernet. Yes, this is a hack, + * but if we don't do this, weird things happen. + */ + if (reconfig && frozen_slot) { + ssleep (5); + rpaphp_enable_pci_slot (frozen_slot); + } + return 0; +} + +/* The longest amount of time to wait for a pci device + * to come back on line, in seconds. + */ +#define MAX_WAIT_FOR_RECOVERY 15 + +int handle_eeh_events (struct notifier_block *self, + unsigned long reason, void *ev) +{ + int freeze_count=0; + struct device_node *frozen_device; + struct peh_event *event = ev; + struct pci_dev *dev = event->dev; + int perm_failure = 0; + int rc; + + if (!dev) + { + printk ("EEH: EEH error caught, but no PCI device specified!\n"); + return 1; + } + + frozen_device = get_phb_of_device (dev); + + if (!frozen_device) + { + printk (KERN_ERR "EEH: Cannot find PCI conroller for %s %s\n", + pci_name(dev), pci_pretty_name (dev)); + + return 1; + } + + /* We get "permanent failure" messages on empty slots. + * These are false alarms. Empty slots have no child dn. */ + if ((event->state == pci_channel_io_perm_failure) && (frozen_device == NULL)) + return 0; + + if (frozen_device) + freeze_count = frozen_device->eeh_freeze_count; + freeze_count ++; + if (freeze_count > EEH_MAX_ALLOWED_FREEZES) + perm_failure = 1; + + /* If the reset state is a '5' and the time to reset is 0 (infinity) + * or is more then 15 seconds, then mark this as a permanent failure. + */ + if ((event->state == pci_channel_io_perm_failure) && + ((event->time_unavail <= 0) || + (event->time_unavail > MAX_WAIT_FOR_RECOVERY*1000))) + perm_failure = 1; + + /* Log the error with the rtas logger. */ + if (perm_failure) { + /* + * About 90% of all real-life EEH failures in the field + * are due to poorly seated PCI cards. Only 10% or so are + * due to actual, failed cards. + */ + printk (KERN_ERR + "EEH: device %s:%s has failed %d times \n" + "and has been permanently disabled. Please try reseating\n" + "this device or replacing it.\n", + pci_name (dev), + pci_pretty_name (dev), + freeze_count); + + eeh_slot_error_detail (frozen_device, 2 /* Permanent Error */); + + /* Notify the device that its about to go down. */ + /* XXX this should be a recursive walk to children for + * multi-function devices */ + if (dev->driver->err_handler.error_detected) { + dev->driver->err_handler.error_detected (dev, pci_channel_io_perm_failure); + } + + /* If there's a hotplug slot, unconfigure it */ + struct slot * frozen_slot = rpaphp_find_slot(dev); + rpaphp_unconfig_pci_adapter (frozen_slot); + return 1; + } else { + eeh_slot_error_detail (frozen_device, 1 /* Temporary Error */); + } + + printk (KERN_WARNING + "EEH: This device has failed %d times since last reboot: %s:%s\n", + freeze_count, + pci_name (dev), + pci_pretty_name (dev)); + + /* Walk the various device drivers attached to this slot through + * a reset sequence, giving each an opportunity to do what it needs + * to accomplish the reset */ + /* XXX this should be a recursive walk to children for + * multi-function devices; each child should get to report + * status too, if needed ... if any child can't handle the reset, + * then need to hotplug it. + * XXX This does not follow flow of BenH's last email at all. + * XXX will be fixed later XXX + */ + if (dev->driver->err_handler.error_detected) { + dev->driver->err_handler.error_detected (dev, pci_channel_io_frozen); + rc = eeh_reset_device (dev, frozen_device, 0); + if (dev->driver->err_handler.slot_reset) + dev->driver->err_handler.slot_reset (dev); + } else { + rc = eeh_reset_device (dev, frozen_device, 1); + } + + /* Store the freeze count with the pci adapter, and not the slot. + * This way, if the device is replaced, the count is cleared. + */ + frozen_device->eeh_freeze_count = freeze_count; + + return rc; +} + +static struct notifier_block eeh_block; + +void __init init_eeh_handler (void) +{ + eeh_block.notifier_call = handle_eeh_events; + peh_register_notifier (&eeh_block); +} + +void __exit exit_eeh_handler (void) +{ + peh_unregister_notifier (&eeh_block); +} + --- kernel/printk.c.linas-orig 2005-04-29 20:32:46.000000000 -0500 +++ kernel/printk.c 2005-05-06 12:28:43.000000000 -0500 @@ -383,6 +383,23 @@ asmlinkage long sys_syslog(int type, cha return do_syslog(type, buf, len); } +#ifdef CONFIG_DEBUG_KERNEL +/** + * Its very handy to be able to view the syslog buffer during debug. + * But do_syslog() uses locks and so it will deadlock if called during + * a debugging session. The routine provides the start and end of the + * physical and logical logs, and is equivalent to do_syslog(3). + */ + +void debugger_syslog_data(char *syslog_data[4]) +{ + syslog_data[0] = log_buf; + syslog_data[1] = log_buf + __LOG_BUF_LEN; + syslog_data[2] = log_buf + log_end - (logged_chars < __LOG_BUF_LEN ? logged_chars : __LOG_BUF_LEN); + syslog_data[3] = log_buf + log_end; +} +#endif /* CONFIG_DEBUG_KERNEL */ + /* * Call the console drivers on a range of log_buf */ --- arch/ppc64/xmon/xmon.c.linas-orig 2005-04-29 20:31:03.000000000 -0500 +++ arch/ppc64/xmon/xmon.c 2005-05-06 12:28:43.000000000 -0500 @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -100,6 +101,7 @@ static void prdump(unsigned long, long); static int ppc_inst_dump(unsigned long, long, int); void print_address(unsigned long); static void backtrace(struct pt_regs *); +static void xmon_show_stack(unsigned long sp, unsigned long lr, unsigned long pc); static void excprint(struct pt_regs *); static void prregs(struct pt_regs *); static void memops(int); @@ -131,6 +133,7 @@ static void csum(void); static void bootcmds(void); void dump_segments(void); static void symbol_lookup(void); +static void xmon_show_dmesg(void); static void xmon_print_symbol(unsigned long address, const char *mid, const char *after); static const char *getvecname(unsigned long vec); @@ -170,6 +173,7 @@ Commands:\n\ #endif "\ C checksum\n\ + D show dmesg (printk) buffer\n\ d dump bytes\n\ di dump instructions\n\ df dump float values\n\ @@ -186,6 +190,7 @@ Commands:\n\ mz zero a block of memory\n\ mi show information about memory allocation\n\ p show the task list\n\ + P show the task list and stacks\n\ r print registers\n\ s single step\n\ S print special registers\n\ @@ -310,6 +315,7 @@ int xmon_core(struct pt_regs *regs, int #endif msr = get_msr(); + msr |= MSR_SF|MSR_IR|MSR_DR; set_msrd(msr & ~MSR_EE); /* disable interrupts */ bp = in_breakpoint_table(regs->nip, &offset); @@ -323,15 +329,39 @@ int xmon_core(struct pt_regs *regs, int #ifdef CONFIG_SMP cpu = smp_processor_id(); if (cpu_isset(cpu, cpus_in_xmon)) { + int recursive = 1; get_output_lock(); excprint(regs); printf("cpu 0x%x: Exception %lx %s in xmon, " "returning to main loop\n", cpu, regs->trap, getvecname(TRAP(regs))); - longjmp(xmon_fault_jmp[cpu], 1); + + /* If crash occured in firmware, then saved stack pointer + * is bad, and we get recursive fault. Switch to using + * emergency stack in this case. + */ + unsigned long *sp = ((unsigned long *) xmon_fault_jmp[cpu]) + 1; + if (*sp < 0xc000000000000000) + { + printf("Bad stack pointer %lx in xmon, using emergency stack\n", *sp); + *sp = (unsigned long ) (get_paca()->emergency_sp); + sp = (unsigned long *) *sp; + *sp = (unsigned long ) (get_paca()->emergency_sp); + recursive = -1; + } + sp = (unsigned long *) *sp; + if (*sp < 0xc000000000000000) + { + printf("Bad stack frame %lx in xmon, using emergency stack\n", *sp); + *sp = (unsigned long ) (get_paca()->emergency_sp); + recursive = -1; + } +printf ("duude planing on returning with setjmp=%p\n", xmon_fault_jmp[cpu]); +printf ("duude planing on returning to %p w/stack=%p or %p\n", xmon_fault_jmp[cpu][0], sp, xmon_fault_jmp[cpu][1]); + longjmp(xmon_fault_jmp[cpu], recursive); } - if (setjmp(recurse_jmp) != 0) { + if (setjmp(recurse_jmp) > 0) { if (!in_xmon || !xmon_gate) { printf("xmon: WARNING: bad recursive fault " "on cpu 0x%x\n", cpu); @@ -353,6 +383,11 @@ int xmon_core(struct pt_regs *regs, int if (!fromipi) { get_output_lock(); excprint(regs); +printf ("duude this was a normal entry\n"); +printf ("duude saved return addr=%p, saves stackp=%p stack=%p\n", recurse_jmp[0], recurse_jmp[1], *((long **)(recurse_jmp[1]))); +printf ("duude my stack really really is %p\n", &msr); +printf ("duude my my setjmp is %p\n", recurse_jmp); + if (bp) { printf("cpu 0x%x stopped at breakpoint 0x%x (", cpu, BP_NUM(bp)); @@ -386,7 +421,7 @@ int xmon_core(struct pt_regs *regs, int smp_send_debugger_break(MSG_ALL_BUT_SELF); /* wait for other cpus to come in */ for (timeout = 100000000; timeout != 0; --timeout) { - if (cpus_weight(cpus_in_xmon) >= ncpus) + if (cpus_weight(*((cpumask_t *) &cpus_in_xmon)) >= ncpus) break; barrier(); } @@ -757,6 +792,64 @@ static void remove_cpu_bpts(void) set_iabr(0); } +static inline int +xmon_process_cpu(const task_t *p) +{ + return p->thread_info->cpu; +} + +#define xmon_task_has_cpu(p) (task_curr(p)) + +static void +xmon_show_task(task_t *p) +{ + printf("0x%p %8d %8d %d %4d %c 0x%p %c%s\n", + (void *)p, p->pid, p->parent->pid, + xmon_task_has_cpu(p), xmon_process_cpu(p), + (p->state == 0) ? 'R' : + (p->state < 0) ? 'U' : + (p->state & TASK_UNINTERRUPTIBLE) ? 'D' : + (p->state & TASK_STOPPED || p->ptrace & PT_PTRACED) ? 'T' : + (p->state & EXIT_ZOMBIE) ? 'Z' : + (p->state & EXIT_DEAD) ? 'X' : + (p->state & TASK_INTERRUPTIBLE) ? 'S' : '?', + (void *)(&p->thread), + (p == current) ? '*': ' ', + p->comm); +} + +static task_t *xmon_next_thread(const task_t *p) +{ + return pid_task(p->pids[PIDTYPE_TGID].pid_list.next, PIDTYPE_TGID); +} + +static void +xmon_show_state(int prt_stacks) +{ + task_t *g, *p; + + printf("%-*s Pid Parent [*] cpu State %-*s Command\n", + (int)(2*sizeof(void *))+2, "Task Addr", + (int)(2*sizeof(void *))+2, "Thread"); + +#ifdef PER_CPU_RUNQUEUES_NO_LONGER_DECLARED_STATIC_IN_SCHED_C + /* Run the active tasks first */ + for (cpu = 0; cpu < NR_CPUS; ++cpu) + if (cpu_online(cpu)) { + p = cpu_curr(cpu); + xmon_show_task(p); + } +#endif + + /* Now the real tasks */ + do_each_thread(g, p) { + xmon_show_task(p); + if (prt_stacks) + xmon_show_stack(p->thread.ksp, 0, 0); + } while ((p = xmon_next_thread(p)) != g); +} + + /* Command interpreting routine */ static char *last_cmd; @@ -809,6 +902,9 @@ cmds(struct pt_regs *excp) case 'd': dump(); break; + case 'D': + xmon_show_dmesg(); + break; case 'l': symbol_lookup(); break; @@ -839,7 +935,10 @@ cmds(struct pt_regs *excp) printf(help_string); break; case 'p': - show_state(); + xmon_show_state(0); + break; + case 'P': + xmon_show_state(1); break; case 'b': bpt_cmds(); @@ -2400,6 +2499,58 @@ static void xmon_print_symbol(unsigned l printf("%s", after); } +extern void debugger_syslog_data(char *syslog_data[4]); +#define SYSLOG_WRAP(p) if (p < syslog_data[0]) p = syslog_data[1]-1; \ + else if (p >= syslog_data[1]) p = syslog_data[0]; + +static void xmon_show_dmesg(void) +{ + char *syslog_data[4], *start, *end, c; + int logsize; + + /* syslog_data[0,1] physical start, end+1. + * syslog_data[2,3] logical start, end+1. + */ + debugger_syslog_data(syslog_data); + if (syslog_data[2] == syslog_data[3]) + return; + logsize = syslog_data[1] - syslog_data[0]; + start = syslog_data[0] + (syslog_data[2] - syslog_data[0]) % logsize; + end = syslog_data[0] + (syslog_data[3] - syslog_data[0]) % logsize; + + /* Do a line at a time (max 200 chars) to reduce overhead */ + c = '\0'; + while(1) { + char *p; + int chars = 0; + if (!*start) { + while (!*start) { + ++start; + SYSLOG_WRAP(start); + if (start == end) + break; + } + if (start == end) + break; + } + p = start; + while (*start && chars < 200) { + c = *start; + ++chars; + ++start; + SYSLOG_WRAP(start); + if (start == end || c == '\n') + break; + } + if (chars) + printf("%.*s", chars, p); + if (start == end) + break; + } + if (c != '\n') + printf("\n"); +} + static void debug_trace(void) { unsigned long val, cmd, on; From michael at ellerman.id.au Sat May 7 12:01:19 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Sat, 7 May 2005 12:01:19 +1000 Subject: [PATCH 1/3] ppc64: iseries_veth: Don't send packets to LPARs which aren't up Message-ID: <200505071201.20123.michael@ellerman.id.au> Hi Everybody, The iseries_veth driver has a logic bug which means it will erroneously send packets to LPARs for which we don't have a connection. This usually isn't a big problem because the Hypervisor call fails gracefully and we return, but if packets are TX'ed during the negotiation of the connection bad things might happen. Regardless, the right thing is to bail early if we know there's no connection. Signed-off-by: Michael Ellerman -- iseries_veth.c | 2 +- 1 files changed, 1 insertion(+), 1 deletion(-) Index: veth-fixes/drivers/net/iseries_veth.c =================================================================== --- veth-fixes.orig/drivers/net/iseries_veth.c +++ veth-fixes/drivers/net/iseries_veth.c @@ -924,7 +924,7 @@ static int veth_transmit_to_one(struct s spin_lock_irqsave(&cnx->lock, flags); - if (! cnx->state & VETH_STATE_READY) + if (! (cnx->state & VETH_STATE_READY)) goto drop; if ((skb->len - 14) > VETH_MAX_MTU) From michael at ellerman.id.au Sat May 7 12:01:25 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Sat, 7 May 2005 12:01:25 +1000 Subject: [PATCH 2/3] ppc64: iseries_veth: Set dev->trans_start so watchdog timer works right Message-ID: <200505071201.25357.michael@ellerman.id.au> Hi Everybody, The iseries_veth driver doesn't set dev->trans_start in it's TX path. This will cause the net device watchdog timer to fire earlier than we want it to, which causes the driver to needlessly reset its connections to other LPARs. Signed-off-by: Michael Ellerman -- iseries_veth.c | 2 ++ 1 files changed, 2 insertions(+) Index: veth-fixes/drivers/net/iseries_veth.c =================================================================== --- veth-fixes.orig/drivers/net/iseries_veth.c +++ veth-fixes/drivers/net/iseries_veth.c @@ -1023,6 +1023,8 @@ static int veth_start_xmit(struct sk_buf lpmask = veth_transmit_to_many(skb, lpmask, dev); + dev->trans_start = jiffies; + if (! lpmask) { dev_kfree_skb(skb); } else { From michael at ellerman.id.au Sat May 7 12:01:33 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Sat, 7 May 2005 12:01:33 +1000 Subject: [PATCH 3/3] ppc64: iseries_veth: Don't leak skbs in RX path Message-ID: <200505071201.33881.michael@ellerman.id.au> Hi Everybody, In some strange circumstances the iseries_veth driver will leak skbs in its RX path. Fix is simply to call dev_kfree_skb() in the right place. Fix up the comment as well. Signed-off-by: Michael Ellerman -- iseries_veth.c | 17 +++++++++++------ 1 files changed, 11 insertions(+), 6 deletions(-) Index: veth-fixes/drivers/net/iseries_veth.c =================================================================== --- veth-fixes.orig/drivers/net/iseries_veth.c +++ veth-fixes/drivers/net/iseries_veth.c @@ -1264,13 +1264,18 @@ static void veth_receive(struct veth_lpa vlan = skb->data[9]; dev = veth_dev[vlan]; - if (! dev) - /* Some earlier versions of the driver sent - broadcasts down all connections, even to - lpars that weren't on the relevant vlan. - So ignore packets belonging to a vlan we're - not on. */ + if (! dev) { + /* + * Some earlier versions of the driver sent + * broadcasts down all connections, even to lpars + * that weren't on the relevant vlan. So ignore + * packets belonging to a vlan we're not on. + * We can also be here if we receive packets while + * the driver is going down, because then dev is NULL. + */ + dev_kfree_skb_irq(skb); continue; + } port = (struct veth_port *)dev->priv; dest = *((u64 *) skb->data) & 0xFFFFFFFFFFFF0000; From david at gibson.dropbear.id.au Sat May 7 13:37:13 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Sat, 7 May 2005 13:37:13 +1000 Subject: Patch to kill ioremap_mm In-Reply-To: <1115310090.6011.21.camel@sinatra.austin.ibm.com> References: <20050505014256.GE18270@localhost.localdomain> <1115310090.6011.21.camel@sinatra.austin.ibm.com> Message-ID: <20050507033712.GC19538@localhost.localdomain> On Thu, May 05, 2005 at 11:21:30AM -0500, John Rose wrote: > > > Hi David- > > Given that we use a separate allocation scheme for imalloc mappings, > does it make sense to lump these into the vmalloc mm_struct, and to > share the vmalloc address space? This saves lines of code, but is it as > clear as the existing (separate) layout? In a word: yes. As Ben's explained this can be a first step to greatly simplifying the imalloc stuff. But even without that, there's absolutely no good reason to have two different sets of pagetables for the kernel unlike every other architecture. init_mm provides a perfectly good virtual memory space, let's use it, instead of pointlessly duplicating stuff and complexifying the hash path. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From arnd at arndb.de Sat May 7 21:31:52 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Sat, 7 May 2005 13:31:52 +0200 Subject: [PATCH 1/4] ppc64: rename arch/ppc64/kernel/pSeries_pci.c In-Reply-To: <17018.64606.662481.104228@cargo.ozlabs.ibm.com> References: <200504200149.22063.arnd@arndb.de> <200504200152.58965.arnd@arndb.de> <17018.64606.662481.104228@cargo.ozlabs.ibm.com> Message-ID: <200505071331.53944.arnd@arndb.de> On Freedag 06 Mai 2005 07:10, Paul Mackerras wrote: > Hmmm, you rename pSeries_pci.c to rtas_pci.c and then in the next > patch you recreate pSeries_pci.c and move some stuff from rtas_pci.c > into it. ?Could we have one patch that creates rtas_pci.c and just > moves stuff from pSeries_pci.c to it? Sure. I wanted to make it easier to review as the rename patch is trivial and the second patch is less to read than the combined one. In my next update I will fold the two together. Arnd <>< From olh at suse.de Sun May 8 03:04:34 2005 From: olh at suse.de (Olaf Hering) Date: Sat, 7 May 2005 19:04:34 +0200 Subject: [PATCH] remove unused arch/ppc64/boot/piggyback.c Message-ID: <20050507170434.GA25407@suse.de> piggyback is not called in arch/ppc64/boot/Makefile Signed-off-by: Olaf Hering Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/Makefile =================================================================== --- linux-2.6.12-rc4-olh.orig/arch/ppc64/boot/Makefile +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/Makefile @@ -52,7 +52,7 @@ obj-sec = $(foreach section, $(1), $(pat src-sec = $(foreach section, $(1), $(patsubst %,$(obj)/kernel-%.c, $(section))) gz-sec = $(foreach section, $(1), $(patsubst %,$(obj)/kernel-%.gz, $(section))) -hostprogs-y := piggy addnote addRamDisk +hostprogs-y := addnote addRamDisk targets += zImage zImage.initrd imagesize.c \ $(patsubst $(obj)/%,%, $(call obj-sec, $(required) $(initrd))) \ $(patsubst $(obj)/%,%, $(call src-sec, $(required) $(initrd))) \ @@ -78,9 +78,6 @@ addsection = $(CROSS32OBJCOPY) $(1) \ quiet_cmd_addnote = ADDNOTE $@ cmd_addnote = $(CROSS32LD) $(BOOTLFLAGS) -o $@ $(obj-boot) && $(obj)/addnote $@ -quiet_cmd_piggy = PIGGY $@ - cmd_piggy = $(obj)/piggyback $(@:.o=) < $< | $(CROSS32AS) -o $@ - $(call gz-sec, $(required)): $(obj)/kernel-%.gz: % FORCE $(call if_changed,gzip) Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/piggyback.c =================================================================== --- linux-2.6.12-rc4-olh.orig/arch/ppc64/boot/piggyback.c +++ /dev/null @@ -1,83 +0,0 @@ -/* - * Copyright 2001 IBM Corp - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ -#include -#include -#include - -extern long ce_exec_config[]; - -int main(int argc, char *argv[]) -{ - int i, cnt, pos, len; - unsigned int cksum, val; - unsigned char *lp; - unsigned char buf[8192]; - char *varname; - if (argc != 2) - { - fprintf(stderr, "usage: %s name out-file\n", - argv[0]); - exit(1); - } - - varname = strrchr(argv[1], '/'); - if (varname) - varname++; - else - varname = argv[1]; - - fprintf(stdout, "#\n"); - fprintf(stdout, "# Miscellaneous data structures:\n"); - fprintf(stdout, "# WARNING - this file is automatically generated!\n"); - fprintf(stdout, "#\n"); - fprintf(stdout, "\n"); - fprintf(stdout, "\t.data\n"); - fprintf(stdout, "\t.globl %s_data\n", varname); - fprintf(stdout, "%s_data:\n", varname); - pos = 0; - cksum = 0; - while ((len = read(0, buf, sizeof(buf))) > 0) - { - cnt = 0; - lp = (unsigned char *)buf; - len = (len + 3) & ~3; /* Round up to longwords */ - for (i = 0; i < len; i += 4) - { - if (cnt == 0) - { - fprintf(stdout, "\t.long\t"); - } - fprintf(stdout, "0x%02X%02X%02X%02X", lp[0], lp[1], lp[2], lp[3]); - val = *(unsigned long *)lp; - cksum ^= val; - lp += 4; - if (++cnt == 4) - { - cnt = 0; - fprintf(stdout, " # %x \n", pos+i-12); - fflush(stdout); - } else - { - fprintf(stdout, ","); - } - } - if (cnt) - { - fprintf(stdout, "0\n"); - } - pos += len; - } - fprintf(stdout, "\t.globl %s_len\n", varname); - fprintf(stdout, "%s_len:\t.long\t0x%x\n", varname, pos); - fflush(stdout); - fclose(stdout); - fprintf(stderr, "cksum = %x\n", cksum); - exit(0); -} - From olh at suse.de Sun May 8 03:05:51 2005 From: olh at suse.de (Olaf Hering) Date: Sat, 7 May 2005 19:05:51 +0200 Subject: [PATCH] remove unused arch/ppc64/boot/mknote.c Message-ID: <20050507170551.GB25407@suse.de> mknote is not called in arch/ppc64/boot/Makefile Signed-off-by: Olaf Hering Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/mknote.c =================================================================== --- linux-2.6.12-rc4-olh.orig/arch/ppc64/boot/mknote.c +++ /dev/null @@ -1,43 +0,0 @@ -/* - * Copyright (C) Cort Dougan 1999. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - * - * Generate a note section as per the CHRP specification. - * - */ - -#include - -#define PL(x) printf("%c%c%c%c", ((x)>>24)&0xff, ((x)>>16)&0xff, ((x)>>8)&0xff, (x)&0xff ); - -int main(void) -{ -/* header */ - /* namesz */ - PL(strlen("PowerPC")+1); - /* descrsz */ - PL(6*4); - /* type */ - PL(0x1275); - /* name */ - printf("PowerPC"); printf("%c", 0); - -/* descriptor */ - /* real-mode */ - PL(0xffffffff); - /* real-base */ - PL(0x00c00000); - /* real-size */ - PL(0xffffffff); - /* virt-base */ - PL(0xffffffff); - /* virt-size */ - PL(0xffffffff); - /* load-base */ - PL(0x4000); - return 0; -} From markus at unixforces.net Sun May 8 03:09:04 2005 From: markus at unixforces.net (Markus Rothe) Date: Sat, 7 May 2005 17:09:04 +0000 Subject: [BUG] linux-2.6.12_rc4: Oops: Kernel access of bad area, sig: 11 [#1] Message-ID: <20050507170904.GA9488@unixforces.net> Hello, I'm running Linux on my G5. I just compiled linux-2.6.12_rc4 and got this when loading my sound modules at boot time (linux-2.6.12_rc3 worked just fine): ---- SNIP ---- Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=2 POWERMAC Modules linked in: snd_powermac snd_pcm snd_page_alloc snd_timer snd soundcore NIP: C0000000002E4030 XER: 20000000 LR: D0000000001B4AC8 CTR: C0000000002E4004 REGS: c000000237783890 TRAP: 0300 Not tainted (2.6.12-rc4) MSR: 9000000000009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 CR: 22022484 DAR: 0000000000000002 DSISR: 0000000040000000 TASK: c0000002375ba030[7007] 'modprobe' THREAD: c000000237780000 CPU: 0 GPR00: D0000000001B4AC8 C000000237783B10 C0000000005A5E70 0000000000000000 GPR04: 0000000000000001 0000000000000060 0000000000000000 0000000000000001 GPR08: 0000000000000002 0000000000000000 C00000000047AAC0 C0000000002E4004 GPR12: D0000000001B9C58 C00000000047B800 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000010028790 0000000000000000 000000001002A4A0 GPR20: 0000000000000000 0000000000000000 0000000000000000 00000080001B1010 GPR24: 000000001002AA90 000000001002ABF8 000000001002ABE0 C000000000478958 GPR28: C00000000F70D580 000000000000000A D0000000001CACD8 D0000000001C137C NIP [c0000000002e4030] .i2c_smbus_write_byte_data+0x2c/0x50 LR [d0000000001b4ac8] .send_init_client+0x50/0x110 [snd_powermac] Call Trace: [c000000237783b10] [d0000000001c2030] tumbler_mixers+0x0/0xffffffffffff8b28 [snd_powermac] (unreliable) [c000000237783bc0] [d0000000001b4ac8] .send_init_client+0x50/0x110 [snd_powermac] [c000000237783c60] [d0000000001b9514] .snd_pmac_tumbler_post_init+0x3c/0x94 [snd_powermac] [c000000237783ce0] [d0000000001b74fc] .alsa_card_pmac_init+0x174/0x3cc [snd_powermac] [c000000237783d90] [c000000000063988] .sys_init_module+0x2cc/0x4a8 [c000000237783e30] [c00000000000d980] syscall_exit+0x0/0x18 Instruction dump: 4e800020 7c0802a6 7c691b78 7c872378 38c00000 39000002 f8010010 f821ff51 98a10070 60000000 60000000 60000000 a0890006 e8630008 39210070 <6>usbcore: registered new driver snd-usb-audio ---- SNIP ---- Could someone please take a look at that? Best regards, Markus -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050507/1305e517/attachment.pgp From olh at suse.de Sun May 8 07:24:49 2005 From: olh at suse.de (Olaf Hering) Date: Sat, 7 May 2005 23:24:49 +0200 Subject: missing deps in arch/ppc64/boot Message-ID: <20050507212449.GA26741@suse.de> Sam, touching arch/ppc64/boot/zlib.h will not cause a rebuild of arch/ppc64/boot/zlib.o. Any ideas what is missing? I use 'make ARCH=ppc64 O=../O-2.6.12-rc4-ppc64-defconfig-boot' ../O-2.6.12-rc4-ppc64-defconfig-boot/arch/ppc64/boot/.zlib.o.cmd cmd_arch/ppc64/boot/zlib.o := gcc -m32 -Wp,-MD,arch/ppc64/boot/.zlib.o.d -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -I/home/olaf/kernel/linux-2.6.12-rc4-olh/arch/ppc64/boot/include -fno-builtin -nostdinc -isystem /usr/lib/gcc-lib/powerpc-suse-linux/3.3.3/include -c -o arch/ppc64/boot/zlib.o /home/olaf/kernel/linux-2.6.12-rc4-olh/arch/ppc64/boot/zlib.c deps_arch/ppc64/boot/zlib.o := \ /home/olaf/kernel/linux-2.6.12-rc4-olh/arch/ppc64/boot/zlib.c \ /home/olaf/kernel/linux-2.6.12-rc4-olh/arch/ppc64/boot/zlib.h \ arch/ppc64/boot/zlib.o: $(deps_arch/ppc64/boot/zlib.o) $(deps_arch/ppc64/boot/zlib.o): From olh at suse.de Sun May 8 07:50:11 2005 From: olh at suse.de (Olaf Hering) Date: Sat, 7 May 2005 23:50:11 +0200 Subject: [PATCH] remove printk usage in arch/ppc64/boot/prom.c Message-ID: <20050507215011.GA26918@suse.de> remove the printk usage in the zImage. we are not there, yet. Signed-off-by: Olaf Hering Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/main.c =================================================================== --- linux-2.6.12-rc4-olh.orig/arch/ppc64/boot/main.c +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/main.c @@ -17,7 +17,6 @@ extern void *finddevice(const char *); extern int getprop(void *, const char *, void *, int); -extern void printk(char *fmt, ...); extern void printf(const char *fmt, ...); extern int sprintf(char *buf, const char *fmt, ...); void gunzip(void *, int, unsigned char *, int *); Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/prom.c =================================================================== --- linux-2.6.12-rc4-olh.orig/arch/ppc64/boot/prom.c +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/prom.c @@ -23,7 +23,7 @@ void *finddevice(const char *name); int getprop(void *phandle, const char *name, void *buf, int buflen); void chrpboot(int a1, int a2, void *prom); /* in main.c */ -void printk(char *fmt, ...); +int printf(char *fmt, ...); /* there is no convenient header to get this from... -- paulus */ extern unsigned long strlen(const char *); @@ -203,7 +203,7 @@ readchar(void) case 1: return ch; case -1: - printk("read(stdin) returned -1\r\n"); + printf("read(stdin) returned -1\r\n"); return -1; } } @@ -611,18 +611,6 @@ int sprintf(char * buf, const char *fmt, static char sprint_buf[1024]; -void -printk(char *fmt, ...) -{ - va_list args; - int n; - - va_start(args, fmt); - n = vsprintf(sprint_buf, fmt, args); - va_end(args); - write(stdout, sprint_buf, n); -} - int printf(char *fmt, ...) { From olh at suse.de Sun May 8 07:52:20 2005 From: olh at suse.de (Olaf Hering) Date: Sat, 7 May 2005 23:52:20 +0200 Subject: [PATCH] remove duplicate printf in arch/ppc64/boot/main.c Message-ID: <20050507215220.GB26918@suse.de> initrd size is printed as hex, add a missing 0x remove a duplicate printf when initrd is used. remove use of kernel type to access the first bytes of the initrd memarea. Signed-off-by: Olaf Hering Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/main.c =================================================================== --- linux-2.6.12-rc4-olh.orig/arch/ppc64/boot/main.c +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/main.c @@ -146,10 +146,10 @@ void start(unsigned long a1, unsigned lo } a1 = initrd.addr; a2 = initrd.size; - printf("initial ramdisk moving 0x%lx <- 0x%lx (%lx bytes)\n\r", + printf("initial ramdisk moving 0x%lx <- 0x%lx (0x%lx bytes)\n\r", initrd.addr, (unsigned long)_initrd_start, initrd.size); memmove((void *)initrd.addr, (void *)_initrd_start, initrd.size); - printf("initrd head: 0x%lx\n\r", *((u32 *)initrd.addr)); + printf("initrd head: 0x%lx\n\r", *((unsigned long *)initrd.addr)); } /* Eventually gunzip the kernel */ @@ -200,9 +200,6 @@ void start(unsigned long a1, unsigned lo flush_cache((void *)vmlinux.addr, vmlinux.size); - if (a1) - printf("initrd head: 0x%lx\n\r", *((u32 *)initrd.addr)); - kernel_entry = (kernel_entry_t)vmlinux.addr; #ifdef DEBUG printf( "kernel:\n\r" From olh at suse.de Sun May 8 07:58:01 2005 From: olh at suse.de (Olaf Hering) Date: Sat, 7 May 2005 23:58:01 +0200 Subject: [PATCH] make arch/ppc64/boot standalone Message-ID: <20050507215801.GC26918@suse.de> make the bootheader for ppc64 independent from kernel and libc headers add -nostdinc -isystem $gccincludes to not include libc headers declare all functions in header files, also the stuff from string.S declare some functions static use stddef.h to get size_t (hopefully ok) remove ppc32-types.h, only elf.h used the __NN types Signed-off-by: Olaf Hering arch/ppc64/boot/ppc32-types.h | 36 ------ arch/ppc64/boot/Makefile | 4 arch/ppc64/boot/crt0.S | 2 arch/ppc64/boot/div64.S | 2 arch/ppc64/boot/include/ctype.h | 54 +++++++++ arch/ppc64/boot/include/elf.h | 149 ++++++++++++++++++++++++ arch/ppc64/boot/include/page.h | 34 +++++ arch/ppc64/boot/include/ppc_asm.h | 228 ++++++++++++++++++++++++++++++++++++++ arch/ppc64/boot/include/prom.h | 18 +++ arch/ppc64/boot/include/stdio.h | 16 ++ arch/ppc64/boot/include/string.h | 20 +++ arch/ppc64/boot/main.c | 55 +++------ arch/ppc64/boot/prom.c | 25 +--- arch/ppc64/boot/string.S | 2 14 files changed, 552 insertions(+), 93 deletions(-) Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/include/prom.h =================================================================== --- /dev/null +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/include/prom.h @@ -0,0 +1,18 @@ +#ifndef _PPC_BOOT_PROM_H_ +#define _PPC_BOOT_PROM_H_ + +extern int (*prom) (void *); +extern void *chosen_handle; + +extern void *stdin; +extern void *stdout; +extern void *stderr; + +extern int write(void *handle, void *ptr, int nb); +extern int read(void *handle, void *ptr, int nb); +extern void exit(void); +extern void pause(void); +extern void *finddevice(const char *); +extern void *claim(unsigned long virt, unsigned long size, unsigned long align); +extern int getprop(void *phandle, const char *name, void *buf, int buflen); +#endif /* _PPC_BOOT_PROM_H_ */ Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/include/stdio.h =================================================================== --- /dev/null +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/include/stdio.h @@ -0,0 +1,16 @@ +#ifndef _PPC_BOOT_STDIO_H_ +#define _PPC_BOOT_STDIO_H_ + +extern int printf(const char *fmt, ...); + +extern int sprintf(char *buf, const char *fmt, ...); + +extern int vsprintf(char *buf, const char *fmt, va_list args); + +extern int putc(int c, void *f); +extern int putchar(int c); +extern int getchar(void); + +extern int fputs(char *str, void *f); + +#endif /* _PPC_BOOT_STDIO_H_ */ Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/include/string.h =================================================================== --- /dev/null +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/include/string.h @@ -0,0 +1,20 @@ +#ifndef _PPC_BOOT_STRING_H_ +#define _PPC_BOOT_STRING_H_ + +extern char *strcpy(char *dest, const char *src); +extern char *strncpy(char *dest, const char *src, size_t n); +extern char *strcat(char *dest, const char *src); +extern int strcmp(const char *s1, const char *s2); +extern size_t strlen(const char *s); +extern size_t strnlen(const char *s, size_t count); + +extern unsigned long simple_strtoul(const char *cp, char **endp, + unsigned int base); +extern long simple_strtol(const char *cp, char **endp, unsigned int base); + +extern void *memset(void *s, int c, size_t n); +extern void *memmove(void *dest, const void *src, unsigned long n); +extern void *memcpy(void *dest, const void *src, unsigned long n); +extern int memcmp(const void *s1, const void *s2, size_t n); + +#endif /* _PPC_BOOT_STRING_H_ */ Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/main.c =================================================================== --- linux-2.6.12-rc4-olh.orig/arch/ppc64/boot/main.c +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/main.c @@ -8,36 +8,28 @@ * as published by the Free Software Foundation; either version * 2 of the License, or (at your option) any later version. */ -#include "ppc32-types.h" +#include +#include +#include +#include +#include +#include +#include #include "zlib.h" -#include -#include -#include -#include - -extern void *finddevice(const char *); -extern int getprop(void *, const char *, void *, int); -extern void printf(const char *fmt, ...); -extern int sprintf(char *buf, const char *fmt, ...); -void gunzip(void *, int, unsigned char *, int *); -void *claim(unsigned int, unsigned int, unsigned int); -void flush_cache(void *, unsigned long); -void pause(void); -extern void exit(void); - -unsigned long strlen(const char *s); -void *memmove(void *dest, const void *src, unsigned long n); -void *memcpy(void *dest, const void *src, unsigned long n); + +static void gunzip(void *, int, unsigned char *, int *); +extern void flush_cache(void *, unsigned long); + /* Value picked to match that used by yaboot */ #define PROG_START 0x01400000 #define RAM_END (256<<20) // Fixme: use OF */ -char *avail_ram; -char *begin_avail, *end_avail; -char *avail_high; -unsigned int heap_use; -unsigned int heap_max; +static char *avail_ram; +static char *begin_avail, *end_avail; +static char *avail_high; +static unsigned int heap_use; +static unsigned int heap_max; extern char _start[]; extern char _vmlinux_start[]; @@ -52,9 +44,9 @@ struct addr_range { unsigned long size; unsigned long memsize; }; -struct addr_range vmlinux = {0, 0, 0}; -struct addr_range vmlinuz = {0, 0, 0}; -struct addr_range initrd = {0, 0, 0}; +static struct addr_range vmlinux = {0, 0, 0}; +static struct addr_range vmlinuz = {0, 0, 0}; +static struct addr_range initrd = {0, 0, 0}; static char scratch[128<<10]; /* 128kB of scratch space for gunzip */ @@ -64,13 +56,6 @@ typedef void (*kernel_entry_t)( unsigned void *); -int (*prom)(void *); - -void *chosen_handle; -void *stdin; -void *stdout; -void *stderr; - #undef DEBUG static unsigned long claim_base = PROG_START; @@ -277,7 +262,7 @@ void zfree(void *x, void *addr, unsigned #define DEFLATED 8 -void gunzip(void *dst, int dstlen, unsigned char *src, int *lenp) +static void gunzip(void *dst, int dstlen, unsigned char *src, int *lenp) { z_stream s; int r, i, flags; Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/prom.c =================================================================== --- linux-2.6.12-rc4-olh.orig/arch/ppc64/boot/prom.c +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/prom.c @@ -7,26 +7,20 @@ * 2 of the License, or (at your option) any later version. */ #include -#include -#include -#include +#include +#include +#include +#include +#include int (*prom)(void *); void *chosen_handle; + void *stdin; void *stdout; void *stderr; -void exit(void); -void *finddevice(const char *name); -int getprop(void *phandle, const char *name, void *buf, int buflen); -void chrpboot(int a1, int a2, void *prom); /* in main.c */ - -int printf(char *fmt, ...); - -/* there is no convenient header to get this from... -- paulus */ -extern unsigned long strlen(const char *); int write(void *handle, void *ptr, int nb) @@ -193,7 +187,7 @@ fputs(char *str, void *f) return write(f, str, n) == n? 0: -1; } -int +static int readchar(void) { char ch; @@ -420,9 +414,6 @@ static char * number(char * str, long nu return str; } -/* Forward decl. needed for IP address printing stuff... */ -int sprintf(char * buf, const char *fmt, ...); - int vsprintf(char *buf, const char *fmt, va_list args) { int len; @@ -612,7 +603,7 @@ int sprintf(char * buf, const char *fmt, static char sprint_buf[1024]; int -printf(char *fmt, ...) +printf(const char *fmt, ...) { va_list args; int n; Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/Makefile =================================================================== --- linux-2.6.12-rc4-olh.orig/arch/ppc64/boot/Makefile +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/Makefile @@ -22,8 +22,8 @@ HOSTCC := gcc -BOOTCFLAGS := $(HOSTCFLAGS) $(LINUXINCLUDE) -fno-builtin -BOOTAFLAGS := -D__ASSEMBLY__ $(BOOTCFLAGS) -traditional +BOOTCFLAGS := $(HOSTCFLAGS) -I$(srctree)/$(src)/include -fno-builtin -nostdinc -isystem $(shell $(CROSS32CC) -print-file-name=include) +BOOTAFLAGS := -D__ASSEMBLY__ $(BOOTCFLAGS) -traditional -nostdinc BOOTLFLAGS := -Ttext 0x00400000 -e _start -T $(srctree)/$(src)/zImage.lds OBJCOPYFLAGS := contents,alloc,load,readonly,data Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/ppc32-types.h =================================================================== --- linux-2.6.12-rc4-olh.orig/arch/ppc64/boot/ppc32-types.h +++ /dev/null @@ -1,36 +0,0 @@ -#ifndef _PPC64_TYPES_H -#define _PPC64_TYPES_H - -typedef __signed__ char __s8; -typedef unsigned char __u8; - -typedef __signed__ short __s16; -typedef unsigned short __u16; - -typedef __signed__ int __s32; -typedef unsigned int __u32; - -typedef __signed__ long long __s64; -typedef unsigned long long __u64; - -typedef signed char s8; -typedef unsigned char u8; - -typedef signed short s16; -typedef unsigned short u16; - -typedef signed int s32; -typedef unsigned int u32; - -typedef signed long long s64; -typedef unsigned long long u64; - -typedef struct { - __u32 u[4]; -} __attribute((aligned(16))) __vector128; - -#define BITS_PER_LONG 32 - -typedef __vector128 vector128; - -#endif /* _PPC64_TYPES_H */ Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/include/ctype.h =================================================================== --- /dev/null +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/include/ctype.h @@ -0,0 +1,54 @@ +#ifndef _PPC_BOOT_CTYPE_H +#define _PPC_BOOT_CTYPE_H + +/* + * NOTE! This ctype does not handle EOF like the standard C + * library is required to. + */ + +#define _U 0x01 /* upper */ +#define _L 0x02 /* lower */ +#define _D 0x04 /* digit */ +#define _C 0x08 /* cntrl */ +#define _P 0x10 /* punct */ +#define _S 0x20 /* white space (space/lf/tab) */ +#define _X 0x40 /* hex digit */ +#define _SP 0x80 /* hard space (0x20) */ + +extern unsigned char _ctype[]; + +#define __ismask(x) (_ctype[(int)(unsigned char)(x)]) + +#define isalnum(c) ((__ismask(c)&(_U|_L|_D)) != 0) +#define isalpha(c) ((__ismask(c)&(_U|_L)) != 0) +#define iscntrl(c) ((__ismask(c)&(_C)) != 0) +#define isdigit(c) ((__ismask(c)&(_D)) != 0) +#define isgraph(c) ((__ismask(c)&(_P|_U|_L|_D)) != 0) +#define islower(c) ((__ismask(c)&(_L)) != 0) +#define isprint(c) ((__ismask(c)&(_P|_U|_L|_D|_SP)) != 0) +#define ispunct(c) ((__ismask(c)&(_P)) != 0) +#define isspace(c) ((__ismask(c)&(_S)) != 0) +#define isupper(c) ((__ismask(c)&(_U)) != 0) +#define isxdigit(c) ((__ismask(c)&(_D|_X)) != 0) + +#define isascii(c) (((unsigned char)(c))<=0x7f) +#define toascii(c) (((unsigned char)(c))&0x7f) + +static inline unsigned char __tolower(unsigned char c) +{ + if (isupper(c)) + c -= 'A' - 'a'; + return c; +} + +static inline unsigned char __toupper(unsigned char c) +{ + if (islower(c)) + c -= 'a' - 'A'; + return c; +} + +#define tolower(c) __tolower(c) +#define toupper(c) __toupper(c) + +#endif /* _PPC_BOOT_CTYPE_H */ Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/include/elf.h =================================================================== --- /dev/null +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/include/elf.h @@ -0,0 +1,149 @@ +#ifndef _PPC_BOOT_ELF_H_ +#define _PPC_BOOT_ELF_H_ + +/* 32-bit ELF base types. */ +typedef unsigned int Elf32_Addr; +typedef unsigned short Elf32_Half; +typedef unsigned int Elf32_Off; +typedef signed int Elf32_Sword; +typedef unsigned int Elf32_Word; + +/* 64-bit ELF base types. */ +typedef unsigned long long Elf64_Addr; +typedef unsigned short Elf64_Half; +typedef signed short Elf64_SHalf; +typedef unsigned long long Elf64_Off; +typedef signed int Elf64_Sword; +typedef unsigned int Elf64_Word; +typedef unsigned long long Elf64_Xword; +typedef signed long long Elf64_Sxword; + +/* These constants are for the segment types stored in the image headers */ +#define PT_NULL 0 +#define PT_LOAD 1 +#define PT_DYNAMIC 2 +#define PT_INTERP 3 +#define PT_NOTE 4 +#define PT_SHLIB 5 +#define PT_PHDR 6 +#define PT_TLS 7 /* Thread local storage segment */ +#define PT_LOOS 0x60000000 /* OS-specific */ +#define PT_HIOS 0x6fffffff /* OS-specific */ +#define PT_LOPROC 0x70000000 +#define PT_HIPROC 0x7fffffff +#define PT_GNU_EH_FRAME 0x6474e550 + +#define PT_GNU_STACK (PT_LOOS + 0x474e551) + +/* These constants define the different elf file types */ +#define ET_NONE 0 +#define ET_REL 1 +#define ET_EXEC 2 +#define ET_DYN 3 +#define ET_CORE 4 +#define ET_LOPROC 0xff00 +#define ET_HIPROC 0xffff + +/* These constants define the various ELF target machines */ +#define EM_NONE 0 +#define EM_PPC 20 /* PowerPC */ +#define EM_PPC64 21 /* PowerPC64 */ + +#define EI_NIDENT 16 + +typedef struct elf32_hdr { + unsigned char e_ident[EI_NIDENT]; + Elf32_Half e_type; + Elf32_Half e_machine; + Elf32_Word e_version; + Elf32_Addr e_entry; /* Entry point */ + Elf32_Off e_phoff; + Elf32_Off e_shoff; + Elf32_Word e_flags; + Elf32_Half e_ehsize; + Elf32_Half e_phentsize; + Elf32_Half e_phnum; + Elf32_Half e_shentsize; + Elf32_Half e_shnum; + Elf32_Half e_shstrndx; +} Elf32_Ehdr; + +typedef struct elf64_hdr { + unsigned char e_ident[16]; /* ELF "magic number" */ + Elf64_Half e_type; + Elf64_Half e_machine; + Elf64_Word e_version; + Elf64_Addr e_entry; /* Entry point virtual address */ + Elf64_Off e_phoff; /* Program header table file offset */ + Elf64_Off e_shoff; /* Section header table file offset */ + Elf64_Word e_flags; + Elf64_Half e_ehsize; + Elf64_Half e_phentsize; + Elf64_Half e_phnum; + Elf64_Half e_shentsize; + Elf64_Half e_shnum; + Elf64_Half e_shstrndx; +} Elf64_Ehdr; + +/* These constants define the permissions on sections in the program + header, p_flags. */ +#define PF_R 0x4 +#define PF_W 0x2 +#define PF_X 0x1 + +typedef struct elf32_phdr { + Elf32_Word p_type; + Elf32_Off p_offset; + Elf32_Addr p_vaddr; + Elf32_Addr p_paddr; + Elf32_Word p_filesz; + Elf32_Word p_memsz; + Elf32_Word p_flags; + Elf32_Word p_align; +} Elf32_Phdr; + +typedef struct elf64_phdr { + Elf64_Word p_type; + Elf64_Word p_flags; + Elf64_Off p_offset; /* Segment file offset */ + Elf64_Addr p_vaddr; /* Segment virtual address */ + Elf64_Addr p_paddr; /* Segment physical address */ + Elf64_Xword p_filesz; /* Segment size in file */ + Elf64_Xword p_memsz; /* Segment size in memory */ + Elf64_Xword p_align; /* Segment alignment, file & memory */ +} Elf64_Phdr; + +#define EI_MAG0 0 /* e_ident[] indexes */ +#define EI_MAG1 1 +#define EI_MAG2 2 +#define EI_MAG3 3 +#define EI_CLASS 4 +#define EI_DATA 5 +#define EI_VERSION 6 +#define EI_OSABI 7 +#define EI_PAD 8 + +#define ELFMAG0 0x7f /* EI_MAG */ +#define ELFMAG1 'E' +#define ELFMAG2 'L' +#define ELFMAG3 'F' +#define ELFMAG "\177ELF" +#define SELFMAG 4 + +#define ELFCLASSNONE 0 /* EI_CLASS */ +#define ELFCLASS32 1 +#define ELFCLASS64 2 +#define ELFCLASSNUM 3 + +#define ELFDATANONE 0 /* e_ident[EI_DATA] */ +#define ELFDATA2LSB 1 +#define ELFDATA2MSB 2 + +#define EV_NONE 0 /* e_version, EI_VERSION */ +#define EV_CURRENT 1 +#define EV_NUM 2 + +#define ELFOSABI_NONE 0 +#define ELFOSABI_LINUX 3 + +#endif /* _PPC_BOOT_ELF_H_ */ Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/include/page.h =================================================================== --- /dev/null +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/include/page.h @@ -0,0 +1,34 @@ +#ifndef _PPC_BOOT_PAGE_H +#define _PPC_BOOT_PAGE_H +/* + * Copyright (C) 2001 PPC64 Team, IBM Corp + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#ifdef __ASSEMBLY__ +#define ASM_CONST(x) x +#else +#define __ASM_CONST(x) x##UL +#define ASM_CONST(x) __ASM_CONST(x) +#endif + +/* PAGE_SHIFT determines the page size */ +#define PAGE_SHIFT 12 +#define PAGE_SIZE (ASM_CONST(1) << PAGE_SHIFT) +#define PAGE_MASK (~(PAGE_SIZE-1)) + +/* align addr on a size boundary - adjust address up/down if needed */ +#define _ALIGN_UP(addr,size) (((addr)+((size)-1))&(~((size)-1))) +#define _ALIGN_DOWN(addr,size) ((addr)&(~((size)-1))) + +/* align addr on a size boundary - adjust address up if needed */ +#define _ALIGN(addr,size) _ALIGN_UP(addr,size) + +/* to align the pointer to the (next) page boundary */ +#define PAGE_ALIGN(addr) _ALIGN(addr, PAGE_SIZE) + +#endif /* _PPC_BOOT_PAGE_H */ Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/include/ppc_asm.h =================================================================== --- /dev/null +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/include/ppc_asm.h @@ -0,0 +1,228 @@ +#ifndef _PPC64_PPC_ASM_H +#define _PPC64_PPC_ASM_H +/* + * + * Definitions used by various bits of low-level assembly code on PowerPC. + * + * Copyright (C) 1995-1999 Gary Thomas, Paul Mackerras, Cort Dougan. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +/* + * Macros for storing registers into and loading registers from + * exception frames. + */ +#define SAVE_GPR(n, base) std n,GPR0+8*(n)(base) +#define SAVE_2GPRS(n, base) SAVE_GPR(n, base); SAVE_GPR(n+1, base) +#define SAVE_4GPRS(n, base) SAVE_2GPRS(n, base); SAVE_2GPRS(n+2, base) +#define SAVE_8GPRS(n, base) SAVE_4GPRS(n, base); SAVE_4GPRS(n+4, base) +#define SAVE_10GPRS(n, base) SAVE_8GPRS(n, base); SAVE_2GPRS(n+8, base) +#define REST_GPR(n, base) ld n,GPR0+8*(n)(base) +#define REST_2GPRS(n, base) REST_GPR(n, base); REST_GPR(n+1, base) +#define REST_4GPRS(n, base) REST_2GPRS(n, base); REST_2GPRS(n+2, base) +#define REST_8GPRS(n, base) REST_4GPRS(n, base); REST_4GPRS(n+4, base) +#define REST_10GPRS(n, base) REST_8GPRS(n, base); REST_2GPRS(n+8, base) + +#define SAVE_NVGPRS(base) SAVE_8GPRS(14, base); SAVE_10GPRS(22, base) +#define REST_NVGPRS(base) REST_8GPRS(14, base); REST_10GPRS(22, base) + +#define SAVE_FPR(n, base) stfd n,THREAD_FPR0+8*(n)(base) +#define SAVE_2FPRS(n, base) SAVE_FPR(n, base); SAVE_FPR(n+1, base) +#define SAVE_4FPRS(n, base) SAVE_2FPRS(n, base); SAVE_2FPRS(n+2, base) +#define SAVE_8FPRS(n, base) SAVE_4FPRS(n, base); SAVE_4FPRS(n+4, base) +#define SAVE_16FPRS(n, base) SAVE_8FPRS(n, base); SAVE_8FPRS(n+8, base) +#define SAVE_32FPRS(n, base) SAVE_16FPRS(n, base); SAVE_16FPRS(n+16, base) +#define REST_FPR(n, base) lfd n,THREAD_FPR0+8*(n)(base) +#define REST_2FPRS(n, base) REST_FPR(n, base); REST_FPR(n+1, base) +#define REST_4FPRS(n, base) REST_2FPRS(n, base); REST_2FPRS(n+2, base) +#define REST_8FPRS(n, base) REST_4FPRS(n, base); REST_4FPRS(n+4, base) +#define REST_16FPRS(n, base) REST_8FPRS(n, base); REST_8FPRS(n+8, base) +#define REST_32FPRS(n, base) REST_16FPRS(n, base); REST_16FPRS(n+16, base) + +#define SAVE_VR(n,b,base) li b,THREAD_VR0+(16*(n)); stvx n,b,base +#define SAVE_2VRS(n,b,base) SAVE_VR(n,b,base); SAVE_VR(n+1,b,base) +#define SAVE_4VRS(n,b,base) SAVE_2VRS(n,b,base); SAVE_2VRS(n+2,b,base) +#define SAVE_8VRS(n,b,base) SAVE_4VRS(n,b,base); SAVE_4VRS(n+4,b,base) +#define SAVE_16VRS(n,b,base) SAVE_8VRS(n,b,base); SAVE_8VRS(n+8,b,base) +#define SAVE_32VRS(n,b,base) SAVE_16VRS(n,b,base); SAVE_16VRS(n+16,b,base) +#define REST_VR(n,b,base) li b,THREAD_VR0+(16*(n)); lvx n,b,base +#define REST_2VRS(n,b,base) REST_VR(n,b,base); REST_VR(n+1,b,base) +#define REST_4VRS(n,b,base) REST_2VRS(n,b,base); REST_2VRS(n+2,b,base) +#define REST_8VRS(n,b,base) REST_4VRS(n,b,base); REST_4VRS(n+4,b,base) +#define REST_16VRS(n,b,base) REST_8VRS(n,b,base); REST_8VRS(n+8,b,base) +#define REST_32VRS(n,b,base) REST_16VRS(n,b,base); REST_16VRS(n+16,b,base) + +/* Macros to adjust thread priority for Iseries hardware multithreading */ +#define HMT_LOW or 1,1,1 +#define HMT_MEDIUM or 2,2,2 +#define HMT_HIGH or 3,3,3 + +/* Insert the high 32 bits of the MSR into what will be the new + MSR (via SRR1 and rfid) This preserves the MSR.SF and MSR.ISF + bits. */ + +#define FIX_SRR1(ra, rb) \ + mr rb,ra; \ + mfmsr ra; \ + rldimi ra,rb,0,32 + +#define CLR_TOP32(r) rlwinm (r),(r),0,0,31 /* clear top 32 bits */ + +/* + * LOADADDR( rn, name ) + * loads the address of 'name' into 'rn' + * + * LOADBASE( rn, name ) + * loads the address (less the low 16 bits) of 'name' into 'rn' + * suitable for base+disp addressing + */ +#define LOADADDR(rn,name) \ + lis rn,name##@highest; \ + ori rn,rn,name##@higher; \ + rldicr rn,rn,32,31; \ + oris rn,rn,name##@h; \ + ori rn,rn,name##@l + +#define LOADBASE(rn,name) \ + lis rn,name at highest; \ + ori rn,rn,name at higher; \ + rldicr rn,rn,32,31; \ + oris rn,rn,name at ha + + +#define SET_REG_TO_CONST(reg, value) \ + lis reg,(((value)>>48)&0xFFFF); \ + ori reg,reg,(((value)>>32)&0xFFFF); \ + rldicr reg,reg,32,31; \ + oris reg,reg,(((value)>>16)&0xFFFF); \ + ori reg,reg,((value)&0xFFFF); + +#define SET_REG_TO_LABEL(reg, label) \ + lis reg,(label)@highest; \ + ori reg,reg,(label)@higher; \ + rldicr reg,reg,32,31; \ + oris reg,reg,(label)@h; \ + ori reg,reg,(label)@l; + + +/* Condition Register Bit Fields */ + +#define cr0 0 +#define cr1 1 +#define cr2 2 +#define cr3 3 +#define cr4 4 +#define cr5 5 +#define cr6 6 +#define cr7 7 + + +/* General Purpose Registers (GPRs) */ + +#define r0 0 +#define r1 1 +#define r2 2 +#define r3 3 +#define r4 4 +#define r5 5 +#define r6 6 +#define r7 7 +#define r8 8 +#define r9 9 +#define r10 10 +#define r11 11 +#define r12 12 +#define r13 13 +#define r14 14 +#define r15 15 +#define r16 16 +#define r17 17 +#define r18 18 +#define r19 19 +#define r20 20 +#define r21 21 +#define r22 22 +#define r23 23 +#define r24 24 +#define r25 25 +#define r26 26 +#define r27 27 +#define r28 28 +#define r29 29 +#define r30 30 +#define r31 31 + + +/* Floating Point Registers (FPRs) */ + +#define fr0 0 +#define fr1 1 +#define fr2 2 +#define fr3 3 +#define fr4 4 +#define fr5 5 +#define fr6 6 +#define fr7 7 +#define fr8 8 +#define fr9 9 +#define fr10 10 +#define fr11 11 +#define fr12 12 +#define fr13 13 +#define fr14 14 +#define fr15 15 +#define fr16 16 +#define fr17 17 +#define fr18 18 +#define fr19 19 +#define fr20 20 +#define fr21 21 +#define fr22 22 +#define fr23 23 +#define fr24 24 +#define fr25 25 +#define fr26 26 +#define fr27 27 +#define fr28 28 +#define fr29 29 +#define fr30 30 +#define fr31 31 + +#define vr0 0 +#define vr1 1 +#define vr2 2 +#define vr3 3 +#define vr4 4 +#define vr5 5 +#define vr6 6 +#define vr7 7 +#define vr8 8 +#define vr9 9 +#define vr10 10 +#define vr11 11 +#define vr12 12 +#define vr13 13 +#define vr14 14 +#define vr15 15 +#define vr16 16 +#define vr17 17 +#define vr18 18 +#define vr19 19 +#define vr20 20 +#define vr21 21 +#define vr22 22 +#define vr23 23 +#define vr24 24 +#define vr25 25 +#define vr26 26 +#define vr27 27 +#define vr28 28 +#define vr29 29 +#define vr30 30 +#define vr31 31 + +#endif /* _PPC64_PPC_ASM_H */ Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/crt0.S =================================================================== --- linux-2.6.12-rc4-olh.orig/arch/ppc64/boot/crt0.S +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/crt0.S @@ -9,7 +9,7 @@ * NOTE: this code runs in 32 bit mode and is packaged as ELF32. */ -#include +#include .text .globl _start Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/div64.S =================================================================== --- linux-2.6.12-rc4-olh.orig/arch/ppc64/boot/div64.S +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/div64.S @@ -13,7 +13,7 @@ * as published by the Free Software Foundation; either version * 2 of the License, or (at your option) any later version. */ -#include +#include .globl __div64_32 __div64_32: Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/string.S =================================================================== --- linux-2.6.12-rc4-olh.orig/arch/ppc64/boot/string.S +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/string.S @@ -9,7 +9,7 @@ * NOTE: this code runs in 32 bit mode and is packaged as ELF32. */ -#include +#include .text .globl strcpy From benh at kernel.crashing.org Sun May 8 14:49:04 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sun, 08 May 2005 14:49:04 +1000 Subject: [BUG] linux-2.6.12_rc4: Oops: Kernel access of bad area, sig: 11 [#1] In-Reply-To: <20050507170904.GA9488@unixforces.net> References: <20050507170904.GA9488@unixforces.net> Message-ID: <1115527744.6304.47.camel@gaston> On Sat, 2005-05-07 at 17:09 +0000, Markus Rothe wrote: > Hello, > > I'm running Linux on my G5. > > I just compiled linux-2.6.12_rc4 and got this when loading my sound > modules at boot time (linux-2.6.12_rc3 worked just fine): Hrm... weird. I'll have a look on monday. It looks like something is blowing up in the i2c layer... Ben. > ---- SNIP ---- > Oops: Kernel access of bad area, sig: 11 [#1] > SMP NR_CPUS=2 POWERMAC > Modules linked in: snd_powermac snd_pcm snd_page_alloc snd_timer snd > soundcore > NIP: C0000000002E4030 XER: 20000000 LR: D0000000001B4AC8 CTR: > C0000000002E4004 > REGS: c000000237783890 TRAP: 0300 Not tainted (2.6.12-rc4) > MSR: 9000000000009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 CR: 22022484 > DAR: 0000000000000002 DSISR: 0000000040000000 > TASK: c0000002375ba030[7007] 'modprobe' THREAD: c000000237780000 CPU: 0 > GPR00: D0000000001B4AC8 C000000237783B10 C0000000005A5E70 0000000000000000 > GPR04: 0000000000000001 0000000000000060 0000000000000000 0000000000000001 > GPR08: 0000000000000002 0000000000000000 C00000000047AAC0 C0000000002E4004 > GPR12: D0000000001B9C58 C00000000047B800 0000000000000000 0000000000000000 > GPR16: 0000000000000000 0000000010028790 0000000000000000 000000001002A4A0 > GPR20: 0000000000000000 0000000000000000 0000000000000000 00000080001B1010 > GPR24: 000000001002AA90 000000001002ABF8 000000001002ABE0 C000000000478958 > GPR28: C00000000F70D580 000000000000000A D0000000001CACD8 D0000000001C137C > NIP [c0000000002e4030] .i2c_smbus_write_byte_data+0x2c/0x50 > LR [d0000000001b4ac8] .send_init_client+0x50/0x110 [snd_powermac] > Call Trace: > [c000000237783b10] [d0000000001c2030] > tumbler_mixers+0x0/0xffffffffffff8b28 [snd_powermac] (unreliable) > [c000000237783bc0] [d0000000001b4ac8] .send_init_client+0x50/0x110 > [snd_powermac] > [c000000237783c60] [d0000000001b9514] > .snd_pmac_tumbler_post_init+0x3c/0x94 [snd_powermac] > [c000000237783ce0] [d0000000001b74fc] .alsa_card_pmac_init+0x174/0x3cc > [snd_powermac] > [c000000237783d90] [c000000000063988] .sys_init_module+0x2cc/0x4a8 > [c000000237783e30] [c00000000000d980] syscall_exit+0x0/0x18 > Instruction dump: > 4e800020 7c0802a6 7c691b78 7c872378 38c00000 39000002 f8010010 f821ff51 > 98a10070 60000000 60000000 60000000 a0890006 e8630008 39210070 > <6>usbcore: registered new driver snd-usb-audio > ---- SNIP ---- > > Could someone please take a look at that? > > Best regards, > > Markus > _______________________________________________ > Linuxppc64-dev mailing list > Linuxppc64-dev at ozlabs.org > https://ozlabs.org/cgi-bin/mailman/listinfo/linuxppc64-dev -- Benjamin Herrenschmidt From markus at unixforces.net Sun May 8 17:22:00 2005 From: markus at unixforces.net (Markus Rothe) Date: Sun, 8 May 2005 07:22:00 +0000 Subject: [BUG] linux-2.6.12_rc4: Oops: Kernel access of bad area, sig: 11 [#1] In-Reply-To: <20050507170904.GA9488@unixforces.net> References: <20050507170904.GA9488@unixforces.net> Message-ID: <20050508072159.GA10031@unixforces.net> Hi, I've just noticed, that I've changed my config slightly from rc3 to rc4. Here is the according diff part of the configs. ---- SNIP ---- --- /usr/src/linux-2.6.12-rc3/.config 2005-05-08 06:54:38.000000000 +0000 +++ /usr/src/linux-2.6.12-rc4/.config 2005-05-08 07:16:32.000000000 +0000 @@ -602,7 +607,7 @@ # I2C support # CONFIG_I2C=y -# CONFIG_I2C_CHARDEV is not set +CONFIG_I2C_CHARDEV=m # # I2C Algorithms ---- SNIP ---- I've compiled rc3 with CONFIG_I2C_CHARDEV=m and now I get the Oops with this version, too. Markus -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050508/70a14b61/attachment.pgp From sam at ravnborg.org Sun May 8 17:32:20 2005 From: sam at ravnborg.org (Sam Ravnborg) Date: Sun, 8 May 2005 09:32:20 +0200 (CEST) Subject: missing deps in arch/ppc64/boot In-Reply-To: <20050507212449.GA26741@suse.de> References: <20050507212449.GA26741@suse.de> Message-ID: <40912.194.237.142.21.1115537540.squirrel@194.237.142.21> > > Sam, > > touching arch/ppc64/boot/zlib.h will not cause a rebuild of > arch/ppc64/boot/zlib.o. Any ideas what is missing? Browsing the Makefile it looks like none of the targets derived from src-boot are assingen to "targets". You need to tell kbuild which targets exits in a given directory, this is done by assigned the .o name to targets. Something like this: src-boot := crt0.S string.S prom.c main.c zlib.c imagesize.c div64.S targets += $(addsuffix .o, $(basename $(src-boot))) src-boot := $(addprefix $(obj)/, $(src-boot)) obj-boot := $(addsuffix .o, $(basename $(src-boot))) Let me know if this cures it. Sam From olh at suse.de Sun May 8 18:33:31 2005 From: olh at suse.de (Olaf Hering) Date: Sun, 8 May 2005 10:33:31 +0200 Subject: panic reboot stuck in rtas_os_term Message-ID: <20050508083331.GA30329@suse.de> A panic does not trigger a reboot anymore on JS20, rtas_os_term() is stuck in RTAS. .config is the defconfig. The panic reboot works on a p630, but I miss the 'rebooting in 180 seconds' message. Any ideas how to fix that? VFS: Cannot open root device "" or unknown-block(8,2) Please append a correct "root=" boot option Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,2) rtas [swapper]: Entering rtas_call rtas [swapper]: token = 0x1a rtas [swapper]: nargs = 1 rtas [swapper]: nret = 1 rtas [swapper]: &outputs = 0x0 rtas [swapper]: narg[0] = 0x72cfa8 rtas [swapper]: entering rtas with 0x72c728 ... zImage starting: loaded at 0x400000 Allocating 0x893000 bytes for kernel ... gunzipping (0x1c00000 <- 0x407000:0x6addcb)...done 0x73da20 bytes 0xdb6c bytes of heap consumed, max in use 0xa1e4 OF stdout device is: /vdevice/vty at 0 Hypertas detected, assuming LPAR ! command line: memory layout at init: memory_limit : 0000000000000000 (16 MB aligned) alloc_bottom : 00000000023a7000 alloc_top : 0000000008000000 alloc_top_hi : 000000001e000000 rmo_top : 0000000008000000 ram_top : 000000001e000000 Looking for displays instantiating rtas at 0x0000000007a70000...rtas_ram_size = 2c8000 fixed_base_addr = 7a70000 code_base_addr = 7afa000 Code Image Load Complete. registered vars: name addr size hash align -------------------------------- ---------------- ---- ---- ----- glob_rtas_trace_buf : 0000000007ab9100 65552 7 0 prtas_was_interrupted : 0000000007aca100 4 9 1 callperf : 0000000007aca400 12496 9 1 pglob_os_term_state : 0000000007acd700 4 12 1 hypStopWatch : 0000000007ac9400 1800 14 8 prtas_in_progress : 0000000007ac9e00 4 20 1 last_error_log : 0000000007acdc00 1024 30 0 nmi_work_buffer : 0000000007ace000 4096 31 12 done 0000000000000000 : boot cpu 0000000000000000 0000000000000001 : starting cpu hw idx 0000000000000001... done copying OF device tree ... Building dt strings... Building dt structure... Device tree strings 0x00000000024a8000 -> 0x00000000024a8e13 Device tree struct 0x00000000024a9000 -> 0x00000000024af000 Calling quiesce ... returning from prom_init firmware_features = 0x55f Starting Linux PPC64 2.6.12-rc4 ----------------------------------------------------- ppc64_pft_size = 0x17 ppc64_debug_switch = 0x0 ppc64_interrupt_controller = 0x2 systemcfg = 0xc0000000005a8000 systemcfg->platform = 0x101 systemcfg->processorCount = 0x2 systemcfg->physicalMemorySize = 0x1e000000 ppc64_caches.dcache_line_size = 0x80 ppc64_caches.icache_line_size = 0x80 htab_address = 0x0000000000000000 htab_hash_mask = 0xffff ----------------------------------------------------- [boot]0100 MM Init [boot]0100 MM Init Done Linux version 2.6.12-rc4 (olaf at mac) (gcc version 3.3.3 (SuSE Linux)) #12 SMP Sun May 8 10:08:56 CEST 2005 [boot]0012 Setup Arch Top of RAM: 0x1e000000, Total RAM: 0x1e000000 Memory hole size: 0MB Syscall map setup, 236 32 bits and 212 64 bits syscalls No ramdisk, default root is /dev/sda2 PPC64 nvram contains 16384 bytes Using default idle loop [boot]0015 Setup Done Built 1 zonelists Kernel command line: [boot]0020 XICS Init xics: no ISA interrupt controller [boot]0021 XICS Done PID hash table entries: 2048 (order: 11, 65536 bytes) time_init: decrementer frequency = 199.840527 MHz time_init: processor frequency = 1600.000000 MHz firmware_features = 0x55f Starting Linux PPC64 2.6.12-rc4 ----------------------------------------------------- ppc64_pft_size = 0x17 ppc64_debug_switch = 0x0 ppc64_interrupt_controller = 0x2 systemcfg = 0xc0000000005a8000 systemcfg->platform = 0x101 systemcfg->processorCount = 0x2 systemcfg->physicalMemorySize = 0x1e000000 ppc64_caches.dcache_line_size = 0x80 ppc64_caches.icache_line_size = 0x80 htab_address = 0x0000000000000000 htab_hash_mask = 0xffff ----------------------------------------------------- [boot]0100 MM Init [boot]0100 MM Init Done Linux version 2.6.12-rc4 (olaf at mac) (gcc version 3.3.3 (SuSE Linux)) #12 SMP Sun May 8 10:08:56 CEST 2005 [boot]0012 Setup Arch Top of RAM: 0x1e000000, Total RAM: 0x1e000000 Memory hole size: 0MB Syscall map setup, 236 32 bits and 212 64 bits syscalls No ramdisk, default root is /dev/sda2 PPC64 nvram contains 16384 bytes Using default idle loop [boot]0015 Setup Done Built 1 zonelists Kernel command line: [boot]0020 XICS Init xics: no ISA interrupt controller [boot]0021 XICS Done PID hash table entries: 2048 (order: 11, 65536 bytes) time_init: decrementer frequency = 199.840527 MHz time_init: processor frequency = 1600.000000 MHz Console: colour dummy device 80x25 Dentry cache hash table entries: 65536 (order: 7, 524288 bytes) Inode-cache hash table entries: 32768 (order: 6, 262144 bytes) freeing bootmem node 0 Memory: 468604k/491520k available (4752k kernel code, 22276k reserved, 1548k data, 430k bss, 356k init) Mount-cache hash table entries: 256 Processor 1 found. Brought up 2 CPUs NET: Registered protocol family 16 PCI: Probing PCI hardware IDE Fixup IRQ: Can't find IO-APIC ! IOMMU table initialized, virtual merging enabled mapping IO 100f4000000 -> e000000000000000, size: 400000 PCI: Probing PCI hardware done SCSI subsystem initialized usbcore: registered new driver usbfs usbcore: registered new driver hub i/pSeries Real Time Clock Driver v1.1 RTAS daemon started Total HugeTLB memory allocated, 0 JFS: nTxBlock = 3665, nTxLock = 29327 Initializing Cryptographic API HVSI: registered 0 devices Linux agpgart interface v0.101 (c) Dave Jones Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled io scheduler noop registered io scheduler anticipatory registered io scheduler deadline registered io scheduler cfq registered Floppy drive(s): fd0 is 2.88M RAMDISK driver initialized: 16 RAM disks of 65536K size 1024 blocksize loop: loaded (max 8 devices) Intel(R) PRO/1000 Network Driver - version 5.7.6-k2 Copyright (c) 1999-2004 Intel Corporation. pcnet32.c:v1.30i 06.28.2004 tsbogend at alpha.franken.de e100: Intel(R) PRO/100 Network Driver, 3.3.6-k2-NAPI e100: Copyright(c) 1999-2004 Intel Corporation tg3.c:v3.27 (May 5, 2005) eth0: Tigon3 [partno(none) rev 2003 PHY(serdes)] (PCIX:133MHz:64-bit) 10/100/1000BaseT Ethernet 00:0d:60:1e:ff:32 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] Split[0] WireSpeed[1] TSOcap[0] eth1: Tigon3 [partno(none) rev 2003 PHY(serdes)] (PCIX:133MHz:64-bit) 10/100/1000BaseT Ethernet 00:0d:60:1e:ff:33 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1] netconsole: not configured, aborting Warning: no ADB interface detected Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx AMD8111: IDE controller at PCI slot 0000:00:04.1 AMD8111: chipset revision 3 AMD8111: 0000:00:04.1 (rev 03) UDMA133 controller AMD8111: 100% native mode on irq 32 ide0: BM-DMA at 0x7c00-0x7c07, BIOS settings: hda:pio, hdb:pio ide1: BM-DMA at 0x7c08-0x7c0f, BIOS settings: hdc:pio, hdd:pio hda: TOSHIBA MK4019GAXB, ATA DISK drive ide0 at 0x7400-0x7407,0x6c02 on irq 32 hda: max request size: 128KiB hda: 78140160 sectors (40007 MB), CHS=65535/16/63, UDMA(33) hda: cache flushes supported hda: hda1 hda2 hda3 ipr: IBM Power RAID SCSI Device Driver version: 2.0.13 (February 21, 2005) st: Version 20050312, fixed bufsize 32768, s/g segs 256 ieee1394: raw1394: /dev/raw1394 device initialized ohci_hcd 0000:21:00.0: OHCI Host Controller ohci_hcd 0000:21:00.0: new USB bus registered, assigned bus number 1 ohci_hcd 0000:21:00.0: irq 35, io mem 0x100e0001000 hub 1-0:1.0: USB hub found hub 1-0:1.0: 3 ports detected ohci_hcd 0000:21:00.1: OHCI Host Controller ohci_hcd 0000:21:00.1: new USB bus registered, assigned bus number 2 ohci_hcd 0000:21:00.1: irq 35, io mem 0x100e0000000 hub 2-0:1.0: USB hub found hub 2-0:1.0: 3 ports detected usbcore: registered new driver hiddev usbcore: registered new driver usbhid /home/olaf/kernel/linux-2.6.12-rc4-olh/drivers/usb/input/hid-core.c: v2.01:USB HID core driver pegasus: v0.6.12 (2005/01/13), Pegasus/Pegasus II USB Ethernet driver usbcore: registered new driver pegasus mice: PS/2 mouse device common for all mice i2c /dev entries driver md: linear personality registered as nr 1 md: raid0 personality registered as nr 2 md: raid1 personality registered as nr 3 md: raid10 personality registered as nr 9 md: raid5 personality registered as nr 4 raid5: measuring checksumming speed 8regs : 3836.000 MB/sec 8regs_prefetch: 3156.000 MB/sec 32regs : 4640.000 MB/sec 32regs_prefetch: 3848.000 MB/sec raid5: using function: 32regs (4640.000 MB/sec) md: md driver 0.90.1 MAX_MD_DEVS=256, MD_SB_DISKS=27 device-mapper: 4.4.0-ioctl (2005-01-12) initialised: dm-devel at redhat.com oprofile: using ppc64/970 performance monitoring. NET: Registered protocol family 2 IP: routing cache hash table of 2048 buckets, 32Kbytes TCP established hash table entries: 16384 (order: 6, 262144 bytes) TCP bind hash table entries: 16384 (order: 6, 262144 bytes) TCP: Hash tables configured (established 16384 bind 16384) IPv4 over IPv4 tunneling driver NET: Registered protocol family 1 NET: Registered protocol family 17 md: Autodetecting RAID arrays. md: autorun ... md: ... autorun DONE. VFS: Cannot open root device "" or unknown-block(8,2) Please append a correct "root=" boot option Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,2) rtas [swapper]: Entering rtas_call rtas [swapper]: token = 0x1a rtas [swapper]: nargs = 1 rtas [swapper]: nret = 1 rtas [swapper]: &outputs = 0x0 rtas [swapper]: narg[0] = 0x72cfa8 rtas [swapper]: entering rtas with 0x72c728 From olh at suse.de Sun May 8 19:15:33 2005 From: olh at suse.de (Olaf Hering) Date: Sun, 8 May 2005 11:15:33 +0200 Subject: panic reboot stuck in rtas_os_term In-Reply-To: <20050508083331.GA30329@suse.de> References: <20050508083331.GA30329@suse.de> Message-ID: <20050508091533.GA30450@suse.de> On Sun, May 08, Olaf Hering wrote: > > A panic does not trigger a reboot anymore on JS20, rtas_os_term() is stuck in > RTAS. .config is the defconfig. > The panic reboot works on a p630, but I miss the 'rebooting in 180 seconds' message. > Any ideas how to fix that? > > VFS: Cannot open root device "" or unknown-block(8,2) > Please append a correct "root=" boot option > Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,2) > rtas [swapper]: Entering rtas_call > rtas [swapper]: token = 0x1a > rtas [swapper]: nargs = 1 > rtas [swapper]: nret = 1 > rtas [swapper]: &outputs = 0x0 > rtas [swapper]: narg[0] = 0x72cfa8 > rtas [swapper]: entering rtas with 0x72c728 This appears to be a firmware bug, fails now also with sles9 sp1 kernel. ibm,os-term was appearently added with recent firmware updates. http://linux.bkbits.net:8080/linux-2.5/gnupatch at 41997abd4TXSpY49vjgObO6N2R2MxA From olh at suse.de Sun May 8 20:07:02 2005 From: olh at suse.de (Olaf Hering) Date: Sun, 8 May 2005 12:07:02 +0200 Subject: missing deps in arch/ppc64/boot In-Reply-To: <40912.194.237.142.21.1115537540.squirrel@194.237.142.21> References: <20050507212449.GA26741@suse.de> <40912.194.237.142.21.1115537540.squirrel@194.237.142.21> Message-ID: <20050508100702.GA30759@suse.de> On Sun, May 08, Sam Ravnborg wrote: > > > > Sam, > > > > touching arch/ppc64/boot/zlib.h will not cause a rebuild of > > arch/ppc64/boot/zlib.o. Any ideas what is missing? > > Browsing the Makefile it looks like none of the targets derived from > src-boot are assingen to "targets". > You need to tell kbuild which targets exits in a given directory, this > is done by assigned the .o name to targets. This does not help. Maybe targets is reset for some reason? But this patch works: Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/Makefile =================================================================== --- linux-2.6.12-rc4-olh.orig/arch/ppc64/boot/Makefile +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/Makefile @@ -28,6 +28,7 @@ BOOTLFLAGS := -Ttext 0x00400000 -e _star OBJCOPYFLAGS := contents,alloc,load,readonly,data src-boot := crt0.S string.S prom.c main.c zlib.c imagesize.c div64.S +targets-boot := $(addsuffix .o, $(basename $(src-boot))) src-boot := $(addprefix $(obj)/, $(src-boot)) obj-boot := $(addsuffix .o, $(basename $(src-boot))) @@ -54,6 +55,7 @@ gz-sec = $(foreach section, $(1), $(pat hostprogs-y := addnote addRamDisk targets += zImage zImage.initrd imagesize.c \ + $(targets-boot) \ $(patsubst $(obj)/%,%, $(call obj-sec, $(required) $(initrd))) \ $(patsubst $(obj)/%,%, $(call src-sec, $(required) $(initrd))) \ $(patsubst $(obj)/%,%, $(call gz-sec, $(required) $(initrd))) \ From olh at suse.de Mon May 9 02:48:51 2005 From: olh at suse.de (Olaf Hering) Date: Sun, 8 May 2005 18:48:51 +0200 Subject: [PATCH] remove unused arch/ppc64/boot/div64.S Message-ID: <20050508164851.GA1707@suse.de> remove unused arch/ppc64/boot/div64.S, it is built but not called Signed-off-by: Olaf Hering Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/Makefile =================================================================== --- linux-2.6.12-rc4-olh.orig/arch/ppc64/boot/Makefile +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/Makefile @@ -27,7 +27,7 @@ BOOTAFLAGS := -D__ASSEMBLY__ $(BOOTCFLAG BOOTLFLAGS := -Ttext 0x00400000 -e _start -T $(srctree)/$(src)/zImage.lds OBJCOPYFLAGS := contents,alloc,load,readonly,data -src-boot := crt0.S string.S prom.c main.c zlib.c imagesize.c div64.S +src-boot := crt0.S string.S prom.c main.c zlib.c imagesize.c src-boot := $(addprefix $(obj)/, $(src-boot)) obj-boot := $(addsuffix .o, $(basename $(src-boot))) Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/div64.S =================================================================== --- linux-2.6.12-rc4-olh.orig/arch/ppc64/boot/div64.S +++ /dev/null @@ -1,58 +0,0 @@ -/* - * Divide a 64-bit unsigned number by a 32-bit unsigned number. - * This routine assumes that the top 32 bits of the dividend are - * non-zero to start with. - * On entry, r3 points to the dividend, which get overwritten with - * the 64-bit quotient, and r4 contains the divisor. - * On exit, r3 contains the remainder. - * - * Copyright (C) 2002 Paul Mackerras, IBM Corp. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ -#include - - .globl __div64_32 -__div64_32: - lwz r5,0(r3) # get the dividend into r5/r6 - lwz r6,4(r3) - cmplw r5,r4 - li r7,0 - li r8,0 - blt 1f - divwu r7,r5,r4 # if dividend.hi >= divisor, - mullw r0,r7,r4 # quotient.hi = dividend.hi / divisor - subf. r5,r0,r5 # dividend.hi %= divisor - beq 3f -1: mr r11,r5 # here dividend.hi != 0 - andis. r0,r5,0xc000 - bne 2f - cntlzw r0,r5 # we are shifting the dividend right - li r10,-1 # to make it < 2^32, and shifting - srw r10,r10,r0 # the divisor right the same amount, - add r9,r4,r10 # rounding up (so the estimate cannot - andc r11,r6,r10 # ever be too large, only too small) - andc r9,r9,r10 - or r11,r5,r11 - rotlw r9,r9,r0 - rotlw r11,r11,r0 - divwu r11,r11,r9 # then we divide the shifted quantities -2: mullw r10,r11,r4 # to get an estimate of the quotient, - mulhwu r9,r11,r4 # multiply the estimate by the divisor, - subfc r6,r10,r6 # take the product from the divisor, - add r8,r8,r11 # and add the estimate to the accumulated - subfe. r5,r9,r5 # quotient - bne 1b -3: cmplw r6,r4 - blt 4f - divwu r0,r6,r4 # perform the remaining 32-bit division - mullw r10,r0,r4 # and get the remainder - add r8,r8,r0 - subf r6,r10,r6 -4: stw r7,0(r3) # return the quotient in *r3 - stw r8,4(r3) - mr r3,r6 # return the remainder in r3 - blr From benh at kernel.crashing.org Mon May 9 08:15:13 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 09 May 2005 08:15:13 +1000 Subject: [BUG] linux-2.6.12_rc4: Oops: Kernel access of bad area, sig: 11 [#1] In-Reply-To: <20050508072159.GA10031@unixforces.net> References: <20050507170904.GA9488@unixforces.net> <20050508072159.GA10031@unixforces.net> Message-ID: <1115590513.6305.52.camel@gaston> On Sun, 2005-05-08 at 07:22 +0000, Markus Rothe wrote: > Hi, > > I've just noticed, that I've changed my config slightly from rc3 to rc4. > Here is the according diff part of the configs. > > ---- SNIP ---- > > --- /usr/src/linux-2.6.12-rc3/.config 2005-05-08 06:54:38.000000000 +0000 > +++ /usr/src/linux-2.6.12-rc4/.config 2005-05-08 07:16:32.000000000 +0000 > @@ -602,7 +607,7 @@ > # I2C support > # > CONFIG_I2C=y > -# CONFIG_I2C_CHARDEV is not set > +CONFIG_I2C_CHARDEV=m > > # > # I2C Algorithms > > ---- SNIP ---- > > I've compiled rc3 with CONFIG_I2C_CHARDEV=m and now I get the Oops with > this version, too. So something is screwing with the i2c bus .. interesting... Ben. From benh at kernel.crashing.org Mon May 9 09:09:08 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 09 May 2005 09:09:08 +1000 Subject: [Fwd: PATCH - enables software RAID on Linux] Message-ID: <1115593748.6305.61.camel@gaston> -------- Forwarded Message -------- From: Dustin Kirkland Reply-To: dustin.kirkland at us.ibm.com To: yaboot-devel at lists.penguinppc.org Cc: dustin.kirkland at us.ibm.com Subject: PATCH - enables software RAID on Linux Date: Fri, 06 May 2005 10:17:42 -0400 Hi- I've made a few minor changes to yaboot that provides support for software RAID on Linux. Many thanks to Paul Nasrat who provided key assistance. The basic changes that were needed in yaboot: 1) the ability to read partitions marked "Linux RAID" (0xfd) in addition to regular "Linux" filesystem partitions when looking for /etc/yaboot.conf 2) the additional protection to prevent of_open() from reading Linux RAID partitions (such as RAID swap space) 3) new functionality in ybin to automatically write yaboot to multiple available PReP partitions (so that you can boot from multiple discs) -- #1 is accomplished by adding a LINUX_RAID macro in include/fdisk-part.h and allowing such partitions to be added to the partition_t list in partition_fdisk_lookup(). #2 means that the partition_t structure needs a new integer field (sys_ind) to hold the partition type, and the add_new_partition() function needs to pass this information when called. Then, the of_open () simply needs to test against part->sys_ind against the LINUX_RAID value before allowing an ok return. #3 is solved via recursion. ybin is augmented such that it will look for any PReP partitions available on the system and will recursively call itself targeting each individual PReP partition. The user will be prompted for each partition in the default interactive mode, and yaboot will be written to all PReP partitions in --force mode. As I've coded it here, it depends on fdisk/awk/xargs. I found uses of grep elsewhere in the script so I figured this should be legit. Please see the attached patch and consider for inclusion in subsequent yaboot releases. -- Benjamin Herrenschmidt From olh at suse.de Mon May 9 11:02:55 2005 From: olh at suse.de (Olaf Hering) Date: Mon, 9 May 2005 03:02:55 +0200 Subject: [PATCH] print negative numbers correctly via vsprintf in arch/ppc64/boot/prom.c In-Reply-To: <20050508164851.GA1707@suse.de> References: <20050508164851.GA1707@suse.de> Message-ID: <20050509010255.GA6759@suse.de> On Sun, May 08, Olaf Hering wrote: > > remove unused arch/ppc64/boot/div64.S, it is built but not called The other way round: if num has a value of -1, accessing the digits[] array will fail and the format string will be printed in funny way, or not at all. This happens if one prints negative numbers. Just change the code to match lib/vsprintf.c asm/div64.h cant be used because u64 maps to u32 for this build. uint64_t -> __u64 -> unsigned long Signed-off-by: Olaf Hering Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/prom.c =================================================================== --- linux-2.6.12-rc4-olh.orig/arch/ppc64/boot/prom.c +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/prom.c @@ -11,6 +11,23 @@ #include #include +extern __u32 __div64_32(unsigned long long *dividend, __u32 divisor); + +/* The unnecessary pointer compare is there + * to check for type safety (n must be 64bit) + */ +# define do_div(n,base) ({ \ + __u32 __base = (base); \ + __u32 __rem; \ + (void)(((typeof((n)) *)0) == ((unsigned long long *)0)); \ + if (((n) >> 32) == 0) { \ + __rem = (__u32)(n) % __base; \ + (n) = (__u32)(n) / __base; \ + } else \ + __rem = __div64_32(&(n), __base); \ + __rem; \ + }) + int (*prom)(void *); void *chosen_handle; @@ -352,7 +369,7 @@ static int skip_atoi(const char **s) #define SPECIAL 32 /* 0x */ #define LARGE 64 /* use 'ABCDEF' instead of 'abcdef' */ -static char * number(char * str, long num, int base, int size, int precision, int type) +static char * number(char * str, unsigned long long num, int base, int size, int precision, int type) { char c,sign,tmp[66]; const char *digits="0123456789abcdefghijklmnopqrstuvwxyz"; @@ -367,9 +384,9 @@ static char * number(char * str, long nu c = (type & ZEROPAD) ? '0' : ' '; sign = 0; if (type & SIGN) { - if (num < 0) { + if ((signed long long)num < 0) { sign = '-'; - num = -num; + num = - (unsigned long long)num; size--; } else if (type & PLUS) { sign = '+'; @@ -389,8 +406,7 @@ static char * number(char * str, long nu if (num == 0) tmp[i++]='0'; else while (num != 0) { - tmp[i++] = digits[num % base]; - num /= base; + tmp[i++] = digits[do_div(num, base)]; } if (i > precision) precision = i; @@ -426,7 +442,7 @@ int sprintf(char * buf, const char *fmt, int vsprintf(char *buf, const char *fmt, va_list args) { int len; - unsigned long num; + unsigned long long num; int i, base; char * str; const char *s; From schwab at suse.de Mon May 9 19:42:39 2005 From: schwab at suse.de (Andreas Schwab) Date: Mon, 09 May 2005 11:42:39 +0200 Subject: [PATCH] print negative numbers correctly via vsprintf in arch/ppc64/boot/prom.c In-Reply-To: <20050509010255.GA6759@suse.de> (Olaf Hering's message of "Mon, 9 May 2005 03:02:55 +0200") References: <20050508164851.GA1707@suse.de> <20050509010255.GA6759@suse.de> Message-ID: Olaf Hering writes: > @@ -352,7 +369,7 @@ static int skip_atoi(const char **s) > #define SPECIAL 32 /* 0x */ > #define LARGE 64 /* use 'ABCDEF' instead of 'abcdef' */ > > -static char * number(char * str, long num, int base, int size, int precision, int type) > +static char * number(char * str, unsigned long long num, int base, int size, int precision, int type) > { > char c,sign,tmp[66]; > const char *digits="0123456789abcdefghijklmnopqrstuvwxyz"; > @@ -367,9 +384,9 @@ static char * number(char * str, long nu > c = (type & ZEROPAD) ? '0' : ' '; > sign = 0; > if (type & SIGN) { > - if (num < 0) { > + if ((signed long long)num < 0) { > sign = '-'; > - num = -num; > + num = - (unsigned long long)num; I think the latter cast is useless. Andreas. -- Andreas Schwab, SuSE Labs, schwab at suse.de SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." From johnrose at austin.ibm.com Tue May 10 01:43:42 2005 From: johnrose at austin.ibm.com (John Rose) Date: Mon, 09 May 2005 10:43:42 -0500 Subject: Patch to kill ioremap_mm In-Reply-To: <1115423232.23610.5.camel@gaston> References: <20050505014256.GE18270@localhost.localdomain> <1115306696.6011.6.camel@sinatra.austin.ibm.com> <1115335822.7627.189.camel@gaston> <1115392690.15458.12.camel@sinatra.austin.ibm.com> <1115423232.23610.5.camel@gaston> Message-ID: <1115653422.14937.41.camel@sinatra.austin.ibm.com> Hi Ben- > Do we need that for normal mmio ioremap mappings or only for PHB IO > space ? Just the latter. > As I said, the later would stay separate, but I feel we don't > need the imalloc infrastructure to handle it. We can probably just > directly set/invalidate PTEs for the ranges when needed. So you won't use vmalloc to manage the explicit mappings for the PHB ranges? How will you keep up with what's been mapped for a PHB? Seems like this would be necessary to know upon removal. Or will you simply invalidate for the entire PHB range, even for subranges that aren't mapped? Sweating the details :) John From apw at shadowen.org Tue May 10 01:49:04 2005 From: apw at shadowen.org (Andy Whitcroft) Date: Mon, 09 May 2005 16:49:04 +0100 Subject: sparsemem ppc64 tidy flat memory comments and fix benign mempresent call In-Reply-To: <427A59BC.1020208@shadowen.org> Message-ID: I was going to rediff the memory present patches, but as -mm has picked these up already here is a simple patch to clean up this errant comment and address a benign call to memory_present(). Applies onto the existing patches. -apw Tidy up the comments for the ppc64 flat memory support and fix a currently benign double call to memory_present() for the first memory block. Signed-off-by: Andy Whitcroft --- init.c | 9 +++++---- 1 files changed, 5 insertions(+), 4 deletions(-) diff -upN reference/arch/ppc64/mm/init.c current/arch/ppc64/mm/init.c --- reference/arch/ppc64/mm/init.c +++ current/arch/ppc64/mm/init.c @@ -631,18 +631,19 @@ void __init do_init_bootmem(void) max_pfn = max_low_pfn; - /* add all physical memory to the bootmem map. Also, find the first - * presence of all LMBs*/ + /* Add all physical memory to the bootmem map, mark each area + * present. The first block has already been marked present above. + */ for (i=0; i < lmb.memory.cnt; i++) { unsigned long physbase, size; physbase = lmb.memory.region[i].physbase; size = lmb.memory.region[i].size; - if (i) { /* already created mappings for first LMB */ + if (i) { start_pfn = physbase >> PAGE_SHIFT; end_pfn = start_pfn + (size >> PAGE_SHIFT); + memory_present(0, start_pfn, end_pfn); } - memory_present(0, start_pfn, end_pfn); free_bootmem(physbase, size); } From jschopp at austin.ibm.com Tue May 10 01:56:36 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Mon, 09 May 2005 10:56:36 -0500 Subject: [Fwd: PATCH - enables software RAID on Linux] In-Reply-To: <1115593748.6305.61.camel@gaston> References: <1115593748.6305.61.camel@gaston> Message-ID: <427F8834.9060509@austin.ibm.com> > Please see the attached patch and consider for inclusion in subsequent > yaboot releases. In the forwarding no patch was attached. From dhowells at redhat.com Tue May 10 03:27:07 2005 From: dhowells at redhat.com (David Howells) Date: Mon, 09 May 2005 18:27:07 +0100 Subject: [PATCH 1/3] ppc64: iseries_veth: Don't send packets to LPARs which aren't up In-Reply-To: <200505071201.20123.michael@ellerman.id.au> References: <200505071201.20123.michael@ellerman.id.au> Message-ID: <10647.1115659627@redhat.com> Michael Ellerman wrote: > > The iseries_veth driver has a logic bug which means it will erroneously > send packets to LPARs for which we don't have a connection. Any particular versions of the kernel? David From linas at austin.ibm.com Tue May 10 04:24:45 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Mon, 9 May 2005 13:24:45 -0500 Subject: [PATCH 2.6] PPC64: janitorial HVSI use wait_event_timeout() Message-ID: <20050509182445.GP11745@austin.ibm.com> Hi, I seem to be dragging around the following janitorial patch, it was submitted upstream to LKML on March 6 2005, but seems not to have made it in. Hollis, please review; I've been running with it for months. --linas Use wait_event_timeout() in place of custom wait-queue code. The code is not changed in any way (I don't think), but is cleaned up quite a bit (will get expanded to almost identical code). Acked-by: Linas Vepstas Signed-off-by: Nishanth Aravamudan Signed-off-by: Domen Puncer --- linux-2.6.11.8/drivers/char/hvsi.c.linas-orig 2005-04-29 20:29:25.000000000 -0500 +++ linux-2.6.11.8/drivers/char/hvsi.c 2005-05-06 12:28:43.000000000 -0500 @@ -44,6 +44,7 @@ #include #include #include +#include #include #include #include @@ -631,27 +632,9 @@ static int __init poll_for_state(struct /* wait for irq handler to change our state */ static int wait_for_state(struct hvsi_struct *hp, int state) { - unsigned long end_jiffies = jiffies + HVSI_TIMEOUT; - unsigned long timeout; - int ret = 0; - - DECLARE_WAITQUEUE(myself, current); - set_current_state(TASK_INTERRUPTIBLE); - add_wait_queue(&hp->stateq, &myself); - - for (;;) { - set_current_state(TASK_INTERRUPTIBLE); - if (hp->state == state) - break; - timeout = end_jiffies - jiffies; - if (time_after(jiffies, end_jiffies)) { - ret = -EIO; - break; - } - schedule_timeout(timeout); - } - remove_wait_queue(&hp->stateq, &myself); - set_current_state(TASK_RUNNING); + int ret=0; + if(!wait_event_timeout(hp->stateq, (hp->state == state), jiffies + + HVSI_TIMEOUT)) ret = -EIO; return ret; } @@ -868,24 +851,8 @@ static int hvsi_open(struct tty_struct * /* wait for hvsi_write_worker to empty hp->outbuf */ static void hvsi_flush_output(struct hvsi_struct *hp) { - unsigned long end_jiffies = jiffies + HVSI_TIMEOUT; - unsigned long timeout; - - DECLARE_WAITQUEUE(myself, current); - set_current_state(TASK_UNINTERRUPTIBLE); - add_wait_queue(&hp->emptyq, &myself); - - for (;;) { - set_current_state(TASK_UNINTERRUPTIBLE); - if (hp->n_outbuf <= 0) - break; - timeout = end_jiffies - jiffies; - if (time_after(jiffies, end_jiffies)) - break; - schedule_timeout(timeout); - } - remove_wait_queue(&hp->emptyq, &myself); - set_current_state(TASK_RUNNING); + wait_event_timeout(hp->emptyq, (hp->n_outbuf <= 0), jiffies + + HVSI_TIMEOUT); /* 'writer' could still be pending if it didn't see n_outbuf = 0 yet */ cancel_delayed_work(&hp->writer); From olh at suse.de Tue May 10 04:26:29 2005 From: olh at suse.de (Olaf Hering) Date: Mon, 9 May 2005 20:26:29 +0200 Subject: [PATCH] print negative numbers correctly via vsprintf in arch/ppc64/boot/prom.c In-Reply-To: References: <20050508164851.GA1707@suse.de> <20050509010255.GA6759@suse.de> Message-ID: <20050509182629.GA25582@suse.de> On Mon, May 09, Andreas Schwab wrote: > > + num = - (unsigned long long)num; > > I think the latter cast is useless. This was a typo. if num has a value of -1, accessing the digits[] array will fail and the format string will be printed in funny way, or not at all. This happens if one prints negative numbers. Just change the code to match lib/vsprintf.c asm/div64.h cant be used because u64 maps to u32 for this build. Signed-off-by: Olaf Hering Index: linux-2.6.12-rc4-olh/arch/ppc64/boot/prom.c =================================================================== --- linux-2.6.12-rc4-olh.orig/arch/ppc64/boot/prom.c +++ linux-2.6.12-rc4-olh/arch/ppc64/boot/prom.c @@ -11,6 +11,23 @@ #include #include +extern __u32 __div64_32(unsigned long long *dividend, __u32 divisor); + +/* The unnecessary pointer compare is there + * to check for type safety (n must be 64bit) + */ +# define do_div(n,base) ({ \ + __u32 __base = (base); \ + __u32 __rem; \ + (void)(((typeof((n)) *)0) == ((unsigned long long *)0)); \ + if (((n) >> 32) == 0) { \ + __rem = (__u32)(n) % __base; \ + (n) = (__u32)(n) / __base; \ + } else \ + __rem = __div64_32(&(n), __base); \ + __rem; \ + }) + int (*prom)(void *); void *chosen_handle; @@ -352,7 +369,7 @@ static int skip_atoi(const char **s) #define SPECIAL 32 /* 0x */ #define LARGE 64 /* use 'ABCDEF' instead of 'abcdef' */ -static char * number(char * str, long num, int base, int size, int precision, int type) +static char * number(char * str, unsigned long long num, int base, int size, int precision, int type) { char c,sign,tmp[66]; const char *digits="0123456789abcdefghijklmnopqrstuvwxyz"; @@ -367,9 +384,9 @@ static char * number(char * str, long nu c = (type & ZEROPAD) ? '0' : ' '; sign = 0; if (type & SIGN) { - if (num < 0) { + if ((signed long long)num < 0) { sign = '-'; - num = -num; + num = - (signed long long)num; size--; } else if (type & PLUS) { sign = '+'; @@ -389,8 +406,7 @@ static char * number(char * str, long nu if (num == 0) tmp[i++]='0'; else while (num != 0) { - tmp[i++] = digits[num % base]; - num /= base; + tmp[i++] = digits[do_div(num, base)]; } if (i > precision) precision = i; @@ -426,7 +442,7 @@ int sprintf(char * buf, const char *fmt, int vsprintf(char *buf, const char *fmt, va_list args) { int len; - unsigned long num; + unsigned long long num; int i, base; char * str; const char *s; From schwab at suse.de Tue May 10 05:54:08 2005 From: schwab at suse.de (Andreas Schwab) Date: Mon, 09 May 2005 21:54:08 +0200 Subject: [PATCH] print negative numbers correctly via vsprintf in arch/ppc64/boot/prom.c In-Reply-To: <20050509182629.GA25582@suse.de> (Olaf Hering's message of "Mon, 9 May 2005 20:26:29 +0200") References: <20050508164851.GA1707@suse.de> <20050509010255.GA6759@suse.de> <20050509182629.GA25582@suse.de> Message-ID: Olaf Hering writes: > @@ -367,9 +384,9 @@ static char * number(char * str, long nu > c = (type & ZEROPAD) ? '0' : ' '; > sign = 0; > if (type & SIGN) { > - if (num < 0) { > + if ((signed long long)num < 0) { > sign = '-'; > - num = -num; > + num = - (signed long long)num; The cast is still not needed. In 2's complement representation the negation of a signed number and an unsigned number are the same operation. Andreas. -- Andreas Schwab, SuSE Labs, schwab at suse.de SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." From olh at suse.de Tue May 10 05:56:42 2005 From: olh at suse.de (Olaf Hering) Date: Mon, 9 May 2005 21:56:42 +0200 Subject: [PATCH] print negative numbers correctly via vsprintf in arch/ppc64/boot/prom.c In-Reply-To: References: <20050508164851.GA1707@suse.de> <20050509010255.GA6759@suse.de> <20050509182629.GA25582@suse.de> Message-ID: <20050509195642.GA28538@suse.de> On Mon, May 09, Andreas Schwab wrote: > The cast is still not needed. In 2's complement representation the > negation of a signed number and an unsigned number are the same operation. Linus wanted it that way a few months/years ago, in lib/vsprintf.c From benh at kernel.crashing.org Tue May 10 08:56:44 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 10 May 2005 08:56:44 +1000 Subject: Patch to kill ioremap_mm In-Reply-To: <1115653422.14937.41.camel@sinatra.austin.ibm.com> References: <20050505014256.GE18270@localhost.localdomain> <1115306696.6011.6.camel@sinatra.austin.ibm.com> <1115335822.7627.189.camel@gaston> <1115392690.15458.12.camel@sinatra.austin.ibm.com> <1115423232.23610.5.camel@gaston> <1115653422.14937.41.camel@sinatra.austin.ibm.com> Message-ID: <1115679405.7339.7.camel@gaston> On Mon, 2005-05-09 at 10:43 -0500, John Rose wrote: > Hi Ben- > > > Do we need that for normal mmio ioremap mappings or only for PHB IO > > space ? > > Just the latter. > > > As I said, the later would stay separate, but I feel we don't > > need the imalloc infrastructure to handle it. We can probably just > > directly set/invalidate PTEs for the ranges when needed. > > So you won't use vmalloc to manage the explicit mappings for the PHB > ranges? How will you keep up with what's been mapped for a PHB? Seems > like this would be necessary to know upon removal. Or will you simply > invalidate for the entire PHB range, even for subranges that aren't > mapped? > > Sweating the details :) Well, for one, I'm not 100% sure we actually need to remove the linux PTEs on removal. We only need to remove the hash entries, which can be done "preventively" easily enough. When a segment is removed, we kick out hash entries for that segment. If somebody tries to access that later one, it gets a sigbus since HV will fail inserting an entry. Ben. From benh at kernel.crashing.org Tue May 10 08:57:30 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 10 May 2005 08:57:30 +1000 Subject: [Fwd: PATCH - enables software RAID on Linux] In-Reply-To: <427F8834.9060509@austin.ibm.com> References: <1115593748.6305.61.camel@gaston> <427F8834.9060509@austin.ibm.com> Message-ID: <1115679451.7339.9.camel@gaston> On Mon, 2005-05-09 at 10:56 -0500, Joel Schopp wrote: > > Please see the attached patch and consider for inclusion in subsequent > > yaboot releases. > > In the forwarding no patch was attached Yup, I noticed. I've asked him to re-post to linuxppc64-dev Ben. From jschopp at austin.ibm.com Tue May 10 09:03:51 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Mon, 09 May 2005 18:03:51 -0500 Subject: sparsemem ppc64 tidy flat memory comments and fix benign mempresent call In-Reply-To: References: Message-ID: <427FEC57.8060505@austin.ibm.com> > diff -upN reference/arch/ppc64/mm/init.c current/arch/ppc64/mm/init.c > --- reference/arch/ppc64/mm/init.c > +++ current/arch/ppc64/mm/init.c > @@ -631,18 +631,19 @@ void __init do_init_bootmem(void) > > max_pfn = max_low_pfn; > > - /* add all physical memory to the bootmem map. Also, find the first > - * presence of all LMBs*/ > + /* Add all physical memory to the bootmem map, mark each area > + * present. The first block has already been marked present above. > + */ > for (i=0; i < lmb.memory.cnt; i++) { > unsigned long physbase, size; > > physbase = lmb.memory.region[i].physbase; > size = lmb.memory.region[i].size; > - if (i) { /* already created mappings for first LMB */ > + if (i) { > start_pfn = physbase >> PAGE_SHIFT; > end_pfn = start_pfn + (size >> PAGE_SHIFT); > + memory_present(0, start_pfn, end_pfn); > } > - memory_present(0, start_pfn, end_pfn); > free_bootmem(physbase, size); > } Instead of moving all that around why don't we just drop the duplicate and the if altogether? I tested and sent a patch back in March that cleaned up the non-numa case pretty well. http://sourceforge.net/mailarchive/message.php?msg_id=11320001 From michael at ellerman.id.au Tue May 10 09:04:09 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Tue, 10 May 2005 09:04:09 +1000 Subject: [PATCH 1/3] ppc64: iseries_veth: Don't send packets to LPARs which aren't up In-Reply-To: <10647.1115659627@redhat.com> References: <200505071201.20123.michael@ellerman.id.au> <10647.1115659627@redhat.com> Message-ID: <200505100904.09850.michael@ellerman.id.au> On Tue, 10 May 2005 03:27, David Howells wrote: > Michael Ellerman wrote: > > The iseries_veth driver has a logic bug which means it will erroneously > > send packets to LPARs for which we don't have a connection. > > Any particular versions of the kernel? 2.6.* I believe. Certainly RHEL4's version is broken. I haven't got an RHEL3 kernel handy, but the 2.6 driver was based on the 2.4 version I believe, so it's possible the bug is in 2.4 too. cheers -- Michael Ellerman IBM OzLabs email: michael:ellerman.id.au inmsg: mpe:jabber.org wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050510/69a1e8dd/attachment.pgp From hollisb at us.ibm.com Tue May 10 09:23:13 2005 From: hollisb at us.ibm.com (Hollis Blanchard) Date: Mon, 9 May 2005 18:23:13 -0500 Subject: [PATCH 2.6] PPC64: janitorial HVSI use wait_event_timeout() In-Reply-To: <20050509182445.GP11745@austin.ibm.com> References: <20050509182445.GP11745@austin.ibm.com> Message-ID: On May 9, 2005, at 1:24 PM, Linas Vepstas wrote: > > I seem to be dragging around the following janitorial patch, it was > submitted upstream to LKML on March 6 2005, but seems not to have made > it in. Hollis, please review; I've been running with it for months. This patch has some formatting issues that need correcting. If it works for you I'm fine with the principle. (Also, please don't email my Lotus Notes account.) Thanks. -- Hollis Blanchard IBM Linux Technology Center From dustin.kirkland at us.ibm.com Tue May 10 09:21:37 2005 From: dustin.kirkland at us.ibm.com (Dustin Kirkland) Date: Mon, 09 May 2005 19:21:37 -0400 Subject: [Fwd: PATCH - enables software RAID on Linux] Message-ID: <1115680898.7274.15.camel@t41p> Hello all- I have been advised to post this patch to this list, as the yaboot-devel list appears to be somewhat stale. You might also be interested in the thread regarding yaboot 1.x ownership spurned by this post to yaboot-devel: http://lists.penguinppc.org/yaboot-devel/2005/yaboot- devel-200505/threads.html Thanks, Dustin -------------- next part -------------- An embedded message was scrubbed... From: Dustin Kirkland Subject: PATCH - enables software RAID on Linux Date: Fri, 06 May 2005 10:17:42 -0400 Size: 9319 Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050509/a54fd260/attachment.eml -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050509/a54fd260/attachment.pgp From michael at ellerman.id.au Tue May 10 16:00:08 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Tue, 10 May 2005 16:00:08 +1000 Subject: [PATCH 1/3] ppc64: iseries_veth: Don't send packets to LPARs which aren't up In-Reply-To: <200505100904.09850.michael@ellerman.id.au> References: <200505071201.20123.michael@ellerman.id.au> <10647.1115659627@redhat.com> <200505100904.09850.michael@ellerman.id.au> Message-ID: <200505101600.17146.michael@ellerman.id.au> On Tue, 10 May 2005 09:04, Michael Ellerman wrote: > On Tue, 10 May 2005 03:27, David Howells wrote: > > Michael Ellerman wrote: > > > The iseries_veth driver has a logic bug which means it will erroneously > > > send packets to LPARs for which we don't have a connection. > > > > Any particular versions of the kernel? > > 2.6.* I believe. Certainly RHEL4's version is broken. > > I haven't got an RHEL3 kernel handy, but the 2.6 driver was based on the > 2.4 version I believe, so it's possible the bug is in 2.4 too. Well something was based on something, no one's quite sure. But the 2.4 driver looks ok, it doesn't have this code at all. cheers -- Michael Ellerman IBM OzLabs email: michael:ellerman.id.au inmsg: mpe:jabber.org wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050510/f5f19f0d/attachment.pgp From apw at shadowen.org Wed May 11 01:45:48 2005 From: apw at shadowen.org (Andy Whitcroft) Date: Tue, 10 May 2005 16:45:48 +0100 Subject: sparsemem ppc64 tidy flat memory comments and fix benign mempresent call In-Reply-To: <427FEC57.8060505@austin.ibm.com> References: <427FEC57.8060505@austin.ibm.com> Message-ID: <4280D72C.4090203@shadowen.org> > Instead of moving all that around why don't we just drop the duplicate > and the if altogether? I tested and sent a patch back in March that > cleaned up the non-numa case pretty well. > > http://sourceforge.net/mailarchive/message.php?msg_id=11320001 Ok, Mike also expressed the feeling that it was no longer necessary to handle the first block separatly. I've tested the attached patch on the machines I have to hand and it seems to boot just fine in the flat memory modes with this applied. Joel, Mike, Dave could you test this one on your platforms to confirm its widly applicable, if so we can push it up to -mm. The patch attached applies to the patches proposed for the next -mm. A full stack on top of 2.6.12-rc3-mm2 can be found at the URL below (see the series file): http://www.shadowen.org/~apw/linux/sparsemem/sparsemem-2.6.12-rc3-mm2-V3/ Cheers. -apw -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: sparsemem-ppc64-flat-first-block-is-not-special Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050510/b0adf4c9/attachment.txt From kravetz at us.ibm.com Wed May 11 05:42:25 2005 From: kravetz at us.ibm.com (mike kravetz) Date: Tue, 10 May 2005 12:42:25 -0700 Subject: sparsemem ppc64 tidy flat memory comments and fix benign mempresent call In-Reply-To: <4280D72C.4090203@shadowen.org> References: <427FEC57.8060505@austin.ibm.com> <4280D72C.4090203@shadowen.org> Message-ID: <20050510194225.GD3915@w-mikek2.ibm.com> On Tue, May 10, 2005 at 04:45:48PM +0100, Andy Whitcroft wrote: > Joel, Mike, Dave could you test this one on your platforms to confirm > its widly applicable, if so we can push it up to -mm. It works on my machine with various config options. -- Mike From david at gibson.dropbear.id.au Wed May 11 11:08:09 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Wed, 11 May 2005 11:08:09 +1000 Subject: [PATCH 1/3] ppc64: iseries_veth: Don't send packets to LPARs which aren't up In-Reply-To: <200505101600.17146.michael@ellerman.id.au> References: <200505071201.20123.michael@ellerman.id.au> <10647.1115659627@redhat.com> <200505100904.09850.michael@ellerman.id.au> <200505101600.17146.michael@ellerman.id.au> Message-ID: <20050511010809.GB18715@localhost.localdomain> On Tue, May 10, 2005 at 04:00:08PM +1000, Michael Ellerman wrote: > On Tue, 10 May 2005 09:04, Michael Ellerman wrote: > > On Tue, 10 May 2005 03:27, David Howells wrote: > > > Michael Ellerman wrote: > > > > The iseries_veth driver has a logic bug which means it will erroneously > > > > send packets to LPARs for which we don't have a connection. > > > > > > Any particular versions of the kernel? > > > > 2.6.* I believe. Certainly RHEL4's version is broken. > > > > I haven't got an RHEL3 kernel handy, but the 2.6 driver was based on the > > 2.4 version I believe, so it's possible the bug is in 2.4 too. > > Well something was based on something, no one's quite sure. But the 2.4 driver > looks ok, it doesn't have this code at all. The 2.6 driver is based on the 2.4 driver in the sense that the 2.4 driver was the only source of information about the virtual ethernet protocol. However, the driver was pretty much completely rewritten in the process. Unfortunately, this seems to have introduce quite a few bugs like these ones. On the other hand, the code is now comprehensible so the bugs are actually fixable.. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050511/7fcaaf61/attachment.pgp From david at gibson.dropbear.id.au Thu May 12 13:39:53 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 12 May 2005 13:39:53 +1000 Subject: First cut at four-level pagetables Message-ID: <20050512033953.GA29780@localhost.localdomain> Here's a first shot at patch which implements true four-level page tables for ppc64. It uses full page tables at the bottom and top levels, and quarter-page tables at the middle two levels. This gives a total usable address space of 44 bits (16T). I've also tweaked the VSID allocation to let us use all that space (thereby halving the number of available contexts) and added some #if and BUILD_BUG sanity checks. Hugepages are presently completely broken, working on that now. This patch applies on top of the patch posted earlier eliminating ioremap_dir. Index: working-2.6/include/asm-ppc64/pgtable.h =================================================================== --- working-2.6.orig/include/asm-ppc64/pgtable.h 2005-05-12 12:08:53.000000000 +1000 +++ working-2.6/include/asm-ppc64/pgtable.h 2005-05-12 13:25:23.000000000 +1000 @@ -15,19 +15,24 @@ #include #endif /* __ASSEMBLY__ */ -#include - /* * Entries per page directory level. The PTE level must use a 64b record * for each page table entry. The PMD and PGD level use a 32b record for * each entry by assuming that each entry is page aligned. */ #define PTE_INDEX_SIZE 9 -#define PMD_INDEX_SIZE 10 -#define PGD_INDEX_SIZE 10 +#define PMD_INDEX_SIZE 7 +#define PUD_INDEX_SIZE 7 +#define PGD_INDEX_SIZE 9 + +#define PTE_TABLE_SIZE (sizeof(pte_t) << PTE_INDEX_SIZE) +#define PMD_TABLE_SIZE (sizeof(pmd_t) << PMD_INDEX_SIZE) +#define PUD_TABLE_SIZE (sizeof(pud_t) << PUD_INDEX_SIZE) +#define PGD_TABLE_SIZE (sizeof(pgd_t) << PGD_INDEX_SIZE) #define PTRS_PER_PTE (1 << PTE_INDEX_SIZE) #define PTRS_PER_PMD (1 << PMD_INDEX_SIZE) +#define PTRS_PER_PUD (1 << PMD_INDEX_SIZE) #define PTRS_PER_PGD (1 << PGD_INDEX_SIZE) /* PMD_SHIFT determines what a second-level page table entry can map */ @@ -35,8 +40,13 @@ #define PMD_SIZE (1UL << PMD_SHIFT) #define PMD_MASK (~(PMD_SIZE-1)) -/* PGDIR_SHIFT determines what a third-level page table entry can map */ -#define PGDIR_SHIFT (PMD_SHIFT + PMD_INDEX_SIZE) +/* PUD_SHIFT determines what a third-level page table entry can map */ +#define PUD_SHIFT (PMD_SHIFT + PMD_INDEX_SIZE) +#define PUD_SIZE (1UL << PUD_SHIFT) +#define PUD_MASK (~(PUD_SIZE-1)) + +/* PGDIR_SHIFT determines what a fourth-level page table entry can map */ +#define PGDIR_SHIFT (PUD_SHIFT + PUD_INDEX_SIZE) #define PGDIR_SIZE (1UL << PGDIR_SHIFT) #define PGDIR_MASK (~(PGDIR_SIZE-1)) @@ -45,15 +55,23 @@ /* * Size of EA range mapped by our pagetables. */ -#define EADDR_SIZE (PTE_INDEX_SIZE + PMD_INDEX_SIZE + \ - PGD_INDEX_SIZE + PAGE_SHIFT) -#define EADDR_MASK ((1UL << EADDR_SIZE) - 1) +#define PGTABLE_EADDR_SIZE (PTE_INDEX_SIZE + PMD_INDEX_SIZE + \ + PUD_INDEX_SIZE + PGD_INDEX_SIZE + PAGE_SHIFT) +#define PGTABLE_RANGE (1UL << PGTABLE_EADDR_SIZE) + +#if TASK_SIZE_USER64 > PGTABLE_RANGE +#error TASK_SIZE_USER64 exceeds pagetable range +#endif + +#if TASK_SIZE_USER64 > (1UL << (USER_ESID_BITS + SID_SHIFT)) +#error TASK_SIZE_USER64 exceeds user VSID range +#endif /* * Define the address range of the vmalloc VM area. */ #define VMALLOC_START (0xD000000000000000ul) -#define VMALLOC_SIZE (0x10000000000UL) +#define VMALLOC_SIZE (0x40000000000UL) #define VMALLOC_END (VMALLOC_START + VMALLOC_SIZE) /* @@ -151,6 +169,8 @@ #ifdef CONFIG_HUGETLB_PAGE +#error Hugepages broken for now + #ifndef __ASSEMBLY__ int hash_huge_page(struct mm_struct *mm, unsigned long access, unsigned long ea, unsigned long vsid, int local); @@ -197,39 +217,45 @@ #define pte_pfn(x) ((unsigned long)((pte_val(x) >> PTE_SHIFT))) #define pte_page(x) pfn_to_page(pte_pfn(x)) -#define pmd_set(pmdp, ptep) \ - (pmd_val(*(pmdp)) = __ba_to_bpn(ptep)) +#define pmd_set(pmdp, ptep) (pmd_val(*(pmdp)) = (unsigned long)(ptep)) #define pmd_none(pmd) (!pmd_val(pmd)) #define pmd_bad(pmd) (pmd_val(pmd) == 0) #define pmd_present(pmd) (pmd_val(pmd) != 0) #define pmd_clear(pmdp) (pmd_val(*(pmdp)) = 0) -#define pmd_page_kernel(pmd) (__bpn_to_ba(pmd_val(pmd))) +#define pmd_page_kernel(pmd) (pmd_val(pmd)) #define pmd_page(pmd) virt_to_page(pmd_page_kernel(pmd)) -#define pud_set(pudp, pmdp) (pud_val(*(pudp)) = (__ba_to_bpn(pmdp))) +#define pud_set(pudp, pmdp) (pud_val(*(pudp)) = (unsigned long)(pmdp)) #define pud_none(pud) (!pud_val(pud)) -#define pud_bad(pud) ((pud_val(pud)) == 0UL) -#define pud_present(pud) (pud_val(pud) != 0UL) -#define pud_clear(pudp) (pud_val(*(pudp)) = 0UL) -#define pud_page(pud) (__bpn_to_ba(pud_val(pud))) +#define pud_bad(pud) ((pud_val(pud)) == 0) +#define pud_present(pud) (pud_val(pud) != 0) +#define pud_clear(pudp) (pud_val(*(pudp)) = 0) +#define pud_page(pud) (pud_val(pud)) + +#define pgd_set(pgdp, pudp) ({pgd_val(*(pgdp)) = (unsigned long)(pudp);}) +#define pgd_none(pgd) (!pgd_val(pgd)) +#define pgd_bad(pgd) (pgd_val(pgd) == 0) +#define pgd_present(pgd) (pgd_val(pgd) != 0) +#define pgd_clear(pgdp) (pgd_val(*(pgdp)) = 0) +#define pgd_page(pgd) (pgd_val(pgd)) /* * Find an entry in a page-table-directory. We combine the address region * (the high order N bits) and the pgd portion of the address. */ /* to avoid overflow in free_pgtables we don't use PTRS_PER_PGD here */ -#define pgd_index(address) (((address) >> (PGDIR_SHIFT)) & 0x7ff) +#define pgd_index(address) (((address) >> (PGDIR_SHIFT)) & 0x1ff) #define pgd_offset(mm, address) ((mm)->pgd + pgd_index(address)) -/* Find an entry in the second-level page table.. */ +#define pud_offset(pgdp, addr) \ + (((pud_t *) pgd_page(*(pgdp))) + (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))) + #define pmd_offset(pudp,addr) \ - ((pmd_t *) pud_page(*(pudp)) + (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))) + (((pmd_t *) pud_page(*(pudp))) + (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))) -/* Find an entry in the third-level page table.. */ #define pte_offset_kernel(dir,addr) \ - ((pte_t *) pmd_page_kernel(*(dir)) \ - + (((addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))) + (((pte_t *) pmd_page_kernel(*(dir))) + (((addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))) #define pte_offset_map(dir,addr) pte_offset_kernel((dir), (addr)) #define pte_offset_map_nested(dir,addr) pte_offset_kernel((dir), (addr)) @@ -458,9 +484,11 @@ #define pte_same(A,B) (((pte_val(A) ^ pte_val(B)) & ~_PAGE_HPTEFLAGS) == 0) #define pmd_ERROR(e) \ - printk("%s:%d: bad pmd %08x.\n", __FILE__, __LINE__, pmd_val(e)) + printk("%s:%d: bad pmd %08lx.\n", __FILE__, __LINE__, pmd_val(e)) +#define pud_ERROR(e) \ + printk("%s:%d: bad pmd %08lx.\n", __FILE__, __LINE__, pud_val(e)) #define pgd_ERROR(e) \ - printk("%s:%d: bad pgd %08x.\n", __FILE__, __LINE__, pgd_val(e)) + printk("%s:%d: bad pgd %08lx.\n", __FILE__, __LINE__, pgd_val(e)) extern pgd_t swapper_pg_dir[]; Index: working-2.6/include/asm-ppc64/page.h =================================================================== --- working-2.6.orig/include/asm-ppc64/page.h 2005-05-12 12:08:53.000000000 +1000 +++ working-2.6/include/asm-ppc64/page.h 2005-05-12 12:08:54.000000000 +1000 @@ -35,6 +35,8 @@ #ifdef CONFIG_HUGETLB_PAGE +#error Hugepages broken for now + #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) /* For 64-bit processes the hugepage range is 1T-1.5T */ @@ -91,7 +93,7 @@ #ifndef __ASSEMBLY__ #include -#undef STRICT_MM_TYPECHECKS +#define STRICT_MM_TYPECHECKS #define REGION_SIZE 4UL #define REGION_SHIFT 60UL @@ -125,27 +127,31 @@ * Entries in the pte table are 64b, while entries in the pgd & pmd are 32b. */ typedef struct { unsigned long pte; } pte_t; -typedef struct { unsigned int pmd; } pmd_t; -typedef struct { unsigned int pgd; } pgd_t; +typedef struct { unsigned long pmd; } pmd_t; +typedef struct { unsigned long pud; } pud_t; +typedef struct { unsigned long pgd; } pgd_t; typedef struct { unsigned long pgprot; } pgprot_t; #define pte_val(x) ((x).pte) #define pmd_val(x) ((x).pmd) +#define pud_val(x) ((x).pud) #define pgd_val(x) ((x).pgd) #define pgprot_val(x) ((x).pgprot) -#define __pte(x) ((pte_t) { (x) } ) -#define __pmd(x) ((pmd_t) { (x) } ) -#define __pgd(x) ((pgd_t) { (x) } ) -#define __pgprot(x) ((pgprot_t) { (x) } ) +#define __pte(x) ((pte_t) { (x) }) +#define __pmd(x) ((pmd_t) { (x) }) +#define __pud(x) ((pud_t) { (x) }) +#define __pgd(x) ((pgd_t) { (x) }) +#define __pgprot(x) ((pgprot_t) { (x) }) #else /* * .. while these make it easier on the compiler */ typedef unsigned long pte_t; -typedef unsigned int pmd_t; -typedef unsigned int pgd_t; +typedef unsigned long pmd_t; +typedef unsigned long pud_t; +typedef unsigned long pgd_t; typedef unsigned long pgprot_t; #define pte_val(x) (x) @@ -208,9 +214,6 @@ #define USER_REGION_ID (0UL) #define REGION_ID(ea) (((unsigned long)(ea)) >> REGION_SHIFT) -#define __bpn_to_ba(x) ((((unsigned long)(x)) << PAGE_SHIFT) + KERNELBASE) -#define __ba_to_bpn(x) ((((unsigned long)(x)) & ~REGION_MASK) >> PAGE_SHIFT) - #define __va(x) ((void *)((unsigned long)(x) + KERNELBASE)) #ifdef CONFIG_DISCONTIGMEM Index: working-2.6/include/asm-ppc64/pgalloc.h =================================================================== --- working-2.6.orig/include/asm-ppc64/pgalloc.h 2005-05-12 12:08:53.000000000 +1000 +++ working-2.6/include/asm-ppc64/pgalloc.h 2005-05-12 12:08:54.000000000 +1000 @@ -6,7 +6,7 @@ #include #include -extern kmem_cache_t *zero_cache; +extern kmem_cache_t *pmd_cache; /* * This program is free software; you can redistribute it and/or @@ -18,13 +18,31 @@ static inline pgd_t * pgd_alloc(struct mm_struct *mm) { - return kmem_cache_alloc(zero_cache, GFP_KERNEL); + return (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO); } static inline void pgd_free(pgd_t *pgd) { - kmem_cache_free(zero_cache, pgd); + free_page((unsigned long)pgd); +} + +#define pgd_populate(MM, PGD, PUD) pgd_set(PGD, PUD) + +static inline pud_t * +pud_alloc_one(struct mm_struct *mm, unsigned long addr) +{ + pud_t *pudp; + + pudp = kmem_cache_alloc(pmd_cache, GFP_KERNEL|__GFP_REPEAT); + memset(pudp, 0, PUD_TABLE_SIZE); + return pudp; +} + +static inline void +pud_free(pud_t *pud) +{ + kmem_cache_free(pmd_cache, pud); } #define pud_populate(MM, PUD, PMD) pud_set(PUD, PMD) @@ -32,13 +50,17 @@ static inline pmd_t * pmd_alloc_one(struct mm_struct *mm, unsigned long addr) { - return kmem_cache_alloc(zero_cache, GFP_KERNEL|__GFP_REPEAT); + pmd_t *pmdp; + + pmdp = kmem_cache_alloc(pmd_cache, GFP_KERNEL|__GFP_REPEAT); + memset(pmdp, 0, PMD_TABLE_SIZE); + return pmdp; } static inline void pmd_free(pmd_t *pmd) { - kmem_cache_free(zero_cache, pmd); + kmem_cache_free(pmd_cache, pmd); } #define pmd_populate_kernel(mm, pmd, pte) pmd_set(pmd, pte) @@ -47,44 +69,54 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address) { - return kmem_cache_alloc(zero_cache, GFP_KERNEL|__GFP_REPEAT); + return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); } static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address) { - pte_t *pte = kmem_cache_alloc(zero_cache, GFP_KERNEL|__GFP_REPEAT); - if (pte) - return virt_to_page(pte); - return NULL; + return alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); } static inline void pte_free_kernel(pte_t *pte) { - kmem_cache_free(zero_cache, pte); + free_page((unsigned long)pte); } static inline void pte_free(struct page *ptepage) { - kmem_cache_free(zero_cache, page_address(ptepage)); + __free_page(ptepage); } -struct pte_freelist_batch +typedef struct pgtable_free { + unsigned long val; +} pgtable_free_t; + +static inline pgtable_free_t pgtable_free_page(struct page *page) { - struct rcu_head rcu; - unsigned int index; - struct page * pages[0]; -}; + return (pgtable_free_t){.val = (unsigned long) page}; +} -#define PTE_FREELIST_SIZE ((PAGE_SIZE - sizeof(struct pte_freelist_batch)) / \ - sizeof(struct page *)) +static inline pgtable_free_t pgtable_free_cache(void *p) +{ + return (pgtable_free_t){.val = ((unsigned long) p) | 1}; +} -extern void pte_free_now(struct page *ptepage); -extern void pte_free_submit(struct pte_freelist_batch *batch); +static inline void pgtable_free(pgtable_free_t pgf) +{ + if (pgf.val & 1) + kmem_cache_free(pmd_cache, (void *)(pgf.val & ~1)); + else + __free_page((struct page *)pgf.val); +} -DECLARE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur); +void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf); -void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage); -#define __pmd_free_tlb(tlb, pmd) __pte_free_tlb(tlb, virt_to_page(pmd)) +#define __pte_free_tlb(tlb, ptepage) \ + pgtable_free_tlb(tlb, pgtable_free_page(ptepage)) +#define __pmd_free_tlb(tlb, pmd) \ + pgtable_free_tlb(tlb, pgtable_free_cache(pmd)) +#define __pud_free_tlb(tlb, pmd) \ + pgtable_free_tlb(tlb, pgtable_free_cache(pud)) #define check_pgt_cache() do { } while (0) Index: working-2.6/arch/ppc64/mm/init.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/init.c 2005-05-12 12:08:53.000000000 +1000 +++ working-2.6/arch/ppc64/mm/init.c 2005-05-12 13:28:03.000000000 +1000 @@ -66,6 +66,14 @@ #include #include +#if PGTABLE_RANGE > USER_VSID_RANGE +#warning Limited user VSID range means pagetable space is wasted +#endif + +#if (TASK_SIZE_USER64 < PGTABLE_RANGE) && (TASK_SIZE_USER64 < USER_VSID_RANGE) +#warning TASK_SIZE is smaller than it needs to be. +#endif + int mem_init_done; unsigned long ioremap_bot = IMALLOC_BASE; static unsigned long phbs_io_bot = PHBS_IO_BASE; @@ -292,7 +300,7 @@ * Before that, we map using addresses going * up from ioremap_bot. imalloc will use * the addresses from ioremap_bot through - * IMALLOC_END (0xE000001fffffffff) + * IMALLOC_END * */ pa = addr & PAGE_MASK; @@ -896,23 +904,19 @@ return virt_addr; } -kmem_cache_t *zero_cache; - -static void zero_ctor(void *pte, kmem_cache_t *cache, unsigned long flags) -{ - memset(pte, 0, PAGE_SIZE); -} +kmem_cache_t *pmd_cache; void pgtable_cache_init(void) { - zero_cache = kmem_cache_create("zero", - PAGE_SIZE, - 0, - SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN, - zero_ctor, - NULL); - if (!zero_cache) - panic("pgtable_cache_init(): could not create zero_cache!\n"); + BUILD_BUG_ON(PTE_TABLE_SIZE != PAGE_SIZE); + BUILD_BUG_ON(PMD_TABLE_SIZE != PUD_TABLE_SIZE); + BUILD_BUG_ON(PGD_TABLE_SIZE != PAGE_SIZE); + + pmd_cache = kmem_cache_create("pmd", PMD_TABLE_SIZE, PMD_TABLE_SIZE, + SLAB_POISON |SLAB_DEBUG_INITIAL, + NULL, NULL); + if (! pmd_cache) + panic("pmd_pud_cache_init(): could not create pmd_pud_cache!\n"); } pgprot_t phys_mem_access_prot(struct file *file, unsigned long addr, Index: working-2.6/include/asm-ppc64/processor.h =================================================================== --- working-2.6.orig/include/asm-ppc64/processor.h 2005-05-12 12:08:53.000000000 +1000 +++ working-2.6/include/asm-ppc64/processor.h 2005-05-12 13:29:50.000000000 +1000 @@ -531,7 +531,7 @@ extern struct task_struct *last_task_used_altivec; /* 64-bit user address space is 41-bits (2TBs user VM) */ -#define TASK_SIZE_USER64 (0x0000020000000000UL) +#define TASK_SIZE_USER64 (0x0000100000000000UL) /* * 32-bit user address space is 4GB - 1 page Index: working-2.6/arch/ppc64/kernel/head.S =================================================================== --- working-2.6.orig/arch/ppc64/kernel/head.S 2005-05-12 12:08:53.000000000 +1000 +++ working-2.6/arch/ppc64/kernel/head.S 2005-05-12 12:08:54.000000000 +1000 @@ -38,6 +38,7 @@ #include #include #include +#include #ifdef CONFIG_PPC_ISERIES #define DO_SOFT_DISABLE @@ -2117,17 +2118,17 @@ empty_zero_page: .space 4096 - .globl swapper_pg_dir -swapper_pg_dir: - .space 4096 - #ifdef CONFIG_SMP /* 1 page segment table per cpu (max 48, cpu0 allocated at STAB0_PHYS_ADDR) */ .globl stab_array stab_array: .space 4096 * 48 #endif - + + .globl swapper_pg_dir +swapper_pg_dir: + .space PAGE_SIZE + /* * This space gets a copy of optional info passed to us by the bootstrap * Used to pass parameters into the kernel like root=/dev/sda1, etc. Index: working-2.6/arch/ppc64/mm/imalloc.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/imalloc.c 2005-05-11 10:05:49.000000000 +1000 +++ working-2.6/arch/ppc64/mm/imalloc.c 2005-05-12 13:28:17.000000000 +1000 @@ -30,7 +30,7 @@ break; if ((unsigned long)tmp->addr >= ioremap_bot) addr = tmp->size + (unsigned long) tmp->addr; - if (addr > IMALLOC_END-size) + if (addr >= IMALLOC_END-size) return 1; } *im_addr = addr; Index: working-2.6/arch/ppc64/mm/hash_utils.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hash_utils.c 2005-05-12 11:56:11.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hash_utils.c 2005-05-12 13:26:06.000000000 +1000 @@ -298,7 +298,7 @@ int local = 0; cpumask_t tmp; - if ((ea & ~REGION_MASK) > EADDR_MASK) + if ((ea & ~REGION_MASK) >= PGTABLE_RANGE) return 1; switch (REGION_ID(ea)) { Index: working-2.6/include/asm-ppc64/mmu.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu.h 2005-05-11 10:05:51.000000000 +1000 +++ working-2.6/include/asm-ppc64/mmu.h 2005-05-12 13:24:06.000000000 +1000 @@ -259,8 +259,10 @@ #define VSID_BITS 36 #define VSID_MODULUS ((1UL< References: <20050512033953.GA29780@localhost.localdomain> Message-ID: <20050512040804.GB29780@localhost.localdomain> On Thu, May 12, 2005 at 01:39:53PM +1000, David Gibson wrote: > Here's a first shot at patch which implements true four-level page > tables for ppc64. It uses full page tables at the bottom and top > levels, and quarter-page tables at the middle two levels. This gives > a total usable address space of 44 bits (16T). I've also tweaked the > VSID allocation to let us use all that space (thereby halving the > number of available contexts) and added some #if and BUILD_BUG sanity > checks. > > Hugepages are presently completely broken, working on that now. This > patch applies on top of the patch posted earlier eliminating > ioremap_dir. Bah, sorry. Patch broken. Changes to arch/ppc64/mm/tlb.c are missing. Fixed version later. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From david at gibson.dropbear.id.au Thu May 12 14:45:46 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 12 May 2005 14:45:46 +1000 Subject: Revised patch to kill ioremap_mm Message-ID: <20050512044546.GD29780@localhost.localdomain> Here's a new version of the patch which removes unmap_im_area() and it's pagetable walking functions, as well as the extra set of tables themselves. arch/ppc64/kernel/eeh.c | 2 arch/ppc64/kernel/head.S | 4 - arch/ppc64/kernel/process.c | 8 --- arch/ppc64/mm/hash_utils.c | 4 - arch/ppc64/mm/imalloc.c | 20 +++++---- arch/ppc64/mm/init.c | 93 ++++-------------------------------------- include/asm-ppc64/imalloc.h | 12 +++-- include/asm-ppc64/page.h | 2 include/asm-ppc64/pgtable.h | 9 ---- include/asm-ppc64/processor.h | 10 ---- 10 files changed, 31 insertions(+), 133 deletions(-) Currently ppc64 has two mm_structs for the kernel, init_mm and also ioremap_mm. The latter really isn't necessary: this patch abolishes it, instead restricting vmallocs to the lower 1TB of the init_mm's range and placing io mappings in the upper 1TB. This simplifies the code in a number of places and eliminates an unecessary set of pagetables. It also tweaks the unmap/free path a little, allowing use to also remove the im_ set of page table walkers, replacing them with unmap_vm_area(). Signed-off-by: David Gibson Index: working-2.6/include/asm-ppc64/pgtable.h =================================================================== --- working-2.6.orig/include/asm-ppc64/pgtable.h 2005-05-11 10:05:51.000000000 +1000 +++ working-2.6/include/asm-ppc64/pgtable.h 2005-05-12 14:08:37.000000000 +1000 @@ -53,7 +53,8 @@ * Define the address range of the vmalloc VM area. */ #define VMALLOC_START (0xD000000000000000ul) -#define VMALLOC_END (VMALLOC_START + EADDR_MASK) +#define VMALLOC_SIZE (0x10000000000UL) +#define VMALLOC_END (VMALLOC_START + VMALLOC_SIZE) /* * Bits in a linux-style PTE. These match the bits in the @@ -239,9 +240,6 @@ /* This now only contains the vmalloc pages */ #define pgd_offset_k(address) pgd_offset(&init_mm, address) -/* to find an entry in the ioremap page-table-directory */ -#define pgd_offset_i(address) (ioremap_pgd + pgd_index(address)) - /* * The following only work if pte_present() is true. * Undefined behaviour if not.. @@ -459,15 +457,12 @@ #define __HAVE_ARCH_PTE_SAME #define pte_same(A,B) (((pte_val(A) ^ pte_val(B)) & ~_PAGE_HPTEFLAGS) == 0) -extern unsigned long ioremap_bot, ioremap_base; - #define pmd_ERROR(e) \ printk("%s:%d: bad pmd %08x.\n", __FILE__, __LINE__, pmd_val(e)) #define pgd_ERROR(e) \ printk("%s:%d: bad pgd %08x.\n", __FILE__, __LINE__, pgd_val(e)) extern pgd_t swapper_pg_dir[]; -extern pgd_t ioremap_dir[]; extern void paging_init(void); Index: working-2.6/include/asm-ppc64/imalloc.h =================================================================== --- working-2.6.orig/include/asm-ppc64/imalloc.h 2005-05-11 10:05:51.000000000 +1000 +++ working-2.6/include/asm-ppc64/imalloc.h 2005-05-12 14:08:37.000000000 +1000 @@ -4,9 +4,9 @@ /* * Define the address range of the imalloc VM area. */ -#define PHBS_IO_BASE IOREGIONBASE -#define IMALLOC_BASE (IOREGIONBASE + 0x80000000ul) /* Reserve 2 gigs for PHBs */ -#define IMALLOC_END (IOREGIONBASE + EADDR_MASK) +#define PHBS_IO_BASE VMALLOC_END +#define IMALLOC_BASE (PHBS_IO_BASE + 0x80000000ul) /* Reserve 2 gigs for PHBs */ +#define IMALLOC_END (VMALLOC_START + EADDR_MASK) /* imalloc region types */ @@ -18,7 +18,9 @@ extern struct vm_struct * im_get_free_area(unsigned long size); extern struct vm_struct * im_get_area(unsigned long v_addr, unsigned long size, - int region_type); -unsigned long im_free(void *addr); + int region_type); +extern void im_free(void *addr); + +extern unsigned long ioremap_bot; #endif /* _PPC64_IMALLOC_H */ Index: working-2.6/include/asm-ppc64/page.h =================================================================== --- working-2.6.orig/include/asm-ppc64/page.h 2005-05-11 10:05:51.000000000 +1000 +++ working-2.6/include/asm-ppc64/page.h 2005-05-12 14:08:37.000000000 +1000 @@ -202,9 +202,7 @@ #define PAGE_OFFSET ASM_CONST(0xC000000000000000) #define KERNELBASE PAGE_OFFSET #define VMALLOCBASE ASM_CONST(0xD000000000000000) -#define IOREGIONBASE ASM_CONST(0xE000000000000000) -#define IO_REGION_ID (IOREGIONBASE >> REGION_SHIFT) #define VMALLOC_REGION_ID (VMALLOCBASE >> REGION_SHIFT) #define KERNEL_REGION_ID (KERNELBASE >> REGION_SHIFT) #define USER_REGION_ID (0UL) Index: working-2.6/arch/ppc64/kernel/eeh.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/eeh.c 2005-04-26 15:37:55.000000000 +1000 +++ working-2.6/arch/ppc64/kernel/eeh.c 2005-05-12 14:05:12.000000000 +1000 @@ -505,7 +505,7 @@ pte_t *ptep; unsigned long pa; - ptep = find_linux_pte(ioremap_mm.pgd, token); + ptep = find_linux_pte(init_mm.pgd, token); if (!ptep) return token; pa = pte_pfn(*ptep) << PAGE_SHIFT; Index: working-2.6/arch/ppc64/kernel/process.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/process.c 2005-04-26 15:37:55.000000000 +1000 +++ working-2.6/arch/ppc64/kernel/process.c 2005-05-12 14:05:12.000000000 +1000 @@ -58,14 +58,6 @@ struct task_struct *last_task_used_altivec = NULL; #endif -struct mm_struct ioremap_mm = { - .pgd = ioremap_dir, - .mm_users = ATOMIC_INIT(2), - .mm_count = ATOMIC_INIT(1), - .cpu_vm_mask = CPU_MASK_ALL, - .page_table_lock = SPIN_LOCK_UNLOCKED, -}; - /* * Make sure the floating-point register state in the * the thread_struct is up to date for task tsk. Index: working-2.6/include/asm-ppc64/processor.h =================================================================== --- working-2.6.orig/include/asm-ppc64/processor.h 2005-04-26 15:38:02.000000000 +1000 +++ working-2.6/include/asm-ppc64/processor.h 2005-05-12 14:08:37.000000000 +1000 @@ -590,16 +590,6 @@ } /* - * Note: the vm_start and vm_end fields here should *not* - * be in kernel space. (Could vm_end == vm_start perhaps?) - */ -#define IOREMAP_MMAP { &ioremap_mm, 0, 0x1000, NULL, \ - PAGE_SHARED, VM_READ | VM_WRITE | VM_EXEC, \ - 1, NULL, NULL } - -extern struct mm_struct ioremap_mm; - -/* * Return saved PC of a blocked thread. For now, this is the "user" PC */ #define thread_saved_pc(tsk) \ Index: working-2.6/arch/ppc64/mm/hash_utils.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hash_utils.c 2005-05-11 10:05:49.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hash_utils.c 2005-05-12 14:08:37.000000000 +1000 @@ -310,10 +310,6 @@ vsid = get_vsid(mm->context.id, ea); break; - case IO_REGION_ID: - mm = &ioremap_mm; - vsid = get_kernel_vsid(ea); - break; case VMALLOC_REGION_ID: mm = &init_mm; vsid = get_kernel_vsid(ea); Index: working-2.6/arch/ppc64/mm/init.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/init.c 2005-05-11 10:05:49.000000000 +1000 +++ working-2.6/arch/ppc64/mm/init.c 2005-05-12 14:11:16.000000000 +1000 @@ -73,9 +73,6 @@ extern pgd_t swapper_pg_dir[]; extern struct task_struct *current_set[NR_CPUS]; -extern pgd_t ioremap_dir[]; -pgd_t * ioremap_pgd = (pgd_t *)&ioremap_dir; - unsigned long klimit = (unsigned long)_end; unsigned long _SDR1=0; @@ -137,69 +134,6 @@ #else -static void unmap_im_area_pte(pmd_t *pmd, unsigned long addr, - unsigned long end) -{ - pte_t *pte; - - pte = pte_offset_kernel(pmd, addr); - do { - pte_t ptent = ptep_get_and_clear(&ioremap_mm, addr, pte); - WARN_ON(!pte_none(ptent) && !pte_present(ptent)); - } while (pte++, addr += PAGE_SIZE, addr != end); -} - -static inline void unmap_im_area_pmd(pud_t *pud, unsigned long addr, - unsigned long end) -{ - pmd_t *pmd; - unsigned long next; - - pmd = pmd_offset(pud, addr); - do { - next = pmd_addr_end(addr, end); - if (pmd_none_or_clear_bad(pmd)) - continue; - unmap_im_area_pte(pmd, addr, next); - } while (pmd++, addr = next, addr != end); -} - -static inline void unmap_im_area_pud(pgd_t *pgd, unsigned long addr, - unsigned long end) -{ - pud_t *pud; - unsigned long next; - - pud = pud_offset(pgd, addr); - do { - next = pud_addr_end(addr, end); - if (pud_none_or_clear_bad(pud)) - continue; - unmap_im_area_pmd(pud, addr, next); - } while (pud++, addr = next, addr != end); -} - -static void unmap_im_area(unsigned long addr, unsigned long end) -{ - struct mm_struct *mm = &ioremap_mm; - unsigned long next; - pgd_t *pgd; - - spin_lock(&mm->page_table_lock); - - pgd = pgd_offset_i(addr); - flush_cache_vunmap(addr, end); - do { - next = pgd_addr_end(addr, end); - if (pgd_none_or_clear_bad(pgd)) - continue; - unmap_im_area_pud(pgd, addr, next); - } while (pgd++, addr = next, addr != end); - flush_tlb_kernel_range(start, end); - - spin_unlock(&mm->page_table_lock); -} - /* * map_io_page currently only called by __ioremap * map_io_page adds an entry to the ioremap page table @@ -214,21 +148,21 @@ unsigned long vsid; if (mem_init_done) { - spin_lock(&ioremap_mm.page_table_lock); - pgdp = pgd_offset_i(ea); - pudp = pud_alloc(&ioremap_mm, pgdp, ea); + spin_lock(&init_mm.page_table_lock); + pgdp = pgd_offset_k(ea); + pudp = pud_alloc(&init_mm, pgdp, ea); if (!pudp) return -ENOMEM; - pmdp = pmd_alloc(&ioremap_mm, pudp, ea); + pmdp = pmd_alloc(&init_mm, pudp, ea); if (!pmdp) return -ENOMEM; - ptep = pte_alloc_kernel(&ioremap_mm, pmdp, ea); + ptep = pte_alloc_kernel(&init_mm, pmdp, ea); if (!ptep) return -ENOMEM; pa = abs_to_phys(pa); - set_pte_at(&ioremap_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, + set_pte_at(&init_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, __pgprot(flags))); - spin_unlock(&ioremap_mm.page_table_lock); + spin_unlock(&init_mm.page_table_lock); } else { unsigned long va, vpn, hash, hpteg; @@ -267,13 +201,9 @@ for (i = 0; i < size; i += PAGE_SIZE) if (map_io_page(ea+i, pa+i, flags)) - goto failure; + return NULL; return (void __iomem *) (ea + (addr & ~PAGE_MASK)); - failure: - if (mem_init_done) - unmap_im_area(ea, ea + size); - return NULL; } @@ -381,19 +311,14 @@ */ void iounmap(volatile void __iomem *token) { - unsigned long address, size; void *addr; if (!mem_init_done) return; addr = (void *) ((unsigned long __force) token & PAGE_MASK); - - if ((size = im_free(addr)) == 0) - return; - address = (unsigned long)addr; - unmap_im_area(address, address + size); + im_free(addr); } static int iounmap_subset_regions(unsigned long addr, unsigned long size) Index: working-2.6/arch/ppc64/kernel/head.S =================================================================== --- working-2.6.orig/arch/ppc64/kernel/head.S 2005-04-26 15:37:55.000000000 +1000 +++ working-2.6/arch/ppc64/kernel/head.S 2005-05-12 14:08:37.000000000 +1000 @@ -2121,10 +2121,6 @@ swapper_pg_dir: .space 4096 - .globl ioremap_dir -ioremap_dir: - .space 4096 - #ifdef CONFIG_SMP /* 1 page segment table per cpu (max 48, cpu0 allocated at STAB0_PHYS_ADDR) */ .globl stab_array Index: working-2.6/arch/ppc64/mm/imalloc.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/imalloc.c 2005-05-11 10:05:49.000000000 +1000 +++ working-2.6/arch/ppc64/mm/imalloc.c 2005-05-12 14:20:33.000000000 +1000 @@ -15,6 +15,7 @@ #include #include #include +#include static DECLARE_MUTEX(imlist_sem); struct vm_struct * imlist = NULL; @@ -285,29 +286,32 @@ return area; } -unsigned long im_free(void * addr) +void im_free(void * addr) { struct vm_struct **p, *tmp; - unsigned long ret_size = 0; if (!addr) - return ret_size; - if ((PAGE_SIZE-1) & (unsigned long) addr) { + return; + if ((unsigned long) addr & ~PAGE_MASK) { printk(KERN_ERR "Trying to %s bad address (%p)\n", __FUNCTION__, addr); - return ret_size; + return; } down(&imlist_sem); for (p = &imlist ; (tmp = *p) ; p = &tmp->next) { if (tmp->addr == addr) { - ret_size = tmp->size; *p = tmp->next; + + /* XXX: do we need the lock? */ + spin_lock(&init_mm.page_table_lock); + unmap_vm_area(tmp); + spin_unlock(&init_mm.page_table_lock); + kfree(tmp); up(&imlist_sem); - return ret_size; + return; } } up(&imlist_sem); printk(KERN_ERR "Trying to %s nonexistent area (%p)\n", __FUNCTION__, addr); - return ret_size; } -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From david at gibson.dropbear.id.au Thu May 12 15:56:10 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 12 May 2005 15:56:10 +1000 Subject: First cut at four-level pagetables In-Reply-To: <20050512040804.GB29780@localhost.localdomain> References: <20050512033953.GA29780@localhost.localdomain> <20050512040804.GB29780@localhost.localdomain> Message-ID: <20050512055610.GF29780@localhost.localdomain> On Thu, May 12, 2005 at 02:08:04PM +1000, David Gibson wrote: > On Thu, May 12, 2005 at 01:39:53PM +1000, David Gibson wrote: > > Here's a first shot at patch which implements true four-level page > > tables for ppc64. It uses full page tables at the bottom and top > > levels, and quarter-page tables at the middle two levels. This gives > > a total usable address space of 44 bits (16T). I've also tweaked the > > VSID allocation to let us use all that space (thereby halving the > > number of available contexts) and added some #if and BUILD_BUG sanity > > checks. > > > > Hugepages are presently completely broken, working on that now. This > > patch applies on top of the patch posted earlier eliminating > > ioremap_dir. > > Bah, sorry. Patch broken. Changes to arch/ppc64/mm/tlb.c are > missing. Fixed version later. Ok, better patch below. Hugepages are still broken, but otherwise this seems to work, at least against elementary testing. arch/ppc64/kernel/head.S | 11 ++-- arch/ppc64/mm/hash_utils.c | 2 arch/ppc64/mm/imalloc.c | 2 arch/ppc64/mm/init.c | 34 ++++++++------- arch/ppc64/mm/tlb.c | 95 ++++++++++++++++++++++++------------------ include/asm-ppc64/imalloc.h | 2 include/asm-ppc64/mmu.h | 6 +- include/asm-ppc64/page.h | 27 ++++++----- include/asm-ppc64/pgalloc.h | 80 ++++++++++++++++++++++++----------- include/asm-ppc64/pgtable.h | 80 +++++++++++++++++++++++------------ include/asm-ppc64/processor.h | 2 11 files changed, 213 insertions(+), 128 deletions(-) This patch implements full four-level page tables for ppc64. It uses a full page for the tables at the bottom and top level, and a quarter page for the intermediate levels. This gives a total usable address space of 44 bits (16T). This patch also tweaks the VSID allocation to have a matching range for user addresses (thereby halving the number of available contexts) and adds some #if and BUILD_BUG sanity checks. Index: working-2.6/include/asm-ppc64/pgtable.h =================================================================== --- working-2.6.orig/include/asm-ppc64/pgtable.h 2005-05-12 14:24:04.000000000 +1000 +++ working-2.6/include/asm-ppc64/pgtable.h 2005-05-12 14:36:48.000000000 +1000 @@ -15,19 +15,24 @@ #include #endif /* __ASSEMBLY__ */ -#include - /* * Entries per page directory level. The PTE level must use a 64b record * for each page table entry. The PMD and PGD level use a 32b record for * each entry by assuming that each entry is page aligned. */ #define PTE_INDEX_SIZE 9 -#define PMD_INDEX_SIZE 10 -#define PGD_INDEX_SIZE 10 +#define PMD_INDEX_SIZE 7 +#define PUD_INDEX_SIZE 7 +#define PGD_INDEX_SIZE 9 + +#define PTE_TABLE_SIZE (sizeof(pte_t) << PTE_INDEX_SIZE) +#define PMD_TABLE_SIZE (sizeof(pmd_t) << PMD_INDEX_SIZE) +#define PUD_TABLE_SIZE (sizeof(pud_t) << PUD_INDEX_SIZE) +#define PGD_TABLE_SIZE (sizeof(pgd_t) << PGD_INDEX_SIZE) #define PTRS_PER_PTE (1 << PTE_INDEX_SIZE) #define PTRS_PER_PMD (1 << PMD_INDEX_SIZE) +#define PTRS_PER_PUD (1 << PMD_INDEX_SIZE) #define PTRS_PER_PGD (1 << PGD_INDEX_SIZE) /* PMD_SHIFT determines what a second-level page table entry can map */ @@ -35,8 +40,13 @@ #define PMD_SIZE (1UL << PMD_SHIFT) #define PMD_MASK (~(PMD_SIZE-1)) -/* PGDIR_SHIFT determines what a third-level page table entry can map */ -#define PGDIR_SHIFT (PMD_SHIFT + PMD_INDEX_SIZE) +/* PUD_SHIFT determines what a third-level page table entry can map */ +#define PUD_SHIFT (PMD_SHIFT + PMD_INDEX_SIZE) +#define PUD_SIZE (1UL << PUD_SHIFT) +#define PUD_MASK (~(PUD_SIZE-1)) + +/* PGDIR_SHIFT determines what a fourth-level page table entry can map */ +#define PGDIR_SHIFT (PUD_SHIFT + PUD_INDEX_SIZE) #define PGDIR_SIZE (1UL << PGDIR_SHIFT) #define PGDIR_MASK (~(PGDIR_SIZE-1)) @@ -45,15 +55,23 @@ /* * Size of EA range mapped by our pagetables. */ -#define EADDR_SIZE (PTE_INDEX_SIZE + PMD_INDEX_SIZE + \ - PGD_INDEX_SIZE + PAGE_SHIFT) -#define EADDR_MASK ((1UL << EADDR_SIZE) - 1) +#define PGTABLE_EADDR_SIZE (PTE_INDEX_SIZE + PMD_INDEX_SIZE + \ + PUD_INDEX_SIZE + PGD_INDEX_SIZE + PAGE_SHIFT) +#define PGTABLE_RANGE (1UL << PGTABLE_EADDR_SIZE) + +#if TASK_SIZE_USER64 > PGTABLE_RANGE +#error TASK_SIZE_USER64 exceeds pagetable range +#endif + +#if TASK_SIZE_USER64 > (1UL << (USER_ESID_BITS + SID_SHIFT)) +#error TASK_SIZE_USER64 exceeds user VSID range +#endif /* * Define the address range of the vmalloc VM area. */ #define VMALLOC_START (0xD000000000000000ul) -#define VMALLOC_SIZE (0x10000000000UL) +#define VMALLOC_SIZE (0x40000000000UL) #define VMALLOC_END (VMALLOC_START + VMALLOC_SIZE) /* @@ -151,6 +169,8 @@ #ifdef CONFIG_HUGETLB_PAGE +#error Hugepages broken for now + #ifndef __ASSEMBLY__ int hash_huge_page(struct mm_struct *mm, unsigned long access, unsigned long ea, unsigned long vsid, int local); @@ -197,39 +217,45 @@ #define pte_pfn(x) ((unsigned long)((pte_val(x) >> PTE_SHIFT))) #define pte_page(x) pfn_to_page(pte_pfn(x)) -#define pmd_set(pmdp, ptep) \ - (pmd_val(*(pmdp)) = __ba_to_bpn(ptep)) +#define pmd_set(pmdp, ptep) (pmd_val(*(pmdp)) = (unsigned long)(ptep)) #define pmd_none(pmd) (!pmd_val(pmd)) #define pmd_bad(pmd) (pmd_val(pmd) == 0) #define pmd_present(pmd) (pmd_val(pmd) != 0) #define pmd_clear(pmdp) (pmd_val(*(pmdp)) = 0) -#define pmd_page_kernel(pmd) (__bpn_to_ba(pmd_val(pmd))) +#define pmd_page_kernel(pmd) (pmd_val(pmd)) #define pmd_page(pmd) virt_to_page(pmd_page_kernel(pmd)) -#define pud_set(pudp, pmdp) (pud_val(*(pudp)) = (__ba_to_bpn(pmdp))) +#define pud_set(pudp, pmdp) (pud_val(*(pudp)) = (unsigned long)(pmdp)) #define pud_none(pud) (!pud_val(pud)) -#define pud_bad(pud) ((pud_val(pud)) == 0UL) -#define pud_present(pud) (pud_val(pud) != 0UL) -#define pud_clear(pudp) (pud_val(*(pudp)) = 0UL) -#define pud_page(pud) (__bpn_to_ba(pud_val(pud))) +#define pud_bad(pud) ((pud_val(pud)) == 0) +#define pud_present(pud) (pud_val(pud) != 0) +#define pud_clear(pudp) (pud_val(*(pudp)) = 0) +#define pud_page(pud) (pud_val(pud)) + +#define pgd_set(pgdp, pudp) ({pgd_val(*(pgdp)) = (unsigned long)(pudp);}) +#define pgd_none(pgd) (!pgd_val(pgd)) +#define pgd_bad(pgd) (pgd_val(pgd) == 0) +#define pgd_present(pgd) (pgd_val(pgd) != 0) +#define pgd_clear(pgdp) (pgd_val(*(pgdp)) = 0) +#define pgd_page(pgd) (pgd_val(pgd)) /* * Find an entry in a page-table-directory. We combine the address region * (the high order N bits) and the pgd portion of the address. */ /* to avoid overflow in free_pgtables we don't use PTRS_PER_PGD here */ -#define pgd_index(address) (((address) >> (PGDIR_SHIFT)) & 0x7ff) +#define pgd_index(address) (((address) >> (PGDIR_SHIFT)) & 0x1ff) #define pgd_offset(mm, address) ((mm)->pgd + pgd_index(address)) -/* Find an entry in the second-level page table.. */ +#define pud_offset(pgdp, addr) \ + (((pud_t *) pgd_page(*(pgdp))) + (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))) + #define pmd_offset(pudp,addr) \ - ((pmd_t *) pud_page(*(pudp)) + (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))) + (((pmd_t *) pud_page(*(pudp))) + (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))) -/* Find an entry in the third-level page table.. */ #define pte_offset_kernel(dir,addr) \ - ((pte_t *) pmd_page_kernel(*(dir)) \ - + (((addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))) + (((pte_t *) pmd_page_kernel(*(dir))) + (((addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))) #define pte_offset_map(dir,addr) pte_offset_kernel((dir), (addr)) #define pte_offset_map_nested(dir,addr) pte_offset_kernel((dir), (addr)) @@ -458,9 +484,11 @@ #define pte_same(A,B) (((pte_val(A) ^ pte_val(B)) & ~_PAGE_HPTEFLAGS) == 0) #define pmd_ERROR(e) \ - printk("%s:%d: bad pmd %08x.\n", __FILE__, __LINE__, pmd_val(e)) + printk("%s:%d: bad pmd %08lx.\n", __FILE__, __LINE__, pmd_val(e)) +#define pud_ERROR(e) \ + printk("%s:%d: bad pmd %08lx.\n", __FILE__, __LINE__, pud_val(e)) #define pgd_ERROR(e) \ - printk("%s:%d: bad pgd %08x.\n", __FILE__, __LINE__, pgd_val(e)) + printk("%s:%d: bad pgd %08lx.\n", __FILE__, __LINE__, pgd_val(e)) extern pgd_t swapper_pg_dir[]; Index: working-2.6/include/asm-ppc64/page.h =================================================================== --- working-2.6.orig/include/asm-ppc64/page.h 2005-05-12 14:24:04.000000000 +1000 +++ working-2.6/include/asm-ppc64/page.h 2005-05-12 14:36:48.000000000 +1000 @@ -35,6 +35,8 @@ #ifdef CONFIG_HUGETLB_PAGE +#error Hugepages broken for now + #define HUGETLB_PAGE_ORDER (HPAGE_SHIFT - PAGE_SHIFT) /* For 64-bit processes the hugepage range is 1T-1.5T */ @@ -91,7 +93,7 @@ #ifndef __ASSEMBLY__ #include -#undef STRICT_MM_TYPECHECKS +#define STRICT_MM_TYPECHECKS #define REGION_SIZE 4UL #define REGION_SHIFT 60UL @@ -125,27 +127,31 @@ * Entries in the pte table are 64b, while entries in the pgd & pmd are 32b. */ typedef struct { unsigned long pte; } pte_t; -typedef struct { unsigned int pmd; } pmd_t; -typedef struct { unsigned int pgd; } pgd_t; +typedef struct { unsigned long pmd; } pmd_t; +typedef struct { unsigned long pud; } pud_t; +typedef struct { unsigned long pgd; } pgd_t; typedef struct { unsigned long pgprot; } pgprot_t; #define pte_val(x) ((x).pte) #define pmd_val(x) ((x).pmd) +#define pud_val(x) ((x).pud) #define pgd_val(x) ((x).pgd) #define pgprot_val(x) ((x).pgprot) -#define __pte(x) ((pte_t) { (x) } ) -#define __pmd(x) ((pmd_t) { (x) } ) -#define __pgd(x) ((pgd_t) { (x) } ) -#define __pgprot(x) ((pgprot_t) { (x) } ) +#define __pte(x) ((pte_t) { (x) }) +#define __pmd(x) ((pmd_t) { (x) }) +#define __pud(x) ((pud_t) { (x) }) +#define __pgd(x) ((pgd_t) { (x) }) +#define __pgprot(x) ((pgprot_t) { (x) }) #else /* * .. while these make it easier on the compiler */ typedef unsigned long pte_t; -typedef unsigned int pmd_t; -typedef unsigned int pgd_t; +typedef unsigned long pmd_t; +typedef unsigned long pud_t; +typedef unsigned long pgd_t; typedef unsigned long pgprot_t; #define pte_val(x) (x) @@ -208,9 +214,6 @@ #define USER_REGION_ID (0UL) #define REGION_ID(ea) (((unsigned long)(ea)) >> REGION_SHIFT) -#define __bpn_to_ba(x) ((((unsigned long)(x)) << PAGE_SHIFT) + KERNELBASE) -#define __ba_to_bpn(x) ((((unsigned long)(x)) & ~REGION_MASK) >> PAGE_SHIFT) - #define __va(x) ((void *)((unsigned long)(x) + KERNELBASE)) #ifdef CONFIG_DISCONTIGMEM Index: working-2.6/include/asm-ppc64/pgalloc.h =================================================================== --- working-2.6.orig/include/asm-ppc64/pgalloc.h 2005-05-02 08:57:22.000000000 +1000 +++ working-2.6/include/asm-ppc64/pgalloc.h 2005-05-12 14:36:48.000000000 +1000 @@ -6,7 +6,7 @@ #include #include -extern kmem_cache_t *zero_cache; +extern kmem_cache_t *pmd_cache; /* * This program is free software; you can redistribute it and/or @@ -18,13 +18,31 @@ static inline pgd_t * pgd_alloc(struct mm_struct *mm) { - return kmem_cache_alloc(zero_cache, GFP_KERNEL); + return (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO); } static inline void pgd_free(pgd_t *pgd) { - kmem_cache_free(zero_cache, pgd); + free_page((unsigned long)pgd); +} + +#define pgd_populate(MM, PGD, PUD) pgd_set(PGD, PUD) + +static inline pud_t * +pud_alloc_one(struct mm_struct *mm, unsigned long addr) +{ + pud_t *pudp; + + pudp = kmem_cache_alloc(pmd_cache, GFP_KERNEL|__GFP_REPEAT); + memset(pudp, 0, PUD_TABLE_SIZE); + return pudp; +} + +static inline void +pud_free(pud_t *pud) +{ + kmem_cache_free(pmd_cache, pud); } #define pud_populate(MM, PUD, PMD) pud_set(PUD, PMD) @@ -32,13 +50,17 @@ static inline pmd_t * pmd_alloc_one(struct mm_struct *mm, unsigned long addr) { - return kmem_cache_alloc(zero_cache, GFP_KERNEL|__GFP_REPEAT); + pmd_t *pmdp; + + pmdp = kmem_cache_alloc(pmd_cache, GFP_KERNEL|__GFP_REPEAT); + memset(pmdp, 0, PMD_TABLE_SIZE); + return pmdp; } static inline void pmd_free(pmd_t *pmd) { - kmem_cache_free(zero_cache, pmd); + kmem_cache_free(pmd_cache, pmd); } #define pmd_populate_kernel(mm, pmd, pte) pmd_set(pmd, pte) @@ -47,44 +69,54 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address) { - return kmem_cache_alloc(zero_cache, GFP_KERNEL|__GFP_REPEAT); + return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); } static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address) { - pte_t *pte = kmem_cache_alloc(zero_cache, GFP_KERNEL|__GFP_REPEAT); - if (pte) - return virt_to_page(pte); - return NULL; + return alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); } static inline void pte_free_kernel(pte_t *pte) { - kmem_cache_free(zero_cache, pte); + free_page((unsigned long)pte); } static inline void pte_free(struct page *ptepage) { - kmem_cache_free(zero_cache, page_address(ptepage)); + __free_page(ptepage); } -struct pte_freelist_batch +typedef struct pgtable_free { + unsigned long val; +} pgtable_free_t; + +static inline pgtable_free_t pgtable_free_page(struct page *page) { - struct rcu_head rcu; - unsigned int index; - struct page * pages[0]; -}; + return (pgtable_free_t){.val = (unsigned long) page}; +} -#define PTE_FREELIST_SIZE ((PAGE_SIZE - sizeof(struct pte_freelist_batch)) / \ - sizeof(struct page *)) +static inline pgtable_free_t pgtable_free_cache(void *p) +{ + return (pgtable_free_t){.val = ((unsigned long) p) | 1}; +} -extern void pte_free_now(struct page *ptepage); -extern void pte_free_submit(struct pte_freelist_batch *batch); +static inline void pgtable_free(pgtable_free_t pgf) +{ + if (pgf.val & 1) + kmem_cache_free(pmd_cache, (void *)(pgf.val & ~1)); + else + __free_page((struct page *)pgf.val); +} -DECLARE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur); +void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf); -void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage); -#define __pmd_free_tlb(tlb, pmd) __pte_free_tlb(tlb, virt_to_page(pmd)) +#define __pte_free_tlb(tlb, ptepage) \ + pgtable_free_tlb(tlb, pgtable_free_page(ptepage)) +#define __pmd_free_tlb(tlb, pmd) \ + pgtable_free_tlb(tlb, pgtable_free_cache(pmd)) +#define __pud_free_tlb(tlb, pmd) \ + pgtable_free_tlb(tlb, pgtable_free_cache(pud)) #define check_pgt_cache() do { } while (0) Index: working-2.6/arch/ppc64/mm/init.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/init.c 2005-05-12 14:24:04.000000000 +1000 +++ working-2.6/arch/ppc64/mm/init.c 2005-05-12 14:36:48.000000000 +1000 @@ -66,6 +66,14 @@ #include #include +#if PGTABLE_RANGE > USER_VSID_RANGE +#warning Limited user VSID range means pagetable space is wasted +#endif + +#if (TASK_SIZE_USER64 < PGTABLE_RANGE) && (TASK_SIZE_USER64 < USER_VSID_RANGE) +#warning TASK_SIZE is smaller than it needs to be. +#endif + int mem_init_done; unsigned long ioremap_bot = IMALLOC_BASE; static unsigned long phbs_io_bot = PHBS_IO_BASE; @@ -225,7 +233,7 @@ * Before that, we map using addresses going * up from ioremap_bot. imalloc will use * the addresses from ioremap_bot through - * IMALLOC_END (0xE000001fffffffff) + * IMALLOC_END * */ pa = addr & PAGE_MASK; @@ -824,23 +832,19 @@ return virt_addr; } -kmem_cache_t *zero_cache; - -static void zero_ctor(void *pte, kmem_cache_t *cache, unsigned long flags) -{ - memset(pte, 0, PAGE_SIZE); -} +kmem_cache_t *pmd_cache; void pgtable_cache_init(void) { - zero_cache = kmem_cache_create("zero", - PAGE_SIZE, - 0, - SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN, - zero_ctor, - NULL); - if (!zero_cache) - panic("pgtable_cache_init(): could not create zero_cache!\n"); + BUILD_BUG_ON(PTE_TABLE_SIZE != PAGE_SIZE); + BUILD_BUG_ON(PMD_TABLE_SIZE != PUD_TABLE_SIZE); + BUILD_BUG_ON(PGD_TABLE_SIZE != PAGE_SIZE); + + pmd_cache = kmem_cache_create("pmd", PMD_TABLE_SIZE, PMD_TABLE_SIZE, + SLAB_POISON |SLAB_DEBUG_INITIAL, + NULL, NULL); + if (! pmd_cache) + panic("pmd_pud_cache_init(): could not create pmd_pud_cache!\n"); } pgprot_t phys_mem_access_prot(struct file *file, unsigned long addr, Index: working-2.6/include/asm-ppc64/processor.h =================================================================== --- working-2.6.orig/include/asm-ppc64/processor.h 2005-05-12 14:24:04.000000000 +1000 +++ working-2.6/include/asm-ppc64/processor.h 2005-05-12 14:36:48.000000000 +1000 @@ -531,7 +531,7 @@ extern struct task_struct *last_task_used_altivec; /* 64-bit user address space is 41-bits (2TBs user VM) */ -#define TASK_SIZE_USER64 (0x0000020000000000UL) +#define TASK_SIZE_USER64 (0x0000100000000000UL) /* * 32-bit user address space is 4GB - 1 page Index: working-2.6/arch/ppc64/kernel/head.S =================================================================== --- working-2.6.orig/arch/ppc64/kernel/head.S 2005-05-12 14:24:04.000000000 +1000 +++ working-2.6/arch/ppc64/kernel/head.S 2005-05-12 14:36:48.000000000 +1000 @@ -38,6 +38,7 @@ #include #include #include +#include #ifdef CONFIG_PPC_ISERIES #define DO_SOFT_DISABLE @@ -2117,17 +2118,17 @@ empty_zero_page: .space 4096 - .globl swapper_pg_dir -swapper_pg_dir: - .space 4096 - #ifdef CONFIG_SMP /* 1 page segment table per cpu (max 48, cpu0 allocated at STAB0_PHYS_ADDR) */ .globl stab_array stab_array: .space 4096 * 48 #endif - + + .globl swapper_pg_dir +swapper_pg_dir: + .space PAGE_SIZE + /* * This space gets a copy of optional info passed to us by the bootstrap * Used to pass parameters into the kernel like root=/dev/sda1, etc. Index: working-2.6/arch/ppc64/mm/imalloc.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/imalloc.c 2005-05-12 14:24:04.000000000 +1000 +++ working-2.6/arch/ppc64/mm/imalloc.c 2005-05-12 14:36:48.000000000 +1000 @@ -31,7 +31,7 @@ break; if ((unsigned long)tmp->addr >= ioremap_bot) addr = tmp->size + (unsigned long) tmp->addr; - if (addr > IMALLOC_END-size) + if (addr >= IMALLOC_END-size) return 1; } *im_addr = addr; Index: working-2.6/arch/ppc64/mm/hash_utils.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hash_utils.c 2005-05-12 14:24:04.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hash_utils.c 2005-05-12 14:36:48.000000000 +1000 @@ -298,7 +298,7 @@ int local = 0; cpumask_t tmp; - if ((ea & ~REGION_MASK) > EADDR_MASK) + if ((ea & ~REGION_MASK) >= PGTABLE_RANGE) return 1; switch (REGION_ID(ea)) { Index: working-2.6/include/asm-ppc64/mmu.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu.h 2005-05-11 10:05:51.000000000 +1000 +++ working-2.6/include/asm-ppc64/mmu.h 2005-05-12 14:36:48.000000000 +1000 @@ -259,8 +259,10 @@ #define VSID_BITS 36 #define VSID_MODULUS ((1UL<index; i++) + pgtable_free(batch->tables[i]); + + free_page((unsigned long)batch); +} + +static void pte_free_submit(struct pte_freelist_batch *batch) +{ + INIT_RCU_HEAD(&batch->rcu); + call_rcu(&batch->rcu, pte_free_rcu_callback); +} + +void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf) { /* This is safe as we are holding page_table_lock */ cpumask_t local_cpumask = cpumask_of_cpu(smp_processor_id()); @@ -49,19 +100,19 @@ if (atomic_read(&tlb->mm->mm_users) < 2 || cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask)) { - pte_free(ptepage); + pgtable_free(pgf); return; } if (*batchp == NULL) { *batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC); if (*batchp == NULL) { - pte_free_now(ptepage); + pgtable_free_now(pgf); return; } (*batchp)->index = 0; } - (*batchp)->pages[(*batchp)->index++] = ptepage; + (*batchp)->tables[(*batchp)->index++] = pgf; if ((*batchp)->index == PTE_FREELIST_SIZE) { pte_free_submit(*batchp); *batchp = NULL; @@ -132,42 +183,6 @@ put_cpu(); } -#ifdef CONFIG_SMP -static void pte_free_smp_sync(void *arg) -{ - /* Do nothing, just ensure we sync with all CPUs */ -} -#endif - -/* This is only called when we are critically out of memory - * (and fail to get a page in pte_free_tlb). - */ -void pte_free_now(struct page *ptepage) -{ - pte_freelist_forced_free++; - - smp_call_function(pte_free_smp_sync, NULL, 0, 1); - - pte_free(ptepage); -} - -static void pte_free_rcu_callback(struct rcu_head *head) -{ - struct pte_freelist_batch *batch = - container_of(head, struct pte_freelist_batch, rcu); - unsigned int i; - - for (i = 0; i < batch->index; i++) - pte_free(batch->pages[i]); - free_page((unsigned long)batch); -} - -void pte_free_submit(struct pte_freelist_batch *batch) -{ - INIT_RCU_HEAD(&batch->rcu); - call_rcu(&batch->rcu, pte_free_rcu_callback); -} - void pte_free_finish(void) { /* This is safe as we are holding page_table_lock */ -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From michael at ellerman.id.au Thu May 12 18:09:45 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Thu, 12 May 2005 18:09:45 +1000 Subject: [PATCH 4/4] iseries_veth: Cleanup skbs to prevent unregister_netdevice() hanging Message-ID: <200505121809.45419.michael@ellerman.id.au> Hi Andrew, Jeff, The iseries_veth driver is badly behaved in that it will keep TX packets hanging around forever if they're not ACK'ed and the queue never fills up. This causes the unregister_netdevice code to wait forever when we try to take the device down, because there's still skbs around with references to our struct net_device. There's already code to cleanup any un-ACK'ed packets in veth_stop_connection() but it's being called after we unregister the net_device, which is too late. The fix is to rearrange the module exit function so that we cleanup any outstanding skbs and then unregister the driver. Signed-off-by: Michael Ellerman -- drivers/net/iseries_veth.c | 11 +++++++++-- 1 files changed, 9 insertions(+), 2 deletions(-) Index: veth-fixes/drivers/net/iseries_veth.c =================================================================== --- veth-fixes.orig/drivers/net/iseries_veth.c 2005-05-12 16:27:32.000000000 +1000 +++ veth-fixes/drivers/net/iseries_veth.c 2005-05-12 16:27:42.000000000 +1000 @@ -1388,18 +1388,25 @@ { int i; - vio_unregister_driver(&veth_driver); + /* Stop the queues first to stop any new packets being sent. */ + for (i = 0; i < HVMAXARCHITECTEDVIRTUALLANS; i++) + if (veth_dev[i]) + netif_stop_queue(veth_dev[i]); + /* Stop the connections before we unregister the driver. This + * ensures there's no skbs lying around holding the device open. */ for (i = 0; i < HVMAXARCHITECTEDLPS; ++i) veth_stop_connection(i); HvLpEvent_unregisterHandler(HvLpEvent_Type_VirtualLan); /* Hypervisor callbacks may have scheduled more work while we - * were destroying connections. Now that we've disconnected from + * were stoping connections. Now that we've disconnected from * the hypervisor make sure everything's finished. */ flush_scheduled_work(); + vio_unregister_driver(&veth_driver); + for (i = 0; i < HVMAXARCHITECTEDLPS; ++i) veth_destroy_connection(i); From david at gibson.dropbear.id.au Thu May 12 21:28:57 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Thu, 12 May 2005 21:28:57 +1000 Subject: [PATCH 4/4] iseries_veth: Cleanup skbs to prevent unregister_netdevice() hanging In-Reply-To: <200505121809.45419.michael@ellerman.id.au> References: <200505121809.45419.michael@ellerman.id.au> Message-ID: <20050512112857.GC32694@localhost.localdomain> On Thu, May 12, 2005 at 06:09:45PM +1000, Michael Ellerman wrote: > Hi Andrew, Jeff, > > The iseries_veth driver is badly behaved in that it will keep TX packets > hanging around forever if they're not ACK'ed and the queue never fills up. > > This causes the unregister_netdevice code to wait forever when we try to take > the device down, because there's still skbs around with references to our > struct net_device. > > There's already code to cleanup any un-ACK'ed packets in veth_stop_connection() > but it's being called after we unregister the net_device, which is too late. > > The fix is to rearrange the module exit function so that we cleanup any > outstanding skbs and then unregister the driver. > > Signed-off-by: Michael Ellerman Nice catch. Acked-by: David Gibson -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From will_schmidt at vnet.ibm.com Fri May 13 05:59:30 2005 From: will_schmidt at vnet.ibm.com (will schmidt) Date: Thu, 12 May 2005 14:59:30 -0500 Subject: (updated) RFC/Patch xmon pte/pgd/userspace address additions. Message-ID: <4283B5A2.4030706@vnet.ibm.com> Hi Folks, Per Ben's prompting me, :-) this version is updated to handle the additional pud_offset calls (part of the 4L header stuff). - I've removed the try_spinlock code; - As an alternative to duplicating lots of function to add mread calls in place of references, I've added setjmp(bus_error_jmp) {} around what seem more likely to be critical areas. - cleaned up spacing - changed most of the function names to be xmon_xxx instead of wm_xxx. these functions show up under a submenu 'w'. use "w?" at xmon> prompt to get the help blurb. > the bulk of my intent was to make it easier for me to poke at memory within a particular user process. > > I realize that the spacing is a bit screwed up, and the function names should eventually change. Because i couldnt decide on letters for the new functions, i put them under a submenu 'w'. > > wP will dump info on all processes. > > wp 0xabc will make process with pid 0xabc the active pid. <- active only with respect to xmon poking into memory. > > wd 0xabcd1234 - will call through the pdg/pmd functions and return the kernel address corresponding to 0xabcd1234 within the processes memory space location. > > wg will dump gprs of the process/thread. -Will -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: xmon_pgd_may12.diff Url: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050512/bf82fdeb/attachment.txt From david at gibson.dropbear.id.au Fri May 13 11:10:38 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Fri, 13 May 2005 11:10:38 +1000 Subject: [PATCH] ppc64: Abolish ioremap_mm Message-ID: <20050513011038.GC19269@localhost.localdomain> Andrew, please apply: Currently ppc64 has two mm_structs for the kernel, init_mm and also ioremap_mm. The latter really isn't necessary: this patch abolishes it, instead restricting vmallocs to the lower 1TB of the init_mm's range and placing io mappings in the upper 1TB. This simplifies the code in a number of places and eliminates an unecessary set of pagetables. It also tweaks the unmap/free path a little, allowing us to remove the unmap_im_area() set of page table walkers, replacing them with unmap_vm_area(). arch/ppc64/kernel/eeh.c | 2 arch/ppc64/kernel/head.S | 4 - arch/ppc64/kernel/process.c | 8 --- arch/ppc64/mm/hash_utils.c | 4 - arch/ppc64/mm/imalloc.c | 20 +++++---- arch/ppc64/mm/init.c | 93 ++++-------------------------------------- include/asm-ppc64/imalloc.h | 12 +++-- include/asm-ppc64/page.h | 2 include/asm-ppc64/pgtable.h | 9 ---- include/asm-ppc64/processor.h | 10 ---- 10 files changed, 31 insertions(+), 133 deletions(-) Signed-off-by: David Gibson Index: working-2.6/include/asm-ppc64/pgtable.h =================================================================== --- working-2.6.orig/include/asm-ppc64/pgtable.h 2005-05-11 10:05:51.000000000 +1000 +++ working-2.6/include/asm-ppc64/pgtable.h 2005-05-12 14:08:37.000000000 +1000 @@ -53,7 +53,8 @@ * Define the address range of the vmalloc VM area. */ #define VMALLOC_START (0xD000000000000000ul) -#define VMALLOC_END (VMALLOC_START + EADDR_MASK) +#define VMALLOC_SIZE (0x10000000000UL) +#define VMALLOC_END (VMALLOC_START + VMALLOC_SIZE) /* * Bits in a linux-style PTE. These match the bits in the @@ -239,9 +240,6 @@ /* This now only contains the vmalloc pages */ #define pgd_offset_k(address) pgd_offset(&init_mm, address) -/* to find an entry in the ioremap page-table-directory */ -#define pgd_offset_i(address) (ioremap_pgd + pgd_index(address)) - /* * The following only work if pte_present() is true. * Undefined behaviour if not.. @@ -459,15 +457,12 @@ #define __HAVE_ARCH_PTE_SAME #define pte_same(A,B) (((pte_val(A) ^ pte_val(B)) & ~_PAGE_HPTEFLAGS) == 0) -extern unsigned long ioremap_bot, ioremap_base; - #define pmd_ERROR(e) \ printk("%s:%d: bad pmd %08x.\n", __FILE__, __LINE__, pmd_val(e)) #define pgd_ERROR(e) \ printk("%s:%d: bad pgd %08x.\n", __FILE__, __LINE__, pgd_val(e)) extern pgd_t swapper_pg_dir[]; -extern pgd_t ioremap_dir[]; extern void paging_init(void); Index: working-2.6/include/asm-ppc64/imalloc.h =================================================================== --- working-2.6.orig/include/asm-ppc64/imalloc.h 2005-05-11 10:05:51.000000000 +1000 +++ working-2.6/include/asm-ppc64/imalloc.h 2005-05-12 14:08:37.000000000 +1000 @@ -4,9 +4,9 @@ /* * Define the address range of the imalloc VM area. */ -#define PHBS_IO_BASE IOREGIONBASE -#define IMALLOC_BASE (IOREGIONBASE + 0x80000000ul) /* Reserve 2 gigs for PHBs */ -#define IMALLOC_END (IOREGIONBASE + EADDR_MASK) +#define PHBS_IO_BASE VMALLOC_END +#define IMALLOC_BASE (PHBS_IO_BASE + 0x80000000ul) /* Reserve 2 gigs for PHBs */ +#define IMALLOC_END (VMALLOC_START + EADDR_MASK) /* imalloc region types */ @@ -18,7 +18,9 @@ extern struct vm_struct * im_get_free_area(unsigned long size); extern struct vm_struct * im_get_area(unsigned long v_addr, unsigned long size, - int region_type); -unsigned long im_free(void *addr); + int region_type); +extern void im_free(void *addr); + +extern unsigned long ioremap_bot; #endif /* _PPC64_IMALLOC_H */ Index: working-2.6/include/asm-ppc64/page.h =================================================================== --- working-2.6.orig/include/asm-ppc64/page.h 2005-05-11 10:05:51.000000000 +1000 +++ working-2.6/include/asm-ppc64/page.h 2005-05-12 14:08:37.000000000 +1000 @@ -202,9 +202,7 @@ #define PAGE_OFFSET ASM_CONST(0xC000000000000000) #define KERNELBASE PAGE_OFFSET #define VMALLOCBASE ASM_CONST(0xD000000000000000) -#define IOREGIONBASE ASM_CONST(0xE000000000000000) -#define IO_REGION_ID (IOREGIONBASE >> REGION_SHIFT) #define VMALLOC_REGION_ID (VMALLOCBASE >> REGION_SHIFT) #define KERNEL_REGION_ID (KERNELBASE >> REGION_SHIFT) #define USER_REGION_ID (0UL) Index: working-2.6/arch/ppc64/kernel/eeh.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/eeh.c 2005-04-26 15:37:55.000000000 +1000 +++ working-2.6/arch/ppc64/kernel/eeh.c 2005-05-12 14:05:12.000000000 +1000 @@ -505,7 +505,7 @@ pte_t *ptep; unsigned long pa; - ptep = find_linux_pte(ioremap_mm.pgd, token); + ptep = find_linux_pte(init_mm.pgd, token); if (!ptep) return token; pa = pte_pfn(*ptep) << PAGE_SHIFT; Index: working-2.6/arch/ppc64/kernel/process.c =================================================================== --- working-2.6.orig/arch/ppc64/kernel/process.c 2005-04-26 15:37:55.000000000 +1000 +++ working-2.6/arch/ppc64/kernel/process.c 2005-05-12 14:05:12.000000000 +1000 @@ -58,14 +58,6 @@ struct task_struct *last_task_used_altivec = NULL; #endif -struct mm_struct ioremap_mm = { - .pgd = ioremap_dir, - .mm_users = ATOMIC_INIT(2), - .mm_count = ATOMIC_INIT(1), - .cpu_vm_mask = CPU_MASK_ALL, - .page_table_lock = SPIN_LOCK_UNLOCKED, -}; - /* * Make sure the floating-point register state in the * the thread_struct is up to date for task tsk. Index: working-2.6/include/asm-ppc64/processor.h =================================================================== --- working-2.6.orig/include/asm-ppc64/processor.h 2005-04-26 15:38:02.000000000 +1000 +++ working-2.6/include/asm-ppc64/processor.h 2005-05-12 14:08:37.000000000 +1000 @@ -590,16 +590,6 @@ } /* - * Note: the vm_start and vm_end fields here should *not* - * be in kernel space. (Could vm_end == vm_start perhaps?) - */ -#define IOREMAP_MMAP { &ioremap_mm, 0, 0x1000, NULL, \ - PAGE_SHARED, VM_READ | VM_WRITE | VM_EXEC, \ - 1, NULL, NULL } - -extern struct mm_struct ioremap_mm; - -/* * Return saved PC of a blocked thread. For now, this is the "user" PC */ #define thread_saved_pc(tsk) \ Index: working-2.6/arch/ppc64/mm/hash_utils.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hash_utils.c 2005-05-11 10:05:49.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hash_utils.c 2005-05-12 14:08:37.000000000 +1000 @@ -310,10 +310,6 @@ vsid = get_vsid(mm->context.id, ea); break; - case IO_REGION_ID: - mm = &ioremap_mm; - vsid = get_kernel_vsid(ea); - break; case VMALLOC_REGION_ID: mm = &init_mm; vsid = get_kernel_vsid(ea); Index: working-2.6/arch/ppc64/mm/init.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/init.c 2005-05-11 10:05:49.000000000 +1000 +++ working-2.6/arch/ppc64/mm/init.c 2005-05-12 14:11:16.000000000 +1000 @@ -73,9 +73,6 @@ extern pgd_t swapper_pg_dir[]; extern struct task_struct *current_set[NR_CPUS]; -extern pgd_t ioremap_dir[]; -pgd_t * ioremap_pgd = (pgd_t *)&ioremap_dir; - unsigned long klimit = (unsigned long)_end; unsigned long _SDR1=0; @@ -137,69 +134,6 @@ #else -static void unmap_im_area_pte(pmd_t *pmd, unsigned long addr, - unsigned long end) -{ - pte_t *pte; - - pte = pte_offset_kernel(pmd, addr); - do { - pte_t ptent = ptep_get_and_clear(&ioremap_mm, addr, pte); - WARN_ON(!pte_none(ptent) && !pte_present(ptent)); - } while (pte++, addr += PAGE_SIZE, addr != end); -} - -static inline void unmap_im_area_pmd(pud_t *pud, unsigned long addr, - unsigned long end) -{ - pmd_t *pmd; - unsigned long next; - - pmd = pmd_offset(pud, addr); - do { - next = pmd_addr_end(addr, end); - if (pmd_none_or_clear_bad(pmd)) - continue; - unmap_im_area_pte(pmd, addr, next); - } while (pmd++, addr = next, addr != end); -} - -static inline void unmap_im_area_pud(pgd_t *pgd, unsigned long addr, - unsigned long end) -{ - pud_t *pud; - unsigned long next; - - pud = pud_offset(pgd, addr); - do { - next = pud_addr_end(addr, end); - if (pud_none_or_clear_bad(pud)) - continue; - unmap_im_area_pmd(pud, addr, next); - } while (pud++, addr = next, addr != end); -} - -static void unmap_im_area(unsigned long addr, unsigned long end) -{ - struct mm_struct *mm = &ioremap_mm; - unsigned long next; - pgd_t *pgd; - - spin_lock(&mm->page_table_lock); - - pgd = pgd_offset_i(addr); - flush_cache_vunmap(addr, end); - do { - next = pgd_addr_end(addr, end); - if (pgd_none_or_clear_bad(pgd)) - continue; - unmap_im_area_pud(pgd, addr, next); - } while (pgd++, addr = next, addr != end); - flush_tlb_kernel_range(start, end); - - spin_unlock(&mm->page_table_lock); -} - /* * map_io_page currently only called by __ioremap * map_io_page adds an entry to the ioremap page table @@ -214,21 +148,21 @@ unsigned long vsid; if (mem_init_done) { - spin_lock(&ioremap_mm.page_table_lock); - pgdp = pgd_offset_i(ea); - pudp = pud_alloc(&ioremap_mm, pgdp, ea); + spin_lock(&init_mm.page_table_lock); + pgdp = pgd_offset_k(ea); + pudp = pud_alloc(&init_mm, pgdp, ea); if (!pudp) return -ENOMEM; - pmdp = pmd_alloc(&ioremap_mm, pudp, ea); + pmdp = pmd_alloc(&init_mm, pudp, ea); if (!pmdp) return -ENOMEM; - ptep = pte_alloc_kernel(&ioremap_mm, pmdp, ea); + ptep = pte_alloc_kernel(&init_mm, pmdp, ea); if (!ptep) return -ENOMEM; pa = abs_to_phys(pa); - set_pte_at(&ioremap_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, + set_pte_at(&init_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, __pgprot(flags))); - spin_unlock(&ioremap_mm.page_table_lock); + spin_unlock(&init_mm.page_table_lock); } else { unsigned long va, vpn, hash, hpteg; @@ -267,13 +201,9 @@ for (i = 0; i < size; i += PAGE_SIZE) if (map_io_page(ea+i, pa+i, flags)) - goto failure; + return NULL; return (void __iomem *) (ea + (addr & ~PAGE_MASK)); - failure: - if (mem_init_done) - unmap_im_area(ea, ea + size); - return NULL; } @@ -381,19 +311,14 @@ */ void iounmap(volatile void __iomem *token) { - unsigned long address, size; void *addr; if (!mem_init_done) return; addr = (void *) ((unsigned long __force) token & PAGE_MASK); - - if ((size = im_free(addr)) == 0) - return; - address = (unsigned long)addr; - unmap_im_area(address, address + size); + im_free(addr); } static int iounmap_subset_regions(unsigned long addr, unsigned long size) Index: working-2.6/arch/ppc64/kernel/head.S =================================================================== --- working-2.6.orig/arch/ppc64/kernel/head.S 2005-04-26 15:37:55.000000000 +1000 +++ working-2.6/arch/ppc64/kernel/head.S 2005-05-12 14:08:37.000000000 +1000 @@ -2121,10 +2121,6 @@ swapper_pg_dir: .space 4096 - .globl ioremap_dir -ioremap_dir: - .space 4096 - #ifdef CONFIG_SMP /* 1 page segment table per cpu (max 48, cpu0 allocated at STAB0_PHYS_ADDR) */ .globl stab_array Index: working-2.6/arch/ppc64/mm/imalloc.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/imalloc.c 2005-05-11 10:05:49.000000000 +1000 +++ working-2.6/arch/ppc64/mm/imalloc.c 2005-05-12 14:20:33.000000000 +1000 @@ -15,6 +15,7 @@ #include #include #include +#include static DECLARE_MUTEX(imlist_sem); struct vm_struct * imlist = NULL; @@ -285,29 +286,32 @@ return area; } -unsigned long im_free(void * addr) +void im_free(void * addr) { struct vm_struct **p, *tmp; - unsigned long ret_size = 0; if (!addr) - return ret_size; - if ((PAGE_SIZE-1) & (unsigned long) addr) { + return; + if ((unsigned long) addr & ~PAGE_MASK) { printk(KERN_ERR "Trying to %s bad address (%p)\n", __FUNCTION__, addr); - return ret_size; + return; } down(&imlist_sem); for (p = &imlist ; (tmp = *p) ; p = &tmp->next) { if (tmp->addr == addr) { - ret_size = tmp->size; *p = tmp->next; + + /* XXX: do we need the lock? */ + spin_lock(&init_mm.page_table_lock); + unmap_vm_area(tmp); + spin_unlock(&init_mm.page_table_lock); + kfree(tmp); up(&imlist_sem); - return ret_size; + return; } } up(&imlist_sem); printk(KERN_ERR "Trying to %s nonexistent area (%p)\n", __FUNCTION__, addr); - return ret_size; } -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From michael at ellerman.id.au Fri May 13 17:44:10 2005 From: michael at ellerman.id.au (Michael Ellerman) Date: Fri, 13 May 2005 17:44:10 +1000 Subject: [PATCH] Updated: fix-pci-mmap-on-ppc-and-ppc64.patch Message-ID: <200505131744.11095.michael@ellerman.id.au> Hi Andrew, This is an updated version of Ben's fix-pci-mmap-on-ppc-and-ppc64.patch which is in 2.6.12-rc4-mm1. It fixes the patch to work on PPC iSeries, removes some debug printks at Ben's request, and incorporates your fix-pci-mmap-on-ppc-and-ppc64-fix.patch also. cheers Signed-off-by: Michael Ellerman -- From: Benjamin Herrenschmidt This patch was discussed at length on linux-pci and so far, the last iteration of it didn't raise any comment. It's effect is a nop on architecture that don't define the new pci_resource_to_user() callback anyway. It allows architecture like ppc who put weird things inside of PCI resource structures to convert to some different value for user visible ones. It also fixes mmap'ing of IO space on those archs. Signed-off-by: Benjamin Herrenschmidt Signed-off-by: Andrew Morton arch/ppc/kernel/pci.c | 21 +++++++++++++++++++-- arch/ppc64/kernel/pci.c | 22 ++++++++++++++++++++-- drivers/pci/pci-sysfs.c | 26 +++++++++++++++++++++----- drivers/pci/proc.c | 14 ++++++++++---- include/asm-ppc/pci.h | 6 ++++++ include/asm-ppc64/pci.h | 7 +++++++ include/linux/pci.h | 14 ++++++++++++++ 7 files changed, 97 insertions(+), 13 deletions(-) Index: 2.6.12-rc4-mm1/arch/ppc64/kernel/pci.c =================================================================== --- 2.6.12-rc4-mm1.orig/arch/ppc64/kernel/pci.c 2005-05-13 15:39:55.000000000 +1000 +++ 2.6.12-rc4-mm1/arch/ppc64/kernel/pci.c 2005-05-13 15:40:19.000000000 +1000 @@ -351,7 +351,7 @@ *offset += hose->pci_mem_offset; res_bit = IORESOURCE_MEM; } else { - io_offset = (unsigned long)hose->io_base_virt; + io_offset = (unsigned long)hose->io_base_virt - pci_io_base; *offset += io_offset; res_bit = IORESOURCE_IO; } @@ -378,7 +378,7 @@ /* found it! construct the final physical address */ if (mmap_state == pci_mmap_io) - *offset += hose->io_base_phys - io_offset; + *offset += hose->io_base_phys - io_offset; return rp; } @@ -941,4 +941,22 @@ } EXPORT_SYMBOL(pci_read_irq_line); +void pci_resource_to_user(const struct pci_dev *dev, int bar, + const struct resource *rsrc, + u64 *start, u64 *end) +{ + struct pci_controller *hose = pci_bus_to_host(dev->bus); + unsigned long offset = 0; + + if (hose == NULL) + return; + + if (rsrc->flags & IORESOURCE_IO) + offset = pci_io_base - (unsigned long)hose->io_base_virt + + hose->io_base_phys; + + *start = rsrc->start + offset; + *end = rsrc->end + offset; +} + #endif /* CONFIG_PPC_MULTIPLATFORM */ Index: 2.6.12-rc4-mm1/arch/ppc/kernel/pci.c =================================================================== --- 2.6.12-rc4-mm1.orig/arch/ppc/kernel/pci.c 2005-05-13 15:39:55.000000000 +1000 +++ 2.6.12-rc4-mm1/arch/ppc/kernel/pci.c 2005-05-13 15:40:19.000000000 +1000 @@ -1495,7 +1495,7 @@ *offset += hose->pci_mem_offset; res_bit = IORESOURCE_MEM; } else { - io_offset = (unsigned long)hose->io_base_virt; + io_offset = hose->io_base_virt - ___IO_BASE; *offset += io_offset; res_bit = IORESOURCE_IO; } @@ -1522,7 +1522,7 @@ /* found it! construct the final physical address */ if (mmap_state == pci_mmap_io) - *offset += hose->io_base_phys - _IO_BASE; + *offset += hose->io_base_phys - io_offset; return rp; } @@ -1739,6 +1739,23 @@ return result; } +void pci_resource_to_user(const struct pci_dev *dev, int bar, + const struct resource *rsrc, + u64 *start, u64 *end) +{ + struct pci_controller *hose = pci_bus_to_hose(dev->bus->number); + unsigned long offset = 0; + + if (hose == NULL) + return; + + if (rsrc->flags & IORESOURCE_IO) + offset = ___IO_BASE - hose->io_base_virt + hose->io_base_phys; + + *start = rsrc->start + offset; + *end = rsrc->end + offset; +} + void __init pci_init_resource(struct resource *res, unsigned long start, unsigned long end, int flags, char *name) Index: 2.6.12-rc4-mm1/drivers/pci/pci-sysfs.c =================================================================== --- 2.6.12-rc4-mm1.orig/drivers/pci/pci-sysfs.c 2005-05-13 15:39:55.000000000 +1000 +++ 2.6.12-rc4-mm1/drivers/pci/pci-sysfs.c 2005-05-13 15:40:19.000000000 +1000 @@ -60,15 +60,18 @@ char * str = buf; int i; int max = 7; + u64 start, end; if (pci_dev->subordinate) max = DEVICE_COUNT_RESOURCE; for (i = 0; i < max; i++) { - str += sprintf(str,"0x%016lx 0x%016lx 0x%016lx\n", - pci_resource_start(pci_dev,i), - pci_resource_end(pci_dev,i), - pci_resource_flags(pci_dev,i)); + struct resource *res = &pci_dev->resource[i]; + pci_resource_to_user(pci_dev, i, res, &start, &end); + str += sprintf(str,"0x%016llx 0x%016llx 0x%016llx\n", + (unsigned long long)start, + (unsigned long long)end, + (unsigned long long)res->flags); } return (str - buf); } @@ -301,8 +304,21 @@ struct device, kobj)); struct resource *res = (struct resource *)attr->attr.private; enum pci_mmap_state mmap_type; + u64 start, end; + int i; - vma->vm_pgoff += res->start >> PAGE_SHIFT; + for (i = 0; i < PCI_ROM_RESOURCE; i++) + if (res == &pdev->resource[i]) + break; + if (i >= PCI_ROM_RESOURCE) + return -ENODEV; + + /* pci_mmap_page_range() expects the same kind of entry as coming + * from /proc/bus/pci/ which is a "user visible" value. If this is + * different from the resource itself, arch will do necessary fixup. + */ + pci_resource_to_user(pdev, i, res, &start, &end); + vma->vm_pgoff += start >> PAGE_SHIFT; mmap_type = res->flags & IORESOURCE_MEM ? pci_mmap_mem : pci_mmap_io; return pci_mmap_page_range(pdev, vma, mmap_type, 0); Index: 2.6.12-rc4-mm1/drivers/pci/proc.c =================================================================== --- 2.6.12-rc4-mm1.orig/drivers/pci/proc.c 2005-05-13 15:39:55.000000000 +1000 +++ 2.6.12-rc4-mm1/drivers/pci/proc.c 2005-05-13 15:40:19.000000000 +1000 @@ -355,14 +355,20 @@ dev->device, dev->irq); /* Here should be 7 and not PCI_NUM_RESOURCES as we need to preserve compatibility */ - for(i=0; i<7; i++) + for(i=0; i<7; i++) { + u64 start, end; + pci_resource_to_user(dev, i, &dev->resource[i], &start, &end); seq_printf(m, LONG_FORMAT, - dev->resource[i].start | + ((unsigned long)start) | (dev->resource[i].flags & PCI_REGION_FLAG_MASK)); - for(i=0; i<7; i++) + } + for(i=0; i<7; i++) { + u64 start, end; + pci_resource_to_user(dev, i, &dev->resource[i], &start, &end); seq_printf(m, LONG_FORMAT, dev->resource[i].start < dev->resource[i].end ? - dev->resource[i].end - dev->resource[i].start + 1 : 0); + (unsigned long)(end - start) + 1 : 0); + } seq_putc(m, '\t'); if (drv) seq_printf(m, "%s", drv->name); Index: 2.6.12-rc4-mm1/include/asm-ppc64/pci.h =================================================================== --- 2.6.12-rc4-mm1.orig/include/asm-ppc64/pci.h 2005-05-13 15:39:55.000000000 +1000 +++ 2.6.12-rc4-mm1/include/asm-ppc64/pci.h 2005-05-13 15:41:51.000000000 +1000 @@ -136,6 +136,13 @@ unsigned long size, pgprot_t prot); +#ifdef CONFIG_PPC_MULTIPLATFORM +#define HAVE_ARCH_PCI_RESOURCE_TO_USER +extern void pci_resource_to_user(const struct pci_dev *dev, int bar, + const struct resource *rsrc, + u64 *start, u64 *end); +#endif /* CONFIG_PPC_MULTIPLATFORM */ + #endif /* __KERNEL__ */ Index: 2.6.12-rc4-mm1/include/asm-ppc/pci.h =================================================================== --- 2.6.12-rc4-mm1.orig/include/asm-ppc/pci.h 2005-05-13 15:39:55.000000000 +1000 +++ 2.6.12-rc4-mm1/include/asm-ppc/pci.h 2005-05-13 15:40:19.000000000 +1000 @@ -103,6 +103,12 @@ unsigned long size, pgprot_t prot); +#define HAVE_ARCH_PCI_RESOURCE_TO_USER +extern void pci_resource_to_user(const struct pci_dev *dev, int bar, + const struct resource *rsrc, + u64 *start, u64 *end); + + #endif /* __KERNEL__ */ #endif /* __PPC_PCI_H */ Index: 2.6.12-rc4-mm1/include/linux/pci.h =================================================================== --- 2.6.12-rc4-mm1.orig/include/linux/pci.h 2005-05-13 15:39:55.000000000 +1000 +++ 2.6.12-rc4-mm1/include/linux/pci.h 2005-05-13 15:43:33.000000000 +1000 @@ -1020,6 +1020,20 @@ #define pci_pretty_name(dev) "" #endif + +/* Some archs don't want to expose struct resource to userland as-is + * in sysfs and /proc + */ +#ifndef HAVE_ARCH_PCI_RESOURCE_TO_USER +static inline void pci_resource_to_user(const struct pci_dev *dev, int bar, + const struct resource *rsrc, u64 *start, u64 *end) +{ + *start = rsrc->start; + *end = rsrc->end; +} +#endif /* HAVE_ARCH_PCI_RESOURCE_TO_USER */ + + /* * The world is not perfect and supplies us with broken PCI devices. * For at least a part of these bugs we need a work-around, so both From benh at kernel.crashing.org Fri May 13 17:59:06 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 13 May 2005 17:59:06 +1000 Subject: [PATCH] Updated: fix-pci-mmap-on-ppc-and-ppc64.patch In-Reply-To: <200505131744.11095.michael@ellerman.id.au> References: <200505131744.11095.michael@ellerman.id.au> Message-ID: <1115971146.5128.16.camel@gaston> On Fri, 2005-05-13 at 17:44 +1000, Michael Ellerman wrote: > Hi Andrew, > > This is an updated version of Ben's fix-pci-mmap-on-ppc-and-ppc64.patch > which is in 2.6.12-rc4-mm1. > > It fixes the patch to work on PPC iSeries, removes some debug printks > at Ben's request, and incorporates your > fix-pci-mmap-on-ppc-and-ppc64-fix.patch also. > > cheers > > Signed-off-by: Michael Ellerman Acked-by: Benjamin Herrenschmidt From tlnguyen at snoqualmie.dp.intel.com Sat May 14 01:02:28 2005 From: tlnguyen at snoqualmie.dp.intel.com (long) Date: Fri, 13 May 2005 08:02:28 -0700 Subject: [PATCH] FYI/ Re: PCI Error Recovery API Proposal (updated) Message-ID: <200505131502.j4DF2S7F015855@snoqualmie.dp.intel.com> On Fri, 6 May 2005 18:05:06, Linas wrote: >Hi, > >This is an "FYI" patch partially implementing the PCI error recovery API >previously detailed by BenH in an earlier email. > >Its an "FYI patch" because this patch has numerous flaws and limitations >which I'm hoping to address any day now. I've been busy with other >things, but have recently been able to carve out a chunk of time to work >on this. > >This patch is almost identical to a previous patch I'd mailed out >before, with only minor changes made to bring it into line with >BenH's proposed API. Basically, I'm just dusting off the old patch, >prior to making more serious changes. I hope to send a more serious >patch in a few days/week. Meanwhile, criticism invited. > >This patch does actually recover from PCI errors on ethernet cards >plugged into ppc64 hotplug slots, and from PCI errors on the IPR scsi >controller. This patch works fine for my AER code. Please post it to LKML. Thanks, Long From johnrose at austin.ibm.com Sat May 14 02:25:56 2005 From: johnrose at austin.ibm.com (John Rose) Date: Fri, 13 May 2005 11:25:56 -0500 Subject: [PATCH] FYI/ Re: PCI Error Recovery API Proposal (updated) In-Reply-To: <20050506230506.GL11745@austin.ibm.com> References: <20050223002409.GA10909@austin.ibm.com> <20050223174356.GH13081@kroah.com> <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <4235847F.3080705@jp.fujitsu.com> <20050314181420.GD498@austin.ibm.com> <1112685311.9518.35.camel@gaston> <20050506230506.GL11745@austin.ibm.com> Message-ID: <1116001556.31126.28.camel@sinatra.austin.ibm.com> Hi Linas- Sorry for the delay. My first comment is that patches that affect PCI Hotplug drivers should also be submitted to pcihpd-discuss at lists.sourceforge.net. We've discussed this offline, and maybe this will generate some public discussion. I don't think handle_eeh_events() or eeh_reset_device() belong in the RPA PCI hotplug driver. I've pasted these functions at the bottom of this note. These are out of place compared to the rest of the rpaphp code. They use eeh-specific RTAS calls, structures, and functions, which do not otherwise appear in RPA PCI Hotplug. These functions are using PCI Hotplug, rather than implementing it, so they belong elsewhere. Looking at the patch, all EEH needs from PCI Hotplug are enable and disable functions. The rpaphp driver could register disable/enable functions with eeh.c, and we could avoid introducing this unrelated code to the PCI Hotplug driver. One other quick comment: +static struct device_node * +get_phb_of_device (struct pci_dev *dev) +{ + struct device_node *dn; + struct pci_bus *bus; + + while (1) { + bus = dev->bus; + if (!bus) + break; + dn = pci_bus_to_OF_node(bus); + + if (dn->phb) + return dn; + + dev = bus->self; + BUG_ON (dev==NULL); + if (dev == NULL) + return NULL; + } Could you just use pci_device_to_OF_node(), then dn->phb? This way, you could avoid the loop. +/* ------------------------------------------------------- */ +/** + * handle_eeh_events -- reset a PCI device after hard lockup. + * + * pSeries systems will isolate a PCI slot if the PCI-Host + * bridge detects address or data parity errors, DMA's + * occuring to wild addresses (which usually happen due to + * bugs in device drivers or in PCI adapter firmware). + * Slot isolations also occur if #SERR, #PERR or other misc + * PCI-related errors are detected. + * + * Recovery process consists of unplugging the device driver + * (which generated hotplug events to userspace), then issuing + * a PCI #RST to the device, then reconfiguring the PCI config + * space for all bridges & devices under this slot, and then + * finally restarting the device drivers (which cause a second + * set of hotplug events to go out to userspace). + */ + +int eeh_reset_device (struct pci_dev *dev, struct device_node *dn, int reconfig) +{ + struct slot *frozen_slot= NULL; + + if (!dev) + return 1; + + if (reconfig) + frozen_slot = rpaphp_find_slot(dev); + + if (reconfig && frozen_slot) rpaphp_unconfig_pci_adapter (frozen_slot); + + /* Reset the pci controller. (Asserts RST#; resets config space). + * Reconfigure bridges and devices */ + rtas_set_slot_reset (dn->child); + rtas_configure_bridge(dn); + eeh_restore_bars(dn->child); +printk ("duude, post restore bars, for %s here's the dump\n", dn->full_name); +{ +extern int rtas_read_config(struct device_node *dn, int where, int size, u32 *val); +int i, rc; +u32 val; +struct device_node *xn=dn->child; +for(i=0;i<16;i++) { +rc = rtas_read_config (xn, i*4,4,&val); +printk ("duude read config %d rc=%d val=%x expect=%x\n", i, rc, val,xn->config_space[i]); +}} + + enable_irq (dev->irq); + + /* Give the system 5 seconds to finish running the user-space + * hotplug scripts, e.g. ifdown for ethernet. Yes, this is a hack, + * but if we don't do this, weird things happen. + */ + if (reconfig && frozen_slot) { + ssleep (5); + rpaphp_enable_pci_slot (frozen_slot); + } + return 0; +} + +/* The longest amount of time to wait for a pci device + * to come back on line, in seconds. + */ +#define MAX_WAIT_FOR_RECOVERY 15 + +int handle_eeh_events (struct notifier_block *self, + unsigned long reason, void *ev) +{ + int freeze_count=0; + struct device_node *frozen_device; + struct peh_event *event = ev; + struct pci_dev *dev = event->dev; + int perm_failure = 0; + int rc; + + if (!dev) + { + printk ("EEH: EEH error caught, but no PCI device specified!\n"); + return 1; + } + + frozen_device = get_phb_of_device (dev); + + if (!frozen_device) + { + printk (KERN_ERR "EEH: Cannot find PCI conroller for %s %s\n", + pci_name(dev), pci_pretty_name (dev)); + + return 1; + } + + /* We get "permanent failure" messages on empty slots. + * These are false alarms. Empty slots have no child dn. */ + if ((event->state == pci_channel_io_perm_failure) && (frozen_device == NULL)) + return 0; + + if (frozen_device) + freeze_count = frozen_device->eeh_freeze_count; + freeze_count ++; + if (freeze_count > EEH_MAX_ALLOWED_FREEZES) + perm_failure = 1; + + /* If the reset state is a '5' and the time to reset is 0 (infinity) + * or is more then 15 seconds, then mark this as a permanent failure. + */ + if ((event->state == pci_channel_io_perm_failure) && + ((event->time_unavail <= 0) || + (event->time_unavail > MAX_WAIT_FOR_RECOVERY*1000))) + perm_failure = 1; + + /* Log the error with the rtas logger. */ + if (perm_failure) { + /* + * About 90% of all real-life EEH failures in the field + * are due to poorly seated PCI cards. Only 10% or so are + * due to actual, failed cards. + */ + printk (KERN_ERR + "EEH: device %s:%s has failed %d times \n" + "and has been permanently disabled. Please try reseating\n" + "this device or replacing it.\n", + pci_name (dev), + pci_pretty_name (dev), + freeze_count); + + eeh_slot_error_detail (frozen_device, 2 /* Permanent Error */); + + /* Notify the device that its about to go down. */ + /* XXX this should be a recursive walk to children for + * multi-function devices */ + if (dev->driver->err_handler.error_detected) { + dev->driver->err_handler.error_detected (dev, pci_channel_io_perm_failure); + } + + /* If there's a hotplug slot, unconfigure it */ + struct slot * frozen_slot = rpaphp_find_slot(dev); + rpaphp_unconfig_pci_adapter (frozen_slot); + return 1; + } else { + eeh_slot_error_detail (frozen_device, 1 /* Temporary Error */); + } + + printk (KERN_WARNING + "EEH: This device has failed %d times since last reboot: %s:%s\n", + freeze_count, + pci_name (dev), + pci_pretty_name (dev)); + + /* Walk the various device drivers attached to this slot through + * a reset sequence, giving each an opportunity to do what it needs + * to accomplish the reset */ + /* XXX this should be a recursive walk to children for + * multi-function devices; each child should get to report + * status too, if needed ... if any child can't handle the reset, + * then need to hotplug it. + * XXX This does not follow flow of BenH's last email at all. + * XXX will be fixed later XXX + */ + if (dev->driver->err_handler.error_detected) { + dev->driver->err_handler.error_detected (dev, pci_channel_io_frozen); + rc = eeh_reset_device (dev, frozen_device, 0); + if (dev->driver->err_handler.slot_reset) + dev->driver->err_handler.slot_reset (dev); + } else { + rc = eeh_reset_device (dev, frozen_device, 1); + } + + /* Store the freeze count with the pci adapter, and not the slot. + * This way, if the device is replaced, the count is cleared. + */ + frozen_device->eeh_freeze_count = freeze_count; + + return rc; +} Thanks- John From greg at kroah.com Sat May 14 02:18:55 2005 From: greg at kroah.com (Greg KH) Date: Fri, 13 May 2005 09:18:55 -0700 Subject: [PATCH] Updated: fix-pci-mmap-on-ppc-and-ppc64.patch In-Reply-To: <200505131744.11095.michael@ellerman.id.au> References: <200505131744.11095.michael@ellerman.id.au> Message-ID: <20050513161855.GC11089@kroah.com> On Fri, May 13, 2005 at 05:44:10PM +1000, Michael Ellerman wrote: > Hi Andrew, Hm, why not send this to the pci maintainer? > This is an updated version of Ben's fix-pci-mmap-on-ppc-and-ppc64.patch > which is in 2.6.12-rc4-mm1. > > It fixes the patch to work on PPC iSeries, removes some debug printks > at Ben's request, and incorporates your > fix-pci-mmap-on-ppc-and-ppc64-fix.patch also. I'll add it to my queue. greg k-h From linas at austin.ibm.com Sat May 14 03:28:09 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Fri, 13 May 2005 12:28:09 -0500 Subject: [PATCH] FYI/ Re: PCI Error Recovery API Proposal (updated) In-Reply-To: <1116001556.31126.28.camel@sinatra.austin.ibm.com> References: <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <4235847F.3080705@jp.fujitsu.com> <20050314181420.GD498@austin.ibm.com> <1112685311.9518.35.camel@gaston> <20050506230506.GL11745@austin.ibm.com> <1116001556.31126.28.camel@sinatra.austin.ibm.com> Message-ID: <20050513172809.GA4138@austin.ibm.com> On Fri, May 13, 2005 at 11:25:56AM -0500, John Rose was heard to remark: > > I don't think handle_eeh_events() or eeh_reset_device() > belong in the RPA PCI hotplug driver. Suggestions where these should go? They got moved from arch/ppc64/kernel to drivers/pci/hotplug at the urging of Paulus and GregKH; in part because rpaphp can be built as a module, whereas the the ppc64 bits cannot. Would a distinct file in drivers/pci/hotplug work? Or someething else? > Looking at the patch, all EEH needs from PCI Hotplug are enable and > disable functions. The rpaphp driver could register disable/enable > functions with eeh.c, and we could avoid introducing this unrelated > code to the PCI Hotplug driver. That's how it used to work; no one liked that idea. But it doesn't hurt to revisit. I think the biggest problem here is module dependencies. Alternately, we might struggle to find a generic pci-error-recovery-on-top-of-pci-hotplug solution, but this may be premature at this point. > One other quick comment: > > +static struct device_node * > +get_phb_of_device (struct pci_dev *dev) > +{ > + struct device_node *dn; > + struct pci_bus *bus; > + > + while (1) { > + bus = dev->bus; > + if (!bus) > + break; > + dn = pci_bus_to_OF_node(bus); > + > + if (dn->phb) > + return dn; > + > + dev = bus->self; > + BUG_ON (dev==NULL); > + if (dev == NULL) > + return NULL; > + } > > Could you just use pci_device_to_OF_node(), then dn->phb? This way, > you could avoid the loop. Yes, except that not all dn's have phb's. The goal here is to walk up the tree, until a phb is found. --linas From linas at austin.ibm.com Sat May 14 03:29:33 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Fri, 13 May 2005 12:29:33 -0500 Subject: [PATCH] FYI/ Re: PCI Error Recovery API Proposal (updated) In-Reply-To: <200505131502.j4DF2S7F015855@snoqualmie.dp.intel.com> References: <200505131502.j4DF2S7F015855@snoqualmie.dp.intel.com> Message-ID: <20050513172933.GB4138@austin.ibm.com> On Fri, May 13, 2005 at 08:02:28AM -0700, long was heard to remark: > On Fri, 6 May 2005 18:05:06, Linas wrote: > >Hi, > > > >This is an "FYI" patch partially implementing the PCI error recovery API > >previously detailed by BenH in an earlier email. > > This patch works fine for my AER code. Please post it to LKML. I'll try to do that "real soon now". I've got Symbios recovery working at this time. --linas From johnrose at austin.ibm.com Sat May 14 03:40:28 2005 From: johnrose at austin.ibm.com (John Rose) Date: Fri, 13 May 2005 12:40:28 -0500 Subject: [PATCH] FYI/ Re: PCI Error Recovery API Proposal (updated) In-Reply-To: <20050513172809.GA4138@austin.ibm.com> References: <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <4235847F.3080705@jp.fujitsu.com> <20050314181420.GD498@austin.ibm.com> <1112685311.9518.35.camel@gaston> <20050506230506.GL11745@austin.ibm.com> <1116001556.31126.28.camel@sinatra.austin.ibm.com> <20050513172809.GA4138@austin.ibm.com> Message-ID: <1116006028.31126.101.camel@sinatra.austin.ibm.com> Hi Linas- > Suggestions where these should go? They got moved from arch/ppc64/kernel > to drivers/pci/hotplug at the urging of Paulus and GregKH; in part because > rpaphp can be built as a module, whereas the the ppc64 bits cannot. > Would a distinct file in drivers/pci/hotplug work? Or someething else? Paul, Greg, can you explain the motivation behind this? Why doesn't the EEH recovery logic belong in eeh.c, with the rest of EEH code? > > Could you just use pci_device_to_OF_node(), then dn->phb? This way, > > you could avoid the loop. > > Yes, except that not all dn's have phb's. The goal here is to walk up > the tree, until a phb is found. If a PCI node had a null phb pointer, that would be a bug :) This pointer should be set for nodes present at boot as well as dynamically added ones. Thanks- John From greg at kroah.com Sat May 14 04:19:59 2005 From: greg at kroah.com (Greg KH) Date: Fri, 13 May 2005 11:19:59 -0700 Subject: [PATCH] FYI/ Re: PCI Error Recovery API Proposal (updated) In-Reply-To: <1116006028.31126.101.camel@sinatra.austin.ibm.com> References: <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <4235847F.3080705@jp.fujitsu.com> <20050314181420.GD498@austin.ibm.com> <1112685311.9518.35.camel@gaston> <20050506230506.GL11745@austin.ibm.com> <1116001556.31126.28.camel@sinatra.austin.ibm.com> <20050513172809.GA4138@austin.ibm.com> <1116006028.31126.101.camel@sinatra.austin.ibm.com> Message-ID: <20050513181958.GA13102@kroah.com> On Fri, May 13, 2005 at 12:40:28PM -0500, John Rose wrote: > Hi Linas- > > > Suggestions where these should go? They got moved from arch/ppc64/kernel > > to drivers/pci/hotplug at the urging of Paulus and GregKH; in part because > > rpaphp can be built as a module, whereas the the ppc64 bits cannot. > > Would a distinct file in drivers/pci/hotplug work? Or someething else? > > Paul, Greg, can you explain the motivation behind this? Why doesn't the > EEH recovery logic belong in eeh.c, with the rest of EEH code? So that all platforms can use this logic? Not all the world is a ppc64 box :) greg k-h From johnrose at austin.ibm.com Sat May 14 04:42:56 2005 From: johnrose at austin.ibm.com (John Rose) Date: Fri, 13 May 2005 13:42:56 -0500 Subject: [PATCH] FYI/ Re: PCI Error Recovery API Proposal (updated) In-Reply-To: <20050513181958.GA13102@kroah.com> References: <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <4235847F.3080705@jp.fujitsu.com> <20050314181420.GD498@austin.ibm.com> <1112685311.9518.35.camel@gaston> <20050506230506.GL11745@austin.ibm.com> <1116001556.31126.28.camel@sinatra.austin.ibm.com> <20050513172809.GA4138@austin.ibm.com> <1116006028.31126.101.camel@sinatra.austin.ibm.com> <20050513181958.GA13102@kroah.com> Message-ID: <1116009776.2018.17.camel@sinatra.austin.ibm.com> > So that all platforms can use this logic? Not all the world is a ppc64 > box :) ?? The chunk of code in question is quite PPC64-specific, so it's not of use to other platforms. The code is being placed in the rpaphp driver, which is also PPC64 specific. Why couldn't arch/ppc64/kernel/eeh.c register an error_recover() implementation with the generic layer? Thanks- John From greg at kroah.com Sat May 14 04:49:50 2005 From: greg at kroah.com (Greg KH) Date: Fri, 13 May 2005 11:49:50 -0700 Subject: [PATCH] FYI/ Re: PCI Error Recovery API Proposal (updated) In-Reply-To: <1116009776.2018.17.camel@sinatra.austin.ibm.com> References: <20050312013251.GA2609@austin.ibm.com> <4235847F.3080705@jp.fujitsu.com> <20050314181420.GD498@austin.ibm.com> <1112685311.9518.35.camel@gaston> <20050506230506.GL11745@austin.ibm.com> <1116001556.31126.28.camel@sinatra.austin.ibm.com> <20050513172809.GA4138@austin.ibm.com> <1116006028.31126.101.camel@sinatra.austin.ibm.com> <20050513181958.GA13102@kroah.com> <1116009776.2018.17.camel@sinatra.austin.ibm.com> Message-ID: <20050513184950.GA13413@kroah.com> On Fri, May 13, 2005 at 01:42:56PM -0500, John Rose wrote: > > So that all platforms can use this logic? Not all the world is a ppc64 > > box :) > > ?? The chunk of code in question is quite PPC64-specific, so it's not > of use to other platforms. The code is being placed in the rpaphp > driver, which is also PPC64 specific. > > Why couldn't arch/ppc64/kernel/eeh.c register an error_recover() > implementation with the generic layer? To be honest, I don't really remember anymore, this thread has been going on for _months_ now :) greg k-h From arnd at arndb.de Sat May 14 05:23:49 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Fri, 13 May 2005 21:23:49 +0200 Subject: [PATCH 2/8] ppc64: add a minimal nvram driver In-Reply-To: <200505132117.37461.arnd@arndb.de> References: <200505132117.37461.arnd@arndb.de> Message-ID: <200505132123.50284.arnd@arndb.de> The firmware provides the location and size of the nvram in the device tree, so it does not really contain any hardware specific bits and could be used on other machines as well. From: Utz Bacher Signed-off-by: Arnd Bergmann Index: linus-2.5/arch/ppc64/kernel/bpa_nvram.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linus-2.5/arch/ppc64/kernel/bpa_nvram.c 2005-04-20 01:55:36.000000000 +0200 @@ -0,0 +1,118 @@ +/* + * NVRAM for CPBW + * + * (C) Copyright IBM Corp. 2005 + * + * Authors : Utz Bacher + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#include +#include +#include +#include +#include + +#include +#include +#include + +static void __iomem *bpa_nvram_start; +static long bpa_nvram_len; +static spinlock_t bpa_nvram_lock = SPIN_LOCK_UNLOCKED; + +static ssize_t bpa_nvram_read(char *buf, size_t count, loff_t *index) +{ + unsigned long flags; + + if (*index >= bpa_nvram_len) + return 0; + if (*index + count > bpa_nvram_len) + count = bpa_nvram_len - *index; + + spin_lock_irqsave(&bpa_nvram_lock, flags); + + memcpy_fromio(buf, bpa_nvram_start + *index, count); + + spin_unlock_irqrestore(&bpa_nvram_lock, flags); + + *index += count; + return count; +} + +static ssize_t bpa_nvram_write(char *buf, size_t count, loff_t *index) +{ + unsigned long flags; + + if (*index >= bpa_nvram_len) + return 0; + if (*index + count > bpa_nvram_len) + count = bpa_nvram_len - *index; + + spin_lock_irqsave(&bpa_nvram_lock, flags); + + memcpy_toio(bpa_nvram_start + *index, buf, count); + + spin_unlock_irqrestore(&bpa_nvram_lock, flags); + + *index += count; + return count; +} + +static ssize_t bpa_nvram_get_size(void) +{ + return bpa_nvram_len; +} + +int __init bpa_nvram_init(void) +{ + struct device_node *nvram_node; + unsigned long *buffer; + int proplen; + unsigned long nvram_addr; + int ret; + + ret = -ENODEV; + nvram_node = of_find_node_by_type(NULL, "nvram"); + if (!nvram_node) + goto out; + + ret = -EIO; + buffer = (unsigned long *)get_property(nvram_node, "reg", &proplen); + if (proplen != 2*sizeof(unsigned long)) + goto out; + + ret = -ENODEV; + nvram_addr = buffer[0]; + bpa_nvram_len = buffer[1]; + if ( (!bpa_nvram_len) || (!nvram_addr) ) + goto out; + + bpa_nvram_start = ioremap(nvram_addr, bpa_nvram_len); + if (!bpa_nvram_start) + goto out; + + printk(KERN_INFO "BPA NVRAM, %luk mapped to %p\n", + bpa_nvram_len >> 10, bpa_nvram_start); + + ppc_md.nvram_read = bpa_nvram_read; + ppc_md.nvram_write = bpa_nvram_write; + ppc_md.nvram_size = bpa_nvram_get_size; + +out: + of_node_put(nvram_node); + return ret; +} Index: linus-2.5/include/asm-ppc64/nvram.h =================================================================== --- linus-2.5.orig/include/asm-ppc64/nvram.h 2005-04-20 01:54:03.000000000 +0200 +++ linus-2.5/include/asm-ppc64/nvram.h 2005-04-20 01:55:36.000000000 +0200 @@ -70,6 +70,7 @@ extern int pSeries_nvram_init(void); extern int pmac_nvram_init(void); +extern int bpa_nvram_init(void); /* PowerMac specific nvram stuffs */ From arnd at arndb.de Sat May 14 05:21:25 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Fri, 13 May 2005 21:21:25 +0200 Subject: [PATCH 1/8] ppc64: split out generic rtas code from pSeries_pci.c In-Reply-To: <200505132117.37461.arnd@arndb.de> References: <200505132117.37461.arnd@arndb.de> Message-ID: <200505132121.26996.arnd@arndb.de> BPA is using rtas for PCI but should not be confused by pSeries code. This also avoids some #ifdefs. Other platforms that want to use rtas_pci.c could create their own platform_pci.c with platform specific fixups. Signed-off-by: Arnd Bergmann --- linux-cg.orig/arch/ppc64/kernel/Makefile 2005-05-13 14:56:19.016994560 -0400 +++ linux-cg/arch/ppc64/kernel/Makefile 2005-05-13 15:00:05.111971888 -0400 @@ -32,13 +32,14 @@ obj-$(CONFIG_PPC_MULTIPLATFORM) += nvram obj-$(CONFIG_PPC_PSERIES) += pSeries_pci.o pSeries_lpar.o pSeries_hvCall.o \ pSeries_nvram.o rtasd.o ras.o pSeries_reconfig.o \ - xics.o rtas.o pSeries_setup.o pSeries_iommu.o + xics.o pSeries_setup.o pSeries_iommu.o obj-$(CONFIG_EEH) += eeh.o obj-$(CONFIG_PROC_FS) += proc_ppc64.o obj-$(CONFIG_RTAS_FLASH) += rtas_flash.o obj-$(CONFIG_SMP) += smp.o obj-$(CONFIG_MODULES) += module.o ppc_ksyms.o +obj-$(CONFIG_PPC_RTAS) += rtas.o rtas_pci.o obj-$(CONFIG_RTAS_PROC) += rtas-proc.o obj-$(CONFIG_SCANLOG) += scanlog.o obj-$(CONFIG_VIOPATH) += viopath.o --- linux-cg.orig/arch/ppc64/kernel/mpic.h 2005-05-13 14:56:19.018994256 -0400 +++ linux-cg/arch/ppc64/kernel/mpic.h 2005-05-13 15:00:10.785908048 -0400 @@ -265,3 +265,6 @@ extern void mpic_send_ipi(unsigned int i extern int mpic_get_one_irq(struct mpic *mpic, struct pt_regs *regs); /* This one gets to the primary mpic */ extern int mpic_get_irq(struct pt_regs *regs); + +/* global mpic for pSeries */ +extern struct mpic *pSeries_mpic; --- linux-cg.orig/arch/ppc64/kernel/pSeries_pci.c 2005-05-13 14:57:09.556898776 -0400 +++ linux-cg/arch/ppc64/kernel/pSeries_pci.c 2005-05-13 15:00:10.786907896 -0400 @@ -1,13 +1,11 @@ /* - * pSeries_pci.c + * arch/ppc64/kernel/pSeries_pci.c * * Copyright (C) 2001 Dave Engebretsen, IBM Corporation * Copyright (C) 2003 Anton Blanchard , IBM * * pSeries specific routines for PCI. * - * Based on code from pci.c and chrp_pci.c - * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or @@ -23,430 +21,18 @@ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ +#include +#include #include -#include #include #include -#include -#include -#include -#include -#include -#include -#include #include -#include -#include +#include -#include "mpic.h" #include "pci.h" -/* RTAS tokens */ -static int read_pci_config; -static int write_pci_config; -static int ibm_read_pci_config; -static int ibm_write_pci_config; - -static int s7a_workaround; - -extern struct mpic *pSeries_mpic; - -static int config_access_valid(struct device_node *dn, int where) -{ - if (where < 256) - return 1; - if (where < 4096 && dn->pci_ext_config_space) - return 1; - - return 0; -} - -static int rtas_read_config(struct device_node *dn, int where, int size, u32 *val) -{ - int returnval = -1; - unsigned long buid, addr; - int ret; - - if (!dn) - return PCIBIOS_DEVICE_NOT_FOUND; - if (!config_access_valid(dn, where)) - return PCIBIOS_BAD_REGISTER_NUMBER; - - addr = ((where & 0xf00) << 20) | (dn->busno << 16) | - (dn->devfn << 8) | (where & 0xff); - buid = dn->phb->buid; - if (buid) { - ret = rtas_call(ibm_read_pci_config, 4, 2, &returnval, - addr, buid >> 32, buid & 0xffffffff, size); - } else { - ret = rtas_call(read_pci_config, 2, 2, &returnval, addr, size); - } - *val = returnval; - - if (ret) - return PCIBIOS_DEVICE_NOT_FOUND; - - if (returnval == EEH_IO_ERROR_VALUE(size) - && eeh_dn_check_failure (dn, NULL)) - return PCIBIOS_DEVICE_NOT_FOUND; - - return PCIBIOS_SUCCESSFUL; -} - -static int rtas_pci_read_config(struct pci_bus *bus, - unsigned int devfn, - int where, int size, u32 *val) -{ - struct device_node *busdn, *dn; - - if (bus->self) - busdn = pci_device_to_OF_node(bus->self); - else - busdn = bus->sysdata; /* must be a phb */ - - /* Search only direct children of the bus */ - for (dn = busdn->child; dn; dn = dn->sibling) - if (dn->devfn == devfn) - return rtas_read_config(dn, where, size, val); - return PCIBIOS_DEVICE_NOT_FOUND; -} - -static int rtas_write_config(struct device_node *dn, int where, int size, u32 val) -{ - unsigned long buid, addr; - int ret; - - if (!dn) - return PCIBIOS_DEVICE_NOT_FOUND; - if (!config_access_valid(dn, where)) - return PCIBIOS_BAD_REGISTER_NUMBER; - - addr = ((where & 0xf00) << 20) | (dn->busno << 16) | - (dn->devfn << 8) | (where & 0xff); - buid = dn->phb->buid; - if (buid) { - ret = rtas_call(ibm_write_pci_config, 5, 1, NULL, addr, buid >> 32, buid & 0xffffffff, size, (ulong) val); - } else { - ret = rtas_call(write_pci_config, 3, 1, NULL, addr, size, (ulong)val); - } - - if (ret) - return PCIBIOS_DEVICE_NOT_FOUND; - - return PCIBIOS_SUCCESSFUL; -} - -static int rtas_pci_write_config(struct pci_bus *bus, - unsigned int devfn, - int where, int size, u32 val) -{ - struct device_node *busdn, *dn; - - if (bus->self) - busdn = pci_device_to_OF_node(bus->self); - else - busdn = bus->sysdata; /* must be a phb */ - - /* Search only direct children of the bus */ - for (dn = busdn->child; dn; dn = dn->sibling) - if (dn->devfn == devfn) - return rtas_write_config(dn, where, size, val); - return PCIBIOS_DEVICE_NOT_FOUND; -} - -struct pci_ops rtas_pci_ops = { - rtas_pci_read_config, - rtas_pci_write_config -}; - -int is_python(struct device_node *dev) -{ - char *model = (char *)get_property(dev, "model", NULL); - - if (model && strstr(model, "Python")) - return 1; - - return 0; -} - -static int get_phb_reg_prop(struct device_node *dev, - unsigned int addr_size_words, - struct reg_property64 *reg) -{ - unsigned int *ui_ptr = NULL, len; - - /* Found a PHB, now figure out where his registers are mapped. */ - ui_ptr = (unsigned int *)get_property(dev, "reg", &len); - if (ui_ptr == NULL) - return 1; - - if (addr_size_words == 1) { - reg->address = ((struct reg_property32 *)ui_ptr)->address; - reg->size = ((struct reg_property32 *)ui_ptr)->size; - } else { - *reg = *((struct reg_property64 *)ui_ptr); - } - - return 0; -} - -static void python_countermeasures(struct device_node *dev, - unsigned int addr_size_words) -{ - struct reg_property64 reg_struct; - void __iomem *chip_regs; - volatile u32 val; - - if (get_phb_reg_prop(dev, addr_size_words, ®_struct)) - return; - - /* Python's register file is 1 MB in size. */ - chip_regs = ioremap(reg_struct.address & ~(0xfffffUL), 0x100000); - - /* - * Firmware doesn't always clear this bit which is critical - * for good performance - Anton - */ - -#define PRG_CL_RESET_VALID 0x00010000 - - val = in_be32(chip_regs + 0xf6030); - if (val & PRG_CL_RESET_VALID) { - printk(KERN_INFO "Python workaround: "); - val &= ~PRG_CL_RESET_VALID; - out_be32(chip_regs + 0xf6030, val); - /* - * We must read it back for changes to - * take effect - */ - val = in_be32(chip_regs + 0xf6030); - printk("reg0: %x\n", val); - } - - iounmap(chip_regs); -} - -void __init init_pci_config_tokens (void) -{ - read_pci_config = rtas_token("read-pci-config"); - write_pci_config = rtas_token("write-pci-config"); - ibm_read_pci_config = rtas_token("ibm,read-pci-config"); - ibm_write_pci_config = rtas_token("ibm,write-pci-config"); -} - -unsigned long __devinit get_phb_buid (struct device_node *phb) -{ - int addr_cells; - unsigned int *buid_vals; - unsigned int len; - unsigned long buid; - - if (ibm_read_pci_config == -1) return 0; - - /* PHB's will always be children of the root node, - * or so it is promised by the current firmware. */ - if (phb->parent == NULL) - return 0; - if (phb->parent->parent) - return 0; - - buid_vals = (unsigned int *) get_property(phb, "reg", &len); - if (buid_vals == NULL) - return 0; - - addr_cells = prom_n_addr_cells(phb); - if (addr_cells == 1) { - buid = (unsigned long) buid_vals[0]; - } else { - buid = (((unsigned long)buid_vals[0]) << 32UL) | - (((unsigned long)buid_vals[1]) & 0xffffffff); - } - return buid; -} - -static int phb_set_bus_ranges(struct device_node *dev, - struct pci_controller *phb) -{ - int *bus_range; - unsigned int len; - - bus_range = (int *) get_property(dev, "bus-range", &len); - if (bus_range == NULL || len < 2 * sizeof(int)) { - return 1; - } - - phb->first_busno = bus_range[0]; - phb->last_busno = bus_range[1]; - - return 0; -} - -static int __devinit setup_phb(struct device_node *dev, - struct pci_controller *phb, - unsigned int addr_size_words) -{ - pci_setup_pci_controller(phb); - - if (is_python(dev)) - python_countermeasures(dev, addr_size_words); - - if (phb_set_bus_ranges(dev, phb)) - return 1; - - phb->arch_data = dev; - phb->ops = &rtas_pci_ops; - phb->buid = get_phb_buid(dev); - - return 0; -} - -static void __devinit add_linux_pci_domain(struct device_node *dev, - struct pci_controller *phb, - struct property *of_prop) -{ - memset(of_prop, 0, sizeof(struct property)); - of_prop->name = "linux,pci-domain"; - of_prop->length = sizeof(phb->global_number); - of_prop->value = (unsigned char *)&of_prop[1]; - memcpy(of_prop->value, &phb->global_number, sizeof(phb->global_number)); - prom_add_property(dev, of_prop); -} - -static struct pci_controller * __init alloc_phb(struct device_node *dev, - unsigned int addr_size_words) -{ - struct pci_controller *phb; - struct property *of_prop; - - phb = alloc_bootmem(sizeof(struct pci_controller)); - if (phb == NULL) - return NULL; - - of_prop = alloc_bootmem(sizeof(struct property) + - sizeof(phb->global_number)); - if (!of_prop) - return NULL; - - if (setup_phb(dev, phb, addr_size_words)) - return NULL; - - add_linux_pci_domain(dev, phb, of_prop); - - return phb; -} - -static struct pci_controller * __devinit alloc_phb_dynamic(struct device_node *dev, unsigned int addr_size_words) -{ - struct pci_controller *phb; - - phb = (struct pci_controller *)kmalloc(sizeof(struct pci_controller), - GFP_KERNEL); - if (phb == NULL) - return NULL; - - if (setup_phb(dev, phb, addr_size_words)) - return NULL; - - phb->is_dynamic = 1; - - /* TODO: linux,pci-domain? */ - - return phb; -} - -unsigned long __init find_and_init_phbs(void) -{ - struct device_node *node; - struct pci_controller *phb; - unsigned int root_size_cells = 0; - unsigned int index; - unsigned int *opprop = NULL; - struct device_node *root = of_find_node_by_path("/"); - - if (ppc64_interrupt_controller == IC_OPEN_PIC) { - opprop = (unsigned int *)get_property(root, - "platform-open-pic", NULL); - } - - root_size_cells = prom_n_size_cells(root); - - index = 0; - - for (node = of_get_next_child(root, NULL); - node != NULL; - node = of_get_next_child(root, node)) { - if (node->type == NULL || strcmp(node->type, "pci") != 0) - continue; - - phb = alloc_phb(node, root_size_cells); - if (!phb) - continue; - - pci_process_bridge_OF_ranges(phb, node); - pci_setup_phb_io(phb, index == 0); - - if (ppc64_interrupt_controller == IC_OPEN_PIC && pSeries_mpic) { - int addr = root_size_cells * (index + 2) - 1; - mpic_assign_isu(pSeries_mpic, index, opprop[addr]); - } - - index++; - } - - of_node_put(root); - pci_devs_phb_init(); - - /* - * pci_probe_only and pci_assign_all_buses can be set via properties - * in chosen. - */ - if (of_chosen) { - int *prop; - - prop = (int *)get_property(of_chosen, "linux,pci-probe-only", - NULL); - if (prop) - pci_probe_only = *prop; - - prop = (int *)get_property(of_chosen, - "linux,pci-assign-all-buses", NULL); - if (prop) - pci_assign_all_buses = *prop; - } - - return 0; -} - -struct pci_controller * __devinit init_phb_dynamic(struct device_node *dn) -{ - struct device_node *root = of_find_node_by_path("/"); - unsigned int root_size_cells = 0; - struct pci_controller *phb; - struct pci_bus *bus; - int primary; - - root_size_cells = prom_n_size_cells(root); - - primary = list_empty(&hose_list); - phb = alloc_phb_dynamic(dn, root_size_cells); - if (!phb) - return NULL; - - pci_process_bridge_OF_ranges(phb, dn); - - pci_setup_phb_io_dynamic(phb, primary); - of_node_put(root); - - pci_devs_phb_init_dynamic(phb); - phb->last_busno = 0xff; - bus = pci_scan_bus(phb->first_busno, phb->ops, phb->arch_data); - phb->bus = bus; - phb->last_busno = bus->subordinate; - - return phb; -} -EXPORT_SYMBOL(init_phb_dynamic); +static int __initdata s7a_workaround; #if 0 void pcibios_name_device(struct pci_dev *dev) @@ -474,7 +60,7 @@ void pcibios_name_device(struct pci_dev DECLARE_PCI_FIXUP_HEADER(PCI_ANY_ID, PCI_ANY_ID, pcibios_name_device); #endif -static void check_s7a(void) +static void __init check_s7a(void) { struct device_node *root; char *model; @@ -488,56 +74,6 @@ static void check_s7a(void) } } -/* RPA-specific bits for removing PHBs */ -int pcibios_remove_root_bus(struct pci_controller *phb) -{ - struct pci_bus *b = phb->bus; - struct resource *res; - int rc, i; - - res = b->resource[0]; - if (!res->flags) { - printk(KERN_ERR "%s: no IO resource for PHB %s\n", __FUNCTION__, - b->name); - return 1; - } - - rc = unmap_bus_range(b); - if (rc) { - printk(KERN_ERR "%s: failed to unmap IO on bus %s\n", - __FUNCTION__, b->name); - return 1; - } - - if (release_resource(res)) { - printk(KERN_ERR "%s: failed to release IO on bus %s\n", - __FUNCTION__, b->name); - return 1; - } - - for (i = 1; i < 3; ++i) { - res = b->resource[i]; - if (!res->flags && i == 0) { - printk(KERN_ERR "%s: no MEM resource for PHB %s\n", - __FUNCTION__, b->name); - return 1; - } - if (res->flags && release_resource(res)) { - printk(KERN_ERR - "%s: failed to release IO %d on bus %s\n", - __FUNCTION__, i, b->name); - return 1; - } - } - - list_del(&phb->list_node); - if (phb->is_dynamic) - kfree(phb); - - return 0; -} -EXPORT_SYMBOL(pcibios_remove_root_bus); - static void __init pSeries_request_regions(void) { if (!isa_io_base) --- linux-cg.orig/arch/ppc64/kernel/rtas_pci.c 1969-12-31 19:00:00.000000000 -0500 +++ linux-cg/arch/ppc64/kernel/rtas_pci.c 2005-05-13 15:00:10.788907592 -0400 @@ -0,0 +1,495 @@ +/* + * arch/ppc64/kernel/rtas_pci.c + * + * Copyright (C) 2001 Dave Engebretsen, IBM Corporation + * Copyright (C) 2003 Anton Blanchard , IBM + * + * RTAS specific routines for PCI. + * + * Based on code from pci.c, chrp_pci.c and pSeries_pci.c + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "mpic.h" +#include "pci.h" + +/* RTAS tokens */ +static int read_pci_config; +static int write_pci_config; +static int ibm_read_pci_config; +static int ibm_write_pci_config; + +static int config_access_valid(struct device_node *dn, int where) +{ + if (where < 256) + return 1; + if (where < 4096 && dn->pci_ext_config_space) + return 1; + + return 0; +} + +static int rtas_read_config(struct device_node *dn, int where, int size, u32 *val) +{ + int returnval = -1; + unsigned long buid, addr; + int ret; + + if (!dn) + return PCIBIOS_DEVICE_NOT_FOUND; + if (!config_access_valid(dn, where)) + return PCIBIOS_BAD_REGISTER_NUMBER; + + addr = ((where & 0xf00) << 20) | (dn->busno << 16) | + (dn->devfn << 8) | (where & 0xff); + buid = dn->phb->buid; + if (buid) { + ret = rtas_call(ibm_read_pci_config, 4, 2, &returnval, + addr, buid >> 32, buid & 0xffffffff, size); + } else { + ret = rtas_call(read_pci_config, 2, 2, &returnval, addr, size); + } + *val = returnval; + + if (ret) + return PCIBIOS_DEVICE_NOT_FOUND; + + if (returnval == EEH_IO_ERROR_VALUE(size) + && eeh_dn_check_failure (dn, NULL)) + return PCIBIOS_DEVICE_NOT_FOUND; + + return PCIBIOS_SUCCESSFUL; +} + +static int rtas_pci_read_config(struct pci_bus *bus, + unsigned int devfn, + int where, int size, u32 *val) +{ + struct device_node *busdn, *dn; + + if (bus->self) + busdn = pci_device_to_OF_node(bus->self); + else + busdn = bus->sysdata; /* must be a phb */ + + /* Search only direct children of the bus */ + for (dn = busdn->child; dn; dn = dn->sibling) + if (dn->devfn == devfn) + return rtas_read_config(dn, where, size, val); + return PCIBIOS_DEVICE_NOT_FOUND; +} + +static int rtas_write_config(struct device_node *dn, int where, int size, u32 val) +{ + unsigned long buid, addr; + int ret; + + if (!dn) + return PCIBIOS_DEVICE_NOT_FOUND; + if (!config_access_valid(dn, where)) + return PCIBIOS_BAD_REGISTER_NUMBER; + + addr = ((where & 0xf00) << 20) | (dn->busno << 16) | + (dn->devfn << 8) | (where & 0xff); + buid = dn->phb->buid; + if (buid) { + ret = rtas_call(ibm_write_pci_config, 5, 1, NULL, addr, buid >> 32, buid & 0xffffffff, size, (ulong) val); + } else { + ret = rtas_call(write_pci_config, 3, 1, NULL, addr, size, (ulong)val); + } + + if (ret) + return PCIBIOS_DEVICE_NOT_FOUND; + + return PCIBIOS_SUCCESSFUL; +} + +static int rtas_pci_write_config(struct pci_bus *bus, + unsigned int devfn, + int where, int size, u32 val) +{ + struct device_node *busdn, *dn; + + if (bus->self) + busdn = pci_device_to_OF_node(bus->self); + else + busdn = bus->sysdata; /* must be a phb */ + + /* Search only direct children of the bus */ + for (dn = busdn->child; dn; dn = dn->sibling) + if (dn->devfn == devfn) + return rtas_write_config(dn, where, size, val); + return PCIBIOS_DEVICE_NOT_FOUND; +} + +struct pci_ops rtas_pci_ops = { + rtas_pci_read_config, + rtas_pci_write_config +}; + +int is_python(struct device_node *dev) +{ + char *model = (char *)get_property(dev, "model", NULL); + + if (model && strstr(model, "Python")) + return 1; + + return 0; +} + +static int get_phb_reg_prop(struct device_node *dev, + unsigned int addr_size_words, + struct reg_property64 *reg) +{ + unsigned int *ui_ptr = NULL, len; + + /* Found a PHB, now figure out where his registers are mapped. */ + ui_ptr = (unsigned int *)get_property(dev, "reg", &len); + if (ui_ptr == NULL) + return 1; + + if (addr_size_words == 1) { + reg->address = ((struct reg_property32 *)ui_ptr)->address; + reg->size = ((struct reg_property32 *)ui_ptr)->size; + } else { + *reg = *((struct reg_property64 *)ui_ptr); + } + + return 0; +} + +static void python_countermeasures(struct device_node *dev, + unsigned int addr_size_words) +{ + struct reg_property64 reg_struct; + void __iomem *chip_regs; + volatile u32 val; + + if (get_phb_reg_prop(dev, addr_size_words, ®_struct)) + return; + + /* Python's register file is 1 MB in size. */ + chip_regs = ioremap(reg_struct.address & ~(0xfffffUL), 0x100000); + + /* + * Firmware doesn't always clear this bit which is critical + * for good performance - Anton + */ + +#define PRG_CL_RESET_VALID 0x00010000 + + val = in_be32(chip_regs + 0xf6030); + if (val & PRG_CL_RESET_VALID) { + printk(KERN_INFO "Python workaround: "); + val &= ~PRG_CL_RESET_VALID; + out_be32(chip_regs + 0xf6030, val); + /* + * We must read it back for changes to + * take effect + */ + val = in_be32(chip_regs + 0xf6030); + printk("reg0: %x\n", val); + } + + iounmap(chip_regs); +} + +void __init init_pci_config_tokens (void) +{ + read_pci_config = rtas_token("read-pci-config"); + write_pci_config = rtas_token("write-pci-config"); + ibm_read_pci_config = rtas_token("ibm,read-pci-config"); + ibm_write_pci_config = rtas_token("ibm,write-pci-config"); +} + +unsigned long __devinit get_phb_buid (struct device_node *phb) +{ + int addr_cells; + unsigned int *buid_vals; + unsigned int len; + unsigned long buid; + + if (ibm_read_pci_config == -1) return 0; + + /* PHB's will always be children of the root node, + * or so it is promised by the current firmware. */ + if (phb->parent == NULL) + return 0; + if (phb->parent->parent) + return 0; + + buid_vals = (unsigned int *) get_property(phb, "reg", &len); + if (buid_vals == NULL) + return 0; + + addr_cells = prom_n_addr_cells(phb); + if (addr_cells == 1) { + buid = (unsigned long) buid_vals[0]; + } else { + buid = (((unsigned long)buid_vals[0]) << 32UL) | + (((unsigned long)buid_vals[1]) & 0xffffffff); + } + return buid; +} + +static int phb_set_bus_ranges(struct device_node *dev, + struct pci_controller *phb) +{ + int *bus_range; + unsigned int len; + + bus_range = (int *) get_property(dev, "bus-range", &len); + if (bus_range == NULL || len < 2 * sizeof(int)) { + return 1; + } + + phb->first_busno = bus_range[0]; + phb->last_busno = bus_range[1]; + + return 0; +} + +static int __devinit setup_phb(struct device_node *dev, + struct pci_controller *phb, + unsigned int addr_size_words) +{ + pci_setup_pci_controller(phb); + + if (is_python(dev)) + python_countermeasures(dev, addr_size_words); + + if (phb_set_bus_ranges(dev, phb)) + return 1; + + phb->arch_data = dev; + phb->ops = &rtas_pci_ops; + phb->buid = get_phb_buid(dev); + + return 0; +} + +static void __devinit add_linux_pci_domain(struct device_node *dev, + struct pci_controller *phb, + struct property *of_prop) +{ + memset(of_prop, 0, sizeof(struct property)); + of_prop->name = "linux,pci-domain"; + of_prop->length = sizeof(phb->global_number); + of_prop->value = (unsigned char *)&of_prop[1]; + memcpy(of_prop->value, &phb->global_number, sizeof(phb->global_number)); + prom_add_property(dev, of_prop); +} + +static struct pci_controller * __init alloc_phb(struct device_node *dev, + unsigned int addr_size_words) +{ + struct pci_controller *phb; + struct property *of_prop; + + phb = alloc_bootmem(sizeof(struct pci_controller)); + if (phb == NULL) + return NULL; + + of_prop = alloc_bootmem(sizeof(struct property) + + sizeof(phb->global_number)); + if (!of_prop) + return NULL; + + if (setup_phb(dev, phb, addr_size_words)) + return NULL; + + add_linux_pci_domain(dev, phb, of_prop); + + return phb; +} + +static struct pci_controller * __devinit alloc_phb_dynamic(struct device_node *dev, unsigned int addr_size_words) +{ + struct pci_controller *phb; + + phb = (struct pci_controller *)kmalloc(sizeof(struct pci_controller), + GFP_KERNEL); + if (phb == NULL) + return NULL; + + if (setup_phb(dev, phb, addr_size_words)) + return NULL; + + phb->is_dynamic = 1; + + /* TODO: linux,pci-domain? */ + + return phb; +} + +unsigned long __init find_and_init_phbs(void) +{ + struct device_node *node; + struct pci_controller *phb; + unsigned int root_size_cells = 0; + unsigned int index; + unsigned int *opprop = NULL; + struct device_node *root = of_find_node_by_path("/"); + + if (ppc64_interrupt_controller == IC_OPEN_PIC) { + opprop = (unsigned int *)get_property(root, + "platform-open-pic", NULL); + } + + root_size_cells = prom_n_size_cells(root); + + index = 0; + + for (node = of_get_next_child(root, NULL); + node != NULL; + node = of_get_next_child(root, node)) { + if (node->type == NULL || strcmp(node->type, "pci") != 0) + continue; + + phb = alloc_phb(node, root_size_cells); + if (!phb) + continue; + + pci_process_bridge_OF_ranges(phb, node); + pci_setup_phb_io(phb, index == 0); +#ifdef CONFIG_PPC_PSERIES + if (ppc64_interrupt_controller == IC_OPEN_PIC && pSeries_mpic) { + int addr = root_size_cells * (index + 2) - 1; + mpic_assign_isu(pSeries_mpic, index, opprop[addr]); + } +#endif + index++; + } + + of_node_put(root); + pci_devs_phb_init(); + + /* + * pci_probe_only and pci_assign_all_buses can be set via properties + * in chosen. + */ + if (of_chosen) { + int *prop; + + prop = (int *)get_property(of_chosen, "linux,pci-probe-only", + NULL); + if (prop) + pci_probe_only = *prop; + + prop = (int *)get_property(of_chosen, + "linux,pci-assign-all-buses", NULL); + if (prop) + pci_assign_all_buses = *prop; + } + + return 0; +} + +struct pci_controller * __devinit init_phb_dynamic(struct device_node *dn) +{ + struct device_node *root = of_find_node_by_path("/"); + unsigned int root_size_cells = 0; + struct pci_controller *phb; + struct pci_bus *bus; + int primary; + + root_size_cells = prom_n_size_cells(root); + + primary = list_empty(&hose_list); + phb = alloc_phb_dynamic(dn, root_size_cells); + if (!phb) + return NULL; + + pci_process_bridge_OF_ranges(phb, dn); + + pci_setup_phb_io_dynamic(phb, primary); + of_node_put(root); + + pci_devs_phb_init_dynamic(phb); + phb->last_busno = 0xff; + bus = pci_scan_bus(phb->first_busno, phb->ops, phb->arch_data); + phb->bus = bus; + phb->last_busno = bus->subordinate; + + return phb; +} +EXPORT_SYMBOL(init_phb_dynamic); + +/* RPA-specific bits for removing PHBs */ +int pcibios_remove_root_bus(struct pci_controller *phb) +{ + struct pci_bus *b = phb->bus; + struct resource *res; + int rc, i; + + res = b->resource[0]; + if (!res->flags) { + printk(KERN_ERR "%s: no IO resource for PHB %s\n", __FUNCTION__, + b->name); + return 1; + } + + rc = unmap_bus_range(b); + if (rc) { + printk(KERN_ERR "%s: failed to unmap IO on bus %s\n", + __FUNCTION__, b->name); + return 1; + } + + if (release_resource(res)) { + printk(KERN_ERR "%s: failed to release IO on bus %s\n", + __FUNCTION__, b->name); + return 1; + } + + for (i = 1; i < 3; ++i) { + res = b->resource[i]; + if (!res->flags && i == 0) { + printk(KERN_ERR "%s: no MEM resource for PHB %s\n", + __FUNCTION__, b->name); + return 1; + } + if (res->flags && release_resource(res)) { + printk(KERN_ERR + "%s: failed to release IO %d on bus %s\n", + __FUNCTION__, i, b->name); + return 1; + } + } + + list_del(&phb->list_node); + if (phb->is_dynamic) + kfree(phb); + + return 0; +} +EXPORT_SYMBOL(pcibios_remove_root_bus); From arnd at arndb.de Sat May 14 05:25:33 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Fri, 13 May 2005 21:25:33 +0200 Subject: [PATCH 4/8] ppc64: add BPA platform type In-Reply-To: <200505132117.37461.arnd@arndb.de> References: <200505132117.37461.arnd@arndb.de> Message-ID: <200505132125.34358.arnd@arndb.de> This adds the basic support for running on BPA machines. So far, this is only the IBM workstation, and it will not run on others without a little more generalization. It should be possible to configure a kernel for any combination of CONFIG_PPC_BPA with any of the other multiplatform targets. Signed-off-by: Arnd Bergmann Index: linus-2.5/MAINTAINERS =================================================================== --- linus-2.5.orig/MAINTAINERS 2005-05-09 08:14:59.000000000 +0200 +++ linus-2.5/MAINTAINERS 2005-05-09 08:17:38.000000000 +0200 @@ -493,6 +493,13 @@ W: http://sourceforge.net/projects/bonding/ S: Supported +BROADBAND PROCESSOR ARCHITECTURE +P: Arnd Bergmann +M: arnd at arndb.de +L: linuxppc64-dev at ozlabs.org +W: http://linuxppc64.org +S: Supported + BTTV VIDEO4LINUX DRIVER P: Gerd Knorr M: kraxel at bytesex.org Index: linus-2.5/arch/ppc64/Kconfig =================================================================== --- linus-2.5.orig/arch/ppc64/Kconfig 2005-05-09 08:15:08.000000000 +0200 +++ linus-2.5/arch/ppc64/Kconfig 2005-05-09 08:17:38.000000000 +0200 @@ -77,6 +77,10 @@ bool " IBM pSeries & new iSeries" default y +config PPC_BPA + bool " Broadband Processor Architecture" + depends on PPC_MULTIPLATFORM + config PPC_PMAC depends on PPC_MULTIPLATFORM bool " Apple G5 based machines" @@ -256,7 +260,7 @@ config PPC_RTAS bool - depends on PPC_PSERIES + depends on PPC_PSERIES || PPC_BPA default y config RTAS_PROC Index: linus-2.5/arch/ppc64/Makefile =================================================================== --- linus-2.5.orig/arch/ppc64/Makefile 2005-05-09 08:15:08.000000000 +0200 +++ linus-2.5/arch/ppc64/Makefile 2005-05-09 08:17:38.000000000 +0200 @@ -90,12 +90,14 @@ boottarget-$(CONFIG_PPC_PSERIES) := zImage zImage.initrd boottarget-$(CONFIG_PPC_MAPLE) := zImage zImage.initrd boottarget-$(CONFIG_PPC_ISERIES) := vmlinux.sminitrd vmlinux.initrd vmlinux.sm +boottarget-$(CONFIG_PPC_BPA) := zImage zImage.initrd $(boottarget-y): vmlinux $(Q)$(MAKE) $(build)=$(boot) $(boot)/$@ bootimage-$(CONFIG_PPC_PSERIES) := $(boot)/zImage bootimage-$(CONFIG_PPC_PMAC) := vmlinux bootimage-$(CONFIG_PPC_MAPLE) := $(boot)/zImage +bootimage-$(CONFIG_PPC_BPA) := zImage bootimage-$(CONFIG_PPC_ISERIES) := vmlinux BOOTIMAGE := $(bootimage-y) install: vmlinux Index: linus-2.5/arch/ppc64/kernel/Makefile =================================================================== --- linus-2.5.orig/arch/ppc64/kernel/Makefile 2005-05-09 08:16:57.000000000 +0200 +++ linus-2.5/arch/ppc64/kernel/Makefile 2005-05-09 08:17:38.000000000 +0200 @@ -34,6 +34,8 @@ pSeries_nvram.o rtasd.o ras.o pSeries_reconfig.o \ xics.o pSeries_setup.o pSeries_iommu.o +obj-$(CONFIG_PPC_BPA) += bpa_setup.o bpa_nvram.o + obj-$(CONFIG_EEH) += eeh.o obj-$(CONFIG_PROC_FS) += proc_ppc64.o obj-$(CONFIG_RTAS_FLASH) += rtas_flash.o @@ -60,6 +62,7 @@ obj-$(CONFIG_PPC_PMAC) += pmac_smp.o smp-tbsync.o obj-$(CONFIG_PPC_ISERIES) += iSeries_smp.o obj-$(CONFIG_PPC_PSERIES) += pSeries_smp.o +obj-$(CONFIG_PPC_BPA) += pSeries_smp.o obj-$(CONFIG_PPC_MAPLE) += smp-tbsync.o endif Index: linus-2.5/arch/ppc64/kernel/bpa_setup.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linus-2.5/arch/ppc64/kernel/bpa_setup.c 2005-05-09 08:17:38.000000000 +0200 @@ -0,0 +1,207 @@ +/* + * linux/arch/ppc/kernel/bpa_setup.c + * + * Copyright (C) 1995 Linus Torvalds + * Adapted from 'alpha' version by Gary Thomas + * Modified by Cort Dougan (cort at cs.nmt.edu) + * Modified by PPC64 Team, IBM Corp + * Modified by BPA Team, IBM Deutschland Entwicklung GmbH + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ +#undef DEBUG + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "pci.h" + +#ifdef DEBUG +#define DBG(fmt...) udbg_printf(fmt) +#else +#define DBG(fmt...) +#endif + +extern void pSeries_get_boot_time(struct rtc_time *rtc_time); +extern void pSeries_get_rtc_time(struct rtc_time *rtc_time); +extern int pSeries_set_rtc_time(struct rtc_time *rtc_time); + +extern unsigned long ppc_proc_freq; +extern unsigned long ppc_tb_freq; + +void bpa_get_cpuinfo(struct seq_file *m) +{ + struct device_node *root; + const char *model = ""; + + root = of_find_node_by_path("/"); + if (root) + model = get_property(root, "model", NULL); + seq_printf(m, "machine\t\t: BPA %s\n", model); + of_node_put(root); +} + +static void __init bpa_progress(char *s, unsigned short hex) +{ + printk("*** %04x : %s\n", hex, s ? s : ""); +} + +extern void setup_default_decr(void); + +/* Some sane defaults: 125 MHz timebase, 1GHz processor */ +#define DEFAULT_TB_FREQ 125000000UL +#define DEFAULT_PROC_FREQ (DEFAULT_TB_FREQ * 8) + +/* FIXME: consolidate this into rtas.c or similar */ +static void __init pSeries_calibrate_decr(void) +{ + struct device_node *cpu; + struct div_result divres; + unsigned int *fp; + int node_found; + + /* + * The cpu node should have a timebase-frequency property + * to tell us the rate at which the decrementer counts. + */ + cpu = of_find_node_by_type(NULL, "cpu"); + + ppc_tb_freq = DEFAULT_TB_FREQ; /* hardcoded default */ + node_found = 0; + if (cpu != 0) { + fp = (unsigned int *)get_property(cpu, "timebase-frequency", + NULL); + if (fp != 0) { + node_found = 1; + ppc_tb_freq = *fp; + } + } + if (!node_found) + printk(KERN_ERR "WARNING: Estimating decrementer frequency " + "(not found)\n"); + + ppc_proc_freq = DEFAULT_PROC_FREQ; + node_found = 0; + if (cpu != 0) { + fp = (unsigned int *)get_property(cpu, "clock-frequency", + NULL); + if (fp != 0) { + node_found = 1; + ppc_proc_freq = *fp; + } + } + if (!node_found) + printk(KERN_ERR "WARNING: Estimating processor frequency " + "(not found)\n"); + + of_node_put(cpu); + + printk(KERN_INFO "time_init: decrementer frequency = %lu.%.6lu MHz\n", + ppc_tb_freq/1000000, ppc_tb_freq%1000000); + printk(KERN_INFO "time_init: processor frequency = %lu.%.6lu MHz\n", + ppc_proc_freq/1000000, ppc_proc_freq%1000000); + + tb_ticks_per_jiffy = ppc_tb_freq / HZ; + tb_ticks_per_sec = tb_ticks_per_jiffy * HZ; + tb_ticks_per_usec = ppc_tb_freq / 1000000; + tb_to_us = mulhwu_scale_factor(ppc_tb_freq, 1000000); + div128_by_32(1024*1024, 0, tb_ticks_per_sec, &divres); + tb_to_xs = divres.result_low; + + setup_default_decr(); +} + +static void __init bpa_setup_arch(void) +{ +#ifdef CONFIG_SMP + smp_init_pSeries(); +#endif + + /* init to some ~sane value until calibrate_delay() runs */ + loops_per_jiffy = 50000000; + + if (ROOT_DEV == 0) { + printk("No ramdisk, default root is /dev/hda2\n"); + ROOT_DEV = Root_HDA2; + } + + /* Find and initialize PCI host bridges */ + init_pci_config_tokens(); + find_and_init_phbs(); + +#ifdef CONFIG_DUMMY_CONSOLE + conswitchp = &dummy_con; +#endif + + // bpa_nvram_init(); +} + +/* + * Early initialization. Relocation is on but do not reference unbolted pages + */ +static void __init bpa_init_early(void) +{ + DBG(" -> bpa_init_early()\n"); + + hpte_init_native(); + + pci_direct_iommu_init(); + + ppc64_interrupt_controller = IC_BPA_IIC; + + DBG(" <- bpa_init_early()\n"); +} + + +static int __init bpa_probe(int platform) +{ + if (platform != PLATFORM_BPA) + return 0; + + return 1; +} + +struct machdep_calls __initdata bpa_md = { + .probe = bpa_probe, + .setup_arch = bpa_setup_arch, + .init_early = bpa_init_early, + .get_cpuinfo = bpa_get_cpuinfo, + .restart = rtas_restart, + .power_off = rtas_power_off, + .halt = rtas_halt, + .get_boot_time = pSeries_get_boot_time, + .get_rtc_time = pSeries_get_rtc_time, + .set_rtc_time = pSeries_set_rtc_time, + .calibrate_decr = pSeries_calibrate_decr, + .progress = bpa_progress, +}; Index: linus-2.5/arch/ppc64/kernel/cpu_setup_power4.S =================================================================== --- linus-2.5.orig/arch/ppc64/kernel/cpu_setup_power4.S 2005-05-08 09:51:35.000000000 +0200 +++ linus-2.5/arch/ppc64/kernel/cpu_setup_power4.S 2005-05-09 08:17:38.000000000 +0200 @@ -73,7 +73,21 @@ _GLOBAL(__setup_cpu_power4) blr - + +_GLOBAL(__setup_cpu_be) + /* Set large page sizes LP=0: 16MB, LP=1: 64KB */ + addi r3, 0, 0 + ori r3, r3, HID6_LB + sldi r3, r3, 32 + nor r3, r3, r3 + mfspr r4, SPRN_HID6 + and r4, r4, r3 + addi r3, 0, 0x02000 + sldi r3, r3, 32 + or r4, r4, r3 + mtspr SPRN_HID6, r4 + blr + _GLOBAL(__setup_cpu_ppc970) mfspr r0,SPRN_HID0 li r11,5 /* clear DOZE and SLEEP */ Index: linus-2.5/arch/ppc64/kernel/cputable.c =================================================================== --- linus-2.5.orig/arch/ppc64/kernel/cputable.c 2005-05-08 09:51:35.000000000 +0200 +++ linus-2.5/arch/ppc64/kernel/cputable.c 2005-05-09 08:17:38.000000000 +0200 @@ -34,6 +34,7 @@ extern void __setup_cpu_power3(unsigned long offset, struct cpu_spec* spec); extern void __setup_cpu_power4(unsigned long offset, struct cpu_spec* spec); extern void __setup_cpu_ppc970(unsigned long offset, struct cpu_spec* spec); +extern void __setup_cpu_be(unsigned long offset, struct cpu_spec* spec); /* We only set the altivec features if the kernel was compiled with altivec @@ -162,6 +163,16 @@ __setup_cpu_power4, COMMON_PPC64_FW }, + { /* BE DD1.x */ + 0xffff0000, 0x00700000, "Broadband Engine", + CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | CPU_FTR_HPTE_TABLE | + CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_ALTIVEC_COMP | + CPU_FTR_SMT, + COMMON_USER_PPC64 | PPC_FEATURE_HAS_ALTIVEC_COMP, + 128, 128, + __setup_cpu_be, + COMMON_PPC64_FW + }, { /* default match */ 0x00000000, 0x00000000, "POWER4 (compatible)", CPU_FTR_SPLIT_ID_CACHE | CPU_FTR_USE_TB | CPU_FTR_HPTE_TABLE | Index: linus-2.5/arch/ppc64/kernel/irq.c =================================================================== --- linus-2.5.orig/arch/ppc64/kernel/irq.c 2005-05-08 09:51:35.000000000 +0200 +++ linus-2.5/arch/ppc64/kernel/irq.c 2005-05-09 08:17:38.000000000 +0200 @@ -395,6 +395,9 @@ if (ppc64_interrupt_controller == IC_OPEN_PIC) return real_irq; /* no mapping for openpic (for now) */ + if (ppc64_interrupt_controller == IC_BPA_IIC) + return real_irq; /* no mapping for iic either */ + /* don't map interrupts < MIN_VIRT_IRQ */ if (real_irq < MIN_VIRT_IRQ) { virt_irq_to_real_map[real_irq] = real_irq; Index: linus-2.5/arch/ppc64/kernel/proc_ppc64.c =================================================================== --- linus-2.5.orig/arch/ppc64/kernel/proc_ppc64.c 2005-05-08 09:51:35.000000000 +0200 +++ linus-2.5/arch/ppc64/kernel/proc_ppc64.c 2005-05-09 08:17:38.000000000 +0200 @@ -53,7 +53,7 @@ if (!root) return 1; - if (!(systemcfg->platform & PLATFORM_PSERIES)) + if (!(systemcfg->platform & (PLATFORM_PSERIES | PLATFORM_BPA))) return 0; if (!proc_mkdir("rtas", root)) Index: linus-2.5/arch/ppc64/kernel/prom_init.c =================================================================== --- linus-2.5.orig/arch/ppc64/kernel/prom_init.c 2005-05-09 08:15:08.000000000 +0200 +++ linus-2.5/arch/ppc64/kernel/prom_init.c 2005-05-09 08:17:38.000000000 +0200 @@ -1844,9 +1844,9 @@ &getprop_rval, sizeof(getprop_rval)); /* - * On pSeries, copy the CPU hold code + * On pSeries and BPA, copy the CPU hold code */ - if (RELOC(of_platform) & PLATFORM_PSERIES) + if (RELOC(of_platform) & (PLATFORM_PSERIES | PLATFORM_BPA)) copy_and_flush(0, KERNELBASE - offset, 0x100, 0); /* Index: linus-2.5/arch/ppc64/kernel/setup.c =================================================================== --- linus-2.5.orig/arch/ppc64/kernel/setup.c 2005-05-08 09:51:35.000000000 +0200 +++ linus-2.5/arch/ppc64/kernel/setup.c 2005-05-09 08:17:38.000000000 +0200 @@ -348,6 +348,7 @@ extern struct machdep_calls pSeries_md; extern struct machdep_calls pmac_md; extern struct machdep_calls maple_md; +extern struct machdep_calls bpa_md; /* Ultimately, stuff them in an elf section like initcalls... */ static struct machdep_calls __initdata *machines[] = { @@ -360,6 +361,9 @@ #ifdef CONFIG_PPC_MAPLE &maple_md, #endif /* CONFIG_PPC_MAPLE */ +#ifdef CONFIG_PPC_BPA + &bpa_md, +#endif NULL }; Index: linus-2.5/arch/ppc64/kernel/traps.c =================================================================== --- linus-2.5.orig/arch/ppc64/kernel/traps.c 2005-05-09 08:04:31.000000000 +0200 +++ linus-2.5/arch/ppc64/kernel/traps.c 2005-05-09 08:17:38.000000000 +0200 @@ -126,6 +126,10 @@ printk("POWERMAC "); nl = 1; break; + case PLATFORM_BPA: + printk("BPA "); + nl = 1; + break; } if (nl) printk("\n"); Index: linus-2.5/include/asm-ppc64/mmu.h =================================================================== --- linus-2.5.orig/include/asm-ppc64/mmu.h 2005-05-09 08:15:42.000000000 +0200 +++ linus-2.5/include/asm-ppc64/mmu.h 2005-05-09 08:20:31.000000000 +0200 @@ -47,9 +47,10 @@ #define SLB_VSID_KS ASM_CONST(0x0000000000000800) #define SLB_VSID_KP ASM_CONST(0x0000000000000400) #define SLB_VSID_N ASM_CONST(0x0000000000000200) /* no-execute */ -#define SLB_VSID_L ASM_CONST(0x0000000000000100) /* largepage 16M */ +#define SLB_VSID_L ASM_CONST(0x0000000000000100) /* largepage */ #define SLB_VSID_C ASM_CONST(0x0000000000000080) /* class */ - +#define SLB_VSID_LS ASM_CONST(0x0000000000000070) /* size of largepage */ + #define SLB_VSID_KERNEL (SLB_VSID_KP|SLB_VSID_C) #define SLB_VSID_USER (SLB_VSID_KP|SLB_VSID_KS) Index: linus-2.5/include/asm-ppc64/processor.h =================================================================== --- linus-2.5.orig/include/asm-ppc64/processor.h 2005-05-09 08:04:46.000000000 +0200 +++ linus-2.5/include/asm-ppc64/processor.h 2005-05-09 08:17:38.000000000 +0200 @@ -217,14 +217,22 @@ #define HID0_ABE (1<<3) /* Address Broadcast Enable */ #define HID0_BHTE (1<<2) /* Branch History Table Enable */ #define HID0_BTCD (1<<1) /* Branch target cache disable */ +#define SPRN_HID6 0x3F9 /* Hardware Implementation Register 6 */ +#define HID6_LB (0x0F<<12) /* Concurrent Large Page Modes */ +#define HID6_DLP (1<<20) /* Disable all large page modes (4K only) */ #define SPRN_MSRDORM 0x3F1 /* Hardware Implementation Register 1 */ #define SPRN_HID1 0x3F1 /* Hardware Implementation Register 1 */ #define SPRN_IABR 0x3F2 /* Instruction Address Breakpoint Register */ #define SPRN_NIADORM 0x3F3 /* Hardware Implementation Register 2 */ #define SPRN_HID4 0x3F4 /* 970 HID4 */ #define SPRN_HID5 0x3F6 /* 970 HID5 */ -#define SPRN_TSC 0x3FD /* Thread switch control */ -#define SPRN_TST 0x3FC /* Thread switch timeout */ +#define SPRN_TSCR 0x399 /* Thread switch control on BE */ +#define SPRN_TTR 0x39A /* Thread switch timeout on BE */ +#define TSCR_DEC_ENABLE 0x200000 /* Decrementer Interrupt */ +#define TSCR_EE_ENABLE 0x100000 /* External Interrupt */ +#define TSCR_EE_BOOST 0x080000 /* External Interrupt Boost */ +#define SPRN_TSC 0x3FD /* Thread switch control on others */ +#define SPRN_TST 0x3FC /* Thread switch timeout on others */ #define SPRN_IAC1 0x3F4 /* Instruction Address Compare 1 */ #define SPRN_IAC2 0x3F5 /* Instruction Address Compare 2 */ #define SPRN_ICCR 0x3FB /* Instruction Cache Cacheability Register */ @@ -411,8 +419,9 @@ #define PV_POWER5 0x003A #define PV_POWER5p 0x003B #define PV_970FX 0x003C -#define PV_630 0x0040 -#define PV_630p 0x0041 +#define PV_630 0x0040 +#define PV_630p 0x0041 +#define PV_BE 0x0070 /* Platforms supported by PPC64 */ #define PLATFORM_PSERIES 0x0100 @@ -421,6 +430,7 @@ #define PLATFORM_LPAR 0x0001 #define PLATFORM_POWERMAC 0x0400 #define PLATFORM_MAPLE 0x0500 +#define PLATFORM_BPA 0x1000 /* Compatibility with drivers coming from PPC32 world */ #define _machine (systemcfg->platform) @@ -432,6 +442,7 @@ #define IC_INVALID 0 #define IC_OPEN_PIC 1 #define IC_PPC_XIC 2 +#define IC_BPA_IIC 3 #define XGLUE(a,b) a##b #define GLUE(a,b) XGLUE(a,b) Index: linus-2.5/include/asm-ppc64/smp.h =================================================================== --- linus-2.5.orig/include/asm-ppc64/smp.h 2005-05-08 09:51:35.000000000 +0200 +++ linus-2.5/include/asm-ppc64/smp.h 2005-05-09 08:17:38.000000000 +0200 @@ -85,6 +85,14 @@ extern struct smp_ops_t *smp_ops; +#ifdef CONFIG_PPC_PSERIES +void vpa_init(int cpu); +#else +static inline void vpa_init(int cpu) +{ +} +#endif /* CONFIG_PPC_PSERIES */ + #endif /* __ASSEMBLY__ */ #endif /* !(_PPC64_SMP_H) */ From arnd at arndb.de Sat May 14 05:24:47 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Fri, 13 May 2005 21:24:47 +0200 Subject: [PATCH 3/8] ppc64: add a watchdog driver for rtas In-Reply-To: <200505132117.37461.arnd@arndb.de> References: <200505132117.37461.arnd@arndb.de> Message-ID: <200505132124.48963.arnd@arndb.de> Add a watchdog using the RTAS OS surveillance service. This is provided as a simpler alternative to rtasd. The added value is that it works with standard watchdog client programs and can therefore also do user space monitoring. On BPA, rtasd is not really useful because the hardware does not have much to report with event-scan. The driver should also work on other platforms that support the OS surveillance rtas calls. From: Utz Bacher Signed-off-by: Arnd Bergmann --- linux-2.6-ppc.orig/drivers/char/watchdog/Kconfig 2005-03-18 07:08:59.836902728 -0500 +++ linux-2.6-ppc/drivers/char/watchdog/Kconfig 2005-03-18 07:09:12.047905480 -0500 @@ -414,6 +414,16 @@ config WATCHDOG_RIO machines. The watchdog timeout period is normally one minute but can be changed with a boot-time parameter. +# ppc64 RTAS watchdog +config WATCHDOG_RTAS + tristate "RTAS watchdog" + depends on WATCHDOG && PPC_RTAS + help + This driver adds watchdog support for the RTAS watchdog. + + To compile this driver as a module, choose M here. The module + will be called wdrtas. + # # ISA-based Watchdog Cards # --- linux-2.6-ppc.orig/drivers/char/watchdog/Makefile 2005-03-18 07:08:59.857899536 -0500 +++ linux-2.6-ppc/drivers/char/watchdog/Makefile 2005-03-18 07:09:52.344904960 -0500 @@ -33,6 +33,7 @@ obj-$(CONFIG_USBPCWATCHDOG) += pcwd_usb. obj-$(CONFIG_IXP4XX_WATCHDOG) += ixp4xx_wdt.o obj-$(CONFIG_IXP2000_WATCHDOG) += ixp2000_wdt.o obj-$(CONFIG_8xx_WDT) += mpc8xx_wdt.o +obj-$(CONFIG_WATCHDOG_RTAS) += wdrtas.o # Only one watchdog can succeed. We probe the hardware watchdog # drivers first, then the softdog driver. This means if your hardware --- linux-2.6-ppc.orig/drivers/char/watchdog/wdrtas.c 1969-12-31 19:00:00.000000000 -0500 +++ linux-2.6-ppc/drivers/char/watchdog/wdrtas.c 2005-03-18 07:09:12.051904872 -0500 @@ -0,0 +1,691 @@ +/* + * FIXME: add wdrtas_get_status and wdrtas_get_boot_status as soon as + * RTAS calls are available + */ + +/* + * RTAS watchdog driver + * + * (C) Copyright IBM Corp. 2005 + * device driver to exploit watchdog RTAS functions + * + * Authors : Utz Bacher + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#define WDRTAS_MAGIC_CHAR 42 +#define WDRTAS_SUPPORTED_MASK (WDIOF_SETTIMEOUT | \ + WDIOF_MAGICCLOSE) + +MODULE_AUTHOR("Utz Bacher "); +MODULE_DESCRIPTION("RTAS watchdog driver"); +MODULE_LICENSE("GPL"); +MODULE_ALIAS_MISCDEV(WATCHDOG_MINOR); +MODULE_ALIAS_MISCDEV(TEMP_MINOR); + +#ifdef CONFIG_WATCHDOG_NOWAYOUT +static int wdrtas_nowayout = 1; +#else +static int wdrtas_nowayout = 0; +#endif + +static volatile int wdrtas_miscdev_open = 0; +static char wdrtas_expect_close = 0; + +static int wdrtas_interval; + +#define WDRTAS_THERMAL_SENSOR 3 +static int wdrtas_token_get_sensor_state; +#define WDRTAS_SURVEILLANCE_IND 9000 +static int wdrtas_token_set_indicator; +#define WDRTAS_SP_SPI 28 +static int wdrtas_token_get_sp; +static int wdrtas_token_event_scan; + +#define WDRTAS_DEFAULT_INTERVAL 300 + +#define WDRTAS_LOGBUFFER_LEN 128 +static char wdrtas_logbuffer[WDRTAS_LOGBUFFER_LEN]; + + +/*** watchdog access functions */ + +/** + * wdrtas_set_interval - sets the watchdog interval + * @interval: new interval + * + * returns 0 on success, <0 on failures + * + * wdrtas_set_interval sets the watchdog keepalive interval by calling the + * RTAS function set-indicator (surveillance). The unit of interval is + * seconds. + */ +static int +wdrtas_set_interval(int interval) +{ + long result; + static int print_msg = 10; + + /* rtas uses minutes */ + interval = (interval + 59) / 60; + + result = rtas_call(wdrtas_token_set_indicator, 3, 1, NULL, + WDRTAS_SURVEILLANCE_IND, 0, interval); + if ( (result < 0) && (print_msg) ) { + printk("wdrtas: setting the watchdog to %i timeout failed: " + "%li\n", interval, result); + print_msg--; + } + + return result; +} + +/** + * wdrtas_get_interval - returns the current watchdog interval + * @fallback_value: value (in seconds) to use, if the RTAS call fails + * + * returns the interval + * + * wdrtas_get_interval returns the current watchdog keepalive interval + * as reported by the RTAS function ibm,get-system-parameter. The unit + * of the return value is seconds. + */ +static int +wdrtas_get_interval(int fallback_value) +{ + long result; + char value[4]; + + result = rtas_call(wdrtas_token_get_sp, 3, 1, NULL, + WDRTAS_SP_SPI, (void *)__pa(&value), 4); + if ( (value[0] != 0) || (value[1] != 2) || (value[3] != 0) || + (result < 0) ) { + printk("wdrtas: could not get sp_spi watchdog timeout (%li). " + "Continuing\n", result); + return fallback_value; + } + + /* rtas uses minutes */ + return ((int)value[2]) * 60; +} + +/** + * wdrtas_timer_start - starts watchdog + * + * wdrtas_timer_start starts the watchdog by calling the RTAS function + * set-interval (surveillance) + */ +static void +wdrtas_timer_start(void) +{ + wdrtas_set_interval(wdrtas_interval); +} + +/** + * wdrtas_timer_stop - stops watchdog + * + * wdrtas_timer_stop stops the watchdog timer by calling the RTAS function + * set-interval (surveillance) + */ +static void +wdrtas_timer_stop(void) +{ + wdrtas_set_interval(0); +} + +/** + * wdrtas_log_scanned_event - logs an event we received during keepalive + * + * wdrtas_log_scanned_event prints a message to the log buffer dumping + * the results of the last event-scan call + */ +static void +wdrtas_log_scanned_event(void) +{ + int i; + + for (i = 0; i < WDRTAS_LOGBUFFER_LEN; i += 16) + printk("wdrtas: dumping event (line %i/%i), data = " + "%02x %02x %02x %02x %02x %02x %02x %02x " + "%02x %02x %02x %02x %02x %02x %02x %02x\n", + (i / 16) + 1, (WDRTAS_LOGBUFFER_LEN / 16), + wdrtas_logbuffer[i + 0], wdrtas_logbuffer[i + 1], + wdrtas_logbuffer[i + 2], wdrtas_logbuffer[i + 3], + wdrtas_logbuffer[i + 4], wdrtas_logbuffer[i + 5], + wdrtas_logbuffer[i + 6], wdrtas_logbuffer[i + 7], + wdrtas_logbuffer[i + 8], wdrtas_logbuffer[i + 9], + wdrtas_logbuffer[i + 10], wdrtas_logbuffer[i + 11], + wdrtas_logbuffer[i + 12], wdrtas_logbuffer[i + 13], + wdrtas_logbuffer[i + 14], wdrtas_logbuffer[i + 15]); +} + +/** + * wdrtas_timer_keepalive - resets watchdog timer to keep system alive + * + * wdrtas_timer_keepalive restarts the watchdog timer by calling the + * RTAS function event-scan and repeats these calls as long as there are + * events available. All events will be dumped. + */ +static void +wdrtas_timer_keepalive(void) +{ + long result; + + do { + result = rtas_call(wdrtas_token_event_scan, 4, 1, NULL, + RTAS_EVENT_SCAN_ALL_EVENTS, 0, + (void *)__pa(wdrtas_logbuffer), + WDRTAS_LOGBUFFER_LEN); + if (result < 0) + printk("wdrtas: event-scan failed: %li\n",result); + if (result == 0) + wdrtas_log_scanned_event(); + } while (result == 0); +} + +/** + * wdrtas_get_temperature - returns current temperature + * + * returns temperature or <0 on failures + * + * wdrtas_get_temperature returns the current temperature in Fahrenheit. It + * uses the RTAS call get-sensor-state, token 3 to do so + */ +static int +wdrtas_get_temperature(void) +{ + long result; + int temperature = 0; + + result = rtas_call(wdrtas_token_get_sensor_state, 2, 2, + (void *)__pa(&temperature), + WDRTAS_THERMAL_SENSOR, 0); + + if (result < 0) + printk("wdrtas: reading the thermal sensor faild: %li\n", + result); + else + temperature = ((temperature * 9) / 5) + 32; /* fahrenheit */ + + return temperature; +} + +/** + * wdrtas_get_status - returns the status of the watchdog + * + * returns a bitmask of defines WDIOF_... as defined in + * include/linux/watchdog.h + */ +static int +wdrtas_get_status(void) +{ + return 0; /* TODO */ +} + +/** + * wdrtas_get_boot_status - returns the reason for the last boot + * + * returns a bitmask of defines WDIOF_... as defined in + * include/linux/watchdog.h, indicating why the watchdog rebooted the system + */ +static int +wdrtas_get_boot_status(void) +{ + return 0; /* TODO */ +} + +/*** watchdog API and operations stuff */ + +/* wdrtas_write - called when watchdog device is written to + * @file: file structure + * @buf: user buffer with data + * @len: amount to data written + * @ppos: position in file + * + * returns the number of successfully processed characters, which is always + * the number of bytes passed to this function + * + * wdrtas_write processes all the data given to it and looks for the magic + * character 'V'. This character allows the watchdog device to be closed + * properly. + */ +static ssize_t +wdrtas_write(struct file *file, const char __user *buf, + size_t len, loff_t *ppos) +{ + int i; + char c; + + if (!len) + goto out; + + if (!wdrtas_nowayout) { + wdrtas_expect_close = 0; + /* look for 'V' */ + for (i = 0; i < len; i++) { + if (get_user(c, buf + i)) + return -EFAULT; + /* allow to close device */ + if (c == 'V') + wdrtas_expect_close = WDRTAS_MAGIC_CHAR; + } + } + + wdrtas_timer_keepalive(); + +out: + return len; +} + +/** + * wdrtas_ioctl - ioctl function for the watchdog device + * @inode: inode structure + * @file: file structure + * @cmd: command for ioctl + * @arg: argument pointer + * + * returns 0 on success, <0 on failure + * + * wdrtas_ioctl implements the watchdog API ioctls + */ +static int +wdrtas_ioctl(struct inode *inode, struct file *file, + unsigned int cmd, unsigned long arg) +{ + int __user *argp = (void *)arg; + int i; + static struct watchdog_info wdinfo = { + .options = WDRTAS_SUPPORTED_MASK, + .firmware_version = 0, + .identity = "wdrtas" + }; + + switch (cmd) { + case WDIOC_GETSUPPORT: + if (copy_to_user(argp, &wdinfo, sizeof(wdinfo))) + return -EFAULT; + return 0; + + case WDIOC_GETSTATUS: + i = wdrtas_get_status(); + return put_user(i, argp); + + case WDIOC_GETBOOTSTATUS: + i = wdrtas_get_boot_status(); + return put_user(i, argp); + + case WDIOC_GETTEMP: + if (wdrtas_token_get_sensor_state == RTAS_UNKNOWN_SERVICE) + return -EOPNOTSUPP; + + i = wdrtas_get_temperature(); + return put_user(i, argp); + + case WDIOC_SETOPTIONS: + if (get_user(i, argp)) + return -EFAULT; + if (i & WDIOS_DISABLECARD) + wdrtas_timer_stop(); + if (i & WDIOS_ENABLECARD) { + wdrtas_timer_keepalive(); + wdrtas_timer_start(); + } + if (i & WDIOS_TEMPPANIC) { + /* not implemented. Done by H8 */ + } + return 0; + + case WDIOC_KEEPALIVE: + wdrtas_timer_keepalive(); + return 0; + + case WDIOC_SETTIMEOUT: + if (get_user(i, argp)) + return -EFAULT; + + if (wdrtas_set_interval(i)) + return -EINVAL; + + wdrtas_timer_keepalive(); + + if (wdrtas_token_get_sp == RTAS_UNKNOWN_SERVICE) + wdrtas_interval = i; + else + wdrtas_interval = wdrtas_get_interval(i); + /* fallthrough */ + + case WDIOC_GETTIMEOUT: + return put_user(wdrtas_interval, argp); + + default: + return -ENOIOCTLCMD; + } +} + +/** + * wdrtas_open - open function of watchdog device + * @inode: inode structure + * @file: file structure + * + * returns 0 on success, -EBUSY if the file has been opened already, <0 on + * other failures + * + * function called when watchdog device is opened + */ +static int +wdrtas_open(struct inode *inode, struct file *file) +{ + /* only open once */ + if (xchg(&wdrtas_miscdev_open,1)) + return -EBUSY; + + wdrtas_timer_start(); + wdrtas_timer_keepalive(); + + return nonseekable_open(inode, file); +} + +/** + * wdrtas_close - close function of watchdog device + * @inode: inode structure + * @file: file structure + * + * returns 0 on success + * + * close function. Always succeeds + */ +static int +wdrtas_close(struct inode *inode, struct file *file) +{ + /* only stop watchdog, if this was announced using 'V' before */ + if (wdrtas_expect_close == WDRTAS_MAGIC_CHAR) + wdrtas_timer_stop(); + else { + printk("wdrtas: got unexpected close. Watchdog " + "not stopped.\n"); + wdrtas_timer_keepalive(); + } + + wdrtas_expect_close = 0; + xchg(&wdrtas_miscdev_open,0); + return 0; +} + +/** + * wdrtas_temp_read - gives back the temperature in fahrenheit + * @file: file structure + * @buf: user buffer + * @count: number of bytes to be read + * @ppos: position in file + * + * returns always 1 or -EFAULT in case of user space copy failures, <0 on + * other failures + * + * wdrtas_temp_read gives the temperature to the users by copying this + * value as one byte into the user space buffer. The unit is Fahrenheit... + */ +static ssize_t +wdrtas_temp_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + int temperature = 0; + + temperature = wdrtas_get_temperature(); + if (temperature < 0) + return temperature; + + if (copy_to_user(buf, &temperature, 1)) + return -EFAULT; + + return 1; +} + +/** + * wdrtas_temp_open - open function of temperature device + * @inode: inode structure + * @file: file structure + * + * returns 0 on success, <0 on failure + * + * function called when temperature device is opened + */ +static int +wdrtas_temp_open(struct inode *inode, struct file *file) +{ + return nonseekable_open(inode, file); +} + +/** + * wdrtas_temp_close - close function of temperature device + * @inode: inode structure + * @file: file structure + * + * returns 0 on success + * + * close function. Always succeeds + */ +static int +wdrtas_temp_close(struct inode *inode, struct file *file) +{ + return 0; +} + +/** + * wdrtas_reboot - reboot notifier function + * @nb: notifier block structure + * @code: reboot code + * @ptr: unused + * + * returns NOTIFY_DONE + * + * wdrtas_reboot stops the watchdog in case of a reboot + */ +static int +wdrtas_reboot(struct notifier_block *this, unsigned long code, void *ptr) +{ + if ( (code==SYS_DOWN) || (code==SYS_HALT) ) + wdrtas_timer_stop(); + + return NOTIFY_DONE; +} + +/*** initialization stuff */ + +static struct file_operations wdrtas_fops = { + .owner = THIS_MODULE, + .llseek = no_llseek, + .write = wdrtas_write, + .ioctl = wdrtas_ioctl, + .open = wdrtas_open, + .release = wdrtas_close, +}; + +static struct miscdevice wdrtas_miscdev = { + .minor = WATCHDOG_MINOR, + .name = "watchdog", + .fops = &wdrtas_fops, +}; + +static struct file_operations wdrtas_temp_fops = { + .owner = THIS_MODULE, + .llseek = no_llseek, + .read = wdrtas_temp_read, + .open = wdrtas_temp_open, + .release = wdrtas_temp_close, +}; + +static struct miscdevice wdrtas_tempdev = { + .minor = TEMP_MINOR, + .name = "temperature", + .fops = &wdrtas_temp_fops, +}; + +static struct notifier_block wdrtas_notifier = { + .notifier_call = wdrtas_reboot, +}; + +/** + * wdrtas_get_tokens - reads in RTAS tokens + * + * returns 0 on succes, <0 on failure + * + * wdrtas_get_tokens reads in the tokens for the RTAS calls used in + * this watchdog driver. It tolerates, if "get-sensor-state" and + * "ibm,get-system-parameter" are not available. + */ +static int +wdrtas_get_tokens(void) +{ + wdrtas_token_get_sensor_state = rtas_token("get-sensor-state"); + if (wdrtas_token_get_sensor_state == RTAS_UNKNOWN_SERVICE) { + printk("wdrtas: couldn't get token for get-sensor-state. " + "Trying to continue without temperature support.\n"); + } + + wdrtas_token_get_sp = rtas_token("ibm,get-system-parameter"); + if (wdrtas_token_get_sp == RTAS_UNKNOWN_SERVICE) { + printk("wdrtas: couldn't get token for " + "ibm,get-system-parameter. Trying to continue with " + "a default timeout value of %i seconds.\n", + WDRTAS_DEFAULT_INTERVAL); + } + + wdrtas_token_set_indicator = rtas_token("set-indicator"); + if (wdrtas_token_set_indicator == RTAS_UNKNOWN_SERVICE) { + printk("wdrtas: couldn't get token for set-indicator. " + "Terminating watchdog code.\n"); + return -EIO; + } + + wdrtas_token_event_scan = rtas_token("event-scan"); + if (wdrtas_token_event_scan == RTAS_UNKNOWN_SERVICE) { + printk("wdrtas: couldn't get token for event-scan. " + "Terminating watchdog code.\n"); + return -EIO; + } + + return 0; +} + +/** + * wdrtas_unregister_devs - unregisters the misc dev handlers + * + * wdrtas_register_devs unregisters the watchdog and temperature watchdog + * misc devs + */ +static void +wdrtas_unregister_devs(void) +{ + misc_deregister(&wdrtas_miscdev); + if (wdrtas_token_get_sensor_state != RTAS_UNKNOWN_SERVICE) + misc_deregister(&wdrtas_tempdev); +} + +/** + * wdrtas_register_devs - registers the misc dev handlers + * + * returns 0 on succes, <0 on failure + * + * wdrtas_register_devs registers the watchdog and temperature watchdog + * misc devs + */ +static int +wdrtas_register_devs(void) +{ + int result; + + result = misc_register(&wdrtas_miscdev); + if (result) { + printk("wdrtas: couldn't register watchdog misc device. " + "Terminating watchdog code.\n"); + return result; + } + + if (wdrtas_token_get_sensor_state != RTAS_UNKNOWN_SERVICE) { + result = misc_register(&wdrtas_tempdev); + if (result) { + printk("wdrtas: couldn't register watchdog " + "temperature misc device. Continuing without " + "temperature support.\n"); + wdrtas_token_get_sensor_state = RTAS_UNKNOWN_SERVICE; + } + } + + return 0; +} + +/** + * wdrtas_init - init function of the watchdog driver + * + * returns 0 on succes, <0 on failure + * + * registers the file handlers and the reboot notifier + */ +static int __init +wdrtas_init(void) +{ + if (wdrtas_get_tokens()) + return -ENODEV; + + if (wdrtas_register_devs()) + return -ENODEV; + + if (register_reboot_notifier(&wdrtas_notifier)) { + printk("wdrtas: could not register reboot notifier. " + "Terminating watchdog code.\n"); + wdrtas_unregister_devs(); + return -ENODEV; + } + + if (wdrtas_token_get_sp == RTAS_UNKNOWN_SERVICE) + wdrtas_interval = WDRTAS_DEFAULT_INTERVAL; + else + wdrtas_interval = wdrtas_get_interval(WDRTAS_DEFAULT_INTERVAL); + + return 0; +} + +/** + * wdrtas_exit - exit function of the watchdog driver + * + * unregisters the file handlers and the reboot notifier + */ +static void __exit +wdrtas_exit(void) +{ + if (!wdrtas_nowayout) + wdrtas_timer_stop(); + + wdrtas_unregister_devs(); + + unregister_reboot_notifier(&wdrtas_notifier); +} + +module_init(wdrtas_init); +module_exit(wdrtas_exit); From arnd at arndb.de Sat May 14 05:26:20 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Fri, 13 May 2005 21:26:20 +0200 Subject: [PATCH 5/8] ppc64: Add driver for BPA interrupt controllers In-Reply-To: <200505132117.37461.arnd@arndb.de> References: <200505132117.37461.arnd@arndb.de> Message-ID: <200505132126.21571.arnd@arndb.de> Add support for the integrated interrupt controller on BPA CPUs. There is one of those for each SMT thread. The mapping of interrupt numbers to HW interrupt sources is described in arch/ppc64/kernel/bpa_iic.h. This version hardcodes the 'Spider' chip as the secondary interrupt controller. That is not really generic for the architecture, but at the moment it is the only secondary PIC that exists. A little more work will be needed on this as soon as we have boards with multiple external interrupt controllers. Signed-off-by: Arnd Bergmann Index: linus-2.5/arch/ppc64/Kconfig =================================================================== --- linus-2.5.orig/arch/ppc64/Kconfig 2005-04-22 06:59:52.000000000 +0200 +++ linus-2.5/arch/ppc64/Kconfig 2005-04-22 06:59:58.000000000 +0200 @@ -106,6 +106,21 @@ bool default y +config XICS + depends on PPC_PSERIES + bool + default y + +config MPIC + depends on PPC_PSERIES || PPC_PMAC || PPC_MAPLE + bool + default y + +config BPA_IIC + depends on PPC_BPA + bool + default y + # VMX is pSeries only for now until somebody writes the iSeries # exception vectors for it config ALTIVEC Index: linus-2.5/arch/ppc64/kernel/Makefile =================================================================== --- linus-2.5.orig/arch/ppc64/kernel/Makefile 2005-04-22 06:59:52.000000000 +0200 +++ linus-2.5/arch/ppc64/kernel/Makefile 2005-04-22 07:01:07.000000000 +0200 @@ -28,13 +28,13 @@ mf.o HvLpEvent.o iSeries_proc.o iSeries_htab.o \ iSeries_iommu.o -obj-$(CONFIG_PPC_MULTIPLATFORM) += nvram.o i8259.o prom_init.o prom.o mpic.o +obj-$(CONFIG_PPC_MULTIPLATFORM) += nvram.o i8259.o prom_init.o prom.o obj-$(CONFIG_PPC_PSERIES) += pSeries_pci.o pSeries_lpar.o pSeries_hvCall.o \ pSeries_nvram.o rtasd.o ras.o pSeries_reconfig.o \ - xics.o pSeries_setup.o pSeries_iommu.o + pSeries_setup.o pSeries_iommu.o -obj-$(CONFIG_PPC_BPA) += bpa_setup.o bpa_nvram.o +obj-$(CONFIG_PPC_BPA) += bpa_setup.o bpa_nvram.o bpa_iic.o spider-pic.o obj-$(CONFIG_EEH) += eeh.o obj-$(CONFIG_PROC_FS) += proc_ppc64.o @@ -50,6 +50,8 @@ obj-$(CONFIG_BOOTX_TEXT) += btext.o obj-$(CONFIG_HVCS) += hvcserver.o obj-$(CONFIG_IBMVIO) += vio.o +obj-$(CONFIG_XICS) += xics.o +obj-$(CONFIG_MPIC) += mpic.o obj-$(CONFIG_PPC_PMAC) += pmac_setup.o pmac_feature.o pmac_pci.o \ pmac_time.o pmac_nvram.o pmac_low_i2c.o Index: linus-2.5/arch/ppc64/kernel/bpa_iic.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linus-2.5/arch/ppc64/kernel/bpa_iic.c 2005-04-22 06:59:58.000000000 +0200 @@ -0,0 +1,270 @@ +/* + * BPA Internal Interrupt Controller + * + * (C) Copyright IBM Deutschland Entwicklung GmbH 2005 + * + * Author: Arnd Bergmann + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#include +#include +#include +#include +#include + +#include +#include +#include +#include + +#include "bpa_iic.h" + +struct iic_pending_bits { + u32 data; + u8 flags; + u8 class; + u8 source; + u8 prio; +}; + +enum iic_pending_flags { + IIC_VALID = 0x80, + IIC_IPI = 0x40, +}; + +struct iic_regs { + struct iic_pending_bits pending; + struct iic_pending_bits pending_destr; + u64 generate; + u64 prio; +}; + +struct iic { + struct iic_regs __iomem *regs; +}; + +static DEFINE_PER_CPU(struct iic, iic); + +void iic_local_enable(void) +{ + out_be64(&__get_cpu_var(iic).regs->prio, 0xff); +} + +void iic_local_disable(void) +{ + out_be64(&__get_cpu_var(iic).regs->prio, 0x0); +} + +static unsigned int iic_startup(unsigned int irq) +{ + return 0; +} + +static void iic_enable(unsigned int irq) +{ + iic_local_enable(); +} + +static void iic_disable(unsigned int irq) +{ +} + +static void iic_end(unsigned int irq) +{ + iic_local_enable(); +} + +static struct hw_interrupt_type iic_pic = { + .typename = " BPA-IIC ", + .startup = iic_startup, + .enable = iic_enable, + .disable = iic_disable, + .end = iic_end, +}; + +static int iic_external_get_irq(struct iic_pending_bits pending) +{ + int irq; + unsigned char node, unit; + + node = pending.source >> 4; + unit = pending.source & 0xf; + irq = -1; + + /* + * This mapping is specific to the Broadband + * Engine. We might need to get the numbers + * from the device tree to support future CPUs. + */ + switch (unit) { + case 0x00: + case 0x0b: + /* + * One of these units can be connected + * to an external interrupt controller. + */ + if (pending.prio > 0x3f || + pending.class != 2) + break; + irq = IIC_EXT_OFFSET + + spider_get_irq(pending.prio + node * IIC_NODE_STRIDE) + + node * IIC_NODE_STRIDE; + break; + case 0x01 ... 0x04: + case 0x07 ... 0x0a: + /* + * These units are connected to the SPEs + */ + if (pending.class > 2) + break; + irq = IIC_SPE_OFFSET + + pending.class * IIC_CLASS_STRIDE + + node * IIC_NODE_STRIDE + + unit; + break; + } + if (irq == -1) + printk(KERN_WARNING "Unexpected interrupt class %02x, " + "source %02x, prio %02x, cpu %02x\n", pending.class, + pending.source, pending.prio, smp_processor_id()); + return irq; +} + +/* Get an IRQ number from the pending state register of the IIC */ +int iic_get_irq(struct pt_regs *regs) +{ + struct iic *iic; + int irq; + struct iic_pending_bits pending; + + iic = &__get_cpu_var(iic); + *(unsigned long *) &pending = + in_be64((unsigned long __iomem *) &iic->regs->pending_destr); + + irq = -1; + if (pending.flags & IIC_VALID) { + if (pending.flags & IIC_IPI) { + irq = IIC_IPI_OFFSET + (pending.prio >> 4); +/* + if (irq > 0x80) + printk(KERN_WARNING "Unexpected IPI prio %02x" + "on CPU %02x\n", pending.prio, + smp_processor_id()); +*/ + } else { + irq = iic_external_get_irq(pending); + } + } + return irq; +} + +static struct iic_regs __iomem *find_iic(int cpu) +{ + struct device_node *np; + int nodeid = cpu / 2; + unsigned long regs; + struct iic_regs __iomem *iic_regs; + + for (np = of_find_node_by_type(NULL, "cpu"); + np; + np = of_find_node_by_type(np, "cpu")) { + if (nodeid == *(int *)get_property(np, "node-id", NULL)) + break; + } + + if (!np) { + printk(KERN_WARNING "IIC: CPU %d not found\n", cpu); + iic_regs = NULL; + } else { + regs = *(long *)get_property(np, "iic", NULL); + + /* hack until we have decided on the devtree info */ + regs += 0x400; + if (cpu & 1) + regs += 0x20; + + printk(KERN_DEBUG "IIC for CPU %d at %lx\n", cpu, regs); + iic_regs = __ioremap(regs, sizeof(struct iic_regs), + _PAGE_NO_CACHE); + } + return iic_regs; +} + +#ifdef CONFIG_SMP +void iic_setup_cpu(void) +{ + out_be64(&__get_cpu_var(iic).regs->prio, 0xff); +} + +void iic_cause_IPI(int cpu, int mesg) +{ + out_be64(&per_cpu(iic, cpu).regs->generate, mesg); +} + +static irqreturn_t iic_ipi_action(int irq, void *dev_id, struct pt_regs *regs) +{ + + smp_message_recv(irq - IIC_IPI_OFFSET, regs); + return IRQ_HANDLED; +} + +static void iic_request_ipi(int irq, const char *name) +{ + /* IPIs are marked SA_INTERRUPT as they must run with irqs + * disabled */ + get_irq_desc(irq)->handler = &iic_pic; + get_irq_desc(irq)->status |= IRQ_PER_CPU; + request_irq(irq, iic_ipi_action, SA_INTERRUPT, name, NULL); +} + +void iic_request_IPIs(void) +{ + iic_request_ipi(IIC_IPI_OFFSET + PPC_MSG_CALL_FUNCTION, "IPI-call"); + iic_request_ipi(IIC_IPI_OFFSET + PPC_MSG_RESCHEDULE, "IPI-resched"); +#ifdef CONFIG_DEBUGGER + iic_request_ipi(IIC_IPI_OFFSET + PPC_MSG_DEBUGGER_BREAK, "IPI-debug"); +#endif /* CONFIG_DEBUGGER */ +} +#endif /* CONFIG_SMP */ + +static void iic_setup_spe_handlers(void) +{ + int be, isrc; + + /* Assume two threads per BE are present */ + for (be=0; be < num_present_cpus() / 2; be++) { + for (isrc = 0; isrc < IIC_CLASS_STRIDE * 3; isrc++) { + int irq = IIC_NODE_STRIDE * be + IIC_SPE_OFFSET + isrc; + get_irq_desc(irq)->handler = &iic_pic; + } + } +} + +void iic_init_IRQ(void) +{ + int cpu, irq_offset; + struct iic *iic; + + irq_offset = 0; + for_each_cpu(cpu) { + iic = &per_cpu(iic, cpu); + iic->regs = find_iic(cpu); + if (iic->regs) + out_be64(&iic->regs->prio, 0xff); + } + iic_setup_spe_handlers(); +} Index: linus-2.5/arch/ppc64/kernel/bpa_iic.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linus-2.5/arch/ppc64/kernel/bpa_iic.h 2005-04-22 06:59:58.000000000 +0200 @@ -0,0 +1,62 @@ +#ifndef ASM_BPA_IIC_H +#define ASM_BPA_IIC_H +#ifdef __KERNEL__ +/* + * Mapping of IIC pending bits into per-node + * interrupt numbers. + * + * IRQ FF CC SS PP FF CC SS PP Description + * + * 00-3f 80 02 +0 00 - 80 02 +0 3f South Bridge + * 00-3f 80 02 +b 00 - 80 02 +b 3f South Bridge + * 41-4a 80 00 +1 ** - 80 00 +a ** SPU Class 0 + * 51-5a 80 01 +1 ** - 80 01 +a ** SPU Class 1 + * 61-6a 80 02 +1 ** - 80 02 +a ** SPU Class 2 + * 70-7f C0 ** ** 00 - C0 ** ** 0f IPI + * + * F flags + * C class + * S source + * P Priority + * + node number + * * don't care + * + * A node consists of a Broadband Engine and an optional + * south bridge device providing a maximum of 64 IRQs. + * The south bridge may be connected to either IOIF0 + * or IOIF1. + * Each SPE is represented as three IRQ lines, one per + * interrupt class. + * 16 IRQ numbers are reserved for inter processor + * interruptions, although these are only used in the + * range of the first node. + * + * This scheme needs 128 IRQ numbers per BIF node ID, + * which means that with the total of 512 lines + * available, we can have a maximum of four nodes. + */ + +enum { + IIC_EXT_OFFSET = 0x00, /* Start of south bridge IRQs */ + IIC_NUM_EXT = 0x40, /* Number of south bridge IRQs */ + IIC_SPE_OFFSET = 0x40, /* Start of SPE interrupts */ + IIC_CLASS_STRIDE = 0x10, /* SPE IRQs per class */ + IIC_IPI_OFFSET = 0x70, /* Start of IPI IRQs */ + IIC_NUM_IPIS = 0x10, /* IRQs reserved for IPI */ + IIC_NODE_STRIDE = 0x80, /* Total IRQs per node */ +}; + +extern void iic_init_IRQ(void); +extern int iic_get_irq(struct pt_regs *regs); +extern void iic_cause_IPI(int cpu, int mesg); +extern void iic_request_IPIs(void); +extern void iic_setup_cpu(void); +extern void iic_local_enable(void); +extern void iic_local_disable(void); + + +extern void spider_init_IRQ(void); +extern int spider_get_irq(unsigned long int_pending); + +#endif +#endif /* ASM_BPA_IIC_H */ Index: linus-2.5/arch/ppc64/kernel/bpa_setup.c =================================================================== --- linus-2.5.orig/arch/ppc64/kernel/bpa_setup.c 2005-04-22 06:59:52.000000000 +0200 +++ linus-2.5/arch/ppc64/kernel/bpa_setup.c 2005-04-22 06:59:58.000000000 +0200 @@ -45,6 +45,7 @@ #include #include "pci.h" +#include "bpa_iic.h" #ifdef DEBUG #define DBG(fmt...) udbg_printf(fmt) @@ -143,6 +144,9 @@ static void __init bpa_setup_arch(void) { + ppc_md.init_IRQ = iic_init_IRQ; + ppc_md.get_irq = iic_get_irq; + #ifdef CONFIG_SMP smp_init_pSeries(); #endif @@ -158,7 +162,7 @@ /* Find and initialize PCI host bridges */ init_pci_config_tokens(); find_and_init_phbs(); - + spider_init_IRQ(); #ifdef CONFIG_DUMMY_CONSOLE conswitchp = &dummy_con; #endif Index: linus-2.5/arch/ppc64/kernel/pSeries_smp.c =================================================================== --- linus-2.5.orig/arch/ppc64/kernel/pSeries_smp.c 2005-04-22 06:58:22.000000000 +0200 +++ linus-2.5/arch/ppc64/kernel/pSeries_smp.c 2005-04-22 06:59:58.000000000 +0200 @@ -1,5 +1,5 @@ /* - * SMP support for pSeries machines. + * SMP support for pSeries and BPA machines. * * Dave Engebretsen, Peter Bergner, and * Mike Corrigan {engebret|bergner|mikec}@us.ibm.com @@ -47,6 +47,7 @@ #include #include "mpic.h" +#include "bpa_iic.h" #ifdef DEBUG #define DBG(fmt...) udbg_printf(fmt) @@ -286,6 +287,7 @@ return 1; } +#ifdef CONFIG_XICS static inline void smp_xics_do_message(int cpu, int msg) { set_bit(msg, &xics_ipi_message[cpu].value); @@ -334,6 +336,37 @@ rtas_set_indicator(GLOBAL_INTERRUPT_QUEUE, (1UL << interrupt_server_size) - 1 - default_distrib_server, 1); } +#endif /* CONFIG_XICS */ +#ifdef CONFIG_BPA_IIC +static void smp_iic_message_pass(int target, int msg) +{ + unsigned int i; + + if (target < NR_CPUS) { + iic_cause_IPI(target, msg); + } else { + for_each_online_cpu(i) { + if (target == MSG_ALL_BUT_SELF + && i == smp_processor_id()) + continue; + iic_cause_IPI(i, msg); + } + } +} + +static int __init smp_iic_probe(void) +{ + iic_request_IPIs(); + + return cpus_weight(cpu_possible_map); +} + +static void __devinit smp_iic_setup_cpu(int cpu) +{ + if (cpu != boot_cpuid) + iic_setup_cpu(); +} +#endif /* CONFIG_BPA_IIC */ static DEFINE_SPINLOCK(timebase_lock); static unsigned long timebase = 0; @@ -388,14 +421,15 @@ return 1; } - +#ifdef CONFIG_MPIC static struct smp_ops_t pSeries_mpic_smp_ops = { .message_pass = smp_mpic_message_pass, .probe = smp_mpic_probe, .kick_cpu = smp_pSeries_kick_cpu, .setup_cpu = smp_mpic_setup_cpu, }; - +#endif +#ifdef CONFIG_XICS static struct smp_ops_t pSeries_xics_smp_ops = { .message_pass = smp_xics_message_pass, .probe = smp_xics_probe, @@ -403,6 +437,16 @@ .setup_cpu = smp_xics_setup_cpu, .cpu_bootable = smp_pSeries_cpu_bootable, }; +#endif +#ifdef CONFIG_BPA_IIC +static struct smp_ops_t bpa_iic_smp_ops = { + .message_pass = smp_iic_message_pass, + .probe = smp_iic_probe, + .kick_cpu = smp_pSeries_kick_cpu, + .setup_cpu = smp_iic_setup_cpu, + .cpu_bootable = smp_pSeries_cpu_bootable, +}; +#endif /* This is called very early */ void __init smp_init_pSeries(void) @@ -411,10 +455,25 @@ DBG(" -> smp_init_pSeries()\n"); - if (ppc64_interrupt_controller == IC_OPEN_PIC) + switch (ppc64_interrupt_controller) { +#ifdef CONFIG_MPIC + case IC_OPEN_PIC: smp_ops = &pSeries_mpic_smp_ops; - else + break; +#endif +#ifdef CONFIG_XICS + case IC_PPC_XIC: smp_ops = &pSeries_xics_smp_ops; + break; +#endif +#ifdef CONFIG_BPA_IIC + case IC_BPA_IIC: + smp_ops = &bpa_iic_smp_ops; + break; +#endif + default: + panic("Invalid interrupt controller"); + } #ifdef CONFIG_HOTPLUG_CPU smp_ops->cpu_disable = pSeries_cpu_disable; Index: linus-2.5/arch/ppc64/kernel/smp.c =================================================================== --- linus-2.5.orig/arch/ppc64/kernel/smp.c 2005-04-22 06:58:22.000000000 +0200 +++ linus-2.5/arch/ppc64/kernel/smp.c 2005-04-22 06:59:58.000000000 +0200 @@ -71,7 +71,7 @@ int smt_enabled_at_boot = 1; -#ifdef CONFIG_PPC_MULTIPLATFORM +#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_PMAC) || defined(CONFIG_PPC_MAPLE) void smp_mpic_message_pass(int target, int msg) { /* make sure we're sending something that translates to an IPI */ Index: linus-2.5/arch/ppc64/kernel/spider-pic.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linus-2.5/arch/ppc64/kernel/spider-pic.c 2005-04-22 06:59:58.000000000 +0200 @@ -0,0 +1,191 @@ +/* + * External Interrupt Controller on Spider South Bridge + * + * (C) Copyright IBM Deutschland Entwicklung GmbH 2005 + * + * Author: Arnd Bergmann + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#include +#include + +#include +#include +#include + +#include "bpa_iic.h" + +/* register layout taken from Spider spec, table 7.4-4 */ +enum { + TIR_DEN = 0x004, /* Detection Enable Register */ + TIR_MSK = 0x084, /* Mask Level Register */ + TIR_EDC = 0x0c0, /* Edge Detection Clear Register */ + TIR_PNDA = 0x100, /* Pending Register A */ + TIR_PNDB = 0x104, /* Pending Register B */ + TIR_CS = 0x144, /* Current Status Register */ + TIR_LCSA = 0x150, /* Level Current Status Register A */ + TIR_LCSB = 0x154, /* Level Current Status Register B */ + TIR_LCSC = 0x158, /* Level Current Status Register C */ + TIR_LCSD = 0x15c, /* Level Current Status Register D */ + TIR_CFGA = 0x200, /* Setting Register A0 */ + TIR_CFGB = 0x204, /* Setting Register B0 */ + /* 0x208 ... 0x3ff Setting Register An/Bn */ + TIR_PPNDA = 0x400, /* Packet Pending Register A */ + TIR_PPNDB = 0x404, /* Packet Pending Register B */ + TIR_PIERA = 0x408, /* Packet Output Error Register A */ + TIR_PIERB = 0x40c, /* Packet Output Error Register B */ + TIR_PIEN = 0x444, /* Packet Output Enable Register */ + TIR_PIPND = 0x454, /* Packet Output Pending Register */ + TIRDID = 0x484, /* Spider Device ID Register */ + REISTIM = 0x500, /* Reissue Command Timeout Time Setting */ + REISTIMEN = 0x504, /* Reissue Command Timeout Setting */ + REISWAITEN = 0x508, /* Reissue Wait Control*/ +}; + +static void __iomem *spider_pics[4]; + +static void __iomem *spider_get_pic(int irq) +{ + int node = irq / IIC_NODE_STRIDE; + irq %= IIC_NODE_STRIDE; + + if (irq >= IIC_EXT_OFFSET && + irq < IIC_EXT_OFFSET + IIC_NUM_EXT && + spider_pics) + return spider_pics[node]; + return NULL; +} + +static int spider_get_nr(unsigned int irq) +{ + return (irq % IIC_NODE_STRIDE) - IIC_EXT_OFFSET; +} + +static void __iomem *spider_get_irq_config(int irq) +{ + void __iomem *pic; + pic = spider_get_pic(irq); + return pic + TIR_CFGA + 8 * spider_get_nr(irq); +} + +static void spider_enable_irq(unsigned int irq) +{ + void __iomem *cfg = spider_get_irq_config(irq); + irq = spider_get_nr(irq); + + out_be32(cfg, in_be32(cfg) | 0x3107000eu); + out_be32(cfg + 4, in_be32(cfg + 4) | 0x00020000u | irq); +} + +static void spider_disable_irq(unsigned int irq) +{ + void __iomem *cfg = spider_get_irq_config(irq); + irq = spider_get_nr(irq); + + out_be32(cfg, in_be32(cfg) & ~0x30000000u); +} + +static unsigned int spider_startup_irq(unsigned int irq) +{ + spider_enable_irq(irq); + return 0; +} + +static void spider_shutdown_irq(unsigned int irq) +{ + spider_disable_irq(irq); +} + +static void spider_end_irq(unsigned int irq) +{ + spider_enable_irq(irq); +} + +static void spider_ack_irq(unsigned int irq) +{ + spider_disable_irq(irq); + iic_local_enable(); +} + +static struct hw_interrupt_type spider_pic = { + .typename = " SPIDER ", + .startup = spider_startup_irq, + .shutdown = spider_shutdown_irq, + .enable = spider_enable_irq, + .disable = spider_disable_irq, + .ack = spider_ack_irq, + .end = spider_end_irq, +}; + + +int spider_get_irq(unsigned long int_pending) +{ + void __iomem *regs = spider_get_pic(int_pending); + unsigned long cs; + int irq; + + cs = in_be32(regs + TIR_CS); + + irq = cs >> 24; + if (irq != 63) + return irq; + + return -1; +} + +void spider_init_IRQ(void) +{ + int node; + struct device_node *dn; + unsigned int *property; + long spiderpic; + int n; + +/* FIXME: detect multiple PICs as soon as the device tree has them */ + for (node = 0; node < 1; node++) { + dn = of_find_node_by_path("/"); + n = prom_n_addr_cells(dn); + property = (unsigned int *) get_property(dn, + "platform-spider-pic", NULL); + + if (!property) + continue; + for (spiderpic = 0; n > 0; --n) + spiderpic = (spiderpic << 32) + *property++; + printk(KERN_DEBUG "SPIDER addr: %lx\n", spiderpic); + spider_pics[node] = __ioremap(spiderpic, 0x800, _PAGE_NO_CACHE); + for (n = 0; n < IIC_NUM_EXT; n++) { + int irq = n + IIC_EXT_OFFSET + node * IIC_NODE_STRIDE; + get_irq_desc(irq)->handler = &spider_pic; + + /* do not mask any interrupts because of level */ + out_be32(spider_pics[node] + TIR_MSK, 0x0); + + /* disable edge detection clear */ + /* out_be32(spider_pics[node] + TIR_EDC, 0x0); */ + + /* enable interrupt packets to be output */ + out_be32(spider_pics[node] + TIR_PIEN, + in_be32(spider_pics[node] + TIR_PIEN) | 0x1); + + /* Enable the interrupt detection enable bit. Do this last! */ + out_be32(spider_pics[node] + TIR_DEN, + in_be32(spider_pics[node] +TIR_DEN) | 0x1); + + } + } +} From arnd at arndb.de Sat May 14 05:27:49 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Fri, 13 May 2005 21:27:49 +0200 Subject: [PATCH 6/8] ppc64: Add driver for BPA iommu In-Reply-To: <200505132117.37461.arnd@arndb.de> References: <200505132117.37461.arnd@arndb.de> Message-ID: <200505132127.51070.arnd@arndb.de> Implementation of software load support for the BE iommu. This is very different from other iommu code on ppc64, since we only do a static mapping. The mapping is currently hardcoded but should really be read from the firmware, but they don't set up the device nodes yet. There is a single 512MB DMA window for PCI, USB and ethernet at 0x20000000 for our RAM. The Cell processor can put the I/O page table either in memory like the hashed page table (hardware load) or have the operating system write the entries into memory mapped CPU registers (software load). I use the software load mechanism because I know that all I/O page table entries for the amount of installed physical memory fit into the IO TLB cache. At the point when we get machines with more than 4GB of installed memory, we can either use hardware I/O page table access like the other platforms do or dynamically update the I/O TLB entries when a page fault occurs in the I/O subsystem. The software load can then use the macros that I have implemented for the static mapping in order to do the TLB cache updates. Signed-off-by: Arnd Bergmann Index: linus-2.5/arch/ppc64/kernel/Makefile =================================================================== --- linus-2.5.orig/arch/ppc64/kernel/Makefile 2005-04-22 07:01:07.000000000 +0200 +++ linus-2.5/arch/ppc64/kernel/Makefile 2005-04-29 10:01:44.000000000 +0200 @@ -34,7 +34,8 @@ pSeries_nvram.o rtasd.o ras.o pSeries_reconfig.o \ pSeries_setup.o pSeries_iommu.o -obj-$(CONFIG_PPC_BPA) += bpa_setup.o bpa_nvram.o bpa_iic.o spider-pic.o +obj-$(CONFIG_PPC_BPA) += bpa_setup.o bpa_iommu.o bpa_nvram.o \ + bpa_iic.o spider-pic.o obj-$(CONFIG_EEH) += eeh.o obj-$(CONFIG_PROC_FS) += proc_ppc64.o Index: linus-2.5/arch/ppc64/kernel/bpa_iommu.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linus-2.5/arch/ppc64/kernel/bpa_iommu.c 2005-04-29 10:24:03.000000000 +0200 @@ -0,0 +1,377 @@ +/* + * IOMMU implementation for Broadband Processor Architecture + * We just establish a linear mapping at boot by setting all the + * IOPT cache entries in the CPU. + * The mapping functions should be identical to pci_direct_iommu, + * except for the handling of the high order bit that is required + * by the Spider bridge. These should be split into a separate + * file at the point where we get a different bridge chip. + * + * Copyright (C) 2005 IBM Deutschland Entwicklung GmbH, + * Arnd Bergmann + * + * Based on linear mapping + * Copyright (C) 2003 Benjamin Herrenschmidt (benh at kernel.crashing.org) + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#undef DEBUG + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "pci.h" +#include "bpa_iommu.h" + +static inline unsigned long +get_iopt_entry(unsigned long real_address, unsigned long ioid, + unsigned long prot) +{ + return (prot & IOPT_PROT_MASK) + | (IOPT_COHERENT) + | (IOPT_ORDER_VC) + | (real_address & IOPT_RPN_MASK) + | (ioid & IOPT_IOID_MASK); +} + +typedef struct { + unsigned long val; +} ioste; + +static inline ioste +mk_ioste(unsigned long val) +{ + ioste ioste = { .val = val, }; + return ioste; +} + +static inline ioste +get_iost_entry(unsigned long iopt_base, unsigned long io_address, unsigned page_size) +{ + unsigned long ps; + unsigned long iostep; + unsigned long nnpt; + unsigned long shift; + + switch (page_size) { + case 0x1000000: + ps = IOST_PS_16M; + nnpt = 0; /* one page per segment */ + shift = 5; /* segment has 16 iopt entries */ + break; + + case 0x100000: + ps = IOST_PS_1M; + nnpt = 0; /* one page per segment */ + shift = 1; /* segment has 256 iopt entries */ + break; + + case 0x10000: + ps = IOST_PS_64K; + nnpt = 0x07; /* 8 pages per io page table */ + shift = 0; /* all entries are used */ + break; + + case 0x1000: + ps = IOST_PS_4K; + nnpt = 0x7f; /* 128 pages per io page table */ + shift = 0; /* all entries are used */ + break; + + default: /* not a known compile time constant */ + BUILD_BUG_ON(1); + break; + } + + iostep = iopt_base + + /* need 8 bytes per iopte */ + (((io_address / page_size * 8) + /* align io page tables on 4k page boundaries */ + << shift) + /* nnpt+1 pages go into each iopt */ + & ~(nnpt << 12)); + + nnpt++; /* this seems to work, but the documentation is not clear + about wether we put nnpt or nnpt-1 into the ioste bits. + In theory, this can't work for 4k pages. */ + return mk_ioste(IOST_VALID_MASK + | (iostep & IOST_PT_BASE_MASK) + | ((nnpt << 5) & IOST_NNPT_MASK) + | (ps & IOST_PS_MASK)); +} + +/* compute the address of an io pte */ +static inline unsigned long +get_ioptep(ioste iost_entry, unsigned long io_address) +{ + unsigned long iopt_base; + unsigned long page_size; + unsigned long page_number; + unsigned long iopt_offset; + + iopt_base = iost_entry.val & IOST_PT_BASE_MASK; + page_size = iost_entry.val & IOST_PS_MASK; + + /* decode page size to compute page number */ + page_number = (io_address & 0x0fffffff) >> (10 + 2 * page_size); + /* page number is an offset into the io page table */ + iopt_offset = (page_number << 3) & 0x7fff8ul; + return iopt_base + iopt_offset; +} + +/* compute the tag field of the iopt cache entry */ +static inline unsigned long +get_ioc_tag(ioste iost_entry, unsigned long io_address) +{ + unsigned long iopte = get_ioptep(iost_entry, io_address); + + return IOPT_VALID_MASK + | ((iopte & 0x00000000000000ff8ul) >> 3) + | ((iopte & 0x0000003fffffc0000ul) >> 9); +} + +/* compute the hashed 6 bit index for the 4-way associative pte cache */ +static inline unsigned long +get_ioc_hash(ioste iost_entry, unsigned long io_address) +{ + unsigned long iopte = get_ioptep(iost_entry, io_address); + + return ((iopte & 0x000000000000001f8ul) >> 3) + ^ ((iopte & 0x00000000000020000ul) >> 17) + ^ ((iopte & 0x00000000000010000ul) >> 15) + ^ ((iopte & 0x00000000000008000ul) >> 13) + ^ ((iopte & 0x00000000000004000ul) >> 11) + ^ ((iopte & 0x00000000000002000ul) >> 9) + ^ ((iopte & 0x00000000000001000ul) >> 7); +} + +/* same as above, but pretend that we have a simpler 1-way associative + pte cache with an 8 bit index */ +static inline unsigned long +get_ioc_hash_1way(ioste iost_entry, unsigned long io_address) +{ + unsigned long iopte = get_ioptep(iost_entry, io_address); + + return ((iopte & 0x000000000000001f8ul) >> 3) + ^ ((iopte & 0x00000000000020000ul) >> 17) + ^ ((iopte & 0x00000000000010000ul) >> 15) + ^ ((iopte & 0x00000000000008000ul) >> 13) + ^ ((iopte & 0x00000000000004000ul) >> 11) + ^ ((iopte & 0x00000000000002000ul) >> 9) + ^ ((iopte & 0x00000000000001000ul) >> 7) + ^ ((iopte & 0x0000000000000c000ul) >> 8); +} + +static inline ioste +get_iost_cache(void __iomem *base, unsigned long index) +{ + unsigned long __iomem *p = (base + IOC_ST_CACHE_DIR); + return mk_ioste(in_be64(&p[index])); +} + +static inline void +set_iost_cache(void __iomem *base, unsigned long index, ioste ste) +{ + unsigned long __iomem *p = (base + IOC_ST_CACHE_DIR); + pr_debug("ioste %02lx was %016lx, store %016lx", index, + get_iost_cache(base, index).val, ste.val); + out_be64(&p[index], ste.val); + pr_debug(" now %016lx\n", get_iost_cache(base, index).val); +} + +static inline unsigned long +get_iopt_cache(void __iomem *base, unsigned long index, unsigned long *tag) +{ + unsigned long __iomem *tags = (void *)(base + IOC_PT_CACHE_DIR); + unsigned long __iomem *p = (void *)(base + IOC_PT_CACHE_REG); + + *tag = tags[index]; + rmb(); + return *p; +} + +static inline void +set_iopt_cache(void __iomem *base, unsigned long index, + unsigned long tag, unsigned long val) +{ + unsigned long __iomem *tags = base + IOC_PT_CACHE_DIR; + unsigned long __iomem *p = base + IOC_PT_CACHE_REG; + pr_debug("iopt %02lx was v%016lx/t%016lx, store v%016lx/t%016lx\n", + index, get_iopt_cache(base, index, &oldtag), oldtag, val, tag); + + out_be64(p, val); + out_be64(&tags[index], tag); +} + +static inline void +set_iost_origin(void __iomem *base) +{ + unsigned long __iomem *p = base + IOC_ST_ORIGIN; + unsigned long origin = IOSTO_ENABLE | IOSTO_SW; + + pr_debug("iost_origin %016lx, now %016lx\n", in_be64(p), origin); + out_be64(p, origin); +} + +static inline void +set_iocmd_config(void __iomem *base) +{ + unsigned long __iomem *p = base + 0xc00; + unsigned long conf; + + conf = in_be64(p); + pr_debug("iost_conf %016lx, now %016lx\n", conf, conf | IOCMD_CONF_TE); + out_be64(p, conf | IOCMD_CONF_TE); +} + +/* FIXME: get these from the device tree */ +#define ioc_base 0x20000511000ull +#define ioc_mmio_base 0x20000510000ull +#define ioid 0x48a +#define iopt_phys_offset (- 0x20000000) /* We have a 512MB offset from the SB */ +#define io_page_size 0x1000000 + +static unsigned long map_iopt_entry(unsigned long address) +{ + switch (address >> 20) { + case 0x600: + address = 0x24020000000ull; /* spider i/o */ + break; + default: + address += iopt_phys_offset; + break; + } + + return get_iopt_entry(address, ioid, IOPT_PROT_RW); +} + +static void iommu_bus_setup_null(struct pci_bus *b) { } +static void iommu_dev_setup_null(struct pci_dev *d) { } + +/* initialize the iommu to support a simple linear mapping + * for each DMA window used by any device. For now, we + * happen to know that there is only one DMA window in use, + * starting at iopt_phys_offset. */ +static void bpa_map_iommu(void) +{ + unsigned long address; + void __iomem *base; + ioste ioste; + unsigned long index; + + base = __ioremap(ioc_base, 0x1000, _PAGE_NO_CACHE); + pr_debug("%lx mapped to %p\n", ioc_base, base); + set_iocmd_config(base); + iounmap(base); + + base = __ioremap(ioc_mmio_base, 0x1000, _PAGE_NO_CACHE); + pr_debug("%lx mapped to %p\n", ioc_mmio_base, base); + + set_iost_origin(base); + + for (address = 0; address < 0x100000000ul; address += io_page_size) { + ioste = get_iost_entry(0x10000000000ul, address, io_page_size); + if ((address & 0xfffffff) == 0) /* segment start */ + set_iost_cache(base, address >> 28, ioste); + index = get_ioc_hash_1way(ioste, address); + pr_debug("addr %08lx, index %02lx, ioste %016lx\n", + address, index, ioste.val); + set_iopt_cache(base, + get_ioc_hash_1way(ioste, address), + get_ioc_tag(ioste, address), + map_iopt_entry(address)); + } + iounmap(base); +} + + +static void *bpa_alloc_coherent(struct device *hwdev, size_t size, + dma_addr_t *dma_handle, unsigned int __nocast flag) +{ + void *ret; + + ret = (void *)__get_free_pages(flag, get_order(size)); + if (ret != NULL) { + memset(ret, 0, size); + *dma_handle = virt_to_abs(ret) | BPA_DMA_VALID; + } + return ret; +} + +static void bpa_free_coherent(struct device *hwdev, size_t size, + void *vaddr, dma_addr_t dma_handle) +{ + free_pages((unsigned long)vaddr, get_order(size)); +} + +static dma_addr_t bpa_map_single(struct device *hwdev, void *ptr, + size_t size, enum dma_data_direction direction) +{ + return virt_to_abs(ptr) | BPA_DMA_VALID; +} + +static void bpa_unmap_single(struct device *hwdev, dma_addr_t dma_addr, + size_t size, enum dma_data_direction direction) +{ +} + +static int bpa_map_sg(struct device *hwdev, struct scatterlist *sg, + int nents, enum dma_data_direction direction) +{ + int i; + + for (i = 0; i < nents; i++, sg++) { + sg->dma_address = (page_to_phys(sg->page) + sg->offset) + | BPA_DMA_VALID; + sg->dma_length = sg->length; + } + + return nents; +} + +static void bpa_unmap_sg(struct device *hwdev, struct scatterlist *sg, + int nents, enum dma_data_direction direction) +{ +} + +static int bpa_dma_supported(struct device *dev, u64 mask) +{ + return mask < 0x100000000ull; +} + +void bpa_init_iommu(void) +{ + bpa_map_iommu(); + + /* Direct I/O, IOMMU off */ + ppc_md.iommu_dev_setup = iommu_dev_setup_null; + ppc_md.iommu_bus_setup = iommu_bus_setup_null; + + pci_dma_ops.alloc_coherent = bpa_alloc_coherent; + pci_dma_ops.free_coherent = bpa_free_coherent; + pci_dma_ops.map_single = bpa_map_single; + pci_dma_ops.unmap_single = bpa_unmap_single; + pci_dma_ops.map_sg = bpa_map_sg; + pci_dma_ops.unmap_sg = bpa_unmap_sg; + pci_dma_ops.dma_supported = bpa_dma_supported; +} Index: linus-2.5/arch/ppc64/kernel/bpa_iommu.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linus-2.5/arch/ppc64/kernel/bpa_iommu.h 2005-04-29 09:47:29.000000000 +0200 @@ -0,0 +1,65 @@ +#ifndef BPA_IOMMU_H +#define BPA_IOMMU_H + +/* some constants */ +enum { + /* segment table entries */ + IOST_VALID_MASK = 0x8000000000000000ul, + IOST_TAG_MASK = 0x3000000000000000ul, + IOST_PT_BASE_MASK = 0x000003fffffff000ul, + IOST_NNPT_MASK = 0x0000000000000fe0ul, + IOST_PS_MASK = 0x000000000000000ful, + + IOST_PS_4K = 0x1, + IOST_PS_64K = 0x3, + IOST_PS_1M = 0x5, + IOST_PS_16M = 0x7, + + /* iopt tag register */ + IOPT_VALID_MASK = 0x0000000200000000ul, + IOPT_TAG_MASK = 0x00000001fffffffful, + + /* iopt cache register */ + IOPT_PROT_MASK = 0xc000000000000000ul, + IOPT_PROT_NONE = 0x0000000000000000ul, + IOPT_PROT_READ = 0x4000000000000000ul, + IOPT_PROT_WRITE = 0x8000000000000000ul, + IOPT_PROT_RW = 0xc000000000000000ul, + IOPT_COHERENT = 0x2000000000000000ul, + + IOPT_ORDER_MASK = 0x1800000000000000ul, + /* order access to same IOID/VC on same address */ + IOPT_ORDER_ADDR = 0x0800000000000000ul, + /* similar, but only after a write access */ + IOPT_ORDER_WRITES = 0x1000000000000000ul, + /* Order all accesses to same IOID/VC */ + IOPT_ORDER_VC = 0x1800000000000000ul, + + IOPT_RPN_MASK = 0x000003fffffff000ul, + IOPT_HINT_MASK = 0x0000000000000800ul, + IOPT_IOID_MASK = 0x00000000000007fful, + + IOSTO_ENABLE = 0x8000000000000000ul, + IOSTO_ORIGIN = 0x000003fffffff000ul, + IOSTO_HW = 0x0000000000000800ul, + IOSTO_SW = 0x0000000000000400ul, + + IOCMD_CONF_TE = 0x0000800000000000ul, + + /* memory mapped registers */ + IOC_PT_CACHE_DIR = 0x000, + IOC_ST_CACHE_DIR = 0x800, + IOC_PT_CACHE_REG = 0x910, + IOC_ST_ORIGIN = 0x918, + IOC_CONF = 0x930, + + /* The high bit needs to be set on every DMA address, + only 2GB are addressable */ + BPA_DMA_VALID = 0x80000000, + BPA_DMA_MASK = 0x7fffffff, +}; + + +void bpa_init_iommu(void); + +#endif Index: linus-2.5/arch/ppc64/kernel/bpa_setup.c =================================================================== --- linus-2.5.orig/arch/ppc64/kernel/bpa_setup.c 2005-04-22 06:59:58.000000000 +0200 +++ linus-2.5/arch/ppc64/kernel/bpa_setup.c 2005-04-29 10:01:12.000000000 +0200 @@ -46,6 +46,7 @@ #include "pci.h" #include "bpa_iic.h" +#include "bpa_iommu.h" #ifdef DEBUG #define DBG(fmt...) udbg_printf(fmt) @@ -179,7 +180,7 @@ hpte_init_native(); - pci_direct_iommu_init(); + bpa_init_iommu(); ppc64_interrupt_controller = IC_BPA_IIC; From arnd at arndb.de Sat May 14 05:29:06 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Fri, 13 May 2005 21:29:06 +0200 Subject: [PATCH 7/8] ppc64: SPU file system In-Reply-To: <200505132117.37461.arnd@arndb.de> References: <200505132117.37461.arnd@arndb.de> Message-ID: <200505132129.07603.arnd@arndb.de> This is an early version of the SPU file system, which is used to run code on the Synergistic Processing Units of the Broadband Engine. The file system provides a name space similar to posix shared memory or message queues. Users that have write permissions on the file system can create directories in the spufs root. Every directory represents an SPU context, which is currently mapped to a physical SPU, but that is going to change to a virtualization scheme in future updates. An SPU context directory contains a predefined set of files used for manipulating the state of the logical SPU. Users can change permissions on those files, but not actually add or remove files without removing the complete directory. The current set of files is: /mem the contents of the local store memory of the SPU. This can be accessed like a regular shared memory file and contains both code and data in the address space of the SPU. The implemented file operations currently are read(), write() and mmap(). We will need our own address space operations as soon as we allow the SPU context to be scheduled away from the physical SPU into page cache. /run A stub file that lets us do ioctl. The only ioctl method we need is the spu_run() call. spu_run suspends the current thread from the host CPU and transfers the flow of execution to the SPU. The ioctl call return to the calling thread when a state is entered that can not be handled by the kernel, e.g. an error in the SPU code or an exit() from it. When a signal is pending for the host CPU thread, the ioctl is interrupted and the SPU stopped in order to call the signal handler. /mbox The first SPU to CPU communication mailbox. This file is read-only and can be read in units of 32 bits. The file can only be used in non-blocking mode and it even poll() will not block on it. When no data is available in the mailbox, read() returns EAGAIN. /ibox The second SPU to CPU communication mailbox. This file is similar to the first mailbox file, but can be read in blocking I/O mode, and the poll familiy of system calls can be used to wait for it. /wbox The CPU to SPU communation mailbox. It is write-only can can be written in units of 32 bits. If the mailbox is full, write() will block and poll can be used to wait for it becoming empty again. Other files are planned but currently are not implemented or not functional. Signed-off-by: Arnd Bergmann --- linux-cg.orig/arch/ppc64/kernel/Makefile 2005-05-13 15:23:43.019961032 -0400 +++ linux-cg/arch/ppc64/kernel/Makefile 2005-05-13 17:25:48.121935456 -0400 @@ -53,6 +53,7 @@ obj-$(CONFIG_HVCS) += hvcserver.o obj-$(CONFIG_IBMVIO) += vio.o obj-$(CONFIG_XICS) += xics.o obj-$(CONFIG_MPIC) += mpic.o +obj-$(CONFIG_SPU_FS) += spu_base.o obj-$(CONFIG_PPC_PMAC) += pmac_setup.o pmac_feature.o pmac_pci.o \ pmac_time.o pmac_nvram.o pmac_low_i2c.o --- linux-cg.orig/arch/ppc64/kernel/spu_base.c 1969-12-31 19:00:00.000000000 -0500 +++ linux-cg/arch/ppc64/kernel/spu_base.c 2005-05-13 17:25:48.124935000 -0400 @@ -0,0 +1,579 @@ +/* + * Low-level SPU handling + * + * (C) Copyright IBM Deutschland Entwicklung GmbH 2005 + * + * Author: Arnd Bergmann + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#define DEBUG 1 + +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include + +#include "bpa_iic.h" + +static int __spu_trap_invalid_dma(struct spu *spu) +{ + pr_debug("%s\n", __FUNCTION__); + force_sig(SIGBUS, /* info, */ spu->task); + return 0; +} + +static int __spu_trap_dma_align(struct spu *spu) +{ + pr_debug("%s\n", __FUNCTION__); + force_sig(SIGBUS, /* info, */ spu->task); + return 0; +} + +static int __spu_trap_error(struct spu *spu) +{ + pr_debug("%s\n", __FUNCTION__); + force_sig(SIGILL, /* info, */ spu->task); + return 0; +} + +static int __spu_trap_data_seg(struct spu *spu, unsigned long ea) +{ + struct spu_priv2 __iomem *priv2; + struct mm_struct *mm; + + pr_debug("%s\n", __FUNCTION__); + + if (REGION_ID(ea) != USER_REGION_ID) { + printk("invalid region access at %016lx\n", ea); + return 1; + } + + priv2 = spu->priv2; + mm = spu->mm; + + if (spu->slb_replace >= 8) + spu->slb_replace = 0; + + out_be64(&priv2->slb_index_W, spu->slb_replace); + out_be64(&priv2->slb_vsid_RW, + (get_vsid(mm->context.id, ea) << SLB_VSID_SHIFT) + | SLB_VSID_USER); + out_be64(&priv2->slb_esid_RW, (ea & ESID_MASK) | SLB_ESID_V); + out_be64(&priv2->mfc_control_RW, MFC_CNTL_RESTART_DMA_COMMAND); + + printk("set slb %d context %lx, ea %016lx, vsid %016lx, esid %016lx\n", + spu->slb_replace, mm->context.id, ea, + (get_vsid(mm->context.id, ea) << SLB_VSID_SHIFT)| SLB_VSID_USER, + (ea & ESID_MASK) | SLB_ESID_V); + return 0; +} + +static int __spu_trap_data_map(struct spu *spu, unsigned long ea) +{ + unsigned long dsisr; + struct spu_priv1 __iomem *priv1; + + pr_debug("%s\n", __FUNCTION__); + priv1 = spu->priv1; + dsisr = in_be64(&priv1->mfc_dsisr_RW); + + if (dsisr & MFC_DSISR_PTE_NOT_FOUND) { + printk("pte lookup ea %016lx, dsisr %lx\n", ea, dsisr); + wake_up(&spu->stop_wq); + } else { + printk("unexpexted data fault ea %016lx, dsisr %lx\n", ea, dsisr); + } + + return 0; +} + +static int __spu_trap_mailbox(struct spu *spu) +{ + pr_debug("%s\n", __FUNCTION__); + wake_up(&spu->mbox_wq); + return 0; +} + +static int __spu_trap_stop(struct spu *spu) +{ + pr_debug("%s\n", __FUNCTION__); + spu->stop_code = in_be32(&spu->problem->spu_status_R); + wake_up(&spu->stop_wq); + return 0; +} + +static int __spu_trap_halt(struct spu *spu) +{ + pr_debug("%s\n", __FUNCTION__); + spu->stop_code = in_be32(&spu->problem->spu_status_R); + wake_up(&spu->stop_wq); + return 0; +} + +static int __spu_trap_tag_group(struct spu *spu) +{ + pr_debug("%s\n", __FUNCTION__); + /* wake_up(&spu->dma_wq); */ + return 0; +} + +static irqreturn_t +spu_irq_class_0(int irq, void *data, struct pt_regs *regs) +{ + struct spu *spu; + unsigned long stat; + + spu = data; + stat = in_be64(&spu->priv1->int_stat_class0_RW); + + if (stat & 1) /* invalid MFC DMA */ + __spu_trap_invalid_dma(spu); + + if (stat & 2) /* invalid DMA alignment */ + __spu_trap_dma_align(spu); + + if (stat & 4) /* error on SPU */ + __spu_trap_error(spu); + + out_be64(&spu->priv1->int_stat_class0_RW, stat); + return stat ? IRQ_HANDLED : IRQ_NONE; +} + +static irqreturn_t +spu_irq_class_1(int irq, void *data, struct pt_regs *regs) +{ + struct spu *spu; + unsigned long stat, dar; + + spu = data; + stat = in_be64(&spu->priv1->int_stat_class1_RW); + dar = in_be64(&spu->priv1->mfc_dar_RW); + + if (stat & 1) /* segment fault */ + __spu_trap_data_seg(spu, dar); + + if (stat & 2) { /* mapping fault */ + __spu_trap_data_map(spu, dar); + } + + if (stat & 4) /* ls compare & suspend on get */ + ; + + if (stat & 8) /* ls compare & suspend on put */ + ; + + out_be64(&spu->priv1->int_stat_class1_RW, stat); + return stat ? IRQ_HANDLED : IRQ_NONE; +} + +static irqreturn_t +spu_irq_class_2(int irq, void *data, struct pt_regs *regs) +{ + struct spu *spu; + unsigned long stat; + + spu = data; + stat = in_be64(&spu->priv1->int_stat_class2_RW); + + if (stat & 1) /* mailbox */ + __spu_trap_mailbox(spu); + + if (stat & 2) /* SPU stop-and-signal */ + __spu_trap_stop(spu); + + if (stat & 4) /* SPU halted */ + __spu_trap_halt(spu); + + if (stat & 8) /* DMA tag group complete */ + __spu_trap_tag_group(spu); + + out_be64(&spu->priv1->int_stat_class2_RW, stat); + return stat ? IRQ_HANDLED : IRQ_NONE; +} + +static int +spu_request_irqs(struct spu *spu) +{ + int ret; + int irq_base; + + irq_base = IIC_NODE_STRIDE * spu->node + IIC_SPE_OFFSET; + + snprintf(spu->irq_c0, sizeof (spu->irq_c0), "spe%02d.0", spu->number); + ret = request_irq(irq_base + spu->isrc, + spu_irq_class_0, 0, spu->irq_c0, spu); + if (ret) + goto out; + out_be64(&spu->priv1->int_mask_class0_RW, 0x7); + + snprintf(spu->irq_c1, sizeof (spu->irq_c1), "spe%02d.1", spu->number); + ret = request_irq(irq_base + IIC_CLASS_STRIDE + spu->isrc, + spu_irq_class_1, 0, spu->irq_c1, spu); + if (ret) + goto out1; + out_be64(&spu->priv1->int_mask_class1_RW, 0x3); + + snprintf(spu->irq_c2, sizeof (spu->irq_c2), "spe%02d.2", spu->number); + ret = request_irq(irq_base + 2*IIC_CLASS_STRIDE + spu->isrc, + spu_irq_class_2, 0, spu->irq_c2, spu); + if (ret) + goto out2; + out_be64(&spu->priv1->int_mask_class2_RW, 0xf); + goto out; + +out2: + free_irq(irq_base + IIC_CLASS_STRIDE + spu->isrc, spu); +out1: + free_irq(irq_base + spu->isrc, spu); +out: + return ret; +} + +static void +spu_free_irqs(struct spu *spu) +{ + int irq_base; + + irq_base = IIC_NODE_STRIDE * spu->node + IIC_SPE_OFFSET; + + free_irq(irq_base + spu->isrc, spu); + free_irq(irq_base + IIC_CLASS_STRIDE + spu->isrc, spu); + free_irq(irq_base + 2*IIC_CLASS_STRIDE + spu->isrc, spu); +} + +static LIST_HEAD(spu_list); +static DECLARE_MUTEX(spu_mutex); + +struct spu *spu_alloc(void) +{ + struct spu *spu; + + down(&spu_mutex); + if (!list_empty(&spu_list)) { + spu = list_entry(spu_list.next, struct spu, list); + list_del_init(&spu->list); + printk("Got SPU %x\n", spu->isrc); + } else { + printk("No SPU left\n"); + spu = NULL; + } + up(&spu_mutex); + return spu; +} +EXPORT_SYMBOL(spu_alloc); + +void spu_free(struct spu *spu) +{ + down(&spu_mutex); + list_add_tail(&spu->list, &spu_list); + up(&spu_mutex); +} +EXPORT_SYMBOL(spu_free); + +extern int hash_page(unsigned long ea, unsigned long access, unsigned long trap); //XXX +static int spu_handle_pte_fault(struct spu *spu) +{ + struct spu_problem __iomem *prob; + struct spu_priv1 __iomem *priv1; + struct spu_priv2 __iomem *priv2; + unsigned long ea, access, is_write; + struct mm_struct *mm; + struct vm_area_struct *vma; + int ret; + + printk("%s\n", __FUNCTION__); + prob = spu->problem; + priv1 = spu->priv1; + priv2 = spu->priv2; + + ea = in_be64(&priv1->mfc_dar_RW); + access = _PAGE_PRESENT | _PAGE_USER; + is_write = in_be64(&priv1->mfc_dsisr_RW) & 0x02000000; + mm = spu->mm; + + ret = hash_page(ea, access, 0x300); + if (ret < 0) { + printk("error in hash_page!\n"); + ret = -EFAULT; + goto out_err; + } + + printk("current %ld, spu %ld, ea %ld\n", current->mm->context.id, mm->context.id, ea); + if (!ret) { + printk("hash inserted, vsid %lx\n", get_vsid(current->mm->context.id, ea)); + goto out_restart; + } + + ret = -EFAULT; + if (ea >= TASK_SIZE) + goto out_err; + + down_read(&mm->mmap_sem); + vma = find_vma(mm, ea); + if (!vma) + goto out; + + if (is_write) { + if (!(vma->vm_flags & VM_WRITE)) + goto out; + } + + ret = 0; +/* FIXME add missing code from do_page_fault */ + switch (handle_mm_fault(mm, vma, ea, is_write)) { + case VM_FAULT_MINOR: + printk("minor\n"); + current->min_flt++; + break; + case VM_FAULT_MAJOR: + printk("major\n"); + current->maj_flt++; + break; + case VM_FAULT_SIGBUS: + ret = -EFAULT; + break; + case VM_FAULT_OOM: + ret = -ENOMEM; + break; + default: + BUG(); + } +out: + up_read(&mm->mmap_sem); + if (ret) + goto out_err; +out_restart: + out_be64(&priv2->mfc_control_RW, MFC_CNTL_RESTART_DMA_COMMAND); +out_err: + printk("%s: returning %d\n", __FUNCTION__, ret); + return ret; +} + +int spu_run(struct spu *spu) +{ + struct spu_problem __iomem *prob; + struct spu_priv1 __iomem *priv1; + struct spu_priv2 __iomem *priv2; + unsigned long status; + int count = 10; + int ret; + + prob = spu->problem; + priv1 = spu->priv1; + priv2 = spu->priv2; + spu->mm = current->mm; + spu->task = current; + out_be32(&prob->spu_runcntl_RW, SPU_RUNCNTL_RUNNABLE); + + do { + ret = wait_event_interruptible(spu->stop_wq, + (!((status = in_be32(&prob->spu_status_R)) & 0x1)) + || (in_be64(&priv1->mfc_dsisr_RW) & MFC_DSISR_PTE_NOT_FOUND)); + + if (status & SPU_STATUS_STOPPED_BY_STOP) + ret = -EAGAIN; + else if (status & SPU_STATUS_STOPPED_BY_HALT) + ret = -EIO; + else if (in_be64(&priv1->mfc_dsisr_RW) & MFC_DSISR_PTE_NOT_FOUND) + ret = spu_handle_pte_fault(spu); + + } while (!ret && count--); + out_be32(&prob->spu_runcntl_RW, SPU_RUNCNTL_STOP); + out_be64(&priv2->slb_invalidate_all_W, 0); + spu->mm = NULL; + spu->task = NULL; + + return ret; +} +EXPORT_SYMBOL(spu_run); + +static void __iomem * __init map_spe_prop(struct device_node *n, + const char *name) +{ + struct address_prop { + unsigned long address; + unsigned int len; + } __attribute__((packed)) *prop; + + void *p; + int proplen; + + p = get_property(n, name, &proplen); + if (proplen != sizeof (struct address_prop)) + return NULL; + + prop = p; + + return ioremap(prop->address, prop->len); +} + +static void spu_unmap(struct spu *spu) +{ + iounmap(spu->priv2); + iounmap(spu->priv1); + iounmap(spu->problem); + iounmap((u8 __iomem *)spu->local_store); +} + +static int __init spu_map_device(struct spu *spu, struct device_node *spe) +{ + unsigned int *isrc_prop; + int ret; + + ret = -ENODEV; + isrc_prop = (u32 *)get_property(spe, "isrc", NULL); + if (!isrc_prop) + goto out; + spu->isrc = *isrc_prop; + + spu->name = get_property(spe, "name", NULL); + if (!spu->name) + goto out; + + /* we use local store as ram, not io memory */ + spu->local_store = (u8 __force *) map_spe_prop(spe, "local-store"); + if (!spu->local_store) + goto out; + + spu->problem= map_spe_prop(spe, "problem"); + if (!spu->problem) + goto out_unmap; + + spu->priv1= map_spe_prop(spe, "priv1"); + if (!spu->priv1) + goto out_unmap; + + spu->priv2= map_spe_prop(spe, "priv2"); + if (!spu->priv2) + goto out_unmap; + ret = 0; + goto out; + +out_unmap: + spu_unmap(spu); +out: + return ret; +} + +static int __init find_spu_node_id(struct device_node *spe) +{ + unsigned int *id; + struct device_node *cpu; + + cpu = spe->parent->parent; + id = (unsigned int *)get_property(cpu, "node-id", NULL); + + return id ? *id : 0; +} + +static int __init create_spu(struct device_node *spe) +{ + struct spu *spu; + int ret; + static int number; + + ret = -ENOMEM; + spu = kmalloc(sizeof (*spu), GFP_KERNEL); + if (!spu) + goto out; + + ret = spu_map_device(spu, spe); + if (ret) + goto out_free; + + spu->node = find_spu_node_id(spe); + spu->stop_code = 0; + spu->slb_replace = 0; + spu->mm = NULL; + + out_be64(&spu->priv1->mfc_sdr_RW, mfspr(SPRN_SDR1)); + out_be64(&spu->priv1->mfc_sr1_RW, 0x33); + + init_waitqueue_head(&spu->stop_wq); + init_waitqueue_head(&spu->mbox_wq); + + down(&spu_mutex); + spu->number = number++; + ret = spu_request_irqs(spu); + if (ret) + goto out_unmap; + + list_add(&spu->list, &spu_list); + up(&spu_mutex); + + printk(KERN_DEBUG "Using SPE %s %02x %p %p %p %p %d\n", + spu->name, spu->isrc, spu->local_store, + spu->problem, spu->priv1, spu->priv2, spu->number); + goto out; + +out_unmap: + up(&spu_mutex); + spu_unmap(spu); +out_free: + kfree(spu); +out: + return ret; +} + +static void destroy_spu(struct spu *spu) +{ + list_del_init(&spu->list); + + spu_free_irqs(spu); + spu_unmap(spu); + kfree(spu); +} + +static void cleanup_spu_base(void) +{ + struct spu *spu, *tmp; + down(&spu_mutex); + list_for_each_entry_safe(spu, tmp, &spu_list, list) + destroy_spu(spu); + up(&spu_mutex); +} +module_exit(cleanup_spu_base); + +static int __init init_spu_base(void) +{ + struct device_node *node; + int ret; + + ret = -ENODEV; + for (node = of_find_node_by_type(NULL, "spc"); + node; node = of_find_node_by_type(node, "spc")) { + ret = create_spu(node); + if (ret) { + printk(KERN_WARNING "%s: Error initializing %s\n", + __FUNCTION__, node->name); + cleanup_spu_base(); + break; + } + } + return ret; +} +module_init(init_spu_base); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Arnd Bergmann "); --- linux-cg.orig/arch/ppc64/mm/hash_utils.c 2005-05-13 15:15:07.870991152 -0400 +++ linux-cg/arch/ppc64/mm/hash_utils.c 2005-05-13 17:25:48.126934696 -0400 @@ -354,6 +354,7 @@ int hash_page(unsigned long ea, unsigned return ret; } +EXPORT_SYMBOL_GPL(hash_page); void flush_hash_page(unsigned long context, unsigned long ea, pte_t pte, int local) --- linux-cg.orig/fs/Kconfig 2005-05-13 15:15:07.872990848 -0400 +++ linux-cg/fs/Kconfig 2005-05-13 17:25:48.128934392 -0400 @@ -853,6 +853,16 @@ config HUGETLBFS config HUGETLB_PAGE def_bool HUGETLBFS +config SPU_FS + tristate "SPU file system" + default m + depends on PPC_BPA + help + The SPU file system is used to access Synergistic Processing + Units on machines implementing the Broadband Processor + Architecture. + + config RAMFS bool default y --- linux-cg.orig/fs/Makefile 2005-05-13 15:15:07.874990544 -0400 +++ linux-cg/fs/Makefile 2005-05-13 17:25:48.131933936 -0400 @@ -95,3 +95,4 @@ obj-$(CONFIG_BEFS_FS) += befs/ obj-$(CONFIG_HOSTFS) += hostfs/ obj-$(CONFIG_HPPFS) += hppfs/ obj-$(CONFIG_DEBUG_FS) += debugfs/ +obj-$(CONFIG_SPU_FS) += spufs/ --- linux-cg.orig/fs/spufs/Makefile 1969-12-31 19:00:00.000000000 -0500 +++ linux-cg/fs/spufs/Makefile 2005-05-13 17:25:48.133933632 -0400 @@ -0,0 +1,3 @@ +obj-$(CONFIG_SPU_FS) += spufs.o + +spufs-y += inode.o --- linux-cg.orig/fs/spufs/inode.c 1969-12-31 19:00:00.000000000 -0500 +++ linux-cg/fs/spufs/inode.c 2005-05-13 17:25:48.135933328 -0400 @@ -0,0 +1,991 @@ +/* + * SPU file system + * + * (C) Copyright IBM Deutschland Entwicklung GmbH 2005 + * + * Author: Arnd Bergmann + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include + +/* SPU context abstraction */ +struct spu_context { + struct spu *spu; /* pointer to a physical SPU if SPUFS_DIRECT */ + struct rw_semaphore backing_sema; /* protects the above */ + spinlock_t mmio_lock; /* protects mmio access */ + long sig; + + struct kref kref; +}; + +static struct spu_context * +alloc_spu_context(void) +{ + struct spu_context *ctx; + ctx = kmalloc(sizeof *ctx, GFP_KERNEL); + if (!ctx) + goto out; + ctx->spu = spu_alloc(); + if (!ctx->spu) + goto out_free; + init_rwsem(&ctx->backing_sema); + spin_lock_init(&ctx->mmio_lock); + kref_init(&ctx->kref); + goto out; +out_free: + kfree(ctx); + ctx = NULL; +out: + return ctx; +} + +static void +destroy_spu_context(struct kref *kref) +{ + struct spu_context *ctx; + ctx = container_of(kref, struct spu_context, kref); + if (ctx->spu) + spu_free(ctx->spu); + kfree(ctx); +} + +static struct spu_context * +get_spu_context(struct spu_context *ctx) +{ + kref_get(&ctx->kref); + return ctx; +} + +static void +put_spu_context(struct spu_context *ctx) +{ + kref_put(&ctx->kref, &destroy_spu_context); +} + +/* The magic number for our file system */ +enum { + SPUFS_MAGIC = 0x23c9b64e, +}; + +/* bits in the inode flags */ +enum { + SPUFS_DIRECT, /* Data resides on a physical SPU */ +}; + +struct spufs_inode_info { + struct spu_context *i_ctx; + struct inode vfs_inode; +}; + +static kmem_cache_t *spufs_inode_cache; +#define SPUFS_I(inode) container_of(inode, struct spufs_inode_info, vfs_inode) + +/* Information about the backing dev, same as ramfs */ + +static struct backing_dev_info spufs_backing_dev_info = { + .ra_pages = 0, /* No readahead */ + .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK | + BDI_CAP_MAP_DIRECT | BDI_CAP_MAP_COPY | BDI_CAP_READ_MAP | + BDI_CAP_WRITE_MAP, +}; + +static struct address_space_operations spufs_aops = { + .readpage = simple_readpage, + .prepare_write = simple_prepare_write, + .commit_write = simple_commit_write, +}; + +/* File operations */ + +static int +spufs_open(struct inode *inode, struct file *file) +{ + struct spufs_inode_info *i = SPUFS_I(inode); + file->private_data = i->i_ctx; + return 0; +} + +static ssize_t +spufs_read(struct file *file, char __user *buffer, size_t size, loff_t *pos) +{ + struct spu *spu; + struct spu_context *ctx; + int ret; + + ctx = file->private_data; + spu = ctx->spu; + + down_read(&ctx->backing_sema); + if (spu->number & 0/*1*/) { + ret = generic_file_read(file, buffer, size, pos); + goto out; + } + + ret = 0; + size = min_t(ssize_t, LS_SIZE - *pos, size); + if (size <= 0) + goto out; + *pos += size; + ret = copy_to_user(buffer, spu->local_store + *pos - size, size); + ret = ret ? -EFAULT : size; + +out: + up_read(&ctx->backing_sema); + return ret; +} + +static ssize_t +spufs_write(struct file *file, const char __user *buffer, size_t size, loff_t *pos) +{ + struct spu_context *ctx = file->private_data; + struct spu *spu = ctx->spu; + + if (spu->number & 0) //1) + return generic_file_write(file, buffer, size, pos); + + size = min_t(ssize_t, LS_SIZE - *pos, size); + if (size <= 0) + return -EFBIG; + *pos += size; + return copy_from_user(spu->local_store + *pos - size, + buffer, size) ? -EFAULT : size; +} + +static int +spufs_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct spu_context *ctx = file->private_data; + struct spu *spu = ctx->spu; + unsigned long pfn; + + if (spu->number & 0) //1) + return generic_file_mmap(file, vma); + + vma->vm_flags |= VM_RESERVED; + pfn = __pa(spu->local_store) >> PAGE_SHIFT; + /* + * This will work for actual SPUs, but not for vmalloc memory: + */ + if (remap_pfn_range(vma, vma->vm_start, pfn, + vma->vm_end-vma->vm_start, vma->vm_page_prot)) + return -EAGAIN; + /**/ + return 0; +} + +static struct file_operations spufs_mem_fops = { + .open = spufs_open, + .read = spufs_read, + .write = spufs_write, + .mmap = spufs_mmap, + .llseek = generic_file_llseek, +}; + +/* generic open function for all pipe-like files */ +static int spufs_pipe_open(struct inode *inode, struct file *file) +{ + struct spufs_inode_info *i = SPUFS_I(inode); + file->private_data = i->i_ctx; + + return nonseekable_open(inode, file); +} + +static ssize_t spufs_mbox_read(struct file *file, char __user *buf, + size_t len, loff_t *pos) +{ + struct spu_context *ctx; + struct spu_problem __iomem *prob; + u32 mbox_stat; + u32 mbox_data; + + if (len < 4) + return -EINVAL; + + ctx = file->private_data; + prob = ctx->spu->problem; + mbox_stat = in_be32(&prob->mb_stat_R); + if (!(mbox_stat & 0x0000ff)) + return -EAGAIN; + + mbox_data = in_be32(&prob->pu_mb_R); + + if (copy_to_user(buf, &mbox_data, sizeof mbox_data)) + return -EFAULT; + + return 4; +} + +static struct file_operations spufs_mbox_fops = { + .open = spufs_pipe_open, + .read = spufs_mbox_read, +}; + +static ssize_t spufs_ibox_read(struct file *file, char __user *buf, + size_t len, loff_t *pos) +{ + struct spu_context *ctx; + struct spu_problem __iomem *prob; + struct spu_priv2 __iomem *priv2; + u32 mbox_stat; + u32 ibox_data; + ssize_t ret; + + if (len < 4) + return -EINVAL; + + ctx = file->private_data; + prob = ctx->spu->problem; + priv2 = ctx->spu->priv2; + + mbox_stat = in_be32(&prob->mb_stat_R); + if (!(mbox_stat & 0xff0000)) + return -EAGAIN; + + ibox_data = in_be64(&priv2->puint_mb_R); + + ret = 4; + if (copy_to_user(buf, &ibox_data, sizeof ibox_data)) + ret = -EFAULT; + + return ret; +} + +static unsigned int spufs_ibox_poll(struct file *file, poll_table *wait) +{ + struct spu_context *ctx; + struct spu_problem __iomem *prob; + u32 mbox_stat; + unsigned int mask; + + ctx = file->private_data; + prob = ctx->spu->problem; + mbox_stat = in_be32(&prob->mb_stat_R); + + poll_wait(file, &ctx->spu->mbox_wq, wait); + + mask = 0; + if (mbox_stat & 0xff0000) + mask |= POLLIN | POLLRDNORM; + + return mask; +} + +static struct file_operations spufs_ibox_fops = { + .open = spufs_pipe_open, + .read = spufs_ibox_read, + .poll = spufs_ibox_poll, +}; + +static ssize_t spufs_wbox_write(struct file *file, const char __user *buf, + size_t len, loff_t *pos) +{ + struct spu_context *ctx; + struct spu_problem __iomem *prob; + u32 mbox_stat; + u32 wbox_data; + + if (len < 4) + return -EINVAL; + + ctx = file->private_data; + prob = ctx->spu->problem; + mbox_stat = in_be32(&prob->mb_stat_R); + if (!(mbox_stat & 0x00ff00)) + return -EAGAIN; + + if (copy_from_user(&wbox_data, buf, sizeof wbox_data)) + return -EFAULT; + + out_be32(&prob->spu_mb_W, wbox_data); + + return 4; +} + +static unsigned int spufs_wbox_poll(struct file *file, poll_table *wait) +{ + struct spu_context *ctx; + struct spu_problem __iomem *prob; + u32 mbox_stat; + unsigned int mask; + + ctx = file->private_data; + prob = ctx->spu->problem; + mbox_stat = in_be32(&prob->mb_stat_R); + + poll_wait(file, &ctx->spu->mbox_wq, wait); + + mask = 0; + if (mbox_stat & 0x00ff00) + mask = POLLOUT | POLLWRNORM; + + return mask; +} + +static struct file_operations spufs_wbox_fops = { + .open = spufs_pipe_open, + .write = spufs_wbox_write, + .poll = spufs_wbox_poll, +}; + +static int spufs_run_open(struct inode *inode, struct file *file) +{ + struct spufs_inode_info *i = SPUFS_I(inode); + file->private_data = i->i_ctx; + + return nonseekable_open(inode, file); +} + +struct spufs_run_arg { + u32 npc; /* inout: Next Program Counter */ + u32 status; /* out: SPU status */ +}; + +static long spufs_run_ioctl(struct file *file, unsigned int num, + unsigned long arg) +{ + struct spu_context *ctx; + struct spu_problem __iomem *prob; + struct spufs_run_arg data; + int ret; + + if (num != _IOWR('s', 0, struct spufs_run_arg)) + return -EINVAL; + + if (copy_from_user(&data, (void __user *)arg, sizeof data)) + return -EFAULT; + + ctx = file->private_data; + prob = ctx->spu->problem; + out_be32(&prob->spu_npc_RW, data.npc); + wmb(); + + ret = spu_run(ctx->spu); +/* + prob->spu_npc_RW = data.npc; + ctx->spu->mm = current->mm; + wmb(); + prob->spu_runcntl_RW = SPU_RUNCNTL_RUNNABLE; + mb(); + + ret = wait_event_interruptible(ctx->spu->stop_wq, + prob->spu_status_R & 0x3e); + + prob->spu_runcntl_RW = SPU_RUNCNTL_STOP; + ctx->spu->mm = NULL; +*/ + data.status = in_be32(&prob->spu_status_R); + data.npc = in_be32(&prob->spu_npc_RW); + if (copy_to_user((void __user *)arg, &data, sizeof data)) + ret = -EFAULT; + + return ret; +} + +static struct file_operations spufs_run_fops = { + .open = spufs_run_open, + .unlocked_ioctl = spufs_run_ioctl, + .compat_ioctl = spufs_run_ioctl, +}; + + +/**** spufs attributes + * + * Attributes in spufs behave similar to those in sysfs: + * + * Writing to an attribute immediately sets a value, an open file can be + * written to multiple times. + * + * Reading from an attribute creates a buffer from the value that might get + * read with multiple read calls. When the attribute has been read completely, + * no further read calls are possible until the file is opened again. + * + * All spufs attributes contain a text representation of a numeric value that + * are accessed with the get() and set() functions. + * + * Perhaps these file operations could be put in debugfs or libfs instead, + * they are not really SPU specific. + */ + +struct spufs_attr { + long (*get)(struct spu_context *); + void (*set)(struct spu_context *, long); + struct spu_context *ctx; + char get_buf[24]; /* enough to store a long and "\n\0" */ + char set_buf[24]; + struct semaphore sem; /* protects access to these buffers */ +}; + +/* spufs_attr_open is called by an actual attribute open file operation + * to set the attribute specific access operations. */ +static int spufs_attr_open(struct inode *inode, struct file *file, + long (*get)(struct spu_context *), + void (*set)(struct spu_context *, long)) +{ + struct spufs_attr *attr; + + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (!attr) + return -ENOMEM; + + /* reading/writing needs the respective get/set operation */ + if (((file->f_mode & FMODE_READ) && !get) || + ((file->f_mode & FMODE_WRITE) && !set)) + return -EACCES; + + attr->get = get; + attr->set = set; + attr->ctx = SPUFS_I(inode)->i_ctx; + init_MUTEX(&attr->sem); + + file->private_data = attr; + + return nonseekable_open(inode, file); +} + +static int spufs_attr_close(struct inode *inode, struct file *file) +{ + kfree(file->private_data); + return 0; +} + +/* read from the buffer that is filled with the get function */ +static ssize_t spufs_attr_read(struct file *file, char __user *buf, + size_t len, loff_t *ppos) +{ + struct spufs_attr *attr; + size_t size; + ssize_t ret; + + attr = file->private_data; + + down(&attr->sem); + if (*ppos) /* continued read */ + size = strlen(attr->get_buf); + else /* first read */ + size = scnprintf(attr->get_buf, sizeof (attr->get_buf), + "%ld\n", attr->get(attr->ctx)); + + ret = simple_read_from_buffer(buf, len, ppos, attr->get_buf, size); + up(&attr->sem); + return ret; +} + +/* interpret the buffer as a number to call the set function with */ +static ssize_t spufs_attr_write(struct file *file, const char __user *buf, + size_t len, loff_t *ppos) +{ + struct spufs_attr *attr; + long val; + size_t size; + ssize_t ret; + + + attr = file->private_data; + + down(&attr->sem); + ret = -EFAULT; + size = min(sizeof (attr->set_buf) - 1, len); + if (copy_from_user(attr->set_buf, buf, size)) + goto out; + + ret = len; /* claim we got the whole input */ + attr->set_buf[size] = '\0'; + val = simple_strtol(attr->set_buf, NULL, 0); + attr->set(attr->ctx, val); +out: + up(&attr->sem); + return ret; +} + +#define spufs_attribute(name) \ +static int name ## _open(struct inode *inode, struct file *file) \ +{ \ + return spufs_attr_open(inode, file, &name ## _get, &name ## _set); \ +} \ +static struct file_operations name = { \ + .open = name ## _open, \ + .release = spufs_attr_close, \ + .read = spufs_attr_read, \ + .write = spufs_attr_write, \ +}; + + +static void spufs_signal1_type_set(struct spu_context *ctx, long val) +{ + ctx->sig = val; +} + +static long spufs_signal1_type_get(struct spu_context *ctx) +{ + return ctx->sig; +} + +spufs_attribute(spufs_signal1_type); + +static void spufs_class0_stat_set(struct spu_context *ctx, long val) +{ + out_be64(&ctx->spu->priv1->int_stat_class0_RW, val); +} + +static long spufs_class0_stat_get(struct spu_context *ctx) +{ + return in_be64(&ctx->spu->priv1->int_stat_class0_RW); +} + +spufs_attribute(spufs_class0_stat); + +static void spufs_class1_stat_set(struct spu_context *ctx, long val) +{ + out_be64(&ctx->spu->priv1->int_stat_class1_RW, val); +} + +static long spufs_class1_stat_get(struct spu_context *ctx) +{ + return in_be64(&ctx->spu->priv1->int_stat_class1_RW); +} + +spufs_attribute(spufs_class1_stat); + +static void spufs_class2_stat_set(struct spu_context *ctx, long val) +{ + out_be64(&ctx->spu->priv1->int_stat_class2_RW, val); +} + +static long spufs_class2_stat_get(struct spu_context *ctx) +{ + return in_be64(&ctx->spu->priv1->int_stat_class2_RW); +} + +spufs_attribute(spufs_class2_stat); + +static void spufs_class0_mask_set(struct spu_context *ctx, long val) +{ + out_be64(&ctx->spu->priv1->int_mask_class0_RW, val); +} + +static long spufs_class0_mask_get(struct spu_context *ctx) +{ + return in_be64(&ctx->spu->priv1->int_mask_class0_RW); +} + +spufs_attribute(spufs_class0_mask); + +static void spufs_class1_mask_set(struct spu_context *ctx, long val) +{ + out_be64(&ctx->spu->priv1->int_mask_class1_RW, val); +} + +static long spufs_class1_mask_get(struct spu_context *ctx) +{ + return in_be64(&ctx->spu->priv1->int_mask_class1_RW); +} + +spufs_attribute(spufs_class1_mask); + +static void spufs_class2_mask_set(struct spu_context *ctx, long val) +{ + out_be64(&ctx->spu->priv1->int_mask_class2_RW, val); +} + +static long spufs_class2_mask_get(struct spu_context *ctx) +{ + return in_be64(&ctx->spu->priv1->int_mask_class2_RW); +} + +spufs_attribute(spufs_class2_mask); + +#define priv1_attr(name) \ +static void spufs_ ## name ## _set(struct spu_context *ctx, long val) \ +{ out_be64(&ctx->spu->priv1->name, val); } \ +static long spufs_ ## name ## _get(struct spu_context *ctx) \ +{ return in_be64(&ctx->spu->priv1->name); } \ +spufs_attribute(spufs_ ## name) + +#define priv2_attr(name) \ +static void spufs_ ## name ## _set(struct spu_context *ctx, long val) \ +{ out_be64(&ctx->spu->priv2->name, val); } \ +static long spufs_ ## name ## _get(struct spu_context *ctx) \ +{ return in_be64(&ctx->spu->priv2->name); } \ +spufs_attribute(spufs_ ## name) + +priv1_attr(mfc_sr1_RW); +priv1_attr(mfc_fir_R); +priv1_attr(mfc_fir_status_or_W); +priv1_attr(mfc_fir_status_and_W); +priv1_attr(mfc_fir_mask_R); +priv1_attr(mfc_fir_mask_or_W); +priv1_attr(mfc_fir_mask_and_W); +priv1_attr(mfc_fir_chkstp_enable_RW); +priv1_attr(mfc_cer_R); +priv1_attr(mfc_dsisr_RW); +priv1_attr(mfc_dsir_R); +priv1_attr(mfc_sdr_RW); +priv2_attr(mfc_control_RW); + +/* Inode operations */ + +static struct inode * +spufs_alloc_inode(struct super_block *sb) +{ + struct spufs_inode_info *ei; + + ei = kmem_cache_alloc(spufs_inode_cache, SLAB_KERNEL); + if (!ei) + return NULL; + return &ei->vfs_inode; +} + +static void +spufs_destroy_inode(struct inode *inode) +{ + kmem_cache_free(spufs_inode_cache, SPUFS_I(inode)); +} + +static void +spufs_init_once(void *p, kmem_cache_t * cachep, unsigned long flags) +{ + struct spufs_inode_info *ei = p; + + if ((flags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) == + SLAB_CTOR_CONSTRUCTOR) { + inode_init_once(&ei->vfs_inode); + } +} + +static struct inode * +spufs_new_inode(struct super_block *sb, int mode) +{ + struct inode *inode; + + inode = new_inode(sb); + if (!inode) + goto out; + + inode->i_mode = mode; + inode->i_uid = current->fsuid; + inode->i_gid = current->fsgid; + inode->i_blksize = PAGE_CACHE_SIZE; + inode->i_blocks = 0; + inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; +out: + return inode; +} + +static int +spufs_setattr(struct dentry *dentry, struct iattr *attr) +{ + struct inode *inode = dentry->d_inode; + +/* dump_stack(); + printk("ia_size %lld, i_size:%lld\n", attr->ia_size, inode->i_size); +*/ + if (attr->ia_size != inode->i_size) + return -EINVAL; + return inode_setattr(inode, attr); +} + +/* +static int +spufs_create(struct inode *dir, struct dentry *dentry, + int mode, struct nameidata *nd) +{ + static struct inode_operations iops = { + .getattr = simple_getattr, + .setattr = spufs_setattr, + }; + + + struct inode *inode; + int ret; + + ret = -ENOSPC; + inode = spufs_new_inode(dir->i_sb, S_IFREG | mode); + if (!inode) + goto out; + inode->i_op = &iops; + inode->i_fop = &spufs_mem_fops; + inode->i_size = LS_SIZE; + SPUFS_I(inode)->i_spu = spu_alloc(); + if (!SPUFS_I(inode)->i_spu) + goto out_iput; + inode->i_mapping->a_ops = &spufs_aops; + inode->i_mapping->backing_dev_info = &spufs_backing_dev_info; + d_instantiate(dentry, inode); + dget(dentry); + return 0; +out_iput: + iput(inode); +out: + return ret; +} +*/ + +static void +spufs_delete_inode(struct inode *inode) +{ + if (SPUFS_I(inode)->i_ctx) + put_spu_context(SPUFS_I(inode)->i_ctx); + clear_inode(inode); +} + +static struct tree_descr spufs_dir_contents[] = { + { "mem", &spufs_mem_fops, 0644, }, + { "run", &spufs_run_fops, 0400, }, + { "mbox", &spufs_mbox_fops, 0400, }, + { "ibox", &spufs_ibox_fops, 0400, }, + { "wbox", &spufs_wbox_fops, 0200, }, + { "signal1_type", &spufs_signal1_type, 0600, }, + { "signal2_type", &spufs_signal1_type, 0600, }, + +#if 1 /* debugging only */ + { "class0_mask", &spufs_class0_mask, 0600, }, + { "class1_mask", &spufs_class1_mask, 0600, }, + { "class2_mask", &spufs_class2_mask, 0600, }, + { "class0_stat", &spufs_class0_stat, 0600, }, + { "class1_stat", &spufs_class1_stat, 0600, }, + { "class2_stat", &spufs_class2_stat, 0600, }, + { "sr1", &spufs_mfc_sr1_RW, 0600, }, + { "fir", &spufs_mfc_fir_R, 0400, }, + { "fir_status_or", &spufs_mfc_fir_status_or_W, 0200, }, + { "fir_status_and", &spufs_mfc_fir_status_and_W, 0200, }, + { "fir_mask", &spufs_mfc_fir_mask_R, 0400, }, + { "fir_mask_or", &spufs_mfc_fir_mask_or_W, 0200, }, + { "fir_mask_and", &spufs_mfc_fir_mask_and_W, 0200, }, + { "fir_chkstp", &spufs_mfc_fir_chkstp_enable_RW, 0600, }, + { "cer", &spufs_mfc_cer_R, 0400, }, + { "dsisr", &spufs_mfc_dsisr_RW, 0600, }, + { "dsir", &spufs_mfc_dsir_R, 0200, }, + { "cntl", &spufs_mfc_control_RW, 0600, }, + { "sdr", &spufs_mfc_sdr_RW, 0600, }, +#endif + {}, +}; + +static int +spufs_fill_dir(struct dentry *dir, struct tree_descr *files, + int mode, struct spu_context *ctx) +{ + struct inode *inode; + struct dentry *dentry; + int ret; + + static struct inode_operations iops = { + .getattr = simple_getattr, + .setattr = spufs_setattr, + }; + + ret = -ENOSPC; + while (files->name && files->name[0]) { + dentry = d_alloc_name(dir, files->name); + if (!dentry) + goto out; + inode = spufs_new_inode(dir->d_sb, + S_IFREG | (files->mode & mode)); + if (!inode) + goto out; + inode->i_op = &iops; + inode->i_fop = files->ops; + inode->i_mapping->a_ops = &spufs_aops; + inode->i_mapping->backing_dev_info = &spufs_backing_dev_info; + SPUFS_I(inode)->i_ctx = get_spu_context(ctx); + + d_add(dentry, inode); + files++; + } + return 0; +out: + // FIXME: remove all files that are left + return ret; +} + +static int +spufs_mkdir(struct inode *dir, struct dentry *dentry, int mode) +{ + int ret; + struct inode *inode; + struct spu_context *ctx; + + ret = -ENOSPC; + inode = spufs_new_inode(dir->i_sb, mode | S_IFDIR); + if (!inode) + goto out; + + if (dir->i_mode & S_ISGID) { + inode->i_gid = dir->i_gid; + inode->i_mode |= S_ISGID; + } + ctx = alloc_spu_context(); + SPUFS_I(inode)->i_ctx = ctx; + if (!ctx) + goto out_iput; + + inode->i_op = &simple_dir_inode_operations; + inode->i_fop = &simple_dir_operations; + ret = spufs_fill_dir(dentry, spufs_dir_contents, mode, ctx); + if (ret) + goto out_free_ctx; + + d_instantiate(dentry, inode); + dget(dentry); + dir->i_nlink++; + goto out; + +out_free_ctx: + put_spu_context(ctx); +out_iput: + iput(inode); +out: + return ret; +} + +/* This looks really wrong! */ +static int spufs_rmdir(struct inode *root, struct dentry *dir_dentry) +{ + struct dentry *dentry; + int err; + + spin_lock(&dcache_lock); + + /* check if any entry is used */ + err = -EBUSY; + list_for_each_entry(dentry, &dir_dentry->d_subdirs, d_child) { + if (d_unhashed(dentry) || !dentry->d_inode) + continue; + if (atomic_read(&dentry->d_count) != 1) + goto out; + } + /* remove all entries */ + err = 0; + list_for_each_entry(dentry, &dir_dentry->d_subdirs, d_child) { + if (d_unhashed(dentry) || !dentry->d_inode) + continue; + atomic_dec(&dentry->d_count); + __d_drop(dentry); + } +out: + spin_unlock(&dcache_lock); + if (!err) { + shrink_dcache_parent(dir_dentry); + err = simple_rmdir(root, dir_dentry); + } + return err; +} + +/* File system initialization */ + +static int +spufs_create_root(struct super_block *sb) { + static struct inode_operations spufs_dir_inode_operations = { + .lookup = simple_lookup, + .mkdir = spufs_mkdir, + .rmdir = spufs_rmdir, +// .rename = simple_rename, // XXX maybe + }; + + struct inode *inode; + int ret; + + ret = -ENOMEM; + inode = spufs_new_inode(sb, S_IFDIR | 0777); + + if (inode) { + inode->i_op = &spufs_dir_inode_operations; + inode->i_fop = &simple_dir_operations; + SPUFS_I(inode)->i_ctx = NULL; + sb->s_root = d_alloc_root(inode); + if (!sb->s_root) + iput(inode); + else + ret = 0; + } + return ret; +} + +static int +spufs_fill_super(struct super_block *sb, void *data, int silent) +{ + static struct super_operations s_ops = { + .alloc_inode = spufs_alloc_inode, + .destroy_inode = spufs_destroy_inode, + .statfs = simple_statfs, + .delete_inode = spufs_delete_inode, + .drop_inode = generic_delete_inode, + }; + + sb->s_maxbytes = MAX_LFS_FILESIZE; + sb->s_blocksize = PAGE_CACHE_SIZE; + sb->s_blocksize_bits = PAGE_CACHE_SHIFT; + sb->s_magic = SPUFS_MAGIC; + sb->s_op = &s_ops; + + return spufs_create_root(sb); +} + +static struct super_block * +spufs_get_sb(struct file_system_type *fstype, int flags, + const char *name, void *data) +{ + return get_sb_single(fstype, flags, data, spufs_fill_super); +} + +static struct file_system_type spufs_type = { + .owner = THIS_MODULE, + .name = "spufs", + .get_sb = spufs_get_sb, + .kill_sb = kill_litter_super, +}; + +static int spufs_init(void) +{ + int ret; + ret = -ENOMEM; + spufs_inode_cache = kmem_cache_create("spufs_inode_cache", + sizeof(struct spufs_inode_info), 0, + SLAB_HWCACHE_ALIGN, spufs_init_once, NULL); + + if (!spufs_inode_cache) + goto out; + ret = register_filesystem(&spufs_type); + if (ret) + kmem_cache_destroy(spufs_inode_cache); +out: + return ret; +} +module_init(spufs_init); + +static void spufs_exit(void) +{ + unregister_filesystem(&spufs_type); + kmem_cache_destroy(spufs_inode_cache); +} +module_exit(spufs_exit); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Arnd Bergmann "); + --- linux-cg.orig/include/asm-ppc64/spu.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-cg/include/asm-ppc64/spu.h 2005-05-13 17:25:48.137933024 -0400 @@ -0,0 +1,463 @@ +/* + * SPU core / file system interface and HW structures + * + * (C) Copyright IBM Deutschland Entwicklung GmbH 2005 + * + * Author: Arnd Bergmann + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2, or (at your option) + * any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#ifndef _SPU_H +#define _SPU_H + +#define LS_ORDER (6) /* 256 kb */ + +#define LS_SIZE (PAGE_SIZE << LS_ORDER) + +struct spu { + char *name; + u8 *local_store; + struct spu_problem __iomem *problem; + struct spu_priv1 __iomem *priv1; + struct spu_priv2 __iomem *priv2; + struct list_head list; + int number; + u32 isrc; + u32 node; + struct kref kref; + size_t ls_size; + unsigned int slb_replace; + struct mm_struct *mm; + struct task_struct *task; + + u32 stop_code; + wait_queue_head_t stop_wq; + wait_queue_head_t mbox_wq; + + char irq_c0[8]; + char irq_c1[8]; + char irq_c2[8]; +}; + +struct spu *spu_alloc(void); +void spu_free(struct spu *spu); +int spu_run(struct spu *spu); + +/* + * This defines the Local Store, Problem Area and Privlege Area of an SPU. + */ + +union MFC_TagSizeClassCmd { + struct { + u16 mfc_size; + u16 mfc_tag; + u8 pad; + u8 mfc_rclassid; + u16 mfc_cmd; + } u; + struct { + u32 mfc_size_tag32; + u32 mfc_class_cmd32; + } by32; + u64 all64; +}; + +struct MFC_cq_sr { + u64 mfc_cq_data0_RW; + u64 mfc_cq_data1_RW; + u64 mfc_cq_data2_RW; + u64 mfc_cq_data3_RW; +}; + +struct spu_problem { + u8 pad_0x0000_0x3000[0x3000 - 0x0000]; /* 0x0000 */ + + /* DMA Area */ + u8 pad_0x3000_0x3004[0x4]; /* 0x3000 */ + u32 mfc_lsa_W; /* 0x3004 */ + u64 mfc_ea_W; /* 0x3008 */ + union MFC_TagSizeClassCmd mfc_union_W; /* 0x3010 */ + u8 pad_0x3018_0x3104[0xec]; /* 0x3018 */ + u32 dma_qstatus_R; /* 0x3104 */ + u8 pad_0x3108_0x3204[0xfc]; /* 0x3108 */ + u32 dma_querytype_RW; /* 0x3204 */ + u8 pad_0x3208_0x321c[0x14]; /* 0x3208 */ + u32 dma_querymask_RW; /* 0x321c */ + u8 pad_0x3220_0x322c[0xc]; /* 0x3220 */ + u32 dma_tagstatus_R; /* 0x322c */ +#define DMA_TAGSTATUS_INTR_ANY 1u +#define DMA_TAGSTATUS_INTR_ALL 2u + u8 pad_0x3230_0x4000[0x4000 - 0x3230]; /* 0x3230 */ + + /* SPU Control Area */ + u8 pad_0x4000_0x4004[0x4]; /* 0x4000 */ + u32 pu_mb_R; /* 0x4004 */ + u8 pad_0x4008_0x400c[0x4]; /* 0x4008 */ + u32 spu_mb_W; /* 0x400c */ + u8 pad_0x4010_0x4014[0x4]; /* 0x4010 */ + u32 mb_stat_R; /* 0x4014 */ + u8 pad_0x4018_0x401c[0x4]; /* 0x4018 */ + u32 spu_runcntl_RW; /* 0x401c */ +#define SPU_RUNCNTL_STOP 0L +#define SPU_RUNCNTL_RUNNABLE 1L + u8 pad_0x4020_0x4024[0x4]; /* 0x4020 */ + u32 spu_status_R; /* 0x4024 */ +#define SPU_STATUS_STOPPED 0x0 +#define SPU_STATUS_RUNNING 0x1 +#define SPU_STATUS_STOPPED_BY_STOP 0x2 +#define SPU_STATUS_STOPPED_BY_HALT 0x4 +#define SPU_STATUS_WAITING_FOR_CHANNEL 0x8 +#define SPU_STATUS_SINGLE_STEP 0x10 + u8 pad_0x4028_0x402c[0x4]; /* 0x4028 */ + u32 spu_spe_R; /* 0x402c */ + u8 pad_0x4030_0x4034[0x4]; /* 0x4030 */ + u32 spu_npc_RW; /* 0x4034 */ + u8 pad_0x4038_0x14000[0x14000 - 0x4038]; /* 0x4038 */ + + /* Signal Notification Area */ + u8 pad_0x14000_0x1400c[0xc]; /* 0x14000 */ + u32 signal_notify1; /* 0x1400c */ + u8 pad_0x14010_0x1c00c[0x7ffc]; /* 0x14010 */ + u32 signal_notify2; /* 0x1c00c */ +} __attribute__ ((aligned(0x20000))); + +/* SPU Privilege 2 State Area */ +struct spu_priv2 { + /* MFC Registers */ + u8 pad_0x0000_0x1100[0x1100 - 0x0000]; /* 0x0000 */ + + /* SLB Management Registers */ + u8 pad_0x1100_0x1108[0x8]; /* 0x1100 */ + u64 slb_index_W; /* 0x1108 */ +#define SLB_INDEX_MASK 0x7L + u64 slb_esid_RW; /* 0x1110 */ + u64 slb_vsid_RW; /* 0x1118 */ +#define SLB_VSID_SUPERVISOR_STATE (0x1ull << 11) +#define SLB_VSID_SUPERVISOR_STATE_MASK (0x1ull << 11) +#define SLB_VSID_PROBLEM_STATE (0x1ull << 10) +#define SLB_VSID_PROBLEM_STATE_MASK (0x1ull << 10) +#define SLB_VSID_EXECUTE_SEGMENT (0x1ull << 9) +#define SLB_VSID_NO_EXECUTE_SEGMENT (0x1ull << 9) +#define SLB_VSID_EXECUTE_SEGMENT_MASK (0x1ull << 9) +#define SLB_VSID_4K_PAGE (0x0 << 8) +#define SLB_VSID_LARGE_PAGE (0x1ull << 8) +#define SLB_VSID_PAGE_SIZE_MASK (0x1ull << 8) +#define SLB_VSID_CLASS_MASK (0x1ull << 7) +#define SLB_VSID_VIRTUAL_PAGE_SIZE_MASK (0x1ull << 6) + u64 slb_invalidate_entry_W; /* 0x1120 */ + u64 slb_invalidate_all_W; /* 0x1128 */ + u8 pad_0x1130_0x2000[0x2000 - 0x1130]; /* 0x1130 */ + + /* Context Save / Restore Area */ + struct MFC_cq_sr spuq[16]; /* 0x2000 */ + struct MFC_cq_sr puq[8]; /* 0x2200 */ + u8 pad_0x2300_0x3000[0x3000 - 0x2300]; /* 0x2300 */ + + /* MFC Control */ + u64 mfc_control_RW; /* 0x3000 */ +#define MFC_CNTL_RESUME_DMA_QUEUE (0ull << 0) +#define MFC_CNTL_SUSPEND_DMA_QUEUE (1ull << 0) +#define MFC_CNTL_SUSPEND_DMA_QUEUE_MASK (1ull << 0) +#define MFC_CNTL_NORMAL_DMA_QUEUE_OPERATION (0ull << 8) +#define MFC_CNTL_SUSPEND_IN_PROGRESS (1ull << 8) +#define MFC_CNTL_SUSPEND_COMPLETE (3ull << 8) +#define MFC_CNTL_SUSPEND_DMA_STATUS_MASK (3ull << 8) +#define MFC_CNTL_DMA_QUEUES_EMPTY (1ull << 14) +#define MFC_CNTL_DMA_QUEUES_EMPTY_MASK (1ull << 14) +#define MFC_CNTL_PURGE_DMA_REQUEST (1ull << 15) +#define MFC_CNTL_PURGE_DMA_IN_PROGRESS (1ull << 24) +#define MFC_CNTL_PURGE_DMA_COMPLETE (3ull << 24) +#define MFC_CNTL_PURGE_DMA_STATUS_MASK (3ull << 24) +#define MFC_CNTL_RESTART_DMA_COMMAND (1ull << 32) +#define MFC_CNTL_DMA_COMMAND_REISSUE_PENDING (1ull << 32) +#define MFC_CNTL_DMA_COMMAND_REISSUE_STATUS_MASK (1ull << 32) +#define MFC_CNTL_MFC_PRIVILEGE_STATE (2ull << 33) +#define MFC_CNTL_MFC_PROBLEM_STATE (3ull << 33) +#define MFC_CNTL_MFC_KEY_PROTECTION_STATE_MASK (3ull << 33) +#define MFC_CNTL_DECREMENTER_HALTED (1ull << 35) +#define MFC_CNTL_DECREMENTER_RUNNING (1ull << 40) +#define MFC_CNTL_DECREMENTER_STATUS_MASK (1ull << 40) + u8 pad_0x3008_0x4000[0x4000 - 0x3008]; /* 0x3008 */ + + /* Interrupt Mailbox */ + u64 puint_mb_R; /* 0x4000 */ + u8 pad_0x4008_0x4040[0x4040 - 0x4008]; /* 0x4008 */ + + /* SPU Control */ + u64 spu_privcntl_RW; /* 0x4040 */ +#define SPU_PRIVCNTL_MODE_NORMAL (0x0ull << 0) +#define SPU_PRIVCNTL_MODE_SINGLE_STEP (0x1ull << 0) +#define SPU_PRIVCNTL_MODE_MASK (0x1ull << 0) +#define SPU_PRIVCNTL_NO_ATTENTION_EVENT (0x0ull << 1) +#define SPU_PRIVCNTL_ATTENTION_EVENT (0x1ull << 1) +#define SPU_PRIVCNTL_ATTENTION_EVENT_MASK (0x1ull << 1) +#define SPU_PRIVCNT_LOAD_REQUEST_NORMAL (0x0ull << 2) +#define SPU_PRIVCNT_LOAD_REQUEST_ENABLE_MASK (0x1ull << 2) + u8 pad_0x4048_0x4058[0x10]; /* 0x4048 */ + u64 spu_lslr_RW; /* 0x4058 */ + u64 spu_chnlcntptr_RW; /* 0x4060 */ + u64 spu_chnlcnt_RW; /* 0x4068 */ + u64 spu_chnldata_RW; /* 0x4070 */ + u64 spu_cfg_RW; /* 0x4078 */ + u8 pad_0x4080_0x5000[0x5000 - 0x4080]; /* 0x4080 */ + + /* PV2_ImplRegs: Implementation-specific privileged-state 2 regs */ + u64 spu_pm_trace_tag_status_RW; /* 0x5000 */ + u64 spu_tag_status_query_RW; /* 0x5008 */ +#define TAG_STATUS_QUERY_CONDITION_BITS (0x3ull << 32) +#define TAG_STATUS_QUERY_MASK_BITS (0xffffffffull) + u64 spu_cmd_buf1_RW; /* 0x5010 */ +#define SPU_COMMAND_BUFFER_1_LSA_BITS (0x7ffffull << 32) +#define SPU_COMMAND_BUFFER_1_EAH_BITS (0xffffffffull) + u64 spu_cmd_buf2_RW; /* 0x5018 */ +#define SPU_COMMAND_BUFFER_2_EAL_BITS ((0xffffffffull) << 32) +#define SPU_COMMAND_BUFFER_2_TS_BITS (0xffffull << 16) +#define SPU_COMMAND_BUFFER_2_TAG_BITS (0x3full) + u64 spu_atomic_status_RW; /* 0x5020 */ +} __attribute__ ((aligned(0x20000))); + +/* SPU Privilege 1 State Area */ +struct spu_priv1 { + /* Control and Configuration Area */ + u64 mfc_sr1_RW; /* 0x000 */ +#define MFC_STATE1_LOCAL_STORAGE_DECODE_MASK 0x01ull +#define MFC_STATE1_BUS_TLBIE_MASK 0x02ull +#define MFC_STATE1_REAL_MODE_OFFSET_ENABLE_MASK 0x04ull +#define MFC_STATE1_PROBLEM_STATE_MASK 0x08ull +#define MFC_STATE1_RELOCATE_MASK 0x10ull +#define MFC_STATE1_MASTER_RUN_CONTROL_MASK 0x20ull + u64 mfc_lpid_RW; /* 0x008 */ + u64 spu_idr_RW; /* 0x010 */ + u64 mfc_vr_RO; /* 0x018 */ +#define MFC_VERSION_BITS (0xffff << 16) +#define MFC_REVISION_BITS (0xffff) +#define MFC_GET_VERSION_BITS(vr) (((vr) & MFC_VERSION_BITS) >> 16) +#define MFC_GET_REVISION_BITS(vr) ((vr) & MFC_REVISION_BITS) + u64 spu_vr_RO; /* 0x020 */ +#define SPU_VERSION_BITS (0xffff << 16) +#define SPU_REVISION_BITS (0xffff) +#define SPU_GET_VERSION_BITS(vr) (vr & SPU_VERSION_BITS) >> 16 +#define SPU_GET_REVISION_BITS(vr) (vr & SPU_REVISION_BITS) + u8 pad_0x28_0x100[0x100 - 0x28]; /* 0x28 */ + + + /* Interrupt Area */ + u64 int_mask_class0_RW; /* 0x100 */ +#define CLASS0_ENABLE_DMA_ALIGNMENT_INTR 0x1L +#define CLASS0_ENABLE_INVALID_DMA_COMMAND_INTR 0x2L +#define CLASS0_ENABLE_SPU_ERROR_INTR 0x4L +#define CLASS0_ENABLE_MFC_FIR_INTR 0x8L + u64 int_mask_class1_RW; /* 0x108 */ +#define CLASS1_ENABLE_SEGMENT_FAULT_INTR 0x1L +#define CLASS1_ENABLE_STORAGE_FAULT_INTR 0x2L +#define CLASS1_ENABLE_LS_COMPARE_SUSPEND_ON_GET_INTR 0x4L +#define CLASS1_ENABLE_LS_COMPARE_SUSPEND_ON_PUT_INTR 0x8L + u64 int_mask_class2_RW; /* 0x110 */ +#define CLASS2_ENABLE_MAILBOX_INTR 0x1L +#define CLASS2_ENABLE_SPU_STOP_INTR 0x2L +#define CLASS2_ENABLE_SPU_HALT_INTR 0x4L +#define CLASS2_ENABLE_SPU_DMA_TAG_GROUP_COMPLETE_INTR 0x8L + u8 pad_0x118_0x140[0x28]; /* 0x118 */ + u64 int_stat_class0_RW; /* 0x140 */ + u64 int_stat_class1_RW; /* 0x148 */ + u64 int_stat_class2_RW; /* 0x150 */ + u8 pad_0x158_0x180[0x28]; /* 0x158 */ + u64 int_route_RW; /* 0x180 */ + + /* Interrupt Routing */ + u8 pad_0x188_0x200[0x200 - 0x188]; /* 0x188 */ + + /* Atomic Unit Control Area */ + u64 mfc_atomic_flush_RW; /* 0x200 */ +#define mfc_atomic_flush_enable 0x1L + u8 pad_0x208_0x280[0x78]; /* 0x208 */ + u64 resource_allocation_groupID_RW; /* 0x280 */ + u64 resource_allocation_enable_RW; /* 0x288 */ + u8 pad_0x290_0x380[0x380 - 0x290]; /* 0x290 */ + + /* MFC Fault Isolation Area */ + /* mfc_fir_R: MFC Fault Isolation Register. + * mfc_fir_status_or_W: MFC Fault Isolation Status OR Register. + * mfc_fir_status_and_W: MFC Fault Isolation Status AND Register. + * mfc_fir_mask_R: MFC FIR Mask Register. + * mfc_fir_mask_or_W: MFC FIR Mask OR Register. + * mfc_fir_mask_and_W: MFC FIR Mask AND Register. + * mfc_fir_chkstp_enable_W: MFC FIR Checkstop Enable Register. + */ + u64 mfc_fir_R; /* 0x380 */ + u64 mfc_fir_status_or_W; /* 0x388 */ + u64 mfc_fir_status_and_W; /* 0x390 */ + u64 mfc_fir_mask_R; /* 0x398 */ + u64 mfc_fir_mask_or_W; /* 0x3a0 */ + u64 mfc_fir_mask_and_W; /* 0x3a8 */ + u64 mfc_fir_chkstp_enable_RW; /* 0x3b0 */ + u8 pad_0x3b8_0x3c8[0x3c8 - 0x3b8]; /* 0x3b8 */ + + /* SPU_Cache_ImplRegs: Implementation-dependent cache registers */ + + u64 smf_sbi_signal_sel; /* 0x3c8 */ +#define smf_sbi_mask_lsb 56 +#define smf_sbi_shift (63 - smf_sbi_mask_lsb) +#define smf_sbi_mask (0x301LL << smf_sbi_shift) +#define smf_sbi_bus0_bits (0x001LL << smf_sbi_shift) +#define smf_sbi_bus2_bits (0x100LL << smf_sbi_shift) +#define smf_sbi2_bus0_bits (0x201LL << smf_sbi_shift) +#define smf_sbi2_bus2_bits (0x300LL << smf_sbi_shift) + u64 smf_ato_signal_sel; /* 0x3d0 */ +#define smf_ato_mask_lsb 35 +#define smf_ato_shift (63 - smf_ato_mask_lsb) +#define smf_ato_mask (0x3LL << smf_ato_shift) +#define smf_ato_bus0_bits (0x2LL << smf_ato_shift) +#define smf_ato_bus2_bits (0x1LL << smf_ato_shift) + u8 pad_0x3d8_0x400[0x400 - 0x3d8]; /* 0x3d8 */ + + /* TLB Management Registers */ + u64 mfc_sdr_RW; /* 0x400 */ + u8 pad_0x408_0x500[0xf8]; /* 0x408 */ + u64 tlb_index_hint_RO; /* 0x500 */ + u64 tlb_index_W; /* 0x508 */ + u64 tlb_vpn_RW; /* 0x510 */ + u64 tlb_rpn_RW; /* 0x518 */ + u8 pad_0x520_0x540[0x20]; /* 0x520 */ + u64 tlb_invalidate_entry_W; /* 0x540 */ + u64 tlb_invalidate_all_W; /* 0x548 */ + u8 pad_0x550_0x580[0x580 - 0x550]; /* 0x550 */ + + /* SPU_MMU_ImplRegs: Implementation-dependent MMU registers */ + u64 smm_hid; /* 0x580 */ +#define PAGE_SIZE_MASK 0xf000000000000000ull +#define PAGE_SIZE_16MB_64KB 0x2000000000000000ull + u8 pad_0x588_0x600[0x600 - 0x588]; /* 0x588 */ + + /* MFC Status/Control Area */ + u64 mfc_accr_RW; /* 0x600 */ +#define MFC_ACCR_EA_ACCESS_GET (1 << 0) +#define MFC_ACCR_EA_ACCESS_PUT (1 << 1) +#define MFC_ACCR_LS_ACCESS_GET (1 << 3) +#define MFC_ACCR_LS_ACCESS_PUT (1 << 4) + u8 pad_0x608_0x610[0x8]; /* 0x608 */ + u64 mfc_dsisr_RW; /* 0x610 */ +#define MFC_DSISR_PTE_NOT_FOUND (1 << 30) +#define MFC_DSISR_ACCESS_DENIED (1 << 27) +#define MFC_DSISR_ATOMIC (1 << 26) +#define MFC_DSISR_ACCESS_PUT (1 << 25) +#define MFC_DSISR_ADDR_MATCH (1 << 22) +#define MFC_DSISR_LS (1 << 17) +#define MFC_DSISR_L (1 << 16) +#define MFC_DSISR_ADDRESS_OVERFLOW (1 << 0) + u8 pad_0x618_0x620[0x8]; /* 0x618 */ + u64 mfc_dar_RW; /* 0x620 */ + u8 pad_0x628_0x700[0x700 - 0x628]; /* 0x628 */ + + /* Replacement Management Table (RMT) Area */ + u64 rmt_index_RW; /* 0x700 */ + u8 pad_0x708_0x710[0x8]; /* 0x708 */ + u64 rmt_data1_RW; /* 0x710 */ + u8 pad_0x718_0x800[0x800 - 0x718]; /* 0x718 */ + + /* Control/Configuration Registers */ + u64 mfc_dsir_R; /* 0x800 */ +#define MFC_DSIR_Q (1 << 31) +#define MFC_DSIR_SPU_QUEUE MFC_DSIR_Q + u64 mfc_lsacr_RW; /* 0x808 */ +#define MFC_LSACR_COMPARE_MASK ((~0ull) << 32) +#define MFC_LSACR_COMPARE_ADDR ((~0ull) >> 32) + u64 mfc_lscrr_R; /* 0x810 */ +#define MFC_LSCRR_Q (1 << 31) +#define MFC_LSCRR_SPU_QUEUE MFC_LSCRR_Q +#define MFC_LSCRR_QI_SHIFT 32 +#define MFC_LSCRR_QI_MASK ((~0ull) << MFC_LSCRR_QI_SHIFT) + u8 pad_0x818_0x900[0x900 - 0x818]; /* 0x818 */ + + /* Real Mode Support Registers */ + u64 mfc_rm_boundary; /* 0x900 */ + u8 pad_0x908_0x938[0x30]; /* 0x908 */ + u64 smf_dma_signal_sel; /* 0x938 */ +#define mfc_dma1_mask_lsb 41 +#define mfc_dma1_shift (63 - mfc_dma1_mask_lsb) +#define mfc_dma1_mask (0x3LL << mfc_dma1_shift) +#define mfc_dma1_bits (0x1LL << mfc_dma1_shift) +#define mfc_dma2_mask_lsb 43 +#define mfc_dma2_shift (63 - mfc_dma2_mask_lsb) +#define mfc_dma2_mask (0x3LL << mfc_dma2_shift) +#define mfc_dma2_bits (0x1LL << mfc_dma2_shift) + u8 pad_0x940_0xa38[0xf8]; /* 0x940 */ + u64 smm_signal_sel; /* 0xa38 */ +#define smm_sig_mask_lsb 12 +#define smm_sig_shift (63 - smm_sig_mask_lsb) +#define smm_sig_mask (0x3LL << smm_sig_shift) +#define smm_sig_bus0_bits (0x2LL << smm_sig_shift) +#define smm_sig_bus2_bits (0x1LL << smm_sig_shift) + u8 pad_0xa40_0xc00[0xc00 - 0xa40]; /* 0xa40 */ + + /* DMA Command Error Area */ + u64 mfc_cer_R; /* 0xc00 */ +#define MFC_CER_Q (1 << 31) +#define MFC_CER_SPU_QUEUE MFC_CER_Q + u8 pad_0xc08_0x1000[0x1000 - 0xc08]; /* 0xc08 */ + + /* PV1_ImplRegs: Implementation-dependent privileged-state 1 regs */ + /* DMA Command Error Area */ + u64 spu_ecc_cntl_RW; /* 0x1000 */ +#define SPU_ECC_CNTL_E (1ull << 0ull) +#define SPU_ECC_CNTL_ENABLE SPU_ECC_CNTL_E +#define SPU_ECC_CNTL_DISABLE (~SPU_ECC_CNTL_E & 1L) +#define SPU_ECC_CNTL_S (1ull << 1ull) +#define SPU_ECC_STOP_AFTER_ERROR SPU_ECC_CNTL_S +#define SPU_ECC_CONTINUE_AFTER_ERROR (~SPU_ECC_CNTL_S & 2L) +#define SPU_ECC_CNTL_B (1ull << 2ull) +#define SPU_ECC_BACKGROUND_ENABLE SPU_ECC_CNTL_B +#define SPU_ECC_BACKGROUND_DISABLE (~SPU_ECC_CNTL_B & 4L) +#define SPU_ECC_CNTL_I_SHIFT 3ull +#define SPU_ECC_CNTL_I_MASK (3ull << SPU_ECC_CNTL_I_SHIFT) +#define SPU_ECC_WRITE_ALWAYS (~SPU_ECC_CNTL_I & 12L) +#define SPU_ECC_WRITE_CORRECTABLE (1ull << SPU_ECC_CNTL_I_SHIFT) +#define SPU_ECC_WRITE_UNCORRECTABLE (3ull << SPU_ECC_CNTL_I_SHIFT) +#define SPU_ECC_CNTL_D (1ull << 5ull) +#define SPU_ECC_DETECTION_ENABLE SPU_ECC_CNTL_D +#define SPU_ECC_DETECTION_DISABLE (~SPU_ECC_CNTL_D & 32L) + u64 spu_ecc_stat_RW; /* 0x1008 */ +#define SPU_ECC_CORRECTED_ERROR (1ull << 0ul) +#define SPU_ECC_UNCORRECTED_ERROR (1ull << 1ul) +#define SPU_ECC_SCRUB_COMPLETE (1ull << 2ul) +#define SPU_ECC_SCRUB_IN_PROGRESS (1ull << 3ul) +#define SPU_ECC_INSTRUCTION_ERROR (1ull << 4ul) +#define SPU_ECC_DATA_ERROR (1ull << 5ul) +#define SPU_ECC_DMA_ERROR (1ull << 6ul) +#define SPU_ECC_STATUS_CNT_MASK (256ull << 8) + u64 spu_ecc_addr_RW; /* 0x1010 */ + u64 spu_err_mask_RW; /* 0x1018 */ +#define SPU_ERR_ILLEGAL_INSTR (1ull << 0ul) +#define SPU_ERR_ILLEGAL_CHANNEL (1ull << 1ul) + u8 pad_0x1020_0x1028[0x1028 - 0x1020]; /* 0x1020 */ + + /* SPU Debug-Trace Bus (DTB) Selection Registers */ + u64 spu_trig0_sel; /* 0x1028 */ + u64 spu_trig1_sel; /* 0x1030 */ + u64 spu_trig2_sel; /* 0x1038 */ + u64 spu_trig3_sel; /* 0x1040 */ + u64 spu_trace_sel; /* 0x1048 */ +#define spu_trace_sel_mask 0x1f1fLL +#define spu_trace_sel_bus0_bits 0x1000LL +#define spu_trace_sel_bus2_bits 0x0010LL + u64 spu_event0_sel; /* 0x1050 */ + u64 spu_event1_sel; /* 0x1058 */ + u64 spu_event2_sel; /* 0x1060 */ + u64 spu_event3_sel; /* 0x1068 */ + u64 spu_trace_cntl; /* 0x1070 */ +} __attribute__ ((aligned(0x2000))); + +#endif --- linux-cg.orig/mm/memory.c 2005-05-13 15:15:07.883989176 -0400 +++ linux-cg/mm/memory.c 2005-05-13 17:25:48.140932568 -0400 @@ -2194,6 +2194,7 @@ unsigned long vmalloc_to_pfn(void * vmal { return page_to_pfn(vmalloc_to_page(vmalloc_addr)); } +EXPORT_SYMBOL_GPL(handle_mm_fault); EXPORT_SYMBOL(vmalloc_to_pfn); From arnd at arndb.de Sat May 14 05:31:09 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Fri, 13 May 2005 21:31:09 +0200 Subject: [PATCH 0/8] ppc64: Introduce Cell/BPA platform, v2 Message-ID: <200505132117.37461.arnd@arndb.de> This series of patches add support for a fifth platform type in the ppc64 architecture tree. The Broadband Processor Architecture (BPA) is what machines using the Cell processor should be following and currently only prototype hardware exists for it. Except for the last patch, these are functionally the same as the first version but are updated for 2.6.12-rc4 and contain changes based on the feedback I got so far. The first three patches add some infrastructure that is used by BPA machines but is not really specific to them can could be used by other new platform types as well. The next three patches add the actual platform code, which should be usable for any BPA compatible implementation. Patch 7 introduces a new file system to make use of the SPUs inside the processors. This patch is still in a prototype stage and not intended for merging yet. The final patch adds some user space code in the Documentation directory that clarifies how to use the file system. This one should become a separate package at a later point. Arnd <>< From arnd at arndb.de Sat May 14 05:32:32 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Fri, 13 May 2005 21:32:32 +0200 Subject: [PATCH 8/8] ppc64: add spufs user library In-Reply-To: <200505132117.37461.arnd@arndb.de> References: <200505132117.37461.arnd@arndb.de> Message-ID: <200505132132.33772.arnd@arndb.de> This adds a user space library as a counterpart to the kernel side spufs. Since the hardware is not available yet, this is mostly for documenting the spufs API and is not intended for merging into mainline. As the API matures, libspu will become a separate package. From: Dirk Herrend?rfer Signed-off-by: Arnd Bergmann --- linux-cg.orig/Documentation/bpa/libspu/Makefile 1969-12-31 19:00:00.000000000 -0500 +++ linux-cg/Documentation/bpa/libspu/Makefile 2005-05-13 11:37:31.481923136 -0400 @@ -0,0 +1,41 @@ +#* +#* libbpathread - A wrapper library to adapt the JSRE SPU usage model to SPUFS +#* Copyright (C) 2005 IBM Corp. +#* +#* This library is free software; you can redistribute it and/or modify it +#* under the terms of the GNU Lesser General Public License as published by +#* the Free Software Foundation; either version 2.1 of the License, +#* or (at your option) any later version. +#* +#* This library is distributed in the hope that it will be useful, but +#* WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +#* or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public +#* License for more details. +#* +#* You should have received a copy of the GNU Lesser General Public License +#* along with this library; if not, write to the Free Software Foundation, +#* Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA +#* + +CC := gcc +CTAGS = ctags + +CFLAGS := -O2 -m32 -Wall -I. -Iinclude -DDEBUG -g \ + +libspu_OBJS := elf_loader.o bpathread.o +OBJS := libbpathread.a $(libspu_OBJS) + +all: $(OBJS) + +libbpathread.a: $(libspu_OBJS) + ar -r $@ $(libspu_OBJS) + +tests: + make -C test/start-stop + +tags: + $(CTAGS) -R . + +clean: + rm -f $(OBJS) *~ tags + make -C test/start-stop clean --- linux-cg.orig/Documentation/bpa/libspu/README 1969-12-31 19:00:00.000000000 -0500 +++ linux-cg/Documentation/bpa/libspu/README 2005-05-13 11:37:31.482922984 -0400 @@ -0,0 +1,7 @@ +This is an example of how to use the SPEs in applications. It is by +no means complete or perfect - it is meant as a reference of how to +use the SPUFS implementation in linux. +As more and more features of SPUFS become available, this code will +be extended too. + +D.Herrendoerfer --- linux-cg.orig/Documentation/bpa/libspu/bpathread.c 1969-12-31 19:00:00.000000000 -0500 +++ linux-cg/Documentation/bpa/libspu/bpathread.c 2005-05-13 11:37:31.483922832 -0400 @@ -0,0 +1,314 @@ +/* + * libbpathread - A wrapper library to adapt the JSRE SPU usage model to SPUFS + * Copyright (C) 2005 IBM Corp. + * + * This library is free software; you can redistribute it and/or modify it + * under the terms of the GNU Lesser General Public License as published by + * the Free Software Foundation; either version 2.1 of the License, + * or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY + * or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public + * License for more details. + * + * You should have received a copy of the GNU Lesser General Public License + * along with this library; if not, write to the Free Software Foundation, + * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include +#include +#include + +#define __PRINTF(fmt, args...) { fprintf(stderr,fmt , ## args); } +#ifdef DEBUG +#define DEBUG_PRINTF(fmt, args...) __PRINTF(fmt , ## args) +#else +#define DEBUG_PRINTF(fmt, args...) +#endif + + +int thread_num = 0; + +/* + * Helpers + * + * */ + +struct thread_start_info +{ + char pathname[40]; /* */ + int thread_num; /* */ +}; + +struct thread_store +{ + pthread_t spe_thread; + int thread_return_value; + int fd_mbox; + unsigned int state; +}; + +static struct thread_store spe_thread_store[1024]; + +/* + * int spe_ldr[]: + * SPE code that performs the actual parameter setting: + */ +static int spe_ldr[] = { + 0x30fff083, + 0x30fff284, + 0x30fff485, + 0x30fff686, + 0x30fff000, + 0x35000000, + 0x00000000, + 0x00000000 +}; + +/** + * Library API + * + */ + +speid_t +spe_create_thread (int gid, void *start, void *argp, void *envp, int mask, + int flags) +{ + int rc, memfd; + addr64 argp64, envp64; + pthread_t thread; + char memname[40], pathname[40]; + void *spe_ld_buf; + ssize_t count = 0, num = 0; + struct spe_ld_info ld_info; + struct thread_start_info *thread_info; + struct spe_exec_params spe_params __attribute__ ((aligned (4096))); + + DEBUG_PRINTF ("spe_create_thread(0x%x, %p, %p, %p, 0x%x, 0x%x)\n", + gid, start, argp, envp, mask, flags); + + argp64.ull = (unsigned long long) (unsigned long) argp; + envp64.ull = (unsigned long long) (unsigned long) envp; + + /* Make the SPU Directory */ + + sprintf (pathname, "/spu/bpathread-%i-%i", getpid (), thread_num); + + DEBUG_PRINTF ("mkdir %s\n", pathname); + + rc = mkdir (pathname, S_IRUSR | S_IWUSR | S_IXUSR); + if (rc < 0) + { + DEBUG_PRINTF ("Could not make dir %s\n", pathname); + return -1; + } + + sprintf (memname, "%s/mem", pathname); + + /* Check SPE */ + + memfd = open (memname, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR); + if (memfd < 0) + { + DEBUG_PRINTF ("Could not open SPE mem file.\n"); + return -1; + } + + /* Prepare Loader */ + + spe_ld_buf = malloc (LS_SIZE); + thread_info = malloc (sizeof (*thread_info)); + + if (!spe_ld_buf || !thread_info) + { + DEBUG_PRINTF ("Could not allocate SPE memory. \n"); + errno = ENOMEM; + return -1; + } + + memset(spe_ld_buf, 0, LS_SIZE); + + rc = load_spe_elf (start, spe_ld_buf, &ld_info); + if (rc != 0) + { + DEBUG_PRINTF ("Load SPE ELF failed..\n"); + return -1; + } + + /* Add SPE exec program */ + + DEBUG_PRINTF ("Add exec prog dst:0x%04x size:0x%04x\n", + SPE_LDR_START, sizeof (spe_ldr)); + memcpy (spe_ld_buf + SPE_LDR_START, &spe_ldr, sizeof (spe_ldr)); + + /* Add SPE exec parameters */ + + spe_params.entry = ld_info.entry; + spe_params.gpr4[0] = argp64.ui[0]; + spe_params.gpr4[1] = argp64.ui[1]; + spe_params.gpr5[0] = envp64.ui[0]; + spe_params.gpr5[1] = envp64.ui[1]; + + DEBUG_PRINTF ("Add exec param dst:0x%04x size:0x%04x\n", + SPE_PARAM_START, sizeof (spe_params)); + memcpy (spe_ld_buf + SPE_PARAM_START, &spe_params, + sizeof (spe_params)); + + /* Copy SPE image to SPUfs */ + do + { + num = write (memfd, spe_ld_buf + count, LS_SIZE - count); + if (num == -1) + { + DEBUG_PRINTF ("Transfer SPE ELF failed..\n"); + return -1; + } + + count += num; + } + while (count < LS_SIZE && num); + close (memfd); + + /* Free the SPE Buffer */ + free (spe_ld_buf); + + strcpy (thread_info->pathname, pathname); + thread_info->thread_num = thread_num; + + spe_thread_store[thread_num].state = BPA_THREAD_START; + + rc = pthread_create (&thread, NULL, spe_thread, thread_info); + + rc = thread_num; + + while (spe_thread_store[thread_num].state != BPA_THREAD_IDLE) + { + thread_num++; + } + + spe_thread_store[rc].spe_thread = thread; + + return rc; +} + +int +spe_wait (speid_t speid, int *status, int options) +{ + int rc; + + DEBUG_PRINTF ("spu_wait(0x%x, %p, 0x%x)\n", speid, status, options); + + rc = pthread_join (spe_thread_store[speid].spe_thread, + (void **) status); + + spe_thread_store[speid].state = BPA_THREAD_IDLE; + + DEBUG_PRINTF ("Thread ended.\n"); + return rc; +} + +int +spe_kill (speid_t speid, int sig) +{ + int rc; + + rc = pthread_kill (spe_thread_store[speid].spe_thread, sig); + + return rc; +} + +/* + * Thread Code + * + * */ + +void * +spe_thread (void *ptr) +{ + char runname[40], mboxname[40], pathname[40]; + int runfd, mboxfd; + int num; + struct thread_start_info *thread_info; + + struct spufs_run_arg + { + unsigned npc; /* inout: Next Program Counter */ + unsigned short code; /* out: SPU status */ + unsigned short status; + }; + struct spufs_run_arg arg = { SPE_LDR_PROG_start, }; + + DEBUG_PRINTF ("In thread\n"); + + thread_info = (struct thread_start_info *) ptr; + + num = thread_info->thread_num; + strcpy (pathname, thread_info->pathname); + + free (thread_info); + + DEBUG_PRINTF ("thread: %i.\n", num); + DEBUG_PRINTF ("pathname: %s.\n", pathname); + + sprintf (runname, "%s/run", pathname); + runfd = open (runname, O_RDONLY); + if (runfd < 0) + { + DEBUG_PRINTF ("Could not open SPU run file.\n"); + spe_thread_store[num].thread_return_value = -EINVAL; + pthread_exit ((void *) spe_thread_store[num]. + thread_return_value); + } + + sprintf (mboxname, "%s/m_box", pathname); + mboxfd = open (mboxname, O_RDONLY); + if (mboxfd < 0) + { + DEBUG_PRINTF ("Could not open SPE mailbox file.\n"); + spe_thread_store[num].thread_return_value = -EINVAL; + pthread_exit ((void *) spe_thread_store[num]. + thread_return_value); + } + else + { + spe_thread_store[num].fd_mbox = mboxfd; + } + + spe_thread_store[num].state = BPA_THREAD_RUNNING; + + int ret = ioctl (runfd, _IOWR ('s', 0, struct spufs_run_arg), &arg) + & 0xff; + if (ret < 0) + { + DEBUG_PRINTF ("Could not ioctl() on SPE run file.\n"); + spe_thread_store[num].thread_return_value = -EINVAL; + pthread_exit ((void *) spe_thread_store[num]. + thread_return_value); + } + + close (runfd); + + spe_thread_store[num].state = BPA_THREAD_ENDED; + + DEBUG_PRINTF ("SPE thread result: %08x:%04x:%04x\n", arg.npc, + arg.code, arg.status); + + spe_thread_store[num].thread_return_value = arg.code; + return ((void *) spe_thread_store[num].thread_return_value); +} --- linux-cg.orig/Documentation/bpa/libspu/bpathread.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-cg/Documentation/bpa/libspu/bpathread.h 2005-05-13 11:37:31.484922680 -0400 @@ -0,0 +1,47 @@ +/* + * libbpathread - A wrapper library to adapt the JSRE SPU usage model to SPUFS + * Copyright (C) 2005 IBM Corp. + * + * This library is free software; you can redistribute it and/or modify it + * under the terms of the GNU Lesser General Public License as published by + * the Free Software Foundation; either version 2.1 of the License, + * or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY + * or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public + * License for more details. + * + * You should have received a copy of the GNU Lesser General Public License + * along with this library; if not, write to the Free Software Foundation, + * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ +#ifndef _bpathread_h_ +#define _bpathread_h_ + +typedef int speid_t; + +/* APIs for SPE threads. + */ + +extern speid_t spe_create_thread (int gid, void *start, + void *argp, void *envp, + int mask, int flags); + +extern int spe_wait (speid_t speid, int *status, int options); + +extern int spe_kill (speid_t speid, int sig); + + +/* SPE-thread internals + */ + +void *spe_thread (void *ptr); + +#define BPA_THREAD_IDLE 0 +#define BPA_THREAD_START 1 +#define BPA_THREAD_RUNNING 2 +#define BPA_THREAD_ENDED 3 + + +#endif --- linux-cg.orig/Documentation/bpa/libspu/elf_loader.c 1969-12-31 19:00:00.000000000 -0500 +++ linux-cg/Documentation/bpa/libspu/elf_loader.c 2005-05-13 11:37:31.484922680 -0400 @@ -0,0 +1,130 @@ +/* libbpathread - A wrapper library to adapt the JSRE SPU usage model to SPUFS + * Copyright (C) 2005 IBM Corp. + * This library is free software; you can redistribute it and/or modify it + * under the terms of the GNU Lesser General Public License as published by + * the Free Software Foundation; either version 2.1 of the License, + * or (at your option) any later version. + * This library is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY + * or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public + * License for more details. + * You should have received a copy of the GNU Lesser General Public License + * along with this library; if not, write to the Free Software Foundation, + * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define __PRINTF(fmt, args...) { fprintf(stderr,fmt , ## args); } +#ifdef DEBUG +#define DEBUG_PRINTF(fmt, args...) __PRINTF(fmt , ## args) +#else +#define DEBUG_PRINTF(fmt, args...) +#endif + +int +load_spe_elf (void *elf_start, void *ld_buffer, struct spe_ld_info *ld_info) +{ + Elf32_Ehdr *ehdr; + static const unsigned char expected[EI_PAD] = { + [EI_MAG0] = ELFMAG0, + [EI_MAG1] = ELFMAG1, + [EI_MAG2] = ELFMAG2, + [EI_MAG3] = ELFMAG3, + [EI_CLASS] = ELFCLASS32, + [EI_DATA] = ELFDATA2MSB, + [EI_VERSION] = EV_CURRENT, + [EI_OSABI] = ELFOSABI_SYSV, + [EI_ABIVERSION] = 0 + }; + Elf32_Phdr *phdr; + Elf32_Phdr *ph; + int num_load_seg = 0; + + DEBUG_PRINTF ("load_spe_elf(%p, %p)\n", elf_start, ld_buffer); + ehdr = (Elf32_Ehdr *) elf_start; + + /* Validate ELF */ + if (memcmp (ehdr->e_ident, expected, EI_PAD) != 0) + { + DEBUG_PRINTF ("invalid ELF header.\n"); + DEBUG_PRINTF ("expected 0x%016llX != 0x%016llX\n", + *(long long *) expected, *(long long *) ehdr); + errno = EINVAL; + return -errno; + } + + /* Validate the machine type */ + if (ehdr->e_machine != 0x17) + { + DEBUG_PRINTF ("not an SPE ELF object"); + errno = EINVAL; + return -errno; + } + + /* Validate ELF object type. */ + if (ehdr->e_type != ET_EXEC) + { + DEBUG_PRINTF ("invalid SPE ELF type.\n"); + DEBUG_PRINTF ("SPU type %d != %d\n", ehdr->e_type, ET_EXEC); + errno = EINVAL; + DEBUG_PRINTF ("parse_spu_elf(): errno=%d.\n", errno); + return -errno; + } + + /* Start processing headers */ + phdr = (Elf32_Phdr *) ((char *) ehdr + ehdr->e_phoff); + + /* + * Load all PT_LOAD segments onto the SPU local store buffer. + */ + DEBUG_PRINTF ("Segments: 0x%x\n", ehdr->e_phnum); + for (ph = phdr; ph < &phdr[ehdr->e_phnum]; ++ph) + { + switch (ph->p_type) + { + case PT_LOAD: + /* DEBUG_PRINTF ("PT_LOAD)\n"); */ + /* Only LOAD non-zero segments. */ + if (ph->p_filesz) + { + num_load_seg++; + + DEBUG_PRINTF + ("SPE_LOAD %p (0x%x) -> %p (0x%x) (%i bytes)\n", + ld_buffer + ph->p_vaddr, + ph->p_vaddr, + elf_start + ph->p_paddr, + ph->p_paddr, ph->p_filesz); + memcpy (ld_buffer + ph->p_vaddr, + elf_start + ph->p_paddr, + ph->p_filesz); + } + break; + } + } + if (num_load_seg == 0) + { + DEBUG_PRINTF ("no segments to load"); + errno = EINVAL; + return -errno; + } + + /* Remember where the code wants to be started */ + ld_info->entry = ehdr->e_entry; + DEBUG_PRINTF ("entry = 0x%x\n", ehdr->e_entry); + + return 0; + +} --- linux-cg.orig/Documentation/bpa/libspu/elf_loader.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-cg/Documentation/bpa/libspu/elf_loader.h 2005-05-13 11:37:31.485922528 -0400 @@ -0,0 +1,40 @@ +/* + * libbpathread - A wrapper library to adapt the JSRE SPU usage model to SPUFS + * Copyright (C) 2005 IBM Corp. + * + * This library is free software; you can redistribute it and/or modify it + * under the terms of the GNU Lesser General Public License as published by + * the Free Software Foundation; either version 2.1 of the License, + * or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY + * or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public + * License for more details. + * + * You should have received a copy of the GNU Lesser General Public License + * along with this library; if not, write to the Free Software Foundation, + * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#define LS_SIZE 0x40000 /* 256K (in bytes) */ + +#define SPE_LDR_PROG_start (LS_SIZE - 512) // location of spu_ld.so prog +#define SPE_LDR_PARAMS_start (LS_SIZE - 128) // location of spu_ldr_params + +typedef union +{ + unsigned long long ull; + unsigned int ui[2]; +} addr64; + +struct spe_ld_info +{ + unsigned int entry; /* Entry point of SPU image */ +}; + +/* + * Global API : */ + +int load_spe_elf (void *elf_start, void *ld_buffer, + struct spe_ld_info *ld_info); --- linux-cg.orig/Documentation/bpa/libspu/spe_exec.h 1969-12-31 19:00:00.000000000 -0500 +++ linux-cg/Documentation/bpa/libspu/spe_exec.h 2005-05-13 11:37:31.485922528 -0400 @@ -0,0 +1,43 @@ +/* + * libbpathread - A wrapper library to adapt the JSRE SPU usage model to SPUFS + * Copyright (C) 2005 IBM Corp. + * + * This library is free software; you can redistribute it and/or modify it + * under the terms of the GNU Lesser General Public License as published by + * the Free Software Foundation; either version 2.1 of the License, + * or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY + * or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public + * License for more details. + * + * You should have received a copy of the GNU Lesser General Public License + * along with this library; if not, write to the Free Software Foundation, + * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef _spe_exec_h_ +#define _spe_exec_h_ + +#define SPE_LDR_START 0x0003fe00 +#define SPE_PARAM_START 0x0003ff80 + + +/* + * struct spe_exec_params: + * + * Holds the (per thread) parameters for the spe program +*/ + +struct spe_exec_params +{ + unsigned int entry; /* entry point for application. */ + unsigned int gpr3[4]; /* initial setting for $3 */ + unsigned int gpr4[4]; /* initial setting for $4 */ + unsigned int gpr5[4]; /* initial setting for $5 */ + unsigned int gpr6[4]; /* initial setting for $6 */ + +}; + +#endif --- linux-cg.orig/Documentation/bpa/libspu/test/start-stop/Makefile 1969-12-31 19:00:00.000000000 -0500 +++ linux-cg/Documentation/bpa/libspu/test/start-stop/Makefile 2005-05-13 11:37:31.486922376 -0400 @@ -0,0 +1,43 @@ +#* +#* libbpathread - A wrapper library to adapt the JSRE SPU usage model to SPUFS +#* Copyright (C) 2005 IBM Corp. +#* +#* This library is free software; you can redistribute it and/or modify it +#* under the terms of the GNU Lesser General Public License as published by +#* the Free Software Foundation; either version 2.1 of the License, +#* or (at your option) any later version. +#* +#* This library is distributed in the hope that it will be useful, but +#* WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY +#* or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public +#* License for more details. +#* +#* You should have received a copy of the GNU Lesser General Public License +#* along with this library; if not, write to the Free Software Foundation, +#* Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA +#* + +CC := gcc +SPECC := spu-gcc +CTAGS = ctags + +CFLAGS := -O2 -m32 -Wall -I../.. -I../../include -g +SPECFLAGS := -O2 -Wall -I../../include + +LDFLAGS := -m32 +LIBS := -L../.. -l bpathread -l pthread + +SPE_OBJS := spe-start-stop +OBJS := ppe-start-stop + +all: $(OBJS) $(SPE_OBJS) + +clean: + rm -f $(OBJS) $(SPE_OBJS) + +ppe-start-stop: ppe-start-stop.c + $(CC) -o $@ $< $(CFLAGS) $(LDFLAGS) $(LIBS) + +spe-start-stop: spe-start-stop.c + $(SPECC) $(SPECFLAGS) -o $@ $< + --- linux-cg.orig/Documentation/bpa/libspu/test/start-stop/ppe-start-stop.c 1969-12-31 19:00:00.000000000 -0500 +++ linux-cg/Documentation/bpa/libspu/test/start-stop/ppe-start-stop.c 2005-05-13 11:37:31.487922224 -0400 @@ -0,0 +1,89 @@ +/* + * libbpathread - A wrapper library to adapt the JSRE SPU usage model to SPUFS + * Copyright (C) 2005 IBM Corp. + * + * This library is free software; you can redistribute it and/or modify it + * under the terms of the GNU Lesser General Public License as published by + * the Free Software Foundation; either version 2.1 of the License, + * or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY + * or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public + * License for more details. + * + * You should have received a copy of the GNU Lesser General Public License + * along with this library; if not, write to the Free Software Foundation, + * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include +#include +#include +#include + +#include +#include + +#include + + +void *load_binary(char *bin) +{ + int binfd; + struct stat statbuf; + int ret; + void *buf; + void *pos; + off_t size; + + binfd = open(bin, O_RDONLY); + if (binfd < 0) + return NULL; + + ret = fstat(binfd, &statbuf); + if (ret < 0) + return NULL; + + buf = malloc(statbuf.st_size + 16); + if (!buf) + return NULL; + + buf = (void *)(((unsigned long)buf + 16) & ~15); + pos = buf; + size = statbuf.st_size; + + do { + ret = read(binfd, pos, size); + if (ret > 0) { + pos += ret; + size -= ret; + } + if (ret < 0) + return NULL; + } while (size > 0 && ret); + + return buf; +} + +int main(int argc, char* argv[]) +{ + char *binary; + int threadnum,status; + + if (argc != 2) { + printf("usage: pu spu-executable\n"); + exit(1); + } + + binary = load_binary(argv[1]); + if (!binary) + exit(2); + + threadnum = spe_create_thread(0, binary, NULL, NULL, 0, 0); + + spe_wait(threadnum,&status,0); + + printf("Thread returned status: %04x\n",status); + return 0; +} --- linux-cg.orig/Documentation/bpa/libspu/test/start-stop/spe-start-stop.c 1969-12-31 19:00:00.000000000 -0500 +++ linux-cg/Documentation/bpa/libspu/test/start-stop/spe-start-stop.c 2005-05-13 11:37:31.487922224 -0400 @@ -0,0 +1,23 @@ +/* + * libbpathread - A wrapper library to adapt the JSRE SPU usage model to SPUFS + * Copyright (C) 2005 IBM Corp. + * + * This library is free software; you can redistribute it and/or modify it + * under the terms of the GNU Lesser General Public License as published by + * the Free Software Foundation; either version 2.1 of the License, + * or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY + * or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public + * License for more details. + * + * You should have received a copy of the GNU Lesser General Public License + * along with this library; if not, write to the Free Software Foundation, + * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +int main(void) +{ + return 0; +} --- linux-cg.orig/Documentation/bpa/libspu/tools/README 1969-12-31 19:00:00.000000000 -0500 +++ linux-cg/Documentation/bpa/libspu/tools/README 2005-05-13 11:37:31.487922224 -0400 @@ -0,0 +1,8 @@ +Contents of this Directory + +elfspe-register: +Script to register a SPE-ELF loading app with binfmt_misc. + +embedspu: +Script to attach a SPE-ELF object to an executable. + --- linux-cg.orig/Documentation/bpa/libspu/tools/elfspe-register 1969-12-31 19:00:00.000000000 -0500 +++ linux-cg/Documentation/bpa/libspu/tools/elfspe-register 2005-05-13 11:37:31.488922072 -0400 @@ -0,0 +1,4 @@ +#!/bin/sh + +echo ':spu:M::\x7fELF\x01\x02\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x02\x00\x17::/home/uweigand/runspu:' > /proc/sys/fs/binfmt_misc/register + --- linux-cg.orig/Documentation/bpa/libspu/tools/embedspu 1969-12-31 19:00:00.000000000 -0500 +++ linux-cg/Documentation/bpa/libspu/tools/embedspu 2005-05-13 11:37:31.488922072 -0400 @@ -0,0 +1,56 @@ +#/bin/sh + +# +# Embed SPE ELF executable into PPE object file, and define a +# global pointer variable refering to the embedded file. +# +# Usage: embedspu [flags] symbol_name input_filename output_filename +# +# input_filename: SPE ELF executable to be embedded +# output_filename: Resulting PPE object file +# symbol_name: Name of global pointer variable to be defined +# flags: GCC flags defining PPE object file format +# (e.g. -m32 or -m64) +# + +# Argument parsing +SYMBOL= +INFILE= +OUTFILE= +FLAGS= + +while [ -n "$1" ]; do + case $1 in + -*) FLAGS="${FLAGS} $1" + shift ;; + *) if [ -z $SYMBOL ]; then + SYMBOL=$1 + elif [ -z $INFILE ]; then + INFILE=$1 + elif [ -z $OUTFILE ]; then + OUTFILE=$1 + fi + shift ;; + esac +done + +if [ -z "$SYMBOL" -o -z "$INFILE" -o -z "$OUTFILE" ]; then + echo "Usage: $0 [symbol_name] [input_filename] [output_filename]" + exit 1 +fi + +# The section name as defined by the SPU ABI +SECTION=.spuelf.${INFILE} + +# Build object file holding pointer to embedded section +gcc ${FLAGS} -x c -c -o ${OUTFILE} - < References: <20050224011409.GE2088@austin.ibm.com> <421DDEF7.7080103@jp.fujitsu.com> <20050224231455.GH2088@austin.ibm.com> <421E9D16.3000606@jp.fujitsu.com> <20050312013251.GA2609@austin.ibm.com> <4235847F.3080705@jp.fujitsu.com> <20050314181420.GD498@austin.ibm.com> <1112685311.9518.35.camel@gaston> <20050506230506.GL11745@austin.ibm.com> <1116001556.31126.28.camel@sinatra.austin.ibm.com> <20050513172809.GA4138@austin.ibm.com> Message-ID: <1116026802.5128.27.camel@gaston> On Fri, 2005-05-13 at 12:28 -0500, Linas Vepstas wrote: > > Could you just use pci_device_to_OF_node(), then dn->phb? This way, > > you could avoid the loop. > > Yes, except that not all dn's have phb's. The goal here is to walk up > the tree, until a phb is found. All PCI dn's should have a PHB. Ben. From benh at kernel.crashing.org Sat May 14 09:29:32 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 14 May 2005 09:29:32 +1000 Subject: [PATCH] Updated: fix-pci-mmap-on-ppc-and-ppc64.patch In-Reply-To: <20050513161855.GC11089@kroah.com> References: <200505131744.11095.michael@ellerman.id.au> <20050513161855.GC11089@kroah.com> Message-ID: <1116026972.5128.29.camel@gaston> On Fri, 2005-05-13 at 09:18 -0700, Greg KH wrote: > On Fri, May 13, 2005 at 05:44:10PM +1000, Michael Ellerman wrote: > > Hi Andrew, > > Hm, why not send this to the pci maintainer? Because I told him to send it to Andrew as it's just an arch bugfix on top of my patch that is already in mm ;) > I'll add it to my queue. Thanks, Ben. From benh at kernel.crashing.org Sat May 14 09:31:18 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 14 May 2005 09:31:18 +1000 Subject: [PATCH 7/8] ppc64: SPU file system In-Reply-To: <200505132129.07603.arnd@arndb.de> References: <200505132117.37461.arnd@arndb.de> <200505132129.07603.arnd@arndb.de> Message-ID: <1116027079.5128.32.camel@gaston> > /run A stub file that lets us do ioctl. The only ioctl > method we need is the spu_run() call. spu_run suspends > the current thread from the host CPU and transfers > the flow of execution to the SPU. > The ioctl call return to the calling thread when a state > is entered that can not be handled by the kernel, e.g. > an error in the SPU code or an exit() from it. > When a signal is pending for the host CPU thread, the > ioctl is interrupted and the SPU stopped in order to > call the signal handler. ioctl's are generally considered evil ... what about a write() method writing a command ? Ben. From greg at kroah.com Sat May 14 17:45:25 2005 From: greg at kroah.com (Greg KH) Date: Sat, 14 May 2005 00:45:25 -0700 Subject: [PATCH 7/8] ppc64: SPU file system In-Reply-To: <200505132129.07603.arnd@arndb.de> References: <200505132117.37461.arnd@arndb.de> <200505132129.07603.arnd@arndb.de> Message-ID: <20050514074524.GC20021@kroah.com> On Fri, May 13, 2005 at 09:29:06PM +0200, Arnd Bergmann wrote: > This is an early version of the SPU file system, which is used > to run code on the Synergistic Processing Units of the Broadband > Engine. The whitespace seems a bit dammaged in places, check your tabs vs. spaces... > /run A stub file that lets us do ioctl. No, as Ben said, do not do this. Use write. And as you are only doing 1 type of ioctl, it shouldn't be an issue. Also it will be faster than the ioctl due to lack of BKL usage :) And I don't quite think you do the proper permission and validate of the data in your code, you should verify this is all correct. > +/**** spufs attributes > + * > + * Attributes in spufs behave similar to those in sysfs: > + * > + * Writing to an attribute immediately sets a value, an open file can be > + * written to multiple times. > + * > + * Reading from an attribute creates a buffer from the value that might get > + * read with multiple read calls. When the attribute has been read completely, > + * no further read calls are possible until the file is opened again. > + * > + * All spufs attributes contain a text representation of a numeric value that > + * are accessed with the get() and set() functions. > + * > + * Perhaps these file operations could be put in debugfs or libfs instead, > + * they are not really SPU specific. Yes they should. I'll gladly take them for debugfs or like you state, libfs is probably the better place for them so everyone can use them. If you make up a patch, I'll fix up debugfs to use them properly. > +#define spufs_attribute(name) \ > +static int name ## _open(struct inode *inode, struct file *file) \ > +{ \ > + return spufs_attr_open(inode, file, &name ## _get, &name ## _set); \ > +} \ > +static struct file_operations name = { \ > + .open = name ## _open, \ > + .release = spufs_attr_close, \ > + .read = spufs_attr_read, \ > + .write = spufs_attr_write, \ > +}; No module owner set? Be careful if not... > +static struct tree_descr spufs_dir_contents[] = { > + { "mem", &spufs_mem_fops, 0644, }, Named identifiers are the better way to do this (yeah, longer code I know...) > + { "run", &spufs_run_fops, 0400, }, > + { "mbox", &spufs_mbox_fops, 0400, }, > + { "ibox", &spufs_ibox_fops, 0400, }, > + { "wbox", &spufs_wbox_fops, 0200, }, > + { "signal1_type", &spufs_signal1_type, 0600, }, > + { "signal2_type", &spufs_signal1_type, 0600, }, > + > +#if 1 /* debugging only */ > + { "class0_mask", &spufs_class0_mask, 0600, }, > + { "class1_mask", &spufs_class1_mask, 0600, }, > + { "class2_mask", &spufs_class2_mask, 0600, }, > + { "class0_stat", &spufs_class0_stat, 0600, }, > + { "class1_stat", &spufs_class1_stat, 0600, }, > + { "class2_stat", &spufs_class2_stat, 0600, }, > + { "sr1", &spufs_mfc_sr1_RW, 0600, }, > + { "fir", &spufs_mfc_fir_R, 0400, }, > + { "fir_status_or", &spufs_mfc_fir_status_or_W, 0200, }, > + { "fir_status_and", &spufs_mfc_fir_status_and_W, 0200, }, > + { "fir_mask", &spufs_mfc_fir_mask_R, 0400, }, > + { "fir_mask_or", &spufs_mfc_fir_mask_or_W, 0200, }, > + { "fir_mask_and", &spufs_mfc_fir_mask_and_W, 0200, }, > + { "fir_chkstp", &spufs_mfc_fir_chkstp_enable_RW, 0600, }, > + { "cer", &spufs_mfc_cer_R, 0400, }, > + { "dsisr", &spufs_mfc_dsisr_RW, 0600, }, > + { "dsir", &spufs_mfc_dsir_R, 0200, }, > + { "cntl", &spufs_mfc_control_RW, 0600, }, > + { "sdr", &spufs_mfc_sdr_RW, 0600, }, > +#endif > + {}, > +}; > + > +static int > +spufs_fill_dir(struct dentry *dir, struct tree_descr *files, > + int mode, struct spu_context *ctx) > +{ > + struct inode *inode; > + struct dentry *dentry; > + int ret; > + > + static struct inode_operations iops = { > + .getattr = simple_getattr, > + .setattr = spufs_setattr, > + }; > + > + ret = -ENOSPC; > + while (files->name && files->name[0]) { > + dentry = d_alloc_name(dir, files->name); > + if (!dentry) > + goto out; > + inode = spufs_new_inode(dir->d_sb, > + S_IFREG | (files->mode & mode)); > + if (!inode) > + goto out; > + inode->i_op = &iops; > + inode->i_fop = files->ops; > + inode->i_mapping->a_ops = &spufs_aops; > + inode->i_mapping->backing_dev_info = &spufs_backing_dev_info; > + SPUFS_I(inode)->i_ctx = get_spu_context(ctx); > + > + d_add(dentry, inode); > + files++; > + } > + return 0; > +out: > + // FIXME: remove all files that are left > + return ret; > +} > + > +static int > +spufs_mkdir(struct inode *dir, struct dentry *dentry, int mode) > +{ > + int ret; > + struct inode *inode; > + struct spu_context *ctx; > + > + ret = -ENOSPC; > + inode = spufs_new_inode(dir->i_sb, mode | S_IFDIR); > + if (!inode) > + goto out; > + > + if (dir->i_mode & S_ISGID) { > + inode->i_gid = dir->i_gid; > + inode->i_mode |= S_ISGID; > + } > + ctx = alloc_spu_context(); > + SPUFS_I(inode)->i_ctx = ctx; > + if (!ctx) > + goto out_iput; > + > + inode->i_op = &simple_dir_inode_operations; > + inode->i_fop = &simple_dir_operations; > + ret = spufs_fill_dir(dentry, spufs_dir_contents, mode, ctx); > + if (ret) > + goto out_free_ctx; > + > + d_instantiate(dentry, inode); > + dget(dentry); > + dir->i_nlink++; > + goto out; > + > +out_free_ctx: > + put_spu_context(ctx); > +out_iput: > + iput(inode); > +out: > + return ret; > +} > + > +/* This looks really wrong! */ > +static int spufs_rmdir(struct inode *root, struct dentry *dir_dentry) Why do you need this? Doesn't 'simple_rmdir' work for you? The rest of your ramfs based fs code looks a bit complex. Can't it be as "simple" as the debugfs code is (only 100 lines for a fs.) Or is it doing different types of things that I'm completly misunderstanding? And I still think that 100 lines of code to make a ramfs type fs is a bit big, need to work on that one of these days... > +union MFC_TagSizeClassCmd { > + struct { > + u16 mfc_size; > + u16 mfc_tag; > + u8 pad; > + u8 mfc_rclassid; > + u16 mfc_cmd; > + } u; > + struct { > + u32 mfc_size_tag32; > + u32 mfc_class_cmd32; > + } by32; > + u64 all64; > +}; Remember __u16 and friends for structures that cross the user/kernel boundry (like your ioctl that you will be rewriting...) thanks, greg k-h From arnd at arndb.de Sat May 14 23:05:06 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Sat, 14 May 2005 15:05:06 +0200 Subject: [PATCH 7/8] ppc64: SPU file system In-Reply-To: <20050514074524.GC20021@kroah.com> References: <200505132117.37461.arnd@arndb.de> <200505132129.07603.arnd@arndb.de> <20050514074524.GC20021@kroah.com> Message-ID: <200505141505.08999.arnd@arndb.de> On S?nnavend 14 Mai 2005 09:45, Greg KH wrote: > On Fri, May 13, 2005 at 09:29:06PM +0200, Arnd Bergmann wrote: > > /run A stub file that lets us do ioctl. > > No, as Ben said, do not do this. Use write. And as you are only doing > 1 type of ioctl, it shouldn't be an issue. Also it will be faster than > the ioctl due to lack of BKL usage :) I've been back and forth between a number of interfaces here and haven't found one that I'm really happy with. Using write() is probably my least favorite one, but these are the alternatives I've come up with so far: 1. ioctl: pro: - easy to do in a file system - can have both input and output arguments contra: - ugly - weakly typed - unpopular 2. sys_spufs_run(int fd, __u32 pc, __u32 *new_pc, __u32 *status): pro: - strong types - can have both input and output arguments contra: - does not fit file system semantics well - bad for prototyping 3. read: pro: - fits file system semantics - can still return a struct { __u32 new_pc; __u32 status; }; contra: - no way to pass updated instruction pointer directly 4. write: pro: - fits file system semantics - can take instruction pointer as input contra: - no output data The main problem is the way that the ABI requires the main loop to work, which is roughly: pc = initial_instruction_pointer; do { set_pc(pc); status = enter_spu(); if ((status & 0xff00) == SPU_WANTS_EXIT) return (status & 0xff); if ((status & 0xffff) == SPU_LIBRARY_CALL) { pc = get_pc(); do_library_call(*(unsigned int)(local_store_pointer + pc)); pc += 4; } if ((status & 0xffff) < SPU_USER_CODE) do_user_defined_stuff(status); } while (!(status & 0xffff0000) & ERROR_MASK)); Currently, I'm doing all this in user space, i.e. the kernel does not need to know about the different status codes that are reserved for exit or library calls. Having a new system call would keep the basic concept of the ioctl and may or may not be nicer but is certainly harder to debug for now. One thing I could do instead is have the kernel automatically increment the program counter when the spu requests a library call. This should be ok, because the SPU_LIBRARY_CALL stuff has already been defined to have a very specific operating system independent meaning and as soon as we want to do system calls from the SPU, the kernel needs to know about some status codes anyway. In that case, I can make the SPU instruction pointer another file in the SPU context directory that only needs to be written once. The operation that starts the SPU code could be an eight byte read system call returning the new instruction pointer (needed to get the library call arguments) and the status code that lets the user determine the required action. Using a write call instead of read makes the interface even more complicated because it would require the user to read the status from a separate file after write returns to check what needs to be done and then use lseek() or yet another file to access the instruction pointer. > And I don't quite think you do the proper permission and validate of the > data in your code, you should verify this is all correct. Yes, I'm sure I got that wrong. I'll put that on my todo list. > > +/**** spufs attributes > > + * > > + * Perhaps these file operations could be put in debugfs or libfs instead, > > + * they are not really SPU specific. > > Yes they should. I'll gladly take them for debugfs or like you state, > libfs is probably the better place for them so everyone can use them. > > If you make up a patch, I'll fix up debugfs to use them properly. Ok. I'll do the patch for libfs then. I've been thinking about changing +#define spufs_attribute(name) \ +static int name ## _open(struct inode *inode, struct file *file) \ +{ \ + return spufs_attr_open(inode, file, &name ## _get, &name ## _set); \ +} \ +static struct file_operations name = { \ + .open = name ## _open, \ + .release = spufs_attr_close, \ + .read = spufs_attr_read, \ + .write = spufs_attr_write, \ +}; to take a format string argument as well, which is then used in the spufs_attr_read function instead of the hardcoded "%ld\n". Do you think I should do that or rather keep the current implementation? > > +#define spufs_attribute(name) \ > > +static int name ## _open(struct inode *inode, struct file *file) \ > > +{ \ > > + return spufs_attr_open(inode, file, &name ## _get, &name ## _set); \ > > +} \ > > +static struct file_operations name = { \ > > + .open = name ## _open, \ > > + .release = spufs_attr_close, \ > > + .read = spufs_attr_read, \ > > + .write = spufs_attr_write, \ > > +}; > > No module owner set? Be careful if not... Right. Is there ever a reason to have file operations without owner? Maybe dentry_open() could warn about this. > > +static struct tree_descr spufs_dir_contents[] = { > > + { "mem", &spufs_mem_fops, 0644, }, > > Named identifiers are the better way to do this (yeah, longer code I > know...) Ok. I took the concept from fs/nfsd/nfsctl.c, thinking that Al knows how to best do these things, but I can of course change this. > > +/* This looks really wrong! */ > > +static int spufs_rmdir(struct inode *root, struct dentry *dir_dentry) > > Why do you need this? Doesn't 'simple_rmdir' work for you? The idea was to keep the file system contents consistant with the underlying data structures. If I allow users to unlink context directories or files in there, there is no longer a way to extract reliable information from the file system, e.g. for the debugger or for implementing something like spu_ps. My solution was to force the dentries in each directory to be present. When the directory is created, the files are already there and unlinking a single file is impossible. To destroy the spu context, the user has to rmdir it, which will either remove all files in there as well or fail in the case that any file is still open. Of course that is not really Posix behavior, but it avoids some other pitfalls. > The rest of your ramfs based fs code looks a bit complex. Can't it be > as "simple" as the debugfs code is (only 100 lines for a fs.) Or is it > doing different types of things that I'm completly misunderstanding? Apart from my special directory semantics, I plan to have a rather unusual way to map the "mem" files: Each spu context can be either present on a physical SPU or stored in memory, so I can create a large number of SPU contexts despite the limitation of physical SPUs present in the machine. When the SPU context is executing code, it obviously has to be on a physical SPU and a memory map of the "mem" file accesses the actual local store memory that is accessible in the real address space of the kernel. The context save operation copies the local store memory into a virtual file that lives only in page cache, exactly how ramfs deals with its files. Switching between these two states should be possible without breaking user space programs that have the file mapped into their address space. This mechanism is not implemented yet, but some of my code is already prepared for it. I also intend to split some parts out from inode.c, probably have a file.c that contains all the file operations and another context.c that deals with the interface to the low level spu code and with abstracting logical spu context from physical spus. I suppose I should also go over my code to find unnecessary functionality. > Remember __u16 and friends for structures that cross the user/kernel > boundry (like your ioctl that you will be rewriting...) Yes. There are no data structures that are shared with user space except the current ioctl argument. The MFC_TagSizeClassCmd (yes, I need to remember to change the name some day, currently this still uses the identifiers from the spec) and the others are defined by the hardware interface. Thanks for all your comments, Arnd <>< From benh at kernel.crashing.org Sun May 15 16:29:06 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sun, 15 May 2005 16:29:06 +1000 Subject: [PATCH 7/8] ppc64: SPU file system In-Reply-To: <200505141505.08999.arnd@arndb.de> References: <200505132117.37461.arnd@arndb.de> <200505132129.07603.arnd@arndb.de> <20050514074524.GC20021@kroah.com> <200505141505.08999.arnd@arndb.de> Message-ID: <1116138546.5095.6.camel@gaston> > Using a write call instead of read makes the interface even more > complicated because it would require the user to read the status > from a separate file after write returns to check what needs to > be done and then use lseek() or yet another file to access the > instruction pointer. Why not just write(pc) to start and read back status from the same file ? Ben. From pavel at ucw.cz Sun May 15 19:07:05 2005 From: pavel at ucw.cz (Pavel Machek) Date: Sun, 15 May 2005 11:07:05 +0200 Subject: [PATCH 7/8] ppc64: SPU file system In-Reply-To: <1116027079.5128.32.camel@gaston> References: <200505132117.37461.arnd@arndb.de> <200505132129.07603.arnd@arndb.de> <1116027079.5128.32.camel@gaston> Message-ID: <20050515090705.GA2343@elf.ucw.cz> Hi! > > /run A stub file that lets us do ioctl. The only ioctl > > method we need is the spu_run() call. spu_run suspends > > the current thread from the host CPU and transfers > > the flow of execution to the SPU. > > The ioctl call return to the calling thread when a state > > is entered that can not be handled by the kernel, e.g. > > an error in the SPU code or an exit() from it. > > When a signal is pending for the host CPU thread, the > > ioctl is interrupted and the SPU stopped in order to > > call the signal handler. > > ioctl's are generally considered evil ... what about a write() method > writing a command ? That's even more evil than ioctl()... Try doing 32-vs-64bit conversion on write... Pavel -- Boycott Kodak -- for their patent abuse against Java. From arnd at arndb.de Sun May 15 20:08:52 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Sun, 15 May 2005 12:08:52 +0200 Subject: [PATCH 7/8] ppc64: SPU file system In-Reply-To: <1116138546.5095.6.camel@gaston> References: <200505132117.37461.arnd@arndb.de> <200505141505.08999.arnd@arndb.de> <1116138546.5095.6.camel@gaston> Message-ID: <200505151208.54229.arnd@arndb.de> On S?nndag 15 Mai 2005 08:29, Benjamin Herrenschmidt wrote: > Why not just write(pc) to start and read back status from the same > file ? I suppose you are thinking of the simple_transaction_read() style interface. I've got the feeling that this is generally even less popular than ioctl because - it is still an untyped interface (as would be a read() based one) - you can't do 32 bit emulation (doesn't matter for me, we only have 32 bit data) - it is non-atomic - it doubles the system call overhead One operation that I want to allow is to have an infinite loop running on the SPU that does a simple operation (e.g. process one MPEG macroblock) and have that called by multiple unrelated processes in turns. When my operation is not atomic, users need to have additional IPC serialization of their accesses. Most would want that anyway, but it is not a requirement with an interface that needs only a single system call. For the extra syscall overhead, I would like to see measurements of a real world application before I change to an interface that is slower in theory. Do you have measurements for the time spent in a trivial system call on G5 or Power4? Arnd <>< From ak at muc.de Sun May 15 21:24:13 2005 From: ak at muc.de (Andi Kleen) Date: Sun, 15 May 2005 13:24:13 +0200 Subject: [PATCH 7/8] ppc64: SPU file system In-Reply-To: <20050514074524.GC20021@kroah.com> (Greg KH's message of "Sat, 14 May 2005 00:45:25 -0700") References: <200505132117.37461.arnd@arndb.de> <200505132129.07603.arnd@arndb.de> <20050514074524.GC20021@kroah.com> Message-ID: Greg KH writes: > > No, as Ben said, do not do this. Use write. And as you are only doing > 1 type of ioctl, it shouldn't be an issue. Also it will be faster than > the ioctl due to lack of BKL usage :) The problem is that if something is wrong regarding 32bit/64bit compatibility (I am not saying Arnd will get it wrong, but for a general rule someone will get it wrong and it has happened, e.g. in ubsfs) then it is impossible to do any compat emulation on read/write. So I would actually prefer ioctl because it is sfer. -Andi From benh at kernel.crashing.org Sun May 15 22:02:28 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sun, 15 May 2005 22:02:28 +1000 Subject: [PATCH 7/8] ppc64: SPU file system In-Reply-To: <20050515090705.GA2343@elf.ucw.cz> References: <200505132117.37461.arnd@arndb.de> <200505132129.07603.arnd@arndb.de> <1116027079.5128.32.camel@gaston> <20050515090705.GA2343@elf.ucw.cz> Message-ID: <1116158548.5095.17.camel@gaston> > > ioctl's are generally considered evil ... what about a write() method > > writing a command ? > > That's even more evil than ioctl()... Try doing 32-vs-64bit conversion > on write... I don't see the problem ... if you are passing a structure, you have to convert it anyway, and it's bad practice. I was thinking about passing ascii so it can be controlled by shell scripts. Ben. From arnd at arndb.de Sun May 15 21:50:04 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Sun, 15 May 2005 13:50:04 +0200 Subject: [PATCH 7/8] ppc64: SPU file system In-Reply-To: <200505151208.54229.arnd@arndb.de> References: <200505132117.37461.arnd@arndb.de> <1116138546.5095.6.camel@gaston> <200505151208.54229.arnd@arndb.de> Message-ID: <200505151350.05692.arnd@arndb.de> On S?nndag 15 Mai 2005 12:08, Arnd Bergmann wrote: > On S?nndag 15 Mai 2005 08:29, Benjamin Herrenschmidt wrote: > > Why not just write(pc) to start and read back status from the same > > file ? I just remembered the strongest reason against using write() to set the instruction pointer: It breaks signal delivery during execution of SPU code. With an ioctl or system call based interface, the kernel simply updates the instruction pointer in process memory before calling a signal handler. When/if the signal handler returns, it does the same call again with the updated argument and the SPU continues to fetch code at the point where it stopped. If I do a read() based interface, there are no input parameters at all, so restarted system calls work as well. How about this one: read() starts execution and returns the status value in a four byte buffer. Calling lseek() on the "run" file updates the instruction pointer, so the library call can work like this plus error handling: extern char *mapped_local_store; uint32_t status; int runfd = open("run", O_RDONLY); lseek(runfd, INITIAL_INSTRUCTION, SEEK_SET); do { read(runfd, &status, 4); if (status == SPU_DO_LIBRARY_CALL) { size_t arg = lseek(runfd, 4, SEEK_CUR) - 4; do_library_call(mapped_local_store + arg); } } while (status != SPU_EXIT); Arnd <>< From arnd at arndb.de Sun May 15 22:33:09 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Sun, 15 May 2005 14:33:09 +0200 Subject: [PATCH 7/8] ppc64: SPU file system In-Reply-To: <1116158548.5095.17.camel@gaston> References: <200505132117.37461.arnd@arndb.de> <20050515090705.GA2343@elf.ucw.cz> <1116158548.5095.17.camel@gaston> Message-ID: <200505151433.11108.arnd@arndb.de> On S?nndag 15 Mai 2005 14:02, Benjamin Herrenschmidt wrote: > > > That's even more evil than ioctl()... Try doing 32-vs-64bit conversion > > on write... > > I don't see the problem ... if you are passing a structure, you have to > convert it anyway, and it's bad practice. I was thinking about passing > ascii so it can be controlled by shell scripts. Parsing multi-value ascii data is error prone. in kernel space, I would not want to do anything more complex than a simple_strtoul(), if only for the reason of not giving bad examples. When passing binary structures, there is a significant difference between passing it through ioctl or read/write: We already have a rather complicated method of detecting if whether and how to convert them (f_op->compat_ioctl, hash lookup and the deprecated dynamic registration). For read/write, there is no way to tell if you need to do the conversion, even if the file operation is aware of the actual data layout of both variants. Moreover, a good implementation of a read/write file operation should be able to deal with resuming partial transfers. Regarding the shell scripting possibility, I don't really see the point. The only code that should actually use the kernel interfaces is something like an /lib/ld-spu.so interpreter and that is better implemented in C anyway because it needs to parse ELF structures and such. Arnd <>< From tzachi at marvell.com Sun May 15 23:08:06 2005 From: tzachi at marvell.com (Tzachi Perelstein) Date: Sun, 15 May 2005 16:08:06 +0300 Subject: io_block_mapping in PPC64? Message-ID: Hi all, I develop Linux (PPC64, 2.6.9) for the new Marvell 64470 evaluation board with the IBM-970FX. Inside the Marvell system controller there is a 2Mbit internal SRAM for applications that required fast memory access (traditionally for network driver descriptors). In the (32bit) PPC arch there is an io_block_mapping function that can be used to map the internal SRAM with configurable (e.g. cache coherence) attributes. Is there any way to it in PPC64 arch? I didn't find any reference for it. Any help will be appriciated. Thanks, Tzachi From benh at kernel.crashing.org Mon May 16 08:49:53 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 16 May 2005 08:49:53 +1000 Subject: io_block_mapping in PPC64? In-Reply-To: References: Message-ID: <1116197395.5095.27.camel@gaston> On Sun, 2005-05-15 at 16:08 +0300, Tzachi Perelstein wrote: > Hi all, > > I develop Linux (PPC64, 2.6.9) for the new Marvell 64470 evaluation > board with the IBM-970FX. Inside the Marvell system controller there is > a 2Mbit internal SRAM for applications that required fast memory access > (traditionally for network driver descriptors). > In the (32bit) PPC arch there is an io_block_mapping function that can > be used to map the internal SRAM with configurable (e.g. cache > coherence) attributes. Is there any way to it in PPC64 arch? I didn't > find any reference for it. Any help will be appriciated. io_block_mapping() isn't really an interface I would recomment for anybody to use... What about __ioremap() instead ? I also notice that Marvell didn't both submitting any patch for supporting their board, I suppose they are happy with just providing customers with their hacks and no attempt to have them validated & merged upstream ... Ben. From benh at kernel.crashing.org Mon May 16 08:54:45 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 16 May 2005 08:54:45 +1000 Subject: io_block_mapping in PPC64? In-Reply-To: <1116197395.5095.27.camel@gaston> References: <1116197395.5095.27.camel@gaston> Message-ID: <1116197685.5095.35.camel@gaston> On Mon, 2005-05-16 at 08:49 +1000, Benjamin Herrenschmidt wrote: > On Sun, 2005-05-15 at 16:08 +0300, Tzachi Perelstein wrote: > > Hi all, > > > > I develop Linux (PPC64, 2.6.9) for the new Marvell 64470 evaluation > > board with the IBM-970FX. Inside the Marvell system controller there is > > a 2Mbit internal SRAM for applications that required fast memory access > > (traditionally for network driver descriptors). > > In the (32bit) PPC arch there is an io_block_mapping function that can > > be used to map the internal SRAM with configurable (e.g. cache > > coherence) attributes. Is there any way to it in PPC64 arch? I didn't > > find any reference for it. Any help will be appriciated. > > io_block_mapping() isn't really an interface I would recomment for > anybody to use... What about __ioremap() instead ? > > I also notice that Marvell didn't both submitting any patch for > supporting their board, I suppose they are happy with just providing > customers with their hacks and no attempt to have them validated & > merged upstream ... Hrm... just noticed your @marvell.com address, so forget my previous comment. I would still recommend that any patch your come up with for supporting this board be submited on this list before it hits customers. I'm trying hard to avoid ppc64 becoming the mess ppc32 has get... Ben. From david at gibson.dropbear.id.au Mon May 16 14:43:53 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Mon, 16 May 2005 14:43:53 +1000 Subject: Four level pagetables, now with hugepages Message-ID: <20050516044353.GV6053@localhost.localdomain> Ok, here's a new revised version of the four level pagetable patch, with hugepages working again. However, it now must be applied on top of the hugepage consolidation patch which is in -mm tree. I think this is now close to ready for merging. This patch implements full four-level page tables for ppc64. It uses a full page for the tables at the bottom and top level, and a quarter page for the intermediate levels. This gives a total usable address space of 44 bits (16T). This patch also tweaks the VSID allocation to have a matching range for user addresses (thereby halving the number of available contexts) and adds some #if and BUILD_BUG sanity checks. Index: working-2.6/include/asm-ppc64/pgtable.h =================================================================== --- working-2.6.orig/include/asm-ppc64/pgtable.h 2005-05-13 17:57:38.000000000 +1000 +++ working-2.6/include/asm-ppc64/pgtable.h 2005-05-13 17:57:39.000000000 +1000 @@ -15,19 +15,24 @@ #include #endif /* __ASSEMBLY__ */ -#include - /* * Entries per page directory level. The PTE level must use a 64b record * for each page table entry. The PMD and PGD level use a 32b record for * each entry by assuming that each entry is page aligned. */ #define PTE_INDEX_SIZE 9 -#define PMD_INDEX_SIZE 10 -#define PGD_INDEX_SIZE 10 +#define PMD_INDEX_SIZE 7 +#define PUD_INDEX_SIZE 7 +#define PGD_INDEX_SIZE 9 + +#define PTE_TABLE_SIZE (sizeof(pte_t) << PTE_INDEX_SIZE) +#define PMD_TABLE_SIZE (sizeof(pmd_t) << PMD_INDEX_SIZE) +#define PUD_TABLE_SIZE (sizeof(pud_t) << PUD_INDEX_SIZE) +#define PGD_TABLE_SIZE (sizeof(pgd_t) << PGD_INDEX_SIZE) #define PTRS_PER_PTE (1 << PTE_INDEX_SIZE) #define PTRS_PER_PMD (1 << PMD_INDEX_SIZE) +#define PTRS_PER_PUD (1 << PMD_INDEX_SIZE) #define PTRS_PER_PGD (1 << PGD_INDEX_SIZE) /* PMD_SHIFT determines what a second-level page table entry can map */ @@ -35,8 +40,13 @@ #define PMD_SIZE (1UL << PMD_SHIFT) #define PMD_MASK (~(PMD_SIZE-1)) -/* PGDIR_SHIFT determines what a third-level page table entry can map */ -#define PGDIR_SHIFT (PMD_SHIFT + PMD_INDEX_SIZE) +/* PUD_SHIFT determines what a third-level page table entry can map */ +#define PUD_SHIFT (PMD_SHIFT + PMD_INDEX_SIZE) +#define PUD_SIZE (1UL << PUD_SHIFT) +#define PUD_MASK (~(PUD_SIZE-1)) + +/* PGDIR_SHIFT determines what a fourth-level page table entry can map */ +#define PGDIR_SHIFT (PUD_SHIFT + PUD_INDEX_SIZE) #define PGDIR_SIZE (1UL << PGDIR_SHIFT) #define PGDIR_MASK (~(PGDIR_SIZE-1)) @@ -45,15 +55,23 @@ /* * Size of EA range mapped by our pagetables. */ -#define EADDR_SIZE (PTE_INDEX_SIZE + PMD_INDEX_SIZE + \ - PGD_INDEX_SIZE + PAGE_SHIFT) -#define EADDR_MASK ((1UL << EADDR_SIZE) - 1) +#define PGTABLE_EADDR_SIZE (PTE_INDEX_SIZE + PMD_INDEX_SIZE + \ + PUD_INDEX_SIZE + PGD_INDEX_SIZE + PAGE_SHIFT) +#define PGTABLE_RANGE (1UL << PGTABLE_EADDR_SIZE) + +#if TASK_SIZE_USER64 > PGTABLE_RANGE +#error TASK_SIZE_USER64 exceeds pagetable range +#endif + +#if TASK_SIZE_USER64 > (1UL << (USER_ESID_BITS + SID_SHIFT)) +#error TASK_SIZE_USER64 exceeds user VSID range +#endif /* * Define the address range of the vmalloc VM area. */ #define VMALLOC_START (0xD000000000000000ul) -#define VMALLOC_SIZE (0x10000000000UL) +#define VMALLOC_SIZE (0x80000000000UL) #define VMALLOC_END (VMALLOC_START + VMALLOC_SIZE) /* @@ -154,8 +172,6 @@ #ifndef __ASSEMBLY__ int hash_huge_page(struct mm_struct *mm, unsigned long access, unsigned long ea, unsigned long vsid, int local); - -void hugetlb_mm_free_pgd(struct mm_struct *mm); #endif /* __ASSEMBLY__ */ #define HAVE_ARCH_UNMAPPED_AREA @@ -163,7 +179,6 @@ #else #define hash_huge_page(mm,a,ea,vsid,local) -1 -#define hugetlb_mm_free_pgd(mm) do {} while (0) #endif @@ -197,39 +212,45 @@ #define pte_pfn(x) ((unsigned long)((pte_val(x) >> PTE_SHIFT))) #define pte_page(x) pfn_to_page(pte_pfn(x)) -#define pmd_set(pmdp, ptep) \ - (pmd_val(*(pmdp)) = __ba_to_bpn(ptep)) +#define pmd_set(pmdp, ptep) (pmd_val(*(pmdp)) = (unsigned long)(ptep)) #define pmd_none(pmd) (!pmd_val(pmd)) #define pmd_bad(pmd) (pmd_val(pmd) == 0) #define pmd_present(pmd) (pmd_val(pmd) != 0) #define pmd_clear(pmdp) (pmd_val(*(pmdp)) = 0) -#define pmd_page_kernel(pmd) (__bpn_to_ba(pmd_val(pmd))) +#define pmd_page_kernel(pmd) (pmd_val(pmd)) #define pmd_page(pmd) virt_to_page(pmd_page_kernel(pmd)) -#define pud_set(pudp, pmdp) (pud_val(*(pudp)) = (__ba_to_bpn(pmdp))) +#define pud_set(pudp, pmdp) (pud_val(*(pudp)) = (unsigned long)(pmdp)) #define pud_none(pud) (!pud_val(pud)) -#define pud_bad(pud) ((pud_val(pud)) == 0UL) -#define pud_present(pud) (pud_val(pud) != 0UL) -#define pud_clear(pudp) (pud_val(*(pudp)) = 0UL) -#define pud_page(pud) (__bpn_to_ba(pud_val(pud))) +#define pud_bad(pud) ((pud_val(pud)) == 0) +#define pud_present(pud) (pud_val(pud) != 0) +#define pud_clear(pudp) (pud_val(*(pudp)) = 0) +#define pud_page(pud) (pud_val(pud)) + +#define pgd_set(pgdp, pudp) ({pgd_val(*(pgdp)) = (unsigned long)(pudp);}) +#define pgd_none(pgd) (!pgd_val(pgd)) +#define pgd_bad(pgd) (pgd_val(pgd) == 0) +#define pgd_present(pgd) (pgd_val(pgd) != 0) +#define pgd_clear(pgdp) (pgd_val(*(pgdp)) = 0) +#define pgd_page(pgd) (pgd_val(pgd)) /* * Find an entry in a page-table-directory. We combine the address region * (the high order N bits) and the pgd portion of the address. */ /* to avoid overflow in free_pgtables we don't use PTRS_PER_PGD here */ -#define pgd_index(address) (((address) >> (PGDIR_SHIFT)) & 0x7ff) +#define pgd_index(address) (((address) >> (PGDIR_SHIFT)) & 0x1ff) #define pgd_offset(mm, address) ((mm)->pgd + pgd_index(address)) -/* Find an entry in the second-level page table.. */ +#define pud_offset(pgdp, addr) \ + (((pud_t *) pgd_page(*(pgdp))) + (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))) + #define pmd_offset(pudp,addr) \ - ((pmd_t *) pud_page(*(pudp)) + (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))) + (((pmd_t *) pud_page(*(pudp))) + (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))) -/* Find an entry in the third-level page table.. */ #define pte_offset_kernel(dir,addr) \ - ((pte_t *) pmd_page_kernel(*(dir)) \ - + (((addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))) + (((pte_t *) pmd_page_kernel(*(dir))) + (((addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))) #define pte_offset_map(dir,addr) pte_offset_kernel((dir), (addr)) #define pte_offset_map_nested(dir,addr) pte_offset_kernel((dir), (addr)) @@ -458,9 +479,11 @@ #define pte_same(A,B) (((pte_val(A) ^ pte_val(B)) & ~_PAGE_HPTEFLAGS) == 0) #define pmd_ERROR(e) \ - printk("%s:%d: bad pmd %08x.\n", __FILE__, __LINE__, pmd_val(e)) + printk("%s:%d: bad pmd %08lx.\n", __FILE__, __LINE__, pmd_val(e)) +#define pud_ERROR(e) \ + printk("%s:%d: bad pmd %08lx.\n", __FILE__, __LINE__, pud_val(e)) #define pgd_ERROR(e) \ - printk("%s:%d: bad pgd %08x.\n", __FILE__, __LINE__, pgd_val(e)) + printk("%s:%d: bad pgd %08lx.\n", __FILE__, __LINE__, pgd_val(e)) extern pgd_t swapper_pg_dir[]; Index: working-2.6/include/asm-ppc64/page.h =================================================================== --- working-2.6.orig/include/asm-ppc64/page.h 2005-05-13 17:57:38.000000000 +1000 +++ working-2.6/include/asm-ppc64/page.h 2005-05-13 17:57:39.000000000 +1000 @@ -46,6 +46,7 @@ #define ARCH_HAS_HUGEPAGE_ONLY_RANGE #define ARCH_HAS_PREPARE_HUGEPAGE_RANGE +#define ARCH_HAS_SETCLEAR_HUGE_PTE #define touches_hugepage_low_range(mm, addr, len) \ (LOW_ESID_MASK((addr), (len)) & mm->context.htlb_segs) @@ -125,36 +126,42 @@ * Entries in the pte table are 64b, while entries in the pgd & pmd are 32b. */ typedef struct { unsigned long pte; } pte_t; -typedef struct { unsigned int pmd; } pmd_t; -typedef struct { unsigned int pgd; } pgd_t; +typedef struct { unsigned long pmd; } pmd_t; +typedef struct { unsigned long pud; } pud_t; +typedef struct { unsigned long pgd; } pgd_t; typedef struct { unsigned long pgprot; } pgprot_t; #define pte_val(x) ((x).pte) #define pmd_val(x) ((x).pmd) +#define pud_val(x) ((x).pud) #define pgd_val(x) ((x).pgd) #define pgprot_val(x) ((x).pgprot) -#define __pte(x) ((pte_t) { (x) } ) -#define __pmd(x) ((pmd_t) { (x) } ) -#define __pgd(x) ((pgd_t) { (x) } ) -#define __pgprot(x) ((pgprot_t) { (x) } ) +#define __pte(x) ((pte_t) { (x) }) +#define __pmd(x) ((pmd_t) { (x) }) +#define __pud(x) ((pud_t) { (x) }) +#define __pgd(x) ((pgd_t) { (x) }) +#define __pgprot(x) ((pgprot_t) { (x) }) #else /* * .. while these make it easier on the compiler */ typedef unsigned long pte_t; -typedef unsigned int pmd_t; -typedef unsigned int pgd_t; +typedef unsigned long pmd_t; +typedef unsigned long pud_t; +typedef unsigned long pgd_t; typedef unsigned long pgprot_t; #define pte_val(x) (x) #define pmd_val(x) (x) +#define pud_val(x) (x) #define pgd_val(x) (x) #define pgprot_val(x) (x) #define __pte(x) (x) #define __pmd(x) (x) +#define __pud(x) (x) #define __pgd(x) (x) #define __pgprot(x) (x) @@ -208,9 +215,6 @@ #define USER_REGION_ID (0UL) #define REGION_ID(ea) (((unsigned long)(ea)) >> REGION_SHIFT) -#define __bpn_to_ba(x) ((((unsigned long)(x)) << PAGE_SHIFT) + KERNELBASE) -#define __ba_to_bpn(x) ((((unsigned long)(x)) & ~REGION_MASK) >> PAGE_SHIFT) - #define __va(x) ((void *)((unsigned long)(x) + KERNELBASE)) #ifdef CONFIG_DISCONTIGMEM Index: working-2.6/include/asm-ppc64/pgalloc.h =================================================================== --- working-2.6.orig/include/asm-ppc64/pgalloc.h 2005-05-13 17:57:38.000000000 +1000 +++ working-2.6/include/asm-ppc64/pgalloc.h 2005-05-13 17:57:39.000000000 +1000 @@ -6,7 +6,7 @@ #include #include -extern kmem_cache_t *zero_cache; +extern kmem_cache_t *pmd_cache; /* * This program is free software; you can redistribute it and/or @@ -18,13 +18,31 @@ static inline pgd_t * pgd_alloc(struct mm_struct *mm) { - return kmem_cache_alloc(zero_cache, GFP_KERNEL); + return (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO); } static inline void pgd_free(pgd_t *pgd) { - kmem_cache_free(zero_cache, pgd); + free_page((unsigned long)pgd); +} + +#define pgd_populate(MM, PGD, PUD) pgd_set(PGD, PUD) + +static inline pud_t * +pud_alloc_one(struct mm_struct *mm, unsigned long addr) +{ + pud_t *pudp; + + pudp = kmem_cache_alloc(pmd_cache, GFP_KERNEL|__GFP_REPEAT); + memset(pudp, 0, PUD_TABLE_SIZE); + return pudp; +} + +static inline void +pud_free(pud_t *pud) +{ + kmem_cache_free(pmd_cache, pud); } #define pud_populate(MM, PUD, PMD) pud_set(PUD, PMD) @@ -32,13 +50,17 @@ static inline pmd_t * pmd_alloc_one(struct mm_struct *mm, unsigned long addr) { - return kmem_cache_alloc(zero_cache, GFP_KERNEL|__GFP_REPEAT); + pmd_t *pmdp; + + pmdp = kmem_cache_alloc(pmd_cache, GFP_KERNEL|__GFP_REPEAT); + memset(pmdp, 0, PMD_TABLE_SIZE); + return pmdp; } static inline void pmd_free(pmd_t *pmd) { - kmem_cache_free(zero_cache, pmd); + kmem_cache_free(pmd_cache, pmd); } #define pmd_populate_kernel(mm, pmd, pte) pmd_set(pmd, pte) @@ -47,44 +69,54 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address) { - return kmem_cache_alloc(zero_cache, GFP_KERNEL|__GFP_REPEAT); + return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); } static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address) { - pte_t *pte = kmem_cache_alloc(zero_cache, GFP_KERNEL|__GFP_REPEAT); - if (pte) - return virt_to_page(pte); - return NULL; + return alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); } static inline void pte_free_kernel(pte_t *pte) { - kmem_cache_free(zero_cache, pte); + free_page((unsigned long)pte); } static inline void pte_free(struct page *ptepage) { - kmem_cache_free(zero_cache, page_address(ptepage)); + __free_page(ptepage); } -struct pte_freelist_batch +typedef struct pgtable_free { + unsigned long val; +} pgtable_free_t; + +static inline pgtable_free_t pgtable_free_page(struct page *page) { - struct rcu_head rcu; - unsigned int index; - struct page * pages[0]; -}; + return (pgtable_free_t){.val = (unsigned long) page}; +} -#define PTE_FREELIST_SIZE ((PAGE_SIZE - sizeof(struct pte_freelist_batch)) / \ - sizeof(struct page *)) +static inline pgtable_free_t pgtable_free_cache(void *p) +{ + return (pgtable_free_t){.val = ((unsigned long) p) | 1}; +} -extern void pte_free_now(struct page *ptepage); -extern void pte_free_submit(struct pte_freelist_batch *batch); +static inline void pgtable_free(pgtable_free_t pgf) +{ + if (pgf.val & 1) + kmem_cache_free(pmd_cache, (void *)(pgf.val & ~1)); + else + __free_page((struct page *)pgf.val); +} -DECLARE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur); +void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf); -void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage); -#define __pmd_free_tlb(tlb, pmd) __pte_free_tlb(tlb, virt_to_page(pmd)) +#define __pte_free_tlb(tlb, ptepage) \ + pgtable_free_tlb(tlb, pgtable_free_page(ptepage)) +#define __pmd_free_tlb(tlb, pmd) \ + pgtable_free_tlb(tlb, pgtable_free_cache(pmd)) +#define __pud_free_tlb(tlb, pmd) \ + pgtable_free_tlb(tlb, pgtable_free_cache(pud)) #define check_pgt_cache() do { } while (0) Index: working-2.6/arch/ppc64/mm/init.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/init.c 2005-05-13 17:57:38.000000000 +1000 +++ working-2.6/arch/ppc64/mm/init.c 2005-05-13 17:57:39.000000000 +1000 @@ -66,6 +66,14 @@ #include #include +#if PGTABLE_RANGE > USER_VSID_RANGE +#warning Limited user VSID range means pagetable space is wasted +#endif + +#if (TASK_SIZE_USER64 < PGTABLE_RANGE) && (TASK_SIZE_USER64 < USER_VSID_RANGE) +#warning TASK_SIZE is smaller than it needs to be. +#endif + int mem_init_done; unsigned long ioremap_bot = IMALLOC_BASE; static unsigned long phbs_io_bot = PHBS_IO_BASE; @@ -225,7 +233,7 @@ * Before that, we map using addresses going * up from ioremap_bot. imalloc will use * the addresses from ioremap_bot through - * IMALLOC_END (0xE000001fffffffff) + * IMALLOC_END * */ pa = addr & PAGE_MASK; @@ -416,12 +424,6 @@ int index; int err; -#ifdef CONFIG_HUGETLB_PAGE - /* We leave htlb_segs as it was, but for a fork, we need to - * clear the huge_pgdir. */ - mm->context.huge_pgdir = NULL; -#endif - again: if (!idr_pre_get(&mmu_context_idr, GFP_KERNEL)) return -ENOMEM; @@ -452,8 +454,6 @@ spin_unlock(&mmu_context_lock); mm->context.id = NO_CONTEXT; - - hugetlb_mm_free_pgd(mm); } /* @@ -824,23 +824,18 @@ return virt_addr; } -kmem_cache_t *zero_cache; - -static void zero_ctor(void *pte, kmem_cache_t *cache, unsigned long flags) -{ - memset(pte, 0, PAGE_SIZE); -} +kmem_cache_t *pmd_cache; void pgtable_cache_init(void) { - zero_cache = kmem_cache_create("zero", - PAGE_SIZE, - 0, - SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN, - zero_ctor, - NULL); - if (!zero_cache) - panic("pgtable_cache_init(): could not create zero_cache!\n"); + BUILD_BUG_ON(PTE_TABLE_SIZE != PAGE_SIZE); + BUILD_BUG_ON(PMD_TABLE_SIZE != PUD_TABLE_SIZE); + BUILD_BUG_ON(PGD_TABLE_SIZE != PAGE_SIZE); + + pmd_cache = kmem_cache_create("pmd", PMD_TABLE_SIZE, PMD_TABLE_SIZE, + 0, NULL, NULL); + if (! pmd_cache) + panic("pmd_pud_cache_init(): could not create pmd_pud_cache!\n"); } pgprot_t phys_mem_access_prot(struct file *file, unsigned long addr, Index: working-2.6/include/asm-ppc64/processor.h =================================================================== --- working-2.6.orig/include/asm-ppc64/processor.h 2005-05-13 17:57:38.000000000 +1000 +++ working-2.6/include/asm-ppc64/processor.h 2005-05-13 18:03:37.000000000 +1000 @@ -530,8 +530,8 @@ extern struct task_struct *last_task_used_math; extern struct task_struct *last_task_used_altivec; -/* 64-bit user address space is 41-bits (2TBs user VM) */ -#define TASK_SIZE_USER64 (0x0000020000000000UL) +/* 64-bit user address space is 44-bits (16TB user VM) */ +#define TASK_SIZE_USER64 (0x0000100000000000UL) /* * 32-bit user address space is 4GB - 1 page Index: working-2.6/arch/ppc64/kernel/head.S =================================================================== --- working-2.6.orig/arch/ppc64/kernel/head.S 2005-05-13 17:57:38.000000000 +1000 +++ working-2.6/arch/ppc64/kernel/head.S 2005-05-13 17:57:39.000000000 +1000 @@ -38,6 +38,7 @@ #include #include #include +#include #ifdef CONFIG_PPC_ISERIES #define DO_SOFT_DISABLE @@ -2117,17 +2118,17 @@ empty_zero_page: .space 4096 - .globl swapper_pg_dir -swapper_pg_dir: - .space 4096 - #ifdef CONFIG_SMP /* 1 page segment table per cpu (max 48, cpu0 allocated at STAB0_PHYS_ADDR) */ .globl stab_array stab_array: .space 4096 * 48 #endif - + + .globl swapper_pg_dir +swapper_pg_dir: + .space PAGE_SIZE + /* * This space gets a copy of optional info passed to us by the bootstrap * Used to pass parameters into the kernel like root=/dev/sda1, etc. Index: working-2.6/arch/ppc64/mm/imalloc.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/imalloc.c 2005-05-13 17:57:38.000000000 +1000 +++ working-2.6/arch/ppc64/mm/imalloc.c 2005-05-13 17:57:39.000000000 +1000 @@ -31,7 +31,7 @@ break; if ((unsigned long)tmp->addr >= ioremap_bot) addr = tmp->size + (unsigned long) tmp->addr; - if (addr > IMALLOC_END-size) + if (addr >= IMALLOC_END-size) return 1; } *im_addr = addr; Index: working-2.6/arch/ppc64/mm/hash_utils.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hash_utils.c 2005-05-13 17:57:38.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hash_utils.c 2005-05-13 17:57:39.000000000 +1000 @@ -298,7 +298,7 @@ int local = 0; cpumask_t tmp; - if ((ea & ~REGION_MASK) > EADDR_MASK) + if ((ea & ~REGION_MASK) >= PGTABLE_RANGE) return 1; switch (REGION_ID(ea)) { Index: working-2.6/include/asm-ppc64/mmu.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu.h 2005-05-13 17:57:38.000000000 +1000 +++ working-2.6/include/asm-ppc64/mmu.h 2005-05-13 17:57:39.000000000 +1000 @@ -259,8 +259,10 @@ #define VSID_BITS 36 #define VSID_MODULUS ((1UL<index; i++) + pgtable_free(batch->tables[i]); + + free_page((unsigned long)batch); +} + +static void pte_free_submit(struct pte_freelist_batch *batch) +{ + INIT_RCU_HEAD(&batch->rcu); + call_rcu(&batch->rcu, pte_free_rcu_callback); +} + +void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf) { /* This is safe as we are holding page_table_lock */ cpumask_t local_cpumask = cpumask_of_cpu(smp_processor_id()); @@ -49,19 +100,19 @@ if (atomic_read(&tlb->mm->mm_users) < 2 || cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask)) { - pte_free(ptepage); + pgtable_free(pgf); return; } if (*batchp == NULL) { *batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC); if (*batchp == NULL) { - pte_free_now(ptepage); + pgtable_free_now(pgf); return; } (*batchp)->index = 0; } - (*batchp)->pages[(*batchp)->index++] = ptepage; + (*batchp)->tables[(*batchp)->index++] = pgf; if ((*batchp)->index == PTE_FREELIST_SIZE) { pte_free_submit(*batchp); *batchp = NULL; @@ -132,42 +183,6 @@ put_cpu(); } -#ifdef CONFIG_SMP -static void pte_free_smp_sync(void *arg) -{ - /* Do nothing, just ensure we sync with all CPUs */ -} -#endif - -/* This is only called when we are critically out of memory - * (and fail to get a page in pte_free_tlb). - */ -void pte_free_now(struct page *ptepage) -{ - pte_freelist_forced_free++; - - smp_call_function(pte_free_smp_sync, NULL, 0, 1); - - pte_free(ptepage); -} - -static void pte_free_rcu_callback(struct rcu_head *head) -{ - struct pte_freelist_batch *batch = - container_of(head, struct pte_freelist_batch, rcu); - unsigned int i; - - for (i = 0; i < batch->index; i++) - pte_free(batch->pages[i]); - free_page((unsigned long)batch); -} - -void pte_free_submit(struct pte_freelist_batch *batch) -{ - INIT_RCU_HEAD(&batch->rcu); - call_rcu(&batch->rcu, pte_free_rcu_callback); -} - void pte_free_finish(void) { /* This is safe as we are holding page_table_lock */ Index: working-2.6/arch/ppc64/mm/hugetlbpage.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hugetlbpage.c 2005-05-13 17:57:38.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hugetlbpage.c 2005-05-13 17:57:39.000000000 +1000 @@ -27,124 +27,93 @@ #include -#define HUGEPGDIR_SHIFT (HPAGE_SHIFT + PAGE_SHIFT - 3) -#define HUGEPGDIR_SIZE (1UL << HUGEPGDIR_SHIFT) -#define HUGEPGDIR_MASK (~(HUGEPGDIR_SIZE-1)) - -#define HUGEPTE_INDEX_SIZE 9 -#define HUGEPGD_INDEX_SIZE 10 - -#define PTRS_PER_HUGEPTE (1 << HUGEPTE_INDEX_SIZE) -#define PTRS_PER_HUGEPGD (1 << HUGEPGD_INDEX_SIZE) - -static inline int hugepgd_index(unsigned long addr) +/* Modelled after find_linux_pte() */ +pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) { - return (addr & ~REGION_MASK) >> HUGEPGDIR_SHIFT; -} + pgd_t *pg; + pud_t *pu; + pmd_t *pm; + pte_t *pt; -static pud_t *hugepgd_offset(struct mm_struct *mm, unsigned long addr) -{ - int index; + BUG_ON(! in_hugepage_area(mm->context, addr)); - if (! mm->context.huge_pgdir) - return NULL; + addr &= HPAGE_MASK; + pg = pgd_offset(mm, addr); + if (!pgd_none(*pg)) { + pu = pud_offset(pg, addr); + if (!pud_none(*pu)) { + pm = pmd_offset(pu, addr); + pt = (pte_t *)pm; + BUG_ON(!pmd_none(*pm) + && !(pte_present(*pt) && pte_huge(*pt))); + return pt; + } + } - index = hugepgd_index(addr); - BUG_ON(index >= PTRS_PER_HUGEPGD); - return (pud_t *)(mm->context.huge_pgdir + index); + return NULL; } -static inline pte_t *hugepte_offset(pud_t *dir, unsigned long addr) +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr) { - int index; - - if (pud_none(*dir)) - return NULL; - - index = (addr >> HPAGE_SHIFT) % PTRS_PER_HUGEPTE; - return (pte_t *)pud_page(*dir) + index; -} + pgd_t *pg; + pud_t *pu; + pmd_t *pm; + pte_t *pt; -static pud_t *hugepgd_alloc(struct mm_struct *mm, unsigned long addr) -{ BUG_ON(! in_hugepage_area(mm->context, addr)); - if (! mm->context.huge_pgdir) { - pgd_t *new; - spin_unlock(&mm->page_table_lock); - /* Don't use pgd_alloc(), because we want __GFP_REPEAT */ - new = kmem_cache_alloc(zero_cache, GFP_KERNEL | __GFP_REPEAT); - BUG_ON(memcmp(new, empty_zero_page, PAGE_SIZE)); - spin_lock(&mm->page_table_lock); - - /* - * Because we dropped the lock, we should re-check the - * entry, as somebody else could have populated it.. - */ - if (mm->context.huge_pgdir) - pgd_free(new); - else - mm->context.huge_pgdir = new; - } - return hugepgd_offset(mm, addr); -} + addr &= HPAGE_MASK; -static pte_t *hugepte_alloc(struct mm_struct *mm, pud_t *dir, unsigned long addr) -{ - if (! pud_present(*dir)) { - pte_t *new; + pg = pgd_offset(mm, addr); + pu = pud_alloc(mm, pg, addr); - spin_unlock(&mm->page_table_lock); - new = kmem_cache_alloc(zero_cache, GFP_KERNEL | __GFP_REPEAT); - BUG_ON(memcmp(new, empty_zero_page, PAGE_SIZE)); - spin_lock(&mm->page_table_lock); - /* - * Because we dropped the lock, we should re-check the - * entry, as somebody else could have populated it.. - */ - if (pud_present(*dir)) { - if (new) - kmem_cache_free(zero_cache, new); - } else { - struct page *ptepage; - - if (! new) - return NULL; - ptepage = virt_to_page(new); - ptepage->mapping = (void *) mm; - ptepage->index = addr & HUGEPGDIR_MASK; - pud_populate(mm, dir, new); + if (pu) { + pm = pmd_alloc(mm, pu, addr); + if (pm) { + pt = (pte_t *)pm; + BUG_ON(!pmd_none(*pm) + && !(pte_present(*pt) && pte_huge(*pt))); + return pt; } } - return hugepte_offset(dir, addr); + return NULL; } -pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) -{ - pud_t *pud; +#define HUGEPTE_BATCH_SIZE (HPAGE_SIZE / PMD_SIZE) - BUG_ON(! in_hugepage_area(mm->context, addr)); +void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte) +{ + int i; - pud = hugepgd_offset(mm, addr); - if (! pud) - return NULL; + if (pte_present(*ptep)) { + pte_clear(mm, addr, ptep); + flush_tlb_pending(); + } - return hugepte_offset(pud, addr); + for (i = 0; i < HUGEPTE_BATCH_SIZE; i++) { + *ptep = __pte(pte_val(pte) & ~_PAGE_HPTEFLAGS); + ptep++; + } } -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr) +pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, + pte_t *ptep) { - pud_t *pud; + unsigned long old = pte_update(ptep, ~0UL); + int i; - BUG_ON(! in_hugepage_area(mm->context, addr)); + if (old & _PAGE_HASHPTE) + hpte_update(mm, addr, old, 0); - pud = hugepgd_alloc(mm, addr); - if (! pud) - return NULL; + for (i = 1; i < HUGEPTE_BATCH_SIZE; i++) { + *ptep = __pte(0); + ptep++; + } - return hugepte_alloc(mm, pud, addr); + return __pte(old); } /* @@ -517,42 +486,6 @@ } } -void hugetlb_mm_free_pgd(struct mm_struct *mm) -{ - int i; - pgd_t *pgdir; - - spin_lock(&mm->page_table_lock); - - pgdir = mm->context.huge_pgdir; - if (! pgdir) - goto out; - - mm->context.huge_pgdir = NULL; - - /* cleanup any hugepte pages leftover */ - for (i = 0; i < PTRS_PER_HUGEPGD; i++) { - pud_t *pud = (pud_t *)(pgdir + i); - - if (! pud_none(*pud)) { - pte_t *pte = (pte_t *)pud_page(*pud); - struct page *ptepage = virt_to_page(pte); - - ptepage->mapping = NULL; - - BUG_ON(memcmp(pte, empty_zero_page, PAGE_SIZE)); - kmem_cache_free(zero_cache, pte); - } - pud_clear(pud); - } - - BUG_ON(memcmp(pgdir, empty_zero_page, PAGE_SIZE)); - kmem_cache_free(zero_cache, pgdir); - - out: - spin_unlock(&mm->page_table_lock); -} - int hash_huge_page(struct mm_struct *mm, unsigned long access, unsigned long ea, unsigned long vsid, int local) { -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From tzachi at marvell.com Mon May 16 17:26:15 2005 From: tzachi at marvell.com (Tzachi Perelstein) Date: Mon, 16 May 2005 10:26:15 +0300 Subject: io_block_mapping in PPC64? Message-ID: Hi Ben, Thanks for your reply. I don't know why you say that the io_block_mapping shouldn't be used. We traditionally use this function during platform setup to map all memory mapped devices, registers, secondary memories, etc. The function interface allows you to provide both physical and virtual addresses and their associated attributes, so giving the same physical and virtual mapping is very convenience. Anyway, your suggestion __ioremap is good enough. I've used _ioremap_explicit after ioremap to set cache coherency attributes, but I'll change it to one __ioremap call. I'm not going to refer your comment about Marvell on the mailing list, but it's coming soon - FYEO. Regards, Tzachi > -----Original Message----- > From: Benjamin Herrenschmidt [mailto:benh at kernel.crashing.org] > Sent: Monday, May 16, 2005 1:55 AM > To: Tzachi Perelstein > Cc: linuxppc64-dev at ozlabs.org > Subject: Re: io_block_mapping in PPC64? > > On Mon, 2005-05-16 at 08:49 +1000, Benjamin Herrenschmidt wrote: > > On Sun, 2005-05-15 at 16:08 +0300, Tzachi Perelstein wrote: > > > Hi all, > > > > > > I develop Linux (PPC64, 2.6.9) for the new Marvell 64470 evaluation > > > board with the IBM-970FX. Inside the Marvell system controller there > is > > > a 2Mbit internal SRAM for applications that required fast memory > access > > > (traditionally for network driver descriptors). > > > In the (32bit) PPC arch there is an io_block_mapping function that can > > > be used to map the internal SRAM with configurable (e.g. cache > > > coherence) attributes. Is there any way to it in PPC64 arch? I didn't > > > find any reference for it. Any help will be appriciated. > > > > io_block_mapping() isn't really an interface I would recomment for > > anybody to use... What about __ioremap() instead ? > > > > I also notice that Marvell didn't both submitting any patch for > > supporting their board, I suppose they are happy with just providing > > customers with their hacks and no attempt to have them validated & > > merged upstream ... > > Hrm... just noticed your @marvell.com address, so forget my previous > comment. I would still recommend that any patch your come up with for > supporting this board be submited on this list before it hits customers. > > I'm trying hard to avoid ppc64 becoming the mess ppc32 has get... > > Ben. > From benh at kernel.crashing.org Mon May 16 17:30:47 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 16 May 2005 17:30:47 +1000 Subject: io_block_mapping in PPC64? In-Reply-To: References: Message-ID: <1116228648.5095.97.camel@gaston> On Mon, 2005-05-16 at 10:26 +0300, Tzachi Perelstein wrote: > Hi Ben, > Thanks for your reply. > I don't know why you say that the io_block_mapping shouldn't be used. > We traditionally use this function during platform setup to map all > memory mapped devices, registers, secondary memories, etc. The function > interface allows you to provide both physical and virtual addresses and > their associated attributes, so giving the same physical and virtual > mapping is very convenience. Anyway, your suggestion __ioremap is good > enough. I've used _ioremap_explicit after ioremap to set cache coherency > attributes, but I'll change it to one __ioremap call. Which every is good for you, but is this sram cache coherent ? If it's not, just use normal ioremap. As for why io_block_mapping() is considered bad nowadays is that in general, hard coding a virtual address is a bad thing. For example, it's the abuse of io_block_mapping() that is forcing us to still default to only 2Gb of TASK_SIZE instead of 3Gb in ppc32. We decided on purpose against providing such an interface on ppc64, you have to "do the right thing" and request a mapping with {_,__}ioremap like anything else. There are special cases that do allow explicit mappings in the ppc64 kernel (ioremap_explicit) but these are _absolutely_ reserved for use by the PCI subsystem for the mapping of the PCI "IO" space of bridges. Your bridge should provide proper "ranges" properties in the OF device tree and thus you shouldn't have to call these yourself, just use the common pci_process_OF_bridge_ranges(). > I'm not going to refer your comment about Marvell on the mailing list, > but it's coming soon - FYEO. Ok, thanks. One thing I want to make sure of is that you are _not_ trying to do something without a device-tree. We decided to make the presence of an OF-like device-tree mandatory for the ppc64 architecture. If you don't have an Open Firmware implementation nor a bootloader capable of providing an OF-like interface (like PIBS), you should at least have a bootloader providing a simple (static) device-tree via the flattened format which is currently use between prom_init.c and the rest of the kernel and with kexec. This should at least contain a /model property for identifying your platform, a /memory node with the physical memory layout, and eventually additional informations about your PCI host bridge and other non-PCI devices Regards, Ben. From tzachi at marvell.com Mon May 16 17:53:58 2005 From: tzachi at marvell.com (Tzachi Perelstein) Date: Mon, 16 May 2005 10:53:58 +0300 Subject: io_block_mapping in PPC64? Message-ID: See my comments bellow [Tzachi]. > On Mon, 2005-05-16 at 10:26 +0300, Tzachi Perelstein wrote: > > Hi Ben, > > Thanks for your reply. > > I don't know why you say that the io_block_mapping shouldn't be used. > > We traditionally use this function during platform setup to map all > > memory mapped devices, registers, secondary memories, etc. The function > > interface allows you to provide both physical and virtual addresses and > > their associated attributes, so giving the same physical and virtual > > mapping is very convenience. Anyway, your suggestion __ioremap is good > > enough. I've used _ioremap_explicit after ioremap to set cache coherency > > attributes, but I'll change it to one __ioremap call. > > Which every is good for you, but is this sram cache coherent ? If it's > not, just use normal ioremap. [Tzachi]: Yes it is cache coherence. I'll use __ioremap. > As for why io_block_mapping() is considered bad nowadays is that in > general, hard coding a virtual address is a bad thing. For example, it's > the abuse of io_block_mapping() that is forcing us to still default to > only 2Gb of TASK_SIZE instead of 3Gb in ppc32. We decided on purpose > against providing such an interface on ppc64, you have to "do the right > thing" and request a mapping with {_,__}ioremap like anything else. > There are special cases that do allow explicit mappings in the ppc64 > kernel (ioremap_explicit) but these are _absolutely_ reserved for use by > the PCI subsystem for the mapping of the PCI "IO" space of bridges. Your > bridge should provide proper "ranges" properties in the OF device tree > and thus you shouldn't have to call these yourself, just use the common > pci_process_OF_bridge_ranges(). [Tzachi]: Thanks clarifying it, I agree. > > > I'm not going to refer your comment about Marvell on the mailing list, > > but it's coming soon - FYEO. > > Ok, thanks. One thing I want to make sure of is that you are _not_ > trying to do something without a device-tree. We decided to make the > presence of an OF-like device-tree mandatory for the ppc64 architecture. > If you don't have an Open Firmware implementation nor a bootloader > capable of providing an OF-like interface (like PIBS), you should at > least have a bootloader providing a simple (static) device-tree via the > flattened format which is currently use between prom_init.c and the rest > of the kernel and with kexec. This should at least contain a /model > property for identifying your platform, a /memory node with the physical > memory layout, and eventually additional informations about your PCI > host bridge and other non-PCI devices [Tzachi]: I do not use device-tree. Our Linux is loaded by U-Boot which first time supports the IBM-970FX and first time executed in 64-bit mode. There is no real need for such complex structures when loaded by U-Boot. If you want ppc64 architecture to be spread to embedded systems too you should avoid such restrictions. > Regards, > Ben. > > From benh at kernel.crashing.org Mon May 16 18:06:56 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 16 May 2005 18:06:56 +1000 Subject: io_block_mapping in PPC64? In-Reply-To: References: Message-ID: <1116230816.5095.115.camel@gaston> On Mon, 2005-05-16 at 10:53 +0300, Tzachi Perelstein wrote: > [Tzachi]: I do not use device-tree. Our Linux is loaded by U-Boot which > first time supports the IBM-970FX and first time executed in 64-bit > mode. There is no real need for such complex structures when loaded by > U-Boot. If you want ppc64 architecture to be spread to embedded systems > too you should avoid such restrictions. No, this will not happen. Look at ppc32. Every single board vendor ended up hacking it's own board_info structure, which quickly became a total maintainance nightmare. It is very easy and not complex at all to provide at least a simple device-tree to the kernel. What we might do to "help" here may be to provide some code for things like uboot to generate it based on a simple ASCII definition or something like that. But there is simply no way I will let arbitrary structures be invented for every new board out there and be passed around, turning the ppc64 kernel into the same kind of unmanageable mess that ppc32 is nowadays. There are also open implementations of Open Firmware floating around (but I agree a solution that can be fitted in uboot would be useful). I'm pretty confident that Paul Mackerras (ppc64 architecture maintainer) agrees with me here. Ben. From tzachi at marvell.com Mon May 16 20:47:22 2005 From: tzachi at marvell.com (Tzachi Perelstein) Date: Mon, 16 May 2005 13:47:22 +0300 Subject: io_block_mapping in PPC64? Message-ID: Please forgive me, but although I absolutely accept your motivation, I disagree with your conclusion. Linux ppc64 should have an alternative to the 'desktop' device-tree structure, something that will be more embedded systems friendly, like the compact U-Boot structure. It doesn't mean that ppc64 arch will become a mess. I will consult Wolfgang Denk, the U-Boot owner, about this matter. Regards, Tzachi > -----Original Message----- > From: Benjamin Herrenschmidt [mailto:benh at kernel.crashing.org] > Sent: Monday, May 16, 2005 11:07 AM > To: Tzachi Perelstein > Cc: linuxppc64-dev at ozlabs.org > Subject: RE: io_block_mapping in PPC64? > > On Mon, 2005-05-16 at 10:53 +0300, Tzachi Perelstein wrote: > > > [Tzachi]: I do not use device-tree. Our Linux is loaded by U-Boot which > > first time supports the IBM-970FX and first time executed in 64-bit > > mode. There is no real need for such complex structures when loaded by > > U-Boot. If you want ppc64 architecture to be spread to embedded systems > > too you should avoid such restrictions. > > No, this will not happen. Look at ppc32. Every single board vendor ended > up hacking it's own board_info structure, which quickly became a total > maintainance nightmare. > > It is very easy and not complex at all to provide at least a simple > device-tree to the kernel. What we might do to "help" here may be to > provide some code for things like uboot to generate it based on a simple > ASCII definition or something like that. But there is simply no way I > will let arbitrary structures be invented for every new board out there > and be passed around, turning the ppc64 kernel into the same kind of > unmanageable mess that ppc32 is nowadays. > > There are also open implementations of Open Firmware floating around > (but I agree a solution that can be fitted in uboot would be useful). > > I'm pretty confident that Paul Mackerras (ppc64 architecture maintainer) > agrees with me here. > > Ben. > From hch at lst.de Mon May 16 20:58:41 2005 From: hch at lst.de (Christoph Hellwig) Date: Mon, 16 May 2005 12:58:41 +0200 Subject: io_block_mapping in PPC64? In-Reply-To: References: Message-ID: <20050516105841.GA9288@lst.de> On Mon, May 16, 2005 at 01:47:22PM +0300, Tzachi Perelstein wrote: > > Please forgive me, but although I absolutely accept your motivation, I > disagree with your conclusion. Linux ppc64 should have an alternative to > the 'desktop' device-tree structure, something that will be more > embedded systems friendly, like the compact U-Boot structure. It doesn't > mean that ppc64 arch will become a mess. So why exactly is the flat device tree not embedded device friendly? The IBM maple evaluation board uses it quite nicely. From dwmw2 at infradead.org Mon May 16 21:00:26 2005 From: dwmw2 at infradead.org (David Woodhouse) Date: Mon, 16 May 2005 12:00:26 +0100 Subject: io_block_mapping in PPC64? In-Reply-To: References: Message-ID: <1116241227.25594.27.camel@localhost.localdomain> On Mon, 2005-05-16 at 13:47 +0300, Tzachi Perelstein wrote: > Please forgive me, but although I absolutely accept your motivation, I > disagree with your conclusion. Linux ppc64 should have an alternative to > the 'desktop' device-tree structure, something that will be more > embedded systems friendly, like the compact U-Boot structure. It doesn't > mean that ppc64 arch will become a mess. The U-Boot boardinfo structure _is_ an appalling mess, and we really need to avoid making the same mistake for ppc64. A compacted device-tree such as that which is passed over kexec() is perfectly "embedded systems friendly". It's just a structured blob which your bootloader provides to the Linux kernel. -- dwmw2 From benh at kernel.crashing.org Mon May 16 21:33:41 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 16 May 2005 21:33:41 +1000 Subject: io_block_mapping in PPC64? In-Reply-To: <20050516105841.GA9288@lst.de> References: <20050516105841.GA9288@lst.de> Message-ID: <1116243221.5095.126.camel@gaston> On Mon, 2005-05-16 at 12:58 +0200, Christoph Hellwig wrote: > On Mon, May 16, 2005 at 01:47:22PM +0300, Tzachi Perelstein wrote: > > > > Please forgive me, but although I absolutely accept your motivation, I > > disagree with your conclusion. Linux ppc64 should have an alternative to > > the 'desktop' device-tree structure, something that will be more > > embedded systems friendly, like the compact U-Boot structure. It doesn't > > mean that ppc64 arch will become a mess. > > So why exactly is the flat device tree not embedded device friendly? > The IBM maple evaluation board uses it quite nicely Sorry, Christophe, I have to correct you here :) The Maple board use IBM PIBS firmware which implements the full open firmware client interface. However, I agree this can be a bit too complicated for an embedded firmware, and the compact device-tree format may be more suited. That was one of the motivations for implementing it, along with making kexec possible. The "bare metal" linux folks in IBM use it, I'm trying to figure out if some code can be provided to the community as an example on how to generate it. The kernel provides only little requirements regarding the device-tree. On purpose, I do not make mandatory, for example, to have every PCI device enumerated by the firmware. Only host bridges, so those can provide proper memory/io windows and irq routing to the kernel. That along with a few nodes defining the board type, RAM layout, and maybe a few other options you want to pass along to the kernel. However, for things like on-chip devices, experience has showed that a device-tree like structure is far more convenient and flexible than any kind of hard-coded mecanism, it allows the firmware to easily inform the kernel on what driver to load, what address/interrupts to assign, what MAC address to use for ethernet, etc... without creating a "carved in stone" ABI like a board info structure does. Beleive me, Tzachi, I've been there, doing embedded developpement and bring up, and I far more prefer changing a property or two in the device tree provided by the firmware than having to add more hacks to the kernel image every time I decide to release a new revision of an embedded board. It is _not_ complicated, and it should be fairly easy to write a parser that turns an ASCII representation into a "blob" that can be flashed and passed along to the kernel. In fact, I'd be glad to help you implement some support for generating a simple device-tree from uboot. I'm not trying to but "arbitrary" restrictions here, I'm really trying to promote a nicer and more maintainable way of defining the firmware<->kernel interface that is suitable for both "desktop" (and server) like solutions and embedded. In fact, if you read the linuxppc-embedded list archives, you'll notice that more than once, people have been trying to replace the current mess with something like that, though so far, nobody has actually implemented anything, mostly because there is the whole lot of existing legacy board_info crap to deal with. I'm trying to avoid that for the future of ppc64. Ben. From miltonm at bga.com Tue May 17 00:43:43 2005 From: miltonm at bga.com (Milton Miller) Date: Mon, 16 May 2005 09:43:43 -0500 Subject: Four level pagetables, now with hugepages Message-ID: <4c4d1fc805b3fe4e82de7de57023a487@bga.com> On Mon May 16 14:43:53 EST 2005, David Gibson wrote: > =================================================================== > --- working-2.6.orig/arch/ppc64/kernel/head.S 2005-05-13 > 17:57:38.000000000 +1000 > +++ working-2.6/arch/ppc64/kernel/head.S 2005-05-13 > 17:57:39.000000000 +1000 > @@ -38,6 +38,7 @@ > #include > #include > #include > +#include > > #ifdef CONFIG_PPC_ISERIES > #define DO_SOFT_DISABLE > @@ -2117,17 +2118,17 @@ > empty_zero_page: > .space 4096 > > - .globl swapper_pg_dir > -swapper_pg_dir: > - .space 4096 > - > #ifdef CONFIG_SMP > /* 1 page segment table per cpu (max 48, cpu0 allocated at > STAB0_PHYS_ADDR) */ > .globl stab_array > stab_array: > .space 4096 * 48 > #endif > - > + > + .globl swapper_pg_dir > +swapper_pg_dir: > + .space PAGE_SIZE > + > /* > * This space gets a copy of optional info passed to us by the > bootstrap > * Used to pass parameters into the kernel like root=/dev/sda1, etc. Does this change to PAGE_SIZE need to be > 4k aligned? If so we shouuld define a new linker section. milton From roland at topspin.com Tue May 17 06:14:58 2005 From: roland at topspin.com (Roland Dreier) Date: Mon, 16 May 2005 13:14:58 -0700 Subject: [PATCH 7/8] ppc64: SPU file system In-Reply-To: <20050514074524.GC20021@kroah.com> (Greg KH's message of "Sat, 14 May 2005 00:45:25 -0700") References: <200505132117.37461.arnd@arndb.de> <200505132129.07603.arnd@arndb.de> <20050514074524.GC20021@kroah.com> Message-ID: <528y2flygt.fsf@topspin.com> Greg> No, as Ben said, do not do this. Use write. And as you are Greg> only doing 1 type of ioctl, it shouldn't be an issue. Also Greg> it will be faster than the ioctl due to lack of BKL usage :) This is no longer true. ioctls don't have to take the BKL now that struct file_operations has unlocked_ioctl and compat_ioctl. - R. From greg at kroah.com Tue May 17 06:58:25 2005 From: greg at kroah.com (Greg KH) Date: Mon, 16 May 2005 13:58:25 -0700 Subject: [PATCH 7/8] ppc64: SPU file system In-Reply-To: <200505141505.08999.arnd@arndb.de> References: <200505132117.37461.arnd@arndb.de> <200505132129.07603.arnd@arndb.de> <20050514074524.GC20021@kroah.com> <200505141505.08999.arnd@arndb.de> Message-ID: <20050516205825.GB11938@kroah.com> On Sat, May 14, 2005 at 03:05:06PM +0200, Arnd Bergmann wrote: > On S?nnavend 14 Mai 2005 09:45, Greg KH wrote: > > On Fri, May 13, 2005 at 09:29:06PM +0200, Arnd Bergmann wrote: > > > /run A stub file that lets us do ioctl. > > > > No, as Ben said, do not do this. Use write. And as you are only doing > > 1 type of ioctl, it shouldn't be an issue. Also it will be faster than > > the ioctl due to lack of BKL usage :) > > I've been back and forth between a number of interfaces here and haven't > found one that I'm really happy with. Using write() is probably my least > favorite one, but these are the alternatives I've come up with so far: > > 1. ioctl: > pro: > - easy to do in a file system > - can have both input and output arguments > contra: > - ugly > - weakly typed > - unpopular > > 2. sys_spufs_run(int fd, __u32 pc, __u32 *new_pc, __u32 *status): > pro: > - strong types > - can have both input and output arguments > contra: > - does not fit file system semantics well > - bad for prototyping I suggest you do this. Based on what you say you want the code to do, I agree, write() doesn't really work out well (but it might, and if you want an example of how to do it, look at the ibmasm driver, it implements write() in a way much like what you are wanting to do.) > > > +/**** spufs attributes > > > + * > > > > + * Perhaps these file operations could be put in debugfs or libfs instead, > > > + * they are not really SPU specific. > > > > Yes they should. I'll gladly take them for debugfs or like you state, > > libfs is probably the better place for them so everyone can use them. > > > > If you make up a patch, I'll fix up debugfs to use them properly. > > Ok. I'll do the patch for libfs then. I've been thinking about > changing > > +#define spufs_attribute(name) \ > +static int name ## _open(struct inode *inode, struct file *file) \ > +{ \ > + return spufs_attr_open(inode, file, &name ## _get, &name ## _set); \ > +} \ > +static struct file_operations name = { \ > + .open = name ## _open, \ > + .release = spufs_attr_close, \ > + .read = spufs_attr_read, \ > + .write = spufs_attr_write, \ > +}; > > to take a format string argument as well, which is then used in the > spufs_attr_read function instead of the hardcoded "%ld\n". Do you think > I should do that or rather keep the current implementation? yeah, you probably need the format string. > > > +#define spufs_attribute(name) \ > > > +static int name ## _open(struct inode *inode, struct file *file) \ > > > +{ \ > > > + return spufs_attr_open(inode, file, &name ## _get, &name ## _set); \ > > > +} \ > > > +static struct file_operations name = { \ > > > + .open = name ## _open, \ > > > + .release = spufs_attr_close, \ > > > + .read = spufs_attr_read, \ > > > + .write = spufs_attr_write, \ > > > +}; > > > > No module owner set? Be careful if not... > > Right. Is there ever a reason to have file operations without owner? Code built into the kernel? > Maybe dentry_open() could warn about this. Would die a horrible death due to the above :) > > > +/* This looks really wrong! */ > > > +static int spufs_rmdir(struct inode *root, struct dentry *dir_dentry) > > > > Why do you need this? Doesn't 'simple_rmdir' work for you? > > The idea was to keep the file system contents consistant with the > underlying data structures. If I allow users to unlink context > directories or files in there, there is no longer a way to extract > reliable information from the file system, e.g. for the debugger > or for implementing something like spu_ps. > > My solution was to force the dentries in each directory to be > present. When the directory is created, the files are already > there and unlinking a single file is impossible. To destroy the > spu context, the user has to rmdir it, which will either remove > all files in there as well or fail in the case that any file is > still open. Ick. > Of course that is not really Posix behavior, but it avoids some > other pitfalls. Go with a syscall :) > > Remember __u16 and friends for structures that cross the user/kernel > > boundry (like your ioctl that you will be rewriting...) > > Yes. There are no data structures that are shared with user space > except the current ioctl argument. The MFC_TagSizeClassCmd (yes, I > need to remember to change the name some day, currently this still > uses the identifiers from the spec) and the others are defined > by the hardware interface. Identifiers that are named as per a spec are ok to leave alone. We did that with USB, as it makes sense to do it that way for anyone who reads the spec and the code. But if your spec is only for the Linux OS, well, that's a different issue... thanks, greg k-h From greg at kroah.com Tue May 17 06:53:58 2005 From: greg at kroah.com (Greg KH) Date: Mon, 16 May 2005 13:53:58 -0700 Subject: [PATCH 7/8] ppc64: SPU file system In-Reply-To: <528y2flygt.fsf@topspin.com> References: <200505132117.37461.arnd@arndb.de> <200505132129.07603.arnd@arndb.de> <20050514074524.GC20021@kroah.com> <528y2flygt.fsf@topspin.com> Message-ID: <20050516205358.GA11938@kroah.com> On Mon, May 16, 2005 at 01:14:58PM -0700, Roland Dreier wrote: > Greg> No, as Ben said, do not do this. Use write. And as you are > Greg> only doing 1 type of ioctl, it shouldn't be an issue. Also > Greg> it will be faster than the ioctl due to lack of BKL usage :) > > This is no longer true. ioctls don't have to take the BKL now that > struct file_operations has unlocked_ioctl and compat_ioctl. Yes, but his patch did not use them :) thanks, greg k-h From arnd at arndb.de Tue May 17 08:01:05 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Tue, 17 May 2005 00:01:05 +0200 Subject: [PATCH 7/8] ppc64: SPU file system In-Reply-To: <20050516205825.GB11938@kroah.com> References: <200505132117.37461.arnd@arndb.de> <200505141505.08999.arnd@arndb.de> <20050516205825.GB11938@kroah.com> Message-ID: <200505170001.10405.arnd@arndb.de> On Maandag 16 Mai 2005 22:58, Greg KH wrote: > On Sat, May 14, 2005 at 03:05:06PM +0200, Arnd Bergmann wrote: > > 2. sys_spufs_run(int fd, __u32 pc, __u32 *new_pc, __u32 *status): > > pro: > > - strong types > > - can have both input and output arguments > > contra: > > - does not fit file system semantics well > > - bad for prototyping > > I suggest you do this. Based on what you say you want the code to do, I > agree, write() doesn't really work out well The syscall approach has another small disadvantage in that I need to do a callback registration mechanism for it if I want to have spufs as a loadable module. I could of course require spufs to be builtin, but that complicates prototype testing (as mentioned) and enlarges combined pSeries/powermac/BPA distro kernels. I think I'll leave the ioctl for now and add a note saying that I need to replace it with a syscall or the write/read or lseek/read based approach when I arrive at a more feature complete point. > (but it might, and if you > want an example of how to do it, look at the ibmasm driver, it > implements write() in a way much like what you are wanting to do.) That would be the same write/read combination as Ben's second proposal and the nfsctl file system, right? > > My solution was to force the dentries in each directory to be > > present. When the directory is created, the files are already > > there and unlinking a single file is impossible. To destroy the > > spu context, the user has to rmdir it, which will either remove > > all files in there as well or fail in the case that any file is > > still open. > > Ick. > > > Of course that is not really Posix behavior, but it avoids some > > other pitfalls. > > Go with a syscall :) Sorry, I'm not following that reasoning. How does a syscall help with the problem of atomic context destruction? Arnd <>< From greg at kroah.com Tue May 17 08:27:36 2005 From: greg at kroah.com (Greg KH) Date: Mon, 16 May 2005 15:27:36 -0700 Subject: [PATCH 7/8] ppc64: SPU file system In-Reply-To: <200505170001.10405.arnd@arndb.de> References: <200505132117.37461.arnd@arndb.de> <200505141505.08999.arnd@arndb.de> <20050516205825.GB11938@kroah.com> <200505170001.10405.arnd@arndb.de> Message-ID: <20050516222736.GA13350@kroah.com> On Tue, May 17, 2005 at 12:01:05AM +0200, Arnd Bergmann wrote: > On Maandag 16 Mai 2005 22:58, Greg KH wrote: > > On Sat, May 14, 2005 at 03:05:06PM +0200, Arnd Bergmann wrote: > > > > 2. sys_spufs_run(int fd, __u32 pc, __u32 *new_pc, __u32 *status): > > > pro: > > > - strong types > > > - can have both input and output arguments > > > contra: > > > - does not fit file system semantics well > > > - bad for prototyping > > > > I suggest you do this. Based on what you say you want the code to do, I > > agree, write() doesn't really work out well > > The syscall approach has another small disadvantage in that I need to > do a callback registration mechanism for it if I want to have spufs as > a loadable module. I could of course require spufs to be builtin, but > that complicates prototype testing (as mentioned) and enlarges combined > pSeries/powermac/BPA distro kernels. Huh? We can handle syscalls in modules these days pretty simply. Look at how nfs and others do it. > I think I'll leave the ioctl for now and add a note saying that I need > to replace it with a syscall or the write/read or lseek/read based > approach when I arrive at a more feature complete point. Nah, make it a syscall :) > > (but it might, and if you > > want an example of how to do it, look at the ibmasm driver, it > > implements write() in a way much like what you are wanting to do.) > > That would be the same write/read combination as Ben's second > proposal and the nfsctl file system, right? Yes. > > > My solution was to force the dentries in each directory to be > > > present. When the directory is created, the files are already > > > there and unlinking a single file is impossible. To destroy the > > > spu context, the user has to rmdir it, which will either remove > > > all files in there as well or fail in the case that any file is > > > still open. > > > > Ick. > > > > > Of course that is not really Posix behavior, but it avoids some > > > other pitfalls. > > > > Go with a syscall :) > > Sorry, I'm not following that reasoning. How does a syscall help > with the problem of atomic context destruction? Sorry, I thought they were referring to the same issue. greg k-h From arnd at arndb.de Tue May 17 08:22:56 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Tue, 17 May 2005 00:22:56 +0200 Subject: [PATCH 7/8] ppc64: SPU file system In-Reply-To: <20050516222736.GA13350@kroah.com> References: <200505132117.37461.arnd@arndb.de> <200505170001.10405.arnd@arndb.de> <20050516222736.GA13350@kroah.com> Message-ID: <200505170022.57662.arnd@arndb.de> On Dinsdag 17 Mai 2005 00:27, Greg KH wrote: > Huh? ?We can handle syscalls in modules these days pretty simply. ?Look > at how nfs and others do it. Well afaics, nfs works around this issue by having fs/nfsctl.o always as a builtin and abstract the calls through a file system using read/write. That would be Ben's idea again, i.e. not actually using a system call. The only widely used module that I'm aware of ever implementing a system call was the TUX web accelerator that that used a hack in entry.S and its own dynamic registration. Arnd <>< From greg at kroah.com Tue May 17 08:49:10 2005 From: greg at kroah.com (Greg KH) Date: Mon, 16 May 2005 15:49:10 -0700 Subject: [PATCH 7/8] ppc64: SPU file system In-Reply-To: <200505170022.57662.arnd@arndb.de> References: <200505132117.37461.arnd@arndb.de> <200505170001.10405.arnd@arndb.de> <20050516222736.GA13350@kroah.com> <200505170022.57662.arnd@arndb.de> Message-ID: <20050516224909.GB13866@kroah.com> On Tue, May 17, 2005 at 12:22:56AM +0200, Arnd Bergmann wrote: > On Dinsdag 17 Mai 2005 00:27, Greg KH wrote: > > Huh? ?We can handle syscalls in modules these days pretty simply. ?Look > > at how nfs and others do it. > > Well afaics, nfs works around this issue by having fs/nfsctl.o always > as a builtin and abstract the calls through a file system using > read/write. That would be Ben's idea again, i.e. not actually > using a system call. > > The only widely used module that I'm aware of ever implementing a system > call was the TUX web accelerator that that used a hack in entry.S > and its own dynamic registration. Sorry, but I was thinking of the cond_syscall() stuff, to allow syscalls in modules or code that just happens to not be built into the kernel. thanks, greg k-h From david at gibson.dropbear.id.au Tue May 17 11:39:38 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Tue, 17 May 2005 11:39:38 +1000 Subject: Four level pagetables, now with hugepages In-Reply-To: <4c4d1fc805b3fe4e82de7de57023a487@bga.com> References: <4c4d1fc805b3fe4e82de7de57023a487@bga.com> Message-ID: <20050517013938.GA31420@localhost.localdomain> On Mon, May 16, 2005 at 09:43:43AM -0500, Milton Miller wrote: > On Mon May 16 14:43:53 EST 2005, David Gibson wrote: > >=================================================================== > >17:57:38.000000000 +1000 > >+++ working-2.6/arch/ppc64/kernel/head.S 2005-05-13 > >17:57:39.000000000 +1000 > >@@ -38,6 +38,7 @@ > > #include > > #include > > #include > >+#include > > > > #ifdef CONFIG_PPC_ISERIES > > #define DO_SOFT_DISABLE > >@@ -2117,17 +2118,17 @@ > > empty_zero_page: > > .space 4096 > > > >- .globl swapper_pg_dir > >-swapper_pg_dir: > >- .space 4096 > >- > > #ifdef CONFIG_SMP > > /* 1 page segment table per cpu (max 48, cpu0 allocated at > >STAB0_PHYS_ADDR) */ > > .globl stab_array > > stab_array: > > .space 4096 * 48 > > #endif > >- > >+ > >+ .globl swapper_pg_dir > >+swapper_pg_dir: > >+ .space PAGE_SIZE > >+ > > /* > > * This space gets a copy of optional info passed to us by the > >bootstrap > > * Used to pass parameters into the kernel like root=/dev/sda1, etc. > > Does this change to PAGE_SIZE need to be > 4k aligned? If so we > shouuld define a new linker section. No, this patch doesn't change PAGE_SIZE, it's still 4k. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From paulus at samba.org Tue May 17 13:26:46 2005 From: paulus at samba.org (Paul Mackerras) Date: Tue, 17 May 2005 13:26:46 +1000 Subject: io_block_mapping in PPC64? In-Reply-To: References: Message-ID: <17033.25718.542891.670814@cargo.ozlabs.ibm.com> Tzachi Perelstein writes: > [Tzachi]: I do not use device-tree. Our Linux is loaded by U-Boot which > first time supports the IBM-970FX and first time executed in 64-bit > mode. There is no real need for such complex structures when loaded by > U-Boot. If you want ppc64 architecture to be spread to embedded systems > too you should avoid such restrictions. As maintainer I will not accept patches to make Linux run on new PPC64 machines without a device tree. It's not hard to create a device tree; it can be supplied as a binary blob by the boot loader, and there are tools to create the blob from a text representation. We will do far better by standardizing on one representation for system information at this stage than we would by replicating the rat's nest that we have in ppc32. Paul. From paulus at samba.org Tue May 17 13:18:53 2005 From: paulus at samba.org (Paul Mackerras) Date: Tue, 17 May 2005 13:18:53 +1000 Subject: io_block_mapping in PPC64? In-Reply-To: References: Message-ID: <17033.25245.812537.720566@cargo.ozlabs.ibm.com> Tzachi Perelstein writes: > I don't know why you say that the io_block_mapping shouldn't be used. It's because it makes (and imposes) assumptions about the layout of kernel virtual address space, and it does so at the point where it is called. That point is in the caller of io_block_mapping, which is usually in board setup code, distant from the mm code, which makes it harder to maintain the mm code that lays out the kernel virtual space and limits its flexibility. In fact there is really no reason for platform and device drivers to require a specific virtual address for their device. It's not that hard to store the virtual address in a variable and dereference that. :) Paul. From paulus at samba.org Tue May 17 17:01:37 2005 From: paulus at samba.org (Paul Mackerras) Date: Tue, 17 May 2005 17:01:37 +1000 Subject: [PATCH 4/8] ppc64: add BPA platform type In-Reply-To: <200505132125.34358.arnd@arndb.de> References: <200505132117.37461.arnd@arndb.de> <200505132125.34358.arnd@arndb.de> Message-ID: <17033.38609.60873.138572@cargo.ozlabs.ibm.com> Arnd Bergmann writes: > This adds the basic support for running on BPA machines. > So far, this is only the IBM workstation, and it will > not run on others without a little more generalization. > +/* FIXME: consolidate this into rtas.c or similar */ > +static void __init pSeries_calibrate_decr(void) Shouldn't this be called bpa_calibrate_decr or something similar? > -#define PV_630 0x0040 > -#define PV_630p 0x0041 > +#define PV_630 0x0040 > +#define PV_630p 0x0041 Hmmm, I don't think your patch needs to clean up the whitespace here. Regards, Paul. From benh at kernel.crashing.org Tue May 17 17:31:33 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 17 May 2005 17:31:33 +1000 Subject: [PATCH] (RFC) slaughter imalloc Message-ID: <1116315093.6804.14.camel@gaston> Hi John ! Can you give me your opinion on this patch ? It will probably not apply "as-is" as I did it on top of another pile of not-yet merged patches that affect the same files, but I'm more interested in what you think of it at the moment than actual testing. The idea is to get rid of imalloc. I did 2 things here: - Normal ioremap's go to the vmalloc space like most other archs. Immediate benefit: things like modules that ioremap their chip registers when loaded will usually end up with the module code, data _and_ the virtual region of the registers in the same segment, thus less SLB misses. - Explicit ioremap just loses all references to the imalloc stuff. It wasn't necessary as far as I can tell. ioremap_explicit() will just establish PTEs directly, and iounmap_explicit() will remove PTEs & flush hash for the concerned area. Those PTEs are currently put above the vmalloc region, at basically the same place where the imalloc area used to be after David's patch to kill ioremap_mm. Note that there is lots of room left, more than we need, we could extend the vmalloc part of it eventually but I doubt we'll ever run out of space anyway. Of course, the later is slightly less robust if you call it with crappy arguments, since previously, it would do some tracking, and not any more, but does that every happen in practice (I do guard a bit anyway, by making sure you don't hit outside of the PHB IO range, since that's the only legitimate use of the _explicit() calls). Finally, I moved all ioremap-related code to a new file called ioremap.c instead of init.c for clarity. Index: linux-work/include/asm-ppc64/pgtable.h =================================================================== --- linux-work.orig/include/asm-ppc64/pgtable.h 2005-05-16 15:42:58.000000000 +1000 +++ linux-work/include/asm-ppc64/pgtable.h 2005-05-17 17:02:02.000000000 +1000 @@ -65,10 +65,15 @@ /* * Define the address range of the vmalloc VM area. */ -#define VMALLOC_START (0xD000000000000000ul) -#define VMALLOC_SIZE (0x80000000000UL) -#define VMALLOC_END (VMALLOC_START + VMALLOC_SIZE) - +#ifndef __ASSEMBLY__ +extern unsigned long ioremap_bot; +#define VMALLOC_BASE (0xD000000000000000ul) +#define VMALLOC_START (ioremap_bot) +#define VMALLOC_SIZE (0x80000000000UL - ioremap_bot) +#define VMALLOC_END (VMALLOC_START + VMALLOC_SIZE) +#define PHBS_IO_BASE (VMALLOC_END) +#define PHBS_IO_END (PHBS_IO_BASE + 80000000ul) +#endif /* __ASSEMBLY__ */ /* * Bits in a linux-style PTE. These match the bits in the * (hardware-defined) PowerPC PTE as closely as possible. Index: linux-work/arch/ppc64/mm/imalloc.c =================================================================== --- linux-work.orig/arch/ppc64/mm/imalloc.c 2005-05-16 15:42:57.000000000 +1000 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,317 +0,0 @@ -/* - * c 2001 PPC 64 Team, IBM Corp - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public License - * as published by the Free Software Foundation; either version - * 2 of the License, or (at your option) any later version. - */ - -#include -#include - -#include -#include -#include -#include -#include -#include - -static DECLARE_MUTEX(imlist_sem); -struct vm_struct * imlist = NULL; - -static int get_free_im_addr(unsigned long size, unsigned long *im_addr) -{ - unsigned long addr; - struct vm_struct **p, *tmp; - - addr = ioremap_bot; - for (p = &imlist; (tmp = *p) ; p = &tmp->next) { - if (size + addr < (unsigned long) tmp->addr) - break; - if ((unsigned long)tmp->addr >= ioremap_bot) - addr = tmp->size + (unsigned long) tmp->addr; - if (addr >= IMALLOC_END-size) - return 1; - } - *im_addr = addr; - - return 0; -} - -/* Return whether the region described by v_addr and size is a subset - * of the region described by parent - */ -static inline int im_region_is_subset(unsigned long v_addr, unsigned long size, - struct vm_struct *parent) -{ - return (int) (v_addr >= (unsigned long) parent->addr && - v_addr < (unsigned long) parent->addr + parent->size && - size < parent->size); -} - -/* Return whether the region described by v_addr and size is a superset - * of the region described by child - */ -static int im_region_is_superset(unsigned long v_addr, unsigned long size, - struct vm_struct *child) -{ - struct vm_struct parent; - - parent.addr = (void *) v_addr; - parent.size = size; - - return im_region_is_subset((unsigned long) child->addr, child->size, - &parent); -} - -/* Return whether the region described by v_addr and size overlaps - * the region described by vm. Overlapping regions meet the - * following conditions: - * 1) The regions share some part of the address space - * 2) The regions aren't identical - * 3) Neither region is a subset of the other - */ -static int im_region_overlaps(unsigned long v_addr, unsigned long size, - struct vm_struct *vm) -{ - if (im_region_is_superset(v_addr, size, vm)) - return 0; - - return (v_addr + size > (unsigned long) vm->addr + vm->size && - v_addr < (unsigned long) vm->addr + vm->size) || - (v_addr < (unsigned long) vm->addr && - v_addr + size > (unsigned long) vm->addr); -} - -/* Determine imalloc status of region described by v_addr and size. - * Can return one of the following: - * IM_REGION_UNUSED - Entire region is unallocated in imalloc space. - * IM_REGION_SUBSET - Region is a subset of a region that is already - * allocated in imalloc space. - * vm will be assigned to a ptr to the parent region. - * IM_REGION_EXISTS - Exact region already allocated in imalloc space. - * vm will be assigned to a ptr to the existing imlist - * member. - * IM_REGION_OVERLAPS - Region overlaps an allocated region in imalloc space. - * IM_REGION_SUPERSET - Region is a superset of a region that is already - * allocated in imalloc space. - */ -static int im_region_status(unsigned long v_addr, unsigned long size, - struct vm_struct **vm) -{ - struct vm_struct *tmp; - - for (tmp = imlist; tmp; tmp = tmp->next) - if (v_addr < (unsigned long) tmp->addr + tmp->size) - break; - - if (tmp) { - if (im_region_overlaps(v_addr, size, tmp)) - return IM_REGION_OVERLAP; - - *vm = tmp; - if (im_region_is_subset(v_addr, size, tmp)) { - /* Return with tmp pointing to superset */ - return IM_REGION_SUBSET; - } - if (im_region_is_superset(v_addr, size, tmp)) { - /* Return with tmp pointing to first subset */ - return IM_REGION_SUPERSET; - } - else if (v_addr == (unsigned long) tmp->addr && - size == tmp->size) { - /* Return with tmp pointing to exact region */ - return IM_REGION_EXISTS; - } - } - - *vm = NULL; - return IM_REGION_UNUSED; -} - -static struct vm_struct * split_im_region(unsigned long v_addr, - unsigned long size, struct vm_struct *parent) -{ - struct vm_struct *vm1 = NULL; - struct vm_struct *vm2 = NULL; - struct vm_struct *new_vm = NULL; - - vm1 = (struct vm_struct *) kmalloc(sizeof(*vm1), GFP_KERNEL); - if (vm1 == NULL) { - printk(KERN_ERR "%s() out of memory\n", __FUNCTION__); - return NULL; - } - - if (v_addr == (unsigned long) parent->addr) { - /* Use existing parent vm_struct to represent child, allocate - * new one for the remainder of parent range - */ - vm1->size = parent->size - size; - vm1->addr = (void *) (v_addr + size); - vm1->next = parent->next; - - parent->size = size; - parent->next = vm1; - new_vm = parent; - } else if (v_addr + size == (unsigned long) parent->addr + - parent->size) { - /* Allocate new vm_struct to represent child, use existing - * parent one for remainder of parent range - */ - vm1->size = size; - vm1->addr = (void *) v_addr; - vm1->next = parent->next; - new_vm = vm1; - - parent->size -= size; - parent->next = vm1; - } else { - /* Allocate two new vm_structs for the new child and - * uppermost remainder, and use existing parent one for the - * lower remainder of parent range - */ - vm2 = (struct vm_struct *) kmalloc(sizeof(*vm2), GFP_KERNEL); - if (vm2 == NULL) { - printk(KERN_ERR "%s() out of memory\n", __FUNCTION__); - kfree(vm1); - return NULL; - } - - vm1->size = size; - vm1->addr = (void *) v_addr; - vm1->next = vm2; - new_vm = vm1; - - vm2->size = ((unsigned long) parent->addr + parent->size) - - (v_addr + size); - vm2->addr = (void *) v_addr + size; - vm2->next = parent->next; - - parent->size = v_addr - (unsigned long) parent->addr; - parent->next = vm1; - } - - return new_vm; -} - -static struct vm_struct * __add_new_im_area(unsigned long req_addr, - unsigned long size) -{ - struct vm_struct **p, *tmp, *area; - - for (p = &imlist; (tmp = *p) ; p = &tmp->next) { - if (req_addr + size <= (unsigned long)tmp->addr) - break; - } - - area = (struct vm_struct *) kmalloc(sizeof(*area), GFP_KERNEL); - if (!area) - return NULL; - area->flags = 0; - area->addr = (void *)req_addr; - area->size = size; - area->next = *p; - *p = area; - - return area; -} - -static struct vm_struct * __im_get_area(unsigned long req_addr, - unsigned long size, - int criteria) -{ - struct vm_struct *tmp; - int status; - - status = im_region_status(req_addr, size, &tmp); - if ((criteria & status) == 0) { - return NULL; - } - - switch (status) { - case IM_REGION_UNUSED: - tmp = __add_new_im_area(req_addr, size); - break; - case IM_REGION_SUBSET: - tmp = split_im_region(req_addr, size, tmp); - break; - case IM_REGION_EXISTS: - /* Return requested region */ - break; - case IM_REGION_SUPERSET: - /* Return first existing subset of requested region */ - break; - default: - printk(KERN_ERR "%s() unexpected imalloc region status\n", - __FUNCTION__); - tmp = NULL; - } - - return tmp; -} - -struct vm_struct * im_get_free_area(unsigned long size) -{ - struct vm_struct *area; - unsigned long addr; - - down(&imlist_sem); - if (get_free_im_addr(size, &addr)) { - printk(KERN_ERR "%s() cannot obtain addr for size 0x%lx\n", - __FUNCTION__, size); - area = NULL; - goto next_im_done; - } - - area = __im_get_area(addr, size, IM_REGION_UNUSED); - if (area == NULL) { - printk(KERN_ERR - "%s() cannot obtain area for addr 0x%lx size 0x%lx\n", - __FUNCTION__, addr, size); - } -next_im_done: - up(&imlist_sem); - return area; -} - -struct vm_struct * im_get_area(unsigned long v_addr, unsigned long size, - int criteria) -{ - struct vm_struct *area; - - down(&imlist_sem); - area = __im_get_area(v_addr, size, criteria); - up(&imlist_sem); - return area; -} - -void im_free(void * addr) -{ - struct vm_struct **p, *tmp; - - if (!addr) - return; - if ((unsigned long) addr & ~PAGE_MASK) { - printk(KERN_ERR "Trying to %s bad address (%p)\n", __FUNCTION__, addr); - return; - } - down(&imlist_sem); - for (p = &imlist ; (tmp = *p) ; p = &tmp->next) { - if (tmp->addr == addr) { - *p = tmp->next; - - /* XXX: do we need the lock? */ - spin_lock(&init_mm.page_table_lock); - unmap_vm_area(tmp); - spin_unlock(&init_mm.page_table_lock); - - kfree(tmp); - up(&imlist_sem); - return; - } - } - up(&imlist_sem); - printk(KERN_ERR "Trying to %s nonexistent area (%p)\n", __FUNCTION__, - addr); -} Index: linux-work/arch/ppc64/mm/ioremap.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-work/arch/ppc64/mm/ioremap.c 2005-05-17 17:15:51.000000000 +1000 @@ -0,0 +1,276 @@ +/* + * ioremap & friends implementation + * + * extracted from arch/ppc64/mm/init.c (see (c) notice in there) + * + * Benjamin Herrenschmidt , IBM Corp. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + */ + +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include + +unsigned long ioremap_bot = VMALLOC_BASE; +static unsigned long phbs_io_bot = PHBS_IO_BASE; + +#ifdef CONFIG_PPC_ISERIES + +void __iomem *ioremap(unsigned long addr, unsigned long size) +{ + return (void __iomem *)addr; +} + +extern void __iomem *__ioremap(unsigned long addr, unsigned long size, + unsigned long flags) +{ + return (void __iomem *)addr; +} + +void iounmap(volatile void __iomem *addr) +{ + return; +} + +#else /* CONFIG_PPC_ISERIES */ + +/* + * map_io_page currently only called by __ioremap + * map_io_page adds an entry to the ioremap page table + * and adds an entry to the HPT, possibly bolting it + */ +static int map_io_page(unsigned long ea, unsigned long pa, int flags) +{ + pgd_t *pgdp; + pud_t *pudp; + pmd_t *pmdp; + pte_t *ptep; + unsigned long vsid; + + if (mem_init_done) { + spin_lock(&init_mm.page_table_lock); + pgdp = pgd_offset_k(ea); + pudp = pud_alloc(&init_mm, pgdp, ea); + if (!pudp) + return -ENOMEM; + pmdp = pmd_alloc(&init_mm, pudp, ea); + if (!pmdp) + return -ENOMEM; + ptep = pte_alloc_kernel(&init_mm, pmdp, ea); + if (!ptep) + return -ENOMEM; + pa = abs_to_phys(pa); + set_pte_at(&init_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, + __pgprot(flags))); + spin_unlock(&init_mm.page_table_lock); + } else { + unsigned long va, vpn, hash, hpteg; + + /* + * If the mm subsystem is not fully up, we cannot create a + * linux page table entry for this mapping. Simply bolt an + * entry in the hardware page table. + */ + vsid = get_kernel_vsid(ea); + va = (vsid << 28) | (ea & 0xFFFFFFF); + vpn = va >> PAGE_SHIFT; + + hash = hpt_hash(vpn, 0); + + hpteg = ((hash & htab_hash_mask) * HPTES_PER_GROUP); + + /* Panic if a pte grpup is full */ + if (ppc_md.hpte_insert(hpteg, va, pa >> PAGE_SHIFT, 0, + _PAGE_NO_CACHE|_PAGE_GUARDED|PP_RWXX, + 1, 0) == -1) { + panic("map_io_page: could not insert mapping"); + } + } + return 0; +} + + +static void __iomem * __ioremap_com(unsigned long addr, unsigned long pa, + unsigned long ea, unsigned long size, + unsigned long flags) +{ + unsigned long i; + + if ((flags & _PAGE_PRESENT) == 0) + flags |= pgprot_val(PAGE_KERNEL); + + for (i = 0; i < size; i += PAGE_SIZE) + if (map_io_page(ea+i, pa+i, flags)) + return NULL; + + return (void __iomem *) (ea + (addr & ~PAGE_MASK)); +} + + +void __iomem * +ioremap(unsigned long addr, unsigned long size) +{ + return __ioremap(addr, size, _PAGE_NO_CACHE | _PAGE_GUARDED); +} + +void __iomem * __ioremap(unsigned long addr, unsigned long size, + unsigned long flags) +{ + unsigned long pa, ea; + void __iomem *ret; + + /* + * Choose an address to map it to. + * Once the imalloc system is running, we use it. + * Before that, we map using addresses going + * up from ioremap_bot. imalloc will use + * the addresses from ioremap_bot through + * IMALLOC_END + * + */ + pa = addr & PAGE_MASK; + size = PAGE_ALIGN(addr + size) - pa; + + if (size == 0) + return NULL; + + if (mem_init_done) { + struct vm_struct *area; + area = get_vm_area(size, VM_IOREMAP); + if (area == NULL) + return NULL; + ea = (unsigned long)(area->addr); + ret = __ioremap_com(addr, pa, ea, size, flags); + if (!ret) + vfree(area->addr); + } else { + ea = ioremap_bot; + ret = __ioremap_com(addr, pa, ea, size, flags); + if (ret) + ioremap_bot += size; + } + return ret; +} + +#define IS_PAGE_ALIGNED(_val) ((_val) == ((_val) & PAGE_MASK)) + +int __ioremap_explicit(unsigned long pa, unsigned long ea, + unsigned long size, unsigned long flags) +{ + void __iomem *ret; + + /* For now, require page-aligned values for pa, ea, and size */ + if (!IS_PAGE_ALIGNED(pa) || !IS_PAGE_ALIGNED(ea) || + !IS_PAGE_ALIGNED(size)) { + printk(KERN_ERR "unaligned value in %s\n", __FUNCTION__); + WARN_ON(1); + return 1; + } + if ((ea < PHBS_IO_BASE) || ((ea + size) > PHBS_IO_END)) { + printk(KERN_ERR "out of bounds value in %s\n", __FUNCTION__); + WARN_ON(1); + return 1; + } + + /* No record is kept of explicit maps for now */ + ret = __ioremap_com(pa, pa, ea, size, flags); + if (ret == NULL) { + printk(KERN_ERR "ioremap_explicit() allocation failure !\n"); + return 1; + } + if (ret != (void *) ea) { + printk(KERN_ERR "__ioremap_com() returned unexpected addr\n"); + return 1; + } + + return 0; +} + +/* + * Unmap an IO region + * Calls before mem_init_done (ie python_countermeasures()) can't be unmapped + * for now. + */ +void iounmap(volatile void __iomem *token) +{ + unsigned long addr; + + if (!mem_init_done) + return; + + addr = (unsigned long __force) token & PAGE_MASK; + if (addr < ioremap_bot) + return; + + vfree((void *)addr); +} + +int iounmap_explicit(volatile void __iomem *start, unsigned long size) +{ + struct vm_struct area; + unsigned long addr = (unsigned long __force)start; + + if (!IS_PAGE_ALIGNED(addr) || !IS_PAGE_ALIGNED(size)) { + printk(KERN_ERR "unaligned value in %s\n", + __FUNCTION__); + WARN_ON(1); + return -EINVAL; + } + if ((addr < PHBS_IO_BASE) || ((addr + size) > PHBS_IO_END)) { + printk(KERN_ERR "out of bounds value in %s\n", __FUNCTION__); + WARN_ON(1); + return -EINVAL; + } + + /* We create a fake vm_struct here to please unmap_vm_area instead + * of re-implementing the 4 level page table iteration here. + * It's always fully aligned here so we are happy. The bit top be + * careful about here is that unmap_vm_area() never ends up wanting + * more of the vm_struct than addr and size. + */ + memset(&area, 0, sizeof(area)); + area.addr = (void *)((unsigned long __force) start & PAGE_MASK); + area.size = size; + unmap_vm_area(&area); + + return 0; +} + +#endif /* CONFIG_PPC_ISERIES */ + + +void __iomem * reserve_phb_iospace(unsigned long size) +{ + void __iomem *virt_addr; + + if (phbs_io_bot >= PHBS_IO_END) + panic("reserve_phb_iospace(): phb io space overflow\n"); + + virt_addr = (void __iomem *) phbs_io_bot; + phbs_io_bot += size; + + return virt_addr; +} + + +EXPORT_SYMBOL(ioremap); +EXPORT_SYMBOL(__ioremap); +EXPORT_SYMBOL(iounmap); +EXPORT_SYMBOL(ioremap_bot); /* poor XFS ... */ Index: linux-work/arch/ppc64/mm/Makefile =================================================================== --- linux-work.orig/arch/ppc64/mm/Makefile 2005-05-02 10:48:08.000000000 +1000 +++ linux-work/arch/ppc64/mm/Makefile 2005-05-17 16:55:07.000000000 +1000 @@ -4,7 +4,7 @@ EXTRA_CFLAGS += -mno-minimal-toc -obj-y := fault.o init.o imalloc.o hash_utils.o hash_low.o tlb.o \ +obj-y := fault.o init.o ioremap.o hash_utils.o hash_low.o tlb.o \ slb_low.o slb.o stab.o mmap.o obj-$(CONFIG_DISCONTIGMEM) += numa.o obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o Index: linux-work/include/asm-ppc64/imalloc.h =================================================================== --- linux-work.orig/include/asm-ppc64/imalloc.h 2005-05-16 15:42:57.000000000 +1000 +++ /dev/null 1970-01-01 00:00:00.000000000 +0000 @@ -1,26 +0,0 @@ -#ifndef _PPC64_IMALLOC_H -#define _PPC64_IMALLOC_H - -/* - * Define the address range of the imalloc VM area. - */ -#define PHBS_IO_BASE VMALLOC_END -#define IMALLOC_BASE (PHBS_IO_BASE + 0x80000000ul) /* Reserve 2 gigs for PHBs */ -#define IMALLOC_END (VMALLOC_START + PGTABLE_RANGE) - - -/* imalloc region types */ -#define IM_REGION_UNUSED 0x1 -#define IM_REGION_SUBSET 0x2 -#define IM_REGION_EXISTS 0x4 -#define IM_REGION_OVERLAP 0x8 -#define IM_REGION_SUPERSET 0x10 - -extern struct vm_struct * im_get_free_area(unsigned long size); -extern struct vm_struct * im_get_area(unsigned long v_addr, unsigned long size, - int region_type); -extern void im_free(void *addr); - -extern unsigned long ioremap_bot; - -#endif /* _PPC64_IMALLOC_H */ Index: linux-work/arch/ppc64/mm/init.c =================================================================== --- linux-work.orig/arch/ppc64/mm/init.c 2005-05-16 15:42:58.000000000 +1000 +++ linux-work/arch/ppc64/mm/init.c 2005-05-17 17:03:29.000000000 +1000 @@ -31,7 +31,6 @@ #include #include #include -#include #include #include #include @@ -40,7 +39,6 @@ #include #include -#include #include #include #include @@ -65,11 +63,8 @@ #include #include #include -#include int mem_init_done; -unsigned long ioremap_bot = IMALLOC_BASE; -static unsigned long phbs_io_bot = PHBS_IO_BASE; extern pgd_t swapper_pg_dir[]; extern struct task_struct *current_set[NR_CPUS]; @@ -115,271 +110,6 @@ printk("%ld pages swap cached\n", cached); } -#ifdef CONFIG_PPC_ISERIES - -void __iomem *ioremap(unsigned long addr, unsigned long size) -{ - return (void __iomem *)addr; -} - -extern void __iomem *__ioremap(unsigned long addr, unsigned long size, - unsigned long flags) -{ - return (void __iomem *)addr; -} - -void iounmap(volatile void __iomem *addr) -{ - return; -} - -#else - -/* - * map_io_page currently only called by __ioremap - * map_io_page adds an entry to the ioremap page table - * and adds an entry to the HPT, possibly bolting it - */ -static int map_io_page(unsigned long ea, unsigned long pa, int flags) -{ - pgd_t *pgdp; - pud_t *pudp; - pmd_t *pmdp; - pte_t *ptep; - unsigned long vsid; - - if (mem_init_done) { - spin_lock(&init_mm.page_table_lock); - pgdp = pgd_offset_k(ea); - pudp = pud_alloc(&init_mm, pgdp, ea); - if (!pudp) - return -ENOMEM; - pmdp = pmd_alloc(&init_mm, pudp, ea); - if (!pmdp) - return -ENOMEM; - ptep = pte_alloc_kernel(&init_mm, pmdp, ea); - if (!ptep) - return -ENOMEM; - pa = abs_to_phys(pa); - set_pte_at(&init_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, - __pgprot(flags))); - spin_unlock(&init_mm.page_table_lock); - } else { - unsigned long va, vpn, hash, hpteg; - - /* - * If the mm subsystem is not fully up, we cannot create a - * linux page table entry for this mapping. Simply bolt an - * entry in the hardware page table. - */ - vsid = get_kernel_vsid(ea); - va = (vsid << 28) | (ea & 0xFFFFFFF); - vpn = va >> PAGE_SHIFT; - - hash = hpt_hash(vpn, 0); - - hpteg = ((hash & htab_hash_mask) * HPTES_PER_GROUP); - - /* Panic if a pte grpup is full */ - if (ppc_md.hpte_insert(hpteg, va, pa >> PAGE_SHIFT, 0, - _PAGE_NO_CACHE|_PAGE_GUARDED|PP_RWXX, - 1, 0) == -1) { - panic("map_io_page: could not insert mapping"); - } - } - return 0; -} - - -static void __iomem * __ioremap_com(unsigned long addr, unsigned long pa, - unsigned long ea, unsigned long size, - unsigned long flags) -{ - unsigned long i; - - if ((flags & _PAGE_PRESENT) == 0) - flags |= pgprot_val(PAGE_KERNEL); - - for (i = 0; i < size; i += PAGE_SIZE) - if (map_io_page(ea+i, pa+i, flags)) - return NULL; - - return (void __iomem *) (ea + (addr & ~PAGE_MASK)); -} - - -void __iomem * -ioremap(unsigned long addr, unsigned long size) -{ - return __ioremap(addr, size, _PAGE_NO_CACHE | _PAGE_GUARDED); -} - -void __iomem * __ioremap(unsigned long addr, unsigned long size, - unsigned long flags) -{ - unsigned long pa, ea; - void __iomem *ret; - - /* - * Choose an address to map it to. - * Once the imalloc system is running, we use it. - * Before that, we map using addresses going - * up from ioremap_bot. imalloc will use - * the addresses from ioremap_bot through - * IMALLOC_END - * - */ - pa = addr & PAGE_MASK; - size = PAGE_ALIGN(addr + size) - pa; - - if (size == 0) - return NULL; - - if (mem_init_done) { - struct vm_struct *area; - area = im_get_free_area(size); - if (area == NULL) - return NULL; - ea = (unsigned long)(area->addr); - ret = __ioremap_com(addr, pa, ea, size, flags); - if (!ret) - im_free(area->addr); - } else { - ea = ioremap_bot; - ret = __ioremap_com(addr, pa, ea, size, flags); - if (ret) - ioremap_bot += size; - } - return ret; -} - -#define IS_PAGE_ALIGNED(_val) ((_val) == ((_val) & PAGE_MASK)) - -int __ioremap_explicit(unsigned long pa, unsigned long ea, - unsigned long size, unsigned long flags) -{ - struct vm_struct *area; - void __iomem *ret; - - /* For now, require page-aligned values for pa, ea, and size */ - if (!IS_PAGE_ALIGNED(pa) || !IS_PAGE_ALIGNED(ea) || - !IS_PAGE_ALIGNED(size)) { - printk(KERN_ERR "unaligned value in %s\n", __FUNCTION__); - return 1; - } - - if (!mem_init_done) { - /* Two things to consider in this case: - * 1) No records will be kept (imalloc, etc) that the region - * has been remapped - * 2) It won't be easy to iounmap() the region later (because - * of 1) - */ - ; - } else { - area = im_get_area(ea, size, - IM_REGION_UNUSED|IM_REGION_SUBSET|IM_REGION_EXISTS); - if (area == NULL) { - /* Expected when PHB-dlpar is in play */ - return 1; - } - if (ea != (unsigned long) area->addr) { - printk(KERN_ERR "unexpected addr return from " - "im_get_area\n"); - return 1; - } - } - - ret = __ioremap_com(pa, pa, ea, size, flags); - if (ret == NULL) { - printk(KERN_ERR "ioremap_explicit() allocation failure !\n"); - return 1; - } - if (ret != (void *) ea) { - printk(KERN_ERR "__ioremap_com() returned unexpected addr\n"); - return 1; - } - - return 0; -} - -/* - * Unmap an IO region and remove it from imalloc'd list. - * Access to IO memory should be serialized by driver. - * This code is modeled after vmalloc code - unmap_vm_area() - * - * XXX what about calls before mem_init_done (ie python_countermeasures()) - */ -void iounmap(volatile void __iomem *token) -{ - void *addr; - - if (!mem_init_done) - return; - - addr = (void *) ((unsigned long __force) token & PAGE_MASK); - - im_free(addr); -} - -static int iounmap_subset_regions(unsigned long addr, unsigned long size) -{ - struct vm_struct *area; - - /* Check whether subsets of this region exist */ - area = im_get_area(addr, size, IM_REGION_SUPERSET); - if (area == NULL) - return 1; - - while (area) { - iounmap((void __iomem *) area->addr); - area = im_get_area(addr, size, - IM_REGION_SUPERSET); - } - - return 0; -} - -int iounmap_explicit(volatile void __iomem *start, unsigned long size) -{ - struct vm_struct *area; - unsigned long addr; - int rc; - - addr = (unsigned long __force) start & PAGE_MASK; - - /* Verify that the region either exists or is a subset of an existing - * region. In the latter case, split the parent region to create - * the exact region - */ - area = im_get_area(addr, size, - IM_REGION_EXISTS | IM_REGION_SUBSET); - if (area == NULL) { - /* Determine whether subset regions exist. If so, unmap */ - rc = iounmap_subset_regions(addr, size); - if (rc) { - printk(KERN_ERR - "%s() cannot unmap nonexistent range 0x%lx\n", - __FUNCTION__, addr); - return 1; - } - } else { - iounmap((void __iomem *) area->addr); - } - /* - * FIXME! This can't be right: - iounmap(area->addr); - * Maybe it should be "iounmap(area);" - */ - return 0; -} - -#endif - -EXPORT_SYMBOL(ioremap); -EXPORT_SYMBOL(__ioremap); -EXPORT_SYMBOL(iounmap); - void free_initmem(void) { unsigned long addr; @@ -795,19 +525,6 @@ local_irq_restore(flags); } -void __iomem * reserve_phb_iospace(unsigned long size) -{ - void __iomem *virt_addr; - - if (phbs_io_bot >= IMALLOC_BASE) - panic("reserve_phb_iospace(): phb io space overflow\n"); - - virt_addr = (void __iomem *) phbs_io_bot; - phbs_io_bot += size; - - return virt_addr; -} - kmem_cache_t *pmd_cache; void pgtable_cache_init(void) From arnd at arndb.de Tue May 17 21:05:12 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Tue, 17 May 2005 13:05:12 +0200 Subject: [PATCH 4/8] ppc64: add BPA platform type In-Reply-To: <17033.38609.60873.138572@cargo.ozlabs.ibm.com> References: <200505132117.37461.arnd@arndb.de> <200505132125.34358.arnd@arndb.de> <17033.38609.60873.138572@cargo.ozlabs.ibm.com> Message-ID: <200505171305.13471.arnd@arndb.de> On Dinsdag 17 Mai 2005 09:01, Paul Mackerras wrote: > Arnd Bergmann writes: > > > This adds the basic support for running on BPA machines. > > So far, this is only the IBM workstation, and it will > > not run on others without a little more generalization. > > > +/* FIXME: consolidate this into rtas.c or similar */ > > +static void __init pSeries_calibrate_decr(void) > > Shouldn't this be called bpa_calibrate_decr or something similar? The function is identical to the one for pSeries, and I'd prefer to have only one copy of it with a more generic name. Actually, it looks like maple and perhaps pmac have a very similar *_calibrate_decr function, so I could perhaps just put this into time.c as generic_calibrate_decr(). [ Ben, can you tell if pSeries_calibrate_decr should work on all G5 macs or if it can be changed to support them as well? ] On a similar issue, I just remembered that I wanted to create a rtas_time.c to hold the rtc access functions for pSeries and BPA. Do you think that's a good idea? > > -#define PV_630 0x0040 > > -#define PV_630p 0x0041 > > +#define PV_630 0x0040 > > +#define PV_630p 0x0041 > > Hmmm, I don't think your patch needs to clean up the whitespace here. ok. Arnd <>< From apw at shadowen.org Tue May 17 23:08:48 2005 From: apw at shadowen.org (Andy Whitcroft) Date: Tue, 17 May 2005 14:08:48 +0100 Subject: [PATCH] sparsemem-ppc64-flat-first-block-is-not-special In-Reply-To: <4280D72C.4090203@shadowen.org> Message-ID: Ok. Testing seems to show that indeed the initial memory blocks do not need to be treated specially on ppc64 non-numa systems. Andrew could you add this to the sparsemem patches please. Applies on top of 2.6.12-rc4-mm2. -apw Testing seems to confirm that we do not need to handle the first memory block specially in do_init_bootmem. Signed-off-by: Andy Whitcroft diffstat sparsemem-ppc64-flat-first-block-is-not-special --- init.c | 21 +++++++-------------- 1 files changed, 7 insertions(+), 14 deletions(-) diff -upN reference/arch/ppc64/mm/init.c current/arch/ppc64/mm/init.c --- reference/arch/ppc64/mm/init.c +++ current/arch/ppc64/mm/init.c @@ -538,14 +538,6 @@ void __init do_init_bootmem(void) unsigned long start, bootmap_pages; unsigned long total_pages = lmb_end_of_DRAM() >> PAGE_SHIFT; int boot_mapsize; - unsigned long start_pfn, end_pfn; - /* - * Note presence of first (logical/coalasced) LMB which will - * contain RMO region - */ - start_pfn = lmb.memory.region[0].physbase >> PAGE_SHIFT; - end_pfn = start_pfn + (lmb.memory.region[0].size >> PAGE_SHIFT); - memory_present(0, start_pfn, end_pfn); /* * Find an area to use for the bootmem bitmap. Calculate the size of @@ -562,18 +554,19 @@ void __init do_init_bootmem(void) max_pfn = max_low_pfn; /* Add all physical memory to the bootmem map, mark each area - * present. The first block has already been marked present above. + * present. */ for (i=0; i < lmb.memory.cnt; i++) { unsigned long physbase, size; + unsigned long start_pfn, end_pfn; physbase = lmb.memory.region[i].physbase; size = lmb.memory.region[i].size; - if (i) { - start_pfn = physbase >> PAGE_SHIFT; - end_pfn = start_pfn + (size >> PAGE_SHIFT); - memory_present(0, start_pfn, end_pfn); - } + + start_pfn = physbase >> PAGE_SHIFT; + end_pfn = start_pfn + (size >> PAGE_SHIFT); + memory_present(0, start_pfn, end_pfn); + free_bootmem(physbase, size); } From tzachi at marvell.com Tue May 17 23:49:18 2005 From: tzachi at marvell.com (Tzachi Perelstein) Date: Tue, 17 May 2005 16:49:18 +0300 Subject: device-tree in U-Boot Message-ID: Hi Ben, I'll make U-Boot loading fits to ppc64 standards using the device-tree. I was consulting Wolfgang Denk (U-Boot owner), and he confirmed it (more or less). Since I'm not familiar with OF and device tree, and since you generously offered your help, I contact you for some guidance. As a first step, I would like to do it as simple as possible. On the last stage of U-Boot before branching to Linux, I will collect the needed information and reassemble it into a device-tree. Can you please advise me how to do it without adding the whole device-tree implementation to U-Boot? Is there any code reference I can use? You have mentioned that there are only little requirements provided by the kernel regarding device tree. Can you please clarify? Best regards, Tzachi > -----Original Message----- > From: Benjamin Herrenschmidt [mailto:benh at kernel.crashing.org] > Sent: Monday, May 16, 2005 2:34 PM > To: Christoph Hellwig > Cc: Tzachi Perelstein; linuxppc64-dev at ozlabs.org > Subject: Re: io_block_mapping in PPC64? > > However, I agree this can be a bit too complicated for an embedded > firmware, and the compact device-tree format may be more suited. That > was one of the motivations for implementing it, along with making kexec > possible. The "bare metal" linux folks in IBM use it, I'm trying to > figure out if some code can be provided to the community as an example > on how to generate it. > > The kernel provides only little requirements regarding the device-tree. > On purpose, I do not make mandatory, for example, to have every PCI > device enumerated by the firmware. Only host bridges, so those can > provide proper memory/io windows and irq routing to the kernel. That > along with a few nodes defining the board type, RAM layout, and maybe a > few other options you want to pass along to the kernel. > > However, for things like on-chip devices, experience has showed that a > device-tree like structure is far more convenient and flexible than any > kind of hard-coded mecanism, it allows the firmware to easily inform the > kernel on what driver to load, what address/interrupts to assign, what > MAC address to use for ethernet, etc... without creating a "carved in > stone" ABI like a board info structure does. > > Beleive me, Tzachi, I've been there, doing embedded developpement and > bring up, and I far more prefer changing a property or two in the device > tree provided by the firmware than having to add more hacks to the > kernel image every time I decide to release a new revision of an > embedded board. > > It is _not_ complicated, and it should be fairly easy to write a parser > that turns an ASCII representation into a "blob" that can be flashed and > passed along to the kernel. In fact, I'd be glad to help you implement > some support for generating a simple device-tree from uboot. I'm not > trying to but "arbitrary" restrictions here, I'm really trying to > promote a nicer and more maintainable way of defining the > firmware<->kernel interface that is suitable for both "desktop" (and > server) like solutions and embedded. In fact, if you read the > linuxppc-embedded list archives, you'll notice that more than once, > people have been trying to replace the current mess with something like > that, though so far, nobody has actually implemented anything, mostly > because there is the whole lot of existing legacy board_info crap to > deal with. I'm trying to avoid that for the future of ppc64. > > Ben. > > From johnrose at austin.ibm.com Wed May 18 06:15:08 2005 From: johnrose at austin.ibm.com (John Rose) Date: Tue, 17 May 2005 15:15:08 -0500 Subject: [PATCH] (RFC) slaughter imalloc In-Reply-To: <1116315093.6804.14.camel@gaston> References: <1116315093.6804.14.camel@gaston> Message-ID: <1116360908.21029.28.camel@sinatra.austin.ibm.com> Hi Ben- Great job! The patch cleans things up nicely. For PHB removal, iounmap_explicit() will be called for the range of the PHB. In many cases, this range will have gaps of invalidated PTEs from children slots that have already been removed. Will unmap_vm_area() and/or the invalidate_pte hcall tolerate such gaps? I don't know enough to answer that w/o actually testing it. While we're cleaning stuff up, we might look at naming the explicit functions consistently. I wrote them, so I'm the guilty party :) __ioremap_explicit <-> iounmap_explicit, etc. Here are some more comments: +/* + * map_io_page currently only called by __ioremap Not totally accurate, since explicit() calls it too :) I should have removed this line... + * map_io_page adds an entry to the ioremap page table + * and adds an entry to the HPT, possibly bolting it + */ +static int map_io_page(unsigned long ea, unsigned long pa, int flags) +{ + pgd_t *pgdp; + pud_t *pudp; + pmd_t *pmdp; + pte_t *ptep; + unsigned long vsid; + + if (mem_init_done) { + spin_lock(&init_mm.page_table_lock); + pgdp = pgd_offset_k(ea); + pudp = pud_alloc(&init_mm, pgdp, ea); + if (!pudp) + return -ENOMEM; + pmdp = pmd_alloc(&init_mm, pudp, ea); + if (!pmdp) + return -ENOMEM; + ptep = pte_alloc_kernel(&init_mm, pmdp, ea); + if (!ptep) + return -ENOMEM; + pa = abs_to_phys(pa); + set_pte_at(&init_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, + __pgprot(flags))); + spin_unlock(&init_mm.page_table_lock); Now that we're so vmalloc-hip, could we use map_vm_area() here? +EXPORT_SYMBOL(ioremap_bot); /* poor XFS ... */ Does XFS really use this!? Thanks- John From ntl at pobox.com Wed May 18 06:40:29 2005 From: ntl at pobox.com (Nathan Lynch) Date: Tue, 17 May 2005 15:40:29 -0500 Subject: [PATCH 3/8] ppc64: add a watchdog driver for rtas In-Reply-To: <200505132124.48963.arnd@arndb.de> References: <200505132117.37461.arnd@arndb.de> <200505132124.48963.arnd@arndb.de> Message-ID: <20050517204029.GA2748@otto> Arnd Bergmann wrote: > +static volatile int wdrtas_miscdev_open = 0; ... > +static int > +wdrtas_open(struct inode *inode, struct file *file) > +{ > + /* only open once */ > + if (xchg(&wdrtas_miscdev_open,1)) > + return -EBUSY; The volatile and xchg strike me as an obscure method for ensuring only one process at a time can open this file. Any reason a semaphore couldn't be used? > +static int > +wdrtas_close(struct inode *inode, struct file *file) > +{ > + /* only stop watchdog, if this was announced using 'V' before */ > + if (wdrtas_expect_close == WDRTAS_MAGIC_CHAR) > + wdrtas_timer_stop(); > + else { > + printk("wdrtas: got unexpected close. Watchdog " > + "not stopped.\n"); printk's need a valid log level specified. There are several in this file that lack them. Nathan From benh at kernel.crashing.org Wed May 18 09:14:02 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 18 May 2005 09:14:02 +1000 Subject: device-tree in U-Boot In-Reply-To: References: Message-ID: <1116371643.5366.12.camel@gaston> On Tue, 2005-05-17 at 16:49 +0300, Tzachi Perelstein wrote: > Hi Ben, > > I'll make U-Boot loading fits to ppc64 standards using the device-tree. > I was consulting Wolfgang Denk (U-Boot owner), and he confirmed it (more > or less). > > Since I'm not familiar with OF and device tree, and since you generously > offered your help, I contact you for some guidance. > As a first step, I would like to do it as simple as possible. On the > last stage of U-Boot before branching to Linux, I will collect the > needed information and reassemble it into a device-tree. > Can you please advise me how to do it without adding the whole > device-tree implementation to U-Boot? Is there any code reference I can > use? > You have mentioned that there are only little requirements provided by > the kernel regarding device tree. Can you please clarify? Sure, I will. Later today or tomorrow if you don't mind though. I'm gather bits & pieces and will provide you with all the needed informations. Ben. From benh at kernel.crashing.org Wed May 18 09:23:48 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 18 May 2005 09:23:48 +1000 Subject: [PATCH] (RFC) slaughter imalloc In-Reply-To: <1116360908.21029.28.camel@sinatra.austin.ibm.com> References: <1116315093.6804.14.camel@gaston> <1116360908.21029.28.camel@sinatra.austin.ibm.com> Message-ID: <1116372228.5366.25.camel@gaston> On Tue, 2005-05-17 at 15:15 -0500, John Rose wrote: > Hi Ben- > > Great job! The patch cleans things up nicely. > > For PHB removal, iounmap_explicit() will be called for the range of the > PHB. In many cases, this range will have gaps of invalidated PTEs from > children slots that have already been removed. Will unmap_vm_area() > and/or the invalidate_pte hcall tolerate such gaps? I don't know > enough to answer that w/o actually testing it. It should be ok, empty pgd/pud/pmd's are skipped, and ptep_test_and_clear() should be harmless. The hash flushing routines don't mind neither. The only thing in there I see that test the state of the actual PTE is that bit: WARN_ON(!pte_none(ptent) && !pte_present(ptent)); Which I think won't trigger, it's supposed to catch a non-present PTE that isn't also empty (that is a swap PTE I suppose) which will never happen in our case. > While we're cleaning stuff up, we might look at naming the explicit > functions consistently. I wrote them, so I'm the guilty party :) > __ioremap_explicit <-> iounmap_explicit, etc. Ok, though at the same time, those are really "internal" APIs not to be used by the common mortals so the __ does make some sense. I'll think about it. > Here are some more comments: > > +/* > + * map_io_page currently only called by __ioremap > > Not totally accurate, since explicit() calls it too :) I should have > removed this line... Hehe, will look, it may even be called elsewhere. > + * map_io_page adds an entry to the ioremap page table > + * and adds an entry to the HPT, possibly bolting it > + */ > +static int map_io_page(unsigned long ea, unsigned long pa, int flags) > +{ > + pgd_t *pgdp; > + pud_t *pudp; > + pmd_t *pmdp; > + pte_t *ptep; > + unsigned long vsid; > + > + if (mem_init_done) { > + spin_lock(&init_mm.page_table_lock); > + pgdp = pgd_offset_k(ea); > + pudp = pud_alloc(&init_mm, pgdp, ea); > + if (!pudp) > + return -ENOMEM; > + pmdp = pmd_alloc(&init_mm, pudp, ea); > + if (!pmdp) > + return -ENOMEM; > + ptep = pte_alloc_kernel(&init_mm, pmdp, ea); > + if (!ptep) > + return -ENOMEM; > + pa = abs_to_phys(pa); > + set_pte_at(&init_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, > + __pgprot(flags))); > + spin_unlock(&init_mm.page_table_lock); > > Now that we're so vmalloc-hip, could we use map_vm_area() here? I don't think map_vm_area() is suitable for ioremap like thigns. It wants struct page pointer arrays and such things. Even x86 doesn't use it. > +EXPORT_SYMBOL(ioremap_bot); /* poor XFS ... */ > > Does XFS really use this!? XFS has some hacks using VMALLOC_START which is now defined as using ioremap_bot. Ben. From benh at kernel.crashing.org Wed May 18 16:27:41 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 18 May 2005 16:27:41 +1000 Subject: RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO Message-ID: <1116397662.918.8.camel@gaston> Hi ! Here's the very first draft of my HOWTO about booting the linux/ppc64 kernel without open firmware. It's still incomplete, the main chapter describing which nodes & properties are required and their format is still missing (though it will basically be a subset of the Open Firmware specification & bindings). The format of the flattened device-tree is documented. It's a first draft, so please, don't be too harsh :) Comments are welcome. Booting the Linux/ppc64 kernel without Open Firmware ---------------------------------------------------- (c) 2005 Benjamin Herrenschmidt , IBM Corp. May 18, 2005: Rev 0.1 - Initial draft, no chapter III yet. I- Introduction =============== During the recent developpements of the Linux/ppc64 kernel, and more specifically, the addition of new platform types outside of the old IBM pSeries/iSeries pair, it was decided to enforce some strict rules regarding the kernel entry and bootloader <-> kernel interfaces, in order to avoid the degeneration that has become the ppc32 kernel entry point and the way a new platform should be added to the kernel. The legacy iSeries platform breaks those rules as it predates this scheme, but no new board support will be accepted in the main tree that doesn't follows them properly. 1) Entry point -------------- There is one and one single entry point to the kernel, at the start of the kernel image. That entry point support two calling conventions: a) Boot from Open Firmware. If your firmware is compatible with Open Firmware (IEEE 1275) or provides an OF compatible client interface API (support for "interpret" callback of forth words isn't required), you can enter the kernel with: r5 : OF callback pointer as defined by IEEE 1275 bindings to powerpc. Only the 32 bits client interface is currently supported r3, r4 : address & lenght of an initrd if any or 0 MMU is either on or off, the kernel will run the trampoline located in arch/ppc64/kernel/prom_init.c to extract the device-tree and other informations from open firmware and build a flattened device-tree as described in b). prom_init() will then re-enter the kernel using the second method. This trampoline code runs in the context of the firmware, which is supposed to handle all exceptions during that time. b) Direct entry with a flattened device-tree block. This entry point is called by a) after the OF trampoline and can also be called directly by a bootloader that does not support the Open Firmware client interface. It is also used by "kexec" to implement "hot" booting of a new kernel from a previous running one. This method is what I will describe in more details in this document, as method a) is simply standard Open Firmware, and thus should be implemented according to the various standard documents defining it and it's binding to the PowerPC platform. The entry point definition then becomes: r3 : physical pointer to the device-tree block (defined in chapter II) r4 : physical pointer to the kernel itself. This is used by the assembly code to properly disable the MMU in case you are entering the kernel with MMU enabled and a non-1:1 mapping. r5 : NULL (as to differenciate with method a) 2) Board support ---------------- Board supports (platforms) are not exclusive config options. An arbitrary set of board supports can be built in a single kernel image. The kernel will "known" what set of functions to use for a given platform based on the content of the device-tree. Thus, you should: a) add your platform support as a _boolean_ option in arch/ppc64/Kconfig, following the example of PPC_PSERIES, PPC_PMAC and PPC_MAPLE. The later is probably a good example of a board support to start from. b) create your main platform file as "arch/ppc64/kernel/myboard_setup.c" and add it to the Makefile under the condition of your CONFIG_ option. This file will define a structure of type "ppc_md" containing the various callbacks that the generic code will use to get to your platform specific code c) Add a reference to your "ppc_md" structure in the "machines" table in arch/ppc64/kernel/setup.c d) request and get assigned a platform number (see PLATFORM_* constants in include/asm-ppc64/processor.h I will describe later the boot process and various callbacks that your platform should implement. II - The DT block format =========================== This chapter defines the actual format of the flattened device-tree passed to the kernel. The actual content of it and kernel requirements are described later. You can find example of code manipulating that format in various places, including arch/ppc64/kernel/prom_init.c which will generate a flattened device-tree from the Open Firmware representation, or the fs2dt utility which is part of the kexec tools which will generate one from a filesystem representation. It is expected that a bootloader like uboot provides a bit more support, that will be discussed later as well. 1) Header --------- The kernel is entered with r3 pointing to an area of memory that is roughtly described in include/asm-ppc64/prom.h by the structure boot_param_header: struct boot_param_header { u32 magic; /* magic word OF_DT_HEADER */ u32 totalsize; /* total size of DT block */ u32 off_dt_struct; /* offset to structure */ u32 off_dt_strings; /* offset to strings */ u32 off_mem_rsvmap; /* offset to memory reserve map */ u32 version; /* format version */ u32 last_comp_version; /* last compatible version */ /* version 2 fields below */ u32 boot_cpuid_phys; /* Which physical CPU id we're booting on */ }; Along with the constants: /* Definitions used by the flattened device tree */ #define OF_DT_HEADER 0xd00dfeed /* 4: version, 4: total size */ #define OF_DT_BEGIN_NODE 0x1 /* Start node: full name */ #define OF_DT_END_NODE 0x2 /* End node */ #define OF_DT_PROP 0x3 /* Property: name off, size, content */ #define OF_DT_END 0x9 All values in this header are in big endian format, the various fields in this header are defined more precisely below. All "offsets" values are in bytes from the start of the header, that is from r3 value. - magic This is a magic value that "marks" the beginning of the device-tree block header. It contains the value 0xd00dfeed and is defined by the constant OF_DT_HEADER - totalsize This is the total size of the DT block including the header. The "DT" block should enclose all data structures defined in this chapter (who are pointed to by offsets in this header). That is, the device-tree structure, strings, and the memory reserve map. - off_dt_struct This is an offset from the beginning of the header to the start of the "structure" part the device tree. (see 2) device tree) - off_dt_strings This is an offset from the beginning of the header to the start of the "strings" part of the device-tree - off_mem_rsvmap This is an offset from the beginning of the header to the start of the reserved memory map. This map is a list of pairs of 64 bits integers. Each pair is a physical address and a size. The list is terminated by an entry of size 0. This map provides the kernel with a list of physical memory areas that are "reserved" and thus not to be used for memory allocations, especially during early initialisation. The kernel needs to allocate memory during boot for things like un-flattening the device-tree, allocating an MMU hash table, etc... Those allocations must be done in such a way to avoid overriding critical things like, on Open Firmware capable machines, the RTAS instance, or on some pSeries, the TCE tables used for the iommu. Typically, the reserve map should contain _at least_ this DT block itself (header,total_size). If you are passing an initrd to the kernel, you should reserve it as well. You do not need to reserve the kernel image itself. The map should be 64 bits aligned. - version This is the version of this structure. Version 1 stops here. Version 2 adds an additional field boot_cpuid_phys. You should always generate a structure of the highest version defined at the time of your implementation. That is version 2. - last_comp_version Last compatible version. This indicates down to what version of the DT block you are backward compatible with. For example, version 2 is backward compatible with version 1 (that is, a kernel build for version 1 will be able to boot with a version 2 format). You should put a 1 in this field unless a new incompatible version of the DT block is defined. - boot_cpuid_phys This field only exist on version 2 headers. It indicate which physical CPU ID is calling the kernel entry point. This is used, among others, by kexec. If you are on an SMP system, this value should match the content of the "reg" property of the CPU node in the device-tree corresponding to the CPU calling the kernel entry point (see further chapters for more informations on the required device-tree contents) So the typical layout of a DT block (though the various parts don't need to be in that order) looks like (addresses go from top to bottom): ------------------------------ r3 -> | struct boot_param_header | ------------------------------ | (alignment gap) (*) | ------------------------------ | memory reserve map | ------------------------------ | (alignment gap) | ------------------------------ | | | device-tree structure | | | ------------------------------ | (alignment gap) | ------------------------------ | | | device-tree strings | | | -----> ------------------------------ | | --- (r3 + totalsize) (*) The alignment gaps are not necessarily present, their presence and size are dependent on the various alignment requirements of the individual data blocks. 2) Device tree generalities --------------------------- This device-tree itself is separated in two different blocks, a structure block and a strings block. Both need to be page aligned. First, let's quickly describe the device-tree concept before detailing the storage format. This chapter does _not_ describe the detail of the required types of nodes & properties for the kernel, this is done later in chapter III. The device-tree layout is strongly inherited from the definition of the Open Firmware IEEE 1275 device-tree. It's basically a tree of nodes, each node having two or more named properties. A property can have a value or not. It is a tree, so each node has one and only one parent except for the root node who has no parent. A node has 2 names. The actual node name is contained in a property of type "name" in the node property list whose value is a zero terminated string and is mandatory. There is also a "unit name" that is used to differenciate nodes with the same name at the same level, it is usually made of the node name's, the "@" sign, and a "unit address", which definition is specific to the bus type the node sits on. The unit name doesn't exist as a property per-se but is included in the device-tree structure. It is typically used to represent "path" in the device-tree. More details about these will be provided later. The kernel ppc64 generic code does not make any formal use of the unit address though (though some board support code may do) so the only real requirement here for the unit address is to ensure uniqueness of the node unit name at a given level. Nodes with no notion of address and no possible sibling of the same name (like /memory or /cpus) may ommit the unit address in the context of this specification, or use the "@0" default unit address. The unit name is used to define a node "full path", which is the concatenation of all parent nodes unit names separated with "/". The root node is defined as beeing named "device-tree" and has no unit address (no @ symbol followed by a unit address). When manipulating device-tree "path", the root of the tree is generally represented by a simple slash sign "/". Every node who actually represents an actual device (that is who isn't only a virtual "container" for more nodes, like "/cpus" is) is also required to have a "device_type" property indicating the type of node Finally, every node is required to have a "linux,phandle" property. Real open firmware implementations don't provide it as it's generated on the fly by the prom_init.c trampoline from the Open Firmware "phandle". Implementations providing a flattened device-tree directly should provide this property. This propery is a 32 bits value that uniquely identify a node. You are free to use whatever values or system of values, internal pointers, or whatever to genrate these, the only requirement is that every single node of the tree you are passing to the kernel has a unique value in this property. This can be used in some cases for nodes to reference other nodes. Here is an example of a simple device-tree. In this example, a "o" designates a node followed by the node unit name. Properties are presented with their name followed by their content. "content" represent an ASCII string (zero terminated) value, while represent a 32 bits hexadecimal value. The various nodes in this example will be discusse in a later chapter. At this point, it is only meant to give you a idea of what a device-tree looks like / o device-tree |- name = "device-tree" |- model = "MyBoardName" |- compatible = "MyBoardFamilyName" |- #address-cells = <2> |- #size-cells = <2> |- linux,phandle = <0> | o cpus | | - name = "cpus" | | - linux,phandle = <1> | | | o PowerPC,970 at 0 | |- name = "PowerPC,970" | |- device_type = "cpu" | |- reg = <0> | |- clock-frequency = <5f5e1000> | |- linux,boot-cpu | |- linux,phandle = <2> | o memory at 0 | |- name = "memory" | |- device_type = "memory" | |- reg = <00000000 00000000 00000000 20000000> | |- linux,phandle = <3> | o chosen |- name = "chosen" |- bootargs = "root=/dev/sda2" |- linux,platform = <00000600> |- linux,phandle = <4> This tree is an example of a minimal tree. It pretty much contains the minimal set of required nodes and properties to boot a linux kernel, that is some basic model informations at the root, the CPUs, the physical memory layout, and misc informations passed through /chosen like in this example, the platform type (mandatory) and the kernel command line arguments (optional). The /cpus/PowerPC,970 at 0/linux,boot-cpu property is an example of a property without a value. All other properties have a value. The signification of the #address-cells and #size-cells properties will be explained in chapter IV which defines precisely the required nodes and properties and their content. 3) Device tree "structure" block The structure of the device tree is a linearized tree structure. The "OF_DT_BEGIN_NODE" token starts a new node, and the "OF_DT_END" ends that node definition. Child nodes are simply defined before "OF_DT_END" (that is nodes within the node). A 'token' is a 32 bits value. Here's the basic structure of a single node: * token OF_DT_BEGIN_NODE (that is 0x00000001) * node full path as a zero terminated string * [align gap to next 4 bytes boundary] * for each property: * token OF_DT_PROP (that is 0x00000003) * 32 bits value of property value size in bytes (or 0 of no value) * 32 bits value of offset in string block of property name * [align gap to either next 4 bytes boundary if the property value size is less or equal to 4 bytes, or to next 8 bytes boundary if the property value size is larger than 4 bytes] * property value data if any * [align gap to next 4 bytes boundary] * [child nodes if any] * token OF_DT_END (that is 0x00000002) So the node content can be summmarised as a start token, a full path, a list of properties, a list of child node and an end token. Every child node is a full node structure itself as defined above 4) Device tree 'strings" block In order to save space, property names, which are generally redundant, are stored separately in the "strings" block. This block is simply the whole bunch of zero terminated strings for all property names concatenated together. The device-tree property definitions in the structure block will contain offset values from the beginning of the strings block. III - Required content of the device tree ========================================= < to be written > IV - Recommendation for a bootloader ==================================== Here are some various ideas/recommendations that have been proposed while all this has been defined and implemented. - It should be possible to write a parser that turns an ASCII representation of a device-tree (or even XML though I find that less readable) into a device-tree block. This would allow to basically build the device-tree structure and strings "blobs" at bootloader build time, and have the bootloader just pass-them as-is to the kernel. In fact, the device-tree blob could be then separate from the bootloader itself, an be placed in a separate portion of the flash that can be "personalized" for different board types by flashing a different device-tree - A very The bootloader may want to be able to use the device-tree itself and may want to manipulate it (to add/edit some properties, like physical memory size or kernel arguments). At this point, 2 choices can be made. Either the bootloader works directly on the flattened format, or the bootloader has it's own internal tree representation with pointers (similar to the kernel one) and re-flattens the tree when booting the kernel. The former is a bit more difficult to edit/modify, the later requires probably a bit more code to handle the tree structure. Note that the structure format has been designed so it's relatively easy to "insert" properties or nodes or delete them by just memmovin'g things around. It contains no internal offsets or pointers for this purpose. - An example of code for iterating nodes & retreiving properties directly from the flattened tree format can be found in the kernel file arch/ppc64/kernel/prom.c, look at scan_flat_dt() function, it's usage in early_init_devtree(), and the corresponding various early_init_dt_scan_*() callbacks. That code can be re-used in a GPL device-tree, and as the author of that code, I would be happy do discuss possible free licencing to any vendor who wishes to integrate all or part of this code into a non-GPL bootloader. From benh at kernel.crashing.org Wed May 18 17:09:10 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 18 May 2005 17:09:10 +1000 Subject: RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO Message-ID: <1116400151.918.10.camel@gaston> Hi ! Here's the very first draft of my HOWTO about booting the linux/ppc64 kernel without open firmware. It's still incomplete, the main chapter describing which nodes & properties are required and their format is still missing (though it will basically be a subset of the Open Firmware specification & bindings). The format of the flattened device-tree is documented. It's a first draft, so please, don't be too harsh :) Comments are welcome. Booting the Linux/ppc64 kernel without Open Firmware ---------------------------------------------------- (c) 2005 Benjamin Herrenschmidt , IBM Corp. May 18, 2005: Rev 0.1 - Initial draft, no chapter III yet. I- Introduction =============== During the recent developpements of the Linux/ppc64 kernel, and more specifically, the addition of new platform types outside of the old IBM pSeries/iSeries pair, it was decided to enforce some strict rules regarding the kernel entry and bootloader <-> kernel interfaces, in order to avoid the degeneration that has become the ppc32 kernel entry point and the way a new platform should be added to the kernel. The legacy iSeries platform breaks those rules as it predates this scheme, but no new board support will be accepted in the main tree that doesn't follows them properly. 1) Entry point -------------- There is one and one single entry point to the kernel, at the start of the kernel image. That entry point support two calling conventions: a) Boot from Open Firmware. If your firmware is compatible with Open Firmware (IEEE 1275) or provides an OF compatible client interface API (support for "interpret" callback of forth words isn't required), you can enter the kernel with: r5 : OF callback pointer as defined by IEEE 1275 bindings to powerpc. Only the 32 bits client interface is currently supported r3, r4 : address & lenght of an initrd if any or 0 MMU is either on or off, the kernel will run the trampoline located in arch/ppc64/kernel/prom_init.c to extract the device-tree and other informations from open firmware and build a flattened device-tree as described in b). prom_init() will then re-enter the kernel using the second method. This trampoline code runs in the context of the firmware, which is supposed to handle all exceptions during that time. b) Direct entry with a flattened device-tree block. This entry point is called by a) after the OF trampoline and can also be called directly by a bootloader that does not support the Open Firmware client interface. It is also used by "kexec" to implement "hot" booting of a new kernel from a previous running one. This method is what I will describe in more details in this document, as method a) is simply standard Open Firmware, and thus should be implemented according to the various standard documents defining it and it's binding to the PowerPC platform. The entry point definition then becomes: r3 : physical pointer to the device-tree block (defined in chapter II) r4 : physical pointer to the kernel itself. This is used by the assembly code to properly disable the MMU in case you are entering the kernel with MMU enabled and a non-1:1 mapping. r5 : NULL (as to differenciate with method a) 2) Board support ---------------- Board supports (platforms) are not exclusive config options. An arbitrary set of board supports can be built in a single kernel image. The kernel will "known" what set of functions to use for a given platform based on the content of the device-tree. Thus, you should: a) add your platform support as a _boolean_ option in arch/ppc64/Kconfig, following the example of PPC_PSERIES, PPC_PMAC and PPC_MAPLE. The later is probably a good example of a board support to start from. b) create your main platform file as "arch/ppc64/kernel/myboard_setup.c" and add it to the Makefile under the condition of your CONFIG_ option. This file will define a structure of type "ppc_md" containing the various callbacks that the generic code will use to get to your platform specific code c) Add a reference to your "ppc_md" structure in the "machines" table in arch/ppc64/kernel/setup.c d) request and get assigned a platform number (see PLATFORM_* constants in include/asm-ppc64/processor.h I will describe later the boot process and various callbacks that your platform should implement. II - The DT block format =========================== This chapter defines the actual format of the flattened device-tree passed to the kernel. The actual content of it and kernel requirements are described later. You can find example of code manipulating that format in various places, including arch/ppc64/kernel/prom_init.c which will generate a flattened device-tree from the Open Firmware representation, or the fs2dt utility which is part of the kexec tools which will generate one from a filesystem representation. It is expected that a bootloader like uboot provides a bit more support, that will be discussed later as well. 1) Header --------- The kernel is entered with r3 pointing to an area of memory that is roughtly described in include/asm-ppc64/prom.h by the structure boot_param_header: struct boot_param_header { u32 magic; /* magic word OF_DT_HEADER */ u32 totalsize; /* total size of DT block */ u32 off_dt_struct; /* offset to structure */ u32 off_dt_strings; /* offset to strings */ u32 off_mem_rsvmap; /* offset to memory reserve map */ u32 version; /* format version */ u32 last_comp_version; /* last compatible version */ /* version 2 fields below */ u32 boot_cpuid_phys; /* Which physical CPU id we're booting on */ }; Along with the constants: /* Definitions used by the flattened device tree */ #define OF_DT_HEADER 0xd00dfeed /* 4: version, 4: total size */ #define OF_DT_BEGIN_NODE 0x1 /* Start node: full name */ #define OF_DT_END_NODE 0x2 /* End node */ #define OF_DT_PROP 0x3 /* Property: name off, size, content */ #define OF_DT_END 0x9 All values in this header are in big endian format, the various fields in this header are defined more precisely below. All "offsets" values are in bytes from the start of the header, that is from r3 value. - magic This is a magic value that "marks" the beginning of the device-tree block header. It contains the value 0xd00dfeed and is defined by the constant OF_DT_HEADER - totalsize This is the total size of the DT block including the header. The "DT" block should enclose all data structures defined in this chapter (who are pointed to by offsets in this header). That is, the device-tree structure, strings, and the memory reserve map. - off_dt_struct This is an offset from the beginning of the header to the start of the "structure" part the device tree. (see 2) device tree) - off_dt_strings This is an offset from the beginning of the header to the start of the "strings" part of the device-tree - off_mem_rsvmap This is an offset from the beginning of the header to the start of the reserved memory map. This map is a list of pairs of 64 bits integers. Each pair is a physical address and a size. The list is terminated by an entry of size 0. This map provides the kernel with a list of physical memory areas that are "reserved" and thus not to be used for memory allocations, especially during early initialisation. The kernel needs to allocate memory during boot for things like un-flattening the device-tree, allocating an MMU hash table, etc... Those allocations must be done in such a way to avoid overriding critical things like, on Open Firmware capable machines, the RTAS instance, or on some pSeries, the TCE tables used for the iommu. Typically, the reserve map should contain _at least_ this DT block itself (header,total_size). If you are passing an initrd to the kernel, you should reserve it as well. You do not need to reserve the kernel image itself. The map should be 64 bits aligned. - version This is the version of this structure. Version 1 stops here. Version 2 adds an additional field boot_cpuid_phys. You should always generate a structure of the highest version defined at the time of your implementation. That is version 2. - last_comp_version Last compatible version. This indicates down to what version of the DT block you are backward compatible with. For example, version 2 is backward compatible with version 1 (that is, a kernel build for version 1 will be able to boot with a version 2 format). You should put a 1 in this field unless a new incompatible version of the DT block is defined. - boot_cpuid_phys This field only exist on version 2 headers. It indicate which physical CPU ID is calling the kernel entry point. This is used, among others, by kexec. If you are on an SMP system, this value should match the content of the "reg" property of the CPU node in the device-tree corresponding to the CPU calling the kernel entry point (see further chapters for more informations on the required device-tree contents) So the typical layout of a DT block (though the various parts don't need to be in that order) looks like (addresses go from top to bottom): ------------------------------ r3 -> | struct boot_param_header | ------------------------------ | (alignment gap) (*) | ------------------------------ | memory reserve map | ------------------------------ | (alignment gap) | ------------------------------ | | | device-tree structure | | | ------------------------------ | (alignment gap) | ------------------------------ | | | device-tree strings | | | -----> ------------------------------ | | --- (r3 + totalsize) (*) The alignment gaps are not necessarily present, their presence and size are dependent on the various alignment requirements of the individual data blocks. 2) Device tree generalities --------------------------- This device-tree itself is separated in two different blocks, a structure block and a strings block. Both need to be page aligned. First, let's quickly describe the device-tree concept before detailing the storage format. This chapter does _not_ describe the detail of the required types of nodes & properties for the kernel, this is done later in chapter III. The device-tree layout is strongly inherited from the definition of the Open Firmware IEEE 1275 device-tree. It's basically a tree of nodes, each node having two or more named properties. A property can have a value or not. It is a tree, so each node has one and only one parent except for the root node who has no parent. A node has 2 names. The actual node name is contained in a property of type "name" in the node property list whose value is a zero terminated string and is mandatory. There is also a "unit name" that is used to differenciate nodes with the same name at the same level, it is usually made of the node name's, the "@" sign, and a "unit address", which definition is specific to the bus type the node sits on. The unit name doesn't exist as a property per-se but is included in the device-tree structure. It is typically used to represent "path" in the device-tree. More details about these will be provided later. The kernel ppc64 generic code does not make any formal use of the unit address though (though some board support code may do) so the only real requirement here for the unit address is to ensure uniqueness of the node unit name at a given level. Nodes with no notion of address and no possible sibling of the same name (like /memory or /cpus) may ommit the unit address in the context of this specification, or use the "@0" default unit address. The unit name is used to define a node "full path", which is the concatenation of all parent nodes unit names separated with "/". The root node is defined as beeing named "device-tree" and has no unit address (no @ symbol followed by a unit address). When manipulating device-tree "path", the root of the tree is generally represented by a simple slash sign "/". Every node who actually represents an actual device (that is who isn't only a virtual "container" for more nodes, like "/cpus" is) is also required to have a "device_type" property indicating the type of node Finally, every node is required to have a "linux,phandle" property. Real open firmware implementations don't provide it as it's generated on the fly by the prom_init.c trampoline from the Open Firmware "phandle". Implementations providing a flattened device-tree directly should provide this property. This propery is a 32 bits value that uniquely identify a node. You are free to use whatever values or system of values, internal pointers, or whatever to genrate these, the only requirement is that every single node of the tree you are passing to the kernel has a unique value in this property. This can be used in some cases for nodes to reference other nodes. Here is an example of a simple device-tree. In this example, a "o" designates a node followed by the node unit name. Properties are presented with their name followed by their content. "content" represent an ASCII string (zero terminated) value, while represent a 32 bits hexadecimal value. The various nodes in this example will be discusse in a later chapter. At this point, it is only meant to give you a idea of what a device-tree looks like / o device-tree |- name = "device-tree" |- model = "MyBoardName" |- compatible = "MyBoardFamilyName" |- #address-cells = <2> |- #size-cells = <2> |- linux,phandle = <0> | o cpus | | - name = "cpus" | | - linux,phandle = <1> | | | o PowerPC,970 at 0 | |- name = "PowerPC,970" | |- device_type = "cpu" | |- reg = <0> | |- clock-frequency = <5f5e1000> | |- linux,boot-cpu | |- linux,phandle = <2> | o memory at 0 | |- name = "memory" | |- device_type = "memory" | |- reg = <00000000 00000000 00000000 20000000> | |- linux,phandle = <3> | o chosen |- name = "chosen" |- bootargs = "root=/dev/sda2" |- linux,platform = <00000600> |- linux,phandle = <4> This tree is an example of a minimal tree. It pretty much contains the minimal set of required nodes and properties to boot a linux kernel, that is some basic model informations at the root, the CPUs, the physical memory layout, and misc informations passed through /chosen like in this example, the platform type (mandatory) and the kernel command line arguments (optional). The /cpus/PowerPC,970 at 0/linux,boot-cpu property is an example of a property without a value. All other properties have a value. The signification of the #address-cells and #size-cells properties will be explained in chapter IV which defines precisely the required nodes and properties and their content. 3) Device tree "structure" block The structure of the device tree is a linearized tree structure. The "OF_DT_BEGIN_NODE" token starts a new node, and the "OF_DT_END" ends that node definition. Child nodes are simply defined before "OF_DT_END" (that is nodes within the node). A 'token' is a 32 bits value. Here's the basic structure of a single node: * token OF_DT_BEGIN_NODE (that is 0x00000001) * node full path as a zero terminated string * [align gap to next 4 bytes boundary] * for each property: * token OF_DT_PROP (that is 0x00000003) * 32 bits value of property value size in bytes (or 0 of no value) * 32 bits value of offset in string block of property name * [align gap to either next 4 bytes boundary if the property value size is less or equal to 4 bytes, or to next 8 bytes boundary if the property value size is larger than 4 bytes] * property value data if any * [align gap to next 4 bytes boundary] * [child nodes if any] * token OF_DT_END (that is 0x00000002) So the node content can be summmarised as a start token, a full path, a list of properties, a list of child node and an end token. Every child node is a full node structure itself as defined above 4) Device tree 'strings" block In order to save space, property names, which are generally redundant, are stored separately in the "strings" block. This block is simply the whole bunch of zero terminated strings for all property names concatenated together. The device-tree property definitions in the structure block will contain offset values from the beginning of the strings block. III - Required content of the device tree ========================================= < to be written > IV - Recommendation for a bootloader ==================================== Here are some various ideas/recommendations that have been proposed while all this has been defined and implemented. - It should be possible to write a parser that turns an ASCII representation of a device-tree (or even XML though I find that less readable) into a device-tree block. This would allow to basically build the device-tree structure and strings "blobs" at bootloader build time, and have the bootloader just pass-them as-is to the kernel. In fact, the device-tree blob could be then separate from the bootloader itself, an be placed in a separate portion of the flash that can be "personalized" for different board types by flashing a different device-tree - A very The bootloader may want to be able to use the device-tree itself and may want to manipulate it (to add/edit some properties, like physical memory size or kernel arguments). At this point, 2 choices can be made. Either the bootloader works directly on the flattened format, or the bootloader has it's own internal tree representation with pointers (similar to the kernel one) and re-flattens the tree when booting the kernel. The former is a bit more difficult to edit/modify, the later requires probably a bit more code to handle the tree structure. Note that the structure format has been designed so it's relatively easy to "insert" properties or nodes or delete them by just memmovin'g things around. It contains no internal offsets or pointers for this purpose. - An example of code for iterating nodes & retreiving properties directly from the flattened tree format can be found in the kernel file arch/ppc64/kernel/prom.c, look at scan_flat_dt() function, it's usage in early_init_devtree(), and the corresponding various early_init_dt_scan_*() callbacks. That code can be re-used in a GPL device-tree, and as the author of that code, I would be happy do discuss possible free licencing to any vendor who wishes to integrate all or part of this code into a non-GPL bootloader. From arnd at arndb.de Wed May 18 17:14:41 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Wed, 18 May 2005 09:14:41 +0200 Subject: [PATCH 3/8] ppc64: add a watchdog driver for rtas In-Reply-To: <20050517204029.GA2748@otto> References: <200505132117.37461.arnd@arndb.de> <200505132124.48963.arnd@arndb.de> <20050517204029.GA2748@otto> Message-ID: <200505180914.44336.arnd@arndb.de> On Dinsdag 17 Mai 2005 22:40, Nathan Lynch wrote: > Arnd Bergmann wrote: > > +static volatile int wdrtas_miscdev_open = 0; > ... > > +static int > > +wdrtas_open(struct inode *inode, struct file *file) > > +{ > > + /* only open once */ > > + if (xchg(&wdrtas_miscdev_open,1)) > > + return -EBUSY; > > The volatile and xchg strike me as an obscure method for ensuring only > one process at a time can open this file. Any reason a semaphore > couldn't be used? A semaphore would also be the wrong approach since we don't want processes to block but instead to fail opening the watchdog twice. Other watchdog drivers use atomic_t or bitops to guard open, which imho would be the better solution. Of course, there is also Wim's plan to do a unified watchdog driver that would solve this once and for all. > > + printk("wdrtas: got unexpected close. Watchdog " > > + "not stopped.\n"); > > printk's need a valid log level specified. There are several in this > file that lack them. Right. Utz, do you have time to fix up these issues? If not, I probably won't look into it this week either. Thanks, Arnd <>< From mgroeger at sysgo.com Wed May 18 18:12:21 2005 From: mgroeger at sysgo.com (Marius Groeger) Date: Wed, 18 May 2005 10:12:21 +0200 (CEST) Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO In-Reply-To: <1116400151.918.10.camel@gaston> References: <1116400151.918.10.camel@gaston> Message-ID: Ben, On Wed, 18 May 2005, Benjamin Herrenschmidt wrote: > Here's the very first draft of my HOWTO about booting the linux/ppc64 > kernel without open firmware. It's still incomplete, the main chapter ^^^^^^^^^^^^^^^^^^^^^ One could argue whether the full-blown emulation of an OF device tree may really be called this.... ;-) > b) Direct entry with a flattened device-tree block. This entry > point is called by a) after the OF trampoline and can also be > called directly by a bootloader that does not support the Open > Firmware client interface. It is also used by "kexec" to For OF based systems, what you outline definitely makes an awful lot of sense. For others I wonder what the costs of this are in terms of the memory footprint (both RAM and ROM). Are there reference implementations in existence? Regards, Marius -- Marius Groeger SYSGO AG Embedded and Real-Time Software Voice: +49 6136 9948 0 FAX: +49 6136 9948 10 www.sysgo.com | www.elinos.com | www.osek.de | www.pikeos.com Meet us: Embedded Systems Expo & Conference, Tokyo Big Sight 2005-JUN-29 - 2005-JUL-01 http://www.esec.jp/en/ From email at paypal.com Wed May 18 22:30:34 2005 From: email at paypal.com (PayPal Security Center) Date: Wed, 18 May 2005 07:30:34 -0500 Subject: Your PayPal account could be suspended Message-ID: <200505181230.j4ICUY9R015817@mail.clearsail.net> An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050518/11ad0964/attachment.htm From arnd at arndb.de Wed May 18 22:40:59 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Wed, 18 May 2005 14:40:59 +0200 Subject: [PATCH] libfs: add simple attribute files In-Reply-To: <20050514074524.GC20021@kroah.com> References: <200505132117.37461.arnd@arndb.de> <200505132129.07603.arnd@arndb.de> <20050514074524.GC20021@kroah.com> Message-ID: <200505181441.01495.arnd@arndb.de> Based on the discussion about spufs attributes, this is my suggestion for a more generic attribute file support that can be used by both debugfs and spufs. Simple attribute files behave similarly to sequential files from a kernel programmers perspective in that a standard set of file operations is provided and only an open operation needs to be written that registers file specific get() and set() functions. These operations are defined as void foo_set(void *data, long val); and long foo_get(void *data); where data is the inode->u.generic_ip pointer of the file and the operations just need to make send of that pointer. The infrastructure makes sure this works correctly with concurrent access and partial read calls. A macro named DEFINE_SIMPLE_ATTRIBUTE is provided to further simplify using the attributes. This patch already contains the changes for debugfs to use attributes for its internal file operations. Signed-off-by: Arnd Bergmann --- fs/debugfs/file.c | 74 +++++++++++++++++++---------------------- fs/libfs.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++++ include/linux/fs.h | 48 +++++++++++++++++++++++++++ 3 files changed, 177 insertions(+), 39 deletions(-) fs/libfs.c: needs update Index: linus-2.5/include/linux/fs.h =================================================================== --- linus-2.5.orig/include/linux/fs.h 2005-05-18 10:58:52.000000000 +0200 +++ linus-2.5/include/linux/fs.h 2005-05-18 14:07:10.000000000 +0200 @@ -1657,6 +1657,55 @@ ar->size = n; } +/* + * simple attribute files + * + * These attributes behave similar to those in sysfs: + * + * Writing to an attribute immediately sets a value, an open file can be + * written to multiple times. + * + * Reading from an attribute creates a buffer from the value that might get + * read with multiple read calls. When the attribute has been read + * completely, no further read calls are possible until the file is opened + * again. + * + * All spufs attributes contain a text representation of a numeric value + * that are accessed with the get() and set() functions. + * + * Perhaps these file operations could be put in debugfs or libfs instead, + * they are not really SPU specific. + */ +#define DEFINE_SIMPLE_ATTRIBUTE(__fops, __get, __set, __fmt) \ +static int __fops ## _open(struct inode *inode, struct file *file) \ +{ \ + __simple_attr_check_format(__fmt, 0ul); \ + return simple_attr_open(inode, file, &__get, &__set, __fmt); \ +} \ +static struct file_operations __fops = { \ + .owner = THIS_MODULE, \ + .open = __fops ## _open, \ + .release = simple_attr_close, \ + .read = simple_attr_read, \ + .write = simple_attr_write, \ +}; + +static inline void __attribute__((format(printf, 1, 2))) +__simple_attr_check_format(const char *fmt, ...) +{ + /* don't do anything, just let the compiler check the arguments; */ +} + +int simple_attr_open(struct inode *inode, struct file *file, + long (*get)(void *), void (*set)(void *, long), + const char *fmt); +int simple_attr_close(struct inode *inode, struct file *file); +ssize_t simple_attr_read(struct file *file, char __user *buf, + size_t len, loff_t *ppos); +ssize_t simple_attr_write(struct file *file, const char __user *buf, + size_t len, loff_t *ppos); + + #ifdef CONFIG_SECURITY static inline char *alloc_secdata(void) { Index: linus-2.5/fs/libfs.c =================================================================== --- linus-2.5.orig/fs/libfs.c 2005-05-18 10:58:52.000000000 +0200 +++ linus-2.5/fs/libfs.c 2005-05-18 12:06:49.000000000 +0200 @@ -519,6 +519,100 @@ return 0; } +/* Simple attribute files */ + +struct simple_attr { + long (*get)(void *); + void (*set)(void *, long); + char get_buf[24]; /* enough to store a long and "\n\0" */ + char set_buf[24]; + void *data; + const char *fmt; /* format for read operation */ + struct semaphore sem; /* protects access to these buffers */ +}; + +/* simple_attr_open is called by an actual attribute open file operation + * to set the attribute specific access operations. */ +int simple_attr_open(struct inode *inode, struct file *file, + long (*get)(void *), void (*set)(void *, long), + const char *fmt) +{ + struct simple_attr *attr; + + attr = kmalloc(sizeof *attr, GFP_KERNEL); + if (!attr) + return -ENOMEM; + + /* reading/writing needs the respective get/set operation */ + if (((file->f_mode & FMODE_READ) && !get) || + ((file->f_mode & FMODE_WRITE) && !set)) + return -EACCES; + + attr->get = get; + attr->set = set; + attr->data = inode->u.generic_ip; + attr->fmt = fmt; + init_MUTEX(&attr->sem); + + file->private_data = attr; + + return nonseekable_open(inode, file); +} + +int simple_attr_close(struct inode *inode, struct file *file) +{ + kfree(file->private_data); + return 0; +} + +/* read from the buffer that is filled with the get function */ +ssize_t simple_attr_read(struct file *file, char __user *buf, + size_t len, loff_t *ppos) +{ + struct simple_attr *attr; + size_t size; + ssize_t ret; + + attr = file->private_data; + + down(&attr->sem); + if (*ppos) /* continued read */ + size = strlen(attr->get_buf); + else /* first read */ + size = scnprintf(attr->get_buf, sizeof (attr->get_buf), + attr->fmt, attr->get(attr->data)); + + ret = simple_read_from_buffer(buf, len, ppos, attr->get_buf, size); + up(&attr->sem); + return ret; +} + +/* interpret the buffer as a number to call the set function with */ +ssize_t simple_attr_write(struct file *file, const char __user *buf, + size_t len, loff_t *ppos) +{ + struct simple_attr *attr; + long val; + size_t size; + ssize_t ret; + + attr = file->private_data; + + down(&attr->sem); + ret = -EFAULT; + size = min(sizeof (attr->set_buf) - 1, len); + if (copy_from_user(attr->set_buf, buf, size)) + goto out; + + ret = len; /* claim we got the whole input */ + attr->set_buf[size] = '\0'; + val = simple_strtol(attr->set_buf, NULL, 0); + attr->set(attr->data, val); +out: + up(&attr->sem); + return ret; +} + EXPORT_SYMBOL(dcache_dir_close); EXPORT_SYMBOL(dcache_dir_lseek); EXPORT_SYMBOL(dcache_dir_open); Index: linus-2.5/fs/debugfs/file.c =================================================================== --- linus-2.5.orig/fs/debugfs/file.c 2005-05-18 10:58:52.000000000 +0200 +++ linus-2.5/fs/debugfs/file.c 2005-05-18 12:22:16.000000000 +0200 @@ -45,45 +45,6 @@ .open = default_open, }; -#define simple_type(type, format, temptype, strtolfn) \ -static ssize_t read_file_##type(struct file *file, char __user *user_buf, \ - size_t count, loff_t *ppos) \ -{ \ - char buf[32]; \ - type *val = file->private_data; \ - \ - snprintf(buf, sizeof(buf), format "\n", *val); \ - return simple_read_from_buffer(user_buf, count, ppos, buf, strlen(buf));\ -} \ -static ssize_t write_file_##type(struct file *file, const char __user *user_buf,\ - size_t count, loff_t *ppos) \ -{ \ - char *endp; \ - char buf[32]; \ - int buf_size; \ - type *val = file->private_data; \ - temptype tmp; \ - \ - memset(buf, 0x00, sizeof(buf)); \ - buf_size = min(count, (sizeof(buf)-1)); \ - if (copy_from_user(buf, user_buf, buf_size)) \ - return -EFAULT; \ - \ - tmp = strtolfn(buf, &endp, 0); \ - if ((endp == buf) || ((type)tmp != tmp)) \ - return -EINVAL; \ - *val = tmp; \ - return count; \ -} \ -static struct file_operations fops_##type = { \ - .read = read_file_##type, \ - .write = write_file_##type, \ - .open = default_open, \ -}; -simple_type(u8, "%c", unsigned long, simple_strtoul); -simple_type(u16, "%hi", unsigned long, simple_strtoul); -simple_type(u32, "%i", unsigned long, simple_strtoul); - /** * debugfs_create_u8 - create a file in the debugfs filesystem that is used to read and write a unsigned 8 bit value. * @@ -109,6 +70,18 @@ * NULL or !NULL instead as to eliminate the need for #ifdef in the calling * code. */ +static void debugfs_u8_set(void *data, long val) +{ + *(u8 *)data = val; +} + +static long debugfs_u8_get(void *data) +{ + return *(u8 *)data; +} + +DEFINE_SIMPLE_ATTRIBUTE(fops_u8, debugfs_u8_get, debugfs_u8_set, "%lu\n"); + struct dentry *debugfs_create_u8(const char *name, mode_t mode, struct dentry *parent, u8 *value) { @@ -141,6 +114,17 @@ * NULL or !NULL instead as to eliminate the need for #ifdef in the calling * code. */ +static void debugfs_u16_set(void *data, long val) +{ + *(u16 *)data = val; +} + +static long debugfs_u16_get(void *data) +{ + return *(u16 *)data; +} +DEFINE_SIMPLE_ATTRIBUTE(fops_u16, debugfs_u16_get, debugfs_u16_set, "%lu\n"); + struct dentry *debugfs_create_u16(const char *name, mode_t mode, struct dentry *parent, u16 *value) { @@ -173,6 +157,18 @@ * NULL or !NULL instead as to eliminate the need for #ifdef in the calling * code. */ +static void debugfs_u32_set(void *data, long val) +{ + *(u32 *)data = val; +} + +static long debugfs_u32_get(void *data) +{ + return *(u32 *)data; +} + +DEFINE_SIMPLE_ATTRIBUTE(fops_u32, debugfs_u32_get, debugfs_u32_set, "%lu\n"); + struct dentry *debugfs_create_u32(const char *name, mode_t mode, struct dentry *parent, u32 *value) { From ntl at pobox.com Thu May 19 00:45:34 2005 From: ntl at pobox.com (Nathan Lynch) Date: Wed, 18 May 2005 09:45:34 -0500 Subject: [PATCH 3/8] ppc64: add a watchdog driver for rtas In-Reply-To: <200505180914.44336.arnd@arndb.de> References: <200505132117.37461.arnd@arndb.de> <200505132124.48963.arnd@arndb.de> <20050517204029.GA2748@otto> <200505180914.44336.arnd@arndb.de> Message-ID: <20050518144534.GA3723@otto> Arnd Bergmann wrote: > On Dinsdag 17 Mai 2005 22:40, Nathan Lynch wrote: > > Arnd Bergmann wrote: > > > +static volatile int wdrtas_miscdev_open = 0; > > ... > > > +static int > > > +wdrtas_open(struct inode *inode, struct file *file) > > > +{ > > > + /* only open once */ > > > + if (xchg(&wdrtas_miscdev_open,1)) > > > + return -EBUSY; > > > > The volatile and xchg strike me as an obscure method for ensuring only > > one process at a time can open this file. Any reason a semaphore > > couldn't be used? > > A semaphore would also be the wrong approach since we don't want > processes to block but instead to fail opening the watchdog twice. I should have been more explicit. What I had in mind was using down_trylock and returning -EBUSY if it failed. Nathan From arnd at arndb.de Thu May 19 00:39:51 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Wed, 18 May 2005 16:39:51 +0200 Subject: [PATCH 3/8] ppc64: add a watchdog driver for rtas In-Reply-To: <20050518144534.GA3723@otto> References: <200505132117.37461.arnd@arndb.de> <200505180914.44336.arnd@arndb.de> <20050518144534.GA3723@otto> Message-ID: <200505181639.52415.arnd@arndb.de> On Middeweken 18 Mai 2005 16:45, Nathan Lynch wrote: > > > A semaphore would also be the wrong approach since we don't want > > processes to block but instead to fail opening the watchdog twice. > > I should have been more explicit. ?What I had in mind was using > down_trylock and returning -EBUSY if it failed. Well, that's also pointless. If the only operations you ever do on a semaphore are down_trylock() and up(), you end up using only the atomic variable in there while wasting a few bytes of extra memory for storing the wait queue head ;-) Arnd <>< From greg at kroah.com Thu May 19 06:24:46 2005 From: greg at kroah.com (Greg KH) Date: Wed, 18 May 2005 13:24:46 -0700 Subject: [PATCH] libfs: add simple attribute files In-Reply-To: <200505181441.01495.arnd@arndb.de> References: <200505132117.37461.arnd@arndb.de> <200505132129.07603.arnd@arndb.de> <20050514074524.GC20021@kroah.com> <200505181441.01495.arnd@arndb.de> Message-ID: <20050518202446.GA20041@kroah.com> On Wed, May 18, 2005 at 02:40:59PM +0200, Arnd Bergmann wrote: > Based on the discussion about spufs attributes, this is my suggestion > for a more generic attribute file support that can be used by both > debugfs and spufs. Thanks for the patch. I've cleaned it up a bit (drop the spufs comments, changed the access check, and made the val be u64, and exported the symbols and cleaned up the debugfs portion) and added it to my tree. It should show up in the next -mm release. I've included the patch below so you can see my changes. thanks, greg k-h --------------- Based on the discussion about spufs attributes, this is my suggestion for a more generic attribute file support that can be used by both debugfs and spufs. Simple attribute files behave similarly to sequential files from a kernel programmers perspective in that a standard set of file operations is provided and only an open operation needs to be written that registers file specific get() and set() functions. These operations are defined as void foo_set(void *data, long val); and long foo_get(void *data); where data is the inode->u.generic_ip pointer of the file and the operations just need to make send of that pointer. The infrastructure makes sure this works correctly with concurrent access and partial read calls. A macro named DEFINE_SIMPLE_ATTRIBUTE is provided to further simplify using the attributes. This patch already contains the changes for debugfs to use attributes for its internal file operations. Signed-off-by: Arnd Bergmann Signed-off-by: Greg Kroah-Hartman --- fs/debugfs/file.c | 67 +++++++++++++++-------------------- fs/libfs.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++++++++ include/linux/fs.h | 46 ++++++++++++++++++++++++ 3 files changed, 174 insertions(+), 38 deletions(-) --- gregkh-2.6.orig/include/linux/fs.h 2005-05-18 11:16:39.000000000 -0700 +++ gregkh-2.6/include/linux/fs.h 2005-05-18 11:16:51.000000000 -0700 @@ -1657,6 +1657,52 @@ ar->size = n; } +/* + * simple attribute files + * + * These attributes behave similar to those in sysfs: + * + * Writing to an attribute immediately sets a value, an open file can be + * written to multiple times. + * + * Reading from an attribute creates a buffer from the value that might get + * read with multiple read calls. When the attribute has been read + * completely, no further read calls are possible until the file is opened + * again. + * + * All attributes contain a text representation of a numeric value + * that are accessed with the get() and set() functions. + */ +#define DEFINE_SIMPLE_ATTRIBUTE(__fops, __get, __set, __fmt) \ +static int __fops ## _open(struct inode *inode, struct file *file) \ +{ \ + __simple_attr_check_format(__fmt, 0ul); \ + return simple_attr_open(inode, file, &__get, &__set, __fmt); \ +} \ +static struct file_operations __fops = { \ + .owner = THIS_MODULE, \ + .open = __fops ## _open, \ + .release = simple_attr_close, \ + .read = simple_attr_read, \ + .write = simple_attr_write, \ +}; + +static inline void __attribute__((format(printf, 1, 2))) +__simple_attr_check_format(const char *fmt, ...) +{ + /* don't do anything, just let the compiler check the arguments; */ +} + +int simple_attr_open(struct inode *inode, struct file *file, + u64 (*get)(void *), void (*set)(void *, u64), + const char *fmt); +int simple_attr_close(struct inode *inode, struct file *file); +ssize_t simple_attr_read(struct file *file, char __user *buf, + size_t len, loff_t *ppos); +ssize_t simple_attr_write(struct file *file, const char __user *buf, + size_t len, loff_t *ppos); + + #ifdef CONFIG_SECURITY static inline char *alloc_secdata(void) { --- gregkh-2.6.orig/fs/libfs.c 2005-05-18 11:16:39.000000000 -0700 +++ gregkh-2.6/fs/libfs.c 2005-05-18 11:19:09.000000000 -0700 @@ -519,6 +519,101 @@ return 0; } +/* Simple attribute files */ + +struct simple_attr { + u64 (*get)(void *); + void (*set)(void *, u64); + char get_buf[24]; /* enough to store a u64 and "\n\0" */ + char set_buf[24]; + void *data; + const char *fmt; /* format for read operation */ + struct semaphore sem; /* protects access to these buffers */ +}; + +/* simple_attr_open is called by an actual attribute open file operation + * to set the attribute specific access operations. */ +int simple_attr_open(struct inode *inode, struct file *file, + u64 (*get)(void *), void (*set)(void *, u64), + const char *fmt) +{ + struct simple_attr *attr; + + attr = kmalloc(sizeof(*attr), GFP_KERNEL); + if (!attr) + return -ENOMEM; + + attr->get = get; + attr->set = set; + attr->data = inode->u.generic_ip; + attr->fmt = fmt; + init_MUTEX(&attr->sem); + + file->private_data = attr; + + return nonseekable_open(inode, file); +} + +int simple_attr_close(struct inode *inode, struct file *file) +{ + kfree(file->private_data); + return 0; +} + +/* read from the buffer that is filled with the get function */ +ssize_t simple_attr_read(struct file *file, char __user *buf, + size_t len, loff_t *ppos) +{ + struct simple_attr *attr; + size_t size; + ssize_t ret; + + attr = file->private_data; + + if (!attr->get) + return -EACCES; + + down(&attr->sem); + if (*ppos) /* continued read */ + size = strlen(attr->get_buf); + else /* first read */ + size = scnprintf(attr->get_buf, sizeof(attr->get_buf), + attr->fmt, attr->get(attr->data)); + + ret = simple_read_from_buffer(buf, len, ppos, attr->get_buf, size); + up(&attr->sem); + return ret; +} + +/* interpret the buffer as a number to call the set function with */ +ssize_t simple_attr_write(struct file *file, const char __user *buf, + size_t len, loff_t *ppos) +{ + struct simple_attr *attr; + u64 val; + size_t size; + ssize_t ret; + + attr = file->private_data; + + if (!attr->set) + return -EACCES; + + down(&attr->sem); + ret = -EFAULT; + size = min(sizeof(attr->set_buf) - 1, len); + if (copy_from_user(attr->set_buf, buf, size)) + goto out; + + ret = len; /* claim we got the whole input */ + attr->set_buf[size] = '\0'; + val = simple_strtol(attr->set_buf, NULL, 0); + attr->set(attr->data, val); +out: + up(&attr->sem); + return ret; +} + EXPORT_SYMBOL(dcache_dir_close); EXPORT_SYMBOL(dcache_dir_lseek); EXPORT_SYMBOL(dcache_dir_open); @@ -547,3 +642,7 @@ EXPORT_SYMBOL(simple_transaction_get); EXPORT_SYMBOL(simple_transaction_read); EXPORT_SYMBOL(simple_transaction_release); +EXPORT_SYMBOL_GPL(simple_attr_open); +EXPORT_SYMBOL_GPL(simple_attr_close); +EXPORT_SYMBOL_GPL(simple_attr_read); +EXPORT_SYMBOL_GPL(simple_attr_write); --- gregkh-2.6.orig/fs/debugfs/file.c 2005-05-18 11:16:39.000000000 -0700 +++ gregkh-2.6/fs/debugfs/file.c 2005-05-18 11:18:35.000000000 -0700 @@ -45,44 +45,15 @@ .open = default_open, }; -#define simple_type(type, format, temptype, strtolfn) \ -static ssize_t read_file_##type(struct file *file, char __user *user_buf, \ - size_t count, loff_t *ppos) \ -{ \ - char buf[32]; \ - type *val = file->private_data; \ - \ - snprintf(buf, sizeof(buf), format "\n", *val); \ - return simple_read_from_buffer(user_buf, count, ppos, buf, strlen(buf));\ -} \ -static ssize_t write_file_##type(struct file *file, const char __user *user_buf,\ - size_t count, loff_t *ppos) \ -{ \ - char *endp; \ - char buf[32]; \ - int buf_size; \ - type *val = file->private_data; \ - temptype tmp; \ - \ - memset(buf, 0x00, sizeof(buf)); \ - buf_size = min(count, (sizeof(buf)-1)); \ - if (copy_from_user(buf, user_buf, buf_size)) \ - return -EFAULT; \ - \ - tmp = strtolfn(buf, &endp, 0); \ - if ((endp == buf) || ((type)tmp != tmp)) \ - return -EINVAL; \ - *val = tmp; \ - return count; \ -} \ -static struct file_operations fops_##type = { \ - .read = read_file_##type, \ - .write = write_file_##type, \ - .open = default_open, \ -}; -simple_type(u8, "%c", unsigned long, simple_strtoul); -simple_type(u16, "%hi", unsigned long, simple_strtoul); -simple_type(u32, "%i", unsigned long, simple_strtoul); +static void debugfs_u8_set(void *data, u64 val) +{ + *(u8 *)data = val; +} +static u64 debugfs_u8_get(void *data) +{ + return *(u8 *)data; +} +DEFINE_SIMPLE_ATTRIBUTE(fops_u8, debugfs_u8_get, debugfs_u8_set, "%lu\n"); /** * debugfs_create_u8 - create a file in the debugfs filesystem that is used to read and write a unsigned 8 bit value. @@ -116,6 +87,16 @@ } EXPORT_SYMBOL_GPL(debugfs_create_u8); +static void debugfs_u16_set(void *data, u64 val) +{ + *(u16 *)data = val; +} +static u64 debugfs_u16_get(void *data) +{ + return *(u16 *)data; +} +DEFINE_SIMPLE_ATTRIBUTE(fops_u16, debugfs_u16_get, debugfs_u16_set, "%lu\n"); + /** * debugfs_create_u16 - create a file in the debugfs filesystem that is used to read and write a unsigned 8 bit value. * @@ -148,6 +129,16 @@ } EXPORT_SYMBOL_GPL(debugfs_create_u16); +static void debugfs_u32_set(void *data, u64 val) +{ + *(u32 *)data = val; +} +static u64 debugfs_u32_get(void *data) +{ + return *(u32 *)data; +} +DEFINE_SIMPLE_ATTRIBUTE(fops_u32, debugfs_u32_get, debugfs_u32_set, "%lu\n"); + /** * debugfs_create_u32 - create a file in the debugfs filesystem that is used to read and write a unsigned 8 bit value. * From benh at kernel.crashing.org Thu May 19 09:11:23 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 19 May 2005 09:11:23 +1000 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO In-Reply-To: References: <1116400151.918.10.camel@gaston> Message-ID: <1116457884.918.29.camel@gaston> On Wed, 2005-05-18 at 10:12 +0200, Marius Groeger wrote: > Ben, > > On Wed, 18 May 2005, Benjamin Herrenschmidt wrote: > > > Here's the very first draft of my HOWTO about booting the linux/ppc64 > > kernel without open firmware. It's still incomplete, the main chapter > ^^^^^^^^^^^^^^^^^^^^^ > One could argue whether the full-blown emulation of an OF device tree > may really be called this.... ;-) You must be kidding :) Honestly, a device tree is small and rather simple to layout, and would fix most of the issues with piling up crap like incompatible boot_info structures and that sort of thing that plague the ppc32 kernel. A full blown implementation of OF is a lot bigger. It requires at least 3 different interfaces (the user interface, the fcode interface, the client interface), along with all the bits & pieces to get a full runtime environment. > > > b) Direct entry with a flattened device-tree block. This entry > > point is called by a) after the OF trampoline and can also be > > called directly by a bootloader that does not support the Open > > Firmware client interface. It is also used by "kexec" to > > For OF based systems, what you outline definitely makes an awful lot of > sense. How so ? OF based system just implement the OF interface... > For others I wonder what the costs of this are in terms of the memory > footprint (both RAM and ROM). Are there reference implementations in > existence? You may not have noticed (well, I haven't filled part III yet so it may not be clear), but I'm only making a very small subset of the device-tree mandatory, though I do encourage people to provide an as complete as possible. For example, I will definitely not require the bootloader to provide a full tree of PCI devices, only host bridges, in order to get interrupt routing and resource mapping. However, I encourage people to put things like on-chip devices in there, it makes everything much more flexible. Regarding the cost, well, the device-tree itself is fairly small, maybe a couple of pages for a minimum one. As I wrote, embedded boards can decide to have it built at booloader build time, and simply embedded as a blob in the firmware and passed along to the kernel, that is 0 firmware code. However, it would be simple to add minimum capabilities to the firmware for editing/adding properties (for things like memory size or kernel command line). I wonder sometimes why people are so "afraid" of the device-tree concept... it's really simple, does not require that much code, and makes everything so much more flexible in the long run. Ben. From benh at kernel.crashing.org Thu May 19 09:32:16 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 19 May 2005 09:32:16 +1000 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO In-Reply-To: References: <1116400151.918.10.camel@gaston> Message-ID: <1116459136.918.39.camel@gaston> > For others I wonder what the costs of this are in terms of the memory > footprint (both RAM and ROM). Are there reference implementations in > existence? Oh, and to complete my answer, no there isn't per-se a reference implementation yet. What exist so far, outside of actual full fledged OF implementations, are IBM PIBS firmware for embedded which implements the full OF client interface, and the kexec tools using the flattened format. The reason why I'm writing this document is precisely to get that developement started as part of uboot. As it was said earlier, no new board support code will be accepted upstream if it doesn't use a device-tree. This decision has been taken a while ago and will not be changed. There are IBM internal stuffs used for bringup that implement this, so I can confirm it works :) But unfortunately, none of these can be distributed at the moment, and thus they don't constitute a reference implementation. Ben. From benh at kernel.crashing.org Thu May 19 14:56:53 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 19 May 2005 14:56:53 +1000 Subject: RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (#2) In-Reply-To: <1116400151.918.10.camel@gaston> References: <1116400151.918.10.camel@gaston> Message-ID: <1116478614.918.75.camel@gaston> On Wed, 2005-05-18 at 17:09 +1000, Benjamin Herrenschmidt wrote: > Hi ! > > Here's the very first draft of my HOWTO about booting the linux/ppc64 > kernel without open firmware. It's still incomplete, the main chapter > describing which nodes & properties are required and their format is > still missing (though it will basically be a subset of the Open Firmware > specification & bindings). The format of the flattened device-tree is > documented. And here is a second draft with more infos. Booting the Linux/ppc64 kernel without Open Firmware ---------------------------------------------------- (c) 2005 Benjamin Herrenschmidt , IBM Corp. May 18, 2005: Rev 0.1 - Initial draft, no chapter III yet. May 19, 2005: Rev 0.2 - Add chapter III and bits & pieces here or clarifies the fact that a lot of things are optional, the kernel only requires a very small device tree, though it is encouraged to provide an as complete one as possible. I- Introduction =============== During the recent developpements of the Linux/ppc64 kernel, and more specifically, the addition of new platform types outside of the old IBM pSeries/iSeries pair, it was decided to enforce some strict rules regarding the kernel entry and bootloader <-> kernel interfaces, in order to avoid the degeneration that has become the ppc32 kernel entry point and the way a new platform should be added to the kernel. The legacy iSeries platform breaks those rules as it predates this scheme, but no new board support will be accepted in the main tree that doesn't follows them properly. The main requirement that will be defined in mmore details below is the presence of a device-tree whose format is defined after Open Firmware specification. However, in order to make life easier to embedded board vendors, the kernel doesn't require the device-tree to represent every device in the system and only requires some nodes and properties to be present. This will be described in details in section III, but, for example, the kernel does not require you to create a node for every PCI device in the system. It is a requirement to have a node for PCI host bridges in order to provide interrupt routing informations and memory/IO ranges, among others. It is also recommended to define nodes for on chip devices and other busses that doesn't specifically fit in an existing OF specification, like on chip devices, this creates a great flexibility in the way the kernel can them probe those and match drivers to device, without having to hard code all sorts of tables. It also makes it more flexible for board vendors to do minor hardware upgrades without impacting significantly the kernel code or cluttering it with special cases. 1) Entry point -------------- There is one and one single entry point to the kernel, at the start of the kernel image. That entry point support two calling conventions: a) Boot from Open Firmware. If your firmware is compatible with Open Firmware (IEEE 1275) or provides an OF compatible client interface API (support for "interpret" callback of forth words isn't required), you can enter the kernel with: r5 : OF callback pointer as defined by IEEE 1275 bindings to powerpc. Only the 32 bits client interface is currently supported r3, r4 : address & lenght of an initrd if any or 0 MMU is either on or off, the kernel will run the trampoline located in arch/ppc64/kernel/prom_init.c to extract the device-tree and other informations from open firmware and build a flattened device-tree as described in b). prom_init() will then re-enter the kernel using the second method. This trampoline code runs in the context of the firmware, which is supposed to handle all exceptions during that time. b) Direct entry with a flattened device-tree block. This entry point is called by a) after the OF trampoline and can also be called directly by a bootloader that does not support the Open Firmware client interface. It is also used by "kexec" to implement "hot" booting of a new kernel from a previous running one. This method is what I will describe in more details in this document, as method a) is simply standard Open Firmware, and thus should be implemented according to the various standard documents defining it and it's binding to the PowerPC platform. The entry point definition then becomes: r3 : physical pointer to the device-tree block (defined in chapter II) r4 : physical pointer to the kernel itself. This is used by the assembly code to properly disable the MMU in case you are entering the kernel with MMU enabled and a non-1:1 mapping. r5 : NULL (as to differenciate with method a) Note about SMP entry: Either your firmware puts your other CPUs in some sleep loop or spin loop in ROM where you can get them out via a soft reset or some other mean, in which case you don't need to care, or you'll have to enter the kernel with all CPUs. The way to do that with method b) will be described in a later revision of this document. 2) Board support ---------------- Board supports (platforms) are not exclusive config options. An arbitrary set of board supports can be built in a single kernel image. The kernel will "known" what set of functions to use for a given platform based on the content of the device-tree. Thus, you should: a) add your platform support as a _boolean_ option in arch/ppc64/Kconfig, following the example of PPC_PSERIES, PPC_PMAC and PPC_MAPLE. The later is probably a good example of a board support to start from. b) create your main platform file as "arch/ppc64/kernel/myboard_setup.c" and add it to the Makefile under the condition of your CONFIG_ option. This file will define a structure of type "ppc_md" containing the various callbacks that the generic code will use to get to your platform specific code c) Add a reference to your "ppc_md" structure in the "machines" table in arch/ppc64/kernel/setup.c d) request and get assigned a platform number (see PLATFORM_* constants in include/asm-ppc64/processor.h I will describe later the boot process and various callbacks that your platform should implement. II - The DT block format =========================== This chapter defines the actual format of the flattened device-tree passed to the kernel. The actual content of it and kernel requirements are described later. You can find example of code manipulating that format in various places, including arch/ppc64/kernel/prom_init.c which will generate a flattened device-tree from the Open Firmware representation, or the fs2dt utility which is part of the kexec tools which will generate one from a filesystem representation. It is expected that a bootloader like uboot provides a bit more support, that will be discussed later as well. 1) Header --------- The kernel is entered with r3 pointing to an area of memory that is roughtly described in include/asm-ppc64/prom.h by the structure boot_param_header: struct boot_param_header { u32 magic; /* magic word OF_DT_HEADER */ u32 totalsize; /* total size of DT block */ u32 off_dt_struct; /* offset to structure */ u32 off_dt_strings; /* offset to strings */ u32 off_mem_rsvmap; /* offset to memory reserve map */ u32 version; /* format version */ u32 last_comp_version; /* last compatible version */ /* version 2 fields below */ u32 boot_cpuid_phys; /* Which physical CPU id we're booting on */ }; Along with the constants: /* Definitions used by the flattened device tree */ #define OF_DT_HEADER 0xd00dfeed /* 4: version, 4: total size */ #define OF_DT_BEGIN_NODE 0x1 /* Start node: full name */ #define OF_DT_END_NODE 0x2 /* End node */ #define OF_DT_PROP 0x3 /* Property: name off, size, content */ #define OF_DT_END 0x9 All values in this header are in big endian format, the various fields in this header are defined more precisely below. All "offsets" values are in bytes from the start of the header, that is from r3 value. - magic This is a magic value that "marks" the beginning of the device-tree block header. It contains the value 0xd00dfeed and is defined by the constant OF_DT_HEADER - totalsize This is the total size of the DT block including the header. The "DT" block should enclose all data structures defined in this chapter (who are pointed to by offsets in this header). That is, the device-tree structure, strings, and the memory reserve map. - off_dt_struct This is an offset from the beginning of the header to the start of the "structure" part the device tree. (see 2) device tree) - off_dt_strings This is an offset from the beginning of the header to the start of the "strings" part of the device-tree - off_mem_rsvmap This is an offset from the beginning of the header to the start of the reserved memory map. This map is a list of pairs of 64 bits integers. Each pair is a physical address and a size. The list is terminated by an entry of size 0. This map provides the kernel with a list of physical memory areas that are "reserved" and thus not to be used for memory allocations, especially during early initialisation. The kernel needs to allocate memory during boot for things like un-flattening the device-tree, allocating an MMU hash table, etc... Those allocations must be done in such a way to avoid overriding critical things like, on Open Firmware capable machines, the RTAS instance, or on some pSeries, the TCE tables used for the iommu. Typically, the reserve map should contain _at least_ this DT block itself (header,total_size). If you are passing an initrd to the kernel, you should reserve it as well. You do not need to reserve the kernel image itself. The map should be 64 bits aligned. - version This is the version of this structure. Version 1 stops here. Version 2 adds an additional field boot_cpuid_phys. You should always generate a structure of the highest version defined at the time of your implementation. That is version 2. - last_comp_version Last compatible version. This indicates down to what version of the DT block you are backward compatible with. For example, version 2 is backward compatible with version 1 (that is, a kernel build for version 1 will be able to boot with a version 2 format). You should put a 1 in this field unless a new incompatible version of the DT block is defined. - boot_cpuid_phys This field only exist on version 2 headers. It indicate which physical CPU ID is calling the kernel entry point. This is used, among others, by kexec. If you are on an SMP system, this value should match the content of the "reg" property of the CPU node in the device-tree corresponding to the CPU calling the kernel entry point (see further chapters for more informations on the required device-tree contents) So the typical layout of a DT block (though the various parts don't need to be in that order) looks like (addresses go from top to bottom): ------------------------------ r3 -> | struct boot_param_header | ------------------------------ | (alignment gap) (*) | ------------------------------ | memory reserve map | ------------------------------ | (alignment gap) | ------------------------------ | | | device-tree structure | | | ------------------------------ | (alignment gap) | ------------------------------ | | | device-tree strings | | | -----> ------------------------------ | | --- (r3 + totalsize) (*) The alignment gaps are not necessarily present, their presence and size are dependent on the various alignment requirements of the individual data blocks. 2) Device tree generalities --------------------------- This device-tree itself is separated in two different blocks, a structure block and a strings block. Both need to be page aligned. First, let's quickly describe the device-tree concept before detailing the storage format. This chapter does _not_ describe the detail of the required types of nodes & properties for the kernel, this is done later in chapter III. The device-tree layout is strongly inherited from the definition of the Open Firmware IEEE 1275 device-tree. It's basically a tree of nodes, each node having two or more named properties. A property can have a value or not. It is a tree, so each node has one and only one parent except for the root node who has no parent. A node has 2 names. The actual node name is contained in a property of type "name" in the node property list whose value is a zero terminated string and is mandatory. There is also a "unit name" that is used to differenciate nodes with the same name at the same level, it is usually made of the node name's, the "@" sign, and a "unit address", which definition is specific to the bus type the node sits on. The unit name doesn't exist as a property per-se but is included in the device-tree structure. It is typically used to represent "path" in the device-tree. More details about these will be provided later. The kernel ppc64 generic code does not make any formal use of the unit address though (though some board support code may do) so the only real requirement here for the unit address is to ensure uniqueness of the node unit name at a given level. Nodes with no notion of address and no possible sibling of the same name (like /memory or /cpus) may ommit the unit address in the context of this specification, or use the "@0" default unit address. The unit name is used to define a node "full path", which is the concatenation of all parent nodes unit names separated with "/". The root node is defined as beeing named "device-tree" and has no unit address (no @ symbol followed by a unit address). When manipulating device-tree "path", the root of the tree is generally represented by a simple slash sign "/". Every node who actually represents an actual device (that is who isn't only a virtual "container" for more nodes, like "/cpus" is) is also required to have a "device_type" property indicating the type of node Finally, every node is required to have a "linux,phandle" property. Real open firmware implementations don't provide it as it's generated on the fly by the prom_init.c trampoline from the Open Firmware "phandle". Implementations providing a flattened device-tree directly should provide this property. This propery is a 32 bits value that uniquely identify a node. You are free to use whatever values or system of values, internal pointers, or whatever to genrate these, the only requirement is that every single node of the tree you are passing to the kernel has a unique value in this property. This can be used in some cases for nodes to reference other nodes. Here is an example of a simple device-tree. In this example, a "o" designates a node followed by the node unit name. Properties are presented with their name followed by their content. "content" represent an ASCII string (zero terminated) value, while represent a 32 bits hexadecimal value. The various nodes in this example will be discusse in a later chapter. At this point, it is only meant to give you a idea of what a device-tree looks like / o device-tree |- name = "device-tree" |- model = "MyBoardName" |- compatible = "MyBoardFamilyName" |- #address-cells = <2> |- #size-cells = <2> |- linux,phandle = <0> | o cpus | | - name = "cpus" | | - linux,phandle = <1> | | - #address-cells = <1> | | - #size-cells = <0> | | | o PowerPC,970 at 0 | |- name = "PowerPC,970" | |- device_type = "cpu" | |- reg = <0> | |- clock-frequency = <5f5e1000> | |- linux,boot-cpu | |- linux,phandle = <2> | o memory at 0 | |- name = "memory" | |- device_type = "memory" | |- reg = <00000000 00000000 00000000 20000000> | |- linux,phandle = <3> | o chosen |- name = "chosen" |- bootargs = "root=/dev/sda2" |- linux,platform = <00000600> |- linux,phandle = <4> This tree is an example of a minimal tree. It pretty much contains the minimal set of required nodes and properties to boot a linux kernel, that is some basic model informations at the root, the CPUs, the physical memory layout, and misc informations passed through /chosen like in this example, the platform type (mandatory) and the kernel command line arguments (optional). The /cpus/PowerPC,970 at 0/linux,boot-cpu property is an example of a property without a value. All other properties have a value. The signification of the #address-cells and #size-cells properties will be explained in chapter IV which defines precisely the required nodes and properties and their content. 3) Device tree "structure" block The structure of the device tree is a linearized tree structure. The "OF_DT_BEGIN_NODE" token starts a new node, and the "OF_DT_END" ends that node definition. Child nodes are simply defined before "OF_DT_END" (that is nodes within the node). A 'token' is a 32 bits value. Here's the basic structure of a single node: * token OF_DT_BEGIN_NODE (that is 0x00000001) * node full path as a zero terminated string * [align gap to next 4 bytes boundary] * for each property: * token OF_DT_PROP (that is 0x00000003) * 32 bits value of property value size in bytes (or 0 of no value) * 32 bits value of offset in string block of property name * [align gap to either next 4 bytes boundary if the property value size is less or equal to 4 bytes, or to next 8 bytes boundary if the property value size is larger than 4 bytes] * property value data if any * [align gap to next 4 bytes boundary] * [child nodes if any] * token OF_DT_END (that is 0x00000002) So the node content can be summmarised as a start token, a full path, a list of properties, a list of child node and an end token. Every child node is a full node structure itself as defined above 4) Device tree 'strings" block In order to save space, property names, which are generally redundant, are stored separately in the "strings" block. This block is simply the whole bunch of zero terminated strings for all property names concatenated together. The device-tree property definitions in the structure block will contain offset values from the beginning of the strings block. III - Required content of the device tree ========================================= WARNING: All "linux,*" properties defined in this document apply only to a flattened device-tree. If your platform uses a real implementation of Open Firmware or an implementation compatible with the Open Firmware client interface, those properties will be created by the trampoline code in the kernel's prom_init() file. For example, that's where you'll have to add code to detect your board model and set the platform number. However, when using the flatenned device-tree entry point, there is no prom_init() pass, and thus you have to provide those properties yourself. 1) Note about cells and address representation ---------------------------------------------- The general rule is documented in the various Open Firmware documentations. If you chose to describe a bus with the device-tree and there exist an OF bus binding, then you should follow the specification. However, the kernel does not require every single device or bus to be described by the device tree. In general, the format of an address for a device is defined by the parent bus type, based on the #address-cells and #size-cells property. In absence of such a property, the parent's parent values are used, etc... The kernel requires the root node to have those properties defining addresses format for devices directly mapped on the processor bus. Those 2 properties define 'cells' for representing an address and a size. A "cell" is a 32 bits number. For example, if both contain 2 like the example tree given above, then an address and a size are both composed of 2 cells, that is a 64 bits number (cells are concatenated and expected to be in big endian format). Another example is the way Apple firmware define them, that is 2 cells for an address and one cell for a size. A device IO or MMIO areas on the bus are defined in the "reg" property. The format of this property depends on the bus the device is sitting on. Standard bus types define their "reg" properties format in the various OF bindings for those bus types, you are free to define your own "reg" format for proprietary busses or virtual busses enclosing on-chip devices, though it is recommended that the parts of the "reg" property containing addresses and sizes do respect the defined #address-cells and #size-cells when those make sense. Later, I will define more precisely some common address formats. For a new ppc64 board, I recommend to use either the 2/2 format or Apple's 2/1 format which is slightly more compact since sizes usually fit in a single 32 bits word. 2) Note about "compatible" properties ------------------------------------- Those properties are optional, but recommended in devices and the root node. The format of a "compatible" property is a list of concatenated zeto terminated strings. They allow a device to express it's compatibility with a family of similar devices, in some cases, allowing a single driver to match against several devices regardless of their actual names 3) Note about "name" properties ------------------------------- While earlier users of Open Firmware like OldWorld macintoshes tended to use the actual device name for the "name" property, it's nowadays considered a good practice to use a name that is closer to the device class (often equal to device_type). For example, nowadays, ethernet controllers are named "ethernet", an additional "model" property defining precisely the chip type/model, and "compatible" property defining the family in case a single driver can driver more than one of these chips. The kernel however doesn't generally put any restriction on the "name" property, it is simply considered good practice to folow the standard and it's evolutions as closely as possible. 4) Required nodes and properties -------------------------------- Note that every node should have a "name" and a "linux,phandle" property, those aren't specified explicitely below as their presence is considered as implicit. The name property is defined in the cases where it's content is defined or has a common practice. a) The root node The root node requires some properties to be present: - model : this is your board name/model - #address-cells : address representation for "root" devices - #size-cells: the size representation for "root" devices Additionally, some recommended properties are: - name : this is generally "device-tree" - compatible : the board "family" generally finds its way here, for example, if you have 2 board models with a similar layout, that typically get driven by the same platform code in the kernel, you would use a different "model" property but put a value in "compatible". The kernel doesn't directly use that value (see /chosen/linux,platform for how the kernel choses a platform type) but it is generally useful. It's also generally where you add additional properties specific to your board like the serial number if any, that sort of thing. it is recommended that if you add any "custom" property whose name may clash with standard defined ones, you prefix them with your vendor name and a comma. b) The /cpus node This node is the parent of all individual CPUs nodes. It doesn't have any specific requirements, though it's generally good practice to have at least: #address-cells = <00000001> #size-cells = <00000000> This defines that the "address" for a CPU is a single cell, and has no meaningful size. This is not necessary but the kernel will assume that format when reading the "reg" properties of a CPU node, see below c) The /cpus/* nodes So under /cpus, you are supposed to create a node for every CPU on the machine. There is no specific restriction on the name of the CPU, though It's common practice to call it PowerPC,, for example, Apple uses PowerPC,G5 while IBM uses PowerPC,970FX. Required properties: - device_type : has to be "cpu" - reg : This is the physical cpu number, it's single 32 bits cell, this is also used as-is as the unit number for constructing the unit name in the full path, for example, with 2 CPUs, you would have the full path: /cpus/PowerPC,970FX at 0 /cpus/PowerPC,970FX at 1 (unit addresses do not require to have leading zero's) - d-cache-line-size : one cell, L1 data cache line size in bytes - i-cache-line-size : one cell, L1 instruction cache line size in bytes - d-cache-size : one cell, size of L1 data cache in bytes - i-cache-size : one cell, size of L1 instruction cache in bytes Recommended properties: - timebase-frequency : a cell indicating the frequency of the timebase in Hz. This is not directly used by the generic code, but you are welcome to copy/paste the pSeries code for setting the kernel timebase/decrementer calibration based on this value. - clock-frequency : a cell indicating the CPU core clock frequency in Hz. A new property will be defined for 64 bits value, but if your frequency is < 4Ghz, one cell is enough. Here as well as for the above, the common code doesn't use that property, but you are welcome to re-use the pSeries or Maple one. A future kernel version might provide a common function for this. You are welcome to add any property you find relevant to your board, like some informations about mecanism used to soft-reset the CPUs for example (Apple puts the GPIO number for CPU soft reset lines in there as a "soft-reset" property as they start secondary CPUs by soft-resetting them). d) the /memory node(s) To define the physical memory layout of your board, you should create one or more memory node(s). You can either create a single node with all memory ranges in it's reg property, or you can create several nodes, as you wishes. The unit address (@ part) used for the full path is the address of the first range of memory defined by a given node. If you use a single memory node, this will typically be @0. Required properties: - name : has to be "chosen" - device_type : has to be "memory" - reg : This property contain all the physical memory ranges of your board. It's a list of addresses/sizes concatenated together, the number of cell of those beeing defined by the #address-cells and #size-cells of the root node. For example, with both of these properties beeing 2 like in the example given earlier, a 970 based machine with 6Gb of RAM could typically have a "reg" property here that looks like: 00000000 00000000 00000000 80000000 00000001 00000000 00000001 00000000 That is a range starting at 0 of 0x80000000 bytes and a range starting at 0x100000000 and of 0x100000000 bytes. You can see that there is no memory covering the IO hold between 2Gb and 4Gb. Some vendors prefer splitting those ranges into smaller segments, the kernel doesn't care. c) The /chosen node This node is a bit "special". Normally, that's where open firmware puts some variable environment informations, like the arguments, or phandle pointers to nodes like the main interrupt controller, or the default input/output devices. This specification makes a few of these mandatory, but also defines some linux specific properties that would be normally constructed by the prom_init() trampoline when booting with an OF client interface, but that you have to provide yourself when using the flattened format. Required properties: - name has to be "chosen" - linux,platform : This is your platform number as assigned by the architecture maintainers Recommended properties: - bootargs : This zero terminated string is passed as the kernel command line - linux,stdout-path : This is the full path to your standard console device if any. Typically, if you have serial devices on your board, you may want to put the full path to the one set as the default console in the firmware here, for the kernel to pick it up as it's own default console. If you look at the funciton set_preferred_console() in arch/ppc64/kernel/setup.c, you'll see that the kernel tries to find out the default console and has knowledge of various types like 8250 serial ports. You may want to extend this function to add your own. - interrupt-controller : This is one cell containing a phandle value that matches the "linux,phandle" property of your main interrupt controller node. May be used for interrupt routing. This is all that is currently required. However, it is strongly recommended that you expose PCI host bridges as documented in the PCI binding to open firmware, and your interrupt tree as documented in OF interrupt tree specification. IV - Recommendation for a bootloader ==================================== Here are some various ideas/recommendations that have been proposed while all this has been defined and implemented. - It should be possible to write a parser that turns an ASCII representation of a device-tree (or even XML though I find that less readable) into a device-tree block. This would allow to basically build the device-tree structure and strings "blobs" at bootloader build time, and have the bootloader just pass-them as-is to the kernel. In fact, the device-tree blob could be then separate from the bootloader itself, an be placed in a separate portion of the flash that can be "personalized" for different board types by flashing a different device-tree - A very The bootloader may want to be able to use the device-tree itself and may want to manipulate it (to add/edit some properties, like physical memory size or kernel arguments). At this point, 2 choices can be made. Either the bootloader works directly on the flattened format, or the bootloader has it's own internal tree representation with pointers (similar to the kernel one) and re-flattens the tree when booting the kernel. The former is a bit more difficult to edit/modify, the later requires probably a bit more code to handle the tree structure. Note that the structure format has been designed so it's relatively easy to "insert" properties or nodes or delete them by just memmovin'g things around. It contains no internal offsets or pointers for this purpose. - An example of code for iterating nodes & retreiving properties directly from the flattened tree format can be found in the kernel file arch/ppc64/kernel/prom.c, look at scan_flat_dt() function, it's usage in early_init_devtree(), and the corresponding various early_init_dt_scan_*() callbacks. That code can be re-used in a GPL bootloader, and as the author of that code, I would be happy do discuss possible free licencing to any vendor who wishes to integrate all or part of this code into a non-GPL bootloader. From paulus at samba.org Thu May 19 16:10:42 2005 From: paulus at samba.org (Paul Mackerras) Date: Thu, 19 May 2005 16:10:42 +1000 Subject: PER_LINUX32 fixes reworked Message-ID: <17036.11746.366471.915182@cargo.ozlabs.ibm.com> I have reworked Juergen Kreileder's patch to fix some PER_LINUX32 things a bit. I took the opportunity to move the ppc32-specific uname functions to sys_ppc32.c, and move the functions that are used for both 32-bit and 64-bit processes from sys_ppc32.c to syscalls.c. We still have one difference from x86_64 in that once a process has set its personality to PER_LINUX32, if it asks what its personality is it will be told PER_LINUX, and if it tries to set the personality to PER_LINUX it will remain as PER_LINUX32. We do that for both 32-bit and 64-bit processes, whereas x86_64 does it only for 32-bit processes (sparc64 does it for both 32-bit and 64-bit). Note that the only difference PER_LINUX32 makes is to change the value returned in the utsname machine field by the uname syscalls from "ppc64" to "ppc", and thus uname -m will print ppc rather than ppc64. I'd like to get rid of the "fakeppc" kernel command-line option someday too. Does anyone actually still use it? Comments? If no-one has any problems with this patch I'll send it upstream. Paul. diff -urN linux-2.6/arch/ppc64/kernel/misc.S g5-ppc64/arch/ppc64/kernel/misc.S --- linux-2.6/arch/ppc64/kernel/misc.S 2005-05-06 15:49:24.000000000 +1000 +++ g5-ppc64/arch/ppc64/kernel/misc.S 2005-05-18 16:02:37.000000000 +1000 @@ -792,7 +792,7 @@ .llong .compat_sys_newstat .llong .compat_sys_newlstat .llong .compat_sys_newfstat - .llong .sys_uname + .llong .sys32_uname .llong .sys_ni_syscall /* 110 old iopl syscall */ .llong .sys_vhangup .llong .sys_ni_syscall /* old idle syscall */ diff -urN linux-2.6/arch/ppc64/kernel/sys_ppc32.c g5-ppc64/arch/ppc64/kernel/sys_ppc32.c --- linux-2.6/arch/ppc64/kernel/sys_ppc32.c 2005-04-26 15:37:55 +++ g5-ppc64/arch/ppc64/kernel/sys_ppc32.c 2005-05-19 15:30:59 @@ -788,34 +788,9 @@ } return -EOPNOTSUPP; -} - - -asmlinkage int ppc64_newuname(struct new_utsname __user * name) -{ - int errno = sys_newuname(name); - - if (current->personality == PER_LINUX32 && !errno) { - if(copy_to_user(name->machine, "ppc\0\0", 8)) { - errno = -EFAULT; - } - } - return errno; -} - -asmlinkage int ppc64_personality(unsigned long personality) -{ - int ret; - if (current->personality == PER_LINUX32 && personality == PER_LINUX) - personality = PER_LINUX32; - ret = sys_personality(personality); - if (ret == PER_LINUX32) - ret = PER_LINUX; - return ret; } - /* Note: it is necessary to treat mode as an unsigned int, * with the corresponding cast to a signed int to insure that the * proper conversion (sign extension) between the register representation of a signed int (msr in 32-bit mode) @@ -1158,26 +1133,47 @@ } #endif +asmlinkage int sys32_uname(struct old_utsname __user * name) +{ + int err = 0; + + down_read(&uts_sem); + if (copy_to_user(name, &system_utsname, sizeof(*name))) + err = -EFAULT; + up_read(&uts_sem); + if (!err && personality(current->personality) == PER_LINUX32) { + /* change "ppc64" to "ppc" */ + if (__put_user(0, name->machine + 3) + || __put_user(0, name->machine + 4)) + err = -EFAULT; + } + return err; +} + asmlinkage int sys32_olduname(struct oldold_utsname __user * name) { int error; - - if (!name) - return -EFAULT; + if (!access_ok(VERIFY_WRITE,name,sizeof(struct oldold_utsname))) return -EFAULT; down_read(&uts_sem); error = __copy_to_user(&name->sysname,&system_utsname.sysname,__OLD_UTS_LEN); - error -= __put_user(0,name->sysname+__OLD_UTS_LEN); - error -= __copy_to_user(&name->nodename,&system_utsname.nodename,__OLD_UTS_LEN); - error -= __put_user(0,name->nodename+__OLD_UTS_LEN); - error -= __copy_to_user(&name->release,&system_utsname.release,__OLD_UTS_LEN); - error -= __put_user(0,name->release+__OLD_UTS_LEN); - error -= __copy_to_user(&name->version,&system_utsname.version,__OLD_UTS_LEN); - error -= __put_user(0,name->version+__OLD_UTS_LEN); - error -= __copy_to_user(&name->machine,&system_utsname.machine,__OLD_UTS_LEN); - error = __put_user(0,name->machine+__OLD_UTS_LEN); + error |= __put_user(0,name->sysname+__OLD_UTS_LEN); + error |= __copy_to_user(&name->nodename,&system_utsname.nodename,__OLD_UTS_LEN); + error |= __put_user(0,name->nodename+__OLD_UTS_LEN); + error |= __copy_to_user(&name->release,&system_utsname.release,__OLD_UTS_LEN); + error |= __put_user(0,name->release+__OLD_UTS_LEN); + error |= __copy_to_user(&name->version,&system_utsname.version,__OLD_UTS_LEN); + error |= __put_user(0,name->version+__OLD_UTS_LEN); + error |= __copy_to_user(&name->machine,&system_utsname.machine,__OLD_UTS_LEN); + error |= __put_user(0,name->machine+__OLD_UTS_LEN); + if (personality(current->personality) == PER_LINUX32) { + /* change "ppc64" to "ppc" */ + error |= __put_user(0, name->machine + 3); + error |= __put_user(0, name->machine + 4); + } + up_read(&uts_sem); error = error ? -EFAULT : 0; diff -urN linux-2.6/arch/ppc64/kernel/syscalls.c g5-ppc64/arch/ppc64/kernel/syscalls.c --- linux-2.6/arch/ppc64/kernel/syscalls.c 2005-04-26 15:37:55.000000000 +1000 +++ g5-ppc64/arch/ppc64/kernel/syscalls.c 2005-05-18 17:09:02.000000000 +1000 @@ -208,15 +208,33 @@ } __setup("fakeppc", set_fakeppc); -asmlinkage int sys_uname(struct old_utsname __user * name) +long ppc64_personality(unsigned long personality) { - int err = -EFAULT; - + long ret; + + if (personality(current->personality) == PER_LINUX32 + && personality == PER_LINUX) + personality = PER_LINUX32; + ret = sys_personality(personality); + if (ret == PER_LINUX32) + ret = PER_LINUX; + return ret; +} + +long ppc64_newuname(struct new_utsname __user * name) +{ + int err = 0; + down_read(&uts_sem); - if (name && !copy_to_user(name, &system_utsname, sizeof (*name))) - err = 0; + if (copy_to_user(name, &system_utsname, sizeof(*name))) + err = -EFAULT; up_read(&uts_sem); - + if (!err && personality(current->personality) == PER_LINUX32) { + /* change ppc64 to ppc */ + if (__put_user(0, name->machine + 3) + || __put_user(0, name->machine + 4)) + err = -EFAULT; + } return err; } diff -urN linux-2.6/include/asm-ppc64/elf.h g5-ppc64/include/asm-ppc64/elf.h --- linux-2.6/include/asm-ppc64/elf.h 2005-05-02 08:29:41.000000000 +1000 +++ g5-ppc64/include/asm-ppc64/elf.h 2005-05-18 16:16:34.000000000 +1000 @@ -221,9 +221,7 @@ set_thread_flag(TIF_ABI_PENDING); \ else \ clear_thread_flag(TIF_ABI_PENDING); \ - if (ibcs2) \ - set_personality(PER_SVR4); \ - else if (current->personality != PER_LINUX32) \ + if (personality(current->personality) != PER_LINUX32) \ set_personality(PER_LINUX); \ } while (0) From anton at samba.org Thu May 19 16:14:39 2005 From: anton at samba.org (Anton Blanchard) Date: Thu, 19 May 2005 16:14:39 +1000 Subject: PER_LINUX32 fixes reworked In-Reply-To: <17036.11746.366471.915182@cargo.ozlabs.ibm.com> References: <17036.11746.366471.915182@cargo.ozlabs.ibm.com> Message-ID: <20050519061439.GG5829@krispykreme> Hi Paul, > I'd like to get rid of the "fakeppc" kernel command-line option > someday too. Does anyone actually still use it? I think the fakeppc hack was to allow an old distro to install by making it think it was a ppc32 box. I doubt anyone would miss it these days :) Anton From paulus at samba.org Thu May 19 16:26:54 2005 From: paulus at samba.org (Paul Mackerras) Date: Thu, 19 May 2005 16:26:54 +1000 Subject: PER_LINUX32 fixes reworked In-Reply-To: <20050519061439.GG5829@krispykreme> References: <17036.11746.366471.915182@cargo.ozlabs.ibm.com> <20050519061439.GG5829@krispykreme> Message-ID: <17036.12718.672142.226792@cargo.ozlabs.ibm.com> Hi Anton, On a completely different topic, an envelope for you has arrived here that looks like it has a new hertz card in it. Are you going to drop in here or would you like me to bring it to San Diego? Paul. From arnd at arndb.de Thu May 19 17:46:43 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Thu, 19 May 2005 09:46:43 +0200 Subject: RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (#2) In-Reply-To: <1116478614.918.75.camel@gaston> References: <1116400151.918.10.camel@gaston> <1116478614.918.75.camel@gaston> Message-ID: <200505190946.45219.arnd@arndb.de> On Dunnersdag 19 Mai 2005 06:56, Benjamin Herrenschmidt wrote: > d) the /memory node(s) > Required properties: > > - name : has to be "chosen" s/chosen/memory/ > c) The /chosen node > > - linux,platform : This is your platform number as assigned by the > architecture maintainers Does this mean you want a new platform number for every board type? I would guess that it might be easier to extend the maple platform to support all boards with ppc970 and similar CPUs (except the pmac and pSeries ones), just like I would like to extend the BPA platform for all Cell based systems. > This is all that is currently required. However, it is strongly > recommended that you expose PCI host bridges as documented in the > PCI binding to open firmware, and your interrupt tree as documented > in OF interrupt tree specification. AFAICS, the pci device tree is currently required if you want to use an IOMMU or if you want PCI-X or PCIe style devices with extended PCI config space. I wouldn't be surprised if other functionality also depends on it. Arnd <>< From benh at kernel.crashing.org Thu May 19 18:09:58 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 19 May 2005 18:09:58 +1000 Subject: RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (#2) In-Reply-To: <200505190946.45219.arnd@arndb.de> References: <1116400151.918.10.camel@gaston> <1116478614.918.75.camel@gaston> <200505190946.45219.arnd@arndb.de> Message-ID: <1116490198.918.89.camel@gaston> On Thu, 2005-05-19 at 09:46 +0200, Arnd Bergmann wrote: > On Dunnersdag 19 Mai 2005 06:56, Benjamin Herrenschmidt wrote: > > > d) the /memory node(s) > > Required properties: > > > > - name : has to be "chosen" > > s/chosen/memory/ Thanks, will fix. > > c) The /chosen node > > > > - linux,platform : This is your platform number as assigned by the > > architecture maintainers > > Does this mean you want a new platform number for every board type? > I would guess that it might be easier to extend the maple platform > to support all boards with ppc970 and similar CPUs (except the > pmac and pSeries ones), just like I would like to extend the BPA > platform for all Cell based systems. I'd rather have a different number per board family. Embedded vendors are likely to hard code all sort of things and deal with all sort of funky bits of hardware in their xxx_setup.c code among others, I'd rather have them have a separate platform. Though it is better if they could keep "similar" boards under the same platform number and use the device-tree to differenciate them. > > This is all that is currently required. However, it is strongly > > recommended that you expose PCI host bridges as documented in the > > PCI binding to open firmware, and your interrupt tree as documented > > in OF interrupt tree specification. > > AFAICS, the pci device tree is currently required if you want to > use an IOMMU or if you want PCI-X or PCIe style devices with > extended PCI config space. I wouldn't be surprised if other > functionality also depends on it. No, you can use the iommu without the PCI device tree. I've verified that it works on maple by disabling generation of the PCI device tree in PIBS. Extended config space should be fixed too, though it's not an issue with existing bridges yet. Ben. From arnd at arndb.de Thu May 19 18:29:06 2005 From: arnd at arndb.de (Arnd Bergmann) Date: Thu, 19 May 2005 10:29:06 +0200 Subject: [PATCH] libfs: add simple attribute files In-Reply-To: <20050518202446.GA20041@kroah.com> References: <200505132117.37461.arnd@arndb.de> <200505181441.01495.arnd@arndb.de> <20050518202446.GA20041@kroah.com> Message-ID: <200505191029.07970.arnd@arndb.de> On Middeweken 18 Mai 2005 22:24, Greg KH wrote: > Thanks for the patch. I've cleaned it up a bit (drop the spufs > comments, changed the access check, and made the val be u64, and > exported the symbols and cleaned up the debugfs portion) and added it to > my tree. It should show up in the next -mm release. I've included the > patch below so you can see my > changes. Great, thanks for cleaning up those mistakes. I noticed one small problem with the change from 'long' to 'u64', in that you did not change it in all places. In particular, using "%lu" to print a u64 value will always do the wrong thing on big-endian 32 bit platforms and maybe on some others. Since 'u64' is '%llu' on most platforms but '%lu' on some 64 bit platforms, I'd either do explicit cast to unsigned long long in the printf or use unsigned long long throughout the code. > void foo_set(void *data, long val); and ^^ u64 > long foo_get(void *data); ^^ u64 > +#define DEFINE_SIMPLE_ATTRIBUTE(__fops, __get, __set, __fmt) \ > +static int __fops ## _open(struct inode *inode, struct file *file) \ > +{ \ > + __simple_attr_check_format(__fmt, 0ul); \ ^^^^ 0ull > + else /* first read */ > + size = scnprintf(attr->get_buf, sizeof(attr->get_buf), > + attr->fmt, attr->get(attr->data)); ^^ (unsigned long long) > +DEFINE_SIMPLE_ATTRIBUTE(fops_u8, debugfs_u8_get, debugfs_u8_set, "%lu\n"); > +DEFINE_SIMPLE_ATTRIBUTE(fops_u16, debugfs_u16_get, debugfs_u16_set, "%lu\n"); > +DEFINE_SIMPLE_ATTRIBUTE(fops_u32, debugfs_u32_get, debugfs_u32_set, "%lu\n"); %llu ^^^^ I also noticed that it is not possible to pass NULL operations to DEFINE_SIMPLE_ATTRIBUTE() unless you change --- a/include/linux/fs.h 2005-05-19 10:17:53.000000000 +0200 +++ b/include/linux/fs.h 2005-05-19 10:14:57.000000000 +0200 @@ -1680,7 +1680,7 @@ static int __fops ## _open(struct inode *inode, struct file *file) \ { \ __simple_attr_check_format(__fmt, 0ul); \ - return simple_attr_open(inode, file, &__get, &__set, __fmt); \ + return simple_attr_open(inode, file, __get, __set, __fmt); \ } \ static struct file_operations __fops = { \ .owner = THIS_MODULE, \ I'm currently away from my test machine, so I think it's easier if you just update your patch yourself, but I could also send you an update patch later if you prefer. Arnd <>< From mgroeger at sysgo.com Thu May 19 19:52:53 2005 From: mgroeger at sysgo.com (Marius Groeger) Date: Thu, 19 May 2005 11:52:53 +0200 (CEST) Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO In-Reply-To: <1116457884.918.29.camel@gaston> References: <1116400151.918.10.camel@gaston> <1116457884.918.29.camel@gaston> Message-ID: On Thu, 19 May 2005, Benjamin Herrenschmidt wrote: >>> Here's the very first draft of my HOWTO about booting the linux/ppc64 >>> kernel without open firmware. It's still incomplete, the main chapter >> ^^^^^^^^^^^^^^^^^^^^^ >> One could argue whether the full-blown emulation of an OF device tree >> may really be called this.... ;-) > > You must be kidding :) > > Honestly, a device tree is small and rather simple to layout, and would > fix most of the issues with piling up crap like incompatible boot_info > structures and that sort of thing that plague the ppc32 kernel. Yes, I know, and I *was* kidding. :-) What I was trying to hint, at, really, was that there is just a bit more than resemblance to what real OF based systems will provide for a device tree. And rightly so, no need to it difficult for those. >> For OF based systems, what you outline definitely makes an awful lot of >> sense. > > How so ? OF based system just implement the OF interface... Er, yes, and that is why it makes sense to design it that way. Maybe striking out the word "awful" makes my statement clearer :-) > Regarding the cost, well, the device-tree itself is fairly small, maybe > a couple of pages for a minimum one. As I wrote, embedded boards can Without knowing the size of the code required for this, it would still mean an increase by a couple of hundred percent for the boot information. > I wonder sometimes why people are so "afraid" of the device-tree > concept... it's really simple, does not require that much code, and > makes everything so much more flexible in the long run. Oh, don't get me wrong: I'm not against the device tree per se, I was just pondering a little on your draft, according to the "RFC" bit in the subject. Actually I welcome your efforts a lot, since I, too, suffered from the mess we a currently in. So, by all means, please do go on! Cheers, Marius -- Marius Groeger SYSGO AG Embedded and Real-Time Software Voice: +49 6136 9948 0 FAX: +49 6136 9948 10 www.sysgo.com | www.elinos.com | www.osek.de | www.pikeos.com Meet us: Embedded Systems Expo & Conference, Tokyo Big Sight 2005-JUN-29 - 2005-JUL-01 http://www.esec.jp/en/ From benh at kernel.crashing.org Thu May 19 20:22:24 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 19 May 2005 20:22:24 +1000 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO In-Reply-To: References: <1116400151.918.10.camel@gaston> <1116457884.918.29.camel@gaston> Message-ID: <1116498144.918.97.camel@gaston> > Without knowing the size of the code required for this, it would still > mean an increase by a couple of hundred percent for the boot > information. Well, if you build the device-tree blob at bootloader build time (you can then embed it in your bootloader or maybe just put it somewhere in flash), there is little code involved, basically passing a pointer to it to the kernel. Now, if you mean the kernel code, oh well, have you seen how big a ppc64 kernel is anyway ? :) I would expect something like uboot to be a bit more smart though and provide optionally some functions to add nodes/properties, but heh, we'll see. I'll try to provide example code after I'm done with the spec part. Ben. From anton at samba.org Thu May 19 22:14:51 2005 From: anton at samba.org (Anton Blanchard) Date: Thu, 19 May 2005 22:14:51 +1000 Subject: [PATCH] ppc64: quieten RTAS printks Message-ID: <20050519121450.GI5829@krispykreme> Some rtasd printks were too loud. They would appear on a quiet boot even though they were only informational. Signed-off-by: Anton Blanchard Index: foobar2/arch/ppc64/kernel/rtasd.c =================================================================== --- foobar2.orig/arch/ppc64/kernel/rtasd.c 2005-05-04 13:16:13.000000000 +1000 +++ foobar2/arch/ppc64/kernel/rtasd.c 2005-05-04 15:49:04.836459502 +1000 @@ -440,7 +440,7 @@ goto error; } - printk(KERN_ERR "RTAS daemon started\n"); + printk(KERN_INFO "RTAS daemon started\n"); DEBUG("will sleep for %d jiffies\n", (HZ*60/rtas_event_scan_rate) / 2); @@ -485,7 +485,7 @@ /* No RTAS, only warn if we are on a pSeries box */ if (rtas_token("event-scan") == RTAS_UNKNOWN_SERVICE) { if (systemcfg->platform & PLATFORM_PSERIES) - printk(KERN_ERR "rtasd: no event-scan on system\n"); + printk(KERN_INFO "rtasd: no event-scan on system\n"); return 1; } From jk at blackdown.de Thu May 19 22:54:02 2005 From: jk at blackdown.de (Juergen Kreileder) Date: Thu, 19 May 2005 14:54:02 +0200 Subject: PER_LINUX32 fixes reworked In-Reply-To: <17036.11746.366471.915182@cargo.ozlabs.ibm.com> (Paul Mackerras's message of "Thu, 19 May 2005 16:10:42 +1000") References: <17036.11746.366471.915182@cargo.ozlabs.ibm.com> Message-ID: <878y2bfkb9.fsf@blackdown.de> Hi Paul, Paul Mackerras writes: > Comments? If no-one has any problems with this patch I'll send it > upstream. looks fine to me. Juergen -- Juergen Kreileder, Blackdown Java-Linux Team http://blog.blackdown.de/ From wd at denx.de Thu May 19 23:18:39 2005 From: wd at denx.de (Wolfgang Denk) Date: Thu, 19 May 2005 15:18:39 +0200 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (#2) In-Reply-To: Your message of "Thu, 19 May 2005 14:56:53 +1000." <1116478614.918.75.camel@gaston> Message-ID: <20050519131844.7D707C1512@atlas.denx.de> Dear Ben, in message <1116478614.918.75.camel at gaston> you wrote: > > And here is a second draft with more infos. > > Booting the Linux/ppc64 kernel without Open Firmware Thanks a lot for taking the initiative to come to an agreement about the kernel boot interface. I have some concerns about the memory foot print and increased boot time that will result from the proposed solution. There are many embedded systems where resources are tight and requirements are aven tighter. It would be probably a good idea to also ask for feedback from these folks - for example by posting your RFC on the celinux-dev mailing list. But my biggest concern is that we should try to come up with a solution that has a wider acceptance. Especially from the U-Boot point of view it is not exactly nice that each of PowerPC, ARM and MIPS use their very own, completely incompatible way of passing in- formation from the boot loader to the kernel. As is, your proposal will add just another incompatible way of doing the same thing (of course we will have to stay backward compatible with U-Boot to allow booting older kernels, too). Why don't we try to come up with a solution that is acceptable to the other architectures as well? Maybe you want to post the RFC to lkml, or at least to the linux-arm-kernel and linux-mips mailing lists? Best regards, Wolfgang Denk -- Software Engineering: Embedded and Realtime Systems, Embedded Linux Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de Real computer scientists don't comment their code. The identifiers are so long they can't afford the disk space. From wd at denx.de Thu May 19 23:18:41 2005 From: wd at denx.de (Wolfgang Denk) Date: Thu, 19 May 2005 15:18:41 +0200 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO In-Reply-To: Your message of "Thu, 19 May 2005 20:22:24 +1000." <1116498144.918.97.camel@gaston> Message-ID: <20050519131846.D5DD8C1512@atlas.denx.de> In message <1116498144.918.97.camel at gaston> you wrote: > > > Without knowing the size of the code required for this, it would still > > mean an increase by a couple of hundred percent for the boot > > information. > > Well, if you build the device-tree blob at bootloader build time (you > can then embed it in your bootloader or maybe just put it somewhere in > flash), there is little code involved, basically passing a pointer to it > to the kernel. Now, if you mean the kernel code, oh well, have you seen > how big a ppc64 kernel is anyway ? :) Marius was talking about the amount of data passed to the kernel. And yes, we are aware how big a ppc64 kernel is. One might argue that you need a 64 bit kernel only for big systems, so resources are cheap. On the other hand, we are also aware how big the 2.6 kernel is compared against 2.4, and how it suffers performancewise. My concern is that just adding a few kB of code here and there and passing a bit more data from A to B and using ASCII representation for the data and all of that will result in elegant and easily maintainable code on one side, but to even bigger memory footprints for boot loader and kernel and longer boot times on the other side, too. We have seen before how this works. A few tens or hundreds of milliseconds of boot time may not mean anything on a fast 64 bit machine which will spend ages anyway while scanning a lot of SCSI busses and all that, but it will *hurt* on many embedded systems. > I would expect something like uboot to be a bit more smart though and > provide optionally some functions to add nodes/properties, but heh, > we'll see. I'll try to provide example code after I'm done with the spec > part. It's not only an issue of being smart enough. It has also a lot do to with hardware restrictions. If you have a product that sells several 1e4 or 1e5 units per year which now works with just 4 MB of flash for boot loader and Linux kernel and application code you have hard times to explain that the next software generation will need bigger (and more expensive) flashes just because of using more elegant code. Yes, small *is* beautiful. We had this discussion before, several times. There once was a proposal by Mark A. Greer (see discussion on the linuxppc-embedded mailing list that started as "EV-64260-BP & GT64260 bi_recs" around March 19, 2002) which was elegant, flexible and lean. If it was not actually sad it could be funny that the general agreement will always end up to be the biggest and slowest of all possible solutioins. But my biggest concern here on the U-Boot list is: U-Boot is not only for PowerPC systems. We should also keep an eye on what ARM and MIPS is doing... See my other posting. Best regards, Wolfgang Denk -- Software Engineering: Embedded and Realtime Systems, Embedded Linux Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de To get something done, a committee should consist of no more than three men, two of them absent. From segher at kernel.crashing.org Wed May 18 03:12:14 2005 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Tue, 17 May 2005 19:12:14 +0200 Subject: device-tree in U-Boot In-Reply-To: References: Message-ID: > You have mentioned that there are only little requirements provided by > the kernel regarding device tree. Can you please clarify? The kernel doesn't actually use much of all that is defined in the Open Firmware standard. My suggestion would be, to just start with an empty device tree, see what breaks, fix it, and repeat. Warning: part of what will break is the kernel not doing enough checking. It would be great to have this fixed up as a side effect ;-) Segher From panto at intracom.gr Thu May 19 23:16:32 2005 From: panto at intracom.gr (Pantelis Antoniou) Date: Thu, 19 May 2005 16:16:32 +0300 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (#2) In-Reply-To: <20050519131844.7D707C1512@atlas.denx.de> References: <20050519131844.7D707C1512@atlas.denx.de> Message-ID: <428C91B0.4090702@intracom.gr> Wolfgang Denk wrote: > Dear Ben, > > in message <1116478614.918.75.camel at gaston> you wrote: > >>And here is a second draft with more infos. >> >> Booting the Linux/ppc64 kernel without Open Firmware > > > Thanks a lot for taking the initiative to come to an agreement about > the kernel boot interface. > > I have some concerns about the memory foot print and increased boot > time that will result from the proposed solution. There are many > embedded systems where resources are tight and requirements are aven > tighter. It would be probably a good idea to also ask for feedback > from these folks - for example by posting your RFC on the celinux-dev > mailing list. > > But my biggest concern is that we should try to come up with a > solution that has a wider acceptance. Especially from the U-Boot > point of view it is not exactly nice that each of PowerPC, ARM and > MIPS use their very own, completely incompatible way of passing in- > formation from the boot loader to the kernel. > > As is, your proposal will add just another incompatible way of doing > the same thing (of course we will have to stay backward compatible > with U-Boot to allow booting older kernels, too). > > > Why don't we try to come up with a solution that is acceptable to the > other architectures as well? > > Maybe you want to post the RFC to lkml, or at least to the > linux-arm-kernel and linux-mips mailing lists? > I'm really interested in having this discussion. I'm forced to maintain my own u-boot based solution for doing this and I'd be very interested in whatever gets chosen. IMHO the current mess is considerable, and at this point I wouldn't really care if the resulting solution is less than optimal, as long as there is one. > Best regards, > > Wolfgang Denk > Regards Pantelis From dwmw2 at infradead.org Fri May 20 00:43:25 2005 From: dwmw2 at infradead.org (David Woodhouse) Date: Thu, 19 May 2005 15:43:25 +0100 Subject: io_block_mapping in PPC64? In-Reply-To: <17033.25718.542891.670814@cargo.ozlabs.ibm.com> References: <17033.25718.542891.670814@cargo.ozlabs.ibm.com> Message-ID: <1116513806.23972.222.camel@hades.cambridge.redhat.com> On Tue, 2005-05-17 at 13:26 +1000, Paul Mackerras wrote: > As maintainer I will not accept patches to make Linux run on new PPC64 > machines without a device tree. I'd suggest making the same decree for new ppc32 machines too. That way, when we backport arch/ppc64 to 32-bit hardware and mark arch/ppc32 obsolete, those platforms should continue to work. -- dwmw2 From segher at kernel.crashing.org Fri May 20 02:24:37 2005 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Thu, 19 May 2005 18:24:37 +0200 Subject: RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (#2) In-Reply-To: <200505190946.45219.arnd@arndb.de> References: <1116400151.918.10.camel@gaston> <1116478614.918.75.camel@gaston> <200505190946.45219.arnd@arndb.de> Message-ID: <415dcbd84021ff1b4d11d9553d36993d@kernel.crashing.org> > AFAICS, the pci device tree is currently required if you want to > use an IOMMU Works fine without it, on Maple at least. > or if you want PCI-X or PCIe style devices with > extended PCI config space. Dunno. > I wouldn't be surprised if other > functionality also depends on it. If you're unlucky enough to have inherited your code from the pSeries port, then yes, it probably does. Segher From apw at shadowen.org Fri May 20 02:43:23 2005 From: apw at shadowen.org (Andy Whitcroft) Date: Thu, 19 May 2005 17:43:23 +0100 Subject: [0/4] [RFC] ppc64 DISCONTIGMEM split Message-ID: When we sent out the original SPARSEMEM patches for PPC64 there was comment that the numa_memory_lookup_table was really a DISCONTIGMEM data element. I've been looking at splitting this off such that DISCONTIGMEM, SPARSEMEM and FLATMEM are independant. Following this mail are four patches to introduce this split. These are preliminary patches and although they compile and boot issues do remain (if nothing else some warnings). But I thought it would be helpful to get some feedback. -apw From apw at shadowen.org Fri May 20 02:44:23 2005 From: apw at shadowen.org (Andy Whitcroft) Date: Thu, 19 May 2005 17:44:23 +0100 Subject: [1/4] [RFC] ppc64 discontig pull out numa_memory_lookup init Message-ID: It seems we have two complete copies of the code which initialises the numa_memory_lookup_table. Pull this out to a common initialisation routine in preparation to limiting its use to the DISCONTIGMEM memory model only. Signed-off-by: Andy Whitcroft diffstat ppc64-discontig-pull-out-numa_memory_lookup-init --- diff -upN reference/arch/ppc64/mm/numa.c current/arch/ppc64/mm/numa.c --- reference/arch/ppc64/mm/numa.c +++ current/arch/ppc64/mm/numa.c @@ -58,6 +58,20 @@ EXPORT_SYMBOL(numa_memory_lookup_table); EXPORT_SYMBOL(numa_cpumask_lookup_table); EXPORT_SYMBOL(nr_cpus_in_node); +static inline void discontig_init(unsigned long top) +{ + unsigned long entries = top >> MEMORY_INCREMENT_SHIFT; + unsigned long i; + + if (!numa_memory_lookup_table) { + numa_memory_lookup_table = + (char *)abs_to_virt(lmb_alloc(entries * sizeof(char), 1)); + memset(numa_memory_lookup_table, 0, entries * sizeof(char)); + for (i = 0; i < entries ; i++) + numa_memory_lookup_table[i] = ARRAY_INITIALISER; + } +} + static inline void map_cpu_to_node(int cpu, int node) { numa_cpu_lookup_table[cpu] = node; @@ -320,7 +334,6 @@ static int __init parse_numa_properties( struct device_node *memory = NULL; int addr_cells, size_cells; int max_domain = 0; - long entries = lmb_end_of_DRAM() >> MEMORY_INCREMENT_SHIFT; unsigned long i; if (numa_enabled == 0) { @@ -328,12 +341,7 @@ static int __init parse_numa_properties( return -1; } - numa_memory_lookup_table = - (char *)abs_to_virt(lmb_alloc(entries * sizeof(char), 1)); - memset(numa_memory_lookup_table, 0, entries * sizeof(char)); - - for (i = 0; i < entries ; i++) - numa_memory_lookup_table[i] = ARRAY_INITIALISER; + discontig_init(lmb_end_of_DRAM()); min_common_depth = find_min_common_depth(); @@ -457,21 +465,13 @@ static void __init setup_nonnuma(void) { unsigned long top_of_ram = lmb_end_of_DRAM(); unsigned long total_ram = lmb_phys_mem_size(); - unsigned long i; printk(KERN_INFO "Top of RAM: 0x%lx, Total RAM: 0x%lx\n", top_of_ram, total_ram); printk(KERN_INFO "Memory hole size: %ldMB\n", (top_of_ram - total_ram) >> 20); - if (!numa_memory_lookup_table) { - long entries = top_of_ram >> MEMORY_INCREMENT_SHIFT; - numa_memory_lookup_table = - (char *)abs_to_virt(lmb_alloc(entries * sizeof(char), 1)); - memset(numa_memory_lookup_table, 0, entries * sizeof(char)); - for (i = 0; i < entries ; i++) - numa_memory_lookup_table[i] = ARRAY_INITIALISER; - } + discontig_init(top_of_ram); map_cpu_to_node(boot_cpuid, 0); From apw at shadowen.org Fri May 20 02:45:23 2005 From: apw at shadowen.org (Andy Whitcroft) Date: Thu, 19 May 2005 17:45:23 +0100 Subject: [2/4] [RFC] ppc64 early_pfn_to_nid stop using numa_memory_lookup Message-ID: Split early_pfn_to_nid off of numa_memory_lookup_table. Use the basic node configuration to locate node for memory pages. Signed-off-by: Andy Whitcroft diffstat ppc64-early_pfn_to_nid-stop-using-numa_memory_lookup --- diff -upN reference/arch/ppc64/mm/numa.c current/arch/ppc64/mm/numa.c --- reference/arch/ppc64/mm/numa.c +++ current/arch/ppc64/mm/numa.c @@ -735,3 +735,19 @@ static int __init early_numa(char *p) return 0; } early_param("numa", early_numa); + +/* Find the owning node for a pfn. */ +int early_pfn_to_nid(unsigned long pfn) +{ + int nid; + + for (nid = 0; nid < MAX_NUMNODES && + init_node_data[nid].node_end_pfn; nid++) { + unsigned long start = init_node_data[nid].node_start_pfn; + unsigned long end = init_node_data[nid].node_end_pfn; + if (start <= pfn && pfn <= end) + return nid; + } + + return 0; +} diff -upN reference/include/asm-ppc64/mmzone.h current/include/asm-ppc64/mmzone.h --- reference/include/asm-ppc64/mmzone.h +++ current/include/asm-ppc64/mmzone.h @@ -102,7 +102,7 @@ static inline int pa_to_nid(unsigned lon #endif /* CONFIG_NEED_MULTIPLE_NODES */ #ifdef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID -#define early_pfn_to_nid(pfn) pa_to_nid(((unsigned long)pfn) << PAGE_SHIFT) +extern int early_pfn_to_nid(unsigned long pfn); #endif #endif /* _ASM_MMZONE_H_ */ From apw at shadowen.org Fri May 20 02:46:23 2005 From: apw at shadowen.org (Andy Whitcroft) Date: Thu, 19 May 2005 17:46:23 +0100 Subject: [3/4] [RFC] ppc64 discontig pfn_to_nid direct to numa_memory_lookup Message-ID: Split off pa_to_nid from the numa_memory_lookup_table. As a result pfn_to_nid for the DISCONTIGMEM model moved to directly accessing it. Signed-off-by: Andy Whitcroft diffstat ppc64-discontig-pfn_to_nid-direct-to-numa_memory_lookup --- diff -upN reference/include/asm-ppc64/mmzone.h current/include/asm-ppc64/mmzone.h --- reference/include/asm-ppc64/mmzone.h +++ current/include/asm-ppc64/mmzone.h @@ -41,22 +41,7 @@ extern int nr_cpus_in_node[]; /* NUMA debugging, will not work on a DLPAR machine */ #undef DEBUG_NUMA -static inline int pa_to_nid(unsigned long pa) -{ - int nid; - - nid = numa_memory_lookup_table[pa >> MEMORY_INCREMENT_SHIFT]; - -#ifdef DEBUG_NUMA - /* the physical address passed in is not in the map for the system */ - if (nid == -1) { - printk("bad address: %lx\n", pa); - BUG(); - } -#endif - - return nid; -} +#define pa_to_nid(pa) early_pfn_to_nid((pa) >> PAGE_SHIFT) #define node_localnr(pfn, nid) ((pfn) - NODE_DATA(nid)->node_start_pfn) @@ -72,12 +57,29 @@ static inline int pa_to_nid(unsigned lon #ifdef CONFIG_DISCONTIGMEM +/* Given a page frame number return the owning node. */ +static inline int pfn_to_nid(unsigned long pfn) +{ + int nid; + + nid = numa_memory_lookup_table[pfn >> + (MEMORY_INCREMENT_SHIFT - PAGE_SHIFT)]; + +#ifdef DEBUG_NUMA + /* the physical page passed in is not in the map for the system */ + if (nid == -1) { + printk("bad pfn: %lx\n", pfn); + BUG(); + } +#endif + + return nid; +} + /* * Given a kernel address, find the home node of the underlying memory. */ -#define kvaddr_to_nid(kaddr) pa_to_nid(__pa(kaddr)) - -#define pfn_to_nid(pfn) pa_to_nid((unsigned long)(pfn) << PAGE_SHIFT) +#define kvaddr_to_nid(kaddr) pa_to_nid(__pa(kaddr) >> PAGE_SHIFT) /* Written this way to avoid evaluating arguments twice */ #define discontigmem_pfn_to_page(pfn) \ From apw at shadowen.org Fri May 20 02:47:23 2005 From: apw at shadowen.org (Andy Whitcroft) Date: Thu, 19 May 2005 17:47:23 +0100 Subject: [4/4] [RFC] ppc64 abstract discontig Message-ID: Make initialisation of numa_memory_lookup_table conditional on the memory model. Pull out populating the numa_memory_lookup_table into the memory_present interface. This effectivly splits out abstracts and splits DISCONTIGMEM from the other memory models. Signed-off-by: Andy Whitcroft diffstat ppc64-abstract-discontig --- diff -upN reference/arch/ppc64/Kconfig current/arch/ppc64/Kconfig --- reference/arch/ppc64/Kconfig +++ current/arch/ppc64/Kconfig @@ -243,6 +243,10 @@ config HAVE_ARCH_EARLY_PFN_TO_NID def_bool y depends on NEED_MULTIPLE_NODES +config ARCH_HAVE_MEMORY_PRESENT + def_bool y + depends on DISCONTIGMEM + # Some NUMA nodes have memory ranges that span # other nodes. Even though a pfn is valid and # between a node's start and end pfns, it may not diff -upN reference/arch/ppc64/mm/numa.c current/arch/ppc64/mm/numa.c --- reference/arch/ppc64/mm/numa.c +++ current/arch/ppc64/mm/numa.c @@ -34,7 +34,6 @@ static int numa_debug; int numa_cpu_lookup_table[NR_CPUS] = { [ 0 ... (NR_CPUS - 1)] = ARRAY_INITIALISER}; -char *numa_memory_lookup_table; cpumask_t numa_cpumask_lookup_table[MAX_NUMNODES]; int nr_cpus_in_node[MAX_NUMNODES] = { [0 ... (MAX_NUMNODES -1)] = 0}; @@ -52,14 +51,31 @@ static struct { unsigned long node_present_pages; } init_node_data[MAX_NUMNODES] __initdata; +#ifdef CONFIG_DISCONTIGMEM +char *numa_memory_lookup_table; + +void memory_present(int nid, unsigned long start, unsigned long end) +{ + unsigned long i; + + /* XXX/APW: fix the loop instead ... */ + start <<= PAGE_SHIFT; + end <<= PAGE_SHIFT; + + for (i = start ; i < end; i += MEMORY_INCREMENT) + numa_memory_lookup_table[i >> MEMORY_INCREMENT_SHIFT] = nid; +} +EXPORT_SYMBOL(numa_memory_lookup_table); +#endif + EXPORT_SYMBOL(node_data); EXPORT_SYMBOL(numa_cpu_lookup_table); -EXPORT_SYMBOL(numa_memory_lookup_table); EXPORT_SYMBOL(numa_cpumask_lookup_table); EXPORT_SYMBOL(nr_cpus_in_node); static inline void discontig_init(unsigned long top) { +#ifdef CONFIG_DISCONTIGMEM unsigned long entries = top >> MEMORY_INCREMENT_SHIFT; unsigned long i; @@ -70,6 +86,7 @@ static inline void discontig_init(unsign for (i = 0; i < entries ; i++) numa_memory_lookup_table[i] = ARRAY_INITIALISER; } +#endif } static inline void map_cpu_to_node(int cpu, int node) @@ -445,9 +462,6 @@ new_range: size / PAGE_SIZE; } - for (i = start ; i < (start+size); i += MEMORY_INCREMENT) - numa_memory_lookup_table[i >> MEMORY_INCREMENT_SHIFT] = - numa_domain; memory_present(numa_domain, start >> PAGE_SHIFT, (start + size) >> PAGE_SHIFT); @@ -481,8 +495,6 @@ static void __init setup_nonnuma(void) init_node_data[0].node_end_pfn = lmb_end_of_DRAM() / PAGE_SIZE; init_node_data[0].node_present_pages = total_ram / PAGE_SIZE; - for (i = 0 ; i < top_of_ram; i += MEMORY_INCREMENT) - numa_memory_lookup_table[i >> MEMORY_INCREMENT_SHIFT] = 0; memory_present(0, 0, init_node_data[0].node_end_pfn); } @@ -499,6 +511,7 @@ static void __init dump_numa_topology(vo printk(KERN_INFO "Node %d Memory:", node); +#if 0 count = 0; for (i = 0; i < lmb_end_of_DRAM(); i += MEMORY_INCREMENT) { @@ -515,6 +528,7 @@ static void __init dump_numa_topology(vo if (count > 0) printk("-0x%lx", i); +#endif printk("\n"); } return; diff -upN reference/include/asm-ppc64/mmzone.h current/include/asm-ppc64/mmzone.h --- reference/include/asm-ppc64/mmzone.h +++ current/include/asm-ppc64/mmzone.h @@ -30,7 +30,9 @@ extern struct pglist_data *node_data[]; */ extern int numa_cpu_lookup_table[]; +#ifdef CONFIG_DISCONTIGMEM extern char *numa_memory_lookup_table; +#endif extern cpumask_t numa_cpumask_lookup_table[]; extern int nr_cpus_in_node[]; @@ -62,8 +64,8 @@ static inline int pfn_to_nid(unsigned lo { int nid; - nid = numa_memory_lookup_table[pfn >> - (MEMORY_INCREMENT_SHIFT - PAGE_SHIFT)]; + nid = numa_memory_lookup_table[pfn >> (MEMORY_INCREMENT_SHIFT - + PAGE_SHIFT)]; #ifdef DEBUG_NUMA /* the physical page passed in is not in the map for the system */ From jschopp at austin.ibm.com Fri May 20 02:55:13 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Thu, 19 May 2005 11:55:13 -0500 Subject: [1/4] [RFC] ppc64 discontig pull out numa_memory_lookup init In-Reply-To: References: Message-ID: <428CC4F1.9030202@austin.ibm.com> > It seems we have two complete copies of the code which initialises > the numa_memory_lookup_table. Pull this out to a common initialisation > routine in preparation to limiting its use to the DISCONTIGMEM memory > model only. This one looks pretty independent and ready to merge. Acked-by: Joel Schopp > > Signed-off-by: Andy Whitcroft > > diffstat ppc64-discontig-pull-out-numa_memory_lookup-init > --- > > diff -upN reference/arch/ppc64/mm/numa.c current/arch/ppc64/mm/numa.c > --- reference/arch/ppc64/mm/numa.c > +++ current/arch/ppc64/mm/numa.c > @@ -58,6 +58,20 @@ EXPORT_SYMBOL(numa_memory_lookup_table); > EXPORT_SYMBOL(numa_cpumask_lookup_table); > EXPORT_SYMBOL(nr_cpus_in_node); > > +static inline void discontig_init(unsigned long top) > +{ > + unsigned long entries = top >> MEMORY_INCREMENT_SHIFT; > + unsigned long i; > + > + if (!numa_memory_lookup_table) { > + numa_memory_lookup_table = > + (char *)abs_to_virt(lmb_alloc(entries * sizeof(char), 1)); > + memset(numa_memory_lookup_table, 0, entries * sizeof(char)); > + for (i = 0; i < entries ; i++) > + numa_memory_lookup_table[i] = ARRAY_INITIALISER; > + } > +} > + > static inline void map_cpu_to_node(int cpu, int node) > { > numa_cpu_lookup_table[cpu] = node; > @@ -320,7 +334,6 @@ static int __init parse_numa_properties( > struct device_node *memory = NULL; > int addr_cells, size_cells; > int max_domain = 0; > - long entries = lmb_end_of_DRAM() >> MEMORY_INCREMENT_SHIFT; > unsigned long i; > > if (numa_enabled == 0) { > @@ -328,12 +341,7 @@ static int __init parse_numa_properties( > return -1; > } > > - numa_memory_lookup_table = > - (char *)abs_to_virt(lmb_alloc(entries * sizeof(char), 1)); > - memset(numa_memory_lookup_table, 0, entries * sizeof(char)); > - > - for (i = 0; i < entries ; i++) > - numa_memory_lookup_table[i] = ARRAY_INITIALISER; > + discontig_init(lmb_end_of_DRAM()); > > min_common_depth = find_min_common_depth(); > > @@ -457,21 +465,13 @@ static void __init setup_nonnuma(void) > { > unsigned long top_of_ram = lmb_end_of_DRAM(); > unsigned long total_ram = lmb_phys_mem_size(); > - unsigned long i; > > printk(KERN_INFO "Top of RAM: 0x%lx, Total RAM: 0x%lx\n", > top_of_ram, total_ram); > printk(KERN_INFO "Memory hole size: %ldMB\n", > (top_of_ram - total_ram) >> 20); > > - if (!numa_memory_lookup_table) { > - long entries = top_of_ram >> MEMORY_INCREMENT_SHIFT; > - numa_memory_lookup_table = > - (char *)abs_to_virt(lmb_alloc(entries * sizeof(char), 1)); > - memset(numa_memory_lookup_table, 0, entries * sizeof(char)); > - for (i = 0; i < entries ; i++) > - numa_memory_lookup_table[i] = ARRAY_INITIALISER; > - } > + discontig_init(top_of_ram); > > map_cpu_to_node(boot_cpuid, 0); > > _______________________________________________ > Linuxppc64-dev mailing list > Linuxppc64-dev at ozlabs.org > https://ozlabs.org/cgi-bin/mailman/listinfo/linuxppc64-dev > From haveblue at us.ibm.com Fri May 20 03:03:41 2005 From: haveblue at us.ibm.com (Dave Hansen) Date: Thu, 19 May 2005 10:03:41 -0700 Subject: [4/4] [RFC] ppc64 abstract discontig In-Reply-To: References: Message-ID: <1116522221.11566.5.camel@localhost> Nice patch set. Every line we move under CONFIG_DISCONTIG is another line we can easily kill later. Some tiny comments below. On Thu, 2005-05-19 at 17:47 +0100, Andy Whitcroft wrote: > @@ -499,6 +511,7 @@ static void __init dump_numa_topology(vo > > printk(KERN_INFO "Node %d Memory:", node); > > +#if 0 > count = 0; > > for (i = 0; i < lmb_end_of_DRAM(); i += MEMORY_INCREMENT) { > @@ -515,6 +528,7 @@ static void __init dump_numa_topology(vo > > if (count > 0) > printk("-0x%lx", i); > +#endif > printk("\n"); > } > return; Just kill this code section if it's really not needed. As it stands now, this patch series looks like it adds quite a bit of code. arch/ppc64/mm/numa.c | 44 +++++++++++++++++++++++++++++------ current/arch/ppc64/Kconfig | 4 +++ current/arch/ppc64/mm/numa.c | 32 ++++++++++++------------- current/include/asm-ppc64/mmzone.h | 2 - include/asm-ppc64/mmzone.h | 46 ++++++++++++++++++++----------------- 5 files changed, 83 insertions(+), 45 deletions(-) That should change pretty significantly once you kill that hunk. > diff -upN reference/include/asm-ppc64/mmzone.h current/include/asm-ppc64/mmzone.h > --- reference/include/asm-ppc64/mmzone.h > +++ current/include/asm-ppc64/mmzone.h > @@ -30,7 +30,9 @@ extern struct pglist_data *node_data[]; > */ > > extern int numa_cpu_lookup_table[]; > +#ifdef CONFIG_DISCONTIGMEM > extern char *numa_memory_lookup_table; > +#endif > extern cpumask_t numa_cpumask_lookup_table[]; > extern int nr_cpus_in_node[]; How about renaming that variable to not have "numa_" in it if it isn't actually numa-dependent? Another patch on top of these 4 to do s/numa(_memory_lookup_table)/discontig$1/ Would be nice. > @@ -62,8 +64,8 @@ static inline int pfn_to_nid(unsigned lo > { > int nid; > > - nid = numa_memory_lookup_table[pfn >> > - (MEMORY_INCREMENT_SHIFT - PAGE_SHIFT)]; > + nid = numa_memory_lookup_table[pfn >> (MEMORY_INCREMENT_SHIFT - > + PAGE_SHIFT)]; A little pet peeve of mine is when there is lots of shift arithmetic going on. What does FOO_SHIFT - BAR_SHIFT really *mean*? If you're going to modify this, can you fix it to something like? unsigned long pages_per_table_entry = MEMORY_INCREMENT_SHIFT - PAGE_SHIFT; -- Dave From linas at austin.ibm.com Fri May 20 05:37:31 2005 From: linas at austin.ibm.com (Linas Vepstas) Date: Thu, 19 May 2005 14:37:31 -0500 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO In-Reply-To: <20050519131846.D5DD8C1512@atlas.denx.de> References: <1116498144.918.97.camel@gaston> <20050519131846.D5DD8C1512@atlas.denx.de> Message-ID: <20050519193731.GF4138@austin.ibm.com> On Thu, May 19, 2005 at 03:18:41PM +0200, Wolfgang Denk was heard to remark: > > It's not only an issue of being smart enough. It has also a lot do to > with hardware restrictions. If you have a product that sells several > 1e4 or 1e5 units per year which now works with just 4 MB of flash for > boot loader and Linux kernel and application code you have hard times > to explain that the next software generation will need bigger (and > more expensive) flashes just because of using more elegant code. > > Yes, small *is* beautiful. :-/ I was once very disatisfied with an earlier job I had because the boss kept trying to make me use a "rabbitcore" which had only 1MB for everything, and there was no way I'd be able to fit Linux into that. Rabbitcore ran some tiny thing called rabbitOS, but the tools were all on windows. :( This was only in 2003, and I still see adds for the rabbit in magazines. --linas From dan at embeddededge.com Fri May 20 06:18:11 2005 From: dan at embeddededge.com (Dan Malek) Date: Thu, 19 May 2005 16:18:11 -0400 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO In-Reply-To: <20050519193731.GF4138@austin.ibm.com> References: <1116498144.918.97.camel@gaston> <20050519131846.D5DD8C1512@atlas.denx.de> <20050519193731.GF4138@austin.ibm.com> Message-ID: On May 19, 2005, at 3:37 PM, Linas Vepstas wrote: > :-/ I was once very disatisfied with an earlier job I had because the > boss kept trying to make me use a "rabbitcore" which had only 1MB for > everything, and there was no way I'd be able to fit Linux into that. People understand the trade off of the need for resources to get the features they want, which is why they choose Linux in the first place. Yes, sometimes people ask for what seems to be unreasonable in such products, but it also forces us to be clever about how we configure the systems. The difficult trade off is when some states they get the same feature set with one particular piece of software as they do with Linux, but in much less space. The advantage of Linux is open source and no royalties, but many of the RTOS systems these days no longer have royalties, just a one time up front cost. When they weigh that against the extra cost of memory for Linux and the number of systems, the Linux "royalty" is more than the purchase of the competing OS. It's really happening that way today. Thanks. -- Dan From kravetz at us.ibm.com Fri May 20 07:50:48 2005 From: kravetz at us.ibm.com (mike kravetz) Date: Thu, 19 May 2005 14:50:48 -0700 Subject: [2/4] [RFC] ppc64 early_pfn_to_nid stop using numa_memory_lookup In-Reply-To: References: Message-ID: <20050519215048.GA4605@w-mikek2.ibm.com> On Thu, May 19, 2005 at 05:45:23PM +0100, Andy Whitcroft wrote: > +/* Find the owning node for a pfn. */ > +int early_pfn_to_nid(unsigned long pfn) > +{ > + int nid; > + > + for (nid = 0; nid < MAX_NUMNODES && > + init_node_data[nid].node_end_pfn; nid++) { > + unsigned long start = init_node_data[nid].node_start_pfn; > + unsigned long end = init_node_data[nid].node_end_pfn; > + if (start <= pfn && pfn <= end) > + return nid; > + } > + > + return 0; > +} Can't do this. Remember the 'swiss cheese' POWER numa model? init_node_data represents the 'range' of valid pfn's on a given node. These ranges may overlap. So, a given pfn may fall into the range of multiple nodes as represented by init_node_data. init_node_data is mostly used to 'prime' the zone initialization code. Currently, numa_memory_lookup_table is the only structure properly initialized/created to do pfn_to_nid lookups. -- Mike From benh at kernel.crashing.org Fri May 20 08:20:29 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 20 May 2005 08:20:29 +1000 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (#2) In-Reply-To: <20050519131844.7D707C1512@atlas.denx.de> References: <20050519131844.7D707C1512@atlas.denx.de> Message-ID: <1116541230.5153.8.camel@gaston> On Thu, 2005-05-19 at 15:18 +0200, Wolfgang Denk wrote: > Dear Ben, > > in message <1116478614.918.75.camel at gaston> you wrote: > > > > And here is a second draft with more infos. > > > > Booting the Linux/ppc64 kernel without Open Firmware > > Thanks a lot for taking the initiative to come to an agreement about > the kernel boot interface. > > I have some concerns about the memory foot print and increased boot > time that will result from the proposed solution. Like everybody it seems, which is funny in a way as I expect pretty much none (or a few Kb maybe). The kernel side code for managing a device-tree may represent more, but heh, have you seen the size of a ppc64 kernel anyways ? I don't think that is very relevant. On the bootloader side, I don't expect any significant impact. The device-tree can be very small, and the code required on the bootloader side ranges from nothing for a pre-built one, to a little bit if the bootloader has to be able to change/add properties/nodes. > There are many embedded systems where resources are tight and requirements > are aven tighter. Amen. (Though heh, this is ppc64, you can't be _that_ tight :) >It would be probably a good idea to also ask for feedback > from these folks - for example by posting your RFC on the celinux-dev > mailing list. I will do when I have a little bit more mature proposal. > But my biggest concern is that we should try to come up with a > solution that has a wider acceptance. No other solution will be accepted on the kernel side. At least for ppc64 > Especially from the U-Boot > point of view it is not exactly nice that each of PowerPC, ARM and > MIPS use their very own, completely incompatible way of passing in- > formation from the boot loader to the kernel. True. > As is, your proposal will add just another incompatible way of doing > the same thing (of course we will have to stay backward compatible > with U-Boot to allow booting older kernels, too). My proposal is the only supported way to boot a ppc64 kernel. There are talks about backporting support for that to ppc32 as well. Other architectures are welcome to use it too though :) The device-tree in the kernel is fully expanded into a tree structure on ppc, since it's heavily used by various pieces of code all over the place, but for other architectures that would like to use that, it's possible to limit themselves to the flattened format. The ppc64 kernel contains some code to access nodes & properties directly from the flattened format (used early during boot) which represents very little code. > Why don't we try to come up with a solution that is acceptable to the > other architectures as well? This has been discussed over and over again, that is the best way to never come up with a solution as everybody will want something different and nobody will ever agree. The present proposal is implemented today on the ppc64 kernel already, and we have decided to not go backward on this requirement. > Maybe you want to post the RFC to lkml, or at least to the > linux-arm-kernel and linux-mips mailing lists? > > Best regards, > > Wolfgang Denk > From benh at kernel.crashing.org Fri May 20 08:33:13 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 20 May 2005 08:33:13 +1000 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO In-Reply-To: <20050519131846.D5DD8C1512@atlas.denx.de> References: <20050519131846.D5DD8C1512@atlas.denx.de> Message-ID: <1116541993.5153.22.camel@gaston> > Marius was talking about the amount of data passed to the kernel. A few Kb maybe... Current implementations always provide a full featured device-tree with pci devices so they aren't a good example (and I don't have numbers in mind at the moment). I'll try to get some later today. The property names are factored out (only one copy of a given name) to avoid bloat, the node format is very compact, A small device-tree would be only about a dozen node (the minimal is 5 nodes including the root) with only a few properties > And yes, we are aware how big a ppc64 kernel is. One might argue that > you need a 64 bit kernel only for big systems, so resources are > cheap. On the other hand, we are also aware how big the 2.6 kernel is > compared against 2.4, and how it suffers performancewise. I wouldn't say it sufferred performance wise on all architectures. Small embedded CPUs may have sufferred (mostly because of larger memory footprint impact on small TLBs), but ppc64 is definitely not something you ever want to use with a 2.4 kernel, and I would expect 2.6 to be faster on 6xx/7xx/7xxx type CPUs as well. > My concern is that just adding a few kB of code here and there and > passing a bit more data from A to B and using ASCII representation > for the data and all of that will result in elegant and easily > maintainable code on one side, but to even bigger memory footprints > for boot loader and kernel and longer boot times on the other side, > too. We have seen before how this works. I don't think it will have any significant impact on the boot time. Not at all. In fact, I'm not even sure the code would be that much bigger neither. For example, all the code needed to declare all the device-specific platform devices used in some case would be _replaced_ by a generic routine that declares a device based on the device-tree data, that sort of thing. I honestly cannot tell what kind of bloat is to be expected, but I really don't think it will be relevant. Even the code for iterating the fully expanded device-tree & properties in the kernel isn't big, but as I wrote earlier, a non-ppc architecture wanting to use that proposal may want to work directly on the flattened tree. I REALLY think people are over-estimating the size & complexity of the device-tree. > A few tens or hundreds of milliseconds of boot time may not mean > anything on a fast 64 bit machine which will spend ages anyway while > scanning a lot of SCSI busses and all that, but it will *hurt* on > many embedded systems. I wouldn't even expect that much. > > I would expect something like uboot to be a bit more smart though and > > provide optionally some functions to add nodes/properties, but heh, > > we'll see. I'll try to provide example code after I'm done with the spec > > part. > > It's not only an issue of being smart enough. It has also a lot do to > with hardware restrictions. If you have a product that sells several > 1e4 or 1e5 units per year which now works with just 4 MB of flash for > boot loader and Linux kernel and application code you have hard times > to explain that the next software generation will need bigger (and > more expensive) flashes just because of using more elegant code. > > Yes, small *is* beautiful. Did you read the "optional" above ? Let me repeat _AGAIN_ here: the bootloader doesn't _need_ ANY code to deal with the device-tree if you decide to just build the blob once for all, and embed it "as is". However, not everybody is fighting after 10 bytes of flash, and thus it would be useful if optionally, uboot could provide the machine specific code with functions to do things like edit the memory or bootargs properties in there. > We had this discussion before, several times. There once was a > proposal by Mark A. Greer (see discussion on the linuxppc-embedded > mailing list that started as "EV-64260-BP & GT64260 bi_recs" around > March 19, 2002) which was elegant, flexible and lean. If it was not > actually sad it could be funny that the general agreement will always > end up to be the biggest and slowest of all possible solutioins. Fuck it ! This is not by far the biggest and slowest of all solutions, the tree format is on purpose very compact, it's not a few strings that will make that much of a difference, damn . Do you really want me to propose ACPI AML instead ? Besides, I know the bi_rec stuff well as I propoed it in the first place, and nobody ever came to an agreement about that neither. Face it, there will NOT be any other way that will be accepted upstream to boot a ppc64 kernel. > > But my biggest concern here on the U-Boot list is: U-Boot is not only > for PowerPC systems. We should also keep an eye on what ARM and MIPS > is doing... See my other posting. Sure, you are welcome to do so. I'm posting to this list because of Marvell's intend to use uboot as a bootloader for what appear to be the first ppc64 platform not to implement the OF command line interface. Ben. From wd at denx.de Fri May 20 09:14:41 2005 From: wd at denx.de (Wolfgang Denk) Date: Fri, 20 May 2005 01:14:41 +0200 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (#2) In-Reply-To: Your message of "Fri, 20 May 2005 08:20:29 +1000." <1116541230.5153.8.camel@gaston> Message-ID: <20050519231446.29487C1512@atlas.denx.de> Dear Ben, in message <1116541230.5153.8.camel at gaston> you wrote: > > > I have some concerns about the memory foot print and increased boot > > time that will result from the proposed solution. > > Like everybody it seems, which is funny in a way as I expect pretty much > none (or a few Kb maybe). The kernel side code for managing a > device-tree may represent more, but heh, have you seen the size of a I am not so narrow-minded to think only about U-Boot. I try to think about the whole system, including boot loader, kernel, and any data that might need to get passed between these two. And please believe me, there are many, many systems out there where "a few Kb" really matter. > ppc64 kernel anyways ? I don't think that is very relevant. On the I am aware that you think so, and I try to raise your awareness of the fact that there is a huge number of small machines out there. Please keep in mind that the same interface will be forced sooner or later on small 8xx systems with maybe just 4 MB flash and 8 or 16 MB RAM. And when you sell 100,000 of these units per year then "a few Kb" may cost a lot of money. Or may cause that other, prorietary OS get used. > bootloader side, I don't expect any significant impact. The device-tree > can be very small, and the code required on the bootloader side ranges > from nothing for a pre-built one, to a little bit if the bootloader has > to be able to change/add properties/nodes. It is IMHO wrong to have only the boot loader side in mind. We should consider the whole system. > > There are many embedded systems where resources are tight and requirements > > are aven tighter. > > Amen. (Though heh, this is ppc64, you can't be _that_ tight :) I think you are aware that there are several people out there working on a similar boot interface for the "small" PPC systems, too. > >It would be probably a good idea to also ask for feedback > > from these folks - for example by posting your RFC on the celinux-dev > > mailing list. > > I will do when I have a little bit more mature proposal. Thanks in advance. > > But my biggest concern is that we should try to come up with a > > solution that has a wider acceptance. > > No other solution will be accepted on the kernel side. At least for > ppc64 This is not exactly a constructive position. When each architecture comes up with it's own solution for the same problem and then claims that no other solution will be accepted we will stick with what we have now: a mess. If this is really your position we may as well stop here. > > As is, your proposal will add just another incompatible way of doing > > the same thing (of course we will have to stay backward compatible > > with U-Boot to allow booting older kernels, too). > > My proposal is the only supported way to boot a ppc64 kernel. There are Yes, of course. And using ATAGS is the only supported way to boot an ARM kernel, and so on. If everybody claims that his way of doing things is the only accepted solution we can really save all the time we are wasting on such a discussion. > talks about backporting support for that to ppc32 as well. Other > architectures are welcome to use it too though :) The device-tree in the Ummm.. Ben, I have really high respect for you, but such a position is simply arrogant. With the same right the ARM folks can say that ATAGS is the way to go and other architectures are welcome to use it. Actually they might have older rights. > > Why don't we try to come up with a solution that is acceptable to the > > other architectures as well? > > This has been discussed over and over again, that is the best way to > never come up with a solution as everybody will want something different > and nobody will ever agree. With such a position I really wonder why you ever asked? > The present proposal is implemented today on the ppc64 kernel already, > and we have decided to not go backward on this requirement. The why the heck do you call this a RFC or a proposal? To me it seems that you don't propose but dictate a solution - a solution which pretty much ignores everything but your own requirements. If everything has been decided already I can as well shut up. But please never claim that this has been _discusssed_. Best regards, Wolfgang Denk -- Software Engineering: Embedded and Realtime Systems, Embedded Linux Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de I have made mistakes, but have never made the mistake of claiming I never made one. - James G. Bennet From wd at denx.de Fri May 20 09:20:43 2005 From: wd at denx.de (Wolfgang Denk) Date: Fri, 20 May 2005 01:20:43 +0200 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO In-Reply-To: Your message of "Fri, 20 May 2005 08:33:13 +1000." <1116541993.5153.22.camel@gaston> Message-ID: <20050519232048.727D0C1512@atlas.denx.de> In message <1116541993.5153.22.camel at gaston> you wrote: > > > Yes, small *is* beautiful. > > Did you read the "optional" above ? Let me repeat _AGAIN_ here: the > bootloader doesn't _need_ ANY code to deal with the device-tree if you > decide to just build the blob once for all, and embed it "as is". And the blob has a zero memory footprint or what? > Face it, there will NOT be any other way that will be accepted upstream > to boot a ppc64 kernel. Then let's just stop here. We're just wasting time if there is nothing to discuss any more. Best regards, Wolfgang Denk -- Software Engineering: Embedded and Realtime Systems, Embedded Linux Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de Too many people are ready to carry the stool when the piano needs to be moved. From benh at kernel.crashing.org Fri May 20 09:28:29 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 20 May 2005 09:28:29 +1000 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (#2) In-Reply-To: <20050519231446.29487C1512@atlas.denx.de> References: <20050519231446.29487C1512@atlas.denx.de> Message-ID: <1116545309.5153.58.camel@gaston> > I am aware that you think so, and I try to raise your awareness of > the fact that there is a huge number of small machines out there. > > Please keep in mind that the same interface will be forced sooner or > later on small 8xx systems with maybe just 4 MB flash and 8 or 16 MB > RAM. I will not force it, but others may find it a good idea to do so :) > It is IMHO wrong to have only the boot loader side in mind. We should > consider the whole system. I do have the kernel in mind as well. The fact is the ppc64 kernel relies on an Open Firmware device tree and we do not want at any cost to get into the mess that is ppc32. We decided to define this flattened format for that purpose, and to allow kexec functionality. I did my best to keep the format as compact as possible (maybe a little bit more could be saved by changing the way the full path are layed out, maybe we could even do a new version which gzip's the while blob, but overall, it's fairly small). On the kernel side, as I wrote as well, the code for dealing with the device-tree isn't that big, and will get smaller as I remove the post-processing of nodes in prom.c that we still have here. And as I wrote, if other platforms want to re-use that mecanism, they may want to just use the compact/flattened format directly. The function for scanning nodes in the flattened tree is about 40 lines of C and the function for accessing a property in a flattened node is about as much. > > I think you are aware that there are several people out there working > on a similar boot interface for the "small" PPC systems, too. I know, and I was at the origin of the bi_rec proposal, a few years ago. I've simply never seen anything actually happening. > > No other solution will be accepted on the kernel side. At least for > > ppc64 > > This is not exactly a constructive position. When each architecture > comes up with it's own solution for the same problem and then claims > that no other solution will be accepted we will stick with what we > have now: a mess. > > If this is really your position we may as well stop here. The ppc64 kernel relies on an open firmware style device tree. That will not change any time soon. This proposal is a way to define a subset of this device-tree along with a compact & flattened format so that one don't have to do a full Open Firmware implementation and so that mimal trees can be used. > Yes, of course. And using ATAGS is the only supported way to boot an > ARM kernel, and so on. > > If everybody claims that his way of doing things is the only accepted > solution we can really save all the time we are wasting on such a > discussion. Maybe. I'd rather have this proposal completed and have actual comments about the _content_ of it rather than such a debate at this point. Once we have that working, we can talk about extending it. > > talks about backporting support for that to ppc32 as well. Other > > architectures are welcome to use it too though :) The device-tree in the > > Ummm.. Ben, I have really high respect for you, but such a position > is simply arrogant. With the same right the ARM folks can say that > ATAGS is the way to go and other architectures are welcome to use it. > Actually they might have older rights. May well be. But that out of topic. The decision has been made already. > > > Why don't we try to come up with a solution that is acceptable to the > > > other architectures as well? > > > > This has been discussed over and over again, that is the best way to > > never come up with a solution as everybody will want something different > > and nobody will ever agree. > > With such a position I really wonder why you ever asked? I'm asking for comments about the content of the proposal and posting to inform people of what's going on. You are the one wanting to extend it to other architectures :) > > The present proposal is implemented today on the ppc64 kernel already, > > and we have decided to not go backward on this requirement. > > The why the heck do you call this a RFC or a proposal? To me it seems > that you don't propose but dictate a solution - a solution which > pretty much ignores everything but your own requirements. If > everything has been decided already I can as well shut up. I'm asking for comments about the actual details of it, if something was overlooked in the format (though that actually works today), if my wording is wrong in parts, if we should define in more details some aspect of it. > But please never claim that this has been _discusssed_. No, what I meant earlier is that trying to come up with something like that, as you stated earlier, has been discussed again and again and again without any useful result. From benh at kernel.crashing.org Fri May 20 09:42:10 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 20 May 2005 09:42:10 +1000 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO In-Reply-To: <20050519232048.727D0C1512@atlas.denx.de> References: <20050519232048.727D0C1512@atlas.denx.de> Message-ID: <1116546130.5153.60.camel@gaston> On Fri, 2005-05-20 at 01:20 +0200, Wolfgang Denk wrote: > In message <1116541993.5153.22.camel at gaston> you wrote: > > > > > Yes, small *is* beautiful. > > > > Did you read the "optional" above ? Let me repeat _AGAIN_ here: the > > bootloader doesn't _need_ ANY code to deal with the device-tree if you > > decide to just build the blob once for all, and embed it "as is". > > And the blob has a zero memory footprint or what? Don't be ridiculous please. But definitely a small one. > > Face it, there will NOT be any other way that will be accepted upstream > > to boot a ppc64 kernel. > > Then let's just stop here. We're just wasting time if there is > nothing to discuss any more. You are welcome to discuss aspects of the content of the proposal. Ben. From sfr at canb.auug.org.au Fri May 20 10:08:24 2005 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Fri, 20 May 2005 10:08:24 +1000 Subject: io_block_mapping in PPC64? In-Reply-To: <1116513806.23972.222.camel@hades.cambridge.redhat.com> References: <17033.25718.542891.670814@cargo.ozlabs.ibm.com> <1116513806.23972.222.camel@hades.cambridge.redhat.com> Message-ID: <20050520100824.2ad5f72e.sfr@canb.auug.org.au> On Thu, 19 May 2005 15:43:25 +0100 David Woodhouse wrote: > > That way, when we backport arch/ppc64 to 32-bit hardware and mark > arch/ppc32 obsolete, those platforms should continue to work. I like a man with a sense of humour :-) -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050520/25b7d7ab/attachment.pgp From benh at kernel.crashing.org Fri May 20 13:11:30 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 20 May 2005 13:11:30 +1000 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO In-Reply-To: <1116541993.5153.22.camel@gaston> References: <20050519131846.D5DD8C1512@atlas.denx.de> <1116541993.5153.22.camel@gaston> Message-ID: <1116558690.5153.87.camel@gaston> On Fri, 2005-05-20 at 08:33 +1000, Benjamin Herrenschmidt wrote: > > Marius was talking about the amount of data passed to the kernel. > > A few Kb maybe... Current implementations always provide a full featured > device-tree with pci devices so they aren't a good example (and I don't > have numbers in mind at the moment). I'll try to get some later today. > The property names are factored out (only one copy of a given name) to > avoid bloat, the node format is very compact, A small device-tree would > be only about a dozen node (the minimal is 5 nodes including the root) > with only a few properties Ok, I got some numbers here. (I have removed the page alignment constraint for the DT block and the strings block in the "blob" passed in btw, I forgot to update v2) - The minimal example device-tree given as an example in the document (exactly identical as the one in v2 of the document, which means it may even shrink more, see below) fits in a blob (complete with header) of 764 bytes. - The complete device-tree of my PowerMac laptop (this is _huge_, Apple puts a _lot_ of stuff in there, way more than most embedded board even the most complex ones will ever need) fits into a 37k blob. I will come up with more numbers soon including a good "average" example that is a Maple board with all the ISA/serial stuff (which is very useful to have there) but without the individual PCI devices. On an additional note, I'm also rev'ing up the blob format with additional space savings in mind: - Current version is 2. That's what the kernel recognises and what current kexec tools generate (well... they actually generate a version 1 but the difference is minor). - Version 3 will be backward compatible and just adds a "string table size" field to the header to help kernel do better memory management with the flattened device-tree. kexec can implement it, older kernel will still understand the tree. - Version 16 will not be backward compatible (will require kernel patches, but that should be ok for new board vendors) that allows more space saving. For this version, I'm planning the following changes for now: * Relax some alignement restrictions (already did it for the numbers above) * Allow replacing of the full path string with only the "name at unit address" part, letting the kernel reconstruct the full path. With this change, the "name" property get be dropped in each node as well as in can be reconstructed by the kernel. There is a lot of redundency in the full path, so that should save a bit. Side effect is also to remove any name requirement for the root node. * Make the "linux,phandle" property optional. It will only be required for nodes that are referenced by another node using a phandle value (typically, nodes part of the interrupt tree). With those chances, the example minimal tree may shrink down to about 600 bytes (gross estimate), which would mean an average tree with a few devices would be between one and 3Kb (gross estimate too). Ben. From hollis at penguinppc.org Fri May 20 13:51:56 2005 From: hollis at penguinppc.org (Hollis Blanchard) Date: Thu, 19 May 2005 22:51:56 -0500 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (#2) In-Reply-To: <20050519131844.7D707C1512@atlas.denx.de> References: <20050519131844.7D707C1512@atlas.denx.de> Message-ID: <6fcc07be88e5091ac1428e9bbde6d92f@penguinppc.org> On May 19, 2005, at 8:18 AM, Wolfgang Denk wrote: > > But my biggest concern is that we should try to come up with a > solution that has a wider acceptance. Especially from the U-Boot > point of view it is not exactly nice that each of PowerPC, ARM and > MIPS use their very own, completely incompatible way of passing in- > formation from the boot loader to the kernel. > > As is, your proposal will add just another incompatible way of doing > the same thing (of course we will have to stay backward compatible > with U-Boot to allow booting older kernels, too). > > Why don't we try to come up with a solution that is acceptable to the > other architectures as well? > > Maybe you want to post the RFC to lkml, or at least to the > linux-arm-kernel and linux-mips mailing lists? As you observe, having multiple incompatible communication mechanisms is an issue of u-boot code maintenance. Since you are the most affected party, perhaps you could propose something for all the architectures? You're obviously much more in tune with the needs of ARM and MIPS... In the meantime, it sounds like this device tree stuff solves ppc64's problem in a way the maintainers are happy with, so it's hard to ask them to come up with a solution to a problem they don't have. -Hollis From paulus at samba.org Fri May 20 14:24:14 2005 From: paulus at samba.org (Paul Mackerras) Date: Fri, 20 May 2005 14:24:14 +1000 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (#2) In-Reply-To: <20050519131844.7D707C1512@atlas.denx.de> References: <1116478614.918.75.camel@gaston> <20050519131844.7D707C1512@atlas.denx.de> Message-ID: <17037.26222.23591.13083@cargo.ozlabs.ibm.com> Wolfgang Denk writes: > But my biggest concern is that we should try to come up with a > solution that has a wider acceptance. Especially from the U-Boot > point of view it is not exactly nice that each of PowerPC, ARM and > MIPS use their very own, completely incompatible way of passing in- > formation from the boot loader to the kernel. I am familiar with birecs and I have looked at the ARM atags structure, which is the same as birecs at an abstract level, i.e. a list of arbitrary blobs of data, each with a binary tag and a size. As far as MIPS is concerned, there didn't seem to be any single consistent way of passing information from the bootloader to the kernel. They seem to be in a similar mess to ppc32 in this respect. I want to avoid that mess for ppc64 by stating now, while there is only one embedded ppc64 board that runs linux (the Maple eval board) that there is one true way to pass information into the kernel at boot time, and that is a flattened device tree. Birecs and atags are both OK at representing a specified, limited set of items of information, such as the location and size of an initrd image or the total amount of memory in a system. They fall down when it comes to giving information about the devices in the system and their interconnections. For instance, atags has a structure for representing a frame buffer - but what if you have two video cards in your system? Essentially, each element in the birecs/atags list is like a property in a device tree that has only one node, and the entire birecs/atags list is like a 1-node device tree. What the device tree gives you is the ability to organize those pieces of information hierarchically so that it becomes obvious when you have multiple instances of a device (e.g. a PCI host bridge), what pieces of information apply to which device instances, and which devices have to be used to get to certain other devices. Thus, my opinion is that the device tree is technically superior to the birecs/atags approach. The device tree has also proven itself to be capable of representing the information that the kernel needs about all sorts of systems from the very small to the very large. Unless you can come up with something even better, ppc64 won't be changing. In particular we're not going to go back to anything like birecs or atags. Also, given that a minimal flattened device tree fits in well under 1kB, any arguments about "excessive" memory usage will need to be accompanied by specific code and data sizes of a real-world example. > As is, your proposal will add just another incompatible way of doing > the same thing (of course we will have to stay backward compatible > with U-Boot to allow booting older kernels, too). U-Boot currently doesn't support any ppc64 machines, does it? So how is there a backward compatibility issue? Ben's proposal is for ppc64, at least as present. If the ppc32 embedded developers decide they want to use a device tree, that would be good, but it will proceed by > Why don't we try to come up with a solution that is acceptable to the > other architectures as well? Other architectures are welcome to move to using a device tree. The problem is going to be convincing them to spend the effort to make the change. None of the other architectures currently have a solution that is appealing. Paul. From paulus at samba.org Fri May 20 14:28:44 2005 From: paulus at samba.org (Paul Mackerras) Date: Fri, 20 May 2005 14:28:44 +1000 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (#2) In-Reply-To: <17037.26222.23591.13083@cargo.ozlabs.ibm.com> References: <1116478614.918.75.camel@gaston> <20050519131844.7D707C1512@atlas.denx.de> <17037.26222.23591.13083@cargo.ozlabs.ibm.com> Message-ID: <17037.26492.942608.110951@cargo.ozlabs.ibm.com> I wrote: > Ben's proposal is for ppc64, at least as present. If the ppc32 > embedded developers decide they want to use a device tree, that would > be good, but it will proceed by ... and got interrupted. I meant to write "proceed by persuasion and consensus, not fiat". Paul. From hollis at penguinppc.org Fri May 20 14:26:38 2005 From: hollis at penguinppc.org (Hollis Blanchard) Date: Thu, 19 May 2005 23:26:38 -0500 Subject: RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (#2) In-Reply-To: <1116478614.918.75.camel@gaston> References: <1116400151.918.10.camel@gaston> <1116478614.918.75.camel@gaston> Message-ID: <2806b111f694b5dff9ae70aefc8f9029@penguinppc.org> On May 18, 2005, at 11:56 PM, Benjamin Herrenschmidt wrote: > > - name has to be "chosen" > - linux,platform : This is your platform number as assigned by the > architecture maintainers Given the seemingly endless embedded boards not developed by "core community" folks, wouldn't it be better for firmware to identify itself with a distributed namespace like "vendor.model" and let the kernel figure out whatever unique number that should be? Requiring everyone to request a special number from kernel maintainers seems unnecessary. Or perhaps you're trying to enforce tighter development interaction...? -Hollis From benh at kernel.crashing.org Fri May 20 15:04:10 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 20 May 2005 15:04:10 +1000 Subject: RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (#2) In-Reply-To: <2806b111f694b5dff9ae70aefc8f9029@penguinppc.org> References: <1116400151.918.10.camel@gaston> <1116478614.918.75.camel@gaston> <2806b111f694b5dff9ae70aefc8f9029@penguinppc.org> Message-ID: <1116565451.5153.95.camel@gaston> On Thu, 2005-05-19 at 23:26 -0500, Hollis Blanchard wrote: > On May 18, 2005, at 11:56 PM, Benjamin Herrenschmidt wrote: > > > > - name has to be "chosen" > > - linux,platform : This is your platform number as assigned by the > > architecture maintainers > > Given the seemingly endless embedded boards not developed by "core > community" folks, wouldn't it be better for firmware to identify itself > with a distributed namespace like "vendor.model" and let the kernel > figure out whatever unique number that should be? This is something I've been thinking about > Requiring everyone to request a special number from kernel maintainers > seems unnecessary. Or perhaps you're trying to enforce tighter > development interaction...? Nope. The platform number is an existing thing, and the kernel isn't yet completely ready for getting rid of it, though I'd like to. It would be nice indeed rely only on /model and /compatible (or whatever other properties). In fact, the kernel already iterates through ppc_md board structures and calls a probe() function to select which one to use ! However, all of them current just test the platform number :) The reason for that is part historical. We have some code, including low level assembly code, that tests the platform number for things like LPAR interaction with an HyperVisor. We also have a bit of platform specific code that runs very early in things like the parsing of the interrupt tree or processor node that needs to differenciate between powermac and pseries due to difference in the way those lay things out. However, it would definitely make sense to define a single platform number "PLATFORM_GENERIC" for every new board that doesn't need such low level interactions (I would expect something like a Xen port to require a new platform number for the sake of the low level assembly stuff but not every new embedded board) and fix the remaining places where we actually test it for things like detecting the northbridge type. I'll see what can be done after I finish version 3 of the proposal which already contains a lot of changes and associated kernel patches :) Ben. From Stefan.Nickl at kontron.com Fri May 20 16:44:37 2005 From: Stefan.Nickl at kontron.com (Stefan Nickl) Date: Fri, 20 May 2005 08:44:37 +0200 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (#2) In-Reply-To: <20050519231446.29487C1512@atlas.denx.de> References: <20050519231446.29487C1512@atlas.denx.de> Message-ID: <1116571478.12823.20.camel@lucy.pep-kaufbeuren.de> On Fri, 2005-05-20 at 01:14 +0200, Wolfgang Denk wrote: > > ppc64 kernel anyways ? I don't think that is very relevant. On the > > I am aware that you think so, and I try to raise your awareness of > the fact that there is a huge number of small machines out there. > > Please keep in mind that the same interface will be forced sooner or > later on small 8xx systems with maybe just 4 MB flash and 8 or 16 MB > RAM. I don't seem to be getting the point: As you proved conclusively on your website, 2.6 (and IMHO very likely anything that will come after it) does not scale down well to small systems like the 8xx any more anyways. And I don't think such a major change will be "forced" upon the mostly frozen 2.4 tree. So why try to stop the folks that want to unite the current "mess" in a proven superset datastructure that seems to suit quite fine with all chips that came into production for (at least) the last five years? -- Stefan Nickl Kontron Modular Computers From wd at denx.de Fri May 20 16:59:31 2005 From: wd at denx.de (Wolfgang Denk) Date: Fri, 20 May 2005 08:59:31 +0200 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (#2) In-Reply-To: Your message of "Thu, 19 May 2005 22:51:56 CDT." <6fcc07be88e5091ac1428e9bbde6d92f@penguinppc.org> Message-ID: <20050520065936.1B51EC1512@atlas.denx.de> In message <6fcc07be88e5091ac1428e9bbde6d92f at penguinppc.org> you wrote: > > > Maybe you want to post the RFC to lkml, or at least to the > > linux-arm-kernel and linux-mips mailing lists? > > As you observe, having multiple incompatible communication mechanisms > is an issue of u-boot code maintenance. Since you are the most affected No, it's vice versa. U-Boot has always been just implementing what the kernel does. There are many other boot loaders around that all have to adhere to the interface(s) imposed on them by the kernel. > In the meantime, it sounds like this device tree stuff solves ppc64's > problem in a way the maintainers are happy with, so it's hard to ask > them to come up with a solution to a problem they don't have. Well, actually nobody has problems: the ARM and MIPS folks have working solutions, too. The next architecture will implement yet another way of passing information to the kernel, implement it and state that they will not accept any other solution, and so on. Best regards, Wolfgang Denk -- Software Engineering: Embedded and Realtime Systems, Embedded Linux Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd at denx.de Where people stand is not as important as which way they face. - Terry Pratchett & Stephen Briggs, _The Discworld Companion_ From mgroeger at sysgo.com Fri May 20 17:11:22 2005 From: mgroeger at sysgo.com (Marius Groeger) Date: Fri, 20 May 2005 09:11:22 +0200 (CEST) Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO In-Reply-To: <1116558690.5153.87.camel@gaston> References: <20050519131846.D5DD8C1512@atlas.denx.de> <1116541993.5153.22.camel@gaston> <1116558690.5153.87.camel@gaston> Message-ID: On Fri, 20 May 2005, Benjamin Herrenschmidt wrote: > Ok, I got some numbers here. (I have removed the page alignment Thanks! I think we'll just have to try all that out for ourselves. Simple boards will probably be at the lower end of your figures, which *should* be fine for most people. How do you view this, though: couldn't it happen in the future, once the dev-tree has been widely established, that more and more drivers are converted to pull their properties off the tree, because it is so convenient? That *could* lead to rising expectations toward the firmware, and make what once was a small blob a big blob. Is it reasonable to assume drivers will #ifdef such behaviour? Again, I'm just thinking here, no opinions yet. Well, if you want one: Actually I always liked the idea of clever firmware, which usually knows the underlying hardware best. > - The complete device-tree of my PowerMac laptop (this is _huge_, Apple > puts a _lot_ of stuff in there, way more than most embedded board even > the most complex ones will ever need) fits into a 37k blob. Don't underestimate embedded hardware. The MPC5554 has 286(!) selectable-priority interrupt sources... :-) Cheers, Marius -- Marius Groeger SYSGO AG Embedded and Real-Time Software Voice: +49 6136 9948 0 FAX: +49 6136 9948 10 www.sysgo.com | www.elinos.com | www.osek.de | www.pikeos.com Meet us: Embedded Systems Expo & Conference, Tokyo Big Sight 2005-JUN-29 - 2005-JUL-01 http://www.esec.jp/en/ From david at gibson.dropbear.id.au Fri May 20 17:23:00 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Fri, 20 May 2005 17:23:00 +1000 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO In-Reply-To: References: <20050519131846.D5DD8C1512@atlas.denx.de> <1116541993.5153.22.camel@gaston> <1116558690.5153.87.camel@gaston> Message-ID: <20050520072300.GI12758@localhost.localdomain> On Fri, May 20, 2005 at 09:11:22AM +0200, Marius Groeger wrote: > On Fri, 20 May 2005, Benjamin Herrenschmidt wrote: > > >Ok, I got some numbers here. (I have removed the page alignment > > Thanks! > > I think we'll just have to try all that out for ourselves. Simple > boards will probably be at the lower end of your figures, which > *should* be fine for most people. > > How do you view this, though: couldn't it happen in the future, once > the dev-tree has been widely established, that more and more drivers > are converted to pull their properties off the tree, because it is so > convenient? That *could* lead to rising expectations toward the > firmware, and make what once was a small blob a big blob. Is it > reasonable to assume drivers will #ifdef such behaviour? Bear in mind that if a driver chooses to take its information from the device tree, it's presumably because the code is simpler that way. Which means any such increase in the necessary device tree size is (at least partially) offset by a reduction in code size.. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From benh at kernel.crashing.org Fri May 20 17:27:34 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Fri, 20 May 2005 17:27:34 +1000 Subject: [U-Boot-Users] RFC: Booting the Linux/ppc64 kernel without Open Firmware HOWTO In-Reply-To: References: <20050519131846.D5DD8C1512@atlas.denx.de> <1116541993.5153.22.camel@gaston> <1116558690.5153.87.camel@gaston> Message-ID: <1116574054.5153.111.camel@gaston> On Fri, 2005-05-20 at 09:11 +0200, Marius Groeger wrote: > On Fri, 20 May 2005, Benjamin Herrenschmidt wrote: > > > Ok, I got some numbers here. (I have removed the page alignment > > Thanks! > > I think we'll just have to try all that out for ourselves. Simple > boards will probably be at the lower end of your figures, which > *should* be fine for most people. I expect so. Apple device-trees are really bloated :) I'm also rev'ing up the format to be even a bit more compact. On the other hand, the low figure I posted is a really very minimal tree with no device at all in it. It would be interesting to see what Marvell comes up with. > How do you view this, though: couldn't it happen in the future, once > the dev-tree has been widely established, that more and more drivers > are converted to pull their properties off the tree, because it is so > convenient? That *could* lead to rising expectations toward the > firmware, and make what once was a small blob a big blob. Is it > reasonable to assume drivers will #ifdef such behaviour? It's very difficult to foresee. But most modern busses like PCI, PCIX, PCIE etc... have their own "probing" facilities and such doesn't need devices to be present in the tree. (It is handy to put some there when ancialliary data has to be passed along, like MAC addresses, but that isn't mandatory at this point). It would be nice however that busses without those facilities (or pseudo busses), like on chip devices or superio chips expose their internals via the device-tree, but again, there is no need to bloat them with gazillion of properties. Just the basic to be identified, matched to a driver and address/ports/interrupts mappings. In fact, the device-tree "bloat" to expose those infos may well be less than the code bloat for hard-coding all possible combinations in the kernel, especially if you want a given kernel image to deal with more than one revision of a board (which you _really_ want, or am I the only one to had bad experience with production and customer screwing up updates in the past ?) > Again, I'm just thinking here, no opinions yet. Well, if you want one: > Actually I always liked the idea of clever firmware, which > usually knows the underlying hardware best. The goal of this compact format is to allow for both clever and non-clever firmwares. You can have a pre-built device-tree "blob" that you just pass around, or really build one on the fly, though in that later case, it may be worth simply implementing the OF client interface :) I'm also hoping there will be soon an open source release of a complete Open Firmware implementation (fully in forth/fcode on top of the engine) though I really can't tell much more about it at this point, and there is the openbios project which also aims to be an OF implementation (that one using a lot of C code) > Don't underestimate embedded hardware. The MPC5554 has 286(!) > selectable-priority interrupt sources... :-) Yes, but you don't need a node for each of them, nor even a property :) You only need typically an interrupt-related property per device having an interrupt (a given property can contain values for several interrupts if a device has more than one) or per bridge for interrupt-maps (like PCI). Though if you actually _use_ all of them (like wire 200 GPIOs used as IRQs on your board or such thing :), well, it may be worth spending a few Kb's of device-tree to avoid a hard coding mess in your kernel. > Cheers, > Marius > From olh at suse.de Sat May 21 01:50:48 2005 From: olh at suse.de (Olaf Hering) Date: Fri, 20 May 2005 17:50:48 +0200 Subject: oops in i2c_keywest_init Message-ID: <20050520155048.GA31591@suse.de> The new G5 models have a new i2c keywest, the kernel dies when I boot with 'nosmp', 2.6.11 and 2.6.12rc4. i2c_keywest_init of_register_driver ... create_iface If booted without 'nosmp', smp_core99_cypress_tb_freeze() will panic because it gets -EIO from the first pmac_low_i2c_xfer() call. From benh at kernel.crashing.org Sat May 21 08:33:24 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 21 May 2005 08:33:24 +1000 Subject: oops in i2c_keywest_init In-Reply-To: <20050520155048.GA31591@suse.de> References: <20050520155048.GA31591@suse.de> Message-ID: <1116628405.5153.126.camel@gaston> On Fri, 2005-05-20 at 17:50 +0200, Olaf Hering wrote: > The new G5 models have a new i2c keywest, the kernel dies when I boot > with 'nosmp', 2.6.11 and 2.6.12rc4. Yes, I've been notfied of that. It seems there is a problem with parsing the interrupt tree an the driver isn't testing properly if it's actually getting passed an interrupt. I'm trying to find out what's wrong Ben. From benh at kernel.crashing.org Sat May 21 12:45:46 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sat, 21 May 2005 12:45:46 +1000 Subject: oops in i2c_keywest_init In-Reply-To: <20050520155048.GA31591@suse.de> References: <20050520155048.GA31591@suse.de> Message-ID: <1116643546.5153.160.camel@gaston> Ok, it looks like the device-tree is bogus on these and lacks interrupt informations. That plus the i2c-keywest driver doesn't properly test for the presence of an interrupt in the device node and crashes if there is none. Can you test this patch ? I haven't even tried building so it may need some fixup. It tries to "fixup" the device-tree at boot, and also adds some guard to i2c-keywest. Let me know if it fixes booting and if i2c works propertly. Index: linux-work/arch/ppc64/kernel/prom_init.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/prom_init.c 2005-05-02 10:48:08.000000000 +1000 +++ linux-work/arch/ppc64/kernel/prom_init.c 2005-05-21 12:43:28.000000000 +1000 @@ -1750,7 +1750,41 @@ prom_printf("Device tree struct 0x%x -> 0x%x\n", RELOC(dt_struct_start), RELOC(dt_struct_end)); - } +} + + +static void fixup_device_tree(void) +{ + phandle u3, i2c, mpic; + u32 u3_rev; + u32 interrupts[2]; + u32 parent; + + /* Some G5s have a missing interrupt definition, fix it up here */ + u3 = call_prom("finddevice", 1, 1, ADDR("/u3 at 0,f8000000")); + if ((long)u3 <= 0) + return; + i2c = call_prom("finddevice", 1, 1, ADDR("/u3 at 0,f8000000/i2c at f8001000")); + if ((long)i2c <= 0) + return; + mpic = call_prom("finddevice", 1, 1, ADDR("/u3 at 0,f8000000/mpic at f8040000")); + if ((long)mpic <= 0) + return; + + if (prom_getprop(u3, "device-rev", &u3_rev, sizeof(u3_rev)) <= 0) + return; + if (u3_rev != 0x35) + return; + if (prom_getproplen(i2c, "interrupts") <= 0) + return; + /* interrupt on this revision of u3 is number 0 and level */ + interrupts[0] = 0; + interrupts[1] = 1; + prom_setprop(i2c, "interrupts", &interrupts, sizeof(interrupts)); + parent = (u32)mpic; + prom_setprop(i2c, "interrupt-parent", &parent, sizeof(parent)); +} + static void __init prom_find_boot_cpu(void) { @@ -1920,6 +1954,11 @@ } /* + * Fixup any known bugs in the device-tree + */ + fixup_device_tree(); + + /* * Now finally create the flattened device-tree */ prom_printf("copying OF device tree ...\n"); Index: linux-work/drivers/i2c/busses/i2c-keywest.c =================================================================== --- linux-work.orig/drivers/i2c/busses/i2c-keywest.c 2005-05-02 10:48:09.000000000 +1000 +++ linux-work/drivers/i2c/busses/i2c-keywest.c 2005-05-21 08:38:40.000000000 +1000 @@ -516,6 +516,11 @@ u32 *psteps, *prate; int rc; + if (np->n_intrs < 1 || np->n_addrs < 1) { + printk(KERN_ERR "%s: Missing interrupt or address !\n", + np->full_name); + return -ENODEV; + } if (pmac_low_i2c_lock(np)) return -ENODEV; From security at paypal.com Sun May 22 10:18:46 2005 From: security at paypal.com (PayPal Security Service) Date: Sun, 22 May 2005 00:18:46 -0000 Subject: *** PayPal Security Issues *** Message-ID: An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050522/93df9ef4/attachment.htm From olh at suse.de Sun May 22 19:24:28 2005 From: olh at suse.de (Olaf Hering) Date: Sun, 22 May 2005 11:24:28 +0200 Subject: oops in i2c_keywest_init In-Reply-To: <1116643546.5153.160.camel@gaston> References: <20050520155048.GA31591@suse.de> <1116643546.5153.160.camel@gaston> Message-ID: <20050522092428.GA9901@suse.de> On Sat, May 21, Benjamin Herrenschmidt wrote: > Ok, it looks like the device-tree is bogus on these and lacks interrupt > informations. That plus the i2c-keywest driver doesn't properly test for > the presence of an interrupt in the device node and crashes if there is > none. Can you test this patch ? I haven't even tried building so it may > need some fixup. It tries to "fixup" the device-tree at boot, and also > adds some guard to i2c-keywest. Let me know if it fixes booting and if > i2c works propertly. It allows booting with 'nosmp' at least. lsprop and dmesg sent in private mail. From olh at suse.de Sun May 22 22:30:10 2005 From: olh at suse.de (Olaf Hering) Date: Sun, 22 May 2005 14:30:10 +0200 Subject: oops in i2c_keywest_init In-Reply-To: <1116643546.5153.160.camel@gaston> References: <20050520155048.GA31591@suse.de> <1116643546.5153.160.camel@gaston> Message-ID: <20050522123010.GA10908@suse.de> On Sat, May 21, Benjamin Herrenschmidt wrote: > + if (prom_getproplen(i2c, "interrupts") <= 0) > + return; Make that '>=' and everyone is happy. From benh at kernel.crashing.org Sun May 22 23:26:12 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Sun, 22 May 2005 23:26:12 +1000 Subject: oops in i2c_keywest_init In-Reply-To: <20050522123010.GA10908@suse.de> References: <20050520155048.GA31591@suse.de> <1116643546.5153.160.camel@gaston> <20050522123010.GA10908@suse.de> Message-ID: <1116768372.6002.5.camel@gaston> On Sun, 2005-05-22 at 14:30 +0200, Olaf Hering wrote: > On Sat, May 21, Benjamin Herrenschmidt wrote: > > > + if (prom_getproplen(i2c, "interrupts") <= 0) > > + return; > > Make that '>=' and everyone is happy. Ah good. Updated patch is here, I'll send to linus/andrew tomorrow. Index: linux-work/arch/ppc64/kernel/prom_init.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/prom_init.c 2005-05-02 10:48:08.000000000 +1000 +++ linux-work/arch/ppc64/kernel/prom_init.c 2005-05-22 23:25:17.000000000 +1000 @@ -1750,7 +1750,44 @@ prom_printf("Device tree struct 0x%x -> 0x%x\n", RELOC(dt_struct_start), RELOC(dt_struct_end)); - } +} + + +static void fixup_device_tree(void) +{ + unsigned long offset = reloc_offset(); + phandle u3, i2c, mpic; + u32 u3_rev; + u32 interrupts[2]; + u32 parent; + + /* Some G5s have a missing interrupt definition, fix it up here */ + u3 = call_prom("finddevice", 1, 1, ADDR("/u3 at 0,f8000000")); + if ((long)u3 <= 0) + return; + i2c = call_prom("finddevice", 1, 1, ADDR("/u3 at 0,f8000000/i2c at f8001000")); + if ((long)i2c <= 0) + return; + mpic = call_prom("finddevice", 1, 1, ADDR("/u3 at 0,f8000000/mpic at f8040000")); + if ((long)mpic <= 0) + return; + + /* check if proper rev of u3 */ + if (prom_getprop(u3, "device-rev", &u3_rev, sizeof(u3_rev)) <= 0) + return; + if (u3_rev != 0x35) + return; + /* does it need fixup ? */ + if (prom_getproplen(i2c, "interrupts") > 0) + return; + /* interrupt on this revision of u3 is number 0 and level */ + interrupts[0] = 0; + interrupts[1] = 1; + prom_setprop(i2c, "interrupts", &interrupts, sizeof(interrupts)); + parent = (u32)mpic; + prom_setprop(i2c, "interrupt-parent", &parent, sizeof(parent)); +} + static void __init prom_find_boot_cpu(void) { @@ -1920,6 +1957,11 @@ } /* + * Fixup any known bugs in the device-tree + */ + fixup_device_tree(); + + /* * Now finally create the flattened device-tree */ prom_printf("copying OF device tree ...\n"); Index: linux-work/drivers/i2c/busses/i2c-keywest.c =================================================================== --- linux-work.orig/drivers/i2c/busses/i2c-keywest.c 2005-05-02 10:48:09.000000000 +1000 +++ linux-work/drivers/i2c/busses/i2c-keywest.c 2005-05-21 08:38:40.000000000 +1000 @@ -516,6 +516,11 @@ u32 *psteps, *prate; int rc; + if (np->n_intrs < 1 || np->n_addrs < 1) { + printk(KERN_ERR "%s: Missing interrupt or address !\n", + np->full_name); + return -ENODEV; + } if (pmac_low_i2c_lock(np)) return -ENODEV; From raffi at raffi.at Mon May 23 06:18:16 2005 From: raffi at raffi.at (Raffael Himmelreich) Date: Sun, 22 May 2005 22:18:16 +0200 Subject: RS/6000 7017-S7A hangs on boot Message-ID: <20050522201816.GA8254@exception.at> Hello, I encounter problems when I try to boot the machine noticed in the subject with a cross compiled 2.6.11.2 kernel. No matter which console= arguments I provide, the kernel simply hangs after printing the messages appended and the LCD display says nothing but `0000'. Does anyone have some hints for me? regards, raffi === boot log goes here === 0 > boot net:192.168.1.1,,192.168.1.2 BOOTP R = 1 BOOTP S = 1 FILE: zImage Load Addr=0x4000 Max Size=0xbfc000 Packet Count = 100 Packet Count = 200 Packet Count = 300 Packet Count = 400 Packet Count = 500 Packet Count = 600 Packet Count = 700 Packet Count = 800 Packet Count = 900 Packet Count = 1000 Packet Count = 1100 Packet Count = 1200 Packet Count = 1300 Packet Count = 1400 Packet Count = 1500 Packet Count = 1600 Packet Count = 1700 Packet Count = 1800 Packet Count = 1900 Packet Count = 2000 Packet Count = 2100 Packet Count = 2200 Packet Count = 2300 Packet Count = 2400 Packet Count = 2500 Packet Count = 2600 Packet Count = 2700 Packet Count = 2800 Packet Count = 2900 Packet Count = 3000 Packet Count = 3100 Packet Count = 3200 Packet Count = 3300 Packet Count = 3400 Packet Count = 3500 Packet Count = 3600 Packet Count = 3700 Packet Count = 3800 Packet Count = 3900 Packet Count = 4000 Packet Count = 4100 Packet Count = 4200 Packet Count = 4300 Packet Count = 4400 Packet Count = 4500 Packet Count = 4600 Packet Count = 4700 Packet Count = 4800 Packet Count = 4900 Packet Count = 5000 Packet Count = 5100 Packet Count = 5200 Packet Count = 5300 Packet Count = 5400 Packet Count = 5500 Packet Count = 5600 FINAL Packet Count = 5617 zImage starting: loaded at 0x400000 Allocating 0x8f7000 bytes for kernel ... gunzipping (0x1c00000 <- 0x407000:0x6aca1a)...done 0x79ac90 bytes 0xe208 bytes of heap consumed, max in use 0xa2e8 OF stdout device is: /pci at f8400000/isa at f/serial at i3f8 klimit=0xc0000000007f7000 offset=0xbffffffffe3f0000 command line: root_addr_cells: 0000000000000002 root_size_cells: 0000000000000002 scanning memory: node /memory at 0 : 0000000000000000 0000000080000000 0000000100000000 0000000040000000 memory layout at init: alloc_bottom : 000000000240b000 alloc_top : 0000000040000000 alloc_top_hi : 0000000140000000 rmo_top : 0000000040000000 ram_top : 0000000140000000 Booting CPU hw index = 0x0000000000000000 Looking for displays starting prom_initialize_tce_table alloc_down(0000000000400000, 0000000000800000, (high)) -> 000000013f800000 alloc_bottom : 000000000240b000 alloc_top : 0000000040000000 alloc_top_hi : 000000013f800000 rmo_top : 0000000040000000 ram_top : 0000000140000000 TCE table: /pci at f8400000 node = 0x0000000000cbc300 base = 0x000000013f800000 size = 0x0000000000400000 opening PHB /pci at f8400000... done alloc_down(0000000000400000, 0000000000800000, (high)) -> 000000013f000000 alloc_bottom : 000000000240b000 alloc_top : 0000000040000000 alloc_top_hi : 000000013f000000 rmo_top : 0000000040000000 ram_top : 0000000140000000 TCE table: /pci at f8500000 node = 0x0000000000cbfec0 base = 0x000000013f000000 size = 0x0000000000400000 opening PHB /pci at f8500000... done alloc_down(0000000000400000, 0000000000800000, (high)) -> 000000013e800000 alloc_bottom : 000000000240b000 alloc_top : 0000000040000000 alloc_top_hi : 000000013e800000 rmo_top : 0000000040000000 ram_top : 0000000140000000 TCE table: /pci at f8600000 node = 0x0000000000cc29f0 base = 0x000000013e800000 size = 0x0000000000400000 opening PHB /pci at f8600000... done alloc_down(0000000000400000, 0000000000800000, (high)) -> 000000013e000000 alloc_bottom : 000000000240b000 alloc_top : 0000000040000000 alloc_top_hi : 000000013e000000 rmo_top : 0000000040000000 ram_top : 0000000140000000 TCE table: /pci at f8700000 node = 0x0000000000cc5520 base = 0x000000013e000000 size = 0x0000000000400000 opening PHB /pci at f8700000... done ending prom_initialize_tce_table prom_instantiate_rtas: start... prom_rtas: 0000000000cad6c8 alloc_down(0000000000029000, 0000000000001000, (low)) trying: 0x000000003ffd7000 -> 000000003ffd7000 alloc_bottom : 000000000240b000 alloc_top : 000000003ffd7000 alloc_top_hi : 000000013e000000 rmo_top : 0000000040000000 ram_top : 0000000140000000 instantiating rtas at 0x000000003ffd7000... done rtas base = 0x000000003ffd7000 rtas entry = 0x000000003ffd76dc rtas size = 0x0000000000029000 prom_instantiate_rtas: end... prom_hold_cpus: start... 1) spinloop = 0x0000000000000008 1) *spinloop = 0x0000000000000000 1) acknowledge = 0x0000000000000010 1) *acknowledge = 0x0000000000000000 1) secondary_hold = 0x0000000000000060 cpuid = 0x0000000000000000 cpu hw idx = 0x0000000000000000 0000000000000000 : boot cpu 0000000000000000 cpuid = 0x0000000000000001 cpu hw idx = 0x0000000000000001 0000000000000001 : starting cpu hw idx 0000000000000001... done cpuid = 0x0000000000000002 cpu hw idx = 0x0000000000000002 0000000000000002 : starting cpu hw idx 0000000000000002... done cpuid = 0x0000000000000003 cpu hw idx = 0x0000000000000003 0000000000000003 : starting cpu hw idx 0000000000000003... done prom_hold_cpus: end... copying OF device tree ... starting device tree allocs at 000000000240b000 alloc_up(0000000000100000, 0000000000001000) trying: 0x000000000240b000 trying: 0x000000000250b000 -> 000000000250b000 alloc_bottom : 000000000250b000 alloc_top : 000000003ffd7000 alloc_top_hi : 000000013e000000 rmo_top : 0000000040000000 ram_top : 0000000140000000 Building dt strings... Building dt structure... reserved memory map: 000000013e000000 - 0000000002000000 000000003ffd7000 - 0000000000029000 000000000250b000 - 000000000000b000 Device tree strings 0x000000000250c000 -> 0x000000000250cd7c Device tree struct 0x000000000250d000 -> 0x0000000002516000 Calling quiesce ... returning from prom_init ->dt_header_start=0x000000000250b000 ->phys=0x0000000001c10000 firmware_features = 0x0 Starting Linux PPC64 2.6.11.2 ----------------------------------------------------- ppc64_pft_size = 0x1a ppc64_debug_switch = 0x0 ppc64_interrupt_controller = 0x1 systemcfg = 0xc000000000005000 systemcfg->platform = 0x100 systemcfg->processorCount = 0x4 systemcfg->physicalMemorySize = 0xc0000000 ppc64_caches.dcache_line_size = 0x80 ppc64_caches.icache_line_size = 0x80 htab_address = 0xc000000138000000 htab_hash_mask = 0x7ffff ----------------------------------------------------- [boot]0100 MM Init IO Hole assumed to be 80000000 -> ffffffff [boot]0100 MM Init Done Linux version 2.6.11.2 (root at localhost.localdomain) (gcc version 3.4.3) #8 SMP Mon May 2 16:05:42 CEST 2005 [boot]0012 Setup Arch Top of RAM: 0x140000000, Total RAM: 0xc0000000 Memory hole size: 2048MB mpic: Setting up MPIC " MPIC " version at ffc00000, max 1 CPUs mpic: ISU size: 16, shift: 4, mask: f No ramdisk, default root is /dev/sda2 Python workaround: reg0: 218e3b88 Python workaround: reg0: 218e3b88 Python workaround: reg0: 218e3b88 Python workaround: reg0: 218e3b88 PPC64 nvram contains 122880 bytes Using default idle loop [boot]0015 Setup Done Built 1 zonelists Kernel command line: mpic: Initializing for 64 sources PID hash table entries: 4096 (order: 12, 131072 bytes) time_init: decrementer frequency = 262.732702 MHz time_init: processor frequency = 251.781200 MHz cpu 0x1: Vector: 300 (Data Access) at [c00000003ff87b70] pc: c00000000002de3c From anton at samba.org Mon May 23 06:54:28 2005 From: anton at samba.org (Anton Blanchard) Date: Mon, 23 May 2005 06:54:28 +1000 Subject: RS/6000 7017-S7A hangs on boot In-Reply-To: <20050522201816.GA8254@exception.at> References: <20050522201816.GA8254@exception.at> Message-ID: <20050522205427.GE20174@krispykreme> Hi, > I encounter problems when I try to boot the machine > noticed in the subject with a cross compiled 2.6.11.2 kernel. ... > cpu 0x1: Vector: 300 (Data Access) at [c00000003ff87b70] > pc: c00000000002de3c Can you look up the pc in your System.map? Is xmon on? You might get some more info on the oops if xmon is turned off (it looks like it hung). Anton From tzachi at marvell.com Mon May 23 07:43:32 2005 From: tzachi at marvell.com (Tzachi Perelstein) Date: Mon, 23 May 2005 00:43:32 +0300 Subject: Booting the Linux/ppc64 kernel without Open Firmware HOWTO(#2) Message-ID: Seems like it was a *hot* weekend... Ben, I agree with Wolfgang that while taking footprint and simplicity into account we should "try to think about the whole system, including boot loader, kernel, and any data that might need to get passed between these two". My ppc64 Linux is booted from U-Boot on Marvell platform and it uses the simple board_info instead of the device-tree. There are only some minor hacks in arch files that by-pass the whole device-tree usage in head.S and setup.c, e.g. [1] early_init_board_info() instead of early_init_devtree(), and [2] skipping on unflatten_device_tree() and finish_device_tree(). As you know, the majority of these functions and sub-functions just parse and handle the device-tree. Frankly, although I'm willing to give device-tree a try, I think it is not the wise thing to do. It is simple and effective, and it can be nicer with minor _general_ support in arch files to non-OF loaders. A slightly different loading for embedded systems is not necessarily a mess. I'm sure that the ppc experience will make it be nice and structured even this way. Imagine emerging ppc64 Linux embedded systems living outside the kernel due to embedded incompatibilities; this might make much more mess in the long run... Best Regards, Tzachi From benh at kernel.crashing.org Mon May 23 08:29:36 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 23 May 2005 08:29:36 +1000 Subject: Booting the Linux/ppc64 kernel without Open Firmware HOWTO(#2) In-Reply-To: References: Message-ID: <1116800976.6002.16.camel@gaston> On Mon, 2005-05-23 at 00:43 +0300, Tzachi Perelstein wrote: > Seems like it was a *hot* weekend... > > Ben, I agree with Wolfgang that while taking footprint and simplicity > into account we should "try to think about the whole system, including > boot loader, kernel, and any data that might need to get passed between > these two". This is what this is doing. Please read all that was said. > My ppc64 Linux is booted from U-Boot on Marvell platform and it uses the > simple board_info instead of the device-tree. There are only some minor > hacks in arch files that by-pass the whole device-tree usage in head.S > and setup.c, e.g. [1] early_init_board_info() instead of > early_init_devtree(), and [2] skipping on unflatten_device_tree() and > finish_device_tree(). Minor hacks -> too many hacks. Please. Understand that it means everybody will end up creating slightly different board-info, and the whole thing will end up like an ifdef clutter. Honestly, I don't see where is your problem with a device-tree, it seems we have proven it was both small and much more flexible. > As you know, the majority of these functions and > sub-functions just parse and handle the device-tree. > Frankly, although I'm willing to give device-tree a try, I think it is > not the wise thing to do. Well, I think you are wrong. > It is simple and effective, and it can be > nicer with minor _general_ support in arch files to non-OF loaders. There is, it's called a flattened device-tree. Damn, I can't see what is your problem here. A minimum tree is a few hundred bytes dammit ! You can probably even define one that exposes the north bridges devices (assuming you have similar ethernet as other Marvell chips) etc... in a few more hundred bytes, thus making the whole probing of these & matching with drivers a lot cleaner on the kernel side. > A > slightly different loading for embedded systems is not necessarily a > mess. I'm sure that the ppc experience will make it be nice and > structured even this way. Nope, it won't. A structure definition is the _worse_ thing to do. > Imagine emerging ppc64 Linux embedded systems living outside the kernel > due to embedded incompatibilities; this might make much more mess in the > long run... I don't understand your point. It is _simple_ to set up a device-tree. It's probably even more flexible for your own embedded needs. Ben. From benh at kernel.crashing.org Mon May 23 10:03:52 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 23 May 2005 10:03:52 +1000 Subject: [PATCH] ppc64: Fix booting on latest G5 models Message-ID: <1116806632.20084.8.camel@gaston> Hi ! The latest speedbumped Apple G5 models have a "bug" in the Open Firmware device tree that lacks the proper interrupt routing information for the northbridge i2c controller. Apple's driver silently falls back into a sub-optimal "polled" mode (heh, maybe they didn't even notice the bug because of that :), our driver didn't properly check and crashes :( This patch fixes our driver to not crash, and adds code to the prom_init() OF trampoline code that detects the "bug" and adds the missing information back for this chipset revision. This fixes booting and thermal control on these models. Signed-off-by: Benjamin Herrenschmidt Index: linux-work/arch/ppc64/kernel/prom_init.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/prom_init.c 2005-05-02 10:48:08.000000000 +1000 +++ linux-work/arch/ppc64/kernel/prom_init.c 2005-05-23 08:23:33.000000000 +1000 @@ -1750,7 +1750,44 @@ prom_printf("Device tree struct 0x%x -> 0x%x\n", RELOC(dt_struct_start), RELOC(dt_struct_end)); - } +} + + +static void __init fixup_device_tree(void) +{ + unsigned long offset = reloc_offset(); + phandle u3, i2c, mpic; + u32 u3_rev; + u32 interrupts[2]; + u32 parent; + + /* Some G5s have a missing interrupt definition, fix it up here */ + u3 = call_prom("finddevice", 1, 1, ADDR("/u3 at 0,f8000000")); + if ((long)u3 <= 0) + return; + i2c = call_prom("finddevice", 1, 1, ADDR("/u3 at 0,f8000000/i2c at f8001000")); + if ((long)i2c <= 0) + return; + mpic = call_prom("finddevice", 1, 1, ADDR("/u3 at 0,f8000000/mpic at f8040000")); + if ((long)mpic <= 0) + return; + + /* check if proper rev of u3 */ + if (prom_getprop(u3, "device-rev", &u3_rev, sizeof(u3_rev)) <= 0) + return; + if (u3_rev != 0x35) + return; + /* does it need fixup ? */ + if (prom_getproplen(i2c, "interrupts") > 0) + return; + /* interrupt on this revision of u3 is number 0 and level */ + interrupts[0] = 0; + interrupts[1] = 1; + prom_setprop(i2c, "interrupts", &interrupts, sizeof(interrupts)); + parent = (u32)mpic; + prom_setprop(i2c, "interrupt-parent", &parent, sizeof(parent)); +} + static void __init prom_find_boot_cpu(void) { @@ -1920,6 +1957,11 @@ } /* + * Fixup any known bugs in the device-tree + */ + fixup_device_tree(); + + /* * Now finally create the flattened device-tree */ prom_printf("copying OF device tree ...\n"); Index: linux-work/drivers/i2c/busses/i2c-keywest.c =================================================================== --- linux-work.orig/drivers/i2c/busses/i2c-keywest.c 2005-05-02 10:48:09.000000000 +1000 +++ linux-work/drivers/i2c/busses/i2c-keywest.c 2005-05-21 08:38:40.000000000 +1000 @@ -516,6 +516,11 @@ u32 *psteps, *prate; int rc; + if (np->n_intrs < 1 || np->n_addrs < 1) { + printk(KERN_ERR "%s: Missing interrupt or address !\n", + np->full_name); + return -ENODEV; + } if (pmac_low_i2c_lock(np)) return -ENODEV; From benh at kernel.crashing.org Mon May 23 12:59:50 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 23 May 2005 12:59:50 +1000 Subject: Fix HW clock sync on G5s Message-ID: <1116817190.20084.30.camel@gaston> Hi Owen, Olaf ! I got it wrong when I told you the patch for fixing that was already upstream. It is not, it's in a local tree though. Here it is, please let me know if it helps. Index: linux-work/arch/ppc64/kernel/pmac_smp.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/pmac_smp.c 2005-01-24 17:09:25.000000000 +1100 +++ linux-work/arch/ppc64/kernel/pmac_smp.c 2005-02-11 19:11:43.365862704 +1100 @@ -68,6 +68,7 @@ static void (*pmac_tb_freeze)(int freeze); static struct device_node *pmac_tb_clock_chip_host; +static u8 pmac_tb_pulsar_addr; static DEFINE_SPINLOCK(timebase_lock); static unsigned long timebase; @@ -106,12 +107,9 @@ u8 data; int rc; - /* Strangely, the device-tree says address is 0xd2, but darwin - * accesses 0xd0 ... - */ pmac_low_i2c_setmode(pmac_tb_clock_chip_host, pmac_low_i2c_mode_combined); rc = pmac_low_i2c_xfer(pmac_tb_clock_chip_host, - 0xd4 | pmac_low_i2c_read, + pmac_tb_pulsar_addr | pmac_low_i2c_read, 0x2e, &data, 1); if (rc != 0) goto bail; @@ -120,7 +118,7 @@ pmac_low_i2c_setmode(pmac_tb_clock_chip_host, pmac_low_i2c_mode_stdsub); rc = pmac_low_i2c_xfer(pmac_tb_clock_chip_host, - 0xd4 | pmac_low_i2c_write, + pmac_tb_pulsar_addr | pmac_low_i2c_write, 0x2e, &data, 1); bail: if (rc != 0) { @@ -185,6 +183,12 @@ if (ncpus <= 1) return 1; + /* HW sync only on these platforms */ + if (!machine_is_compatible("PowerMac7,2") && + !machine_is_compatible("PowerMac7,3") && + !machine_is_compatible("RackMac3,1")) + goto nohwsync; + /* Look for the clock chip */ for (cc = NULL; (cc = of_find_node_by_name(cc, "i2c-hwclock")) != NULL;) { struct device_node *p = of_get_parent(cc); @@ -198,11 +202,18 @@ goto next; switch (*reg) { case 0xd2: - pmac_tb_freeze = smp_core99_cypress_tb_freeze; - printk(KERN_INFO "Timebase clock is Cypress chip\n"); + if (device_is_compatible(cc, "pulsar-legacy-slewing")) { + pmac_tb_freeze = smp_core99_pulsar_tb_freeze; + pmac_tb_pulsar_addr = 0xd2; + printk(KERN_INFO "Timebase clock is Pulsar chip\n"); + } else if (device_is_compatible(cc, "cy28508")) { + pmac_tb_freeze = smp_core99_cypress_tb_freeze; + printk(KERN_INFO "Timebase clock is Cypress chip\n"); + } break; case 0xd4: pmac_tb_freeze = smp_core99_pulsar_tb_freeze; + pmac_tb_pulsar_addr = 0xd4; printk(KERN_INFO "Timebase clock is Pulsar chip\n"); break; } @@ -210,12 +221,15 @@ pmac_tb_clock_chip_host = p; smp_ops->give_timebase = smp_core99_give_timebase; smp_ops->take_timebase = smp_core99_take_timebase; + of_node_put(cc); + of_node_put(p); break; } next: of_node_put(p); } + nohwsync: mpic_request_ipis(); return ncpus; From olh at suse.de Mon May 23 16:23:21 2005 From: olh at suse.de (Olaf Hering) Date: Mon, 23 May 2005 08:23:21 +0200 Subject: Fix HW clock sync on G5s In-Reply-To: <1116817190.20084.30.camel@gaston> References: <1116817190.20084.30.camel@gaston> Message-ID: <20050523062321.GA24392@suse.de> On Mon, May 23, Benjamin Herrenschmidt wrote: > Hi Owen, Olaf ! > > I got it wrong when I told you the patch for fixing that was already > upstream. It is not, it's in a local tree though. Here it is, please let > me know if it helps. Yes, it works for me. From olh at suse.de Mon May 23 19:33:34 2005 From: olh at suse.de (Olaf Hering) Date: Mon, 23 May 2005 11:33:34 +0200 Subject: openpower 720, unknown Firmware/POST Error Code Message-ID: <20050523093334.GA27320@suse.de> Any idea what this number on the openpower 720 frontpanel means? b2018127 It did not come back after a reboot, did not respond to the white button on the front panel, had to pull the plug. It runs in SMP mode. I found no error code in this range in the 'Firmware/POST Error Codes' for other pseries systems. From DEEPAKCS at in.ibm.com Mon May 23 19:47:27 2005 From: DEEPAKCS at in.ibm.com (Deepak C Shetty) Date: Mon, 23 May 2005 15:17:27 +0530 Subject: subscribe Message-ID: subscribe -------------- next part -------------- An HTML attachment was scrubbed... URL: http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050523/a53ca70e/attachment.htm From jfaslist at yahoo.fr Tue May 24 08:23:50 2005 From: jfaslist at yahoo.fr (jfaslist) Date: Mon, 23 May 2005 15:23:50 -0700 Subject: device-tree in U-Boot In-Reply-To: <1116371643.5366.12.camel@gaston> References: <1116371643.5366.12.camel@gaston> Message-ID: <429257F6.7030506@yahoo.fr> for what it's worth...you can give a look at what IBM provides in its ppc64 reference design for the 970FX CPU. They use a firmware called PIBS that have a lightweight implementation of OF and a device tree. See under /bsp_970fx/oflib/oflib $ ls discover.c dumptree.o of_canon.o of_stub.o openfirm.o version.c discover.o makefile of_path.c of_stub.s treecreate.c version.o dumptree.c of_canon.c of_path.o openfirm.c treecreate.o The code can be obtained from: http://www-128.ibm.com/developerworks/power/ppc970fx/?S_TACT=105AGX37&S_CMP=TAPPA -jf simon - themis computer Benjamin Herrenschmidt a ?crit : >On Tue, 2005-05-17 at 16:49 +0300, Tzachi Perelstein wrote: > > >>Hi Ben, >> >>I'll make U-Boot loading fits to ppc64 standards using the device-tree. >>I was consulting Wolfgang Denk (U-Boot owner), and he confirmed it (more >>or less). >> >>Since I'm not familiar with OF and device tree, and since you generously >>offered your help, I contact you for some guidance. >>As a first step, I would like to do it as simple as possible. On the >>last stage of U-Boot before branching to Linux, I will collect the >>needed information and reassemble it into a device-tree. >>Can you please advise me how to do it without adding the whole >>device-tree implementation to U-Boot? Is there any code reference I can >>use? >>You have mentioned that there are only little requirements provided by >>the kernel regarding device tree. Can you please clarify? >> >> > >Sure, I will. Later today or tomorrow if you don't mind though. I'm >gather bits & pieces and will provide you with all the needed >informations. > >Ben. > > >_______________________________________________ >Linuxppc64-dev mailing list >Linuxppc64-dev at ozlabs.org >https://ozlabs.org/cgi-bin/mailman/listinfo/linuxppc64-dev > > > From dwm at maxeymade.com Mon May 23 23:09:12 2005 From: dwm at maxeymade.com (Doug Maxey) Date: Mon, 23 May 2005 08:09:12 -0500 Subject: openpower 720, unknown Firmware/POST Error Code In-Reply-To: <20050523093334.GA27320@suse.de> Message-ID: <200505231309.j4ND9Cc9027300@falcon30.maxeymade.com> On Mon, 23 May 2005 11:33:34 +0200, Olaf Hering wrote: > >Any idea what this number on the openpower 720 frontpanel means? > >b2018127 > B2pp8127 Description: Phyp was unable to allocate the memory needed to build the PFDS. pp == partition number. Not sure about the PFDS. Sounds like something was changed, configuration wise. >It did not come back after a reboot, did not respond to the white >button on the front panel, had to pull the plug. It runs in SMP mode. >I found no error code in this range in the 'Firmware/POST Error Codes' >for other pseries systems. From tzachi at marvell.com Tue May 24 00:04:21 2005 From: tzachi at marvell.com (Tzachi Perelstein) Date: Mon, 23 May 2005 17:04:21 +0300 Subject: device-tree in U-Boot Message-ID: Thanks! I got it. > -----Original Message----- > From: jfaslist [mailto:jfaslist at yahoo.fr] > Sent: Tuesday, May 24, 2005 1:24 AM > To: Tzachi Perelstein > Cc: Benjamin Herrenschmidt; linuxppc64-dev at ozlabs.org > Subject: Re: device-tree in U-Boot > > for what it's worth...you can give a look at what IBM provides in its > ppc64 reference design for the 970FX CPU. They use a firmware called > PIBS that have a lightweight implementation of OF and a device tree. > See under /bsp_970fx/oflib/oflib > $ ls > discover.c dumptree.o of_canon.o of_stub.o openfirm.o version.c > discover.o makefile of_path.c of_stub.s treecreate.c version.o > dumptree.c of_canon.c of_path.o openfirm.c treecreate.o > > The code can be obtained from: > http://www- > 128.ibm.com/developerworks/power/ppc970fx/?S_TACT=105AGX37&S_CMP=TAPPA > > -jf simon - themis computer > > Benjamin Herrenschmidt a ?crit : > > >On Tue, 2005-05-17 at 16:49 +0300, Tzachi Perelstein wrote: > > > > > >>Hi Ben, > >> > >>I'll make U-Boot loading fits to ppc64 standards using the device-tree. > >>I was consulting Wolfgang Denk (U-Boot owner), and he confirmed it (more > >>or less). > >> > >>Since I'm not familiar with OF and device tree, and since you generously > >>offered your help, I contact you for some guidance. > >>As a first step, I would like to do it as simple as possible. On the > >>last stage of U-Boot before branching to Linux, I will collect the > >>needed information and reassemble it into a device-tree. > >>Can you please advise me how to do it without adding the whole > >>device-tree implementation to U-Boot? Is there any code reference I can > >>use? > >>You have mentioned that there are only little requirements provided by > >>the kernel regarding device tree. Can you please clarify? > >> > >> > > > >Sure, I will. Later today or tomorrow if you don't mind though. I'm > >gather bits & pieces and will provide you with all the needed > >informations. > > > >Ben. > > > > > >_______________________________________________ > >Linuxppc64-dev mailing list > >Linuxppc64-dev at ozlabs.org > >https://ozlabs.org/cgi-bin/mailman/listinfo/linuxppc64-dev > > > > > > > From segher at kernel.crashing.org Tue May 24 01:06:46 2005 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Mon, 23 May 2005 17:06:46 +0200 Subject: Booting the Linux/ppc64 kernel without Open Firmware HOWTO(#2) In-Reply-To: References: Message-ID: > Imagine emerging ppc64 Linux embedded systems living outside the kernel > due to embedded incompatibilities; this might make much more mess in > the > long run... The mess won't be _in the main kernel repository_, though. I.e., whoever caused it, would have to deal with it; not everyone else. Segher From benh at kernel.crashing.org Tue May 24 09:00:07 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 24 May 2005 09:00:07 +1000 Subject: device-tree in U-Boot In-Reply-To: References: Message-ID: <1116889207.5124.10.camel@gaston> On Mon, 2005-05-23 at 17:04 +0300, Tzachi Perelstein wrote: > Thanks! I got it. Note that this is a full featured implementation of the OF client interface code, which may be a lot more than what you are looking for. David and I will come up soon (this week) with bits of code showing how to manipulate the flattened device-tree and a device-tree compiler that allow you to generate one from a text source file. This should result in much more compact code & data in the bootloader. Ben. From benh at kernel.crashing.org Tue May 24 14:34:45 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 24 May 2005 14:34:45 +1000 Subject: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (v3) Message-ID: <1116909286.4992.17.camel@gaston> Hi ! Here's revision 3 of the spec for the booting of linux/ppc64 with a flattened device-tree. The novelty is that I added a new more compact format. A followup mail will have the kernel patches to add support to this new format, I'll submit them upstream for after 2.6.12 I think. David and I are still working on sample code & tools. We have a prototype of a device-tree "compiler" that can build the flattened blob from a textual representation. We'll release that soon, hopefully this week. -------- Booting the Linux/ppc64 kernel without Open Firmware ---------------------------------------------------- (c) 2005 Benjamin Herrenschmidt , IBM Corp. May 18, 2005: Rev 0.1 - Initial draft, no chapter III yet. May 19, 2005: Rev 0.2 - Add chapter III and bits & pieces here or clarifies the fact that a lot of things are optional, the kernel only requires a very small device tree, though it is encouraged to provide an as complete one as possible. May 24, 2005: Rev 0.3 - Precise that DT block has to be in RAM - Misc fixes - Define version 3 and new format version 16 for the DT block (version 16 needs kernel patches, will be fwd separately). String block now has a size, and full path is replaced by unit name for more compactness. linux,phandle is made optional, only nodes that are referenced by other nodes need it. "name" property is now automatically deduced from the unit name ToDo: - Add some definitions of interrupt tree (simple/complex) - Add some definitions for pci host bridges I- Introduction =============== During the recent developpements of the Linux/ppc64 kernel, and more specifically, the addition of new platform types outside of the old IBM pSeries/iSeries pair, it was decided to enforce some strict rules regarding the kernel entry and bootloader <-> kernel interfaces, in order to avoid the degeneration that has become the ppc32 kernel entry point and the way a new platform should be added to the kernel. The legacy iSeries platform breaks those rules as it predates this scheme, but no new board support will be accepted in the main tree that doesn't follows them properly. The main requirement that will be defined in more details below is the presence of a device-tree whose format is defined after Open Firmware specification. However, in order to make life easier to embedded board vendors, the kernel doesn't require the device-tree to represent every device in the system and only requires some nodes and properties to be present. This will be described in details in section III, but, for example, the kernel does not require you to create a node for every PCI device in the system. It is a requirement to have a node for PCI host bridges in order to provide interrupt routing informations and memory/IO ranges, among others. It is also recommended to define nodes for on chip devices and other busses that doesn't specifically fit in an existing OF specification, like on chip devices, this creates a great flexibility in the way the kernel can them probe those and match drivers to device, without having to hard code all sorts of tables. It also makes it more flexible for board vendors to do minor hardware upgrades without impacting significantly the kernel code or cluttering it with special cases. 1) Entry point -------------- There is one and one single entry point to the kernel, at the start of the kernel image. That entry point support two calling conventions: a) Boot from Open Firmware. If your firmware is compatible with Open Firmware (IEEE 1275) or provides an OF compatible client interface API (support for "interpret" callback of forth words isn't required), you can enter the kernel with: r5 : OF callback pointer as defined by IEEE 1275 bindings to powerpc. Only the 32 bits client interface is currently supported r3, r4 : address & lenght of an initrd if any or 0 MMU is either on or off, the kernel will run the trampoline located in arch/ppc64/kernel/prom_init.c to extract the device-tree and other informations from open firmware and build a flattened device-tree as described in b). prom_init() will then re-enter the kernel using the second method. This trampoline code runs in the context of the firmware, which is supposed to handle all exceptions during that time. b) Direct entry with a flattened device-tree block. This entry point is called by a) after the OF trampoline and can also be called directly by a bootloader that does not support the Open Firmware client interface. It is also used by "kexec" to implement "hot" booting of a new kernel from a previous running one. This method is what I will describe in more details in this document, as method a) is simply standard Open Firmware, and thus should be implemented according to the various standard documents defining it and it's binding to the PowerPC platform. The entry point definition then becomes: r3 : physical pointer to the device-tree block (defined in chapter II) in RAM r4 : physical pointer to the kernel itself. This is used by the assembly code to properly disable the MMU in case you are entering the kernel with MMU enabled and a non-1:1 mapping. r5 : NULL (as to differenciate with method a) Note about SMP entry: Either your firmware puts your other CPUs in some sleep loop or spin loop in ROM where you can get them out via a soft reset or some other mean, in which case you don't need to care, or you'll have to enter the kernel with all CPUs. The way to do that with method b) will be described in a later revision of this document. 2) Board support ---------------- Board supports (platforms) are not exclusive config options. An arbitrary set of board supports can be built in a single kernel image. The kernel will "known" what set of functions to use for a given platform based on the content of the device-tree. Thus, you should: a) add your platform support as a _boolean_ option in arch/ppc64/Kconfig, following the example of PPC_PSERIES, PPC_PMAC and PPC_MAPLE. The later is probably a good example of a board support to start from. b) create your main platform file as "arch/ppc64/kernel/myboard_setup.c" and add it to the Makefile under the condition of your CONFIG_ option. This file will define a structure of type "ppc_md" containing the various callbacks that the generic code will use to get to your platform specific code c) Add a reference to your "ppc_md" structure in the "machines" table in arch/ppc64/kernel/setup.c d) request and get assigned a platform number (see PLATFORM_* constants in include/asm-ppc64/processor.h I will describe later the boot process and various callbacks that your platform should implement. II - The DT block format =========================== This chapter defines the actual format of the flattened device-tree passed to the kernel. The actual content of it and kernel requirements are described later. You can find example of code manipulating that format in various places, including arch/ppc64/kernel/prom_init.c which will generate a flattened device-tree from the Open Firmware representation, or the fs2dt utility which is part of the kexec tools which will generate one from a filesystem representation. It is expected that a bootloader like uboot provides a bit more support, that will be discussed later as well. Note: The block has to be in main memory. It has to be accessible in both real mode and virtual mode with no other mapping than main memory. If you are writing a simple flash bootloader, it should copy the block to RAM before passing it to the kernel. 1) Header --------- The kernel is entered with r3 pointing to an area of memory that is roughtly described in include/asm-ppc64/prom.h by the structure boot_param_header: struct boot_param_header { u32 magic; /* magic word OF_DT_HEADER */ u32 totalsize; /* total size of DT block */ u32 off_dt_struct; /* offset to structure */ u32 off_dt_strings; /* offset to strings */ u32 off_mem_rsvmap; /* offset to memory reserve map */ u32 version; /* format version */ u32 last_comp_version; /* last compatible version */ /* version 2 fields below */ u32 boot_cpuid_phys; /* Which physical CPU id we're booting on */ /* version 3 fields below */ u32 size_dt_strings; /* size of the strings block */ }; Along with the constants: /* Definitions used by the flattened device tree */ #define OF_DT_HEADER 0xd00dfeed /* 4: version, 4: total size */ #define OF_DT_BEGIN_NODE 0x1 /* Start node: full name */ #define OF_DT_END_NODE 0x2 /* End node */ #define OF_DT_PROP 0x3 /* Property: name off, size, content */ #define OF_DT_END 0x9 All values in this header are in big endian format, the various fields in this header are defined more precisely below. All "offsets" values are in bytes from the start of the header, that is from r3 value. - magic This is a magic value that "marks" the beginning of the device-tree block header. It contains the value 0xd00dfeed and is defined by the constant OF_DT_HEADER - totalsize This is the total size of the DT block including the header. The "DT" block should enclose all data structures defined in this chapter (who are pointed to by offsets in this header). That is, the device-tree structure, strings, and the memory reserve map. - off_dt_struct This is an offset from the beginning of the header to the start of the "structure" part the device tree. (see 2) device tree) - off_dt_strings This is an offset from the beginning of the header to the start of the "strings" part of the device-tree - off_mem_rsvmap This is an offset from the beginning of the header to the start of the reserved memory map. This map is a list of pairs of 64 bits integers. Each pair is a physical address and a size. The list is terminated by an entry of size 0. This map provides the kernel with a list of physical memory areas that are "reserved" and thus not to be used for memory allocations, especially during early initialisation. The kernel needs to allocate memory during boot for things like un-flattening the device-tree, allocating an MMU hash table, etc... Those allocations must be done in such a way to avoid overriding critical things like, on Open Firmware capable machines, the RTAS instance, or on some pSeries, the TCE tables used for the iommu. Typically, the reserve map should contain _at least_ this DT block itself (header,total_size). If you are passing an initrd to the kernel, you should reserve it as well. You do not need to reserve the kernel image itself. The map should be 64 bits aligned. - version This is the version of this structure. Version 1 stops here. Version 2 adds an additional field boot_cpuid_phys. Version 3 adds the size of the strings block, allowing the kernel to reallocate it easily at boot and free up the unused flattened structure after expansion. Version 16 introduces a new more "compact" format for the tree itself that is however not backward compatible. You should always generate a structure of the highest version defined at the time of your implementation. Currently that is version 16, unless you explicitely aim at being backward compatible - last_comp_version Last compatible version. This indicates down to what version of the DT block you are backward compatible with. For example, version 2 is backward compatible with version 1 (that is, a kernel build for version 1 will be able to boot with a version 2 format). You should put a 1 in this field if you generate a device tree of version 1 to 3, or 0x10 if you generate a tree of version 0x10 using the new unit name format. - boot_cpuid_phys This field only exist on version 2 headers. It indicate which physical CPU ID is calling the kernel entry point. This is used, among others, by kexec. If you are on an SMP system, this value should match the content of the "reg" property of the CPU node in the device-tree corresponding to the CPU calling the kernel entry point (see further chapters for more informations on the required device-tree contents) So the typical layout of a DT block (though the various parts don't need to be in that order) looks like (addresses go from top to bottom): ------------------------------ r3 -> | struct boot_param_header | ------------------------------ | (alignment gap) (*) | ------------------------------ | memory reserve map | ------------------------------ | (alignment gap) | ------------------------------ | | | device-tree structure | | | ------------------------------ | (alignment gap) | ------------------------------ | | | device-tree strings | | | -----> ------------------------------ | | --- (r3 + totalsize) (*) The alignment gaps are not necessarily present, their presence and size are dependent on the various alignment requirements of the individual data blocks. 2) Device tree generalities --------------------------- This device-tree itself is separated in two different blocks, a structure block and a strings block. Both need to be page aligned. First, let's quickly describe the device-tree concept before detailing the storage format. This chapter does _not_ describe the detail of the required types of nodes & properties for the kernel, this is done later in chapter III. The device-tree layout is strongly inherited from the definition of the Open Firmware IEEE 1275 device-tree. It's basically a tree of nodes, each node having two or more named properties. A property can have a value or not. It is a tree, so each node has one and only one parent except for the root node who has no parent. A node has 2 names. The actual node name is generally contained in a property of type "name" in the node property list whose value is a zero terminated string and is mandatory for version 1 to 3 of the format definition (as it is in Open Firmware). Version 0x10 makes it optional as it can generate it from the unit name defined below. There is also a "unit name" that is used to differenciate nodes with the same name at the same level, it is usually made of the node name's, the "@" sign, and a "unit address", which definition is specific to the bus type the node sits on. The unit name doesn't exist as a property per-se but is included in the device-tree structure. It is typically used to represent "path" in the device-tree. More details about the actual format of these will be below. The kernel ppc64 generic code does not make any formal use of the unit address (though some board support code may do) so the only real requirement here for the unit address is to ensure uniqueness of the node unit name at a given level of the tree. Nodes with no notion of address and no possible sibling of the same name (like /memory or /cpus) may ommit the unit address in the context of this specification, or use the "@0" default unit address. The unit name is used to define a node "full path", which is the concatenation of all parent nodes unit names separated with "/". The root node doesn't have a defined name, and isn't required to have a name property either if you are using version 3 or earlier of the format. It also has no unit address (no @ symbol followed by a unit address). The root node unit name is thus an empty string. The full path to the root node is "/" Every node who actually represents an actual device (that is who isn't only a virtual "container" for more nodes, like "/cpus" is) is also required to have a "device_type" property indicating the type of node Finally, every node that can be referrenced from a property in another node is required to have a "linux,phandle" property. Real open firmware implementations do provide a unique "phandle" value for every node that the "prom_init()" trampoline code turns into "linux,phandle" properties. However, this is made optional if the flattened is used directly. An example of a node referencing another node via "phandle" is when laying out the interrupt tree which will be described in a further version of this document. This propery is a 32 bits value that uniquely identify a node. You are free to use whatever values or system of values, internal pointers, or whatever to generate these, the only requirement is that every node for which you provide that property has a unique value for it. Here is an example of a simple device-tree. In this example, a "o" designates a node followed by the node unit name. Properties are presented with their name followed by their content. "content" represent an ASCII string (zero terminated) value, while represent a 32 bits hexadecimal value. The various nodes in this example will be discusse in a later chapter. At this point, it is only meant to give you a idea of what a device-tree looks like. I have on purpose kept the "name" and "linux,phandle" properties which aren't necessary in order to give you a better idea of what the tree looks like in practice. / o device-tree |- name = "device-tree" |- model = "MyBoardName" |- compatible = "MyBoardFamilyName" |- #address-cells = <2> |- #size-cells = <2> |- linux,phandle = <0> | o cpus | | - name = "cpus" | | - linux,phandle = <1> | | - #address-cells = <1> | | - #size-cells = <0> | | | o PowerPC,970 at 0 | |- name = "PowerPC,970" | |- device_type = "cpu" | |- reg = <0> | |- clock-frequency = <5f5e1000> | |- linux,boot-cpu | |- linux,phandle = <2> | o memory at 0 | |- name = "memory" | |- device_type = "memory" | |- reg = <00000000 00000000 00000000 20000000> | |- linux,phandle = <3> | o chosen |- name = "chosen" |- bootargs = "root=/dev/sda2" |- linux,platform = <00000600> |- linux,phandle = <4> This tree is almost a minimal tree. It pretty much contains the minimal set of required nodes and properties to boot a linux kernel, that is some basic model informations at the root, the CPUs, the physical memory layout, and misc informations passed through /chosen like in this example, the platform type (mandatory) and the kernel command line arguments (optional). The /cpus/PowerPC,970 at 0/linux,boot-cpu property is an example of a property without a value. All other properties have a value. The signification of the #address-cells and #size-cells properties will be explained in chapter IV which defines precisely the required nodes and properties and their content. 3) Device tree "structure" block The structure of the device tree is a linearized tree structure. The "OF_DT_BEGIN_NODE" token starts a new node, and the "OF_DT_END" ends that node definition. Child nodes are simply defined before "OF_DT_END" (that is nodes within the node). A 'token' is a 32 bits value. Here's the basic structure of a single node: * token OF_DT_BEGIN_NODE (that is 0x00000001) * for version 1 to 3, this is the node full path as a zero terminated string, starting with "/". For version 16 and later, this is the node unit name only (or an empty string for the root node) * [align gap to next 4 bytes boundary] * for each property: * token OF_DT_PROP (that is 0x00000003) * 32 bits value of property value size in bytes (or 0 of no value) * 32 bits value of offset in string block of property name * [align gap to either next 4 bytes boundary if the property value size is less or equal to 4 bytes, or to next 8 bytes boundary if the property value size is larger than 4 bytes] * property value data if any * [align gap to next 4 bytes boundary] * [child nodes if any] * token OF_DT_END (that is 0x00000002) So the node content can be summmarised as a start token, a full path, a list of properties, a list of child node and an end token. Every child node is a full node structure itself as defined above 4) Device tree 'strings" block In order to save space, property names, which are generally redundant, are stored separately in the "strings" block. This block is simply the whole bunch of zero terminated strings for all property names concatenated together. The device-tree property definitions in the structure block will contain offset values from the beginning of the strings block. III - Required content of the device tree ========================================= WARNING: All "linux,*" properties defined in this document apply only to a flattened device-tree. If your platform uses a real implementation of Open Firmware or an implementation compatible with the Open Firmware client interface, those properties will be created by the trampoline code in the kernel's prom_init() file. For example, that's where you'll have to add code to detect your board model and set the platform number. However, when using the flatenned device-tree entry point, there is no prom_init() pass, and thus you have to provide those properties yourself. 1) Note about cells and address representation ---------------------------------------------- The general rule is documented in the various Open Firmware documentations. If you chose to describe a bus with the device-tree and there exist an OF bus binding, then you should follow the specification. However, the kernel does not require every single device or bus to be described by the device tree. In general, the format of an address for a device is defined by the parent bus type, based on the #address-cells and #size-cells property. In absence of such a property, the parent's parent values are used, etc... The kernel requires the root node to have those properties defining addresses format for devices directly mapped on the processor bus. Those 2 properties define 'cells' for representing an address and a size. A "cell" is a 32 bits number. For example, if both contain 2 like the example tree given above, then an address and a size are both composed of 2 cells, that is a 64 bits number (cells are concatenated and expected to be in big endian format). Another example is the way Apple firmware define them, that is 2 cells for an address and one cell for a size. A device IO or MMIO areas on the bus are defined in the "reg" property. The format of this property depends on the bus the device is sitting on. Standard bus types define their "reg" properties format in the various OF bindings for those bus types, you are free to define your own "reg" format for proprietary busses or virtual busses enclosing on-chip devices, though it is recommended that the parts of the "reg" property containing addresses and sizes do respect the defined #address-cells and #size-cells when those make sense. Later, I will define more precisely some common address formats. For a new ppc64 board, I recommend to use either the 2/2 format or Apple's 2/1 format which is slightly more compact since sizes usually fit in a single 32 bits word. 2) Note about "compatible" properties ------------------------------------- Those properties are optional, but recommended in devices and the root node. The format of a "compatible" property is a list of concatenated zero terminated strings. They allow a device to express it's compatibility with a family of similar devices, in some cases, allowing a single driver to match against several devices regardless of their actual names 3) Note about "name" properties ------------------------------- While earlier users of Open Firmware like OldWorld macintoshes tended to use the actual device name for the "name" property, it's nowadays considered a good practice to use a name that is closer to the device class (often equal to device_type). For example, nowadays, ethernet controllers are named "ethernet", an additional "model" property defining precisely the chip type/model, and "compatible" property defining the family in case a single driver can driver more than one of these chips. The kernel however doesn't generally put any restriction on the "name" property, it is simply considered good practice to folow the standard and it's evolutions as closely as possible. Note also that the new format version 16 makes the "name" property optional. If it's absent for a node, then the node's unit name is then used to reconstruct the name. That is, the part of the unit name before the "@" sign is used (or the entire unit name if no "@" sign is present). 4) Note about node and property names and character set ------------------------------------------------------- While open firmware provides more flexibe usage of 8859-1, this specification enforces more strict rules. Nodes and properties should be comprised only of ASCII characters 'a' to 'z', '0' to '9', ',', '.', '_', '+', '#', '?', and '-'. Node names additionally allow uppercase characters 'A' to 'Z' (property names should be lowercase. The fact that vendors like Apple don't respect this rule is irrelevant here). Additionally, node and property names should always begin with a character in the range 'a' to 'z' (or 'A' to 'Z' for node names). The maximum number of characters for both nodes and property names is 31. In the case of node names, this is only the leftmost part of a unit name (the pure "name" property), it doesn't include the unit address which can extend beyond that limit. 5) Required nodes and properties -------------------------------- a) The root node The root node requires some properties to be present: - model : this is your board name/model - #address-cells : address representation for "root" devices - #size-cells: the size representation for "root" devices Additionally, some recommended properties are: - compatible : the board "family" generally finds its way here, for example, if you have 2 board models with a similar layout, that typically get driven by the same platform code in the kernel, you would use a different "model" property but put a value in "compatible". The kernel doesn't directly use that value (see /chosen/linux,platform for how the kernel choses a platform type) but it is generally useful. It's also generally where you add additional properties specific to your board like the serial number if any, that sort of thing. it is recommended that if you add any "custom" property whose name may clash with standard defined ones, you prefix them with your vendor name and a comma. b) The /cpus node This node is the parent of all individual CPUs nodes. It doesn't have any specific requirements, though it's generally good practice to have at least: #address-cells = <00000001> #size-cells = <00000000> This defines that the "address" for a CPU is a single cell, and has no meaningful size. This is not necessary but the kernel will assume that format when reading the "reg" properties of a CPU node, see below c) The /cpus/* nodes So under /cpus, you are supposed to create a node for every CPU on the machine. There is no specific restriction on the name of the CPU, though It's common practice to call it PowerPC,, for example, Apple uses PowerPC,G5 while IBM uses PowerPC,970FX. Required properties: - device_type : has to be "cpu" - reg : This is the physical cpu number, it's single 32 bits cell, this is also used as-is as the unit number for constructing the unit name in the full path, for example, with 2 CPUs, you would have the full path: /cpus/PowerPC,970FX at 0 /cpus/PowerPC,970FX at 1 (unit addresses do not require to have leading zero's) - d-cache-line-size : one cell, L1 data cache line size in bytes - i-cache-line-size : one cell, L1 instruction cache line size in bytes - d-cache-size : one cell, size of L1 data cache in bytes - i-cache-size : one cell, size of L1 instruction cache in bytes Recommended properties: - timebase-frequency : a cell indicating the frequency of the timebase in Hz. This is not directly used by the generic code, but you are welcome to copy/paste the pSeries code for setting the kernel timebase/decrementer calibration based on this value. - clock-frequency : a cell indicating the CPU core clock frequency in Hz. A new property will be defined for 64 bits value, but if your frequency is < 4Ghz, one cell is enough. Here as well as for the above, the common code doesn't use that property, but you are welcome to re-use the pSeries or Maple one. A future kernel version might provide a common function for this. You are welcome to add any property you find relevant to your board, like some informations about mecanism used to soft-reset the CPUs for example (Apple puts the GPIO number for CPU soft reset lines in there as a "soft-reset" property as they start secondary CPUs by soft-resetting them). d) the /memory node(s) To define the physical memory layout of your board, you should create one or more memory node(s). You can either create a single node with all memory ranges in it's reg property, or you can create several nodes, as you wishes. The unit address (@ part) used for the full path is the address of the first range of memory defined by a given node. If you use a single memory node, this will typically be @0. Required properties: - device_type : has to be "memory" - reg : This property contain all the physical memory ranges of your board. It's a list of addresses/sizes concatenated together, the number of cell of those beeing defined by the #address-cells and #size-cells of the root node. For example, with both of these properties beeing 2 like in the example given earlier, a 970 based machine with 6Gb of RAM could typically have a "reg" property here that looks like: 00000000 00000000 00000000 80000000 00000001 00000000 00000001 00000000 That is a range starting at 0 of 0x80000000 bytes and a range starting at 0x100000000 and of 0x100000000 bytes. You can see that there is no memory covering the IO hold between 2Gb and 4Gb. Some vendors prefer splitting those ranges into smaller segments, the kernel doesn't care. c) The /chosen node This node is a bit "special". Normally, that's where open firmware puts some variable environment informations, like the arguments, or phandle pointers to nodes like the main interrupt controller, or the default input/output devices. This specification makes a few of these mandatory, but also defines some linux specific properties that would be normally constructed by the prom_init() trampoline when booting with an OF client interface, but that you have to provide yourself when using the flattened format. Required properties: - linux,platform : This is your platform number as assigned by the architecture maintainers Recommended properties: - bootargs : This zero terminated string is passed as the kernel command line - linux,stdout-path : This is the full path to your standard console device if any. Typically, if you have serial devices on your board, you may want to put the full path to the one set as the default console in the firmware here, for the kernel to pick it up as it's own default console. If you look at the funciton set_preferred_console() in arch/ppc64/kernel/setup.c, you'll see that the kernel tries to find out the default console and has knowledge of various types like 8250 serial ports. You may want to extend this function to add your own. - interrupt-controller : This is one cell containing a phandle value that matches the "linux,phandle" property of your main interrupt controller node. May be used for interrupt routing. This is all that is currently required. However, it is strongly recommended that you expose PCI host bridges as documented in the PCI binding to open firmware, and your interrupt tree as documented in OF interrupt tree specification. IV - Recommendation for a bootloader ==================================== Here are some various ideas/recommendations that have been proposed while all this has been defined and implemented. - It should be possible to write a parser that turns an ASCII representation of a device-tree (or even XML though I find that less readable) into a device-tree block. This would allow to basically build the device-tree structure and strings "blobs" at bootloader build time, and have the bootloader just pass-them as-is to the kernel. In fact, the device-tree blob could be then separate from the bootloader itself, an be placed in a separate portion of the flash that can be "personalized" for different board types by flashing a different device-tree - A very The bootloader may want to be able to use the device-tree itself and may want to manipulate it (to add/edit some properties, like physical memory size or kernel arguments). At this point, 2 choices can be made. Either the bootloader works directly on the flattened format, or the bootloader has it's own internal tree representation with pointers (similar to the kernel one) and re-flattens the tree when booting the kernel. The former is a bit more difficult to edit/modify, the later requires probably a bit more code to handle the tree structure. Note that the structure format has been designed so it's relatively easy to "insert" properties or nodes or delete them by just memmovin'g things around. It contains no internal offsets or pointers for this purpose. - An example of code for iterating nodes & retreiving properties directly from the flattened tree format can be found in the kernel file arch/ppc64/kernel/prom.c, look at scan_flat_dt() function, it's usage in early_init_devtree(), and the corresponding various early_init_dt_scan_*() callbacks. That code can be re-used in a GPL bootloader, and as the author of that code, I would be happy do discuss possible free licencing to any vendor who wishes to integrate all or part of this code into a non-GPL bootloader. From benh at kernel.crashing.org Tue May 24 14:35:51 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 24 May 2005 14:35:51 +1000 Subject: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (v3) In-Reply-To: <1116909286.4992.17.camel@gaston> References: <1116909286.4992.17.camel@gaston> Message-ID: <1116909351.4992.19.camel@gaston> On Tue, 2005-05-24 at 14:34 +1000, Benjamin Herrenschmidt wrote: > Hi ! > > Here's revision 3 of the spec for the booting of linux/ppc64 with a > flattened device-tree. The novelty is that I added a new more compact > format. A followup mail will have the kernel patches to add support to > this new format, I'll submit them upstream for after 2.6.12 I think. > > David and I are still working on sample code & tools. We have a > prototype of a device-tree "compiler" that can build the flattened blob > from a textual representation. We'll release that soon, hopefully this > week. And here is the patch: Index: linux-work/arch/ppc64/kernel/prom_init.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/prom_init.c 2005-05-23 18:15:28.000000000 +1000 +++ linux-work/arch/ppc64/kernel/prom_init.c 2005-05-24 14:10:00.000000000 +1000 @@ -1514,7 +1514,14 @@ return 0; } -static void __init scan_dt_build_strings(phandle node, unsigned long *mem_start, +/* + * The Open Firmware 1275 specification states properties must be 31 bytes or + * less, however not all firmwares obey this. Make it 64 bytes to be safe. + */ +#define MAX_PROPERTY_NAME 64 + +static void __init scan_dt_build_strings(phandle node, + unsigned long *mem_start, unsigned long *mem_end) { unsigned long offset = reloc_offset(); @@ -1527,14 +1534,19 @@ /* get and store all property names */ prev_name = RELOC(""); for (;;) { - - /* 32 is max len of name including nul. */ - namep = make_room(mem_start, mem_end, 32, 1); + namep = make_room(mem_start, mem_end, MAX_PROPERTY_NAME, 1); if (call_prom("nextprop", 3, 1, node, prev_name, namep) <= 0) { /* No more nodes: unwind alloc */ *mem_start = (unsigned long)namep; break; } + + /* skip "name" */ + if (strcmp(namep, RELOC("name")) == 0) { + *mem_start = (unsigned long)namep; + prev_name = RELOC("name"); + continue; + } soff = dt_find_string(namep); if (soff != 0) { *mem_start = (unsigned long)namep; @@ -1555,12 +1567,6 @@ } } -/* - * The Open Firmware 1275 specification states properties must be 31 bytes or - * less, however not all firmwares obey this. Make it 64 bytes to be safe. - */ -#define MAX_PROPERTY_NAME 64 - static void __init scan_dt_build_struct(phandle node, unsigned long *mem_start, unsigned long *mem_end) { @@ -1570,10 +1576,8 @@ unsigned long soff; unsigned char *valp; unsigned long offset = reloc_offset(); - char pname[MAX_PROPERTY_NAME]; - char *path; - - path = RELOC(prom_scratch); + static char pname[MAX_PROPERTY_NAME] __initdata; + char *path, *p, *lp; dt_push_token(OF_DT_BEGIN_NODE, mem_start, mem_end); @@ -1588,10 +1592,20 @@ call_prom("package-to-path", 3, 1, node, namep, l); } namep[l] = '\0'; - *mem_start = _ALIGN(((unsigned long) namep) + strlen(namep) + 1, 4); + /* now try to find the unit name in that mess */ + p = namep; + lp = NULL; + for (lp = NULL; *p; p++) + if (*p == '/') + lp = p + 1; + if (lp != NULL) + memmove(namep, lp, strlen(lp) + 1); + *mem_start = _ALIGN(((unsigned long) namep) + + strlen(namep) + 1, 4); } /* get it again for debugging */ + path = RELOC(prom_scratch); memset(path, 0, PROM_SCRATCH_SIZE); call_prom("package-to-path", 3, 1, node, path, PROM_SCRATCH_SIZE-1); @@ -1599,20 +1613,24 @@ prev_name = RELOC(""); sstart = (char *)RELOC(dt_string_start); for (;;) { - if (call_prom("nextprop", 3, 1, node, prev_name, pname) <= 0) + if (call_prom("nextprop", 3, 1, node, prev_name, + RELOC(pname)) <= 0) break; - + if (strcmp(RELOC(pname), RELOC("name")) == 0) { + prev_name = RELOC("name"); + continue; + } /* find string offset */ - soff = dt_find_string(pname); + soff = dt_find_string(RELOC(pname)); if (soff == 0) { - prom_printf("WARNING: Can't find string index for <%s>, node %s\n", - pname, path); + prom_printf("WARNING: Can't find string index " + "for <%s>, node %s\n", RELOC(pname), path); break; } prev_name = sstart + soff; /* get length */ - l = call_prom("getproplen", 2, 1, node, pname); + l = call_prom("getproplen", 2, 1, node, RELOC(pname)); /* sanity checks */ if (l < 0) @@ -1621,7 +1639,7 @@ prom_printf("WARNING: ignoring large property "); /* It seems OF doesn't null-terminate the path :-( */ prom_printf("[%s] ", path); - prom_printf("%s length 0x%x\n", pname, l); + prom_printf("%s length 0x%x\n", RELOC(pname), l); continue; } @@ -1633,15 +1651,15 @@ /* push property content */ align = (l >= 8) ? 8 : 4; valp = make_room(mem_start, mem_end, l, align); - call_prom("getprop", 4, 1, node, pname, valp, l); + call_prom("getprop", 4, 1, node, RELOC(pname), valp, l); *mem_start = _ALIGN(*mem_start, 4); } /* Add a "linux,phandle" property. */ soff = dt_find_string(RELOC("linux,phandle")); if (soff == 0) - prom_printf("WARNING: Can't find string index for " - " node %s\n", path); + prom_printf("WARNING: Can't find string index for" + " node %s\n", path); else { dt_push_token(OF_DT_PROP, mem_start, mem_end); dt_push_token(4, mem_start, mem_end); @@ -1691,7 +1709,8 @@ /* Build header and make room for mem rsv map */ mem_start = _ALIGN(mem_start, 4); - hdr = make_room(&mem_start, &mem_end, sizeof(struct boot_param_header), 4); + hdr = make_room(&mem_start, &mem_end, + sizeof(struct boot_param_header), 4); RELOC(dt_header_start) = (unsigned long)hdr; rsvmap = make_room(&mem_start, &mem_end, sizeof(mem_reserve_map), 8); @@ -1704,11 +1723,11 @@ namep = make_room(&mem_start, &mem_end, 16, 1); strcpy(namep, RELOC("linux,phandle")); mem_start = (unsigned long)namep + strlen(namep) + 1; - RELOC(dt_string_end) = mem_start; /* Build string array */ prom_printf("Building dt strings...\n"); scan_dt_build_strings(root, &mem_start, &mem_end); + RELOC(dt_string_end) = mem_start; /* Build structure */ mem_start = PAGE_ALIGN(mem_start); @@ -1723,9 +1742,11 @@ hdr->totalsize = RELOC(dt_struct_end) - RELOC(dt_header_start); hdr->off_dt_struct = RELOC(dt_struct_start) - RELOC(dt_header_start); hdr->off_dt_strings = RELOC(dt_string_start) - RELOC(dt_header_start); + hdr->dt_strings_size = RELOC(dt_string_end) - RELOC(dt_string_start); hdr->off_mem_rsvmap = ((unsigned long)rsvmap) - RELOC(dt_header_start); hdr->version = OF_DT_VERSION; - hdr->last_comp_version = 1; + /* Version 16 is not backward compatible */ + hdr->last_comp_version = 0x10; /* Reserve the whole thing and copy the reserve map in, we * also bump mem_reserve_cnt to cause further reservations to Index: linux-work/arch/ppc64/kernel/setup.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/setup.c 2005-05-23 18:15:28.000000000 +1000 +++ linux-work/arch/ppc64/kernel/setup.c 2005-05-23 18:16:09.000000000 +1000 @@ -10,7 +10,7 @@ * 2 of the License, or (at your option) any later version. */ -#undef DEBUG +#define DEBUG #include #include Index: linux-work/arch/ppc64/kernel/prom.c =================================================================== --- linux-work.orig/arch/ppc64/kernel/prom.c 2005-05-23 18:15:28.000000000 +1000 +++ linux-work/arch/ppc64/kernel/prom.c 2005-05-24 14:12:35.000000000 +1000 @@ -15,7 +15,7 @@ * 2 of the License, or (at your option) any later version. */ -#undef DEBUG +#define DEBUG #include #include @@ -635,20 +635,23 @@ * unflatten the tree */ static int __init scan_flat_dt(int (*it)(unsigned long node, - const char *full_path, void *data), + const char *uname, int depth, void *data), void *data) { unsigned long p = ((unsigned long)initial_boot_params) + initial_boot_params->off_dt_struct; int rc = 0; + int depth = -1; do { u32 tag = *((u32 *)p); char *pathp; p += 4; - if (tag == OF_DT_END_NODE) + if (tag == OF_DT_END_NODE) { + depth --; continue; + } if (tag == OF_DT_END) break; if (tag == OF_DT_PROP) { @@ -664,9 +667,18 @@ " device tree !\n", tag); return -EINVAL; } + depth++; pathp = (char *)p; p = _ALIGN(p + strlen(pathp) + 1, 4); - rc = it(p, pathp, data); + if ((*pathp) == '/') { + char *lp, *np; + for (lp = NULL, np = pathp; *np; np++) + if ((*np) == '/') + lp = np+1; + if (lp != NULL) + pathp = lp; + } + rc = it(p, pathp, depth, data); if (rc != 0) break; } while(1); @@ -727,13 +739,16 @@ static unsigned long __init unflatten_dt_node(unsigned long mem, unsigned long *p, struct device_node *dad, - struct device_node ***allnextpp) + struct device_node ***allnextpp, + unsigned long fpsize) { struct device_node *np; struct property *pp, **prev_pp = NULL; char *pathp; u32 tag; - unsigned int l; + unsigned int l, allocl; + int has_name = 0; + int new_format = 0; tag = *((u32 *)(*p)); if (tag != OF_DT_BEGIN_NODE) { @@ -742,21 +757,56 @@ } *p += 4; pathp = (char *)*p; - l = strlen(pathp) + 1; + l = allocl = strlen(pathp) + 1; *p = _ALIGN(*p + l, 4); - np = unflatten_dt_alloc(&mem, sizeof(struct device_node) + l, + /* version 0x10 has a more compact unit name here instead of the full + * path. we accumulate the full path size using "fpsize", we'll rebuild + * it later. We detect this because the first character of the name is + * not '/'. + */ + if ((*pathp) != '/') { + new_format = 1; + if (fpsize == 0) { + /* root node: special case. fpsize accounts for path + * plus terminating zero. root node only has '/', so + * fpsize should be 2, but we want to avoid the first + * level nodes to have two '/' so we use fpsize 1 here + */ + fpsize = 1; + allocl = 2; + } else { + /* account for '/' and path size minus terminal 0 + * already in 'l' + */ + fpsize += l; + allocl = fpsize; + } + } + + + np = unflatten_dt_alloc(&mem, sizeof(struct device_node) + allocl, __alignof__(struct device_node)); if (allnextpp) { memset(np, 0, sizeof(*np)); np->full_name = ((char*)np) + sizeof(struct device_node); - memcpy(np->full_name, pathp, l); + if (new_format) { + char *p = np->full_name; + /* rebuild full path for new format */ + if (dad && dad->parent) { + strcpy(p, dad->full_name); + p += strlen(p); + } + *(p++) = '/'; + memcpy(p, pathp, l); + } else + memcpy(np->full_name, pathp, l); prev_pp = &np->properties; **allnextpp = np; *allnextpp = &np->allnext; if (dad != NULL) { np->parent = dad; - /* we temporarily use the `next' field as `last_child'. */ + /* we temporarily use the next field as `last_child'*/ if (dad->next == 0) dad->child = np; else @@ -782,6 +832,8 @@ printk("Can't find property name in list !\n"); break; } + if (strcmp(pname, "name") == 0) + has_name = 1; l = strlen(pname) + 1; pp = unflatten_dt_alloc(&mem, sizeof(struct property), __alignof__(struct property)); @@ -801,6 +853,29 @@ } *p = _ALIGN((*p) + sz, 4); } + /* with version 0x10 we may not have the name property, recreate + * it here from the unit name if absent + */ + if (!has_name) { + char *p = pathp; + int sz; + + DBG("no name, building from %s\n", p); + while (*p && (*p) != '@') + p++; + sz = (p - pathp) + 1; + pp = unflatten_dt_alloc(&mem, sizeof(struct property) + sz, + __alignof__(struct property)); + if (allnextpp) { + pp->name = "name"; + pp->length = sz; + pp->value = (unsigned char *)(pp + 1); + *prev_pp = pp; + prev_pp = &pp->next; + memcpy(pp->value, pathp, sz - 1); + ((char *)pp->value)[sz - 1] = 0; + } + } if (allnextpp) { *prev_pp = NULL; np->name = get_property(np, "name", NULL); @@ -812,7 +887,7 @@ np->type = ""; } while (tag == OF_DT_BEGIN_NODE) { - mem = unflatten_dt_node(mem, p, np, allnextpp); + mem = unflatten_dt_node(mem, p, np, allnextpp, fpsize); tag = *((u32 *)(*p)); } if (tag != OF_DT_END_NODE) { @@ -842,7 +917,7 @@ /* First pass, scan for size */ start = ((unsigned long)initial_boot_params) + initial_boot_params->off_dt_struct; - size = unflatten_dt_node(0, &start, NULL, NULL); + size = unflatten_dt_node(0, &start, NULL, NULL, 0); DBG(" size is %lx, allocating...\n", size); @@ -854,7 +929,7 @@ /* Second pass, do actual unflattening */ start = ((unsigned long)initial_boot_params) + initial_boot_params->off_dt_struct; - unflatten_dt_node(mem, &start, NULL, &allnextp); + unflatten_dt_node(mem, &start, NULL, &allnextp, 0); if (*((u32 *)start) != OF_DT_END) printk(KERN_WARNING "Weird tag at end of tree: %x\n", *((u32 *)start)); *allnextp = NULL; @@ -880,7 +955,7 @@ static int __init early_init_dt_scan_cpus(unsigned long node, - const char *full_path, void *data) + const char *uname, int depth, void *data) { char *type = get_flat_dt_prop(node, "device_type", NULL); u32 *prop; @@ -933,13 +1008,15 @@ } static int __init early_init_dt_scan_chosen(unsigned long node, - const char *full_path, void *data) + const char *uname, int depth, void *data) { u32 *prop; u64 *prop64; extern unsigned long memory_limit, tce_alloc_start, tce_alloc_end; - if (strcmp(full_path, "/chosen") != 0) + DBG("search \"chosen\", depth: %d, uname: %s\n", depth, uname); + + if (depth != 1 || strcmp(uname, "chosen") != 0) return 0; /* get platform type */ @@ -989,18 +1066,20 @@ } static int __init early_init_dt_scan_root(unsigned long node, - const char *full_path, void *data) + const char *uname, int depth, void *data) { u32 *prop; - if (strcmp(full_path, "/") != 0) + if (depth != 0) return 0; prop = (u32 *)get_flat_dt_prop(node, "#size-cells", NULL); dt_root_size_cells = (prop == NULL) ? 1 : *prop; - + DBG("dt_root_size_cells = %x\n", dt_root_size_cells); + prop = (u32 *)get_flat_dt_prop(node, "#address-cells", NULL); dt_root_addr_cells = (prop == NULL) ? 2 : *prop; + DBG("dt_root_addr_cells = %x\n", dt_root_addr_cells); /* break now */ return 1; @@ -1028,7 +1107,7 @@ static int __init early_init_dt_scan_memory(unsigned long node, - const char *full_path, void *data) + const char *uname, int depth, void *data) { char *type = get_flat_dt_prop(node, "device_type", NULL); cell_t *reg, *endp; @@ -1044,7 +1123,9 @@ endp = reg + (l / sizeof(cell_t)); - DBG("memory scan node %s ...\n", full_path); + DBG("memory scan node %s ..., reg size %ld, data: %x %x %x %x, ...\n", + uname, l, reg[0], reg[1], reg[2], reg[3]); + while ((endp - reg) >= (dt_root_addr_cells + dt_root_size_cells)) { unsigned long base, size; @@ -1455,10 +1536,11 @@ struct device_node *np = allnodes; read_lock(&devtree_lock); - for (; np != 0; np = np->allnext) + for (; np != 0; np = np->allnext) { if (np->full_name != 0 && strcasecmp(np->full_name, path) == 0 && of_node_get(np)) break; + } read_unlock(&devtree_lock); return np; } Index: linux-work/include/asm-ppc64/prom.h =================================================================== --- linux-work.orig/include/asm-ppc64/prom.h 2005-05-23 18:15:28.000000000 +1000 +++ linux-work/include/asm-ppc64/prom.h 2005-05-23 18:16:09.000000000 +1000 @@ -22,13 +22,14 @@ #define RELOC(x) (*PTRRELOC(&(x))) /* Definitions used by the flattened device tree */ -#define OF_DT_HEADER 0xd00dfeed /* 4: version, 4: total size */ -#define OF_DT_BEGIN_NODE 0x1 /* Start node: full name */ +#define OF_DT_HEADER 0xd00dfeed /* marker */ +#define OF_DT_BEGIN_NODE 0x1 /* Start of node, full name */ #define OF_DT_END_NODE 0x2 /* End node */ -#define OF_DT_PROP 0x3 /* Property: name off, size, content */ +#define OF_DT_PROP 0x3 /* Property: name off, size, + * content */ #define OF_DT_END 0x9 -#define OF_DT_VERSION 1 +#define OF_DT_VERSION 0x10 /* * This is what gets passed to the kernel by prom_init or kexec @@ -54,7 +55,9 @@ u32 version; /* format version */ u32 last_comp_version; /* last compatible version */ /* version 2 fields below */ - u32 boot_cpuid_phys; /* Which physical CPU id we're booting on */ + u32 boot_cpuid_phys; /* Physical CPU id we're booting on */ + /* version 3 fields below */ + u32 dt_strings_size; /* size of the DT strings block */ }; @@ -169,17 +172,20 @@ */ #define HAVE_ARCH_DEVTREE_FIXUPS -static inline void set_node_proc_entry(struct device_node *dn, struct proc_dir_entry *de) +static inline void set_node_proc_entry(struct device_node *dn, + struct proc_dir_entry *de) { dn->pde = de; } -static void inline set_node_name_link(struct device_node *dn, struct proc_dir_entry *de) +static void inline set_node_name_link(struct device_node *dn, + struct proc_dir_entry *de) { dn->name_link = de; } -static void inline set_node_addr_link(struct device_node *dn, struct proc_dir_entry *de) +static void inline set_node_addr_link(struct device_node *dn, + struct proc_dir_entry *de) { dn->addr_link = de; } From segher at kernel.crashing.org Tue May 24 17:50:30 2005 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Tue, 24 May 2005 09:50:30 +0200 Subject: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (v3) In-Reply-To: <1116909351.4992.19.camel@gaston> References: <1116909286.4992.17.camel@gaston> <1116909351.4992.19.camel@gaston> Message-ID: <78325f65fc807da213be423c20ce6dcd@kernel.crashing.org> > +/* > + * The Open Firmware 1275 specification states properties must be 31 > bytes or > + * less, however not all firmwares obey this. Make it 64 bytes to be > safe. > + */ Not true. There is no restriction on the length of properties (or property _names_, which I guess is what you meant). There is the 31-char restriction on _node_ names, though (and the node name is only the part before the optional @ or : -- and at a minimum, the part after @ but before : is needed to disambiguate the node). I didn't review the rest of the code, sorry. Segher From segher at kernel.crashing.org Tue May 24 17:53:41 2005 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Tue, 24 May 2005 09:53:41 +0200 Subject: device-tree in U-Boot In-Reply-To: <1116889207.5124.10.camel@gaston> References: <1116889207.5124.10.camel@gaston> Message-ID: <46aadf0955bce007db9d859af2fe324d@kernel.crashing.org> > Note that this is a full featured implementation of the OF client > interface code, This is very much not true. It is impossible to fully implement the OF client interface without implementing _all_ of OF. The Linux kernel is fine with it (right now), though. It doesn't do much more than just copying the device tree, anyway. > which may be a lot more than what you are looking for. True. Segher From benh at kernel.crashing.org Tue May 24 17:55:02 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 24 May 2005 17:55:02 +1000 Subject: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (v3) In-Reply-To: <78325f65fc807da213be423c20ce6dcd@kernel.crashing.org> References: <1116909286.4992.17.camel@gaston> <1116909351.4992.19.camel@gaston> <78325f65fc807da213be423c20ce6dcd@kernel.crashing.org> Message-ID: <1116921302.6395.6.camel@gaston> On Tue, 2005-05-24 at 09:50 +0200, Segher Boessenkool wrote: > > +/* > > + * The Open Firmware 1275 specification states properties must be 31 > > bytes or > > + * less, however not all firmwares obey this. Make it 64 bytes to be > > safe. > > + */ > > Not true. There is no restriction on the length of properties > (or property _names_, which I guess is what you meant). Yes, we meant names, and yes, there is a limit of 31 characters + 0, that is 32 bytes (see "3.2.2.1.1 Property names" in 1275. BTW. Do you have a useable version of this document ? mine is a totally broken PDF made from a crappy .ps :) IBM firmware broke that limit in a couple of places though... > There > is the 31-char restriction on _node_ names, though (and the node > name is only the part before the optional @ or : -- and at a > minimum, the part after @ but before : is needed to disambiguate > the node). Yes, I know all that. > I didn't review the rest of the code, sorry. No worries :) Ben. From benh at kernel.crashing.org Tue May 24 17:56:57 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Tue, 24 May 2005 17:56:57 +1000 Subject: device-tree in U-Boot In-Reply-To: <46aadf0955bce007db9d859af2fe324d@kernel.crashing.org> References: <1116889207.5124.10.camel@gaston> <46aadf0955bce007db9d859af2fe324d@kernel.crashing.org> Message-ID: <1116921417.6395.9.camel@gaston> On Tue, 2005-05-24 at 09:53 +0200, Segher Boessenkool wrote: > > Note that this is a full featured implementation of the OF client > > interface code, > > This is very much not true. It is impossible to fully implement > the OF client interface without implementing _all_ of OF. Heh, come on Segher, please :) Ok, it's a _partial_ implementation of the OF client interface that provides enough to make the kernel's prom_init() trampoline happy. It's still a lot more code that providing just a flattened device tree though :) > The Linux kernel is fine with it (right now), though. It doesn't > do much more than just copying the device tree, anyway. Well, part of this spec is to also define what content in the device tree the kernel is expecting. Ben. From segher at kernel.crashing.org Tue May 24 18:01:32 2005 From: segher at kernel.crashing.org (Segher Boessenkool) Date: Tue, 24 May 2005 10:01:32 +0200 Subject: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (v3) In-Reply-To: <1116921302.6395.6.camel@gaston> References: <1116909286.4992.17.camel@gaston> <1116909351.4992.19.camel@gaston> <78325f65fc807da213be423c20ce6dcd@kernel.crashing.org> <1116921302.6395.6.camel@gaston> Message-ID: > Yes, we meant names, and yes, there is a limit of 31 characters + 0, Nah, it's thirty-one chars, not 31. Sorry for missing this. > that is 32 bytes (see "3.2.2.1.1 Property names" in 1275. BTW. Do you > have a useable version of this document ? mine is a totally broken PDF > made from a crappy .ps :) Mine is a perfectly working PDF from a perfectly working PS. I'll send it to you. > IBM firmware broke that limit in a couple of places though... That's what happens if there isn't a testsuite anywhere. > Yes, I know all that. Good :-) Segher From jdl at freescale.com Wed May 25 01:11:15 2005 From: jdl at freescale.com (Jon Loeliger) Date: Tue, 24 May 2005 10:11:15 -0500 Subject: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (v3) In-Reply-To: <1116909286.4992.17.camel@gaston> References: <1116909286.4992.17.camel@gaston> Message-ID: <1116947474.4720.3.camel@cashmere.sps.mot.com> On Mon, 2005-05-23 at 23:34, Benjamin Herrenschmidt wrote: > Hi ! > David and I are still working on sample code & tools. We have a > prototype of a device-tree "compiler" that can build the flattened blob > from a textual representation. We'll release that soon, hopefully this > week. Hi Ben, This is most excellent. In particular, I won't start to do just that step too... :-) So, can you put me on the beta tester list for those tools? Or are you just going to post it all here? Thanks, jdl From benh at kernel.crashing.org Wed May 25 07:48:29 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Wed, 25 May 2005 07:48:29 +1000 Subject: Booting the Linux/ppc64 kernel without Open Firmware HOWTO (v3) In-Reply-To: <1116947474.4720.3.camel@cashmere.sps.mot.com> References: <1116909286.4992.17.camel@gaston> <1116947474.4720.3.camel@cashmere.sps.mot.com> Message-ID: <1116971310.6395.13.camel@gaston> On Tue, 2005-05-24 at 10:11 -0500, Jon Loeliger wrote: > On Mon, 2005-05-23 at 23:34, Benjamin Herrenschmidt wrote: > > Hi ! > > > David and I are still working on sample code & tools. We have a > > prototype of a device-tree "compiler" that can build the flattened blob > > from a textual representation. We'll release that soon, hopefully this > > week. > > Hi Ben, > > This is most excellent. In particular, I won't start to > do just that step too... :-) So, can you put me on the > beta tester list for those tools? Or are you just going > to post it all here? We'll post it all here :) Ben. From sleddog at us.ibm.com Wed May 25 08:23:16 2005 From: sleddog at us.ibm.com (Dave C Boutcher) Date: Tue, 24 May 2005 17:23:16 -0500 Subject: openpower 720, unknown Firmware/POST Error Code In-Reply-To: <200505231309.j4ND9Cc9027300@falcon30.maxeymade.com> References: <20050523093334.GA27320@suse.de> <200505231309.j4ND9Cc9027300@falcon30.maxeymade.com> Message-ID: <20050524222316.GA23671@cs.umn.edu> On Mon, May 23, 2005 at 08:09:12AM -0500, Doug Maxey wrote: > > On Mon, 23 May 2005 11:33:34 +0200, Olaf Hering wrote: > > > >Any idea what this number on the openpower 720 frontpanel means? > > > >b2018127 > > > B2pp8127 Description: > > Phyp was unable to allocate the memory needed to build the PFDS. > > pp == partition number. Not sure about the PFDS. Sounds like > something was changed, configuration wise. FWIW, PFDS is "Partition Firmware Data Structure". I believe this is the structure passed to Open Firmware from which it builds the device tree. -- Dave Boutcher From sfr at canb.auug.org.au Wed May 25 13:41:26 2005 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Wed, 25 May 2005 13:41:26 +1000 Subject: [PATCH] ppc64: fix initialisation of gettimeofday calculations Message-ID: <20050525134126.3382e519.sfr@canb.auug.org.au> Hi Andrew, On PPC64, we keep track of when we need to update jiffies (and the variables used to calculate the time of day) based on the time base. If the time base frequence is sufficiently high compared to the processor clock frequency, then it is possible for the time of day variables to be corrupted by at the time of the first decrementer interrupt we take. This became obvious on a legacy iSeries where the time base frequency is the same as the processor clock. This one line patch fixes the initialisation so that the time of day variables and the indicator we use to tell when updates are due are better synchronised. Signed-off-by: Stephen Rothwell Please apply and send to Linus. -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ diff -ruNp linus/arch/ppc64/kernel/time.c linus-clock.1/arch/ppc64/kernel/time.c --- linus/arch/ppc64/kernel/time.c 2005-05-17 09:23:08.000000000 +1000 +++ linus-clock.1/arch/ppc64/kernel/time.c 2005-05-25 13:11:14.000000000 +1000 @@ -515,6 +515,7 @@ void __init time_init(void) do_gtod.varp = &do_gtod.vars[0]; do_gtod.var_idx = 0; do_gtod.varp->tb_orig_stamp = tb_last_stamp; + get_paca()->next_jiffy_update_tb = tb_last_stamp + tb_ticks_per_jiffy; do_gtod.varp->stamp_xsec = xtime.tv_sec * XSEC_PER_SEC; do_gtod.tb_ticks_per_sec = tb_ticks_per_sec; do_gtod.varp->tb_to_xs = tb_to_xs; -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050525/f80f8994/attachment.pgp From olh at suse.de Wed May 25 15:59:27 2005 From: olh at suse.de (Olaf Hering) Date: Wed, 25 May 2005 07:59:27 +0200 Subject: openpower 720, unknown Firmware/POST Error Code In-Reply-To: <20050524222316.GA23671@cs.umn.edu> References: <20050523093334.GA27320@suse.de> <200505231309.j4ND9Cc9027300@falcon30.maxeymade.com> <20050524222316.GA23671@cs.umn.edu> Message-ID: <20050525055927.GA6642@suse.de> On Tue, May 24, Dave C Boutcher wrote: > On Mon, May 23, 2005 at 08:09:12AM -0500, Doug Maxey wrote: > > > > On Mon, 23 May 2005 11:33:34 +0200, Olaf Hering wrote: > > > > > >Any idea what this number on the openpower 720 frontpanel means? > > > > > >b2018127 > > > > > B2pp8127 Description: > > > > Phyp was unable to allocate the memory needed to build the PFDS. > > > > pp == partition number. Not sure about the PFDS. Sounds like > > something was changed, configuration wise. > > FWIW, PFDS is "Partition Firmware Data Structure". I believe this > is the structure passed to Open Firmware from which it builds the > device tree. It rebooted ok in a second attempt. I dont know if anyone changed anything, its very unlikely. Firmware is version SF222_075 From sfr at canb.auug.org.au Wed May 25 16:29:26 2005 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Wed, 25 May 2005 16:29:26 +1000 Subject: [PATCH] ppc64 iSeries: fix boot time setting Message-ID: <20050525162926.7691bb34.sfr@canb.auug.org.au> Hi all, For quite a while, there has existed a hypervisor bug on legacy iSeries which means that we do not get the boot time set in the kernel. This patch works around that bug. This was most noticable when the root partition needed to be checked at every boot as the kernel thought it was some time in 1905 until user mode reset the time correctly. Signed-off-by: Stephen Rothwell --- arch/ppc64/kernel/mf.c | 85 +++++++++++++++++++++++++++++++---------- arch/ppc64/kernel/rtc.c | 39 ------------------ include/asm-ppc64/iSeries/mf.h | 1 3 files changed, 67 insertions(+), 58 deletions(-) Please apply and send to Linus. -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ diff -ruNp linus-clock.1/arch/ppc64/kernel/mf.c linus-clock.2/arch/ppc64/kernel/mf.c --- linus-clock.1/arch/ppc64/kernel/mf.c 2005-05-17 09:23:07.000000000 +1000 +++ linus-clock.2/arch/ppc64/kernel/mf.c 2005-05-25 14:44:14.000000000 +1000 @@ -1,7 +1,7 @@ /* * mf.c * Copyright (C) 2001 Troy D. Armstrong IBM Corporation - * Copyright (C) 2004 Stephen Rothwell IBM Corporation + * Copyright (C) 2004-2005 Stephen Rothwell IBM Corporation * * This modules exists as an interface between a Linux secondary partition * running on an iSeries and the primary partition's Virtual Service @@ -36,10 +36,12 @@ #include #include +#include #include #include #include #include +#include /* * This is the structure layout for the Machine Facilites LPAR event @@ -696,36 +698,23 @@ static void get_rtc_time_complete(void * complete(&rtc->com); } -int mf_get_rtc(struct rtc_time *tm) +static int rtc_set_tm(int rc, u8 *ce_msg, struct rtc_time *tm) { - struct ce_msg_comp_data ce_complete; - struct rtc_time_data rtc_data; - int rc; - - memset(&ce_complete, 0, sizeof(ce_complete)); - memset(&rtc_data, 0, sizeof(rtc_data)); - init_completion(&rtc_data.com); - ce_complete.handler = &get_rtc_time_complete; - ce_complete.token = &rtc_data; - rc = signal_ce_msg_simple(0x40, &ce_complete); - if (rc) - return rc; - wait_for_completion(&rtc_data.com); tm->tm_wday = 0; tm->tm_yday = 0; tm->tm_isdst = 0; - if (rtc_data.rc) { + if (rc) { tm->tm_sec = 0; tm->tm_min = 0; tm->tm_hour = 0; tm->tm_mday = 15; tm->tm_mon = 5; tm->tm_year = 52; - return rtc_data.rc; + return rc; } - if ((rtc_data.ce_msg.ce_msg[2] == 0xa9) || - (rtc_data.ce_msg.ce_msg[2] == 0xaf)) { + if ((ce_msg[2] == 0xa9) || + (ce_msg[2] == 0xaf)) { /* TOD clock is not set */ tm->tm_sec = 1; tm->tm_min = 1; @@ -736,7 +725,6 @@ int mf_get_rtc(struct rtc_time *tm) mf_set_rtc(tm); } { - u8 *ce_msg = rtc_data.ce_msg.ce_msg; u8 year = ce_msg[5]; u8 sec = ce_msg[6]; u8 min = ce_msg[7]; @@ -765,6 +753,63 @@ int mf_get_rtc(struct rtc_time *tm) return 0; } +int mf_get_rtc(struct rtc_time *tm) +{ + struct ce_msg_comp_data ce_complete; + struct rtc_time_data rtc_data; + int rc; + + memset(&ce_complete, 0, sizeof(ce_complete)); + memset(&rtc_data, 0, sizeof(rtc_data)); + init_completion(&rtc_data.com); + ce_complete.handler = &get_rtc_time_complete; + ce_complete.token = &rtc_data; + rc = signal_ce_msg_simple(0x40, &ce_complete); + if (rc) + return rc; + wait_for_completion(&rtc_data.com); + return rtc_set_tm(rtc_data.rc, rtc_data.ce_msg.ce_msg, tm); +} + +struct boot_rtc_time_data { + int busy; + struct ce_msg_data ce_msg; + int rc; +}; + +static void get_boot_rtc_time_complete(void *token, struct ce_msg_data *ce_msg) +{ + struct boot_rtc_time_data *rtc = token; + + memcpy(&rtc->ce_msg, ce_msg, sizeof(rtc->ce_msg)); + rtc->rc = 0; + rtc->busy = 0; +} + +int mf_get_boot_rtc(struct rtc_time *tm) +{ + struct ce_msg_comp_data ce_complete; + struct boot_rtc_time_data rtc_data; + int rc; + + memset(&ce_complete, 0, sizeof(ce_complete)); + memset(&rtc_data, 0, sizeof(rtc_data)); + rtc_data.busy = 1; + ce_complete.handler = &get_boot_rtc_time_complete; + ce_complete.token = &rtc_data; + rc = signal_ce_msg_simple(0x40, &ce_complete); + if (rc) + return rc; + /* We need to poll here as we are not yet taking interrupts */ + while (rtc_data.busy) { + extern unsigned long lpevent_count; + struct ItLpQueue *lpq = get_paca()->lpqueue_ptr; + if (lpq && ItLpQueue_isLpIntPending(lpq)) + lpevent_count += ItLpQueue_process(lpq, NULL); + } + return rtc_set_tm(rtc_data.rc, rtc_data.ce_msg.ce_msg, tm); +} + int mf_set_rtc(struct rtc_time *tm) { char ce_time[12]; diff -ruNp linus-clock.1/arch/ppc64/kernel/rtc.c linus-clock.2/arch/ppc64/kernel/rtc.c --- linus-clock.1/arch/ppc64/kernel/rtc.c 2005-05-17 09:23:08.000000000 +1000 +++ linus-clock.2/arch/ppc64/kernel/rtc.c 2005-05-25 13:59:16.000000000 +1000 @@ -292,47 +292,10 @@ int iSeries_set_rtc_time(struct rtc_time void iSeries_get_boot_time(struct rtc_time *tm) { - unsigned long time; - static unsigned long lastsec = 1; - - u32 dataWord1 = *((u32 *)(&xSpCommArea.xBcdTimeAtIplStart)); - u32 dataWord2 = *(((u32 *)&(xSpCommArea.xBcdTimeAtIplStart)) + 1); - int year = 1970; - int year1 = ( dataWord1 >> 24 ) & 0x000000FF; - int year2 = ( dataWord1 >> 16 ) & 0x000000FF; - int sec = ( dataWord1 >> 8 ) & 0x000000FF; - int min = dataWord1 & 0x000000FF; - int hour = ( dataWord2 >> 24 ) & 0x000000FF; - int day = ( dataWord2 >> 8 ) & 0x000000FF; - int mon = dataWord2 & 0x000000FF; - if ( piranha_simulator ) return; - BCD_TO_BIN(sec); - BCD_TO_BIN(min); - BCD_TO_BIN(hour); - BCD_TO_BIN(day); - BCD_TO_BIN(mon); - BCD_TO_BIN(year1); - BCD_TO_BIN(year2); - year = year1 * 100 + year2; - - time = mktime(year, mon, day, hour, min, sec); - time += ( jiffies / HZ ); - - /* Now THIS is a nasty hack! - * It ensures that the first two calls get different answers. - * That way the loop in init_time (time.c) will not think - * the clock is stuck. - */ - if ( lastsec ) { - time -= lastsec; - --lastsec; - } - - to_tm(time, tm); - tm->tm_year -= 1900; + mf_get_boot_rtc(tm); tm->tm_mon -= 1; } #endif diff -ruNp linus-clock.1/include/asm-ppc64/iSeries/mf.h linus-clock.2/include/asm-ppc64/iSeries/mf.h --- linus-clock.1/include/asm-ppc64/iSeries/mf.h 2005-05-17 09:25:49.000000000 +1000 +++ linus-clock.2/include/asm-ppc64/iSeries/mf.h 2005-05-25 14:12:40.000000000 +1000 @@ -52,6 +52,7 @@ extern void mf_clear_src(void); extern void mf_init(void); extern int mf_get_rtc(struct rtc_time *tm); +extern int mf_get_boot_rtc(struct rtc_time *tm); extern int mf_set_rtc(struct rtc_time *tm); #endif /* _ASM_PPC64_ISERIES_MF_H */ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050525/634ba722/attachment.pgp From olh at suse.de Wed May 25 16:54:10 2005 From: olh at suse.de (Olaf Hering) Date: Wed, 25 May 2005 08:54:10 +0200 Subject: panic reboot stuck in rtas_os_term In-Reply-To: <20050508091533.GA30450@suse.de> References: <20050508083331.GA30329@suse.de> <20050508091533.GA30450@suse.de> Message-ID: <20050525065410.GA9430@suse.de> On Sun, May 08, Olaf Hering wrote: > On Sun, May 08, Olaf Hering wrote: > > > > > A panic does not trigger a reboot anymore on JS20, rtas_os_term() is stuck in > > RTAS. .config is the defconfig. > > The panic reboot works on a p630, but I miss the 'rebooting in 180 seconds' message. > > Any ideas how to fix that? > > > > VFS: Cannot open root device "" or unknown-block(8,2) > > Please append a correct "root=" boot option > > Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,2) > > rtas [swapper]: Entering rtas_call > > rtas [swapper]: token = 0x1a > > rtas [swapper]: nargs = 1 > > rtas [swapper]: nret = 1 > > rtas [swapper]: &outputs = 0x0 > > rtas [swapper]: narg[0] = 0x72cfa8 > > rtas [swapper]: entering rtas with 0x72c728 > > This appears to be a firmware bug, fails now also with sles9 sp1 kernel. > ibm,os-term was appearently added with recent firmware updates. What is the purpose of ibm,os-term anyway? I just noticed that a POWER4 lpar did not reboot after a panic. Thats clearly not the way panic=$bignum was designed. Can we make ppc_md.panic optional please? From tzachi at marvell.com Wed May 25 21:00:34 2005 From: tzachi at marvell.com (Tzachi Perelstein) Date: Wed, 25 May 2005 14:00:34 +0300 Subject: io_block_mapping in PPC64? Message-ID: > > > In the (32bit) PPC arch there is an io_block_mapping function that can > > > be used to map the internal SRAM with configurable (e.g. cache > > > coherence) attributes. Is there any way to it in PPC64 arch? > > > > io_block_mapping() isn't really an interface I would recomment for > > anybody to use... What about __ioremap() instead ? > > I've noticed that CPU access to this SRAM as cache inhibited, although I called __ioremp with _PAGE_COHERENT flag. Are there any other _PAGE flags required? From ananth at in.ibm.com Thu May 26 03:01:59 2005 From: ananth at in.ibm.com (Ananth N Mavinakayanahalli) Date: Wed, 25 May 2005 13:01:59 -0400 Subject: [PATCH] kprobes: fix single-step out of line Message-ID: <20050525170159.GA9364@in.ibm.com> Hi, On Power4 and above, single-step out of line when the instruction copy is on a kmalloc'ed memory area, fails with an Instruction Access exception. Here is a patch that fixes it. Thanks, Ananth The single-step out of line fails with an Trap 0x400 (Instruction Access) if the copy of the instruction is in a kmalloced memory area. Fix that. While we are there, correct the case of a kprobe on a trap variant. Signed-off-by: Ananth N Mavinakayanahalli arch/ppc64/kernel/kprobes.c | 10 +++++++--- 1 files changed, 7 insertions(+), 3 deletions(-) Index: linux-2.6.12-rc5/arch/ppc64/kernel/kprobes.c =================================================================== --- linux-2.6.12-rc5.orig/arch/ppc64/kernel/kprobes.c 2005-05-24 23:31:20.000000000 -0400 +++ linux-2.6.12-rc5/arch/ppc64/kernel/kprobes.c 2005-05-25 12:47:33.000000000 -0400 @@ -42,6 +42,7 @@ static struct kprobe *current_kprobe; static unsigned long kprobe_status, kprobe_saved_msr; static struct pt_regs jprobe_saved_regs; +static kprobe_opcode_t stepped_insn; int arch_prepare_kprobe(struct kprobe *p) { @@ -71,11 +72,14 @@ static inline void disarm_kprobe(struct static inline void prepare_singlestep(struct kprobe *p, struct pt_regs *regs) { regs->msr |= MSR_SE; - /*single step inline if it a breakpoint instruction*/ - if (p->opcode == BREAKPOINT_INSTRUCTION) + stepped_insn = *p->ainsn.insn; + + /* single step inline if it is a trap variant */ + if (IS_TW(stepped_insn) || IS_TD(stepped_insn) || + IS_TWI(stepped_insn) || IS_TDI(stepped_insn)) regs->nip = (unsigned long)p->addr; else - regs->nip = (unsigned long)&p->ainsn.insn; + regs->nip = (unsigned long)&stepped_insn; } static inline int kprobe_handler(struct pt_regs *regs) From george at mvista.com Thu May 26 03:30:58 2005 From: george at mvista.com (George Anzinger) Date: Wed, 25 May 2005 10:30:58 -0700 Subject: [PATCH] ppc64 iSeries: fix boot time setting In-Reply-To: <20050525162926.7691bb34.sfr@canb.auug.org.au> References: <20050525162926.7691bb34.sfr@canb.auug.org.au> Message-ID: <4294B652.1090509@mvista.com> Am I the only one who has problems with the signature mark ("-- ") prior to the patch. For example, this is a reply with quoting the original message. My mailer (mozilla) dropped everything after the "-- " in the original. If someone knows how to change mozilla to fix this, I welcome the help. George Stephen Rothwell wrote: > Hi all, > > For quite a while, there has existed a hypervisor bug on legacy iSeries > which means that we do not get the boot time set in the kernel. This > patch works around that bug. This was most noticable when the root > partition needed to be checked at every boot as the kernel thought it was > some time in 1905 until user mode reset the time correctly. > > Signed-off-by: Stephen Rothwell > --- > > arch/ppc64/kernel/mf.c | 85 +++++++++++++++++++++++++++++++---------- > arch/ppc64/kernel/rtc.c | 39 ------------------ > include/asm-ppc64/iSeries/mf.h | 1 > 3 files changed, 67 insertions(+), 58 deletions(-) > > Please apply and send to Linus. -- George Anzinger george at mvista.com High-res-timers: http://sourceforge.net/projects/high-res-timers/ From moilanen at austin.ibm.com Thu May 26 03:53:52 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Wed, 25 May 2005 12:53:52 -0500 Subject: panic reboot stuck in rtas_os_term In-Reply-To: <20050525065410.GA9430@suse.de> References: <20050508083331.GA30329@suse.de> <20050508091533.GA30450@suse.de> <20050525065410.GA9430@suse.de> Message-ID: <20050525125352.2d08dc52.moilanen@austin.ibm.com> > What is the purpose of ibm,os-term anyway? > I just noticed that a POWER4 lpar did not reboot after a panic. > Thats clearly not the way panic=$bignum was designed. > Can we make ppc_md.panic optional please? ibm,os-term is used to tell RTAS that the OS has finished normal operation. RTAS will enact the service policy at that point (for example on old systems it would "call home"). I remember finding a bug (which I thought I fixed) that fixed a problem w/ the schedule timeout in the panic code: Novell 75391. Jake From olh at suse.de Thu May 26 04:24:13 2005 From: olh at suse.de (Olaf Hering) Date: Wed, 25 May 2005 20:24:13 +0200 Subject: panic reboot stuck in rtas_os_term In-Reply-To: <20050525125352.2d08dc52.moilanen@austin.ibm.com> References: <20050508083331.GA30329@suse.de> <20050508091533.GA30450@suse.de> <20050525065410.GA9430@suse.de> <20050525125352.2d08dc52.moilanen@austin.ibm.com> Message-ID: <20050525182413.GA18053@suse.de> On Wed, May 25, Jake Moilanen wrote: > > What is the purpose of ibm,os-term anyway? > > I just noticed that a POWER4 lpar did not reboot after a panic. > > Thats clearly not the way panic=$bignum was designed. > > Can we make ppc_md.panic optional please? > > ibm,os-term is used to tell RTAS that the OS has finished normal > operation. RTAS will enact the service policy at that point (for > example on old systems it would "call home"). How do I change that policy on a js20? Having a stuck system is as bad as this stupid panic blinking on i386. > I remember finding a bug (which I thought I fixed) that fixed a problem > w/ the schedule timeout in the panic code: Novell 75391. But the rtas call does not return, so the kernel cant do anything. From moilanen at austin.ibm.com Thu May 26 04:46:17 2005 From: moilanen at austin.ibm.com (Jake Moilanen) Date: Wed, 25 May 2005 13:46:17 -0500 Subject: panic reboot stuck in rtas_os_term In-Reply-To: <20050525182413.GA18053@suse.de> References: <20050508083331.GA30329@suse.de> <20050508091533.GA30450@suse.de> <20050525065410.GA9430@suse.de> <20050525125352.2d08dc52.moilanen@austin.ibm.com> <20050525182413.GA18053@suse.de> Message-ID: <20050525134617.24a2c940.moilanen@austin.ibm.com> > How do I change that policy on a js20? Having a stuck system is as bad > as this stupid panic blinking on i386. Normally that is setup through the service processor (interfaced through the ASM, HMC, or SP menus). But on a JS20 there is not a fully featured SP. So there is no simple method that I know of. You may be able to do it through the ibm,set-sysetm-parameter RTAS call. It would require some research to figure out exactly what needs to be passed in. Is it bothering you that much? > > I remember finding a bug (which I thought I fixed) that fixed a > > problem w/ the schedule timeout in the panic code: Novell 75391. > > But the rtas call does not return, so the kernel cant do anything. There was a firmware problem w/ os-term awhile ago (returned hardware error). It may be related. How old is your firmware? Jake From holindho at cs.helsinki.fi Thu May 26 04:42:42 2005 From: holindho at cs.helsinki.fi (Heikki Lindholm) Date: Wed, 25 May 2005 21:42:42 +0300 Subject: Finding module TOC address Message-ID: <4294C722.4030207@cs.helsinki.fi> Hello, I need to use assembly jump to a function in a module. How do I find out the TOC that the function expects to find in r2? A quick peek at kernel/module.c didn't help much, maybe someone here could enlighten me. Regards, Heikki Lindholm From benh at kernel.crashing.org Thu May 26 07:29:58 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 26 May 2005 07:29:58 +1000 Subject: io_block_mapping in PPC64? In-Reply-To: References: Message-ID: <1117056598.5076.4.camel@gaston> On Wed, 2005-05-25 at 14:00 +0300, Tzachi Perelstein wrote: > > > > In the (32bit) PPC arch there is an io_block_mapping function that > can > > > > be used to map the internal SRAM with configurable (e.g. cache > > > > coherence) attributes. Is there any way to it in PPC64 arch? > > > > > > io_block_mapping() isn't really an interface I would recomment for > > > anybody to use... What about __ioremap() instead ? > > > > > I've noticed that CPU access to this SRAM as cache inhibited, although I > called __ioremp with _PAGE_COHERENT flag. Are there any other _PAGE > flags required? No, if you didn't set _PAGE_NOCACHE, it should be set as _PAGE_KERNEL | _PAGE_COHERENT which is fine. I don't see why you would get cache inhibited accesses that way. Ben From benh at kernel.crashing.org Thu May 26 08:06:07 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 26 May 2005 08:06:07 +1000 Subject: Finding module TOC address In-Reply-To: <4294C722.4030207@cs.helsinki.fi> References: <4294C722.4030207@cs.helsinki.fi> Message-ID: <1117058767.5076.27.camel@gaston> On Wed, 2005-05-25 at 21:42 +0300, Heikki Lindholm wrote: > Hello, > I need to use assembly jump to a function in a module. How do I find out > the TOC that the function expects to find in r2? A quick peek at > kernel/module.c didn't help much, maybe someone here could enlighten me. You don't, there is no TOC on ppc32 ABI, at least the kernel version of it. r2 is reserved in the kernel and always contains "current". Ben. From strosake at austin.ibm.com Thu May 26 11:16:13 2005 From: strosake at austin.ibm.com (Mike Strosaker) Date: Wed, 25 May 2005 20:16:13 -0500 Subject: panic reboot stuck in rtas_os_term In-Reply-To: <20050525134617.24a2c940.moilanen@austin.ibm.com> References: <20050508083331.GA30329@suse.de> <20050508091533.GA30450@suse.de> <20050525065410.GA9430@suse.de> <20050525125352.2d08dc52.moilanen@austin.ibm.com> <20050525182413.GA18053@suse.de> <20050525134617.24a2c940.moilanen@austin.ibm.com> Message-ID: <4295235D.3010908@austin.ibm.com> Jake Moilanen wrote: >>How do I change that policy on a js20? Having a stuck system is as bad >>as this stupid panic blinking on i386. > > > Normally that is setup through the service processor (interfaced > through the ASM, HMC, or SP menus). But on a JS20 there is not a fully > featured SP. So there is no simple method that I know of. You may be > able to do it through the ibm,set-sysetm-parameter RTAS call. It would > require some research to figure out exactly what needs to be passed in. The ibm,os-auto-restart variable in NVRAM (in the common partition, I think), is set to 'x' by default, indicating that the OS doesn't have a restart policy set. The OS is supposed to set the policy to '1' or '0' to indicate whether it wants to be restarted after a crash. Unfortunately, the variable is always reset to 'x' on reboot. I don't know if these exist on the JS20, but there may also be sp-plt-reboot and sp-os-plt-reboot variables in NVRAM. The former sets the platform's restart policy; the latter indicates whether the platform's or OS's reboot policy should be used. If they exist, I would expect that setting the former to 'y' and the latter to 'n' would also cause os-term to reboot. Mike From benh at kernel.crashing.org Thu May 26 16:49:58 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Thu, 26 May 2005 16:49:58 +1000 Subject: Finding module TOC address In-Reply-To: <1117058767.5076.27.camel@gaston> References: <4294C722.4030207@cs.helsinki.fi> <1117058767.5076.27.camel@gaston> Message-ID: <1117090199.9076.115.camel@gaston> On Thu, 2005-05-26 at 08:06 +1000, Benjamin Herrenschmidt wrote: > On Wed, 2005-05-25 at 21:42 +0300, Heikki Lindholm wrote: > > Hello, > > I need to use assembly jump to a function in a module. How do I find out > > the TOC that the function expects to find in r2? A quick peek at > > kernel/module.c didn't help much, maybe someone here could enlighten me. > > You don't, there is no TOC on ppc32 ABI, at least the kernel version of > it. r2 is reserved in the kernel and always contains "current". Ooops ! As you stated privately, wrong list :) So yes, ppc64 has a TOC :) When you take the function pointer, what you should obtain is a descriptor containing the actual pointer and the TOC (unless you are calling the .xxx symbol, but that isn't very good to do). So if you are doing a pointer-based jump, just load the TOC pointer along with the function pointer from the descriptor. If you are doing a direct jump (bl .blabla), I'm not 100% sure, you'll probably just load the non-dot symbol address in a register to get the descriptor, and then peek it for the toc value... Alan ? Is there some better way ? Ben. From sfr at canb.auug.org.au Thu May 26 17:18:06 2005 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Thu, 26 May 2005 17:18:06 +1000 Subject: [PATCH] ppc64 iSeries: make virtual DVD-RAMs writable again Message-ID: <20050526171806.2a996350.sfr@canb.auug.org.au> Hi Andrew, It appears that another test has been added in the Uniform CDROM layer that must be passed before a DVD-RAM is considered wrieable. This patch implements an emulation of the needed packet command for the viocd driver. Signed-off-by: Stephen Rothwell -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ diff -ruNp linus/drivers/cdrom/viocd.c linus-viocd.dvd/drivers/cdrom/viocd.c --- linus/drivers/cdrom/viocd.c 2005-05-20 09:03:44.000000000 +1000 +++ linus-viocd.dvd/drivers/cdrom/viocd.c 2005-05-13 16:00:10.000000000 +1000 @@ -488,6 +488,20 @@ static int viocd_packet(struct cdrom_dev & (CDC_DVD_RAM | CDC_RAM)) != 0; } break; + case GPCMD_GET_CONFIGURATION: + if (cgc->cmd[3] == CDF_RWRT) { + struct rwrt_feature_desc *rfd = (struct rwrt_feature_desc *)(cgc->buffer + sizeof(struct feature_header)); + + if ((buflen >= + (sizeof(struct feature_header) + sizeof(*rfd))) && + (cdi->ops->capability & ~cdi->mask + & (CDC_DVD_RAM | CDC_RAM))) { + rfd->feature_code = cpu_to_be16(CDF_RWRT); + rfd->curr = 1; + ret = 0; + } + } + break; default: if (cgc->sense) { /* indicate Unknown code */ From amodra at bigpond.net.au Thu May 26 17:33:07 2005 From: amodra at bigpond.net.au (Alan Modra) Date: Thu, 26 May 2005 17:03:07 +0930 Subject: Finding module TOC address In-Reply-To: <1117090199.9076.115.camel@gaston> References: <4294C722.4030207@cs.helsinki.fi> <1117058767.5076.27.camel@gaston> <1117090199.9076.115.camel@gaston> Message-ID: <20050526073307.GO11771@bubble.grove.modra.org> On Thu, May 26, 2005 at 04:49:58PM +1000, Benjamin Herrenschmidt wrote: > On Thu, 2005-05-26 at 08:06 +1000, Benjamin Herrenschmidt wrote: > > On Wed, 2005-05-25 at 21:42 +0300, Heikki Lindholm wrote: > > > Hello, > > > I need to use assembly jump to a function in a module. How do I find out > > > the TOC that the function expects to find in r2? A quick peek at > > > kernel/module.c didn't help much, maybe someone here could enlighten me. > > > > You don't, there is no TOC on ppc32 ABI, at least the kernel version of > > it. r2 is reserved in the kernel and always contains "current". > > Ooops ! As you stated privately, wrong list :) > > So yes, ppc64 has a TOC :) > > When you take the function pointer, what you should obtain is a > descriptor containing the actual pointer and the TOC (unless you are > calling the .xxx symbol, but that isn't very good to do). > > So if you are doing a pointer-based jump, just load the TOC pointer > along with the function pointer from the descriptor. If you are doing a > direct jump (bl .blabla), I'm not 100% sure, you'll probably just load > the non-dot symbol address in a register to get the descriptor, and then > peek it for the toc value... Alan ? Is there some better way ? Rusty will know for sure. I haven't kept up with how kernel modules work, but I think that you just use bl .foo nop The kernel module support turn this into bl foo_stub ld 2,40(1) and generates foo_stub to save r2 at 40(1), load r2 from the descriptor and jump to the address in the descriptor. That means the module itself had better have .opd entries for functions. -- Alan Modra IBM OzLabs - Linux Technology Centre From holindho at cs.helsinki.fi Thu May 26 19:22:25 2005 From: holindho at cs.helsinki.fi (Heikki Lindholm) Date: Thu, 26 May 2005 12:22:25 +0300 Subject: Finding module TOC address In-Reply-To: <1117090199.9076.115.camel@gaston> References: <4294C722.4030207@cs.helsinki.fi> <1117058767.5076.27.camel@gaston> <1117090199.9076.115.camel@gaston> Message-ID: <42959551.2050003@cs.helsinki.fi> Benjamin Herrenschmidt kirjoitti: > On Thu, 2005-05-26 at 08:06 +1000, Benjamin Herrenschmidt wrote: > >>On Wed, 2005-05-25 at 21:42 +0300, Heikki Lindholm wrote: >> >>>Hello, >>>I need to use assembly jump to a function in a module. How do I find out >>>the TOC that the function expects to find in r2? A quick peek at >>>kernel/module.c didn't help much, maybe someone here could enlighten me. >> >>You don't, there is no TOC on ppc32 ABI, at least the kernel version of >>it. r2 is reserved in the kernel and always contains "current". > > > Ooops ! As you stated privately, wrong list :) > > So yes, ppc64 has a TOC :) > > When you take the function pointer, what you should obtain is a > descriptor containing the actual pointer and the TOC (unless you are > calling the .xxx symbol, but that isn't very good to do). > > So if you are doing a pointer-based jump, just load the TOC pointer > along with the function pointer from the descriptor. If you are doing a > direct jump (bl .blabla), I'm not 100% sure, you'll probably just load > the non-dot symbol address in a register to get the descriptor, and then > peek it for the toc value... Alan ? Is there some better way ? > > Ben. > Thanks! Problem solved. -- Heikki Lindholm From olh at suse.de Thu May 26 19:57:46 2005 From: olh at suse.de (Olaf Hering) Date: Thu, 26 May 2005 11:57:46 +0200 Subject: panic reboot stuck in rtas_os_term In-Reply-To: <4295235D.3010908@austin.ibm.com> References: <20050508083331.GA30329@suse.de> <20050508091533.GA30450@suse.de> <20050525065410.GA9430@suse.de> <20050525125352.2d08dc52.moilanen@austin.ibm.com> <20050525182413.GA18053@suse.de> <20050525134617.24a2c940.moilanen@austin.ibm.com> <4295235D.3010908@austin.ibm.com> Message-ID: <20050526095746.GA24463@suse.de> On Wed, May 25, Mike Strosaker wrote: > The ibm,os-auto-restart variable in NVRAM (in the common partition, I > think), is set to 'x' by default, indicating that the OS doesn't have a > restart policy set. The OS is supposed to set the policy to '1' or '0' to > indicate whether it wants to be restarted after a crash. Unfortunately, > the variable is always reset to 'x' on reboot. Thanks for the hint, I will try this. > I don't know if these exist on the JS20, but there may also be > sp-plt-reboot and sp-os-plt-reboot variables in NVRAM. The former sets the > platform's restart policy; the latter indicates whether the platform's or > OS's reboot policy should be used. If they exist, I would expect that > setting the former to 'y' and the latter to 'n' would also cause os-term to > reboot. A JS20 model 'IBM,8842-41X' has ibm,os-auto-restart, but model 'IBM,8842-21X' doesnt have it. I can probably add it with nvsetenv. From sven.luther at wanadoo.fr Fri May 27 23:25:38 2005 From: sven.luther at wanadoo.fr (Sven Luther) Date: Fri, 27 May 2005 15:25:38 +0200 Subject: [PATCH] [PPC64] override command line AS/LD/CC variables when adding -m64 and co for biarch compilers. Message-ID: <20050527132538.GA28407@pegasos> Hello, The following kind of calls currently fails : make ARCH=ppc64 CC="gcc-3.4" Since the code for detecting a biarch compiler and adding the needed 64bit magic argument fails if the AS/LD/CC commands are overriden in the command line. The attached patch fixes this by using the make override and += directive, but i am not 100% sure this will work without gmake, as i am no Makefile expert. Friendly, Sven Luther -------------- next part -------------- --- linux-2.6.11/arch/ppc64/Makefile.orig 2005-05-27 13:24:25.000000000 +0000 +++ linux-2.6.11/arch/ppc64/Makefile 2005-05-27 13:24:36.000000000 +0000 @@ -17,9 +17,9 @@ HAS_BIARCH := $(call cc-option-yn, -m64) ifeq ($(HAS_BIARCH),y) -AS := $(AS) -a64 -LD := $(LD) -m elf64ppc -CC := $(CC) -m64 +override AS += -a64 +override LD += -m elf64ppc +override CC += -m64 endif new_nm := $(shell if $(NM) --help 2>&1 | grep -- '--synthetic' > /dev/null; then echo y; else echo n; fi) From ananth at in.ibm.com Sat May 28 01:37:44 2005 From: ananth at in.ibm.com (Ananth N Mavinakayanahalli) Date: Fri, 27 May 2005 11:37:44 -0400 Subject: [PATCH 1/2] kprobes: return correct error value on kprobe registration failure Message-ID: <20050527153744.GA6395@in.ibm.com> Hi, This patch adds stricter checks during kprobe registration. We weren't returning a correct value on a registration request at an invalid address. Please apply. Regards, Ananth Add stricter checks during kprobe registration. Return correct error value so insmod doesn't succeed. Also printk reason for registration failure. Signed-off-by: Ananth N Mavinakayanahalli arch/ppc64/kernel/kprobes.c | 13 +++++++++---- 1 files changed, 9 insertions(+), 4 deletions(-) Index: linux-2.6.12-rc5/arch/ppc64/kernel/kprobes.c =================================================================== --- linux-2.6.12-rc5.orig/arch/ppc64/kernel/kprobes.c 2005-05-27 14:35:12.000000000 -0400 +++ linux-2.6.12-rc5/arch/ppc64/kernel/kprobes.c 2005-05-27 14:40:42.000000000 -0400 @@ -45,12 +45,17 @@ static kprobe_opcode_t stepped_insn; int arch_prepare_kprobe(struct kprobe *p) { + int ret = 0; kprobe_opcode_t insn = *p->addr; - if (IS_MTMSRD(insn) || IS_RFID(insn)) - /* cannot put bp on RFID/MTMSRD */ - return 1; - return 0; + if ((unsigned long)p->addr & 0x03) { + printk("Attempt to register kprobe at an unaligned address\n"); + ret = -EINVAL; + } else if (IS_MTMSRD(insn) || IS_RFID(insn)) { + printk("Cannot register a kprobe on rfid or mtmsrd\n"); + ret = -EINVAL; + } + return ret; } void arch_copy_kprobe(struct kprobe *p) From ananth at in.ibm.com Sat May 28 01:43:32 2005 From: ananth at in.ibm.com (Ananth N Mavinakayanahalli) Date: Fri, 27 May 2005 11:43:32 -0400 Subject: [PATCH 2/2] kprobes: remove spurious MSR_SE masking Message-ID: <20050527154332.GB6395@in.ibm.com> Hi, single_step_exception() masks MSR_SE upon entry. We don't have to do it again during kprobe post processing. This also will help applications like itrace that'd want to single-step more than once, after a kprobe hit. Regards, Ananth Remove spurious MSR_SE reset during kprobe processing. single_step_exception() already does it for us. Reset it to be safe when executing the fault_handler. Signed-off-by: Ananth N Mavinakayanahalli arch/ppc64/kernel/kprobes.c | 3 +-- 1 files changed, 1 insertion(+), 2 deletions(-) Index: linux-2.6.12-rc5/arch/ppc64/kernel/kprobes.c =================================================================== --- linux-2.6.12-rc5.orig/arch/ppc64/kernel/kprobes.c 2005-05-27 14:40:42.000000000 -0400 +++ linux-2.6.12-rc5/arch/ppc64/kernel/kprobes.c 2005-05-27 14:50:38.000000000 -0400 @@ -214,8 +214,6 @@ static void resume_execution(struct kpro ret = emulate_step(regs, p->ainsn.insn[0]); if (ret == 0) regs->nip = (unsigned long)p->addr + 4; - - regs->msr &= ~MSR_SE; } static inline int post_kprobe_handler(struct pt_regs *regs) @@ -260,6 +258,7 @@ static inline int kprobe_fault_handler(s if (kprobe_status & KPROBE_HIT_SS) { resume_execution(current_kprobe, regs); + regs->msr &= ~MSR_SE; regs->msr |= kprobe_saved_msr; unlock_kprobes(); From anton at samba.org Mon May 30 07:32:42 2005 From: anton at samba.org (Anton Blanchard) Date: Mon, 30 May 2005 07:32:42 +1000 Subject: [PATCH] remove io_page_mask In-Reply-To: <20050504054258.GK13590@krispykreme> References: <20050504043745.GJ13590@krispykreme> <20050504054258.GK13590@krispykreme> Message-ID: <20050529213242.GA11066@krispykreme> Hi, > I found an issue with the io_page_mask code when pci_probe_only is not > set (we dont initialise io_page_mask and bad things happen). I was > about to fix it up when I wondered if we can remove it now. > > Ben changed the serial code to check before it goes pounding on > addresses. Im not sure if there were other issues with badly behaving > drivers but my js20 boots here with the following removal patch. I tried inserting and removing every module that we currently compile. I found one more fail in the pcspkr driver. Is someone in the mood to convert these two drivers to use the legacy IO helpers? If so we can merge the io port removal patch. Anton cpu 0x0: Vector: 300 (Data Access) at [c000000001fb7a20] pc: d0000000001be1b8: .pcspkr_event+0x1b8/0x228 [pcspkr] lr: d0000000001be054: .pcspkr_event+0x54/0x228 [pcspkr] dar: 61 0:mon> t c000000000077f04 .sys_delete_module+0x22c/0x330 cpu 0x1: Vector: 300 (Data Access) at [c000000002e33970] pc: d0000000002ec2f8: .parport_pc_probe_port+0x67c/0xbac [parport_pc] lr: d0000000002ec2d4: .parport_pc_probe_port+0x658/0xbac [parport_pc] dar: 3be 1:mon> t d0000000002edbb8 .parport_pc_init+0x2bc/0x494 [parport_pc] c000000000079fc8 .sys_init_module+0x2cc/0x4a8 From anton at samba.org Mon May 30 09:30:05 2005 From: anton at samba.org (Anton Blanchard) Date: Mon, 30 May 2005 09:30:05 +1000 Subject: [PATCH] ppc64: allow timer based profiling on iseries Message-ID: <20050529233005.GE11066@krispykreme> Hi, We used to have an iseries specific profiler that used /proc/profile. Now thats gone we can use the generic timer based stuff. Signed-off-by: Anton Blanchard ===== arch/ppc64/kernel/time.c 1.40 vs edited ===== Index: linux-2.6.12-rc2/arch/ppc64/kernel/time.c =================================================================== --- linux-2.6.12-rc2.orig/arch/ppc64/kernel/time.c 2005-04-30 00:17:20.303032159 +1000 +++ linux-2.6.12-rc2/arch/ppc64/kernel/time.c 2005-05-01 23:42:46.357508524 +1000 @@ -325,9 +325,7 @@ irq_enter(); -#ifndef CONFIG_PPC_ISERIES profile_tick(CPU_PROFILING, regs); -#endif lpaca->lppaca.int_dword.fields.decr_int = 0; From anton at samba.org Mon May 30 09:37:26 2005 From: anton at samba.org (Anton Blanchard) Date: Mon, 30 May 2005 09:37:26 +1000 Subject: [PATCH] ppc64: use cpu_has_feature macro Message-ID: <20050529233726.GF11066@krispykreme> Hi, Use the new cpu_has_feature macros instead of open coding it. Signed-off-by: Anton Blanchard Index: linux-2.6.12-rc2/arch/ppc64/kernel/pSeries_smp.c =================================================================== --- linux-2.6.12-rc2.orig/arch/ppc64/kernel/pSeries_smp.c 2005-04-26 19:32:20.834218097 +1000 +++ linux-2.6.12-rc2/arch/ppc64/kernel/pSeries_smp.c 2005-04-26 19:32:23.970192943 +1000 @@ -385,7 +385,7 @@ * cpus are assumed to be secondary threads. */ if (system_state < SYSTEM_RUNNING && - cur_cpu_spec->cpu_features & CPU_FTR_SMT && + cpu_has_feature(CPU_FTR_SMT) && !smt_enabled_at_boot && nr % 2 != 0) return 0; @@ -429,8 +429,8 @@ #endif /* Mark threads which are still spinning in hold loops. */ - if (cur_cpu_spec->cpu_features & CPU_FTR_SMT) - for_each_present_cpu(i) { + if (cpu_has_feature(CPU_FTR_SMT)) { + for_each_present_cpu(i) { if (i % 2 == 0) /* * Even-numbered logical cpus correspond to @@ -438,8 +438,9 @@ */ cpu_set(i, of_spin_map); } - else + } else { of_spin_map = cpu_present_map; + } cpu_clear(boot_cpuid, of_spin_map); From anton at samba.org Mon May 30 09:41:54 2005 From: anton at samba.org (Anton Blanchard) Date: Mon, 30 May 2005 09:41:54 +1000 Subject: [PATCH] ppc64: set/clear SMT capable bit at boot Message-ID: <20050529234154.GG11066@krispykreme> Paul/Nathan, does this look OK to you? Anton -- Allow the SMT bit to be set/reset at boot, like the ALTIVEC bit. This means we will enable SMT on unknown cpus that support it. Signed-off-by: Anton Blanchard Index: linux-2.6.12-rc2/arch/ppc64/kernel/prom.c =================================================================== --- linux-2.6.12-rc2.orig/arch/ppc64/kernel/prom.c 2005-04-26 19:32:20.834218097 +1000 +++ linux-2.6.12-rc2/arch/ppc64/kernel/prom.c 2005-04-26 19:32:23.974192213 +1000 @@ -880,6 +880,7 @@ { char *type = get_flat_dt_prop(node, "device_type", NULL); u32 *prop; + unsigned long size; /* We are scanning "cpu" nodes only */ if (type == NULL || strcmp(type, "cpu") != 0) @@ -925,6 +926,17 @@ cur_cpu_spec->cpu_user_features |= PPC_FEATURE_HAS_ALTIVEC; } + /* + * Check for an SMT capable CPU and set the CPU feature. We do + * this by looking at the size of the ibm,ppc-interrupt-server#s + * property + */ + prop = (u32 *)get_flat_dt_prop(node, "ibm,ppc-interrupt-server#s", + &size); + cur_cpu_spec->cpu_features &= ~CPU_FTR_SMT; + if (prop && ((size / sizeof(u32)) > 1)) + cur_cpu_spec->cpu_features |= CPU_FTR_SMT; + return 0; } From anton at samba.org Mon May 30 10:05:08 2005 From: anton at samba.org (Anton Blanchard) Date: Mon, 30 May 2005 10:05:08 +1000 Subject: [PATCH] ppc64: cleanup SPR definitions Message-ID: <20050530000508.GI11066@krispykreme> There are a bunch of irrelevant SPR definitions in asm/processer.h. Cut them down a bit, also add a DABR_TRANSLATION define which will be used shortly. Signed-off-by: Anton Blanchard Index: foobar2/include/asm-ppc64/processor.h =================================================================== --- foobar2.orig/include/asm-ppc64/processor.h 2005-05-04 13:16:21.000000000 +1000 +++ foobar2/include/asm-ppc64/processor.h 2005-05-22 12:29:27.090183375 +1000 @@ -120,103 +120,18 @@ /* Special Purpose Registers (SPRNs)*/ -#define SPRN_CDBCR 0x3D7 /* Cache Debug Control Register */ #define SPRN_CTR 0x009 /* Count Register */ #define SPRN_DABR 0x3F5 /* Data Address Breakpoint Register */ -#define SPRN_DAC1 0x3F6 /* Data Address Compare 1 */ -#define SPRN_DAC2 0x3F7 /* Data Address Compare 2 */ +#define DABR_TRANSLATION (1UL << 2) #define SPRN_DAR 0x013 /* Data Address Register */ -#define SPRN_DBCR 0x3F2 /* Debug Control Regsiter */ -#define DBCR_EDM 0x80000000 -#define DBCR_IDM 0x40000000 -#define DBCR_RST(x) (((x) & 0x3) << 28) -#define DBCR_RST_NONE 0 -#define DBCR_RST_CORE 1 -#define DBCR_RST_CHIP 2 -#define DBCR_RST_SYSTEM 3 -#define DBCR_IC 0x08000000 /* Instruction Completion Debug Evnt */ -#define DBCR_BT 0x04000000 /* Branch Taken Debug Event */ -#define DBCR_EDE 0x02000000 /* Exception Debug Event */ -#define DBCR_TDE 0x01000000 /* TRAP Debug Event */ -#define DBCR_FER 0x00F80000 /* First Events Remaining Mask */ -#define DBCR_FT 0x00040000 /* Freeze Timers on Debug Event */ -#define DBCR_IA1 0x00020000 /* Instr. Addr. Compare 1 Enable */ -#define DBCR_IA2 0x00010000 /* Instr. Addr. Compare 2 Enable */ -#define DBCR_D1R 0x00008000 /* Data Addr. Compare 1 Read Enable */ -#define DBCR_D1W 0x00004000 /* Data Addr. Compare 1 Write Enable */ -#define DBCR_D1S(x) (((x) & 0x3) << 12) /* Data Adrr. Compare 1 Size */ -#define DAC_BYTE 0 -#define DAC_HALF 1 -#define DAC_WORD 2 -#define DAC_QUAD 3 -#define DBCR_D2R 0x00000800 /* Data Addr. Compare 2 Read Enable */ -#define DBCR_D2W 0x00000400 /* Data Addr. Compare 2 Write Enable */ -#define DBCR_D2S(x) (((x) & 0x3) << 8) /* Data Addr. Compare 2 Size */ -#define DBCR_SBT 0x00000040 /* Second Branch Taken Debug Event */ -#define DBCR_SED 0x00000020 /* Second Exception Debug Event */ -#define DBCR_STD 0x00000010 /* Second Trap Debug Event */ -#define DBCR_SIA 0x00000008 /* Second IAC Enable */ -#define DBCR_SDA 0x00000004 /* Second DAC Enable */ -#define DBCR_JOI 0x00000002 /* JTAG Serial Outbound Int. Enable */ -#define DBCR_JII 0x00000001 /* JTAG Serial Inbound Int. Enable */ -#define SPRN_DBCR0 0x3F2 /* Debug Control Register 0 */ -#define SPRN_DBCR1 0x3BD /* Debug Control Register 1 */ -#define SPRN_DBSR 0x3F0 /* Debug Status Register */ -#define SPRN_DCCR 0x3FA /* Data Cache Cacheability Register */ -#define DCCR_NOCACHE 0 /* Noncacheable */ -#define DCCR_CACHE 1 /* Cacheable */ -#define SPRN_DCMP 0x3D1 /* Data TLB Compare Register */ -#define SPRN_DCWR 0x3BA /* Data Cache Write-thru Register */ -#define DCWR_COPY 0 /* Copy-back */ -#define DCWR_WRITE 1 /* Write-through */ -#define SPRN_DEAR 0x3D5 /* Data Error Address Register */ #define SPRN_DEC 0x016 /* Decrement Register */ -#define SPRN_DMISS 0x3D0 /* Data TLB Miss Register */ #define SPRN_DSISR 0x012 /* Data Storage Interrupt Status Register */ #define DSISR_NOHPTE 0x40000000 /* no translation found */ #define DSISR_PROTFAULT 0x08000000 /* protection fault */ #define DSISR_ISSTORE 0x02000000 /* access was a store */ #define DSISR_DABRMATCH 0x00400000 /* hit data breakpoint */ #define DSISR_NOSEGMENT 0x00200000 /* STAB/SLB miss */ -#define SPRN_EAR 0x11A /* External Address Register */ -#define SPRN_ESR 0x3D4 /* Exception Syndrome Register */ -#define ESR_IMCP 0x80000000 /* Instr. Machine Check - Protection */ -#define ESR_IMCN 0x40000000 /* Instr. Machine Check - Non-config */ -#define ESR_IMCB 0x20000000 /* Instr. Machine Check - Bus error */ -#define ESR_IMCT 0x10000000 /* Instr. Machine Check - Timeout */ -#define ESR_PIL 0x08000000 /* Program Exception - Illegal */ -#define ESR_PPR 0x04000000 /* Program Exception - Priveleged */ -#define ESR_PTR 0x02000000 /* Program Exception - Trap */ -#define ESR_DST 0x00800000 /* Storage Exception - Data miss */ -#define ESR_DIZ 0x00400000 /* Storage Exception - Zone fault */ -#define SPRN_EVPR 0x3D6 /* Exception Vector Prefix Register */ -#define SPRN_HASH1 0x3D2 /* Primary Hash Address Register */ -#define SPRN_HASH2 0x3D3 /* Secondary Hash Address Resgister */ #define SPRN_HID0 0x3F0 /* Hardware Implementation Register 0 */ -#define HID0_EMCP (1<<31) /* Enable Machine Check pin */ -#define HID0_EBA (1<<29) /* Enable Bus Address Parity */ -#define HID0_EBD (1<<28) /* Enable Bus Data Parity */ -#define HID0_SBCLK (1<<27) -#define HID0_EICE (1<<26) -#define HID0_ECLK (1<<25) -#define HID0_PAR (1<<24) -#define HID0_DOZE (1<<23) -#define HID0_NAP (1<<22) -#define HID0_SLEEP (1<<21) -#define HID0_DPM (1<<20) -#define HID0_ICE (1<<15) /* Instruction Cache Enable */ -#define HID0_DCE (1<<14) /* Data Cache Enable */ -#define HID0_ILOCK (1<<13) /* Instruction Cache Lock */ -#define HID0_DLOCK (1<<12) /* Data Cache Lock */ -#define HID0_ICFI (1<<11) /* Instr. Cache Flash Invalidate */ -#define HID0_DCI (1<<10) /* Data Cache Invalidate */ -#define HID0_SPD (1<<9) /* Speculative disable */ -#define HID0_SGE (1<<7) /* Store Gathering Enable */ -#define HID0_SIED (1<<7) /* Serial Instr. Execution [Disable] */ -#define HID0_BTIC (1<<5) /* Branch Target Instruction Cache Enable */ -#define HID0_ABE (1<<3) /* Address Broadcast Enable */ -#define HID0_BHTE (1<<2) /* Branch History Table Enable */ -#define HID0_BTCD (1<<1) /* Branch target cache disable */ #define SPRN_MSRDORM 0x3F1 /* Hardware Implementation Register 1 */ #define SPRN_HID1 0x3F1 /* Hardware Implementation Register 1 */ #define SPRN_IABR 0x3F2 /* Instruction Address Breakpoint Register */ @@ -225,23 +140,8 @@ #define SPRN_HID5 0x3F6 /* 970 HID5 */ #define SPRN_TSC 0x3FD /* Thread switch control */ #define SPRN_TST 0x3FC /* Thread switch timeout */ -#define SPRN_IAC1 0x3F4 /* Instruction Address Compare 1 */ -#define SPRN_IAC2 0x3F5 /* Instruction Address Compare 2 */ -#define SPRN_ICCR 0x3FB /* Instruction Cache Cacheability Register */ -#define ICCR_NOCACHE 0 /* Noncacheable */ -#define ICCR_CACHE 1 /* Cacheable */ -#define SPRN_ICDBDR 0x3D3 /* Instruction Cache Debug Data Register */ -#define SPRN_ICMP 0x3D5 /* Instruction TLB Compare Register */ -#define SPRN_ICTC 0x3FB /* Instruction Cache Throttling Control Reg */ -#define SPRN_IMISS 0x3D4 /* Instruction TLB Miss Register */ -#define SPRN_IMMR 0x27E /* Internal Memory Map Register */ #define SPRN_L2CR 0x3F9 /* Level 2 Cache Control Regsiter */ #define SPRN_LR 0x008 /* Link Register */ -#define SPRN_PBL1 0x3FC /* Protection Bound Lower 1 */ -#define SPRN_PBL2 0x3FE /* Protection Bound Lower 2 */ -#define SPRN_PBU1 0x3FD /* Protection Bound Upper 1 */ -#define SPRN_PBU2 0x3FF /* Protection Bound Upper 2 */ -#define SPRN_PID 0x3B1 /* Process ID */ #define SPRN_PIR 0x3FF /* Processor Identification Register */ #define SPRN_PIT 0x3DB /* Programmable Interval Timer */ #define SPRN_PURR 0x135 /* Processor Utilization of Resources Register */ @@ -249,9 +149,6 @@ #define SPRN_RPA 0x3D6 /* Required Physical Address Register */ #define SPRN_SDA 0x3BF /* Sampled Data Address Register */ #define SPRN_SDR1 0x019 /* MMU Hash Base Register */ -#define SPRN_SGR 0x3B9 /* Storage Guarded Register */ -#define SGR_NORMAL 0 -#define SGR_GUARDED 1 #define SPRN_SIA 0x3BB /* Sampled Instruction Address Register */ #define SPRN_SPRG0 0x110 /* Special Purpose Register General 0 */ #define SPRN_SPRG1 0x111 /* Special Purpose Register General 1 */ @@ -264,49 +161,8 @@ #define SPRN_TBWL 0x11C /* Time Base Lower Register (super, W/O) */ #define SPRN_TBWU 0x11D /* Time Base Write Upper Register (super, W/O) */ #define SPRN_HIOR 0x137 /* 970 Hypervisor interrupt offset */ -#define SPRN_TCR 0x3DA /* Timer Control Register */ -#define TCR_WP(x) (((x)&0x3)<<30) /* WDT Period */ -#define WP_2_17 0 /* 2^17 clocks */ -#define WP_2_21 1 /* 2^21 clocks */ -#define WP_2_25 2 /* 2^25 clocks */ -#define WP_2_29 3 /* 2^29 clocks */ -#define TCR_WRC(x) (((x)&0x3)<<28) /* WDT Reset Control */ -#define WRC_NONE 0 /* No reset will occur */ -#define WRC_CORE 1 /* Core reset will occur */ -#define WRC_CHIP 2 /* Chip reset will occur */ -#define WRC_SYSTEM 3 /* System reset will occur */ -#define TCR_WIE 0x08000000 /* WDT Interrupt Enable */ -#define TCR_PIE 0x04000000 /* PIT Interrupt Enable */ -#define TCR_FP(x) (((x)&0x3)<<24) /* FIT Period */ -#define FP_2_9 0 /* 2^9 clocks */ -#define FP_2_13 1 /* 2^13 clocks */ -#define FP_2_17 2 /* 2^17 clocks */ -#define FP_2_21 3 /* 2^21 clocks */ -#define TCR_FIE 0x00800000 /* FIT Interrupt Enable */ -#define TCR_ARE 0x00400000 /* Auto Reload Enable */ -#define SPRN_THRM1 0x3FC /* Thermal Management Register 1 */ -#define THRM1_TIN (1<<0) -#define THRM1_TIV (1<<1) -#define THRM1_THRES (0x7f<<2) -#define THRM1_TID (1<<29) -#define THRM1_TIE (1<<30) -#define THRM1_V (1<<31) -#define SPRN_THRM2 0x3FD /* Thermal Management Register 2 */ -#define SPRN_THRM3 0x3FE /* Thermal Management Register 3 */ -#define THRM3_E (1<<31) -#define SPRN_TSR 0x3D8 /* Timer Status Register */ -#define TSR_ENW 0x80000000 /* Enable Next Watchdog */ -#define TSR_WIS 0x40000000 /* WDT Interrupt Status */ -#define TSR_WRS(x) (((x)&0x3)<<28) /* WDT Reset Status */ -#define WRS_NONE 0 /* No WDT reset occurred */ -#define WRS_CORE 1 /* WDT forced core reset */ -#define WRS_CHIP 2 /* WDT forced chip reset */ -#define WRS_SYSTEM 3 /* WDT forced system reset */ -#define TSR_PIS 0x08000000 /* PIT Interrupt Status */ -#define TSR_FIS 0x04000000 /* FIT Interrupt Status */ #define SPRN_USIA 0x3AB /* User Sampled Instruction Address Register */ #define SPRN_XER 0x001 /* Fixed Point Exception Register */ -#define SPRN_ZPR 0x3B0 /* Zone Protection Register */ #define SPRN_VRSAVE 0x100 /* Vector save */ /* Performance monitor SPRs */ @@ -352,28 +208,19 @@ #define CTR SPRN_CTR /* Counter Register */ #define DAR SPRN_DAR /* Data Address Register */ #define DABR SPRN_DABR /* Data Address Breakpoint Register */ -#define DCMP SPRN_DCMP /* Data TLB Compare Register */ #define DEC SPRN_DEC /* Decrement Register */ -#define DMISS SPRN_DMISS /* Data TLB Miss Register */ #define DSISR SPRN_DSISR /* Data Storage Interrupt Status Register */ -#define EAR SPRN_EAR /* External Address Register */ -#define HASH1 SPRN_HASH1 /* Primary Hash Address Register */ -#define HASH2 SPRN_HASH2 /* Secondary Hash Address Register */ #define HID0 SPRN_HID0 /* Hardware Implementation Register 0 */ #define MSRDORM SPRN_MSRDORM /* MSR Dormant Register */ #define NIADORM SPRN_NIADORM /* NIA Dormant Register */ #define TSC SPRN_TSC /* Thread switch control */ #define TST SPRN_TST /* Thread switch timeout */ #define IABR SPRN_IABR /* Instruction Address Breakpoint Register */ -#define ICMP SPRN_ICMP /* Instruction TLB Compare Register */ -#define IMISS SPRN_IMISS /* Instruction TLB Miss Register */ -#define IMMR SPRN_IMMR /* PPC 860/821 Internal Memory Map Register */ #define L2CR SPRN_L2CR /* PPC 750 L2 control register */ #define __LR SPRN_LR #define PVR SPRN_PVR /* Processor Version */ #define PIR SPRN_PIR /* Processor ID */ #define PURR SPRN_PURR /* Processor Utilization of Resource Register */ -//#define RPA SPRN_RPA /* Required Physical Address Register */ #define SDR1 SPRN_SDR1 /* MMU hash base register */ #define SPR0 SPRN_SPRG0 /* Supervisor Private Registers */ #define SPR1 SPRN_SPRG1 @@ -389,10 +236,6 @@ #define TBRU SPRN_TBRU /* Time Base Read Upper Register */ #define TBWL SPRN_TBWL /* Time Base Write Lower Register */ #define TBWU SPRN_TBWU /* Time Base Write Upper Register */ -#define ICTC 1019 -#define THRM1 SPRN_THRM1 /* Thermal Management Register 1 */ -#define THRM2 SPRN_THRM2 /* Thermal Management Register 2 */ -#define THRM3 SPRN_THRM3 /* Thermal Management Register 3 */ #define XER SPRN_XER /* Processor Version Register (PVR) field extraction */ From sfr at canb.auug.org.au Mon May 30 13:38:00 2005 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Mon, 30 May 2005 13:38:00 +1000 Subject: [PATCH] remove platform support for pci_dac_dma_supported Message-ID: <20050530133800.7c409a24.sfr@canb.auug.org.au> Hi Anton, Currently no PPC64 platform actually implements the dac_dma_supported method in the dma_mapping_ops structure. This patch just tidies it up. Does it look OK to you? Signed-off-by: Stephen Rothwell -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ diff -ruNp linus/include/asm-ppc64/dma-mapping.h linus-pci_dac_dma_supported/include/asm-ppc64/dma-mapping.h --- linus/include/asm-ppc64/dma-mapping.h 2005-05-20 09:05:54.000000000 +1000 +++ linus-pci_dac_dma_supported/include/asm-ppc64/dma-mapping.h 2005-05-04 17:56:06.000000000 +1000 @@ -130,7 +130,6 @@ struct dma_mapping_ops { void (*unmap_sg)(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction direction); int (*dma_supported)(struct device *dev, u64 mask); - int (*dac_dma_supported)(struct device *dev, u64 mask); }; #endif /* _ASM_DMA_MAPPING_H */ diff -ruNp linus/include/asm-ppc64/pci.h linus-pci_dac_dma_supported/include/asm-ppc64/pci.h --- linus/include/asm-ppc64/pci.h 2005-05-20 09:05:56.000000000 +1000 +++ linus-pci_dac_dma_supported/include/asm-ppc64/pci.h 2005-05-04 17:57:02.000000000 +1000 @@ -68,15 +68,7 @@ extern unsigned int pcibios_assign_all_b extern struct dma_mapping_ops pci_dma_ops; -/* For DAC DMA, we currently don't support it by default, but - * we let the platform override this - */ -static inline int pci_dac_dma_supported(struct pci_dev *hwdev,u64 mask) -{ - if (pci_dma_ops.dac_dma_supported) - return pci_dma_ops.dac_dma_supported(&hwdev->dev, mask); - return 0; -} +#define pci_dac_dma_supported(hwdev, mask) 0 extern int pci_domain_nr(struct pci_bus *bus); -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050530/7bcce932/attachment.pgp From sfr at canb.auug.org.au Mon May 30 14:19:56 2005 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Mon, 30 May 2005 14:19:56 +1000 Subject: [PATCH] ppc64: allow timer based profiling on iseries In-Reply-To: <20050529233005.GE11066@krispykreme> References: <20050529233005.GE11066@krispykreme> Message-ID: <20050530141956.437f0b13.sfr@canb.auug.org.au> On Mon, 30 May 2005 09:30:05 +1000 Anton Blanchard wrote: > > We used to have an iseries specific profiler that used /proc/profile. > Now thats gone we can use the generic timer based stuff. > > Signed-off-by: Anton Blanchard Ack-by: Stephen Rothwell -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050530/735fcfab/attachment.pgp From benh at kernel.crashing.org Mon May 30 14:56:08 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 30 May 2005 14:56:08 +1000 Subject: [PATCH] ppc64: set/clear SMT capable bit at boot In-Reply-To: <20050529234154.GG11066@krispykreme> References: <20050529234154.GG11066@krispykreme> Message-ID: <1117428968.5228.34.camel@gaston> > + /* > + * Check for an SMT capable CPU and set the CPU feature. We do > + * this by looking at the size of the ibm,ppc-interrupt-server#s > + * property > + */ > + prop = (u32 *)get_flat_dt_prop(node, "ibm,ppc-interrupt-server#s", > + &size); > + cur_cpu_spec->cpu_features &= ~CPU_FTR_SMT; > + if (prop && ((size / sizeof(u32)) > 1)) > + cur_cpu_spec->cpu_features |= CPU_FTR_SMT; > + > return 0; The above will always invalidate the bit that is present in the cputable, thus if you don't have the "ibm,ppc-interrupt-server#s" property, there will be no SMT at all, is that what you want ? I would have rather done something like if (prop) { clear_bit if (check-content-of-property) set_bit } I need to "fix" the altivec one too so that the firmware can "invalidate" the altivec by setting this property with a value of "0". Ben. From benh at kernel.crashing.org Mon May 30 15:07:00 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 30 May 2005 15:07:00 +1000 Subject: symlinks in /proc/device-tree Message-ID: <1117429620.5228.38.camel@gaston> Hi ! Is anybody actually using one of these things in /proc/device-tree ? - for a node name at address, symlinks from name->node and @address->node - the fact that we remove the @address part if address is 0 I'm considering killing both of these things for 2.6.13. That will reduce the memory footprint of the device-tree significantly (bloat in inode cache) and it will be a more exact representation of the actual tree. Besides, the symlinks are not really useful in practice. Ben. From olh at suse.de Mon May 30 16:17:37 2005 From: olh at suse.de (Olaf Hering) Date: Mon, 30 May 2005 08:17:37 +0200 Subject: symlinks in /proc/device-tree In-Reply-To: <1117429620.5228.38.camel@gaston> References: <1117429620.5228.38.camel@gaston> Message-ID: <20050530061737.GB28195@suse.de> On Mon, May 30, Benjamin Herrenschmidt wrote: > Hi ! > > Is anybody actually using one of these things in /proc/device-tree ? Just make sure the devspec file in sysfs doesnt contain symlinks. From benh at kernel.crashing.org Mon May 30 16:22:11 2005 From: benh at kernel.crashing.org (Benjamin Herrenschmidt) Date: Mon, 30 May 2005 16:22:11 +1000 Subject: symlinks in /proc/device-tree In-Reply-To: <20050530061737.GB28195@suse.de> References: <1117429620.5228.38.camel@gaston> <20050530061737.GB28195@suse.de> Message-ID: <1117434132.5228.40.camel@gaston> On Mon, 2005-05-30 at 08:17 +0200, Olaf Hering wrote: > On Mon, May 30, Benjamin Herrenschmidt wrote: > > > Hi ! > > > > Is anybody actually using one of these things in /proc/device-tree ? > > Just make sure the devspec file in sysfs doesnt contain symlinks. It doesn't. Ben. From anton at samba.org Mon May 30 17:07:34 2005 From: anton at samba.org (Anton Blanchard) Date: Mon, 30 May 2005 17:07:34 +1000 Subject: [PATCH] ppc64: cleanup iseries runlight support Message-ID: <20050530070734.GB9193@krispykreme> The iseries has a bar graph on the front panel that shows how busy it is. The operating system sets and clears a bit in the CTRL register to control it. Instead of going to the complexity of using a thread info bit, just set and clear it in the idle loop. Also create two helper functions, ppc64_runlatch_on and ppc64_runlatch_off. Finally dont use the short form of the SPR defines. Signed-off-by: Anton Blanchard Ack-by: Stephen Rothwell Index: foobar2/arch/ppc64/kernel/process.c =================================================================== --- foobar2.orig/arch/ppc64/kernel/process.c 2005-05-22 12:25:14.977598356 +1000 +++ foobar2/arch/ppc64/kernel/process.c 2005-05-22 12:29:38.185046661 +1000 @@ -378,9 +378,6 @@ childregs->gpr[1] = sp + sizeof(struct pt_regs); p->thread.regs = NULL; /* no user register state */ clear_ti_thread_flag(p->thread_info, TIF_32BIT); -#ifdef CONFIG_PPC_ISERIES - set_ti_thread_flag(p->thread_info, TIF_RUN_LIGHT); -#endif } else { childregs->gpr[1] = usp; p->thread.regs = childregs; Index: foobar2/include/asm-ppc64/processor.h =================================================================== --- foobar2.orig/include/asm-ppc64/processor.h 2005-05-22 12:29:27.090183375 +1000 +++ foobar2/include/asm-ppc64/processor.h 2005-05-22 12:29:38.188046431 +1000 @@ -164,6 +164,9 @@ #define SPRN_USIA 0x3AB /* User Sampled Instruction Address Register */ #define SPRN_XER 0x001 /* Fixed Point Exception Register */ #define SPRN_VRSAVE 0x100 /* Vector save */ +#define SPRN_CTRLF 0x088 +#define SPRN_CTRLT 0x098 +#define CTRL_RUNLATCH 0x1 /* Performance monitor SPRs */ #define SPRN_SIAR 780 @@ -279,12 +282,6 @@ #define XGLUE(a,b) a##b #define GLUE(a,b) XGLUE(a,b) -/* iSeries CTRL register (for runlatch) */ - -#define CTRLT 0x098 -#define CTRLF 0x088 -#define RUNLATCH 0x0001 - #ifdef __ASSEMBLY__ #define _GLOBAL(name) \ @@ -499,6 +496,24 @@ #define HAVE_ARCH_PICK_MMAP_LAYOUT +static inline void ppc64_runlatch_on(void) +{ + unsigned long ctrl; + + ctrl = mfspr(SPRN_CTRLF); + ctrl |= CTRL_RUNLATCH; + mtspr(SPRN_CTRLT, ctrl); +} + +static inline void ppc64_runlatch_off(void) +{ + unsigned long ctrl; + + ctrl = mfspr(SPRN_CTRLF); + ctrl &= ~CTRL_RUNLATCH; + mtspr(SPRN_CTRLT, ctrl); +} + #endif /* __KERNEL__ */ #endif /* __ASSEMBLY__ */ Index: foobar2/arch/ppc64/kernel/idle.c =================================================================== --- foobar2.orig/arch/ppc64/kernel/idle.c 2005-05-22 12:25:14.977598356 +1000 +++ foobar2/arch/ppc64/kernel/idle.c 2005-05-22 12:29:38.190046278 +1000 @@ -75,13 +75,9 @@ { struct paca_struct *lpaca; long oldval; - unsigned long CTRL; /* ensure iSeries run light will be out when idle */ - clear_thread_flag(TIF_RUN_LIGHT); - CTRL = mfspr(CTRLF); - CTRL &= ~RUNLATCH; - mtspr(CTRLT, CTRL); + ppc64_runlatch_off(); lpaca = get_paca(); @@ -111,7 +107,9 @@ } } + ppc64_runlatch_on(); schedule(); + ppc64_runlatch_off(); } return 0; Index: foobar2/arch/ppc64/kernel/sysfs.c =================================================================== --- foobar2.orig/arch/ppc64/kernel/sysfs.c 2005-05-22 12:25:14.977598356 +1000 +++ foobar2/arch/ppc64/kernel/sysfs.c 2005-05-22 12:29:38.192046125 +1000 @@ -113,7 +113,6 @@ #ifdef CONFIG_PPC_PSERIES unsigned long set, reset; int ret; - unsigned int ctrl; #endif /* CONFIG_PPC_PSERIES */ /* Only need to enable them once */ @@ -167,11 +166,8 @@ * On SMT machines we have to set the run latch in the ctrl register * in order to make PMC6 spin. */ - if (cpu_has_feature(CPU_FTR_SMT)) { - ctrl = mfspr(CTRLF); - ctrl |= RUNLATCH; - mtspr(CTRLT, ctrl); - } + if (cpu_has_feature(CPU_FTR_SMT)) + ppc64_runlatch_on(); #endif /* CONFIG_PPC_PSERIES */ } Index: foobar2/include/asm-ppc64/thread_info.h =================================================================== --- foobar2.orig/include/asm-ppc64/thread_info.h 2005-05-22 12:25:14.975598509 +1000 +++ foobar2/include/asm-ppc64/thread_info.h 2005-05-22 12:29:38.193046048 +1000 @@ -96,7 +96,7 @@ #define TIF_POLLING_NRFLAG 4 /* true if poll_idle() is polling TIF_NEED_RESCHED */ #define TIF_32BIT 5 /* 32 bit binary */ -#define TIF_RUN_LIGHT 6 /* iSeries run light */ +/* #define SPARE 6 */ #define TIF_ABI_PENDING 7 /* 32/64 bit switch needed */ #define TIF_SYSCALL_AUDIT 8 /* syscall auditing active */ #define TIF_SINGLESTEP 9 /* singlestepping active */ @@ -110,7 +110,7 @@ #define _TIF_NEED_RESCHED (1< References: <20050529234154.GG11066@krispykreme> <1117428968.5228.34.camel@gaston> Message-ID: <20050530073506.GC9193@krispykreme> Hi, > The above will always invalidate the bit that is present in the > cputable, thus if you don't have the "ibm,ppc-interrupt-server#s" > property, there will be no SMT at all, is that what you want ? Firmware uses that property to tell us what server number each thread is, if you dont have that property then SMT wont work. In fact we saw this issue with some internal firmware where it only supplies one interrupt server property on a POWER5 chip. Having said that I bet rs64 HMT is different, but its pretty heavily broken and needs to be rewritten. Anton From jschopp at austin.ibm.com Tue May 31 15:42:23 2005 From: jschopp at austin.ibm.com (Joel Schopp) Date: Tue, 31 May 2005 00:42:23 -0500 Subject: [PATCH] ppc64: set/clear SMT capable bit at boot In-Reply-To: <20050529234154.GG11066@krispykreme> References: <20050529234154.GG11066@krispykreme> Message-ID: <429BF93F.1030301@austin.ibm.com> > + if (prop && ((size / sizeof(u32)) > 1)) If this hasn't gone out yet it might be nice to add a BUG_ON((size / sizeof(u32)) > 2). I don't know of any processors that do more than 2 way SMT, but if one comes out several years from now it would be nice to catch it early. Mostly I'd like the extra BUG_ON() in order to have one more check against buggy firmware. Patch looks fine with our without the extra BUG_ON() From david at gibson.dropbear.id.au Tue May 31 16:43:46 2005 From: david at gibson.dropbear.id.au (David Gibson) Date: Tue, 31 May 2005 16:43:46 +1000 Subject: Fixed version of 4-level pagetables patch Message-ID: <20050531064345.GC7824@localhost.localdomain> Turns out the patch for 4-level pagetables I posted earlier had a serious bug: there was a hardcoded check in the SLB miss handler for the old 2TB address space, so accessing anything beyond that caused endless calls to hash_page(). Here's a corrected version. This patch implements full four-level page tables for ppc64. It uses a full page for the tables at the bottom and top level, and a quarter page for the intermediate levels. This gives a total usable address space of 44 bits (16T). This patch also tweaks the VSID allocation to have a matching range for user addresses (thereby halving the number of available contexts) and adds some #if and BUILD_BUG sanity checks. Index: working-2.6/include/asm-ppc64/pgtable.h =================================================================== --- working-2.6.orig/include/asm-ppc64/pgtable.h 2005-05-31 13:23:45.000000000 +1000 +++ working-2.6/include/asm-ppc64/pgtable.h 2005-05-31 15:45:02.000000000 +1000 @@ -15,19 +15,24 @@ #include #endif /* __ASSEMBLY__ */ -#include - /* * Entries per page directory level. The PTE level must use a 64b record * for each page table entry. The PMD and PGD level use a 32b record for * each entry by assuming that each entry is page aligned. */ #define PTE_INDEX_SIZE 9 -#define PMD_INDEX_SIZE 10 -#define PGD_INDEX_SIZE 10 +#define PMD_INDEX_SIZE 7 +#define PUD_INDEX_SIZE 7 +#define PGD_INDEX_SIZE 9 + +#define PTE_TABLE_SIZE (sizeof(pte_t) << PTE_INDEX_SIZE) +#define PMD_TABLE_SIZE (sizeof(pmd_t) << PMD_INDEX_SIZE) +#define PUD_TABLE_SIZE (sizeof(pud_t) << PUD_INDEX_SIZE) +#define PGD_TABLE_SIZE (sizeof(pgd_t) << PGD_INDEX_SIZE) #define PTRS_PER_PTE (1 << PTE_INDEX_SIZE) #define PTRS_PER_PMD (1 << PMD_INDEX_SIZE) +#define PTRS_PER_PUD (1 << PMD_INDEX_SIZE) #define PTRS_PER_PGD (1 << PGD_INDEX_SIZE) /* PMD_SHIFT determines what a second-level page table entry can map */ @@ -35,8 +40,13 @@ #define PMD_SIZE (1UL << PMD_SHIFT) #define PMD_MASK (~(PMD_SIZE-1)) -/* PGDIR_SHIFT determines what a third-level page table entry can map */ -#define PGDIR_SHIFT (PMD_SHIFT + PMD_INDEX_SIZE) +/* PUD_SHIFT determines what a third-level page table entry can map */ +#define PUD_SHIFT (PMD_SHIFT + PMD_INDEX_SIZE) +#define PUD_SIZE (1UL << PUD_SHIFT) +#define PUD_MASK (~(PUD_SIZE-1)) + +/* PGDIR_SHIFT determines what a fourth-level page table entry can map */ +#define PGDIR_SHIFT (PUD_SHIFT + PUD_INDEX_SIZE) #define PGDIR_SIZE (1UL << PGDIR_SHIFT) #define PGDIR_MASK (~(PGDIR_SIZE-1)) @@ -45,15 +55,23 @@ /* * Size of EA range mapped by our pagetables. */ -#define EADDR_SIZE (PTE_INDEX_SIZE + PMD_INDEX_SIZE + \ - PGD_INDEX_SIZE + PAGE_SHIFT) -#define EADDR_MASK ((1UL << EADDR_SIZE) - 1) +#define PGTABLE_EADDR_SIZE (PTE_INDEX_SIZE + PMD_INDEX_SIZE + \ + PUD_INDEX_SIZE + PGD_INDEX_SIZE + PAGE_SHIFT) +#define PGTABLE_RANGE (1UL << PGTABLE_EADDR_SIZE) + +#if TASK_SIZE_USER64 > PGTABLE_RANGE +#error TASK_SIZE_USER64 exceeds pagetable range +#endif + +#if TASK_SIZE_USER64 > (1UL << (USER_ESID_BITS + SID_SHIFT)) +#error TASK_SIZE_USER64 exceeds user VSID range +#endif /* * Define the address range of the vmalloc VM area. */ #define VMALLOC_START (0xD000000000000000ul) -#define VMALLOC_SIZE (0x10000000000UL) +#define VMALLOC_SIZE (0x80000000000UL) #define VMALLOC_END (VMALLOC_START + VMALLOC_SIZE) /* @@ -154,8 +172,6 @@ #ifndef __ASSEMBLY__ int hash_huge_page(struct mm_struct *mm, unsigned long access, unsigned long ea, unsigned long vsid, int local); - -void hugetlb_mm_free_pgd(struct mm_struct *mm); #endif /* __ASSEMBLY__ */ #define HAVE_ARCH_UNMAPPED_AREA @@ -163,7 +179,6 @@ #else #define hash_huge_page(mm,a,ea,vsid,local) -1 -#define hugetlb_mm_free_pgd(mm) do {} while (0) #endif @@ -197,39 +212,45 @@ #define pte_pfn(x) ((unsigned long)((pte_val(x) >> PTE_SHIFT))) #define pte_page(x) pfn_to_page(pte_pfn(x)) -#define pmd_set(pmdp, ptep) \ - (pmd_val(*(pmdp)) = __ba_to_bpn(ptep)) +#define pmd_set(pmdp, ptep) (pmd_val(*(pmdp)) = (unsigned long)(ptep)) #define pmd_none(pmd) (!pmd_val(pmd)) #define pmd_bad(pmd) (pmd_val(pmd) == 0) #define pmd_present(pmd) (pmd_val(pmd) != 0) #define pmd_clear(pmdp) (pmd_val(*(pmdp)) = 0) -#define pmd_page_kernel(pmd) (__bpn_to_ba(pmd_val(pmd))) +#define pmd_page_kernel(pmd) (pmd_val(pmd)) #define pmd_page(pmd) virt_to_page(pmd_page_kernel(pmd)) -#define pud_set(pudp, pmdp) (pud_val(*(pudp)) = (__ba_to_bpn(pmdp))) +#define pud_set(pudp, pmdp) (pud_val(*(pudp)) = (unsigned long)(pmdp)) #define pud_none(pud) (!pud_val(pud)) -#define pud_bad(pud) ((pud_val(pud)) == 0UL) -#define pud_present(pud) (pud_val(pud) != 0UL) -#define pud_clear(pudp) (pud_val(*(pudp)) = 0UL) -#define pud_page(pud) (__bpn_to_ba(pud_val(pud))) +#define pud_bad(pud) ((pud_val(pud)) == 0) +#define pud_present(pud) (pud_val(pud) != 0) +#define pud_clear(pudp) (pud_val(*(pudp)) = 0) +#define pud_page(pud) (pud_val(pud)) + +#define pgd_set(pgdp, pudp) ({pgd_val(*(pgdp)) = (unsigned long)(pudp);}) +#define pgd_none(pgd) (!pgd_val(pgd)) +#define pgd_bad(pgd) (pgd_val(pgd) == 0) +#define pgd_present(pgd) (pgd_val(pgd) != 0) +#define pgd_clear(pgdp) (pgd_val(*(pgdp)) = 0) +#define pgd_page(pgd) (pgd_val(pgd)) /* * Find an entry in a page-table-directory. We combine the address region * (the high order N bits) and the pgd portion of the address. */ /* to avoid overflow in free_pgtables we don't use PTRS_PER_PGD here */ -#define pgd_index(address) (((address) >> (PGDIR_SHIFT)) & 0x7ff) +#define pgd_index(address) (((address) >> (PGDIR_SHIFT)) & 0x1ff) #define pgd_offset(mm, address) ((mm)->pgd + pgd_index(address)) -/* Find an entry in the second-level page table.. */ +#define pud_offset(pgdp, addr) \ + (((pud_t *) pgd_page(*(pgdp))) + (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))) + #define pmd_offset(pudp,addr) \ - ((pmd_t *) pud_page(*(pudp)) + (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))) + (((pmd_t *) pud_page(*(pudp))) + (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))) -/* Find an entry in the third-level page table.. */ #define pte_offset_kernel(dir,addr) \ - ((pte_t *) pmd_page_kernel(*(dir)) \ - + (((addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))) + (((pte_t *) pmd_page_kernel(*(dir))) + (((addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))) #define pte_offset_map(dir,addr) pte_offset_kernel((dir), (addr)) #define pte_offset_map_nested(dir,addr) pte_offset_kernel((dir), (addr)) @@ -458,9 +479,11 @@ #define pte_same(A,B) (((pte_val(A) ^ pte_val(B)) & ~_PAGE_HPTEFLAGS) == 0) #define pmd_ERROR(e) \ - printk("%s:%d: bad pmd %08x.\n", __FILE__, __LINE__, pmd_val(e)) + printk("%s:%d: bad pmd %08lx.\n", __FILE__, __LINE__, pmd_val(e)) +#define pud_ERROR(e) \ + printk("%s:%d: bad pmd %08lx.\n", __FILE__, __LINE__, pud_val(e)) #define pgd_ERROR(e) \ - printk("%s:%d: bad pgd %08x.\n", __FILE__, __LINE__, pgd_val(e)) + printk("%s:%d: bad pgd %08lx.\n", __FILE__, __LINE__, pgd_val(e)) extern pgd_t swapper_pg_dir[]; Index: working-2.6/include/asm-ppc64/page.h =================================================================== --- working-2.6.orig/include/asm-ppc64/page.h 2005-05-31 13:23:45.000000000 +1000 +++ working-2.6/include/asm-ppc64/page.h 2005-05-31 15:45:02.000000000 +1000 @@ -46,6 +46,7 @@ #define ARCH_HAS_HUGEPAGE_ONLY_RANGE #define ARCH_HAS_PREPARE_HUGEPAGE_RANGE +#define ARCH_HAS_SETCLEAR_HUGE_PTE #define touches_hugepage_low_range(mm, addr, len) \ (LOW_ESID_MASK((addr), (len)) & mm->context.htlb_segs) @@ -125,36 +126,42 @@ * Entries in the pte table are 64b, while entries in the pgd & pmd are 32b. */ typedef struct { unsigned long pte; } pte_t; -typedef struct { unsigned int pmd; } pmd_t; -typedef struct { unsigned int pgd; } pgd_t; +typedef struct { unsigned long pmd; } pmd_t; +typedef struct { unsigned long pud; } pud_t; +typedef struct { unsigned long pgd; } pgd_t; typedef struct { unsigned long pgprot; } pgprot_t; #define pte_val(x) ((x).pte) #define pmd_val(x) ((x).pmd) +#define pud_val(x) ((x).pud) #define pgd_val(x) ((x).pgd) #define pgprot_val(x) ((x).pgprot) -#define __pte(x) ((pte_t) { (x) } ) -#define __pmd(x) ((pmd_t) { (x) } ) -#define __pgd(x) ((pgd_t) { (x) } ) -#define __pgprot(x) ((pgprot_t) { (x) } ) +#define __pte(x) ((pte_t) { (x) }) +#define __pmd(x) ((pmd_t) { (x) }) +#define __pud(x) ((pud_t) { (x) }) +#define __pgd(x) ((pgd_t) { (x) }) +#define __pgprot(x) ((pgprot_t) { (x) }) #else /* * .. while these make it easier on the compiler */ typedef unsigned long pte_t; -typedef unsigned int pmd_t; -typedef unsigned int pgd_t; +typedef unsigned long pmd_t; +typedef unsigned long pud_t; +typedef unsigned long pgd_t; typedef unsigned long pgprot_t; #define pte_val(x) (x) #define pmd_val(x) (x) +#define pud_val(x) (x) #define pgd_val(x) (x) #define pgprot_val(x) (x) #define __pte(x) (x) #define __pmd(x) (x) +#define __pud(x) (x) #define __pgd(x) (x) #define __pgprot(x) (x) @@ -208,9 +215,6 @@ #define USER_REGION_ID (0UL) #define REGION_ID(ea) (((unsigned long)(ea)) >> REGION_SHIFT) -#define __bpn_to_ba(x) ((((unsigned long)(x)) << PAGE_SHIFT) + KERNELBASE) -#define __ba_to_bpn(x) ((((unsigned long)(x)) & ~REGION_MASK) >> PAGE_SHIFT) - #define __va(x) ((void *)((unsigned long)(x) + KERNELBASE)) #ifdef CONFIG_DISCONTIGMEM Index: working-2.6/include/asm-ppc64/pgalloc.h =================================================================== --- working-2.6.orig/include/asm-ppc64/pgalloc.h 2005-05-31 13:23:45.000000000 +1000 +++ working-2.6/include/asm-ppc64/pgalloc.h 2005-05-31 15:45:02.000000000 +1000 @@ -6,7 +6,7 @@ #include #include -extern kmem_cache_t *zero_cache; +extern kmem_cache_t *pmd_cache; /* * This program is free software; you can redistribute it and/or @@ -18,13 +18,31 @@ static inline pgd_t * pgd_alloc(struct mm_struct *mm) { - return kmem_cache_alloc(zero_cache, GFP_KERNEL); + return (pgd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO); } static inline void pgd_free(pgd_t *pgd) { - kmem_cache_free(zero_cache, pgd); + free_page((unsigned long)pgd); +} + +#define pgd_populate(MM, PGD, PUD) pgd_set(PGD, PUD) + +static inline pud_t * +pud_alloc_one(struct mm_struct *mm, unsigned long addr) +{ + pud_t *pudp; + + pudp = kmem_cache_alloc(pmd_cache, GFP_KERNEL|__GFP_REPEAT); + memset(pudp, 0, PUD_TABLE_SIZE); + return pudp; +} + +static inline void +pud_free(pud_t *pud) +{ + kmem_cache_free(pmd_cache, pud); } #define pud_populate(MM, PUD, PMD) pud_set(PUD, PMD) @@ -32,13 +50,17 @@ static inline pmd_t * pmd_alloc_one(struct mm_struct *mm, unsigned long addr) { - return kmem_cache_alloc(zero_cache, GFP_KERNEL|__GFP_REPEAT); + pmd_t *pmdp; + + pmdp = kmem_cache_alloc(pmd_cache, GFP_KERNEL|__GFP_REPEAT); + memset(pmdp, 0, PMD_TABLE_SIZE); + return pmdp; } static inline void pmd_free(pmd_t *pmd) { - kmem_cache_free(zero_cache, pmd); + kmem_cache_free(pmd_cache, pmd); } #define pmd_populate_kernel(mm, pmd, pte) pmd_set(pmd, pte) @@ -47,44 +69,54 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address) { - return kmem_cache_alloc(zero_cache, GFP_KERNEL|__GFP_REPEAT); + return (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); } static inline struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address) { - pte_t *pte = kmem_cache_alloc(zero_cache, GFP_KERNEL|__GFP_REPEAT); - if (pte) - return virt_to_page(pte); - return NULL; + return alloc_page(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO); } static inline void pte_free_kernel(pte_t *pte) { - kmem_cache_free(zero_cache, pte); + free_page((unsigned long)pte); } static inline void pte_free(struct page *ptepage) { - kmem_cache_free(zero_cache, page_address(ptepage)); + __free_page(ptepage); } -struct pte_freelist_batch +typedef struct pgtable_free { + unsigned long val; +} pgtable_free_t; + +static inline pgtable_free_t pgtable_free_page(struct page *page) { - struct rcu_head rcu; - unsigned int index; - struct page * pages[0]; -}; + return (pgtable_free_t){.val = (unsigned long) page}; +} -#define PTE_FREELIST_SIZE ((PAGE_SIZE - sizeof(struct pte_freelist_batch)) / \ - sizeof(struct page *)) +static inline pgtable_free_t pgtable_free_cache(void *p) +{ + return (pgtable_free_t){.val = ((unsigned long) p) | 1}; +} -extern void pte_free_now(struct page *ptepage); -extern void pte_free_submit(struct pte_freelist_batch *batch); +static inline void pgtable_free(pgtable_free_t pgf) +{ + if (pgf.val & 1) + kmem_cache_free(pmd_cache, (void *)(pgf.val & ~1)); + else + __free_page((struct page *)pgf.val); +} -DECLARE_PER_CPU(struct pte_freelist_batch *, pte_freelist_cur); +void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf); -void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage); -#define __pmd_free_tlb(tlb, pmd) __pte_free_tlb(tlb, virt_to_page(pmd)) +#define __pte_free_tlb(tlb, ptepage) \ + pgtable_free_tlb(tlb, pgtable_free_page(ptepage)) +#define __pmd_free_tlb(tlb, pmd) \ + pgtable_free_tlb(tlb, pgtable_free_cache(pmd)) +#define __pud_free_tlb(tlb, pmd) \ + pgtable_free_tlb(tlb, pgtable_free_cache(pud)) #define check_pgt_cache() do { } while (0) Index: working-2.6/arch/ppc64/mm/init.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/init.c 2005-05-31 15:44:52.000000000 +1000 +++ working-2.6/arch/ppc64/mm/init.c 2005-05-31 15:45:02.000000000 +1000 @@ -66,6 +66,14 @@ #include #include +#if PGTABLE_RANGE > USER_VSID_RANGE +#warning Limited user VSID range means pagetable space is wasted +#endif + +#if (TASK_SIZE_USER64 < PGTABLE_RANGE) && (TASK_SIZE_USER64 < USER_VSID_RANGE) +#warning TASK_SIZE is smaller than it needs to be. +#endif + int mem_init_done; unsigned long ioremap_bot = IMALLOC_BASE; static unsigned long phbs_io_bot = PHBS_IO_BASE; @@ -225,7 +233,7 @@ * Before that, we map using addresses going * up from ioremap_bot. imalloc will use * the addresses from ioremap_bot through - * IMALLOC_END (0xE000001fffffffff) + * IMALLOC_END * */ pa = addr & PAGE_MASK; @@ -416,12 +424,6 @@ int index; int err; -#ifdef CONFIG_HUGETLB_PAGE - /* We leave htlb_segs as it was, but for a fork, we need to - * clear the huge_pgdir. */ - mm->context.huge_pgdir = NULL; -#endif - again: if (!idr_pre_get(&mmu_context_idr, GFP_KERNEL)) return -ENOMEM; @@ -452,8 +454,6 @@ spin_unlock(&mmu_context_lock); mm->context.id = NO_CONTEXT; - - hugetlb_mm_free_pgd(mm); } /* @@ -824,23 +824,18 @@ return virt_addr; } -kmem_cache_t *zero_cache; - -static void zero_ctor(void *pte, kmem_cache_t *cache, unsigned long flags) -{ - memset(pte, 0, PAGE_SIZE); -} +kmem_cache_t *pmd_cache; void pgtable_cache_init(void) { - zero_cache = kmem_cache_create("zero", - PAGE_SIZE, - 0, - SLAB_HWCACHE_ALIGN | SLAB_MUST_HWCACHE_ALIGN, - zero_ctor, - NULL); - if (!zero_cache) - panic("pgtable_cache_init(): could not create zero_cache!\n"); + BUILD_BUG_ON(PTE_TABLE_SIZE != PAGE_SIZE); + BUILD_BUG_ON(PMD_TABLE_SIZE != PUD_TABLE_SIZE); + BUILD_BUG_ON(PGD_TABLE_SIZE != PAGE_SIZE); + + pmd_cache = kmem_cache_create("pmd", PMD_TABLE_SIZE, PMD_TABLE_SIZE, + 0, NULL, NULL); + if (! pmd_cache) + panic("pmd_pud_cache_init(): could not create pmd_pud_cache!\n"); } pgprot_t phys_mem_access_prot(struct file *file, unsigned long addr, Index: working-2.6/include/asm-ppc64/processor.h =================================================================== --- working-2.6.orig/include/asm-ppc64/processor.h 2005-05-31 13:23:45.000000000 +1000 +++ working-2.6/include/asm-ppc64/processor.h 2005-05-31 15:45:02.000000000 +1000 @@ -530,8 +530,8 @@ extern struct task_struct *last_task_used_math; extern struct task_struct *last_task_used_altivec; -/* 64-bit user address space is 41-bits (2TBs user VM) */ -#define TASK_SIZE_USER64 (0x0000020000000000UL) +/* 64-bit user address space is 44-bits (16TB user VM) */ +#define TASK_SIZE_USER64 (0x0000100000000000UL) /* * 32-bit user address space is 4GB - 1 page Index: working-2.6/arch/ppc64/kernel/head.S =================================================================== --- working-2.6.orig/arch/ppc64/kernel/head.S 2005-05-31 13:23:45.000000000 +1000 +++ working-2.6/arch/ppc64/kernel/head.S 2005-05-31 15:45:02.000000000 +1000 @@ -38,6 +38,7 @@ #include #include #include +#include #ifdef CONFIG_PPC_ISERIES #define DO_SOFT_DISABLE @@ -2117,17 +2118,17 @@ empty_zero_page: .space 4096 - .globl swapper_pg_dir -swapper_pg_dir: - .space 4096 - #ifdef CONFIG_SMP /* 1 page segment table per cpu (max 48, cpu0 allocated at STAB0_PHYS_ADDR) */ .globl stab_array stab_array: .space 4096 * 48 #endif - + + .globl swapper_pg_dir +swapper_pg_dir: + .space PAGE_SIZE + /* * This space gets a copy of optional info passed to us by the bootstrap * Used to pass parameters into the kernel like root=/dev/sda1, etc. Index: working-2.6/arch/ppc64/mm/imalloc.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/imalloc.c 2005-05-31 13:23:45.000000000 +1000 +++ working-2.6/arch/ppc64/mm/imalloc.c 2005-05-31 15:45:02.000000000 +1000 @@ -31,7 +31,7 @@ break; if ((unsigned long)tmp->addr >= ioremap_bot) addr = tmp->size + (unsigned long) tmp->addr; - if (addr > IMALLOC_END-size) + if (addr >= IMALLOC_END-size) return 1; } *im_addr = addr; Index: working-2.6/arch/ppc64/mm/hash_utils.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hash_utils.c 2005-05-31 15:44:52.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hash_utils.c 2005-05-31 16:32:57.000000000 +1000 @@ -298,7 +298,7 @@ int local = 0; cpumask_t tmp; - if ((ea & ~REGION_MASK) > EADDR_MASK) + if ((ea & ~REGION_MASK) >= PGTABLE_RANGE) return 1; switch (REGION_ID(ea)) { Index: working-2.6/include/asm-ppc64/mmu.h =================================================================== --- working-2.6.orig/include/asm-ppc64/mmu.h 2005-05-31 15:44:52.000000000 +1000 +++ working-2.6/include/asm-ppc64/mmu.h 2005-05-31 15:45:02.000000000 +1000 @@ -259,8 +259,10 @@ #define VSID_BITS 36 #define VSID_MODULUS ((1UL<index; i++) + pgtable_free(batch->tables[i]); + + free_page((unsigned long)batch); +} + +static void pte_free_submit(struct pte_freelist_batch *batch) +{ + INIT_RCU_HEAD(&batch->rcu); + call_rcu(&batch->rcu, pte_free_rcu_callback); +} + +void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf) { /* This is safe as we are holding page_table_lock */ cpumask_t local_cpumask = cpumask_of_cpu(smp_processor_id()); @@ -49,19 +100,19 @@ if (atomic_read(&tlb->mm->mm_users) < 2 || cpus_equal(tlb->mm->cpu_vm_mask, local_cpumask)) { - pte_free(ptepage); + pgtable_free(pgf); return; } if (*batchp == NULL) { *batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC); if (*batchp == NULL) { - pte_free_now(ptepage); + pgtable_free_now(pgf); return; } (*batchp)->index = 0; } - (*batchp)->pages[(*batchp)->index++] = ptepage; + (*batchp)->tables[(*batchp)->index++] = pgf; if ((*batchp)->index == PTE_FREELIST_SIZE) { pte_free_submit(*batchp); *batchp = NULL; @@ -132,42 +183,6 @@ put_cpu(); } -#ifdef CONFIG_SMP -static void pte_free_smp_sync(void *arg) -{ - /* Do nothing, just ensure we sync with all CPUs */ -} -#endif - -/* This is only called when we are critically out of memory - * (and fail to get a page in pte_free_tlb). - */ -void pte_free_now(struct page *ptepage) -{ - pte_freelist_forced_free++; - - smp_call_function(pte_free_smp_sync, NULL, 0, 1); - - pte_free(ptepage); -} - -static void pte_free_rcu_callback(struct rcu_head *head) -{ - struct pte_freelist_batch *batch = - container_of(head, struct pte_freelist_batch, rcu); - unsigned int i; - - for (i = 0; i < batch->index; i++) - pte_free(batch->pages[i]); - free_page((unsigned long)batch); -} - -void pte_free_submit(struct pte_freelist_batch *batch) -{ - INIT_RCU_HEAD(&batch->rcu); - call_rcu(&batch->rcu, pte_free_rcu_callback); -} - void pte_free_finish(void) { /* This is safe as we are holding page_table_lock */ Index: working-2.6/arch/ppc64/mm/hugetlbpage.c =================================================================== --- working-2.6.orig/arch/ppc64/mm/hugetlbpage.c 2005-05-31 15:44:52.000000000 +1000 +++ working-2.6/arch/ppc64/mm/hugetlbpage.c 2005-05-31 15:45:02.000000000 +1000 @@ -27,124 +27,93 @@ #include -#define HUGEPGDIR_SHIFT (HPAGE_SHIFT + PAGE_SHIFT - 3) -#define HUGEPGDIR_SIZE (1UL << HUGEPGDIR_SHIFT) -#define HUGEPGDIR_MASK (~(HUGEPGDIR_SIZE-1)) - -#define HUGEPTE_INDEX_SIZE 9 -#define HUGEPGD_INDEX_SIZE 10 - -#define PTRS_PER_HUGEPTE (1 << HUGEPTE_INDEX_SIZE) -#define PTRS_PER_HUGEPGD (1 << HUGEPGD_INDEX_SIZE) - -static inline int hugepgd_index(unsigned long addr) +/* Modelled after find_linux_pte() */ +pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) { - return (addr & ~REGION_MASK) >> HUGEPGDIR_SHIFT; -} + pgd_t *pg; + pud_t *pu; + pmd_t *pm; + pte_t *pt; -static pud_t *hugepgd_offset(struct mm_struct *mm, unsigned long addr) -{ - int index; + BUG_ON(! in_hugepage_area(mm->context, addr)); - if (! mm->context.huge_pgdir) - return NULL; + addr &= HPAGE_MASK; + pg = pgd_offset(mm, addr); + if (!pgd_none(*pg)) { + pu = pud_offset(pg, addr); + if (!pud_none(*pu)) { + pm = pmd_offset(pu, addr); + pt = (pte_t *)pm; + BUG_ON(!pmd_none(*pm) + && !(pte_present(*pt) && pte_huge(*pt))); + return pt; + } + } - index = hugepgd_index(addr); - BUG_ON(index >= PTRS_PER_HUGEPGD); - return (pud_t *)(mm->context.huge_pgdir + index); + return NULL; } -static inline pte_t *hugepte_offset(pud_t *dir, unsigned long addr) +pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr) { - int index; - - if (pud_none(*dir)) - return NULL; - - index = (addr >> HPAGE_SHIFT) % PTRS_PER_HUGEPTE; - return (pte_t *)pud_page(*dir) + index; -} + pgd_t *pg; + pud_t *pu; + pmd_t *pm; + pte_t *pt; -static pud_t *hugepgd_alloc(struct mm_struct *mm, unsigned long addr) -{ BUG_ON(! in_hugepage_area(mm->context, addr)); - if (! mm->context.huge_pgdir) { - pgd_t *new; - spin_unlock(&mm->page_table_lock); - /* Don't use pgd_alloc(), because we want __GFP_REPEAT */ - new = kmem_cache_alloc(zero_cache, GFP_KERNEL | __GFP_REPEAT); - BUG_ON(memcmp(new, empty_zero_page, PAGE_SIZE)); - spin_lock(&mm->page_table_lock); - - /* - * Because we dropped the lock, we should re-check the - * entry, as somebody else could have populated it.. - */ - if (mm->context.huge_pgdir) - pgd_free(new); - else - mm->context.huge_pgdir = new; - } - return hugepgd_offset(mm, addr); -} + addr &= HPAGE_MASK; -static pte_t *hugepte_alloc(struct mm_struct *mm, pud_t *dir, unsigned long addr) -{ - if (! pud_present(*dir)) { - pte_t *new; + pg = pgd_offset(mm, addr); + pu = pud_alloc(mm, pg, addr); - spin_unlock(&mm->page_table_lock); - new = kmem_cache_alloc(zero_cache, GFP_KERNEL | __GFP_REPEAT); - BUG_ON(memcmp(new, empty_zero_page, PAGE_SIZE)); - spin_lock(&mm->page_table_lock); - /* - * Because we dropped the lock, we should re-check the - * entry, as somebody else could have populated it.. - */ - if (pud_present(*dir)) { - if (new) - kmem_cache_free(zero_cache, new); - } else { - struct page *ptepage; - - if (! new) - return NULL; - ptepage = virt_to_page(new); - ptepage->mapping = (void *) mm; - ptepage->index = addr & HUGEPGDIR_MASK; - pud_populate(mm, dir, new); + if (pu) { + pm = pmd_alloc(mm, pu, addr); + if (pm) { + pt = (pte_t *)pm; + BUG_ON(!pmd_none(*pm) + && !(pte_present(*pt) && pte_huge(*pt))); + return pt; } } - return hugepte_offset(dir, addr); + return NULL; } -pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr) -{ - pud_t *pud; +#define HUGEPTE_BATCH_SIZE (HPAGE_SIZE / PMD_SIZE) - BUG_ON(! in_hugepage_area(mm->context, addr)); +void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte) +{ + int i; - pud = hugepgd_offset(mm, addr); - if (! pud) - return NULL; + if (pte_present(*ptep)) { + pte_clear(mm, addr, ptep); + flush_tlb_pending(); + } - return hugepte_offset(pud, addr); + for (i = 0; i < HUGEPTE_BATCH_SIZE; i++) { + *ptep = __pte(pte_val(pte) & ~_PAGE_HPTEFLAGS); + ptep++; + } } -pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr) +pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr, + pte_t *ptep) { - pud_t *pud; + unsigned long old = pte_update(ptep, ~0UL); + int i; - BUG_ON(! in_hugepage_area(mm->context, addr)); + if (old & _PAGE_HASHPTE) + hpte_update(mm, addr, old, 0); - pud = hugepgd_alloc(mm, addr); - if (! pud) - return NULL; + for (i = 1; i < HUGEPTE_BATCH_SIZE; i++) { + *ptep = __pte(0); + ptep++; + } - return hugepte_alloc(mm, pud, addr); + return __pte(old); } /* @@ -517,42 +486,6 @@ } } -void hugetlb_mm_free_pgd(struct mm_struct *mm) -{ - int i; - pgd_t *pgdir; - - spin_lock(&mm->page_table_lock); - - pgdir = mm->context.huge_pgdir; - if (! pgdir) - goto out; - - mm->context.huge_pgdir = NULL; - - /* cleanup any hugepte pages leftover */ - for (i = 0; i < PTRS_PER_HUGEPGD; i++) { - pud_t *pud = (pud_t *)(pgdir + i); - - if (! pud_none(*pud)) { - pte_t *pte = (pte_t *)pud_page(*pud); - struct page *ptepage = virt_to_page(pte); - - ptepage->mapping = NULL; - - BUG_ON(memcmp(pte, empty_zero_page, PAGE_SIZE)); - kmem_cache_free(zero_cache, pte); - } - pud_clear(pud); - } - - BUG_ON(memcmp(pgdir, empty_zero_page, PAGE_SIZE)); - kmem_cache_free(zero_cache, pgdir); - - out: - spin_unlock(&mm->page_table_lock); -} - int hash_huge_page(struct mm_struct *mm, unsigned long access, unsigned long ea, unsigned long vsid, int local) { Index: working-2.6/arch/ppc64/mm/slb_low.S =================================================================== --- working-2.6.orig/arch/ppc64/mm/slb_low.S 2005-05-24 14:12:22.000000000 +1000 +++ working-2.6/arch/ppc64/mm/slb_low.S 2005-05-31 16:41:52.000000000 +1000 @@ -91,7 +91,7 @@ 0: /* user address: proto-VSID = context<<15 | ESID */ li r11,SLB_VSID_USER - srdi. r9,r3,13 + srdi. r9,r3,USER_ESID_BITS bne- 8f /* invalid ea bits set */ #ifdef CONFIG_HUGETLB_PAGE -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/people/dgibson From anton at samba.org Tue May 31 17:01:22 2005 From: anton at samba.org (Anton Blanchard) Date: Tue, 31 May 2005 17:01:22 +1000 Subject: [PATCH] ppc64: remove decr_overclock Message-ID: <20050531070122.GA28982@krispykreme> Hi, Now that we have HZ=1000 there is much less of a need for decr_overclock. Remove it. Leave spread_lpevents but move it into iSeries_setup.c. We should look at making event spreading the default some day. Signed-off-by: Anton Blanchard Acked-by: Stephen Rothwell ===== arch/ppc64/kernel/iSeries_setup.c 1.32 vs edited ===== Index: linux-2.6.git-work/arch/ppc64/kernel/iSeries_setup.c =================================================================== --- linux-2.6.git-work.orig/arch/ppc64/kernel/iSeries_setup.c 2005-05-23 17:40:56.000000000 +1000 +++ linux-2.6.git-work/arch/ppc64/kernel/iSeries_setup.c 2005-05-23 18:22:15.000000000 +1000 @@ -852,6 +852,28 @@ late_initcall(iSeries_src_init); +static int set_spread_lpevents(char *str) +{ + unsigned long i; + unsigned long val = simple_strtoul(str, NULL, 0); + + /* + * The parameter is the number of processors to share in processing + * lp events. + */ + if (( val > 0) && (val <= NR_CPUS)) { + for (i = 1; i < val; ++i) + paca[i].lpqueue_ptr = paca[0].lpqueue_ptr; + + printk("lpevent processing spread over %ld processors\n", val); + } else { + printk("invalid spread_lpevents %ld\n", val); + } + + return 1; +} +__setup("spread_lpevents=", set_spread_lpevents); + void __init iSeries_early_setup(void) { iSeries_fixup_klimit(); Index: linux-2.6.git-work/arch/ppc64/kernel/setup.c =================================================================== --- linux-2.6.git-work.orig/arch/ppc64/kernel/setup.c 2005-05-23 17:40:57.000000000 +1000 +++ linux-2.6.git-work/arch/ppc64/kernel/setup.c 2005-05-23 18:22:15.000000000 +1000 @@ -103,11 +103,6 @@ extern void smp_release_cpus(void); -unsigned long decr_overclock = 1; -unsigned long decr_overclock_proc0 = 1; -unsigned long decr_overclock_set = 0; -unsigned long decr_overclock_proc0_set = 0; - int have_of = 1; int boot_cpuid = 0; int boot_cpuid_phys = 0; @@ -1120,64 +1115,15 @@ printk("[dump]%04x %s\n", src, msg); } -int set_spread_lpevents( char * str ) -{ - /* The parameter is the number of processors to share in processing lp events */ - unsigned long i; - unsigned long val = simple_strtoul( str, NULL, 0 ); - if ( ( val > 0 ) && ( val <= NR_CPUS ) ) { - for ( i=1; idefault_decr = tb_ticks_per_jiffy / decr_overclock_proc0; + lpaca->default_decr = tb_ticks_per_jiffy; lpaca->next_jiffy_update_tb = get_tb() + tb_ticks_per_jiffy; } -int set_decr_overclock_proc0( char * str ) -{ - unsigned long val = simple_strtoul( str, NULL, 0 ); - if ( ( val >= 1 ) && ( val <= 48 ) ) { - decr_overclock_proc0_set = 1; - decr_overclock_proc0 = val; - printk("proc 0 decrementer overclock factor of %ld\n", val); - } - else - printk("invalid proc 0 decrementer overclock factor of %ld\n", val); - return 1; -} - -int set_decr_overclock( char * str ) -{ - unsigned long val = simple_strtoul( str, NULL, 0 ); - if ( ( val >= 1 ) && ( val <= 48 ) ) { - decr_overclock_set = 1; - decr_overclock = val; - printk("decrementer overclock factor of %ld\n", val); - } - else - printk("invalid decrementer overclock factor of %ld\n", val); - return 1; - -} - -__setup("spread_lpevents=", set_spread_lpevents ); -__setup("decr_overclock_proc0=", set_decr_overclock_proc0 ); -__setup("decr_overclock=", set_decr_overclock ); - #ifndef CONFIG_PPC_ISERIES /* * This function can be used by platforms to "find" legacy serial ports. Index: linux-2.6.git-work/arch/ppc64/kernel/smp.c =================================================================== --- linux-2.6.git-work.orig/arch/ppc64/kernel/smp.c 2005-05-23 17:40:57.000000000 +1000 +++ linux-2.6.git-work/arch/ppc64/kernel/smp.c 2005-05-23 18:22:15.000000000 +1000 @@ -334,7 +334,6 @@ } } -extern unsigned long decr_overclock; extern struct gettimeofday_struct do_gtod; struct thread_info *current_set[NR_CPUS]; @@ -491,7 +490,7 @@ if (smp_ops->cpu_bootable && !smp_ops->cpu_bootable(cpu)) return -EINVAL; - paca[cpu].default_decr = tb_ticks_per_jiffy / decr_overclock; + paca[cpu].default_decr = tb_ticks_per_jiffy; if (!cpu_has_feature(CPU_FTR_SLB)) { void *tmp; From sfr at canb.auug.org.au Tue May 31 17:24:58 2005 From: sfr at canb.auug.org.au (Stephen Rothwell) Date: Tue, 31 May 2005 17:24:58 +1000 Subject: Fixed version of 4-level pagetables patch In-Reply-To: <20050531064345.GC7824@localhost.localdomain> References: <20050531064345.GC7824@localhost.localdomain> Message-ID: <20050531172458.672fd8d9.sfr@canb.auug.org.au> On Tue, 31 May 2005 16:43:46 +1000 David Gibson wrote: > > This patch implements full four-level page tables for ppc64. It uses > a full page for the tables at the bottom and top level, and a quarter > page for the intermediate levels. This gives a total usable address > space of 44 bits (16T). This patch also tweaks the VSID allocation to > have a matching range for user addresses (thereby halving the number > of available contexts) and adds some #if and BUILD_BUG sanity checks. Builds, boots and runs a 64bit HelloWorld program on legacy iSeries. -- Cheers, Stephen Rothwell sfr at canb.auug.org.au http://www.canb.auug.org.au/~sfr/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://ozlabs.org/pipermail/linuxppc64-dev/attachments/20050531/7e5501be/attachment.pgp From rcblach at blach.dnsalias.org Tue May 31 22:19:20 2005 From: rcblach at blach.dnsalias.org (Ralph Blach) Date: Tue, 31 May 2005 08:19:20 -0400 Subject: book e idea Message-ID: <429C5648.2040501@blach.dnsalias.org> I had this idea for the book e 44x processors. Given that book E processors are really 33 bit processors, could the OS be run in translation state 0, and and the users in translation state 1? The advantage to this would be is that user task could have true 4 gb addressing on a and so could the kernel. This would eliminate the tuning between the virtual address of the kernel and user space. The disadvantage would be that the os would be slower. This is just an idea, and I admitt it might NOT be practical, but here it is anyway. Chip