[PATCH v2] [POWERPC] Optimize counting distinct entries in the relocation sections
Medve Emilian
Emilian.Medve at freescale.com
Wed Nov 28 01:54:07 EST 2007
Hi Paul,
I'm just wondering what are your latest thoughts about this patch
(http://patchwork.ozlabs.org/linuxppc/patch?id=14707).
Cheers,
Emil.
> -----Original Message-----
> From: Medve Emilian
> Sent: Tuesday, November 13, 2007 10:24 AM
> To: paulus at samba.org; rusty at rustcorp.com.au; ntl at pobox.com;
> sfr at canb.auug.org.au; behlendorf1 at llnl.gov;
> galak at kernel.crashing.org; linuxppc-dev at ozlabs.org;
> linuxppc-embedded at ozlabs.org
> Cc: Medve Emilian
> Subject: [PATCH v2] [POWERPC] Optimize counting distinct
> entries in the relocation sections
>
> When a module has relocation sections with tens of thousands
> worth of entries,
> counting the distinct/unique entries only (i.e. no
> duplicates) at load time can
> take tens of seconds and up to minutes. The sore point is the
> count_relocs()
> function which is called as part of the architecture specific
> module loading
> processing path:
>
> -> load_module() generic
> -> module_frob_arch_sections() arch specific
> -> get_plt_size() 32-bit
> -> get_stubs_size() 64-bit
> -> count_relocs()
>
> (Not sure why the relocation tables could contain lots of
> duplicates and why
> they are not trimmed at compile time by the linker. In some
> test cases, out of
> 35K relocation entries only 1.5K were distinct/unique)
>
> The previous counting algorithm was having O(n^2) complexity.
> Basically two
> solutions were proposed on the e-mail list: a hash based
> approach and a sort
> based approach
>
> The hash based approach is the fastest (O(n)) but the has it
> needs additional
> memory and for certain corner cases it could take lots of
> memory due to the
> degeneration of the hash. One such proposal was submitted here:
>
> http://ozlabs.org/pipermail/linuxppc-dev/2007-June/037641.html
>
> In this proposal, the symbol + addendum are hashed to
> generate a key and a
> pointer to the relocation entry will be stored in it. The
> hash is implemented as
> a linked list of memory pages with PAGE_SIZE /
> sizeof(Elfxx_Rela *) entries. In
> case of collisions in all the existing pages, a new page is
> added to the list to
> accommodate the new distinct relocation entry
>
> For 32-bit PowerPCs with 4K pages, a page can accommodate 1K
> worth of pointers
> to relocation entries. In the 35K entries scenario, as
> much/little of six (6)
> pages could be allocated using 24K of extra memory during the
> module load
>
> The sort based approach is slower (O(n * log n + n)) but if
> the sorting is done
> "in place" it doesn't need additional memory. A proposal was
> submitted here:
>
> http://ozlabs.org/pipermail/linuxppc-dev/2007-November/045854.html
> (http://patchwork.ozlabs.org/linuxppc/patch?filter=default&id=14573)
>
> In this proposal an array of pointers to the relocation
> entries is built and
> then is sorted using the kernel sort() utility function. This
> is basically a heap
> sort algorithm with a stable O(n * log n) complexity. With
> this counting the
> distinct/unique entries is just linear (O(n)) complexity. The
> problem is the
> extra memory needed in this proposal, which in the 35K
> relocation entries test
> case it can be as much as 140K (for 32-bit PowerPCs; double
> for 64-bit). This is
> much more then the memory needed by the hash based approach described
> above/earlier but it doesn't hide potential degenerative corner cases
>
> The current patch is a happy compromise between the two
> proposals above:
> O(n + n * log n) performance with no additional memory
> requirements due to
> sorting in place the relocation table itself
>
> Signed-off-by: Emil Medve <Emilian.Medve at Freescale.com>
> ---
>
> This patch only tries to address counting the distinct
> R_PPC_REL24 entries for
> the purpose of sizing the PLT. This operation was/can be
> detected by the proper
> kernel logic as a soft lockup for large relocation tables
>
> A related optimization (that could cause rewriting the this
> patch) is to
> optimize the PLT search from do_plt_call() but that doesn't
> seem to be a
> problem right now
>
> The errors below are false positives due to the fact that
> Elfxx_Rela are
> falsely assumed to be variables/operands instead of types:
>
> linux-2.6> scripts/checkpatch.pl
> 0001-POWERPC-Optimize-counting-distinct-entries-in-the.patch
> ERROR: need consistent spacing around '*' (ctx:WxV)
> #116: FILE: arch/powerpc/kernel/module_32.c:78:
> + const Elf32_Rela *x, *y;
> ^
>
> ERROR: need consistent spacing around '*' (ctx:WxV)
> #228: FILE: arch/powerpc/kernel/module_64.c:122:
> + const Elf64_Rela *x, *y;
> ^
>
> total: 2 errors, 0 warnings, 218 lines checked
> Your patch has style problems, please review. If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
>
> arch/powerpc/kernel/module_32.c | 77
> ++++++++++++++++++++++++++++++-------
> arch/powerpc/kernel/module_64.c | 81
> ++++++++++++++++++++++++++++++--------
> 2 files changed, 127 insertions(+), 31 deletions(-)
>
> diff --git a/arch/powerpc/kernel/module_32.c
> b/arch/powerpc/kernel/module_32.c
> index 07a89a3..eab3138 100644
> --- a/arch/powerpc/kernel/module_32.c
> +++ b/arch/powerpc/kernel/module_32.c
> @@ -24,6 +24,7 @@
> #include <linux/kernel.h>
> #include <linux/cache.h>
> #include <linux/bug.h>
> +#include <linux/sort.h>
>
> #include "setup.h"
>
> @@ -54,22 +55,60 @@ void module_free(struct module *mod, void
> *module_region)
> addend) */
> static unsigned int count_relocs(const Elf32_Rela *rela,
> unsigned int num)
> {
> - unsigned int i, j, ret = 0;
> -
> - /* Sure, this is order(n^2), but it's usually short, and not
> - time critical */
> - for (i = 0; i < num; i++) {
> - for (j = 0; j < i; j++) {
> - /* If this addend appeared before, it's
> - already been counted */
> - if (ELF32_R_SYM(rela[i].r_info)
> - == ELF32_R_SYM(rela[j].r_info)
> - && rela[i].r_addend == rela[j].r_addend)
> - break;
> + unsigned int i, r_info, r_addend, _count_relocs;
> +
> + _count_relocs = 0;
> + r_info = 0;
> + r_addend = 0;
> + for (i = 0; i < num; i++)
> + /* Only count 24-bit relocs, others don't need stubs */
> + if (ELF32_R_TYPE(rela[i].r_info) == R_PPC_REL24 &&
> + (r_info != ELF32_R_SYM(rela[i].r_info) ||
> + r_addend != rela[i].r_addend)) {
> + _count_relocs++;
> + r_info = ELF32_R_SYM(rela[i].r_info);
> + r_addend = rela[i].r_addend;
> }
> - if (j == i) ret++;
> +
> + return _count_relocs;
> +}
> +
> +static int relacmp(const void *_x, const void *_y)
> +{
> + const Elf32_Rela *x, *y;
> +
> + y = (Elf32_Rela *)_x;
> + x = (Elf32_Rela *)_y;
> +
> + /* Compare the entire r_info (as opposed to
> ELF32_R_SYM(r_info) only) to
> + * make the comparison cheaper/faster. It won't affect
> the sorting or
> + * the counting algorithms' performance
> + */
> + if (x->r_info < y->r_info)
> + return -1;
> + else if (x->r_info > y->r_info)
> + return 1;
> + else if (x->r_addend < y->r_addend)
> + return -1;
> + else if (x->r_addend > y->r_addend)
> + return 1;
> + else
> + return 0;
> +}
> +
> +static void relaswap(void *_x, void *_y, int size)
> +{
> + uint32_t *x, *y, tmp;
> + int i;
> +
> + y = (uint32_t *)_x;
> + x = (uint32_t *)_y;
> +
> + for (i = 0; i < sizeof(Elf32_Rela) / sizeof(uint32_t); i++) {
> + tmp = x[i];
> + x[i] = y[i];
> + y[i] = tmp;
> }
> - return ret;
> }
>
> /* Get the potential trampolines size required of the init and
> @@ -100,6 +139,16 @@ static unsigned long get_plt_size(const
> Elf32_Ehdr *hdr,
> DEBUGP("Ptr: %p. Number: %u\n",
> (void *)hdr + sechdrs[i].sh_offset,
> sechdrs[i].sh_size / sizeof(Elf32_Rela));
> +
> + /* Sort the relocation information
> based on a symbol and
> + * addend key. This is a stable O(n*log
> n) complexity
> + * alogrithm but it will reduce the
> complexity of
> + * count_relocs() to linear complexity O(n)
> + */
> + sort((void *)hdr + sechdrs[i].sh_offset,
> + sechdrs[i].sh_size / sizeof(Elf32_Rela),
> + sizeof(Elf32_Rela), relacmp, relaswap);
> +
> ret += count_relocs((void *)hdr
> + sechdrs[i].sh_offset,
> sechdrs[i].sh_size
> diff --git a/arch/powerpc/kernel/module_64.c
> b/arch/powerpc/kernel/module_64.c
> index 75c7c4f..3a82b02 100644
> --- a/arch/powerpc/kernel/module_64.c
> +++ b/arch/powerpc/kernel/module_64.c
> @@ -24,6 +24,7 @@
> #include <asm/module.h>
> #include <asm/uaccess.h>
> #include <asm/firmware.h>
> +#include <linux/sort.h>
>
> #include "setup.h"
>
> @@ -81,25 +82,23 @@ static struct ppc64_stub_entry ppc64_stub =
> different addend) */
> static unsigned int count_relocs(const Elf64_Rela *rela,
> unsigned int num)
> {
> - unsigned int i, j, ret = 0;
> + unsigned int i, r_info, r_addend, _count_relocs;
>
> /* FIXME: Only count external ones --RR */
> - /* Sure, this is order(n^2), but it's usually short, and not
> - time critical */
> - for (i = 0; i < num; i++) {
> + _count_relocs = 0;
> + r_info = 0;
> + r_addend = 0;
> + for (i = 0; i < num; i++)
> /* Only count 24-bit relocs, others don't need stubs */
> - if (ELF64_R_TYPE(rela[i].r_info) != R_PPC_REL24)
> - continue;
> - for (j = 0; j < i; j++) {
> - /* If this addend appeared before, it's
> - already been counted */
> - if (rela[i].r_info == rela[j].r_info
> - && rela[i].r_addend == rela[j].r_addend)
> - break;
> + if (ELF64_R_TYPE(rela[i].r_info) == R_PPC_REL24 &&
> + (r_info != ELF64_R_SYM(rela[i].r_info) ||
> + r_addend != rela[i].r_addend)) {
> + _count_relocs++;
> + r_info = ELF64_R_SYM(rela[i].r_info);
> + r_addend = rela[i].r_addend;
> }
> - if (j == i) ret++;
> - }
> - return ret;
> +
> + return _count_relocs;
> }
>
> void *module_alloc(unsigned long size)
> @@ -118,6 +117,44 @@ void module_free(struct module *mod,
> void *module_region)
> table entries. */
> }
>
> +static int relacmp(const void *_x, const void *_y)
> +{
> + const Elf64_Rela *x, *y;
> +
> + y = (Elf64_Rela *)_x;
> + x = (Elf64_Rela *)_y;
> +
> + /* Compare the entire r_info (as opposed to
> ELF64_R_SYM(r_info) only) to
> + * make the comparison cheaper/faster. It won't affect
> the sorting or
> + * the counting algorithms' performance
> + */
> + if (x->r_info < y->r_info)
> + return -1;
> + else if (x->r_info > y->r_info)
> + return 1;
> + else if (x->r_addend < y->r_addend)
> + return -1;
> + else if (x->r_addend > y->r_addend)
> + return 1;
> + else
> + return 0;
> +}
> +
> +static void relaswap(void *_x, void *_y, int size)
> +{
> + uint64_t *x, *y, tmp;
> + int i;
> +
> + y = (uint64_t *)_x;
> + x = (uint64_t *)_y;
> +
> + for (i = 0; i < sizeof(Elf64_Rela) / sizeof(uint64_t); i++) {
> + tmp = x[i];
> + x[i] = y[i];
> + y[i] = tmp;
> + }
> +}
> +
> /* Get size of potential trampolines required. */
> static unsigned long get_stubs_size(const Elf64_Ehdr *hdr,
> const Elf64_Shdr *sechdrs)
> @@ -133,6 +170,16 @@ static unsigned long
> get_stubs_size(const Elf64_Ehdr *hdr,
> DEBUGP("Ptr: %p. Number: %lu\n",
> (void *)sechdrs[i].sh_addr,
> sechdrs[i].sh_size / sizeof(Elf64_Rela));
> +
> + /* Sort the relocation information
> based on a symbol and
> + * addend key. This is a stable O(n*log
> n) complexity
> + * alogrithm but it will reduce the
> complexity of
> + * count_relocs() to linear complexity O(n)
> + */
> + sort((void *)sechdrs[i].sh_addr,
> + sechdrs[i].sh_size / sizeof(Elf64_Rela),
> + sizeof(Elf64_Rela), relacmp, relaswap);
> +
> relocs += count_relocs((void
> *)sechdrs[i].sh_addr,
> sechdrs[i].sh_size
> / sizeof(Elf64_Rela));
> @@ -343,7 +390,7 @@ int apply_relocate_add(Elf64_Shdr *sechdrs,
> /* Simply set it */
> *(u32 *)location = value;
> break;
> -
> +
> case R_PPC64_ADDR64:
> /* Simply set it */
> *(unsigned long *)location = value;
> @@ -399,7 +446,7 @@ int apply_relocate_add(Elf64_Shdr *sechdrs,
> }
>
> /* Only replace bits 2 through 26 */
> - *(uint32_t *)location
> + *(uint32_t *)location
> = (*(uint32_t *)location & ~0x03fffffc)
> | (value & 0x03fffffc);
> break;
> --
> 1.5.3.GIT
More information about the Linuxppc-embedded
mailing list