[RFC/PATCH] ppc64: Add mem=X option

Olof Johansson olof at austin.ibm.com
Thu Feb 24 15:38:44 EST 2005


Hi,

On Thu, Feb 24, 2005 at 03:17:58PM +1100, Michael Ellerman wrote:

> Thanks for the comments, I'll send an updated patch later today or tomorrow.
> I'll fix 

Cool, a couple of comments/answers below.

> On Wed, 23 Feb 2005 14:40, Olof Johansson wrote:
> > > + for (i = 0; i < mem->cnt; i++) {
> > > +  total += mem->region[i].size;
> > > +
> > > +  if (total <= memory_limit)
> > > +   continue;
> > > +
> > > +  crop = (memory_limit - (total - mem->region[i].size));
> ...
> > > +  mem->region[i].size = crop;
> > > +  mem->cnt = i + 1;
> > > +  break;
> >
> > Also, if you just reduce the size of memory_limit instead of keep the
> > rolling total, the crop calculation will be simpler. It took a bit of
> > thinking to make sure it's right the way it's written now.
> 
> I've rewritten this twice already, and I agree it's a bit ugly. I hadn't 
> thought of decrementing the limit. I'll fix it to avoid a 0 sized region too.
> 
> > > + /* Bolt kernel mappings for all of memory */
> > > + iSeries_bolt_kernel(0, systemcfg->physicalMemorySize);
> >
> > Do you want to bolt it all, even if there's a mem= limitation?
> 
> I'm not sure, it seems to boot happily either way. But I think it's better not 
> to bolt it all - that more closely resembles the situation we're trying to 
> simulate, ie that there is no more RAM than the limit.

The only drawback of always bolting it all is that noone will find bugs
caused by referencing kernel memory that's past the limit. :-)

> > > + if (unlikely(memory_limit && tce_alloc_start && tce_alloc_end)) {
> >
> > Do you need to check both tce_alloc_start and tce_alloc_end? Won't both
> > be set if one is?
> 
> Yeah either both tce variables should be set or none.
> 
> > > +  if (base + size >= tce_alloc_start)
> > > +   tce_alloc_start = base + size + 1;
> > > +
> > > +  create_pte_mapping(tce_alloc_start, tce_alloc_end,
> > > +   mode_rw, use_largepages);
> >
> > Could tce_alloc_end ever be below memory_limit too?
> 
> No. tce_alloc_end is just ram_top in prom_init.c (before memory_limit) 
> But I might change the prom_init.c code to make that more explicit, rather 
> than the current code where we copy ram_top -> alloc_top_high -> 
> tce_alloc_end.

I'm happy with the way it is now, I just wanted to ask.

> > You might need to check tce_alloc_start and end to make sure you can use
> > 16MB pages, or if you need 4K because of alignment/size constraints.
> >
> > Even if you don't have to, a comment related to it could be warranted.
> 
> Not sure what you mean here. If start/end aren't 16MB aligned then I should 
> use 4K pages for the mapping?

Well, the current create_pte_mapping will, if 16MB pages are used,
potentially map a bit more than the limit if it's not on a 16MB
boundary. Likewise, if the tce tables aren't starting/ending on 16MB
boundaries, more will be mapped.  Either you can just ignore it and
round accordingly, or you'll need to downgrade to use 4KB mappings with
less granularity. The problem then becomes that you can't mix the page
sizes in the same segment, so you'd need to map either the whole kernel
or just the segments in question with 4K pages.

Rounding is probably better and easier. Since you already do it for
memory_limit, adding it for the tce range (start and end) is all that's
needed.

> > > +  for (i++; i <= max_domain; i++) {
> >
> > I think I would change this to a regular while(++i < max_domain) loop
> > instead.
> 
> Agreed. That's ugly, I just copied the for loop that was there.
> 
> > > +   dbg("NUMA: offlining node %ld for memory_limit\n", i);
> > > +   node_set_offline(i);
> >
> > Just because the node doesn't get memory allocated we can't set it
> > offline. I.e. the cpus will still be online.
> 
> Ah crud, I thought I was just offlining the memory.
> 
> > > +  for (i = 0; i <= max_domain; i++) 
> > > +   node_set_online(i);
> >
> > Good question, I think can go. That code was added by Matt Dobson earlier
> > this year.
> 
> Well, if I'm not calling node_set_offline() above then it doesn't matter.
> He added that when he added the for_each_online_node() stuff, and if there are 
> gaps in the node numbering then the extra for loop does have an effect, I'm 
> not game to touch it in case it's doing something subtle that I'm missing.

Sure, it can always be taken out later if someone feels up for it and
can prove it won't break anything. NUMA is messy in the sense that it
takes a while for a weird config case to show up that will break
assumptions, etc.

> > > +static unsigned long __initdata memory_limit;
> > > +static unsigned long __initdata tce_alloc_start;
> > > +static unsigned long __initdata tce_alloc_end;
> >
> > A little confusing to have the same variable names here as local
> > statics. Maybe rename them? Or use the global.
> 
> Benh wants to keep the prom_init <=> kernel interface clean so I'll keep them 
> but rename them.

Good point. Separate is fine with me.

> > > + while (isxdigit(*cp) &&
> > > +        (value = isdigit(*cp) ? *cp - '0' : toupper(*cp) - 'A' + 10) <
> > > base) { +  result = result * base + value;
> > > +  cp++;
> > > + }
> >
> > Hmm. Would you mind breaking that up? It'll be a few more lines but much
> > easier to read.
> 
> Yeah it's not pretty, I just copied it from the regular strtoul but I'll clean 
> it up.

Ah, might as well leave it be then, there's value in keeping the code
common even if the original was a bit messy.

> > Do other architectures swallow/tolerate a b/B after the unit? Could be
> > nice.
> 
> Yes and no. That's copied from the generic memparse() which everyone else 
> uses. And although it doesn't have a b/B case it shouldn't choke if we do 
> have one, it'll just be ignored and the next strstr() will skip it.

Good point. Should be fine the way it is then.

> > > +  /* Align to 16 MB == size of large page */
> > > +  RELOC(memory_limit) = ALIGN(RELOC(memory_limit), 0x1000000);
> >
> > Maybe a printk to say that it's been rounded up, so we don't surprise
> > the user?
> 
> Good plan.
> 
> > > +  if (RELOC(memory_limit) <= RELOC(alloc_bottom)) {
> > > +   prom_printf("Ignoring mem=%x <= alloc_bottom.\n",
> > > +    RELOC(memory_limit));
> > > +   RELOC(memory_limit) = 0;
> >
> > ...or should it just be bumped up to include alloc_bottom instead?
> 
> That would just mean the first allocation will fail and we'll panic, because 
> alloc_bottom == memory_limit == alloc_top (see alloc_up() in prom_init.c). We 
> could just get rid of the checking but I figure it's nicer to complain and 
> still boot than panic.

Yep, that's better. The alternative would be to add some sort of
arbitrary margin for the machine to have enough memory to come up,
but that's way ugly.


-Olof



More information about the Linuxppc64-dev mailing list