[Skiboot] [PATCH] fast-reboot: parallel memory clearing

Tue Apr 17 20:11:41 AEST 2018

Nicholas Piggin <nicholas.piggin at gmail.com> writes:

> On Tue, 17 Apr 2018 18:46:07 +1000
> Stewart Smith <stewart at linux.ibm.com> wrote:
>
>> A real simple hack to split things up into 16GB jobs,
>> and name them as how far we are into zeroing, so we can
>> simply print progress out as XGB cleared as we wait for
>> the jobs to finish.
>> 
>> This seems to cut at least ~40% time from memory zeroing on
>> fast-reboot on a 256GB Boston system.
>
> Cool! I'm assuming this is just an initial hack, but I'll give some
> comments anyway.

Initial hack *while* distracted and trying not to accidentally 'git
reset --hard' the wrong tree. Miracle it boots at all. git-send-email is
my backup system.

>> Signed-off-by: Stewart Smith <stewart at linux.ibm.com>
>> ---
>>  core/mem_region.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 61 insertions(+), 2 deletions(-)
>> 
>> diff --git a/core/mem_region.c b/core/mem_region.c
>> index 8ae49bb790fd..a06e725db3ec 100644
>> --- a/core/mem_region.c
>> +++ b/core/mem_region.c
>> @@ -1206,19 +1206,47 @@ static void mem_clear_range(uint64_t s, uint64_t e)
>>  		return;
>>  	}
>>  
>> -	prlog(PR_NOTICE, "Clearing region %llx-%llx\n",
>> +	prlog(PR_DEBUG, "Clearing region %llx-%llx\n",
>>  	      (long long)s, (long long)e);
>>  	memset((void *)s, 0, e - s);
>>  }
>>  
>> +static void mem_region_clear_job(void *data)
>> +{
>> +	uint64_t *arg = (uint64_t*)data;
>> +	mem_clear_range(arg[0], arg[1]);
>> +}
>> +
>> +#define MEM_REGION_CLEAR_JOB_SIZE (16ULL*(1<<30))
>> +
>>  void mem_region_clear_unused(void)
>>  {
>> +	int njobs = 0;
>> +	struct cpu_job **jobs;
>>  	struct mem_region *r;
>> +	uint64_t *job_args;
>> +	uint64_t s,l;
>> +	uint64_t total = 0;
>> +	char **job_names;
>> +	int i;
>
> Can you do a struct for all of these dynamic allocations?

Yeah, that makes a lot more sense. I should probably name the jobs
separately from what I log too.

>>  	lock(&mem_region_lock);
>>  	assert(mem_regions_finalised);
>>  
>> +	list_for_each(&regions, r, list) {
>> +		if (!(r->type == REGION_OS))
>> +			continue;
>> +		njobs++;
>> +		/* One job per 16GB */
>> +		njobs += r->len / MEM_REGION_CLEAR_JOB_SIZE;
>
> A quick glance and I thought you had an off by one there in case of not
> 16GB length, but no, because that initial increment. So maybe you have
> one too many if it is 16GB multiple?
>
> How about
>
>   njobs += (r->len + MEM_REGION_CLEAR_JOB_SIZE - 1) /
>                    MEM_REGION_CLEAR_JOB_SIZE;

Yeah, that looks better.

>> @@ -1226,9 +1254,40 @@ void mem_region_clear_unused(void)
>>  
>>  		assert(r != &skiboot_heap);
>>  
>> -		mem_clear_range(r->start, r->start + r->len);
>> +		s = r->start;
>> +		l = r->len;
>> +		while(l > MEM_REGION_CLEAR_JOB_SIZE) {
>> +			job_args[i*2] = s+l - MEM_REGION_CLEAR_JOB_SIZE;
>> +			job_args[i*2+1] = s+l;
>> +			l-=MEM_REGION_CLEAR_JOB_SIZE;
>> +			job_names[i] = malloc(sizeof(char)*40);
>> +			total+=MEM_REGION_CLEAR_JOB_SIZE;
>> +			snprintf(job_names[i], 40, "%lldGB cleared", total/(1<<30));
>> +			jobs[i] = cpu_queue_job(NULL, job_names[i],
>> +						mem_region_clear_job,
>> +						&job_args[i*2]);
>> +			i++;
>
> I'd be going the other way and incrementing s by the job size on each
> round, and assigning a length of max(job size, l - s) each iteration,
> so you should get them all into the one loop. Just my preference.

Yeah, that's nicer.

> Bonus points for scheduling memory clearing on to node local cores.
> That looks like it will take a bit of work in the job scheduler code,
> so probably best left to patch 2.

Yeah, I initially started down that route and quickly realised that it
needed multiple patches :)

-- 
Stewart Smith
OPAL Architect, IBM.