[PATCH] powerpc/pseries/vas: Migration suspend waits for no in-progress open windows

Thu Oct 12 07:36:58 AEDT 2023

Haren Myneni <haren at linux.ibm.com> writes:
>> Haren Myneni <haren at linux.ibm.com> writes:
>>> The hypervisor returns migration failure if all VAS windows are not
>>> closed. During pre-migration stage, vas_migration_handler() sets
>>> migration_in_progress flag and closes all windows from the list.
>>> The allocate VAS window routine checks the migration flag, setup
>>> the window and then add it to the list. So there is possibility of
>>> the migration handler missing the window that is still in the
>>> process of setup.
>>>
>>> t1: Allocate and open VAS	t2: Migration event
>>>      window
>>>
>>> lock vas_pseries_mutex
>>> If migration_in_progress set
>>>    unlock vas_pseries_mutex
>>>    return
>>> open window HCALL
>>> unlock vas_pseries_mutex
>>> Modify window HCALL		lock vas_pseries_mutex
>>> setup window			migration_in_progress=true
>>> 				Closes all windows from
>>> 				the list
>>> 				unlock vas_pseries_mutex
>>> lock vas_pseries_mutex		return
>>> if nr_closed_windows == 0
>>>    // No DLPAR CPU or migration
>>>    add to the list
>>>    unlock vas_pseries_mutex
>>>    return
>>> unlock vas_pseries_mutex
>>> Close VAS window
>>> // due to DLPAR CPU or migration
>>> return -EBUSY
>> 
>> Could the the path t1 takes simply hold the mutex for the duration of
>> its execution instead of dropping and reacquiring it in the middle?
>> 
>> Here's the relevant code from vas_allocate_window():
>> 
>> 	mutex_lock(&vas_pseries_mutex);
>> 	if (migration_in_progress)
>> 		rc = -EBUSY;
>> 	else
>> 		rc = allocate_setup_window(txwin, (u64 *)&domain[0],
>> 				   cop_feat_caps->win_type);
>> 	mutex_unlock(&vas_pseries_mutex);
>> 	if (rc)
>> 		goto out;
>> 
>> 	rc = h_modify_vas_window(txwin);
>> 	if (!rc)
>> 		rc = get_vas_user_win_ref(&txwin->vas_win.task_ref);
>> 	if (rc)
>> 		goto out_free;
>> 
>> 	txwin->win_type = cop_feat_caps->win_type;
>> 	mutex_lock(&vas_pseries_mutex);
>> 	if (!caps->nr_close_wins) {
>> 		list_add(&txwin->win_list, &caps->list);
>> 		caps->nr_open_windows++;
>> 		mutex_unlock(&vas_pseries_mutex);
>> 		vas_user_win_add_mm_context(&txwin->vas_win.task_ref);
>> 		return &txwin->vas_win;
>> 	}
>> 	mutex_unlock(&vas_pseries_mutex);
>> 
>> Is there something about h_modify_vas_window() or get_vas_user_win_ref()
>> that requires temporarily dropping the lock?
>> 
>
> Thanks Nathan for your comments.
>
> vas_pseries_mutex protects window ID and IRQ allocation between alloc 
> and free window HCALLs, and window list. Generally try to not using 
> mutex in HCALLs, but we need this mutex with these HCALLs.
>
> We can add h_modify_vas_window() or get_vas_user_win_ref() with in the 
> mutex context, but not needed.

Hmm. I contend that it would fix your bug in a simpler way that
eliminates the race instead of coping with it by adding more state and
complicating the locking model. With your change, readers of the
migration_in_progress flag check it under the mutex, but the writer
updates it outside of the mutex, which seems strange and unlikely to be
correct.

> Also free HCALL can take longer depends 
> on pending NX requests since the hypervisor waits for all requests to be 
> completed before closing the window.
>
> Applications can issue window open / free calls at the same time which 
> can experience mutex contention especially on the large system where 
> 100's of credits are available. Another example: The migration event can 
> wait longer (or timeout) to get this mutex if many threads issue 
> open/free window calls. Hence added h_modify_vas_window() (modify HCALL) 
> or get_vas_user_win_ref() outside of mutex.

OK. I believe you're referring to this code, which can run under the lock:

static long hcall_return_busy_check(long rc)
{
	/* Check if we are stalled for some time */
	if (H_IS_LONG_BUSY(rc)) {
		msleep(get_longbusy_msecs(rc));
		rc = H_BUSY;
	} else if (rc == H_BUSY) {
		cond_resched();
	}

	return rc;
}

...

static int h_deallocate_vas_window(u64 winid)
{
	long rc;

	do {
		rc = plpar_hcall_norets(H_DEALLOCATE_VAS_WINDOW, winid);

		rc = hcall_return_busy_check(rc);
	} while (rc == H_BUSY);

...

[ with similar loops in the window allocate and modify functions ]

If get_longbusy_msecs() typically returns low values (<20), then you
should prefer usleep_range() over msleep() to avoid over-sleeping. For
example, msleep(1) can schedule away for much longer than 1ms -- often
more like 20ms.

If mutex hold times have been a problem in this code, then it's probably
worth improving the way it handles the busy/retry statuses under the
lock.