[PATCH] powerpc/pseries/vas: Migration suspend waits for no in-progress open windows
Nathan Lynch
nathanl at linux.ibm.com
Thu Oct 12 07:36:58 AEDT 2023
Haren Myneni <haren at linux.ibm.com> writes:
>> Haren Myneni <haren at linux.ibm.com> writes:
>>> The hypervisor returns migration failure if all VAS windows are not
>>> closed. During pre-migration stage, vas_migration_handler() sets
>>> migration_in_progress flag and closes all windows from the list.
>>> The allocate VAS window routine checks the migration flag, setup
>>> the window and then add it to the list. So there is possibility of
>>> the migration handler missing the window that is still in the
>>> process of setup.
>>>
>>> t1: Allocate and open VAS t2: Migration event
>>> window
>>>
>>> lock vas_pseries_mutex
>>> If migration_in_progress set
>>> unlock vas_pseries_mutex
>>> return
>>> open window HCALL
>>> unlock vas_pseries_mutex
>>> Modify window HCALL lock vas_pseries_mutex
>>> setup window migration_in_progress=true
>>> Closes all windows from
>>> the list
>>> unlock vas_pseries_mutex
>>> lock vas_pseries_mutex return
>>> if nr_closed_windows == 0
>>> // No DLPAR CPU or migration
>>> add to the list
>>> unlock vas_pseries_mutex
>>> return
>>> unlock vas_pseries_mutex
>>> Close VAS window
>>> // due to DLPAR CPU or migration
>>> return -EBUSY
>>
>> Could the the path t1 takes simply hold the mutex for the duration of
>> its execution instead of dropping and reacquiring it in the middle?
>>
>> Here's the relevant code from vas_allocate_window():
>>
>> mutex_lock(&vas_pseries_mutex);
>> if (migration_in_progress)
>> rc = -EBUSY;
>> else
>> rc = allocate_setup_window(txwin, (u64 *)&domain[0],
>> cop_feat_caps->win_type);
>> mutex_unlock(&vas_pseries_mutex);
>> if (rc)
>> goto out;
>>
>> rc = h_modify_vas_window(txwin);
>> if (!rc)
>> rc = get_vas_user_win_ref(&txwin->vas_win.task_ref);
>> if (rc)
>> goto out_free;
>>
>> txwin->win_type = cop_feat_caps->win_type;
>> mutex_lock(&vas_pseries_mutex);
>> if (!caps->nr_close_wins) {
>> list_add(&txwin->win_list, &caps->list);
>> caps->nr_open_windows++;
>> mutex_unlock(&vas_pseries_mutex);
>> vas_user_win_add_mm_context(&txwin->vas_win.task_ref);
>> return &txwin->vas_win;
>> }
>> mutex_unlock(&vas_pseries_mutex);
>>
>> Is there something about h_modify_vas_window() or get_vas_user_win_ref()
>> that requires temporarily dropping the lock?
>>
>
> Thanks Nathan for your comments.
>
> vas_pseries_mutex protects window ID and IRQ allocation between alloc
> and free window HCALLs, and window list. Generally try to not using
> mutex in HCALLs, but we need this mutex with these HCALLs.
>
> We can add h_modify_vas_window() or get_vas_user_win_ref() with in the
> mutex context, but not needed.
Hmm. I contend that it would fix your bug in a simpler way that
eliminates the race instead of coping with it by adding more state and
complicating the locking model. With your change, readers of the
migration_in_progress flag check it under the mutex, but the writer
updates it outside of the mutex, which seems strange and unlikely to be
correct.
> Also free HCALL can take longer depends
> on pending NX requests since the hypervisor waits for all requests to be
> completed before closing the window.
>
> Applications can issue window open / free calls at the same time which
> can experience mutex contention especially on the large system where
> 100's of credits are available. Another example: The migration event can
> wait longer (or timeout) to get this mutex if many threads issue
> open/free window calls. Hence added h_modify_vas_window() (modify HCALL)
> or get_vas_user_win_ref() outside of mutex.
OK. I believe you're referring to this code, which can run under the lock:
static long hcall_return_busy_check(long rc)
{
/* Check if we are stalled for some time */
if (H_IS_LONG_BUSY(rc)) {
msleep(get_longbusy_msecs(rc));
rc = H_BUSY;
} else if (rc == H_BUSY) {
cond_resched();
}
return rc;
}
...
static int h_deallocate_vas_window(u64 winid)
{
long rc;
do {
rc = plpar_hcall_norets(H_DEALLOCATE_VAS_WINDOW, winid);
rc = hcall_return_busy_check(rc);
} while (rc == H_BUSY);
...
[ with similar loops in the window allocate and modify functions ]
If get_longbusy_msecs() typically returns low values (<20), then you
should prefer usleep_range() over msleep() to avoid over-sleeping. For
example, msleep(1) can schedule away for much longer than 1ms -- often
more like 20ms.
If mutex hold times have been a problem in this code, then it's probably
worth improving the way it handles the busy/retry statuses under the
lock.
More information about the Linuxppc-dev
mailing list