[Skiboot] [PATCH v3 06/13] hiomap: Enable Async IPMI messaging

Deb McLemore debmc at linux.ibm.com
Tue Jan 7 23:50:17 AEDT 2020

On 1/6/20 9:43 PM, Oliver O'Halloran wrote:
> On Fri, Jan 3, 2020 at 3:38 AM Deb McLemore <debmc at linux.ibm.com> wrote:
>> *snip*
>>>>>>         unlock(&flash_lock);
>>>>> I'm not sure that we even need to use try_lock() with patch 03/13
>>>>> applied. With that patch the flash's lock should only be held briefly
>>>>> so we should be able to use the normal (blocking) lock() since we
>>>>> don't need to deal with waiting for a 64MB flash read to complete.
>>> ^^^^^^^^^^^^^^^^^^
>>> This is what I really wanted a response to.
>>> All the games above with try_lock() and checking the busy flag outside
>>> the lock only make sense to me if we're need to avoid being stuck
>>> waiting for a lock in skiboot. The changes in 3/13 result in the lock
>>> only being held briefly since the only thing that needs to be done
>>> under the lock is checking the busy flag. Why not do this instead:
>>> diff --git a/core/flash.c b/core/flash.c
>>> index d97d88edbdd5..8fcc9d46ca59 100644
>>> --- a/core/flash.c
>>> +++ b/core/flash.c
>>> @@ -64,8 +64,7 @@ static bool flash_reserve(struct flash *flash)
>>>  {
>>>       bool rc = false;
>>> -     if (!try_lock(&flash_lock))
>>> -             return false;
>>> +     lock(&flash_lock);
>>>       if (!flash->busy) {
>>>               flash->busy = true;
>>> That change probably should have been in Cyril's patch. Oh well, people
>>> write bugs sometimes.
>> The logic added is to add resiliency.
> Resiliency against what? Performing multiple flash operations in
> parallel is something that we've never really supported  and the
> caller needs to handle the OPAL_BUSY case anyway. If the operation
> can't be done now then we should return BUSY to the caller and let it
> re-try the operation. What does this hack help with? What breaks if we
> don't do this?
>> If the flash_lock is secured, this
>> does not mean the flash->busy will always be available so we check the
>> status of the data and
> I know.
>> allow some iterations to maybe persist in the current operation.
> Again, why is this worth doing?

During testing using a single caller there are timing flows where

the lock is able to be retrieved and where another thread has

freed the lock and updated the data busy flag.  Without this code

there are timing issues where simple requests fail.

>>>>>> @@ -279,12 +340,31 @@ static void flash_poll(struct timer *t __unused, void *data, uint64_t now __unus
>>>>>>                 assert(0);
>>>>>>         }
>>>>>> -       if (rc)
>>>>>> -               rc = OPAL_HARDWARE;
>>>>>> +       if (rc == 0) {
>>>>>> +               /*
>>>>>> +                * if we are good to proceed forward
>>>>>> +                * otherwise we may have to try again
>>>>>> +                */
>>>>>> +               flash->async.pos += len;
>>>>>> +               flash->async.buf += len;
>>>>>> +               flash->async.len -= len;
>>>>>> +               /* if we good clear */
>>>>>> +               flash->async.retry_counter = 0;
>>>>>> +               /*
>>>>>> +                * clear the IN_PROGRESS flags
>>>>>> +                * we only need IN_PROGRESS active on missed_cc
>>>>>> +                */
>>>>>> +               flash->bl->flags &= IN_PROGRESS_MASK;
>>>>>> +               /* reset time for scheduling gap */
>>>>>> +               flash->async.in_progress_schedule_delay = FLASH_SCHEDULE_DELAY;
>>>> async flash layer is the one that is managing the work,
>>>> block layer is a worker which does not know how the
>>>> work is being managed.
>>>>> The default delay should come from the blocklevel rather than being
>>>>> assumed by the user of libflash. The bl has a better idea about what
>>>>> sort of interval is required between ops and what a reasonable timeout
>>>>> is.
>>> What do you mean by "managing"? The general pattern for async IO in
>>> skiboot is:
>>> upper layer:
>>>       1. Sets up a job
>>>       2. Submit job it to the worker
>>>       3. Periodically polls the worker until completed.
>>> worker:
>>>       1. Do all the work you can until a wait is required.
>>>       2. Return a wait time to the upper layer.
>>> All the details of how the job is handled are supposed to be owned by
>>> the worker. The upper layer does not, and should not, know any of the
>>> details of how the job is actually handled. All the upper layer really
>>> wants is a completion code from the worker. This does mean you need to
>>> keep a bit more of the transaction state in the blocklayer, but I think
>>> that's fine.
>>>> Problem cases are when the async has split the work up into
>>>> chunks.  If a chunk desired to be read straddles the window
>>>> (meaning that only a subset of the chunk is available in the
>>>> current window) we have to bump down the async chunk
>>>> desired to be read so that it will succeed.  Then we can
>>>> acquire a new window which we can then do subsequent reads
>>>> (again at the async flash layer).
>>>>> I can't parse this, but it seems like it's all blocklevel specific
>>>>> stuff that should be documented there. From the perspective of a
>>>>> libflash user we don't want to know or care about *why* we need to do
>>>>> a async delay, just that we have to.
>>> As mentioned above, I don't see why this can't be handled in the
>>> blocklayer. Splitting a single read because it crosses a window
>>> boundary is an implementation detail of the blocklayer driver and
>>> should be handled there.
>> The async design for this feature is to allow an alternative management
>> of the typical skiboot polling during the window movement we cannot
>> allow those operations
> Why not?
>> (pushing the transaction state to the blocklayer was not the
>> design point of this feature managing and communicating success
>> and failure of each individual worker request seemed a bit much).
> What is "the design point?" and how (and why) did you decide that was the point?
> My main gripe with this series (aside from the 1300 line patch...) is
> that there's a lot of strange code which seems like it's there to hack
> around issues caused by the weird split between the blocklayer and the
> "transaction state". So, why did that design get chosen? What were the
> alternatives and why is this better? I'm sure there's real problems
> you're trying to address, but I'm having a hard time working out what
> they are...

Recursive polling in skiboot is not allowed, therefore we need a design

for opal_flash_op which allows asynchronous operations (which

includes the synchronous ipmi window move).  Performing the

synchronous window move takes an undeterministic amount of

time, therefore we need to return to linux to prevent stalls.  Handling

the success and failure of these move operations also needs to be managed.

Tracking the success and failure of each individual block request also

to be managed therefore this design is chosen to manage the async work


The other suggestion referred for consideration was the

posix_fadvise(POSIX_FADV_WILLNEED) which was reviewed.

>>>>>> @@ -487,7 +596,7 @@ static int64_t opal_flash_op(enum flash_op op, uint64_t id, uint64_t offset,
>>>>>>                 uint64_t buf, uint64_t size, uint64_t token)
>>>>>>  {
>>>>>>         struct flash *flash = NULL;
>>>>>> -       uint64_t len;
>>>>>> +       uint64_t len = 0;
>>>> Yes
>>>>> Is that needed?
>>> Doesn't appear to be. Adding a bit more context to the hunk below:
>>>         len = MIN(size, flash->block_size*10);
>>>         printf("Flash op %d len %llu\n", op, len);
>>>         flash->async.op = op;
>>>         flash->async.token = token;
>>>         flash->async.buf = buf + len;
>>>         flash->async.len = size - len;
>>> So it's already explicitly initialised.
>>         [CC]  core/flash.o
>> In file included from core/flash.c:11:0:
>> core/flash.c: In function ‘opal_flash_op’:
>> include/skiboot.h:81:31: error: ‘len’ may be used uninitialized in this
>> function [-Werror=maybe-uninitialized]
>>  #define prlog(l, f, ...) do { _prlog(l, pr_fmt(f), ##__VA_ARGS__); }
>> while(0)
>>                                ^~~~~~
>> core/flash.c:612:11: note: ‘len’ was declared here
>>   uint64_t len;
>>            ^~~
>>         [CC]  core/sensor.o
>> cc1: all warnings being treated as errors
>> make[2]: *** [core/flash.o] Error 1
> Right, that's fine then. That error message is pretty terrible though
> since it doesn't even mention the line that causes it (I think it's
> the prlog()s after the out: label). Might be worth filing a GCC bug
> about it.
>>>>>> @@ -516,6 +625,10 @@ static int64_t opal_flash_op(enum flash_op op, uint64_t id, uint64_t offset,
>>>>>>         flash->async.buf = buf + len;
>>>>>>         flash->async.len = size - len;
>>>>>>         flash->async.pos = offset + len;
>>>>>> +       flash->async.retry_counter = 0;
>>>>>> +       flash->async.in_progress_schedule_delay = FLASH_SCHEDULE_DELAY;
>>>>>> +       flash->bl->flags |= ASYNC_REQUIRED;
>>>>>> +       ++flash->async.transaction_id;
>>>> This is needed.
>>>>> Use post-increment unless you absolutely have to use pre-increment.
>>>>> There's only a few instances where pre-increment is used in skiboot
>>>>> and I'd like to keep it that way.
>>> Why? Is a transaction ID of zero signifigant in some way? If so, is
>>> that documented?
>> The entry of opal_flash_op is the transition point of when we know a new
>> transaction is being  requested, so we bump the transaction_id on entry
>> (subsequent continuation paths enter through flash_poll).
> What does this have to do with using a pre-increment rather than a
> post-increment? I'm complaining about the coding style, not asking why
> the transaction_id exists.

The pre-increment is used at the point in time when we know the operation

is a new transaction, this allows the ability to have the bump in a single

location versus having multiple locations on different exit paths.

More information about the Skiboot mailing list