[PATCH arm/aspeed/ast2500 v2] ipmi: add an Aspeed KCS IPMI BMC driver

Wed Jan 31 12:52:41 AEDT 2018

On 01/30/2018 07:37 PM, Wang, Haiyue wrote:
>
>
> On 2018-01-31 09:25, Corey Minyard wrote:
>> On 01/30/2018 07:02 PM, Wang, Haiyue wrote:
>>>
>>>
>>> On 2018-01-31 08:52, Corey Minyard wrote:
>>>> On 01/30/2018 06:02 PM, Wang, Haiyue wrote:
>>>>>
>>>>>
>>>>> On 2018-01-30 21:49, Corey Minyard wrote:
>>>>>> On 01/29/2018 07:57 AM, Wang, Haiyue wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2018-01-26 22:48, Corey Minyard wrote:
>>>>>>>> On 01/26/2018 12:08 AM, Wang, Haiyue wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2018-01-25 01:48, Corey Minyard wrote:
>>>>>>>>>> On 01/24/2018 10:06 AM, Haiyue Wang wrote:
>>>>>>>>>>> The KCS (Keyboard Controller Style) interface is used to 
>>>>>>>>>>> perform in-band
>>>>>>>>>>> IPMI communication between a server host and its BMC 
>>>>>>>>>>> (BaseBoard Management
>>>>>>>>>>> Controllers).
>>>>>>>>>>>
>>>>>>>>>>> This driver exposes the KCS interface on ASpeed SOCs 
>>>>>>>>>>> (AST2400 and AST2500)
>>>>>>>>>>> as a character device. Such SOCs are commonly used as BMCs 
>>>>>>>>>>> and this driver
>>>>>>>>>>> implements the BMC side of the KCS interface.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Haiyue Wang <haiyue.wang at linux.intel.com>
>>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>> v1->v2
>>>>>>>>>>>
>>>>>>>>>>> - Divide the driver into two parts, one handles the BMC KCS 
>>>>>>>>>>> IPMI 2.0 state;
>>>>>>>>>>>    the other handles the BMC KCS controller such as AST2500 
>>>>>>>>>>> IO accessing.
>>>>>>>>>>> - Use the spin lock APIs to handle the device file 
>>>>>>>>>>> operations and BMC chip
>>>>>>>>>>>    IRQ inferface for accessing the same KCS BMC data structure.
>>>>>>>>>>> - Enhanced the phases handling of the KCS BMC.
>>>>>>>>>>> - Unified the IOCTL definition for IPMI BMC, it will be used 
>>>>>>>>>>> by KCS and BT.
>>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>>
>>>>>>>>>>> +
>>>>>>>>>>> +static void kcs_bmc_handle_data(struct kcs_bmc *kcs_bmc)
>>>>>>>>>>> +{
>>>>>>>>>>> +    u8 data;
>>>>>>>>>>> +
>>>>>>>>>>> +    switch (kcs_bmc->phase) {
>>>>>>>>>>> +    case KCS_PHASE_WRITE:
>>>>>>>>>>> +        set_state(kcs_bmc, WRITE_STATE);
>>>>>>>>>>> +
>>>>>>>>>>> +        /* set OBF before reading data */
>>>>>>>>>>> +        write_data(kcs_bmc, KCS_ZERO_DATA);
>>>>>>>>>>> +
>>>>>>>>>>> +        if (kcs_bmc->data_in_idx < KCS_MSG_BUFSIZ)
>>>>>>>>>>> + kcs_bmc->data_in[kcs_bmc->data_in_idx++] =
>>>>>>>>>>> +                        read_data(kcs_bmc);
>>>>>>>>
>>>>>>>> I missed this earlier, you need to issue a length error if the 
>>>>>>>> data is too large.
>>>>>>>>
>>>>>>>>>>> +        break;
>>>>>>>>>>> +
>>>>>>>>>>> +    case KCS_PHASE_WRITE_END:
>>>>>>>>>>> +        set_state(kcs_bmc, READ_STATE);
>>>>>>>>>>> +
>>>>>>>>>>> +        if (kcs_bmc->data_in_idx < KCS_MSG_BUFSIZ)
>>>>>>>>>>> + kcs_bmc->data_in[kcs_bmc->data_in_idx++] =
>>>>>>>>>>> +                        read_data(kcs_bmc);
>>>>>>>>>>> +
>>>>>>>>>>> +        kcs_bmc->phase = KCS_PHASE_WAIT_READ;
>>>>>>>>>>> +        if (kcs_bmc->running) {
>>>>>>>>>>
>>>>>>>>>> Why do you only do this when running is set?  It won't hurt 
>>>>>>>>>> anything if it's not
>>>>>>>>>> set.  As it is, you have a race if something opens the device 
>>>>>>>>>> while this code
>>>>>>>>>> runs.
>>>>>>>>>>
>>>>>>>>>> Also, don't set the state to wait read until the "write" has 
>>>>>>>>>> finished (userland has
>>>>>>>>>> read the data out of the buffer.  More on that later.
>>>>>>>>>>
>>>>>>>>> Understood.
>>>>>>>>>>> + kcs_bmc->data_in_avail = true;
>>>>>>>>>>> + wake_up_interruptible(&kcs_bmc->queue);
>>>>>>>>>>> +        }
>>>>>>>>>>> +        break;
>>>>>>>>>>> +
>>>>>>>>>>> +    case KCS_PHASE_READ:
>>>>>>>>>>> +        if (kcs_bmc->data_out_idx == kcs_bmc->data_out_len)
>>>>>>>>>>> +            set_state(kcs_bmc, IDLE_STATE);
>>>>>>>>>>> +
>>>>>>>>>>> +        data = read_data(kcs_bmc);
>>>>>>>>>>> +        if (data != KCS_CMD_READ_BYTE) {
>>>>>>>>>>> +            set_state(kcs_bmc, ERROR_STATE);
>>>>>>>>>>> +            write_data(kcs_bmc, KCS_ZERO_DATA);
>>>>>>>>>>> +            break;
>>>>>>>>>>> +        }
>>>>>>>>>>> +
>>>>>>>>>>> +        if (kcs_bmc->data_out_idx == kcs_bmc->data_out_len) {
>>>>>>>>>>> +            write_data(kcs_bmc, KCS_ZERO_DATA);
>>>>>>>>>>> +            kcs_bmc->phase = KCS_PHASE_IDLE;
>>>>>>>>>>> +            break;
>>>>>>>>>>> +        }
>>>>>>>>>>> +
>>>>>>>>>>> +        write_data(kcs_bmc,
>>>>>>>>>>> + kcs_bmc->data_out[kcs_bmc->data_out_idx++]);
>>>>>>>>>>> +        break;
>>>>>>>>>>> +
>>>>>>>>>>> +    case KCS_PHASE_ABORT_ERROR1:
>>>>>>>>>>> +        set_state(kcs_bmc, READ_STATE);
>>>>>>>>>>> +
>>>>>>>>>>> +        /* Read the Dummy byte */
>>>>>>>>>>> +        read_data(kcs_bmc);
>>>>>>>>>>> +
>>>>>>>>>>> +        write_data(kcs_bmc, kcs_bmc->error);
>>>>>>>>>>> +        kcs_bmc->phase = KCS_PHASE_ABORT_ERROR2;
>>>>>>>>>>> +        break;
>>>>>>>>>>> +
>>>>>>>>>>> +    case KCS_PHASE_ABORT_ERROR2:
>>>>>>>>>>> +        set_state(kcs_bmc, IDLE_STATE);
>>>>>>>>>>> +
>>>>>>>>>>> +        /* Read the Dummy byte */
>>>>>>>>>>> +        read_data(kcs_bmc);
>>>>>>>>>>> +
>>>>>>>>>>> +        write_data(kcs_bmc, KCS_ZERO_DATA);
>>>>>>>>>>> +        kcs_bmc->phase = KCS_PHASE_IDLE;
>>>>>>>>>>> +
>>>>>>>>>>> +        break;
>>>>>>>>>>> +
>>>>>>>>>>> +    default:
>>>>>>>>>>> +        set_state(kcs_bmc, ERROR_STATE);
>>>>>>>>>>> +
>>>>>>>>>>> +        /* Read the Dummy byte */
>>>>>>>>>>> +        read_data(kcs_bmc);
>>>>>>>>>>> +
>>>>>>>>>>> +        write_data(kcs_bmc, KCS_ZERO_DATA);
>>>>>>>>>>> +        break;
>>>>>>>>>>> +    }
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static void kcs_bmc_handle_command(struct kcs_bmc *kcs_bmc)
>>>>>>>>>>> +{
>>>>>>>>>>> +    u8 cmd;
>>>>>>>>>>> +
>>>>>>>>>>> +    set_state(kcs_bmc, WRITE_STATE);
>>>>>>>>>>> +
>>>>>>>>>>> +    /* Dummy data to generate OBF */
>>>>>>>>>>> +    write_data(kcs_bmc, KCS_ZERO_DATA);
>>>>>>>>>>> +
>>>>>>>>>>> +    cmd = read_data(kcs_bmc);
>>>>>>>>>>
>>>>>>>>>> Shouldn't you check the phase in all the cases below and do 
>>>>>>>>>> error
>>>>>>>>>> handling if the phase isn't correct?
>>>>>>>>>>
>>>>>>>>>> Similar thing if the device here isn't open. You need to handle
>>>>>>>>>> that gracefully.
>>>>>>>>>>
>>>>>>>>>> Also, you should remove data_in_avail and data_in_idx setting 
>>>>>>>>>> from
>>>>>>>>>> here, for reasons I will explain later.
>>>>>>>>>>
>>>>>>>>> If host software sends the data twice such as a retry before 
>>>>>>>>> the BMC's IPMI service starts,
>>>>>>>>> then the two IPMI requests will be merged into one, if not 
>>>>>>>>> clear data_in_idx after receving
>>>>>>>>> KCS_CMD_WRITE_START. Most of the states are driven by host 
>>>>>>>>> software (SMS). :(
>>>>>>>>
>>>>>>>> True, but what if the host issues WRITE_START or a WRITE_END 
>>>>>>>> while this driver is in read
>>>>>>>> state?  The spec is unclear on this, but it really only makes 
>>>>>>>> sense for the host to issue
>>>>>>>> WRITE_START in idle stat and WRITE_END in write state. IMHO it 
>>>>>>>> should go to error
>>>>>>>> state.  You might make the case that a WRITE_START anywhere 
>>>>>>>> restarts the transaction,
>>>>>>>> but the feel of the error state machine kind of goes against 
>>>>>>>> that. WRITE_END is definitely
>>>>>>>> wrong anywhere but write state.
>>>>>>>>
>>>>>>>> I just found the following in the spec (section 9.12):
>>>>>>>>
>>>>>>>>    Thus, since the interface will allow a command transfer to be
>>>>>>>>    started or restarted
>>>>>>>>    at any time when the input buffer is empty, software could 
>>>>>>>> elect to
>>>>>>>>    simply retry
>>>>>>>>    the command upon detecting an error condition, or issue a 
>>>>>>>> ‘known good’
>>>>>>>>    command in order to clear ERROR_STATE
>>>>>>>>
>>>>>>>> So a WRITE_START anywhere is ok.  A WRITE_END in the wrong 
>>>>>>>> state should probably
>>>>>>>> still go to error state.  This means the user needs to be able 
>>>>>>>> to handle a write error at
>>>>>>>> any time.  It also means it's very important to make sure the 
>>>>>>>> user does a read before
>>>>>>>> doing a write.  If the host re-issues a WRITE_START and writes 
>>>>>>>> a new command
>>>>>>>> between the time the use reads the data and writes the 
>>>>>>>> response, the response would
>>>>>>>> be for the wrong command.
>>>>>>>>
>>>>>>>>>>> +    switch (cmd) {
>>>>>>>>>>> +    case KCS_CMD_WRITE_START:
>>>>>>>>>>> +        kcs_bmc->data_in_avail = false;
>>>>>>>>>>> +        kcs_bmc->data_in_idx   = 0;
>>>>>>>>>>> +        kcs_bmc->phase         = KCS_PHASE_WRITE;
>>>>>>>>>>> +        kcs_bmc->error         = KCS_NO_ERROR;
>>>>>>>>>>> +        break;
>>>>>>>>>>> +
>>>>>>>>>>> +    case KCS_CMD_WRITE_END:
>>>>>>>>>>> +        kcs_bmc->phase = KCS_PHASE_WRITE_END;
>>>>>>>>>>> +        break;
>>>>>>>>>>> +
>>>>>>>>>>> +    case KCS_CMD_ABORT:
>>>>>>>>>>> +        if (kcs_bmc->error == KCS_NO_ERROR)
>>>>>>>>>>> +            kcs_bmc->error = KCS_ABORTED_BY_COMMAND;
>>>>>>>>>>> +
>>>>>>>>>>> +        kcs_bmc->phase = KCS_PHASE_ABORT_ERROR1;
>>>>>>>>>>> +        break;
>>>>>>>>>>> +
>>>>>>>>>>> +    default:
>>>>>>>>>>> +        kcs_bmc->error = KCS_ILLEGAL_CONTROL_CODE;
>>>>>>>>>>> +        set_state(kcs_bmc, ERROR_STATE);
>>>>>>>>>>> +        write_data(kcs_bmc, kcs_bmc->error);
>>>>>>>>>>> +        kcs_bmc->phase = KCS_PHASE_ERROR;
>>>>>>>>>>> +        break;
>>>>>>>>>>> +    }
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +int kcs_bmc_handle_event(struct kcs_bmc *kcs_bmc)
>>>>>>>>>>> +{
>>>>>>>>>>> +    unsigned long flags;
>>>>>>>>>>> +    int ret = 0;
>>>>>>>>>>> +    u8 status;
>>>>>>>>>>> +
>>>>>>>>>>> +    spin_lock_irqsave(&kcs_bmc->lock, flags);
>>>>>>>>>>> +
>>>>>>>>>>> +    status = read_status(kcs_bmc) & (KCS_STATUS_IBF | 
>>>>>>>>>>> KCS_STATUS_CMD_DAT);
>>>>>>>>>>> +
>>>>>>>>>>> +    switch (status) {
>>>>>>>>>>> +    case KCS_STATUS_IBF | KCS_STATUS_CMD_DAT:
>>>>>>>>>>> +        kcs_bmc_handle_command(kcs_bmc);
>>>>>>>>>>> +        break;
>>>>>>>>>>> +
>>>>>>>>>>> +    case KCS_STATUS_IBF:
>>>>>>>>>>> +        kcs_bmc_handle_data(kcs_bmc);
>>>>>>>>>>> +        break;
>>>>>>>>>>> +
>>>>>>>>>>> +    default:
>>>>>>>>>>> +        ret = -1;
>>>>>>>>>>> +        break;
>>>>>>>>>>> +    }
>>>>>>>>>>> +
>>>>>>>>>>> + spin_unlock_irqrestore(&kcs_bmc->lock, flags);
>>>>>>>>>>> +
>>>>>>>>>>> +    return ret;
>>>>>>>>>>> +}
>>>>>>>>>>> +EXPORT_SYMBOL(kcs_bmc_handle_event);
>>>>>>>>>>> +
>>>>>>>>>>> +static inline struct kcs_bmc *file_kcs_bmc(struct file *filp)
>>>>>>>>>>> +{
>>>>>>>>>>> +    return container_of(filp->private_data, struct kcs_bmc, 
>>>>>>>>>>> miscdev);
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static int kcs_bmc_open(struct inode *inode, struct file 
>>>>>>>>>>> *filp)
>>>>>>>>>>> +{
>>>>>>>>>>> +    struct kcs_bmc *kcs_bmc = file_kcs_bmc(filp);
>>>>>>>>>>> +    int ret = 0;
>>>>>>>>>>> +
>>>>>>>>>>> +    spin_lock_irq(&kcs_bmc->lock);
>>>>>>>>>>> +
>>>>>>>>>>> +    if (!kcs_bmc->running) {
>>>>>>>>>>> +        kcs_bmc->running       = 1;
>>>>>>>>>>> +        kcs_bmc->phase         = KCS_PHASE_IDLE;
>>>>>>>>>>> +        kcs_bmc->data_in_avail = false;
>>>>>>>>>>
>>>>>>>>>> If you do everything right, setting the phase and 
>>>>>>>>>> data_in_avail should not
>>>>>>>>>> be necessary here.
>>>>>>>>>>
>>>>>>>>>>> +    } else {
>>>>>>>>>>> +        ret = -EBUSY;
>>>>>>>>>>> +    }
>>>>>>>>>>> +
>>>>>>>>>>> +    spin_unlock_irq(&kcs_bmc->lock);
>>>>>>>>>>> +
>>>>>>>>>>> +    return ret;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static unsigned int kcs_bmc_poll(struct file *filp, 
>>>>>>>>>>> poll_table *wait)
>>>>>>>>>>> +{
>>>>>>>>>>> +    struct kcs_bmc *kcs_bmc = file_kcs_bmc(filp);
>>>>>>>>>>> +    unsigned int mask = 0;
>>>>>>>>>>> +
>>>>>>>>>>> +    poll_wait(filp, &kcs_bmc->queue, wait);
>>>>>>>>>>> +
>>>>>>>>>>> +    spin_lock_irq(&kcs_bmc->lock);
>>>>>>>>>>> +
>>>>>>>>>>> +    if (kcs_bmc->data_in_avail)
>>>>>>>>>>> +        mask |= POLLIN;
>>>>>>>>>>> +
>>>>>>>>>>> +    spin_unlock_irq(&kcs_bmc->lock);
>>>>>>>>>>> +
>>>>>>>>>>> +    return mask;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static ssize_t kcs_bmc_read(struct file *filp, char *buf,
>>>>>>>>>>> +                size_t count, loff_t *offset)
>>>>>>>>>>> +{
>>>>>>>>>>> +    struct kcs_bmc *kcs_bmc = file_kcs_bmc(filp);
>>>>>>>>>>> +    ssize_t ret = -EAGAIN;
>>>>>>>>>>> +
>>>>>>>>>>
>>>>>>>>>> This function still has some issues.
>>>>>>>>>>
>>>>>>>>>> You can't call copy_to_user() with a spinlock held or 
>>>>>>>>>> interrupts disabled.
>>>>>>>>>> To handle readers, you probably need a separate mutex.
>>>>>>>>>>
>>>>>>>>>> Also, this function can return -EAGAIN even if O_NONBLOCK is 
>>>>>>>>>> not set if
>>>>>>>>>> kcs_bmc->data_in_avail changes between when you wait on the 
>>>>>>>>>> event
>>>>>>>>>> and when you check it under the lock.
>>>>>>>>>>
>>>>>>>>>> You also clear data_in_avail even if the copy_to_user() 
>>>>>>>>>> fails, which is
>>>>>>>>>> wrong.
>>>>>>>>>>
>>>>>>>>>> I believe the best way to handle this would be to have the 
>>>>>>>>>> spinlock
>>>>>>>>>> protect the inner workings of the state machine and a mutex 
>>>>>>>>>> handle
>>>>>>>>>> copying data out, setting/clearing the running flag (thus a 
>>>>>>>>>> mutex
>>>>>>>>>> instead of spinlock in open and release) and the ioctl 
>>>>>>>>>> settings (except
>>>>>>>>>> for abort where you will need to grab the spinlock).
>>>>>>>>>>
>>>>>>>>>> After the wait event below, grab the mutex.  If data is not 
>>>>>>>>>> available
>>>>>>>>>> and O_NONBLOCK is not set, drop the mutex and retry. Otherwise
>>>>>>>>>> this is the only place (besides release) that sets 
>>>>>>>>>> data_in_avail to false.
>>>>>>>>>> Do the copy_to_user(), grab the spinlock, clear data_in_avail 
>>>>>>>>>> and
>>>>>>>>>> data_in_idx, then release the lock and mutex. If you are really
>>>>>>>>>> adventurous you can do this without grabbing the lock using
>>>>>>>>>> barriers, but it's probably not necessary here.
>>>>>>>>>>
>>>>>>>>
>>>>>>>> With the state machine being able to be restarted at any time, 
>>>>>>>> you need
>>>>>>>> something a little different here.  You still need the mutex to 
>>>>>>>> handle
>>>>>>>> multiple readers and the copy.  I think the function should be 
>>>>>>>> something
>>>>>>>> like:
>>>>>>>>
>>>>>>> Since KCS is not a multi-reader protocol from BMC's view, you 
>>>>>>> makes things complex. :-)
>>>>>>
>>>>>> No, I don't think you understand.  The primary purpose of the 
>>>>>> complexity
>>>>>> here is to protect the driver from the host system (on the other 
>>>>>> side of
>>>>>> the KCS interface).  Without this protection, it is possible for 
>>>>>> the host
>>>>>> system to start a new write while the user on the BMC side is 
>>>>>> reading
>>>>>> data out, resulting in corrupt data being read.
>>>>>>
>>>>>> I haven't thought too much about this.  There may be a simpler way,
>>>>>> but the protection needs to be there.
>>>>>>
>>>>>> And you may not think you need to protect the driver against a
>>>>>> malicious BMC side user code, but you would be wrong. You can
>>>>>> only have one opener, but with threads or a fork you can have
>>>>>> multiple readers.  And you don't know if a malicious piece of
>>>>>> code has taken over userland.  You always need to protect the
>>>>>> kernel.
>>>>>>
>>>>> Sure, the read/write have protected the critical data area with 
>>>>> IRQ, and also, these
>>>>> functions should be thread local safe I believe.
>>>>>
>>>>> spin_lock_irq(&kcs_bmc->lock);
>>>>> ...
>>>>> spin_unlock_irq(&kcs_bmc->lock);
>>>>>
>>>>
>>>> But remember, you can't call copy_to_user() when IRQs are off or 
>>>> when you are holding
>>>> a spinlock.  That is an absolute no.  It can crash the kernel.
>>>>
>>>> So you need a design that takes this into account, but will not 
>>>> result in the possibility
>>>> of bad data being read.
>>>>
>>> Yes, sure, as I said before: access_ok(VERIFY_WRITE, to, n), then 
>>> memcpy in spin_lock.
>>
>> Where did you get the idea that this was ok?  It's not. access_ok() 
>> is not actually very
>> useful, since the permissions on memory can change at any time unless 
>> you are holding
>> the mm lock, which is also not an ok thing to do.  It is entirely 
>> possible for access_ok()
>> to pass and copy_to_user() to fail.
>>
> I thought memcpy will not fail. :(

Oh, memcpy won't fail as long as the source and destination is kernel 
memory.
I was a little confused by the access_ok() thing, it's common for people to
assume that if they do access_ok(), that copy_to_user() won't fail.

>> I'm not exactly sure what you are saying, though.  In any event, a 
>> well-designed read()/write()
>> operation should leave the system unchanged if it gets an error.
>>
> I saw BT use a local buffer, If I change the '#define 
> KCS_MSG_BUFSIZ    1024' to ".. 512", should it be OK
> as BT ?
>
> static ssize_t bt_bmc_read(struct file *file, char __user *buf,
>                size_t count, loff_t *ppos)
> {
>     struct bt_bmc *bt_bmc = file_bt_bmc(file);
>     u8 len;
>     int len_byte = 1;
>     u8 kbuffer[BT_BMC_BUFFER_SIZE];  --> #define BT_BMC_BUFFER_SIZE 256

It's good practice to keep larger things off the stack, which is why I 
dynamically
allocated it.  But if you have a mutex, you can put that buffer in 
struct bt_bmc
since it would only be accessed when holding the mutex.

>
>> -corey
>>
>>>>>>>>    static ssize_t kcs_bmc_read(struct file *filp, char *buf,
>>>>>>>>                     size_t count, loff_t *offset)
>>>>>>>>    {
>>>>>>>>         struct kcs_bmc *kcs_bmc = file_kcs_bmc(filp);
>>>>>>>>         ssize_t ret;
>>>>>>>>         bool avail;
>>>>>>>>         size_t data_size;
>>>>>>>>         u8 *data;
>>>>>>>>
>>>>>>>>         data = kmalloc(KCS_MSG_BUFSIZ, GFP_KERNEL);
>>>>>>>>         if (!data)
>>>>>>>>             return -ENOMEM;
>>>>>>>>
>>>>>>>>    retry:
>>>>>>>>         ret = -EAGAIN;
>>>>>>>>         if (!(filp->f_flags & O_NONBLOCK))
>>>>>>>> wait_event_interruptible(kcs_bmc->queue,
>>>>>>>>                          kcs_bmc->data_in_avail);
>>>>>>>>
>>>>>>>>         mutex_lock(&kcs_bmc->read_mutex);
>>>>>>>>
>>>>>>>>         spin_lock_irq(&kcs_bmc->lock);
>>>>>>>>         avail = kcs_bmc->data_in_avail;
>>>>>>>>         if (avail) {
>>>>>>>>             memcpy(data, kcs_bmc->data_in, kcs_bmc->data_in_idx);
>>>>>>>>             data_size = kcs_bmc->data_in_idx;
>>>>>>>>         }
>>>>>>>>         spin_unlock_irq(&kcs_bmc->lock);
>>>>>>>>
>>>>>>>>         if (!avail) {
>>>>>>>>             if (filp->f_flags & O_NONBLOCK)
>>>>>>>>                 goto out_mutex_unlock;
>>>>>>>> mutex_unlock(&kcs_bmc->read_mutex);
>>>>>>>>             goto retry;
>>>>>>>>         }
>>>>>>>>
>>>>>>>>         if (count < data_size) {
>>>>>>>>             ret = -EOVERFLOW;
>>>>>>>>              ? I'm not sure about the error, but userspace 
>>>>>>>> needs to know.
>>>>>>>>             goto out_mutex_unlock;
>>>>>>
>>>>>> Maybe a length error to the host side here?
>>>>
>>>> You didn't comment on this or the other length error.  That needs 
>>>> to be
>>>> handled.
>>>>
>>> Yes, will send a length error by following KCS spec.
>>>>>>
>>>>>>>>         }
>>>>>>>>
>>>>>>>>         if (!copy_to_user(buf, data, data_size)) {
>>>>>>>>             ret = -EFAULT;
>>>>>>>>             goto out_mutex_unlock;
>>>>>>>>         }
>>>>>>>>
>>>>>>>>         ret = data_size;
>>>>>>>>
>>>>>>>>         spin_lock_irq(&kcs_bmc->lock);
>>>>>>>>
>>>>>>>>         if (kcs_bmc->phase != KCS_PHASE_WRITE_END_DONE)
>>>>>>>>             /* Something aborted or restarted the state 
>>>>>>>> machine. */
>>>>>>>>             ? Maybe restart if O_NONBLOCK is not set and 
>>>>>>>> -EAGAIN if it is?
>>>>>>>>             ret = -EIO;
>>>>>>>>         } else {
>>>>>>>>             kcs_bmc->phase = KCS_PHASE_WAIT_READ;
>>>>>>>>             kcs_bmc->data_in_avail = false;
>>>>>>>>             kcs_bmc->data_in_idx = 0;
>>>>>>>>         }
>>>>>>>>
>>>>>>>>         spin_unlock_irq(&kcs_bmc->lock);
>>>>>>>>
>>>>>>>>    out_mutex_unlock:
>>>>>>>>         mutex_unlock(&kcs_bmc->read_mutex);
>>>>>>>>
>>>>>>>>         kfree(data);
>>>>>>>>
>>>>>>>>         return ret;
>>>>>>>>    }
>>>>>>>> Note that I added a state, KCS_PHASE_WRITE_END_DONE, which 
>>>>>>>> would be
>>>>>>>> set after the final byte from the host is received. You want 
>>>>>>>> the read here
>>>>>>>> done before you can do the write below to avoid the race I 
>>>>>>>> talked about.
>>>>>>>>
>>>>>>>> There is a local copy made of the data.  What you *never* want 
>>>>>>>> to happen
>>>>>>>> here is for the state machine to start processing a new write 
>>>>>>>> command
>>>>>>>> while the data is being copied.  It could result in corrupt 
>>>>>>>> data being read
>>>>>>>> and some random operation being done by the BMC.
>>>>>>>>
>>>>>>>> If you want to avoid the local copy, it could be done, but it's 
>>>>>>>> more complex.
>>>>>>>>
>>>>>>>>>>> +    if (!(filp->f_flags & O_NONBLOCK))
>>>>>>>>>>> + wait_event_interruptible(kcs_bmc->queue,
>>>>>>>>>>> + kcs_bmc->data_in_avail);
>>>>>>>>>>> +
>>>>>>>>>>> +    spin_lock_irq(&kcs_bmc->lock);
>>>>>>>>>>> +
>>>>>>>>>>> +    if (kcs_bmc->data_in_avail) {
>>>>>>>>>>> +        kcs_bmc->data_in_avail = false;
>>>>>>>>>>> +
>>>>>>>>>>> +        if (count > kcs_bmc->data_in_idx)
>>>>>>>>>>> +            count = kcs_bmc->data_in_idx;
>>>>>>>>>>> +
>>>>>>>>>>> +        if (!copy_to_user(buf, kcs_bmc->data_in, count))
>>>>>>>>>>> +            ret = count;
>>>>>>>>>>> +        else
>>>>>>>>>>> +            ret = -EFAULT;
>>>>>>>>>>> +    }
>>>>>>>>>>> +
>>>>>>>>>>> +    spin_unlock_irq(&kcs_bmc->lock);
>>>>>>>>>>> +
>>>>>>>>>>> +    return ret;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static ssize_t kcs_bmc_write(struct file *filp, const char 
>>>>>>>>>>> *buf,
>>>>>>>>>>> +                 size_t count, loff_t *offset)
>>>>>>>>>>> +{
>>>>>>>>>>> +    struct kcs_bmc *kcs_bmc = file_kcs_bmc(filp);
>>>>>>>>>>> +    ssize_t ret = count;
>>>>>>>>>>> +
>>>>>>>>>>> +    if (count < 1 || count > KCS_MSG_BUFSIZ)
>>>>>>>>>>> +        return -EINVAL;
>>>>>>>>>>> +
>>>>>>>>>>> +    spin_lock_irq(&kcs_bmc->lock);
>>>>>>>>>>> +
>>>>>>>>>>> +    if (kcs_bmc->phase == KCS_PHASE_WAIT_READ) {
>>>>>>>>>>> +        if (copy_from_user(kcs_bmc->data_out, buf, count)) {
>>>>>>>>>>> + spin_unlock_irq(&kcs_bmc->lock);
>>>>>>>>>>> +            return -EFAULT;
>>>>>>>>>>> +        }
>>>>>>>>>>> +
>>>>>>>>>>> +        kcs_bmc->phase = KCS_PHASE_READ;
>>>>>>>>>>> +        kcs_bmc->data_out_idx = 1;
>>>>>>>>>>> +        kcs_bmc->data_out_len = count;
>>>>>>>>>>> +        write_data(kcs_bmc, kcs_bmc->data_out[0]);
>>>>>>>>>>> +    } else if (kcs_bmc->phase == KCS_PHASE_READ) {
>>>>>>>>>>> +        ret = -EBUSY;
>>>>>>>>>>> +    } else {
>>>>>>>>>>> +        ret = -EINVAL;
>>>>>>>>>>
>>>>>>>>>> Is there a reason you return -EINVAL here?  Why not just 
>>>>>>>>>> -EBUSY in all
>>>>>>>>>> cases?  Is there something that userland will need to do 
>>>>>>>>>> differently?
>>>>>>>>>>
>>>>>>>>>>> +    }
>>>>>>>>>>> +
>>>>>>>>>>> +    spin_unlock_irq(&kcs_bmc->lock);
>>>>>>>>>>> +
>>>>>>>>>>> +    return ret;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static long kcs_bmc_ioctl(struct file *filp, unsigned int cmd,
>>>>>>>>>>> +              unsigned long arg)
>>>>>>>>>>> +{
>>>>>>>>>>> +    struct kcs_bmc *kcs_bmc = file_kcs_bmc(filp);
>>>>>>>>>>> +    long ret = 0;
>>>>>>>>>>> +
>>>>>>>>>>> +    spin_lock_irq(&kcs_bmc->lock);
>>>>>>>>>>> +
>>>>>>>>>>> +    switch (cmd) {
>>>>>>>>>>> +    case IPMI_BMC_IOCTL_SET_SMS_ATN:
>>>>>>>>>>> +        update_status_bits(kcs_bmc, KCS_STATUS_SMS_ATN,
>>>>>>>>>>> +                        KCS_STATUS_SMS_ATN);
>>>>>>>>>>> +        break;
>>>>>>>>>>> +
>>>>>>>>>>> +    case IPMI_BMC_IOCTL_CLEAR_SMS_ATN:
>>>>>>>>>>> +        update_status_bits(kcs_bmc, KCS_STATUS_SMS_ATN,
>>>>>>>>>>> +                        0);
>>>>>>>>>>> +        break;
>>>>>>>>>>> +
>>>>>>>>>>> +    case IPMI_BMC_IOCTL_FORCE_ABORT:
>>>>>>>>>>> +        set_state(kcs_bmc, ERROR_STATE);
>>>>>>>>>>> +        read_data(kcs_bmc);
>>>>>>>>>>> +        write_data(kcs_bmc, KCS_ZERO_DATA);
>>>>>>>>>>> +
>>>>>>>>>>> +        kcs_bmc->phase = KCS_PHASE_ERROR;
>>>>>>>>>>> +        kcs_bmc->data_in_avail = false;
>>>>>>>>>>> +        break;
>>>>>>>>>>> +
>>>>>>>>>>> +    default:
>>>>>>>>>>> +        ret = -EINVAL;
>>>>>>>>>>> +        break;
>>>>>>>>>>> +    }
>>>>>>>>>>> +
>>>>>>>>>>> +    spin_unlock_irq(&kcs_bmc->lock);
>>>>>>>>>>> +
>>>>>>>>>>> +    return ret;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>>>> +static int kcs_bmc_release(struct inode *inode, struct file 
>>>>>>>>>>> *filp)
>>>>>>>>>>> +{
>>>>>>>>>>> +    struct kcs_bmc *kcs_bmc = file_kcs_bmc(filp);
>>>>>>>>>>> +
>>>>>>>>>>
>>>>>>>>>> What happens if the device gets closed in the middle of a 
>>>>>>>>>> transaction?  That's
>>>>>>>>>> an important case to handle.  If something is in process, you 
>>>>>>>>>> need to abort it.
>>>>>>>>>>
>>>>>>>>> The device just provides the read & write data, the 
>>>>>>>>> transaction is handled in the KCS
>>>>>>>>> controller's IRQ handler.
>>>>>>>>
>>>>>>>> From the spec, section 9.14:
>>>>>>>>
>>>>>>>>    The BMC must change the status to ERROR_STATE on any 
>>>>>>>> condition where it
>>>>>>>>    aborts a command transfer in progress.
>>>>>>>>
>>>>>>>> So you need to do something here.
>>>>>>>>
>>>>>>> In practice, we do this as spec said in ipmid, NOT in driver, 
>>>>>>> driver can't handle anything, let's
>>>>>>> make it simple, thanks!
>>>>>>
>>>>>> If ipmid crashes or is killed, how does it accomplish this?
>>>>>>
>>>>> Every time ipmids (or kcsd) crashed or killed, it needs start to 
>>>>> call FORCE_ARBORT firstly, to sync with
>>>>> host side software.
>>>>>>>
>>>>>>> Whenever the BMC is reset (from power-on or a hard reset), the 
>>>>>>> State Bits are initialized to “11 - Error State”. Doing so
>>>>>>> allows SMS to detect that the BMC has been reset and that any 
>>>>>>> message in process has been terminated by the BMC.
>>>>>>
>>>>>> Right, that's fine, like it should be.  But we are not talking 
>>>>>> about a reset.
>>>>>>
>>>>> I think the final error handling solution is that kcsd (user land) 
>>>>> runs, otherwise, the host software side still got stuck. We meet
>>>>> this kind of issue, so in general, we just doesn't handle some 
>>>>> mirror errors in driver, then in kcsd, when it can provide the real
>>>>> IPMI service, it will reset the channel firstly to sync with host 
>>>>> side software.
>>>>
>>>> "Userland will do the right thing" is not very convincing to a 
>>>> kernel developer.
>>>>
>>>> Plus if the above is true, I would think that you would just want 
>>>> to hold the device
>>>> in an error state when it wasn't opened.
>>>>
>>> I understand your concern, of course, driver need handles things 
>>> well. But in fact, if a user app is truly a bad boy, it still can hang
>>> the host side: set SMS_ATN, but no message returned when software 
>>> host side requests, then host open-ipmi driver will hang, we
>>> meet this kind of error to hang the customer's host. :) In my 
>>> understanding, kcs-bmc should do the right thing about read and write,
>>> the real transaction should be handled correctly by the kcsd.
>>>
>>> And if no kcsd starts, then this kind of BMC can't be sold out. :)
>>
>> True.  I'm not as concerned about this sort of thing.  It's nicer to 
>> the host side if
>> it can detect problems quickly, but it will eventually time out.
>>
>> From what I can tell from the current design, if the BMC userland is 
>> not running,
>> the driver will step through the state machine until it hits read 
>> state, then it
>> will sit there until the host times out and aborts the operation.
>>
>> IMHO, it would be better for the host side if the driver just stayed 
>> in error state
>> if nothing had it open.  It would think the spec says that in the 
>> quote I referenced
>> above, but that quote, like many things in the IPMI spec, is fairly 
>> vague and could
>> be interpreted many ways.
>>
> Well, I will try to fix this errors as possible.
>> -corey
>>
>>
>>>> -corey
>>>>
>>>>>> -corey
>>>>>>
>>>>>>>>>>> + spin_lock_irq(&kcs_bmc->lock);
>>>>>>>>>>> +
>>>>>>>>>>> +    kcs_bmc->running = 0;
>>>>>>>>>>> +
>>>>>>>>>>> +    spin_unlock_irq(&kcs_bmc->lock);
>>>>>>>>>>> +
>>>>>>>>>>> +    return 0;
>>>>>>>>>>> +}
>>>>>>>>>>> +
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>