[PATCH] powerpc/pseries: Add papr-platform-dump character driver for dump retrieval

Tue Dec 10 20:18:29 AEDT 2024

On Thu, 2024-12-05 at 11:42 +0100, Michal Suchánek wrote:
> On Wed, Dec 04, 2024 at 06:14:06PM -0800, Haren Myneni wrote:
> > On Wed, 2024-12-04 at 17:57 +0100, Michal Suchánek wrote:
> > > On Mon, Dec 02, 2024 at 08:40:05PM -0800, Haren Myneni wrote:
> > > > On Wed, 2024-11-27 at 10:11 +0100, Michal Suchánek wrote:
> > > > > On Tue, Nov 26, 2024 at 12:40:20PM -0800, Haren Myneni wrote:
> > > > > > On Wed, 2024-11-27 at 00:42 +0530, Mahesh J Salgaonkar
> > > > > > wrote:
> > > > > > > On 2024-11-23 21:20:39 Sat, Haren Myneni wrote:
> > > > > > > [...]
> > > > > > > > +static ssize_t papr_platform_dump_handle_read(struct
> > > > > > > > file
> > > > > > > > *file,
> > > > > > > > +		char __user *buf, size_t size, loff_t
> > > > > > > > *off)
> > > > > > > > +{
> > > > > > > > +	struct ibm_platform_dump_params *params = file-
> > > > > > > > > private_data;
> > > > > > > > +	u64 total_bytes;
> > > > > > > > +	s32 fwrc;
> > > > > > > > +
> > > > > > > > +	/*
> > > > > > > > +	 * Dump already completed with the previous
> > > > > > > > read calls.
> > > > > > > > +	 * In case if the user space issues further
> > > > > > > > reads,
> > > > > > > > returns
> > > > > > > > +	 * -EINVAL.
> > > > > > > > +	 */
> > > > > > > > +	if (!params->buf_length) {
> > > > > > > > +		pr_warn_once("Platform dump completed
> > > > > > > > for dump
> > > > > > > > ID
> > > > > > > > %llu\n",
> > > > > > > > +			(u64) (((u64)params-
> > > > > > > > >dump_tag_hi << 32)
> > > > > > > > +				params->dump_tag_lo));
> > > > > > > > +		return -EINVAL;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	if (size < SZ_1K) {
> > > > > > > > +		pr_err_once("Buffer length should be
> > > > > > > > minimum
> > > > > > > > 1024
> > > > > > > > bytes\n");
> > > > > > > > +		return -EINVAL;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	/*
> > > > > > > > +	 * The hypervisor returns status 0 if no more
> > > > > > > > data
> > > > > > > > available to
> > > > > > > > +	 * download. Then expects the last RTAS call
> > > > > > > > with NULL
> > > > > > > > buffer
> > > > > > > > +	 * to invalidate the dump which means dump will
> > > > > > > > be
> > > > > > > > freed in the
> > > > > > > > +	 * hypervisor.
> > > > > > > > +	 */
> > > > > > > > +	if (params->status ==
> > > > > > > > RTAS_IBM_PLATFORM_DUMP_COMPLETE)
> > > > > > > > {
> > > > > > > > +		params->buf_length = 0;
> > > > > > > > +		fwrc = rtas_ibm_platform_dump(params,
> > > > > > > > 0, 0);
> > > > > > > > +		/*
> > > > > > > > +		 * Returns 0 (success) to the user
> > > > > > > > space so
> > > > > > > > that user
> > > > > > > > +		 * space read stops.
> > > > > > > 
> > > > > > > Does this implicitly invalidates the dump irrespective of
> > > > > > > whether
> > > > > > > userspace has requested or not ?
> > > > > > 
> > > > > > No, the RTAS call from the last read() will invalidate the
> > > > > > dump.
> > > > > > Expect
> > > > > > the user space make the read() until returns 0 which means
> > > > > > the
> > > > > > last
> > > > > > read() will return 0 after invalidate the dump. 
> > > > > > 
> > > > > > size_t read() 
> > > > > > {
> > > > > >    if (params->status == RTAS_IBM_PLATFORM_DUMP_COMPLETE) {
> > > > > >            RTAS call to invalidate dump
> > > > > >            return 0;
> > > > > >    }
> > > > > >    get the data from RTAS call
> > > > > >    If the RTAS call return status is DUMP_COMPLETE
> > > > > >              params->status =
> > > > > > RTAS_IBM_PLATFORM_DUMP_COMPLETE
> > > > > >    return bytes_written
> > > > > > }
> > > > > > 
> > > > > > If the RTAS call returns DUMP_COMPLETE, the hypervisor
> > > > > > expects
> > > > > > one
> > > > > > more
> > > > > > RTAS call to invalidate the dump which is done as part of
> > > > > > the
> > > > > > last
> > > > > > read()
> > > > Yes, providing support for FD from ioctl for the following
> > > > reasons:
> > > > 
> > > > - get-vpd char driver is providing only for FD from ioctl. So
> > > > thought
> > > > of using same interface for platform-dump so that having
> > > > consitent
> > > > interface for all RTAS function char drivers. 
> > > > 
> > > > - file->private_data is assigned to miscdevice in misc_register
> > > > and
> > > > also assigned to some other miscdevice struct in driver
> > > > specific
> > > > code.
> > > > So was thinking of not following semantics in the existing code
> > > > if
> > > > I
> > > > private_data to save internal param struct.
> > > > 
> > > > Please let me know if you prefer to use FD from open() for
> > > > platform-
> > > > dump read().
> > > 
> > > There is no way to specify the dump id with open() of the
> > > character
> > > device. So for open() other mechanism would have to be used, such
> > > as
> > > sysfs/debugfs which can list the IDs as filenames, or something
> > > similar.
> > 
> > you mentioned about standard file handling code using FD from open.
> > So
> > I was referring to passing dump ID with ioctl as in the current
> > patch,
> > but using FD returned from open() for read() calls like:
> > fd = open()
> > ret =  ioctl(fd,.. dumpid)
> > read(fd, buf, size)
> > 
> > As in get-vpd, we will have one other RTAS stream function get-
> > indices. 
> 
> It would be much clearer if these two related functions were
> implemented together.

I need more time to implement ibm,get-indices support. Do you prefer to
wait platform-dump RTAS call support along with ibm,get-indices?

> 
> > I will add this support later. So I am proposing same interfaces
> > for
> > RTAS functions as provided by papr-vpd so that same type of
> > interfaces
> > will be used across all RTAS functions
> 
> The proposed interface is similar but subtly different. While
> papr_vpd accumulates the data in kernel and returns them in single
> call, this API requires the user to call read() repeatedly to obtain
> the data, unlike papr_vpd.
> 
> With the requirement to introduce a new API it can as well be
> designed
> to fit the call in question.

For papr-vpd, the user space can issue read call to get all data. But 
can also use multiple read calls until EOF. See rtas_get_vpd()
implementation in librtas sources

 get_vpd_chardev() - 
https://github.com/ibm-power-utilities/librtas/blob/next/librtas_src/vpd.c

Adding the following stream RTAS calls support in the kernel for
lockdown:

ibm,get-vpd: included in papr-vpd.c, Address of location code as input
and starts with sequence 0. Interleaved RTAS calls are not supported.
So get the complete buffer within ioctl and the buffer is available to
the user space with read() calls. 

current kernel implementation:
ioctl:
- first RTAS call with sequence 0 and continue for the complete buffer
- If need to start, start with sequence 0 again;

Interface usage in the user space: see librtas code.
fd = open()
devfd = ioctl(fd ..)
while (read());

ibm,platform-dump: The user space passes dumpID as input with ioctl().
Interleaved RTAS calls are supported - with multiple dump IDs. Also the
dump may need large buffer  (noticed 3G dump in my testing) if get the
complete dump within ioctl. So proposing RTAS call for each read. Then
dump has to be invalidated at the end of dump to remove the dump from
the hypervisor.

fd= open()
devfd = ioctl(fd, ) - sets sequence =1;

read(devfd, )
{
  RTAS call with sequence and dump ID which returns next sequence#

  until end of dump
}
ioctl(devfd, ..) -  invalidate dump
   -- If this ioctl is not called by the user space, the dump is still
valid and the user space can restart the sequence with the original
ioctl(fd, ..) which sets the sequence = 1;

The current platform-dump implementation:
ppc64-diag/rtas_errd/extract_platdump.c
{
status = First RTAS call with sequence = 1
while (status == dump_complete)
  status = RTAS call with next sequence

status = RTAS call to invalidate dump
}

get-indices RTAS call: Takes indicator or sensor and type as inputs
from the user space with ioctl(). Multiple interleave calls are not
supported with different indices. So will be proposing similar
interface as in get-vpd - Read complete buffer in ioctl() and the
buffer will be available with read(). In the case of get-vpd, each RTAS
call returns #bytes written and the user space can read continuous
buffer. Where as get-indices, RTAS call does not tell number of bytes
written and the output has to be certain format for each RTAS call
given below:

- Number Returned: 32-bit integer representing the number of indicator
indices returned on this call
- Sets of (32-bit integer index, 32-bit integer length of location code
including NULLs, location code string (NULL terminated and padded to
nearest 4 byte boundary)), one set per indicator or sensor, with the
number of sets indicated by the first integer in this work buffer

The hypervisor determine how many indices can fit in to the requested
buffer and fills the buffer. For example if 1K buffer is requested, we
may not get the data in all 1K buffer. So will be gettng 1K buffer
(fixed) for each RTAS call and appends to the one buffer as in get-vpd
during ioctl. Each read returns 1K to the user space (as in the current
implementation in librtas)

ioctl()
{ start sequence 1, 
  get 1K buffer from RTAS call for each sequence
  copy 1K to large buffer
  if need to start sequence again, start with sequence 1.
}

read()
{ return 1K buffer until EOF
}

The current implementaion in the user space:
https://github.com/ibm-power-utilities/librtas/tree/next/librtas_src   
> 
> > fd = open()
> > devfd = ioctl(fd,.. dumpid)
> > read(devfd, buf, size)    --> will just returns 0 if the previous
> > RTAS
> > call returns dump complete status. But will not invalidate the
> > dump.
> > ret = ioctl(devfd, .. dumpid) - can be used to invalidate dump as
> > you
> > suggested.  
> > 
> > > > > But also the size 0 read both indicates the EOF and
> > > > > invalidates
> > > > > the
> > > > > dump, these should be separate.
> > > > > 
> > > > > Which differs from the hypervisor API that makes it
> > > > > impossible to
> > > > > save
> > > > > the dump without deleting it, and introduces a regression.
> > > > 
> > > > The hypervisor API says to invalidate the dump after saving it.
> > > > In
> > > > the
> > > > current interface it does - The user space makes the last
> > > > read()
> > > > (for
> > > > return 0) after saving the complete dump. 
> > > > 
> > > > All read() calls return size (> 0) == RTAS calls for dump
> > > > read() expects return 0 == same RTAS invalidate dump
> > > > 
> > > > So the only difference is if the user does not call to
> > > > invalidate
> > > > dump
> > > > explicitly even though saved the dump, but we do not have the
> > > > current
> > > > usage, Only the extract-dump command is the only usage now and
> > > > please
> > > > note that this command is used for non-HMC manages system. It
> > > > is
> > > > not
> > > > used on HMC managed system which has its own command gets the
> > > > dump
> > > > directly from the hypervisor.    
> > > 
> > > However, the hypervisor does provide the option to read the dump
> > > without
> > > invalidating it.
> > > 
> > > At the very least this option is useful for testing. Not making
> > > it a
> > > separate call makes it needlessly difficult for people testing
> > > the
> > > feature.
> > > 
> > > If as you say this is used exclusively by extract-dump, and it
> > > already
> > > calls an IOCTL to obtain a FD it can call another IOCTL on it to
> > > do
> > > the
> > > invalidation. There is hte corner case that the kernel might need
> > > to
> > > finish the reading of the dump on its own if the userspace did
> > > not to
> > > perform the IOCTL, or return an error if the dump is not fully
> > > read
> > > yet.
> > 
> > kernel should not read the remaining dump if the user space did not
> > issue or returns an error if the complete dump is not read. should
> > leave it to the user space as it is doing now. As Mahesh pointed
> > before, what if the user space gets exited for some reason in the
> > middle of dump collection. 
> > 
> > should be same before but one more ioctl to invaldate the dump
> > 
> > read()
> > {
> >    if the status is 'dump complete' 
> >          return 0;
> > 
> >    if the status is 'dump invalidated'
> >          return -EINVAL  -- means user space issued ioctl to
> > invalidate
> > before. should not expect this but in case
> > 
> >    RTAS call to get the dump
> >    return size;
> > }
> > 
> > ioctl(devfd, ...) 
> > {
> >   if the status is 'dump complete'
> >       RTAS call to invalidate dump
> >       return 0;
> > 
> >   return -EINVAL --- expects the user space to issue this ioctl
> > after
> > read() returns 0
> > }
> 
> That makes not a very nice API, though.
> 
> Also EINVAL is not exactly helpful: the call arguments are valid, the
> kernel merely refused to do the remaining reads until it can perform
> the
> invalidation according to the hypervisor API.

We can change to some other errno ex: -EPERM - not allowed since the
dump is not completed. 

"dump-complete' status is returned from RTAS call only after all data
is retrieved. So you mean the user space will be reading again starting
from sequence 1. Not sure about this special case. But even if the tool
starts again. it can restart the dump sequence with new ioctl(fd,..)
before dump invalidate.
ex: 
close(devfd)
devfd = ioctl(fd, ..)  - sets sequence 0 and next read() calls will get
the complete dump 

> 
> > > > > If you are doing IOCTLs anyway the invalidation could be an
> > > > > IOCTL. Or
> > > > > you could really follow the RTAS ABI and only incalidate if
> > > > > the
> > > > > user
> > > > > passes NULL and 0.
> > > > 
> > > > I could use this ioctl interface to invalidate the dump.
> > > > 
> > > > devfd = ioctl(fd ..) for dump ID
> > > > read(devfd)
> > > > ret = ioctl(devfd ...) to invalidate dump 
> > > > 
> > > > I will make changes if you are OK with this interface
> > > 
> > > Yes, for the ioctl interface this looks like a better option.
> > 
> > Thanks for your suggestion. As I mentioned above, will add this
> > change.
> > > > > But more generally the previously added RTAS wrappers do not
> > > > > closely
> > > > > follow the RTAS ABI, and do accumulation of read data in the
> > > > > kernel,
> > > > > passing the whole buffer to userspace afterwards.
> > > > > 
> > > > > Why cannot it be done in this case?
> > > > 
> > > > platform-dump RTAS returns more data unlike in get-vpd. In one
> > > > case
> > > > noticed 3G data which is not the ideal case to save in the
> > > > kernel
> > > > buffer within ioctl. 
> > > > 
> > > > Also platform-dump RTAS function supports interleave calls for
> > > > multiple
> > > > dump IDs at the same time unlike in get-vpd case.
> > > 
> > > It would be nice to document why this call is incompatible with
> > > the
> > > existing implementations somewhere in the commit message or a
> > > comment.
> > 
> > I think mentioned about interleave calls. Sure will update in
> > comments.
> > 
> > > > > Even more generally if the dump IDs are stable these could be
> > > > > listed
> > > > > in
> > > > > some sysfs or debugfs directory, and provide standard file
> > > > > operations,
> > > > > including unlink() to remove the dump.
> > > > 
> > > > dump IDs are not stable. The hypervisor can have several dumps
> > > > with
> > > > different IDs at the same time. so unlink() can not be used.
> > > 
> > > I mean stable in the sense that same dump corresponds always to
> > > the
> > > same
> > > ID. Not in the sense that new dumps cannot be addded, and old
> > > removed.
> > > 
> > > It is indeed unclear what is the lifecycle of these dump IDs, and
> > > how
> > > they are created, and how the userspace is obtaining them.
> > 
> > Dump IDs can be 64bit value and so I think they may not be reused.
> > Whenever the hypervisor has the dump ot if the user initiates the
> > dump
> > with BMC interface, notifies to the partition with the dump ID and
> > automatically executes extract_platdump command.  
> 
> So the reason you can get away with not implementing get-indices is
> that
> the dump ID is passed in an event that invokes saving the dump.

Sorry if I created confusion. platform-dump and get-indices are
separate RTAS calls for different usage. 

> 
> That should perhaps be documented as well.
> 
> Also it does sound like get-indices together with these dump reading
> functions would work with a filesystem API with readdir returning the
> get-indices, and then the filename in open() or unlink() giving the
> dump
> ID.
> 
> It's not really a problem if dump IDs are reused over time, it's also
> possible to delete a file, and create one with the same name later.
> What
> would be a problem for this is if get-indices gave different results
> each time even without the available dumps changing.
> 
> However, it would make unlink() quite inefficient requiring to read
> the
> dump a second time.

Are you proposing the following interfaces with unlink() usage:
fd= open()
ret = ioctl(dumpid) - which should create <dumpID> file name under some
sysfs interface
devfd = open(sysfs-dumpid-file)
read(devfd,..) -- Note: First read should pass sequence=1 to RTAS call
unlink(sysfs-dumpid-file)
close(devfd)

Suppose if the tool likes to read the dump again, read() call has to be
with sequence 1. I am not sure how do we tell read to start the dump. I
think it becomes more complex interface. So thinking of having similar
interface for all RTAS stream calls to make it simple.

> 
> Also if the only known tool used for reading dumps always reads the
> dump
> once and then invalidates it repeated dump reading may trigger not
> yet
> encountered hyperviisor bugs.

Not sure what kind of bugs are exposed here. Expects the tool follow
same semantics as in the current usage.

As I mentioned about the behavior of 3 different RTAS calls, please let
me know if you have any comments.

Thanks
Haren

> 
> Thanks
> 
> Michal
> 
> > We should be OK Since using ioctl to invalidate
> > 
> > fd = open()
> > devfd = ioctl(fd,.. dumpid)
> > read(devfd, buf, size)
> > ret = ioctl(devfd, .. dumpid)
> > 
> > Will implement the above change. Please let me know if you have any
> > concerns. 
> > 
> > Thanks
> > Haren
> > 
> > > Thanks
> > > 
> > > Michal
> > > 
> > > > > With the bonus that at least one of these filesystems has
> > > > > some
> > > > > provisions for confidentiality lockdown. The implementation
> > > > > could
> > > > > use
> > > > > that to make the dumps unavailable in confidentiality
> > > > > lockdown
> > > > > level
> > > > > if
> > > > > the dumps are considered confidential without reimplementing
> > > > > the
> > > > > check.
> > > > 
> > > > We are adding new driver (interfaces) to support lockdown.
> > > > Otherwise
> > > > the current usage is sufficient. But we could modify to
> > > > restrict
> > > > the
> > > > interface for confidentiality lockdown in future if we have
> > > > that
> > > > restriction.
> > > > 
> > > > Thanks
> > > > Haren
> > > > 
> > > > > Thanks
> > > > > 
> > > > > Michal