From mahesh at linux.ibm.com Sat Sep 20 03:19:43 2025 From: mahesh at linux.ibm.com (Mahesh Salgaonkar) Date: Fri, 19 Sep 2025 22:49:43 +0530 Subject: [Skiboot] [PATCH 1/3] external/opal-prd: Fix compilation error In-Reply-To: <20250919171945.3947759-1-mahesh@linux.ibm.com> References: <20250919171945.3947759-1-mahesh@linux.ibm.com> Message-ID: <20250919171945.3947759-2-mahesh@linux.ibm.com> Upstream opal-prd fails with below compilation errors: $ make -C external/opal-prd/ make: Entering directory '/home/mahesh/skiboot/external/opal-prd' [...] CC thunk.o /usr/include/bits/types.h: Assembler messages: /usr/include/bits/types.h:30: Error: unrecognized opcode: `typedef' /usr/include/bits/types.h:31: Error: unrecognized opcode: `typedef' /usr/include/bits/types.h:32: Error: unrecognized opcode: `typedef' [...] The file thunk.S includes which gets picked up from skiboot's 'include/endian.h' (instead of /usr/include/endian.h) which then includes 'ccan/endian/endian.h' -> /usr/include/bits/types.h where assember fails with compilation. The 'include/endian.h' was introduced by commit 18c9b278e747 ("include: Provide endian conversion functions") when support for PLDM was added. This header file was introduced to be used by C files under pldm/, core/pldm and libmctp/ only. In order to restrict this header file to pldm/libmctp, move the endian.h header under 'pldm/include' folder. This makes sure that except files under 'pldm' and 'libmctp', rest all other C files would pick up default header file from '/usr/include/endian.h' as it was prior to commit 18c9b278e747 ("include: Provide endian conversion functions") After moving endian.h under 'pldm/include', opal-prd gets successfully compiled. Fixes: 18c9b278e747 ("include: Provide endian conversion functions") Signed-off-by: Mahesh Salgaonkar --- {include => pldm/include}/endian.h | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename {include => pldm/include}/endian.h (100%) diff --git a/include/endian.h b/pldm/include/endian.h similarity index 100% rename from include/endian.h rename to pldm/include/endian.h -- 2.51.0 From mahesh at linux.ibm.com Sat Sep 20 03:19:44 2025 From: mahesh at linux.ibm.com (Mahesh Salgaonkar) Date: Fri, 19 Sep 2025 22:49:44 +0530 Subject: [Skiboot] [PATCH 2/3] external/opal-prd: Fix opal-prd service shutdown on memory errors In-Reply-To: <20250919171945.3947759-1-mahesh@linux.ibm.com> References: <20250919171945.3947759-1-mahesh@linux.ibm.com> Message-ID: <20250919171945.3947759-3-mahesh@linux.ibm.com> Whenever there is a memory error reported, opal-prd tries to spawn a child process using fork to delegate the memory offline work to child process. After handling memory error child process suppose to exit. However, instead of delegating the task to child process the main thread itself handles the memory error and exits. Thus causing opal-prd service to go into stop/restart loop and eventually hits the systemd restart limit leaving opal-prd service unavailable. opal-prd[49096]: MEM: Memory error: range 0000000eeb445700-0000000eeb445700, type: correctable opal-prd[49096]: MEM: Offlined 0000000eeb445700,0000000eeb455700, type correctable: No such file or directory systemd[1]: opal-prd.service: Service RestartSec=100ms expired, scheduling restart. systemd[1]: opal-prd.service: Scheduled restart job, restart counter is at 7. systemd[1]: opal-prd.service: Start request repeated too quickly. systemd[1]: opal-prd.service: Failed with result 'start-limit-hit'. systemd[1]: Failed to start OPAL PRD daemon The fork() function, on success, returns pid of child process (pid > 0) in the parent and 0 in the child. Instead of invoking memory worker when return value pid == 0, it invokes worker when pid > 0 which is parent process itself. pid = fork(); if (pid > 0) exit(memory_error_worker(sysfsfile, typestr, i_start_addr, i_endAddr)); The above logic causes the parent thread to exit after handling memory error. Fix this by changing the if condition to (pid == 0). Fixes: 8cbd0de88d16 ("opal-prd: Have a worker process handle page offlining") Signed-off-by: Mahesh Salgaonkar --- external/opal-prd/opal-prd.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/external/opal-prd/opal-prd.c b/external/opal-prd/opal-prd.c index 1c610da4c..da947c827 100644 --- a/external/opal-prd/opal-prd.c +++ b/external/opal-prd/opal-prd.c @@ -755,9 +755,13 @@ int hservice_memory_error(uint64_t i_start_addr, uint64_t i_endAddr, /* * HBRT expects the memory offlining process to happen in the background * after the notification is delivered. + * + * fork() return value: + * On success, the PID of the child process is returned in the parent, + * and 0 is returned in the child. */ pid = fork(); - if (pid > 0) + if (pid == 0) exit(memory_error_worker(sysfsfile, typestr, i_start_addr, i_endAddr)); if (pid < 0) { -- 2.51.0 From mahesh at linux.ibm.com Sat Sep 20 03:19:42 2025 From: mahesh at linux.ibm.com (Mahesh Salgaonkar) Date: Fri, 19 Sep 2025 22:49:42 +0530 Subject: [Skiboot] [PATCH 0/3] external/opal-prd fixes Message-ID: <20250919171945.3947759-1-mahesh@linux.ibm.com> Fix for opal-prd compilation failure and there are couple more fixes to address issues in memory error handling code path. Thanks, -Mahesh. Mahesh Salgaonkar (3): external/opal-prd: Fix compilation error external/opal-prd: Fix opal-prd service shutdown on memory errors external/opal-prd: remove misleading errno error message external/opal-prd/opal-prd.c | 8 ++++++-- {include => pldm/include}/endian.h | 0 2 files changed, 6 insertions(+), 2 deletions(-) rename {include => pldm/include}/endian.h (100%) -- 2.51.0 From mahesh at linux.ibm.com Sat Sep 20 03:19:45 2025 From: mahesh at linux.ibm.com (Mahesh Salgaonkar) Date: Fri, 19 Sep 2025 22:49:45 +0530 Subject: [Skiboot] [PATCH 3/3] external/opal-prd: remove misleading errno error message In-Reply-To: <20250919171945.3947759-1-mahesh@linux.ibm.com> References: <20250919171945.3947759-1-mahesh@linux.ibm.com> Message-ID: <20250919171945.3947759-4-mahesh@linux.ibm.com> While handling memory error opal-prd displays misleading errno error messages even when handler was successfully able to soft/hard offline the requested memory page. opal-prd[49096]: MEM: Memory error: range 0000000eeb445700-0000000eeb445700, type: correctable opal-prd[49096]: MEM: Offlined 0000000eeb445700,0000000eeb455700, type correctable: No such file or directory In above example, an error message 'No such file or directory' was displayed even after successfully offlining memory. This is because printf in success case was using '%m' which prints errno based error message. The value in errno is significant only when the return value of the call indicated an error. The value of errno is never set to zero by any system call or library function. Hence, in success case do not use '%m' in printf to avoid misleading error message Signed-off-by: Mahesh Salgaonkar --- external/opal-prd/opal-prd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/external/opal-prd/opal-prd.c b/external/opal-prd/opal-prd.c index da947c827..d85e3e9bd 100644 --- a/external/opal-prd/opal-prd.c +++ b/external/opal-prd/opal-prd.c @@ -721,7 +721,7 @@ static int memory_error_worker(const char *sysfsfile, const char *type, ret = 1; } } - pr_log(LOG_CRIT, "MEM: Offlined %016lx,%016lx, type %s: %m\n", + pr_log(LOG_CRIT, "MEM: Offlined %016lx,%016lx, type %s\n", i_start_addr, addr, type); close(memfd); -- 2.51.0