Fixing sdbusplus CI issues

Lei YU mine260309 at gmail.com
Mon Dec 16 20:30:11 AEDT 2019


This email is to introduce a few changes to sdbusplus to fix the CI issues.

You may have noticed that sdbusplus has no change for a long time, and
the CI is broken on the latest OpenBMC.

The CI has two issues:
1. The case VtableTest.SameContent() fails
2. Valgrind reports an error about "Syscall param epoll_ctl(event)
points to uninitialised byte(s)", on ppc64le, but not on x86-64.

For 1), it is the bug in sdbusplus that vtable.hpp does not handle the
member added in the newer version of systemd.
It could be fixed with
https://gerrit.openbmc-project.xyz/c/openbmc/sdbusplus/+/27949, see
the gerrit link for details.

For 2), it is a bit more complicated, it's caused by code in systemd,
and eventually, it's a GCC bug. (And thanks @Andrew Jeffery for the
help on debugging the issue)
See below for details.

TL;DR

Here let me explain the above issue 2).

1. GCC has a bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77992,
that the padding bytes are not initialized in case of such code:

    struct S {
        long l;
        char c;
    };

    void main () {
        struct S s ={
            .l = 0,
            .c = 0
        };
    }

2. In systemd, the code in [libsystemd/sd-event/sd-event.c][1] using
epoll_event hit the above bug:

    ev = (struct epoll_event) {
        .events = EPOLLIN,
        .data.ptr = d,
    };

Because epoll_event is defined as

    typedef union epoll_data
    {
      void *ptr;
      int fd;
      uint32_t u32;
      uint64_t u64;
    } epoll_data_t;

    struct epoll_event
    {
      uint32_t events;      /* Epoll events */
      epoll_data_t data;    /* User data variable */
    } __EPOLL_PACKED;

In glibc, on x86, `__EPOLL_PACKED` is defined as `__attribute__
((__packed__))`[2], so it's packed and there are no internal padding
bytes;
On other architectures (e.g. ppc64le), __EPOLL_PACKED is not defined
and thus there are 4 internal padding bytes between `events` and
`data`, that are not initialized.

3. When epoll_ctl() is invoked, in [Linux kernel][3], it does a
copy_from_user() to copy the whole struct into kernel space. That's
why Valgrind reports "epoll_ctl(event) points to uninitialised
byte(s)", only for non-x86 platforms.

4. sdbusplus has a timer test that invokes sd_event_add_time() and
eventually hit the above code. And in OpenBMC CI there are x86-64 and
ppc64le systems.
   * When the build is run on x86-64, there is no Valgrind error;
   * When the build is run on ppc64le, we got the error.

To fix that issue, we could:
1. Wait for the GCC bug to be fixed, which is not likely to happen.
The bug was opened in 2016 on GCC 5.4.0, now the issue remains on GCC
9.2.1 as well...
2. Workaround the issue by initializing strict epoll_event to 0. I
send a PR to systemd https://github.com/systemd/systemd/pull/14353,
but not sure if it will be accepted or not.
3. If the above is not accepted, we have to suppress the Valgrind
error. It is sent to gerrit as well:
   https://gerrit.openbmc-project.xyz/c/openbmc/sdbusplus/+/25548

So if you think GCC or systemd could be fixed, could you please go to
the GCC bug or the systemd PR and leave comments to help to push the
fix?
Or if you think it's acceptable to suppress the Valgrind error, please
give +1 on the gerrit review.

Thanks!

[1]: https://github.com/systemd/systemd/blob/v242/src/libsystemd/sd-event/sd-event.c#L956-L959
[2]: https://github.com/bminor/glibc/blob/f1a0eb5b6762b315517469da47735c51bde6f4ad/sysdeps/unix/sysv/linux/x86/bits/epoll.h#L29
[3]: https://github.com/torvalds/linux/blob/d1eef1c619749b2a57e514a3fa67d9a516ffa919/fs/eventpoll.c#L2095


More information about the openbmc mailing list