[c-lightning] Shared-memory Routemap

Tue Sep 24 15:48:57 AEST 2019

Good morning all,

Some further thoughts.

Entity-Component Model
======================

In https://github.com/ElementsProject/lightning/pull/3075#issuecomment-534348482 Rusty expresses:

>  It's better for an extension to use the same array approach for their metadata as we use internally; though they can always use an array of void * and simulate this approach if space isn't a concern.

This seems a good idea to have.

It also encourages us to use shared-memory routemaps, since the routemaps, when synchronized, will now be exactly the same across plugins, and there would be little or no point to duplicating them across every plugin that wants to get data on routing information.

Surprisingly, the interface I proposed before works with this style: the `routemap_node` and `routemap_chan` can be treated by the plugin as a meaningless handle identifying the object.
We only need to add a note that `routemap_node` and `routemap_chan` are expected to be small integers and can be used as 1-based indexes to an array of metadata locally stored by the plugin in its own process space.
(it is 1-based as we use `0` as `NULL`.)

However, this complicates memory management.
Not only do we need to track which entries in the arrays are in-use and which may be reused later, we also need to maintain two large contiguous arrays.

My initial thought to this is to simply split the shared memory into two areas: an array of channel objects and an array of node objects.
The array of node objects can grow freely, but if the array of channel objects has to grow, we need to move the node objects array.
The cost of the movement can be amortized by putting some space between the channel object array and node object array where the channel array can grow.

Alternately, we could just maintain two different shared memories, one for channels and one for nodes.
This greatly reduces our memory management overhead, so we only need to use a freelist to manage used-ness of entries in each array.
This lets us move the problem of finding memory for the two arrays as they grow without end to the OS.

Entry 0 for each array will not be used, as mentioned above, due to using `0` index as `NULL`.
We can store some data into it, such as the root of the red-black trees for channels / nodes.

Deletion
========

By use of the above, plugins can easily manage their metadata by using the `routemap_node` and `routemap_chan` as indexes to an array, possibly 1-indexed.

However, we should take care that metadata stored by plugins may need to be deleted when a channel or node is deleted.
This is important since the same index may be reused in the future.
A node or channel might be semantically "new" (i.e. the plugin has not encountered it yet and has not put a valid metadata into it) but because it reuses an existing entry, would take on the metadata of the previously-deleted node or channel.

We can currently use notifications, informing all interested plugins about deletion of nodes / channels, so they can clear the metadata for those indexes as appropriate for a new object.

The problem with notifications is that they are asynchronous.
However, we should note that it is the *reuse* of the same index that may cause problems like the above.

What we could do would be:

* Have the freelist form a deque rather than a stack.
* Put freed objects into the tail of the queue.
  * When putting freed objects into the freelist, also record the time it was deleted.
    * This implies that least-recent objects are at the front of the queue while most-recent are at the tail of the queue.
* Get new objects from the front of the queue.
  * If the object at the front of the queue is too recent (e.g. less than 10 minutes since it was deleted), do not get it.
    * Instead, increase the array size and put new objects to the front of the queue, with time at 0.

This allows any deletion notification to hopefully "propagate" to all plugins by the time we will reuse the index of the deleted object for a new object.

Synchronization
===============

As noted before, using `pthread_rwlock_rdlock` has a risk where a plugin crashes while holding a readlock, then prevents the `routemapd` from acquiring a writelock and updating the routemap shared memory.

As plugins are third-party entities, arbitrary crashes of those plugins must be protected against.

Rather than use a `pthread_rwlock_t` in the shared memory, we can instead use an `AF_UNIX` socket in the lightningdir.

The `routemapd` `listen()`s to this socket.

When it `accept()`s an incoming connection, it reopens the shared memories and sends its file descriptors over the new connection.
(`AF_UNIX` sockets *can* send file descriptors, it is what we use extensively between other C-Lightning daemons)
Then it stops updating the routemap (and possibly just `accept()`s further plugins) as long as any plugin has this socket open.

Thus, at plugin side:

* Acquire readlock ==> connect to socket and get file descriptors for shared memory.
* Release readlock ==> close connection to socket.

As process termination implies `close` of all file descriptors it has open, it implies that unexpected process termination of plugins also closes the connection to the socket, releasing the readlock.

Of course, a plugin may still misbehave by entering an infinite loop, or calling `pause()`, while holding a readlock (i.e. having a connection on the socket open).
Then `routemapd` can keep track of the age of every connection.
If the connection becomes too old (say 30 minutes) the `routemapd` can force-close the connection, revoking the readlock.

Finally, we still need some kind of memory fence, as the processor might cache data that was recently written by the `routemapd`.
We can simply have `routemapd` maintain a local `pthread_mutex_t`, and use `pthread_mutex_unlock` as a portable memory fence, since all writes before the `pthread_mutex_unlock` are assured to occur before the `pthread_mutex_unlock` as per Posix, updating the shared memory.

Cleanup
=======

`routemapd` itself may be buggy and crash.

In that case we should clean up the shared memory so that it does not take up space in `/dev/shm`.

We already have a technique for determining if another process has crashed (i.e. `pipe` or `socket` with the other end kept open by the other process).
We can launch a daemonized monitor process which waits for the main `routemapd` to die, then cleans up the shared memory.
This is really very simple:

    void monitor_process(int fd, char const *shm_name)
    {
        char buf[1024];
        for (;;) {
            ssize_t res = read(fd, buff, 1024);
            if (res < 0 && errno == EINTR)
                continue;
            else if (res < 0)
                exit(1);
            else if (res == 0)
                break;
        }
        (void) shm_unlink(shm_name);
        exit(0);
    }

Regards,
ZmnSCPxj