[c-lightning] Shared-memory Routemap

ZmnSCPxj ZmnSCPxj at protonmail.com
Mon Sep 23 01:17:47 AEST 2019

Hi all,

Rusty expressed the opinion, to move routing techniques to plugins.
And he made also https://github.com/ElementsProject/lightning/pull/3075 .

Simply put, the consideration of this pull request is to reduce representation of nodes and channels, so contains fields only for purpose of routemaps.
And each plugin separately reads the `gossip_store` file to create an in-memory routemap.

My concern, is that this routemap is (to my mind, needlessly) duplicated across each plugin instance.
Consider that we may eventually have a `getroutequick` and `permuteroute` and `recalcroute` implementat1ions (as well as legacy `getroute` implementation), possibly in separate plugins for easier code management.

I want to propose, to use a shared-memory routemap, by use of `shm_open` and then `mmap`ing the memory into each plugin.
At low-level, `shm_open` simply creates files in `/dev/shm` (or `/run/shm` depending on distro) in a Linux system.
This `/dev/shm` is assuredly using `tmpfs`, which is backed by swap space.
(Of note is that regular program read/writeable memory is "the same" kind of memory that `tmpfs` uses, thus it has no kernel-level overhead compared to ordinary use of `malloc`)

I checked also, and for Raspberry Pis, it seems total `/dev/shm` can be as small as 50Mbytes.
As the routemap currently may be 10Mb or so, this is a substantial fraction of the 50Mbyte maximum size of `/dev/shm`, but it may still be workable.

A separate `routemapd` is the sole writer to this shared memory routemap.
It reads the `gossip_store` to derive this shared routemap, and keeps up-to-date on the `gossip_store`.
Plugins simply read this shared memory routemap.

This has the advantage that the routemap is not needlessly duplicated across plugins.
This should help us make more plugins that need the routemap, without less worry about memory load on smaller devices.

Below are some notes.


As the memory is shared, we should take care the synchronization across multiple threads-of-control.
Thus, we should care point, to use a proper synchronization structure.

Fortunately, Posix Threads `pthread_rwlock_t` structure fits this very well.
There is the attribute `pthread_rwlockattrib_setpshared` which can be set to `PTHREAD_PROCESS_SHARED`.
This allows the structure to be placed on `shm_open` space and then used to synchronize across entire processes using shared-memory.

In actuality the Linux version of `pthread_rwlock_t` ignore this setting and is always possible to use the structure to synchronize between threads and processes.
However, other operating systems might require explicitly this `PTHREAD_PROCESS_SHARED` and thus it should still be used.

`pthread_rwlock_t` is best here, as we have a privileged writer (`routemapd`) of the routemap while plugins simply read the routemap.

* While `routemapd` has not caught up to latest `gossip_store`, it acquires a writelock, performs the update, then releases the writelock.
* Whenever a routing plugin wants to get a route or otherwise look at routemap, it acquires a readlock, performs the routing operation, the releases the readlock.
  * This can be enforced by proper object-oriented design: a library provides a `routemap` object, whose only operation is a `routemap_get_view` that constructs a `routemap_view` object, which is the actual object that is capable of looking up nodes and etc.
    * Construction of a `routemap_view` object acquires the readlock, while its destruction releases the readlock, as normal RAII pattern.

This allows routing plugins to run in parallel (for example for background routemap processing needed by `getroutequick`, while `pay` uses the normal `getroute` to get a route while the initial `getroutequick` acceleration cache is not yet populated).

The problem, of course, is misbehaving plugins!
A plugin which constructs a `routemap_view` but then crashes without releasing it will prevent the routemap from ever being updated.
This is the biggest problem with this style.

Another method of synchronization would be to somehow acquire a pipe file descriptor (the writeable end) from `routemapd`.
As long as the writeable end is kept open, `select` on the readable end will indicate non-readiness for reading.
However, if the only process with the write-end file descriptor open terminates for any reason, the write-end file descriptor is automatically closed, and then the read-end will end-of-file and become ready for reading.
However, this is relatively heavyweight, and potentially spends limited OS resources.

Pointers Within Shared Memory

We can use `u32` as our "pointer-to-shared structure" type.
This indicates an offset from the start of our shared memory.
This reduces the size of pointers in 64-bit environments to 4 bytes and helps reduce the size of routemaps.

This limits us to having routemaps of less than 4Gb.
However, this would represent a network size ~300 times larger than current, so may be acceptable for a few years more.

The `pthread_rwlock_t` can be placed at offset 0 of the shared memory, so that a `u32` of 0 can also serve as `NULL` for us.

Memory Management

Memory needs to be managed.
Unfortunately, the "built-in" `malloc` and `free` memory management algorithms inherently assume an unshared process-private memory.
Thus we need to write our own implementations of memory management, targeting specifically our use-case.

Fortunately, we primarily need to only implement representations of Lightning nodes and channels.
We can determine the size of these objects at compile time.
Then, we need only to use slab allocators, one for node objects and one for channel objects.

A slab allocator is a specialized memory manager which can manage multiple objects of a single fixed size.
It uses a simple singly-linked list to store currently-free objects; the singly-linked list forms a stack, and allocating an object simply pops from this stack, and freeing an object simply pushes onto this stack.

If the freelist is empty when an object is to be allocated, it requests an entire "slab" of memory, i.e. a large piece of memory, then splits it into objects and pushes them onto the freelist.
In shared-memory terms, this means performing an `ftruncate` to increase the shared memory by the slab size (nearly equivalent to a `sbrk` call) to allocate a slab of memory for the shared-memory area.

Finally, as `routemapd` is the sole privileged writer, the memory management structures only need to exist in its process-private memory space.
Readers of the routemap (i.e. routing plugins) do not have to allocate objects on the routemap, they just read the objects that exist there.

Node Lookup and Channel Iteration

We need a fast way to look up nodes.
Currently we use hashtables, but this is difficult to use with our own shared memory routemap:

* It requires a single array of varying size.
  This makes it hard to fit in slab allocator, which is for managing multiple objects of the same size.

Instead, I consider to use a red-black binary search tree.
The Lightning node objects themselves serve as the nodes in the binary search tree.

The red/black indicator can be a single byte.
As nodes have a 33-byte node id, it is best to place the single red/black byte after the node id so that it will be in the space that would otherwise not be usable due to alignment.

Assuming the hashtable keeps less than 50% residency before resizing, then the number of pointers in the hashtable array should be equal to the number of left + right pointers added to each node (each node has one pointer in the hashtable array, but if the hashtable has 50% residency, then the size of the backing array is at least twice the number of nodes).

Drawbacks are as follows:

* O(log n) lookup instead of average O(1).
* Node objects increased by two pointers-to-node-objects (4 bytes, since we will use `u32` for pointer-inside-shared-memory even on 64-bit platforms) to represent left and right sub-trees.
  * These pointers will be "cold" relative to the other fields of the node; generally, once a routing algorithm has located the start and goal nodes, it no longer has to look at these fields.
    However, they will remain in the CPU cache while traversing the channels of a node due to proximity, wasting precious CPU cache.
* Iterator objects over all nodes, as they iterate over a binary tree, requires O(log n) space rather than O(1) space as for hashtables.

For representing the channels of a node, we will similarly replace the array with a singly-linked list, for a similar reason.

For the singly-linked list, as a channel "belongs" to two nodes, each channel has two `next` pointers.
One is for `node[0]` of the channel while other is for `node[1]`.
Iterator for the nodes of a channel would then need to know which node is being iterated, so it can test `node[0]` and `node[1]` to determine whether to take `next[0]` or `next[1]` when stepping.
Deletion of a particular channel does become more complicated.

We prefer a singly-linked list here, despite the complexity of deletion, since an array of pointers to channels has only one pointer per channel that a node owns, so to keep storage space small, we should use a singly-linked list (so one pointer per channel a node owns) instead of doubly-linked.
Doubly-linked lists only have an advantage during deletion anyway, which we *hope* is rare.

Similarly we can make a red-black binary search tree of channels also, for fast channel lookup.

The root pointers to the trees can be placed in the 0th slab near the "`NULL`" location.

`mmap` Notes

We should consider the `mmap`-defined virtual memory address as a "viewport" onto the shared memory routemap.
As routemap size grows, the viewport might not be enough to see the entire routemap.
(On 64-bit systems this is not as restrictive, but on 32-bit systems it would probably do good to have a small viewport, so that the plugin can use the rest of the virtual memory address space for ordinary memory usage.)

First we should reserve space in the virtual memory space by calling `mmap` with `PROT_NONE` and `MAP_ANON` (`MAP_ANON` is not in POSIX but is supported by most modern Unixlikes; if not present we could just open `/dev/zero` and `MAP_PRIVATE` a `PROT_NONE` area with it).
This is the viewport we can use in the virtual memory space to view the shared memory routemap.

Switching the viewport location is done by using `mmap` with the same memory area and using `MAP_FIXED`.

One thing we can observe is that due to the slab allocator, no object will ever cross a slab boundary.
So we can make the viewport a multiple of the slab size.
When we try to locate an object in the shared memory, we only need to put its slab within the viewport.

We can have the `routemap` object maintain the viewport and what part of the shared memory it currently has mapped.

As taking a *reader* lock requires us to write into the memory reserved for the `pthread_rwlock_t`, even plugins must get read and write access to the memory, at least at the time it locks and unlocks the shared memory.

SHM Naming

The name to be passed to `shm_open` will on Linux simply be appended to `/dev/shm`.

The question is what name should be used.
Of note is that in theory we could run multiple `lightningd` instances on the same system.

However, each such `lightningd` must have its own distinct `lightningdir`.
So a simple thought is to simply hash the `lightningdir`, prepend `lightningd` to the hash, and use that as the name for SHM.

A plugin that is started for a particular `lightningd` will need to know what this SHM name is.
We can provide a `getroutemap` command which returns this SHM name.
Thus, plugins should:

1.  Issue the `getroutemap` command.
2.  Use the routemap library to open the SHM name and get access to the shared routemap via the library.

Object Orientation

Below is description of public interface.
Suffice it to say, that I fully intend to abuse `static inline` and `#define` to ensure as few calls as possible are actually compiled, even if all data is accessed via what syntactically look like functions.

* Type `routemap_attrib` - defined in header, but fields are not part of public interface.
  * `void routemap_attrib_init(struct routemap_attrib *attrib)` - Set up an attribute object with the default values.
  * `unsigned int routemap_attrib_getviewportsize(const struct routemap_attrib *attrib)` - Get viewport size in the attributes object.
  * `void routemap_attrib_setviewportsize(struct routemap_attrib *attrib, unsigned int viewportsize)` - Set viewport size in the attributes object.
  * `void routemap_attrib_destroy(struct routemap_attrib *attrib)` - Cleans up any resources the attribute object might have.

* Type `routemap` - declared only in header.
  * `struct routemap *routemap_new(const struct routemap_attrib *attrib, const char *shm_name, unsigned int shm_name_len)` - Allocate (using `malloc`) a routemap and initialize the object.
    - `attrib` can be `NULL`, in which case the routemap is created with default attributes.
    - `shm_name` must be the name acquired via the `getroutemap` command.
      As the name will come from a JSON-RPC command we require a `shm_name_len` as well so as not to require zero-termination.
  * `struct routemap_view *routemap_get_view(struct routemap *routemap)` - Get a view of the routemap.
    - Only one view at a time can be used, if a view already exists, this will lead to undefined behavior!
  * `bool routemap_is_view_gotten(const struct routemap *routemap)` - Determine if a view of the routemap is already in use and has not been deleted yet.
  * `void routemap_delete(struct routemap *routemap)` - Destroy the routemap and free all resources it is using.
    - Any views and sub-objects of the view will also be destroyed.
    - If `routemap` is `NULL`, does nothing.
    - IMPORTANT: your program should make an effort to do this even under strange conditions, such as after receiving `SIGINT` or `SIGTERM`.
      Do note it is not safe to call this in the signal handler directly.

* Type `routemap_node` - `typedef uint32_t routemap_node`
  * `ROUTEMAP_NODE_NULL` - `NULL`/nonexistent node.
    Numerically 0, so can be checked via direct `if`.
  * `void routemap_node_get_id(const struct routemap_view *view, routemap_node node, unsigned char id_output[33])` - Get the ID of the routemap.
  * `routemap_chan routemap_node_first_chan(const struct routemap_view *view, routemap_node node)` - Get the first channel of a node.
    - Returns `ROUTEMAP_CHAN_NULL` (0) if node has no channels.
  * `routemap_chan routemap_node_next_chan(const struct routemap_view *view, routemap_node node, routemap_chan chan)` - Get the next channel of a node.
    - Returns `ROUTEMAP_CHAN_NULL` (0) if node has no more channels.

* Type `routemap_chan` - `typedef uint32_t routemap_chan`
  * `ROUTEMAP_CHAN_NULL` - `NULL`/nonexistend channel.
    Numerically 0, `if`, blah.
  * `uint64_t routemap_chan_get_id(const struct routemap_view *view, routemap_chan channel)` - Get the short-channel-id of the node.
  * `int routemap_chan_get_dir(const struct routemap_view *view, routemap_chan channel, routemap_node node)` - Get which direction (0 or 1) the specified `node` owns in the `channel`.
    - Result is undefined if the given `node` is not one of the participants of the channel.
  * `uint32_t routemap_chan_get_fee_base(const struct routemap_view *view, routemap_chan channel, int dir)`.
    `uint32_t routemap_chan_get_fee_proportional(const struct routemap_view *view, routemap_chan channel, int dir)`.
    `uint32_t routemap_chan_get_delay(const struct routemap_view *view, routemap_chan channel, int dir)`.
    `uint64_t routemap_chan_get_htlc_min(const struct routemap_view *view, routemap_chan channel, int dir)`.
    `uint64_t routemap_chan_get_htlc_max(const struct routemap_view *view, routemap_chan channel, int dir)`.
    `routemap_node routemap_chan_get_node(const struct routemap_view *view, routemap_chan channel, int dir)`.
    - Access the appropriate field of the given channel-direction.

* Type `routemap_view` - defined in header, but fields are not part of public interface.
  * `void routemap_view_delete(struct routemap_view *view)` - Release the view.
    - IMPORTANT: your program should make an effort to do this even under strange conditions, such as after receiving `SIGINT` or `SIGTERM`.
      Do note it is not safe to call this in the signal handler directly.
    - Deletion of the containing `routemap` via `routemap_delete` implicitly deletes any `routemap_view` currently in use (this can be relied upon after receiving `SIGINT` or `SIGTERM`).
  * `routemap_node routemap_view_find_node(const struct routemap_view *view, uint8_t id[33])` - Find the node with the specified node ID.
    - Returns `ROUTEMAP_NODE_NULL` (0) if not found.
  * `routemap_chan routemap_view_find_chan(const struct routemap_view *view, uint64_t scid)` - Find the channel with the specified short-channel-ID.
    - Returns `ROUTEMAP_CHAN_NULL` (0) if not found.

* Type `routemap_allnode_iter` - defined in header, but fields are not part of public interface.
  * `routemap_node routemap_allnode_iter_first(const struct routemap_view *view, struct routemap_allnode_iter *iter)` - Start iterating over all nodes in the routemap view.
    - Returns `ROUTEMAP_NODE_NULL` (0) if no nodes.
  * `routemap_node routemap_allnode_iter_next(const struct routemap_view *view, struct routemap_allnode_iter *iter)` - Get next node in the iteration of all nodes.
    - Returns `ROUTEMAP_NODE_NULL` (0) if no more nodes.
  * `void routemap_allnode_iter_destroy(struct routemap_allnode_iter *iter)` - Clean up the node iterator.
    - IMPORTANT: If you called `routemap_allnode_iter_first`, you MUST call `routemap_allnode_iter_destroy` whether or not you completed iteration, otherwise memory will be leaked.

* Type `routemap_allchan_iter` - defined in header, but fields are not part of public interface.
  * `routemap_chan routemap_allcan_iter_first(const struct routemap_view *view, struct routemap_allchan_iter *iter)` - Start iterating over all channels in the routemap view.
    - Returns `ROUTEMAP_CHAN_NULL` (0) if no channels.
  * `routemap_chan routemap_allcan_iter_next(const struct routemap_view *view, struct routemap_allchan_iter *iter)` - Get next channel in the iteration of all channels.
    - Returns `ROUTEMAP_CHAN_NULL` (0) if no channels.
  * `void routemap_allchan_iter_destroy(struct routemap_allchan_iter *iter)` - Clean up the channel iterator.
    - IMPORTANT: If you called `routemap_allchan_iter_first`, you MUST call `routemap_allchan_iter_destroy` whether or not you completed iteration, otherwise memory will be leaked.


Some routing algorithms might want to keep track of per-node or per-channel data of interest only to itself.
For example, `getroutequick` will annotate a distance-from-self to each node.

The simplest way to implement this would be to create a mapping within your plugin from the node id / channel id to the per-node / per-channel data.

However, note that nodes and channels may eventually be deleted from the routemap.
Further, especially for nodes in particular, node IDs are large (33 bytes) and might very well dwarf the per-node or per-channel annotation.

We can signal deletion of channels and nodes from the routemap via the ordinary notification mechanism.
Asynchronous delivery of this notification is fine, since it is an advisory to tell plugins to remove the corresponding node IDs or channel SCIDs from their internal maps.
Notifications also allow multiple plugins to be attached to them.

The issue of large node ID size is harder to fix.
It would be desirable to use the `routemap_node` type instead (which is really just an offset within the shared memory), as we do not move objects in the shared memory anyway.
However, this may potentially lead to ABA problems, where the node is deleted, then its `routemap_node` is reused later for a new node, causing any annotations of the previous instance to magically appear on the new instance, possibly leading to incorrect or buggy behavior.

Initially, I thought to use hooks.
Basically, the `routemapd` would invoke a hook, and only once all hooked plugins have returned, will `routemapd` continue with the deletion.
This should be fine since only the `routemapd` "should" write to the shared memory routemap.

Deletion would take the following steps:

1.  `routemapd` takes the writelock.
2.  The node or channel is disconnected from the object graph.
    i.e. it is removed from the red-black tree, and (for channels) from the linked lists it is in.
3.  `routemapd` releases the writelock, to prevent deadlock conditions (i.e. the plugin is waiting on the readlock, while the `routemapd` is waiting for a reply from the hook invocation).
4.  At this point, if the plugin traverses the routemap, it will no longer encounter the object to be deleted.
    Presumably, the plugin will not make annotations if it never encounters the object in the routemap.
5.  `routemapd` invokes the hook (and invokes the notification as well) and suspends until all hooks have completed.
6.  Plugins clean up their annotation data (mappings from `routemap_node` or `routemap_chan` to per-node or per-channel data).
7.  `routemapd` finds that all hooked plugins have completed.
8.  `routemapd` takes the writelock.
9.  `routemapd` returns the object(s) removed to their respective slab allocators.
10.  `routemapd` releases the writelock.

The above requires that hook chaining be implemented first, though.
I might implement this later, as it can reduce the size of annotations on nodes, by reducing the key from a `struct node_id` (33 bytes, which has bad alignment properties if e.g. you put it in a `struct` with `u64`s) to a `u32` (4 bytes).

In many ways this is similar to the issue of weak pointers.
We want a weak mapping between the node object to its annotated data, so when the node object is in a position to have its memory reused, the mapping has to be destroyed at that point.


More information about the c-lightning mailing list