[PATCH 19/26] netfs: New writeback implementation
Vadim Fedorenko
vadim.fedorenko at linux.dev
Sat Mar 30 12:03:21 AEDT 2024
On 28/03/2024 16:34, David Howells wrote:
> The current netfslib writeback implementation creates writeback requests of
> contiguous folio data and then separately tiles subrequests over the space
> twice, once for the server and once for the cache. This creates a few
> issues:
>
> (1) Every time there's a discontiguity or a change between writing to only
> one destination or writing to both, it must create a new request.
> This makes it harder to do vectored writes.
>
> (2) The folios don't have the writeback mark removed until the end of the
> request - and a request could be hundreds of megabytes.
>
> (3) In future, I want to support a larger cache granularity, which will
> require aggregation of some folios that contain unmodified data (which
> only need to go to the cache) and some which contain modifications
> (which need to be uploaded and stored to the cache) - but, currently,
> these are treated as discontiguous.
>
> There's also a move to get everyone to use writeback_iter() to extract
> writable folios from the pagecache. That said, currently writeback_iter()
> has some issues that make it less than ideal:
>
> (1) there's no way to cancel the iteration, even if you find a "temporary"
> error that means the current folio and all subsequent folios are going
> to fail;
>
> (2) there's no way to filter the folios being written back - something
> that will impact Ceph with it's ordered snap system;
>
> (3) and if you get a folio you can't immediately deal with (say you need
> to flush the preceding writes), you are left with a folio hanging in
> the locked state for the duration, when really we should unlock it and
> relock it later.
>
> In this new implementation, I use writeback_iter() to pump folios,
> progressively creating two parallel, but separate streams and cleaning up
> the finished folios as the subrequests complete. Either or both streams
> can contain gaps, and the subrequests in each stream can be of variable
> size, don't need to align with each other and don't need to align with the
> folios.
>
> Indeed, subrequests can cross folio boundaries, may cover several folios or
> a folio may be spanned by multiple folios, e.g.:
>
> +---+---+-----+-----+---+----------+
> Folios: | | | | | | |
> +---+---+-----+-----+---+----------+
>
> +------+------+ +----+----+
> Upload: | | |.....| | |
> +------+------+ +----+----+
>
> +------+------+------+------+------+
> Cache: | | | | | |
> +------+------+------+------+------+
>
> The progressive subrequest construction permits the algorithm to be
> preparing both the next upload to the server and the next write to the
> cache whilst the previous ones are already in progress. Throttling can be
> applied to control the rate of production of subrequests - and, in any
> case, we probably want to write them to the server in ascending order,
> particularly if the file will be extended.
>
> Content crypto can also be prepared at the same time as the subrequests and
> run asynchronously, with the prepped requests being stalled until the
> crypto catches up with them. This might also be useful for transport
> crypto, but that happens at a lower layer, so probably would be harder to
> pull off.
>
> The algorithm is split into three parts:
>
> (1) The issuer. This walks through the data, packaging it up, encrypting
> it and creating subrequests. The part of this that generates
> subrequests only deals with file positions and spans and so is usable
> for DIO/unbuffered writes as well as buffered writes.
>
> (2) The collector. This asynchronously collects completed subrequests,
> unlocks folios, frees crypto buffers and performs any retries. This
> runs in a work queue so that the issuer can return to the caller for
> writeback (so that the VM can have its kswapd thread back) or async
> writes.
>
> (3) The retryer. This pauses the issuer, waits for all outstanding
> subrequests to complete and then goes through the failed subrequests
> to reissue them. This may involve reprepping them (with cifs, the
> credits must be renegotiated, and a subrequest may need splitting),
> and doing RMW for content crypto if there's a conflicting change on
> the server.
>
> [!] Note that some of the functions are prefixed with "new_" to avoid
> clashes with existing functions. These will be renamed in a later patch
> that cuts over to the new algorithm.
>
> Signed-off-by: David Howells <dhowells at redhat.com>
> cc: Jeff Layton <jlayton at kernel.org>
> cc: Eric Van Hensbergen <ericvh at kernel.org>
> cc: Latchesar Ionkov <lucho at ionkov.net>
> cc: Dominique Martinet <asmadeus at codewreck.org>
> cc: Christian Schoenebeck <linux_oss at crudebyte.com>
> cc: Marc Dionne <marc.dionne at auristor.com>
> cc: v9fs at lists.linux.dev
> cc: linux-afs at lists.infradead.org
> cc: netfs at lists.linux.dev
> cc: linux-fsdevel at vger.kernel.org
[..snip..]
> +/*
> + * Begin a write operation for writing through the pagecache.
> + */
> +struct netfs_io_request *new_netfs_begin_writethrough(struct kiocb *iocb, size_t len)
> +{
> + struct netfs_io_request *wreq = NULL;
> + struct netfs_inode *ictx = netfs_inode(file_inode(iocb->ki_filp));
> +
> + mutex_lock(&ictx->wb_lock);
> +
> + wreq = netfs_create_write_req(iocb->ki_filp->f_mapping, iocb->ki_filp,
> + iocb->ki_pos, NETFS_WRITETHROUGH);
> + if (IS_ERR(wreq))
> + mutex_unlock(&ictx->wb_lock);
> +
> + wreq->io_streams[0].avail = true;
in case IS_ERR(wreq) is true, the execution falls through and this
derefere is invalid.
> + trace_netfs_write(wreq, netfs_write_trace_writethrough);
not sure if we still need trace function call in case of error
> + return wreq;
> +}
> +
[..snip..]
More information about the Linux-erofs
mailing list