[PATCH v4] erofs: replace erofs_unzipd workqueue with per-cpu threads
Gao Xiang
hsiangkao at linux.alibaba.com
Tue Feb 7 13:55:40 AEDT 2023
On 2023/2/7 03:41, Sandeep Dhavale wrote:
> On Mon, Feb 6, 2023 at 2:01 AM Gao Xiang <xiang at kernel.org> wrote:
>>
>> Hi Sandeep,
>>
>> On Fri, Jan 06, 2023 at 07:35:01AM +0000, Sandeep Dhavale wrote:
>>> Using per-cpu thread pool we can reduce the scheduling latency compared
>>> to workqueue implementation. With this patch scheduling latency and
>>> variation is reduced as per-cpu threads are high priority kthread_workers.
>>>
>>> The results were evaluated on arm64 Android devices running 5.10 kernel.
>>>
>>> The table below shows resulting improvements of total scheduling latency
>>> for the same app launch benchmark runs with 50 iterations. Scheduling
>>> latency is the latency between when the task (workqueue kworker vs
>>> kthread_worker) became eligible to run to when it actually started
>>> running.
>>> +-------------------------+-----------+----------------+---------+
>>> | | workqueue | kthread_worker | diff |
>>> +-------------------------+-----------+----------------+---------+
>>> | Average (us) | 15253 | 2914 | -80.89% |
>>> | Median (us) | 14001 | 2912 | -79.20% |
>>> | Minimum (us) | 3117 | 1027 | -67.05% |
>>> | Maximum (us) | 30170 | 3805 | -87.39% |
>>> | Standard deviation (us) | 7166 | 359 | |
>>> +-------------------------+-----------+----------------+---------+
>>>
>>> Background: Boot times and cold app launch benchmarks are very
>>> important to the android ecosystem as they directly translate to
>>> responsiveness from user point of view. While erofs provides
>>> a lot of important features like space savings, we saw some
>>> performance penalty in cold app launch benchmarks in few scenarios.
>>> Analysis showed that the significant variance was coming from the
>>> scheduling cost while decompression cost was more or less the same.
>>>
>>> Having per-cpu thread pool we can see from the above table that this
>>> variation is reduced by ~80% on average. This problem was discussed
>>> at LPC 2022. Link to LPC 2022 slides and
>>> talk at [1]
>>>
>>> [1] https://lpc.events/event/16/contributions/1338/
>>>
>>> Signed-off-by: Sandeep Dhavale <dhavale at google.com>
>>> ---
>>> V3 -> V4
>>> * Updated commit message with background information
>>> V2 -> V3
>>> * Fix a warning Reported-by: kernel test robot <lkp at intel.com>
>>> V1 -> V2
>>> * Changed name of kthread_workers from z_erofs to erofs_worker
>>> * Added kernel configuration to run kthread_workers at normal or
>>> high priority
>>> * Added cpu hotplug support
>>> * Added wrapped kthread_workers under worker_pool
>>> * Added one unbound thread in a pool to handle a context where
>>> we already stopped per-cpu kthread worker
>>> * Updated commit message
>>
>> I've just modified your v4 patch based on erofs -dev branch with
>> my previous suggestion [1], but I haven't tested it.
>>
>> Could you help check if the updated patch looks good to you and
>> test it on your side? If there are unexpected behaviors, please
>> help update as well, thanks!
> Thanks Xiang, I was working on the same. I see that you have cleaned it up.
> I will test it and report/fix any problems.
>
> Thanks,
> Sandeep.
Thanks! Look forward to your test. BTW, we have < 2 weeks for 6.3, so I'd
like to fix it this week so that we could catch 6.3 merge window.
I've fixed some cpu hotplug errors as below and added to a branch for 0day CI
testing.
Thanks,
Gao Xiang
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 73198f494a6a..92a9e20948b0 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -398,7 +398,7 @@ static inline void erofs_destroy_percpu_workers(void) {}
static inline int erofs_init_percpu_workers(void) { return 0; }
#endif
-#if defined(CONFIG_HOTPLUG_CPU) && defined(EROFS_FS_PCPU_KTHREAD)
+#if defined(CONFIG_HOTPLUG_CPU) && defined(CONFIG_EROFS_FS_PCPU_KTHREAD)
static DEFINE_SPINLOCK(z_erofs_pcpu_worker_lock);
static enum cpuhp_state erofs_cpuhp_state;
@@ -408,7 +408,7 @@ static int erofs_cpu_online(unsigned int cpu)
worker = erofs_init_percpu_worker(cpu);
if (IS_ERR(worker))
- return ERR_PTR(worker);
+ return PTR_ERR(worker);
spin_lock(&z_erofs_pcpu_worker_lock);
old = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
@@ -428,7 +428,7 @@ static int erofs_cpu_offline(unsigned int cpu)
spin_lock(&z_erofs_pcpu_worker_lock);
worker = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
lockdep_is_held(&z_erofs_pcpu_worker_lock));
- rcu_assign_pointer(worker_pool.workers[cpu], NULL);
+ rcu_assign_pointer(z_erofs_pcpu_workers[cpu], NULL);
spin_unlock(&z_erofs_pcpu_worker_lock);
synchronize_rcu();
>
>>
>> [1] https://lore.kernel.org/r/5e1b7191-9ea6-3781-7928-72ac4cd88591@linux.alibaba.com/
>>
>> Thanks,
>> Gao Xiang
>>
>> From 2e87235abc745c0fef8e32abcd3a51546b4378ad Mon Sep 17 00:00:00 2001
>> From: Sandeep Dhavale <dhavale at google.com>
>> Date: Mon, 6 Feb 2023 17:53:39 +0800
>> Subject: [PATCH] erofs: add per-cpu threads for decompression
>>
>> Using per-cpu thread pool we can reduce the scheduling latency compared
>> to workqueue implementation. With this patch scheduling latency and
>> variation is reduced as per-cpu threads are high priority kthread_workers.
>>
>> The results were evaluated on arm64 Android devices running 5.10 kernel.
>>
>> The table below shows resulting improvements of total scheduling latency
>> for the same app launch benchmark runs with 50 iterations. Scheduling
>> latency is the latency between when the task (workqueue kworker vs
>> kthread_worker) became eligible to run to when it actually started
>> running.
>> +-------------------------+-----------+----------------+---------+
>> | | workqueue | kthread_worker | diff |
>> +-------------------------+-----------+----------------+---------+
>> | Average (us) | 15253 | 2914 | -80.89% |
>> | Median (us) | 14001 | 2912 | -79.20% |
>> | Minimum (us) | 3117 | 1027 | -67.05% |
>> | Maximum (us) | 30170 | 3805 | -87.39% |
>> | Standard deviation (us) | 7166 | 359 | |
>> +-------------------------+-----------+----------------+---------+
>>
>> Background: Boot times and cold app launch benchmarks are very
>> important to the android ecosystem as they directly translate to
>> responsiveness from user point of view. While erofs provides
>> a lot of important features like space savings, we saw some
>> performance penalty in cold app launch benchmarks in few scenarios.
>> Analysis showed that the significant variance was coming from the
>> scheduling cost while decompression cost was more or less the same.
>>
>> Having per-cpu thread pool we can see from the above table that this
>> variation is reduced by ~80% on average. This problem was discussed
>> at LPC 2022. Link to LPC 2022 slides and
>> talk at [1]
>>
>> [1] https://lpc.events/event/16/contributions/1338/
>>
>> Signed-off-by: Sandeep Dhavale <dhavale at google.com>
>> Signed-off-by: Gao Xiang <hsiangkao at linux.alibaba.com>
>> ---
>> fs/erofs/Kconfig | 18 +++++
>> fs/erofs/zdata.c | 190 ++++++++++++++++++++++++++++++++++++++++++-----
>> 2 files changed, 189 insertions(+), 19 deletions(-)
>>
>> diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
>> index 85490370e0ca..704fb59577e0 100644
>> --- a/fs/erofs/Kconfig
>> +++ b/fs/erofs/Kconfig
>> @@ -108,3 +108,21 @@ config EROFS_FS_ONDEMAND
>> read support.
>>
>> If unsure, say N.
>> +
>> +config EROFS_FS_PCPU_KTHREAD
>> + bool "EROFS per-cpu decompression kthread workers"
>> + depends on EROFS_FS_ZIP
>> + help
>> + Saying Y here enables per-CPU kthread workers pool to carry out
>> + async decompression for low latencies on some architectures.
>> +
>> + If unsure, say N.
>> +
>> +config EROFS_FS_PCPU_KTHREAD_HIPRI
>> + bool "EROFS high priority per-CPU kthread workers"
>> + depends on EROFS_FS_ZIP && EROFS_FS_PCPU_KTHREAD
>> + help
>> + This permits EROFS to configure per-CPU kthread workers to run
>> + at higher priority.
>> +
>> + If unsure, say N.
>> diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
>> index 384f64292f73..73198f494a6a 100644
>> --- a/fs/erofs/zdata.c
>> +++ b/fs/erofs/zdata.c
>> @@ -7,6 +7,8 @@
>> #include "compress.h"
>> #include <linux/prefetch.h>
>> #include <linux/psi.h>
>> +#include <linux/slab.h>
>> +#include <linux/cpuhotplug.h>
>>
>> #include <trace/events/erofs.h>
>>
>> @@ -109,6 +111,7 @@ struct z_erofs_decompressqueue {
>> union {
>> struct completion done;
>> struct work_struct work;
>> + struct kthread_work kthread_work;
>> } u;
>> bool eio, sync;
>> };
>> @@ -341,24 +344,128 @@ static void z_erofs_free_pcluster(struct z_erofs_pcluster *pcl)
>>
>> static struct workqueue_struct *z_erofs_workqueue __read_mostly;
>>
>> -void z_erofs_exit_zip_subsystem(void)
>> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
>> +static struct kthread_worker __rcu **z_erofs_pcpu_workers;
>> +
>> +static void erofs_destroy_percpu_workers(void)
>> {
>> - destroy_workqueue(z_erofs_workqueue);
>> - z_erofs_destroy_pcluster_pool();
>> + struct kthread_worker *worker;
>> + unsigned int cpu;
>> +
>> + for_each_possible_cpu(cpu) {
>> + worker = rcu_dereference_protected(
>> + z_erofs_pcpu_workers[cpu], 1);
>> + rcu_assign_pointer(z_erofs_pcpu_workers[cpu], NULL);
>> + if (worker)
>> + kthread_destroy_worker(worker);
>> + }
>> + kfree(z_erofs_pcpu_workers);
>> }
>>
>> -static inline int z_erofs_init_workqueue(void)
>> +static struct kthread_worker *erofs_init_percpu_worker(int cpu)
>> {
>> - const unsigned int onlinecpus = num_possible_cpus();
>> + struct kthread_worker *worker =
>> + kthread_create_worker_on_cpu(cpu, 0, "erofs_worker/%u", cpu);
>>
>> - /*
>> - * no need to spawn too many threads, limiting threads could minimum
>> - * scheduling overhead, perhaps per-CPU threads should be better?
>> - */
>> - z_erofs_workqueue = alloc_workqueue("erofs_unzipd",
>> - WQ_UNBOUND | WQ_HIGHPRI,
>> - onlinecpus + onlinecpus / 4);
>> - return z_erofs_workqueue ? 0 : -ENOMEM;
>> + if (IS_ERR(worker))
>> + return worker;
>> + if (IS_ENABLED(CONFIG_EROFS_FS_PCPU_KTHREAD_HIPRI))
>> + sched_set_fifo_low(worker->task);
>> + else
>> + sched_set_normal(worker->task, 0);
>> + return worker;
>> +}
>> +
>> +static int erofs_init_percpu_workers(void)
>> +{
>> + struct kthread_worker *worker;
>> + unsigned int cpu;
>> +
>> + z_erofs_pcpu_workers = kcalloc(num_possible_cpus(),
>> + sizeof(struct kthread_worker *), GFP_ATOMIC);
>> + if (!z_erofs_pcpu_workers)
>> + return -ENOMEM;
>> +
>> + for_each_online_cpu(cpu) { /* could miss cpu{off,on}line? */
>> + worker = erofs_init_percpu_worker(cpu);
>> + if (!IS_ERR(worker))
>> + rcu_assign_pointer(z_erofs_pcpu_workers[cpu], worker);
>> + }
>> + return 0;
>> +}
>> +#else
>> +static inline void erofs_destroy_percpu_workers(void) {}
>> +static inline int erofs_init_percpu_workers(void) { return 0; }
>> +#endif
>> +
>> +#if defined(CONFIG_HOTPLUG_CPU) && defined(EROFS_FS_PCPU_KTHREAD)
>> +static DEFINE_SPINLOCK(z_erofs_pcpu_worker_lock);
>> +static enum cpuhp_state erofs_cpuhp_state;
>> +
>> +static int erofs_cpu_online(unsigned int cpu)
>> +{
>> + struct kthread_worker *worker, *old;
>> +
>> + worker = erofs_init_percpu_worker(cpu);
>> + if (IS_ERR(worker))
>> + return ERR_PTR(worker);
>> +
>> + spin_lock(&z_erofs_pcpu_worker_lock);
>> + old = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
>> + lockdep_is_held(&z_erofs_pcpu_worker_lock));
>> + if (!old)
>> + rcu_assign_pointer(z_erofs_pcpu_workers[cpu], worker);
>> + spin_unlock(&z_erofs_pcpu_worker_lock);
>> + if (old)
>> + kthread_destroy_worker(worker);
>> + return 0;
>> +}
>> +
>> +static int erofs_cpu_offline(unsigned int cpu)
>> +{
>> + struct kthread_worker *worker;
>> +
>> + spin_lock(&z_erofs_pcpu_worker_lock);
>> + worker = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
>> + lockdep_is_held(&z_erofs_pcpu_worker_lock));
>> + rcu_assign_pointer(worker_pool.workers[cpu], NULL);
>> + spin_unlock(&z_erofs_pcpu_worker_lock);
>> +
>> + synchronize_rcu();
>> + if (worker)
>> + kthread_destroy_worker(worker);
>> + return 0;
>> +}
>> +
>> +static int erofs_cpu_hotplug_init(void)
>> +{
>> + int state;
>> +
>> + state = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
>> + "fs/erofs:online", erofs_cpu_online, erofs_cpu_offline);
>> + if (state < 0)
>> + return state;
>> +
>> + erofs_cpuhp_state = state;
>> + return 0;
>> +}
>> +
>> +static void erofs_cpu_hotplug_destroy(void)
>> +{
>> + if (erofs_cpuhp_state)
>> + cpuhp_remove_state_nocalls(erofs_cpuhp_state);
>> +}
>> +#else /* !CONFIG_HOTPLUG_CPU || !CONFIG_EROFS_FS_PCPU_KTHREAD */
>> +static inline int erofs_cpu_hotplug_init(void) { return 0; }
>> +static inline void erofs_cpu_hotplug_destroy(void) {}
>> +#endif
>> +
>> +void z_erofs_exit_zip_subsystem(void)
>> +{
>> + erofs_cpu_hotplug_destroy();
>> + erofs_destroy_percpu_workers();
>> + destroy_workqueue(z_erofs_workqueue);
>> + z_erofs_destroy_pcluster_pool();
>> }
>>
>> int __init z_erofs_init_zip_subsystem(void)
>> @@ -366,10 +473,29 @@ int __init z_erofs_init_zip_subsystem(void)
>> int err = z_erofs_create_pcluster_pool();
>>
>> if (err)
>> - return err;
>> - err = z_erofs_init_workqueue();
>> + goto out_error_pcluster_pool;
>> +
>> + z_erofs_workqueue = alloc_workqueue("erofs_worker",
>> + WQ_UNBOUND | WQ_HIGHPRI, num_possible_cpus());
>> + if (!z_erofs_workqueue)
>> + goto out_error_workqueue_init;
>> +
>> + err = erofs_init_percpu_workers();
>> if (err)
>> - z_erofs_destroy_pcluster_pool();
>> + goto out_error_pcpu_worker;
>> +
>> + err = erofs_cpu_hotplug_init();
>> + if (err < 0)
>> + goto out_error_cpuhp_init;
>> + return err;
>> +
>> +out_error_cpuhp_init:
>> + erofs_destroy_percpu_workers();
>> +out_error_pcpu_worker:
>> + destroy_workqueue(z_erofs_workqueue);
>> +out_error_workqueue_init:
>> + z_erofs_destroy_pcluster_pool();
>> +out_error_pcluster_pool:
>> return err;
>> }
>>
>> @@ -1305,11 +1431,17 @@ static void z_erofs_decompressqueue_work(struct work_struct *work)
>>
>> DBG_BUGON(bgq->head == Z_EROFS_PCLUSTER_TAIL_CLOSED);
>> z_erofs_decompress_queue(bgq, &pagepool);
>> -
>> erofs_release_pages(&pagepool);
>> kvfree(bgq);
>> }
>>
>> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
>> +static void z_erofs_decompressqueue_kthread_work(struct kthread_work *work)
>> +{
>> + z_erofs_decompressqueue_work((struct work_struct *)work);
>> +}
>> +#endif
>> +
>> static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io,
>> int bios)
>> {
>> @@ -1324,9 +1456,24 @@ static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io,
>>
>> if (atomic_add_return(bios, &io->pending_bios))
>> return;
>> - /* Use workqueue and sync decompression for atomic contexts only */
>> + /* Use (kthread_)work and sync decompression for atomic contexts only */
>> if (in_atomic() || irqs_disabled()) {
>> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
>> + struct kthread_worker *worker;
>> +
>> + rcu_read_lock();
>> + worker = rcu_dereference(
>> + z_erofs_pcpu_workers[raw_smp_processor_id()]);
>> + if (!worker) {
>> + INIT_WORK(&io->u.work, z_erofs_decompressqueue_work);
>> + queue_work(z_erofs_workqueue, &io->u.work);
>> + } else {
>> + kthread_queue_work(worker, &io->u.kthread_work);
>> + }
>> + rcu_read_unlock();
>> +#else
>> queue_work(z_erofs_workqueue, &io->u.work);
>> +#endif
>> /* enable sync decompression for readahead */
>> if (sbi->opt.sync_decompress == EROFS_SYNC_DECOMPRESS_AUTO)
>> sbi->opt.sync_decompress = EROFS_SYNC_DECOMPRESS_FORCE_ON;
>> @@ -1455,7 +1602,12 @@ static struct z_erofs_decompressqueue *jobqueue_init(struct super_block *sb,
>> *fg = true;
>> goto fg_out;
>> }
>> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
>> + kthread_init_work(&q->u.kthread_work,
>> + z_erofs_decompressqueue_kthread_work);
>> +#else
>> INIT_WORK(&q->u.work, z_erofs_decompressqueue_work);
>> +#endif
>> } else {
>> fg_out:
>> q = fgq;
>> @@ -1640,7 +1792,7 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
>>
>> /*
>> * although background is preferred, no one is pending for submission.
>> - * don't issue workqueue for decompression but drop it directly instead.
>> + * don't issue decompression but drop it directly instead.
>> */
>> if (!*force_fg && !nr_bios) {
>> kvfree(q[JQ_SUBMIT]);
>> --
>> 2.30.2
>>
More information about the Linux-erofs
mailing list