[PATCH v4] erofs: replace erofs_unzipd workqueue with per-cpu threads

Tue Feb 7 13:55:40 AEDT 2023

On 2023/2/7 03:41, Sandeep Dhavale wrote:
> On Mon, Feb 6, 2023 at 2:01 AM Gao Xiang <xiang at kernel.org> wrote:
>>
>> Hi Sandeep,
>>
>> On Fri, Jan 06, 2023 at 07:35:01AM +0000, Sandeep Dhavale wrote:
>>> Using per-cpu thread pool we can reduce the scheduling latency compared
>>> to workqueue implementation. With this patch scheduling latency and
>>> variation is reduced as per-cpu threads are high priority kthread_workers.
>>>
>>> The results were evaluated on arm64 Android devices running 5.10 kernel.
>>>
>>> The table below shows resulting improvements of total scheduling latency
>>> for the same app launch benchmark runs with 50 iterations. Scheduling
>>> latency is the latency between when the task (workqueue kworker vs
>>> kthread_worker) became eligible to run to when it actually started
>>> running.
>>> +-------------------------+-----------+----------------+---------+
>>> |                         | workqueue | kthread_worker |  diff   |
>>> +-------------------------+-----------+----------------+---------+
>>> | Average (us)            |     15253 |           2914 | -80.89% |
>>> | Median (us)             |     14001 |           2912 | -79.20% |
>>> | Minimum (us)            |      3117 |           1027 | -67.05% |
>>> | Maximum (us)            |     30170 |           3805 | -87.39% |
>>> | Standard deviation (us) |      7166 |            359 |         |
>>> +-------------------------+-----------+----------------+---------+
>>>
>>> Background: Boot times and cold app launch benchmarks are very
>>> important to the android ecosystem as they directly translate to
>>> responsiveness from user point of view. While erofs provides
>>> a lot of important features like space savings, we saw some
>>> performance penalty in cold app launch benchmarks in few scenarios.
>>> Analysis showed that the significant variance was coming from the
>>> scheduling cost while decompression cost was more or less the same.
>>>
>>> Having per-cpu thread pool we can see from the above table that this
>>> variation is reduced by ~80% on average. This problem was discussed
>>> at LPC 2022. Link to LPC 2022 slides and
>>> talk at [1]
>>>
>>> [1] https://lpc.events/event/16/contributions/1338/
>>>
>>> Signed-off-by: Sandeep Dhavale <dhavale at google.com>
>>> ---
>>> V3 -> V4
>>> * Updated commit message with background information
>>> V2 -> V3
>>> * Fix a warning Reported-by: kernel test robot <lkp at intel.com>
>>> V1 -> V2
>>> * Changed name of kthread_workers from z_erofs to erofs_worker
>>> * Added kernel configuration to run kthread_workers at normal or
>>>    high priority
>>> * Added cpu hotplug support
>>> * Added wrapped kthread_workers under worker_pool
>>> * Added one unbound thread in a pool to handle a context where
>>>    we already stopped per-cpu kthread worker
>>> * Updated commit message
>>
>> I've just modified your v4 patch based on erofs -dev branch with
>> my previous suggestion [1], but I haven't tested it.
>>
>> Could you help check if the updated patch looks good to you and
>> test it on your side?  If there are unexpected behaviors, please
>> help update as well, thanks!
> Thanks Xiang, I was working on the same. I see that you have cleaned it up.
> I will test it and report/fix any problems.
> 
> Thanks,
> Sandeep.

Thanks! Look forward to your test. BTW, we have < 2 weeks for 6.3, so I'd
like to fix it this week so that we could catch 6.3 merge window.


I've fixed some cpu hotplug errors as below and added to a branch for 0day CI
testing.

Thanks,
Gao Xiang

diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 73198f494a6a..92a9e20948b0 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -398,7 +398,7 @@ static inline void erofs_destroy_percpu_workers(void) {}
  static inline int erofs_init_percpu_workers(void) { return 0; }
  #endif

-#if defined(CONFIG_HOTPLUG_CPU) && defined(EROFS_FS_PCPU_KTHREAD)
+#if defined(CONFIG_HOTPLUG_CPU) && defined(CONFIG_EROFS_FS_PCPU_KTHREAD)
  static DEFINE_SPINLOCK(z_erofs_pcpu_worker_lock);
  static enum cpuhp_state erofs_cpuhp_state;

@@ -408,7 +408,7 @@ static int erofs_cpu_online(unsigned int cpu)

  	worker = erofs_init_percpu_worker(cpu);
  	if (IS_ERR(worker))
-		return ERR_PTR(worker);
+		return PTR_ERR(worker);

  	spin_lock(&z_erofs_pcpu_worker_lock);
  	old = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
@@ -428,7 +428,7 @@ static int erofs_cpu_offline(unsigned int cpu)
  	spin_lock(&z_erofs_pcpu_worker_lock);
  	worker = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
  			lockdep_is_held(&z_erofs_pcpu_worker_lock));
-	rcu_assign_pointer(worker_pool.workers[cpu], NULL);
+	rcu_assign_pointer(z_erofs_pcpu_workers[cpu], NULL);
  	spin_unlock(&z_erofs_pcpu_worker_lock);

  	synchronize_rcu();

> 
>>
>> [1] https://lore.kernel.org/r/5e1b7191-9ea6-3781-7928-72ac4cd88591@linux.alibaba.com/
>>
>> Thanks,
>> Gao Xiang
>>
>>  From 2e87235abc745c0fef8e32abcd3a51546b4378ad Mon Sep 17 00:00:00 2001
>> From: Sandeep Dhavale <dhavale at google.com>
>> Date: Mon, 6 Feb 2023 17:53:39 +0800
>> Subject: [PATCH] erofs: add per-cpu threads for decompression
>>
>> Using per-cpu thread pool we can reduce the scheduling latency compared
>> to workqueue implementation. With this patch scheduling latency and
>> variation is reduced as per-cpu threads are high priority kthread_workers.
>>
>> The results were evaluated on arm64 Android devices running 5.10 kernel.
>>
>> The table below shows resulting improvements of total scheduling latency
>> for the same app launch benchmark runs with 50 iterations. Scheduling
>> latency is the latency between when the task (workqueue kworker vs
>> kthread_worker) became eligible to run to when it actually started
>> running.
>> +-------------------------+-----------+----------------+---------+
>> |                         | workqueue | kthread_worker |  diff   |
>> +-------------------------+-----------+----------------+---------+
>> | Average (us)            |     15253 |           2914 | -80.89% |
>> | Median (us)             |     14001 |           2912 | -79.20% |
>> | Minimum (us)            |      3117 |           1027 | -67.05% |
>> | Maximum (us)            |     30170 |           3805 | -87.39% |
>> | Standard deviation (us) |      7166 |            359 |         |
>> +-------------------------+-----------+----------------+---------+
>>
>> Background: Boot times and cold app launch benchmarks are very
>> important to the android ecosystem as they directly translate to
>> responsiveness from user point of view. While erofs provides
>> a lot of important features like space savings, we saw some
>> performance penalty in cold app launch benchmarks in few scenarios.
>> Analysis showed that the significant variance was coming from the
>> scheduling cost while decompression cost was more or less the same.
>>
>> Having per-cpu thread pool we can see from the above table that this
>> variation is reduced by ~80% on average. This problem was discussed
>> at LPC 2022. Link to LPC 2022 slides and
>> talk at [1]
>>
>> [1] https://lpc.events/event/16/contributions/1338/
>>
>> Signed-off-by: Sandeep Dhavale <dhavale at google.com>
>> Signed-off-by: Gao Xiang <hsiangkao at linux.alibaba.com>
>> ---
>>   fs/erofs/Kconfig |  18 +++++
>>   fs/erofs/zdata.c | 190 ++++++++++++++++++++++++++++++++++++++++++-----
>>   2 files changed, 189 insertions(+), 19 deletions(-)
>>
>> diff --git a/fs/erofs/Kconfig b/fs/erofs/Kconfig
>> index 85490370e0ca..704fb59577e0 100644
>> --- a/fs/erofs/Kconfig
>> +++ b/fs/erofs/Kconfig
>> @@ -108,3 +108,21 @@ config EROFS_FS_ONDEMAND
>>            read support.
>>
>>            If unsure, say N.
>> +
>> +config EROFS_FS_PCPU_KTHREAD
>> +       bool "EROFS per-cpu decompression kthread workers"
>> +       depends on EROFS_FS_ZIP
>> +       help
>> +         Saying Y here enables per-CPU kthread workers pool to carry out
>> +         async decompression for low latencies on some architectures.
>> +
>> +         If unsure, say N.
>> +
>> +config EROFS_FS_PCPU_KTHREAD_HIPRI
>> +       bool "EROFS high priority per-CPU kthread workers"
>> +       depends on EROFS_FS_ZIP && EROFS_FS_PCPU_KTHREAD
>> +       help
>> +         This permits EROFS to configure per-CPU kthread workers to run
>> +         at higher priority.
>> +
>> +         If unsure, say N.
>> diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
>> index 384f64292f73..73198f494a6a 100644
>> --- a/fs/erofs/zdata.c
>> +++ b/fs/erofs/zdata.c
>> @@ -7,6 +7,8 @@
>>   #include "compress.h"
>>   #include <linux/prefetch.h>
>>   #include <linux/psi.h>
>> +#include <linux/slab.h>
>> +#include <linux/cpuhotplug.h>
>>
>>   #include <trace/events/erofs.h>
>>
>> @@ -109,6 +111,7 @@ struct z_erofs_decompressqueue {
>>          union {
>>                  struct completion done;
>>                  struct work_struct work;
>> +               struct kthread_work kthread_work;
>>          } u;
>>          bool eio, sync;
>>   };
>> @@ -341,24 +344,128 @@ static void z_erofs_free_pcluster(struct z_erofs_pcluster *pcl)
>>
>>   static struct workqueue_struct *z_erofs_workqueue __read_mostly;
>>
>> -void z_erofs_exit_zip_subsystem(void)
>> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
>> +static struct kthread_worker __rcu **z_erofs_pcpu_workers;
>> +
>> +static void erofs_destroy_percpu_workers(void)
>>   {
>> -       destroy_workqueue(z_erofs_workqueue);
>> -       z_erofs_destroy_pcluster_pool();
>> +       struct kthread_worker *worker;
>> +       unsigned int cpu;
>> +
>> +       for_each_possible_cpu(cpu) {
>> +               worker = rcu_dereference_protected(
>> +                                       z_erofs_pcpu_workers[cpu], 1);
>> +               rcu_assign_pointer(z_erofs_pcpu_workers[cpu], NULL);
>> +               if (worker)
>> +                       kthread_destroy_worker(worker);
>> +       }
>> +       kfree(z_erofs_pcpu_workers);
>>   }
>>
>> -static inline int z_erofs_init_workqueue(void)
>> +static struct kthread_worker *erofs_init_percpu_worker(int cpu)
>>   {
>> -       const unsigned int onlinecpus = num_possible_cpus();
>> +       struct kthread_worker *worker =
>> +               kthread_create_worker_on_cpu(cpu, 0, "erofs_worker/%u", cpu);
>>
>> -       /*
>> -        * no need to spawn too many threads, limiting threads could minimum
>> -        * scheduling overhead, perhaps per-CPU threads should be better?
>> -        */
>> -       z_erofs_workqueue = alloc_workqueue("erofs_unzipd",
>> -                                           WQ_UNBOUND | WQ_HIGHPRI,
>> -                                           onlinecpus + onlinecpus / 4);
>> -       return z_erofs_workqueue ? 0 : -ENOMEM;
>> +       if (IS_ERR(worker))
>> +               return worker;
>> +       if (IS_ENABLED(CONFIG_EROFS_FS_PCPU_KTHREAD_HIPRI))
>> +               sched_set_fifo_low(worker->task);
>> +       else
>> +               sched_set_normal(worker->task, 0);
>> +       return worker;
>> +}
>> +
>> +static int erofs_init_percpu_workers(void)
>> +{
>> +       struct kthread_worker *worker;
>> +       unsigned int cpu;
>> +
>> +       z_erofs_pcpu_workers = kcalloc(num_possible_cpus(),
>> +                       sizeof(struct kthread_worker *), GFP_ATOMIC);
>> +       if (!z_erofs_pcpu_workers)
>> +               return -ENOMEM;
>> +
>> +       for_each_online_cpu(cpu) {      /* could miss cpu{off,on}line? */
>> +               worker = erofs_init_percpu_worker(cpu);
>> +               if (!IS_ERR(worker))
>> +                       rcu_assign_pointer(z_erofs_pcpu_workers[cpu], worker);
>> +       }
>> +       return 0;
>> +}
>> +#else
>> +static inline void erofs_destroy_percpu_workers(void) {}
>> +static inline int erofs_init_percpu_workers(void) { return 0; }
>> +#endif
>> +
>> +#if defined(CONFIG_HOTPLUG_CPU) && defined(EROFS_FS_PCPU_KTHREAD)
>> +static DEFINE_SPINLOCK(z_erofs_pcpu_worker_lock);
>> +static enum cpuhp_state erofs_cpuhp_state;
>> +
>> +static int erofs_cpu_online(unsigned int cpu)
>> +{
>> +       struct kthread_worker *worker, *old;
>> +
>> +       worker = erofs_init_percpu_worker(cpu);
>> +       if (IS_ERR(worker))
>> +               return ERR_PTR(worker);
>> +
>> +       spin_lock(&z_erofs_pcpu_worker_lock);
>> +       old = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
>> +                       lockdep_is_held(&z_erofs_pcpu_worker_lock));
>> +       if (!old)
>> +               rcu_assign_pointer(z_erofs_pcpu_workers[cpu], worker);
>> +       spin_unlock(&z_erofs_pcpu_worker_lock);
>> +       if (old)
>> +               kthread_destroy_worker(worker);
>> +       return 0;
>> +}
>> +
>> +static int erofs_cpu_offline(unsigned int cpu)
>> +{
>> +       struct kthread_worker *worker;
>> +
>> +       spin_lock(&z_erofs_pcpu_worker_lock);
>> +       worker = rcu_dereference_protected(z_erofs_pcpu_workers[cpu],
>> +                       lockdep_is_held(&z_erofs_pcpu_worker_lock));
>> +       rcu_assign_pointer(worker_pool.workers[cpu], NULL);
>> +       spin_unlock(&z_erofs_pcpu_worker_lock);
>> +
>> +       synchronize_rcu();
>> +       if (worker)
>> +               kthread_destroy_worker(worker);
>> +       return 0;
>> +}
>> +
>> +static int erofs_cpu_hotplug_init(void)
>> +{
>> +       int state;
>> +
>> +       state = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
>> +                       "fs/erofs:online", erofs_cpu_online, erofs_cpu_offline);
>> +       if (state < 0)
>> +               return state;
>> +
>> +       erofs_cpuhp_state = state;
>> +       return 0;
>> +}
>> +
>> +static void erofs_cpu_hotplug_destroy(void)
>> +{
>> +       if (erofs_cpuhp_state)
>> +               cpuhp_remove_state_nocalls(erofs_cpuhp_state);
>> +}
>> +#else /* !CONFIG_HOTPLUG_CPU || !CONFIG_EROFS_FS_PCPU_KTHREAD */
>> +static inline int erofs_cpu_hotplug_init(void) { return 0; }
>> +static inline void erofs_cpu_hotplug_destroy(void) {}
>> +#endif
>> +
>> +void z_erofs_exit_zip_subsystem(void)
>> +{
>> +       erofs_cpu_hotplug_destroy();
>> +       erofs_destroy_percpu_workers();
>> +       destroy_workqueue(z_erofs_workqueue);
>> +       z_erofs_destroy_pcluster_pool();
>>   }
>>
>>   int __init z_erofs_init_zip_subsystem(void)
>> @@ -366,10 +473,29 @@ int __init z_erofs_init_zip_subsystem(void)
>>          int err = z_erofs_create_pcluster_pool();
>>
>>          if (err)
>> -               return err;
>> -       err = z_erofs_init_workqueue();
>> +               goto out_error_pcluster_pool;
>> +
>> +       z_erofs_workqueue = alloc_workqueue("erofs_worker",
>> +                       WQ_UNBOUND | WQ_HIGHPRI, num_possible_cpus());
>> +       if (!z_erofs_workqueue)
>> +               goto out_error_workqueue_init;
>> +
>> +       err = erofs_init_percpu_workers();
>>          if (err)
>> -               z_erofs_destroy_pcluster_pool();
>> +               goto out_error_pcpu_worker;
>> +
>> +       err = erofs_cpu_hotplug_init();
>> +       if (err < 0)
>> +               goto out_error_cpuhp_init;
>> +       return err;
>> +
>> +out_error_cpuhp_init:
>> +       erofs_destroy_percpu_workers();
>> +out_error_pcpu_worker:
>> +       destroy_workqueue(z_erofs_workqueue);
>> +out_error_workqueue_init:
>> +       z_erofs_destroy_pcluster_pool();
>> +out_error_pcluster_pool:
>>          return err;
>>   }
>>
>> @@ -1305,11 +1431,17 @@ static void z_erofs_decompressqueue_work(struct work_struct *work)
>>
>>          DBG_BUGON(bgq->head == Z_EROFS_PCLUSTER_TAIL_CLOSED);
>>          z_erofs_decompress_queue(bgq, &pagepool);
>> -
>>          erofs_release_pages(&pagepool);
>>          kvfree(bgq);
>>   }
>>
>> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
>> +static void z_erofs_decompressqueue_kthread_work(struct kthread_work *work)
>> +{
>> +       z_erofs_decompressqueue_work((struct work_struct *)work);
>> +}
>> +#endif
>> +
>>   static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io,
>>                                         int bios)
>>   {
>> @@ -1324,9 +1456,24 @@ static void z_erofs_decompress_kickoff(struct z_erofs_decompressqueue *io,
>>
>>          if (atomic_add_return(bios, &io->pending_bios))
>>                  return;
>> -       /* Use workqueue and sync decompression for atomic contexts only */
>> +       /* Use (kthread_)work and sync decompression for atomic contexts only */
>>          if (in_atomic() || irqs_disabled()) {
>> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
>> +               struct kthread_worker *worker;
>> +
>> +               rcu_read_lock();
>> +               worker = rcu_dereference(
>> +                               z_erofs_pcpu_workers[raw_smp_processor_id()]);
>> +               if (!worker) {
>> +                       INIT_WORK(&io->u.work, z_erofs_decompressqueue_work);
>> +                       queue_work(z_erofs_workqueue, &io->u.work);
>> +               } else {
>> +                       kthread_queue_work(worker, &io->u.kthread_work);
>> +               }
>> +               rcu_read_unlock();
>> +#else
>>                  queue_work(z_erofs_workqueue, &io->u.work);
>> +#endif
>>                  /* enable sync decompression for readahead */
>>                  if (sbi->opt.sync_decompress == EROFS_SYNC_DECOMPRESS_AUTO)
>>                          sbi->opt.sync_decompress = EROFS_SYNC_DECOMPRESS_FORCE_ON;
>> @@ -1455,7 +1602,12 @@ static struct z_erofs_decompressqueue *jobqueue_init(struct super_block *sb,
>>                          *fg = true;
>>                          goto fg_out;
>>                  }
>> +#ifdef CONFIG_EROFS_FS_PCPU_KTHREAD
>> +               kthread_init_work(&q->u.kthread_work,
>> +                                 z_erofs_decompressqueue_kthread_work);
>> +#else
>>                  INIT_WORK(&q->u.work, z_erofs_decompressqueue_work);
>> +#endif
>>          } else {
>>   fg_out:
>>                  q = fgq;
>> @@ -1640,7 +1792,7 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f,
>>
>>          /*
>>           * although background is preferred, no one is pending for submission.
>> -        * don't issue workqueue for decompression but drop it directly instead.
>> +        * don't issue decompression but drop it directly instead.
>>           */
>>          if (!*force_fg && !nr_bios) {
>>                  kvfree(q[JQ_SUBMIT]);
>> --
>> 2.30.2
>>