sched/debug: CPU hotplug operation suffers in a large cpu systems

Wed Nov 9 02:38:31 AEDT 2022

On Tue, Nov 08, 2022 at 08:21:00PM +0530, Srikar Dronamraju wrote:
> * Greg Kroah-Hartman <gregkh at linuxfoundation.org> [2022-11-08 13:24:39]:
> 
> > On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> 
> Hi Greg, 
> 
> > > 
> > > Thanks Greg & Peter for your direction. 
> > > 
> > > While we pursue the idea of having debugfs based on kernfs, we thought about
> > > having a boot time parameter which would disable creating and updating of the
> > > sched_domain debugfs files and this would also be useful even when the kernfs
> > > solution kicks in, as users who may not care about these debugfs files would
> > > benefit from a faster CPU hotplug operation.
> > 
> > Ick, no, you would be adding a new user/kernel api that you will be
> > required to support for the next 20+ years.  Just to get over a
> > short-term issue before you solve the problem properly.
> > 
> > If you really do not want these debugfs files, just disable debugfs from
> > your system.  That should be a better short-term solution, right?
> > 
> > Or better yet, disable SCHED_DEBUG, why can't you do that?
> 
> Thanks a lot for your quick inputs.
> 
> CONFIG_SCHED_DEBUG disables a lot more stuff than just updation of debugfs
> files. Information like /sys/kernel/debug/sched/debug and system-wide and
> per process wide information would be lost when that config is disabled.
> 
> Most users would still be using distribution kernels and most distribution
> kernels that I know of seem to have CONFIG_SCHED_DEBUG enabled.

Then work with the distros to remove that option if it doesn't do well
on very large systems.

Odds are they really do not want that enabled either, but that's not our
issue, that's theirs :)

> In a large system, lets say close to 2000 CPUs and we are offlining around
> 1750 CPUs. For example ppc64_cpu --smt=1  on a powerpc. Even if we move to a
> lesser overhead kernfs based implementation, we would still be creating
> files and deleting files for every CPU offline. Most users may not even be
> aware of these files. However for a few users who may be using these files
> once a while, we end up creating and deleting these files for all users. The
> overhead increases exponentially with the number of CPUs. I would assume the
> max number of CPUs are going to increase in future further.

I understand the issue, you don't have to explain it again.  The
scheduler developers like to see these files, and for them it's useful.
Perhaps for distros that is not a useful thing to have around, that's
up to them.

> Hence our approach was to reduce the overhead for those users who are sure
> they don't depend on these files. We still keep the creating of the files as
> the default approach so that others who depend on it are not going to be
> impacted.

No, you are adding a new user/kernel api to the kernel that you then
have to support for the next 20+ years because you haven't fixed the
real issue here.

I think you could have done the kernfs conversion already, it shouldn't
be that complex, right?

Note, when you do it, you might want to move away from returning a raw
dentry from debugfs calls, and instead use an opaque type "debugfs_file"
or something like that, instead, which might make this easier over time.

thanks,

greg k-h