[RFC] sched/eevdf: sched feature to dismiss lag on wakeup

Fri Mar 15 00:45:41 AEDT 2024

On Fri, Mar 08, 2024 at 03:11:38PM +0000, Luis Machado wrote:
> On 2/28/24 16:10, Tobias Huschle wrote:
> > 
> > Questions:
> > 1. The kworker getting its negative lag occurs in the following scenario
> >    - kworker and a cgroup are supposed to execute on the same CPU
> >    - one task within the cgroup is executing and wakes up the kworker
> >    - kworker with 0 lag, gets picked immediately and finishes its
> >      execution within ~5000ns
> >    - on dequeue, kworker gets assigned a negative lag
> >    Is this expected behavior? With this short execution time, I would
> >    expect the kworker to be fine.
> 
> That strikes me as a bit odd as well. Have you been able to determine how a negative lag
> is assigned to the kworker after such a short runtime?
> 

I did some more trace reading though and found something.

What I observed if everything runs regularly:
- vhost and kworker run alternating on the same CPU
- if the kworker is done, it leaves the runqueue
- vhost wakes up the kworker if it needs it
--> this means:
  - vhost starts alone on an otherwise empty runqueue
  - it seems like it never gets dequeued
    (unless another unrelated task joins or migration hits)
  - if vhost wakes up the kworker, the kworker gets selected
  - vhost runtime > kworker runtime 
    --> kworker gets positive lag and gets selected immediately next time

What happens if it does go wrong:
>From what I gather, there seem to be occasions where the vhost either
executes suprisingly quick, or the kworker surprinsingly slow. If these
outliers reach critical values, it can happen, that
   vhost runtime < kworker runtime
which now causes the kworker to get the negative lag.

In this case it seems like, that the vhost is very fast in waking up
the kworker. And coincidentally, the kworker takes, more time than usual
to finish. We speak of 4-digit to low 5-digit nanoseconds.

So, for these outliers, the scheduler extrapolates that the kworker 
out-consumes the vhost and should be slowed down, although in the majority
of other cases this does not happen.

Therefore this particular usecase would profit from being able to ignore
such outliers, or being able to ignore a certain amount of difference in the
lag values, i.e. introduce some grace value around the average runtime for
which lag is not accounted. But not sure if I like that idea.

So the negative lag can be somewhat justified, but for this particular case
it leads to a problem where one outlier can cause havoc. As mentioned in the
vhost discussion, it could also be argued that the vhost should not rely on 
the fact that the kworker gets always scheduled on wake up, since these
timing issues can always happen.

Hence, the two options:
- offer the alternative strategy which dismisses lag on wake up for workloads
  where we know that a task usually finishes faster than others but should
  not be punished by rare outliers (if that is predicatble, I don't know)
- require vhost to adress this issue on their side (if possible without 
  creating an armada of side effects)

(plus the third one mentioned above, but that requires a magic cutoff value, meh)

> I was looking at a different thread (https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@intel.com/) that
> uncovers a potential overflow in the eligibility calculation. Though I don't think that is the case for this particular
> vhost problem.

Yea, the numbers I see do not look very overflowy.