 Goals, Design and Implementation of the ultra-scalable O(1) scheduler by
 Ingo Molnar and theStaircase Deadline cpu scheduler policy designed by
 Con Kolivas.


  This was originally an edited version of an email Ingo Molnar sent to
  lkml on 4 Jan 2002.  It describes the goals, design, and implementation
  of Ingo's ultra-scalable O(1) scheduler. It now contains a description
  of the Staircase Deadline priority scheduler that was built on this
  design.
  Last Updated: Fri, 4 May 2007


Goal
====

The main goal of the new scheduler is to keep all the good things we know
and love about the current Linux scheduler:

 - good interactive performance even during high load: if the user
   types or clicks then the system must react instantly and must execute
   the user tasks smoothly, even during considerable background load.

 - good scheduling/wakeup performance with 1-2 runnable processes.

 - fairness: no process should stay without any timeslice for any
   unreasonable amount of time. No process should get an unjustly high
   amount of CPU time.

 - priorities: less important tasks can be started with lower priority,
   more important tasks with higher priority.

 - SMP efficiency: no CPU should stay idle if there is work to do.

 - SMP affinity: processes which run on one CPU should stay affine to
   that CPU. Processes should not bounce between CPUs too frequently.

 - plus additional scheduler features: RT scheduling, CPU binding.

and the goal is also to add a few new things:

 - fully O(1) scheduling. Are you tired of the recalculation loop
   blowing the L1 cache away every now and then? Do you think the goodness
   loop is taking a bit too long to finish if there are lots of runnable
   processes? This new scheduler takes no prisoners: wakeup(), schedule(),
   the timer interrupt are all O(1) algorithms. There is no recalculation
   loop. There is no goodness loop either.

 - 'perfect' SMP scalability. With the new scheduler there is no 'big'
   runqueue_lock anymore - it's all per-CPU runqueues and locks - two
   tasks on two separate CPUs can wake up, schedule and context-switch
   completely in parallel, without any interlocking. All
   scheduling-relevant data is structured for maximum scalability.

 - better SMP affinity. The old scheduler has a particular weakness that
   causes the random bouncing of tasks between CPUs if/when higher
   priority/interactive tasks, this was observed and reported by many
   people. The reason is that the timeslice recalculation loop first needs
   every currently running task to consume its timeslice. But when this
   happens on eg. an 8-way system, then this property starves an
   increasing number of CPUs from executing any process. Once the last
   task that has a timeslice left has finished using up that timeslice,
   the recalculation loop is triggered and other CPUs can start executing
   tasks again - after having idled around for a number of timer ticks.
   The more CPUs, the worse this effect.

   Furthermore, this same effect causes the bouncing effect as well:
   whenever there is such a 'timeslice squeeze' of the global runqueue,
   idle processors start executing tasks which are not affine to that CPU.
   (because the affine tasks have finished off their timeslices already.)

   The new scheduler solves this problem by distributing timeslices on a
   per-CPU basis, without having any global synchronization or
   recalculation.

 - batch scheduling. A significant proportion of computing-intensive tasks
   benefit from batch-scheduling, where timeslices are long and processes
   are roundrobin scheduled. The new scheduler does such batch-scheduling
   of the lowest priority tasks - so nice +19 jobs will get
   'batch-scheduled' automatically. With this scheduler, nice +19 jobs are
   in essence SCHED_IDLE, from an interactiveness point of view.

 - handle extreme loads more smoothly, without breakdown and scheduling
   storms.

 - O(1) RT scheduling. For those RT folks who are paranoid about the
   O(nr_running) property of the goodness loop and the recalculation loop.

 - run fork()ed children before the parent. Andrea has pointed out the
   advantages of this a few months ago, but patches for this feature
   do not work with the old scheduler as well as they should,
   because idle processes often steal the new child before the fork()ing
   CPU gets to execute it.


Design
======

The core of the new scheduler contains the following mechanisms:

 - *two* priority-ordered 'priority arrays' per CPU. There is an 'active'
   array and an 'expired' array. The active array contains all tasks that
   are affine to this CPU and have timeslices left. The expired array
   contains all tasks which have used up their timeslices - but this array
   is kept sorted as well. The active and expired array is not accessed
   directly, it's accessed through two pointers in the per-CPU runqueue
   structure. If all active tasks are used up then we 'switch' the two
   pointers and from now on the ready-to-go (former-) expired array is the
   active array - and the empty active array serves as the new collector
   for expired tasks.

 - there is a 64-bit bitmap cache for array indices. Finding the highest
   priority task is thus a matter of two x86 BSFL bit-search instructions.

the split-array solution enables us to have an arbitrary number of active
and expired tasks, and the recalculation of timeslices can be done
immediately when the timeslice expires. Because the arrays are always
access through the pointers in the runqueue, switching the two arrays can
be done very quickly.

this is a hybride priority-list approach coupled with roundrobin
scheduling and the array-switch method of distributing timeslices.

 - there is a per-task 'load estimator'.

one of the toughest things to get right is good interactive feel during
heavy system load. While playing with various scheduler variants i found
that the best interactive feel is achieved not by 'boosting' interactive
tasks, but by 'punishing' tasks that want to use more CPU time than there
is available. This method is also much easier to do in an O(1) fashion.

to establish the actual 'load' the task contributes to the system, a
complex-looking but pretty accurate method is used: there is a 4-entry
'history' ringbuffer of the task's activities during the last 4 seconds.
This ringbuffer is operated without much overhead. The entries tell the
scheduler a pretty accurate load-history of the task: has it used up more
CPU time or less during the past N seconds. [the size '4' and the interval
of 4x 1 seconds was found by lots of experimentation - this part is
flexible and can be changed in both directions.]

the penalty a task gets for generating more load than the CPU can handle
is a priority decrease - there is a maximum amount to this penalty
relative to their static priority, so even fully CPU-bound tasks will
observe each other's priorities, and will share the CPU accordingly.

the SMP load-balancer can be extended/switched with additional parallel
computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs
can be supported easily by changing the load-balancer. Right now it's
tuned for my SMP systems.

i skipped the prev->mm == next->mm advantage - no workload i know of shows
any sensitivity to this. It can be added back by sacrificing O(1)
schedule() [the current and one-lower priority list can be searched for a
that->mm == current->mm condition], but costs a fair number of cycles
during a number of important workloads, so i wanted to avoid this as much
as possible.

- the SMP idle-task startup code was still racy and the new scheduler
triggered this. So i streamlined the idle-setup code a bit. We do not call
into schedule() before all processors have started up fully and all idle
threads are in place.

- the patch also cleans up a number of aspects of sched.c - moves code
into other areas of the kernel where it's appropriate, and simplifies
certain code paths and data constructs. As a result, the new scheduler's
code is smaller than the old one.

	Ingo


Staircase Deadline cpu scheduler policy
================================================

Design summary
==============

A novel design which incorporates a foreground-background descending priority
system (the staircase) via a bandwidth allocation matrix according to nice
level.


Features
========

A starvation free, strict fairness O(1) scalable design with interactivity
as good as the above restrictions can provide. There is no interactivity
estimator, no sleep/run measurements and only simple fixed accounting.
The design has strict enough a design and accounting that task behaviour
can be modelled and maximum scheduling latencies can be predicted by
the virtual deadline mechanism that manages runqueues. The prime concern
in this design is to maintain fairness at all costs determined by nice level,
yet to maintain as good interactivity as can be allowed within the
constraints of strict fairness.


Design description
==================

SD works off the principle of providing each task a quota of runtime that it is
allowed to run at a number of priority levels determined by its static priority
(ie. its nice level). If the task uses up its quota it has its priority
decremented to the next level determined by a priority matrix. Once every
runtime quota has been consumed of every priority level, a task is queued on the
"expired" array. When no other tasks exist with quota, the expired array is
activated and fresh quotas are handed out. This is all done in O(1).

Design details
==============

Each task keeps a record of its own entitlement of cpu time. Most of the rest of
these details apply to non-realtime tasks as rt task management is straight
forward.

Each runqueue keeps a record of what major epoch it is up to in the
rq->prio_rotation field which is incremented on each major epoch. It also
keeps a record of the current prio_level for each static priority task.

Each task keeps a record of what major runqueue epoch it was last running
on in p->rotation. It also keeps a record of what priority levels it has
already been allocated quota from during this epoch in a bitmap p->bitmap.

The only tunable that determines all other details is the RR_INTERVAL. This
is set to 8ms, and is scaled gently upwards with more cpus. This value is
tunable via a /proc interface.

All tasks are initially given a quota based on RR_INTERVAL. This is equal to
RR_INTERVAL between nice values of -6 and 0, half that size above nice 0, and
progressively larger for nice values from -1 to -20. This is assigned to
p->quota and only changes with changes in nice level.

As a task is first queued, it checks in recalc_task_prio to see if it has run at
this runqueue's current priority rotation. If it has not, it will have its
p->prio level set according to the first slot in a "priority matrix" and will be
given a p->time_slice equal to the p->quota, and has its allocation bitmap bit
set in p->bitmap for this prio level. It is then queued on the current active
priority array.

If a task has already been running during this major epoch, and it has
p->time_slice left and the rq->prio_quota for the task's p->prio still
has quota, it will be placed back on the active array, but no more quota
will be added.

If a task has been running during this major epoch, but does not have
p->time_slice left, it will find the next lowest priority in its bitmap that it
has not been allocated quota from. It then gets the a full quota in
p->time_slice. It is then queued on the current active priority array at the
newly determined lower priority.

If a task has been running during this major epoch, and does not have
any entitlement left in p->bitmap and no time_slice left, it will have its
bitmap cleared, and be queued at its best prio again, but on the expired
priority array.

When a task is queued, it has its relevant bit set in the array->prio_bitmap.

p->time_slice is stored in nanosconds and is updated via update_cpu_clock on
schedule() and scheduler_tick. If p->time_slice is below zero then the
recalc_task_prio is readjusted and the task rescheduled.


Priority Matrix
===============

In order to minimise the latencies between tasks of different nice levels
running concurrently, the dynamic priority slots where different nice levels
are queued are dithered instead of being sequential. What this means is that
there are 40 priority slots where a task may run during one major rotation,
and the allocation of slots is dependant on nice level. In the
following table, a zero represents a slot where the task may run.

PRIORITY:0..................20.................39
nice -20 0000000000000000000000000000000000000000
nice -10 1000100010001000100010001000100010010000
nice   0 1010101010101010101010101010101010101010
nice   5 1011010110110101101101011011010110110110
nice  10 1110111011101110111011101110111011101110
nice  15 1111111011111110111111101111111011111110
nice  19 1111111111111111111111111111111111111110

As can be seen, a nice -20 task runs in every priority slot whereas a nice 19
task only runs one slot per major rotation. This dithered table allows for the
smallest possible maximum latencies between tasks of varying nice levels, thus
allowing vastly different nice levels to be used.

SCHED_BATCH tasks are managed slightly differently, receiving only the top
slots from its priority bitmap giving it equal cpu as SCHED_NORMAL, but
slightly higher latencies.


Modelling deadline behaviour
============================

As the accounting in this design is hard and not modified by sleep average
calculations or interactivity modifiers, it is possible to accurately
predict the maximum latency that a task may experience under different
conditions. This is a virtual deadline mechanism enforced by mandatory
timeslice expiration and not outside bandwidth measurement.

The maximum duration a task can run during one major epoch is determined by its
nice value. Nice 0 tasks can run at 19 different priority levels for RR_INTERVAL
duration during each epoch. Nice 10 tasks can run at 9 priority levels for each
epoch, and so on. The table in the priority matrix above demonstrates how this
is enforced.

Therefore the maximum duration a runqueue epoch can take is determined by
the number of tasks running, and their nice level. After that, the maximum
duration it can take before a task can wait before it get scheduled is
determined by the position of its first slot on the matrix.

In the following examples, these are _worst case scenarios_ and would rarely
occur, but can be modelled nonetheless to determine the maximum possible
latency.

So for example, if two nice 0 tasks are running, and one has just expired as
another is activated for the first time receiving a full quota for this
runqueue rotation, the first task will wait:

nr_tasks * max_duration + nice_difference * rr_interval
1 * 19 * RR_INTERVAL + 0 = 152ms

In the presence of a nice 10 task, a nice 0 task would wait a maximum of
1 * 10 * RR_INTERVAL + 0 = 80ms

In the presence of a nice 0 task, a nice 10 task would wait a maximum of
1 * 19 * RR_INTERVAL + 1 * RR_INTERVAL = 160ms

More useful than these values, though, are the average latencies which are
a matter of determining the average distance between priority slots of
different nice values and multiplying them by the tasks' quota. For example
in the presence of a nice -10 task, a nice 0 task will wait either one or
two slots. Given that nice -10 tasks have a quota 2.5 times the RR_INTERVAL,
this means the latencies will alternate between 2.5 and 5 RR_INTERVALs or
20 and 40ms respectively (on uniprocessor at 1000HZ).


Achieving interactivity
=======================

A requirement of this scheduler design was to achieve good interactivity
despite being a completely fair deadline based design. The disadvantage of
designs that try to achieve interactivity is that they usually do so at
the expense of maintaining fairness. As cpu speeds increase, the requirement
for some sort of metered unfairness towards interactive tasks becomes a less
desirable phenomenon, but low latency and fairness remains mandatory to
good interactive performance.

This design relies on the fact that interactive tasks, by their nature,
sleep often. Most fair scheduling designs end up penalising such tasks
indirectly giving them less than their fair possible share because of the
sleep, and have to use a mechanism of bonusing their priority to offset
this based on the duration they sleep. This becomes increasingly inaccurate
as the number of running tasks rises and more tasks spend time waiting on
runqueues rather than sleeping, and it is impossible to tell whether the
task that's waiting on a runqueue only intends to run for a short period and
then sleep again after than runqueue wait. Furthermore, all such designs rely
on a period of time to pass to accumulate some form of statistic on the task
before deciding on how much to give them preference. The shorter this period,
the more rapidly bursts of cpu ruin the interactive tasks behaviour. The
longer this period, the longer it takes for interactive tasks to get low
scheduling latencies and fair cpu.

This design does not measure sleep time at all. Interactive tasks that sleep
often will wake up having consumed very little if any of their quota for
the current major priority rotation. The longer they have slept, the less
likely they are to even be on the current major priority rotation. Once
woken up, though, they get to use up a their full quota for that epoch,
whether part of a quota remains or a full quota. Overall, however, they
can still only run as much cpu time for that epoch as any other task of the
same nice level. This means that two tasks behaving completely differently
from fully cpu bound to waking/sleeping extremely frequently will still
get the same quota of cpu, but the latter will be using its quota for that
epoch in bursts rather than continuously. This guarantees that interactive
tasks get the same amount of cpu as cpu bound ones.

The other requirement of interactive tasks is also to obtain low latencies
for when they are scheduled. Unlike fully cpu bound tasks and the maximum
latencies possible described in the modelling deadline behaviour section
above, tasks that sleep will wake up with quota available usually at the
current runqueue's priority_level or better. This means that the most latency
they are likely to see is one RR_INTERVAL, and often they will preempt the
current task if it is not of a sleeping nature. This then guarantees very
low latency for interactive tasks, and the lowest latencies for the least
cpu bound tasks.


Fri, 4 May 2007
Con Kolivas <kernel@kolivas.org>
