Multicore Energy Aware Schedular in Linux

Background

Loadable Kernel Module (KLM)

An LKM is simply a chunk of binary code that can be loaded and unloaded into the kernel on demand. Kernel modules are used to extend the functionality of the kernel without the need to reboot the system. The counterpart in user-space programs are shared libraries (aka. DLLs). For example, one type of module could be a character device driver which enables the kernel to interact with a particular hardware connected to the system. Loadable kernel modules keep kernel source modular and allow for third party components distributed in binary form (without the source code).

Essential utilities for working with modules:

lsmod: list currently loaded modules

insmod: load a module by a path to a .ko file

rmmod: unload a module by name

modprobe: load a module by name together with its dependencies

modinfo: display information about a module

Making a 'hello world' KLM. Just printing out 'Hello World' Loading and unloading the module

System Call

System calls provide an interface between userspace processes and the kernel. For example, the I/O interface is provided by the system calls open, close, read, and write. A userspace process cannot access kernel memory and it cannot (directly) call kernel functions. It is forced to use system calls.

To make a system call, the process loads the arguments into registers, loads the system call number into a register, and then executes a special instruction (called a software interrupt or trap), which jumps to a well-defined location in the kernel. The hardware automatically switches the processor from user-mode execution to restricted kernel mode. The trap handler checks the system call number, which in turn represents what kernel service the process requested. This service in turn will correspond to a particular OS function that needs to be called. The OS looks at the table of system calls to determine the address of the kernel function to call. The OS calls this function, and when that function returns, returns back to the process.

In case of ARM architectures, the system call table is defined in arch/arm/tools/syscall.tbl. During the kernel compilation process, this file is read to automatically generate actual header/assembly files needed for the kernel. For example, if you want to look at where the system call sched_setparam is defined, type 'grep -rnI sched_setparam arch/arm/'. This will display syscall.tbl and other auto-generated files (those in arch/arm/include/generated).

Adding a custom system call called check_andrewid that takes one input char array. The system call should return success (0) when you enter your AndrewID; fail (-1) when anything else is entered.

The check_andrewid system call must have the following format: check_andrewid(char* andrewid)

Then, write a user-level application that takes a name as input from the command, makes the check_andrewid system call, and outputs “valid” or “error”:


Task Scheduling and Monitoring

Periodic Task Parameters

In real-time systems, tasks are conventionally modeled as a periodic sequence of jobs. Jobs are released every T units of time and each job needs to finish within a relative deadline of D units of time from its release. To guarantee schedulability, we only admit a task into the system if the schedulability of the resulting task set is guaranteed. To perform a schedulability test, it is necessary to calculate the amount of resources each task uses, e.g., task τ i 's utilization Ui = Ci/T i , or the worst-case response time.

Processes and Threads

In the Linux kernel, the kernel object corresponding to each process/thread is a task. Every process/thread running under Linux is dynamically allocated a struct task_struct structure, called Task Control Block (TCB). The TCB is declared in <kernel_srcs>/include/linux/sched.h.

The Linux kernel assigns a task ID to each task. Hence, both processes and threads have task IDs and each task's ID is stored in its TCB, in a member variable pid_t pid (not ‘tid' due to historical reasons). In a single-threaded process, the task ID is equal to the process ID. In a multi-threaded process, all threads have the same process ID, but each one has a unique task ID. The process ID is also called a thread group leader ID (TGID) and it is stored in pid_t tgid of a TCB. The TCBs of tasks are maintained in a doubly-linked list in the kernel. To find a task's TCB with a task ID (pid), you can use find_task_by_pid_ns(<pid>, &init_pid_ns);.

Each task is associated with a priority level. In Linux, the priority value range is divided into general-purpose priorities and real-time priorities (see rt_priority field in struct task_struct). All real-time priorities supersede all general-purpose priorities. A real-time task is a task whose scheduling class is a real-time class and rt_priority value is within the real-time priority value range (1 to 99). The chrt command can be used to give a process/thread real-time priority in a shell terminal. sched_setscheduler() is a system call to set real-time priority in C programs.

Task Scheduling

To keep track of the amount of computation resources a task consumes, the task will need to be followed throughout its lifetime: birth, context switch in, context switch out, and death. The kernel shares the available CPU(s) among the tasks in the system. The kernel must be able to suspend the instruction stream of one task and resume the instruction stream of another previously-suspended task. This activity is referred to as a context switch.

When executing on one CPU, all tasks share the CPU registers. Hence, after suspending a task, the kernel must save the values of registers to restore them when the task is resumed. The set of data that must be loaded into the registers before the task resumes its execution on the CPU is called the hardware context. A context switch involves saving the task control block (TCB, struct task_struct) of the task being switched out and replacing it with that of the task being switched in place. The context switch in the Linux kernel is initiated from one well-defined point: schedule() function in kernel/sched/core.c. The   schedule() function is central to the kernel scheduler. Its objective is to find a task in the run-queue (a list of tasks ready to run) and assign the CPU to it. The function involves several kernel routines. It sets a local variable prev, which points to the TCB of the currently-running task on the current CPU. Next, it picks a next task to be run and sets a local variable next to the next task's TCB. If next is different from prev, then finally context switch happens. If no runnable task has higher priority than the current task, then next coincides with prev and no context switch takes place.

Writing Periodic Real-time User-level Test Program 

A user-level application that takes C, T, CPUID arguments on the command line and busy-loops (e.g., for-loop with some iterations) for C milliseconds every T milliseconds on the CPU CPUID. Support C and T values up to 10,000 ms (10 secs). C and T values should be greater than 0 (C <= T). RPI has 4 CPUs (CPU0-3); hence the valid range CPUID is 0 to 3. The program should be runnable on the stock kernel too: it does not rely on any of the modifications to the kernel. The program should continue to run until a user presses ‘Ctrl-C' or terminates it by sending a KILL signal. Once launched successfully, it should print out its PID and other input parameters. 

The timing error of this program is expected to be around 1 ms. The application task can be preempted, and C is the pure execution time of the task (not the response time) In case of deadline misses due to preemptions, continue to execute the current job in the next period.

Testing the program: Task 1 (10, 50) in blue. Task 2 (20, 80) in grey, Task 3 (50, 200) in green.We do not set any type of schedular in periodic.c Hence the tasks are run by the default linux  scheduler  that  gives  a  fair  chance  to  all  the  tasks.  In  the  zoomed  in  screenshot,  it can be easily seen that the tasks are given cpu time proportional to their utilisation. Hence Task  2  and  Task  3  have  almost  same  cpu  time  while  task  1  has  comparatively  less  cpu time.

Writing an EDF application

Write a userspace application that takes a task PID and other arguments, and sets a task to use Linux's SCHED_DEADLINE scheduling policy.

 Arguments: <PID> <execution time C in ms> <period T in ms>

After spinning up a periodic task (periodic.c) with a C of 250 ms, T of 500 ms, and PID of 1101: $ ./test_edf 1101 250 500



Multicore EDF support

Creating a set_edf syscall 

A task can be assigned to use Linux's EDF scheduler (SCHED_DEADLINE) using the sched_setattr syscall. Unfortunately, SCHED_DEADLINE tasks cannot have an affinity mask smaller than the entire root domain they are created on. This is a conservative measure to simplify schedulability testing. When affinity masks for different SCHED_DEADLINE tasks partially overlap, the schedulability test becomes an NP hard problem. In order to improve scalability, SCHED_DEADLINE tasks are restricted to having an affinity mask no smaller than the root domain, and schedulability is tested at a root domain level. We focuses on partitioned scheduling; thus, the schedulability of a taskset can be tested on a per-core basis.

The cpuset utility does allow us to pin a SCHED_DEADLINE to a particular processor; however, this can be a little cumbersome to use, especially if you wish to spawn multiple processes and pin them to different cores. To get around this limitation, we will implement a new set of syscalls to set a task to run using the Linux kernel's SCHED_DEADLINE scheduler on a particular core.

Implemented syscalls are

int set_edf(pid_t pid, unsigned int C, unsigned int T, int cpuid);
int cancel_edf(pid_t pid);

The pid argument is the target task's ID. C is the allowed execution time and T is the period and deadline. Both C and T are in milliseconds and have a maximum of 10,000 ms and a minimum of 1 ms. Assume the deadline equals the period. Finally, cpuid specifies the core to which the task should be pinned.

If pid equals 0, then the syscalls should apply to the calling thread. Syscalls checks if the input arguments are all valid (e.g., valid pointers; valid pid; values of C and T are valid; C and T are within the maximum/minimum range). The set_edf syscall must test for schedulability. The syscalls should return 0 if there is no error, and -1 if there was any error (invalid input, not schedulable, canceling a non-EDF task, etc.). The syscalls work swith processes as well as threads. When the given pid is of a thread, the setting should apply to that thread and not to any of its siblings, parent or children.

Printing Task’s Timing Parameters 

Implementing a syscall that prints out task timing parameters (C and T values). int print_edf(pid_t pid);
print_edf returns 0 if it found a task with pid or if pid is -1. Otherwise, it returns -1. 

The pid argument is the target task's ID. print_edf prints out the target task's C and T parameters and its cpu number with its task ID using printk. If pid is -1, then print_edf prints the C and T values of all tasks set to use EDF scheduling by set_edf. The syscalls works with processes as well as threads.

EDF Test Application

The program now also accepts a new CPU core ID argument: Arguments: <PID> <execution time C in ms> <period T in ms> <CPU core ID>

For example, for a task with with C of 250 ms, T of 500 ms, and CPU ID of 1 on a task with PID 1101: $ ./test_edf 1101 250 500 1

Testing the implementation: Task 1 (10, 50) in orange Task 2 (20, 80) in purple, Task 3 (50, 200) in blue. As compared to part 1 and part 2 screenshots, this looks to have a greater utilization. This is  due  to  the  governor  which  is  set  to  schedutil  scales  the  frequency  for  energy  saving. However  in  first  2  parts,  frequency  was  fixed  to  be  1.5  GHz.  The  utilization  is  not  near 100% as schedutil scales frequency conservatively to avoid system crash.

Multicore Energy Aware Scheduling

Multicore bin packing

For each C and T pair, spawns a new thread/process and schedule that thread/process to run on one of the RPi's four cores using set_edf syscall. Using "Best-Fit Decreasing" as the task partitioning heuristic. Hence, the implementation finds a CPU core that has the minimum remaining space among the cores where the new task is schedulable. Program reports when the taskset is unschedulable.

Frequency Scaling

The RPi supports CPU frequency scaling on the fly, allowing to decrease power consumption. This is accomplished through the cpufreq tool. CPU frequency is controlled by “cpufreq governors.” The ondemand governor is used by default, which sets the CPU frequency based on processor load. Here, we use userspace the governor to allow you to manually set the CPU frequency. The program adjusts the CPU frequency based on utilization to minimize power consumption while maintaining schedulability. When the governor is not set to userspace, your program does not attempt to modify the CPU frequency. Assume that the C values given as input in the .csv correspond to execution of the task at the RPi's highest frequency of 1.5 MHz.

Energy-Aware Scheduling with schedutil

The RPi also supports the schedutil governor. When cpufreq's schedutil governor is selected, SCHED_DEADLINE implements the GRUB-PA algorithm, minimizing CPU frequency while still ensuring deadlines are met.

Comparison Experiment

Running following tasks with userspace governor and schedutil governor and comparing performance.


Conclusion

Userspace Governer

The tasks are scheduled using best fit increasing heuristic so that each core is uniformly utilized.  Then  the  frequency  is  scaled  such  that  core  with  maximum  utilization  still maintains schedulability. We  can  easily  observe  C  values  increased  proportionally  to  the  reduced  frequency  from 1.5 GHz to 1.1 GHz. For cpu 2, only one task is assigned with C = 50 and T = 80 with a utilization  of  0.625.  However  the  in  trace  file,  cpu  2  utilization  is  as  high  as  0.85  which goes on to show that the frequency of the cores has indeed reduced.

Schedutil Governer

Here  the  governor  is  set  to  schedutil.  Hence  the  user  level  application  does  not  set  the scaling  frequency.  It  just  schedules  the  tasks  using  best  first  increasing  and  allows  the kernel to manipulate the frequency. As before we can see only one task is assigned to core 2. The utilisation looks a bit higher than  0.625  at  1.5  GHz  but  not  as  high  as  it  was  for  1.1  GHz.  I  believe  the  frequency  is scaled but conservatively to avoid a system crash.

References