A Comparison of I/O Schedulers
Mulyadi Santosa and Fawad Lateef
What do you think of when you read the term "scheduler"? If you think of the mechanism that schedules the order in which processes are served, then you already have an idea of what this article is about. The term "scheduler" itself is broad, and in this article we will narrow it down to I/O, specifically disk I/O.
The first question to consider is why you need an I/O scheduler. To answer that, let's briefly see how a disk (ATA/SATA HDD or "hard disk drive") works. A typical HDD provides two ways of accessing a specific location:
- C/H/S (Cylinder/Head/Sector)
- LBA (Logical Block Address)
Most operating systems use LBA (especially Linux), which is based on sectors only, so the OS doesn't need to take into account cylinders and heads (each sector has a size of 512 bytes). When the OS asks a sector to be read, the disk head moves to the target sector, stops, fetches the content, puts it in disk's buffer, and generates an interrupt. This process is the same for all disk access whether it's for reading or writing. Note that the head moves mechanically, unlike the RAM/Memory, which has hardwired addresses and is directly accessible by the processor. So accessing a sector on disk can be hundreds or thousands of times slower than RAM access, and the seeking latency for a specific sector can vary greatly.
Also, the mechanical nature of disk head movement causes another issue: More time spent moving the head means longer actual access time. It's preferable if all the data can be read with just one sweep, but of course that's an ideal case.
So, given these large variations in seeking time, what can be done to decrease read/write latency? This is the question the I/O scheduler tries to solve. You can imagine it as a traffic controller. In a nutshell, the scheduler can postpone a request, re-order requests, and merge adjacent requests into one. The goal of an I/O scheduler is to optimize head movement so more things can be done during a given time (resulting in technically higher throughput). With this idea in mind, let's look at I/O schedulers in more detail.
Overview of I/O Schedulers
There are four built-in I/O schedulers ready to use in Linux; they are:
- Noop
- Anticipatory
- Deadline
- CFQ (Complete Fair Queueing)
Except for Noop, all of the above schedulers use the "queueing" concept as a way to manage incoming I/O requests. Queue here doesn't necessarily mean that requests will be processed in FIFO (first-in-first-out) order; the queue is just a container that holds the requests to be manipulated further. A scheduler maintains one or more queues according to its implementation. In general, we see two queues: The I/O scheduler's queue and the driver's queue.
Conceptually, the scheduler's queue sits between the VFS (Virtual File System) layer and the block device driver layer. Using this approach, the VFS layer needs no major modification to utilize the schedulers. Because the VFS layer already provides the basic elevator algorithm (disk access algo), the only work needed was to modify that elevator support to provide a generic interface for I/O schedulers. So, requests are submitted into a predefined I/O queue where the schedulers manage them. The same thing happens when the scheduler submits a request to the block device driver's queue. Thanks to this "object-oriented-like" approach, tweaking or even creating a new scheduler doesn't necessarily affect how other layers work.
With the latest kernels, you can also change the effective scheduler on the fly at runtime for each available block device. This means that if you have two disks, they can use different I/O schedulers, so you don't need to select just one when booting the kernel for all the disks in the system. Using different schedulers on different block devices is a neat way to adapt to different I/O characteristics.
All I/O schedulers have some tunable parameters that allow users to tweak them according to their needs. And, each parameter can do a lot to change performance.
I/O Schedulers in Detail
Let's take a brief look at each of these I/O schedulers.
Noop
Noop is the simplest scheduler present in Linux. It is mostly suitable for truly random-access block devices, such as flash memory cards. The incoming requests for read/write are kept in FIFO order, and only the current request added at the end of the queue is tested for the possibility of merging. The downside is clear; you get no optimization at all, especially in situations where there are many competing read/write processes. Per process sequential readings will be seen as "random" access. So, it's better not to use this scheduler with any sequential access device, such as a hard disk.
Anticipatory
The Anticipatory scheduler is designed around the fact that a user tends to access adjacent sectors instead of jumping to other disk areas during certain periods of time. Most file systems in Linux use a specific approach to ensure that sectors belonging to a file are kept adjacent as much as possible. Between these reads, it is very likely that the task sleeps for a while before continuing reading.
So, the basic idea behind the Anticipatory scheduler is that it anticipates subsequent reads between sleeps. By staying a bit longer in the same head position, it can reduce the back-and-forth seek movement to a certain degree. How long it should stay is sometimes determined by looking into the next incoming read request, or in some cases by forecasting based on statistics. It is clear then that the Anticipatory scheduler isn't really great for time-sensitive read/write processes. The CFQ scheduler shares the same cons, although it is a bit better because of its "fairness" ability, which we'll describe later.
Deadline
If you have a lot of requests, you may want to make sure a request is serviced within a strict period of time. This situation is quite common for a busy file server, but you can see it on desktop workload, too (e.g., graphics rendering). The Deadline scheduler was created to address this need. Read and write processes are guaranteed to be serviced within a strict amount of time, and read is prioritized over write. So, the Deadline scheduler pays attention to latency issues but not necessarily throughput.
Read is favored over write, because read is assumed to be more important; however, write operations aren't allowed to be starved too much. This behavior can also be tweaked with some tunables, which we'll talk more about later.
The Deadline scheduler is a sound fit for high-performance I/O because it decreases the interactivity to a certain degree. For someone who demands smooth desktop experience, however, this might not be the scheduler to choose.
CFQ
CFQ is the default scheduler used in the latest kernel versions, and it is typically suited for people seeking balanced I/O service. Stressing "fairness", it tries to manage read/write processes using priorities so that nothing dominates or fully dictates the way the kernel serves these requests. This is done by dividing the bandwidth of each block device fairly among the processes performing I/O to that device. Lately it has also been improved by the time slice-based operation, meaning it does what process schedulers do: Serves a request within a specific duration, stops, switches to another one, and repeats.
When a process submits a request for disk access, the request can be classified as synchronous or asynchronous. In the case of synchronous requests, the process waits for the request's completion; for asynchronous requests, it simply sends the request to disk and starts doing some other work. Hence, the CFQ scheduler gives more time/priority to synchronous requests rather than to asynchronous requests. CFQ gives the best balance (theoretically) between throughput and interactivity.
I/O Schedulers' Tunables
Most likely the default values of the schedulers' tunables don't meet your needs, so next we'll describe how to change them. You can read about the tunables in detail in the documentation/block directory inside the Linux kernel source tree.
You will find these tunables under the /sys/block//queue/iosched directory. You can set them or view the setting using "echo" or "cat", for example:
# cat /sys/block/hdc/queue/iosched/writes_starved
Note that the content of the "iosched" directory changes according to the effective scheduler applied on the related block device.
To change the effective scheduler itself, do:
# echo "deadline" > /sys/block/hdc/queue/scheduler
Replace "deadline" with the scheduler you want to use.
The possible (case-sensitive) strings you can use are:
- noop
- anticipatory
- deadline
- cfq
Note that "hdc" here refers to the block device that will use this scheduler.
At boot time, you can use the elevator= kernel parameter to pick the effective scheduler applied for the whole block devices (notice this behavior!). The possible strings are similar to above:
kernel /boot/vmlinuz-2.6.20.3 ro root=LABEL=/1 elevator=cfq
All the information is based on the Linux kernel 2.6.20.3.
Anticipatory Tunables (All Units Are in Milliseconds)
- antic_expire -- This parameter controls how long the scheduler should wait while expecting read requests on nearby disk sectors. Note that if you set this to zero, the Anticipatory scheduler works just like the Deadline scheduler, so maybe that's not something you want to do. The default is fine, but if you think your disk suffers from seek latency, try to increase gradually until you get a reasonable result. The default value is 6.
- read_expire -- When a read request stays in the queue longer than the time specified here, it becomes expired. An expired request moves into the expired list and will soon be dispatched to the disk. Decreasing read_expire will make requests get serviced faster, but that really depends on how near the request is to the head of the expired list. The default is 125.
- write expire -- This parameter works the same as read_expire, only for write requests. The default is 250.
- read_batch_expire -- The Anticipatory scheduler alternates dispatches between read and write requests, so they are not mixed. When read requests are served, the scheduler keeps serving them during read_batch_expire interval. After exceeding it, the scheduler turns to serve write requests. The default value is 500.
- write_batch_expire -- This is the same as read_batch_expire but applied to write requests. The default value is 125.
Deadline Tunables
- read_expire (in milliseconds) -- This works the same as in the Anticipatory scheduler. The default value is 500.
- write_expire (in milliseconds) -- This is same as read_expire except for write requests. The default value is 5000.
These separate expire times for read and write request provide the ability to change the default behavior by giving priority to read requests. However, altering this behavior in most cases is not very feasible.
- fifo_batch (number of requests) -- This value determines how many requests (i.e., batch of requests) must be moved to the block-device queue when the read request reaches its deadline. The idea is not to move a single request, because that might create a lot of seeks, but rather to send a batch of requests in which some requests might be seeks and some might require contiguous data thereby reducing seek time.
The default value here is 16, and it's optimized for most scenarios. Increasing it might create more seeks, which would eventually decrease the device throughput. This is the main performance parameter of the Deadline I/O scheduler.
- writes_starved (number of read dispatches before write) -- The writes_starved parameter tells how many read dispatches (to driver queue from I/O scheduler) will be done before doing one write dispatch. The default value is 2, which tells the I/O scheduler to do two read dispatches and then dispatch some writes for I/O based on the same criteria used for dispatching reads.
- front_merges (Boolean value) -- This parameter allows you to enable/disable front merge of requests. In practice, many requests that enter the I/O scheduler are contiguous with a request that is already in the queue and can fit either at the back or at the front of that request. This is called a back-merge or front-merge candidate. Most of the time, file systems keep files in such a way that back merges are used rather than front merges, and in some workloads we don't even need to do front merges.
In this case, disabling front merges will save scheduler time spent looking for front-merge candidates. For example, during a streaming session of audio/video, you will most likely access a single file at a time and requests are generated for the next file block. Thus, there won't be any requests requiring front merge and disabling this capability (by setting it to 0) will greatly increase I/O performance. By default, the front_merges parameter is enabled.
CFQ Tunables
- back_seek_max (in KB) -- Primarily, CFQ tries to fulfill requests in one direction only, but it also provides support for handling requests in the other direction. This parameter specifies how much it can seek backward to fulfill requests rather than handling forward requests, but it comes with a penalty (defined by back_seek_penalty). The default value is 16384 (16MB).
- back_seek_penalty (in numbers) -- This defines how much penalty is given to backward requests. For example, given two requests of the same distance, if one goes forward and the other goes backward, then after multiplying the back_seek_penalty to the backward distance, it becomes greater than the forward distance. Therefore, the forward request is serviced. The default value here is 2, which means the backward request will be served only if the backward request's distance is less than half of the forward request's distance.
- fifo_expire_sync (in milliseconds) -- This parameter defines the expiration time of synchronous requests. The queued request is sent to the block driver's queue after pending longer than this expiry time. The default value is 124.
- fifo_expire_async (in milliseconds) -- This is quite similar to the fifo_expire_sync value, except it defines the expiration time for asynchronous requests. The default value is 248.
- quantum (in terms of requests) -- This tells scheduler the maximum number of requests to dispatch to the device queue. The default is 4.
- slice_sync (in milliseconds) -- This parameter defines the amount of time the scheduler will spend on sending synchronous requests to the device request queue. The default is 100.
- slice_async (in milliseconds) -- This is the same as slice_sync but for synchronous request handling. Its default value is 40. Asynchronous I/O is usually given less slice because it doesn't block the process.
- slice_async_rq (in number of requests) -- It defines how many asynchronous requests are dispatched to the device whenever slice_async expires. However, the actual number of requests dispatched is calculated according to priority using this value as the base request number. Thus, requests dispatched will be at least the number specified here. The default value is 2.
- slice_idle (in milliseconds) -- CFQ provides support of idle I/O priority. If a process has idle priority, then it will get disk time when other processes are not asking for disk. This value defines the time for dispatching requests with idle priority. The default value is 8 ms.
Besides tunables, CFQ provides different classes and priorities for each process that can be modified with the ionice tool, thus giving more freedom to the user to manage I/O. See the man page for more details about ionice. This support is available on Linux 2.6.13 and later with CFQv3.
There are three classes, and the classes have eight priority levels. The classes (quoted from the ionice man page are:
- Idle -- A program running with idle I/O priority will only get disk when no other process is asking for disk I/O, thus idle I/O processes have almost zero effect on normal system activity. This class doesn't have any priority.
- Best effort -- This is the default scheduling class for a process if it doesn't ask for any I/O priority explicitly. This class takes priority from 0 to 7, where a lower number means higher priority. Programs running in the best effort class with the same priority are served in round-robin fashion.
- Real time (RT) -- The processes having the RT class set are given highest priority over other classes, but this class should be used with care because it can starve other processes. This class also defines eight priority levels, which denote how large a time slice will be assigned to each process in each scheduling window.
Benchmarking
To show how all those I/O schedulers really work, we will summarize the results of several benchmarks that we performed. The three main tests are:
1. Running IOZone. IOZone is a file system benchmark tool that generates and measures various file operations. For this test, we used throughput mode and forked one up to four POSIX threads for every test item. Each thread reads and writes on a 64 MB file.
We also used IOZone to test how each scheduler scaled on an increasing number of read/write processes. This time, the target file size was 512 KB. We also took flush()/fsync() time into account, so we counted the time until the data was written completely. The number of running threads was 8n, where 1<=n<=8.
2. Grepping a string in the kernel source tree. The grep test represents a single reader test case. Linux kernel 2.6.20.3 contains around 21,000 files and six levels of directories with a total size of 279 MB, enough to put pressure on the I/O schedulers.
3. Copying kernel source using rsync. We tested this in two ways -- between directories in the same partition on the same disk and between directories in different disks. Specifically for the latter test, both disks are tested using the same schedulers.
The test was conducted with the following machine specifications:
- AMD Athlon XP 1800+
- RAM 256 MB
- 40-GB EIDE disk as the testing target. Inside this disk, we created a 2.8 GB partition formatted as ext2.
- 80-GB hosting OS (Fedora Core 2 + kernel 2.6.20.3)
Since we're talking about disk I/O, here is the detailed information about the target disk:
Model : Maxtor 6E040L0
Interface : ATA
Transfer mode : PIO, DMA and UDMA. Effectively running at UDMA6
Buffer size : 2 MB
Write Cache : Enabled
C/H/S : 16383/16/63 (max)
4047/16/255 (current)
Between each repetition, we mounted and unmounted the target partition to flush the page cache out of RAM. All the tests were done in runlevel 1 to make sure there were as few competing processes as possible. All schedulers used their default tunable settings. Read-ahead was enabled, and the size of read-ahead was 128 KB.
Let's begin with a look at the grep test. In this test, we searched for occurrences of the string "HZ" within the kernel source 2.6.20.3. HZ is a constant that represents the frequency of timer interrupt. Initially, the kernel source was extracted onto a temporary partition and then copied (using rsync) to the freshly formatted ext2 partition, so there was almost no fragmentation.
The test was done three times per scheduler, and we took the average and the standard deviation. Standard output was redirected to /dev/null, so console access would not interfere with the results.
It's pretty interesting to see that the Noop scheduler yielded the fastest real time among all schedulers (see Table 1). This fact can be explained when we look again at how the data physically is laid out on the disk. Recall that this is freshly copied kernel source on a freshly formatted partition. The data blocks are sequentially written and so are the inodes. So, even with no "clever" scheduling, read can be done quite quickly. The other schedulers add a certain degree of overhead, and the worst performing one is Anticipatory (~1.5 sec).
Checking the standard deviation of real time, we see that CFQ shows the most fluctuation, while Noop shows the least. If we take this deviation into account, we see that CFQ, Deadline, and Noop actually are fairly equal in completion time (real time). The Anticipatory scheduler works poorly here because anticipation isn't really needed; we keep reading with almost no delay, and the data is mostly next of the other.
Next, we introduce "sleep" time as real time subtracted with user+system time. It represents the cumulative time a process waits before the data is completely fetched from disk. Reading data from page cache is assumed to take zero delay. The results shown in Table 2 show that CFQ gives us the least latency, so conceptually CFQ works as it was originally designed. The Deadline scheduler is supposed to be the winner here, but it performs a bit worse than Noop in our tests. The Anticipatory scheduler is shown to contribute the highest latency.
Table 3 shows the results of the rsync test. The results are divided into two cases, so we can see how the schedulers react if read and write occur on different disks as well as on the same disk. Between disks, Noop becomes the winner, and in second place is the Anticipatory scheduler. However, the Anticipatory scheduler is fastest when copying on the same disk. This shows us that the Anticipatory works fairly well if you do both read and write from/to files scattered over directories.
Now let's look at the IOZone results. In the IOZone write test (see Table 4), Noop is the fastest when there is only single writer, but once many threads are operating, CFQ begins to shine.
The IOZone read test (see Table 5) again shows that CFQ works best in most cases. However, notice that the Deadline scheduler has the best results in the in the 4-thread case. If we combine these results, we could say that multi-process I/O operations are better served with the CFQ scheduler.
In the IOZone stride read test, IOZone performs the following read pattern: read 4 KB, seek forward 200 KB, read 4KB again, and so on (see Table 6). With this read pattern, merges rarely happen, and we see how these schedulers avoid starvation. Noop competes very closely with the other schedulers, with an exception when three processes are used. The Deadline scheduler outperforms the others with ~40% increased throughput. It still performs a lot better than both Anticipatory and CFQ in the 4-thread case.
With random operations, schedulers show their real strength. In the random read case (Table 7), CFQ works best and shows relatively stable performance from 1 to 4 processes. The read fairness of CFQ is really proven here; however, in the case of three processes, the Anticipatory scheduler is the fastest. This could be due to the fact that three is an odd number, so bandwidth split is not really fair here.
In the random write case (Table 8), we see things turning upside down. Because the schedulers usually prioritize read over write, we get lower numbers here. CFQ gives the finest throughput up to two processes, but Anticipatory wins when there are more processes involved.
Finally, let's look at the scalability test. We used a small file size (512 KB) because we just needed a quick look at what happens when more processes stress the disk. However, we did take flushing time into account, so you can really see how long it takes to write back the dirty buffers. The results are presented in four bar charts. For the write results, Figure 1 shows minimum throughput, and Figure 2 shows the average throughput. For the read results, Figure 3 shows the minimum, and Figure 4 shows the average.
Why are we interested in minimum throughput? As you may know, when there are many busy processes, one or more processes might work more slowly because of certain decisions taken by the schedulers. Ideally, all processes should give fairly equal throughput, but our tests show that this is not the reality. There is always a big gap between minimal and average throughput in both read/write tests.
The good news is that we saw no big difference between schedulers in a minimum throughput context. The kernel developers did a good job avoiding heavy I/O starvation. Quite a different situation is seen in average throughput, however. For writing, throughput can go higher if more processes are forked. We chose not to draw any conclusions about this specific issue because the pattern is not recognizable. The speculation is that it relates to the way asynchronous write operations are done by the kernel.
Note that reading does indeed slow down when more processes are working. The chart shows that the CFQ scheduler is a relatively good choice, although it's not the winner for every number of processes. Note that when we used 64 workers, the Anticipatory scheduler was the fastest. So, the Anticipatory scheduler would seem to be a good alternative if CFQ disappoints you in busy servers.
Conclusion
Choosing which scheduler to use is a matter of continuous adaptation. Generally, CFQ is a safe bet, especially if there are many read/write processes. Surprisingly, the Noop scheduler gives quite nice results for various access patterns if the file layout is nearly ideal. You will likely see this kind of layout on a freshly formatted partition. As seen in our rsync test, the Anticipatory scheduler works best with many small- and medium-sized files scattered over directories (like kernel source). The Deadline scheduler is an alternative to CFQ when you need lower latency but can accept lower throughput.
Make sure you observe your I/O characteristics with tools like sar or iostat. You can download these as a part of the sysstat package from:
http://perso.orange.fr/sebastien.godard/
These tools provide a broad range of fields you can observe, such as disk utilization and average service time. Pick one or two fields as the focus of your optimization and work gradually to achieve optimum results. You cannot get the best result for all aspects.
It is important to remember that I/O schedulers are under active development, so upgrade to the latest kernel stable version if you want to achieve better results. Once you get optimal results for your case, stick with that kernel version.
Acknowledgements
The authors acknowledge these persons for their help and advice: Tejun Heo, Jens Axboe, Chris Wedgewood, Mithlesh Thukral, and Rik van Riel.
References
1. Man page of ionice
2. Bovet, Daniel P. and Marco Cesati. 2000. Understanding the Linux Kernel, Third Ed. O'Reilly.
3. Kernel Korner: I/O Schedulers by Robert Love -- http://www.linuxjournal.com/article/6931
4. A collection of I/O benchmark tools -- http://www.acnc.com/benchmarks.html
5. Kernel documentations under the block directory
Mulyadi Santosa is 28 years old and is a freelance IT writer and consultant living in Indonesia. His interests include clustering, Linux kernel analysis, and system performance tuning. He holds both RHCE and MCSE certifications and runs a startup that teaches Linux courses.
Fawad Lateef is 26 years old and is a Lead Software Engineer working at Aglix (Pvt.) Ltd in Pakistan. His interests and major work include: memory/process management, file systems, block layer of Linux kernel, and embedded development based on specialized network security processors.
|