Using Adaptive Read-Ahead to Improve I/O Throughput on Linux
Mulyadi Santosa and Fengguang Wu
For years, I/O has been a major problem when dealing
with latency. CPU, memory, network connection, and bus are getting faster
in significant way. But what about disk? With the birth of SATA, disk is
getting some acceleration, but the problem with disk access is that it is
so slow compared to memory access.
Disk cache was invented to overcome this. When a read
operation is issued, data is brought to RAM. From there, the CPU can access
it. Furthermore, when data is written to the disk, it is put to buffer and
later pushed to the disk. This is done for two reasons. The first reason is
for reading operations. Here we are caching the data to anticipate further
read requests on the same blocks. As stated in the temporal locality
principle, a program tends to read same data over and over again during a
certain interval. If we apply that to disk access, on the second access,
there is no need to read from the disk, the CPU can just access it through
disk cache.
The second reason is for writing operations. If the CPU
must wait until the data is actually written to the disk, the CPU will be
completely blocked. A solution for this is using buffer as temporary
storage. The CPU just puts the data on buffer and continues to work.
There's no need to wait for lengthy disk access.
Another feature can help you overcome disk latency:
read-ahead. The basic idea is that if you read a block from a file, there is a chance that you will read the next block of the
same file. Thus, if somehow the kernel can predict it, it can "read-ahead" to the next block(s). So, when the data is
needed, it is already on the disk cache. This read-ahead concept is the
focus of this article.
Overview of Read-Ahead
Let's see how the stock read-ahead logic works.
The kernel performs this job by setting up two areas called
"windows", each of which is related to a certain area inside
the file. One is called the current window, the other is called read-ahead
window. Whenever the kernel recognizes a sequential read access inside the
current window, it will submit I/O request for pages that belong to read-ahead window. However, anytime the kernel detects that it is no longer doing sequential reads, the
read-ahead will be cancelled. This is done to prevent too much bloated
caching.
Adaptive read-ahead (ARA) is the new logic added to the
Linux kernel. One major goal of ARA is to provide support for aggressive
read-ahead. Besides sequential reads, it can also detect some forms of
semi-sequential scanning, such as:
- Parallel/interleaved sequential scans on
one file descriptor
- Sequential reads across file open/close
lifetime
- Mixed sequential/random accesses
- Sparse/skimming sequential reads
To use ARA, you must download the appropriate patches
from:
http://www.vanheusden.com/ara
and do manual patching [1]. Another alternative is to
use the -mm patchset [2]. Use the matching kernel source version to avoid
problems.
After taking a quick look on read-ahead basic, three
questions arise:
1. How far should we read-ahead? If it is too far, the reader will need less than we have
pre-fetched, and we will be wasting memory and bandwidth. If it's too
little, we experience more latency and, in case of concurrent readers, more
seeks.
2. When do we start read-ahead? Logically, it should
start when recognized patterns are detected, such as sequential
forward/backward reads. Thus, it is up the logic itself to interpret what
kind of disk access the program currently does. Since there are many types,
the logic should just try to catch certain common patterns.
3. When do we stop the read-ahead? If we stop too
soon, the kernel will lose the chance to speed up I/O. If we stop too late,
it will waste precious time seeking and reading on disk (which is really a
disadvantage, since again disk access is so slow compared to memory
access).
There are two ways to start read-ahead. First, it is
initiated by the kernel itself when the built-in heuristic says this disk
read will be done sequentially. Second, the user instructs the kernel to do
read-ahead, usually via certain system calls, such as posix_fadvise() on Linux kernel
version 2.6.x.
Most often, read-ahead is triggered through the first
method. This is normal because most applications leave the I/O management
completely to the kernel. However, certain applications do interfere in
I/O. An example is a database application, where the kernel is forced to
predict what the user is trying to do, almost without explicit clues.
Fortunately, there are implicit clues. Generally,
read-ahead logic will be started on two conditions: the file is freshly
opened and read from first offset, and the first page accessed is the next
page of the previously accessed page. Generally, these rules assume forward
reading. ARA improves this logic by doing backward pre-fetching if the
reader also does backward reading.
How many blocks we will read-ahead? This is trivial
work. Conventional read-ahead works somewhat slowly. It normally starts
with 8 pages (32 KB), then doubles until it reaches maximum 32 blocks (128
KB). Whenever the logic detects cache miss, but it believes that the page
has been recently cached, the read-ahead size is subtracted by 2 blocks.
Four blocks (16 KB) is, however, the minimum number of pages in a
read-ahead window.
ARA improves this by starting at a larger number (128
KB). For special cases, when a file is freshly opened and read from offset
0, it tries more aggressively and quickly reaches 1MB data to be
read-ahead.
If you want to see the current maximum disk read-ahead,
use this command:
# blockdev --getra <device>
Or manually set it by:
# blockdev --setra <sectors> <device>
Fetching more blocks brings two advantages. First,
speculatively, the kernel might anticipate some patterns of sparse reading.
Second, assuming the file is not terribly fragmented, data can be read in
one head seek. This is more efficient than doing move-read-move-read many
times due to small requests.
As you might guess, the disadvantage is that RAM is
filled up faster as pages cache and the number of readers increases.
However, the effect is minimal if you have just one concurrent reader.
To alleviate this, ARA adds thrashing protection logic -- something that doesn't exist on
stock read-ahead. Simply speaking, the protection is done by careful
computation based on the global available memory and the current reading
speed. It can also detect whether the current stream has suffered from a
read-ahead thrash, and properly handle it. A file page is said to be
"read-ahead thrashed" if it is freshly paged-in by read-ahead,
but falls to the tail of the inactive list (semi-FIFO list for cold cache
pages), and is abandoned before being read. For a task that slowly reads a
file, this is a loss because the block has to be paged-in again. ARA solves
this problem by not reading ahead pages more than the threshing threshold,
and moving pages back to the list head when danger of thrashing is
detected.
Finally, we come to the question of when to stop the
read-ahead. Again, there are two conditions: if the reading becomes
non-sequential and if the pages currently being read are already in page
cache. In conventional read-ahead, 256 pages found in cache is the
system-defined maximum value at which to stop read-ahead.
Note that read-ahead logic always considers the level
of disk congestion. Under heavy congestion, only necessary read-ahead will
be performed.
To control how ARA behaves, two tunable kernel
parameters have been added:
/proc/sys/vm/readahead_ratio -- This
knob limits read-ahead size to a percentage of the thrashing threshold. Set
it to a smaller value if you don't have enough memory for all the
concurrent readers or if the I/O loads fluctuate a lot. But if
there's plenty of memory (>2 MB per reader), a bigger value may
help performance.
Readahead_ratio also selects the read-ahead logic, as follows:
VALUE CODE PATH
--------------------------------------------
0 disable read-ahead totally
1 select the stock read-ahead logic
2-inf select the adaptive read-ahead logic
---------------------------------------------
The default value is 50. Reasonable values would be between 50 and 100.
/proc/sys/vm/readahead_hit_rate -- This is the maximum value allowed for number of read-ahead pages divided by
the number of accessed pages. This is useful only when readahead_ratio >=
1, that is, if ARA logic is selected. If the previous read-ahead request
has a bad hit rate, the kernel will be reluctant to do the next read-ahead.
Larger values help catch more sparse access patterns.
Be aware that read-ahead of the sparse patterns sacrifices memory for
speed. The default value is 1. It is recommended to keep the value below
16.
Benchmarking
So far, we discussed the theory behind read-ahead
logic, now it is the time to test how it performs in the real world. We
will perform these tests on machines with the following specifications:
For the dd and Iozone test:
- CPU AMD Athlon XP 1800+ Palomino Core, L1
cache 128 K, L2 cache 256 K
- Motherboard MSI KT4V with chipset VIA KT400
- Memory 256 MB DDR PC2700
- Maxtor Diamond Max Plus 8 Parallel ATA 40
GB, 2-MB buffer cache, 7200 RPM; formatted as ext2 and hosting benchmark
files
- Maxtor Diamond Max Plus 9 Parallel ATA 80
GB, 2-MB buffer cache, 7200 RPM; formatted as ext3 and running the OS
together with test program
For the NFS (Network File System) test:
- Dell PowerEdge 1750
- CPU 3.2-GHz Intel Xeon DP processor with 1-MB L3 cache
- ServerWorks GC-LE chipset 533-MHz FSB
- 2-GB PC2100 ECC SDRAM
- Dell PERC 4/Di RAID controller with 128-MB cache memory
- 3 x 73-GB Seagate Cheetah 10-K RPM Ultra320 hard disks, configured as hardware RAID
5; disks formatted as reiserfs and set as LVM2 volume
All tests were done using Linux kernel 2.6.17 with -mm6 [2] patch.
Before each of the tests, we did the following pre-conditioning:
- System is booted in run level 1.
- The deadline scheduler is set as the
default I/O scheduler for all disks.
- Swap devices are completely unmounted.
- /proc/sys/vm/readahead_hit_rate is set to 1.
- Maximum read ahead size is 1MB (according to the output of /proc/sys/<block-device>/queue/read_ahead_kb).
The initial test is reading a big file.
"dd" is used as the reader program, and the target file is 2
GB. Note that the file size exceeds the RAM size, thus real disk transfer
speed is also stressed here.
The tests are done by issuing ($ is the shell prompt):
$ dd if=<test-file> of=/dev/null bs=4k
and:
$ dd if=<test-file> of=/dev/null bs=2M
Reading is done four times, and we grab the user,
system, and the real time of the completion. The results are shown on Table
1. These results show that ARA brings no improvement in overall completion
time. However, it's a good sign that ARA doesn't add any
significant overhead. Also note that ARA profits from a larger block size.
More detailed data is available by running Iozone.
Iozone, which is found at:
http://www.iozone.org
is a file system benchmark tool that generates and
measures various file operations. We used Iozone version 3.263 compiled
with the make linux command.
Since we are focusing on reading operations, several types of tests are selected:
- Read -- Simply reading a file
sequentially.
- Re-read -- Reading a file that was
recently read. Since the operating system usually caches data from last
read operations, this test also shows how well the caching system works.
- Backward read -- Reading a file
backwards. Although perhaps unusual, this is a nice way to see whether
ARA's backward-reading logic really works.
- Stride read -- Reading a file using
read-seek-read-seek pattern. This is also known as sparse reading.
- Random read -- Reading a file in random locations. It could go forward or backward.
- Fread -- File is read using fread() function. This is a
kind of blocking and buffered read.
- Re-Fread -- Same as re-read test,
only this time we used fread().
The above tests are selected to represent real-world
reading processes done by major applications.
Iozone is invoked using the following options:
./iozone -S 256 -L 64 -E -Ra -i 0 -i 1 -i 2 -i 3 -i 5 -i 7 -i 10 -i 12
-g 512m -U /mnt/disk-linux/ -f /mnt/disk-linux/test.img \
-b ./iozone-result-adaptive.xls
-i option selects the above mentioned tests plus the write test
because it is needed to create the test file. -g defines the maximum size of the test file, which is 512 MB. -U tells Iozone to unmount
the mount point (/mnt/disk-linux) each time a single run of a certain test
is done. By doing unmount, cache is immediately flushed.
The -S option tells Iozone how much processor cache we have. Here, we
assume that this is the size of L2 cache that is 256 KB. -L tells Iozone about the
processor cache line size. We set both parameters manually since Iozone
can't automatically probe it. Please note that all the Iozone tests
are done in single-stream mode.
Since Iozone produces bulks of numbers, we have
narrowed the explanation into the following scope:
1. Comparison is done head to head between ARA
and conventional read-ahead in the following tests:
a. Reader test.
b. Re-reader test.
c. Backward reader.
d. Stride reader.
e. Random reader.
f. Fread reader.
2. Observation is focused on the following file size:
a. Equal to L2 cache size.
b. Equal to disk cache size.
c. Half of RAM size.
d. Equal to RAM size.
e. Twice RAM size.
The test results shown in Table 2 show that ARA makes
an improvement over stock logic in 43% of sequential reading cases. The
increase is not big for file sizes much greater than on-disk cache (128,
256, 512 MB); in fact it is rather detrimental. For those files, using
512-KB record size (half of maximum read-ahead size) is the best case. But
overall, we would likely see gain if the record size is between L2 cache
size and on-disk cache size.
Theoretically, in Table 3 we should see higher numbers
in re-reader tests (compared to Table 2) for every file size. However, due
to unmounting, memory cache is also flushed, so we get more or less the
same result as with the reader tests. Here, ARA works better in 36% of
cases. We still hardly see improvement for big files (>=128 MB).
Combined with Table 2, the chance to get acceleration for sequential read
is about 40%.
It's hard to define "random read"; therefore what we see in Table 4 is the way
Iozone does random read -- not necessarily "random" in its true nature. In 36% of cases, ARA
works better, and in big files (>=128 MB) ARA
gave a 600-800 KB/sec boost. Here, we see the best results if the record
size is equal with L1 data cache size.
Table 5 shows a promising result for ARA. We see
acceleration in 68% of cases, and mostly we have the best results if the
record size is either equal to L1 or L2 cache. The boost is around
5000-9000 KB/sec for large file. For both random and stride read,
increasing the readahead_hit_rate might help. Thus, we suggest tweaking
this parameter to get even better performance.
Backward read (see Table 6) is an unusual case and, for
stock logic, we can safely say that read-ahead is disabled. Only ARA still
works here, so what we see is actually ARA versus no read-ahead. ARA makes
reading faster in 48% of cases. We get the best results when using a 64-KB
record size (equal with L1 data-cache size). The gain drops below 1000
KB/sec once we use a record size greater or equal to 1 MB (maximum
read-ahead size).
Fread is the worst case for ARA (see Table 7). ARA
works better in 34% of cases, the lowest compared to the other Iozone test
items. However, we still see gain in file sizes greater or equal to 128 MB,
but only if we use big record sizes (8 MB and 16 MB). This is true since in
blocking operation, it is more efficient to submit requests as big as we
can per read operation.
So far, we can make the following conclusions:
1. ARA is good for stride reading. Optimal results are
achieved if the record size is equal to L1/L2 cache size.
2. If your application does backward reads, you should
consider using ARA to improve the performance since stock read-ahead simply
don't do backward read-ahead.
3. Random read is a guessing game for read-ahead (ARA
and stock). Thus, either simply turn off read-ahead or play with
readahead_hit_rate and see if you get any luck there. Experimentation shows
that reading chunks smaller than the maximum read-ahead size is the best
way to boost performance with ARA.
4. Forward sequential read is improved with ARA if you
use large record sizes (somewhere between 256 KB and 2 MB). However, even
with large records, ARA is fairly equal with stock logic in performance.
5. It is better to not use ARA if your application
uses fread().
Next, we did an NFS test to measure concurrent read
performance. NFS was selected to represent one of the oldest remote file
systems still widely used. NFS server runs with eight working threads.
For this NFS test, untar the Linux kernel 2.6.16 and
2.6.17 source packages and do a recursive "diff" between the
two. Here, the NFS volume is mounted locally, thus no network traffic is
involved. The NFS volume is mounted with the following command for the
first test:
mount -o tcp,rsize=32768 localhost:$ROOT $NFSROOT
And this for the second test:
mount -o tcp,rsize=524288 localhost:$ROOT $NFSROOT
Note that "rsize" is the number of bytes
NFS uses when reading files from an NFS server (in bytes). Looking at the
combined NFS test results in Table 8, we see that increasing
"rsize" makes ARA works faster. This is similar to the Iozone
tests, where increasing the record size brings improvement.
However, ARA does not work better than stock logic in
NFS tests. Further investigation shows the suspect. The NFS daemons'
read requests can be out of order, concurrent, and with no read-ahead state
information (internal variables that record the progress of read-ahead
work). They are handled by the context-based read-ahead method, which does
the job in the following steps:
1. Scan in page cache.
2. Make read-ahead decisions.
3. Allocate new pages.
4. Insert new pages to page cache.
A single read-ahead chunk in the client side will be
disassembled and serviced by many concurrent NFS daemons in the server
side. It is highly possible for two or more of these parallel nfsd
instances to be in steps 1, 2, and 3 at the same time. Without knowing that
others are working on the same file region, they will issue overlapping
read-ahead requests, which lead to many conflicts at step 4.
Finally, we also asked some people from the PostgreSQL
community to test ARA. A few people responded, and you can see the complete
results at the PostgreSQL mailing list archives [3]. (Click on the
"Follow-Ups" links for the details.)
Table 9 is a copy of the relevant test results using
odbc-bench taken from the PostgreSQL mailing list archives [9]. The
odbc-bench results show that ARA increases transactions per second from
around 6 to 30. This is a slight improvement for mixed-read operations
since database can do any type of read at any time. Setting
readahead_hit_rate to 1 works better here, thus giving us an idea that
read-ahead should be done as gently as possible. We didn't get any
information about swapping condition; thus, it's possible that
swapping could be the detrimental factor in the odbc-bench test.
How to Achieve the Best Results
The kernel only does a good job of reading ahead for
sequential accesses, but since that's the common case, most
applications are well served by the kernel-level read-ahead. However, in
some cases it would be better to bypass a bunch of tricky logic trying to
determine what the application is doing. There are ways to inform the
kernel that you either are or are not doing a sequential read, or even to
explicitly read-ahead/drop pages.
Fengguang Wu, the ARA author, considers the best method
would be to leave obvious patterns to the kernel (i.e., all kinds of
[semi-]sequential reads) and to do fadvise() for non-obvious patterns on critical points (e.g., the
database index scans).
To implement the second item would involve using posix_fadvise(POSIX_FADV_RANDOM) to disable kernel read-ahead and then using posix_fadvise(POSIX_FADV_WILLNEED) to launch application-level
read-ahead. Here are some code snippet examples:
This call will disable read-ahead totally for a file descriptor:
posix_fadvise(fd, any, any, POSIX_FADV_RANDOM);
Or, to do so for a whole file system:
# blockdev --setra 0 /dev/hda1
This one will re-enable read-ahead:
posix_fadvise(fd, any, any, POSIX_FADV_NORMAL);
This one will enable read-ahead and set the maximum
read-ahead window to 2 times the default size:
posix_fadvise(fd, any, any, POSIX_FADV_SEQUENTIAL);
The ARA works fine with user level read-ahead via the readahead() or posix_fadvise() syscall. That is, if a
program does the necessary readahead() calls to cover all the file pages requested by the following read() calls, it avoids triggering unnecessary kernel read-aheads.
If you see that you still have plenty of free RAM, try
to increase both readahead_hit_rate and readahead_ratio. This will consume
more memory but might increase performance as well.
Conclusions
Read-ahead is one nice feature implemented in Linux to
help reading performance. With ARA, it works especially well for sequential
forward/backward and stride reading. Applications such as databases, file
servers, and Web servers could also benefit from it.
As always, manual tuning is needed if you think the
reading pattern is somewhat unpredictable. It is suggested that you
temporarily disable any read-ahead logic if you believe reading is too
random or that only small portions of files are ever read, thus you can
save precious disk seek time.
The authors would like to thank Chakravarthi Pasumarthy
for reviewing the manuscript.
Resources
1. Latest adaptive read-ahead patches -- http://www.vanheusden.com/ara
2. Andrew Morton's -mm patchset -- http://kernel.org/pub/linux/kernel/people/akpm/
3. Announcement from Fengguang Wu about adaptive read-ahead on pgsql-performance mailing list.
The "Follow-Ups" link contains several test results done by the
community -- http://archives.postgresql.org/pgsql-performance/2006-04/msg00635.php
4. Scripts and source code to test several aspects of file systems -- http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz
5. Articles on LinuxWeeklyNews that describe how adaptive read-ahead works -- http://lwn.net/Articles/155510/; http://lwn.net/Articles/176279/
6. A Linux Symposium 2004 paper explaining Linux 2.6
performance improvement through read-ahead optimization -- http://www.linuxsymposium.org/proceedings/reprints/Reprint-Pai-OLS2004.pdf
7. Iozone Web site -- http://www.iozone.org
8. Bonnie++, yet another filesystem benchmark suite -- http://bonnie.sourceforge.net/
9. Odbc test results from PostgreSQL mailing list archives -- http://archives.postgresql.org/pgsql-performance/2006-04/msg00613.php
Mulyadi Santosa is a freelance writer and freelance IT
consultant and currently resides in Indonesia. His interests include
clustering, process management, and scheduling. He also runs a startup
company that teaches Linux courses for audiences ranging from newbies to
experts. He can be reached at: mulyadi.santosa@gmail.com.
Fengguang Wu is a Ph.D. student at University of
Science and Technology China. One of his research topics is I/O
optimization. Contact Wu at: wfg@mail.ustc.edu.cn.
|