Rock
Your Wiki!
Christian Folini
Web content is often generated on the fly. This generation is
triggered by the request of the user. Unfortunately, it takes a
lot of time to recreate the html content anew. If we take a closer
look at the dynamic pages of many sites, we note that every user
gets the same content without personalized data or session context;
the same content is generated over and over again. The Web site
has been made dynamic to be easier to maintain; however, the user
expecting a fast browsing experience has been left out.
These days, many content management systems (CMS) come with some
kind of application cache, which improves the performance to a certain
extent. In this article, I will show that even those without a cache
can be boosted on the Web server and that the application cache
inside the CMS is slow compared to the method presented. While this
approach works on a wide set of applications, I will use twiki as
a test case [1].
Wikis are a special type of CMS. Although they are very useful
collaboration tools, they are also known to be slow. The reason
lies in their richness of features, of which the powerful markup
language is just one. The twiki implementation has a large userbase,
especially in the enterprise or intranet environment. This is where
the network latency is very small, and a cache will give the biggest
performance gain in regard to speed.
Twiki comes with a so-called cache plug-in that brings a better
performance for the twiki installation [2]. By the way, the documentation
is a good read on application caches in general. In terms of performance,
we can do better than the cache plug-in. The general approach proposed
in this article can be applied to other applications, too.
I will not explain how to set up twiki in this article. Please
see the twiki installation guide on the Web site for detailed instruction
on that [3]. Beyond installation, it should be simple to adapt the
method explained here to an application of your choice.
Caching
Documents can be cached at multiple levels. Caching on the browser
level does not work, as the browser does not know about page updates
on the wiki in question. At the other end of the transaction is
the application that can be configured to do caching in the twiki
cache. However, this still results in a call to the application,
which gives away a speed gain. We get the best performance by answering
the request before it hits the application. We aim for a server-sided
cache outside of the application on the Web server or even a lean
reverse proxy in front of the Web server. The configuration discussed
here will work in the same way with Apache 1.3 and Apache 2.0.
Apache supports caching in a typical way defined by the version
1.1 of the HTTP protocol [4]. Usually the caching functionality
of mod_proxy is used to maintain a cache and to determine the expiry
of a cached document [5]. This form of caching is recommended for
files that remain static over a longer period of time. Usual candidates
are pdfs, icons, and images that can be set to be cached for days
and even weeks.
Html content is not recommended for caching by the means of mod_proxy
and especially not in a wiki content. This is because the frequent
updates to the content will conflict with the cache. Therefore,
mod_proxy and similar caching modules are ruled out. Instead, we
will use mod_rewrite [6]. At first sight, mod_rewrite has nothing
to do with caching, yet its flexibility makes it a valid choice.
Mod_rewrite is capable of serving a cached document, of observing
updates to the content, and consequently of rendering the html document
anew in order to deliver the latest version to the user.
Using mod_rewrite
The mod_rewrite documentation states that it's the "Swiss Army
Knife of URL manipulation!". Since I am Swiss myself, I am very
fond of mod_rewrite because it provides a way to rewrite URL requests
on the fly. It can be configured via a set of rules defined in the
Apache config.
Before tweaking our twiki installation with mod_rewrite, we need
to have performance data so that we will be able to assess the improvements
later on. In our case, twiki is running on the host kieran, a venerable
Pentium II Debian Gnu/Linux host used as intranet Web server. A
quick glimpse will tell us that running twiki on top of kieran will
drive users mad:
$ time curl http://kieran.netnea.com/twiki/bin/view/Main/WebHome \
>/dev/null 2>&1
real 0m4.941s
user 0m0.020s
sys 0m0.010s
I ran this command a couple of times and always got about five seconds
for the start page of the wiki. This is really bad performance. So,
let's try to improve the performance of kieran.
To begin, we create a cache directory beneath the twiki installation
(/var/www/twiki/cache). We make sure it is writeable by the Web
server. Then we make sure that the mod_rewrite module is loaded
by Apache (which is likely to be the case by default). Afterwards,
we activate the rewrite engine with the following commands in the
Apache config file:
RewriteEngine On
...
RewriteLog /var/log/apache/rewrite.log
RewriteLogLevel 3
RewriteLock /var/log/apache/rewrite.lock
After the initialization via "RewriteEngine On" (which must be placed
outside a VirtualHost stanza), we define a logfile and the loglevel
3, which is fine for setup and debugging. On a production system,
this should be set to 0. The last line is a lockfile, which makes
sure that only a single request at the time is accessing the scripts
we are going to use.
Before juggling the cache, it is important to distinguish the
requests we want to cache from the ones we want to leave unaltered.
In the case of twiki, this distinction is quite simple -- all cacheable
requests access the view script. Furthermore, we have to treat the
requests to the save-script in a special manner, too. Therefore,
we set up a rewrite rule:
RewriteRule !/twiki/bin/(view|save)/ - [last]
This will process every request URI that does not match (note the
negation with "!") a path resembling /twiki/bin/view/ or /twiki/bin/save/.
We do not rewrite the request ("-"), but we do tell Apache that this
has been the last rewrite statement for all requests matched. So,
instead of applying more rewrite rules to the request, it will avoid
the following rewrite rules of the configuration.
So far, we got rid of the non-cacheable requests. Now, we have
to find out which wiki page the user actually wants to see:
RewriteRule /twiki/bin/(view|save)/(.*)/(.*)$ - \
[ENV=TWIKIWEB:$2,ENV=TWIKIPAGE:$3]
This defines a rewrite rule with a condition resembling the one above.
It will match all cacheable requests as they follow this regex pattern.
See, for example, the curl request shown previously. The view (or
save) in the path is followed by the wiki Web parameter and finally
the wiki page parameter. Again, we do not rewrite this request, but
we put the content of the second parameter in the brackets in the
Apache environment variable TWIKIWEB, which is bound to the request.
The same applies to the third parameter, which is saved as TWIKIPAGE.
(If you are not used to regexes, you have to realize that "(view|save)"
is the first parameter in brackets.)
Now let's see whether we have this page in the cache. If it is
present, we will serve it. This works as follows:
RewriteCond /var/www/twiki/cache/%{ENV:TWIKIWEB}.%{ENV:TWIKIPAGE}.html -s
RewriteRule /twiki/bin/view/ \
/twiki/cache/%{ENV:TWIKIWEB}.%{ENV:TWIKIPAGE}.html [last]
This is another rewrite statement, but this time with a complication
in form of an extended condition, which the RewriteRule statement
cannot handle anymore. Basically, we look for a file like /var/www/twiki/cache/Main.WebHome.html.
The condition with "-s" matches if the file exists and if it has
a filesize above 0 (good additional check!). If that is the case,
we follow with the RewriteRule itself. The rule has another condition,
which applies to view-requests only. If this is a view-request,
we will rewrite it. In our example, we end up with the URI /twiki/cache/Main.WebHome.html.
Then, we tell Apache that we are done with rewrite rules and that
it should proceed with the delivery of the request ("last"). To
test this, place a file with the proper filename in the cache folder
and see whether it is delivered. It does not need to be real content;
a "Hello World" will do for the test.
If you take a closer look at the HTTP response headers returned
by Apache (you can do so using curl -v) you will note that
the twiki call and the request delivered by the cache differ in
one detail -- the request carried out by twiki will not include
the http-headers that the browser uses to cache the file (e.g.,
the creation date and the etag headers). This is done to prevent
it from caching dynamic content. When we deliver the page from the
cache defined previously, the normal html content handler of Apache
comes into play, and Apache will submit the said headers to the
browser to keep it from reloading the same documents a second time.
It proves to be difficult to convince the Web server to suppress
these headers. The easier way around is to use mod_headers to add
a header of the form "Pragma: no-cache" to the request. This will
prevent the browser from caching the document locally for reuse:
Near Directory or Location:
<Directory /var/www/twiki/cache/>
Header add Pragma no-cache
</Directory>
Make sure you have enabled mod_headers and keep in mind that the directory
stanzas are not cummulative. Therefore, you may have to add other
options (notably authentication) to this stanza if you have defined
them for the rest of the server.
Generating the Cache
If the RewriteRule to access the cache does not catch, then we
have to generate the cache before we can fulfill the request. The
generation is carried out by the means of a Perl script that will
have the page assembled and saved as the cached file.
Let's take a look at this script:
script: /var/www/twiki/bin/content-cache-generator.pl
#!/usr/bin/perl -w
use strict;
use LWP;
$| = 1; # Turn off buffering
my ($str, $res) = "";
my $ua = LWP::UserAgent->new(timeout => 10);
while (<STDIN>) {
chomp($_);
$str = $_;
$res = $ua->get("http://kieran.netnea.com/twiki/bin/view/ \
$str\?x-ignore");
if ($res->is_success && $res->status_line eq "200 OK") {
s/\//./;
if (open(FILE, ">/var/www/twiki/cache/$_.html")) {
print FILE $res->content;
close(FILE);
print "/twiki/cache/$_.html\n";
}
else {
print "/twiki/bin/view/$str\n";
}
}
else {
print "/twiki/bin/view/$str\n";
}
}
Because Apache has to call this script itself, we must be extremely
careful with input and output handling. If this script were to print
out error messages, the whole Apache process would hang because the
buffer would be filled instantly.
So, first we source the "strict" and the "LWP" module and then
we turn off buffered output, which is mandatory in this context.
Two variables are initialized, and a "user agent" object (=browser
request object) is generated. Because our server, kieran, is known
to be slow, the timeout of the request is set to 10 seconds.
Second, we enter a loop with STDIN as a parameter. Thus, this
script will initialize and then wait for input, process the input,
and wait for the next input. The parameter is taken, the trailing
newline character is chopped off, and then it is saved in a variable
called str. After this, the request to the page is initialized.
Note that this is almost the same request that the user sent out.
But this time, we add x-ignore to the query string of the
request. Without this flag, the Apache Web server would handle it
as another user request, access this script, and enter a recursive
loop. We will see shortly how to make Apache be aware of the x-ignore
flag.
Once the request is complete, the script checks whether it was
successful. If it was, the content is saved in the cached file.
We then print out the URI pointing to the cached file to STDOUT.
This means that Apache and the script will communicate with each
other with the help of STDIN and STDOUT.
If the request has failed, we give up and tell Apache to deliver
the page using the twiki view-script. This way, the user's request
has been put on hold for a few seconds, but eventually he or she
will receive the desired content, even if the cache generation failed.
Before we can try out the script itself, we must configure it
inside Apache's mod_rewrite. Before the first RewriteRule command:
RewriteMap generator prg:/var/www/twiki/bin/ \
content-cache-generator.pl
RewriteCond %{QUERY_STRING} !^$
RewriteRule . - [last]
After the last RewriteRule command:
RewriteRule /twiki/bin/view/(.*) \
${generator:%{ENV:TWIKIWEB}/%{ENV:TWIKIPAGE}} [last]
Next, we define a RewriteMap called generator, which is the
script described above. Rewrite maps are a special construct that
give Apache the ability to have an external program decide how to
rewrite a request. The script decides whether to deliver the newly
generated cache or to go with the plain request itself when the generation
failed. Of course, we issue the cache generation request, but this
is merely a side note as far as the rewrite map is concerned.
Rewrite maps are started when the Apache master process is started.
All the Apache children share the same map. This is why we defined
a lock above, and this is also why we have to be so careful. When
the map script hangs, the Apache child using it will hang, too.
The next child will queue up behind waiting for the lock to be released
but in vain. More and more children will be stalled and eventually
the Apache master process will run out of children and hang. To
put it simply, a single error in the RewriteMap script will pile
up and eventually hose your Web server. So, be extremely careful
when adopting the script to your needs.
Afterwards, we tell Apache how to handle the twiki request issued
by the script, that is, requests with the x-ignore in the
query string (the parameter list of the request). We do not have
access to the query string in the rewrite rule itself, so we use
a separate rewrite condition again. We would want to check for the
x-ignore flag. However, it is best to avoid requests with
query strings completely. This is safer because they also carry
twiki parameters that would generate special forms of the same page,
thus hitting our cache, which is ignorant of these parameters. Consequently,
the cache would deliver the wrong content. Thus, instead of messing
with special requests, we concentrate on the 90% of plain cacheable
requests and avoid the rest, which will be delivered by twiki.
After the last RewriteRule (the one that checks for the cached
file), we call the map we have defined as "generator" for all the
view requests. The syntax is a bit special, but you will certainly
get it -- the map script gets a combination of the two environment
variables divided by a slash as standard input. Thus, by making
up the part of the URL we use in the script to form the request
and replacing the slash with a dot and putting ".html" as suffix,
we get the filename of the cached document.
After we have defined the handling of the x-ignore flag
in the Apache config, we can try out the script from the command
line. Remember that the script communicates via standard input and
output:
$ /var/www/twiki/bin/content-cache-generator.pl
Main/WebHome (input by user)
/twiki/cache/Main.WebHome.html (output by script)
We call the script and enter Main/WebHome as STDIN. Then we
wait for a few seconds and hope we get the desired path to the cached
file. If this worked, we check that the file is really present and
has the right content. If not, we have to debug the script. It helps
to place some temporary debug messages in the script. You can also
look at the rewrite log or issue an strace call against the
script (if you are running Linux).
Testing the Setup
If all goes well, we can now test the full setup from the browser.
When we request a page for the first time, the Web server will take
a few moments to generate the cache, but subsequent deliveries will
be very fast. We can also check the cache folder and see how the
files pop up.
Now we have caching, but we have not yet defined a way to get
rid of the outdated cache. One frequently used method to remove
a cache is with a cronjob that erases the content of the cache.
But, as we deal with a wiki, which is updated constantly, we have
to update the cache in a similar way. We do this via a second rewrite
map, which is called whenever a page is saved. On to the "generator"
RewriteMap:
RewriteMap remover prg:/var/www/twiki/bin/content-cache-remover.pl
After the last command:
RewriteRule /twiki/bin/save/(.*) \
/twiki/bin/save/$1?${remover:%{ENV:TWIKIWEB}/%{ENV:TWIKIPAGE}} [last]
The first line is the initialization of a rewrite map named "remover".
The next line is a little trick -- it places a rewrite rule with a
condition, and we do not actually do any rewriting. Instead, we rewrite
the statement to itself. Note that the $1 refers to the brackets
in the condition. After the question mark comes the new query string
of the request. We do not actually need a query string, but we use
the generation of the query string to call our remover with the known
environment variable as STDIN:
script /var/www/twiki/bin/content-cache-remover.pl:
#!/usr/bin/perl -w
use strict;
$| = 1; # Turn off buffering
while (<STDIN>) {
chomp($_);
s/\//./;
unlink "/var/www/twiki/cache/$_.html";
print "\n";
}
This simple script takes STDIN as filename and tries to remove the
file. The nice thing about the "unlink" function in Perl is that it
does not complain. Thus, there is no need for error handling. After
we pass a newline character to STDOUT, we are done because the cached
file will be removed, and the query string will be empty. Apache will
carry on with the save-request afterwards and the cache will be generated
anew upon the next view request. If you edit a page and your browser
does not display the new version, but the former one, you are facing
a caching error, probably on the browser side. See whether reloading
helps, then check the presence of the http-header Pragma: no-cache
in the server response. You can also have a look at the timestamp
of the cached file to see whether it has been updated.
Once mod_rewrite caching has been completely implemented, it is
time to look at the performance of the Web server again. Before
you do, you may want to look at the sidebar "mod_rewrite config
roundup", which brings the complete cache configuration in a single
piece.
?> time curl \
http://kieran.netnea.com/twiki/bin/ \
view/Main/WebHome >/dev/null 2>&1
real 0m4.744s
user 0m0.040s
sys 0m0.000s
?> time curl \
http://kieran.netnea.com/twiki/bin/ \
view/Main/WebHome >/dev/null 2>/dev/null
real 0m0.043s
user 0m0.030s
sys 0m0.010s
The first request is one with an empty cache. The delivery time in
this example is faster than the one with the non-altered setup we
started with, but this is within the statistical variation. It is
important to note, however, that a request is not slowed down by our
tweaking. But then have a look at a subsequent request. It is more
than a hundred times faster. It is unlikely that you'll get this big
of an increase outside an intranet setting, due to network latency.
Regardless of where the caching is applied, however, users will encounter
a real boost in performance!
References
1. Twiki -- http://www.twiki.org
2. Twiki Cache Plug-in -- http://twiki.org/cgi-bin/view/Plugins/CacheAddOn
3. Twiki Installation Guide -- http://twiki.org/cgi-bin/view/TWiki/TWikiInstallationGuide
4. Hypertext Transfer Protocol (HTTP) -- http://www.w3.org/Protocols/rfc2616/rfc2616.html
5. Apache mod_proxy -- http://httpd.apache.org/docs/1.3/mod/mod_proxy.html
6. Apache mod_rewrite -- http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html
Christian Folini, PhD has studied Medieval History and Computer
Science in Fribourg (Switzerland) and Berlin. He works as a consultant
for netnea and he assisted in the construction of the access layer
of yellownet, the online banking platform of Swiss Post. |