Article

apr2006.tar

Rock Your Wiki!

Christian Folini

Web content is often generated on the fly. This generation is triggered by the request of the user. Unfortunately, it takes a lot of time to recreate the html content anew. If we take a closer look at the dynamic pages of many sites, we note that every user gets the same content without personalized data or session context; the same content is generated over and over again. The Web site has been made dynamic to be easier to maintain; however, the user expecting a fast browsing experience has been left out.

These days, many content management systems (CMS) come with some kind of application cache, which improves the performance to a certain extent. In this article, I will show that even those without a cache can be boosted on the Web server and that the application cache inside the CMS is slow compared to the method presented. While this approach works on a wide set of applications, I will use twiki as a test case [1].

Wikis are a special type of CMS. Although they are very useful collaboration tools, they are also known to be slow. The reason lies in their richness of features, of which the powerful markup language is just one. The twiki implementation has a large userbase, especially in the enterprise or intranet environment. This is where the network latency is very small, and a cache will give the biggest performance gain in regard to speed.

Twiki comes with a so-called cache plug-in that brings a better performance for the twiki installation [2]. By the way, the documentation is a good read on application caches in general. In terms of performance, we can do better than the cache plug-in. The general approach proposed in this article can be applied to other applications, too.

I will not explain how to set up twiki in this article. Please see the twiki installation guide on the Web site for detailed instruction on that [3]. Beyond installation, it should be simple to adapt the method explained here to an application of your choice.

Caching

Documents can be cached at multiple levels. Caching on the browser level does not work, as the browser does not know about page updates on the wiki in question. At the other end of the transaction is the application that can be configured to do caching in the twiki cache. However, this still results in a call to the application, which gives away a speed gain. We get the best performance by answering the request before it hits the application. We aim for a server-sided cache outside of the application on the Web server or even a lean reverse proxy in front of the Web server. The configuration discussed here will work in the same way with Apache 1.3 and Apache 2.0.

Apache supports caching in a typical way defined by the version 1.1 of the HTTP protocol [4]. Usually the caching functionality of mod_proxy is used to maintain a cache and to determine the expiry of a cached document [5]. This form of caching is recommended for files that remain static over a longer period of time. Usual candidates are pdfs, icons, and images that can be set to be cached for days and even weeks.

Html content is not recommended for caching by the means of mod_proxy and especially not in a wiki content. This is because the frequent updates to the content will conflict with the cache. Therefore, mod_proxy and similar caching modules are ruled out. Instead, we will use mod_rewrite [6]. At first sight, mod_rewrite has nothing to do with caching, yet its flexibility makes it a valid choice. Mod_rewrite is capable of serving a cached document, of observing updates to the content, and consequently of rendering the html document anew in order to deliver the latest version to the user.

Using mod_rewrite

The mod_rewrite documentation states that it's the "Swiss Army Knife of URL manipulation!". Since I am Swiss myself, I am very fond of mod_rewrite because it provides a way to rewrite URL requests on the fly. It can be configured via a set of rules defined in the Apache config.

Before tweaking our twiki installation with mod_rewrite, we need to have performance data so that we will be able to assess the improvements later on. In our case, twiki is running on the host kieran, a venerable Pentium II Debian Gnu/Linux host used as intranet Web server. A quick glimpse will tell us that running twiki on top of kieran will drive users mad:

$ time curl http://kieran.netnea.com/twiki/bin/view/Main/WebHome \
  >/dev/null 2>&1

real    0m4.941s
user    0m0.020s
sys     0m0.010s

I ran this command a couple of times and always got about five seconds for the start page of the wiki. This is really bad performance. So, let's try to improve the performance of kieran.

To begin, we create a cache directory beneath the twiki installation (/var/www/twiki/cache). We make sure it is writeable by the Web server. Then we make sure that the mod_rewrite module is loaded by Apache (which is likely to be the case by default). Afterwards, we activate the rewrite engine with the following commands in the Apache config file:

RewriteEngine   On
...
RewriteLog      /var/log/apache/rewrite.log
RewriteLogLevel 3
RewriteLock     /var/log/apache/rewrite.lock

After the initialization via "RewriteEngine On" (which must be placed outside a VirtualHost stanza), we define a logfile and the loglevel 3, which is fine for setup and debugging. On a production system, this should be set to 0. The last line is a lockfile, which makes sure that only a single request at the time is accessing the scripts we are going to use.

Before juggling the cache, it is important to distinguish the requests we want to cache from the ones we want to leave unaltered. In the case of twiki, this distinction is quite simple -- all cacheable requests access the view script. Furthermore, we have to treat the requests to the save-script in a special manner, too. Therefore, we set up a rewrite rule:

RewriteRule !/twiki/bin/(view|save)/ -       [last]

This will process every request URI that does not match (note the negation with "!") a path resembling /twiki/bin/view/ or /twiki/bin/save/. We do not rewrite the request ("-"), but we do tell Apache that this has been the last rewrite statement for all requests matched. So, instead of applying more rewrite rules to the request, it will avoid the following rewrite rules of the configuration.

So far, we got rid of the non-cacheable requests. Now, we have to find out which wiki page the user actually wants to see:

RewriteRule /twiki/bin/(view|save)/(.*)/(.*)$  -  \
  [ENV=TWIKIWEB:$2,ENV=TWIKIPAGE:$3]

This defines a rewrite rule with a condition resembling the one above. It will match all cacheable requests as they follow this regex pattern. See, for example, the curl request shown previously. The view (or save) in the path is followed by the wiki Web parameter and finally the wiki page parameter. Again, we do not rewrite this request, but we put the content of the second parameter in the brackets in the Apache environment variable TWIKIWEB, which is bound to the request. The same applies to the third parameter, which is saved as TWIKIPAGE. (If you are not used to regexes, you have to realize that "(view|save)" is the first parameter in brackets.)

Now let's see whether we have this page in the cache. If it is present, we will serve it. This works as follows:

RewriteCond /var/www/twiki/cache/%{ENV:TWIKIWEB}.%{ENV:TWIKIPAGE}.html  -s
RewriteRule /twiki/bin/view/  \
  /twiki/cache/%{ENV:TWIKIWEB}.%{ENV:TWIKIPAGE}.html  [last]

This is another rewrite statement, but this time with a complication in form of an extended condition, which the RewriteRule statement cannot handle anymore. Basically, we look for a file like /var/www/twiki/cache/Main.WebHome.html.

The condition with "-s" matches if the file exists and if it has a filesize above 0 (good additional check!). If that is the case, we follow with the RewriteRule itself. The rule has another condition, which applies to view-requests only. If this is a view-request, we will rewrite it. In our example, we end up with the URI /twiki/cache/Main.WebHome.html. Then, we tell Apache that we are done with rewrite rules and that it should proceed with the delivery of the request ("last"). To test this, place a file with the proper filename in the cache folder and see whether it is delivered. It does not need to be real content; a "Hello World" will do for the test.

If you take a closer look at the HTTP response headers returned by Apache (you can do so using curl -v) you will note that the twiki call and the request delivered by the cache differ in one detail -- the request carried out by twiki will not include the http-headers that the browser uses to cache the file (e.g., the creation date and the etag headers). This is done to prevent it from caching dynamic content. When we deliver the page from the cache defined previously, the normal html content handler of Apache comes into play, and Apache will submit the said headers to the browser to keep it from reloading the same documents a second time.

It proves to be difficult to convince the Web server to suppress these headers. The easier way around is to use mod_headers to add a header of the form "Pragma: no-cache" to the request. This will prevent the browser from caching the document locally for reuse:

Near Directory or Location:

    <Directory /var/www/twiki/cache/>
        Header add Pragma no-cache
    </Directory>

Make sure you have enabled mod_headers and keep in mind that the directory stanzas are not cummulative. Therefore, you may have to add other options (notably authentication) to this stanza if you have defined them for the rest of the server.

Generating the Cache

If the RewriteRule to access the cache does not catch, then we have to generate the cache before we can fulfill the request. The generation is carried out by the means of a Perl script that will have the page assembled and saved as the cached file.

Let's take a look at this script:

script: /var/www/twiki/bin/content-cache-generator.pl

#!/usr/bin/perl -w
use strict;
use LWP;

$| = 1; # Turn off buffering

my ($str, $res) = "";
my $ua = LWP::UserAgent->new(timeout => 10);

while (<STDIN>) {
    chomp($_);
    $str = $_;
    $res = $ua->get("http://kieran.netnea.com/twiki/bin/view/ \
       $str\?x-ignore");

    if ($res->is_success && $res->status_line eq "200 OK") {
        s/\//./;
        if (open(FILE, ">/var/www/twiki/cache/$_.html")) {
            print FILE $res->content;
            close(FILE);
            print "/twiki/cache/$_.html\n";
        }
        else {
            print "/twiki/bin/view/$str\n";
        }
    }
    else {
        print "/twiki/bin/view/$str\n";
    }
}

Because Apache has to call this script itself, we must be extremely careful with input and output handling. If this script were to print out error messages, the whole Apache process would hang because the buffer would be filled instantly.

So, first we source the "strict" and the "LWP" module and then we turn off buffered output, which is mandatory in this context. Two variables are initialized, and a "user agent" object (=browser request object) is generated. Because our server, kieran, is known to be slow, the timeout of the request is set to 10 seconds.

Second, we enter a loop with STDIN as a parameter. Thus, this script will initialize and then wait for input, process the input, and wait for the next input. The parameter is taken, the trailing newline character is chopped off, and then it is saved in a variable called str. After this, the request to the page is initialized. Note that this is almost the same request that the user sent out. But this time, we add x-ignore to the query string of the request. Without this flag, the Apache Web server would handle it as another user request, access this script, and enter a recursive loop. We will see shortly how to make Apache be aware of the x-ignore flag.

Once the request is complete, the script checks whether it was successful. If it was, the content is saved in the cached file. We then print out the URI pointing to the cached file to STDOUT. This means that Apache and the script will communicate with each other with the help of STDIN and STDOUT.

If the request has failed, we give up and tell Apache to deliver the page using the twiki view-script. This way, the user's request has been put on hold for a few seconds, but eventually he or she will receive the desired content, even if the cache generation failed.

Before we can try out the script itself, we must configure it inside Apache's mod_rewrite. Before the first RewriteRule command:

RewriteMap   generator            prg:/var/www/twiki/bin/ \
                                      content-cache-generator.pl
RewriteCond    %{QUERY_STRING}       !^$
RewriteRule    .          -          [last]

After the last RewriteRule command:

RewriteRule /twiki/bin/view/(.*)   \
  ${generator:%{ENV:TWIKIWEB}/%{ENV:TWIKIPAGE}}     [last]

Next, we define a RewriteMap called generator, which is the script described above. Rewrite maps are a special construct that give Apache the ability to have an external program decide how to rewrite a request. The script decides whether to deliver the newly generated cache or to go with the plain request itself when the generation failed. Of course, we issue the cache generation request, but this is merely a side note as far as the rewrite map is concerned.

Rewrite maps are started when the Apache master process is started. All the Apache children share the same map. This is why we defined a lock above, and this is also why we have to be so careful. When the map script hangs, the Apache child using it will hang, too. The next child will queue up behind waiting for the lock to be released but in vain. More and more children will be stalled and eventually the Apache master process will run out of children and hang. To put it simply, a single error in the RewriteMap script will pile up and eventually hose your Web server. So, be extremely careful when adopting the script to your needs.

Afterwards, we tell Apache how to handle the twiki request issued by the script, that is, requests with the x-ignore in the query string (the parameter list of the request). We do not have access to the query string in the rewrite rule itself, so we use a separate rewrite condition again. We would want to check for the x-ignore flag. However, it is best to avoid requests with query strings completely. This is safer because they also carry twiki parameters that would generate special forms of the same page, thus hitting our cache, which is ignorant of these parameters. Consequently, the cache would deliver the wrong content. Thus, instead of messing with special requests, we concentrate on the 90% of plain cacheable requests and avoid the rest, which will be delivered by twiki.

After the last RewriteRule (the one that checks for the cached file), we call the map we have defined as "generator" for all the view requests. The syntax is a bit special, but you will certainly get it -- the map script gets a combination of the two environment variables divided by a slash as standard input. Thus, by making up the part of the URL we use in the script to form the request and replacing the slash with a dot and putting ".html" as suffix, we get the filename of the cached document.

After we have defined the handling of the x-ignore flag in the Apache config, we can try out the script from the command line. Remember that the script communicates via standard input and output:

$ /var/www/twiki/bin/content-cache-generator.pl
Main/WebHome                   (input by user)
/twiki/cache/Main.WebHome.html (output by script)

We call the script and enter Main/WebHome as STDIN. Then we wait for a few seconds and hope we get the desired path to the cached file. If this worked, we check that the file is really present and has the right content. If not, we have to debug the script. It helps to place some temporary debug messages in the script. You can also look at the rewrite log or issue an strace call against the script (if you are running Linux).

Testing the Setup

If all goes well, we can now test the full setup from the browser. When we request a page for the first time, the Web server will take a few moments to generate the cache, but subsequent deliveries will be very fast. We can also check the cache folder and see how the files pop up.

Now we have caching, but we have not yet defined a way to get rid of the outdated cache. One frequently used method to remove a cache is with a cronjob that erases the content of the cache. But, as we deal with a wiki, which is updated constantly, we have to update the cache in a similar way. We do this via a second rewrite map, which is called whenever a page is saved. On to the "generator" RewriteMap:

RewriteMap   remover   prg:/var/www/twiki/bin/content-cache-remover.pl

After the last command:

RewriteRule /twiki/bin/save/(.*)   \
  /twiki/bin/save/$1?${remover:%{ENV:TWIKIWEB}/%{ENV:TWIKIPAGE}} [last]

The first line is the initialization of a rewrite map named "remover". The next line is a little trick -- it places a rewrite rule with a condition, and we do not actually do any rewriting. Instead, we rewrite the statement to itself. Note that the $1 refers to the brackets in the condition. After the question mark comes the new query string of the request. We do not actually need a query string, but we use the generation of the query string to call our remover with the known environment variable as STDIN:

script /var/www/twiki/bin/content-cache-remover.pl:

#!/usr/bin/perl -w
use strict;

$| = 1; # Turn off buffering

while (<STDIN>) {
    chomp($_);
    s/\//./;
    unlink "/var/www/twiki/cache/$_.html";
                print "\n";
}

This simple script takes STDIN as filename and tries to remove the file. The nice thing about the "unlink" function in Perl is that it does not complain. Thus, there is no need for error handling. After we pass a newline character to STDOUT, we are done because the cached file will be removed, and the query string will be empty. Apache will carry on with the save-request afterwards and the cache will be generated anew upon the next view request. If you edit a page and your browser does not display the new version, but the former one, you are facing a caching error, probably on the browser side. See whether reloading helps, then check the presence of the http-header Pragma: no-cache in the server response. You can also have a look at the timestamp of the cached file to see whether it has been updated.

Once mod_rewrite caching has been completely implemented, it is time to look at the performance of the Web server again. Before you do, you may want to look at the sidebar "mod_rewrite config roundup", which brings the complete cache configuration in a single piece.

?> time curl  \
   http://kieran.netnea.com/twiki/bin/ \
   view/Main/WebHome >/dev/null 2>&1

real    0m4.744s
user    0m0.040s
sys     0m0.000s
?> time curl  \
   http://kieran.netnea.com/twiki/bin/ \
   view/Main/WebHome >/dev/null 2>/dev/null

real    0m0.043s
user    0m0.030s
sys     0m0.010s

The first request is one with an empty cache. The delivery time in this example is faster than the one with the non-altered setup we started with, but this is within the statistical variation. It is important to note, however, that a request is not slowed down by our tweaking. But then have a look at a subsequent request. It is more than a hundred times faster. It is unlikely that you'll get this big of an increase outside an intranet setting, due to network latency. Regardless of where the caching is applied, however, users will encounter a real boost in performance!

References

1. Twiki -- http://www.twiki.org

2. Twiki Cache Plug-in -- http://twiki.org/cgi-bin/view/Plugins/CacheAddOn

3. Twiki Installation Guide -- http://twiki.org/cgi-bin/view/TWiki/TWikiInstallationGuide

4. Hypertext Transfer Protocol (HTTP) -- http://www.w3.org/Protocols/rfc2616/rfc2616.html

5. Apache mod_proxy -- http://httpd.apache.org/docs/1.3/mod/mod_proxy.html

6. Apache mod_rewrite -- http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html

Christian Folini, PhD has studied Medieval History and Computer Science in Fribourg (Switzerland) and Berlin. He works as a consultant for netnea and he assisted in the construction of the access layer of yellownet, the online banking platform of Swiss Post.