Minimizing Contact Form and Forum Spam
Johannes Ullrich
A few years ago, like many other companies, we decided to remove our email address from our Web sites in order to reduce the amount of spam we received. Instead, we offered contact forms through which our users could provide feedback. However, over the past year, we have received spam submitted via those contact forms, bulletin boards, and other random forms we use on our site. In fact, the spam far out-numbered valid submissions.
This problem is not unique to our site. "Form Spam" or "Forum Spam" has become a huge problem, and, like email spam, is a big time sink. Besides being a nuisance, this type of spam presents a security threat because it may link to browser exploits or use cross-site-scripting (XSS) to steal cookies and other session information from administrators.
To combat this problem, we investigated a number of countermeasures. It was our goal to find a countermeasure that would selectively increase the work required by spammers and would not prevent valid posts or cause additional load on the server. In this article, I'll compare the method we finally selected (form field hiding with style sheets) to other commonly employed methods (see Table 1).
CAPTCHA
CAPTCHA, which stands for "Completely Automated Public Turing test to tell Computers and Humans Apart," is an image that is displayed to the user. The image shows a number of distorted letters, and the user is asked to type the text displayed by the image into a form field. The system then verifies whether the text entered matches the text displayed. Such tests were proposed as early as 1996 by Moni Naor [2]. Later, the term CAPTCHA was coined for the particular implementation of such a test using images. Moni Naor's paper suggests other similar tests as well (e.g., simple multiple-choice quiz questions).
CAPTCHAS have a low false-negative rate. We define a false-negative as an automated script being able to make it past a CAPTCHA or other security measure. A false-positive, on the other hand, is a regular user who is not able to pass the test. Few, if any, automated processes are able to decipher CAPTCHAS. As a result, they probably have the lowest false-negative rate and are very useful if one has to reliably discriminate against automated processes. Examples of CAPTCHAS at work include registration pages that allow users to register for a discussion forum. These functions should also include secondary validation mechanisms, such as sending an email to the new user.
Unfortunately, CAPTCHAs also have a high false-positive rate. A CAPTCHA can be hard for humans to decipher. Users with visual impairments, in particular, find it difficult to recognize the often badly distorted images. To provide some accommodation, some sites complement CAPTCHAs with audio files.
To implement CAPTCHAs, the page has to generate an image and a way to track which image was displayed to the user. There are a number of methods available to do so. The simplest is probably by using cookies. Others methods include using hidden-form variables or the name of the CAPTCHA itself. See the PHP code snippet in Listing 1.
CAPTCHAs have to be well enough obfuscated to make optical character recognition (OCR) difficult or impossible. OCR techniques can be very simple, and some CAPTCHA implementations can be defeated by simple Perl scripts. Additionally, you have to be careful not to give away the content of the CAPTCHA via URL parameters and image names.
The PEAR project provides a module (Text_CAPTCHA) that implements CAPTCHAs in PHP. However, this module is only in alpha release and has not seen any recent activity. Another implementation in PHP, PhpCaptcha, is provided by Ed Eliot [3]. It includes audio CAPTCHAs.
A complete sample implementation of CAPTCHAs in php and other languages can be found at:
http://captcha.net/sample
KittenAuth
KittenAuth [1] is a specific extension of the CAPTCHA idea. Instead of asking users to type in a text, it shows a set of images and asks the user to click on all images in a given category (e.g., all kittens, or, in another implementation, all squirrels). This is a task that is somewhat easier than identifying scrambled text, but it requires significant screen real estate.
If a sufficient number of images is used, guessing the right set via brute force methods can be quite difficult. Each image represents a bit (selected or not selected). So, for nine images (a convenient 3x3 square), 512 different combinations are possible. Code that implements KittenAuth can be found at kittenauth.com. Various plugins for software such as Drupal are in the works.
Session Tracking
But, let's get back to our original problem, form spam. Form spam protection can be a bit less stringent than the protection needed on a Web site registration page. However, you should attempt to limit the spam to a reasonable level without deterring regular users from seeking customer support or contributing to a forum. We found that CAPTCHAs actually reduce legitimate contributions by about 20%. This is not acceptable for many applications.
Session tracking uses cookies or URL parameters to track a user from page view to page view. As a simple anti-spam measure, it will identify users who submit a form without first downloading the appropriate form page. This will only work to deter some of the simplest spam scripts. Many spammers now will first download the form and carefully submit cookies, URL parameters, and hidden form fields. However, this method is still useful in assigning spam scores.
Key Word Filtering
Key word filtering is probably the simplest solution to form spam. At this point, unlike the situation facing email spam, form spam is hardly ever obfuscated and frequently includes links to repeat-offender sites like rlk-host. A short keyword library of 10-20 words can very successfully reduce form spam. However, this will not work for a forum with a wide range of topics. Finding appropriate disallowed key words is particularly hard if the forum includes a discussion of form spam or is a specialty site, like those discussing pharmaceuticals.
Javascript
One goal of any defense is to increase the work an attacker must do, while not increasing the work required of a regular user or a defender. Any spam defense that is going to increase the defender's work load more then it does the attacker's has the potential of resulting in a denial of service attack, as it will disadvantage a regular user over an attacker in the case of limited system resources. Javascript is one method that can be used to move processing to the client. For example, Javascript can be used to encrypt (or better "encode") an HTML page or a form. The client is then required to decrypt the page before submitting it.
Such a system can be bypassed by a determined attacker by including Javascript parsers in form spam bots. However, their inclusion significantly increases the attacker's work. Changing the Javascript to create the form is much easier than changing the script to parse it.
To the user, the Javascript will be totally invisible. The user is unlikely to experience any performance impact. The server will not care about Javascript either, as it is developed just like other static content and parsed by the server (see Listing 2).
Style Sheets
Style sheets, similar to Javascript, will ask the user to prove that they are using an actual regular browser to visit the page. Style sheets have the ability to hide certain content from users. However, a script that is not considering style sheets is not able to distinguish the "hidden" parts of a page from visible ones. Note that this is not the same as using the "hidden" input elements. Instead, the "display: none" style is applied. In its simplest form, a page using this method would look like Listing 3.
More complex and harder to decode implementations are simple to achieve in PHP. A generic function to draw input form fields can easily be modified to draw a number of decoy form fields as well. See Listing 4 for a simple function to implement this technique for text areas.
Summary
For our own Web site, the contact form received approximately 100-150 spam messages per day. Initially, we used CAPTCHAs. As expected, form spam disappeared. However, legitimate messages dropped by about 10-20%, which we found unacceptable. After all, deleting a spam message doesn't take that much time.
Later, we implemented the style sheet method described above. Just one decoy form field, which was named in a more enticing manner then the actual valid form field, reduced form spam to 2-3 messages per day. Interestingly, the remaining messages typically came from one system that submitted the 2-3 messages within a few minutes. To learn more about the spam, we also measured the time between downloading the form and submitting it. For this spam, it was rather consistently 2-5 seconds. The systems used to spam were located in Korea, India, and Malaysia. I assume that these are "spam farms", companies that specialize in hiring cheap labor to submit such spam to various forums.
We implemented this method on two very "spam heavy" forms on our Web site. Spam to non-protected forms stayed at the previous high level. Even our "poll", which offers a one-line comment form, receives a lot of spam.
References
1. KittenAuth -- http://www.kittenauth.com
2. Paper by Moni Naor -- http://www.wisdom.weizmann.ac.il/~naor/PAPERS/human_abs.html
3. PhpCAPTCHA -- http://www.ejeliot.com/pages/2
As Chief Research Officer for the SANS Institute, Johannes is currently responsible for the SANS Internet Storm Center (ISC) and the GIAC Gold program. He founded DShield.org in 2000, which is now the data collection engine behind the ISC. His work with the ISC has been widely recognized, and in 2004, Network World named him one of the 50 most powerful people in the networking industry. Prior to working for SANS, Johannes worked as a lead support engineer for a web development company and as a research physicist. Johannes holds a Ph.D. in Physics from SUNY Albany and is located in Jacksonville, FL.
|