Combating Inappropriate Content in Web-Based Environments

Jeffrey Fulmer

"Markets are conversations," according to Thesis One of the 1999 Cluetrain Manifesto. Its authors claim new forms of communication have the power to transform contemporary business practices. The Internet provides a means for people to reconnect to the traditional marketplace and exchange information about products and services. During the age of mass media, most information flowed in one direction -- from producer to consumer. Communication between consumers was limited in large part due to geography. You could warn your friends about a faulty product, but we all relied on industry watchdogs to convey such information on a global scale. That is no longer the case.

Consumers now rely on a wealth of online data to help guide their purchasing decisions. They can compare pricing information in just a few minutes. They can meet satisfied and dissatisfied consumers from all over the world. The marketplace is considerably more transparent, and companies now realize they cannot control the conversations. At the same time, they now see a need for active participation. If their products are inaccurately portrayed, then it is necessary to correct the record. If consumers have legitimate complaints, then improvements must be made. Slowly but surely, they are opening consumer channels.

As an administrator, developer, or a systems architect, you may be asked to help open new channels for customers. This could take the form of a comment section on your corporate Web site, a message board, or an affiliated blog. Business people who have long controlled the conversations will undoubtedly be reluctant to take such steps. They will worry about the types of things that will appear in those forums. They have every right to worry.

In this article, I will examine pitfalls associated with a Web-based open message forum and discuss techniques to provide such an environment safely. This information will help alleviate concerns during the planning and architectural stages and provide a strategy to prevent inappropriate content from appearing in venues associated with your company.

Threats

When you design and build a system, it's important to recognize its potential threats. For the purpose of this discussion, we are concerned specifically with threats that arise from user-generated content. As an astute Sys Admin reader, let's assume that you've applied due diligence and hardened your environment against the many documented threats to a server on an open network. We're concerned with a more nuanced threat that manifests in two forms: inappropriate comments and link spam.

The Internet contains a vast array of self-correcting information. After Abe Vigoda was wrongly reported dead, a Firefox extension was provided free of charge for anyone who wanted up to the minute information on the aging actor's status. (At the time of this writing, Mr. Vigoda was still very much alive.) Despite an abundance of information, most people have the same introductory online lesson -- the world is full of social deviants. An open conversation invites participation from all types of people, but just one jerk can detract from its value. On a woodworking site, I watched in horror as one reader laughed at a woman whose dog died from eating glue. By almost any standard, that's inappropriate. If you provide an open forum, you can expect some loutish behavior. Fortunately, most people aren't anti-social. If you plan good measures to handle this sort of behavior, you can still provide user-generated content.

Another concern arises from the incredible success of a company whose code of conduct simply reads "Don't be Evil". Google Inc. may have met that standard but their technology has helped unleash "link spammers". The framework in which they operate began with a good idea. Google's founders, Lawrence Page and Sergey Brin, applied a well-worn academic rule to the Internet. Among scholars, it is widely held that the more a paper is referenced the greater its importance. My work has been referenced just once in academic papers, while some guy named Einstein appears quite frequently. In the world of Google, Einstein ranks with nytimes.com while I'm a neglected blog. Thus, Einstein's page rank is higher than mine.

Link spammers exploit sites that allow user-generated content in order to boost their Google PageRank. They'll add quaint comments like "nice site" with a link to an online gambling venture. These innocuous comments are not only annoying, but they can be damaging. To prevent people from gaming the system with a bunch of inexpensive sites that all link to one another -- a link farm -- Google punishes any site that references a link farm. If link spammers flood your site with link farm links, then imagine what happens to the high page rank you worked so hard to achieve. It takes a nosedive.

Planning

Before you open the corporate Web site to user-generated content, it's important to meet with all the business units to discuss the ramifications. The best way to limit the damage that Internet deviants can inflict is to limit the things that they can do. During these meetings, you should gather the minimum requirements. Should posting be available to all, or just to registered users? Keep in mind, registration will not prevent inappropriate behavior; it will only provide a hurdle. Ask questions and try to steer your audience toward the tightest permissible features. If it is not necessary to post images, then the easiest way to prevent pornography from appearing on your Web site is to remove the IMG tag. It's probably unlikely that users will need to add JavaScript code to your Web site, so be sure you strip that out, too.

Once you establish what users are allowed to do, you need to characterize inappropriate content. The simplest place to begin is foul language. While George Carlin's Seven Dirty Words may be appropriate for some Web sites, they will probably provide a good starting point for most businesses. Other types of behavior are more difficult to characterize. For example, you may agree that it's wrong to berate another user but correcting others is permitted. You probably don't want inaccurate information on your site, but how do you proceed once you encounter it? You may want someone to approve every comment before it's published. These requirements will drive your design.

Beyond business requirements, there are some provisions you will need in order to administer this new feature. A comprehensive list of all new comments is necessary to make sure inappropriate content doesn't appear without your knowledge. If users can contribute to discussions throughout the site, then the only way you can monitor everything is with one comprehensive view. You will also need the ability to flag content by words, URLs, and IP address. Finally, you should log all content marked as questionable. This will give you a chance to catch mistakes and to hone your filtering.

If this is starting to sound a lot like SMTP spam filtering, then you're right. It's exactly like that. You may even employ many of the same tools to combat this scourge. Once you have your requirements, you can architect your system. You have the options of leveraging existing anti-spam tools or rolling your own system.

Fortunately, the biggest threats have a limited number of resources. Both are persistent but neither has a lot to say. Anti-social types have a small vocabulary of vulgar words, and they tend to rely on a single IP address issued by their ISPs. Link spammers have access to an incredible number of open proxy servers, but their payload is limited. They are trying to add their URLs to your Web site. A link spammer can only purchase so many domains before his venture becomes cost ineffective. If you block content that contains vulgar words, links to link farms, or posts from "bad" IP addresses, then you'll eventually stop the bad guys in their tracks. Given a limited number of things to block, I recently decided to roll my own system.

Implementation

As indicated above, I matched four configuration files to the problematic areas above: one file for keywords, one for author's names, one for IP addresses, and another for links. Each file was constructed in INI style where the header matched a point value that would be assigned for a match in that section. Negative headers serve as a type of white-listing. If a post scores above a predetermined threshold, then it is flagged as spam. Here is a snippet from my IP address file:

#
# IP Address list
# Weighted scores are placed inside brackets
# Whitelist addresses with negative values
#
[-10]
172.16.24.1    # white list our friends
[5]
82.234.238.65  # United States [spr69-1-82-234-238-65.fbx.proxad.net
68.49.160.138  # United States [pcp09561363pcs.rtchrd01.md.comcast.net]
[10]
220.95.169.196 # 80; open anonymous proxy in Australia [Unknown host]
222.65.1.205   # 80; open anonymous proxy in United States [Unknown host]

My threshold is 8 points, which means some IP addresses will not be allowed to post under any circumstances. I generally flag wide-open proxies for immediate denial. Dynamic IP addresses are given a high hurdle, but since their ownership changes, we don't want to deny one person for the sins of another. Listing 1 is a sample PHP code snippet used to match the IP address of the current poster against the INI file above.

We use a similar function to match against keywords and authors. Here is a sample keywords INI file:

# each section indicates the number of 
# points awarded for each match on the 
# list. To whitelist an entry, place it
# in a negative [-10] category.
#
[1]
  party
  invited
  odds
  download
  pill
  free
  online
  texas
[2]
  drugs
  stud
  mortgage
  poker
  check out

Words that score higher start to get a little more risqué. Our SpamCheck object uses a method called _isSpamKeyword to parse this file and return a score. The same is true of its _isSpamAuthor method. The code for this article can be downloaded from the Sys Admin Web site: http://www.sysadminmag.com/code/.

While they are equally obnoxious, we have a special function just for link spammers. Whenever a link spammer adds a link to our site, we want to kill the entire post. URL paths are cheap, but domain names are not. Our configuration file is filled with nothing but hostnames and domains. If a user's URL matches a domain, then we kill the entire post. Here is a configuration snippet:

# This file contains domains that spammers
# link to. For each domain in the file, spam
# detection will return 10 points.
vsymphony.com
vthought.com
nemasoft.com
luxuryrenting.net
knowtax.net
windowscasino.com
mydivx.info
petsellers.net
poker4spain.com
vcrap[s]?.com

Finally, here is the function that pulls everything together:

<?php
  $threshold = 8;
  require 'include/spam.php';
  $spam  = new spamcheck();
  /**
   * isSpam returns true (1) if the content scores
   * greater than $threshold; false (0) if not.
   */
  if($spam->isSpam($threshold, $content, $name, \
    $_SERVER['REMOTE_ADDR'])){
    // log the transaction for intelligence gathering
    echo "This smells like spam<br>";
  } else {
    // allow the user to post the comment
    if(!empty($name) && !empty($content)){
      echo "The following comment is NOT spam:<br>";
      echo "$name said, $content";
    }
  }
?>

If $spam->isSpam returns a value higher than $threshold, we log vital details from the post. This allows us to build our configuration files and stop new spam before it happens. If our software eats a spammer's post, he's likely to try again since the reason for failure remains a mystery. If he attacks us from another IP address, then we can add another one to our ip.ini file. As mentioned above, link spammers and social deviants have limited vocabularies. The more information we can collect on them, the sooner we can dispense with them.

Other Strategies

For 18 years, the International Obfuscated C Code Contest has provided a safe forum for blurry unreadable code. Although you may not win this year's contest, obfuscation can help you keep spammers at bay. While we may flatter ourselves that we're the center of the universe, spammers actually care more about search engines than they do about us. If we can make their links unreadable to googlebots, then we can discourage them from cluttering our site with spam. Rather than publishing links as entered, you can encrypt them and use JavaScript to decrypt the URL when a user clicks the link. For example:

<a href="javascript:decyptMe('uggc://jjj.wbrqbt.bet/')">Click here</a>

As the astute reader can see, the method of encryption doesn't have to be strong; it just has to confuse the googlebot. (I look forward to seeing some of you on that site.) You can also insert the nofollow attribute that Google introduced so that sites can be referenced without contributing to their page rank. Here's an example:

<a href="http://www.whitehouse.gov/" rel="nofollow">Click here</a>

Once spammers locate your site, they are likely to automate the process of adding links. One method of prevention is through Turing tests, assessments that are designed to differentiate computers and humans. One of the more popular offerings is the CAPTCHA test, a Completely Automated Public Turing Test to tell Computers and Humans Apart. If you spend any time on the World Wide Web, then you've seen this one. You have to enter the letters that appear in a wavy image to verify that you are neither a computer script nor a cyborg. The downside of CAPTCHA tests is their intrusiveness. Personally, I don't care for them.

Conclusion

Combating inappropriate user-generated content is a cyclic process. Build a system that allows you to use the abuser's own content against him. As you catch or flag inappropriate posts, you can add additional content to your configuration files. New information will enable you to flag more content and further enhance your configuration files. No system is perfect. Make sure you create the means to monitor all new user-generated content and actively monitor your site.

References

Abe Vigoda Firefox Extension -- http://www.vesterman.com/FirefoxExtensions/AbeVigodaStatus

The International Obfuscated C Code Contest -- http://www0.us.ioccc.org/main.html

The Carnegie Mellon CAPTCHA Project -- http://www.captcha.net/

Jeffrey Fulmer has administered enterprise computer systems professionally since 1995. He is an open source software developer and the primary author of siege. He currently resides in Pennsylvania with his wife and English bulldog.