Combating
Inappropriate Content in Web-Based Environments
Jeffrey Fulmer
"Markets are conversations," according to Thesis One of the 1999
Cluetrain Manifesto. Its authors claim new forms of communication
have the power to transform contemporary business practices. The
Internet provides a means for people to reconnect to the traditional
marketplace and exchange information about products and services.
During the age of mass media, most information flowed in one direction
-- from producer to consumer. Communication between consumers was
limited in large part due to geography. You could warn your friends
about a faulty product, but we all relied on industry watchdogs
to convey such information on a global scale. That is no longer
the case.
Consumers now rely on a wealth of online data to help guide their
purchasing decisions. They can compare pricing information in just
a few minutes. They can meet satisfied and dissatisfied consumers
from all over the world. The marketplace is considerably more transparent,
and companies now realize they cannot control the conversations.
At the same time, they now see a need for active participation.
If their products are inaccurately portrayed, then it is necessary
to correct the record. If consumers have legitimate complaints,
then improvements must be made. Slowly but surely, they are opening
consumer channels.
As an administrator, developer, or a systems architect, you may
be asked to help open new channels for customers. This could take
the form of a comment section on your corporate Web site, a message
board, or an affiliated blog. Business people who have long controlled
the conversations will undoubtedly be reluctant to take such steps.
They will worry about the types of things that will appear in those
forums. They have every right to worry.
In this article, I will examine pitfalls associated with a Web-based
open message forum and discuss techniques to provide such an environment
safely. This information will help alleviate concerns during the
planning and architectural stages and provide a strategy to prevent
inappropriate content from appearing in venues associated with your
company.
Threats
When you design and build a system, it's important to recognize
its potential threats. For the purpose of this discussion, we are
concerned specifically with threats that arise from user-generated
content. As an astute Sys Admin reader, let's assume that
you've applied due diligence and hardened your environment against
the many documented threats to a server on an open network. We're
concerned with a more nuanced threat that manifests in two forms:
inappropriate comments and link spam.
The Internet contains a vast array of self-correcting information.
After Abe Vigoda was wrongly reported dead, a Firefox extension
was provided free of charge for anyone who wanted up to the minute
information on the aging actor's status. (At the time of this writing,
Mr. Vigoda was still very much alive.) Despite an abundance of information,
most people have the same introductory online lesson -- the world
is full of social deviants. An open conversation invites participation
from all types of people, but just one jerk can detract from its
value. On a woodworking site, I watched in horror as one reader
laughed at a woman whose dog died from eating glue. By almost any
standard, that's inappropriate. If you provide an open forum, you
can expect some loutish behavior. Fortunately, most people aren't
anti-social. If you plan good measures to handle this sort of behavior,
you can still provide user-generated content.
Another concern arises from the incredible success of a company
whose code of conduct simply reads "Don't be Evil". Google Inc.
may have met that standard but their technology has helped unleash
"link spammers". The framework in which they operate began with
a good idea. Google's founders, Lawrence Page and Sergey Brin, applied
a well-worn academic rule to the Internet. Among scholars, it is
widely held that the more a paper is referenced the greater its
importance. My work has been referenced just once in academic papers,
while some guy named Einstein appears quite frequently. In the world
of Google, Einstein ranks with nytimes.com while I'm a neglected
blog. Thus, Einstein's page rank is higher than mine.
Link spammers exploit sites that allow user-generated content
in order to boost their Google PageRank. They'll add quaint comments
like "nice site" with a link to an online gambling venture. These
innocuous comments are not only annoying, but they can be damaging.
To prevent people from gaming the system with a bunch of inexpensive
sites that all link to one another -- a link farm -- Google punishes
any site that references a link farm. If link spammers flood your
site with link farm links, then imagine what happens to the high
page rank you worked so hard to achieve. It takes a nosedive.
Planning
Before you open the corporate Web site to user-generated content,
it's important to meet with all the business units to discuss the
ramifications. The best way to limit the damage that Internet deviants
can inflict is to limit the things that they can do. During these
meetings, you should gather the minimum requirements. Should posting
be available to all, or just to registered users? Keep in mind,
registration will not prevent inappropriate behavior; it will only
provide a hurdle. Ask questions and try to steer your audience toward
the tightest permissible features. If it is not necessary to post
images, then the easiest way to prevent pornography from appearing
on your Web site is to remove the IMG tag. It's probably unlikely
that users will need to add JavaScript code to your Web site, so
be sure you strip that out, too.
Once you establish what users are allowed to do, you need to characterize
inappropriate content. The simplest place to begin is foul language.
While George Carlin's Seven Dirty Words may be appropriate for some
Web sites, they will probably provide a good starting point for
most businesses. Other types of behavior are more difficult to characterize.
For example, you may agree that it's wrong to berate another user
but correcting others is permitted. You probably don't want inaccurate
information on your site, but how do you proceed once you encounter
it? You may want someone to approve every comment before it's published.
These requirements will drive your design.
Beyond business requirements, there are some provisions you will
need in order to administer this new feature. A comprehensive list
of all new comments is necessary to make sure inappropriate content
doesn't appear without your knowledge. If users can contribute to
discussions throughout the site, then the only way you can monitor
everything is with one comprehensive view. You will also need the
ability to flag content by words, URLs, and IP address. Finally,
you should log all content marked as questionable. This will give
you a chance to catch mistakes and to hone your filtering.
If this is starting to sound a lot like SMTP spam filtering, then
you're right. It's exactly like that. You may even employ many of
the same tools to combat this scourge. Once you have your requirements,
you can architect your system. You have the options of leveraging
existing anti-spam tools or rolling your own system.
Fortunately, the biggest threats have a limited number of resources.
Both are persistent but neither has a lot to say. Anti-social types
have a small vocabulary of vulgar words, and they tend to rely on
a single IP address issued by their ISPs. Link spammers have access
to an incredible number of open proxy servers, but their payload
is limited. They are trying to add their URLs to your Web site.
A link spammer can only purchase so many domains before his venture
becomes cost ineffective. If you block content that contains vulgar
words, links to link farms, or posts from "bad" IP addresses, then
you'll eventually stop the bad guys in their tracks. Given a limited
number of things to block, I recently decided to roll my own system.
Implementation
As indicated above, I matched four configuration files to the
problematic areas above: one file for keywords, one for author's
names, one for IP addresses, and another for links. Each file was
constructed in INI style where the header matched a point value
that would be assigned for a match in that section. Negative headers
serve as a type of white-listing. If a post scores above a predetermined
threshold, then it is flagged as spam. Here is a snippet from my
IP address file:
#
# IP Address list
# Weighted scores are placed inside brackets
# Whitelist addresses with negative values
#
[-10]
172.16.24.1 # white list our friends
[5]
82.234.238.65 # United States [spr69-1-82-234-238-65.fbx.proxad.net
68.49.160.138 # United States [pcp09561363pcs.rtchrd01.md.comcast.net]
[10]
220.95.169.196 # 80; open anonymous proxy in Australia [Unknown host]
222.65.1.205 # 80; open anonymous proxy in United States [Unknown host]
My threshold is 8 points, which means some IP addresses will not be
allowed to post under any circumstances. I generally flag wide-open
proxies for immediate denial. Dynamic IP addresses are given a high
hurdle, but since their ownership changes, we don't want to deny one
person for the sins of another. Listing 1 is a sample PHP code snippet
used to match the IP address of the current poster against the INI
file above.
We use a similar function to match against keywords and authors.
Here is a sample keywords INI file:
# each section indicates the number of
# points awarded for each match on the
# list. To whitelist an entry, place it
# in a negative [-10] category.
#
[1]
party
invited
odds
download
pill
free
online
texas
[2]
drugs
stud
mortgage
poker
check out
Words that score higher start to get a little more risqué.
Our SpamCheck object uses a method called _isSpamKeyword to parse
this file and return a score. The same is true of its _isSpamAuthor
method. The code for this article can be downloaded from the Sys
Admin Web site: http://www.sysadminmag.com/code/.
While they are equally obnoxious, we have a special function just
for link spammers. Whenever a link spammer adds a link to our site,
we want to kill the entire post. URL paths are cheap, but domain
names are not. Our configuration file is filled with nothing but
hostnames and domains. If a user's URL matches a domain, then we
kill the entire post. Here is a configuration snippet:
# This file contains domains that spammers
# link to. For each domain in the file, spam
# detection will return 10 points.
vsymphony.com
vthought.com
nemasoft.com
luxuryrenting.net
knowtax.net
windowscasino.com
mydivx.info
petsellers.net
poker4spain.com
vcrap[s]?.com
Finally, here is the function that pulls everything together:
<?php
$threshold = 8;
require 'include/spam.php';
$spam = new spamcheck();
/**
* isSpam returns true (1) if the content scores
* greater than $threshold; false (0) if not.
*/
if($spam->isSpam($threshold, $content, $name, \
$_SERVER['REMOTE_ADDR'])){
// log the transaction for intelligence gathering
echo "This smells like spam<br>";
} else {
// allow the user to post the comment
if(!empty($name) && !empty($content)){
echo "The following comment is NOT spam:<br>";
echo "$name said, $content";
}
}
?>
If $spam->isSpam returns a value higher than $threshold,
we log vital details from the post. This allows us to build our configuration
files and stop new spam before it happens. If our software eats a
spammer's post, he's likely to try again since the reason for failure
remains a mystery. If he attacks us from another IP address, then
we can add another one to our ip.ini file. As mentioned above, link
spammers and social deviants have limited vocabularies. The more information
we can collect on them, the sooner we can dispense with them.
Other Strategies
For 18 years, the International Obfuscated C Code Contest has
provided a safe forum for blurry unreadable code. Although you may
not win this year's contest, obfuscation can help you keep spammers
at bay. While we may flatter ourselves that we're the center of
the universe, spammers actually care more about search engines than
they do about us. If we can make their links unreadable to googlebots,
then we can discourage them from cluttering our site with spam.
Rather than publishing links as entered, you can encrypt them and
use JavaScript to decrypt the URL when a user clicks the link. For
example:
<a href="javascript:decyptMe('uggc://jjj.wbrqbt.bet/')">Click here</a>
As the astute reader can see, the method of encryption doesn't have
to be strong; it just has to confuse the googlebot. (I look forward
to seeing some of you on that site.) You can also insert the nofollow
attribute that Google introduced so that sites can be referenced without
contributing to their page rank. Here's an example:
<a href="http://www.whitehouse.gov/" rel="nofollow">Click here</a>
Once spammers locate your site, they are likely to automate the process
of adding links. One method of prevention is through Turing tests,
assessments that are designed to differentiate computers and humans.
One of the more popular offerings is the CAPTCHA test, a Completely
Automated Public Turing Test to tell Computers and Humans Apart. If
you spend any time on the World Wide Web, then you've seen this one.
You have to enter the letters that appear in a wavy image to verify
that you are neither a computer script nor a cyborg. The downside
of CAPTCHA tests is their intrusiveness. Personally, I don't care
for them.
Conclusion
Combating inappropriate user-generated content is a cyclic process.
Build a system that allows you to use the abuser's own content against
him. As you catch or flag inappropriate posts, you can add additional
content to your configuration files. New information will enable
you to flag more content and further enhance your configuration
files. No system is perfect. Make sure you create the means to monitor
all new user-generated content and actively monitor your site.
References
Abe Vigoda Firefox Extension -- http://www.vesterman.com/FirefoxExtensions/AbeVigodaStatus
The International Obfuscated C Code Contest -- http://www0.us.ioccc.org/main.html
The Carnegie Mellon CAPTCHA Project -- http://www.captcha.net/
Jeffrey Fulmer has administered enterprise computer systems
professionally since 1995. He is an open source software developer
and the primary author of siege. He currently resides in Pennsylvania
with his wife and English bulldog. |