Combating Link Spam
Jeffrey Fulmer
In the first article in this series, I discussed
techniques and strategies to combat inappropriate content in Web-based
message formats (see March, 2006 Sys Admin). As companies continue to leverage the Internet to open
new avenues of communication with customers,
they make their Web sites vulnerable to inappropriate content. Abusive
behavior is one concern. The pseudo-anonymous nature of the Internet often serves to break obnoxious caterpillars out of their cocoons.
Many quiet wallflowers have found themselves transformed into loud bullies
in Web-based message forums. Fortunately, this type of behavior is easy to
stop.
A common characteristic among Internet bullies is the
language they employ. They like vulgar words. The simplest way to minimize
their obnoxiousness is to eliminate their repertoire. In the first article,
I showed how to reject articles based on keyword filtering. A bully can
still be obnoxious without vulgar words, but
his attacks are far less offensive. You can
eliminate most persistent problems with IP blacklisting. A typical bully
will be discouraged by these tactics.
Link spammers are a more persistent problem. They tend
to have programming skills and a comprehensive understanding of Internet
protocols. More importantly, they are highly motivated. Last year, The Register interviewed a
link spammer who claims the PPC (porn, pills, and casinos) sites to which
he links generate revenues from £100,000 to £200,000 a month.
He earns a piece of that action. The link spammer floods tens of thousands
of blogs and message boards with links to his landing sites. He generates
income by sending click-throughs to the PPC sites. Never underestimate the
motivating power of money.
One of the sites I administer was locked down with
the techniques described in my previous article. The filtering system
successfully caught more than 200 spam attempts each day. Legitimate users
registered no complaints of false positive blocks. We were literally
batting 1.000. Life was good.
Suddenly things turned for the worse. A link spammer
started to penetrate our defenses. Each morning the comment section was
flooded with links to PPC sites. Our spammer was able to post these
messages because his comments were in Italian. I collected keywords,
domains, and IP addresses from his messages and appended them to our
filter. The comments kept coming. As fast as I added domains, he acquired
new ones. Sometimes he linked to sites he compromised with content uploaded
to, for example, a geology department's Web site at a major
university.
According to the logs, our filter was capturing the
vast majority of his attempts. Each morning, we would log hundreds of
failed efforts before we were hit by a flood of link spam. Our spammer
invested hours crafting messages through various IP addresses until he
finally hit the right combination and his comment made it through. He
composed these messages by hand in a Web browser. Once the payload cleared
the filter, he loaded its contents in a script and flooded us with hundreds
of messages. Make no mistake, they are highly motivated people.
Know Thy Enemy
This experience somewhat changed my approach to spam
filtering. While I considered link spammers to be similar to email
spammers, one difference was apparent. The former have tangible evidence to
evaluate the success or failure of their efforts. When an email spammer
floods a server with messages, he does not know whether the filter ate the
payload. A link spammer, on the other hand, can see the result in his
browser. If the message was posted, then he was successful. At that point
he can script a flood.
As I studied the actions of my antagonists, there was
another revelation -- there are a lot of ways to write, "Porn,
pills, and casinos." To date, I have not seen link spammers resort to
1337 spellings (i.e., p0rn, which characterize the payloads of their email
counterparts). As more Web sites rely on automated filtering, it's
only a matter of time until we experience that on message boards. For the
time being, they are motivated by a desire to score well with Internet
search words.
It was also apparent that link spammers do indeed make
money. They may claim all kinds of revenues in an interview with The Register, but those claims
are empty without some sort of tangible evidence. During the past several
months, I've watched quite a few link spammers tap what appears to be
a bottomless pool of domain names. While I could not independently
substantiate the revenues associated with link spamming, it is fair to say
they make enough to stay motivated and keep a supply of new domains.
The techniques discussed in the first article may
provide you with adequate protection against abuse, but should your site
fall on the radar of a highly motivated, highly skilled link spammer, you
will need to expand those defenses.
A Self-Adjusting System
Even the most dedicated systems administrator needs
time away from his systems. Unfortunately, the most dedicated link spammers
don't allow themselves that luxury. In this article, I introduce new
tools and techniques to help you build a self-adjusting system. This system
is built on techniques I discussed in the first article.
A dedicated link spammer will spend a great deal of
effort constructing a payload to pass our defenses. We want to capture
information as he tries to slip a message past our filter. On each
unsuccessful attempt, we capture IP addresses and domains that are not
currently in our database. Before he can construct a message with the right
combination of elements, we will have more information to block him.
To illustrate the effectiveness of such a system,
let's consider the following log snippet:
#
# Date Total IP Address A B C D
17/Aug/2006:06:44:40 | 18 | 83.138.144.208 | 0 | 0 | 0 | 18
17/Aug/2006:06:46:57 | 8 | 83.138.144.208 | 2 | 4 | 0 | 2
17/Aug/2006:06:49:25 | 16 | 83.138.144.208 | 4 | 6 | 0 | 6
Our spam threshold is 8 points; the total is the sum
of columns A through D. A represents the total points for each blacklisted
domain in the body of the message. B is the total points for a blacklisted
IP address. C is blacklisted authors, and D is blacklisted keywords. In the
example above, a link spammer attempted to penetrate our defenses with a
new domain from a new IP address. On his first attempt, his message scored
18 points from keyword matches. Since it was flagged as spam, we extracted
the domain and inserted it into the database. We did the same with his IP
address. We assign 2 points to each suspicious new domain and 4 points for
each suspicious new IP address. On each ensuing instance of abuse, we
increment both by 2 points.
On his second attempt, the spammer honed his message
with less offensive keywords. He responded to our system, but our system
responded to him. While his keyword score dropped from 18 to 2, the points
we've assigned his domain and IP address put him over the edge. By
the third attempt, his IP address was blacklisted and his new domain nearly
worthless. Most importantly, this adjustment occurred while I was still
soundly asleep.
Every business case is different so this exact system
may not work for you. I'll present its specific components so that
you may apply any that fit your environment.
IP Blacklisting
In the first article, we blacklisted IP addresses as
bad behavior merited it. We kept a paper trail and manually added IP
addresses to our filter after users violated our terms of service. If we
want to deter link spammers, a more proactive approach is necessary. Our
self-adjusting system checks IP addresses against lists of troublesome
addresses and adds new addresses automatically as comments are flagged as
spam. It enters those addresses below the spam threshold and increments
their point values each additional time it's flagged.
A good hacker never reinvents the wheel. One tool we
can employ has been honed for years by people who combat email spam.
DNS-based Block Lists (DNSBLs) publish lists of IP addresses that are
easily queried with existing Internet protocols. These addresses have
already been flagged as troublesome. Many servers exploited by email
spammers are used by link spammers. The chief culprit in both cases is
often an open SOCKS relay poorly administrated by a person who probably
doesn't read this magazine. Since SOCKS resides between the
application layer and the transport layer, it can be used seamlessly by
both email and link spammers. As part of our self-adjusting system, we want
to leverage DNSBLs to prevent spam before it happens.
Fortunately, this problem has already been solved by
a contributor to the PHP Web site. You can use the following function to
check an address against SpamCop, DNSBL, and Spamhaus:
private function _is_blacklisted($ip) {
// written by satmd, do what you want with it,
// but keep the author please
$srvs = array("bl.spamcop.net", "list.dsbl.org", \
"sbl.spamhaus.org");
if($ip){
$octs = explode(".",$ip);
$rip = $octs[3].".".$octs[2].".".$octs[1].".".$octs[0];
for($i=0; $i<count($srvs); $i++) {
if(checkdnsrr($rip.".".$srvs[$i].".","A")) {
return true;
}
}
}
return false;
}
Feel free to add or remove any DNSBL.
Since the first article in this series was published,
we moved our IP addresses into a MySQL database. The number of addresses
we've been forced to manage has simply grown too large for INI-style
text files. As mentioned above, the database is updated automatically. If a
customer posts a message flagged as spam, we check to see if it's
already in the database. If it is, we append 2 points to its score. If not,
then we add it with a base score of 4. (Our spam threshold is 8 points):
function _add_address($ip) {
$sql = "SELECT score FROM ipaddr WHERE ipaddr='$ip'";
$res = mysql_query($sql,$this->dbh);
$res = mysql_fetch_object($res);
if($res->score < 1){
$sql = "INSERT INTO ipaddr (id, score, ipaddr) VALUES(NULL, 4, '$ip')";
$res = mysql_query($sql,$this->dbh);
} else {
$pts = $res->score + 2;
$sql = "UPDATE ipaddr SET score=$pts WHERE ipaddr='$ip'";
$res = mysql_query($sql,$this->dbh);
}
return true;
}
Keywords
Since the first article, we moved our keywords from an
INI file into the database. An Italian link spammer was giving us fits, so
I ended that nuisance by adding nearly 100,000 Italian words to the
keywords database. The site is English-only but we made sure that words
like "pizza" and "lasagna" were not added along
with "anglicanesimo" and "cardinalesche".
A keywords query is trickier than one that checks for
an IP address. In the latter case, we check for a record that exactly
matches an address string:
$sql = "SELECT score FROM ipaddr WHERE ipaddr='$ip'";
To match keywords, we have to locate records
containing a single word from a larger string of many words. In this case,
the large string contains the message that a customer posted in the comment
form. Fortunately, MySQL provides us with the RLIKE string comparison
function. It makes a difficult query easy:
private function _score_keyword($str="") {
$pts = 0;
$sql = "SELECT id, score FROM keywords WHERE '$str' RLIKE word";
$res = mysql_query($sql,$this->dbh);
while(list($id, $score) = mysql_fetch_row($res)){
$pts += $score;
}
return $pts;
}
We simply loop through our result set and increment
the corresponding points for every keyword match.
Domains
The task of stopping link spammers would be
considerably easier if they didn't have the resources necessary to
acquire thousands of domains. Remember, they don't want to post just
anything on your Web site. They want to post links to their PPC sites. Most
spammers add multiple hosts to each domain, but we simplify our task by
filtering the domain itself. Again, RLIKE is our friend:
private function _score_domain($str="") {
$pts = 0;
$sql = "SELECT id, score FROM domains WHERE '$str' RLIKE domain";
$res = mysql_query($sql,$this->dbh);
while(list($id, $score) = mysql_fetch_row($res)){
$pts += $score;
}
return $pts;
}
When a message is flagged as spam (i.e., its score is
greater than or equal to 8), we parse the message for domains then add them
to our database. It's a self-adjusting system, remember:
$spam = new spamfilter($dbh);
if($spam->isSpam($threshold, $content, $name, _getIp())){
echo "This smells like spam<br>";
} else {
// allow the comment
if(!empty($name) && !empty($content)){
echo "The following comment is NOT spam:<br>";
echo "$name said: " .
$content;
}
}
Notice that the program sleeps for 10 seconds after a
message is flagged as spam. This is to lengthen the time it takes our
spammer to compose various attempts to penetrate our defenses. We borrowed
this idea from the failed Unix login.
Conclusion
If you chose this profession, then chances are you
love spending time on computers, but even the most dedicated administrator
needs time with family and friends. One way to assure you get that time is
to have computers handle your grunt work. A self-adjusting link spam filter
will help you achieve that time. As always, you'll have to tailor
these ideas to fit your business case. Log as much information as possible.
Use that information to tune your system, and you'll be able to keep
the bad guys at bay.
Code for this article is available for download at: http://www.samag.com/code/.
References
Register: Interview with a Link Spammer --http://www.theregister.co.uk/2005/01/31/link_spamer_interview/
1337 Spellings -- http://en.wikipedia.org/wiki/Leet
Jeffrey Fulmer has administered enterprise computer
systems professionally since 1995. He is an open source software developer
and the primary author of siege. He currently resides in Pennsylvania with
his wife and English bulldog.
|