Combating Link Spam

Jeffrey Fulmer

In the first article in this series, I discussed techniques and strategies to combat inappropriate content in Web-based message formats (see March, 2006 Sys Admin). As companies continue to leverage the Internet to open new avenues of communication with customers, they make their Web sites vulnerable to inappropriate content. Abusive behavior is one concern. The pseudo-anonymous nature of the Internet often serves to break obnoxious caterpillars out of their cocoons. Many quiet wallflowers have found themselves transformed into loud bullies in Web-based message forums. Fortunately, this type of behavior is easy to stop.

A common characteristic among Internet bullies is the language they employ. They like vulgar words. The simplest way to minimize their obnoxiousness is to eliminate their repertoire. In the first article, I showed how to reject articles based on keyword filtering. A bully can still be obnoxious without vulgar words, but his attacks are far less offensive. You can eliminate most persistent problems with IP blacklisting. A typical bully will be discouraged by these tactics.

Link spammers are a more persistent problem. They tend to have programming skills and a comprehensive understanding of Internet protocols. More importantly, they are highly motivated. Last year, The Register interviewed a link spammer who claims the PPC (porn, pills, and casinos) sites to which he links generate revenues from £100,000 to £200,000 a month. He earns a piece of that action. The link spammer floods tens of thousands of blogs and message boards with links to his landing sites. He generates income by sending click-throughs to the PPC sites. Never underestimate the motivating power of money.

One of the sites I administer was locked down with the techniques described in my previous article. The filtering system successfully caught more than 200 spam attempts each day. Legitimate users registered no complaints of false positive blocks. We were literally batting 1.000. Life was good.

Suddenly things turned for the worse. A link spammer started to penetrate our defenses. Each morning the comment section was flooded with links to PPC sites. Our spammer was able to post these messages because his comments were in Italian. I collected keywords, domains, and IP addresses from his messages and appended them to our filter. The comments kept coming. As fast as I added domains, he acquired new ones. Sometimes he linked to sites he compromised with content uploaded to, for example, a geology department's Web site at a major university.

According to the logs, our filter was capturing the vast majority of his attempts. Each morning, we would log hundreds of failed efforts before we were hit by a flood of link spam. Our spammer invested hours crafting messages through various IP addresses until he finally hit the right combination and his comment made it through. He composed these messages by hand in a Web browser. Once the payload cleared the filter, he loaded its contents in a script and flooded us with hundreds of messages. Make no mistake, they are highly motivated people.

Know Thy Enemy

This experience somewhat changed my approach to spam filtering. While I considered link spammers to be similar to email spammers, one difference was apparent. The former have tangible evidence to evaluate the success or failure of their efforts. When an email spammer floods a server with messages, he does not know whether the filter ate the payload. A link spammer, on the other hand, can see the result in his browser. If the message was posted, then he was successful. At that point he can script a flood.

As I studied the actions of my antagonists, there was another revelation -- there are a lot of ways to write, "Porn, pills, and casinos." To date, I have not seen link spammers resort to 1337 spellings (i.e., p0rn, which characterize the payloads of their email counterparts). As more Web sites rely on automated filtering, it's only a matter of time until we experience that on message boards. For the time being, they are motivated by a desire to score well with Internet search words.

It was also apparent that link spammers do indeed make money. They may claim all kinds of revenues in an interview with The Register, but those claims are empty without some sort of tangible evidence. During the past several months, I've watched quite a few link spammers tap what appears to be a bottomless pool of domain names. While I could not independently substantiate the revenues associated with link spamming, it is fair to say they make enough to stay motivated and keep a supply of new domains.

The techniques discussed in the first article may provide you with adequate protection against abuse, but should your site fall on the radar of a highly motivated, highly skilled link spammer, you will need to expand those defenses.

A Self-Adjusting System

Even the most dedicated systems administrator needs time away from his systems. Unfortunately, the most dedicated link spammers don't allow themselves that luxury. In this article, I introduce new tools and techniques to help you build a self-adjusting system. This system is built on techniques I discussed in the first article.

A dedicated link spammer will spend a great deal of effort constructing a payload to pass our defenses. We want to capture information as he tries to slip a message past our filter. On each unsuccessful attempt, we capture IP addresses and domains that are not currently in our database. Before he can construct a message with the right combination of elements, we will have more information to block him.

To illustrate the effectiveness of such a system, let's consider the following log snippet:

#               
# Date                Total  IP Address       A   B   C    D 
17/Aug/2006:06:44:40 |  18 | 83.138.144.208 | 0 | 0 | 0 | 18 
17/Aug/2006:06:46:57 |   8 | 83.138.144.208 | 2 | 4 | 0 |  2 
17/Aug/2006:06:49:25 |  16 | 83.138.144.208 | 4 | 6 | 0 |  6

Our spam threshold is 8 points; the total is the sum of columns A through D. A represents the total points for each blacklisted domain in the body of the message. B is the total points for a blacklisted IP address. C is blacklisted authors, and D is blacklisted keywords. In the example above, a link spammer attempted to penetrate our defenses with a new domain from a new IP address. On his first attempt, his message scored 18 points from keyword matches. Since it was flagged as spam, we extracted the domain and inserted it into the database. We did the same with his IP address. We assign 2 points to each suspicious new domain and 4 points for each suspicious new IP address. On each ensuing instance of abuse, we increment both by 2 points.

On his second attempt, the spammer honed his message with less offensive keywords. He responded to our system, but our system responded to him. While his keyword score dropped from 18 to 2, the points we've assigned his domain and IP address put him over the edge. By the third attempt, his IP address was blacklisted and his new domain nearly worthless. Most importantly, this adjustment occurred while I was still soundly asleep.

Every business case is different so this exact system may not work for you. I'll present its specific components so that you may apply any that fit your environment.

IP Blacklisting

In the first article, we blacklisted IP addresses as bad behavior merited it. We kept a paper trail and manually added IP addresses to our filter after users violated our terms of service. If we want to deter link spammers, a more proactive approach is necessary. Our self-adjusting system checks IP addresses against lists of troublesome addresses and adds new addresses automatically as comments are flagged as spam. It enters those addresses below the spam threshold and increments their point values each additional time it's flagged.

A good hacker never reinvents the wheel. One tool we can employ has been honed for years by people who combat email spam. DNS-based Block Lists (DNSBLs) publish lists of IP addresses that are easily queried with existing Internet protocols. These addresses have already been flagged as troublesome. Many servers exploited by email spammers are used by link spammers. The chief culprit in both cases is often an open SOCKS relay poorly administrated by a person who probably doesn't read this magazine. Since SOCKS resides between the application layer and the transport layer, it can be used seamlessly by both email and link spammers. As part of our self-adjusting system, we want to leverage DNSBLs to prevent spam before it happens.

Fortunately, this problem has already been solved by a contributor to the PHP Web site. You can use the following function to check an address against SpamCop, DNSBL, and Spamhaus:

private function _is_blacklisted($ip) { 
    // written by satmd, do what you want with it, 
    // but keep the author please 
    $srvs = array("bl.spamcop.net", "list.dsbl.org", \   
       "sbl.spamhaus.org"); 
    if($ip){ 
      $octs = explode(".",$ip); 
      $rip  = $octs[3].".".$octs[2].".".$octs[1].".".$octs[0]; 
      for($i=0; $i<count($srvs); $i++) { 
        if(checkdnsrr($rip.".".$srvs[$i].".","A")) { 
          return true; 
        } 
      } 
    } 
    return false; 
  }

Feel free to add or remove any DNSBL.

Since the first article in this series was published, we moved our IP addresses into a MySQL database. The number of addresses we've been forced to manage has simply grown too large for INI-style text files. As mentioned above, the database is updated automatically. If a customer posts a message flagged as spam, we check to see if it's already in the database. If it is, we append 2 points to its score. If not, then we add it with a base score of 4. (Our spam threshold is 8 points):

 
function _add_address($ip) { 
    $sql = "SELECT score FROM ipaddr WHERE ipaddr='$ip'"; 
    $res = mysql_query($sql,$this->dbh); 
    $res = mysql_fetch_object($res); 
    if($res->score < 1){ 
      $sql = "INSERT INTO ipaddr (id, score, ipaddr) VALUES(NULL, 4, '$ip')"; 
      $res = mysql_query($sql,$this->dbh); 
    } else { 
      $pts = $res->score + 2; 
      $sql = "UPDATE ipaddr SET score=$pts WHERE ipaddr='$ip'"; 
      $res = mysql_query($sql,$this->dbh); 
    } 
    return true; 
  }

Keywords

Since the first article, we moved our keywords from an INI file into the database. An Italian link spammer was giving us fits, so I ended that nuisance by adding nearly 100,000 Italian words to the keywords database. The site is English-only but we made sure that words like "pizza" and "lasagna" were not added along with "anglicanesimo" and "cardinalesche".

A keywords query is trickier than one that checks for an IP address. In the latter case, we check for a record that exactly matches an address string:

$sql = "SELECT score FROM ipaddr WHERE ipaddr='$ip'";

To match keywords, we have to locate records containing a single word from a larger string of many words. In this case, the large string contains the message that a customer posted in the comment form. Fortunately, MySQL provides us with the RLIKE string comparison function. It makes a difficult query easy:

private function _score_keyword($str="") { 
    $pts  = 0; 
    $sql  = "SELECT id, score FROM keywords WHERE '$str' RLIKE word"; 
    $res  = mysql_query($sql,$this->dbh); 
    while(list($id, $score) = mysql_fetch_row($res)){ 
       $pts += $score; 
    } 
    return $pts; 
  }

We simply loop through our result set and increment the corresponding points for every keyword match.

Domains

The task of stopping link spammers would be considerably easier if they didn't have the resources necessary to acquire thousands of domains. Remember, they don't want to post just anything on your Web site. They want to post links to their PPC sites. Most spammers add multiple hosts to each domain, but we simplify our task by filtering the domain itself. Again, RLIKE is our friend:

private function _score_domain($str="") { 
    $pts  = 0; 
    $sql  = "SELECT id, score FROM domains WHERE '$str' RLIKE domain"; 
    $res  = mysql_query($sql,$this->dbh); 
    while(list($id, $score) = mysql_fetch_row($res)){ 
       $pts += $score; 
    } 
    return $pts; 
  }

When a message is flagged as spam (i.e., its score is greater than or equal to 8), we parse the message for domains then add them to our database. It's a self-adjusting system, remember:

 
$spam = new spamfilter($dbh); 
if($spam->isSpam($threshold, $content, $name, _getIp())){ 
  echo "This smells like spam<br>"; 
} else { 
  // allow the comment 
  if(!empty($name) && !empty($content)){ 
    echo "The following comment is NOT spam:<br>"; 
    echo "$name said: " . 
    $content; 
  } 
}

Notice that the program sleeps for 10 seconds after a message is flagged as spam. This is to lengthen the time it takes our spammer to compose various attempts to penetrate our defenses. We borrowed this idea from the failed Unix login.

Conclusion

If you chose this profession, then chances are you love spending time on computers, but even the most dedicated administrator needs time with family and friends. One way to assure you get that time is to have computers handle your grunt work. A self-adjusting link spam filter will help you achieve that time. As always, you'll have to tailor these ideas to fit your business case. Log as much information as possible. Use that information to tune your system, and you'll be able to keep the bad guys at bay.

Code for this article is available for download at: http://www.samag.com/code/.

References

1337 Spellings -- http://en.wikipedia.org/wiki/Leet

Jeffrey Fulmer has administered enterprise computer systems professionally since 1995. He is an open source software developer and the primary author of siege. He currently resides in Pennsylvania with his wife and English bulldog.