Linux.com

Feature: Enterprise Applications

Stemming the menace of wiki spamming

By Rob Sutherland on June 28, 2005 (8:00:00 AM)

Share    Print    Comments   

Do you maintain a wiki that you haven't checked for a while? Better take a look at it. Over the past year or so, an increasing number of wikis and wiki variants have become the target of spam. Finding the right line to walk between complete vulnerability and complete inflexibility is the problem that wiki developers and operators are trying to solve. How can the wiki community keep wikis open and easy to use, yet stamp out the spam?

Spam that is directed against a wiki usually consists of lists of links inserted in standard pages, such as user profiles or sandbox areas. Although much of the spam consists of innocent-appearing link text, the actual URL points to a phishing, gambling, or porn site. By putting out hundreds and thousands of spam pages, wiki spammers increase the number of links available for people to click on and thereby bump their sites higher in the search engines.

A list of reported wiki spam incidents and a Google search on known wiki spam names show that this is a growing problem.

The wiki community has been grappling at length with spamming. The theory in the past has been that wikis are self-maintaining in the sense that community editing will remove or polish entries over time. That may not be true. To see why, consider that a wiki posting may go unnoticed for some time, especially if it's posted to a "ghost wiki" that is running more or less unattended. Even if it is noticed, because wikis retain revision histories by default, cleaning up the spam is more work than simply deleting an offending page. Complicating the problem even for popular wikis is the fact that legitimate users are less likely to participate in a given wiki if many of the postings are bogus and link to unappealing and dangerous sites, which means there will be fewer people to notice and help clean up.

Another factor that attracts spammers is that a wiki's registration and entry contribution process needs to be as streamlined and simple as possible. If it's not, legitimate users will become frustrated and go elsewhere. But the greatest vulnerability of a wiki is ironically its greatest strength. Wikis' openness and ease of modification allow the irresponsible and criminal to add bogus content and dilute the value of the shared resource.

There seems to be agreement that "wiki spam is a wikiwide problem that needs to be solved wikiwide." Furthermore, the wiki spam problem has to be solved by different approaches than email or blog comment spam; both of these are subject to bottlenecks in a way that wikis are not. There is also a community element that has to be addressed -- it's not just a case of putting in a Bayesian filter, or implementing something like greylisting, or setting up a blacklist. You have to consider the impact of use and misuse of a particular tool on a varied (very varied) community. For example, if you set up a blacklist as a wiki page and allow wiki members to submit blackhat IPs so as to take advantage of community support, how do you stop people from using this tool to take revenge on their enemies in the flame war du jour?

The best we can do right now is put a toolkit together and see what works. The current toolset consists of three main categories:

  • Troll control tools: Behavior-based banlists, blacklists, whitelists, and schemes to flag spamlike behavior, such as too many posts in too short of a time, and content analysis to pick out spamlike posts. There's some discussion of using Spamassasin to do this.
  • User verification: CAPTCHA -- Completely Automated Public Turing Test to Tell Computers and Humans Apart -- is usually a generated image of some text you have to type to proceed. This will stop automated spamming. At least if a real person is doing the spamming, the rate of spam can be reduced to what a person can type. And an email verification process combined with a blacklist of known spam domains will slow down human-generated spam.
  • After the fact: Checking for and removing spam pages, contributing information on spammers to the community, complaining to the abusers' ISPs, and publicizing their use of illegal and inappropriate methods. The Chongqued wiki employs an interesting technique to publicize spam-fighting techniques. It encourages wiki operations to link known spam keywords to the chongqed.org site so that over time, searches will point there rather than to the spam target site. Chongqued also maintains a list of sites that have had links put out in wiki spam.

So, when you discover your wiki has been spammed, what should you do (after you remove the spam, of course)?

  • Go to the support/development site for the wiki that you use and look for resources and discussion on the best practices for blocking wiki spamming. Last week a Twiki wiki I am responsible for was hit by a wiki spam bot that registered using a stolen ID and then added a number of links to spam and phishing sites based in China. I found a topic on wiki spam that led me to the BlackList plugin for Twiki. I installed it and added the offending IP address. I haven't seen any spam since then, but I'm sure the fun isn't over.
  • Google recommends the use of the "nofollow" attribute on hyperlinks in order to prevent spammers from gaining anything from comment spam. This technique is also effective for links put into wiki topics by spammers and is part of the Twiki Blacklist plugin.
  • If you've put up a ghost wiki that isn't being used, consider taking it down. If it's being used for evaluation, consider putting a<nobr> <wbr></nobr>.htaccess login on the main directory or using some other mechanism to prevent access by spammers.

Naturally, there are a lot of wiki pages discussing the various evils of and responses to wikispam. My own page is here. Chongqued has a good list of resources and the Meatball wiki has a fairly good article with background and definitions.

Like everything else about the world of wikis, the discussion of spam fighting is both extensive and opinionated. You may have to do some digging to get your particular questions answered, but persevere. Despite the overabundance of rhetoric, there are numerous skilled individuals trying to protect their personal investment and community from misuse.

Luckily, most wiki spammers are fairly crude in their efforts and therefore easy to spot and block. However, that could quickly change as wiki spammers become more subtle and capable of getting around current impediments, such as blacklists, by using the same techniques email spammers do. When that occurs, we'll have to update our tools and create newer and more sophisticated ones.

Unfortunately, there is no clear victory in sight. The best we can hope for is to react quickly, utilizing our community support and rapid dissemination of information, to defeat each new round of wiki spammers.

Share    Print    Comments   

Comments

on Stemming the menace of wiki spamming

Note: Comments are owned by the poster. We are not responsible for their content.

rel=nofollow

Posted by: JelleB on June 28, 2005 11:39 PM
wasn't there a technique to add rel=nofollow to a link so that google (and any other searchbot) would ingnore the link in it's rankings? It seems to me the first thing to do, as the spamming is mostly directed at search engines, not the induvidual readers.

#

Re:rel=nofollow

Posted by: Anonymous Coward on June 29, 2005 02:56 AM
Yes, I was imagining a system where URLs have the rel="nofollow" attribute tacked on until they have been whitelisted by a trusted account.

#

Re:rel=nofollow

Posted by: Anonymous Coward on June 29, 2005 06:15 PM
Gentle suggestion: read the fucking article before posting a comment!

#

Re:rel=nofollow

Posted by: Anonymous Coward on June 30, 2005 02:00 PM
While I agree with your sentiment about reading TFA, it was only mentioned once and not talked about much. It would be easy to skim the article and miss it.

#

good enough solutions in place

Posted by: vtre17 on June 29, 2005 12:07 AM
Wikis are hit only accidently by spambots (which target blog and comment forms mainly). It is not necessary to contribute to the locked web with forced user account registrations, and especially CAPTCHAs should be seen as the last resort only.

Like the author mentioned, it is pretty simple to stop the currently rather stupid bots. It often suffices to introduce a forced save/submit delay or add a checkbox ala "[x] I'm no spambot". A very effective mechanism is also counting the added external links and rejecting the whole submission if too many links were put in (often seen in chinese spam).

Bayesian filters may become necessary, but aren't currently. And if a general distributed blocking system emerges, it will be a bit more effective by not blocking IP addresses only, but ISPs and web space providers or via reverse WHOIS lookups (block the spammers, not their pages).

Mass edits can also easily be reverted, so cleaning up a spammed Wiki is only difficult if you use the wrong software or database scheme.

The rel=nofollow is mostly non-effective, because spammers don't check for it. And then it is merely a protective invention by Google, who created the current mess in the first place by throwing their PageRank at the Web (and now tries to persuade other people to fix it). I don't even clean my SandBox anymore - bogus and misplaced links are entirely Googles problem.

The only real problem with WikiSpam are injections and manipulations of existing hyperlinks. But as this often happens manually, it can be easily detected and cleaned up by typical contributors and visitors.

#

A solved problem

Posted by: Anonymous Coward on June 29, 2005 12:13 AM
Moin Moin basically fixed this a while back with their AntiSpamGlobalSolution. Thats a central blacklist updated frequently of regexps that match spammy URLs.

Since that came out I've only had a couple of instances (in many months) of wiki spamming, and then I just go and add the appropriate regexps to my local blacklist and the central blacklist submission page, and that spammer is locked out again.


      <a href="http://moinmoin.wikiwikiweb.de/AntiSpamGlobalSolution" title="wikiwikiweb.de">http://moinmoin.wikiwikiweb.de/AntiSpamGlobalSolu<nobr>t<wbr></nobr> ion</a wikiwikiweb.de>

#

Why would anyone be surprised?

Posted by: Anonymous Coward on June 29, 2005 04:02 AM
Wikis are a broken idea. They would work in some alternate universe in which everybody lives only for the good of the community.

The various parasites that flourish in the web haven't even started to attack Wikis yet, probably because not enough millions of naive users are looking at Wikis yet. When they do, Wikis will go the way of Usenet.

#

Not again!?!?!?!

Posted by: Anonymous Coward on June 29, 2005 04:12 AM
Give it up already! It is NOT possible to have an open website/forum on the internet and not have it used in an undesirable fashion. Like a freshly painted wall in Harlem, open wikis WILL be defaced or spammed! There is no "solution", without restricting access.

Give up on the Wiki concept. The open access to editors was a bad idea and still is. Even Wikipedia is restricting access to certain pages and open pages are checked and reverted frequently because they are regularly defaced.

Just be glad that it is only spam. Imagine if it was goatse like the <a href="http://news.webindia123.com/news/showdetails.asp?id=90406&n_date=20050622&cat=Entertainment" title="webindia123.com">LA Times</a webindia123.com> Wiki experienced.

#

Re:Not again!?!?!?!

Posted by: Anonymous Coward on June 29, 2005 05:21 AM
Despite the pros/cons of wikis, to make sure that bots cannot post to a wiki it is simple to deploy a method similar to the Word Verfication aspect of <a href="https://www.google.com/accounts/NewAccount" title="google.com">https://www.google.com/accounts/NewAccount</a google.com>

#

Re:Not again!?!?!?!

Posted by: The_Wilschon on June 29, 2005 07:50 AM
One big problem with "Word Verification" or CAPTCHA is the accessibility aspect. See the W3C document on the topic: <a href="http://www.w3.org/TR/turingtest/" title="w3.org">http://www.w3.org/TR/turingtest/</a w3.org>

#

Word Verfication

Posted by: Anonymous Coward on June 29, 2005 05:23 AM
Sorry for the double post....Despite the pros/cons of wikis, to make sure that bots cannot post to a wiki it is simple to deploy a method similar to the Word Verfication aspect of <a href="https://www.google.com/accounts/NewAccount" title="google.com">https://www.google.com/accounts/NewAccount</a google.com> [google.com]

#

Change logging

Posted by: The_Wilschon on June 29, 2005 07:53 AM
It seems to me that implementing some sort of logging of changes, sent to a trusted user (wiki owner?) would be very useful. This could be sorted by size of change (large changes are very likely defacement), or type of change (if all that was changed was a hyperlink, it was most likely either a spam or a broken link being fixed. Broken links will most likely not be all that common, so the signal to noise ratio here ought to be high), or any number of other things. Then, the trusted user(s) could check up on the likely spams, and ignore those changes which were unlikely to be spams.

#

Wikis are a joke

Posted by: Anonymous Coward on June 29, 2005 09:21 AM
How can people (mostly programmers) be so naive and let their own web pages open to criminals?!

I have never understand how this was possible. Why not using Wikis for corporate web sites? No one will put insults and kiddy porn spam after all, we're in the best of the world, aren't we?

Other than that, I found Wikis are for developers and spammer bots only, because the only time I set up a wiki, people complained it was too complicated to use, and prefered the forum. Too complicated, you betcha, unless you start learning the Wiki language with its own markups, you can't edit or add any entry. Unless you're a... Programmer.

Simple, isn't it?

#

Re:Wikis are a joke

Posted by: walt-sjc on June 29, 2005 10:37 PM
Your one poorly implemented experiment went bad so all wikis are bad? Good logic that...

Wikis are not forums. They can't be replaced with forums. Some wikis do have forums as an extra module however to form a complete solution. Forums are about discussions. Wikis are about collaborative documents. The two are not the same.

In our multiple well-implemented wikis, we have hundreds technical and non-technical users that have no issues with wiki notation. Furthermore, good wikis have tools that allow conversion from HTML to wiki notation. Spend 5 minutes learning the notation or turn on the online help and learning wiki notation is a non-issue. It certainly is much faster to type after that 5 minutes of learning than HTML.

Wiki spam is no different than forum spam. Same problem, same solutions.

#

Spam by ASN

Posted by: Karsten M. Self on June 29, 2005 09:38 AM

I've been tracking email spam by ASN for some time. The basic theory is that there are some places from which abuse is far more likely to come than others.

I'm finding that the same rule applies to Wikis, as I also admin <a href="http://twiki.iwethey.org/" title="iwethey.org">TWikIWeThey</a iwethey.org>. In our case, it's AS4134 (China Telecom) which has been the overwhelming source of spam. The entire AS (you can get assignments from the <a href="http://www.cidr-report.org/" title="cidr-report.org">CIDR Report</a cidr-report.org>) is now null-routed at the server.

Looking over the spam reports at the Portland Pattern Repository, I'm finding a pretty familiar AS distribution, frequency and AS follow:


  • 35 4134 CHINANET-BACKBONE

  • 33 4837 CHINA169-BACKBONE CNCGROUP China169 Backbone

  • 25 4814 CHINA169-BBN CNCGROUP IP networkChina169 Beijing Broadband Network

  • 10 6800 SAMARA-INTERNET-AS Samara-Internet, Ltd

  • 6 4812 CHINANET-SH-AP China Telecom (Group)

  • 4 7470 ASIAINFO-AS-AP ASIA INFONET Co.,Ltd.

  • 3 9394 CRNET CHINA RAILWAY Internet(CRNET)

  • 3 9304 HUTCHISON-AS-AP Hutchison Global Communications

  • 2 9931 CAT-AP The Communication Authoity of Thailand, CAT

  • 2 8866 BTC-AS Bulgarian Telecommunication Company Plc.

  • 2 7482 APOL-AS Asia Pacific On-line Service Inc.

  • 2 3209 Arcor IP-Network

  • 2 1680 NetVision Ltd.

  • 2 15471 SNR-RO SNR - Societatea Nationala de Radiocomunicatii



To map IP to AS, you can use the reverse DNS server at asn.routeviews.org, txt field. See the <a href="http://www.routeviews.org/" title="routeviews.org">Routeviews Project</a routeviews.org> homepage for more information.

#

Have a good backend

Posted by: blindcoder on June 29, 2005 04:45 PM
The cleanup procedure can be made a whole lot easier if you have a good backend to your wiki.
For example, on ROCKDoc I use SubWiki which uses SubVersion as its backend. This make it easy to remove spam:
svn merge -r spammyrev:spammyrev-1 .
And it's gone. I'm also using the chonqued blacklist to check for spam in contributions and comments as well as maintain a little blacklist of my own. So far there have been 3 or 4 successful spammings in almost 300 revisions which isn't too bad. None of which survived longer than a few hours and was also removed manually from the backend. The part now shows '[SPAM REMOVED]'.

#

WikiSpam - a solved problem for us

Posted by: Anonymous Coward on August 12, 2005 12:27 AM
I run a fair number of wikis (somewhere in the region of 20), all with a low turnover, so I guess you'd call them "ghost" wikis, although they have useful info on them.

We've solved the spam problem by having all changes mailed to a small group of checkers. In the rare case where the change is spam, one of the checkers will just revert the page. Average time for spam to survive is an hour or so.

We also noticed that the spammers were using humans to edit the pages - I can imagine some sort of sweatshop in China filled with people editing wikis. So CAPTCHAS are NOT going to work, contrary to the advice in the article.

Spammers also come from searches like "wiki sandbox" on Google. We have added a robots.txt file so that our sandbox pages are hidden from Google. This seriously reduced the number of spammers arriving.

Rich.

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya