It seems that over the past two days, a bot or two has begun spamming my comment form with junk content. In general, I'm pretty trusting of users that comment on articles, requiring only that those who wish to comment pass a simple CAPTCHA. I use ReCAPTCHA, but apparently this has been bypassed sufficiently well to allow spammers to make it through, or there are malicious human beings out there trying to make a buck by spamming my comment forms.
Despite the fact that all HTML is escaped before being displayed on my comment pages, this is still a nuisance and gets in the way of valid data on the comment pages of my site. To deal with this, I have added a couple new protections. I now track the IP of comments (for spam banning purposes alone), and have built in capabilities for myself to mark items as spam. I'm actually keeping the spam in the database, because if I need to, I will apply some Bayesian filtering.
How would Bayesian spam filtering work?
Bayes' Theorem is a theorem that describes the likelihood of events given other events. For more detailed information check out the wikipedia page for Bayes' Theorem. Using this theorem, computer systems can look at past messages that have been spam, or not spam, and based on the contents of a future message, make a prediction or assign a probability to whether or not it is spam, without human interaction. This is most often used with email systems, but it would work just well for my commenting system.
Hopefully spammers realize it is futile, or I can just ban an IP block, but if those are unsuccessful, I don't want to manually need to groom all comments on my site, so I may implement a Bayesian filter. If I do need to, I'll post details about the implementation, and possibly source code at that time. This could also have a big impact on my video submission system, but time will tell. I've already had to change the URL of my comment system several times, and I've considered eliminating it entirely.
What do you think about spam and what steps are reasonable to vet visitors and content submitters?