« An easy way to post photos online | Main | Touch at WisCon »

Human-powered comment spam

| 8 Comments

I'm not really sure what to make of this. There've been half a dozen times in the past month when a new kind of comment has showed up on various of my journal entries: in each case, the comment was pretty clearly written by a real human who appeared to have actually read the entry in question, but in each case the name and email address and URL given suggested that the real goal in posting the comment was to create traffic to a commercial website. (In some sense this is an old kind of comment, in that it's probably how comment spam was originally done, but I haven't seen so much of it before.)

I'm not sure what to do with these. On the one hand, I don't want to allow comment spam; on the other hand, they appear to be real people who are, at least to some degree, contributing to conversation.

In one case, I allowed the comment but removed the advertising/link. In another case, I took down the comment, but haven't decided whether to restore it yet, or in what form.

I was especially entertained by the one where I redacted the comment but allowed it (with a note from me explaining why I'd redacted it and telling them to get in touch with me if they were a real person), because a week or two later I got another comment on the same entry, claiming to be from another person, saying that they know the original commenter, and that the original commenter would be distressed to learn that I'd thought his comment was spam, because really he just likes to link to his website, which is: [and then the name and URL of the original commenter's website]. I've certainly received meta-spam before, but it's usually been email about how to stop email spam, rather than a comment on a specific other comment. Note to anyone planning to try this approach: if person B, whom I don't know, vouches for the trustworthiness of person A, whom I don't know, then from my point of view that's no different than person A vouching for their own trustworthiness. And that's true even if A and B are actually different people, which is hard to confirm online.

All this is causing me to think more about my criteria for legitimate comments. (I have a feeling I've said some or all of the following before, but I'm too lazy right now to go look for it.)

At one end of the spectrum, if a regular reader makes a substantive comment and gives the URL of their non-commercial home page or journal, that's obviously fine; at the other end of the spectrum, if a bot posts a substanceless comment that isn't related to the entry, and gives a product or business name as their name, and gives the URL of a site that obviously exists only to harvest AdSense clicks, that's obviously spam.

But what about the in-between areas?

Bots are getting a little smarter--for example, some of them have started quoting entry text in their spam comments, a technique I thought of quite a while ago but didn't mention 'cause I didn't want to give the spammers any ideas. But the bots are still generally really obviously bots. Even the cleverest bot comments are extremely generic and don't appear to have much connection to the topic except in the most general kind of way. (Insert note about various people's ideas that spam will lead to AI here.)

And until recently, it wasn't cost-effective for humans to post much comment spam. Why bother, when a bot can do it faster and cheaper and nearly as effectively?

But now, with humans regularly creating comments in that gray zone, I need to come up with some workable rules of thumb for myself.

Certainly if a regular reader links to a commercial site that appears to me to be of value to my readers, that's fine. And if anyone (even a regular reader) links to a really obviously spammy site, I'll probably remove at least the link, and maybe the comment too. (Please don't test this; I don't claim to be perfect or consistent, and it will only annoy me.)

So the main uncertain area is commercial sites that look vaguely relevant but not all that relevant. (And that look at least semi-useful, not just an AdSense harvester site.) If such a URL is attached to a really good comment, or comes from a regular reader or someone else I know and trust, that's probably fine. For that matter, if I know you and you want the URL attached to your comment to be the URL of an unrelated commercial site that you're affiliated with, that's probably fine too. (I'm drawing a distinction here between the URL you put in the URL field of your comment, and URLs included in the text of the comment as links--the former don't need to be as relevant as the latter, especially if I know you or you're a regular reader.)

But if I don't know you, and you post a vaguely relevant comment that contains a link to a vaguely relevant commercial site, I think I'm going to continue to at least remove the URLs, and probably remove the comments as well. In such cases, if the main purpose of your comment wasn't to increase traffic to your site, then feel free to drop me a note in email and let me know, and I may consider allowing the comment after all.

I guess the key overall point here is that I have no interest in allowing my journal to be used by a random stranger (even if they're a human instead of a bot) to increase traffic to their website, especially if the goal of their website is to make money from ads without supplying any value to visitors. There are at least two components to that: I don't want my journal to be used to improve the PageRank of such sites, and I don't want my journal to be used to lead my readers to such sites. On the other hand, I'm happy for my journal to be used to increase traffic and PageRank for useful, valuable, and/or interesting websites, especially if those sites are affiliated with my friends and/or regular readers.

. . . While I'm on the subject of spam, I wanted to mention two three things I've been amused by in a lot of in email spam lately:

  • 419 scam/advance fee fraud letters that start out with a salutation like "My dearest love," and then immediately say "I'm sure you will be surprised to see this, because we have never met."
  • A specific piece of spam (that I've received many copies of) that contains the line "I am ready to kill myself and eat my dog, if medicine prices here ([URL removed by Jed]) are bad."
  • [Added the next day when I realized I'd left it out.] A specific piece of comment spam (that I've received many copies of) that contains the phrase "goose bumps and e-motions, the design of your web page really got me!!!" I really like the idea of "e-motions," which I guess is short for "electronic emotions"?

8 Comments

Wow, the perils of having enough readers to actually generate comment spam. I have few enough readers of my blog that I have never once had any comment spam...

and btw, I have a can't miss method for everyone to double their money in four months! All you need to do is email me at...


Laurie Edison and I get these with some frequency on Body Impolitic. We delete the links and leave the comment standing, which defies the purpose but opens the conversation.


"if person B, whom I don’t know, vouches for the trustworthiness of person A, whom I don’t know, then from my point of view that’s no different than person A vouching for their own trustworthiness."

That's an interesting statement, since unknown people vouching for other unknown people/companies is a basis for trust on other major websites. The eBay feedback system is entirely based on this assumption, for example. And naturally, this system gets abused on those sites as well (fraudulent eBay accounts building up huge feedback ratings by doing loads of micropayment transactions with automated sellers, for example).

By the way, having a blog with few readers is in no way a safeguard against comment spam (from bots at least). A friend of mine has a blog with very few readers outside a specific (Dutch speaking) community, but he has had bots placing comments on his site (in English, so incredibly obvious).


Jay and Jacob re number of readers: I think probably the main defining factor in comment spam is how likely your blog is to appear high in search results for a given keyword. I've always assumed that the main way comment spammers find blogs to spam is by searching for terms relevant to their spam.... Though they also appear to look for blogs that have a high general PageRank, regardless of specific terms. The central idea being that most comment spam is primarily aimed at search engines (mostly Google, because of the way PageRank works) rather than at site visitors.

Debbie: That sounds like a good approach. In most of the cases I've seen so far in my blog, the comments are pretty vague and insubstantial, not really lending themselves to furthering the discussion per se, but maybe I'll shift toward allowing (but redacting) them as my default.

Jacob: As you note, the system does get abused on those sites. What makes the system work at all on those sites is volume (plus various safeguards built into the system); if an eBay seller has only one point of feedback, and it's a five-star rating that says "THIS GUY IS THE BEST EVER!!!! TOTALLY AWESOME!!!! AND REALLY STUDLY, TOO!!! YOU SHOULD TOTALLY GO OUT WITH HIM!!!", that doesn't make me think "Huh, I guess I should buy stuff from him, and maybe go out with him too." Instead, I tend to think "Huh, I guess he thinks he can fool people by giving himself feedback." This happens with Amazon all the time -- there'll be an unknown book by an unknown author, from a vanity press, that has five anonymous rave reviews that all say essentially the same thing, and everyone who looks at the page says "Oh, I see the author has posted a bunch of fake reviews of their own work."

So, yeah, if I get a hundred notes vouching for person A, and they don't appear to have the same agenda as the original posting from person A, and they give the strong appearance (under careful scrutiny) of actually being from different people, then I might be a little more likely to believe that person A is for real. Then again, as you noted, feedback systems get abused despite all their safeguards, so I'd still be reluctant to put a whole lot of weight on those hundred notes.

I should add that this is only an issue (in the case of comments on my blog) when person A has behaved in a spammy way in the first place. If person A has posted an interesting and/or substantive and/or funny contribution to the conversation, and didn't link to some irrelevant commercial site, then I pretty much don't care whether they're the same person as some other persona that's also posting comments.


Plus -- do you know who's who? I mean, for example, I tend to post here as Jacob, but there's a post above from a different Jacob. Perhaps you don't know that Jacob, but you might have assumed it was me, and given credence to whatever they posted since I've posted here before many times (and you know me in, as they say, Real Life). Or do you see more identifying info at your end (such as IP address, or the fact that I've signed in through Typekey)?


[There was a piece of comment spam here that consisted simply of a spam URL. I'm removing that, but leaving this here as a placeholder so as not to spoil V.'s subsequent joke.]


Boy, I'm tempted to post here as Nbkvqyb now, but I suppose I shouldn't.

Thanks,
-V.


Jacob: Yeah, good point--I noticed that the other Jacob's email address wasn't the same as yours, but other readers might well not have known that.

V: :) But just as well that you didn't; I've now added any name starting with "Nbkvq" to my comment-spam filter, so all such comments will henceforth be automatically junked.

Funny thing: in the few days since I posted this entry, the amount of (obviously machine-generated) comment spam that's been getting through my comment-spam filter has gone through the roof. Several dozen such comments a day. Fortunately, almost all of those comments go to moderation for various reasons (mostly because they're posted to entries older than a couple weeks old), so they never appear publicly (for example, there were a bunch more Nbkvq comments on other entries, but those others went to moderation); still, annoying. And although I'm ever-grateful for Movable Type's spam filter, unfortunately it doesn't learn automatically; I have to figure out what each new wave of spam has in common, and manually add stuff to the filter. Not terribly onerous or time-consuming, but adds a little to the annoyance of the spam.