« Shoggoth on the Roof | Main | Computer can determine author sex? »

Spam-marking

| 3 Comments

I set up my email to use Pair's spam-marking system, which is based on the highly regarded SpamAssassin, a couple days ago. It seems to be working pretty well so far. It apparently assigns points (and fractions of points) to each message, based on a variety of spam-identifying criteria; for example, a message being 70% to 80% HTML is worth half a point, while phrases suggesting a baldness cure are worth 2.8 points. If the total point value of the message is above a certain threshold (I'm using the default threshold of 4.0), the message is marked as spam. In my mailreader (Eudora), I have a filter that moves all messages with the "X-Spam-Flag" header set to "YES" into a special Spam folder, which I can glance over quickly as I delete what's in it every once in a while.

I wish I could customize the values. (I know I could, by installing SpamAssassin on my own machine and doing some complicated mail maneuvering, but that would be more of a pain than it's worth.) For example, a message sent to me that consists entirely of a single image is almost 100% likely to be spam, because (thankfully) friends don't send me such messages; so I'd like to give that situation a point value of about 3 or 4. But I think SpamAssassin assigns only about half a point to that situation. That and similarly low point-values for certain items that are invariably spam for me mean that maybe half a dozen spam messages a day aren't coming anywhere near the 4.0 threshold, and thus aren't being marked as spam. But that's still a big improvement over the several dozen a day that were getting through my ad hoc spam filters before.

I'm also getting maybe as much as one or two false positives per day—items that are being marked as spam but shouldn't be—and it's possible that some submissions (the ones sent in all-HTML, using red boldface text, and that don't have the right subject line for submissions) will get marked as spam. But with luck, I'll catch those. (I know I have no real responsibility to authors who deviate that far from our formatting guidelines, but I try to be nice to 'em anyway.)

At some point I'll do a more precise study, counting the number of false negatives, barely-negatives that aren't spam, barely-positives that are spam, and false positives, over (say) a 24-hour period. But for now it seems to be doing a better job than what I had before. And it catches the Nigerian 419 spam, which is all to the good.

Most of the spam I get appears to get scores around 5 or 6. But the highest score a spam message has received so far was 31.4. Some highlights from the score report that SpamAssassin adds to the header of the message (these numbers, plus others, are added together to get the total score), where "BODY" refers to things found in the message body (rather than headers):

  • 3.5—Forged mail pretending to be from MS Outlook
  • 2.9—BODY: Removes Wrinkles
  • 2.8—BODY: Cures Baldness
  • 2.5—BODY: Talks about exercise with an exclamation!
  • 2.4—BODY: HTML has very strong "shouting" markup
  • 2.1—BODY: Reverses Aging
  • 1.9—BODY: As seen on national TV!
  • 1.5—URI: URL contains username and (optional) password
  • 1.2—BODY: Human Growth Hormone

3 Comments

I have reason to believe that you can customize your SpamAssassin scoring rules in a config file in your home directory, assuming that Pair's SA configuration is set up that way. Their docs about tests suggest that "score NAME_OF_TEST 3.0" will let you change the score of a test, for example. (And lists the full set of tests, which is pretty amusing anyway.)


Ooh! Way cool. Thanks! I'll definitely try this out.


  • 2.1?BODY: Reverses Aging
  • 1.9?BODY: As seen on national TV!

Out of context--such as when I'm scrolling UP my LJ-friends page--this is too durn funny.


Post a comment