Computer can determine author sex?

Interesting item in Nature: "Computer program detects author gender." Claim is that "[the] simple scan of key words and syntax is around 80% accurate on both fiction and non-fiction" in distinguishing male authors from female authors.

I'm immediately dubious, of course; that 20% is too big a chunk to ignore. Still, their generalities about writing by male and female authors are interesting if taken as generalities rather than as Universal Truths.

Among the books misclassified: Possession and Remains of the Day. I'd love to give them some Tiptree. And perhaps Raphael Carter's "Congenital Agenesis of Gender Ideation."

Also interesting: they claim their software can distinguish fiction from nonfiction with 98% accuracy. That's pretty cool—I wonder how they've classified the Weekly World News.

(Okay, I know, that was a cheap shot. They presumably mean something like "book-length material published by a major publisher as nonfiction" vs the same with fiction. Still, the claim sets up a dichotomy that I don't think is justified. Where does memoir fit? What about fictional memoir? What about creative nonfiction? What about epistolary novels containing news clippings? And are books of mythology fiction or nonfiction?)

In my experience with statistical NLP tools, 80% accuracy is really quite crappy. Robust parsers are an exception -- they're usually in the 80-85% accuracy range -- but for tasks such as author identification, you really expect something in the >90% range.

That said, any statistical NLP tool is only going to be as good as the corpus which feeds it, so if they're avoiding line-crossing authors (if there even is a line, I mean), that will of course skew the results. It's a weird sort of tyranny of the majority, though that's a hyperbolic way of putting it.