« Spam-marking | Main | Generations »

Computer can determine author sex?

| 1 Comment

Interesting item in Nature: "Computer program detects author gender." Claim is that "[the] simple scan of key words and syntax is around 80% accurate on both fiction and non-fiction" in distinguishing male authors from female authors.

I'm immediately dubious, of course; that 20% is too big a chunk to ignore. Still, their generalities about writing by male and female authors are interesting if taken as generalities rather than as Universal Truths.

Among the books misclassified: Possession and Remains of the Day. I'd love to give them some Tiptree. And perhaps Raphael Carter's "Congenital Agenesis of Gender Ideation."

Also interesting: they claim their software can distinguish fiction from nonfiction with 98% accuracy. That's pretty cool—I wonder how they've classified the Weekly World News.

(Okay, I know, that was a cheap shot. They presumably mean something like "book-length material published by a major publisher as nonfiction" vs the same with fiction. Still, the claim sets up a dichotomy that I don't think is justified. Where does memoir fit? What about fictional memoir? What about creative nonfiction? What about epistolary novels containing news clippings? And are books of mythology fiction or nonfiction?)

1 Comment

In my experience with statistical NLP tools, 80% accuracy is really quite crappy. Robust parsers are an exception -- they're usually in the 80-85% accuracy range -- but for tasks such as author identification, you really expect something in the >90% range.

That said, any statistical NLP tool is only going to be as good as the corpus which feeds it, so if they're avoiding line-crossing authors (if there even is a line, I mean), that will of course skew the results. It's a weird sort of tyranny of the majority, though that's a hyperbolic way of putting it.