« Hugo nominating ballot numbers | Main | Hugo complaining as a spectator sport »

Data accuracy

| 1 Comment

I want an "approximate date" data structure.

I know a friend's birthday, and I know their age to within five years, but I don't know their actual birth year. I want to be able to tell my database (or my calendar) that the year is, say, "1970 +3/-5."

I know the year that a book was published, and even the month, but not the exact day. I want to be able to tell Delicious Library that the publication date is, say, "May [unknown], 1953."

It's not only dates that have uncertainty attached to them. In the SH fiction submission database, I have a field for author gender; I leave it set to "unknown" unless/until I'm pretty certain of the author's gender. In uncertain cases, sometimes I go looking; when I do that, reasonably often, I find several non-definite clues/hints/signals--for example, the author may have a blog where they refer to their wife, or there may be a photo that looks fairly male, or a reviewer who doesn't know the author personally may use a gendered third-person pronoun to refer to the author. But I don't fill in the field until I'm sure. I'd like to be able to say "probably male, according to 3 strong signals." (Yes, I know I can create my own database fields to simulate this, and I sometimes consider doing that. But I'd really prefer it to be a built-in data type.)

Which brings me to another related issue: I want to be able to specify the source of a piece of information.

So for example, if Wikipedia tells me that no mammals lay eggs, then I'd like to attach a "source:Wikipedia[URLGoesHere][2007-03-30 11:19:33]" tag to that bit of information. At any given time, I can adjust my global level of trust in Wikipedia--the strength of its signals--and thereby get an idea of the likelihood that that particular factoid is true. At this particular moment, I would probably assign Britannica a veracity level of about 98%, and Wikipedia a veracity level of about 85%, but those numbers fluctuate for me.

You may be wondering why I would put the (false, obviously) info about mammals not laying eggs into a database. I wouldn't, which brings me to the last and most pie-in-the-sky part of this wishlist:

I want those sourcing/veracity tags on data in my head.

I come across huge amounts of data and huge numbers of factoids all the time. I try to mentally tag them with info about where I encountered them and how much I trust that source, but my memory's not really good enough to accurately track that. So when I read the other day in Wikipedia about possible theories for the history of modern salutes, I mentally filed that under both "unverified speculation" (because Wikipedia said the explanations were theories and that the true answer wasn't known) and "read it in Wikipedia so probably but not necessarily accurate." But chances are that in a few months or years I'll forget where I read it, and it'll become yet another unsourced factoid floating around in my head.

1 Comment

Can't help with tagging your head, but I totally share your wish that databases could tolerate more approximate info -- even just a universally acceptable symbol for "I don't know this bit." When I was a school nurse, I was supposed to enter the immunization dates on the computer, but many records came with just month and year -- I had to put all of them in as the first of that month, raising the unpleasant idea that I was somewhat fabricating the medical record. And now at my current job I want to enter the TB test results of client at a Methadone clinic, at least for those who have had a positive ppd test at some point in their murky past. But the moment I click "yes" for the result, the computer insists I must fill in the fields of test date, birthplace, and number of mm swelling on the test -- what part of "heroin addicts aren't reknown for their record keeping" don't they get? Sheesh.

Post a comment