As we're confronted with an ever-vaster quantity of unsorted online information, one of the greatest challenges of coming decades may be determining ways to find a given needle of information in the haystack of the Web.
Indexing is the process of providing multiple topic-based entry points to a collection of data—specifically, providing a way for the audience to find information on a particular topic, given a work that's too large to easily search by hand.
Indexing is one of the best tools we have (so far) for information access and retrieval. A good index allows a reader to find desired information quickly and easily. Like other systems of organization, an index provides an ordered/structured framework for making sense out of what could otherwise be seen as chaos.
So it shouldn't be terribly surprising that creating a good index requires skill and time. The American Society of Indexers is a nonprofit organization founded in 1968 to promote good indexing. Professional indexers carefully create organized, coherent, internally consistent indexes that make it easy for readers to find information.
However: if an indexing system is haphazard or arbitrary, oddities result. Imagine how difficult it would be to use taxonomies like the Dewey Decimal System (or the Library of Congress numbering system) if one top-level category ("100," say) meant "purple things that are bigger than a breadbasket" while another top-level category ("200," for instance) indicated "natural science." Plenty of such semi-arbitrary indexing systems have accreted on the Net. For instance, the Internet Movie Database, or IMDb (now owned by Amazon), associates keywords with each movie, but the set of keywords has grown organically over time, with new keywords added (I believe) by users of the system. Thus, the keywords are not all at the same level of abstraction. The result is an entertaining hodgepodge of keywords for many films, and the basis for the game of Keywords. In this game, one player presents a list of keywords that could be used in a movie database to describe a film (or novel or other item), and other players try to guess what the described item is. For instance, a couple of real keyword lists from the IMDb:
- oz, based-on-novel, private-detective, smoking, road, surreal, violence, dominant-mother, elvis, hitman, murder (answer)
- drama, plastic-surgery, love, torture, dreams, fascism (answer)
(this second list leaves out some of the keywords used in the IMDb, to avoid making the answer too obvious)
And here are some invented keyword lists not derived from the IMDb:
- dog, queer, love, insanity, waitress (answer) (from Kam)
- cross-dressing, siblings, mistaken-identity, based-on-play (answer)
- based-on-novel, aristocracy, murder, comedy, inheritance, memoirs, adultery (answer)
For suggestions for keywords, see the IMDb's list of keywords. Unfortunately there's no obvious way to add new keywords in the new IMDb.
Usenet news-posting software provides a field for keywords for each posting. About five years ago, to help out people who find themselves at a loss for keywords to concisely describe their postings, Dominus created a script which randomly chooses n keywords from user-specified files. (By default, the script generates 4 keywords chosen at random from the UNIX word-list file, /usr/dict/words.)
- Keywords: Bellatrix clearance covert planetoid
- Keywords: cavemen Madagascar pee pit
- Keywords: exclaim glamor Lima Manama
- Keywords: arrear censor German transect
- Keywords: forgetting grope Kochab slice
- Keywords: dazzle rebutted shipshape signify
Of course, you could use your message text as the source file for keywords if you wanted something vaguely relevant-seeming but not quite useful.
These days there are plenty of automated search engines that attempt to index some subset of the Web. Such search engines don't create really effective indexes for the same reason that automatically generated book indexes aren't generally very good: they index based on words rather than concepts. If you perform a search using a given term, you won't find pages that don't happen to use your specific search term (even if the pages do use synonyms of your term). The META tag in HTML provides one way for page authors to help with such searches; the author of the page can include a dozen synonyms and relevant search terms in the "content" portion of the META tag, making it easier for searches to turn up relevant matches. Of course, all of the above-indicated problems with keywords apply here as well: the resulting index of the Web is a hodgepodge of terms with no structure, and unscrupulous site owners (and surrealists) can include irrelevant keywords to lure unsuspecting visitors.
Another problem with automated indexing is that the indexing software doesn't understand context, and so doesn't know what's important to include. For instance, on the page in which I explain my domain name, I mention that in this context the term "kith" has nothing to do with a certain roleplaying game. That explanation would be helpful to someone who visited my site thinking they would find information about that game—but it also means that search engines index that page under both "kith" and the name of the roleplaying game, making it more likely that searchers will visit that page expecting to find roleplaying information... I'm carefully avoiding using the name of the game in this column, to prevent the same indexing problem from mistakenly leading people to this page.