Railing against the categorical imperative

Wikipedia, Wikimedia Commons and other Wikimedia wikis use a system of categorisation for navigation and (to some extent) classification. Commons is a particularly important case because, until images are searchable, they are pretty reliant on categorisation for even basic functionality.

The current system

The theory behind this seems straightforward. On the English Wikipedia, there are categories for British sportspeople; American Ivy League universities; economic terminology. These are clearly useful, no?

Well, yes. But to understand the development of the present category system, you have to understand that, historically, category intersection was impossible. What this means is that you can’t just say tell Wikipedia that Jessica Ennis-Hill has the properties of being a sportsperson and British, and expect her to appear in the category for British sportspeople automatically. Rather, a volunteer has to (a) make that link and (b) believe it’s a significant intersection.

When people asked to say what you knew about someone or something (in this case Ms Ennis-Hill), they say things like “she’s English”, “she’s a heptathlete”, “she’s 27”, “she won a gold medal in the Olympics” — in other words, they name properties. As above, however, Wikipedia’s categorisation system doesn’t record properties, it records intersections. Thus her article ends up with this spew of permutations, including bother “British heptathletes” and “English athletes“.

Suppose I now decide (as someone did in the past) that her ancestry is an important characteristic (a point I’ll come back to momentarily). Instead of just adding one statement about her nationality, I have to consider adding a multitude of possible categories that relate her ancestry to other characteristics about her: in this case, Black English sportspeople and English people of Jamaican descent.

The problems

Underdescription: One the one hand, this system clearly risks underdescription. Since adding a ‘n’th property requires the addition of n categories, an editor simply may not bother.

Overdescription: On the other, editors are tempted to leave a parent category (e.g. “English athletes”) when adding a more specific category (“English heptathletes”), to avoid losing any information — in this case, that Ennis-Hill is not just a heptathlete. [She hurdles, she high-jumps; in the former case editors can be bothered to list her as a hurdler, but her high-jumping does not seem to warrant a separate category.] The result is overcategorisation.

Selectivity: The element of selectivity (working out which intersections deserved categories) creates controversy, as when the creation of a “women novelists” category rightly caused a media storm earlier this year. That was despite the two individual properties the category intersected (women and ) being uncontroversial.

Emergence: Wikidata’s Denny Vrande?i? (ever keen to show just how much he deserved his PhD in the subject!) argued recently, strong classification of the kind that categories lead to seems to hit a nerve in the human psyche. (The title of this blog post is borrowed from Denny’s.) To take the classic example, was George Washington “an American” anything? Why? Was Tesla a Serb? Russian? Something else?

Ambiguity: Categories can be ambiguous. Can you be a “LGBTQ blogger” without blogging about LGBTQ issues? Does a “Russian statesman” have to practice his craft in Russia? And so on. Properties (sexuality, place of birth, occupation) are rarely as ambiguous.

The solution

The solution to all these problems would be to move towards a system of recording properties directly, and there are signs that projects such as Wikidata are increasing interest in this “new” kind of system. Of course, it is not new. It’s called “tagging” and we’ve been used to it on the internet for a decade; but finally Wikimedians may be coming round to it.

The power of the Wikidata object-property model as well as refinements to search mean that tagging is becoming possible at last, so maybe the days of the category are numbered. Maybe. Here’s to hoping: tagging probably has its own problems; but it’s got to be better than categorisation.

Leave a Reply

Your email address will not be published. Required fields are marked *