Empirical tags experiment


  • Share on Pinterest

Looks like Prof Chuck and his graduate student might be about to do an interesting experiment.

“a study where we compare information retrieval times and errors when people are using 2 kinds of info organizations: traditional hierarchical/taxonomic categories versus tags.

Between groups: We are thinking of having subjects come in and sort 100 photographs into categories or tag 100 pictures. Then, a few days or one week later, they would have to retrieve a subset of those pictures. Of course we would control the amount of time spent on the pictures.”

It got me thinking about maths proofs of the same problem. Presumably, you could do some statistical analysis of the problem? And that might let you know something about the relevant pro’s and con’s, of say how many tags to use?

(This all then made me realise how much maths I’d forgotten or never knew, but anyway, here goes some thinking out loud…)

  • Let’s say P(x) is the probability you find x within a certain time. And P(x|t) and P(x|h) are the probabilities that you find x given that it is either tagged or filed in a hierarchy. (This is where I remember normally stopping in my proofs, but again – “anyway” …)
  • And um, let’s assume that there’s only one route to x in a hierarchy, and that there’s a number of different routes to x via tags (routes being determined by the different folders you have to open or the different tags you have to click) The exceptional case being where x only has one tag, in which case the tagsonomy is a hierarchy by another name. I suppose I’m also assuming that this is all in a digital world – the speed with which you can follow a route is essentially the speed you click your mouse, so pretty much equal in both hierarchies and tag-based systems
  • Intuitively, assuming you don’t know where x is beforehand, I ‘d guess that tags make for more probable quick discovery. Whereas in a hierarchy there’s only one possible route to x, in a tagged system there are normally several. i.e. P(x|t) > P(x|h)
  • One of the things it might be useful to factor in is knowledge of the system. If K is that knowledge and we say K can have a value in the range of 1 (perfect knowledge) to 0 (perfect ignorance) then we can say a couple of things.
    1. If K=1, then the only differentiating factor is the time it takes to get there, or length of route (if you click at the same pace). For hierarchies the length of route is determined by x’s level in the hierarchy. Nothing else. So if x is at the bottom of a hundred tier hierarchy, you get there in a hundred clicks.

      For tag-systems, the length of route is more complicated. It’s presumably some function of the number of items with exactly the same tags. But essentially the length of route is 1 (the tag you know it’s under) and however many pages of items with the same tag you have to scroll past to get to x. One thing that’s interesting here is that you can improve your speed of retrieval by knowing how many items there are with each tag. [e.g. If with perfect knowledge, I know that this page is tagged under “stats” and “idiocy”, and if I see something like stats(2) idiocy (100,000) I can speed up my route.]

    2. How about if K=0? Well for hierarchies, I’d guess P(x|h) is some function of the number of levels and the number of branches at each level. So for a one level classification system with two branches I’ve got a fifty-fifty chance of finding x in one step. (average route length = 1.5). And, erm, the maximum number of steps is the total number of possible locations. Which on a regular hierarchy I think is Sum(l = 1 to number of levels) (number of) branches ^ l. [So a 4 level hierarchy with 4 branches at each level should have have a maximum route length of 340 (4 + 16 + 64 + 256)].

      For tag-systems, it depends on how many tags an item has. If an item x has t different tags, and there are n different possible tags x could have, then maximum route to x is n-(t-1). So the more tags you give something the easier it is to find. Interestingly, if I’ve got any of this right, that should mean that if you’re searching a tag-system you know nothing about (like del.icio.us in Sanskrit) you’re worst-case scenario is always than it would have been in a hierarchy.

  • Hmm. Beginning to lose the thread. Might well come back to this, but if there are any stats whizzes who haven;t sighed too heavily at all this, I’d love to see some more rigorous thinking about this sort of thing.

  • And if K is somewhere between 0 and 1 (you’ve got some idea where it is but aren’t exactly sure?