The Voting/Experience System: Uncertainty and Reputation

A thought occurred to me a while back: E2 reputations should not just take into account the net number of votes for a WU but also some sense of uncertainty in that number, similar to the way you estimate uncertainty in a poll or a measurement of a physical quantity.

When you make a measurement, you must assess the uncertainty in that measurement. When you make a claim like, "My measurement is consistent with X," usually the salient question is how far the measurement is from X compared to the size of the error bars. When it comes to nodes, if a node has 20 upvotes and no downvotes, odds are that most people on E2 would say it's a quality node. On the other hand, if a node has stats like (+250/-200), then even though it has a rep of 50 it's clear that the E2 audience has a more mixed opinion of the quality. I thought perhaps this should be taken into account in the reputation. Now I'm not sure it's good to take the XP system too seriously, but I thought it might be interesting to consider such an improvement.

The first idea I had in this vein was not a good one. Suppose a WU has U upvotes and D downvotes. The reputation is

R = U - D.

I thought one might define an uncertainty

S = (U + D),

which would just be the total number of votes. The measure of quality would them be something like Q = R/S, or how many error bars away it is from zero, and we could define Q =0 when no one has yet voted on the node. Now, of course this is a number between -1 and 1, and would be the same for a node with 2 votes or 20. That's not quite right for a reputation. But if you wanted something that goes up when a node gets more votes, you could just take the rep and rate it by the quality, to come up with a new "corrected" rep

R' = Q*R = R^2/S.

If a node gets all upvotes or all downvotes, it's rep is the same as under the original system, but in the mixed case from the first paragraph of (+250/-200) it would go from R = 50 to R' = 5.5. If you think that's too much of a reduction, you could really use any function f(Q) to get R' = f(Q)*R. As long as f monotonically increases with Q, has a maximum of 1, and is non-negative, it would have all the same qualitative features. Still, the system is based off of a rather arbitrary definition of uncertainty.

Then I thought of whether one could ascribe a more meaningful uncertainty. Well, that all depends on what the reputation is supposed to "measure". I suggested before that we might be interested in whether the E2 usership would judge it to be "quality" (i.e. something we want on E2). The easiest way to do with would be to have everyone on the site vote on it, but that doesn't work for myriad reasons (the time involved, coercing the users to actually do it, infrequent or fled users, etc.). However, by looking at the vote totals, we can get some sense of how E2 feels about a WU. This brings up another point. Not only does a big split in votes (between up and down) show uncertainty, but we are probably a lot less certain of the quality of a node with (+4/-0) than one with (+20/-0).

So, we could estimate the uncertainty in the reputation by assuming the users that voted on a node are a random sampling of noders and asking either, "If this is representative of all noders and if we chose another random sampling of users and had them vote, how different of a rep would the node be likely to get?" or, "How different could the actual feelings of noders be (the distribution of how they would vote if they read the node) and still be likely to lead to this result?" And of course, one would have to choose some appropriate confidence level to use in those assessments. Unfortunately, the math there gets kind of messy. If we take the case where the number of upvoted U is small compared to the total number of noders who would hypothetically upvote the WU and the same for the downvotes, then we can say that if the total proportion of users who would upvote the WU is p, the probability of a random small sample giving U upvotes and D downvoted is

Prob(U,D) = p^U * (1-p)^D

which is just the binomial distribution. The math is a bit more complicated if those assumptions are not valid, namely if a significant portion of those who would upvote (or downvote) a WU have already voted.

With the simple formula above, one would try to assign uncertainties. Take the example of the node with (+20/-0). If we supposed that in reality only 80% of noders would upvote it (so p = 0.8), then the probability that it would happen to get the score (+20/-0) by random chance is Prob(20,0) = (0.8)^20 = 0.01 or 1%. Let's say we want to give the range of possible values of p with 95% confidence, then we want to find the range of p values that give Prob(U,D) ≥ 0.05. For a score of (+20/-0), we can then say that p ≥ (0.05)^(1/20) = 0.86. So from the votes that have be cast, we can say with 95% confidence that the actual p falls between 0.86 and 1. Unfortunately, if there is a mixture of votes, solving for p will involve finding the root of a polynomial. Likewise, we could use Prob(U,D) to figure out if we took different sample where the rep is likely to fall. If we did these more complicated statistical things, then we'd have to figure out how we were going to use this uncertainty to give a corrected rep. Before we were talking about how many error bars away from zero rep a WU was. One could do something similar to that or ask questions like how far it is from being perfect (p = 1).

The problem with this more detailed statistical approach is that it's too complicated and rests on faulty assumptions. It's too complicated because we'd like a relatively simple and computationally inexpensive way to assign reputation to a WU, and it's probably unjustified because the samples of those who vote on WUs are not at all random. Furthermore, sometimes people make a deliberate choice not to vote at all. There are also questions like "what do you mean by 'all' noders". Ultimately, going through all that business for an answer that's probably wrong is kinda silly, so perhaps the naive idea at the beginning (or something like it) is better.

Then there's the question of whether this whole idea of "error bars" makes any sense to begin with. The motivation at the beginning was the idea that the rep of 50 for a (+250/-200) WU is less meaningful than for (+50/-0) WU, but maybe that's bullshit. After all, noders might be less decided about the quality of the first WU, but at least it probably sparked a lot of thought and discussion. WUs with very few downvotes are often dry factuals, which are good but not necessarily the "be all, end all" of noding. Would a system of corrected reps like the ones I mentioned just lead to a lot of bland, inoffensive factuals? Is that really what we want for E2, or is that better left for the likes of Wikipedia?