Ranking, ordering, weighted sums, hyperbolic discount and other beasties in the 5-star rating zoo

A very non-rigorous adventure into the realm of calculating ratings. All this is prompted by a problem I'm trying to find a fair and fairly efficient solution suitable for use with databases.

The problem can be stated as:

We have a collection of 5-star [0-4] rated items, with user ratings and creation and last update timestamps. Retrieve the most important items by rank, taking into account the item ratings, be fairer than simple average item rating.

Using the average rating for an item is not nice because we don't have an idea of what to expect from an item, items with different number of votes have different uncertainty about the truthfulness of the average, there is no accounting for freshness of either votes or the item itself, which in our application is important.

What do we know about the importance of an item?

  • Following common sense, newer items are more likely to be important, as the owner of the item is probably more likely to respond to queries about them
  • Similarly, items with no votes, are likely to be of average interest, we just don't know more
  • The quality of an item, is likely to be higher, if the user's other items have high score
  • We do know the average rating of an item, it would be stupid not to use it

The idea is to express and blend the above intuition

Freshness - we can use either or both of the timestamps. For simplicity only one timestamp will be used, the last update, as it makes sense encouraging updates from business perspective. The idea is simple - discount the item rank with a discount factor dependent on the update timestamp.

rank = f(t) * rank'
   where 
      f(t) = 1 / (1 + k*t)

We use the hyperbolic discount function, although the exponential one can be used as well ( exp( -k*t ) ). The literature hints that the hyperbolic function better approximates to human behaviour.

Global average rating (collection average rating) - just calculate the average item rating and the average number of votes

Average item rating - the arithmetic mean of the votes for the item

For simplicity we will ignore the item-user, item-item and user-user distance measures at the moment. They can be added later.

The adjusted item rating can be derived from applying the Bayes formula to find the probability for an item to be good, with the global average rating being used as a prior. That is a dubious explanation, so it won't be used further, but it is an inspiration. Probably for the same reason the tubes sometimes name it as bayesian averaging.

So the global average and item rating are blended into one measure. Intuitively, when we have no votes, the global average predicts the rating for the item, if the item has far more votes than average, the global average rating becomes insignificant.

rating = (ga * av + ia * iv) / (av + iv)
   where
       ga - global average rating
       av - average number of votes per item in the collection
       ia - average item vote
       iv - number of votes for the item

The rating is a simple weighted average, the weights are derived from data, so no guesswork is required.

Subtituting rank' to rating above we get the final form


rank(item) = f * rating
   where 
      f = 1 / (1 + collection.discount * item.updateTime )
      rating = (collection.averageRating * collection.averageNumVotes 
               + item.averagerating * item.numVotes) 
               / (collection.averageNumVotes  + item.numVotes)

The discount factor can be easily tuned based on business logic, for example, it can be assumed that the item is half as useful after a week (or a month, or whatever). This is ok as an initial, cold start assumption. We can apply statistical inferencing at a later stage to tune it.

This ranking method is a decent starting compromise. We can add more assumptions into this scheme - average rating of user's items and the analogous average rating of items across all users. We can use a richer freshness estimation by taking into account user's activity, for example their last web site visit, the probability of them replying to a message and so on. The system can track a lot of information, turning that into predictions is just a blend of common sense, maths and programming.

One problem with the above method is that if we use it to update ranks in large datasets we will have to recalculate the rank on each update, which is very inefficient. The search is on, for methods with similar behaviour, but allowing for more efficient updates and queries.

The biggest problem is the (*) operation. It is what changes the order of otherwise nicely, easily updated ordered sets.

One obvious approach for reducing the work during updates is to introduce windowing. Inserting or updating an item within a time window, does not change its order (rank), with respect to the other items within the same window. For example, we can create windows corresponding to items updated within the last hour, day, week, month, quarter, year, whenever

This will introduce a few simple partial order rules:

  • all items updated within the last hour rank higher than items updated through the rest of the day
  • similarly all items updated today rank higher than the rest of the items updated this week
  • ...

The actual intervals and their number can be chosen to approximate the hyperbolic discount function with some predefined error margin.

This scheme allows for simpler and more importantly more efficient queries. We can build two indexes - rating and time-interval windows. The ordering is the first by time-window, then made more precise by rating.

In some cases there will be bigger differences in the ordering the items than fall within the desired error. We should just ask ourselves if we care about these extreme cases, how important they are and is the extra precision worth the loss of efficiency and the associated extra costs in computational power required.

Powered by Drupal, an open source content management system