Data Science: Normalizing Judging Scores in R

R has some powerful tools and packages for aggregating data. One of the best out there is plyr. It is slower than R's own aggregating functions like apply. However, the convenience plyr provides is very useful and powerful. More often than not, you will be using the ddply method in plyr. In this write up, I'll try and explain a simple example use case of aggregation by using the plyr package.

Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites

Assume you are tasked with aggregating judges scores in a competition. Further assume that there are three judges and they all get to judge three different categories of participants. Any method we use should eliminate biases to the extent possible. For example, a particular judge might give low scores in a particular category while giving higher scores in another category. The workaround for minimizing such biases is to normalize the scores in some way. First, let us look at some example data.

Notice, there are a couple of things going on here. Judge John has more extreme views. He likes to grade in the extremes whereas judge David is tighter in his grades. This also results in a heavy skew in values. The problem gets worse if there are many more categories (other than A) and the judge's rating behaviour changes based on category. A simple and robust way to normalize for all of this is to transform the scores such that they have a mean of 0 and a standard deviation of 1. The following transform achieves this
$$x_i \leftarrow \frac{x_i - \bar{x}}{\sigma} $$

Machine Learning

The R code that achieves all of the above is shown below.

#!/usr/bin/Rscript

library(plyr)

x = data.frame(
  judge = c(rep('J',3),rep('D',3)),
  contestant = c('x','y','z','x','y','z'),
  scores = c(1,2,8,7,8,9)
  )
x.tmp = ddply(x[,c(1,3)], # select a group of dimensions
  .(judge), # group by this dimension
  summarize, # we are aggregating, so summarize.
  mu=mean(scores), # the function we are computing
  s=sd(scores)) # another function we are computing

# Merge in the summary of the scores
x = merge(x,x.tmp,by="judge",all.x=TRUE)

# Compute the normalized scores
x$nscores = (x$scores - x$mu)/x$s

# Finally aggregate by the normalized scores
x = ddply(x,
  .(contestant),
  summarize,
  tot.score = sum(nscores))
x

Some good books to pick up the art of data science

Machine Learning (Tom Mitchell)

Algorithms in a Nutshell (In a Nutshell (O'Reilly))

Programming Collective Intelligence: Building Smart Web 2.0 Applications

1 comment:

AnonymousAugust 14, 2014 at 7:12 AM
Thanks quite interesting. I think the intermediate code could be simplify, even to a one line. See below...

From the input
> x = data.frame(
judge = c(rep('J',3),rep('D',3)),
contestant = c('x','y','z','x','y','z'),
scores = c(1,2,8,7,8,9) )
> x
judge contestant scores
1 J x 1
2 J y 2
3 J z 8
4 D x 7
5 D y 8
6 D z 9
The result of the previous computation is
> head(x)
judge contestant scores mu s nscores
1 D x 7 8.000000 1.000000 -1.0000000
2 D y 8 8.000000 1.000000 0.0000000
3 D z 9 8.000000 1.000000 1.0000000
4 J x 1 3.666667 3.785939 -0.7043607
5 J y 2 3.666667 3.785939 -0.4402255
6 J z 8 3.666667 3.785939 1.1445862

Let's use transform instead of summarize; this removes the need of the merge call.

> x = ddply( x, ~ judge, transform, mu = mean(scores))
> x = ddply( x, ~ judge, transform, s = sd(scores))
> x$nscores = with( x, (scores - mu) / s )
> x
judge contestant scores mu s nscores
1 D x 7 8.000000 1.000000 -1.0000000
2 D y 8 8.000000 1.000000 0.0000000
3 D z 9 8.000000 1.000000 1.0000000
4 J x 1 3.666667 3.785939 -0.7043607
5 J y 2 3.666667 3.785939 -0.4402255
6 J z 8 3.666667 3.785939 1.1445862

Let's use within

> x = ddply( x, ~ judge, within, { mu = mean(scores) ; s = sd(scores) ; nscores = (scores - mu) / s } )
> x
judge contestant scores nscores s mu
1 D x 7 -1.0000000 1.000000 8.000000
2 D y 8 0.0000000 1.000000 8.000000
3 D z 9 1.0000000 1.000000 8.000000
4 J x 1 -0.7043607 3.785939 3.666667
5 J y 2 -0.4402255 3.785939 3.666667
6 J z 8 1.1445862 3.785939 3.666667

Tuesday, April 2, 2013

Normalizing Judging Scores in R

1 comment: