Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites
Assume you are tasked with aggregating judges scores in a competition. Further assume that there are three judges and they all get to judge three different categories of participants. Any method we use should eliminate biases to the extent possible. For example, a particular judge might give low scores in a particular category while giving higher scores in another category. The workaround for minimizing such biases is to normalize the scores in some way. First, let us look at some example data.
Notice, there are a couple of things going on here. Judge John has more extreme views. He likes to grade in the extremes whereas judge David is tighter in his grades. This also results in a heavy skew in values. The problem gets worse if there are many more categories (other than A) and the judge's rating behaviour changes based on category. A simple and robust way to normalize for all of this is to transform the scores such that they have a mean of 0 and a standard deviation of 1. The following transform achieves this
$$x_i \leftarrow \frac{x_i - \bar{x}}{\sigma} $$
Machine Learning
The R code that achieves all of the above is shown below.
#!/usr/bin/Rscript library(plyr) x = data.frame( judge = c(rep('J',3),rep('D',3)), contestant = c('x','y','z','x','y','z'), scores = c(1,2,8,7,8,9) ) x.tmp = ddply(x[,c(1,3)], # select a group of dimensions .(judge), # group by this dimension summarize, # we are aggregating, so summarize. mu=mean(scores), # the function we are computing s=sd(scores)) # another function we are computing # Merge in the summary of the scores x = merge(x,x.tmp,by="judge",all.x=TRUE) # Compute the normalized scores x$nscores = (x$scores - x$mu)/x$s # Finally aggregate by the normalized scores x = ddply(x, .(contestant), summarize, tot.score = sum(nscores)) x
Some good books to pick up the art of data science
Machine Learning (Tom Mitchell)
Algorithms in a Nutshell (In a Nutshell (O'Reilly))
Programming Collective Intelligence: Building Smart Web 2.0 Applications
Thanks quite interesting. I think the intermediate code could be simplify, even to a one line. See below...
ReplyDeleteFrom the input
> x = data.frame(
judge = c(rep('J',3),rep('D',3)),
contestant = c('x','y','z','x','y','z'),
scores = c(1,2,8,7,8,9) )
> x
judge contestant scores
1 J x 1
2 J y 2
3 J z 8
4 D x 7
5 D y 8
6 D z 9
The result of the previous computation is
> head(x)
judge contestant scores mu s nscores
1 D x 7 8.000000 1.000000 -1.0000000
2 D y 8 8.000000 1.000000 0.0000000
3 D z 9 8.000000 1.000000 1.0000000
4 J x 1 3.666667 3.785939 -0.7043607
5 J y 2 3.666667 3.785939 -0.4402255
6 J z 8 3.666667 3.785939 1.1445862
Let's use transform instead of summarize; this removes the need of the merge call.
> x = ddply( x, ~ judge, transform, mu = mean(scores))
> x = ddply( x, ~ judge, transform, s = sd(scores))
> x$nscores = with( x, (scores - mu) / s )
> x
judge contestant scores mu s nscores
1 D x 7 8.000000 1.000000 -1.0000000
2 D y 8 8.000000 1.000000 0.0000000
3 D z 9 8.000000 1.000000 1.0000000
4 J x 1 3.666667 3.785939 -0.7043607
5 J y 2 3.666667 3.785939 -0.4402255
6 J z 8 3.666667 3.785939 1.1445862
Let's use within
> x = ddply( x, ~ judge, within, { mu = mean(scores) ; s = sd(scores) ; nscores = (scores - mu) / s } )
> x
judge contestant scores nscores s mu
1 D x 7 -1.0000000 1.000000 8.000000
2 D y 8 0.0000000 1.000000 8.000000
3 D z 9 1.0000000 1.000000 8.000000
4 J x 1 -0.7043607 3.785939 3.666667
5 J y 2 -0.4402255 3.785939 3.666667
6 J z 8 1.1445862 3.785939 3.666667