Data Science: April 2013

R has some powerful tools and packages for aggregating data. One of the best out there is plyr. It is slower than R's own aggregating functions like apply. However, the convenience plyr provides is very useful and powerful. More often than not, you will be using the ddply method in plyr. In this write up, I'll try and explain a simple example use case of aggregation by using the plyr package.

Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites

Assume you are tasked with aggregating judges scores in a competition. Further assume that there are three judges and they all get to judge three different categories of participants. Any method we use should eliminate biases to the extent possible. For example, a particular judge might give low scores in a particular category while giving higher scores in another category. The workaround for minimizing such biases is to normalize the scores in some way. First, let us look at some example data.

Notice, there are a couple of things going on here. Judge John has more extreme views. He likes to grade in the extremes whereas judge David is tighter in his grades. This also results in a heavy skew in values. The problem gets worse if there are many more categories (other than A) and the judge's rating behaviour changes based on category. A simple and robust way to normalize for all of this is to transform the scores such that they have a mean of 0 and a standard deviation of 1. The following transform achieves this
$$x_i \leftarrow \frac{x_i - \bar{x}}{\sigma} $$

Machine Learning

The R code that achieves all of the above is shown below.

#!/usr/bin/Rscript

library(plyr)

x = data.frame(
  judge = c(rep('J',3),rep('D',3)),
  contestant = c('x','y','z','x','y','z'),
  scores = c(1,2,8,7,8,9)
  )
x.tmp = ddply(x[,c(1,3)], # select a group of dimensions
  .(judge), # group by this dimension
  summarize, # we are aggregating, so summarize.
  mu=mean(scores), # the function we are computing
  s=sd(scores)) # another function we are computing

# Merge in the summary of the scores
x = merge(x,x.tmp,by="judge",all.x=TRUE)

# Compute the normalized scores
x$nscores = (x$scores - x$mu)/x$s

# Finally aggregate by the normalized scores
x = ddply(x,
  .(contestant),
  summarize,
  tot.score = sum(nscores))
x

Some good books to pick up the art of data science

Machine Learning (Tom Mitchell)

Algorithms in a Nutshell (In a Nutshell (O'Reilly))

Programming Collective Intelligence: Building Smart Web 2.0 Applications

Data Science

Tuesday, April 2, 2013

Normalizing Judging Scores in R