Data Science: 2013

Tuesday, April 2, 2013

Normalizing Judging Scores in R

R has some powerful tools and packages for aggregating data. One of the best out there is plyr. It is slower than R's own aggregating functions like apply. However, the convenience plyr provides is very useful and powerful. More often than not, you will be using the ddply method in plyr. In this write up, I'll try and explain a simple example use case of aggregation by using the plyr package.

Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites

Assume you are tasked with aggregating judges scores in a competition. Further assume that there are three judges and they all get to judge three different categories of participants. Any method we use should eliminate biases to the extent possible. For example, a particular judge might give low scores in a particular category while giving higher scores in another category. The workaround for minimizing such biases is to normalize the scores in some way. First, let us look at some example data.

Notice, there are a couple of things going on here. Judge John has more extreme views. He likes to grade in the extremes whereas judge David is tighter in his grades. This also results in a heavy skew in values. The problem gets worse if there are many more categories (other than A) and the judge's rating behaviour changes based on category. A simple and robust way to normalize for all of this is to transform the scores such that they have a mean of 0 and a standard deviation of 1. The following transform achieves this
$$x_i \leftarrow \frac{x_i - \bar{x}}{\sigma} $$

Machine Learning

The R code that achieves all of the above is shown below.

#!/usr/bin/Rscript

library(plyr)

x = data.frame(
  judge = c(rep('J',3),rep('D',3)),
  contestant = c('x','y','z','x','y','z'),
  scores = c(1,2,8,7,8,9)
  )
x.tmp = ddply(x[,c(1,3)], # select a group of dimensions
  .(judge), # group by this dimension
  summarize, # we are aggregating, so summarize.
  mu=mean(scores), # the function we are computing
  s=sd(scores)) # another function we are computing

# Merge in the summary of the scores
x = merge(x,x.tmp,by="judge",all.x=TRUE)

# Compute the normalized scores
x$nscores = (x$scores - x$mu)/x$s

# Finally aggregate by the normalized scores
x = ddply(x,
  .(contestant),
  summarize,
  tot.score = sum(nscores))
x

Some good books to pick up the art of data science

Machine Learning (Tom Mitchell)

Algorithms in a Nutshell (In a Nutshell (O'Reilly))

Programming Collective Intelligence: Building Smart Web 2.0 Applications

Wednesday, March 13, 2013

Web Scraping with R

In this write up I'll describe an R function that I use to fetch stock data from the web. The cool thing about this function is that it is done in pure R, the data that gets returned can be used as a data frame which in turn can be analysed in any way or charted for different metrics. Here is that code.

Technical Analysis: The Complete Resource for Financial Market Technicians (2nd Edition)
The R Book

#!/usr/bin/Rscript

getData = function(instrument,lookback=365){
  # instrument tells the stock ticker
  # default lookback is 365 days

  # Format the start date
  starty = format(Sys.time()-60*60*24*lookback,"%Y")
  startm = format(Sys.time()-60*60*24*lookback,"%m")
  startd = format(Sys.time()-60*60*24*lookback,"%d")

  # Format the stop date
  endy = format(Sys.time(),"%Y")
  endm = format(Sys.time(),"%m")
  endd = format(Sys.time(),"%d")

  # Create a url. Using Y! Finance here
  # You can use any site which offers this data
  url = paste("http://ichart.finance.yahoo.com/table.csv?s=",
    instrument,"&a=",
    startm,"&b=",
    startd,"&c=",
    starty,"&d=",
    endm,"&e=",
    endd,"&f=",
    endy,"&g=d&ignore=.csv",
    sep="")

  # The destination file to write out this data
  # Use the pid of the process
  # Its a simple way to get uniqueness
  destfile = paste("out.txt.",Sys.getpid())
  print(url)
  cat("Fetching data from Y! Finance for ",
      instrument," Ranging ",
      starty,startm," -> ",
      endy,endm,"\n")
  
  # Fetch that data
  status = download.file(url,destfile,method="auto",quiet=TRUE,cacheOK=FALSE)
  if(status != 0){
    # Some error. Stop!
    unlink(destfile)
    stop(paste("Download error, status ",status))
  }
  nlines = length(count.fields(destfile,sep="\n"))
  if(nlines == 1){
    # Site didn't return data
    unlink(destfile)
    stop(paste("No data available for",instrument))
  }
  # Read the data in as a table
  data = read.table(file=destfile,sep=",",header=T,as.is=T)
  # Delete the temporary file
  unlink(destfile)
  # Return the data
  data
}

argv = commandArgs(trailingOnly=T)
if(length(argv) != 1){
  cat("Usage: this-file.r \n")
  q()
}

# Get 60 days look back data
x = getData(argv[1],60)

# Check to see if its there
head(x)

ggplot2: Elegant Graphics for Data Analysis (Use R!)

Sunday, March 10, 2013

A Technical Chart for Run ups and downs in R

Two of the most important aspects of technical analysis and trading is to watch for the percentage of movement to the up or down and the associated volume. A large volume large percentage upwards movement is a sign that lots of big fish are jumping into the stock. Likewise, a large volume large percentage downwards movement indicates traders are dumping the stock and its a falling knife you want to stay away. However most day to day movement don't fall into either of these buckets. Worse, it is generally difficult to place the price and volume in perspective with respect to past price movements. In this write up, I'll try and explain a charting technique that addresses this issue. A charting technique does not give out a number like a metric, instead it gives a visual perspective on where the stock stands as of a given day.

Technical Analysis: The Complete Resource for Financial Market Technicians (2nd Edition)

To begin, we will need two relatively simple metrics in place. A volume index and the percentage change in price.

Volume Index:
This is the ratio of the volume of shares traded on a given day by the average volume of the stock. The higher this number, the greater the momentum in a given direction. So if average volume of shares traded in a day is $v_{avg}$ and volume on a given day is $v_{t}$, the volume index $v_{index}$ is
$$v_{index} = \frac{v_t}{v_{avg}}$$

Percentage change:
The fractional change in price on a given day from the prior day. If price on day $t$ is $p_t$ and on day $t+1$ is $p_{t+1}$ then the percentage change is given as
$$\Delta_p = \frac{p_{t+1} - p_{t}}{p_{t}}$$

Technical Analysis Explained : The Successful Investor's Guide to Spotting Investment Trends and Turning Points

The goal is to now club all the series of run ups and run downs together and place them on an X-Y chart. For example, if the price, volume and average volume were $[\{2,100,90\},\{3,110,90\},\{4,80,90\},\{3.5,70,90\}\ldots]$ then the data gets mapped as $\{\frac{3-2}{2}=0.5,\frac{100}{90}=1.11\},\{\frac{4-3}{3}=0.33,1.22\},\{\frac{3.5 - 4}{4}=-0.125,\frac{80}{90} = 0.88\},\ldots$. Next, we tag each series of continuously positive and negative price changes with a number. This is more as a preparatory step to plot the data. R has a nice function "rle" which stands for "run length encoding" for doing exactly this. Our goal is to make an overlaid line chart of all the positives and negatives. This chart is useful to visualize how any recent run-up/run-down on a stock stacks up against previous run-ups/run-downs of that same stock.

The R Book

Finally, we get historical price data for the stock we wish to analyse, say Yahoo! Finance for McDonald's stock (MCD) for the past 6 months. We will save this file as "mcd.csv". In the following code I'll use the Hadley Wickam's ggplot2 package to chart this. For $v_{avg}$ I'll use the overall average of volume for the trailing 6 months (5722305) for simplicity, but you can tweak this. The code below charts out the continuous run up and run down price changes to the MCD stock.

The chart generated is shown below

The code used is shown below.

ggplot2: Elegant Graphics for Data Analysis (Use R!)

#!/usr/bin/Rscript

library(ggplot2)
library(grid)
library(scales)

# Read in the file
x = read.csv("mcd.csv",sep=",",header=T)

# Pick the Volume and Price column
x = x[,c(6,7)]
colnames(x) = c('Volume','Price')

# 6 month average
# Change this to some sliding window
vol.avg = 5722305

# Compute the volume index and price change
x.volindex = x$Volume/vol.avg
x.volindex = x.volindex[1:(length(x.volindex) - 1)]
x.delta    = -diff(x$Price)/x$Price[2:length(x$Price)]


# Create a vector which has the same length as x.delta
# and is tagged as 'Positive' when x.delta[i] is greater than 0 and
# is tagged as 'Negative' when x.delta[j] is less than 0.
x.tag = rep('Positive',length(x.delta))
x.tag[x.delta < 0] = 'Negative'
x.tmp = rle(x.tag)
x.t = data.frame(lab = c())
x.index = data.frame(index = c())
for(i in 1:length(x.tmp$lengths)){
  x.t = rbind(x.t,data.frame(lab=rep(i,x.tmp$lengths[i])))
  x.index = rbind(x.index,data.frame(index = seq(1:x.tmp$lengths[i])))
}

x = data.frame(
  delta.price = x.delta,
  index = x.index$index,
  volindex = x.volindex,
  tag = x.t$lab
  )

p1 = ggplot(x,aes(index,delta.price,group=tag)) +
  geom_line(alpha=0.3) +
  scale_y_continuous(labels=percent,
                     breaks=round(
                       seq(min(x$delta.price),
                           max(x$delta.price),
                           by=0.005),3)) +
  scale_x_continuous(breaks=seq(min(x$index),max(x$index),by=1)) +
  xlab(" Days in Run") + ylab(" Change in Price") +
  geom_hline(yintercept=0,linetype="longdash")

p2 = ggplot(x,aes(index,volindex,group=tag)) +
  geom_line(alpha=0.3) +
  scale_y_continuous(breaks=round(
                       seq(min(x$volindex),
                           max(x$volindex),
                           by=0.5),2)) +
  scale_x_continuous(breaks=seq(min(x$index),max(x$index),by=1)) +
  xlab(" Days in Run") + ylab(" Volume Index")


png(filename="t.png",width=800,height=800)
# Change p1 to p2 to get the volume index chart.
print(p1)
dev.off()