2009-05-05: Simple Statistics with RSRuby

In this post, I start exploring the RSRuby interface and its facilities for handling some simple statistical operations. At the moment I am not interested in many of the options available for the different procedures, but just want to understand how to pass data to R from Ruby, and do some useful computations.

Simple Statistics

Given a list of readings, such as: [4, 2, 35, 10, 17, 3, 6, 8] we can compute the mean, median, variance, and standard deviation as follows:

irb(main):052:0> require 'RSRuby'
irb(main):053:0> r = RSRuby.instance
irb(main):054:0> a = [4, 2, 35, 10, 17, 3, 6, 8]
irb(main):055:0> r.mean(a)
=> 10.625
irb(main):056:0> r.median(a)
=> 7.0
irb(main):057:0> r.var(a)
=> 119.982142857143
irb(main):058:0> r.sd(a)
=> 10.9536360564491

This is straightforward. The RSRuby bridge converts our Ruby array stored in the variable 'a' into an 'R' object, and calls the appropriate methods, as illustrated.

Statistics on Multiple Datasets

Often in modelling experiments we have a set of observed data and a set of predicted data, and we would like to compute the fit of the predicted data from the model to the observations. There are various tests to help us compute this.

The covariance is one measure of how two variables change together. The covariance is zero if the two sets of variables are unrelated; positive if the variables are related and change in the same direction; and negative if the variables are related and change in different directions. Covariances are computed easily in RSRuby, passing the data as two arrays:

irb(main):074:0> a = [1,2,3,4,5,6]
=> [1, 2, 3, 4, 5, 6]
irb(main):075:0> b = [2,4,6,8,10,12]
=> [2, 4, 6, 8, 10, 12]
irb(main):076:0> c = [6,5,4,3,2,1]
=> [6, 5, 4, 3, 2, 1]
irb(main):077:0> d=[2,2,2,2,2,2]
=> [2, 2, 2, 2, 2, 2]
irb(main):078:0> r.cov(a,b)
=> 7.0
irb(main):079:0> r.cov(a,c)
=> -3.5
irb(main):080:0> r.cov(a,d)
=> 0.0

Notice that 'b' changes in the same way as 'a', 'c' changes but in the opposite direction to 'a', and 'd' has no relation to 'a' whatsoever.

The covariance has a dimension, the dimensions of the variables it is computed over. By contrast, correlation is an independent measure, indicating the strength of any linear relation between the two variables. Correlations produce values in the range [-1, 1], and so are comparable between experiments.

R provides a 'cor' function, to compute the correlation between two variables. Three forms of correlation may be computed: Pearson's, Kendall's tau, and Spearman's rho. As can be seen in the sample below, there are clear differences between them:

irb(main):082:0> r.cor([1,2,3],[2,1,4], :method => "pearson")
=> 0.654653670707977
irb(main):083:0> r.cor([1,2,3],[2,1,4], :method => "spearman")
=> 0.5
irb(main):084:0> r.cor([1,2,3],[2,1,4], :method => "kendall")
=> 0.333333333333333

Note the way these functions are called in RSRuby. The part `:method => "pearson"` uses Ruby's notation for passing keyword arguments into a method to provide values for optional named arguments in the R syntax. For instance, the equivalent of r.cor([1,2,3],[2,1,4], :method => "kendall") in R is:

> cor(c(1,2,3), c(2,1,4), method="kendall")
[1] 0.3333333

Page from Peter's Scrapbook, output from a VimWiki on 2024-01-29.