2009-05-14

In my previous post, I looked at creating some graphs and doing other analysis within RSRuby. The problem with using RSRuby directly is that the syntax can be clunky; there are many things that interfere with simply getting the result. For example, to save a graph of data, you must start up an instance of R, tell it to create a file of the right type, plot the graph, then tell the device you are finished, etc. What would be convenient, at least for simple cases, is to have a single function which does all the above in one go. In this post, I explore how easy it is to create a wrapper around some of the RSRuby calls, and so devise a personal library of functions with just the right level of exposed complexity.

I shall give my wrapper library a simple name, 'L', as I don't want long names to type in. All methods are module methods. 'L' includes an instance of R and the methods in 'L' will do the hard work of interacting with R. Here is the start of the library:

require 'rsruby'

module L
  R = RSRuby.instance # keep a constant R interpreter
end

This is very simple, of course. The instance of R is accessible as 'L::R', and can be called in the usual way. I think this is important - the wrapper library does not prevent all the usual ways of interacting with R, but will make some functions simpler.

Let us add some simple statistical functions. I would like to compute the mean of an array of numbers by calling: 'L::mean [1,2,3,4,5,6,7]', and I can do this by providing functions like:

  # --- stats
  def L.mean items
    R.mean items
  end

  def L.stddev items
    R.sd items
  end

This library is for my own use, so I can use whatever names I like. I sometimes prefer more verbose forms than R uses, as you can see in the stddev function.

The process of creating and saving a graph in RSRuby required a few steps. We can simplify those steps using our wrapper, creating a function to accept the name of the final graph along with the parameters for constructing the graph. We use here the usual Ruby trick of combining the hash map arguments into a single Hashmap, so save_histogram("sample.png", data, :main => "Title", :xlab => "x label") will treat the parameters :main => "Title, :xlab => "x label" as a single argument, labelled params, which can then be passed to R's own hist function.

  def L.save_histogram(filename, data, params)
    R.png filename 
    R.hist(data, params)
    R.eval_R("dev.off()")
  end

The machine-learning and data-mining communities have a fairly standard format for representing data instances: comma-separated values, or CSV. Each line of a text file is taken to represent a single data instance. The features of the data are separated by commas. For example, a table of data representing information about cars might hold features about the colour, engine size and whether there was a roofrack:

white, 1800, no
blue, 1200, yes
...

Many such files contain header information, describing what the column headings are, for instance. We would like a function that will read such data into a Ruby data structure. The aims of the function are:

to accept as input the filename of the data to read in;
to optionally accept a number of header lines to ignore;
to optionally accept an alternative separator symbol to the comma;
to return an array in which each element is an array of the features for one instance; and
to convert the feature values into Integer or Float types, where appropriate.

The functions I created are:

  # --- reading in data
  # try to convert string item into an Integer or a Float, else return item
  def L.convert_item item
    begin
      Integer item
    rescue
      begin
        Float item
      rescue 
        item
      end
    end
  end

  # convert every item in given list of items
  def L.convert_items items
    items.collect {|i| L.convert_item i}
  end

  # return a list of the lines from given file
  # -- ignore the top ignore_n lines
  # -- convert every item in each line using the above
  def L.read_data_file(filename, ignore_n=0, split_char=",")
    data = []

    file = File.open(filename, "r")
    ignore_n.times { file.gets }
    while line = file.gets
      unless line.strip == "" # ignore blank lines
        data << L.convert_items(line.split(split_char))
      end
    end
    file.close
    
    data
  end

Finally, we can put all these pieces together by simplifying our example from last time. There, we took an example dataset which had a number of fields in comma-separated format. That is, each line represents a single data instance, and the information for each instance's features is separated by a comma. The first five lines of our example dataset are header information, which we do not need. Reading in these data, extracting the 11th feature from the data, printing the feature's mean and standard deviation, and finally creating a histogram of its values, can now be done quite directly:

# -- process the image.txt data
image_data = L::read_data_file("image.txt", 5, ",")
column_data = image_data.collect{|i| i[10]}
puts "Mean: #{L::mean(column_data)} SD: #{L::stddev(column_data)}"
L::save_histogram("image.png", image_data.collect{|i| i[10]}, 
                  :xlab => "X from image", :main=>"From image")

Of course, what I've shown above is just the beginning. It does demonstrate how easy Ruby makes it to provide tailored libraries to simplify daily tasks. Collect a few dozen of such functions together, and suddenly we have a budding wrapper library. How far you go to a complete wrapper library really depends on your stamina - the R project comes with a manual of over 1600 pages!

2009-05-14: Constructing a Wrapper around RSRuby

Creating a Wrapper

Accessing Statistical Functions

Graphical Functions

Reading Data

Example

Just a Beginning ...