In my previous post, I looked at creating some graphs and doing other analysis within RSRuby. The problem with using RSRuby directly is that the syntax can be clunky; there are many things that interfere with simply getting the result. For example, to save a graph of data, you must start up an instance of R, tell it to create a file of the right type, plot the graph, then tell the device you are finished, etc. What would be convenient, at least for simple cases, is to have a single function which does all the above in one go. In this post, I explore how easy it is to create a wrapper around some of the RSRuby calls, and so devise a personal library of functions with just the right level of exposed complexity.
I shall give my wrapper library a simple name, 'L', as I don't want long names to type in. All methods are module methods. 'L' includes an instance of R and the methods in 'L' will do the hard work of interacting with R. Here is the start of the library:
require 'rsruby' module L R = RSRuby.instance # keep a constant R interpreter end
This is very simple, of course. The instance of R is accessible as 'L::R', and can be called in the usual way. I think this is important - the wrapper library does not prevent all the usual ways of interacting with R, but will make some functions simpler.
Let us add some simple statistical functions. I would like to compute the mean of an array of numbers by calling: 'L::mean [1,2,3,4,5,6,7]', and I can do this by providing functions like:
# --- stats def L.mean items R.mean items end def L.stddev items R.sd items end
This library is for my own use, so I can use whatever names I like. I sometimes prefer more verbose forms than R uses, as you can see in the stddev function.
The process of creating and saving a graph in RSRuby required a few steps. We can simplify those steps using our wrapper, creating a function to accept the name of the final graph along with the parameters for constructing the graph. We use here the usual Ruby trick of combining the hash map arguments into a single Hashmap, so save_histogram("sample.png", data, :main => "Title", :xlab => "x label") will treat the parameters :main => "Title, :xlab => "x label" as a single argument, labelled params, which can then be passed to R's own hist function.
def L.save_histogram(filename, data, params) R.png filename R.hist(data, params) R.eval_R("dev.off()") end
The machine-learning and data-mining communities have a fairly standard format for representing data instances: comma-separated values, or CSV. Each line of a text file is taken to represent a single data instance. The features of the data are separated by commas. For example, a table of data representing information about cars might hold features about the colour, engine size and whether there was a roofrack:
white, 1800, no blue, 1200, yes ...
Many such files contain header information, describing what the column headings are, for instance. We would like a function that will read such data into a Ruby data structure. The aims of the function are:
- to accept as input the filename of the data to read in;
- to optionally accept a number of header lines to ignore;
- to optionally accept an alternative separator symbol to the comma;
- to return an array in which each element is an array of the features for one instance; and
- to convert the feature values into Integer or Float types, where appropriate.
The functions I created are:
# --- reading in data # try to convert string item into an Integer or a Float, else return item def L.convert_item item begin Integer item rescue begin Float item rescue item end end end # convert every item in given list of items def L.convert_items items items.collect {|i| L.convert_item i} end # return a list of the lines from given file # -- ignore the top ignore_n lines # -- convert every item in each line using the above def L.read_data_file(filename, ignore_n=0, split_char=",") data = [] file = File.open(filename, "r") ignore_n.times { file.gets } while line = file.gets unless line.strip == "" # ignore blank lines data << L.convert_items(line.split(split_char)) end end file.close data end
Finally, we can put all these pieces together by simplifying our example from last time. There, we took an example dataset which had a number of fields in comma-separated format. That is, each line represents a single data instance, and the information for each instance's features is separated by a comma. The first five lines of our example dataset are header information, which we do not need. Reading in these data, extracting the 11th feature from the data, printing the feature's mean and standard deviation, and finally creating a histogram of its values, can now be done quite directly:
# -- process the image.txt data image_data = L::read_data_file("image.txt", 5, ",") column_data = image_data.collect{|i| i[10]} puts "Mean: #{L::mean(column_data)} SD: #{L::stddev(column_data)}" L::save_histogram("image.png", image_data.collect{|i| i[10]}, :xlab => "X from image", :main=>"From image")
Of course, what I've shown above is just the beginning. It does demonstrate how easy Ruby makes it to provide tailored libraries to simplify daily tasks. Collect a few dozen of such functions together, and suddenly we have a budding wrapper library. How far you go to a complete wrapper library really depends on your stamina - the R project comes with a manual of over 1600 pages!