A fairly typical scenario I find myself in is having a program run an experiment, outputting its results into a text file. I want to automate the data analysis so that large amounts of data are converted into tables, graphs and summarising statistics with as little effort as possible. This is the kind of work computers should be doing for us.
To illustrate a simple case, I have an optimisation program running various experiments, and producing text output, which consists of many lines like:
results-14-51-3-0.txt, 26, 7, 128, 18, 0.1, 4.62769, 6.45275e+07, 27.6596, 8, 256, 19, 0.1, 9.25538, 4.79442e+08, 27.6596, 9, 512, 21, 0.1, 18.5108, 3.75805e+09, 27.6596 results-14-51-3-1.txt, 21, 7, 128, 16, 0.1, 5.72952, 1.14546e+07, 22.3404, 8, 256, 17, 0.1, 11.459, 6.22772e+07, 22.3404, 9, 512, 18, 0.1, 22.9181, 3.37488e+08, 22.3404 ... results-14-59-3-0.txt, 3, 7, 128, 3, 0.35, 35.84, 53775.3, 3.57143, 8, 256, 3, 0.45, 80.2133, 257532, 3.19149, 9, 512, 3, 0.45, 160.427, 1.03329e+06, 3.19149 results-14-59-3-1.txt, 24, 7, 128, 16, 0.1, 5.01333, 2.77391e+07, 25.5319, 8, 256, 20, 0.1, 10.0267, 2.46826e+08, 25.5319, 9, 512, 20, 0.1, 20.0533, 1.39406e+09, 25.5319 ...
These provide the following information:
- within the name is encoded values for 'n' 'm' 'k', defining the experiment, and the 'instance' number; each experiment has been run 20 times.
- next is the actual number of minima located in the file.
- finally we have three sets of seven values, encoding: l and "2 to the power of l", the number of samples made, the number of minima located, values for gamma, r_appr and T_gamma_r, and finally N_appr, which is an estimate of the actual number of minima.
I want to collect some aggregate results out of these data and generate the results in a format suitable to use in LaTeX. (One of the great benefits of using LaTeX for writing articles is that its text format can be created by your programs!) The first time I tried this, I imported the file into a spreadsheet program as CSV, wrote some cell formulae, and then manually copied all the results over to my text editor - not a process I want to repeat.
As we are using Ruby, we first need some classes to hold the data. There will be a class for the individual results:
class Results_l attr_reader :l, :num_found, :gamma, :r_appr, :t_gamma_r, :n_appr def initialize(l, pow_l, num_found, gamma, r_appr, t_gammar_r, n_appr) @l = l.to_i @num_found = num_found.to_i @gamma = gamma.to_f @r_appr = r_appr.to_f @t_gammar_r = t_gammar_r.to_f @n_appr = n_appr.to_f end end
And a class for the complete experiment:
class Results_nmk attr_reader :n, :m, :k, :instance, :true_N, :results_1, :results_2, :results_3 def initialize(n, m, k, instance, true_N, results_1, results_2, results_3) @n = n @m = m @k = k @instance = instance @true_N = true_N @results_1 = results_1 @results_2 = results_2 @results_3 = results_3 end end
The data is stored in consecutive lines of a text file. Our main program will
read in each text line, extract the fields that it needs, and construct
instances of the above classes with the relevant data. Each line will be stored
as an instance of Result_nmk
, and these instances will be stored in an array
of results. Ruby provides a number of useful techniques to achieve this.
First, we need to split a string up into pieces, based on a separator token.
The String.split
method is key to making our process here work. Assume the
variable line holds one of the lines read in from the text file, then
line.split(",")
will divide the line up into an array of strings, separating
it at each of the commas.
"a,b,c,dd,ee".split -> ["a", "b", "c", "dd", "ee"]
Second, we want some of the strings to be treated as numbers, not as strings.
We can do this with the String.to_i
and String.to_f
methods, which convert
strings into Fixnums or Floats, respectively.
"25.2".to_f -> 25.2 "19x1".to_i -> 19 # notice that the initial number is read, and the rest ignored
Third, we need to extract items from the input array and construct a class instance using these items as its input parameters. Ruby provides a nice way to 'explode' an array when passing it to a method, so that each item in the array gets seen by the method as a separate input argument.
For analysing my experiment data, I split the input text line and stored it in
a local variable called line_arr. Then I could create an instance of Results_l
by exploding the relevant part of line_arr, e.g.
Results_l.new(*line_arr[2..8])
The complete method to read in and construct class instances for my data file follows. I have made the classes themselves responsible for converting their arguments from strings to Fixnums or Floats as required.
def analyse_file filename results = [] File.open(filename).each do |line| line_arr = line.split(",") results << Results_nmk.new( get_n(line_arr[0]), get_m(line_arr[0]), get_k(line_arr[0]), get_instance(line_arr[0]), line_arr[1].to_i, Results_l.new(*line_arr[2..8]), Results_l.new(*line_arr[9..15]), Results_l.new(*line_arr[16..22])) end
The get_n, get_m, get_k and get_instance methods all work by splitting the "results-14-55-3-0.txt" name string on the "-" separator, and converting the relevant part of the array to an integer. For example:
def get_k name name.split("-")[3].to_i end
In my experiment, I have 20 instances for each value of the three key variables, n, m and k. For my tables, I want to aggregate these results to compute mean values and some extra relations. The output should be in a LaTeX friendly form, so I can just copy the output directly into my article. I didn't need anything too complex here, but of course, the opportunity is there to automate more complex statistical analysis.
Again, there are two steps, three if you include the output as something separate.
The first step is to collect together values for each of the items I am interested in. For each value of n, m, k and l there are 20 experimental results. I want to collect all of these 20 results together for each set of n, m, k and l. This step is complicated by the fact that each instance of Result_nmk contains three instances of Result_l! I would like to use an iterator to step through an array of possible nmkl values, so what I do first is create an array of 4-tuples, where each 4-tuple is an array of n,m,k,l values. The code to create the array of possible nmkl values is below. Notice the 'unless' clauses, which make sure each nmkl value is only stored once.
nmkl_values = [] results.each do |result| nmkl_value_1 = [result.n, result.m ,result.k, result.results_1.l] nmkl_values << nmkl_value_1 unless nmkl_values.include? nmkl_value_1 nmkl_value_2 = [result.n, result.m ,result.k, result.results_2.l] nmkl_values << nmkl_value_2 unless nmkl_values.include? nmkl_value_2 nmkl_value_3 = [result.n, result.m ,result.k, result.results_3.l] nmkl_values << nmkl_value_3 unless nmkl_values.include? nmkl_value_3 end
We can now iterate through the list of nmkl values, and retrieve matching Result_nmk
instances and, though it, Result_l. We will add the following methods to Result_nmk
:
class Results_nmk def matches_nmkl?(n, m, k, l) @n == n && @m == m && @k == k && (@results_1.l == l || @results_2.l == l || @results_3.l == l) end def get_result_l l if @results_1.l == l @results_1 elsif @results_2.l == l @results_2 elsif @results_3.l == l @results_3 else raise Exception # there must be a legal value for l end end
The second step is to compute the results and generate output in the correct format. Computing the results is simple enough, as I only care about the mean for each set of instances; I compute this value using RSRuby. To get the results into the correct format for LaTeX, I print out each line of results as text, with the numbers separated by an ampersand sign. These lines can then be copied into my article, where the header and other formatting information is waiting for them. An instance of R is initialised through RSRuby at the start of the program with:
require 'rsruby' @@r_instance = RSRuby.instance
The code is straightforward: run through each one of the nmkl values, retrieve the values for each of the items of interest, and compute the average. Within the 'puts' I use the format: #{"%5.2f" % true_N} which displays every number with 2 decimal places. First is the general loop, and second is one of the methods which gets called to collect the values. Because of the way I have constructed my data, sometimes you want a list of the overall results, of type Result_mnk, and sometimes you want a list of the individual results, of type Result_l. My code computes both these lists, and passes them to the individual methods for computing the desired averages. The method 'show_mean' takes an array as an argument, uses R to compute the mean, and returns a string formatted neatly.
nmkl_values.each do |nmkl_value| instances = results.find_all {|result| result.matches_nmkl?(*nmkl_value)} l_instances = instances.collect{|result| result.get_result_l(nmkl_value[3])} avg_true_N = show_mean(get_true_N(instances)) avg_beta_m = show_mean(get_beta_m(l_instances)) avg_gamma = show_mean(get_gamma(l_instances)) avg_r_appr = show_mean(get_r_appr(l_instances)) avg_n_appr = show_mean(get_n_appr(l_instances)) avg_n_appr_by_n = show_mean(get_n_appr_by_n(instances, nmkl_value[3])) puts "#{"%5.2f" % nmkl_value[1]} & #{avg_true_N} & $2^#{nmkl_value[3]}$ & #{avg_beta_m} & #{avg_gamma} & #{avg_r_appr} & #{avg_n_appr} & #{avg_n_appr_by_n} \\" end def show_mean values "%5.2f" % @@r_instance.mean(values) end def get_n_appr(l_instances) l_instances.collect {|instance| instance.n_appr} end
The final output is:
51.00 & 22.40 & $2^7$ & 16.60 & 0.12 & 8.23 & 23.76 & 1.07 \\ 51.00 & 22.40 & $2^8$ & 18.45 & 0.13 & 17.17 & 23.93 & 1.06 \\ 51.00 & 22.40 & $2^9$ & 19.20 & 0.12 & 33.23 & 24.21 & 1.07 \\ ...