2009-05-12: Analysing Numeric Data in Text Files with Ruby

A fairly typical scenario I find myself in is having a program run an experiment, outputting its results into a text file. I want to automate the data analysis so that large amounts of data are converted into tables, graphs and summarising statistics with as little effort as possible. This is the kind of work computers should be doing for us.

To illustrate a simple case, I have an optimisation program running various experiments, and producing text output, which consists of many lines like:

results-14-51-3-0.txt, 26, 7, 128, 18, 0.1, 4.62769, 6.45275e+07, 27.6596, 8, 256, 19, 0.1, 9.25538, 4.79442e+08, 27.6596, 9, 512, 21, 0.1, 18.5108, 3.75805e+09, 27.6596
results-14-51-3-1.txt, 21, 7, 128, 16, 0.1, 5.72952, 1.14546e+07, 22.3404, 8, 256, 17, 0.1, 11.459, 6.22772e+07, 22.3404, 9, 512, 18, 0.1, 22.9181, 3.37488e+08, 22.3404
...
results-14-59-3-0.txt, 3, 7, 128, 3, 0.35, 35.84, 53775.3, 3.57143, 8, 256, 3, 0.45, 80.2133, 257532, 3.19149, 9, 512, 3, 0.45, 160.427, 1.03329e+06, 3.19149
results-14-59-3-1.txt, 24, 7, 128, 16, 0.1, 5.01333, 2.77391e+07, 25.5319, 8, 256, 20, 0.1, 10.0267, 2.46826e+08, 25.5319, 9, 512, 20, 0.1, 20.0533, 1.39406e+09, 25.5319
...

These provide the following information:

  1. within the name is encoded values for 'n' 'm' 'k', defining the experiment, and the 'instance' number; each experiment has been run 20 times.
  2. next is the actual number of minima located in the file.
  3. finally we have three sets of seven values, encoding: l and "2 to the power of l", the number of samples made, the number of minima located, values for gamma, r_appr and T_gamma_r, and finally N_appr, which is an estimate of the actual number of minima.

I want to collect some aggregate results out of these data and generate the results in a format suitable to use in LaTeX. (One of the great benefits of using LaTeX for writing articles is that its text format can be created by your programs!) The first time I tried this, I imported the file into a spreadsheet program as CSV, wrote some cell formulae, and then manually copied all the results over to my text editor - not a process I want to repeat.

Classes for the Data

As we are using Ruby, we first need some classes to hold the data. There will be a class for the individual results:

class Results_l
  attr_reader :l, :num_found, :gamma, :r_appr, :t_gamma_r, :n_appr

  def initialize(l, pow_l, num_found, gamma, r_appr, t_gammar_r, n_appr)
    @l = l.to_i
    @num_found = num_found.to_i
    @gamma = gamma.to_f
    @r_appr = r_appr.to_f
    @t_gammar_r = t_gammar_r.to_f
    @n_appr = n_appr.to_f
  end
end

And a class for the complete experiment:

class Results_nmk
  attr_reader :n, :m, :k, :instance, :true_N, :results_1, :results_2, :results_3

  def initialize(n, m, k, instance, true_N, results_1, results_2, results_3)
    @n = n
    @m = m
    @k = k
    @instance = instance
    @true_N = true_N
    @results_1 = results_1
    @results_2 = results_2
    @results_3 = results_3
  end
end

Reading in the Data

The data is stored in consecutive lines of a text file. Our main program will read in each text line, extract the fields that it needs, and construct instances of the above classes with the relevant data. Each line will be stored as an instance of Result_nmk, and these instances will be stored in an array of results. Ruby provides a number of useful techniques to achieve this.

First, we need to split a string up into pieces, based on a separator token. The String.split method is key to making our process here work. Assume the variable line holds one of the lines read in from the text file, then line.split(",") will divide the line up into an array of strings, separating it at each of the commas.

"a,b,c,dd,ee".split -> ["a", "b", "c", "dd", "ee"]

Second, we want some of the strings to be treated as numbers, not as strings. We can do this with the String.to_i and String.to_f methods, which convert strings into Fixnums or Floats, respectively.

"25.2".to_f -> 25.2
"19x1".to_i -> 19 # notice that the initial number is read, and the rest ignored

Third, we need to extract items from the input array and construct a class instance using these items as its input parameters. Ruby provides a nice way to 'explode' an array when passing it to a method, so that each item in the array gets seen by the method as a separate input argument.

For analysing my experiment data, I split the input text line and stored it in a local variable called line_arr. Then I could create an instance of Results_l by exploding the relevant part of line_arr, e.g. Results_l.new(*line_arr[2..8])

The complete method to read in and construct class instances for my data file follows. I have made the classes themselves responsible for converting their arguments from strings to Fixnums or Floats as required.

def analyse_file filename
  results = []
  File.open(filename).each do |line|
    line_arr = line.split(",")
    results << Results_nmk.new(
        get_n(line_arr[0]),
        get_m(line_arr[0]),
        get_k(line_arr[0]),
        get_instance(line_arr[0]),
        line_arr[1].to_i, 
        Results_l.new(*line_arr[2..8]), 
        Results_l.new(*line_arr[9..15]), 
        Results_l.new(*line_arr[16..22]))
  end

The get_n, get_m, get_k and get_instance methods all work by splitting the "results-14-55-3-0.txt" name string on the "-" separator, and converting the relevant part of the array to an integer. For example:

def get_k name
  name.split("-")[3].to_i
end

Calculating Statistics

In my experiment, I have 20 instances for each value of the three key variables, n, m and k. For my tables, I want to aggregate these results to compute mean values and some extra relations. The output should be in a LaTeX friendly form, so I can just copy the output directly into my article. I didn't need anything too complex here, but of course, the opportunity is there to automate more complex statistical analysis.

Again, there are two steps, three if you include the output as something separate.

The first step is to collect together values for each of the items I am interested in. For each value of n, m, k and l there are 20 experimental results. I want to collect all of these 20 results together for each set of n, m, k and l. This step is complicated by the fact that each instance of Result_nmk contains three instances of Result_l! I would like to use an iterator to step through an array of possible nmkl values, so what I do first is create an array of 4-tuples, where each 4-tuple is an array of n,m,k,l values. The code to create the array of possible nmkl values is below. Notice the 'unless' clauses, which make sure each nmkl value is only stored once.

  nmkl_values = []
  results.each do |result| 
    nmkl_value_1 = [result.n, result.m ,result.k, result.results_1.l]
    nmkl_values << nmkl_value_1 unless nmkl_values.include? nmkl_value_1
    nmkl_value_2 = [result.n, result.m ,result.k, result.results_2.l]
    nmkl_values << nmkl_value_2 unless nmkl_values.include? nmkl_value_2
    nmkl_value_3 = [result.n, result.m ,result.k, result.results_3.l]
    nmkl_values << nmkl_value_3 unless nmkl_values.include? nmkl_value_3
  end

We can now iterate through the list of nmkl values, and retrieve matching Result_nmk instances and, though it, Result_l. We will add the following methods to Result_nmk:

class Results_nmk
  def matches_nmkl?(n, m, k, l)
    @n == n && @m == m && @k == k &&
      (@results_1.l == l || @results_2.l == l || @results_3.l == l)
  end

  def get_result_l l
    if @results_1.l == l
      @results_1
    elsif @results_2.l == l
      @results_2
    elsif @results_3.l == l
      @results_3
    else
      raise Exception # there must be a legal value for l
    end
  end

The second step is to compute the results and generate output in the correct format. Computing the results is simple enough, as I only care about the mean for each set of instances; I compute this value using RSRuby. To get the results into the correct format for LaTeX, I print out each line of results as text, with the numbers separated by an ampersand sign. These lines can then be copied into my article, where the header and other formatting information is waiting for them. An instance of R is initialised through RSRuby at the start of the program with:

require 'rsruby'
@@r_instance = RSRuby.instance

The code is straightforward: run through each one of the nmkl values, retrieve the values for each of the items of interest, and compute the average. Within the 'puts' I use the format: #{"%5.2f" % true_N} which displays every number with 2 decimal places. First is the general loop, and second is one of the methods which gets called to collect the values. Because of the way I have constructed my data, sometimes you want a list of the overall results, of type Result_mnk, and sometimes you want a list of the individual results, of type Result_l. My code computes both these lists, and passes them to the individual methods for computing the desired averages. The method 'show_mean' takes an array as an argument, uses R to compute the mean, and returns a string formatted neatly.

  nmkl_values.each do |nmkl_value|
    instances = results.find_all {|result| result.matches_nmkl?(*nmkl_value)}
    l_instances = instances.collect{|result| result.get_result_l(nmkl_value[3])}

    avg_true_N = show_mean(get_true_N(instances))
    avg_beta_m = show_mean(get_beta_m(l_instances))
    avg_gamma = show_mean(get_gamma(l_instances))
    avg_r_appr = show_mean(get_r_appr(l_instances))
    avg_n_appr = show_mean(get_n_appr(l_instances))
    avg_n_appr_by_n = show_mean(get_n_appr_by_n(instances, nmkl_value[3]))

    puts "#{"%5.2f" % nmkl_value[1]} & #{avg_true_N} & $2^#{nmkl_value[3]}$ & #{avg_beta_m} & 
      #{avg_gamma} & #{avg_r_appr} & #{avg_n_appr} & #{avg_n_appr_by_n} \\"
  end

def show_mean values
  "%5.2f" % @@r_instance.mean(values)
end

def get_n_appr(l_instances)
  l_instances.collect {|instance| instance.n_appr}
end

The final output is:

51.00 & 22.40 & $2^7$ & 16.60 & 0.12 & 8.23 & 23.76 & 1.07 \\
51.00 & 22.40 & $2^8$ & 18.45 & 0.13 & 17.17 & 23.93 & 1.06 \\
51.00 & 22.40 & $2^9$ & 19.20 & 0.12 & 33.23 & 24.21 & 1.07 \\
...

Page from Peter's Scrapbook, output from a VimWiki on 2024-01-29.