2024-02-03: Fantom<->Java Interop for Iris Data Analysis

As I've done in a few other notes, I will explore some simple data analysis tools applied to the iris dataset, but this time using Fantom. This will illustrate how to interact with a Java library from Fantom for simple purposes.

The Java library used here is Apache Commons Math, version 3.6.1. Download commons-math3-3.6.1.jar and place in "FANTOM/lib/java/ext", ready for Fantom to find and use.

Loading a CSV File

The iris dataset, "iris.data", is a CSV formatted dataset:

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
...

There are four numeric fields, followed by a class name.

The first part of the program must load the dataset, creating a list of IrisInstance instances, where each IrisInstance holds the data from one row of the file.

The IrisInstance class is:

class IrisInstance : Clusterable
{
  Float sepalLength
  Float sepalWidth
  Float petalLength
  Float petalWidth
  Str label

  new make(Str[] row)
  {
    this.sepalLength = row[0].toFloat
    this.sepalWidth = row[1].toFloat
    this.petalLength = row[2].toFloat
    this.petalWidth = row[3].toFloat
    this.label = row[4]
  }
}

The constructor accepts a list of strings, and converts the first 4 values into Floats.

Fantom supports CSV file handling in its standard library: the "util" library has a CsvInStream class for reading CSV files. With this, the code to read in the CSV file line-by-line and convert into IrisInstance instances is:

  static Void main(Str[] args) 
  {
    // step 1: read in CSV file
    echo("File name " + args[0])                                    // <1>
    data := IrisInstance[,]                                         // <2>

    util::CsvInStream(args[0].toUri.toFile.in).eachRow |Str[] row|  // <3>
    {
      if (row.size == 5)                                            // <4>
      {
        data.add(IrisInstance(row))
      }
    }
    echo("Read " + data.size + " instances")                        // <5>
  }
  1. Get the filename from the input args.
  2. Create an empty list to store the IrisInstance instances.
  3. Use CsvInStream to read in each row of the given file.
  4. If the row has enough fields, convert the row into an IrisInstance instance and add it to the data list.
  5. Finally, check the data was read OK by printing out the number of instances.

Output:

> fan IrisData.fan iris.data
File name iris.data
Read 150 instances

Descriptive Statistics

The Apache Commons Math library provides the class DescriptiveStatistics, which stores a collection of values and provides methods to access statistical properties, such as arithmetic mean, standard deviation, etc.

We can collect the values for a given attribute and return an instance of DescriptiveStatistics using the following method, which accepts an accessor function to retrieve the required attribute value:

  static DescriptiveStatistics getStatistics(IrisInstance[] data, 
                                             |IrisInstance->Float| accessor)
  {
    ds := DescriptiveStatistics()               // <1>
    data.each { ds.addValue(accessor(it)) }     // <2>
    return ds
  }
  1. Creates an instance of the required class from the library.
  2. Uses an it-block to apply the accessor to every item in the data set.

It is then straightforward to print out information for an attribute from the instance of DescriptiveStatistics:

  static Void displayStatistics(Str name, DescriptiveStatistics statistics)
  {
    echo(name)
    echo(" -- Minimum: " + statistics.getMin.toLocale("0.00"))    // <1>
    echo(" -- Maximum: " + statistics.getMax.toLocale("0.00"))
    echo(" -- Mean:    " + statistics.getMean.toLocale("0.00"))
    echo(" -- Stddev:  " + statistics.getStandardDeviation.toLocale("0.00"))
  }
  1. toLocale lets us specify how to represent the printed numbers.

Finally, we call these methods using closures to specify each attribute value:

    displayStatistics("Sepal Length", 
                      getStatistics(data) { it.sepalLength })
    displayStatistics("Sepal Width", 
                      getStatistics(data) { it.sepalWidth })
    displayStatistics("Petal Length", 
                      getStatistics(data) { it.petalLength })
    displayStatistics("Petal Width", 
                      getStatistics(data) { it.petalWidth })

This produces the following output:

Sepal Length
 -- Minimum: 4.30
 -- Maximum: 7.90
 -- Mean:    5.84
 -- Stddev:  0.83
Sepal Width
 -- Minimum: 2.00
 -- Maximum: 4.40
 -- Mean:    3.05
 -- Stddev:  0.43
Petal Length
 -- Minimum: 1.00
 -- Maximum: 6.90
 -- Mean:    3.76
 -- Stddev:  1.76
Petal Width
 -- Minimum: 0.10
 -- Maximum: 2.50
 -- Mean:    1.20
 -- Stddev:  0.76

Clustering: Fantom<->Java Interop

We now want to cluster the Iris instances. The Apache Commons Math library provides the class KMeansPlusPlusClusterer. Properties of the model are set on constructing an instance of this class, including:

As the Iris dataset has three class labels, this example uses 3 as the expected number of clusters, leaving the other properties to their default values.

The model is built by calling the cluster method with a collection of Clusterable instances, and returns a list of CentroidCluster instances. Each centroid cluster has a "centre", and the list of points assigned to that cluster.

So we need to supply our "data" list to the cluster method as a collection of Clusterable instances. This requires first implementing the Clusterable interface on the IrisInstance class, and then passing the list as a collection to the cluster method.

First, we need to make IrisInstance implement the Clusterable interface. This requires a getPoint method to return an instance of Java's double[] - this type is mapped to DoubleArray by Fantom's interop library.

class IrisInstance : Clusterable
{
  // ... same as above

  override DoubleArray? getPoint() {
    result := DoubleArray(4)
    
    result[0] = sepalLength
    result[1] = sepalWidth
    result[2] = petalLength
    result[3] = petalWidth

    return result
  }
}

Second, we need to do some conversions on Fantom's List type, to and from Java. Fantom already has the ability to automatically convert List to and from an array, so we take advantage of Java's Arrays#asList method to convert the array into a collection.

The return value is a Java List, and we can convert that back into an array using Java's Collection#toArray method.

    model := KMeansPlusPlusClusterer(3)
    model.cluster(Arrays.asList(data)).toArray.each |CentroidCluster cluster|
    {
      echo("Centre: " + Arrays.toString(cluster.getCenter.getPoint))  // <1>
      echo("Cluster has: " + cluster.getPoints.size + " points")
    }
  1. Arrays.toString is another useful Java method, formatting the array into a printable format.

Output:

Centre: [5.901612903225807, 2.748387096774194, 4.393548387096775, 1.4338709677419357]
Cluster has: 62 points
Centre: [6.8500000000000005, 3.073684210526315, 5.742105263157893, 2.0710526315789473]
Cluster has: 38 points
Centre: [5.005999999999999, 3.4180000000000006, 1.464, 0.2439999999999999]
Cluster has: 50 points

Complete Program

The complete program includes the using statements, to access the required Java libraries and Fantom interop type.

// IrisData analysis

using [java]org.apache.commons.math3.stat.descriptive::DescriptiveStatistics
using [java]org.apache.commons.math3.ml.clustering::CentroidCluster
using [java]org.apache.commons.math3.ml.clustering::Clusterable
using [java]org.apache.commons.math3.ml.clustering::KMeansPlusPlusClusterer
using [java]java.util::Arrays
using [java]fanx.interop::DoubleArray

class IrisData 
{
  static Void main(Str[] args) 
  {
    // step 1: read in CSV file
    echo("File name " + args[0])
    data := IrisInstance[,]

    util::CsvInStream(args[0].toUri.toFile.in).eachRow |Str[] row| 
    {
      if (row.size == 5)
      {
        data.add(IrisInstance(row))
      }
    }
    echo("Read " + data.size + " instances")

    // step 2: print out some statistics
    displayStatistics("Sepal Length", 
                      getStatistics(data) { it.sepalLength })
    displayStatistics("Sepal Width", 
                      getStatistics(data) { it.sepalWidth })
    displayStatistics("Petal Length", 
                      getStatistics(data) { it.petalLength })
    displayStatistics("Petal Width", 
                      getStatistics(data) { it.petalWidth })

    // step 3: cluster the instances
    model := KMeansPlusPlusClusterer(3)
    model.cluster(Arrays.asList(data)).toArray.each |CentroidCluster cluster|
    {
      echo("Centre: " + Arrays.toString(cluster.getCenter.getPoint))
      echo("Cluster has: " + cluster.getPoints.size + " points")
    }
  }

  static Void displayStatistics(Str name, DescriptiveStatistics statistics)
  {
    echo(name)
    echo(" -- Minimum: " + statistics.getMin.toLocale("0.00"))
    echo(" -- Maximum: " + statistics.getMax.toLocale("0.00"))
    echo(" -- Mean:    " + statistics.getMean.toLocale("0.00"))
    echo(" -- Stddev:  " + statistics.getStandardDeviation.toLocale("0.00"))
  }

  static DescriptiveStatistics getStatistics(IrisInstance[] data, 
                                             |IrisInstance->Float| accessor)
  {
    ds := DescriptiveStatistics()
    data.each { ds.addValue(accessor(it)) }
    return ds
  }
}

class IrisInstance : Clusterable
{
  Float sepalLength
  Float sepalWidth
  Float petalLength
  Float petalWidth
  Str label

  new make(Str[] row)
  {
    this.sepalLength = row[0].toFloat
    this.sepalWidth = row[1].toFloat
    this.petalLength = row[2].toFloat
    this.petalWidth = row[3].toFloat
    this.label = row[4]
  }

  override DoubleArray? getPoint() {
    result := DoubleArray(4)
    
    result[0] = sepalLength
    result[1] = sepalWidth
    result[2] = petalLength
    result[3] = petalWidth

    return result
  }
}

Page from Peter's Scrapbook, output from a VimWiki on 2024-02-06.