As I've done in a few other notes, I will explore some simple data analysis tools applied to the iris dataset, but this time using Fantom. This will illustrate how to interact with a Java library from Fantom for simple purposes.
The Java library used here is Apache Commons Math, version 3.6.1. Download commons-math3-3.6.1.jar and place in "FANTOM/lib/java/ext", ready for Fantom to find and use.
The iris dataset, "iris.data", is a CSV formatted dataset:
5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa ...
There are four numeric fields, followed by a class name.
The first part of the program must load the dataset, creating a list of
IrisInstance
instances, where each IrisInstance
holds the data from one row
of the file.
The IrisInstance
class is:
class IrisInstance : Clusterable { Float sepalLength Float sepalWidth Float petalLength Float petalWidth Str label new make(Str[] row) { this.sepalLength = row[0].toFloat this.sepalWidth = row[1].toFloat this.petalLength = row[2].toFloat this.petalWidth = row[3].toFloat this.label = row[4] } }
The constructor accepts a list of strings, and converts the first 4 values into Floats.
Fantom supports CSV file handling in its standard library: the "util" library
has a CsvInStream class for reading
CSV files. With this, the code to read in the CSV file line-by-line and convert
into IrisInstance
instances is:
static Void main(Str[] args) { // step 1: read in CSV file echo("File name " + args[0]) // <1> data := IrisInstance[,] // <2> util::CsvInStream(args[0].toUri.toFile.in).eachRow |Str[] row| // <3> { if (row.size == 5) // <4> { data.add(IrisInstance(row)) } } echo("Read " + data.size + " instances") // <5> }
- Get the filename from the input args.
-
Create an empty list to store the
IrisInstance
instances. -
Use
CsvInStream
to read in each row of the given file. -
If the row has enough fields, convert the row into an
IrisInstance
instance and add it to the data list. - Finally, check the data was read OK by printing out the number of instances.
Output:
> fan IrisData.fan iris.data File name iris.data Read 150 instances
The Apache Commons Math
library provides the class DescriptiveStatistics
, which stores a collection
of values and provides methods to access statistical properties, such as
arithmetic mean, standard deviation, etc.
We can collect the values for a given attribute and return
an instance of DescriptiveStatistics
using the following method, which
accepts an accessor function to retrieve the required attribute value:
static DescriptiveStatistics getStatistics(IrisInstance[] data, |IrisInstance->Float| accessor) { ds := DescriptiveStatistics() // <1> data.each { ds.addValue(accessor(it)) } // <2> return ds }
- Creates an instance of the required class from the library.
- Uses an it-block to apply the accessor to every item in the data set.
It is then straightforward to print out information for an attribute from the
instance of DescriptiveStatistics
:
static Void displayStatistics(Str name, DescriptiveStatistics statistics) { echo(name) echo(" -- Minimum: " + statistics.getMin.toLocale("0.00")) // <1> echo(" -- Maximum: " + statistics.getMax.toLocale("0.00")) echo(" -- Mean: " + statistics.getMean.toLocale("0.00")) echo(" -- Stddev: " + statistics.getStandardDeviation.toLocale("0.00")) }
-
toLocale
lets us specify how to represent the printed numbers.
Finally, we call these methods using closures to specify each attribute value:
displayStatistics("Sepal Length", getStatistics(data) { it.sepalLength }) displayStatistics("Sepal Width", getStatistics(data) { it.sepalWidth }) displayStatistics("Petal Length", getStatistics(data) { it.petalLength }) displayStatistics("Petal Width", getStatistics(data) { it.petalWidth })
This produces the following output:
Sepal Length -- Minimum: 4.30 -- Maximum: 7.90 -- Mean: 5.84 -- Stddev: 0.83 Sepal Width -- Minimum: 2.00 -- Maximum: 4.40 -- Mean: 3.05 -- Stddev: 0.43 Petal Length -- Minimum: 1.00 -- Maximum: 6.90 -- Mean: 3.76 -- Stddev: 1.76 Petal Width -- Minimum: 0.10 -- Maximum: 2.50 -- Mean: 1.20 -- Stddev: 0.76
We now want to cluster the Iris instances. The
Apache Commons Math library
provides the class KMeansPlusPlusClusterer
. Properties of the model are
set on constructing an instance of this class, including:
- expected number of clusters
- maximum number of iterations
- strategy for dealing with empty clusters
- distance measure
As the Iris dataset has three class labels, this example uses 3 as the expected number of clusters, leaving the other properties to their default values.
The model is built by calling the cluster
method with a collection of
Clusterable
instances, and returns a list of CentroidCluster
instances.
Each centroid cluster has a "centre", and the list of points assigned to that
cluster.
So we need to supply our "data" list to the cluster
method as a collection
of Clusterable
instances. This requires first implementing the Clusterable
interface on the IrisInstance
class, and then passing the list as a collection
to the cluster
method.
First, we need to make IrisInstance
implement the Clusterable
interface.
This requires a getPoint
method to return an instance of Java's double[]
-
this type is mapped to DoubleArray
by Fantom's interop library.
class IrisInstance : Clusterable { // ... same as above override DoubleArray? getPoint() { result := DoubleArray(4) result[0] = sepalLength result[1] = sepalWidth result[2] = petalLength result[3] = petalWidth return result } }
Second, we need to do some conversions on Fantom's List
type, to and from
Java. Fantom already has the ability to automatically convert List
to and
from an array, so we take advantage of Java's
Arrays#asList
method to convert the array into a collection.
The return value is a Java List
, and we can convert that back into an array
using Java's
Collection#toArray
method.
model := KMeansPlusPlusClusterer(3) model.cluster(Arrays.asList(data)).toArray.each |CentroidCluster cluster| { echo("Centre: " + Arrays.toString(cluster.getCenter.getPoint)) // <1> echo("Cluster has: " + cluster.getPoints.size + " points") }
-
Arrays.toString
is another useful Java method, formatting the array into a printable format.
Output:
Centre: [5.901612903225807, 2.748387096774194, 4.393548387096775, 1.4338709677419357] Cluster has: 62 points Centre: [6.8500000000000005, 3.073684210526315, 5.742105263157893, 2.0710526315789473] Cluster has: 38 points Centre: [5.005999999999999, 3.4180000000000006, 1.464, 0.2439999999999999] Cluster has: 50 points
The complete program includes the using
statements, to access the required
Java libraries and Fantom interop type.
// IrisData analysis using [java]org.apache.commons.math3.stat.descriptive::DescriptiveStatistics using [java]org.apache.commons.math3.ml.clustering::CentroidCluster using [java]org.apache.commons.math3.ml.clustering::Clusterable using [java]org.apache.commons.math3.ml.clustering::KMeansPlusPlusClusterer using [java]java.util::Arrays using [java]fanx.interop::DoubleArray class IrisData { static Void main(Str[] args) { // step 1: read in CSV file echo("File name " + args[0]) data := IrisInstance[,] util::CsvInStream(args[0].toUri.toFile.in).eachRow |Str[] row| { if (row.size == 5) { data.add(IrisInstance(row)) } } echo("Read " + data.size + " instances") // step 2: print out some statistics displayStatistics("Sepal Length", getStatistics(data) { it.sepalLength }) displayStatistics("Sepal Width", getStatistics(data) { it.sepalWidth }) displayStatistics("Petal Length", getStatistics(data) { it.petalLength }) displayStatistics("Petal Width", getStatistics(data) { it.petalWidth }) // step 3: cluster the instances model := KMeansPlusPlusClusterer(3) model.cluster(Arrays.asList(data)).toArray.each |CentroidCluster cluster| { echo("Centre: " + Arrays.toString(cluster.getCenter.getPoint)) echo("Cluster has: " + cluster.getPoints.size + " points") } } static Void displayStatistics(Str name, DescriptiveStatistics statistics) { echo(name) echo(" -- Minimum: " + statistics.getMin.toLocale("0.00")) echo(" -- Maximum: " + statistics.getMax.toLocale("0.00")) echo(" -- Mean: " + statistics.getMean.toLocale("0.00")) echo(" -- Stddev: " + statistics.getStandardDeviation.toLocale("0.00")) } static DescriptiveStatistics getStatistics(IrisInstance[] data, |IrisInstance->Float| accessor) { ds := DescriptiveStatistics() data.each { ds.addValue(accessor(it)) } return ds } } class IrisInstance : Clusterable { Float sepalLength Float sepalWidth Float petalLength Float petalWidth Str label new make(Str[] row) { this.sepalLength = row[0].toFloat this.sepalWidth = row[1].toFloat this.petalLength = row[2].toFloat this.petalWidth = row[3].toFloat this.label = row[4] } override DoubleArray? getPoint() { result := DoubleArray(4) result[0] = sepalLength result[1] = sepalWidth result[2] = petalLength result[3] = petalWidth return result } }