Computing descriptive statistics from a list of data points, using the Apache Math library.
The example used here is the Iris Data set from the UCI database: download iris.data.
The four attributes are stored as double values in a Java record, and the label as a string:
record IrisInstance( double sepalLength, double sepalWidth, double petalLength, double petalWidth, String label ) {}
Assume the data have been stored in a List<IrisInstance>
, e.g. by using the
technique in Reading CSV Files.
The Apache Commons Math library
provides the class DescriptiveStatistics
, which accepts a collection of values
and provides methods to access statistical properties, such as arithmetic mean,
standard deviation, etc.
Using an attribute-accessor method reference, the following method can return
an instance of DescriptiveStatistics
for any of the attributes:
public static DescriptiveStatistics statistics (List<IrisInstance> data, Function<IrisInstance, Double> accessor) { // <1> DescriptiveStatistics ds = new DescriptiveStatistics (); // <2> for (IrisInstance instance : data) { // <3> ds.addValue(accessor.apply(instance)); // <4> } return ds; }
- The second argument is a method reference to access an attribute.
-
The
DescriptiveStatistics
instance is created. - Loop through every instance in the data, and ...
- Add the value obtained by applying the attribute-accessor to each instance.
Having obtained an instance of DescriptiveStatistics
for an attribute, some
information can be printed, e.g. the minimum and maximum values, and the
mean and standard deviation:
public static void displayStatistics (String attributeName, DescriptiveStatistics statistics) { System.out.println(attributeName); System.out.println(" -- Minimum: " + statistics.getMin()); System.out.println(" -- Maximum: " + statistics.getMax()); System.out.println(String.format(" -- Mean: %.2f", statistics.getMean())); System.out.println(String.format(" -- Stddev: %.2f", statistics.getStandardDeviation())); }
Finally, the code to analyse each attribute in turn and display its information:
displayStatistics ("Sepal Length", statistics(data, IrisInstance::sepalLength); displayStatistics ("Sepal Width", statistics(data, IrisInstance::sepalWidth); displayStatistics ("Petal Length", statistics(data, IrisInstance::petalLength); displayStatistics ("Petal Width", statistics(data, IrisInstance::petalWidth);
This produces the following output:
Sepal Length -- Minimum: 4.3 -- Maximum: 7.9 -- Mean: 5.84 -- Stddev: 0.83 Sepal Width -- Minimum: 2.0 -- Maximum: 4.4 -- Mean: 3.05 -- Stddev: 0.43 Petal Length -- Minimum: 1.0 -- Maximum: 6.9 -- Mean: 3.76 -- Stddev: 1.76 Petal Width -- Minimum: 0.1 -- Maximum: 2.5 -- Mean: 1.20 -- Stddev: 0.76