Reading a CSV file of data is a common task for machine-learning applications. This note shows how to read a file of CSV data, using the Apache Commons CSV library, and convert it into a list of instances, where each instance is an appropriate Java record.
The example used here is the Iris Data set from the UCI database: download iris.data.
This dataset has four attributes and a label, stored within a CSV file. The first few records are:
5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa ...
The four attributes are stored as double values in a Java record, and the label as a string:
record IrisInstance( double sepalLength, double sepalWidth, double petalLength, double petalWidth, String label ) {}
The Apache Commons CSV library provides functionality to handle CSV files - here, we focus on reading a CSV file.
The CSVFormat
class provides a list of CSV readers, as CSV is not a
standardised format. For straightforward files, we can use the
RFC4180 definition via
CSVFormat.RFC4180
. This provides access to a parse
method, which reads from
an input stream and produces an iterable instance of CSVParser
.
It is convenient to process the parsed output as a stream:
try (Reader in = Files.newBufferedReader (file)) { // <1> List<IrisInstance> data = CSVFormat.RFC4180.parse(in).stream() // <2> .map(IrisInstance::fromCSVRecord) // <3> .flatMap(Optional::stream) // <4> .toList(); // <5>
-
Opens an input reader using
try-with-resources
, to ensure it is safely closed. -
Parses the input reader and creates a Java stream of
CSVRecord
instances. -
Converts each
CSVRecord
into an optionalIrisData
. - Removes the empty optional items.
- Converts the stream to a list.
The library represents each line of data from the CSV file as an instance of its
CSVRecord
class: this must be converted into an IrisInstance
record, to
properly represent the data.
The conversion should ensure there are enough fields in the line, and also convert the first four attributes into doubles, checking that each attribute is correctly formatted.
As it is possible for a CSVRecord
not to be a valid IrisInstance
, the conversion
uses Optional
, with an empty optional used when the conversion could not be made:
as shown above, these are then filtered out.
The conversion method is added to the IrisInstance
record as a static method:
public static Optional<IrisInstance> fromCSVRecord (CSVRecord record) { if (record.size() == 5) { // ensure we have enough parts // <1> try { double sepalLength = Double.parseDouble(record.get(0)); // <2> double sepalWidth = Double.parseDouble(record.get(1)); double petalLength = Double.parseDouble(record.get(2)); double petalWidth = Double.parseDouble(record.get(3)); return Optional.of(new IrisInstance(sepalLength, // <3> sepalWidth, petalLength, petalWidth, record.get(4))); } catch (NumberFormatException e) { ; // <4> } } return Optional.empty(); // <5> }
- Check that the record has enough fields (4 attributes + 1 label).
-
Parse each of the attributes out into a
Double
. -
Assuming no errors, return an optional of
IrisInstance
with the appropriate values. - If there is a number format error, fall through (or report the error).
- Something went wrong, so return an empty optional.