2022-04-08: Reading CSV Files

Reading a CSV file of data is a common task for machine-learning applications. This note shows how to read a file of CSV data, using the Apache Commons CSV library, and convert it into a list of instances, where each instance is an appropriate Java record.

Example: Iris Dataset

The example used here is the Iris Data set from the UCI database: download iris.data.

This dataset has four attributes and a label, stored within a CSV file. The first few records are:

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
...

The four attributes are stored as double values in a Java record, and the label as a string:

record IrisInstance(
    double sepalLength, 
    double sepalWidth,
    double petalLength,
    double petalWidth,
    String label
  ) {}

Apache Commons CSV

The Apache Commons CSV library provides functionality to handle CSV files - here, we focus on reading a CSV file.

The CSVFormat class provides a list of CSV readers, as CSV is not a standardised format. For straightforward files, we can use the RFC4180 definition via CSVFormat.RFC4180. This provides access to a parse method, which reads from an input stream and produces an iterable instance of CSVParser.

It is convenient to process the parsed output as a stream:

    try (Reader in = Files.newBufferedReader (file)) {                // <1>
      List<IrisInstance> data = CSVFormat.RFC4180.parse(in).stream()  // <2>
        .map(IrisInstance::fromCSVRecord)                             // <3>
        .flatMap(Optional::stream)                                    // <4>
        .toList();                                                    // <5>
  1. Opens an input reader using try-with-resources, to ensure it is safely closed.
  2. Parses the input reader and creates a Java stream of CSVRecord instances.
  3. Converts each CSVRecord into an optional IrisData.
  4. Removes the empty optional items.
  5. Converts the stream to a list.

Converting CSVRecord to IrisInstance

The library represents each line of data from the CSV file as an instance of its CSVRecord class: this must be converted into an IrisInstance record, to properly represent the data.

The conversion should ensure there are enough fields in the line, and also convert the first four attributes into doubles, checking that each attribute is correctly formatted.

As it is possible for a CSVRecord not to be a valid IrisInstance, the conversion uses Optional, with an empty optional used when the conversion could not be made: as shown above, these are then filtered out.

The conversion method is added to the IrisInstance record as a static method:

  public static Optional<IrisInstance> fromCSVRecord (CSVRecord record) {
    if (record.size() == 5) { // ensure we have enough parts    // <1>
      try {
        double sepalLength = Double.parseDouble(record.get(0)); // <2>
        double sepalWidth = Double.parseDouble(record.get(1));
        double petalLength = Double.parseDouble(record.get(2));
        double petalWidth = Double.parseDouble(record.get(3));

        return Optional.of(new IrisInstance(sepalLength,        // <3>
              sepalWidth, petalLength, petalWidth, record.get(4)));

      } catch (NumberFormatException e) {
        ;                                                       // <4>
      }
    }

    return Optional.empty();                                    // <5>
  }
  1. Check that the record has enough fields (4 attributes + 1 label).
  2. Parse each of the attributes out into a Double.
  3. Assuming no errors, return an optional of IrisInstance with the appropriate values.
  4. If there is a number format error, fall through (or report the error).
  5. Something went wrong, so return an empty optional.

Page from Peter's Scrapbook, output from a VimWiki on 2024-01-29.