2024-12-09: Analysing Iris Dataset with Zig

As I've done in a few other notes, I will explore some simple data analysis tools applied to the iris dataset, but this time using Zig.

$ zig version
0.13.0

Files available as: iris-in-zig.tar.gz

Task Description

The iris dataset, "iris.data", is a CSV formatted dataset:

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
...

There are four numeric fields, followed by a class name.

Step 1 is to load in the data from this CSV file, turning each instance into a struct value, and storing the instances in an ArrayList.

Step 2 will do some simple analysis of the columns of data, printing out summary statistics such as the mean value of each column.

Project Setup

To create our working environment:

$ mkdir iris-analysis
$ cd iris-analysis
$ zig init

Also, extract the "iris.data" file from the iris zip download.

$ ls
build.zig  build.zig.zon  iris.data  src

The file "src/root.zig" is not needed - I deleted it.

Libraries

This little program will use two libraries.

Reading CSV

Although the CSV file is easy to read and parse manually, I decided to use a library: zig_csv. Installation requires first using zig fetch to add the dependency to "build.zig.zon":

$ zig fetch --save git+https://github.com/matthewtolman/zig_csv#main
info: resolved ref 'main' to commit 749a0f7d0133847a562621e1a53fabaca4b619a5

Second, the dependency needs adding to the program in "build.zig", as described below.

Descriptive Statistics

I didn't find anything immediately that I could use, so I wrote my own.

Installing means adding the file into "src" and creating an import in "main.zig":

const stats = @import("descriptive-statistics.zig");

Dependencies in build.zig

Then edit "build.zig". I simplified my file a bit. I don't need to build a lib, or run unit tests. After adding the zig_csv library, my file looks like:

const std = @import("std");

pub fn build(b: *std.Build) void {
    const target = b.standardTargetOptions(.{});
    const optimize = b.standardOptimizeOption(.{});

    const exe = b.addExecutable(.{
        .name = "iris-analysis",
        .root_source_file = b.path("src/main.zig"),
        .target = target,
        .optimize = optimize,
    });

    const zcsv = b.dependency("zcsv", .{                      // <1>
        .target = target,
        .optimize = optimize,
    });

    exe.root_module.addImport("zcsv", zcsv.module("zcsv"));   // <2>

    b.installArtifact(exe);

    const run_cmd = b.addRunArtifact(exe);

    run_cmd.step.dependOn(b.getInstallStep());

    if (b.args) |args| {
        run_cmd.addArgs(args);
    }

    const run_step = b.step("run", "Run the app");
    run_step.dependOn(&run_cmd.step);
}

Creates the "zig_csv" dependency, named "zcsv".
Adds to the "exe" module, so we can @import("zcsv") in our code.

Reading a CSV File

The function readIrisData is used as the top-level function to open a given csv file and return an arraylist of iris data. The logic is straightforward, and I decided to handle all errors in "main", so try is used throughout.

fn readIrisData(allocator: std.mem.Allocator, filename: []const u8) !std.ArrayList(IrisInstance) {
    const file = try std.fs.cwd().openFile(filename, .{});            // <1>
    defer file.close();
    var csv = zcsv.allocs.column.init(allocator, file.reader(), .{}); // <2>

    // read the lines, storing structs into result
    var result = std.ArrayList(IrisInstance).init(allocator);         // <3>
    while (csv.next()) |row| {                                        // <4>
        defer row.deinit();                                           // <5>

        if (row.len() == 5) {                                         // <6>
            const instance = try IrisInstance.fromRow(row);           // <7>
            try result.append(instance);                              // <8>
        }
    }

    return result;
}

Open the given filename
and create a csv column reader.
The instances will be stored in an ArrayList, to be returned at the end of the function.
Iterate through the rows in the csv file:
rows are allocated, so we must deinit them.
Instances have 5 columns - there is a blank line at the end of "iris-data", so make sure we skip that one.
Now create our instance from the given row,
and add it to our result ArrayList.

The IrisInstance is a simple struct with five fields. An enum is used to represent the three labels. A top-level function is provided in the enum to return the appropriate enum value given a string, as found in the data file:

const IrisLabel = enum {
    Setosa,
    Versicolour,
    Virginica,

    fn fromString(str: []const u8) IrisLabel {
        if (std.mem.eql(u8, str, "Setosa")) {
            return .Setosa;
        } else if (std.mem.eql(u8, str, "Versicolor")) {
            return .Versicolour;
        } else {
            return .Virginica;
        }
    }
};

The instance is constructed from a csv row. The numeric fields are converted to floats.

const IrisInstance = struct {
    label: IrisLabel,
    petal_length: f64,
    petal_width: f64,
    sepal_length: f64,
    sepal_width: f64,

    fn fromRow(row: zcsv.allocs.column.Row) !IrisInstance {
        const label = try row.field(4);
        const petal_length = try row.field(0);
        const petal_width = try row.field(1);
        const sepal_length = try row.field(2);
        const sepal_width = try row.field(3);

        return IrisInstance {
            .label = IrisLabel.fromString(label.data()),
            .petal_length = try std.fmt.parseFloat(f64, petal_length.data()),
            .petal_width = try std.fmt.parseFloat(f64, petal_width.data()),
            .sepal_length = try std.fmt.parseFloat(f64, sepal_length.data()),
            .sepal_width = try std.fmt.parseFloat(f64, sepal_width.data()),
        };
    }
};

Summary Statistics

For the summary statistics, a function pointer is used to specify which attribute to analyse in the printStatistics function. This pointer simply takes an instance and returns the relevant field - this is a little clunky to set up, but makes the calling code quite tidy.

The accessor functions are given a type and set up as follows:

const IrisAccessor = *const fn(instance: IrisInstance) f64;

fn irisPetalLength (instance: IrisInstance) f64 {
    return instance.petal_length;
}

// etc

The following function uses the DescriptiveStatistics struct to store all the data values and then retrieve summary statistics for display:

fn printStatistics(allocator: std.mem.Allocator, instances: *std.ArrayList(IrisInstance), 
        attribute_name: []const u8, instance_accessor: IrisAccessor) void {         // <1>
    var ds = stats.DescriptiveStatistics.init(allocator);                           // <2>
    defer ds.deinit();

    for (instances.items) |instance| {                                              // <3>
        ds.add(instance_accessor(instance));                                        // <4>
    }

    std.debug.print("{s}\n", .{attribute_name});
    std.debug.print(" -- Minimum {d:.2}\n", .{ds.min() orelse 0.0});                // <5>
    std.debug.print(" -- Maximum {d:.2}\n", .{ds.max() orelse 0.0});
    std.debug.print(" -- Mean    {d:.2}\n", .{ds.mean()});
    std.debug.print(" -- Stddev  {d:.2}\n", .{ds.standardDeviation() catch 0.0});
}

The IrisAccessor type is used to pass in a function pointer.
Creates an instance of the DescriptiveStatistics struct.
Loop through the instances and
... call the function pointer to access the relevant attribute value.
Print out some summary statistics.