2024-12-09: Analysing Iris Dataset with Zig

As I've done in a few other notes, I will explore some simple data analysis tools applied to the iris dataset, but this time using Zig.

$ zig version
0.13.0

Files available as: iris-in-zig.tar.gz

Task Description

The iris dataset, "iris.data", is a CSV formatted dataset:

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
...

There are four numeric fields, followed by a class name.

Step 1 is to load in the data from this CSV file, turning each instance into a struct value, and storing the instances in an ArrayList.

Step 2 will do some simple analysis of the columns of data, printing out summary statistics such as the mean value of each column.

Project Setup

To create our working environment:

$ mkdir iris-analysis
$ cd iris-analysis
$ zig init

Also, extract the "iris.data" file from the iris zip download.

$ ls
build.zig  build.zig.zon  iris.data  src

The file "src/root.zig" is not needed - I deleted it.

Libraries

This little program will use two libraries.

Reading CSV

Although the CSV file is easy to read and parse manually, I decided to use a library: zig_csv. Installation requires first using zig fetch to add the dependency to "build.zig.zon":

$ zig fetch --save git+https://github.com/matthewtolman/zig_csv#main
info: resolved ref 'main' to commit 749a0f7d0133847a562621e1a53fabaca4b619a5

Second, the dependency needs adding to the program in "build.zig", as described below.

Descriptive Statistics

I didn't find anything immediately that I could use, so I wrote my own.

Installing means adding the file into "src" and creating an import in "main.zig":

const stats = @import("descriptive-statistics.zig");

Dependencies in build.zig

Then edit "build.zig". I simplified my file a bit. I don't need to build a lib, or run unit tests. After adding the zig_csv library, my file looks like:

const std = @import("std");

pub fn build(b: *std.Build) void {
    const target = b.standardTargetOptions(.{});
    const optimize = b.standardOptimizeOption(.{});

    const exe = b.addExecutable(.{
        .name = "iris-analysis",
        .root_source_file = b.path("src/main.zig"),
        .target = target,
        .optimize = optimize,
    });

    const zcsv = b.dependency("zcsv", .{                      // <1>
        .target = target,
        .optimize = optimize,
    });

    exe.root_module.addImport("zcsv", zcsv.module("zcsv"));   // <2>

    b.installArtifact(exe);

    const run_cmd = b.addRunArtifact(exe);

    run_cmd.step.dependOn(b.getInstallStep());

    if (b.args) |args| {
        run_cmd.addArgs(args);
    }

    const run_step = b.step("run", "Run the app");
    run_step.dependOn(&run_cmd.step);
}
  1. Creates the "zig_csv" dependency, named "zcsv".
  2. Adds to the "exe" module, so we can @import("zcsv") in our code.

Reading a CSV File

The function readIrisData is used as the top-level function to open a given csv file and return an arraylist of iris data. The logic is straightforward, and I decided to handle all errors in "main", so try is used throughout.

fn readIrisData(allocator: std.mem.Allocator, filename: []const u8) !std.ArrayList(IrisInstance) {
    const file = try std.fs.cwd().openFile(filename, .{});            // <1>
    defer file.close();
    var csv = zcsv.allocs.column.init(allocator, file.reader(), .{}); // <2>

    // read the lines, storing structs into result
    var result = std.ArrayList(IrisInstance).init(allocator);         // <3>
    while (csv.next()) |row| {                                        // <4>
        defer row.deinit();                                           // <5>

        if (row.len() == 5) {                                         // <6>
            const instance = try IrisInstance.fromRow(row);           // <7>
            try result.append(instance);                              // <8>
        }
    }

    return result;
}
  1. Open the given filename
  2. and create a csv column reader.
  3. The instances will be stored in an ArrayList, to be returned at the end of the function.
  4. Iterate through the rows in the csv file:
  5. rows are allocated, so we must deinit them.
  6. Instances have 5 columns - there is a blank line at the end of "iris-data", so make sure we skip that one.
  7. Now create our instance from the given row,
  8. and add it to our result ArrayList.

The IrisInstance is a simple struct with five fields. An enum is used to represent the three labels. A top-level function is provided in the enum to return the appropriate enum value given a string, as found in the data file:

const IrisLabel = enum {
    Setosa,
    Versicolour,
    Virginica,

    fn fromString(str: []const u8) IrisLabel {
        if (std.mem.eql(u8, str, "Setosa")) {
            return .Setosa;
        } else if (std.mem.eql(u8, str, "Versicolor")) {
            return .Versicolour;
        } else {
            return .Virginica;
        }
    }
};

The instance is constructed from a csv row. The numeric fields are converted to floats.

const IrisInstance = struct {
    label: IrisLabel,
    petal_length: f64,
    petal_width: f64,
    sepal_length: f64,
    sepal_width: f64,

    fn fromRow(row: zcsv.allocs.column.Row) !IrisInstance {
        const label = try row.field(4);
        const petal_length = try row.field(0);
        const petal_width = try row.field(1);
        const sepal_length = try row.field(2);
        const sepal_width = try row.field(3);

        return IrisInstance {
            .label = IrisLabel.fromString(label.data()),
            .petal_length = try std.fmt.parseFloat(f64, petal_length.data()),
            .petal_width = try std.fmt.parseFloat(f64, petal_width.data()),
            .sepal_length = try std.fmt.parseFloat(f64, sepal_length.data()),
            .sepal_width = try std.fmt.parseFloat(f64, sepal_width.data()),
        };
    }
};

Summary Statistics

For the summary statistics, a function pointer is used to specify which attribute to analyse in the printStatistics function. This pointer simply takes an instance and returns the relevant field - this is a little clunky to set up, but makes the calling code quite tidy.

The accessor functions are given a type and set up as follows:

const IrisAccessor = *const fn(instance: IrisInstance) f64;

fn irisPetalLength (instance: IrisInstance) f64 {
    return instance.petal_length;
}

// etc

The following function uses the DescriptiveStatistics struct to store all the data values and then retrieve summary statistics for display:

fn printStatistics(allocator: std.mem.Allocator, instances: *std.ArrayList(IrisInstance), 
        attribute_name: []const u8, instance_accessor: IrisAccessor) void {         // <1>
    var ds = stats.DescriptiveStatistics.init(allocator);                           // <2>
    defer ds.deinit();

    for (instances.items) |instance| {                                              // <3>
        ds.add(instance_accessor(instance));                                        // <4>
    }

    std.debug.print("{s}\n", .{attribute_name});
    std.debug.print(" -- Minimum {d:.2}\n", .{ds.min() orelse 0.0});                // <5>
    std.debug.print(" -- Maximum {d:.2}\n", .{ds.max() orelse 0.0});
    std.debug.print(" -- Mean    {d:.2}\n", .{ds.mean()});
    std.debug.print(" -- Stddev  {d:.2}\n", .{ds.standardDeviation() catch 0.0});
}
  1. The IrisAccessor type is used to pass in a function pointer.
  2. Creates an instance of the DescriptiveStatistics struct.
  3. Loop through the instances and
  4. ... call the function pointer to access the relevant attribute value.
  5. Print out some summary statistics.

Page from Peter's Scrapbook, output from a VimWiki on 2024-12-13.