2024-12-09: Analysing Iris Dataset with Zig
As I've done in a few other notes, I will explore some simple data analysis tools applied to the iris dataset, but this time using Zig.
$ zig version 0.13.0
Files available as: iris-in-zig.tar.gz
Task Description
The iris dataset, "iris.data", is a CSV formatted dataset:
5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa ...
There are four numeric fields, followed by a class name.
Step 1 is to load in the data from this CSV file, turning each instance into a struct value, and storing the instances in an ArrayList.
Step 2 will do some simple analysis of the columns of data, printing out summary statistics such as the mean value of each column.
Project Setup
To create our working environment:
$ mkdir iris-analysis $ cd iris-analysis $ zig init
Also, extract the "iris.data" file from the iris zip download.
$ ls build.zig build.zig.zon iris.data src
The file "src/root.zig" is not needed - I deleted it.
Libraries
This little program will use two libraries.
Reading CSV
Although the CSV file is easy to read and parse manually, I decided to use a
library: zig_csv. Installation
requires first using zig fetch
to add the dependency to "build.zig.zon":
$ zig fetch --save git+https://github.com/matthewtolman/zig_csv#main info: resolved ref 'main' to commit 749a0f7d0133847a562621e1a53fabaca4b619a5
Second, the dependency needs adding to the program in "build.zig", as described below.
Descriptive Statistics
I didn't find anything immediately that I could use, so I wrote my own.
Installing means adding the file into "src" and creating an import in "main.zig":
const stats = @import("descriptive-statistics.zig");
Dependencies in build.zig
Then edit "build.zig". I simplified my file a bit. I don't need to build a lib, or run unit tests. After adding the zig_csv library, my file looks like:
const std = @import("std"); pub fn build(b: *std.Build) void { const target = b.standardTargetOptions(.{}); const optimize = b.standardOptimizeOption(.{}); const exe = b.addExecutable(.{ .name = "iris-analysis", .root_source_file = b.path("src/main.zig"), .target = target, .optimize = optimize, }); const zcsv = b.dependency("zcsv", .{ // <1> .target = target, .optimize = optimize, }); exe.root_module.addImport("zcsv", zcsv.module("zcsv")); // <2> b.installArtifact(exe); const run_cmd = b.addRunArtifact(exe); run_cmd.step.dependOn(b.getInstallStep()); if (b.args) |args| { run_cmd.addArgs(args); } const run_step = b.step("run", "Run the app"); run_step.dependOn(&run_cmd.step); }
- Creates the "zig_csv" dependency, named "zcsv".
-
Adds to the "exe" module, so we can
@import("zcsv")
in our code.
Reading a CSV File
The function readIrisData
is used as the top-level function to open a given csv file and return an arraylist of iris data. The logic is straightforward, and I decided to handle all errors in
"main", so try
is used throughout.
fn readIrisData(allocator: std.mem.Allocator, filename: []const u8) !std.ArrayList(IrisInstance) { const file = try std.fs.cwd().openFile(filename, .{}); // <1> defer file.close(); var csv = zcsv.allocs.column.init(allocator, file.reader(), .{}); // <2> // read the lines, storing structs into result var result = std.ArrayList(IrisInstance).init(allocator); // <3> while (csv.next()) |row| { // <4> defer row.deinit(); // <5> if (row.len() == 5) { // <6> const instance = try IrisInstance.fromRow(row); // <7> try result.append(instance); // <8> } } return result; }
- Open the given filename
- and create a csv column reader.
- The instances will be stored in an ArrayList, to be returned at the end of the function.
- Iterate through the rows in the csv file:
-
rows are allocated, so we must
deinit
them. - Instances have 5 columns - there is a blank line at the end of "iris-data", so make sure we skip that one.
- Now create our instance from the given row,
- and add it to our result ArrayList.
The IrisInstance is a simple struct with five fields. An enum is used to represent the three labels. A top-level function is provided in the enum to return the appropriate enum value given a string, as found in the data file:
const IrisLabel = enum { Setosa, Versicolour, Virginica, fn fromString(str: []const u8) IrisLabel { if (std.mem.eql(u8, str, "Setosa")) { return .Setosa; } else if (std.mem.eql(u8, str, "Versicolor")) { return .Versicolour; } else { return .Virginica; } } };
The instance is constructed from a csv row. The numeric fields are converted to floats.
const IrisInstance = struct { label: IrisLabel, petal_length: f64, petal_width: f64, sepal_length: f64, sepal_width: f64, fn fromRow(row: zcsv.allocs.column.Row) !IrisInstance { const label = try row.field(4); const petal_length = try row.field(0); const petal_width = try row.field(1); const sepal_length = try row.field(2); const sepal_width = try row.field(3); return IrisInstance { .label = IrisLabel.fromString(label.data()), .petal_length = try std.fmt.parseFloat(f64, petal_length.data()), .petal_width = try std.fmt.parseFloat(f64, petal_width.data()), .sepal_length = try std.fmt.parseFloat(f64, sepal_length.data()), .sepal_width = try std.fmt.parseFloat(f64, sepal_width.data()), }; } };
Summary Statistics
For the summary statistics, a function pointer is used to specify which attribute
to analyse in the printStatistics
function. This pointer simply takes an instance
and returns the relevant field - this is a little clunky to set up, but makes the
calling code quite tidy.
The accessor functions are given a type and set up as follows:
const IrisAccessor = *const fn(instance: IrisInstance) f64; fn irisPetalLength (instance: IrisInstance) f64 { return instance.petal_length; } // etc
The following function uses the DescriptiveStatistics
struct to store all the data
values and then retrieve summary statistics for display:
fn printStatistics(allocator: std.mem.Allocator, instances: *std.ArrayList(IrisInstance), attribute_name: []const u8, instance_accessor: IrisAccessor) void { // <1> var ds = stats.DescriptiveStatistics.init(allocator); // <2> defer ds.deinit(); for (instances.items) |instance| { // <3> ds.add(instance_accessor(instance)); // <4> } std.debug.print("{s}\n", .{attribute_name}); std.debug.print(" -- Minimum {d:.2}\n", .{ds.min() orelse 0.0}); // <5> std.debug.print(" -- Maximum {d:.2}\n", .{ds.max() orelse 0.0}); std.debug.print(" -- Mean {d:.2}\n", .{ds.mean()}); std.debug.print(" -- Stddev {d:.2}\n", .{ds.standardDeviation() catch 0.0}); }
-
The
IrisAccessor
type is used to pass in a function pointer. -
Creates an instance of the
DescriptiveStatistics
struct. - Loop through the instances and
- ... call the function pointer to access the relevant attribute value.
- Print out some summary statistics.