2024-10-28: String Splitting
I've been looking at one of the first examples used on the Raku language site: Raku by Example. This provides a little test case for exploring some string splitting and related Lisp libraries.
The program reads in a file of data and does some pre-processing. What struck me as interesting was the way the data was split up into fields.
The file format for a single line is:
Ana Dave | 3:0
and the Raku code to get this into four variables is:
my ($pairing, $result) = $line.split(' | '); my ($p1, $p2) = $pairing.words; my ($r1, $r2) = $result.split(':');
What does this look like in Lisp?
No Libraries
The "no library" version could be written as:
(let* ((space-posn (position #\space line)) (bar-posn (position #\| line)) (colon-posn (position #\: line)) (pairing-1 (subseq line 0 space-posn)) (pairing-2 (subseq line (+ space-posn 1) (- bar-posn 1))) (result-1 (subseq line (+ bar-posn 2) colon-posn)) (result-2 (subseq line (+ colon-posn 1)))) ; ... )
This uses position
to extract the index of the three key split points in
the string, and then careful subseq
operations to retrieve the four
required values.
This is not too bad to write for small examples, but is clearly not robust and it's not too clear how the splitting relates to the structure of the string.
Using UIOP Library
The UIOP library is an important
one, because it comes along with ASDF, so you get it "for free" when using the
library system. The library offers a useful function, split-string
, which
splits strings based on one or more separating characters.
In this way, our string is split by using all three separator characters in one go:
* (uiop:split-string "Ann Dave | 3:0" :separator " |:") ("Ann" "Dave" "" "" "3" "0")
The above code can then be rewritten, using destructuring-bind
and removing all
the empty strings:
(destructuring-bind (pairing-1 pairing-2 result-1 result-2) (remove-if #'uiop:emptyp (uiop:split-string line :separator " |:")) ; ... )
This works, but it again ignores the structure of the data, which the Raku program reflected: first splitting the data into two halves, and then dealing with each half in a different way.
The Raku split
command also has an additional ability over uiop's
split-string
: it can split a string using a substring, returning the parts
before and after the substring.
[3] > 'a-b-c | d-e-f'.split(' | ') (a-b-c d-e-f) [4] > '(a-b-c | d-e-f'.split('|') ((a-b-c d-e-f)
The first split has removed the vertical bar and surrounding spaces. Can we do this in Lisp?
Using str Library
The str library provides many important string-processing functions.
The function we are looking at is str's split
function. This does several
things, but, to answer the points above, it can split on a character or on a
substring:
* (str:split #\| "Ann Dave | 3:0") ("Ann Dave " " 3:0") * (str:split " | " "Ann Dave | 3:0") ("Ann Dave" "3:0")
and it has a flag to remove empty strings from the result:
* (str:split #\space " one two three ") ("" "one" "two" "" "three" "") * (str:split #\space " one two three " :omit-nulls t) ("one" "two" "three")
We use split
and words
to replicate the three split statements from the
Raku code:
(destructuring-bind (pairing result) (str:split " | " line) (destructuring-bind (pairing-1 pairing-2) (str:words pairing) (destructuring-bind (result-1 result-2) (str:split ":" result) ; ... )))
Of course, the binding part looks a bit ugly with those nested, repeated
destructuring-bind
statements. We clean this up with another library: the
metabang-bind
library, which provides a combination of let
, destructuring-bind
,
with-multiple-values
.
With this, we write three parallel assignments, splitting the line in steps, echoing the original Raku program:
(metabang-bind:bind (((pairing result) (str:split " | " line)) ((pairing-1 pairing-2) (str:words pairing)) ((result-1 result-2) (str:split ":" result))) ; ... )
The only remaining point about Raku's treatment of the input data, is that Raku is weakly typed, and is happy to take the strings returned for the results and use them as numbers:
[0] > 1 + 2 3 [1] > 1 + '2' 3 [2] > "1" + 2 3 [3] > "1" + "2" 3
whereas strongly-typed Lisp is not so happy:
* (+ 1 2) 3 * (+ 1 "2") debugger invoked on a TYPE-ERROR @B800002FFF in thread #<THREAD tid=2954 "main thread" RUNNING {1000B90003}>: The value "2" is not of type NUMBER
For Lisp, we need to convert the results to integers, which is done after
splitting the result
variable.
Final Program
For completeness, my final Lisp program is:
(asdf:load-system "metabang-bind") (asdf:load-system "str") (format t "Tournament results:~%~%") (let* ((lines (uiop:read-file-lines "scores.txt")) (names (str:words (first lines))) (matches (make-hash-table :test #'equalp)) (sets (make-hash-table :test #'equalp))) ; (dolist (line (rest lines)) (unless (str:emptyp line) (metabang-bind:bind (((pairing result) (str:split " | " line)) ((player-1 player-2) (str:words pairing)) ((result-1 result-2) (mapcar #'parse-integer (str:split ":" result)))) (incf (gethash player-1 sets 0) result-1) (incf (gethash player-2 sets 0) result-2) (if (> result-1 result-2) (incf (gethash player-1 matches 0)) (incf (gethash player-2 matches 0)))))) ; (setf names (sort names #'< :key #'(lambda (name) (gethash name sets)))) (setf names (sort names #'< :key #'(lambda (name) (gethash name matches)))) ; (dolist (name (reverse names)) (format t "~a has won ~d ~:* ~[matches~;match~:;matches~] and ~d set~:p~&" name (gethash name matches) (gethash name sets)))) (uiop:quit)
Note the use of format
conditionals to handle plurals.
-
~:*
steps back in the list of arguments, to reuse the count of matches -
~[...~]
picks the relevant match/matches based on if the count is 0, 1, or more -
~:p
adds an "s" if the previous argument is not 1