Code Examples: Robin

Some examples of the libraries and procedures in the robin namespace of r7rs-libs.

(robin abbrev)

Create unique abbreviations for a list of strings. This library is a port of the equivalent Ruby library.

abbrev

Input:

a list of strings
an optional prefix

Output:

an association list mapping unique abbreviations to its matching string

For example:

gosh[r7rs.user]$ (import (robin abbrev))
gosh[r7rs.user]$ (abbrev '("scheme" "scala" "lisp"))
(("lisp" . "lisp") ("lis" . "lisp") ("li" . "lisp") ("l" . "lisp")
 ("scala" . "scala") ("scal" . "scala") ("sca" . "scala") ("scheme" . "scheme")
 ("schem" . "scheme") ("sche" . "scheme") ("sch" . "scheme"))

The first line shows that any prefix to "lisp" is an acceptable, unique abbreviation: because no other word in the list of strings starts with the letter "l". For Scala and Scheme, the unique abbreviations start with three letters, as "s" and "sc" are prefixes to both.

The optional prefix will only include words that have that prefix, allowing us to select a subset of the strings to generate abbreviations for. For example:

gosh[r7rs.user]$ (abbrev '("scheme" "scala" "lisp") "s")
(("scala" . "scala") ("scal" . "scala") ("sca" . "scala") ("scheme" . "scheme")
 ("schem" . "scheme") ("sche" . "scheme") ("sch" . "scheme"))

With the "s" prefix included, "lisp" is no longer in the output.

As a simple application, the abbreviations can be used as a word or command expander:

gosh[r7rs.user]$ (define shortcuts (abbrev '("scheme" "scala" "lisp")))
shortcuts
gosh[r7rs.user]$ (cdr (assoc "l" shortcuts))
"lisp"
gosh[r7rs.user]$ (cdr (assoc "sch" shortcuts))
"scheme"

(robin confusion-matrix)

A confusion matrix represents "the relative frequencies with which each of a number of stimuli is mistaken for each of the others by a person in a task requiring recognition or identification of stimuli" (R. Colman, A Dictionary of Psychology, 2008). Each row represents the predicted label of an instance, and each column represents the observed label of that instance. Numbers at each (row, column) reflect the total number of instances of predicted label "row" which were observed as having label "column".

A two-class example is:

    Observed        Observed      |
    Positive        Negative      | Predicted
    ------------------------------+------------
        a               b         | Positive
        c               d         | Negative

Here the value:

a - is the number of true positives (those predicted positive and observed positive)
b - is the number of false negatives (those predicted positive but observed negative)
c - is the number of false positives (those predicted negative but observed positive)
d - is the number of true negatives (those predicted negative and observed negative)

From this table we can calculate statistics like:

true positive rate - a/(a+b)
positive recall - a/(a+c)

As statistics can also be calculated for the negative label, e.g. the true negative rate is d/(c+d), the functions below have an optional for-label parameter, to specify which label they are calculated for: the default is to report for the first label named when the matrix is created

The implementation supports confusion matrices with more than two labels. When more than two labels are in use, the statistics are calculated as if the first, or named, label were positive and all the other labels are grouped as if negative.

The following example creates a simple two-label confusion matrix, prints a few statistics and displays the table - notice that the package "confusion-matrix" has the alias "cm".

(import (scheme base)
        (scheme write)
        (robin confusion-matrix))

(define cm (make-confusion-matrix (list 'pos 'neg)))

(confusion-matrix-add cm 'pos 'pos 10)
(confusion-matrix-add cm 'pos 'neg 3)
(confusion-matrix-add cm 'neg 'neg 20)
(confusion-matrix-add cm 'neg 'pos 5)

(display "Confusion Matrix") (newline) (newline)
(confusion-matrix-display cm)

(display "Precision: ") (display (inexact (precision cm))) (newline)
(display "Recall: ") (display (inexact (recall cm))) (newline)
(display "Matthews-Correlation: ")
(display (inexact (matthews-correlation cm)))
(newline)

which outputs:

> gosh -I. .\robin-examples\confusion-matrix-eg.sps
Confusion Matrix

Observed
  pos  neg | Predicted
-----------+-----------
   10    3 | pos
    5   20 | neg
Precision: 0.6666666666666666
Recall: 0.7692307692307693
Matthews-Correlation: 0.5524850114241865

(robin csv)

A CSV reader/writer library. RFC4180 compliant - http://tools.ietf.org/html/rfc4180 - but allowing:

choice of separator character, e.g. tab instead of comma
choice of quote character, e.g. single instead of double-quote
"CR LF", "CR" or "LF" are all accepted as line endings when reading

The CSV reader/writer uses a standard input or output port. Records are retrieved or written as lists of field values. Fields are always read as strings. Fields are written using display unless they are strings needing quoting, when they are written character by character, or symbols, which are first converted to strings.

make-csv-reader

Creates a csv-reader object:

no arguments: uses current input port, comma as separator and double-quote for quote character
one argument: specifies input port
two arguments: specifies input port and separator character
three arguments: specifies input port, separator character and quote character

csv-read

Reads a single record (list of fields) from given reader, or current-input-port if no reader given.

csv-read-all

Reads a list of records (list of fields) from given reader, or current-input-port if no reader given.

$ sash -r7 -L .
sash[r7rs]> (import (robin csv) (scheme file))
sash[r7rs]> (with-input-from-file "data.csv" (lambda () (csv-read-all)))
(("123" "field 2" "456") ("789" "see, comma" "12"))
sash[r7rs]>

make-csv-writer

Creates a csv-writer object:

no arguments: uses current-output-port, comma as separator and double-quote for quote character
one argument: specifies output port
two arguments: specifies output port and separator character
three arguments: specifies output port, separator character and quote character

csv-write

Writes a single record (list of fields) in CSV format:

one argument: the record, writes to the current-output-port with comma as separator and double-quote for quote character
two arguments: the record and a csv-writer object

csv-write-all

Writes a list of records in CSV format:

one argument: the records, writes to the current-output-port with comma as separator and double-quote for quote character
two arguments: the records and a csv-writer object

$ sash -r7 -L .
sash[r7rs]> (import (robin csv) (scheme file))
sash[r7rs]> (with-output-to-file "data.csv"
   (lambda () (csv-write-all '((123 "field 2" 456) (789 "see, comma" 012)))))
sash[r7rs]>
$ more data.csv
123,field 2,456
789,"see, comma",12

(robin disjoint-set)

A disjoint-set data structure holds items in distinct sets (or groups). Efficient procedures are provided for finding a representative of the set any item is contained in, and also for joining two sets together.

make-disjoint-set

make-disjoint-set takes one or two parameters: a single comparator or the equality function to use on terms, and a hash function (e.g. hash, string-hash or default-hash). The return value is a disjoint set.

disjoint-set?

disjoint-set? checks if its argument is a disjoint-set instance or not, returning a boolean value.

disjoint-set:make

disjoint-set:make places an item into the disjoint-set as its own group. Takes two arguments: the disjoint set and the item. Return value is undefined.

disjoint-set:find

disjoint-set:find takes two arguments, the disjoint set and an item, and an optional default value. The function returns the representative item for the group that the given item is in, or the default value if not present (default is 'item-not-found if not provided).

disjoint-set:union

disjoint-set:union takes three arguments, the disjoint set and two items. The disjoint set is modified to combine the groups that the two items are in.

disjoint-set:size

disjoint-set:size takes a disjoint set and returns the number of distinct groups it contains.

Example: Kruskal’s Algorithm

This example illustrates how disjoint sets can be used in Kruskal’s algorithm to find a minimal spanning set. (The complete code is in "robin-examples/kruskal.sps")

(define (kruskal graph)
  (let ((result '())
        (nodes (delete-duplicates (append (map car graph) (map cadr graph)) eq?)))
    ; 1. make a disjoint set, with each node an item
    (let ((ds (make-disjoint-set eq? hash-by-identity)))        ; 
      (for-each (lambda (node) (disjoint-set:make ds node))     ; 
                nodes)
      ; 2. set 'links' holds all the links in graph, sorted with the shortest first
      (let loop ((links (sort graph (lambda (a b) (< (caddr a) (caddr b))))))
        ; 3. if links non-empty and size > 1
        (when (and (not (null? links))
                   (> (disjoint-set:size ds) 1))                ; 
          (let ((link (car links)))
            (unless (eq? (disjoint-set:find ds (car link))      ; 
                         (disjoint-set:find ds (cadr link)))
              (set! result (cons link result))
              (disjoint-set:union ds (car link) (cadr link))))  ; 
          (loop (cdr links)))))
    (reverse result)))

;; using it
(let* ((graph '((a b 3) (a e 1) (b c 5) (b e 4) (c d 2) (c e 6) (d e 7)))
       (res (kruskal graph)))
  (format #t "MST has ~a links~&" (length res))
  (format #t "~{   : ~a~&~}" res)                               ; 
  (format #t "Total length: ~a~&" (fold + 0 (map caddr res))))

	Creates a disjoint set using `eq?` for equality and `hash-by-identity` for the hash function, because individual items are symbols.
	Each node is added to the disjoint set, initially in its own group.
	The number of groups in the disjoint set tells us how many links remain to be added.
	Look for a link where its end points are in different groups: tested by finding the representative item of each end point’s group in the disjoint set.
	When we add a link, connect the groups in the disjoint set.
	Note use of `format` to print a list of items

Output:

MST has 4 links
   : (a e 1)
   : (c d 2)
   : (a b 3)
   : (b c 5)
Total length: 11

(robin simulated-annealing)

Simulated Annealing is a popular optimisation technique, first introduced by:

Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., & Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087-1092.

simulated annealing

Three functions must be provided:

make-instance: create a new random instance. This function takes an optional argument, which is the previous instance.
evaluate-instance: given an instance, return a real number estimating the quality of that instance, with 0.0 being the best value.
num-iterations: the number of internal cycles to perform.

An optional fourth function can be provided to stop the algorithm before all the iterations have been performed:

can-stop?: function that accepts the result of evaluate-instance and returns #t if the algorithm should stop.

An example:

Find value for x so that x² - 7x + 10 = 0, with x in range [0,10].

The make-instance function must work differently depending on whether it is given an optional argument. Without an argument, it should simply return a random number in the range [0,10]. With an argument, it should return a neighbour, such as a small delta from the given value:

(define make-instance
  (case-lambda
    (()                             ; 
     (* (random-real) 10.0))
    ((x)                            ; 
     (let ((new-x (+ x (* 0.02 (random-real)) -0.01)))
       (cond ((< new-x 0)           ; 
              0.0)
             ((> new-x 10)
              10.0)
             (else
               new-x))))))

	No arguments - return a random number.
	One argument - return a neighbour.
	Make sure the new value is within the required range.

The evaluate-instance function should compute the value of the quadratic, returning a positive number with 0.0 representing the best value:

(define (evaluate-instance x)
  (let ((y = (+ (* x x) (* -7 x) 10)))
    (square y)))

Running the algorithm also requires a number of iterations:

(let ((result (simulated-annealing make-instance evaluate-instance 10000)))
  (display "Final value of x: ") (display result) (newline))

Depending on the particular random values it chooses, the algorithm will return one of other of the two roots:

> gosh .\robin-examples\simulated-annealing-eg1.sps
Final value of x: 2.0092664951926014
> gosh .\robin-examples\simulated-annealing-eg1.sps
Final value of x: 5.002658285287288

The file "robin-examples/simulated-annealing-eg1.sps" also uses (slib charplot) to plot the performance graph in the terminal. An example:

evaluation _____________________________________________
        80|-*                                           |
          | *                                           |
        70|-*                                           |
          | **                                          |
        60|-:*                                          |
          | :*                                          |
        50|-:**                                         |
          | :**                                         |
        40|-: *                                         |
          | : *                                         |
        30|-: **                                        |
          | :  **                                       |
        20|-:   *                                       |
          | :   **                                      |
        10|-:    **                                     |
          | :     *******                               |
         0|-:----------******************************---|
          |_:____.____:____.____:____.____:____.____:___|
cycle numbe 0        2500      5000      7500     10000

(robin srfi64-utils)

Some convenience functions for unit testing with SRFI 64.

test-all-equal

test-all-equal accepts two arguments: a function to test and an association list of ( argument . result ) pairs. The function effectively calls (test-equal result (function argument)) for all pairs.

(test-all-equal daitch-mokotoff-soundex '(("MANHEIM" . "665600")
                                          ("MINTZ" . "664000")
                                          ("TOPF" . "370000")
                                          ("AUERBACH" . "097500")))

test-compare

test-compare checks if two given items satisfy the given comparison procedure.

This is useful for testing equality of more complex data. For example, we might want to check if just the first item of sublists are the same:

sash> (import (scheme list))
#<unspecified>
sash> (define (equal-cars? i1 i2) (every (lambda (l1 l2) (equal? (car l1) (car l2))) i1 i2))
#<unspecified>
sash> (test-compare equal-cars? '((1 2) (3 4)) '((1 5) (3 6)))
#<unspecified>
sash> (test-compare equal-cars? '((1 2) (3 4)) '((1 5) (4 6)))
FAIL '((1 5) (4 6))
#<unspecified>

test-no-error

test-no-error is used to confirm that a piece of code has not thrown an error. In the following example, the first line raises an error, and so the test fails. The second line does not raise an error, and so the test passes.

sash> (test-no-error (if (zero? 0) (error "") #t))
FAIL #f
#<unspecified>
sash> (test-no-error (if (zero? 1) (error "") #t))
#<unspecified>

(robin statistics)

A library of functions to compute statistical or other information about sets of data.

arithmetic-mean

Same as mean.

arithmetic-geometric-mean

The arithmetic-geometric-mean is defined as the limit of a sequence of two terms, replacing the first term with their arithmetic mean and the second with their geometric mean, until the two are equal. The function is useful in mathematical applications, see description at http://mathworld.wolfram.com/Arithmetic-GeometricMean.html

The function takes two arguments and an optional tolerance:

sash> (import (robin statistics))
#<unspecified>
sash> (arithmetic-geometric-mean 1 (/ 1 (sqrt 2)))
0.8472130848351929
sash> (arithmetic-geometric-mean 1 (/ 1 (sqrt 2)) 1e-12)
0.8472130847939792

coefficient-of-variation

coefficient-of-variation describes the amount of spread in a dataset relative to its mean. In the following example, both lists have the same mean (3), but the spread is greater in the second.

sash> (coefficient-of-variation '(1 2 3 4 5))
52.704627669472984
sash> (coefficient-of-variation '(-2 2 3 4 8))
120.18504251546631

combinations

combinations returns the number of ways to take k things from n without replacement, when the order does not matter. The form is (combinations n k)

sash> (combinations 5 2)
10
sash> (combinations 1000 500)
270288240945436569515614693625975275496152008446548287007392875106625428705522193898612483924502370165362606085021546104802209750050679917549894219699518475423665484263751733356162464079737887344364574161119497604571044985756287880514600994219426752366915856603136862602484428109296905863799821216320

geometric-mean

The geometric mean takes the nth root of the product of the numbers, where n is the size of the data. This yields the central number within a geometric progression.

sash> (geometric-mean '(1 2 3 4 5))
2.605171084697352
sash> (geometric-mean '(1 3 9 27 81))
9.000000000000002
sash> (geometric-mean '(1))
1
sash> (geometric-mean '(1 3))
1.7320508075688772

harmonic-mean

The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals.

sash> (harmonic-mean '(1 2 3 4 5))
300/137
sash> (inexact (harmonic-mean '(1 2 3 4 5)))
2.18978102189781

jaccard-distance

The Jaccard Distance] is simply (1 - (jaccard-index)), and measures how dissimilar the two sets are.

sash> (jaccard-distance '(1 2 3 4 5) '(3 4 5 6 7))
4/7

jaccard-index

The Jaccard Index] returns a number from 0 to 1 indicating how similar two sets are. Sets here are represented as lists, and duplicates are ignored. An optional third argument provides the set-equality function (which defaults to equal?).

sash> (jaccard-index '(1 2 3 4 5) '(3 4 5 6 7))
3/7
sash> (jaccard-index '(1 2 3 4 5) '(3 4 5 1 2 2 1 1))
1
sash> (jaccard-index '("a" "b" "c") '("A" "B" "F"))
0
sash> (jaccard-index '("a" "b" "c") '("A" "B" "D") string-ci=?)
1/2

Note

Interestingly, the Jaccard Index is a distance metric satisfying, in particular, the triangle inequality.

mean

Computes the arithmetic mean of a list of numbers. This is the familiar "add up all the numbers, divide by the total" average.

#|kawa:2|# (mean '(1 2 3 4 5))
3

median

The median is the central number, when all the numbers are put in order (or, when there are an even number of numbers, the average of the two middle numbers).

#|kawa:8|# (median '(5 3 1 2 4))
3

mode

The mode is the number appearing most often. The mode function returns two values: a list of the most frequent number or numbers, and their number of occurrences.

#|kawa:9|# (mode '(5 3 1 2 4))
(1 2 3 4 5) 1
#|kawa:11|# (let-values (((items count) (mode '(5 3 5 1 2 4))))
#|.....12|# (display items) (newline) (display count) (newline))
(5)
2

percentile

percentile takes two arguments, a list of data and a percent value, and returns the item in the list at that percentile position along the list, or an average of the nearest two values. percent values must be in the range 1 to 99, inclusive. Note that a percent of 50 corresponds to the median.

#|kawa:13|# (percentile '(1 2 3 4 5) 50)
3
#|kawa:14|# (percentile '(1 2 3 4 5) 75)
4
#|kawa:16|# (percentile '(1 2 3 4 5) 99)
5
#|kawa:19|# (percentile '(1 2 3 4 5) 85)
5
#|kawa:20|# (percentile '(1 2 3 4 5) 80)
9/2

perlin-noise

perlin-noise is a function returning a value computed using a gradient function. Perlin Noise was invented by Ken Perlin and has been widely applied in computer graphics. The implementation here is a conversion of the Java reference implementation.

perlin-noise takes three numbers as input, and returns a single number (all numbers inexact).

sash> (perlin-noise 12.3 4.2 5.7)
0.4085747160714242

permutations

permutations returns the number of ways to take k things from n without replacement, when the order matters. The form is (permutations n k)

#|kawa:4|# (permutations 5 2)
20
#|kawa:5|# (permutations 100 20)
1303995018204712451095685346159820800000

population-standard-deviation

population-standard-deviation of a list of data:

#|kawa:28|# (population-standard-deviation '(1 2 3 4 5))
1.4142135623730951

Optionally, if you know the mean of your data, pass that as a second argument so the function does not have to recompute it:

#|kawa:39|# (population-standard-deviation '(1 2 3 4 5) 3)
1.4142135623730951

population-variance

population-variance of a list of data:

#|kawa:29|# (population-variance '(1 2 3 4 5))
2

Optionally, if you know the mean of your data, pass that as a second argument so the function does not have to recompute it.

random-normal

random-normal is used to return a random number normally distributed with a given mean and standard deviation.

#|kawa:2|# (random-normal 0 10)
3017/1250
#|kawa:3|# (inexact (random-normal 0 10))
0.05948
#|kawa:4|# (inexact (random-normal 0 10))
0.48029
#|kawa:5|# (inexact (random-normal 0 10))
-3.83553
#|kawa:7|# (inexact (random-normal 20 10))
21.81053
#|kawa:8|# (inexact (random-normal 20 10))
15.54491

random-pick

random-pick returns a random selection from a given list

#|kawa:20|# (random-pick '(1 2 3 4 5 6))
6
#|kawa:21|# (random-pick '(1 2 3 4 5 6))
3
#|kawa:22|# (random-pick '(1 2 3 4 5 6))
2

random-sample

random-sample returns a random sample of n chosen from a list of items, without replacement

#|kawa:25|# (random-sample 3 '(1 2 3 4 5 6))
(1 4 5)
#|kawa:26|# (random-sample 3 '(1 2 3 4 5 6))
(6 5 1)
#|kawa:27|# (random-sample 3 '(1 2 3 4 5 6))
(2 3 6)

random-weighted-sample

random-weighted-sample returns a random sample of n chosen from a list of items, without replacement, but with the items weighted by the given list of weights.

The following gives a heavy bias to choosing 1 and 6:

#|kawa:28|# (random-weighted-sample 3 '(1 2 3 4 5 6) '(10 1 1 1 1 10))
(6 1 5)
#|kawa:29|# (random-weighted-sample 3 '(1 2 3 4 5 6) '(10 1 1 1 1 10))
(1 6 4)
#|kawa:30|# (random-weighted-sample 3 '(1 2 3 4 5 6) '(10 1 1 1 1 10))
(1 6 5)

The following is a lighter bias to choosing 1 and 4:

#|kawa:34|# (random-weighted-sample 3 '(1 2 3 4 5 6) '(2 1 1 2 1 1))
(4 6 2)
#|kawa:35|# (random-weighted-sample 3 '(1 2 3 4 5 6) '(2 1 1 2 1 1))
(5 4 3)
#|kawa:36|# (random-weighted-sample 3 '(1 2 3 4 5 6) '(2 1 1 2 1 1))
(1 3 2)

sign

sign returns 1 for positive numbers, 0 for zero, -1 for negative numbers

#|kawa:36|# (sign 10)
1
#|kawa:37|# (sign 0)
0
#|kawa:38|# (sign -0.7)
-1

sorenson-dice-index

The Sorenson Dice Index is a measure of the similarity of two sets. Sets here are represented as lists, and duplicates are ignored. An optional third argument provides the set-equality function (which defaults to equal?).

#|kawa:47|# (sorenson-dice-index '(1 2 3 4 5) '(3 4 5 6 7))
3/5
#|kawa:48|# (sorenson-dice-index '("a" "b" "c") '("A" "B" "D"))
0
#|kawa:49|# (sorenson-dice-index '("a" "b" "c") '("A" "B" "D") string-ci=?)
2/3

standard-deviation

standard-deviation of a list of data:

#|kawa:30|# (standard-deviation '(1 2 3 4 5))
1.5811388300841898

Optionally, if you know the mean of your data, pass that as a second argument so the function does not have to recompute it.

standard-error-of-the-mean

standard-error-of-the-mean describes the mean of the sampling distribution.

#|kawa:35|# (standard-error-of-the-mean '(1 2 3 4 5))
0.7071067811865476

variance

variance of a list of data:

#|kawa:31|# (variance '(1 2 3 4 5))
5/2

Optionally, if you know the mean of your data, pass that as a second argument so the function does not have to recompute it.

(robin text)

The text library contains a collection of functions for working with strings or text documents, including similarity measures, a stemmer and layout.

daitch-mokotoff-soundex

The Daitch-Mokotoff Soundex algorithm is a variant of the Russell Soundex algorithm, designed to work better for Slavic and Yiddish names. The implementation here uses the table from http://www.jewishgen.org/InfoFiles/soundex.html.

#|kawa:6|# (daitch-mokotoff-soundex "LONDON")
863600
#|kawa:9|# (daitch-mokotoff-soundex "LEWINSKY")
876450
#|kawa:10|# (daitch-mokotoff-soundex "LEVINSKI")
876450

For some words, multiple codes are possible - pass an optional second argument 'all to get a list of codes:

#|kawa:2|# (daitch-mokotoff-soundex "auerbach")
097500
#|kawa:3|# (daitch-mokotoff-soundex "auerbach" 'all)
(097500 097400)

hamming-distance

The hamming-distance is the number of mismatched items between two equal-length sequences. The hamming-distance function takes two arguments and an optional comparison procedure. The sequences can be strings, lists, vectors or bytevectors. The comparison procedure defaults to char=? for strings, = for bytevectors, and equal? for everything else.

#|kawa:2|# (hamming-distance "This string" "that strong")
4
#|kawa:3|# (hamming-distance "This string" "that strong" char-ci=?)
3
#|kawa:4|# (hamming-distance #(1 2 3 4) #(0 2 3 5))
2

levenshtein-distance

The levenshtein-distance counts the minimum number of insertions/deletions/substitutions to convert one sequence into another. The levenshtein-distance function takes two arguments and an optional comparison procedure. The sequences can be strings, lists, vectors or bytevectors. The comparison procedure defaults to char=? for strings, = for bytevectors, and equal? for everything else.

#|kawa:2|# (levenshtein-distance "sitting" "kitten")
3
#|kawa:3|# (levenshtein-distance "Saturday" "sunday")
4
#|kawa:4|# (levenshtein-distance "Saturday" "sunday" char-ci=?)
3

metaphone

The Metaphone Algorithm was created as an improvement on Soundex, better taking account of variations in English pronounciation. The algorithm was created by Lawrence Philips and published in "Computer Language" December 1990 issue. There have been many published variants of Metaphone. A summary can be found at http://aspell.net/metaphone/.

The metaphone function simply takes a word to convert into a coded representation of its sound:

#|kawa:2|# (metaphone "smith")
SM0
#|kawa:3|# (metaphone "smythe")
SM0
#|kawa:4|# (metaphone "lewinsky")
LWNSK
#|kawa:5|# (metaphone "levinski")
LFNSK

Note, the output code is all upper-case letters, with "0" standing in for "TH".

optimal-string-alignment-distance

optimal-string-alignment-distance is a modification of the Levenshtein distance to include transpositions as well as deletions, insertions or substitutions. A transposition is where two characters have been swapped, such as when typing too quiclky. The optimal-string-alignment-distance function takes two arguments and an optional comparison procedure. The sequences can be strings, lists, vectors or bytevectors. The comparison procedure defaults to char=? for strings, = for bytevectors, and equal? for everything else.

> (levenshtein-distance "kitten" "sitting")
3
> (optimal-string-alignment-distance "kitten" "sitting")
3
> (levenshtein-distance "this string" "that strnig")
4
> (optimal-string-alignment-distance "this string" "that strnig")
3

Notice the difference between the two algorithms in the last case, where "n" and "i" have been transposed.

porter-stem

porter-stem is an implementation of the well-known Porter Stemming Algorithm], for reducing words to a base form. More details of the algorithm are at https://tartarus.org/martin/PorterStemmer/

The function simply takes the word to change, and returns the stemmed form:

> (porter-stem "running")
"run"
> (porter-stem "apples")
"appl"
> (porter-stem "apple")
"appl"
> (porter-stem "approximation")
"approxim"
> (porter-stem "sympathize")
"sympath"
> (porter-stem "sympathise")
"sympathis"

russell-soundex

russell-soundex is the same as soundex, exported from (slib soundex)

sorenson-dice-similarity

sorenson-dice-similarity returns a measure of how similar two strings are, based on n-grams of characters. An optional third argument provides the number of characters in the n-grams, which defaults to 2:

> (sorenson-dice-similarity "rabbit" "racket")
1/5
> (sorenson-dice-similarity "sympathize" "sympthise")
10/17
> (sorenson-dice-similarity "sympathize" "sympthise" 1)
8/9
> (sorenson-dice-similarity "sympathize" "sympthise" 3)
2/5

soundex

soundex is exported from (slib soundex).

string→n-grams

string→n-grams separates a string into overlapping groups of n letters.

> (string->n-grams "ABCDE" 1)
("A" "B" "C" "D" "E")
> (string->n-grams "ABCDE" 3)
("ABC" "BCD" "CDE")

For n greater than the length of the string, the string itself is returned. If n is less than 1, an error is raised.

words→with-commas

words→with-commas is a function taking a list of strings and adding commas in between the items up to the last item, which is preceded by an "and". For example:

> (words->with-commas '())
""
> (words->with-commas '("apple"))
"apple"
> (words->with-commas '("apple" "banana"))
"apple and banana"
> (words->with-commas '("apple" "banana" "chikoo"))
"apple, banana and chikoo"
> (words->with-commas '("apple" "banana" "chikoo" "damson"))
"apple, banana, chikoo and damson"
> (words->with-commas '("apple" "banana" "chikoo" "damson") #t)
"apple, banana, chikoo, and damson"

An optional second argument controls whether the final "and" should be preceded by a comma. The default is not to have the comma.

word-wrap

word-wrap takes two arguments, a string and a target width. It returns a list of strings, each item in the list representing a line of text. Each line is formatted so words do not go beyond the target width. The algorithm is a simple greedy algorithm. If a single word is longer than the target width, it is allowed to overlap.

For example, setting:

(define *text* "Programming languages should be designed not by piling feature on top of feature, but by removing the weaknesses and restrictions that make additional features appear necessary. Scheme demonstrates that a very small number of rules for forming expressions, with no restrictions on how they are composed, suffice to form a practical and efficient programming language that is flexible enough to support most of the major programming paradigms in use today.")

We can wrap to a width of 50 characters using:

> (word-wrap *text* 50)
Programming languages should be designed not by
piling feature on top of feature, but by removing
the weaknesses and restrictions that make
additional features appear necessary. Scheme
demonstrates that a very small number of rules for
forming expressions, with no restrictions on how
they are composed, suffice to form a practical and
efficient programming language that is flexible
enough to support most of the major programming
paradigms in use today.

and we can wrap to a width of 60 characters using:

> (word-wrap *text* 60)
Programming languages should be designed not by piling
feature on top of feature, but by removing the weaknesses
and restrictions that make additional features appear
necessary. Scheme demonstrates that a very small number of
rules for forming expressions, with no restrictions on how
they are composed, suffice to form a practical and efficient
programming language that is flexible enough to support most
of the major programming paradigms in use today.