Some examples of the libraries and procedures in the robin namespace of r7rs-libs.
(robin abbrev)
Create unique abbreviations for a list of strings. This library is a port of the equivalent Ruby library.
abbrev
Input:
-
a list of strings
-
an optional prefix
Output:
-
an association list mapping unique abbreviations to its matching string
For example:
gosh[r7rs.user]$ (import (robin abbrev))
gosh[r7rs.user]$ (abbrev '("scheme" "scala" "lisp"))
(("lisp" . "lisp") ("lis" . "lisp") ("li" . "lisp") ("l" . "lisp")
("scala" . "scala") ("scal" . "scala") ("sca" . "scala") ("scheme" . "scheme")
("schem" . "scheme") ("sche" . "scheme") ("sch" . "scheme"))
The first line shows that any prefix to "lisp" is an acceptable, unique abbreviation: because no other word in the list of strings starts with the letter "l". For Scala and Scheme, the unique abbreviations start with three letters, as "s" and "sc" are prefixes to both.
The optional prefix will only include words that have that prefix, allowing us to select a subset of the strings to generate abbreviations for. For example:
gosh[r7rs.user]$ (abbrev '("scheme" "scala" "lisp") "s")
(("scala" . "scala") ("scal" . "scala") ("sca" . "scala") ("scheme" . "scheme")
("schem" . "scheme") ("sche" . "scheme") ("sch" . "scheme"))
With the "s" prefix included, "lisp" is no longer in the output.
As a simple application, the abbreviations can be used as a word or command expander:
gosh[r7rs.user]$ (define shortcuts (abbrev '("scheme" "scala" "lisp")))
shortcuts
gosh[r7rs.user]$ (cdr (assoc "l" shortcuts))
"lisp"
gosh[r7rs.user]$ (cdr (assoc "sch" shortcuts))
"scheme"
(robin confusion-matrix)
A confusion matrix represents "the relative frequencies with which each of a number of stimuli is mistaken for each of the others by a person in a task requiring recognition or identification of stimuli" (R. Colman, A Dictionary of Psychology, 2008). Each row represents the predicted label of an instance, and each column represents the observed label of that instance. Numbers at each (row, column) reflect the total number of instances of predicted label "row" which were observed as having label "column".
A two-class example is:
Observed Observed | Positive Negative | Predicted ------------------------------+------------ a b | Positive c d | Negative
Here the value:
-
a - is the number of true positives (those predicted positive and observed positive)
-
b - is the number of false negatives (those predicted positive but observed negative)
-
c - is the number of false positives (those predicted negative but observed positive)
-
d - is the number of true negatives (those predicted negative and observed negative)
From this table we can calculate statistics like:
-
true positive rate - a/(a+b)
-
positive recall - a/(a+c)
As statistics can also be calculated for the negative label, e.g. the true
negative rate is d/(c+d), the functions below have an optional for-label
parameter, to specify which label they are calculated for: the default is to
report for the first label named when the matrix is created
The implementation supports confusion matrices with more than two labels. When more than two labels are in use, the statistics are calculated as if the first, or named, label were positive and all the other labels are grouped as if negative.
The following example creates a simple two-label confusion matrix, prints a few statistics and displays the table - notice that the package "confusion-matrix" has the alias "cm".
(import (scheme base)
(scheme write)
(robin confusion-matrix))
(define cm (make-confusion-matrix (list 'pos 'neg)))
(confusion-matrix-add cm 'pos 'pos 10)
(confusion-matrix-add cm 'pos 'neg 3)
(confusion-matrix-add cm 'neg 'neg 20)
(confusion-matrix-add cm 'neg 'pos 5)
(display "Confusion Matrix") (newline) (newline)
(confusion-matrix-display cm)
(display "Precision: ") (display (inexact (precision cm))) (newline)
(display "Recall: ") (display (inexact (recall cm))) (newline)
(display "Matthews-Correlation: ")
(display (inexact (matthews-correlation cm)))
(newline)
which outputs:
> gosh -I. .\robin-examples\confusion-matrix-eg.sps Confusion Matrix Observed pos neg | Predicted -----------+----------- 10 3 | pos 5 20 | neg Precision: 0.6666666666666666 Recall: 0.7692307692307693 Matthews-Correlation: 0.5524850114241865
(robin csv)
A CSV reader/writer library. RFC4180 compliant - http://tools.ietf.org/html/rfc4180 - but allowing:
-
choice of separator character, e.g. tab instead of comma
-
choice of quote character, e.g. single instead of double-quote
-
"CR LF", "CR" or "LF" are all accepted as line endings when reading
The CSV reader/writer uses a standard input or output port. Records are
retrieved or written as lists of field values. Fields are always read as
strings. Fields are written using display
unless they are strings needing
quoting, when they are written character by character, or symbols, which are
first converted to strings.
make-csv-reader
Creates a csv-reader object:
-
no arguments: uses current input port, comma as separator and double-quote for quote character
-
one argument: specifies input port
-
two arguments: specifies input port and separator character
-
three arguments: specifies input port, separator character and quote character
csv-read
Reads a single record (list of fields) from given reader, or current-input-port if no reader given.
csv-read-all
Reads a list of records (list of fields) from given reader, or current-input-port if no reader given.
$ sash -r7 -L . sash[r7rs]> (import (robin csv) (scheme file)) sash[r7rs]> (with-input-from-file "data.csv" (lambda () (csv-read-all))) (("123" "field 2" "456") ("789" "see, comma" "12")) sash[r7rs]>
make-csv-writer
Creates a csv-writer object:
-
no arguments: uses current-output-port, comma as separator and double-quote for quote character
-
one argument: specifies output port
-
two arguments: specifies output port and separator character
-
three arguments: specifies output port, separator character and quote character
csv-write
Writes a single record (list of fields) in CSV format:
-
one argument: the record, writes to the current-output-port with comma as separator and double-quote for quote character
-
two arguments: the record and a csv-writer object
csv-write-all
Writes a list of records in CSV format:
-
one argument: the records, writes to the current-output-port with comma as separator and double-quote for quote character
-
two arguments: the records and a csv-writer object
$ sash -r7 -L . sash[r7rs]> (import (robin csv) (scheme file)) sash[r7rs]> (with-output-to-file "data.csv" (lambda () (csv-write-all '((123 "field 2" 456) (789 "see, comma" 012))))) sash[r7rs]> $ more data.csv 123,field 2,456 789,"see, comma",12
(robin disjoint-set)
A disjoint-set data structure holds items in distinct sets (or groups). Efficient procedures are provided for finding a representative of the set any item is contained in, and also for joining two sets together.
make-disjoint-set
make-disjoint-set
takes one or two parameters: a single comparator
or the equality function to use on terms, and a hash function (e.g. hash
, string-hash
or default-hash
).
The return value is a disjoint set.
disjoint-set?
disjoint-set?
checks if its argument is a disjoint-set instance or not, returning a boolean value.
disjoint-set:make
disjoint-set:make
places an item into the disjoint-set as its own group.
Takes two arguments: the disjoint set and the item. Return value is undefined.
disjoint-set:find
disjoint-set:find
takes two arguments, the disjoint set and an item, and an
optional default value. The function returns the representative item for the group that
the given item is in, or the default value if not present (default is 'item-not-found
if not provided).
disjoint-set:union
disjoint-set:union
takes three arguments, the disjoint set and two items. The
disjoint set is modified to combine the groups that the two items are in.
disjoint-set:size
disjoint-set:size
takes a disjoint set and returns the number of
distinct groups it contains.
Example: Kruskal’s Algorithm
This example illustrates how disjoint sets can be used in Kruskal’s algorithm to find a minimal spanning set. (The complete code is in "robin-examples/kruskal.sps")
(define (kruskal graph)
(let ((result '())
(nodes (delete-duplicates (append (map car graph) (map cadr graph)) eq?)))
; 1. make a disjoint set, with each node an item
(let ((ds (make-disjoint-set eq? hash-by-identity))) ;
(for-each (lambda (node) (disjoint-set:make ds node)) ;
nodes)
; 2. set 'links' holds all the links in graph, sorted with the shortest first
(let loop ((links (sort graph (lambda (a b) (< (caddr a) (caddr b))))))
; 3. if links non-empty and size > 1
(when (and (not (null? links))
(> (disjoint-set:size ds) 1)) ;
(let ((link (car links)))
(unless (eq? (disjoint-set:find ds (car link)) ;
(disjoint-set:find ds (cadr link)))
(set! result (cons link result))
(disjoint-set:union ds (car link) (cadr link)))) ;
(loop (cdr links)))))
(reverse result)))
;; using it
(let* ((graph '((a b 3) (a e 1) (b c 5) (b e 4) (c d 2) (c e 6) (d e 7)))
(res (kruskal graph)))
(format #t "MST has ~a links~&" (length res))
(format #t "~{ : ~a~&~}" res) ;
(format #t "Total length: ~a~&" (fold + 0 (map caddr res))))
Creates a disjoint set using eq? for equality and hash-by-identity for the hash function,
because individual items are symbols. |
|
Each node is added to the disjoint set, initially in its own group. | |
The number of groups in the disjoint set tells us how many links remain to be added. | |
Look for a link where its end points are in different groups: tested by finding the representative item of each end point’s group in the disjoint set. | |
When we add a link, connect the groups in the disjoint set. | |
Note use of format to print a list of items |
Output:
MST has 4 links : (a e 1) : (c d 2) : (a b 3) : (b c 5) Total length: 11
(robin simulated-annealing)
Simulated Annealing is a popular optimisation technique, first introduced by:
-
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., & Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087-1092.
simulated annealing
Three functions must be provided:
- make-instance
-
create a new random instance. This function takes an optional argument, which is the previous instance.
- evaluate-instance
-
given an instance, return a real number estimating the quality of that instance, with 0.0 being the best value.
- num-iterations
-
the number of internal cycles to perform.
An optional fourth function can be provided to stop the algorithm before all the iterations have been performed:
- can-stop?
-
function that accepts the result of evaluate-instance and returns #t if the algorithm should stop.
An example:
Find value for x so that x2 - 7x + 10 = 0, with x in range [0,10].
The make-instance
function must work differently depending on whether it is given an optional
argument. Without an argument, it should simply return a random number in the range [0,10].
With an argument, it should return a neighbour, such as a small delta from the given value:
(define make-instance
(case-lambda
(() ;
(* (random-real) 10.0))
((x) ;
(let ((new-x (+ x (* 0.02 (random-real)) -0.01)))
(cond ((< new-x 0) ;
0.0)
((> new-x 10)
10.0)
(else
new-x))))))
No arguments - return a random number. | |
One argument - return a neighbour. | |
Make sure the new value is within the required range. |
The evaluate-instance
function should compute the value of the quadratic, returning a
positive number with 0.0 representing the best value:
(define (evaluate-instance x)
(let ((y = (+ (* x x) (* -7 x) 10)))
(square y)))
Running the algorithm also requires a number of iterations:
(let ((result (simulated-annealing make-instance evaluate-instance 10000)))
(display "Final value of x: ") (display result) (newline))
Depending on the particular random values it chooses, the algorithm will return one of other of the two roots:
> gosh .\robin-examples\simulated-annealing-eg1.sps Final value of x: 2.0092664951926014 > gosh .\robin-examples\simulated-annealing-eg1.sps Final value of x: 5.002658285287288
The file "robin-examples/simulated-annealing-eg1.sps" also uses (slib
charplot)
to plot the performance graph in the terminal. An example:
evaluation _____________________________________________ 80|-* | | * | 70|-* | | ** | 60|-:* | | :* | 50|-:** | | :** | 40|-: * | | : * | 30|-: ** | | : ** | 20|-: * | | : ** | 10|-: ** | | : ******* | 0|-:----------******************************---| |_:____.____:____.____:____.____:____.____:___| cycle numbe 0 2500 5000 7500 10000
(robin srfi64-utils)
Some convenience functions for unit testing with SRFI 64.
test-all-equal
test-all-equal
accepts two arguments: a function to test and an
association list of ( argument . result )
pairs. The function effectively
calls (test-equal result (function argument))
for all pairs.
(test-all-equal daitch-mokotoff-soundex '(("MANHEIM" . "665600")
("MINTZ" . "664000")
("TOPF" . "370000")
("AUERBACH" . "097500")))
test-compare
test-compare
checks if two given items satisfy the given comparison procedure.
This is useful for testing equality of more complex data. For example, we might want to check if just the first item of sublists are the same:
sash> (import (scheme list))
#<unspecified>
sash> (define (equal-cars? i1 i2) (every (lambda (l1 l2) (equal? (car l1) (car l2))) i1 i2))
#<unspecified>
sash> (test-compare equal-cars? '((1 2) (3 4)) '((1 5) (3 6)))
#<unspecified>
sash> (test-compare equal-cars? '((1 2) (3 4)) '((1 5) (4 6)))
FAIL '((1 5) (4 6))
#<unspecified>
test-no-error
test-no-error
is used to confirm that a piece of code has not thrown an error.
In the following example, the first line raises an error
, and so the test fails.
The second line does not raise an error
, and so the test passes.
sash> (test-no-error (if (zero? 0) (error "") #t))
FAIL #f
#<unspecified>
sash> (test-no-error (if (zero? 1) (error "") #t))
#<unspecified>
(robin statistics)
A library of functions to compute statistical or other information about sets of data.
arithmetic-mean
Same as mean
.
arithmetic-geometric-mean
The arithmetic-geometric-mean
is defined as the
limit of a sequence of two terms, replacing the first term with
their arithmetic mean and the second with their geometric mean,
until the two are equal. The function is useful in mathematical
applications, see description at
http://mathworld.wolfram.com/Arithmetic-GeometricMean.html
The function takes two arguments and an optional tolerance:
sash> (import (robin statistics))
#<unspecified>
sash> (arithmetic-geometric-mean 1 (/ 1 (sqrt 2)))
0.8472130848351929
sash> (arithmetic-geometric-mean 1 (/ 1 (sqrt 2)) 1e-12)
0.8472130847939792
coefficient-of-variation
coefficient-of-variation
describes the amount of spread in a dataset relative to its mean.
In the following example, both lists have the same mean (3), but the spread is greater in the second.
sash> (coefficient-of-variation '(1 2 3 4 5))
52.704627669472984
sash> (coefficient-of-variation '(-2 2 3 4 8))
120.18504251546631
combinations
combinations
returns the number of ways to take k things from n without replacement,
when the order does not matter. The form is (combinations n k)
sash> (combinations 5 2)
10
sash> (combinations 1000 500)
270288240945436569515614693625975275496152008446548287007392875106625428705522193898612483924502370165362606085021546104802209750050679917549894219699518475423665484263751733356162464079737887344364574161119497604571044985756287880514600994219426752366915856603136862602484428109296905863799821216320
geometric-mean
The geometric mean takes the nth root of the
product of the numbers, where n
is the size of the data. This yields the
central number within a geometric progression.
sash> (geometric-mean '(1 2 3 4 5))
2.605171084697352
sash> (geometric-mean '(1 3 9 27 81))
9.000000000000002
sash> (geometric-mean '(1))
1
sash> (geometric-mean '(1 3))
1.7320508075688772
harmonic-mean
The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals.
sash> (harmonic-mean '(1 2 3 4 5))
300/137
sash> (inexact (harmonic-mean '(1 2 3 4 5)))
2.18978102189781
jaccard-distance
The Jaccard Distance] is simply (1 - (jaccard-index))
, and measures how dissimilar
the two sets are.
sash> (jaccard-distance '(1 2 3 4 5) '(3 4 5 6 7))
4/7
jaccard-index
The Jaccard Index] returns a number from 0 to 1 indicating how similar
two sets are. Sets here are represented as lists, and duplicates are ignored. An optional
third argument provides the set-equality function (which defaults to equal?
).
sash> (jaccard-index '(1 2 3 4 5) '(3 4 5 6 7))
3/7
sash> (jaccard-index '(1 2 3 4 5) '(3 4 5 1 2 2 1 1))
1
sash> (jaccard-index '("a" "b" "c") '("A" "B" "F"))
0
sash> (jaccard-index '("a" "b" "c") '("A" "B" "D") string-ci=?)
1/2
mean
Computes the arithmetic mean
of a list of numbers.
This is the familiar "add up all the numbers, divide by the total" average.
#|kawa:2|# (mean '(1 2 3 4 5))
3
median
The median
is the central number, when all the numbers are put in order
(or, when there are an even number of numbers, the average of the two middle numbers).
#|kawa:8|# (median '(5 3 1 2 4))
3
mode
The mode
is the number appearing most often. The mode
function returns
two values: a list of the most frequent number or numbers, and their number of occurrences.
#|kawa:9|# (mode '(5 3 1 2 4))
(1 2 3 4 5) 1
#|kawa:11|# (let-values (((items count) (mode '(5 3 5 1 2 4))))
#|.....12|# (display items) (newline) (display count) (newline))
(5)
2
percentile
percentile
takes two arguments, a list of data and a percent
value, and returns the item in the list at that percentile position along the
list, or an average of the nearest two values. percent
values must be in the
range 1 to 99, inclusive. Note that a percent
of 50 corresponds to the
median
.
#|kawa:13|# (percentile '(1 2 3 4 5) 50)
3
#|kawa:14|# (percentile '(1 2 3 4 5) 75)
4
#|kawa:16|# (percentile '(1 2 3 4 5) 99)
5
#|kawa:19|# (percentile '(1 2 3 4 5) 85)
5
#|kawa:20|# (percentile '(1 2 3 4 5) 80)
9/2
perlin-noise
perlin-noise
is a function returning a value computed using a gradient function.
Perlin Noise was invented by Ken Perlin and has been widely
applied in computer graphics. The implementation here is a conversion of the
Java reference implementation.
perlin-noise
takes three numbers as input, and returns a single number (all numbers inexact).
sash> (perlin-noise 12.3 4.2 5.7)
0.4085747160714242
permutations
permutations
returns the number of ways to take k things from n without replacement, when the order matters.
The form is (permutations n k)
#|kawa:4|# (permutations 5 2)
20
#|kawa:5|# (permutations 100 20)
1303995018204712451095685346159820800000
population-standard-deviation
population-standard-deviation
of a list of data:
#|kawa:28|# (population-standard-deviation '(1 2 3 4 5))
1.4142135623730951
Optionally, if you know the mean of your data, pass that as a second argument so the function does not have to recompute it:
#|kawa:39|# (population-standard-deviation '(1 2 3 4 5) 3)
1.4142135623730951
population-variance
population-variance
of a list of data:
#|kawa:29|# (population-variance '(1 2 3 4 5))
2
Optionally, if you know the mean of your data, pass that as a second argument so the function does not have to recompute it.
random-normal
random-normal
is used to return a random number normally distributed with
a given mean and standard deviation.
#|kawa:2|# (random-normal 0 10)
3017/1250
#|kawa:3|# (inexact (random-normal 0 10))
0.05948
#|kawa:4|# (inexact (random-normal 0 10))
0.48029
#|kawa:5|# (inexact (random-normal 0 10))
-3.83553
#|kawa:7|# (inexact (random-normal 20 10))
21.81053
#|kawa:8|# (inexact (random-normal 20 10))
15.54491
random-pick
random-pick
returns a random selection from a given list
#|kawa:20|# (random-pick '(1 2 3 4 5 6))
6
#|kawa:21|# (random-pick '(1 2 3 4 5 6))
3
#|kawa:22|# (random-pick '(1 2 3 4 5 6))
2
random-sample
random-sample
returns a random sample of n
chosen from a list of items,
without replacement
#|kawa:25|# (random-sample 3 '(1 2 3 4 5 6))
(1 4 5)
#|kawa:26|# (random-sample 3 '(1 2 3 4 5 6))
(6 5 1)
#|kawa:27|# (random-sample 3 '(1 2 3 4 5 6))
(2 3 6)
random-weighted-sample
random-weighted-sample
returns a random sample of n
chosen from a list of items,
without replacement, but with the items weighted by the given list of weights.
The following gives a heavy bias to choosing 1 and 6:
#|kawa:28|# (random-weighted-sample 3 '(1 2 3 4 5 6) '(10 1 1 1 1 10))
(6 1 5)
#|kawa:29|# (random-weighted-sample 3 '(1 2 3 4 5 6) '(10 1 1 1 1 10))
(1 6 4)
#|kawa:30|# (random-weighted-sample 3 '(1 2 3 4 5 6) '(10 1 1 1 1 10))
(1 6 5)
The following is a lighter bias to choosing 1 and 4:
#|kawa:34|# (random-weighted-sample 3 '(1 2 3 4 5 6) '(2 1 1 2 1 1))
(4 6 2)
#|kawa:35|# (random-weighted-sample 3 '(1 2 3 4 5 6) '(2 1 1 2 1 1))
(5 4 3)
#|kawa:36|# (random-weighted-sample 3 '(1 2 3 4 5 6) '(2 1 1 2 1 1))
(1 3 2)
sign
sign
returns 1 for positive numbers, 0 for zero, -1 for negative numbers
#|kawa:36|# (sign 10)
1
#|kawa:37|# (sign 0)
0
#|kawa:38|# (sign -0.7)
-1
sorenson-dice-index
The Sorenson Dice Index is a
measure of the similarity of two sets. Sets here are represented as lists, and
duplicates are ignored. An optional third argument provides the set-equality
function (which defaults to equal?
).
#|kawa:47|# (sorenson-dice-index '(1 2 3 4 5) '(3 4 5 6 7))
3/5
#|kawa:48|# (sorenson-dice-index '("a" "b" "c") '("A" "B" "D"))
0
#|kawa:49|# (sorenson-dice-index '("a" "b" "c") '("A" "B" "D") string-ci=?)
2/3
standard-deviation
standard-deviation
of a list of data:
#|kawa:30|# (standard-deviation '(1 2 3 4 5))
1.5811388300841898
Optionally, if you know the mean of your data, pass that as a second argument so the function does not have to recompute it.
standard-error-of-the-mean
standard-error-of-the-mean
describes the mean of the sampling distribution.
#|kawa:35|# (standard-error-of-the-mean '(1 2 3 4 5))
0.7071067811865476
variance
variance
of a list of data:
#|kawa:31|# (variance '(1 2 3 4 5))
5/2
Optionally, if you know the mean of your data, pass that as a second argument so the function does not have to recompute it.
(robin text)
The text library contains a collection of functions for working with strings or text documents, including similarity measures, a stemmer and layout.
daitch-mokotoff-soundex
The Daitch-Mokotoff Soundex algorithm is a variant of the Russell Soundex algorithm, designed to work better for Slavic and Yiddish names. The implementation here uses the table from http://www.jewishgen.org/InfoFiles/soundex.html.
#|kawa:6|# (daitch-mokotoff-soundex "LONDON")
863600
#|kawa:9|# (daitch-mokotoff-soundex "LEWINSKY")
876450
#|kawa:10|# (daitch-mokotoff-soundex "LEVINSKI")
876450
For some words, multiple codes are possible - pass an optional second argument 'all
to get a list of codes:
#|kawa:2|# (daitch-mokotoff-soundex "auerbach")
097500
#|kawa:3|# (daitch-mokotoff-soundex "auerbach" 'all)
(097500 097400)
hamming-distance
The hamming-distance
is the number of mismatched items between two equal-length sequences.
The hamming-distance
function takes two arguments and an optional comparison procedure. The sequences
can be strings, lists, vectors or bytevectors. The comparison procedure defaults to char=?
for strings,
=
for bytevectors, and equal?
for everything else.
#|kawa:2|# (hamming-distance "This string" "that strong")
4
#|kawa:3|# (hamming-distance "This string" "that strong" char-ci=?)
3
#|kawa:4|# (hamming-distance #(1 2 3 4) #(0 2 3 5))
2
levenshtein-distance
The levenshtein-distance
counts the minimum number of
insertions/deletions/substitutions to convert one sequence into another. The
levenshtein-distance
function takes two arguments and an optional comparison
procedure. The sequences can be strings, lists, vectors or bytevectors. The
comparison procedure defaults to char=?
for strings, =
for bytevectors, and
equal?
for everything else.
#|kawa:2|# (levenshtein-distance "sitting" "kitten")
3
#|kawa:3|# (levenshtein-distance "Saturday" "sunday")
4
#|kawa:4|# (levenshtein-distance "Saturday" "sunday" char-ci=?)
3
metaphone
The Metaphone Algorithm was created as an improvement on Soundex, better taking account of variations in English pronounciation. The algorithm was created by Lawrence Philips and published in "Computer Language" December 1990 issue. There have been many published variants of Metaphone. A summary can be found at http://aspell.net/metaphone/.
The metaphone
function simply takes a word to convert into a coded representation of its sound:
#|kawa:2|# (metaphone "smith")
SM0
#|kawa:3|# (metaphone "smythe")
SM0
#|kawa:4|# (metaphone "lewinsky")
LWNSK
#|kawa:5|# (metaphone "levinski")
LFNSK
Note, the output code is all upper-case letters, with "0" standing in for "TH".
optimal-string-alignment-distance
optimal-string-alignment-distance
is a modification of the
Levenshtein distance to include transpositions as well as deletions, insertions
or substitutions. A transposition is where two characters have been swapped,
such as when typing too quiclky. The optimal-string-alignment-distance
function takes two arguments and an optional comparison procedure. The
sequences can be strings, lists, vectors or bytevectors. The comparison
procedure defaults to char=?
for strings, =
for bytevectors, and equal?
for everything else.
> (levenshtein-distance "kitten" "sitting")
3
> (optimal-string-alignment-distance "kitten" "sitting")
3
> (levenshtein-distance "this string" "that strnig")
4
> (optimal-string-alignment-distance "this string" "that strnig")
3
Notice the difference between the two algorithms in the last case, where "n" and "i" have been transposed.
porter-stem
porter-stem
is an implementation of the well-known Porter Stemming Algorithm], for
reducing words to a base form. More details of the algorithm are at https://tartarus.org/martin/PorterStemmer/
The function simply takes the word to change, and returns the stemmed form:
> (porter-stem "running")
"run"
> (porter-stem "apples")
"appl"
> (porter-stem "apple")
"appl"
> (porter-stem "approximation")
"approxim"
> (porter-stem "sympathize")
"sympath"
> (porter-stem "sympathise")
"sympathis"
russell-soundex
russell-soundex
is the same as soundex
, exported from (slib soundex)
sorenson-dice-similarity
sorenson-dice-similarity
returns a measure of how similar two strings are, based
on n-grams of characters. An optional third argument provides the number
of characters in the n-grams, which defaults to 2:
> (sorenson-dice-similarity "rabbit" "racket")
1/5
> (sorenson-dice-similarity "sympathize" "sympthise")
10/17
> (sorenson-dice-similarity "sympathize" "sympthise" 1)
8/9
> (sorenson-dice-similarity "sympathize" "sympthise" 3)
2/5
soundex
soundex
is exported from (slib soundex)
.
string→n-grams
string→n-grams
separates a string into overlapping groups of n
letters.
> (string->n-grams "ABCDE" 1)
("A" "B" "C" "D" "E")
> (string->n-grams "ABCDE" 3)
("ABC" "BCD" "CDE")
For n
greater than the length of the string, the string itself is returned.
If n
is less than 1, an error is raised.
words→with-commas
words→with-commas
is a function taking a list of strings and
adding commas in between the items up to the last item, which is preceded
by an "and". For example:
> (words->with-commas '())
""
> (words->with-commas '("apple"))
"apple"
> (words->with-commas '("apple" "banana"))
"apple and banana"
> (words->with-commas '("apple" "banana" "chikoo"))
"apple, banana and chikoo"
> (words->with-commas '("apple" "banana" "chikoo" "damson"))
"apple, banana, chikoo and damson"
> (words->with-commas '("apple" "banana" "chikoo" "damson") #t)
"apple, banana, chikoo, and damson"
An optional second argument controls whether the final "and" should be preceded by a comma. The default is not to have the comma.
word-wrap
word-wrap
takes two arguments, a string and a target width. It
returns a list of strings, each item in the list representing a line of text.
Each line is formatted so words do not go beyond the target width. The
algorithm is a simple greedy algorithm. If a single word is longer than the
target width, it is allowed to overlap.
For example, setting:
(define *text* "Programming languages should be designed not by piling feature on top of feature, but by removing the weaknesses and restrictions that make additional features appear necessary. Scheme demonstrates that a very small number of rules for forming expressions, with no restrictions on how they are composed, suffice to form a practical and efficient programming language that is flexible enough to support most of the major programming paradigms in use today.")
We can wrap to a width of 50 characters using:
> (word-wrap *text* 50)
Programming languages should be designed not by
piling feature on top of feature, but by removing
the weaknesses and restrictions that make
additional features appear necessary. Scheme
demonstrates that a very small number of rules for
forming expressions, with no restrictions on how
they are composed, suffice to form a practical and
efficient programming language that is flexible
enough to support most of the major programming
paradigms in use today.
and we can wrap to a width of 60 characters using:
> (word-wrap *text* 60)
Programming languages should be designed not by piling
feature on top of feature, but by removing the weaknesses
and restrictions that make additional features appear
necessary. Scheme demonstrates that a very small number of
rules for forming expressions, with no restrictions on how
they are composed, suffice to form a practical and efficient
programming language that is flexible enough to support most
of the major programming paradigms in use today.