This is a rewrite of Text Classification with jRuby and Weka using Kawa Scheme. To keep this note self-contained, I have repeated most of the description.
For data, I use the BBC dataset from Kaggle, selecting two classes: entertainment and tech.
So I have a parent folder, "bbc/", with two child folders, "entertainment" and "tech". Each file in the child folders is a text document, and will form an instance for the classifier to work on. The classes each have around 400 documents, and the total word count is 330,000.
The overall process can be divided into several steps. (Here, I keep things relatively simple, to investigate the API.)
First, the text must be converted into words only, so all numbers and punctuation are removed, and all letters are converted to lowercase. So "Address: 3 High Street, London." becomes the four words "address high street london".
Second, it is useful to simplify the words themselves. Common words (like "the", "a") are removed, using a stop-list. Also, stemming is applied so that different words like "walks" and "walking" are reduced to their stem "walk"; stemming increases the number of words in different documents which match.
Third, the text files are converted into a representation where each text file is an instance made from the values for a number of attributes, and a class label. The class label has two values, one for "entertainment" and one for "tech", depending on which folder the text file is in. Each attribute represents one word, and has values 1 or 0 (whether the word for that attribute is present in this instance or not).
Fourth, the number of attributes is usually very large at this stage, because natural language uses a lot of words. An attibute selection technique reduces the number of attributes to a manageable size.
Fifth, build some classification models. Three algorithms are used: Naive Bayes, Decision Trees and Support Vector Machines. For each of these algorithms, 10-fold cross validation is used to derive an overall accuracy.
The following code works with Kawa version 3.1.1, requires `(robin confusion-matrix)` from r7rs-libs and Weka 3.9.6.
The script follows. It is actually one file, but broken up with some explanations.
At the start of the script, import the required classes. Most of these are from
Weka, but we also need the (robin confusion-matrix)
library, for convenient
statistics on the results, and (kawa regex)
, for regular expressions.
(import (class weka.attributeSelection CorrelationAttributeEval Ranker) (class weka.classifiers.bayes NaiveBayes) (class weka.classifiers.functions SMO) (class weka.classifiers.trees J48) (class weka.core.converters ArffSaver TextDirectoryLoader) (class weka.core.stemmers LovinsStemmer) (class weka.core.stopwords Rainbow) (class weka.core.tokenizers WordTokenizer) (class weka.filters.supervised.attribute AttributeSelection) (class weka.filters.unsupervised.attribute StringToWordVector) (robin confusion-matrix) (kawa regex))
The following procedure was created as our pre-process steps apply two of Weka's
filters. Weka requires us to input each instance in turn to the filter, tell
the filter the current batch has finished, and then retrieve the instances.
Notice that the output result
may have a different structure (number and type
of attributes) to the input instances
. Kawa conveniently enables us to
use for-each
on the Weka Instances
object, as a kind of sequence.
(define (apply-filter instances filter-fn) (filter-fn:setInputFormat instances) (for-each filter-fn:input instances) ; <1> (filter-fn:batchFinished) (let ((result (filter-fn:getOutputFormat))) (do ((instance (filter-fn:output) (filter-fn:output))) ((eq? #!null instance) result) ; <2> (result:add instance))))
- Apply a Java method to a Java collection!
-
Java nulls do not respond to
null?
.
The pre-process procedure covers the first four(!) steps described above.
Weka provides TextDirectoryLoader
to load the text documents from the two
folders. This process leaves each instance with two attributes: one is the text
of the document, and the second is its class label (the name of the child
folder).
Step 1 is done using a regular expression, to replace all non-alphabetic
characters with spaces. Noice that the strings have to be copied - the strings
stored in the instance are Java strings and so immutable. We need a mutable
version to modify with the regular expression, and string-copy
converts the
immutable string to a mutable version.
Steps 2-3 are done using a StringToWordVector
filter. In this filter, I set
the stemmer and stopwords handlers, tell it to convert the text to lower case
and tokenise the string as words (rather than character sequences). Setting
outputWordCounts
to false means the values will be 1 or 0, not actual word
counts.
Step 4 is achieved using a second filter, CorrelationAttributeEval
, along
with a ranking algorithm to pick the most predictive 300 attributes.
(define (pre-process text-directory) (let ((loader (TextDirectoryLoader))) (loader:setSource (java.io.File text-directory)) (let ((instances (loader:getDataSet))) ; remove numbers/punctuation - step 1 (for-each (lambda (instance) (let ((text (string-copy (instance:stringValue 0)))) ; the text is in the first attribute (regex-replace* #/[^a-zA-Z]/ text " ") ; remove non ASCII-letters (instance:setValue 0 text))) instances) ; turn into vector of words, applying filters - steps 2 & 3 (let ((string->words (StringToWordVector))) (string->words:setLowerCaseTokens #t) (string->words:setOutputWordCounts #f) (string->words:setStemmer (LovinsStemmer)) (string->words:setStopwordsHandler (Rainbow)) (string->words:setTokenizer (WordTokenizer)) ; -- apply the filter (set! instances (apply-filter instances string->words))) ; identify the class label (instances:setClassIndex 0) ; reduce number of attributes to 300 - step 4 (let ((selector (AttributeSelection)) (ranker (Ranker))) (selector:setEvaluator (CorrelationAttributeEval)) (ranker:setNumToSelect 300) (selector:setSearch ranker) ; -- apply the filter (set! instances (apply-filter instances selector))) ; randomise order of data (instances:randomize (java.util.Random)) instances)))
Step 5 is the task of the evaluate-classifier
procedure, used to test a given
classification algorithm. Weka provides methods on instances to access
train/test sets for k-fold cross-validation, so we use those to build and
evaluate a classifier for each fold.
(define (evaluate-classifier classifier-name classifier-constructor instances k) (define (number->class-label value) ; <1> (if (> value 0.5) 'positive 'negative)) ; (let ((cm (make-confusion-matrix))) ; <2> (do ((i 0 (+ 1 i))) ; <3> ((= k i) ; display results at end ; <4> (format #t "Classifier: ~a~&" classifier-name) (format #t " -- Precision ~,3f~&" (precision cm)) (format #t " -- Recall ~,3f~&" (recall cm)) (format #t " -- Geometric mean ~,3f~&" (geometric-mean cm))) ; (let ((model (classifier-constructor)) ; <5> (train (instances:trainCV k i)) (test (instances:testCV k i))) (model:buildClassifier train) ; <6> (for-each (lambda (instance) ; <7> (confusion-matrix-add ; <8> cm (number->class-label (instance:classValue)) ; predicted class (number->class-label (model:classifyInstance instance)) ; observed class )) test)))))
- Converts the numeric class label into a positive/negative symbol.
- A confusion matrix is used to store the results.
- Loops through each fold.
- After check each fold, display summary statistics from the confusion matrix.
- Creates a new model on each fold,
- ... trains the model on the training dataset for this fold,
- ... and tests the model on the test dataset for this fold.
- Results are accumulated in the confusion matrix, from the predicted and observed class.
The final step is to create some actual data and build some classification
models. Notice how the Weka class names are passed to evaluate-classifier
,
which are used to create the classifier.
(let ((data (pre-process "bbc/"))) (evaluate-classifier "Decision Tree" J48 data 10) (evaluate-classifier "Naive Bayes" NaiveBayes data 10) (evaluate-classifier "Support Vector Machine" SMO data 10))
On my system, the script runs through in about 12 seconds. The output is:
> java --add-opens java.base/java.lang=ALL-UNNAMED -cp "weka.jar;kawa.jar;r7rs-libs.jar" kawa.repl --no-warn-unknown-member .\text-classification.scm Classifier: Decision Tree -- Precision 0.957 -- Recall 0.940 -- Geometric mean 0.948 Classifier: Naive Bayes -- Precision 0.971 -- Recall 0.985 -- Geometric mean 0.977 Classifier: Support Vector Machine -- Precision 0.990 -- Recall 0.993 -- Geometric mean 0.991
The add-opens
flag is useful on later Java versions to suppress an error caused, I believe, by changes in
reflection.