2022-07-25: Text Classification with Kawa and Weka

This is a rewrite of Text Classification with jRuby and Weka using Kawa Scheme. To keep this note self-contained, I have repeated most of the description.

For data, I use the BBC dataset from Kaggle, selecting two classes: entertainment and tech.

So I have a parent folder, "bbc/", with two child folders, "entertainment" and "tech". Each file in the child folders is a text document, and will form an instance for the classifier to work on. The classes each have around 400 documents, and the total word count is 330,000.

The overall process can be divided into several steps. (Here, I keep things relatively simple, to investigate the API.)

First, the text must be converted into words only, so all numbers and punctuation are removed, and all letters are converted to lowercase. So "Address: 3 High Street, London." becomes the four words "address high street london".

Second, it is useful to simplify the words themselves. Common words (like "the", "a") are removed, using a stop-list. Also, stemming is applied so that different words like "walks" and "walking" are reduced to their stem "walk"; stemming increases the number of words in different documents which match.

Third, the text files are converted into a representation where each text file is an instance made from the values for a number of attributes, and a class label. The class label has two values, one for "entertainment" and one for "tech", depending on which folder the text file is in. Each attribute represents one word, and has values 1 or 0 (whether the word for that attribute is present in this instance or not).

Fourth, the number of attributes is usually very large at this stage, because natural language uses a lot of words. An attibute selection technique reduces the number of attributes to a manageable size.

Fifth, build some classification models. Three algorithms are used: Naive Bayes, Decision Trees and Support Vector Machines. For each of these algorithms, 10-fold cross validation is used to derive an overall accuracy.

The following code works with Kawa version 3.1.1, requires `(robin confusion-matrix)` from r7rs-libs and Weka 3.9.6.

The script follows. It is actually one file, but broken up with some explanations.

At the start of the script, import the required classes. Most of these are from Weka, but we also need the (robin confusion-matrix) library, for convenient statistics on the results, and (kawa regex), for regular expressions.

(import (class weka.attributeSelection
               CorrelationAttributeEval Ranker)
        (class weka.classifiers.bayes
               NaiveBayes)
        (class weka.classifiers.functions
               SMO)
        (class weka.classifiers.trees
               J48)
        (class weka.core.converters
               ArffSaver TextDirectoryLoader)
        (class weka.core.stemmers
               LovinsStemmer)
        (class weka.core.stopwords
               Rainbow)
        (class weka.core.tokenizers
               WordTokenizer)
        (class weka.filters.supervised.attribute
               AttributeSelection)
        (class weka.filters.unsupervised.attribute
               StringToWordVector)
        (robin confusion-matrix)
        (kawa regex))

The following procedure was created as our pre-process steps apply two of Weka's filters. Weka requires us to input each instance in turn to the filter, tell the filter the current batch has finished, and then retrieve the instances. Notice that the output result may have a different structure (number and type of attributes) to the input instances. Kawa conveniently enables us to use for-each on the Weka Instances object, as a kind of sequence.

(define (apply-filter instances filter-fn)
  (filter-fn:setInputFormat instances)
  (for-each filter-fn:input instances)                      ; <1>
  (filter-fn:batchFinished)

  (let ((result (filter-fn:getOutputFormat)))
    (do ((instance (filter-fn:output) (filter-fn:output)))
      ((eq? #!null instance) result)                        ; <2>
      (result:add instance))))

Apply a Java method to a Java collection!
Java nulls do not respond to null?.

The pre-process procedure covers the first four(!) steps described above.

Weka provides TextDirectoryLoader to load the text documents from the two folders. This process leaves each instance with two attributes: one is the text of the document, and the second is its class label (the name of the child folder).

Step 1 is done using a regular expression, to replace all non-alphabetic characters with spaces. Noice that the strings have to be copied - the strings stored in the instance are Java strings and so immutable. We need a mutable version to modify with the regular expression, and string-copy converts the immutable string to a mutable version.

Steps 2-3 are done using a StringToWordVector filter. In this filter, I set the stemmer and stopwords handlers, tell it to convert the text to lower case and tokenise the string as words (rather than character sequences). Setting outputWordCounts to false means the values will be 1 or 0, not actual word counts.

Step 4 is achieved using a second filter, CorrelationAttributeEval, along with a ranking algorithm to pick the most predictive 300 attributes.

(define (pre-process text-directory)
  (let ((loader (TextDirectoryLoader)))
    (loader:setSource (java.io.File text-directory))
    (let ((instances (loader:getDataSet)))

      ; remove numbers/punctuation - step 1
      (for-each (lambda (instance)
                  (let ((text (string-copy (instance:stringValue 0)))) ; the text is in the first attribute
                    (regex-replace* #/[^a-zA-Z]/ text " ") ; remove non ASCII-letters
                    (instance:setValue 0 text)))
                instances)

      ; turn into vector of words, applying filters - steps 2 & 3
      (let ((string->words (StringToWordVector)))
        (string->words:setLowerCaseTokens #t)
        (string->words:setOutputWordCounts #f)
        (string->words:setStemmer (LovinsStemmer))
        (string->words:setStopwordsHandler (Rainbow))
        (string->words:setTokenizer (WordTokenizer))
        ; -- apply the filter
        (set! instances (apply-filter instances string->words)))

      ; identify the class label
      (instances:setClassIndex 0)

      ; reduce number of attributes to 300 - step 4
      (let ((selector (AttributeSelection))
            (ranker (Ranker)))
        (selector:setEvaluator (CorrelationAttributeEval))
        (ranker:setNumToSelect 300)
        (selector:setSearch ranker)
        ; -- apply the filter
        (set! instances (apply-filter instances selector)))

      ; randomise order of data
      (instances:randomize (java.util.Random))

      instances)))

Step 5 is the task of the evaluate-classifier procedure, used to test a given classification algorithm. Weka provides methods on instances to access train/test sets for k-fold cross-validation, so we use those to build and evaluate a classifier for each fold.

(define (evaluate-classifier classifier-name classifier-constructor instances k)
  (define (number->class-label value)                                 ; <1>
    (if (> value 0.5) 'positive 'negative))
  ;
  (let ((cm (make-confusion-matrix)))                                 ; <2>
    (do ((i 0 (+ 1 i)))                                               ; <3>
      ((= k i) ; display results at end                               ; <4>
       (format #t "Classifier: ~a~&" classifier-name)
       (format #t " -- Precision      ~,3f~&" (precision cm))
       (format #t " -- Recall         ~,3f~&" (recall cm))
       (format #t " -- Geometric mean ~,3f~&" (geometric-mean cm)))
      ;
      (let ((model (classifier-constructor))                          ; <5>
            (train (instances:trainCV k i))
            (test (instances:testCV k i)))
        (model:buildClassifier train)                                 ; <6>
        (for-each (lambda (instance)                                  ; <7>
                    (confusion-matrix-add                             ; <8>
                      cm
                      (number->class-label (instance:classValue)) ; predicted class
                      (number->class-label (model:classifyInstance instance)) ; observed class
                      ))
                  test)))))

Converts the numeric class label into a positive/negative symbol.
A confusion matrix is used to store the results.
Loops through each fold.
After check each fold, display summary statistics from the confusion matrix.
Creates a new model on each fold,
... trains the model on the training dataset for this fold,
... and tests the model on the test dataset for this fold.
Results are accumulated in the confusion matrix, from the predicted and observed class.

The final step is to create some actual data and build some classification models. Notice how the Weka class names are passed to evaluate-classifier, which are used to create the classifier.

(let ((data (pre-process "bbc/")))
  (evaluate-classifier "Decision Tree" J48 data 10)
  (evaluate-classifier "Naive Bayes" NaiveBayes data 10)
  (evaluate-classifier "Support Vector Machine" SMO data 10))

On my system, the script runs through in about 12 seconds. The output is:

> java --add-opens java.base/java.lang=ALL-UNNAMED -cp "weka.jar;kawa.jar;r7rs-libs.jar" kawa.repl --no-warn-unknown-member .\text-classification.scm
Classifier: Decision Tree
 -- Precision      0.957
 -- Recall         0.940
 -- Geometric mean 0.948
Classifier: Naive Bayes
 -- Precision      0.971
 -- Recall         0.985
 -- Geometric mean 0.977
Classifier: Support Vector Machine
 -- Precision      0.990
 -- Recall         0.993
 -- Geometric mean 0.991

The add-opens flag is useful on later Java versions to suppress an error caused, I believe, by changes in reflection.

2022-07-25: Text Classification with Kawa and Weka

Overall process

Script